UNIT II INTRODUCTION TO DEEP LEARNING

History of Deep Learning- A Probabilistic Theory of Deep Learning- Backpropagation and regularization, batch normalization- VC Dimension and Neural Nets-Deep Vs Shallow Networks Convolutional Networks- Generative Adversarial Networks (GAN), Semi-supervised Learning

1 History of Deep Learning [DL]:

q The chain rule that underlies the back-propagation algorithm was invented in the seventeenth century (Leibniz, 1676; L’Hôpital, 1696)

q Beginning in the 1940s, the function approximation techniques were used to motivate machine learning models such as the perceptron

q The earliest models were based on linear models. Critics including Marvin Minsky pointed out several of the flaws of the linear model family, such as its inability to learn the XOR function, which led to a backlash against the entire neural network approach

q Efficient applications of the chain rule based on dynamic programming began to appear in the 1960s and 1970s

q Werbos (1981) proposed applying chain rule techniques for training artificial neural networks. The idea was finally developed in practice after being independently rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a)

q Following the success of back-propagation, neural network research gained popularity and reached a peak in the early 1990s. Afterwards, other machine learning techniques became more popular until the modern deep learning renaissance that began in 2006

q The core ideas behind modern feedforward networks have not changed substantially since the 1980s. The same back-propagation algorithm and the same approaches to gradient descent are still in use.

Most of the improvement in neural network performance from 1986 to 2015 can be attributed to two factors. First, larger datasets have reduced the degree to which statistical generalization is a challenge for neural networks. Second, neural networks have become much larger, because of more powerful computers and better software infrastructure.A small number of algorithmic changes have also improved the performance of neural networks noticeably. One of these algorithmic changes was the replacement of mean squared error with the cross-entropy family of loss functions. Mean squared error was popular in the 1980s and 1990s but was gradually replaced by cross-entropy losses and the principle of maximum likelihood as ideas spread between the statistics community and the machine learning community.

The other major algorithmic change that has greatly improved the performance of feedforward networks was the replacement of sigmoid hidden units with piecewise linear hidden units, such as rectified linear units. Rectification using the max{0, z} function was introduced in early neural network models and dates back at least as far as the Cognitron and Neo-Cognitron (Fukushima, 1975, 1980).

For small datasets, Jarrett et al. (2009) observed that using rectifying nonlinearities is even more important than learning the weights of the hidden layers. Random weights are

sufficient to propagate useful information through a rectified linear network, enabling the classifier layer at the top to learn how to map different feature vectors to class identities. When more data is available, learning begins to extract enough useful knowledge to exceed the performance of randomly chosen parameters. Glorot et al. (2011a) showed that learning is far easier in deep rectified linear networks than in deep networks that have curvature or two-sided saturation in their activation functions.

When the modern resurgence of deep learning began in 2006, feedforward networks continued to have a bad reputation. From about 2006 to 2012, it was widely believed that feedforward networks would not perform well unless they were assisted by other models, such as probabilistic models. Today, it is now known that with the right resources and engineering practices, feedforward networks perform very well. Today, gradient-based learning in feedforward networks is used as a tool to develop probabilistic models. Feedforward networks continue to have unfulfilled potential. In the future, we expect they will be applied to many more tasks, and that advances in optimization algorithms and model design will improve their performance even further.

A Probabilistic Theory of Deep Learning

Probability is the science of quantifying uncertain things. Most of machine learning and deep learning systems utilize a lot of data to learn about patterns in the data. Whenever data is utilized in a system rather than sole logic, uncertainty grows up and whenever uncertainty grows up, probability becomes relevant.

By introducing probability to a deep learning system, we introduce common sense to the system. In deep learning, several models like Bayesian models, probabilistic graphical models, Hidden Markov models are used. They depend entirely on probability concepts.

Real world data is chaotic. Since deep learning systems utilize real world data, they require a tool to handle the chaoticness.

Back Propagation Networks (BPN)

Need for Multilayer Networks

· Single Layer networks cannot used to solve Linear Inseparable problems & can only be used to solve linear separable problems

· Single layer networks cannot solve complex problems

· Single layer networks cannot be used when large input-output data set is available

· Single layer networks cannot capture the complex information’s available in the training pairs

Hence to overcome the above said Limitations we use Multi-Layer Networks.

Multi-Layer Networks

· Any neural network which has at least one layer in between input and output layers is called Multi-Layer Networks

· Layers present in between the input and out layers are called Hidden Layers

· Input layer neural unit just collects the inputs and forwards them to the next higher layer

· Hidden layer and output layer neural units process the information’s feed to them and produce an appropriate output

· Multi -layer networks provide optimal solution for arbitrary classification problems

· Multi -layer networks use linear discriminants, where the inputs are non linear

Back Propagation Networks (BPN)

Introduced by Rumelhart, Hinton, & Williams in 1986. BPN is a Multi- layer Feedforward Network but error is back propagated, Hence the name Back Propagation Network (BPN). It uses Supervised Training process; it has a systematic procedure for training the network and is used in Error Detection and Correction. Generalized Delta Law /Continuous Perceptron Law/ Gradient Descent Law is used in this network. Generalized Delta rule minimizes the mean squared error of the output calculated from the output. Delta law has faster convergence rate when compared with Perceptron Law. It is the extended version of Perceptron Training Law. Limitations of this law is the Local minima problem. Due to this the convergence speed reduces, but it is better than perceptron’s. Figure 1 represents a BPN network architecture. Even though Multi level perceptron’s can be used they are flexible and efficient that BPN. In figure 1 the weights between input and the hidden portion is considered as Wij and the weight between first hidden to the next layer is considered as Vjk. This network is valid only for Differential Output functions. The Training process used in backpropagation involves three stages, which are listed as below

1. Feedforward of input training pair

2. Calculation and backpropagation of associated error

3. Adjustments of weights

Figure 1: Back Propagation Network

BPN Algorithm

The algorithm for BPN is as classified into four major steps as follows:

1. Initialization of Bias, Weights

2. Feedforward process

3. Back Propagation of Errors

4. Updating of weights & biases

Algorithm:

I. Initialization of weights:

Step 1: Initialize the weights to small random values near zero Step 2: While stop condition is false , Do steps 3 to 10

Step 3: For each training pair do steps 4 to 9

II. Feed forward of inputs

Step 4: Each input xi is received and forwarded to higher layers (next hidden)

Step 5: Hidden unit sums its weighted inputs as follows Zinj = Woj + Σxiwij

Applying Activation function Zj = f(Zinj)

This value is passed to the output layer Step 6: Output unit sums it’s weighted inputs

yink= Voj + Σ ZjVjk

Applying Activation function

Yk = f(yink)

III. Backpropagation of Errors

Step 7: δk = (tk – Yk)f(yink ) Step 8: δinj = Σ δjVjk

IV. Updating of Weights & Biases

Step 8: Weight correction is Δwij = αδkZj bias Correction is Δwoj = αδk

V. Updating of Weights & Biases

Step 9: continued:

New Weight is

Wij(new) = Wij(old) + Δwij Vjk(new) = Vjk(old) + ΔVjk

New bias is

Woj(new) = Woj(old) + Δwoj Vok(new) = Vok(old) + ΔVok

Step 10: Test for Stop Condition

2.2.5 Merits

• Has smooth effect on weight correction

• Computing time is less if weight’s are small

• 100 times faster than perceptron model

• Has a systematic weight updating procedure

Demerits

• Learning phase requires intensive calculations

• Selection of number of Hidden layer neurons is an issue

• Selection of number of Hidden layers is also an issue

• Network gets trapped in Local Minima

• Temporal Instability

• Network Paralysis

• Training time is more for Complex problems

Regularization

A fundamental problem in machine learning is how to make an algorithm that will perform well not just on the training data, but also on new inputs. Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are known collectively as regularization.

Definition: - “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”

v In the context of deep learning, most regularization strategies are based on regularizing estimators.

v Regularization of an estimator works by trading increased bias for reduced variance.

An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.

q Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J. We denote the regularized objective function by J˜

J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)

where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J. Setting α to 0 results in no regularization. Larger values of α correspond to more regularization.

v The parameter norm penalty Ω that penalizes only the weights of the aﬃne transformation at each layer and leaves the biases unregularized.

L2 Regularization

One of the simplest and most common kind of parameter norm penalty is L2 parameter & it’s also called commonly as weight decay. This regularization strategy drives the weights closer to the origin by adding a regularization term . L2 regularization is also known as ridge regression or Tikhonov regularization. To simplify, we assume no bias parameter, so θ is just w. Such a model has the following total objective function.

v We can see that the addition of the weight decay term has modiﬁed the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update. This describes what happens in a single step.

v The approximation ^J is Given by

Where H is the Hessian matrix of J with respect to w evaluated at w∗.

The minimum of ˆJ occurs where its gradient ∇wˆJ(w) = H(w − w∗) is equal to ‘0’ To study the eﬀect of weight decay,

v As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows? Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an orthonormal basis of eigenvectors, Q, such that H = QΛQ^T. Applying Decomposition to the above equation, We Obtain

Figure 2: Weight updation effect

The solid ellipses represent contours of equal value of the unregularized objective. The dotted circles represent contours of equal value of the L 2 regularizer. At the point w˜, these competing objectives reach an equilibrium. In the first dimension, the eigenvalue of the Hessian of J is small. The objective function does not increase much when moving horizontally away from w∗ . Because the objective function does not express a strong preference along this direction, the regularizer has a strong effect on this axis. The regularizer pulls w1 close to zero. In the second dimension, the objective function is very sensitive to movements away from w∗ . The corresponding eigenvalue is large, indicating high curvature. As a result, weight decay affects the position of w2 relatively little.

L1 Regularization

While L2 weight decay is the most common form of weight decay, there are other ways to penalize the size of the model parameters. Another option is to use L1 regularization.

Ø L1 regularization on the model parameter w is defined as the sum of absolute values of the individual parameters.

L1 weight decay controls the strength of the regularization by scaling the penalty Ω using a positive hyperparameter α. Thus, the regularized objective function J˜(w; X, y) is given by

By inspecting equation 1, we can see immediately that the effect of L 1 regularization is quite different from that of L 2 regularization. Specifically, we can see that the regularization contribution to the gradient no longer scales linearly with each wi ; instead it is a constant factor with a sign equal to sign(wi).

Difference between L1 & L2 Parameter Regularization

Ø L1 regularization attempts to estimate the median of data, L2 regularization makes estimation for the mean of the data in order to evade overfitting.

Ø L1 regularization can add the penalty term in cost function. But L2 regularization appends the squared value of weights in the cost function.

Ø L1 regularization can be helpful in features selection by eradicating the unimportant features, whereas, L2 regularization is not recommended for feature selection

Ø
L1 doesn’t have a closed form solution since it includes an absolute value and it is a non- differentiable function, while L2 has a solution in closed form as it’s a square of a weight

Batch Normalization:

It is a method of adaptive reparameterization, motivated by the difficulty of training very deep models. In Deep networks, the weights are updated for each layer. So the output will no longer be on the same scale as the input (even though input is normalized).Normalization - is a data pre-processing tool used to bring the numerical data to a common scale without distorting its shape. When we input the data to a machine or deep learning algorithm we tend to change the values to a balanced scale because, we ensure that our model can generalize appropriately. (Normalization is used to bring the input into a balanced scale/ Range).

Image Source: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/

Even though the input X was normalized but the output is no longer on the same scale. The data passes through multiple layers of network with multiple times (sigmoidal) activation functions are applied, which leads to an internal co-variate shift in the data.

This motivates us to move towards Batch Normalization

Normalization is the process of altering the input data to have mean as zero and standard deviation value as one.

2.4.1 Procedure to do Batch Normalization:

(1) Consider the batch input from layer h, for this layer we need to calculate the mean of this hidden activation.

(2) After calculating the mean the next step is to calculate the standard deviation of the hidden activations.

(3) Now we normalize the hidden activations using these Mean & Standard Deviation values. To do this, we subtract the mean from each input and divide the whole value with the sum of standard deviation and the smoothing term (ε).

(4) As the final stage, the re-scaling and offsetting of the input is performed. Here two components of the BN algorithm is used, γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) the vector contains values from the previous operations.

These two parameters are learnable parameters, Hence during the training of neural network, the optimal values of γ and β are obtained and used. Hence we get the accurate normalization of each batch.

VC dimension and Neural nets

The Vapnik-Chervonenkis dimension, more commonly known as the VC dimension, is a model capacity measurement used in statistics and machine learning. It is termed informally as a measure of a model’s capacity. It is used frequently to guide the model selection process while developing machine learning applications. To understand the VC dimension, we must first understand shattering.

Shattering

Shattering is the ability of a model to classify a set of points perfectly. More generally, the model can create a function that can divide the points into two distinct classes without overlapping. It is different from simple classification because it considers all possible combinations of labels upon those points. Later in the shot, we’ll see this concept in action while computing the VC dimension. In the context of shattering, we simply define the VC dimension of a model as the size of the largest set of points that that model can shatter.

Find VC dimension

Let us consider a simple binary classification model, which states that for all points (a, b), such that a < x < b, label them as 1, otherwise, label them as 0.

h(x)=1, if a<x<b

h(x)=0, otherwise

(a,b)∈R2

We take two points, m and n. For these two points, there can be 22 distinct labels in binary classification. We list these cases as follows:

h(m)=0;h(n)=0

h(m)=0;h(n)=1

h(m)=1;h(n)=0

h(m)=1;h(n)=1

We can observe that for all the possible labelling variations of mm and nn. The model can divide the points into two segments.

This is where we can claim that our model successfully shattered two points in the dataset. Consequently, the VC dimension for this model is 2 (for now). Similar to the testing above, the modal also works on three points, which bumps our VC dimension to 3.

However, when we reach four points, we run into an issue. Specifically, in cases like these:

There is no possible division through hyperplane in the plot above that can distinctly classify these points. Consequently, we can say that our shattering iteration failed, and our VC dimension is 3.

VC dimension is an essential metric in determining the capacity of a machine learning algorithm. It should be noted that the terms “capacity” and “accuracy” refer to two different things. The capacity of a model is defined as its ability to learn from a given dataset while accuracy is its ability to correctly identify labels for a given batch of data.

One model can have a high VC dimension but lower accuracy, and another model to have a low VC dimension but higher accuracy. It is also possible that a model with a high VC dimension is more likely to overfit the data, while a model with a low VC dimension is more likely to under fit the data.

Much like other metrics in machine learning, the VC dimension merely acts as a guiding light in model selection and shall be used with personal intuition.

Shallow Networks

Shallow neural networks give us basic idea about deep neural network which consist of only 1 or 2 hidden layers. Understanding a shallow neural network gives us an understanding into what exactly is going on inside a deep neural network A neural network is built using various hidden layers. Now that we know the computations that occur in a particular layer, let us understand how the whole neural network computes the output for a given input X. These can also be called the forward-propagation equations.

Figure 2:Shallow Networks – Generic Model

Difference Between a Shallow Net & Deep Learning Net:

Sl.No	Shallow Net’s	Deep Learning Net’s
1	One Hidden layer(or very less no. of Hidden Layers)	Deep Net’s has many layers of Hidden layers with more no. of neurons in each layers
2	Takes input only as VECTORS	DL can have raw data like image, text as inputs
3	Shallow net’s needs more parameters to have better fit	DL can fit functions better with less parameters than a shallow network
4	Shallow networks with one Hidden layer (same no of neurons as DL) cannot place complex functions over the input space	DL can compactly express highly complex functions over input space
5	The number of units in a shallow network grows exponentially with task complexity.	DL don’t need to increase it size(neurons) for complex problems
6	Shallow network is more difficult to train with our current algorithms (e.g. it has issues of local minima etc)	Training in DL is easy and no issue of local minima in DL

Reference Books:

1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.

2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.

3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and Applications”, Prentice Hall publications.

4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.

5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.

6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.

7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.

Note: For further reference, kindly refer the class notes, PPTs, Video lectures available in the Learning Management System (Moodle)

Search This Blog

MyTutorials4U

UNIT II - INTRODUCTION TO DEEP LEARNING