Header Ads Widget

Gradient Descent and Delta Rule

Gradient Descent and Delta Rule

Liinear and Non-Linearly Separable data

A set of data points are said to be linearly separable if the data can be divided into two classes using a straight line. If the data is not divided into two classes using a straight line, such data points are said to be called non-linearly separable data.

Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.

A second training rule, called the delta rule, is designed to overcome this difficulty.

If the training examples are not linearly separable, the delta rule converges toward a best-fit approximation to the target concept.

The key idea behind the delta rule is to use gradient descent to search the hypothesis space of possible weight vectors to find the weights that best fit the training examples.

This rule is important because gradient descent provides the basis for the BACKPROPAGATON algorithm, which can learn networks with many interconnected units.
Gradient Descent and Delta Rule

Derivation of Delta Rule

The delta training rule is best understood by considering the task of training an unthresholded perceptron; that is, a linear unit for which the output o is given by

Thus, a linear unit corresponds to the first stage of a perceptron, without the threshold.

In order to derive a weight learning rule for linear units, let us begin by specifying a measure for the training error of a hypothesis (weight vector), relative to the training examples.

Although there are many ways to define this error, one common measure is

where D is the set of training examples, ‘td’ is the target output for training example ‘d’, and od is the output of the linear unit for training example ‘d’.

How to calculate the direction of steepest descent along the error surface?

The direction of steepest can be found by computing the derivative of E with respect to each component of the vector w. This vector derivative is called the gradient of E with respect to w, written as,

The gradient specifies the direction of steepest increase of E, the training rule for gradient descent is

Here η is a positive constant called the learning rate, which determines the step size in the gradient descent search.

The negative sign is present because we want to move the weight vector in the direction that decreases E.

This training rule can also be written in its component form,

Here,

Finally,


Ques. Explain delta rule. Explain generalized delta learning rule (error back propagation learning rule).

Answer:

The delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network.It is a special case of the more general backpropagation algorithm.

The generalized delta rule is a mathematically derived formula used to determine how to update a neural network during a (back propagation) training step.

  • A neural network learns a function that maps an input to an output based on given example pairs of inputs and outputs. 
  • A set number of input and output pairs are presented repeatedly, in random order during the training.
  • The generalized delta rule is used repeatedly during training to modify weights between node connections.
  • Before training,the network has connection weights initialized with small, random numbers.
  • The purpose of the weight modifications is to reduce the overall network error, which means to reduce the difference between the actual and expected output.

Ques.   Write a short note on gradient descent.

Answer

  • Gradient descent is simply used to find the values of a function's parameters that minimize a cost function as far as possible.
  • Gradient Descent Algorithm is an iterative algorithm to find a Global Minimum of an objective function (cost  function).We start by defining the initial parameter's values and from there gradient descent uses calculus to iteratively adjust the values so they minimize the given cost-function
  • It is an optimization algorithm that's used when training a machine learning model.
  • gradient simply measures the change in all weights with regard to the change in error. 
  • The higher the gradient, the steeper the slope and the faster a model can learn.

 

Ques.  Explain different types of gradient descent.

Answer

Batch Gradient Descent 

Batch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters.It is the first basic type of gradient descent in which we use the complete dataset available to compute the gradient of cost function.

Stochastic Gradient Descent 

The first step of the algorithm is to randomize the whole training set. Then, for updation of every parameter we use only one training example in every iteration to compute the gradient of cost function. As it uses one training example in every iteration this algo is faster for larger data sets.

Mini Batch Gradient Descent 

It is the most favorable and widely used algorithm that makes precise and faster results using a batch of ‘ m ‘training examples.In mini batch algorithms rather than using  the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function.It sums up over a lower number of examples based on the batch size.

 

Ques.   What are the advantages and disadvantages of batch gradient descent?

Answer

Advantages of Batch Gradient Descent

  1. Fewer oscillations and noisy steps are taken towards the global minima of the loss function.
  2. It can benefit from the vectorization(by which we can make your code execute fast) which increases the speed of processing all training samples together.
  3. It produces a more stable gradient descent convergence and stable error gradient than stochastic gradient descent.
  4. It is computationally efficient as all computer resources are not being used to process a single sample rather are being used for all training samples.

Disadvantages of Batch Gradient Descent

  1. Sometimes a stable error gradient can lead to local minima.
  2. The entire training set can be too large to process in the memory due to which additional memory might be needed.
  3. Depending on computer resources it can take too long for processing all training samples as a batch.

 

Ques.  What are the advantages and disadvantages of stochastic gradient descent?

Answer

Advantages of Stochastic Gradient Descent

  1. It is easier to fit in the memory due to a single training example being processed by the network.
  2. It is computationally fast as only one sample is processed at a time.
  3. For larger datasets, it can converge faster as it causes updates to the parameters more frequently.

Disadvantages of Stochastic Gradient Descent

  1. Due to frequent updates, the steps taken towards the minima are very noisy,which can often lean the gradient descent into other directions.
  2. Also, due to noisy steps, it may take longer to achieve convergence to the minima of the loss function.
  3. Frequent updates are computationally expensive because of using all resources for processing one training sample at a time.

Post a Comment

0 Comments