Header Ads Widget

GRADIENT DESCENT algorithm for training a linear unit

 

To summarize, the gradient descent algorithm for training linear units is as follows:
  • Pick an initial random weight vector.
  • Apply the linear unit to all training examples, then compute Δwi for each weight according to Equation (7).
  • Update each weight wi by adding Δwi, then repeat this process

Issues in Gradient Descent Algorithm

Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever

  1. The hypothesis space contains continuously parameterized hypotheses
  2. The error can be differentiated with respect to these hypothesis parameters

 The key practical difficulties in applying gradient descent are

  1. Converging to a local minimum can sometimes be quite slow
  2. If there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum

Stochastic Approximation to Gradient Descent

  • The gradient descent training rule presented in Equation (7) computes weight updates after summing over all the training examples in D
  • The idea behind stochastic gradient descent is to approximate this gradient descent search by updating weights incrementally, following the calculation of the error for each individual example
                                        ∆wi = η (t – o) xi
  • where t, o, and xi are the target value, unit output, and ith input for the training example in question



One way to view this stochastic gradient descent is to consider a distinct error function Ed(wvector)for each individual training example d as follows

         

where, td and od are the target value and the unit output value for training example d.

  • Stochastic gradient descent iterates over the training examples d in D, at each iteration altering the weights according to the gradient with respect to Ed(wvector)
  • The sequence of these weight updates, when iterated over all training examples, provides a reasonable approximation to descending the gradient with respect to our original error function Ed(wvector)
  • By making the value of η sufficiently small, stochastic gradient descent can be made to approximate true gradient descent arbitrarily closely

The key differences between standard gradient descent and stochastic gradient descent are

  • In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example.
  • Summing over multiple examples in standard gradient descent requires more computation per weight update step. On the other hand, because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than stochastic gradient descent.
  • In cases where there are multiple local minima with respect to stochastic gradient descent can sometimes avoid falling into these local minima because it uses the various ∇Ed(wvector) rather than E(wvector) to guide its search

Post a Comment

0 Comments