- Pick an initial random weight vector.
- Apply the linear unit to all training examples, then compute Δwi for each weight according to Equation (7).
- Update each weight wi by adding Δwi, then repeat this process
Issues in Gradient Descent Algorithm
Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever
- The hypothesis space contains continuously parameterized hypotheses
- The error can be differentiated with respect to these hypothesis parameters
The key practical difficulties in applying gradient descent are
- Converging to a local minimum can sometimes be quite slow
- If there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum
Stochastic Approximation to Gradient Descent
- The gradient descent training rule presented in
Equation (7) computes weight updates after summing over all the training
examples in D
- The idea behind stochastic gradient descent is to
approximate this gradient descent search by updating
weights incrementally, following the calculation of the error for each individual example
∆wi = η (t – o) xi
- where
t, o, and xi are
the target value, unit output, and ith input for the training
example in question
One way to view this stochastic gradient descent is to consider a distinct error function Ed(wvector)for each individual training example d as follows
where, td and od are the target value and the unit output value for training example d.
- Stochastic gradient descent iterates over the training examples d in D, at each iteration altering the weights according to the gradient with respect to Ed(wvector)
- The sequence of these weight updates, when iterated over all training examples, provides a reasonable approximation to descending the gradient with respect to our original error function Ed(wvector)
- By making the value of η sufficiently small, stochastic gradient descent can be made to approximate true gradient descent arbitrarily closely
The key differences between standard gradient descent and stochastic gradient descent are
- In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example.
- Summing over multiple examples in standard gradient descent requires more computation per weight update step. On the other hand, because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than stochastic gradient descent.
- In cases where there are multiple local minima with respect to stochastic gradient descent can sometimes avoid falling into these local minima because it uses the various ∇Ed(wvector) rather than ∇E(wvector) to guide its search



 

 
 
 
No comments:
Post a Comment