GRADIENT DESCENT algorithm for training a linear unit

To summarize, the gradient descent algorithm for training linear units is as follows:

Pick an initial random weight vector.
Apply the linear unit to all training examples, then compute Δw_i for each weight according to Equation (7).
Update each weight w_i by adding Δw_i, then repeat this process

Issues in Gradient Descent Algorithm

Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever

The hypothesis space contains continuously parameterized hypotheses
The error can be differentiated with respect to these hypothesis parameters

The key practical difficulties in applying gradient descent are

Converging to a local minimum can sometimes be quite slow
If there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum

Stochastic Approximation to Gradient Descent

The gradient descent training rule presented in Equation (7) computes weight updates after summing over all the training examples in D
The idea behind stochastic gradient descent is to approximate this gradient descent search by updating weights incrementally, following the calculation of the error for each individual example
∆wi = η (t – o) xi
where t, o, and xi are the target value, unit output, and i^th input for the training example in question

One way to view this stochastic gradient descent is to consider a distinct error function Ed(wvector)for each individual training example d as follows

where, td and od are the target value and the unit output value for training example d.

Stochastic gradient descent iterates over the training examples d in D, at each iteration altering the weights according to the gradient with respect to Ed(wvector)
The sequence of these weight updates, when iterated over all training examples, provides a reasonable approximation to descending the gradient with respect to our original error function Ed(wvector)
By making the value of η sufficiently small, stochastic gradient descent can be made to approximate true gradient descent arbitrarily closely

The key differences between standard gradient descent and stochastic gradient descent are

In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example.
Summing over multiple examples in standard gradient descent requires more computation per weight update step. On the other hand, because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than stochastic gradient descent.
In cases where there are multiple local minima with respect to stochastic gradient descent can sometimes avoid falling into these local minima because it uses the various ∇Ed(wvector) rather than ∇E(wvector) to guide its search

Header Ads Widget

GRADIENT DESCENT algorithm for training a linear unit

Stochastic Approximation to Gradient Descent

Post a Comment

0 Comments

Total Pageviews

Search This Blog

Subject Labels

Popular Posts

Classification/Types of Operating Systems

Introduction of Operating System

Operating-System Structure

Contact form

Sponsor

Popular Posts

Fixed (or static) Partitioning in Operating System

Variable (or dynamic) Partitioning

Threads and their management

Ad Space

Random Posts

Recent in Sports

Popular Posts

Classification/Types of Operating Systems

Introduction of Operating System

Operating-System Structure

Menu Footer Widget