- Consider the setting in which we wish to learn a nondeterministic (probabilistic) function f : X → {0, 1}, which has two discrete output values.
- We want a function approximator whose output is the probability that f(x) = 1. In other words, learn the target function f ` : X → [0, 1] such that f ` (x) = P(f(x) = 1)
- Use of brute force way would be to first collect the observed frequencies of 1's and 0's for each possible value of x and to then train the neural network to output the target frequency for each x.
What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in this setting?
- First obtain an expression for P(D|h)
- Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the observed 0 or 1 value for f (xi).
- Both xi and di as random variables, and assuming that each training example is drawn independently, we can write P(D|h) as
The probability P(di|h, xi)
Re-express it in a more mathematically manipulable form, as
Equation (4) to substitute for P(di |h, xi)
in Equation (5) to obtain
The last term is a constant independent of h, so it can be dropped
It easier to work with the log of the likelihood, yielding
Equation (7) describes the quantity that must be maximized in order to obtain the maximum likelihood hypothesis in our current problem
setting
Gradient Search to Maximize Likelihood in a Neural Net
- Derive a weight-training rule for neural network learning that seeks to maximize G(h,D) using gradient ascent.
- The gradient of G(h,D) is given by the vector of partial derivatives of G(h,D) with respect to the various network weights that define the hypothesis h represented by the learned network
- In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is
- Suppose our neural network is constructed from a single layer of sigmoid units. Then,
where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative of the sigmoid squashing function.
- Finally, substituting this expression into Equation (1), we obtain a simple expression for the derivatives that constitute the gradient
Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather than gradient descent search. On each iteration of the search the weight vector is adjusted in the direction of the gradient, using the weight update rule
Where, η is a small positive
constant that determines the step size of the i gradient
ascent search
0 Comments