Header Ads Widget

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES

Consider the problem of learning a continuous-valued target function such as neural network learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood (ML) hypothesis

  • Learner L considers an instance space X and a hypothesis space H consisting of some class of real-valued functions defined over X, i.e., (h H)[ h : X → R] and training examples of the form <xi,di>\
  • The problem faced by L is to learn an unknown target function f : X R
  • A set of m training examples is provided, where the target value of each example is corrupted by random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
  • Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .

        Here f(xi) is the noise-free value of the target function and ei is a random variable representing the noise. 

        It is assumed that the values of the ei are drawn independently and that they are distributed according to a Normal distribution with zero mean.

  • The task of the learner is to output a maximum likelihood hypothesis  or a MAP hypothesis assuming all hypotheses are equally probable a priori.

Using the definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the product of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di must also obey a Normal distribution around the true target value f(xi). Because we are writing the expression for P(D|h), we assume h is the correct description of f. Hence, Âµ = f(xi) = h(xi)



Maximize the less complicated logarithm, which is justified because of the monotonicity of function p

The first term in this expression is a constant independent of h, and can therefore be discarded, yielding


Maximizing this negative quantity is equivalent to minimizing the corresponding positive quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis hML is the one that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi)

 

Note:

Why is it reasonable to choose the Normal distribution to characterize noise?\

  • Good approximation of many types of noise in physical systems
  • Central Limit Theorem shows that the sum of a sufficiently large number of independent, identically distributed random variables itself obeys a Normal distribution

Only noise in the target value is considered, not in the attributes describing the instances themselves

Post a Comment

0 Comments