Consider the problem of learning a continuous-valued target function such as neural network learning, linear regression, and polynomial curve fitting
A straightforward Bayesian analysis will show that under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood (ML) hypothesis
- Learner L considers an instance space X and a hypothesis space H consisting of some class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form <xi,di>\
- The problem faced by L is to learn an unknown target function f : X → R
- A set of m training examples is provided, where the target value of each example is corrupted by random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
- Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are distributed according to a Normal distribution with zero mean.
- The task of the learner is to output a maximum likelihood hypothesis or a MAP hypothesis assuming all hypotheses are equally probable a priori.
Using the definition of hML we have
Assuming training
examples are mutually independent given h, we can write P(D|h) as the product of the various (di|h)
Maximize the less complicated logarithm, which is justified because of the monotonicity of function p
The first term in this expression is
a constant independent of h, and can therefore be discarded, yielding
Maximizing this negative quantity is equivalent
to minimizing the corresponding positive quantity
Thus, above equation shows that the maximum likelihood hypothesis hML is the one that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h(xi)
Note:
Why is it reasonable to choose the Normal distribution to characterize noise?\
- Good approximation of many types of noise in physical systems
- Central Limit Theorem shows that the sum of a sufficiently large number of independent, identically distributed random variables itself obeys a Normal distribution
Only noise
in the target value is considered, not in the attributes describing the instances themselves
0 Comments