What is the relationship between Bayes theorem and the problem of concept learning?
Since Bayes theorem provides a principled way to calculate the posterior probability of each hypothesis given the training data, and can use it as the basis for a straightforward learning algorithm that calculates the probability for each possible hypothesis, then outputs the most probable.
Brute-Force Bayes Concept
Learning
Consider the concept learning problem
- Assume the learner considers some finite hypothesis space H defined over the instance space X, in which the task is to learn some target concept c : X → {0,1}.
- Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is some instance from X and where di is the target value of xi (i.e., di = c(xi)).
- The sequence of target values are written as D = (d1 . . . dm).
We can design a straightforward concept learning algorithm to output the maximum a posteriori hypothesis, based on Bayes theorem, as follows:
BRUTE-FORCE MAP LEARNING algorithm:
- For each hypothesis h in H, calculate the posterior probability
- Output the hypothesis hMAP with the highest posterior probability
- The training data D is noise free (i.e., di = c(xi))
- The target concept c is contained in the hypothesis space H
- Do not have a priori reason to believe that any hypothesis is more probable than any other.
What values should we specify for P(h)?
- Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign the same prior probability to every hypothesis h in H.
- Assume the target concept is contained in H and require that these prior probabilities sum to 1.
What choice shall we make for P(D|h)?
- P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
- Since we assume noise-free training data, the probability of observing classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,
Consider the case where h is inconsistent
with the training data D
The posterior probability of a hypothesis
inconsistent with D is
zero
Consider the case where h is consistent with D
0 Comments