BAYES THEOREM AND CONCEPT LEARNING

What is the relationship between Bayes theorem and the problem of concept learning?

Since Bayes theorem provides a principled way to calculate the posterior probability of each hypothesis given the training data, and can use it as the basis for a straightforward learning algorithm that calculates the probability for each possible hypothesis, then outputs the most probable.

Brute-Force Bayes Concept Learning

Consider the concept learning problem

Assume the learner considers some finite hypothesis space H defined over the instance space X, in which the task is to learn some target concept c : X → {0,1}.
Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is some instance from X and where di is the target value of xi (i.e., di = c(xi)).
The sequence of target values are written as D = (d1 . . . dm).

We can design a straightforward concept learning algorithm to output the maximum a posteriori hypothesis, based on Bayes theorem, as follows:

BRUTE-FORCE MAP LEARNING algorithm:

For each hypothesis h in H, calculate the posterior probability

Output the hypothesis hMAP with the highest posterior probability

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we must specify what values are to be used for P(h) and for P(D|h) ?

Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:

The training data D is noise free (i.e., di = c(xi))
The target concept c is contained in the hypothesis space H
Do not have a priori reason to believe that any hypothesis is more probable than any other.

What values should we specify for P(h)?

Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign the same prior probability to every hypothesis h in H.
Assume the target concept is contained in H and require that these prior probabilities sum to 1.

What choice shall we make for P(D|h)?

P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
Since we assume noise-free training data, the probability of observing classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,