Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
- For C1: C1 might be some obvious encoding, in which the description length grows with the number of nodes and with the number of edges
- For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the transmitter and receiver, so that we need only transmit the classifications (f (x1) . . . f (xm)).
- Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the hypothesis, then there is no need to transmit any information about these examples. The description length of the classifications given the hypothesis ZERO
- If examples are misclassified by h, then for each misclassification we need to transmit a message that identifies which example is misclassified as well as its correct classification
- The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the sum of these description lengths.
0 Comments