ISSUES IN DECISION TREE LEARNING

1. Avoiding Overfitting the Data

Reduced error pruning
Rule post-pruning

2. Incorporating Continuous-Valued Attributes

3. Alternative Measures for Selecting Attributes

4. Handling Training Examples with Missing Attribute Values

5. Handling Attributes with Differing Costs

1. Avoiding Overfitting the Data

The ID3 algorithm grows each branch of the tree just deeply enough to perfectly classify the training examples but it can lead to difficulties when there is noise in the data, or when the number of training examples is too small to produce a representative sample of the true target function. This algorithm can produce trees that overfit the training examples.
Definition - Overfit: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h' ∈ H, such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances.

The below figure illustrates the impact of overfitting in a typical application of decision tree learning.

The horizontal axis of this plot indicates the total number of nodes in the decision tree, as the tree is being constructed. The vertical axis indicates the accuracy of predictions made by the tree.
The solid line shows the accuracy of the decision tree over the training examples. The broken line shows accuracy measured over an independent set of test example
The accuracy of the tree over the training examples increases monotonically as the tree is grown. The accuracy measured over the independent test examples first increases, then decreases.

How can it be possible for tree h to fit the training examples better than h', but for it to perform more poorly over subsequent examples?

Noisy Training Example

Approaches to avoiding overfitting in decision tree learning

Pre-pruning (avoidance): Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data
Post-pruning (recovery): Allow the tree to overfit the data, and then post-prune the tree

Criterion used to determine the correct final tree size

Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree
Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set
Use measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding size is minimized. This approach is called the Minimum Description Length

MDL – Minimize : size(tree) + size (misclassifications(tree))

Classification/Types of Operating Systems