Issues in Decision Tree Learning

Rahul Saini December 09, 2021

1. Overfitting the data:

Definition: given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h’ H, such that h has smaller error than h' over the training examples, but h' has smaller error than h over the entire distribution of instances.

How can we prevent overfitting? Here are some common heuristics:

Don't try to fit all examples, stop before the training set is exhausted.
Fit all examples then prune the resultant tree.

How does one tell if a given tree is one that has overfit the data?

Extract a validation set not used for training from the training set and use this to check for overfitting. Usually the validation set consists of one-third of the training set, chosen randomly.
Then use statistical tests, eg. the chi-squared metric, to determine if changing the tree improves its performance over the validation set.
A variation of the above is to use MDL to check if modifying the tree increases its MDL with respect to the validation set.

If we use the validation set to guide pruning, how do we tell that the tree is not overfitting the validation set? In this case, we need to extract yet another set called the test set from the training set and use this for the final check.

2. Guarding against bad attribute choices:

The information gain measure has a bias that favors attributes with many values over those with only a few.

For example, if you imagine an attribute Date with unique values for each training example, then Gain(S,Date) will yield H(S) since

Obviously no other attribute can do better. This will result in a very broad tree of depth 1.

To guard against this, Quinlan uses GainRatio(S,A) instead of Gain(S,A).

Where

with P(S_v) estimated by relative frequency as before.

This introduces a penalty for partitioning into equipossible groups.

3. Handling continuous valued attributes:

Note that continuous valued attributes can be partitioned into a discrete number of disjoint intervals. Then we can test for membership to these intervals.

If the Temperature attribute in the PlayTennis example took on continuous values in the range 40-90, note that we cannot just use the same approach as before.

Why? Because Temperature becomes a bad choice for classification (It alone may perfectly classify the training examples and therefore promise the highest information gain like in the earlier Date example) while remaining a poor predictor on the test set.

The solution to this problem is to classify based not on the actual temperature, but on dynamically determined intervals within which the temperature falls.

For instance, we can introduce boolean attributes T<=a, a<T<=b, b<T<=c and T > c instead of real valued T. a, b and c can be computed dynamically based on boundaries where T is likely or known to change.

4. Handling missing attribute values:

What happens if some of the training examples contain one or more ``?'', meaning ``value not known'' instead of the actual attribute values?

Here are some common ad hoc solutions:

Substitute ``?'' by the most common value in that column.
Substitute ``?'' by the most common value among all training examples that have been sorted into the tree at that node.
Substitute ``?'' by the most common value among all training examples that have been sorted into the tree at that node with the same classification as the incomplete example.
etc.

5. Handling attributes with differing costs:

Sometimes we want to introduce additional bias against the selection of certain attributes. Eg. ``Wherever possible try not to use InvasiveTest as an attribute, since this might inconvenience the patient.

Here is another ad hoc heuristic to fix this:

Introduce a user-defined measure Cost(A), to specify a fixed bias against each attribute.

Now we can use a CostedGain(S,A) which is defined along the lines of: