Issues in Decision Tree Learning
1.
Overfitting the data:
Definition: given a hypothesis space H,
a hypothesis h
How can we
prevent overfitting? Here are some common heuristics:
- Don't try to fit all examples, stop
before the training set is exhausted.
- Fit all examples then prune the resultant
tree.
How does
one tell if a given tree is one that has overfit the data?
- Extract a validation set not
used for training from the training set and use this to check for
overfitting. Usually the validation set consists of one-third of the
training set, chosen randomly.
- Then use statistical tests, eg. the
chi-squared metric, to determine if changing the tree improves its
performance over the validation set.
- A variation of the above is to use MDL to
check if modifying the tree increases its MDL with respect to the
validation set.
If we use the validation set to guide pruning, how do we tell that the tree is not overfitting the validation set? In this case, we need to extract yet another set called the test set from the training set and use this for the final check.
2.
Guarding against bad attribute choices:
The
information gain measure has a bias that favors attributes with many values
over those with only a few.
For example, if you imagine an attribute Date with unique values for each training example, then Gain(S,Date) will yield H(S) since
Obviously
no other attribute can do better. This will result in a very broad tree of
depth 1.
To guard
against this, Quinlan uses GainRatio(S,A) instead
of Gain(S,A).
Where
with P(Sv)
estimated by relative frequency as before.
This introduces a penalty for partitioning into equipossible groups.
3.
Handling continuous valued attributes:
Note that
continuous valued attributes can be partitioned into a discrete number of
disjoint intervals. Then we can test for membership to these intervals.
If
the Temperature attribute in the PlayTennis example
took on continuous values in the range 40-90, note that we cannot just use the
same approach as before.
Why?
Because Temperature becomes a bad choice for classification
(It alone may perfectly classify the training examples and therefore promise
the highest information gain like in the earlier Date example)
while remaining a poor predictor on the test set.
The
solution to this problem is to classify based not on the actual temperature,
but on dynamically determined intervals within which the temperature falls.
For instance, we can introduce boolean attributes T<=a, a<T<=b, b<T<=c and T > c instead of real valued T. a, b and c can be computed dynamically based on boundaries where T is likely or known to change.
4.
Handling missing attribute values:
What
happens if some of the training examples contain one or more ``?'', meaning ``value
not known'' instead of the actual attribute values?
Here are
some common ad hoc solutions:
- Substitute ``?'' by the most common value
in that column.
- Substitute ``?'' by the most common value
among all training examples that have been sorted into the tree at that
node.
- Substitute ``?'' by the most common value
among all training examples that have been sorted into the tree at that
node with the same classification as the incomplete example.
- etc.
5.
Handling attributes with differing costs:
Sometimes
we want to introduce additional bias against the selection of certain
attributes. Eg. ``Wherever possible try not to use InvasiveTest as
an attribute, since this might inconvenience the patient.
Here is
another ad hoc heuristic to fix this:
Introduce
a user-defined measure Cost(A), to specify a fixed bias
against each attribute.
Now we can
use a CostedGain(S,A) which is defined along the
lines of:
where w
0 Comments