Rule post-pruning is successful method for finding high accuracy hypotheses
- Rule post-pruning involves the following steps:
- Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur.
- Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.
- Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy.
- Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.
For example,
consider the decision tree. The leftmost path of the tree in below figure is
translated into the rule.
IF (Outlook =
Sunny) ^ (Humidity = High) THEN PlayTennis = No
Given the above rule, rule post-pruning would consider
removing the preconditions (Outlook = Sunny) and (Humidity = High)
- It would select whichever of these pruning steps produced the greatest improvement in estimated rule accuracy, then consider pruning the second precondition as a further pruning step.
- No pruning step is performed if it reduces the estimated rule accuracy.
There are three main advantages by converting the decision tree to rules before pruning
- Converting to rules allows distinguishing among the different contexts in which a decision node is used. Because each distinct path through the decision tree node produces a distinct rule, the pruning decision regarding that attribute test can be made differently for each path.
- Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those that occur near the leaves. Thus, it avoid messy bookkeeping issues such as how to reorganize the tree if the root node is pruned while retaining part of the subtree below this test.
- Converting to rules improves readability. Rules are often easier for to understand.
1. Incorporating
Continuous-Valued Attributes
- Define new discrete valued attributes that partition the continuous attribute value into a discrete set of intervals.
- Using thresholds for splitting nodes
What threshold-based Boolean
attribute should be defined based on Temperature?
- Pick a threshold, c, that produces the greatest information gain
- In the current example, there are two candidate thresholds, corresponding to the values of Temperature at which the value of PlayTennis changes: (48 + 60)/2, and (80 + 90)/2.
- The information gain can then be computed for each of the candidate attributes, Temperature >54, and Temperature >85 and the best can be selected (Temperature >54)
- The problem is if attributes with many values, Gain will select it ?
- Example: consider the attribute Date, which has a very large number of possible values. (e.g., March 4, 1979).
- If this attribute is added to the PlayTennis data, it would have the highest information gain of any of the attributes. This is because Date alone perfectly predicts the target attribute over the training data. Thus, it would be selected as the decision attribute for the root node of the tree and lead to a tree of depth one, which perfectly classifies the training data.
- This decision tree with root node Date is not a useful predictor because it perfectly separates the training data, but poorly predict on subsequent examples.
One Approach: Use GainRatio
instead of Gain
The gain ratio measure
penalizes attributes by incorporating a split information, that is sensitive to how broadly and uniformly
the attribute splits the data
Where, Si is subset of S, for which attribute A has value vi
3. Handling Training Examples with Missing Attribute Values
The data which is available may contain missing values for some attributes Example: Medical diagnosis
- <Fever = true, Blood-Pressure = normal, …, Blood-Test = ?, …>
- Sometimes values truly unknown, sometimes low priority (or cost too high)
Strategies for dealing with the missing attribute value
- If node n test A, assign most common value of A among other training examples sorted to node n
- Assign most common value of A among other training examples with same target value
- Assign a probability pi to each of the possible values vi of A rather than simply assigning the most common value to A(x)
4. Handling Attributes with Differing Costs
- In some learning tasks the instance attributes may have associated costs.
- For example: In learning to classify medical diseases, the patients described in terms of attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
- These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort
- Decision trees use low-cost attributes where possible, depends only on high-cost attributes only when needed to produce reliable classifications
How to
Learn A Consistent Tree with Low Expected Cost?
One approach is replace Gain by Cost-Normalized-Gain
Examples of normalization functions
0 Comments