The basic design issues and approaches to machine learning are illustrated by designing a program to learn to play checkers, with the goal of entering it in the world checkers tournament
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
- Estimating training values
- Adjusting the weights
1. Choosing the Training Experience
- The first design choice is to choose the type of training experience from which the system will learn.
- The type of training experience available can have a significant impact on success or failure of the learner.
There are three attributes which impact on success or failure of the learner
1. Whether the training experience provides direct or indirect feedback regarding the choices made by the performance system.
For example, in checkers game:
In learning to play checkers, the system might learn from direct training examples consisting of individual checkers board states and the correct move for each.
Indirect training examples consisting of the move sequences and final outcomes of various games played. The information about the correctness of specific moves early in the game must be inferred indirectly from the fact that the game was eventually won or lost.
Here the learner faces an additional problem of credit assignment, or determining the degree to which each move in the sequence deserves credit or blame for the final outcome. Credit assignment can be a particularly difficult problem because the game can be lost even when early moves are optimal, if these are followed later by poor moves.
Hence, learning from direct training feedback is typically easier than learning from indirect feedback.
2. The degree to which the learner controls the sequence of training examples
For example, in checkers game:
The learner might depends on the teacher to select informative board states and to provide the correct move for each.
Alternatively, the learner might itself propose board states that it finds particularly confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect) training classifications, as it does when it learns by playing against itself with no teacher present.
3. How well it represents the distribution of examples over which the final system performance P must be measured
For example, in checkers game:
In checkers learning scenario, the performance metric P is the percent of games the system wins in the world tournament.
If its training experience E consists only of games played against itself, there is a danger that this training experience might not be fully representative of the distribution of situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from those on which the final system will be evaluated.
2. Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and how this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board state.
The program needs only to learn how to choose the best move from among these legal moves.
We must learn to choose among the legal moves, the most obvious choice for the type of information to be learned is a program, or function, that chooses the best move for any given board state.
1. Let ChooseMove be the target function and the notation is
ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board states B and produces as output some move from the set of legal moves M.
ChooseMove is a choice for the target function in checkers example, but this function will turn out to be very difficult to learn given the kind of indirect training experience available to our system
2. An alternative target function is an evaluation function that assigns a numerical score to any given board state
Let the target function V and the notation
V : B → R
which denote that V maps any legal board state from the set B to some real value.
Intend for this target function V to assign higher scores to better board states. If the system can successfully learn such a target function V, then it can easily use it to select the best move from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
- If b is a final board state that is won, then V(b) = 100
- If b is a final board state that is lost, then V(b) = -100
- If b is a final board state that is drawn, then V(b) = 0
- If b is a not a final state in the game, then V(b) = V(b' ),
Where b' is the best final board state that can be achieved starting from b and playing optimally until the end of the game.
3. Choosing a Representation for the Target Function
Let’s choose a simple representation - for any given board state, the function c will be calculated as a linear combination of the following board features:
- xl: the number of black pieces on the board
- x2: the number of red pieces on the board
- x3: the number of black kings on the board
- x4: the number of red kings on the board
- x5: the number of black pieces threatened by red (i.e., which can be captured on red's next turn)
- x6: the number of red pieces threatened by black
Thus, the learning program will represent as a linear function of the form
Where,
- w0 through w6 are numerical coefficients, or weights, to be chosen by the learning algorithm.
- Learned values for the weights w1 through w6 will determine the relative importance of the various board features in determining the value of the board
- The weight w0 will provide an additive constant to the board value
4. Choosing a Function Approximation Algorithm
In order to learn the target function f we require a set of training examples, each describing a specific board state b and the training value Vtrain(b) for b.
Each training example is an ordered pair of the form (b, Vtrain(b)).
For instance, the following training example describes a board state b in which black has won the game (note x2 = 0 indicates that red has no remaining pieces) and for which the target function value Vtrain(b) is therefore +100.
((x1=3, x2=0, x3=1, x4=0, x5=0, x6=0), +100)
Function Approximation Procedure
1. Derive training examples from the indirect training experience available to the learner
2. Adjusts the weights wi to best fit these training examples
1. Estimating training values
A simple approach for estimating training values for intermediate board states is to assign the training value of Vtrain(b) for any intermediate board state b to be V̂(Successor(b))
Where ,
- V̂ is the learner's current approximation to V
- Successor(b) denotes the next board state following b for which it is again the program's turn to move
Rule for estimating training values
Vtrain(b) ← V̂ (Successor(b))
2. Adjusting the weights
Specify the learning algorithm for choosing the weights wi to best fit the set of training examples {(b, Vtrain(b))}
A first step is to define what we mean by the bestfit to the training data.
One common approach is to define the best hypothesis, or set of weights, as that which minimizes the squared error E between the training values and the values predicted by the hypothesis.
Several algorithms are known for finding weights of a linear function that minimize E. One such algorithm is called the least mean squares, or LMS training rule. For each observed training example it adjusts the weights a small amount in the direction that reduces the error on this training example
LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V̂ (b)
For each weight wi, update it as
wi ← wi + Æž (Vtrain (b) - V̂(b)) xi
Here Æž is a small constant (e.g., 0.1) that moderates the size of the weight update.
Working of weight update rule
- When the error (Vtrain(b)- V̂(b)) is zero, no weights are changed.
- When (Vtrain(b) - V̂(b)) is positive (i.e., when V̂(b) is too low), then each weight is increased in proportion to the value of its corresponding feature. This will raise the value of V̂(b), reducing the error.
- If the value of some feature xi is zero, then its weight is not altered regardless of the error, so that the only weights updated are those whose features actually occur on the training example board.
0 Comments