How can an agent learn an optimal
policy π * for an arbitrary environment?
The training information available to the learner is the sequence
of immediate rewards
r(si,ai) for i = 0, 1,2,................... Given this kind of training information it is easier to learn a numerical
evaluation function
defined over states and actions,
then implement the optimal policy in terms of
this evaluation function.
What evaluation function
should the agent attempt to learn?
One obvious choice is V*. The agent should prefer state sl over state s2
whenever V*(sl) > V*(s2), because the cumulative future reward will be greater from sl
The optimal
action in state s is the action a that maximizes the sum of the immediate
reward r(s, a) plus
the value V* of the immediate successor
state, discounted by γ.
The Q Function
The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a from state s, plus the value (discounted by γ ) of following the optimal policy thereafter
Rewrite Equation (3) in terms of Q(s, a) as
Equation (5) makes clear, it need only consider each available action a in its current state s and choose the action that maximizes Q(s, a).
An Algorithm for Learning Q
- Learning the Q function corresponds to learning the optimal policy.
- The key problem is finding a reliable way to estimate training values for Q, given only a sequence of immediate rewards r spread out over time. This can be accomplished through iterative approximation
- Q learning algorithm assuming deterministic rewards and actions. The discount factor γ may be any constant such that 0 ≤ γ < 1
- 𝑄̂ to refer to the learner's estimate, or hypothesis, of the actual Q function
An Illustrative Example
- To illustrate the operation of the Q learning algorithm, consider a single action taken by an agent, and the corresponding refinement to 𝑄̂ shown in below figure
- The agent moves one cell to the right in its grid world and receives an immediate reward of zero for this transition.
- Apply the training rule of Equation
- According to the training rule, the new 𝑄̂ estimate for this transition is the sum of the received reward (zero) and the highest 𝑄̂ value associated with the resulting state (100), discounted by γ (.9).
0 Comments