Header Ads Widget

Q LEARNING

How can an agent learn an optimal policy π * for an arbitrary environment?

The training information available to the learner is the sequence of immediate rewards r(si,ai)        for i = 0, 1,2,................... Given this kind of training information it is easier to learn a numerical

evaluation function defined over states and actions, then implement the optimal policy in terms of this evaluation function.


What evaluation function should the agent attempt to learn?

One obvious choice is V*. The agent should prefer state sl over state s2 whenever V*(sl) > V*(s2), because the cumulative future reward will be greater from sl

The optimal action in state s is the action a that maximizes the sum of the immediate reward r(s, a) plus the value V* of the immediate successor state, discounted by γ.

The Q Function

The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a from state s, plus the value (discounted by γ ) of following the optimal policy thereafter


Rewrite Equation (3) in terms of Q(s, a) as

Equation (5) makes clear, it need only consider each available action a in its current state s and choose the action that maximizes Q(s, a). 

An Algorithm for Learning Q

  • Learning the Q function corresponds to learning the optimal policy.
  • The key problem is finding a reliable way to estimate training values for Q, given only a sequence of immediate rewards r spread out over time. This can be accomplished through iterative approximation

Rewriting Equation
Q learning algorithm:

 

  • Q learning algorithm assuming deterministic rewards and actions. The discount factor γ may be any constant such that 0 γ < 1
  • 𝑄̂  to refer to the learner's estimate, or hypothesis, of the actual Q function 

An Illustrative Example

  •  To illustrate the operation of the Q learning algorithm, consider a single action taken by an agent, and the corresponding refinement to 𝑄̂   shown in below figure

  • The agent moves one cell to the right in its grid world and receives an immediate reward of zero for this transition. 
  • Apply the training rule of Equation 

to refine its estimate Q for the state-action transition it just executed.

  • According to the training rule, the new 𝑄̂  estimate for this transition is the sum of the received reward (zero) and the highest 𝑄̂  value associated with the resulting state (100), discounted by γ (.9).

Post a Comment

0 Comments