Q LEARNING

How can an agent learn an optimal policy π * for an arbitrary environment?

The training information available to the learner is the sequence of immediate rewards r(si,ai) for i = 0, 1,2,................... Given this kind of training information it is easier to learn a numerical

evaluation function defined over states and actions, then implement the optimal policy in terms of this evaluation function.

What evaluation function should the agent attempt to learn?

One obvious choice is V*. The agent should prefer state sl over state s2 whenever V*(sl) > V*(s2), because the cumulative future reward will be greater from sl

The optimal action in state s is the action a that maximizes the sum of the immediate reward r(s, a) plus the value V* of the immediate successor state, discounted by γ.

The Q Function

The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a from state s, plus the value (discounted by γ ) of following the optimal policy thereafter

Rewrite Equation (3) in terms of Q(s, a) as

Equation (5) makes clear, it need only consider each available action a in its current state s and choose the action that maximizes Q(s, a).

An Algorithm for Learning Q

Learning the Q function corresponds to learning the optimal policy.
The key problem is finding a reliable way to estimate training values for Q, given only a sequence of immediate rewards r spread out over time. This can be accomplished through iterative approximation

Rewriting Equation

Q learning algorithm:

Q learning algorithm assuming deterministic rewards and actions. The discount factor γ may be any constant such that 0 ≤ γ < 1
𝑄̂ to refer to the learner's estimate, or hypothesis, of the actual Q function

An Illustrative Example

To illustrate the operation of the Q learning algorithm, consider a single action taken by an agent, and the corresponding refinement to 𝑄̂ shown in below figure

The agent moves one cell to the right in its grid world and receives an immediate reward of zero for this transition.
Apply the training rule of Equation

to refine its estimate Q for the state-action transition it just executed.

According to the training rule, the new 𝑄̂ estimate for this transition is the sum of the received reward (zero) and the highest 𝑄̂ value associated with the resulting state (100), discounted by γ (.9).

Header Ads Widget

Q LEARNING

How can an agent learn an optimal policy π * for an arbitrary environment?

What evaluation function should the agent attempt to learn?

The Q Function

An Algorithm for Learning Q

An Illustrative Example

Post a Comment

0 Comments

Total Pageviews

Search This Blog

Subject Labels

Popular Posts

Classification/Types of Operating Systems

Introduction of Operating System

Operating-System Structure

Contact form

Sponsor

Popular Posts

File Directories

ROM Memories

Church’s Thesis

Ad Space

Random Posts

Recent in Sports

Popular Posts

Classification/Types of Operating Systems

Introduction of Operating System

Operating-System Structure

Menu Footer Widget