Header Ads Widget

THE LEARNING TASK

  • Consider Markov decision process (MDP) where the agent can perceive a set S of distinct states of its environment and has a set A of actions that it can perform.
  • t each discrete time step t, the agent senses the current state st, chooses a current action at, and performs it.
  • he environment responds by giving the agent a reward rt = r(st, at) and by producing the succeeding state st+l = δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on the current state and action, and not on earlier states or actions. 

The task of the agent is to learn a policy, 𝝅: S A, for selecting its next action a, based on the current observed state st; that is, 𝝅(st) = at. 

How shall we specify precisely which policy π we would like the agent to learn?


1. One approach is to require the policy that produces the greatest possible cumulative reward for the robot over time.
  • To state this requirement more precisely, define the cumulative value Vπ (st) achieved by following an arbitrary policy π from an arbitrary initial state st as follows:


  • Where, the sequence of rewards rt+i is generated by beginning at state st and by repeatedly using the policy π to select actions.
  • Here 0 ≤ γ ≤ 1 is a constant that determines the relative value of delayed versus immediate rewards. if we set γ = 0, only the immediate reward is considered. As we set γ closer to 1, future rewards are given greater emphasis relative to the immediate reward.
  • The quantity Vπ (st) is called the discounted cumulative reward achieved by policy π from initial state s. It is reasonable to discount future rewards relative to immediate rewards because, in many cases, we prefer to obtain the reward sooner rather than later.
2. Other definitions of total reward is finite horizon reward,

 

Considers the un-discounted sum of rewards over a finite number of steps

3. Another approach is average reward

.
Considers the average reward per time step over the entire lifetime of the agent.

We require that the agent learn a policy π that maximizes Vπ (st) for all states s. such a policy is called an optimal policy and denote it by π*
Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum discounted cumulative reward that the agent can obtain starting from state s.



Example:

 

A simple grid-world environment is depicted in the diagram

  • The six grid squares in this diagram represent six possible states, or locations, for the agent.
  • Each arrow in the diagram represents a possible action the agent can take to move from one state to another.
  • The number associated with each arrow represents the immediate reward r(s, a) the agent receives if it executes the corresponding state-action transition
  • The immediate reward in this environment is defined to be zero for all state-action transitions except for those leading into the state labelled G. The state G as the goal state, and the agent can receive reward by entering this state. 

Once the states, actions, and immediate rewards are defined, choose a value for the discount factor γ, determine the optimal policy π * and its value function V*(s).


Let’s choose γ = 0.9. The diagram at the bottom of the figure shows one optimal policy for this setting.

 

Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ = 0.9. An optimal policy, corresponding to actions with maximal Q values, is also shown.

The discounted future reward from the bottom centre state is

0+ γ 100+ γ2 0+ γ3 0+... = 90

Post a Comment

0 Comments