Reinforcement Learning Demystified: Markov Decision Processes (Part 2)

In the previous blog post, Reinforcement Learning Demystified: Markov Decision Processes (Part 1), I explained the Markov Decision Process and Bellman equation without mentioning how to get the optimal policy or optimal value function.

Bellman Expectation Equation

As we mentioned before, the state-value function can be decomposed into immediate reward Rt+1, and the discounted value of successor state 𝛾V𝜋(St+1) on policy 𝜋,

Value function

The idea is wherever you are, you take one step and you get your immediate reward for that step, and then you look at the value where you end up. The sum of those things together tells you how good it was to be in your original state.

In this case, you start in state S, following policy 𝜋, but the value being in that state is the immediate reward you get, added to the value of the successor state, if you know your are going to follow the policy 𝜋 from that state on-wards.

Similarly, the action-value function can be decomposed,

Action value function

If I’m in state S, and I take action a from there, I’ll get an immediate reward for that action, and then I’ll look at where I end up, and I can ask, what is the action-value in the state I end up in under the action I’ll pick from that point onwards. The sum of those things together tells us how good it was to take that action from that particular state.

Since we have multiple actions from one state S, and the policy defines a probability distribution over those actions, we are gonna average, and that is the bellman expectation equation,

Bellman Expectation eqn

From a particular state S, there are multiple actions, I’m gonna average over the actions that I might take. There is a probability that I’m gonna take the first action and another probability that I’m gonna take the second action and so on. This probability distribution is defined by a policy 𝜋.

When we reach q (the action value) for the action that we took, it tells us how good is it to take that action from that state. Averaging over possible action-values tells us how good is it to be in state S.

Image for post

After I took that action, I’m committed to it. The environment might blow me left or right, since there is stochasticity in the environment. I want to ask for each of these situations I might get blown to, how good is it?, what is the value of being in that situation following my policy onwards.

So we are gonna average over possible things that might happen, i.e. possible successor states the agent might land in, meaning multiplying each state value on policy 𝜋 we might land in by the probability that we land in it. summing all those things together tells us how good it was to take that particular action from that particular state S.

Remember V𝜋(s) is telling us how good is it to be in a particular state, and q𝜋(s, a) is telling us how good is it to take a particular action from a given state.

Stitching Bellman expectation equation for V𝜋(s),

Bellman Expectation Equation

At the root, we got the value function for state S. It tells us how good is it to be in that particular state. We consider all the actions we might take next and we consider all the things the environment might do to us after we took some action. For each of the things the environment might do, there are successor states. We might land in one of those states.

We want to know how good is it to be in that state we landed in, and carry on with our usual policy, i.e. how much reward I’ll get for carrying on from that point onwards. We are gonna average over all those together, we’re weighting each of the first two arcs by a probability the policy will select, and weighting each of the second level arcs by a probability the environment will select, and this gives us the value of being in the root.

Stitching Bellman expectation equation for q𝜋(s),

Image for post

Starting from an action a, we now can look ahead, considering where the environment might blow us in after we took that action. First we get the immediate reward for our action, and then we average over possible states we might land in, i.e. the value of each state we might land in multiplied by a probability the environment will select and average over all those things together.

Then consider from the state we’re blown to which action might I take next. There is a probability distribution over possible actions from there, and then we average over.

Student MDP

For our state space, we evaluate the state in red. The policy defines a uniform probability distribution for the two possible actions, we’re gonna weight each of the things that might happen after taking one action by 0.5.

For the action “Study”, we’re gonna weight it by 0.5 multiplied by the immediate reward for that action, and since the state we’re gonna land in is the terminal state, i.e. it has zero value, the action value will be just the probability of the action multiplied by the reward for that action.

For the second action that we might take “Pub”, we’re gonna weight the things that might happen after taking that action by the probability that we take that action.

First, we get the immediate reward for that action +1, added to an average over possible successor states. There is a chance node that we go to some state, or some other state, or return to the state where we started.

We multiply the value of each state we might land in after taking the action by the probability that we land in in, which is defined by the environment. The sum of those things together gives us the value of being in our original state.

Bellman expectation equation can be expressed concisely using the induced MRP,

Image for post

We can solve it directly,

Image for post

Optimal Value Function

The optimal state-value function V*(s) is the maximum value function over all policies.

Image for post

It’s the best possible solution of an MDP. Of all kinds of different policies that we can follow in our Markov chain, all we care about is the best of them, i.e. what is the maximum possible rewards that we can extract from an MDP.

The optimal action-value function q*(s, a) is the maximum action-value function over all policies.

Image for post

For the state–action pair (s, a), this function gives the expected return for taking action a in state S, and thereafter following an optimal policy, i.e. the maximum amount of rewards you can extract starting in state S, and taking action a.

If we know q*(s, a), then we’re basically done. It tells us the right action to take. The optimal value function specifies the best possible performance in the MDP. An MDP is solved when we know the optimal value function.


Optimal Policy

What is the best possible way to behave in an MDP ?. A policy is just a stochastic mapping from states to actions, we want to know what is the best one of those things.

We are gonna define a partial ordering over different policies,

Image for post

One policy is better than another policy if the value function for that policy is greater than the value function of the other policy in all states. For any Markov Decision Process, there exists an optimal policy 𝜋* that is better than or equal to all other policies, 𝜋* > 𝜋, ∀𝜋.

It’s possible to have more than one optimal policy, e.g. if there are two separate actions that take us to the same state, it doesn’t matter which one of those we take.

Remember, all optimal policies achieve optimal value function. All optimal policies achieve the optimal action-value function.


Finding an Optimal Policy

An optimal policy can be found by maximizing over q*(s, a), i.e. pick the action that gives you the most q*(s, a),

Image for post

If you are in some state S, you just pick the action a with probability 1 that maximizes q*(s, a), and that is the action that will give you maximum possible rewards.

There is always a deterministic policy for any MDP. If we know q*(s, a), we immediately have the optimal policy.

Student MDP

Here, we have different actions from every state. Notice that we pick the action that has the maximum value among different actions in a given state. That gives us the optimal policy in this state space.


Bellman Optimality Equation

The optimal value functions are recursively related by the Bellman optimality equation. Bellman optimality equation for V*,

Bellman Optimality eqn

What is the optimal value of being in state S ?. We consider each of the actions we might take, and an action will take us to one of the chance nodes. We look at the action value for each action.

Instead of taking average like in the Bellman expectation equation, we take the maximum of q*(s, a), and that tells us how good is it to be in that state S.

Bellman optimality equation for Q*,

Image for post

We can do one-step lookahead, but lookahead to what the environment might do to us, remember we don’t control what the environment might do.

Each of the states that we might end up in, has some optimal value, and we average over them. We don’t get to pick where the environment might blow us in. We have to average over all the things the environment might do to us, and that tells us how good our action is.

Stitching Bellman optimality equation for V*(s),

Image for post

This is a recursive equation that we can solve. We are looking ahead at the action we can take, and maximizing over those actions. We’re also looking ahead at the dice the environment might roll, we don’t control the dice, and we average over those things together.

Stitching Bellman optimality equation for Q*(s, a),

Image for post

The same goes for optimal action value function, we are already committed to the action that we took, so we got the reward associated with that action, and then the environment might blow us to different states.

We average over the optimal action values of those states we might land in, i.e. the best action that we can take in each of the successor states, and that gives us the value of the action that we took.

Image for post

The value of this state is the maximum over the possible two actions. We can take the “Facebook” action and get a reward -1 added to the optimal value of the state we are gonna land in, or take the “Study” action and get a reward -2 added to the value of the state we are gonna land in.

Maximizing over those gives us the maximum expected sum of rewards we can get from our original state.


Solving The Bellman Optimality Equation

Bellman optimality equation is a non-linear equation, there is no closed form solution in general, but there are many iterative solution methods that we can apply, e.g. value iteration, policy iteration, Q-leaning, and Sarsa.

In the next blog post, I’ll discuss how to solve the Bellman optimality equation using dynamic programming, with code examples, and that will enable us to do planning in an MDP.


The Intuition behind Bellman Equation

Remember in the first blog post when we talked about Atari games and how an RL agent learned to play Breakout like any human being after 400 training episodes.

In this example, the q* and V* are telling the agent what is the maximum amount of score it can get. Imagine that you are the agent, you got the screen, and you ask what is the maximum score I can get from that screen ?.

The principle of optimality, discussed later, tells us how to get the maximum score, we have to behave optimally for one step, and then behave optimally the remainder of the trajectory.

Now all we need to do is to figure out how to behave optimally for one step, and the way to behave optimally for one step is to maximize over those optimal value functions in the places we might end up in.

The intuition is that, just by breaking down the trajectory into those two parts; the optimal decision for one step, and the optimal decision from now onwards, we can describe what it means to have optimal dynamics to the whole problem.


References

Richard Sutton’s Intro to Reinforcement Learning

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *