Reinforcement Learning: Solving MDPs with Dynamic Programming

In the previous two episodes, I illustrated the key concepts and ideas behind MDPs, and how they are used to model an environment in the reinforcement learning problem. In this episode, I’ll cover how to solve an MDP with code examples, and that will allow us to do prediction, and control in any given MDP.

Brace yourself, this blog post is a bit longer than any of the previous ones, so grab your coffee and just dive in.

Dynamic Programming

Dynamic Programming or (DP) is a method for solving complex problems by breaking them down into subproblems, solve the subproblems, and combine solutions to the subproblems to solve the overall problem.

DP is a very general solution method for problems that have two properties, the first is “optimal substructure” where the principle of optimality applies (discussed later). It tells us that we can solve some overall problem by breaking it down into two or more pieces, solve for each of the pieces, and the optimal solution to the pieces tells us how to get the optimal solution to the overall problem.

E.g. if your problem is to get from some point in space to the nearest wall, you will try to figure out the shortest path to the wall. DP tells us that the shortest path to the wall is the shortest path to a midpoint between you and the wall combined with the shortest path from that midpoint to the wall.

The second property is “overlapping subproblems”, which means that the subproblems which occur once, will occur again and again. We broke down the problem of getting to the wall into first getting to the midpoint, and then getting to the wall. That will help us solve the subproblem of how to get from another point to the wall. Solutions can be cached and reused.

Markov Decision Processes satisfy both mentioned properties. Bellman equation gives us recursive decomposition (the first property). Bellman equation tells us how to break down the optimal value function into two pieces, the optimal behavior for one step followed by the optimal behavior after that step.

What gives us the overlapping subproblems property in MDPs is the value function. It’s like a cache of the good information we figured out about the MDP.

E.g. once I figured out the optimal path to the wall, I can store that in a value function V(s), and I can remember what I did to get the optimal path to the wall. Now If I want to work out anything that relies on V(s), I’ve already cached out this information, and I don’t need to recompute it.

Dynamic programming assumes full knowledge of the MDP. It’s used in planning. There are two main ideas we tackle in a given MDP. If someone tells us the MDP, where M =(S, A, P, R,𝛾), and a policy 𝜋 or an MRP where M =(S, P, R,𝛾), we can do prediction, i.e. evaluate the given policy to get the value function on that policy.

The other main idea is, we are given an MDP, M =(S, A, P, R,𝛾) and we are asked for the optimal value function and optimal policy. This is full optimization, this is what we are after and it will allow us to do control in a given MDP. What we are trying to figure out is the best thing to do, and this is what we think of when we’re trying to solve an MDP.

Policy Evaluation

In policy evaluation, we’re given an MDP and a policy 𝜋. We try to evaluate the given policy. We can evaluate the given policy through iterative application of Bellman expectation equation, and later use the Bellman optimality equation to do control.

We’re gonna start off with an initial value function, e.g. zero for all states. We’re gonna do that one step lookahead using the Bellman expectation equation to figure out a new value function, iterate this process many times for all states, and at the end, we’ll end up with the true value function for the given policy.

One step Lookahead
One step lookahead

To do this we’re gonna use synchronous backups, which means at each iteration k+1, we’ll consider all state s ∈ S, update Vk+1(s) which is the current iteration from Vk(s′) where s′ is a successor state of s.

Formally, we’re gonna start with our previous value function, which is initially zeros, we’re going to apply iterative update to every single state in this value function to produce a complete new value function for all states and carry on with this process. This is called synchronous evaluation.

In DP, we’re gonna turn Bellman expectation equation into an iterative update,

Iterative update

We’re gonna define the value function of the current iteration by plugging in the previous iteration values in the leafs and then iterate this process. This is guaranteed to converge on the true value function for the given policy.

Evaluating a random policy in a small Gridworld

Consider a 4×4 grid as shown below,

Gridworld

The non-terminal states are S = {1, 2, …, 14}. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The reward is -1 on all transitions until the terminal state is reached. The terminal state is shaded in the figure (although it’s shown in two places, it’s formally one state). The agent follows uniform random policy,

Here is the convergence of iterative policy evaluation on the left column on a small gridworld,

Fig. 1

Here is the code for policy evaluation,

The output is,

[  0.         -13.99989315 -19.99984167 -21.99982282 -13.99989315
 -17.99986052 -19.99984273 -19.99984167 -19.99984167 -19.99984273
 -17.99986052 -13.99989315 -21.99982282 -19.99984167 -13.99989315
   0.        ]
[  0 -14 -20 -22 -14 -18 -20 -20 -20 -20 -18 -14 -22 -20 -14   0]

Policy Improvement

Our reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function V𝜋for an arbitrary deterministic policy 𝜋. For some state s we would like to know whether or not we should change the policy to deterministically choose an action a𝜋(s).

We know how good it is to follow the current policy from s, that is V𝜋(s), but would it be better or worse to change to the new policy?. One way to answer this question is to consider selecting a in s and thereafter following the existing policy, 𝜋.

The value of this way of behaving is,

The key criterion is whether this is greater than or less than V𝜋(s). If it’s greater, that is if it’s better to select a once in s and thereafter follow 𝜋 than it would be to follow 𝜋 all the time, then one would expect it to be better still to select a every time s is encountered, and the new policy would be in fact a better one overall.

So formally, given a policy 𝜋, we’re going to evaluate it, i.e. we’re going to figure out a value function for that policy like in the left column in Fig. 1,

value function

and then we improve the policy by acting greedily with respect to V𝜋 like we did on the right column, where we look ahead in each direction and pick the best action, that’s what it means to act greedily,

As show in the figure below, we start off with some arbitrary value function and some policy. We’re gonna do policy evaluation on the up arrows and policy improvement on the down arrows,

We evaluate the initial policy on the first arrow. Once we evaluated that policy, we act greedily with respect to the value function on that policy to give us a new policy. Once we got that new policy, we evaluate it on the third arc, which gives us a new value function. Once we have that new value function, we act greedily with respect to it to get a new policy and so forth.

In a small gridworld, the improved policy was optimal after the third iteration. In general, we need more iterations of improvement/evaluation. In interesting problems, we can go around these two steps of improvement and evaluation again and again, and this process of policy iteration always converges to the optimal policy 𝜋*.

The idea behind the proof of the policy improvement theorem is easy to understand,

As we act greedily, we pick the best action in state s, which is the reward for that action combined with the discounted value of successor state on policy 𝜋. The value of that action is the same action value if we replaced the value of the successor state on policy 𝜋 with the greedy action value in the successor state. Apply this idea to all states, and eventually we will end up with a new value function on the greedy policy 𝜋′ which is better than the previous policy 𝜋.

If improvement stops, that means that the greedy action value is the value of the current state on the current policy,

Then Bellman optimality equation has been satisfied,

therefore V𝜋(s) = V*(s) for all s ∈ S, so 𝜋 is an optimal policy.

Here is the code of policy iteration on the gridworld,

The output is,

Policy Probability Distribution:
[[1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]
Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):
[[0 3 3 2]
 [0 0 0 2]
 [0 0 1 2]
 [0 1 1 0]]
Value Function:
[ 0. -1. -2. -3. -1. -2. -3. -2. -2. -3. -2. -1. -3. -2. -1.  0.]
Reshaped Grid Value Function:
[[ 0. -1. -2. -3.]
 [-1. -2. -3. -2.]
 [-2. -3. -2. -1.]
 [-3. -2. -1.  0.]]

Modified Policy Iteration

In girdworld, if we did just a few iterations of the Bellman expectation equation, that was really enough for this problem to come up with a better policy. We didn’t need to evaluate this thing perfectly and act greedily with respect to it. In fact, all additional iterations were a waste of time. Can we truncate our evaluation process and use approximate policy evaluation rather than exact policy evaluation?.

Does policy evaluation need to converge to V𝜋, or we need to introduce a stopping condition? The basic idea is to stop when we get close enough to the value function we care about. So we can say look at how much your Bellman updates change in the value function, when there is a change in the value function by some tiny amount ε, then you can just stop. This is called ε-convergence of the value function.

But even this will probably be a waste of time too. We can do something more naive by stopping after k-iterations of iterative policy evaluation, e.g. in a small gridworld the third iteration was sufficient to achieve optimal policy.

Value Iteration

Any optimal policy can be subdivided into two components; an optimal first action, followed by an optimal policy from successor state s′. If our first action is optimal and after that, we follow an optimal policy from wherever we end up, then we can say that the overall behavior is optimal.

The theorem of principle of optimality says,

A policy 𝜋(a|s) achieves the optimal value from state s, i.e. V𝜋(s)=V*(s), if and only if, for any state s′ reachable from s, 𝜋 achieves the optimal value from state s′, V𝜋(s′) = V*(s′).

So we can break down the description of what it means to have an optimal policy in terms of finding the optimal policy from all states we can end up in. Once we do that, then we just need to do this one step look-ahead and figure out what is the best first action we could have taken is and we’re done. This is the principle of optimality applied to policies.

If we know the solution to the subproblem V*(s′), i.e. the optimal value function from s′, then V*(s) can be found by one step lookahead,

Optimal value function

In value iteration, we iteratively apply the Bellman optimality equation to get the optimal value function. At each iteration k+1 update Vk+1(s) from Vk(s′) for all state s ∈ S. Unlike policy iteration, there is no explicit policy, and intermediate value functions may not correspond to any policy. The convergence rate is independent of where we start off.

At every iteration, each state gets a turn to be the root. We start off with our old value function Vk(s′) and we put it in the leafs. We’re taking Bellman optimality equation and we’re turning it into an iterative update.

In the gridworld, the intuition is, we start with final rewards and work backward. Initially, we start off with a zero value function. In the second iteration, after applying the Bellman optimality equation, we end up with -1 for each state, which is the result of maximizing over possible actions in a given state. Notice that the -1 resulted from the reward add to the discounted value of the successor state, which is was zeros at V1, so we end up with -1 for the first iteration.

Here is the code for value iteration in a gridworld,

The output is,

Policy Probability Distribution:
[[1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]
Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):
[[0 3 3 2]
 [0 0 0 2]
 [0 0 1 2]
 [0 1 1 0]]
Value Function:
[ 0. -1. -2. -3. -1. -2. -3. -2. -2. -3. -2. -1. -3. -2. -1.  0.]
Reshaped Grid Value Function:
[[ 0. -1. -2. -3.]
 [-1. -2. -3. -2.]
 [-2. -3. -2. -1.]
 [-3. -2. -1.  0.]]

Synchronous Dynamic Programming algorithms

From David Silver class

In the prediction problem, we’re trying to figure out how much reward we get from some given policy. To figure that out we use the Bellman expectation equation and turn it into an iterative update. So we end up with iterative policy evaluation.

In control, there are two different ways; the first was to use the Bellman expectation equation to evaluate our policy iteratively and alternate this process of policy evaluation to policy improvement, and that gave us the policy iteration algorithm.

The second way was to use the Bellman optimality equation and turn it into an iterative update to end up with the value iteration algorithm.

Here is a link to the above-mentioned algorithms in case you want to run them or play with them on your machine.

In the next blog post, I’ll discuss how to do prediction in a model-free reinforcement learning problem.

Previous Episodes

References

Richard Sutton’s Intro to Reinforcement Learning

No responses yet

Leave a Reply

Your email address will not be published.