## Reinforcement Learning Demystified: Model-Free Prediction Part 1.

In the last blog post, we talked about the reinforcement learning problem, given that we have a model of the environment, and we applied dynamic programming algorithms to do prediction and control.

In this blog post, we will talk about the same problems; prediction and control, but this time we don’t have a model of the environment, i.e. we don’t know the dynamics of the environment.

Starting with prediction, in a model-free prediction, we estimate the value function of an unknown MDP. If someone gives us the policy for behaving in this MDP, how much we’re going to get for that policy entirely model-free.

### Monte Carlo Learning

The idea of these algorithms is fairly straightforward. The agent will learn directly from episodes of experience, we don’t need knowledge of MDP transitions or rewards. We estimate the value function by simply taking the mean returns of episodes.

In one episode I got a return of 5, in the next episode I got a return of 7, so we can estimate the value function to be 6 by taking the average. MC can only work for episodic MDPs.

### Monte Carlo Policy Evaluation

In Monte Carlo (MC) evaluation the goal is to learn V_{π} from episodes of experience under policy **π**. The episodes of experience consist of states, actions, and rewards: S_{1}, A_{1}, R_{2}, ….., S_{k} ~ **π.**

Recall the return is:

We want to maximize that return from each state and different time steps. And the value function is the expected return:

Imagine it’s a robot that’s gonna wonder for a while. We’re gonna look at the episodes of experience, i.e. the stream of states, actions, and rewards that we sampled from our random policy, and look at the discounted rewards that we got. That will give us an estimate of the value function under our policy.

So, in MC policy evaluation we use the empirical mean instead of the expected return. We’re gonna collect as many samples as we can from each of the states.

### First-visit Monte Carlo policy evaluation

This is the first algorithm and it’s a very intuitive idea. Imagine you got some loop in the MDP, i.e. you come back to the same state repeatedly. To evaluate a state, we’re gonna consider the very first-time step *t* a state s is visited in an episode and increment a counter for that state N(s) ←N(s) + 1.

From a programming point of view, you can think of it as a flag for a state, if it’s visited for the first time, that flag is 1. If it’s not visited, that state is not considered in the calculations.

Then we’re gonna add up the total return one episode S(s)←S(s) + G_{t}. Now we can get the mean return to get an estimated value of that state V(s) = S(s) / N(s).

You might wonder why all that works. There is a well-known law in probability theory called “*the law of large numbers”. *Simply it says that when we take a mean of a bunch of random variables, that mean does approach the true expectations as we sample infinitely.

If we just randomly sample enough episodes, MC learning will converge to the true value function for our policy as we get more and more samples V(s) →V_{π}(s) as N(s) →**∞**.

### Every-visit Monte Carlo policy evaluation

This** **is the second algorithm. It’s exactly the previous one except that instead of accounting for the first visit, we will account for every time step t that a state is visited in an episode and increment the counter for that state N(s) ← N(s) + 1, increment total return S(s) ←S(s) + G_{t }based on the point which you’re up to at that time step t for each of these visits.

The value is estimated by mean return V(s) = S(s) / N(s). Again, MC learning will converge to the true value function for our policy as we get more and more samples V(s) →V_{π}(s) as N(s) →**∞**.

### Incremental Method

This algorithm is not on-line, i.e. we can’t change the values of state as we act in the MDP. We need an on-line algorithm where we step-by-step update the mean so that we don’t have to wait until the end of the episode and make a sum of everything and then divide by the count. We can do this incrementally.

The mean μ_{1}, μ_{2}, …. of a sequence x_{1}, x_{2},…. can be computed incrementally,

The new mean is the old mean plus some step size 1/k times the difference between the new element and what we thought the mean was. Think of μ_{k-1} as our estimate of what the value is gonna be on average or our mean, then we see the new element X_{k} and there is some kind of error between what we thought is gonna happen and what actually happened.

We update our mean a little bit in the direction of that error, i.e. we had a prediction and we correct that prediction in the direction of the error.

### Incremental Monte Carlo Updates

To incorporate this idea in MC learning, we’ll update V(s) incrementally after episode S_{1}, A_{1}, R_{2},…..S_{T}. For each state St with return G_{t}, we are gonna increment the counter N(s) ←N(s) + 1, and calculate the value of that state,

Every time we see that state, we’re gonna measure the return from that state onwards. We’re gonna look at the error between the value function V(s_{t}), i.e. that what we thought the return was going to be, and the return we actually observed. We generate some error and we update our mean estimate for that state a little bit (1/N(s)) in the direction of that return.

Furthermore, we’re gonna move to algorithms that don’t have to maintain these statistics and just incrementally update themselves. One way is by having a constant step size α that gives an exponential forgetting rate or exponential moving average of all the returns the agent has seen so far.

### Temporal Difference Learning

TD methods learn directly from episodes of experience. It applies to non_episodic problems as well. TD is model-free, no knowledge of MDP transitions/rewards is needed.

Unlike MC, TD learns from incomplete episodes by bootstrapping. The agent doesn’t have to go all the way until it hits a wall and says how much reward It got from that complete trajectory.

The agent can take a partial trajectory, and use an estimate of how much reward it thinks it will get from now up until the wall, in place of the actual return.

This idea is called bootstrapping, i.e. substituting the remainder of the trajectory with that estimate of what would happen from that point onwards.

In a nutshell, I’m gonna start guessing how long it’d take me to get to the wall, walk some number of steps, and then make another guess of how long it’ll take to get to the wall, *and I’m gonna update my original guess towards my subsequent guess.*

### MC and TD

The ultimate goal is to learn * V_{π}* online from experience under policy

*. In*

**π***incremental every visit MC*, we update

*toward actual return*

**V**(s_{t})*G*,

_{t}In TD learning, we start off thinking it was gonna take us 30 minutes to reach the wall, after one step we are thinking it will take 40 minutes, so we can immediately update there already.

That’s it for this post. In part 2, I will discuss TD in greater details with elaborate examples and different properties of both algorithms.

If you want to read more about reinforcement learning, check all blogposts here

**Previous Episodes**

- Reinforcement Learning: A Gentle Introduction
- Markov Decision Processes: (Part 1)
- Markov Decision Processes: (Part 2)

## No responses yet