Reinforcement Learning Demystified: Markov Decision Processes (Part 1)
In the previous blog post, we talked about reinforcement learning and its characteristics. We mentioned the process of the agent observing the environment output consisting of a reward and the next state, and then acting upon that. This whole process is a Markov Decision Process or an MDP for short.
This blog post is a bit mathy. Grab your coffee and a comfortable chair, and just dive in.
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The agent and the environment interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.
Formally, an MDP is used to describe an environment for reinforcement learning, where the environment is fully observable. Almost all RL problems can be formalized as MDPs.
What is exactly a Markov Decision Process?
To understand the MDP, first we have to understand the Markov property.
The Markov property
The Markov property states that,
“ The future is independent of the past given the present.”
Once the current state in known, the history of information encountered so far may be thrown away, and that state is a sufficient statistic that gives us the same characterization of the future as if we have all the history.
In mathematical terms, a state St has the Markov property, if and only if;
P[St+1 | St] = P[St+1 | S1, ….. , St],
the state captures all relevant information from history. For a Markov state S and successor state S′, the state transition probability function is defined by,

It’s a probability distribution over next possible successor states, given current state, i.e. the agent is in some state, there is a probability to go to the first state, and another probability to go to the second state and so on.
We can put this transition function in the form of a matrix, where each row sums to 1,

Markov Process
A Markov process is a memory-less random process, i.e. a sequence of random states S1, S2, ….. with the Markov property. A Markov process or Markov chain is a tuple (S, P) on state space S and transition function P. The dynamics of the system can be defined by these two components S and P. When we sample from an MDP, it’s basically a sequence of states or as we call it an episode.

Here, we have different states with different successors, e.g. in Class 1 you might go to Class 2 with a probability of 0.5 or Facebook with a probability of 0.5. An episode would be for example [Class 1 →Class 2 → Class 3 → Pass →Sleep]. Sleep is the terminal state or absorbing state that terminates an episode.
Markov Reward Process
A Markov Reward Process or an MRP is a Markov process with value judgment, saying how much reward accumulated through some particular sequence that we sampled.
An MRP is a tuple (S, P, R, ?) where S is a finite state space, P is the state transition probability function, R is a reward function where,
Rs = ?[Rt+1 | St = S],
it says how much immediate reward we expect to get from state S at the moment.
There is the notion of the return Gt, which is the total discounted rewards from time step t. This is what we care about, the goal is to maximize this return,

? is a discount factor, where ? ∈ [0, 1]. It informs the agent of how much it should care about rewards now to rewards in the future. If (? = 0), that means the agent is short-sighted, in other words, it only cares about the first reward. If (? = 1), that means the agent is far-sighted, i.e. it cares about all future rewards. What we care about is the total rewards that we’re going to get.
You might be confused, why put a discounting factor ?. why not get all the rewards undiscounted ?. It turns out to be mathematically convenient to discount rewards, here we guarantee that the algorithm will converge, and avoid infinite returns in loopy Markov processes.
Note that, although the return is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant, if ? < 1. For example, if the reward is a constant +1, then the return is,

Another reason to discount rewards is that the agent is not certain about what would happen in the future, it might be better to take the immediate reward rather than waiting in the hope to get a larger reward in the future, so ? defines a kind of finite-horizon for what to care about. ? implicitly encoded the animal/human cognitive model, which shows preferences for immediate rewards.
Bear with me, imagine you are in a situation, where someone offered you to get a 1000$ now or to get 1000$ after each passing hour for 10 hours, but with each passing hour the value of a 1000$ is decreasing by some factor ? to the power of the passed hours. As a rational person trying to get the maximum possible amount of money, your choice depends on ?.
If (? = 0.1), with simple calculation, your return would be,
Gt = 1000 + 100 + 10+ …..,
on the other hand, if (? = 0.9), the return would be,
Gt = 1000 + 900 + 810+ ……
Notice that, if the decreasing factor is near zero, you’ll care only about first hour or 2 hours, and stop wasting your time and just leave, i.e. get to the terminal state, all that comes afterward is worthless. If ? is near 1, you will wait 10 hours to get the money, i.e. still taking actions to get rewards. So ? defined a horizon for acting in the environment.
Sometimes, it’s possible to use undiscounted MRPs (i.e. ? = 1), if we are sure that all sequences will terminate.
As we mentioned in the previous blog post, the value function tells us how good is each state and/or action, i.e. how good is it to be in a particular state, and how good is it to take a particular action. It informs the agent of how much reward to expect if it takes a particular action in a particular state.
The state-value function of an MRP is the expected return starting from state s,

E.g. the return of the previously mentioned episode [Class 1 →Class 2 → Class 3 → Pass →Sleep] would be,

The value of state Class 1 would be = -2.25.
Bellman Equation
The agent tries to get the most expected sum of rewards from every state it lands in. In order to achieve that we must try to get the optimal value function, i.e. the maximum sum of cumulative rewards. Bellman equation will help us to do so.
Using the Bellman equation, the value function will be decomposed into two part; an immediate reward, Rt+1, and discounted value of the successor state ?V(St+1),

We unroll the return Gt,

hen substitute the return Gt+1, starting from time step t+1,

finally, since the expected value function is a linear function, meaning that ?(aX+bY)= a?(X) +b?(Y). The expected value of the return Gt+1 is the value of the state St+1,

That gives us the Bellman equation for MRPs,

So, for each state in the state space, the Bellman equation gives us the value of that state,

The value of the state S is the reward we get upon leaving that state, plus a discounted average over next possible successor states, where the value of each possible successor state is multiplied by the probability that we land in it.

For our example state space, the value of Class 3 is the reward -2 that we get upon leaving that state added to the discounted average over the next possible successor states.
This Bellman equation is a linear equation, i.e. it can be solved directly,

The complexity of this computation is O(n³) for n states. This direct solution is only possible for small MRPs. For larger MRPs, there are many iterative methods, e.g. Dynamic Programming, Monte-Carlo evaluation, and Temporal-Difference learning.
Markov Decision Process
An MDP is a Markov Reward Process with decisions, it’s an environment in which all states are Markov. This is what we want to solve. An MDP is a tuple (S, A, P, R, ?), where S is our state space, A is a finite set of actions, P is the state transition probability function,

R is the reward function,

and ? is a discount factor ? ∈ [0, 1].

Remember that a policy ? is a distribution over actions given states. A policy fully defines the behavior of an agent,

MDP policies depend on the current state, not the history, i.e. policies are stationary At ~ ?(.|St), ∀ t > 0, which means whenever the agent lands in a particular state, it’ll take the action that the policy decided before, for all different time steps. Policies can be stochastic to allow us to do exploration in the state space.
We can always recover a Markov process or an MRP from an MDP. Given an MDP M =(S, A, P, R,?), and a policy ?, the state sequence S1, S2, ….. is a Markov process (S, P) on policy ?.
The state and reward sequence S1, R1, S2, R2, …… is a Markov Reward Process (S, P, R, ?) where,

we’re going to average over all things that might happen under our policy ?. Now we have defined the transition dynamics function on policy ? to be the average of the transition dynamics of all things that we might do.
There is a probability to take action a under policy ? from state S, we multiply the probability that we take that action by what would happen after we took that action in state S, i.e. the successor states we might land in.
The same goes for the reward function,

We average over all possible rewards associated with different possible actions from state S.
We already have the value function for MRPs, but there was no decisions. Now we got a policy, i.e. some way to choose to behave in a Markov process.
“The state-value function V?(s) of an MDP is the expected return starting from state S, and then following policy ?.”
The value function tells us how good is it to be in state S if I’m following policy ?, i.e. the expectations when we sample all actions according to policy ?,

“The action-value function q?(s, a) is the expected return starting from state s, taking action a, and following policy ?. “
It tells us how good is it to take a particular action from a particular state,


In part 2, I’ll cover Bellman expectation equation, Bellman optimality equation, and how to get the optimal policy, optimal state-value function, and optimal action-value function.
No responses yet