Reinforcement Learning Demystified: A Gentle Introduction
In a long blog post series starting with this episode, I’ll try to simplify the theory behind the science of reinforcement learning and its applications and cover code examples to make a solid illustration. You can read all my Blog-post here.
What is Reinforcement Learning?
Reinforcement learning or RL for short is the science of decision making or the optimal way of making decisions. When an infant plays, waves its arms, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about consequences of actions, and about what to do in order to achieve goals.
This is the key idea behind RL, we have an environment that represents the outside world to the agent and an agent that takes actions, receives observations from the environment that consists of a reward for his action, and information of his new state. That reward informs the agent of how good or bad was the taken action, and the observation tells him what is his next state in the environment.
The agent tries to figure out the the best actions to take or the optimal way to behave in the environment in order to carry out his task in the best possible way.
This is a simulation of a humanoid that learned how to run after executing the sequence of acting, observing, and then acting until it finally figured out the best action to take at each time step to achieve its task, i.e. running efficiently.
Here is a successful example of an RL agent that learned to play Breakout like any human being after 400 training episodes. After 600 training episodes, the agent finds and exploits the best strategy of tunneling and then hitting the ball behind the wall.
Keep in mind there is no explicit human supervisor, the agent learns by trial and error.
Another amazing success story is how Deepmind used RL to simulate locomotion behavior on Mujoco’s simulation models. The agent is given proprioception and simplified vision to perceive the environment.
The agent learns to run, jump, crouch, and climb through relentless attempts of trials and learning from its errors.
We can’t ignore the biggest event in AI community, the ultimate smackdown of human versus machine, where Deepmind’s AlphaGo ruthlessly managed to defeat Lee Sedol, the South Korean professional Go player of 9 dan rank, 4 matches against 1 in March 2016. This guy has 18 world championships under his belt.
“The Game of Go is the holy grail of artificial intelligence. Everything we’ve ever tried in AI, it just falls over when you try the game of Go.”
David Silver, Lead Researcher for AlphaGo
What makes RL different from other machine learning paradigms ?
In RL there is no supervisor, only a reward signal or a real number that tells the agent how good or bad was his action. Feedback from the environment might be delayed over several time steps, it’s not necessarily instantaneous e.g. for the task of reaching a goal in a grid-world, the feedback might be at the end when the agent reaches the goal. The agent might spend some time exploring and wandering in the environment until it finally reaches the goal after a while to realize what were the good and bad actions it has taken.
In machine learning or supervised learning, we have a dataset that describes the environment to the algorithm and the right answers or actions to do when faced with a specific situation, and the algorithm tries to generalize from that data to new situations.
In RL, the data is not I.I.D (independent and identically distributed). The agent might spend some time in a certain part of the environment and not see other parts which might be interesting to learn the optimal behavior. So time really matters, the agent must explore pretty much every part of the environment to be able to take the right actions.
The agent influences the environment through its actions which in turn affect the subsequent data it receives from the environment, it’s an active learning process.
what are the Rewards?
A reward Rt is a scalar feedback signal that indicates how well the agent is doing at time step t. The agent’s job is to maximize the expected sum of rewards. Reinforcement learning is based on the Reward Hypothesis which states that.
“All goals can be described by the maximization of expected cumulative rewards”.
E.g. if we want the humanoid to learn how to walk, we can break down this goal into a positive reward for forward motion and negative reward for falling over. The agent will learn that the negative reward is associated with the actions the cause him to fall over, and eventually it’ll learn that it’s not good to take that action, instead it should take the other action that gets him the positive reward.
We’ve seen that the agent must select the actions that maximize the future rewards, but some actions might have long term consequences. Given that some rewards can be delayed, the agent can’t be greedy all the time, i.e. take action associated with maximum reward at the current time, it has to plan ahead (more on that later). It might be better to sacrifice immediate reward to gain long-term reward, or vice versa.
Agent and Environment.
In fig. 1, the agent and the environment interact at each other over a sequence of discrete-time steps, t = 0, 1, 2, 3, …. At each time step t, the agent receives some representation of the environment’s state, St ∈ S, and on that basis selects an action, At ∈ A(s).
One time step later, in part as a consequence of its action, the agent receives a numerical reward, Rt+1 ∈ R ⊂ ℝ, and finds itself in a new state, St+1.
All the sequence of observations, actions, and rewards during the agent’s life time up to time step t is called the history,
Ht= S1, A1, R2, ………. , St-1, At-1, Rt.
It’s what the agent has seen so far, i.e. all the observable variables up to time step t.
What happens next depends on the history, the agent selects actions, i.e. a mapping from the history to an action. The environment emits next state and a reward associated with the taken action.
Naively, it seems that, this is probably the best way to encode what the agent has encountered so far, and the agent should use this information to decide what is the next action, but the history is not very useful since it’s enormous, and it’ll be infeasible to use this approach in the real world interesting problems.
Instead, we turn to the state, which is like a summary of information encountered so far. It is used to determine what happens next. A state is a function of history, St= F(Ht).
State of Agent and Environment.
The environment state is the information used within the environment to determine what happens next from the environment’s perspective, i.e. spit out the observation or next state and reward.
That environment state St is the private representation of the environment, i.e. whatever data the environment uses to pick the next state and reward. The environment state is not usually visible to the agent, even if it’s visible, it might contain some irrelevant information.
The agent state captures what happened to the agent so far, it summarizes what is going on and the agent uses this to pick the next action. The agent state is its internal representation, i.e. whatever information the agent uses to pick the next action.
The state can be any function of history as we mentioned, and the agent decides what this function is going to be.
How to represent the agent state?
An information state a.k.a Markov state contains all useful information from history. We use that Markov state to represent the agent’s state.
A state St is Markov if and only if ;
P[St+1 | St] = P[St+1 | S1, ….. , St]
This means the agent state is Markov if that state contains all the useful information the agent has encountered so far, which in turn means, we can throw away all the previous states and just retain the agent’s current state.
If we have the Markov property, the future is independent of the past given the present. Once the current state is known, the history may be thrown away, and that state is a sufficient statistic that gives us the same characterization of the future as if we have all the history.
There is the notion of fully observable environments, where the agent directly observes the environment state and as a result, the observation emitted from the environment is the agent’s new state as well as the environment’s new state. Formally this s a Markov Decision Process or (MDP) (next blog post).
On the other hand, there is the notion of partially observable environments, where the agent indirectly observes the environment, e.g. a robot with camera vision isn’t told its exact location, all that the agent knows is what lies in front of it. Now the agent state ≠ the environment state. Formally, this is a Partially Observable Markov Decision Process or (POMDP).
In POMDP, the agent must find a way to construct its own state representation. As we mentioned before, we could use the complete history to construct the current state, but this is a naive approach.
Instead, we could use the agent’s belief of the environment, i.e. where the agent thinks it’s in the environment,
This is a Bayesian approach, where there is a probability distribution over where the agent thinks it’s in the environment.
We could use the Recurrent Neural Network to construct the current state of the agent,
where there is a linear combination of the previous state multiplied by some weight, and the current observation multiplied by some other weight. We pass this linear combination through some non-linearity, and that gives us the current state of the agent.
Components of a Reinforcement Learning Agent.
A reinforcement learning agent may include one or more of these components:
It’s a probability distribution over actions given states, i.e. the agent’s behavior function or how the agent picks his actions given that it’s in some certain state. It could be a deterministic policy that we want to learn from experience,
or a stochastic policy,
A Value Function,
It’s a function that tells us how good is each state and/or action, i.e. how good is it to be in a particular state, and how good is it to take a particular action. It informs the agent of how much reward to expect if it takes a particular action in a particular state.
In short, it’s a prediction of expected future rewards used to evaluate goodness/badness of states, therefore enabling the agent to select between different actions,
In some state S, and time step t, the value function informs the agent of the expected sum of future rewards on a given policy 𝜋, so as to choose the right action that maximizes that expected sum of rewards. The value function depends on the way the agent is behaving.
𝛾 is a discount factor, where 𝛾 ∈ [0, 1]. It informs the agent of how much it should care about rewards now to rewards in the future.
If (𝛾 = 0), that means the agent is short-sighted, i.e. it only cares about the first reward. If (𝛾 = 1), that means the agent is far-sighted, i.e. it cares about all future rewards.
A model predicts what the environment will do next. It’s the agent’s representation of the environment, i.e. how the agent thinks the environment works.
There is a transition function P, which predicts the next state or the dynamics of the environment,
It tells us the probability distribution over next possible successor states, given the current state and the action taken by the agent. We can try to learn these dynamics.
There is a reward function R, which predicts the next immediate reward associated with the taken action, given the current state;
Categorizing Reinforcement Learning Agents.
- Value-Based Agent, the agent will evaluate all the states in the state space, and the policy will be kind of implicit, i.e. the value function tells the agent how good is each action in a particular state and the agent will choose the best one.
- Policy-Based Agent, instead of representing the value function inside the agent, we explicitly represent the policy. The agent searches for the optimal action-value function which in turn will enable it to act optimally.
- Actor-Critic Agent, this agent is a value-based and policy-based agent. It’s an agent that stores both of the policy, and how much reward it is getting from each state.
- Model-Based Agent, the agent tries to build a model of how the environment works, and then plan to get the best possible behavior.
- Model-Free Agent, here the agent doesn’t try to understand the environment, i.e. it doesn’t try to build the dynamics. Instead, we go directly to the policy and/or value function. We just see experience and try to figure out a policy of how to behave optimally to get the most possible rewards.