Reinforcement Learning: Exploration vs. Exploitation

Exploration vs. Exploitation

On-line decision-making involves a fundamental choice; exploration, where we gather more information that might lead us to better decisions in the future, or exploitation, where we make the best decision given current information.

This comes up because we’re learning on-line. In the reinforcement learning setting, no one gives us some batch of data like in supervised learning. We’re gathering data as we go, and the actions that we take affects the data that we see, and so sometimes it’s worth taking different actions to get new data.

A k-armed Bandit Problem

Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice, you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, 1000 action selections, or time steps.

This is the original form of the k-armed bandit problem, so named by analogy to a slot machine, except that it has k levers instead of one. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers.

Each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action. We denote the action selected on time step t as At, and the corresponding reward as Rt. The value then of an arbitrary action a, denoted q*(a), is the expected reward given that a is selected:

If we knew the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with the highest value. We assume that we don’t know the action values with certainty, although we may have estimates. We denote the estimated value of action a at time step t as Qt(a). We would like Qt(a) to be close to q*(a).

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these greedy actions. When you select one of these actions, we say that you’re exploiting your current knowledge of the values of the actions. If instead, you select one of the non-greedy actions, then we say you are exploring, because this enables you to improve your estimate of the non-greedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one-step, but exploration may produce a greater total reward in the long run, so what should we do to solve this dilemma?.

Principles to Exploration

First, we begin by looking at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we are collectively called action-value methods. Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

Action value

If the denominator is zero, then we instead define Qt(a) as some default value, such as 0. As the denominator goes to infinity, by the law of large numbers, Qt(a) converges to q*(a). We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards. Of course, this is just one way to estimate action values, and not necessarily the best one.


Rather than accounting for the amount of reward we’ve got, we might ask a question, how much worse have we done than the best we could have possibly done? The optimal value V* is the best we could have possibly done if we knew which machine paid out the most:

The regret is how much we are far away from V*, it’s the opportunity loss for one step in expectation,

The total regret is the total opportunity losses over all time steps,

We want to maximize cumulative reward which means we want to minimize total regret. It’s useful to think about regret because it helps us understand how well an algorithm can possibly do. We want to find algorithms that bring regret each step towards zero.

We can formulate regret in another way. Consider Nt(a) — the count — to be the expected number of times we selected action a. The gab Δa is the difference in value between action a and optimal action a*,

Now we can make regret as a function of gaps and counts,

If we’re summing up how much we lost each time we picked action a, this is the same as counting how many times we chose that action and multiplying it by how much we lost each time we did pick that action.

Each time the gab is huge, i.e. some machine is really horrible, we need to make sure that we pull that arm very few times, whereas if there is another machine that has a small gab, now we want to pull that arm more and more. A good algorithm ensures small counts for large gaps. The problem is that gabs are not known since we don’t know V*.

Greedy Algorithm

Fig.1 courtesy of David Silver

This figure shows the total regret as a function of time steps and different algorithms for picking actions. The first one and actually the simplest one is to select one of the actions with the highest estimated value, that is, one of the greedy actions. A greedy action is the one action whose estimated value is greatest. If there is more than one greedy action, then a selection is made among them in an arbitrary way, perhaps randomly. We write this greedy action selection method as,

where the argmax denotes the action a for which the expression that follows is maximized. Greedy action selection always exploits current knowledge to maximize immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better. So greedy can lock onto suboptimal action forever causing the total regret to be linear in time steps.

Initial action values can also be used as a simple way to encourage exploration. This is called “greedy with optimistic initialization”. Suppose that instead of setting the initial action value to zero, we set them instead to +5, given that the mean of all actions is 0 for example, thus an initial estimate of +5 is widely optimistic. We’re gonna assume that everything is really good until proven otherwise.

This optimism encourages the greedy method to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the agent switches to other actions being “disappointed” with the reward it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time. This is a simple trick that can be quite effective on stationary problems, i.e. for problems in which the reward probabilities do not change over time. But this method is far from being a generally useful approach to encouraging exploration.

If the task changes, creating a renewed need for exploration, this method cannot help. Any method that focuses on the initial conditions in any special way is unlikely to help with the general non-stationary case. The beginning of time occurs only once, and thus we should not focus on it too much.

Another issue with this method is that few unlucky samples can lock out optimal actions forever. Let’s assume that I started off thinking that action a1 is the best. I tried it, I’m unlucky. I try it again, I’m unlucky. Now I can end up locking out this action forever because I might try some other action a2, and it turns out to be better and I never explore a1 again. So we end up carrying regret every time step.

ε-greedy Algorithm

A simple alternative to greedy action selection is to behave greedily most of the time, but every once in a while, say with small probability ε, instead select randomly from among all the actions with equal probability, independently of the action-value estimates. An advantage of this method is that, in the limit as the number of steps increases, every action will be sampled an infinite number of times, thus ensuring that all the Qt(a) converge to q*(a).

The ε-greedy algorithm continues to explore forever, with probability 1- ε select the best action, with probability ε select a random action. Every time we explore randomly, we’re very much likely to make some mistakes and not pull the best arm, so we keep incurring regret every time-step. ε-greedy has linear total regret.

To assess the relative effectiveness of the greedy and ε-greedy algorithms, we compare them numerically on a suite of test problems. This is a set of 2000 randomly generated k-armed bandit problems with k = 10. For each bandit problem, such as the one shown in the figure,

The action values, q*(a), a = 1,….,10, were selected according to a normal distribution with mean 0 and variance 1. Then, when a learning method applied to that problem selected action At at time step t, the actual reward, Rt, was selected from a normal distribution with mean q*(At) and variance 1. These distributions are shown in gray in the figure. We call this suite of test tasks the 10-armed testbed.

For any learning method, we can measure its performance and behavior as it improves with experience over 1000 time steps when applied to one of the bandit problems. This makes up one run. Repeating this for 2000 independent runs, each with a different bandit problem, we obtain measures of the learning algorithm’s average behavior.

This graph compared a greedy method with two ε-greedy methods (ε = 0.01 and ε = 0.1) on the 10-armed testbed. All the method formed their action-value estimates using the sample average technique mentioned above. The greedy method improved slightly faster than the other methods at the very beginning, but then leveled off at a lower level. The greedy method performed significantly worse in the long run because it often got stuck performing suboptimal actions.

The ε-greedy methods eventually performed better because they continued to explore and to improve their chances of recognizing the best action. The ε = 0.1 method explored more and usually found the optimal action earlier, but it never selected that action more than 91% of the time. The ε = 0.01 method improved more slowly but eventually would perform better than the ε = 0.1 method.

It’s also possible to reduce ε over time and pick a schedule for reducing the value of ε to try to get the best of both worlds. Decaying ε-greedy has logarithmic asymptotic total regret.

The advantage of ε-greedy over greedy methods depends on the task. Suppose the reward variance had been larger, say 10 instead of 1. With noisier rewards, it takes more exploration to find the optimal action, and ε-greedy methods should fare even better relative to the greedy method. On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once.

Suppose the bandit task was non-stationary, that is, the true values of the actions changed over time. In this case, exploration is needed even in the deterministic case to make sure one of the non-greedy actions has not changed to become better than the greedy one. Non-stationarity is the case most commonly encountered in reinforcement learning.

Even if the underlying task is stationary and deterministic, the agent faces a set of bandit-like decision tasks each of which changes over time as learning proceeds and the agent’s decision-making policy changes.

A Lower Bound

There’s a lower bound on the regret, i.e. no algorithm can possibly do better than a certain lower bound. Meaning, there’s a lower bound on how well we can do in terms of regret. we want to push down our algorithms closer and closer to this lower bound. This lower bound is logarithmic in the number of time steps just as decaying ε-greedy algorithm.

In the bandit problem, the performance of any algorithm is determined by the similarity between the optimal arm and other arms. An easy bandit problem is where you got one arm that is obviously good and one arm that is obviously bad. You just try this arm once and gives you a good number, you try the other one once and it gives you a bad number then you’re done.

A hard bandit problem is something where the 1st arm for example is the best one, we don’t know that yet. We try each available arm. The 1st arm is sometimes better than the others, and sometimes it’s bad. There’s a lot of noise on them, and it’s really hard to disambiguate them. We’re making a lot of mistakes, and it takes a really long time to figure out, that the 1st arm is much better than the rest. This is the case of non-stationarity as we mentioned earlier.

So the hard problems have similar-looking arms with different means. We can describe that formally in terms of the gap between them and how similar their distributions are using the KL divergence method, which is a measure of how one probability distribution is different from another reference probability distribution.

Formally, this is a theorem that states,

Theorem (Lai and Robbins): Asymptotic total regret is at least logarithmic in a number of steps.

This means we can never do better than this lower bound in terms of time steps. It tells us that, the more different the arms are, the more regret will incur.

Optimism in the face of uncertainty

Imagine there are 3 different arms. There’s a probability distribution over the action value for each arm. Maybe we tried the green one a lot of time, so we’ve got a quite good idea of what the mean of this action is. Maybe we tried the blue one a couple of times, we’re not quite sure of the mean, and the red one is in-between. The question is, which arm should we pick next?

Optimism in the face of uncertainty principle says, don’t take the one you currently believe is the best. Take the action which has the most potential to be the best. The blue action has the most potential to have a higher mean. So we should try it and narrow down the distribution. As we narrow the distributions down, we start to get more and more confident about what the best action really is until we actually pick the one that’s got the maximum mean.

So it’s a way to push down our uncertainty and at the same time try things out that has the most potential to do well.

Upper Confidence Bound (UCB)

Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. The greedy actions are those that look at the present, but some of the other actions may actually be better. ε-greedy action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain.

It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how their estimates are to being maximal and the uncertainties in those estimates.

We’re not just gonna try to estimate the mean of an action, we’re gonna estimate some upper confidence Ut(a) of what we think the mean could be. Think of it as the tail of the distribution above. We’re gonna estimate some addition, some bonus which characterizes how big the tail is of the distribution.

You can think of Ut(a) as a high probability upper confidence on what the value of an action could be. Then we’re gonna pick the action with the highest upper confidence value.

This depends on Nt(a), i.e. the number of times an action has been selected. Small Nt(a) means the larger Ut(a) will be (estimated value is uncertain). The larger Nt(a), the smaller Ut(a) will be (estimated value is accurate). We’re gonna add less and less bonus to actions that we tried more because we become more and more confident about what the mean of that action is. Eventually, we just end up using the mean.

We select action maximizing UCB. Here the maximization is over the action value added to it the upper confidence of that action. That helps us systematically look around the action space and figure out which of these actions is giving us the best results.

So how to calculate an upper confidence bound of an action? Here we won’t make any assumptions about the distributions of the actions.

Theorem (Hoeffding’s Inequality):

It basically tells us, if we have some random variables, we’re sampling them between [0, 1], we keep sampling these X values, again and again, we take an empirical mean of all our samples, what’s the probability that we’re actually making a mistake in our estimation of this empirical mean by at least u.

This is true for any distribution when the rewards are bounded between [0,1]. We’ll apply Hoeffding’s Inequality to rewards of the bandit conditioned on selecting action a,

This is saying, what’s the probability that we’re wrong in our estimate of Q by more than Ut(a). We’re gonna use that to solve for Ut(a) values and to set them to the appropriate amount, to guarantee that this probability is, say, within 95%.

We’re gonna solve for Ut(a),

and that gives us this upper confidence value. What is nice about this, is we don’t need to know anything about the gabs, we don’t need to know anything about the rewards except that they’re bounded. This term has all the properties that we want. The count is in the denominator, i.e. as we pick things more and more, this bonus term is gonna be pushed down towards zeros, and for actions, we haven’t tried very often, we’re gonna have a very large bonus term.

Now, we want to pick a schedule. We want to guarantee that we pick the optimal action as we continue, we want to get this asymptotic regret to be logarithmic in time steps. So we add a schedule to our P values as we observe more rewards, e.g. we might set P to be equal to t raised to the power of -4. Using the logarithmic power rule, “The logarithm of the exponent of x raised to the power of y, is y times the logarithm of x”,

we end up with this equation,

Thus we ensure that we select optimal action as t →.


This leads to UCB1 algorithm, which is a quite effective algorithm in the k-armed bandit setting.

Every step we estimate Q values using the sample-average method and then we add the bonus term that only depends on the number of time steps t and the number of times we picked that action, Nt(a).

It looks a lot like the lower bound, except that we don’t have the KL term, because there is no assumption about the probability distribution.


  • Greedy with optimistic initialization: We initialize the values of the actions to a highly optimistic value and assume that everything is good until proven otherwise. In the end, we suppress each action value to its realistic value.
  • Random Exploration: ε-greedy algorithm does well if we tune it just right. The difficulty is that if we get it wrong, it might be difficult to get to the optimal action in the end.
  • Optimism in the face of uncertainty: We estimate how much we don’t know about the value of an action, and we use that to guide us towards actions that have the most potential to be good. This is not a safe exploration. In industry, researchers and engineers don’t use this approach since it’s not safe.
  • UCB: Often, this algorithm performs well than ε-greedy, but it’s more difficult to extend beyond bandits to the more general reinforcement learning setting. When using function approximation, UCB action selection is usually not practical. UCB without any knowledge about the problem, actually systematically performs really quite well.

Previous Episodes


Richard Sutton’s Intro to Reinforcement Learning

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *