Consider the following learning problem. You are faced repeatedly with a choice among $k$ different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

This is the original form of the k-armed bandit problem. Each of the k actions has an excepted or mean reward given that action is selected; let us call this value of that action. We denote the action selected on time step t as $A_{t}$, and the corresponding reward as $R_{t}$. The value then of an arbitrary action a, denoted $q_{\star}(a)$, is the excepted reward given that a is selected:
$$q_{\star}(a) = \mathbb{E}[R_t|A_t=a]$$
If you knew the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action a at time t as $Q_{t}(a) \approx q_{\star}(a)$.

We begin by looking more closely at some simple methods for estimating the values of actions and for using the estimates to make action selection decisions. Recall that the true value of an action is the mean reward when action is selected. One natural way to estimate this is by averaging the rewards actually received:

$$Q_t(a) \doteq \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} = \frac{\sum_{i=1}^{t-1}R_i \cdot \mathbf{1}_{A_i=a}}{\sum_{i=1}^{t-1}\mathbf{1}_{A_i=a}}$$

where $\mathbf{1}_{\text{predicate}}$ denotes the random variable that is 1 if predicate is true and 0 if it is not. If the denominator is zero, then we instead define $Q_t(a)$ as some default value, such as $Q_1(a) = 0$. As the denominator goes to infinity, by the law of large numbers, $Q_t(a)$ converges to $q_{\star} (a)$. We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards.

The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, that is, to select at step t one of the greedy actions, $A_t^\star$ for which $Q_t(A_t^\star) = \max_a Q_t(a)$. This greedy action selection method can be written as
$$A_t \doteq argmax_a Q_t(a)$$
Naturally, we could use the $\epsilon$-greedy method rather the greedy method. We’ll show their difference on the performance. Now, let’s jump into the implementation details. In order to be able to see the results quickly, we set to k to be 10. The first, we generate 10 stationary probability distributions that we’ll sample from to generate action values. The generate method is below:

We first generate randomly 10 true excepted values by np.random.randn(10), then I’m going to change the variance to 1. So we get 10 stationary probability distributions. The visualizations are as follows:

We’re going to compare how different $\epsilon$ values affect the end result.

Before we go into the details, we introduce the Bandit object first.

For now we just introduce sample-average method, so skip other methods parameters. Let us see the initialization method.

There are some important attributes. time is a number that represents the time steps now. actionCount is the times that correspond actions have been taken prior to current time steps. qTrue is a list. And each item is the true excepted value corresponding to each action. qEst is the estimate value of each action. It’s initialized to zero. epsilon is the $\epsilon$. Next, in the for loop, we initialize real rewards with N(0, 1) distribution and estimations with desired initial value. At last, the bestAction store the current best action will be take.

The next method tell us how to get the next action should be take:

We can skip the second and the third if statements (we’ll introduce this two methods later). If we use greedy method, we just return the action that has highest value. Otherwise, we’re choosing randomly at $\epsilon$ probability.

Similarly, we just skip other if statements and focus on this row:

This formula seems different with the earlier one. The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. We now turn to the question of how these averages can be computed in a computationally efficient manner, in particular, with constant memory and per-time-step computation.

To simplify notation we concentrate on a single action. Let $R_i$ now denote the reward received after ith selection of this action, and let $Q_n$ denote the estimate of its action value after it has been select $n-1$ times, which we can now write simply as
$$Q_n \doteq \frac{R_1 + R_2 + \cdots + R_{n-1}}{n-1}$$
The obvious implementation would be to maintain a record of all the rewards and then perform this computation whenever the estimated value was needed. However, in this case the memory and computational requirements would grow over time as more rewards are seen.

It is easy to devise incremental formulas for updating averages with small, constant computation required to process each new reward. Given $Q_n$ and the nth reward, $R_n$, the new average of all n rewards can be computed by
\begin{align} Q_{n+1} &\doteq \frac{1}{n}\sum_{i=1}^{n} R_i \\ &= \frac{1}{n}(R_n + \sum_{i=1}^{n-1} R_i) \\ &= \frac{1}{n}(R_n + (n-1) \frac{1}{n-1}\sum_{i=1}^{n-1} R_i) \\ &= \frac{1}{n}(R_n + (n-1) Q_n) \\ &= \frac{1}{n}(R_n + n Q_n - Q_n) \\ &= Q_n + \frac{1}{n}[R_n - Q_n] \end{align}
So this is why the code is look like this:

Back to epsilonGreedy() method:

Now, we get nBandits bandits and each bandit has 10 arm. Then we use a bandit simulation to simulate this process.

The bandits is a list that has three item. Each item is a list that contains nBandits bandits that has a corresponding epsilon value. We calculate the average total reward of 2000 bandits is to eliminate the effect of noise. Ok, let us see the final result.

We can see the algorithm reaches the best performance when epsilon is set to 0.1.

The averaging methods discussed so far are appropriate in a stationary environment, but not if the bandit is changing over time. As noted earlier, we often encounter reinforcement learning problems that are effectively nonstationary. In such cases it makes sense to weight recent rewards more heavily than long-past ones. One of the most popular ways of doing this is to use a constant step-size parameter. For example, the incremental update rule for updating an average $Q_n$ of the $n-1$ past rewards is modified to be
$$Q_{n+1} \doteq Q_n + \alpha [R_n - Q_n]$$
where the step-size parameter $\alpha \in (0, 1]$ is constant. This results in $Q_{n+1}$ being a weighted average of past rewards and the initial estimate $Q_1$:
\begin{align} Q_{n+1} &\doteq Q_n + \alpha [R_n - Q_n] \\ &= \alpha R_n + (1 - \alpha) Q_n \\ &= \alpha R_n + (1 - \alpha) [\alpha R_{n-1} + (1 - \alpha) Q_{n-1}] \\ &= \alpha R_n + (1 - \alpha) \alpha R_{n-1} + (1 - \alpha) ^2 Q_{n-1} \\ &= \alpha R_n + (1 - \alpha) \alpha R_{n-1} + (1 - \alpha) ^2 R_{n-1} + \cdots + (1 - \alpha) ^ {n-1} \alpha R_1 + (1 - \alpha) ^ n Q_1 \\ &= (1 - \alpha) ^ n Q_1 + \sum_{i=1}^{n} \alpha (1 - \alpha) ^ {n-i} R_i \end{align}
We call this a weighted average because the sum of the weights is 1. In fact, the weight decays exponentially according to the exponent on $1-\alpha$. According, this is sometimes called an exponential, recency-weighted average.

Sometimes it is convenient to vary the step-size parameter from step to step. Let $\alpha_n(a)$ denote the step-size parameter used to process the reward received after nth selection of action a. As we noted, the choice $\alpha_n(a)=\frac{1}{n}$ results in the sample-average method, which is guaranteed to converge to the true action values by the law of the large numbers. A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:
$$\sum_{n=1}^{\infty}\alpha_{n}(a) = \infty \; \text{and} \; \sum_{n=1}^{\infty}\alpha_{n}^{2}(a) < \infty$$
All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, $Q_1(a)$. In the language of statistics, these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant $\alpha$ (weighted avetrage), the bias is permanent, though decreasing over time. In practice, this kind of bias is usually not a problem and can sometimes be very helpful. The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be excepted. So, let us do a experiment. The first we set them all to zero and the we set them all to a constant, 5.

The Bandit object’s takeAction() has a little difference:

The result is as follows:

We can see it reaches the better performance than the $\epsilon$-greedy method, when they are all used the weighted average method. Notice that the $\epsilon$-greedy with weighted-average method is worsen than the $\epsilon$-greedy with sample-average method.

We regard it as a simple trick that can be quite effective on stationary problems, but it is far from being a generally useful approach to encouraging exploration. For example, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case.

Exploration is needed because the estimates of the action values are uncertain. The greedy actions are those that look best at present, but some of the other actions may actually be better. $\epsilon$-greedy action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. One effective way of doing this is to select actions as
$$A_t \doteq argmax_a \left [Q_t(a) + c\sqrt{\frac{\ln{t}}{N_t(a)}}\ \right]$$
where $N_t(a)$ denotes the number of times that action a has been selected prior to time t, and the number $c>0$ controls the degree of exploration. If $N_t(a)=0$, then a is considered to be a maximizing action. The idea of this is called upper confidence bound (UCB). Let us implement it.

We note that the UCBParam=2. The Bandit object explains this. The getAction() method and takeAction() method are as follows:

We can see the policy get next action has changed but the update policy has not changed. The result is here:

We omit the figure about the optimal action because the trend of this two figures are the same. UCB will often perform well, but is more difficult than $\epsilon$-greedy to extend beyond bandits to the more general reinforcement learning setting considered in the more advanced problems. In these more advanced settings there is currently no known practical way of utilizing the idea of UCB action selection.

So far we have considered methods that estimate action values and use those estimates to select actions. This is often a good approach, but it is not the only one possible. Now, we consider learning a numerical preference $H_t(a)$ for each action a. The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Only the relative preference of one action over another is important; if we add 1000 to all the preferences there is no effect on the action probabilities, which are determined according to a softmax distribution as follows:
$$Pr\{A_t=a\} \doteq \frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}} \doteq \pi_t(a)$$
where here we have also introduced a useful new notation $\pi_t(a)$ for the probability of taking action a at time t. Initially all preferences are the same (e.g., $H_1(a)=0, \forall a$) so that all actions have an equal probability of being selected.

There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selection the action $A_t$ and receiving the reward $R_t$, the preferences are updated by:
\begin{align} H_{t+1}(A_t) \doteq H_t(A_t) + \alpha (R_t - \overline{R_t}) (1 - \pi_t(A_t)), \; \text{and} \\ H_{t+1}(a) \doteq H_t(a) - \alpha (R_t - \overline{R_t}) \pi_t(a), \;\;\;\;\;\; \forall{a \neq A_t} \end{align}
where $\alpha>0$ is a ste-size parameter, and $\overline{R_t} \in \mathbb{R}$ is the average of all the rewards up through and including time t, which can be computed incrementally. The $\overline{R_t}$ term serves as a baseline with which the reward is compared. Let us see the implement details (takeAction() method and getAction() method).

The below figure shows results with the gradient bandit algorithm on a variant of the 10-armed testbed in which the true excepted rewards were selected according to a normal distribution with a mean of +4 instead of zero (and with unit variance as before).

Despite their simplicity, in our opinion the methods presented in here can fairly be considered the state of the art. Finally, we do a parameter study of the various bandit algorithms presented in here.

The results as follows: