Here we consider our first learning methods for estimating value functions and discovering optimal policy. Unlike the previous algorithms, we do not assume complete knowledge of the environment. Monte Carlo methods require only experience – sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. Although a model is required, the model need only generate sample transition, not the complete probability distributions of all possible transitions that is required for dynamic programming (DP).

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample return. To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. Only on the completion of an episode are value estimates and policies changed.

To handle the nonstationarity, we adapt the idea of general policy iteration (GPI) developed in earlier for DP.

Suppose we wish to estimate $v_{\pi}(s)$, the value of a state $s$ under policy $\pi$, given a set of episodes obtained by following $\pi$ and passing though $s$. Each occurrence of state $s$ in an episode is called a visit to $s$. Let us call the first time it is visited in an episode the first visit to $s$. The first-visit MC method estimates $v_{\pi}(s)$ as the average of the returns following first visits to $s$, whereas the every-visit MC method averages the returns following all visits to $s$.

First-visit MC policy evaluation (returns $V \approx v_{\pi}$)

Initialize:

​ $\pi \leftarrow$ policy to be evaluated

​ $V \leftarrow$ an arbitrary state-value function

​ $Return(s) \leftarrow$ an empty list, for all $s \in \mathcal{S}$

Repeat forever:

​ Generate an episode using $\pi$

​ For each state $s$ appearing in the episode:

​ $G \leftarrow$ return following the first occurrence of $s$

​ Append $G$ to $Return(s)$

​ $V(s) \leftarrow$ $\text{average}(Return(s))$

Next, we’ll use this algorithm to solve a naive problem that defined as follows:

The object of the popular casino card game of blackjack is to obtain cards the sum of whose numerical values is as great as possible without exceeding 21. All face cards count as 10, and an ace can count either as 1 or as 11. We consider the version in which each player competes independently against the dealer. The game begins with two cards dealt to both dealer and player. One of the dealer’s cards is face up and the other is face down. If the player has 21 immediately (an ace and a 10-card), it is called a natural. He then wins unless the dealer also has a natural, in which case the game is draw. If the player does not have a natural, then he can request additional cards, one by one (hits), until he either stop (sticks) or excepted 21 (goes bust). If he goes bust, he loses; if he sticks, then it becomes the dealer’s turn. The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome–win, lose, draw–is determined by whose final sum is closer to 21.

Playing blackjack is naturally formulated as an episode finite MDP. Each game of blackjack is an episode. Rewards of +1, -1, and 0 are given for winning, losing, and drawing, respectively. All rewards within a game are zero, and we do not discount ($\gamma=1$); therefore these terminal rewards are also returns. The player’s actions are to hit or to stick. We assume that cards are dealt from an infinite deck (i.e., with replacement). If the player holds an ace that he could count as 11 without going bust, then the ace is said to be usable. Consider the policy that sticks if the player’s sum is 20 or 21, and otherwise hits. Thus, the player makes decisions on the basis of three variables: his current sum (12-21), the dealer’s one showing card (ace-10), and whether or not he holds a usable ace. This makes for a total of 200 states.

Note that, although we have complete knowledge of the environment in this task, it would not be easy to apply DP methods to compute the value function. DP methods require the distribution of next events–in particular, they require the quantities $p(s^{\prime}, r|s, a)$–and it is not easy to determine these for blackjack. For example, suppose the play’s sum is 14 and he chooses to sticks. What is his excepted reward as a function of the dealer’s showing card? All of these rewards and transition probabilities must be computed before DP can be applied, and such computations are often complex and error-prone.

The conceptual diagram of the experimental results is as follows:

Figure 1

The first we define some auxiliary variables and methods:

Furthermore, we also have a print method:

In order to get the figure above, we wrote the following code:

There is a term named on policy, we’ll explain this term later. Now let us jump into the monteCarloOnPolicy method:

We ignore he first four variables now and explain them later. nEpisodes represents the number of the episodes and the play method simulates the process of the blackjack cards game. So what is it like? This method is too long (more than 100 rows) so we partially explain it. In the first place, this method define some auxiliary variables:

Then, we generate a random initial state if the initial state is null. If not, we load the initial state from initialState variable:

Game start! Above all is player’s turn:

Then is the dealer’s turn if the player’s turn is end:

If the both sides have finished the game:

Now, let us come back the mentoCarloOnPolicy method:

In this method we ignore the player’s trajectory (represent by the playerTrajectory variable). If you remember a sentence in the game definition (as follows) it will easy to understand.

Thus, the player makes decisions on the basis of three variables: his current sum (12-21), the dealer’s one showing card (ace-10), and whether or not he holds a usable ace. This makes for a total of 200 states.

This row (as follows) is to calculate the average returns of each state:

Recall the beginning of the code and let’s see what results are like:

Figure 2 (from up to down). (1) Usable Ace, 10000 Episodes. (2) No Usable Ace, 10000 Episodes. (3) Usable Ace, 500000 Episodes. (4) No Usable Ace, 500000 Episodes.

If a model is not available, then it is particularly useful to estimate action values (the value of state-value pairs) rather than state values. With a model, state values alone are sufficient to determine a policy; one simply look ahead one step and chooses whichever action leads to the best combination of reward and next state, as we did in the earlier on DP. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of out primary goals for Monte Carlo methods is to estimate $q_{\star}$. To achieve this, we first consider the policy evaluation problem for action values.

The policy evaluation problem for action values is estimate $q_{\pi}(s, a)$. The Monte Carlo methods for this are essentially the same as just presented for state values, except now we talk about visits to a state-action pair rather than to a state. The only complication is that many state-value pairs may never be visited. This is the general problem of maintaining exploration, as discussed in the context of the k-armed bandit problem in here. One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this the assumption of exploring starts.

We are now ready to consider how Monte Carlo estimation can be used in control, that is, to approximate optimal policies. To begin, let us consider a Monte Carlo version of classical policy iteration. In this method, we perform alternating complete steps of policy evaluation and policy improvement, beginning with a arbitrary policy $\pi_{0}$ and ending with the optimal policy and optimal action-value function:
$$\pi_{0} \stackrel{E}\longrightarrow q_{\pi{0}} \stackrel{I}\longrightarrow \pi_{1} \stackrel{E}\longrightarrow q_{\pi_{1}} \stackrel{I}\longrightarrow \pi_{2} \stackrel{E}\longrightarrow \cdots \stackrel{I}\longrightarrow \pi_{\star} \stackrel{E}\longrightarrow q_{\star},$$
We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. We postpone consideration of the first assumption until later.

The primary approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo to complete policy evaluation before returning to policy improvement. For Monte Carlo policy evaluation it it natural to alternate between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation,and then the policy is improved at all the states visited in the episode.

Monte Carlo ES (Exploring Starts)

Initialize, for all $s \in \mathcal{S},\; a \in \mathcal{A(s)}$:

​ $Q(s,a) \leftarrow \text{arbitrary}$

​ $\pi(s) \leftarrow \text{arbitrary}$

​ $Returns(s,a) \leftarrow \text{empty list}$

Repeat forever:

​ Choose $S_{0} \in \mathcal{S}$ and $A_{0} \in \mathcal{A(S_{0})}$ s.t. all pairs have probability > 0

​ Generate an episode starting from $S_0, A_0$, following $\pi$

​ For each pair $s, a$ appearing in the episode:

​ $G \leftarrow \text{return following the first occurrence of} \; s, a$

​ Append $G$ to $Returns(s,a)$

​ $Q(s,a) \leftarrow \text{average}(Returns(s,a))$

​ For each $s$ in the episode:

​ $\pi(s) \leftarrow \arg\max_{a} Q(s,a)$

Recall the blackjack card game. It is straightforward to apply Monte Carlo ES to blackjack.

Run the code we’ll get the conceptual diagram like follows:

Let us to see the implementation (monteCarloES method) of this algorithm. Note that, some auxiliary variables are defined earlier.

You can see we use the trajectory variable now. The difference between Monte Carlo On Policy and Monte Carlo ES is prior calculates the average return of each state and later calculates the average return of each state-action pair.

The results are as follows:

How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy (Do you remember this term?) methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.. The First-visit Monte Carlo method and the Monte Carlo ES method are an example of an on-policy method. Here we show how an on-policy Monte Carlo control method can be designed that does not use the unrealistic assumption of exploring starts. Off-policy methods are considered later.

The on-policy method we presented in here uses $\epsilon \text{-} greedy$ policies. The complete algorithm is given below.

On-policy first-visit MC control (for $\epsilon \text{-soft}$ policies)

Initialize, for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}$:

​ $Q(s,a ) \leftarrow \text{arbitrary}$

​ $Returns(s,a) \leftarrow \text{empty list}$

​ $\pi(a|s) \leftarrow \text{an arbitrary} \; \epsilon \text{-soft policy}$

Repeat forever:

​ (a) Generate an episode using $\pi$

​ (b) For each pair $s, a$ appearing in the episode:

​ $G \leftarrow$ return following the first occurrence of $s, a$

​ Append $G$ to $Returns(s,a)$

​ $Q(s, a) \leftarrow \text{average}(Returns(s,a))$

​ (c) For each s in the episode:

​ $A^{\star} \leftarrow \arg\max_{a} Q(s,a)$

​ For all $a \in \mathcal{A(s)}$:

​ $\pi(a|s) \leftarrow \left\{ \begin{array}{c} 1 - \epsilon + \epsilon/|\mathcal{A(s)}|\;\;\;\text{if} \;a=A^{\star} \\ \epsilon/|\mathcal{A(s)}|\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{if} \;a\neq A^{\star} \end{array} \right.$

All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions (to find the optimal actions). How can they learn about the optimal policy while behaving according to an exploratory policy? The on-policy approach in the preceding section is actually a compromise–it learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more straightforward approach is to user two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

We begin the study of off-policy methods by considering the prediction problem, in which both target and behavior policies are fixed. That is, suppose we wish to estimate $v_{\pi}$ or $q_{\pi}$, but all we have are episodes following another policy $\mu$, where $\mu \neq \pi$. In this case, $\pi$ is the target policy, $\mu$ is the behavior policy, and both policies are considered fixed and given.

In order to use episodes from $\mu$ to estimate values for $\pi$, we require that every action taken under $\pi$ is also taken, at least occasionally, under $\mu$. That is, we require that $\pi(a|s) > 0$ implies $\mu(a|s) > 0$. This is called the assumption of converge. It follows from converge that $\mu$ must be stochastic in states where it is not identical to $\pi$. The target policy $\pi$, on the other hand, may be deterministic. In here, we consider the prediction problem, in which $\pi$ is unchanging and given.

Almost all off-policy methods utilize importance samplingddd, a general technique for estimating excepted values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectory occurring under the target and behavior policies, called the importance-sampling ratio. Given a starting state $S_{t}$, the probability of the subsequent state-action trajectory , $A_{t}, S_{t+1}, A_{t+1, \cdots, S_{T}}$, occurring under any policy $\pi$ is
$$\prod_{k=t}^{T-1} \pi(A_k | S_k)p(S_{k+1} | S_k, A_k),$$
where $p$ here is the state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the important-sampling ratio) is
$$\rho_{t}^{T} \doteq \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1} \mu(A_k|S_k)p(S_{k+1}|S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi (A_k | S_k)}{\mu (A_k | S_k)}$$
Now we are ready to give a Monte Carlo algorithm that uses a batch of observed episodes following policy $\mu$ to estimate $v_{\pi}(s)$. It is convenient here to number time steps in a way that increases across episode boundaries. That is, if the first episode of the batch ends in a terminal state at time 100, then the next episode begins at time $t=101$. This enables us to use time-step numbers to refer to particular steps in particular episodes. In particular, we can define the set of all time steps in which state $s$ is visited, denote $\tau(s)$. This is for an every-visit method; for a first-visit method, $\tau(s)$ would only include time steps that were first visits to $s$ within their episodes. Also, let $T(t)$ denote the first time of termination following time $t$, and $G_t$ denote the return after $t$ up though $T(t)$. Then $\{G_t\}_{t \in \tau(s)}$ are returns that pertain to state $s$, and $\{\rho_{t}^{T(t)}\}_{t \in \tau(s)}$ are the corresponding importance-sampling ratios. To estimate $v_{\pi}(s)$, we simply scale the returns by the ratios and average the results:
$$V(s) \doteq \frac{\sum_{t \in \tau(s)} \rho_{t}^{T(t)} G_t}{|\tau(s)|}.$$
When importance sampling is done as a simple average in this way it is called ordinary importance sampling.

An important alternative is weighted importance sampling, which uses a weighted average, defined as
$$V(s) \doteq \frac{\sum_{t \in \tau(s)} \rho_{t}^{T(t)} G_t}{\sum_{t \in \tau(s)} \rho_{t}^{T(t)}},$$
or zero if the denominator is zero.

We applied both ordinary and weighted importance-sampling methods to estimate the value of as single blackjack state from off-policy data. Recall that one of the advantages of Monte Carlo methods is that they can be used to evaluate a single state without forming estimated for any other states. In this example, we evaluated the state in which the dealer is showing a deuce, the sum of the player’s cards is 13, and the player has a usable ace (that is, the player holds an ace and a deuce, or equivalently three aces). The data was generate by starting in this state then choosing to hit or stick at random with equal probability (the behavior policy). The target policy was to stick only on a sum of 20 or 21. The value of this state under the target policy is approximately -0.27726 (this was determined by separately generating one-hundred million episodes using the target policy and averaging their returns). Both off-policy methods closely approximated this value after 1000 off-policy episodes using the random policy. To make sure they did this reliably, we performed 100 independent runs, each starting from estimates of zero and learning for 10,000 episodes. The implementation details are as follows:

Note that the behaviorPolicyPlayer that is a function that define the behavior policy:

And the calculation of the numerator and denominator of the importance-sampling are the summation operation. As the number of iterations increases, we need to accumulate the result from each episode. So we need to store the previous results. The sumOfRewards and sumOfImportanceRatio are used for this purpose.

Then we need to show the result (mean square error):

Result is as follows:

Formally, the difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one. Nevertheless, we will not totally abandon ordinary importance sampling as it is easier to extend to the approximate methods using function approximate that we explore later.

The estimates of ordinary importance sampling will typical have infinite variance, and this can easily happen in off-policy learning when trajectories contains loops. A simple example is shown below:

There is only one nonterminal state $s$ and two action, end and back. The end action causes a deterministic transition to termination, whereas the back action transitions, with probability 0.9, back to $s$ or, with probability 0.1, on to termination. The rewards are +1 on latter transition and otherwise zero. Consider the target policy that always selects back. All episodes under this policy consist of some number (possibly zero) of transitions back to $s$ followed by termination with a reward and return of +1. Thus the value of $s$ under the target policy is 1. Suppose we are estimating this value from off-policy data using the behavior policy that selects end and back with equal probability.

The implementation details are as follows. We first define the two policies:

Then we define how an episode runs:

Now we start our off-policy (first-visit MC) learning process:

Result is as follows:

The correct estimate here is 1, and, even though this is the expected value of a sample return (after importance sampling), the variance of the samples is infinite, and the estimates do not convergence to this value.

At last, we proposed two fancy algorithms, that is, the Incremental off-policy every-visit MC policy evaluation and the Off-policy every-visit MC control.

Incremental off-policy every-visit MC policy evaluation

Initialize, for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}$:

​ $Q(s,a) \leftarrow$ arbitrary

​ $C(s,a) \leftarrow$ 0

​ $\mu(a|s) \leftarrow$ an arbitrary soft behavior policy

​ $\pi(a|s) \leftarrow$ an arbitrary target policy

Repeat forever:

​ Generate an episode using $\mu$:

​ $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_{T}, S_{T}$

​ $G \leftarrow 0$

​ $W \leftarrow 1$

​ For $t=T-1, T-2, \cdots \;\text{downto} \; 0$:

​ $G \leftarrow \gamma G + R_{t+1}$

​ $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$

​ $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)}[G - Q(S_t, A_t)]$

​ $W \leftarrow W\frac{\pi(A_t | S_t)}{\mu(A_t | S_t)}$

​ If $W = 0$ then ExitForLoop

Off-policy every-visit MC control (returns $\pi \approx \pi_{\star}$)

Initialize, for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}$:

​ $Q(s,a) \leftarrow$ arbitrary

​ $C(s,a) \leftarrow$ 0

​ $\mu(a|s) \leftarrow$ an arbitrary soft behavior policy

​ $\pi(s) \leftarrow$ an deterministic policy that is greedy with respect $Q$

Repeat forever:

​ Generate an episode using $\mu$:

​ $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_{T}, S_{T}$

​ $G \leftarrow 0$

​ $W \leftarrow 1$

​ For $t=T-1, T-2, \cdots \;\text{downto} \; 0$:

​ $G \leftarrow \gamma G + R_{t+1}$

​ $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$

​ $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)}[G - Q(S_t, A_t)]$

​ $\pi(S_t) \leftarrow \arg \max_{a}Q(S_t, a)$ (with ties broken consistently)

​ If $A_t \neq \pi(S_t)$ then ExitForLoop

​ $W \leftarrow W\frac{1}{\mu(A_t | S_t)}$