Abracadabra

Monte Carlo Methods (Reinforcement Learning)

Here we consider our first learning methods for estimating value functions and discovering optimal policy. Unlike the previous algorithms, we do not assume complete knowledge of the environment. Monte Carlo methods require only experience – sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. Although a model is required, the model need only generate sample transition, not the complete probability distributions of all possible transitions that is required for dynamic programming (DP).

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample return. To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. Only on the completion of an episode are value estimates and policies changed.

To handle the nonstationarity, we adapt the idea of general policy iteration (GPI) developed in earlier for DP.

Suppose we wish to estimate $v_{\pi}(s)$, the value of a state $s$ under policy $\pi$, given a set of episodes obtained by following $\pi$ and passing though $s$. Each occurrence of state $s$ in an episode is called a visit to $s$. Let us call the first time it is visited in an episode the first visit to $s$. The first-visit MC method estimates $v_{\pi}(s)$ as the average of the returns following first visits to $s$, whereas the every-visit MC method averages the returns following all visits to $s$.

First-visit MC policy evaluation (returns $V \approx v_{\pi}$)

Initialize:

​ $\pi \leftarrow$ policy to be evaluated

​ $V \leftarrow $ an arbitrary state-value function

​ $Return(s) \leftarrow$ an empty list, for all $s \in \mathcal{S}$

Repeat forever:

​ Generate an episode using $\pi$

​ For each state $s$ appearing in the episode:

​ $G \leftarrow$ return following the first occurrence of $s$

​ Append $G$ to $Return(s)$

​ $V(s) \leftarrow$ $\text{average}(Return(s))$

Next, we’ll use this algorithm to solve a naive problem that defined as follows:

The object of the popular casino card game of blackjack is to obtain cards the sum of whose numerical values is as great as possible without exceeding 21. All face cards count as 10, and an ace can count either as 1 or as 11. We consider the version in which each player competes independently against the dealer. The game begins with two cards dealt to both dealer and player. One of the dealer’s cards is face up and the other is face down. If the player has 21 immediately (an ace and a 10-card), it is called a natural. He then wins unless the dealer also has a natural, in which case the game is draw. If the player does not have a natural, then he can request additional cards, one by one (hits), until he either stop (sticks) or excepted 21 (goes bust). If he goes bust, he loses; if he sticks, then it becomes the dealer’s turn. The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome–win, lose, draw–is determined by whose final sum is closer to 21.

Playing blackjack is naturally formulated as an episode finite MDP. Each game of blackjack is an episode. Rewards of +1, -1, and 0 are given for winning, losing, and drawing, respectively. All rewards within a game are zero, and we do not discount ($\gamma=1$); therefore these terminal rewards are also returns. The player’s actions are to hit or to stick. We assume that cards are dealt from an infinite deck (i.e., with replacement). If the player holds an ace that he could count as 11 without going bust, then the ace is said to be usable. Consider the policy that sticks if the player’s sum is 20 or 21, and otherwise hits. Thus, the player makes decisions on the basis of three variables: his current sum (12-21), the dealer’s one showing card (ace-10), and whether or not he holds a usable ace. This makes for a total of 200 states.

Note that, although we have complete knowledge of the environment in this task, it would not be easy to apply DP methods to compute the value function. DP methods require the distribution of next events–in particular, they require the quantities $p(s^{\prime}, r|s, a)$–and it is not easy to determine these for blackjack. For example, suppose the play’s sum is 14 and he chooses to sticks. What is his excepted reward as a function of the dealer’s showing card? All of these rewards and transition probabilities must be computed before DP can be applied, and such computations are often complex and error-prone.

The conceptual diagram of the experimental results is as follows:

blackjack_c

Figure 1

The first we define some auxiliary variables and methods:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# actions: hit or stand (stick)
ACTION_HIT = 0
ACTION_STAND = 1
actions = [ACTION_HIT, ACTION_STAND]
# policy for player
policyPlayer = np.zeros(22)
for i in range(12, 20):
policyPlayer[i] = ACTION_HIT
policyPlayer[20] = ACTION_STAND
policyPlayer[21] = ACTION_STAND
# function form of target policy of player
def targetPolicyPlayer(usableAcePlayer, playerSum, dealerCard):
return policyPlayer[playerSum]
# function form of behavior policy of player
def behaviorPolicyPlayer(usableAcePlayer, playerSum, dealerCard):
if np.random.binomial(1, 0.5) == 1:
return ACTION_STAND
return ACTION_HIT
# policy for dealer
policyDealer = np.zeros(22)
for i in range(12, 17):
policyDealer[i] = ACTION_HIT
for i in range(17, 22):
policyDealer[i] = ACTION_STAND
# get a new card
def getCard():
card = np.random.randint(1, 14)
card = min(card, 10)
return card

Furthermore, we also have a print method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# print the state value
figureIndex = 0
def prettyPrint(data, tile, zlabel='reward'):
global figureIndex
fig = plt.figure(figureIndex)
figureIndex += 1
fig.suptitle(tile)
ax = fig.add_subplot(111, projection='3d')
axisX = []
axisY = []
axisZ = []
for i in range(12, 22):
for j in range(1, 11):
axisX.append(i)
axisY.append(j)
axisZ.append(data[i - 12, j - 1])
ax.scatter(axisX, axisY, axisZ)
ax.set_xlabel('player sum')
ax.set_ylabel('dealer showing')
ax.set_zlabel(zlabel)

In order to get the figure above, we wrote the following code:

1
2
3
4
5
6
7
8
def onPolicy():
statesUsableAce1, statesNoUsableAce1 = monteCarloOnPolicy(10000)
statesUsableAce2, statesNoUsableAce2 = monteCarloOnPolicy(500000)
prettyPrint(statesUsableAce1, 'Usable Ace, 10000 Episodes')
prettyPrint(statesNoUsableAce1, 'No Usable Ace, 10000 Episodes')
prettyPrint(statesUsableAce2, 'Usable Ace, 500000 Episodes')
prettyPrint(statesNoUsableAce2, 'No Usable Ace, 500000 Episodes')
plt.show()

There is a term named on policy, we’ll explain this term later. Now let us jump into the monteCarloOnPolicy method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Monte Carlo Sample with On-Policy
def monteCarloOnPolicy(nEpisodes):
statesUsableAce = np.zeros((10, 10))
# initialze counts to 1 to avoid 0 being divided
statesUsableAceCount = np.ones((10, 10))
statesNoUsableAce = np.zeros((10, 10))
# initialze counts to 1 to avoid 0 being divided
statesNoUsableAceCount = np.ones((10, 10))
for i in range(0, nEpisodes):
state, reward, _ = play(targetPolicyPlayer)
state[1] -= 12
state[2] -= 1
if state[0]:
statesUsableAceCount[state[1], state[2]] += 1
statesUsableAce[state[1], state[2]] += reward
else:
statesNoUsableAceCount[state[1], state[2]] += 1
statesNoUsableAce[state[1], state[2]] += reward
return statesUsableAce / statesUsableAceCount, statesNoUsableAce / statesNoUsableAceCount

We ignore he first four variables now and explain them later. nEpisodes represents the number of the episodes and the play method simulates the process of the blackjack cards game. So what is it like? This method is too long (more than 100 rows) so we partially explain it. In the first place, this method define some auxiliary variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# play a game
# @policyPlayerFn: specify policy for player
# @initialState: [whether player has a usable Ace, sum of player's cards, one card of dealer]
# @initialAction: the initial action
def play(policyPlayerFn, initialState=None, initialAction=None):
# player status
# sum of player
playerSum = 0
# trajectory of player
playerTrajectory = []
# whether player uses Ace as 11
usableAcePlayer = False
# dealer status
dealerCard1 = 0
dealerCard2 = 0
usableAceDealer = False

Then, we generate a random initial state if the initial state is null. If not, we load the initial state from initialState variable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
if initialState is None:
# generate a random initial state
numOfAce = 0
# initialize cards of player
while playerSum < 12:
# if sum of player is less than 12, always hit
card = getCard()
# if get an Ace, use it as 11
if card == 1:
numOfAce += 1
card = 11
usableAcePlayer = True
playerSum += card
# if player's sum is larger than 21, he must hold at least one Ace, two Aces are possible
if playerSum > 21:
# use the Ace as 1 rather than 11
playerSum -= 10
# if the player only has one Ace, then he doesn't have usable Ace any more
if numOfAce == 1:
usableAcePlayer = False
# initialize cards of dealer, suppose dealer will show the first card he gets
dealerCard1 = getCard()
dealerCard2 = getCard()
else:
# use specified initial state
usableAcePlayer = initialState[0]
playerSum = initialState[1]
dealerCard1 = initialState[2]
dealerCard2 = getCard()
# initial state of the game
state = [usableAcePlayer, playerSum, dealerCard1]
# initialize dealer's sum
dealerSum = 0
if dealerCard1 == 1 and dealerCard2 != 1:
dealerSum += 11 + dealerCard2
usableAceDealer = True
elif dealerCard1 != 1 and dealerCard2 == 1:
dealerSum += dealerCard1 + 11
usableAceDealer = True
elif dealerCard1 == 1 and dealerCard2 == 1:
dealerSum += 1 + 11
usableAceDealer = True
else:
dealerSum += dealerCard1 + dealerCard2

Game start! Above all is player’s turn:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# player's turn
while True:
if initialAction is not None:
action = initialAction
initialAction = None
else:
# get action based on current sum
action = policyPlayerFn(usableAcePlayer, playerSum, dealerCard1)
# track player's trajectory for importance sampling
playerTrajectory.append([action, (usableAcePlayer, playerSum, dealerCard1)])
if action == ACTION_STAND:
break
# if hit, get new card
playerSum += getCard()
# player busts
if playerSum > 21:
# if player has a usable Ace, use it as 1 to avoid busting and continue
if usableAcePlayer == True:
playerSum -= 10
usableAcePlayer = False
else:
# otherwise player loses
return state, -1, playerTrajectory

Then is the dealer’s turn if the player’s turn is end:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
while True:
# get action based on current sum
action = policyDealer[dealerSum]
if action == ACTION_STAND:
break
# if hit, get a new card
dealerSum += getCard()
# dealer busts
if dealerSum > 21:
if usableAceDealer == True:
# if dealer has a usable Ace, use it as 1 to avoid busting and continue
dealerSum -= 10
usableAceDealer = False
else:
# otherwise dealer loses
return state, 1, playerTrajectory

If the both sides have finished the game:

1
2
3
4
5
6
7
# compare the sum between player and dealer
if playerSum > dealerSum:
return state, 1, playerTrajectory
elif playerSum == dealerSum:
return state, 0, playerTrajectory
else:
return state, -1, playerTrajectory

Now, let us come back the mentoCarloOnPolicy method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def monteCarloOnPolicy(nEpisodes):
statesUsableAce = np.zeros((10, 10))
# initialze counts to 1 to avoid 0 being divided
statesUsableAceCount = np.ones((10, 10))
statesNoUsableAce = np.zeros((10, 10))
# initialze counts to 1 to avoid 0 being divided
statesNoUsableAceCount = np.ones((10, 10))
for i in range(0, nEpisodes):
state, reward, _ = play(targetPolicyPlayer)
state[1] -= 12
state[2] -= 1
if state[0]:
statesUsableAceCount[state[1], state[2]] += 1
statesUsableAce[state[1], state[2]] += reward
else:
statesNoUsableAceCount[state[1], state[2]] += 1
statesNoUsableAce[state[1], state[2]] += reward
return statesUsableAce / statesUsableeCount, statesNoUsableAce / statesNoUsableAceCount

In this method we ignore the player’s trajectory (represent by the playerTrajectory variable). If you remember a sentence in the game definition (as follows) it will easy to understand.

Thus, the player makes decisions on the basis of three variables: his current sum (12-21), the dealer’s one showing card (ace-10), and whether or not he holds a usable ace. This makes for a total of 200 states.

This row (as follows) is to calculate the average returns of each state:

1
return statesUsableAce / statesUsableeCount, statesNoUsableAce / statesNoUsableAceCount

Recall the beginning of the code and let’s see what results are like:

usable_ace_10000

no_usable_ace_10000

usable_ace_500000

no_usable_ace_500000

Figure 2 (from up to down). (1) Usable Ace, 10000 Episodes. (2) No Usable Ace, 10000 Episodes. (3) Usable Ace, 500000 Episodes. (4) No Usable Ace, 500000 Episodes.

If a model is not available, then it is particularly useful to estimate action values (the value of state-value pairs) rather than state values. With a model, state values alone are sufficient to determine a policy; one simply look ahead one step and chooses whichever action leads to the best combination of reward and next state, as we did in the earlier on DP. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of out primary goals for Monte Carlo methods is to estimate $q_{\star}$. To achieve this, we first consider the policy evaluation problem for action values.

The policy evaluation problem for action values is estimate $q_{\pi}(s, a)$. The Monte Carlo methods for this are essentially the same as just presented for state values, except now we talk about visits to a state-action pair rather than to a state. The only complication is that many state-value pairs may never be visited. This is the general problem of maintaining exploration, as discussed in the context of the k-armed bandit problem in here. One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this the assumption of exploring starts.

We are now ready to consider how Monte Carlo estimation can be used in control, that is, to approximate optimal policies. To begin, let us consider a Monte Carlo version of classical policy iteration. In this method, we perform alternating complete steps of policy evaluation and policy improvement, beginning with a arbitrary policy $\pi_{0}$ and ending with the optimal policy and optimal action-value function:
$$
\pi_{0} \stackrel{E}\longrightarrow q_{\pi{0}} \stackrel{I}\longrightarrow \pi_{1} \stackrel{E}\longrightarrow q_{\pi_{1}} \stackrel{I}\longrightarrow \pi_{2} \stackrel{E}\longrightarrow \cdots \stackrel{I}\longrightarrow \pi_{\star} \stackrel{E}\longrightarrow q_{\star},
$$
We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. We postpone consideration of the first assumption until later.

The primary approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo to complete policy evaluation before returning to policy improvement. For Monte Carlo policy evaluation it it natural to alternate between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation,and then the policy is improved at all the states visited in the episode.

Monte Carlo ES (Exploring Starts)

Initialize, for all $s \in \mathcal{S},\; a \in \mathcal{A(s)}$:

​ $Q(s,a) \leftarrow \text{arbitrary}$

​ $\pi(s) \leftarrow \text{arbitrary}$

​ $Returns(s,a) \leftarrow \text{empty list}$

Repeat forever:

​ Choose $S_{0} \in \mathcal{S}$ and $A_{0} \in \mathcal{A(S_{0})}$ s.t. all pairs have probability > 0

​ Generate an episode starting from $S_0, A_0$, following $\pi$

​ For each pair $s, a$ appearing in the episode:

​ $G \leftarrow \text{return following the first occurrence of} \; s, a$

​ Append $G$ to $Returns(s,a)$

​ $Q(s,a) \leftarrow \text{average}(Returns(s,a))$

​ For each $s$ in the episode:

​ $\pi(s) \leftarrow \arg\max_{a} Q(s,a)$

Recall the blackjack card game. It is straightforward to apply Monte Carlo ES to blackjack.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def figure5_3():
stateActionValues = monteCarloES(500000)
stateValueUsableAce = np.zeros((10, 10))
stateValueNoUsableAce = np.zeros((10, 10))
# get the optimal policy
actionUsableAce = np.zeros((10, 10), dtype='int')
actionNoUsableAce = np.zeros((10, 10), dtype='int')
for i in range(10):
for j in range(10):
stateValueNoUsableAce[i, j] = np.max(stateActionValues[i, j, 0, :])
stateValueUsableAce[i, j] = np.max(stateActionValues[i, j, 1, :])
actionNoUsableAce[i, j] = argmax(stateActionValues[i, j, 0, :])
actionUsableAce[i, j] = argmax(stateActionValues[i, j, 1, :])
prettyPrint(stateValueUsableAce, 'Optimal state value with usable Ace')
prettyPrint(stateValueNoUsableAce, 'Optimal state value with no usable Ace')
prettyPrint(actionUsableAce, 'Optimal policy with usable Ace', 'Action (0 Hit, 1 Stick)')
prettyPrint(actionNoUsableAce, 'Optimal policy with no usable Ace', 'Action (0 Hit, 1 Stick)')
plt.show()

Run the code we’ll get the conceptual diagram like follows:

mces

Let us to see the implementation (monteCarloES method) of this algorithm. Note that, some auxiliary variables are defined earlier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Monte Carlo with Exploring Starts
def monteCarloES(nEpisodes):
# (playerSum, dealerCard, usableAce, action)
stateActionValues = np.zeros((10, 10, 2, 2))
# set default to 1 to avoid being divided by 0
stateActionPairCount = np.ones((10, 10, 2, 2))
# behavior policy is greedy
def behaviorPolicy(usableAce, playerSum, dealerCard):
usableAce = int(usableAce)
playerSum -= 12
dealerCard -= 1
return argmax(stateActionValues[playerSum, dealerCard, usableAce, :])
# play for several episodes
for episode in range(nEpisodes):
print('episode:', episode)
# for each episode, use a randomly initialized state and action
initialState = [bool(np.random.choice([0, 1])),
np.random.choice(range(12, 22)),
np.random.choice(range(1, 11))]
initialAction = np.random.choice(actions)
_, reward, trajectory = play(behaviorPolicy, initialState, initialAction)
for action, (usableAce, playerSum, dealerCard) in trajectory:
usableAce = int(usableAce)
playerSum -= 12
dealerCard -= 1
# update values of state-action pairs
stateActionValues[playerSum, dealerCard, usableAce, action] += reward
stateActionPairCount[playerSum, dealerCard, usableAce, action] += 1
return stateActionValues / stateActionPairCount

You can see we use the trajectory variable now. The difference between Monte Carlo On Policy and Monte Carlo ES is prior calculates the average return of each state and later calculates the average return of each state-action pair.

The results are as follows:

mcse_usbale_ace_optimal_policy

mcse_usbale_ace_optimal_policy

mcse_usable_ace_optimal_state_value

mcse_no_usable_ace_optimal_state_value

How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy (Do you remember this term?) methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.. The First-visit Monte Carlo method and the Monte Carlo ES method are an example of an on-policy method. Here we show how an on-policy Monte Carlo control method can be designed that does not use the unrealistic assumption of exploring starts. Off-policy methods are considered later.

The on-policy method we presented in here uses $\epsilon \text{-} greedy$ policies. The complete algorithm is given below.

On-policy first-visit MC control (for $\epsilon \text{-soft}$ policies)

Initialize, for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}$:

​ $Q(s,a ) \leftarrow \text{arbitrary}$

​ $Returns(s,a) \leftarrow \text{empty list}$

​ $\pi(a|s) \leftarrow \text{an arbitrary} \; \epsilon \text{-soft policy}$

Repeat forever:

​ (a) Generate an episode using $\pi$

​ (b) For each pair $s, a$ appearing in the episode:

​ $G \leftarrow $ return following the first occurrence of $s, a$

​ Append $G$ to $Returns(s,a)$

​ $Q(s, a) \leftarrow \text{average}(Returns(s,a))$

​ (c) For each s in the episode:

​ $A^{\star} \leftarrow \arg\max_{a} Q(s,a)$

​ For all $a \in \mathcal{A(s)}$:

​ $\pi(a|s) \leftarrow \left\{ \begin{array}{c} 1 - \epsilon + \epsilon/|\mathcal{A(s)}|\;\;\;\text{if} \;a=A^{\star} \\ \epsilon/|\mathcal{A(s)}|\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{if} \;a\neq A^{\star} \end{array} \right.$

All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behavior, but they need to behave non-optimally in order to explore all actions (to find the optimal actions). How can they learn about the optimal policy while behaving according to an exploratory policy? The on-policy approach in the preceding section is actually a compromise–it learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more straightforward approach is to user two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

We begin the study of off-policy methods by considering the prediction problem, in which both target and behavior policies are fixed. That is, suppose we wish to estimate $v_{\pi}$ or $q_{\pi}$, but all we have are episodes following another policy $\mu$, where $\mu \neq \pi$. In this case, $\pi$ is the target policy, $\mu$ is the behavior policy, and both policies are considered fixed and given.

In order to use episodes from $\mu$ to estimate values for $\pi$, we require that every action taken under $\pi$ is also taken, at least occasionally, under $\mu$. That is, we require that $\pi(a|s) > 0$ implies $\mu(a|s) > 0$. This is called the assumption of converge. It follows from converge that $\mu$ must be stochastic in states where it is not identical to $\pi$. The target policy $\pi$, on the other hand, may be deterministic. In here, we consider the prediction problem, in which $\pi$ is unchanging and given.

Almost all off-policy methods utilize importance samplingddd, a general technique for estimating excepted values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectory occurring under the target and behavior policies, called the importance-sampling ratio. Given a starting state $S_{t}$, the probability of the subsequent state-action trajectory , $A_{t}, S_{t+1}, A_{t+1, \cdots, S_{T}}$, occurring under any policy $\pi$ is
$$
\prod_{k=t}^{T-1} \pi(A_k | S_k)p(S_{k+1} | S_k, A_k),
$$
where $p$ here is the state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the important-sampling ratio) is
$$
\rho_{t}^{T} \doteq \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1} \mu(A_k|S_k)p(S_{k+1}|S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi (A_k | S_k)}{\mu (A_k | S_k)}
$$
Now we are ready to give a Monte Carlo algorithm that uses a batch of observed episodes following policy $\mu$ to estimate $v_{\pi}(s)$. It is convenient here to number time steps in a way that increases across episode boundaries. That is, if the first episode of the batch ends in a terminal state at time 100, then the next episode begins at time $t=101$. This enables us to use time-step numbers to refer to particular steps in particular episodes. In particular, we can define the set of all time steps in which state $s$ is visited, denote $\tau(s)$. This is for an every-visit method; for a first-visit method, $\tau(s)$ would only include time steps that were first visits to $s$ within their episodes. Also, let $T(t)$ denote the first time of termination following time $t$, and $G_t$ denote the return after $t$ up though $T(t)$. Then $\{G_t\}_{t \in \tau(s)}$ are returns that pertain to state $s$, and $\{\rho_{t}^{T(t)}\}_{t \in \tau(s)}$ are the corresponding importance-sampling ratios. To estimate $v_{\pi}(s)$, we simply scale the returns by the ratios and average the results:
$$
V(s) \doteq \frac{\sum_{t \in \tau(s)} \rho_{t}^{T(t)} G_t}{|\tau(s)|}.
$$
When importance sampling is done as a simple average in this way it is called ordinary importance sampling.

An important alternative is weighted importance sampling, which uses a weighted average, defined as
$$
V(s) \doteq \frac{\sum_{t \in \tau(s)} \rho_{t}^{T(t)} G_t}{\sum_{t \in \tau(s)} \rho_{t}^{T(t)}},
$$
or zero if the denominator is zero.

We applied both ordinary and weighted importance-sampling methods to estimate the value of as single blackjack state from off-policy data. Recall that one of the advantages of Monte Carlo methods is that they can be used to evaluate a single state without forming estimated for any other states. In this example, we evaluated the state in which the dealer is showing a deuce, the sum of the player’s cards is 13, and the player has a usable ace (that is, the player holds an ace and a deuce, or equivalently three aces). The data was generate by starting in this state then choosing to hit or stick at random with equal probability (the behavior policy). The target policy was to stick only on a sum of 20 or 21. The value of this state under the target policy is approximately -0.27726 (this was determined by separately generating one-hundred million episodes using the target policy and averaging their returns). Both off-policy methods closely approximated this value after 1000 off-policy episodes using the random policy. To make sure they did this reliably, we performed 100 independent runs, each starting from estimates of zero and learning for 10,000 episodes. The implementation details are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Monte Carlo Sample with Off-Policy
def monteCarloOffPolicy(nEpisodes):
initialState = [True, 13, 2]
sumOfImportanceRatio = [0]
sumOfRewards = [0]
for i in range(0, nEpisodes):
_, reward, playerTrajectory = play(behaviorPolicyPlayer, initialState=initialState)
# get the importance ratio
importanceRatioAbove = 1.0
importanceRatioBelow = 1.0
for action, (usableAce, playerSum, dealerCard) in playerTrajectory:
if action == targetPolicyPlayer(usableAce, playerSum, dealerCard):
importanceRatioBelow *= 0.5
else:
importanceRatioAbove = 0.0
break
importanceRatio = importanceRatioAbove / importanceRatioBelow
sumOfImportanceRatio.append(sumOfImportanceRatio[-1] + importanceRatio)
sumOfRewards.append(sumOfRewards[-1] + reward * importanceRatio)
del sumOfImportanceRatio[0]
del sumOfRewards[0]
sumOfRewards= np.asarray(sumOfRewards)
sumOfImportanceRatio= np.asarray(sumOfImportanceRatio)
ordinarySampling = sumOfRewards / np.arange(1, nEpisodes + 1)
with np.errstate(divide='ignore',invalid='ignore'):
weightedSampling = np.where(sumOfImportanceRatio != 0, sumOfRewards / sumOfImportanceRatio, 0)
return ordinarySampling, weightedSampling

Note that the behaviorPolicyPlayer that is a function that define the behavior policy:

1
2
3
4
5
# function form of behavior policy of player
def behaviorPolicyPlayer(usableAcePlayer, playerSum, dealerCard):
if np.random.binomial(1, 0.5) == 1:
return ACTION_STAND
return ACTION_HIT

And the calculation of the numerator and denominator of the importance-sampling are the summation operation. As the number of iterations increases, we need to accumulate the result from each episode. So we need to store the previous results. The sumOfRewards and sumOfImportanceRatio are used for this purpose.

Then we need to show the result (mean square error):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Figure 5.4
def offPolicy():
trueValue = -0.27726
nEpisodes = 10000
nRuns = 100
ordinarySampling = np.zeros(nEpisodes)
weightedSampling = np.zeros(nEpisodes)
for i in range(0, nRuns):
ordinarySampling_, weightedSampling_ = monteCarloOffPolicy(nEpisodes)
# get the squared error
ordinarySampling += np.power(ordinarySampling_ - trueValue, 2)
weightedSampling += np.power(weightedSampling_ - trueValue, 2)
ordinarySampling /= nRuns
weightedSampling /= nRuns
axisX = np.log10(np.arange(1, nEpisodes + 1))
plt.plot(axisX, ordinarySampling, label='Ordinary Importance Sampling')
plt.plot(axisX, weightedSampling, label='Weighted Importance Sampling')
plt.xlabel('Episodes (10^x)')
plt.ylabel('Mean square error')
plt.legend()
plt.show()

Result is as follows:

off_policy

Formally, the difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one. Nevertheless, we will not totally abandon ordinary importance sampling as it is easier to extend to the approximate methods using function approximate that we explore later.

The estimates of ordinary importance sampling will typical have infinite variance, and this can easily happen in off-policy learning when trajectories contains loops. A simple example is shown below:

infinite_variance_example

There is only one nonterminal state $s$ and two action, end and back. The end action causes a deterministic transition to termination, whereas the back action transitions, with probability 0.9, back to $s$ or, with probability 0.1, on to termination. The rewards are +1 on latter transition and otherwise zero. Consider the target policy that always selects back. All episodes under this policy consist of some number (possibly zero) of transitions back to $s$ followed by termination with a reward and return of +1. Thus the value of $s$ under the target policy is 1. Suppose we are estimating this value from off-policy data using the behavior policy that selects end and back with equal probability.

The implementation details are as follows. We first define the two policies:

1
2
3
4
5
6
7
8
9
10
ACTION_BACK = 0
ACTION_END = 1
# behavior policy
def behaviorPolicy():
return np.random.binomial(1, 0.5)
# target policy
def targetPolicy():
return ACTION_BACK

Then we define how an episode runs:

1
2
3
4
5
6
7
8
9
10
11
# one turn
def play():
# track the action for importance ratio
trajectory = []
while True:
action = behaviorPolicy()
trajectory.append(action)
if action == ACTION_END:
return 0, trajectory
if np.random.binomial(1, 0.9) == 0:
return 1, trajectory

Now we start our off-policy (first-visit MC) learning process:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Figure 5.5
def monteCarloSample():
runs = 10
episodes = 100000
axisX = np.log10(np.arange(1, episodes + 1))
for run in range(0, runs):
sumOfRewards = [0]
for episode in range(0, episodes):
reward, trajectory = play()
if trajectory[-1] == ACTION_END:
importanceRatio = 0 # Because it is impossible on the target policy
else:
importanceRatio = 1.0 / pow(0.5, len(trajectory))
sumOfRewards.append(sumOfRewards[-1] + importanceRatio * reward)
del sumOfRewards[0]
estimations = np.asarray(sumOfRewards) / np.arange(1, episodes + 1)
plt.plot(axisX, estimations)
plt.xlabel('Episodes (10^x)')
plt.ylabel('Ordinary Importance Sampling')
plt.show()
return

Result is as follows:

inifinite_variance

The correct estimate here is 1, and, even though this is the expected value of a sample return (after importance sampling), the variance of the samples is infinite, and the estimates do not convergence to this value.

At last, we proposed two fancy algorithms, that is, the Incremental off-policy every-visit MC policy evaluation and the Off-policy every-visit MC control.

Incremental off-policy every-visit MC policy evaluation

Initialize, for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}$:

​ $Q(s,a) \leftarrow$ arbitrary

​ $C(s,a) \leftarrow$ 0

​ $\mu(a|s) \leftarrow$ an arbitrary soft behavior policy

​ $\pi(a|s) \leftarrow$ an arbitrary target policy

Repeat forever:

​ Generate an episode using $\mu$:

​ $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_{T}, S_{T}$

​ $G \leftarrow 0$

​ $W \leftarrow 1$

​ For $t=T-1, T-2, \cdots \;\text{downto} \; 0$:

​ $G \leftarrow \gamma G + R_{t+1}$

​ $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$

​ $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)}[G - Q(S_t, A_t)]$

​ $W \leftarrow W\frac{\pi(A_t | S_t)}{\mu(A_t | S_t)}$

​ If $W = 0$ then ExitForLoop

Off-policy every-visit MC control (returns $\pi \approx \pi_{\star}$)

Initialize, for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}$:

​ $Q(s,a) \leftarrow$ arbitrary

​ $C(s,a) \leftarrow$ 0

​ $\mu(a|s) \leftarrow$ an arbitrary soft behavior policy

​ $\pi(s) \leftarrow$ an deterministic policy that is greedy with respect $Q$

Repeat forever:

​ Generate an episode using $\mu$:

​ $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_{T}, S_{T}$

​ $G \leftarrow 0$

​ $W \leftarrow 1$

​ For $t=T-1, T-2, \cdots \;\text{downto} \; 0$:

​ $G \leftarrow \gamma G + R_{t+1}$

​ $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$

​ $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)}[G - Q(S_t, A_t)]$

​ $\pi(S_t) \leftarrow \arg \max_{a}Q(S_t, a)$ (with ties broken consistently)

​ If $A_t \neq \pi(S_t)$ then ExitForLoop

​ $W \leftarrow W\frac{1}{\mu(A_t | S_t)}$