Abracadabra

Temporal-Difference Learning

If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome.

TD(0)

Roughly speaking, Monte Carlo methods wait until the return following the visit is known, then use that return as a target for $V(S_t)$. A simple every-visit Monte Carlo method suitable for nonstationary environment is
$$
V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)],
$$
where $G_t$ is the actual return following time $t$. Let us call this method $constant\text{-}\alpha \ MC$. Notice that, if we are in a stationary environment (like earlier. For some reason, don’t use incremental implementation), the $\alpha$ is equals to $\frac{1}{N(S_t)}$. whereas Monte Carlo methods must wait until the end of the episode to determine the increment to $V(S_t)$ (only then is $G_t$ known), TD methods need to wait only until the next time step. At time $t+1$ they immediately form a target and make a useful update using the observed reward $R_{t+1}$ and the estimate $V(S_{t+1})$. The simplest TD method makes the update
$$
V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]
$$
immediately on transition to $S_{t+1}$ and receiving $R_{t+1}$. In effect, the target for the Monte Carlo update is $G_t$, whereas the target for the TD update is $R_{t+1} + \gamma V(S_{t+1})$. This TD method is called $TD(0)$, or one-step TD. The box below specifies TD(0) completely in procedural form.

td_0

TD(0)’s backup diagram is as follows:

td0bg

Because the TD(0) bases its update in part on an existing estimate, we say that it is a bootstrapping method, like DP. We know that
$$
\begin{align}
v_{\pi}(s) &\doteq \mathbb{E}_{\pi} [G_t \ | \ S_t=s] \\
&= \mathbb{E}_{\pi} [R_{t+1} + \gamma G_{t+1} \ | \ S_t=s] \\
&= \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \ | \ S_t=s].
\end{align}
$$
Roughly speaking, Monte Carlo methods use an estimate of (3) as a target, whereas DP methods use an estimate of (5) as a target, The Monte Carlo target is an estimate because the expected value in (3) is not known; a sample return is used in place of the real expected return. The DP target is an estimate not because of the excepted value, which are assumed to be completely provided by a model of the environment (the environment is known for the DP methods), but because $v_{\pi}(S_{t+1})$ is not known and the current estimate, $V(S_{t+1})$, is used instead. The TD target is an estimate for both reasons.

Note that the quantity in brackets in the TD(0) update is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t+1})$. This quantity, called the TD error, arises in various forms throughout reinforcement learning:
$$
\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t).
$$
Notice that the TD error at each time is the error in the estimate made at that time. Because the TD error depends on the next state and the next reward, it is not actually available until one time step later. Also note that if the array $V$ does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:
$$
\begin{align}
G_t - V(S_t) &= R_{t+1} + \gamma G(S_{t+1}) - V(S_t) + \gamma V(S_{t+1} ) - \gamma V(S_{t+1}) \\
&= \delta_t + \gamma (G_{t+1} - V(S_{t+1})) \\
&= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+1} - V(S_{t+1})) \\
&= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+1} - V(S_{t+1})) \\
&= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(G_t-V(S_T)) \\
&= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(0 -0) \\
&= \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k.
\end{align}
$$
This identity is not exact if $V$ is updated during the episode (as it is in TD(0)), but if the step size is small then it may still hold approximately. Generalizations of this identity play an important role in the theory and algorithms of temporal-difference learning.

Example: Random walk

In this example we empirically compare the prediction abilities of TD(0) and constant-$\alpha$ MC applied to the small Markov reward process shown in the upper part of the figure below. All episodes start in the center state, C, and the proceed either left or right by one state on each step, with equal probability. This behavior can be thought of as due to the combined effect of a fixed policy and an environment’s state-transition probabilities, but we do not care which; we are concerned only with predicting returns however they are generated. Episodes terminates on the right, a reward of +1 occurs; all other reward are zero. For example, a typical episode might consist of the following state-and-reward sequence: C, 0, B, 0, C, 0, D, 0, E, 1. Because this task is undiscounted, the true value of each state is the probability of terminating on the right if starting from that state. Thus, the true value of the center state is $v_{\pi}(\text{C}) = 0.5$. The true values of all the states, A through E, are $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}$, and $\frac{5}{6}$. In all cases the approximate value function was initialized to the intermediate value $V(s)=0.5$, for all $s$.

random_walk

Now, let us develop the codes to solve problem.

The first, we initialize some truth.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 0 is the left terminal state
# 6 is the right terminal state
# 1 ... 5 represents A ... E
states = np.zeros(7)
states[1:6] = 0.5
# For convenience, we assume all rewards are 0
# and the left terminal state has value 0, the right terminal state has value 1
# This trick has been used in Gambler's Problem
states[6] = 1
# set up true state values
trueValue = np.zeros(7)
trueValue[1:6] = np.arange(1, 6) / 6.0
trueValue[6] = 1
ACTION_LEFT = 0
ACTION_RIGHT = 1

The below box is the TD(0) algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def temporalDifference(states, alpha=0.1, batch=False):
state = 3
trajectory = [state]
rewards = [0]
while True:
oldState = state
if np.random.binomial(1, 0.5) == ACTION_LEFT:
state -= 1
else:
state += 1
# Assume all rewards are 0
reward = 0
trajectory.append(state)
# TD update
if not batch:
states[oldState] += alpha * (reward + states[state] - states[oldState])
if state == 6 or state == 0:
break
rewards.append(reward)
return trajectory, rewards

And below box is the constant-$\alpha$ Monte Carlo algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def monteCarlo(states, alpha=0.1, batch=False):
state = 3
trajectory = [3]
# if end up with left terminal state, all returns are 0
# if end up with right terminal state, all returns are 1
returns = 0
while True:
if np.random.binomial(1, 0.5) == ACTION_LEFT:
state -= 1
else:
state += 1
trajectory.append(state)
if state == 6:
returns = 1.0
break
elif state == 0:
returns = 0.0
break
if not batch:
for state_ in trajectory[:-1]:
# MC update
states[state_] += alpha * (returns - states[state_])
return trajectory, [returns] * (len(trajectory) - 1)

First of all, let us test the performance of the TD(0) algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def stateValue():
episodes = [0, 1, 10, 100]
currentStates = np.copy(states)
plt.figure(1)
axisX = np.arange(0, 7)
for i in range(0, episodes[-1] + 1):
if i in episodes:
plt.plot(axisX, currentStates, label=str(i) + ' episodes')
temporalDifference(currentStates)
plt.plot(axisX, trueValue, label='true values')
plt.xlabel('state')
plt.legend()
stateValue()

Results are as follows:

random_walk_td0

And then let us show the RMS error of the TD(0) algorithm and constant-$\alpha$ Monte Carlo algorithm, for various $\alpha$ values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def RMSError():
# I'm lazy here, so do not let same alpha value appear in both arrays
# For example, if in TD you want to use alpha = 0.2, then in MC you can use alpha = 0.201
TDAlpha = [0.15, 0.1, 0.05]
MCAlpha = [0.01, 0.02, 0.03, 0.04]
episodes = 100 + 1
runs = 100
plt.figure(2)
axisX = np.arange(0, episodes)
for alpha in TDAlpha + MCAlpha:
totalErrors = np.zeros(episodes)
if alpha in TDAlpha:
method = 'TD'
else:
method = 'MC'
for run in range(0, runs):
errors = []
currentStates = np.copy(states)
for i in range(0, episodes):
errors.append(np.sqrt(np.sum(np.power(trueValue - currentStates, 2)) / 5.0))
if method == 'TD':
temporalDifference(currentStates, alpha=alpha)
else:
monteCarlo(currentStates, alpha=alpha)
totalErrors += np.asarray(errors)
totalErrors /= runs
plt.plot(axisX, totalErrors, label=method + ', alpha=' + str(alpha))
plt.xlabel('episodes')
plt.legend()
RMSError()

Results are as follows:

random_walk_error

We can see, the TD method was consistently better than the MC method on this task.

Now, suppose that there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning method is to present the experience repeatedly until the method converges upon an answer. We call this batch updating.

Example: Random walk under batch updating

After each new episodes, all episodes seen so far were treated as a batch. They were repeatedly presented to the algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def batchUpdating(method, episodes, alpha=0.001):
# perform 100 independent runs
runs = 100
totalErrors = np.zeros(episodes - 1)
for run in range(0, runs):
currentStates = np.copy(states)
errors = []
# track shown trajectories and reward/return sequences
trajectories = []
rewards = []
for ep in range(1, episodes):
print('Run:', run, 'Episode:', ep)
if method == 'TD':
trajectory_, rewards_ = temporalDifference(currentStates, batch=True)
else:
trajectory_, rewards_ = monteCarlo(currentStates, batch=True)
trajectories.append(trajectory_)
rewards.append(rewards_)
while True:
# keep feeding our algorithm with trajectories seen so far until state value function converges
updates = np.zeros(7)
for trajectory_, rewards_ in zip(trajectories, rewards):
for i in range(0, len(trajectory_) - 1):
if method == 'TD':
updates[trajectory_[i]] += rewards_[i] + currentStates[trajectory_[i + 1]] - currentStates[trajectory_[i]]
else:
updates[trajectory_[i]] += rewards_[i] - currentStates[trajectory_[i]]
updates *= alpha
if np.sum(np.abs(updates)) < 1e-3:
break
# perform batch updating
currentStates += updates
# calculate rms error
errors.append(np.sqrt(np.sum(np.power(currentStates - trueValue, 2)) / 5.0))
totalErrors += np.asarray(errors)
totalErrors /= runs
return totalErrors

Notice that the core codes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
while True:
# keep feeding our algorithm with trajectories seen so far until state
# value function converges
updates = np.zeros(7)
for trajectory_, rewards_ in zip(trajectories, rewards):
for i in range(0, len(trajectory_) - 1):
if method == 'TD':
updates[trajectory_[i]] += rewards_[i] + \
currentStates[trajectory_[i + 1]] - currentStates[trajectory_[i]]
else:
updates[trajectory_[i]] += rewards_[i] - currentStates[trajectory_[i]]
updates *= alpha
if np.sum(np.abs(updates)) < 1e-3:
break
# perform batch updating
currentStates += updates

Either TD methods or MC methods, the target is to minimize the TD error (or MC error, I say).

The result is as follows:

batch_update

Under batch training, constant-$\alpha$ MC converges to value, $V(s)$, that are sample averages of the actual returns experienced after visiting each state $s$. These are optimal estimate in the sense that they minimize the mean-squared error from the actual returns in the training set. In this sense it is surprising that the batch TD method was able to perform better according to the root mean-squared error measure shown in the top figure. How is it that batch TD was able to perform better than this optimal methods? Consider the example in below box:

example6_4

Example illustrates a general difference between the estimates founds by batch TD(0) and batch Monte Carlo methods. Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. Given this model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate.

Sarsa

$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]
$$

This update is done after every transition from a nonterminal state $S_t$. If $S_{t+1}$ is terminal, then $Q(S_{t+1}, A_{t+1})$ is defined as zero. This rule uses every element of the quintuple of events, $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm. The backup diagram for Sarsa is as shown to the bottom.

sarsa_bg

The general form of the Sarsa control algorithm is given in the box below.

sarsa

Example: Windy Gridworld

The figure below is a standard grid-world, with start and goal states, but with one difference: there is a crosswind upward through the middle of the grid. The actions are the standard four—up, down,right, and left—but in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. The strength of the wind is given below each column, in number of cells shifted upward. For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as an undiscounted episodic task, with constant rewards of −1 until the goal state is reached.

windy_gridworld

To demonstrate the problem clearly, we use the OpenAI gym toolkit to develop the algorithm.

First of all, we need to define a environment (the windy grid world):

1
2
3
4
5
# represents every action as a integer
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

The environment is a class that inherit the gym default class discrete.DiscreteEnv (shows that the states are discrete):

1
class WindyGridworldEnv(discrete.DiscreteEnv)

First we need to construct our world:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def __init__(self):
self.shape = (7, 10)
# the number of all states
nS = np.prod(self.shape)
# the number of all actions
nA = 4
# Wind strength
winds = np.zeros(self.shape)
winds[:,[3,4,5,8]] = 1
winds[:,[6,7]] = 2
# Calculate transition probabilities
# P is the transition matrix
P = {}
for s in range(nS):
position = np.unravel_index(s, self.shape)
P[s] = { a : [] for a in range(nA) }
P[s][UP] = self._calculate_transition_prob(position, [-1, 0], winds)
P[s][RIGHT] = self._calculate_transition_prob(position, [0, 1], winds)
P[s][DOWN] = self._calculate_transition_prob(position, [1, 0], winds)
P[s][LEFT] = self._calculate_transition_prob(position, [0, -1], winds)
# We always start in state (3, 0)
isd = np.zeros(nS)
isd[np.ravel_multi_index((3,0), self.shape)] = 1.0
super(WindyGridworldEnv, self).__init__(nS, nA, P, isd)

This is natural, uh? Notice that there is a method called _calculate_transition_prob:

1
2
3
4
5
6
def _calculate_transition_prob(self, current, delta, winds):
new_position = np.array(current) + np.array(delta) + np.array([-1, 0]) * winds[tuple(current)]
new_position = self._limit_coordinates(new_position).astype(int)
new_state = np.ravel_multi_index(tuple(new_position), self.shape)
is_done = tuple(new_position) == (3, 7)
return [(1.0, new_state, -1.0, is_done)]

and _limit_corrdinates method:

1
2
3
4
5
6
def _limit_coordinates(self, coord):
coord[0] = min(coord[0], self.shape[0] - 1)
coord[0] = max(coord[0], 0)
coord[1] = min(coord[1], self.shape[1] - 1)
coord[1] = max(coord[1], 0)
return coord

It is worth to mention that the default gym environment class has some useful parameters: nS, nA, P and is_done. nS is the total number of states and nA is the total number of actions (here assume all states only could take the same fixed actions). P is the state transition matrix, the default environment class has a step method (accept a parameter action) that could generates episode automatically according the P and is_done that represents whether a state is terminal state or not.

Finally, we define a output method for pretty show the result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def _render(self, mode='human', close=False):
if close:
return
outfile = StringIO() if mode == 'ansi' else sys.stdout
for s in range(self.nS):
position = np.unravel_index(s, self.shape)
# print(self.s)
if self.s == s:
output = " x "
elif position == (3,7):
output = " T "
else:
output = " o "
if position[1] == 0:
output = output.lstrip()
if position[1] == self.shape[1] - 1:
output = output.rstrip()
output += "\n"
outfile.write(output)
outfile.write("\n")

Then, let us test our model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
env = WindyGridworldEnv()
print(env.reset())
env.render()
print(env.step(1))
env.render()
print(env.step(1))
env.render()
print(env.step(1))
env.render()
print(env.step(2))
env.render()
print(env.step(1))
env.render()
print(env.step(1))
env.render()

The results are as follows:

windy_show

Each state transition, the step method return a tuple (next_state, reward, is_done, some_extra_info).

Next, we define the episodes generation policy:

def make_epsilon_greedy_policy(Q, epsilon, nA):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
"""
Creates an epsilon-greedy policy based on a given Q-function and epsilon.
Args:
Q: A dictionary that maps from state -> action-values.
Each value is a numpy array of length nA (see below)
epsilon: The probability to select a random action . float between 0 and 1.
nA: Number of actions in the environment.
Returns:
A function that takes the observation as an argument and returns
the probabilities for each action in the form of a numpy array of length nA.
"""
def policy_fn(observation):
A = np.ones(nA, dtype=float) * epsilon / nA
best_action = np.argmax(Q[observation])
A[best_action] += (1.0 - epsilon)
return A
return policy_fn

Now, let us implement the sarsa algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
"""
SARSA algorithm: On-policy TD control. Finds the optimal epsilon-greedy policy.
Args:
env: OpenAI environment.
num_episodes: Number of episodes to run for.
discount_factor: Lambda time discount factor.
alpha: TD learning rate.
epsilon: Chance the sample a random action. Float betwen 0 and 1.
Returns:
A tuple (Q, stats).
Q is the optimal action-value function, a dictionary mapping state -> action values.
stats is an EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.
"""
# The final action-value function.
# A nested dictionary that maps state -> (action -> action-value).
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Keeps track of useful statistics
stats = plotting.EpisodeStats(
episode_lengths=np.zeros(num_episodes),
episode_rewards=np.zeros(num_episodes))
# The policy we're following
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
for i_episode in range(num_episodes):
# Print out which episode we're on, useful for debugging.
if (i_episode + 1) % 100 == 0:
print("\rEpisode {}/{}.".format(i_episode + 1, num_episodes), end="")
sys.stdout.flush()
# Implement this!
state = env.reset()
action_probs = policy(state)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
for t in itertools.count():
next_state, reward, is_done, _ = env.step(action)
next_action_probs = policy(next_state)
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
Q[state][action] += alpha * (reward + discount_factor * Q[next_state][next_action] - Q[state][action])
if is_done:
break
state = next_state
action = next_action
return Q, stats

For understand easily, we put the pesudo-code here again:

sarsa

The results (with $\varepsilon=0.1,\ \alpha=0.5$) are as follows:

sarsa_result

The increasing slope (bottom figure) of the graph shows that the goal is reached more and more quickly over time. Note that Monte Carlo methods cannot easily be used on this task because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Step-by-step learning methods such as Sarsa do not have this problem because they quickly learn during the episode that such
policies are poor, and switch to something else.

Q-learning

One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning, defined by
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\right]
$$
The algorithm is shown in procedural form in the box below:

q_learning

And below is the backup diagram:

q_bg

Example: Cliff Walking

This grid world example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the grid world shown in the figure below:

cliff_world

The same as earlier, we define the environment first. But the new environment just changes a little, so we just paste the code here.

Let us test the environment first:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
env = CliffWalkingEnv()
print(env.reset())
env.render()
print(env.step(0))
env.render()
print(env.step(1))
env.render()
print(env.step(1))
env.render()
print(env.step(2))
env.render()

cliff_walk_show

Not bad.

Then, let us develop the Q-learning algorithm (the episodes generation policy is not change):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def q_learning(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
"""
Q-Learning algorithm: Off-policy TD control. Finds the optimal greedy policy
while following an epsilon-greedy policy
Args:
env: OpenAI environment.
num_episodes: Number of episodes to run for.
discount_factor: Lambda time discount factor.
alpha: TD learning rate.
epsilon: Chance the sample a random action. Float betwen 0 and 1.
Returns:
A tuple (Q, episode_lengths).
Q is the optimal action-value function, a dictionary mapping state -> action values.
stats is an EpisodeStats object with two numpy arrays for episode_lengths and episode_rewards.
"""
# The final action-value function.
# A nested dictionary that maps state -> (action -> action-value).
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Keeps track of useful statistics
stats = plotting.EpisodeStats(
episode_lengths=np.zeros(num_episodes),
episode_rewards=np.zeros(num_episodes))
# The policy we're following
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
for i_episode in range(num_episodes):
# Print out which episode we're on, useful for debugging.
if (i_episode + 1) % 100 == 0:
print("\rEpisode {}/{}.".format(i_episode + 1, num_episodes), end="")
sys.stdout.flush()
# Reset the environment and pick the first action
state = env.reset()
# One step in the environment
# total_reward = 0.0
for t in itertools.count():
# Take a step
action_probs = policy(state)
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
next_state, reward, done, _ = env.step(action)
# Update statistics
stats.episode_rewards[i_episode] += reward
stats.episode_lengths[i_episode] = t
# TD Update
best_next_action = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_action]
td_delta = td_target - Q[state][action]
Q[state][action] += alpha * td_delta
if done:
break
state = next_state
return Q, stats

Results ($\varepsilon=0.1$) are as follows:

q_learning_result

For compare convenience, we put the result of Sarsa here again:

sarsa_result

We can see, for average, After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally falling off the cliff because of the ε-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the
grid. Although Q-learning actually learns the values of the optimal policy, its online performance is worse than that of Sarsa, which learns the roundabout policy. Of course, if ε were gradually reduced, then both methods would asymptotically converge to the optimal policy.

Expected Sarsa

Consider the learning algorithm that is just like Q-learning except that instead of the maximum over next state-action pairs it uses the expected value, taking into account how likely each action is under the current policy. That is, consider the algorithm with the update rule
$$
\begin{align}
Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \mathbb{E}[Q(S_{t+1}, A_{t+1} \ | \ S_{t+1})] - Q(S_t, A_t) \right ] \\
&\leftarrow Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t) \right ],
\end{align}
$$
but that otherwise follows the schema of Q-learning. Its backup diagram is shown below:

esarsa_bg

For compare the results on the cliff-walking task with Excepted Sarsa with Sarsa and Q-learning, we develop another codes (here we are not use the OpenAI gym toolkit).

The first we define some truth of the environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# world height
WORLD_HEIGHT = 4
# world width
WORLD_WIDTH = 12
# probability for exploration
EPSILON = 0.1
# step size
ALPHA = 0.5
# gamma for Q-Learning and Expected Sarsa
GAMMA = 1
# all possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3
actions = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]
# initial state action pair values
stateActionValues = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))
startState = [3, 0]
goalState = [3, 11]
# reward for each action in each state
actionRewards = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, 4))
actionRewards[:, :, :] = -1.0
actionRewards[2, 1:11, ACTION_DOWN] = -100.0
actionRewards[3, 0, ACTION_RIGHT] = -100.0

And then we define the state transitions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# set up destinations for each action in each state
actionDestination = []
for i in range(0, WORLD_HEIGHT):
actionDestination.append([])
for j in range(0, WORLD_WIDTH):
destinaion = dict()
destinaion[ACTION_UP] = [max(i - 1, 0), j]
destinaion[ACTION_LEFT] = [i, max(j - 1, 0)]
destinaion[ACTION_RIGHT] = [i, min(j + 1, WORLD_WIDTH - 1)]
if i == 2 and 1 <= j <= 10:
destinaion[ACTION_DOWN] = startState
else:
destinaion[ACTION_DOWN] = [min(i + 1, WORLD_HEIGHT - 1), j]
actionDestination[-1].append(destinaion)
actionDestination[3][0][ACTION_RIGHT] = startState

We also need a policy to generate the next action according to the current state:

1
2
3
4
5
6
# choose an action based on epsilon greedy algorithm
def chooseAction(state, stateActionValues):
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(actions)
else:
return np.argmax(stateActionValues[state[0], state[1], :])

The stateActionValues just is the Q.

Then, let us develop the Sarsa (and Excepted Sarsa) algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# an episode with Sarsa
# @stateActionValues: values for state action pair, will be updated
# @expected: if True, will use expected Sarsa algorithm
# @stepSize: step size for updating
# @return: total rewards within this episode
def sarsa(stateActionValues, expected=False, stepSize=ALPHA):
currentState = startState
currentAction = chooseAction(currentState, stateActionValues)
rewards = 0.0
while currentState != goalState:
newState = actionDestination[currentState[0]][currentState[1]][currentAction]
newAction = chooseAction(newState, stateActionValues)
reward = actionRewards[currentState[0], currentState[1], currentAction]
rewards += reward
if not expected:
valueTarget = stateActionValues[newState[0], newState[1], newAction]
else:
# calculate the expected value of new state
valueTarget = 0.0
actions_list = stateActionValues[newState[0], newState[1], :]
bestActions = np.argwhere(actions_list == np.amax(actions_list)).flatten().tolist()
for action in actions:
if action in bestActions:
valueTarget += ((1.0 - EPSILON) / len(bestActions) + EPSILON / len(actions)) * stateActionValues[newState[0], newState[1], action]
else:
valueTarget += EPSILON / len(actions) * stateActionValues[newState[0], newState[1], action]
valueTarget *= GAMMA
# Sarsa update
stateActionValues[currentState[0], currentState[1], currentAction] += stepSize * (reward +
valueTarget - stateActionValues[currentState[0], currentState[1], currentAction])
currentState = newState
currentAction = newAction
return rewards

Because we develop the Sarsa algorithm earlier, so we just concentrate on the Excepted Sarsa algorithm here:

1
2
3
4
5
6
7
8
9
# calculate the expected value of new state
valueTarget = 0.0
actions_list = stateActionValues[newState[0], newState[1], :]
bestActions = np.argwhere(actions_list == np.amax(actions_list)).flatten().tolist()
for action in actions:
if action in bestActions:
valueTarget += ((1.0 - EPSILON) / len(bestActions) + EPSILON / len(actions)) * stateActionValues[newState[0], newState[1], action]
else:
valueTarget += EPSILON / len(actions) * stateActionValues[newState[0], newState[1], action]

By the way, let us develop the Q-learning algorithm again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# an episode with Q-Learning
# @stateActionValues: values for state action pair, will be updated
# @expected: if True, will use expected Sarsa algorithm
# @stepSize: step size for updating
# @return: total rewards within this episode
def qLearning(stateActionValues, stepSize=ALPHA):
currentState = startState
rewards = 0.0
while currentState != goalState:
currentAction = chooseAction(currentState, stateActionValues)
reward = actionRewards[currentState[0], currentState[1], currentAction]
rewards += reward
newState = actionDestination[currentState[0]][currentState[1]][currentAction]
# Q-Learning update
stateActionValues[currentState[0], currentState[1], currentAction] += stepSize * (
reward + GAMMA * np.max(stateActionValues[newState[0], newState[1], :]) -
stateActionValues[currentState[0], currentState[1], currentAction])
currentState = newState
return rewards

Now we can see the optimal policy in each state of both algorithm (we are not mentioned earlier):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# print optimal policy
def printOptimalPolicy(stateActionValues):
optimalPolicy = []
for i in range(0, WORLD_HEIGHT):
optimalPolicy.append([])
for j in range(0, WORLD_WIDTH):
if [i, j] == goalState:
optimalPolicy[-1].append('G')
continue
bestAction = np.argmax(stateActionValues[i, j, :])
if bestAction == ACTION_UP:
optimalPolicy[-1].append('U')
elif bestAction == ACTION_DOWN:
optimalPolicy[-1].append('D')
elif bestAction == ACTION_LEFT:
optimalPolicy[-1].append('L')
elif bestAction == ACTION_RIGHT:
optimalPolicy[-1].append('R')
for row in optimalPolicy:
print(row)
# averaging the reward sums from 10 successive episodes
averageRange = 10
# episodes of each run
nEpisodes = 500
# perform 20 independent runs
runs = 20
rewardsSarsa = np.zeros(nEpisodes)
rewardsQLearning = np.zeros(nEpisodes)
for run in range(0, runs):
stateActionValuesSarsa = np.copy(stateActionValues)
stateActionValuesQLearning = np.copy(stateActionValues)
for i in range(0, nEpisodes):
# cut off the value by -100 to draw the figure more elegantly
rewardsSarsa[i] += max(sarsa(stateActionValuesSarsa), -100)
rewardsQLearning[i] += max(qLearning(stateActionValuesQLearning), -100)
# averaging over independt runs
rewardsSarsa /= runs
rewardsQLearning /= runs
# averaging over successive episodes
smoothedRewardsSarsa = np.copy(rewardsSarsa)
smoothedRewardsQLearning = np.copy(rewardsQLearning)
for i in range(averageRange, nEpisodes):
smoothedRewardsSarsa[i] = np.mean(rewardsSarsa[i - averageRange: i + 1])
smoothedRewardsQLearning[i] = np.mean(rewardsQLearning[i - averageRange: i + 1])
# display optimal policy
print('Sarsa Optimal Policy:')
printOptimalPolicy(stateActionValuesSarsa)
print('Q-Learning Optimal Policy:')
printOptimalPolicy(stateActionValuesQLearning)

The results are as follows (emits the results of the changes of reward):

cliff_walk_optimal_policy

Now let us compare the three algorithms:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
stepSizes = np.arange(0.1, 1.1, 0.1)
nEpisodes = 1000
runs = 10
ASY_SARSA = 0
ASY_EXPECTED_SARSA = 1
ASY_QLEARNING = 2
INT_SARSA = 3
INT_EXPECTED_SARSA = 4
INT_QLEARNING = 5
methods = range(0, 6)
performace = np.zeros((6, len(stepSizes)))
for run in range(0, runs):
for ind, stepSize in zip(range(0, len(stepSizes)), stepSizes):
stateActionValuesSarsa = np.copy(stateActionValues)
stateActionValuesExpectedSarsa = np.copy(stateActionValues)
stateActionValuesQLearning = np.copy(stateActionValues)
for ep in range(0, nEpisodes):
print('run:', run, 'step size:', stepSize, 'episode:', ep)
sarsaReward = sarsa(stateActionValuesSarsa, expected=False, stepSize=stepSize)
expectedSarsaReward = sarsa(stateActionValuesExpectedSarsa, expected=True, stepSize=stepSize)
qLearningReward = qLearning(stateActionValuesQLearning, stepSize=stepSize)
performace[ASY_SARSA, ind] += sarsaReward
performace[ASY_EXPECTED_SARSA, ind] += expectedSarsaReward
performace[ASY_QLEARNING, ind] += qLearningReward
if ep < 100:
performace[INT_SARSA, ind] += sarsaReward
performace[INT_EXPECTED_SARSA, ind] += expectedSarsaReward
performace[INT_QLEARNING, ind] += qLearningReward
performace[:3, :] /= nEpisodes * runs
performace[3:, :] /= runs * 100
labels = ['Asymptotic Sarsa', 'Asymptotic Expected Sarsa', 'Asymptotic Q-Learning',
'Interim Sarsa', 'Interim Expected Sarsa', 'Interim Q-Learning']
plt.figure(2)
for method, label in zip(methods, labels):
plt.plot(stepSizes, performace[method, :], label=label)
plt.xlabel('alpha')
plt.ylabel('reward per episode')
plt.legend()

The results are as follows:

compare3algo_cliff_walk

As an on-policy method, Expected Sarsa retains the significant advantage of Sarsa over Q-learning on this problem. In addition, Expected Sarsa shows a significant improvement over Sarsa over a wide range of values for the step-size parameter α. In cliff walking the state transitions are all deterministic and all randomness comes from the policy. In such cases, Expected Sarsa can safely set α = 1 without suffering any degradation of asymptotic performance, whereas Sarsa can only perform well in the long run at a small value of α, at which short-term performance is poor. In this and other examples there is a consistent empirical advantage of Expected Sarsa over Sarsa. Except for the small additional computational cost, Expected Sarsa may completely dominate both of the other more-well-known TD control algorithms.

Double Q-learning

All the control algorithms that we have discussed so far involve maximization in the construction of their target policies. For example, in Q-learning the target policy is the greedy policy given the current action values, which is defined with a max, and in Sarsa the policy is often ε-greedy, which also involves a maximization operation. In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias. To see why, consider a single state s where there are many actions a whose true values, $q(s, a)$ are all zero but whose estimated values, $Q(s, a)$, are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization
bias.

Example: Maximization Bias

We have a small MDP:

mb

the expected return for any trajectory starting with left (from B) is −0.1, and thus taking left in state A is always a mistake. Nevertheless, our control methods may favor left because of maximization bias making B appear to have a positive value. The results (paste later) shows that Q-learning with ε-greedy action selection initially learns to strongly favor the left action on this example. Even at asymptote, Q-learning takes the left action about 5% more often than is optimal at our parameter settings (ε = 0.1, α = 0.1, and γ = 1).

We could use the Double Q-learning algorithm to avoid this problem. One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value. Suppose we divided the plays in two sets and used them to learn two independent estimates.

dbq

Of course there are also doubled versions of Sarsa and Expected Sarsa.

Now let us develop the both algorithms and compare their performance on the earlier example. First we define the problem environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# state A
STATE_A = 0
# state B
STATE_B = 1
# use one terminal state
STATE_TERMINAL = 2
# starts from state A
STATE_START = STATE_A
# possible actions in A
ACTION_A_RIGHT = 0
ACTION_A_LEFT = 1
# possible actions in B, maybe 10 actions
actionsOfB = range(0, 10)
# all possible actions
stateActions = [[ACTION_A_RIGHT, ACTION_A_LEFT], actionsOfB]
# state action pair values, if a state is a terminal state, then the value is always 0
stateActionValues = [np.zeros(2), np.zeros(len(actionsOfB)), np.zeros(1)]
# set up destination for each state and each action
actionDestination = [[STATE_TERMINAL, STATE_B], [STATE_TERMINAL] * len(actionsOfB)]
# probability for exploration
EPSILON = 0.1
# step size
ALPHA = 0.1
# discount for max value
GAMMA = 1.0

And we need a policy to take an action:

1
2
3
4
5
6
# choose an action based on epsilon greedy algorithm
def chooseAction(state, stateActionValues):
if np.random.binomial(1, EPSILON) == 1:
return np.random.choice(stateActions[state])
else:
return argmax(stateActionValues[state])

After take an action, we get the reward:

1
2
3
4
5
# take @action in @state, return the reward
def takeAction(state, action):
if state == STATE_A:
return 0
return np.random.normal(-0.1, 1)

Next, we develop the Double Q-learning algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# if there are two state action pair value array, use double Q-Learning
# otherwise use normal Q-Learning
def qLearning(stateActionValues, stateActionValues2=None):
currentState = STATE_START
# track the # of action left in state A
leftCount = 0
while currentState != STATE_TERMINAL:
if stateActionValues2 is None:
currentAction = chooseAction(currentState, stateActionValues)
else:
# derive a action form Q1 and Q2
currentAction = chooseAction(currentState, [item1 + item2 for item1, item2 in zip(stateActionValues, stateActionValues2)])
if currentState == STATE_A and currentAction == ACTION_A_LEFT:
leftCount += 1
reward = takeAction(currentState, currentAction)
newState = actionDestination[currentState][currentAction]
if stateActionValues2 is None:
currentStateActionValues = stateActionValues
targetValue = np.max(currentStateActionValues[newState])
else:
if np.random.binomial(1, 0.5) == 1:
currentStateActionValues = stateActionValues
anotherStateActionValues = stateActionValues2
else:
currentStateActionValues = stateActionValues2
anotherStateActionValues = stateActionValues
bestAction = argmax(currentStateActionValues[newState])
targetValue = anotherStateActionValues[newState][bestAction]
# Q-Learning update
currentStateActionValues[currentState][currentAction] += ALPHA * (
reward + GAMMA * targetValue - currentStateActionValues[currentState][currentAction])
currentState = newState
return leftCount

And now, let us solve the example problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# each independent run has 300 episodes
episodes = 300
leftCountsQ = np.zeros(episodes)
leftCountsDoubleQ = np.zeros(episodes)
runs = 1000
for run in range(0, runs):
print('run:', run)
stateActionValuesQ = [np.copy(item) for item in stateActionValues]
stateActionValuesDoubleQ1 = [np.copy(item) for item in stateActionValues]
stateActionValuesDoubleQ2 = [np.copy(item) for item in stateActionValues]
leftCountsQ_ = [0]
leftCountsDoubleQ_ = [0]
for ep in range(0, episodes):
leftCountsQ_.append(leftCountsQ_[-1] + qLearning(stateActionValuesQ))
leftCountsDoubleQ_.append(leftCountsDoubleQ_[-1] + qLearning(stateActionValuesDoubleQ1, stateActionValuesDoubleQ2))
del leftCountsQ_[0]
del leftCountsDoubleQ_[0]
leftCountsQ += np.asarray(leftCountsQ_, dtype='float') / np.arange(1, episodes + 1)
leftCountsDoubleQ += np.asarray(leftCountsDoubleQ_, dtype='float') / np.arange(1, episodes + 1)
leftCountsQ /= runs
leftCountsDoubleQ /= runs
plt.figure()
plt.plot(leftCountsQ, label='Q-Learning')
plt.plot(leftCountsDoubleQ, label='Double Q-Learning')
plt.plot(np.ones(episodes) * 0.05, label='Optimal')
plt.xlabel('episodes')
plt.ylabel('% left actions from A')
plt.legend()

Ok, results are as follows:

dbq_result