Abracadabra

n-step TD

In this post, we unify the Monte Carlo methods and the one-step TD methods. We first consider the prediction problem.

n-step TD Prediction

Monte Carlo methods preform a backup for each state based on the entire sequence of observed rewards from that state until the end of the episode. One-step TD methods is based on just on next reward. So n-step TD methods perform a backup based on an intermediate number of rewards: more than one, but less than all of them until termination.

nstep_td

More formally, consider the backup applied to state $S_t$ as a result of the state-reward sequence, $S_t, R_{t+1},S_{t+1}, R_{t+2}, \cdots, R_T, S_T$ (omitting the actions for simplicity). We know that in Monte Carlo backups the estimate of $v_{\pi}(S_t)$ updated in the direction of the complete return:
$$
G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T,
$$
where $T$ is the last time step of the episode. Let us call this quantity the target of the backup. Whereas in Monte Carlo backups the target is the return, in one-step backups the target is the first reward plus the discounted estimated value of the next state, which we call the one-step return:
$$
G_{t:t+1} \doteq R_{t+1} + \gamma V_t (S_{t+1}),
$$
where $V_t : \mathcal{S} \rightarrow \mathbb{R}$ here is an estimate at time $t$ of $v_{\pi}$. The subscripts on $G_{t:t+1}$ indicate that it is truncated return for time t using rewards up until time $t+1$. In the one-step return, $\gamma V_t (S_{t+1})$ takes the place of the other terms $ \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T$ of the full return. Our point now is that this idea makes just as much sense after two steps as it does after one. The target for a two-step backup is the two-step return:
$$
G_{t:t+2} \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1} (S_{t+2}),
$$
where now $\gamma^2 V_{t+1}(S_{t+2})$ corrects for the absence of the terms $\gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T$. Similarly, the target for an arbitrary n-step backup is the n-step return:
$$
G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1} (S_{t+n}),
$$
for all $n,t$ such that $n \ge 1$ and $0 \leq t \leq T-n$. If $t+n \ge T$, then all the missing terms are taken as zero, and the n-step return defined to be equal to the ordinary full return.

No real algorithm can use the n-step return until after it has seen $R_{t+n}$ and computed $V_{t+n-1}$. The first time these are available is $t+n$. The natural algorithm state-value learning algorithm for using n-step returns is thus
$$
V_{t+n}(S_t) \doteq V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)], \;\;\;\;\;\; 0 \leq t \leq T
$$
while the values of all other states remain unchanged. Note that no changes at all are made during the first $n-1$ steps of each episode. Complete pseudocode is given in the box below.

alg_box

The worst error of the expected n-step return is guaranteed to be less than or equal to $\gamma^n$ times the worst error under $V_{t+n-1}$:
$$
\max_s \left |\mathbb{E}[G_{t:t+1}|S_t=s] - v_{\pi}(s) \right | \leq \gamma^n \max_s |V_{t+n-1}(s) - v_{\pi}(s)|,
$$
for all $n \geq 1$. This is called the error reduction property of n-step returns. The n-step TD methods thus form a family of sound methods, with one-step TD methods and Monte Carlo methods as extreme members.

Example: n-step TD Methods on the Random Walk

Now we have a larger MDP (19 non-terminal states). First of all we need to define the new environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# all states
N_STATES = 19
# discount
GAMMA = 1
# initial state values
stateValues = np.zeros(N_STATES + 2)
# all states but terminal states
states = np.arange(1, N_STATES + 1)
# start from the middle state
START_STATE = 10
# two terminal states
# an action leading to the left terminal state has reward -1
# an action leading to the right terminal state has reward 1
END_STATES = [0, N_STATES + 1]
# true state value from bellman equation
realStateValues = np.arange(-20, 22, 2) / 20.0
realStateValues[0] = realStateValues[-1] = 0

And then develop the n-step TD algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# n-steps TD method
# @stateValues: values for each state, will be updated
# @n: # of steps
# @alpha: # step size
def temporalDifference(stateValues, n, alpha):
# initial starting state
currentState = START_STATE
# arrays to store states and rewards for an episode
# space isn't a major consideration, so I didn't use the mod trick
states = [currentState]
rewards = [0]
# track the time
time = 0
# the length of this episode
T = float('inf')
while True:
# go to next time step
time += 1
if time < T:
# choose an action randomly
if np.random.binomial(1, 0.5) == 1:
newState = currentState + 1
else:
newState = currentState - 1
if newState == 0:
reward = -1
elif newState == 20:
reward = 1
else:
reward = 0
# store new state and new reward
states.append(newState)
rewards.append(reward)
if newState in END_STATES:
T = time
# get the time of the state to update
updateTime = time - n
if updateTime >= 0:
returns = 0.0
# calculate corresponding rewards
for t in range(updateTime + 1, min(T, updateTime + n) + 1):
returns += pow(GAMMA, t - updateTime - 1) * rewards[t]
# add state value to the return
if updateTime + n <= T:
returns += pow(GAMMA, n) * stateValues[states[(updateTime + n)]]
stateToUpdate = states[updateTime]
# update the state value
if not stateToUpdate in END_STATES:
stateValues[stateToUpdate] += alpha * (returns - stateValues[stateToUpdate])
if updateTime == T - 1:
break
currentState = newState

Now, let us test the performance under different $n$ values and $\alpha$ values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# truncate value for better display
truncateValue = 0.55
# all possible steps
steps = np.power(2, np.arange(0, 10))
# all possible alphas
alphas = np.arange(0, 1.1, 0.1)
# each run has 10 episodes
episodes = 10
# perform 100 independent runs
runs = 100
# track the errors for each (step, alpha) combination
errors = np.zeros((len(steps), len(alphas)))
for run in range(0, runs):
for stepInd, step in zip(range(len(steps)), steps):
for alphaInd, alpha in zip(range(len(alphas)), alphas):
print('run:', run, 'step:', step, 'alpha:', alpha)
currentStateValues = np.copy(stateValues)
for ep in range(0, episodes):
temporalDifference(currentStateValues, step, alpha)
# calculate the RMS error
errors[stepInd, alphaInd] += np.sqrt(np.sum(np.power(currentStateValues - realStateValues, 2)) / N_STATES)
# take average
errors /= episodes * runs
# truncate the error
errors[errors > truncateValue] = truncateValue
plt.figure()
for i in range(0, len(steps)):
plt.plot(alphas, errors[i, :], label='n = ' + str(steps[i]))
plt.xlabel('alpha')
plt.ylabel('RMS error')
plt.legend()

Results are as follows:

n_step_td_random_walk_result

TODO: N-STEP SARSA

TODO: N-STEP OFF-POLICY ALGORITHM