In this post, we unify the Monte Carlo methods and the one-step TD methods. We first consider the prediction problem.

### n-step TD Prediction

Monte Carlo methods preform a backup for each state based on the entire sequence of observed rewards from that state until the end of the episode. One-step TD methods is based on just on next reward. So n-step TD methods perform a backup based on an intermediate number of rewards: more than one, but less than all of them until termination.

More formally, consider the backup applied to state $S_t$ as a result of the state-reward sequence, $S_t, R_{t+1},S_{t+1}, R_{t+2}, \cdots, R_T, S_T$ (omitting the actions for simplicity). We know that in Monte Carlo backups the estimate of $v_{\pi}(S_t)$ updated in the direction of the complete return:
$$G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T,$$
where $T$ is the last time step of the episode. Let us call this quantity the target of the backup. Whereas in Monte Carlo backups the target is the return, in one-step backups the target is the first reward plus the discounted estimated value of the next state, which we call the one-step return:
$$G_{t:t+1} \doteq R_{t+1} + \gamma V_t (S_{t+1}),$$
where $V_t : \mathcal{S} \rightarrow \mathbb{R}$ here is an estimate at time $t$ of $v_{\pi}$. The subscripts on $G_{t:t+1}$ indicate that it is truncated return for time t using rewards up until time $t+1$. In the one-step return, $\gamma V_t (S_{t+1})$ takes the place of the other terms $\gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T$ of the full return. Our point now is that this idea makes just as much sense after two steps as it does after one. The target for a two-step backup is the two-step return:
$$G_{t:t+2} \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1} (S_{t+2}),$$
where now $\gamma^2 V_{t+1}(S_{t+2})$ corrects for the absence of the terms $\gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T$. Similarly, the target for an arbitrary n-step backup is the n-step return:
$$G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1} (S_{t+n}),$$
for all $n,t$ such that $n \ge 1$ and $0 \leq t \leq T-n$. If $t+n \ge T$, then all the missing terms are taken as zero, and the n-step return defined to be equal to the ordinary full return.

No real algorithm can use the n-step return until after it has seen $R_{t+n}$ and computed $V_{t+n-1}$. The first time these are available is $t+n$. The natural algorithm state-value learning algorithm for using n-step returns is thus
$$V_{t+n}(S_t) \doteq V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)], \;\;\;\;\;\; 0 \leq t \leq T$$
while the values of all other states remain unchanged. Note that no changes at all are made during the first $n-1$ steps of each episode. Complete pseudocode is given in the box below.

The worst error of the expected n-step return is guaranteed to be less than or equal to $\gamma^n$ times the worst error under $V_{t+n-1}$:
$$\max_s \left |\mathbb{E}[G_{t:t+1}|S_t=s] - v_{\pi}(s) \right | \leq \gamma^n \max_s |V_{t+n-1}(s) - v_{\pi}(s)|,$$
for all $n \geq 1$. This is called the error reduction property of n-step returns. The n-step TD methods thus form a family of sound methods, with one-step TD methods and Monte Carlo methods as extreme members.

#### Example: n-step TD Methods on the Random Walk

Now we have a larger MDP (19 non-terminal states). First of all we need to define the new environment:

And then develop the n-step TD algorithm:

Now, let us test the performance under different $n$ values and $\alpha$ values:

Results are as follows: