If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome.

### TD(0)

Roughly speaking, Monte Carlo methods wait until the return following the visit is known, then use that return as a target for $V(S_t)$. A simple every-visit Monte Carlo method suitable for nonstationary environment is
$$V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)],$$
where $G_t$ is the actual return following time $t$. Let us call this method $constant\text{-}\alpha \ MC$. Notice that, if we are in a stationary environment (like earlier. For some reason, don’t use incremental implementation), the $\alpha$ is equals to $\frac{1}{N(S_t)}$. whereas Monte Carlo methods must wait until the end of the episode to determine the increment to $V(S_t)$ (only then is $G_t$ known), TD methods need to wait only until the next time step. At time $t+1$ they immediately form a target and make a useful update using the observed reward $R_{t+1}$ and the estimate $V(S_{t+1})$. The simplest TD method makes the update
$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]$$
immediately on transition to $S_{t+1}$ and receiving $R_{t+1}$. In effect, the target for the Monte Carlo update is $G_t$, whereas the target for the TD update is $R_{t+1} + \gamma V(S_{t+1})$. This TD method is called $TD(0)$, or one-step TD. The box below specifies TD(0) completely in procedural form.

TD(0)’s backup diagram is as follows:

Because the TD(0) bases its update in part on an existing estimate, we say that it is a bootstrapping method, like DP. We know that
\begin{align} v_{\pi}(s) &\doteq \mathbb{E}_{\pi} [G_t \ | \ S_t=s] \\ &= \mathbb{E}_{\pi} [R_{t+1} + \gamma G_{t+1} \ | \ S_t=s] \\ &= \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \ | \ S_t=s]. \end{align}
Roughly speaking, Monte Carlo methods use an estimate of (3) as a target, whereas DP methods use an estimate of (5) as a target, The Monte Carlo target is an estimate because the expected value in (3) is not known; a sample return is used in place of the real expected return. The DP target is an estimate not because of the excepted value, which are assumed to be completely provided by a model of the environment (the environment is known for the DP methods), but because $v_{\pi}(S_{t+1})$ is not known and the current estimate, $V(S_{t+1})$, is used instead. The TD target is an estimate for both reasons.

Note that the quantity in brackets in the TD(0) update is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t+1})$. This quantity, called the TD error, arises in various forms throughout reinforcement learning:
$$\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t).$$
Notice that the TD error at each time is the error in the estimate made at that time. Because the TD error depends on the next state and the next reward, it is not actually available until one time step later. Also note that if the array $V$ does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:
\begin{align} G_t - V(S_t) &= R_{t+1} + \gamma G(S_{t+1}) - V(S_t) + \gamma V(S_{t+1} ) - \gamma V(S_{t+1}) \\ &= \delta_t + \gamma (G_{t+1} - V(S_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+1} - V(S_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+1} - V(S_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(G_t-V(S_T)) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(0 -0) \\ &= \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k. \end{align}
This identity is not exact if $V$ is updated during the episode (as it is in TD(0)), but if the step size is small then it may still hold approximately. Generalizations of this identity play an important role in the theory and algorithms of temporal-difference learning.

#### Example: Random walk

In this example we empirically compare the prediction abilities of TD(0) and constant-$\alpha$ MC applied to the small Markov reward process shown in the upper part of the figure below. All episodes start in the center state, C, and the proceed either left or right by one state on each step, with equal probability. This behavior can be thought of as due to the combined effect of a fixed policy and an environment’s state-transition probabilities, but we do not care which; we are concerned only with predicting returns however they are generated. Episodes terminates on the right, a reward of +1 occurs; all other reward are zero. For example, a typical episode might consist of the following state-and-reward sequence: C, 0, B, 0, C, 0, D, 0, E, 1. Because this task is undiscounted, the true value of each state is the probability of terminating on the right if starting from that state. Thus, the true value of the center state is $v_{\pi}(\text{C}) = 0.5$. The true values of all the states, A through E, are $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}$, and $\frac{5}{6}$. In all cases the approximate value function was initialized to the intermediate value $V(s)=0.5$, for all $s$.

Now, let us develop the codes to solve problem.

The first, we initialize some truth.

The below box is the TD(0) algorithm:

And below box is the constant-$\alpha$ Monte Carlo algorithm:

First of all, let us test the performance of the TD(0) algorithm:

Results are as follows:

And then let us show the RMS error of the TD(0) algorithm and constant-$\alpha$ Monte Carlo algorithm, for various $\alpha$ values:

Results are as follows:

We can see, the TD method was consistently better than the MC method on this task.

Now, suppose that there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning method is to present the experience repeatedly until the method converges upon an answer. We call this batch updating.

#### Example: Random walk under batch updating

After each new episodes, all episodes seen so far were treated as a batch. They were repeatedly presented to the algorithm.

Notice that the core codes:

Either TD methods or MC methods, the target is to minimize the TD error (or MC error, I say).

The result is as follows:

Under batch training, constant-$\alpha$ MC converges to value, $V(s)$, that are sample averages of the actual returns experienced after visiting each state $s$. These are optimal estimate in the sense that they minimize the mean-squared error from the actual returns in the training set. In this sense it is surprising that the batch TD method was able to perform better according to the root mean-squared error measure shown in the top figure. How is it that batch TD was able to perform better than this optimal methods? Consider the example in below box:

Example illustrates a general difference between the estimates founds by batch TD(0) and batch Monte Carlo methods. Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process. Given this model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate.

### Sarsa

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]$$

This update is done after every transition from a nonterminal state $S_t$. If $S_{t+1}$ is terminal, then $Q(S_{t+1}, A_{t+1})$ is defined as zero. This rule uses every element of the quintuple of events, $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm. The backup diagram for Sarsa is as shown to the bottom.

The general form of the Sarsa control algorithm is given in the box below.

#### Example: Windy Gridworld

The figure below is a standard grid-world, with start and goal states, but with one diﬀerence: there is a crosswind upward through the middle of the grid. The actions are the standard four—up, down,right, and left—but in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. The strength of the wind is given below each column, in number of cells shifted upward. For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. Let us treat this as an undiscounted episodic task, with constant rewards of −1 until the goal state is reached.

To demonstrate the problem clearly, we use the OpenAI gym toolkit to develop the algorithm.

First of all, we need to define a environment (the windy grid world):

The environment is a class that inherit the gym default class discrete.DiscreteEnv (shows that the states are discrete):

First we need to construct our world:

This is natural, uh? Notice that there is a method called _calculate_transition_prob:

and _limit_corrdinates method:

It is worth to mention that the default gym environment class has some useful parameters: nS, nA, P and is_done. nS is the total number of states and nA is the total number of actions (here assume all states only could take the same fixed actions). P is the state transition matrix, the default environment class has a step method (accept a parameter action) that could generates episode automatically according the P and is_done that represents whether a state is terminal state or not.

Finally, we define a output method for pretty show the result:

Then, let us test our model：

The results are as follows:

Each state transition, the step method return a tuple (next_state, reward, is_done, some_extra_info).

Next, we define the episodes generation policy:

def make_epsilon_greedy_policy(Q, epsilon, nA):

Now, let us implement the sarsa algorithm:

For understand easily, we put the pesudo-code here again:

The results (with $\varepsilon=0.1,\ \alpha=0.5$) are as follows:

The increasing slope (bottom figure) of the graph shows that the goal is reached more and more quickly over time. Note that Monte Carlo methods cannot easily be used on this task because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Step-by-step learning methods such as Sarsa do not have this problem because they quickly learn during the episode that such
policies are poor, and switch to something else.

### Q-learning

One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning, defined by
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\right]$$
The algorithm is shown in procedural form in the box below:

And below is the backup diagram:

#### Example: Cliff Walking

This grid world example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the grid world shown in the figure below:

The same as earlier, we define the environment first. But the new environment just changes a little, so we just paste the code here.

Let us test the environment first:

Then, let us develop the Q-learning algorithm (the episodes generation policy is not change):

Results ($\varepsilon=0.1$) are as follows:

For compare convenience, we put the result of Sarsa here again:

We can see, for average, After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliﬀ. Unfortunately, this results in its occasionally falling oﬀ the cliﬀ because of the ε-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the
grid. Although Q-learning actually learns the values of the optimal policy, its online performance is worse than that of Sarsa, which learns the roundabout policy. Of course, if ε were gradually reduced, then both methods would asymptotically converge to the optimal policy.

### Expected Sarsa

Consider the learning algorithm that is just like Q-learning except that instead of the maximum over next state-action pairs it uses the expected value, taking into account how likely each action is under the current policy. That is, consider the algorithm with the update rule
\begin{align} Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \mathbb{E}[Q(S_{t+1}, A_{t+1} \ | \ S_{t+1})] - Q(S_t, A_t) \right ] \\ &\leftarrow Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t) \right ], \end{align}
but that otherwise follows the schema of Q-learning. Its backup diagram is shown below:

For compare the results on the cliff-walking task with Excepted Sarsa with Sarsa and Q-learning, we develop another codes (here we are not use the OpenAI gym toolkit).

The first we define some truth of the environment:

And then we define the state transitions:

We also need a policy to generate the next action according to the current state:

The stateActionValues just is the Q.

Then, let us develop the Sarsa (and Excepted Sarsa) algorithm:

Because we develop the Sarsa algorithm earlier, so we just concentrate on the Excepted Sarsa algorithm here:

By the way, let us develop the Q-learning algorithm again:

Now we can see the optimal policy in each state of both algorithm (we are not mentioned earlier):

The results are as follows (emits the results of the changes of reward):

Now let us compare the three algorithms:

The results are as follows:

As an on-policy method, Expected Sarsa retains the signiﬁcant advantage of Sarsa over Q-learning on this problem. In addition, Expected Sarsa shows a signiﬁcant improvement over Sarsa over a wide range of values for the step-size parameter α. In cliﬀ walking the state transitions are all deterministic and all randomness comes from the policy. In such cases, Expected Sarsa can safely set α = 1 without suﬀering any degradation of asymptotic performance, whereas Sarsa can only perform well in the long run at a small value of α, at which short-term performance is poor. In this and other examples there is a consistent empirical advantage of Expected Sarsa over Sarsa. Except for the small additional computational cost, Expected Sarsa may completely dominate both of the other more-well-known TD control algorithms.

### Double Q-learning

All the control algorithms that we have discussed so far involve maximization in the construction of their target policies. For example, in Q-learning the target policy is the greedy policy given the current action values, which is deﬁned with a max, and in Sarsa the policy is often ε-greedy, which also involves a maximization operation. In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a signiﬁcant positive bias. To see why, consider a single state s where there are many actions a whose true values, $q(s, a)$ are all zero but whose estimated values, $Q(s, a)$, are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization
bias.

#### Example: Maximization Bias

We have a small MDP:

the expected return for any trajectory starting with left (from B) is −0.1, and thus taking left in state A is always a mistake. Nevertheless, our control methods may favor left because of maximization bias making B appear to have a positive value. The results (paste later) shows that Q-learning with ε-greedy action selection initially learns to strongly favor the left action on this example. Even at asymptote, Q-learning takes the left action about 5% more often than is optimal at our parameter settings (ε = 0.1, α = 0.1, and γ = 1).

We could use the Double Q-learning algorithm to avoid this problem. One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value. Suppose we divided the plays in two sets and used them to learn two independent estimates.

Of course there are also doubled versions of Sarsa and Expected Sarsa.

Now let us develop the both algorithms and compare their performance on the earlier example. First we define the problem environment:

And we need a policy to take an action:

After take an action, we get the reward:

Next, we develop the Double Q-learning algorithm:

And now, let us solve the example problem:

Ok, results are as follows: