The novelty in this post is that the approximate value function is represented not as a table but as a parameterized functional form with weight vector $\mathbf{w} \in \mathbb{R}^d$. We will write $\hat{v}(s, \mathbf{w}) \approx v_{\pi}(s)$ for the approximated value of state $s$ given weight vector $\mathbf{w}$. For example, $\hat{v}$ might be a linear function in features of the state, with $\mathbf{w}$ the vector of feature weights. Consequently, when a single state is updated, the change generalizes from that state to affect the values of many other states.

### The prediction Objective (MSVE)

In the tabular case a continuous measure of prediction quality was not necessary because the learned value function could come to equal the true value function exactly. But with genuine approximation, an update at one state aﬀects many others, and it is not possible to get all states exactly correct. By assumption we have far more states than weights, so making one state’s estimate more accurate invariably means making others’ less accurate.

By the error in a state $s$ we mean the square of the diﬀerence between the approximate value $\hat{v}(s,\mathbf{w})$ and the true value $v_{\pi}(s)$. Weighting this over the state space by the distribution $\mu$, we obtain a natural objective function, the Mean Squared Value Error, or MSVE:
$$\text{MSVE}(\mathbf{w}) \doteq \sum_{s \in \mathcal{S}} \mu(s) \Big[v_{\pi}(s) - \hat{v}(s, \mathbf{w}) \Big]^2.$$
The square root of this measure, the root MSVE or RMSVE, gives a rough measure of how much the approximate values diﬀer from the true values and is often used in plots. Typically one chooses $\mu(s)$ to be the fraction of time spent in $s$ under the target policy $\pi$. This is called the on-policy distribution.

We assume that states appear in examples with the same distribution, µ, over which we are trying to minimize the MSVE. A good strategy in this case is to try to minimize error on the observed examples. Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:
\begin{align} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t - \frac{1}{2} \alpha \nabla \Big[ v_{\pi}(s) - \hat{v}(S_t, \mathbf{w}_t)\Big]^2 \\ &= \mathbf{w}_t + \alpha \Big[ v_{\pi}(s) - \hat{v}(S_t, \mathbf{w}_t)\Big] \nabla \hat{v}(S_t, \mathbf{w}_t). \end{align}
And
$$\nabla f(\mathbf{w}) \doteq \Big( \frac{\partial f(\mathbf{w})}{\partial w_1}, \frac{\partial f(\mathbf{w})}{\partial w_2}, \cdots, \frac{\partial f(\mathbf{w})}{\partial w_d}\Big)^{\top}.$$
Although $v_{\pi}(S_t)$ is unknown, but we can approximate it by substituting $U_t$ (the $t$th training example) in place of $v_{\pi}(S_t)$. This yields the following general SGD method for state-value prediction:
$$\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ U_t - \hat{v}(S_t, \mathbf{w}_t)\Big] \nabla \hat{v}(S_t, \mathbf{w}_t).$$
If $U_t$ is an unbiased estimate, that is, if $\mathbb{E}[U_t]=v_{\pi}(S_t)$, for each $t$, then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under the usual stochastic approximation conditions for decreasing $\alpha$.

For example, suppose the states in the examples are the states generated by interaction (or simulated interaction) with the environment using policy $\pi$. Because the true value of a state is the expected value of the return following it, the Monte Carlo target $U_t \doteq G_t$ is by deﬁnition an unbiased estimate of $v_{\pi}(S_t)$. Thus, the gradient-descent version of Monte Carlo state-value prediction is guaranteed to ﬁnd a locally optimal solution. Pseudocode for a complete algorithm
is shown in the box.

#### Example: State Aggregation on the 1000-state Random Walk

State aggregation is a simple form of generalizing function approximation in which states are grouped together, with one estimated value (one component of the weight vector w) for each group. The value of a state is estimated as its group’s component, and when the state is updated, that component alone is updated. State aggregation is a special case of SGD in which the gradient, $\nabla \hat{v}(S_t, \mathbf{w}_t)$, is 1 for $S_t$’s group’s component and 0 for the other components.

Consider a 1000-state version of the random walk task. The states are numbered from 1 to 1000, left to right, and all episodes begin near the center, in state 500. State transitions are from the current state to one of the 100 neighboring states to its left, or to one of the 100 neighboring states to its right, all with equal probability. Of course, if the current state is near an edge, then there may be fewer than 100 neighbors on that side of it. In this case, all the probability that would have gone into those missing neighbors goes into the probability of terminating on that side (thus, state 1 has a 0.5 chance of terminating on the left, and state 950 has a 0.25 chance of terminating on the right). As usual, termination on the left produces a reward of −1, and termination on the right produces a reward of +1. All other transitions have a reward of zero.

Now, let us solve this problem by gradient Monte Carlo algorithm. First of all, we define the environment of this problem.

We need a true value of each state, thus use the dynamic programming to get these value:

The policy of episodes generation:

The reward after take an action:

And we have a special value function:

Finally. let us solve this problem:

Results are as follows:

Bootstrapping target such as n-step returns $G_{t:t+n}$ or the DP target $\sum_{a, s^{\prime}, r} \pi(a|S_t)p(s^{\prime}, r|S_t, a)[r + \gamma \hat{v}(s^{\prime}, \mathbf{w}_t)]$ all depend on the current value of the weight vector $\mathbf{w}_t$, which implies that they will be biased and that they will not produce a true gradient=descent method. We call them semi-gradient methods.

Although semi-gradient (bootstrapping) methods do not converge as robustly as gradient methods, they do converge reliably in important cases such as the linear case. Moreover, they oﬀer important advantages which makes them often clearly preferred. One reason for this is that they are typically signiﬁcantly faster to learn. Another is that they enable learning to be continual and online, without waiting for the end of an episode. This enables them to be used on continuing problems and provides computational advantages. A prototypical semi-gradient method is semi-gradient TD(0), which uses $U_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$ as its target. Complete pseudocode for this method is given in the box below.

### Linear Methods

One of the most important special cases of function approximation is that in which the approximate function, $\hat{v}(\cdot, \mathbf{w})$, is a linear function of the weight vector, $\mathbf{w}$. Corresponding to every state $s$, there is a real-valued vector of features $\mathbf{x}(s) \doteq (x_1(s),x_2(s),\cdots,x_d(s))^{\top}$, with the same number of components as $\mathbf{w}$. However the features are constructed, the approximate state-value function is given by the inner product between $\mathbf{w}$ and $\mathbf{x}(s)$:
$$\hat{v}(s, \mathbf{w}) \doteq \mathbf{w^{\top}x}(s) \doteq \sum_{i=1}^d w_i x_i(s).$$
The individual functions $x_i:\mathcal{S} \rightarrow \mathbb{E}$ are called basis functions. The gradient of the approximate value function with respect to $\mathbf{w}$ in this case is
$$\nabla \hat{v} (s, \mathbf{w}) = \mathbf{x}(s).$$
The update at each time $t$ is

\begin{align} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha \Big( R_{t+1} + \gamma \mathbf{w}_t^{\top}\mathbf{x}_{t+1} - \mathbf{w}_t^{\top}\mathbf{x}_{t}\Big)\mathbf{x}_t \\ &= \mathbf{w}_t + \alpha \Big( R_{t+1}\mathbf{x}_t - \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^{\top} \mathbf{w}_t\Big), \end{align}
where here we have used the notational shorthand $\mathbf{x}_t = \mathbf{x}(S_t)$. If the system converges, it must converge to the weight vector $\mathbf{w}_{TD}$ at which
$$\mathbf{w}_{TD} \doteq \mathbf{A^{-1}b},$$
where
$$\mathbf{b} \doteq \mathbb{E}[R_{t+1}\mathbf{x}_t] \in \mathbb{R}^d \;\; \text{and} \;\; \mathbf{A} \doteq \mathbb{E}\big[ \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^{\top} \big] \in \mathbb{R}^d \times \mathbb{R}^d.$$
This quantity is called the TD fixedpoint. At this point we have:
$$\text{MSVE}(\mathbf{w}_{TD}) \leq \frac{1}{1 - \gamma} \min_{\mathbf{w}} \text{MSVE}(\mathbf{w}).$$
Now we use the state aggregation example again, but use the semi-gradient TD method.

Results are as follows:

We also could use the n-step semi-gradient TD method. To obtain such quantitatively similar results we switched the state aggregation to 20 groups of 50 states each. The 20 groups are then quantitatively close to the 19 states of the tabular problem.

The semi-gradient n-step TD algorithm we used in this example is the natural extension of the tabular n-step TD algorithm. The key equation is
$$\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \Big[ G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\Big] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \;\; 0 \leq t \leq T,$$
where
$$G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \;\; 0 \leq t \leq T-n.$$
Pseudocode for the complete algorithm is given in the box below.

Now let us show the performance of different value of n:

The results are as follows:

### Feature Construction for Linear Methods

Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learning systems. Intuitively, the features should correspond to the natural features of the task, those along which generalization is most appropriate. If we are valuing geometric objects, for example, we might want to have features for each possible shape, color, size, or function. If we are valuing states of a mobile robot, then we might want to have features for locations, degrees of remaining battery power, recent sonar readings, and so on.

#### Fourier Basis

Konidaris et al. (2011) found that when using Fourier cosine basis functions with a learning algorithm such as semi-gradient TD(0), or semi-gradient Sarsa, it is helpful to use a diﬀerent step-size parameter for each basis function. If $\alpha$ is the basic step-size parameter, they suggest setting the step-size parameter for basis function $x_i$ to $a_i=\alpha/\sqrt{(c_1^i)^2 + \cdots + (c_d^i)^2}$ (except when each $c_j^i=0$, in which case $\alpha_i=\alpha$).

Now, let us we compare the Fourier and polynomial bases on the 1000-state random walk example. In general, we do not recommend using the polynomial basis for online learning.

The function upper is used to construction the features of states (map states to features).

Next, we will compare different super-parameters’ (order) performance:

Results: