In this post we turn to the control problem with parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{\ast}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a finite-dimensional weight vector.

The general gradient-descent update for action-value prediction is
$$\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ U_t - \hat{q}(S_t, A_t, \mathbf{w}_t) \Big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t).$$
For example, the update for the one-step Sarsa method is
$$\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)- \hat{q}(S_t, A_t, \mathbf{w}_t) \Big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t).$$
We call this method episode semi-gradient one-step sarsa.

To form control methods, we need to couple such action-value prediction methods with techniques for policy improvement and action selection. Suitable techniques applicable to continuous actions, or to actions from large discrete sets, are a topic of ongoing research with as yet no clear resolution. On the other hand, if the action set is discrete and not too large, then we can use the techniques already developed in previous chapters.

Consider the task of driving an underpowered car up a steep mountain road, as suggested by the diagram in the below.

The diﬃculty is that gravity is stronger than the car’s engine, and even at full throttle the car cannot accelerate up the steep slope. The only solution is to ﬁrst move away from the goal and up the opposite slope on the left. Then, by applying full throttle the car can build up enough inertia to carry it up the steep slope even though it is slowing down the whole way. This is a simple example of a continuous control task where things have to get worse in a sense (farther from the goal) before they can get better. Many control methodologies have great diﬃculties with tasks of this kind unless explicitly aided by a human designer.

The reward in this problem is −1 on all time steps until the car moves past its goal position at the top of the mountain, which ends the episode. There are three possible actions: full throttle forward (+1), full throttle reverse (−1), and zero throttle (0). The car moves according to a simpliﬁed physics. Its position, $x_t$, and velocity, $\dot{x}_t$, are updated by
\begin{align} x_{t+1} &\doteq bound \big[x_t + \dot{x}_{t+1} \big] \\ \dot{x}_{t+1} &\doteq bound \big[\dot{x}_t + 0.001 A_t - 0.0025 \cos(3x_t) \big], \end{align}
where the bound operation enforces $-1.2 \leq x_{t+1} \leq 0.5 \;\; \text{and} \; -0.07 \leq \dot{x}_{t+1} \leq 0.07$. In addition, when $x_{t+1}$ reached the left bound, $\dot{x}_{t+1}$ was reset to zero. When it reached the right bound, the goal was reached and the episode was terminated. Each episode started from a random position $x_t \in [−0.6, −0.4)$ and zero velocity. To convert the two continuous state variables to binary features, we used grid-tilings.

First of all, we define the environment of this problem:

After take an action, we transition to a new state and get a reward:

The $\varepsilon$-greedy policy:

We need map out continuous state to discrete state:

Because the one-step semi-gradient sarsa is a special case of n-step, so we develop the n-step algorithm
$$G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \; n \geq 1,0 \leq t \leq T-n.$$
The n-step equation is
$$\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ G_{t:t+n}- \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}) \Big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}), \;\;\; 0 \leq t \leq T.$$
Complete pseudocode is given in the box below.

So the code is as follows:

Next, we use the method mentioned earlier to solve this problem:

Result is as follows:

The result shows what typically happens while learning to solve this task with this form of function approximation. Shown is the negative of the value function (the cost-to-go function) learned on a single run. The initial action values were all zero, which was optimistic (all true values are negative in this task), causing extensive exploration to occur even though the exploration parameter, $\varepsilon$, was 0. All the states visited frequently are valued worse than unexplored states, because the actual rewards have been worse than what was (unrealistically) expected. This continually drives the agent away from wherever it has been, to explore new states, until a solution is found.

Next, let us test the performance of various step size (learning rate).

The result is as follows:

And then, let us test the performance of one-step sarsa and multi-step sarsa on the Mountain Car task:

The result is as follows:

Finally, let me shows the results of a more detailed study of the eﬀect of the parameters $\alpha$ and $n$ on the rate of learning on this task.

The result is as follows:

### Update

#### Use OpenAI gym

Now, let us use the OpenAI gym toolkit to simply our Mountain Car task. As we know, we need to define the environment of task before develop the detail algorithm. It’s easy to make mistakes with some detail. So it is fantastic if we do not need to care about that. The gym toolkit help us to define the environment. Such as the Mountain Car task, there is a build-in model in the gym toolkit. We just need to write one row code:

That is amazing!

We also can test the environment very convenience and get a pretty good user graphic:

These codes will return the result as follows:

Bravo~

Now, let us solve the task by to use the Q-Learning algorithm. Meanwhile, we will use the RBF kernel to construct the features and use the SGDRegressor gradient-descent method of scikit-learn package to update $\mathbf{w}$.

First of all, we need to normalized our features to accelerate the convergence speed and use the RBF kernel to reconstruct our feature space. The both methods (Normalization and Reconstruction) need to use some training set to train a model. Then we use this model to transform our raw data.

Next, we define a class named Estimator to simply the gradient descent process:

We also need a $\varepsilon$-greedy policy to select action:

Then we develop the Q-Learning method:

Run this method:

The result is as follows: