Abracadabra

On-policy Prediction with Approximation

The novelty in this post is that the approximate value function is represented not as a table but as a parameterized functional form with weight vector $\mathbf{w} \in \mathbb{R}^d$. We will write $\hat{v}(s, \mathbf{w}) \approx v_{\pi}(s)$ for the approximated value of state $s$ given weight vector $\mathbf{w}$. For example, $\hat{v}$ might be a linear function in features of the state, with $\mathbf{w}$ the vector of feature weights. Consequently, when a single state is updated, the change generalizes from that state to affect the values of many other states.

The prediction Objective (MSVE)

In the tabular case a continuous measure of prediction quality was not necessary because the learned value function could come to equal the true value function exactly. But with genuine approximation, an update at one state affects many others, and it is not possible to get all states exactly correct. By assumption we have far more states than weights, so making one state’s estimate more accurate invariably means making others’ less accurate.

By the error in a state $s$ we mean the square of the difference between the approximate value $\hat{v}(s,\mathbf{w})$ and the true value $v_{\pi}(s)$. Weighting this over the state space by the distribution $\mu$, we obtain a natural objective function, the Mean Squared Value Error, or MSVE:
$$
\text{MSVE}(\mathbf{w}) \doteq \sum_{s \in \mathcal{S}} \mu(s) \Big[v_{\pi}(s) - \hat{v}(s, \mathbf{w}) \Big]^2.
$$
The square root of this measure, the root MSVE or RMSVE, gives a rough measure of how much the approximate values differ from the true values and is often used in plots. Typically one chooses $\mu(s)$ to be the fraction of time spent in $s$ under the target policy $\pi$. This is called the on-policy distribution.

on-policy-distribution

Stochastic-gradient Methods

We assume that states appear in examples with the same distribution, µ, over which we are trying to minimize the MSVE. A good strategy in this case is to try to minimize error on the observed examples. Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:
$$
\begin{align}
\mathbf{w}_{t+1} &\doteq \mathbf{w}_t - \frac{1}{2} \alpha \nabla \Big[ v_{\pi}(s) - \hat{v}(S_t, \mathbf{w}_t)\Big]^2 \\
&= \mathbf{w}_t + \alpha \Big[ v_{\pi}(s) - \hat{v}(S_t, \mathbf{w}_t)\Big] \nabla \hat{v}(S_t, \mathbf{w}_t).
\end{align}
$$
And
$$
\nabla f(\mathbf{w}) \doteq \Big( \frac{\partial f(\mathbf{w})}{\partial w_1}, \frac{\partial f(\mathbf{w})}{\partial w_2}, \cdots, \frac{\partial f(\mathbf{w})}{\partial w_d}\Big)^{\top}.
$$
Although $v_{\pi}(S_t)$ is unknown, but we can approximate it by substituting $U_t$ (the $t$th training example) in place of $v_{\pi}(S_t)$. This yields the following general SGD method for state-value prediction:
$$
\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ U_t - \hat{v}(S_t, \mathbf{w}_t)\Big] \nabla \hat{v}(S_t, \mathbf{w}_t).
$$
If $U_t$ is an unbiased estimate, that is, if $\mathbb{E}[U_t]=v_{\pi}(S_t)$, for each $t$, then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under the usual stochastic approximation conditions for decreasing $\alpha$.

For example, suppose the states in the examples are the states generated by interaction (or simulated interaction) with the environment using policy $\pi$. Because the true value of a state is the expected value of the return following it, the Monte Carlo target $U_t \doteq G_t$ is by definition an unbiased estimate of $v_{\pi}(S_t)$. Thus, the gradient-descent version of Monte Carlo state-value prediction is guaranteed to find a locally optimal solution. Pseudocode for a complete algorithm
is shown in the box.

gradient_mc

Example: State Aggregation on the 1000-state Random Walk

State aggregation is a simple form of generalizing function approximation in which states are grouped together, with one estimated value (one component of the weight vector w) for each group. The value of a state is estimated as its group’s component, and when the state is updated, that component alone is updated. State aggregation is a special case of SGD in which the gradient, $\nabla \hat{v}(S_t, \mathbf{w}_t)$, is 1 for $S_t$’s group’s component and 0 for the other components.

Consider a 1000-state version of the random walk task. The states are numbered from 1 to 1000, left to right, and all episodes begin near the center, in state 500. State transitions are from the current state to one of the 100 neighboring states to its left, or to one of the 100 neighboring states to its right, all with equal probability. Of course, if the current state is near an edge, then there may be fewer than 100 neighbors on that side of it. In this case, all the probability that would have gone into those missing neighbors goes into the probability of terminating on that side (thus, state 1 has a 0.5 chance of terminating on the left, and state 950 has a 0.25 chance of terminating on the right). As usual, termination on the left produces a reward of −1, and termination on the right produces a reward of +1. All other transitions have a reward of zero.

Now, let us solve this problem by gradient Monte Carlo algorithm. First of all, we define the environment of this problem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# # of states except for terminal states
N_STATES = 1000
# true state values, just a promising guess
trueStateValues = np.arange(-1001, 1003, 2) / 1001.0
# all states
states = np.arange(1, N_STATES + 1)
# start from a central state
START_STATE = 500
# terminal states
END_STATES = [0, N_STATES + 1]
# possible actions
ACTION_LEFT = -1
ACTION_RIGHT = 1
ACTIONS = [ACTION_LEFT, ACTION_RIGHT]
# maximum stride for an action
STEP_RANGE = 100

We need a true value of each state, thus use the dynamic programming to get these value:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Dynamic programming to find the true state values, based on the promising guess above
# Assume all rewards are 0, given that we have already given value -1 and 1 to terminal states
while True:
oldTrueStateValues = np.copy(trueStateValues)
for state in states:
trueStateValues[state] = 0
for action in ACTIONS:
for step in range(1, STEP_RANGE + 1):
step *= action
newState = state + step
newState = max(min(newState, N_STATES + 1), 0)
# asynchronous update for faster convergence
trueStateValues[state] += 1.0 / (2 * STEP_RANGE) * trueStateValues[newState]
error = np.sum(np.abs(oldTrueStateValues - trueStateValues))
print(error)
if error < 1e-2:
break
# correct the state value for terminal states to 0
trueStateValues[0] = trueStateValues[-1] = 0

The policy of episodes generation:

1
2
3
4
5
6
7
8
9
10
11
12
13
# take an @action at @state, return new state and reward for this transition
def takeAction(state, action):
step = np.random.randint(1, STEP_RANGE + 1)
step *= action
state += step
state = max(min(state, N_STATES + 1), 0)
if state == 0:
reward = -1
elif state == N_STATES + 1:
reward = 1
else:
reward = 0
return state, reward

The reward after take an action:

1
2
3
4
5
# get an action, following random policy
def getAction():
if np.random.binomial(1, 0.5) == 1:
return 1
return -1

And we have a special value function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# a wrapper class for aggregation value function
class ValueFunction:
# @numOfGroups: # of aggregations
def __init__(self, numOfGroups):
self.numOfGroups = numOfGroups
self.groupSize = N_STATES // numOfGroups
# thetas
self.params = np.zeros(numOfGroups)
# get the value of @state
def value(self, state):
if state in END_STATES:
return 0
groupIndex = (state - 1) // self.groupSize
return self.params[groupIndex]
# update parameters
# @delta: step size * (target - old estimation)
# @state: state of current sample
def update(self, delta, state):
groupIndex = (state - 1) // self.groupSize
self.params[groupIndex] += delta

And the gradient MC algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# gradient Monte Carlo algorithm
# @valueFunction: an instance of class ValueFunction
# @alpha: step size
# @distribution: array to store the distribution statistics
def gradientMonteCarlo(valueFunction, alpha, distribution=None):
currentState = START_STATE
trajectory = [currentState]
# We assume gamma = 1, so return is just the same as the latest reward
reward = 0.0
while currentState not in END_STATES:
action = getAction()
newState, reward = takeAction(currentState, action)
trajectory.append(newState)
currentState = newState
# Gradient update for each state in this trajectory
for state in trajectory[:-1]:
delta = alpha * (reward - valueFunction.value(state))
valueFunction.update(delta, state)
if distribution is not None:
distribution[state] += 1

Finally. let us solve this problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
nEpisodes = int(1e5)
alpha = 2e-5
# we have 10 aggregations in this example, each has 100 states
valueFunction = ValueFunction(10)
distribution = np.zeros(N_STATES + 2)
for episode in range(0, nEpisodes):
print('episode:', episode)
gradientMonteCarlo(valueFunction, alpha, distribution)
distribution /= np.sum(distribution)
stateValues = [valueFunction.value(i) for i in states]
plt.figure(0)
plt.plot(states, stateValues, label='Approximate MC value')
plt.plot(states, trueStateValues[1: -1], label='True value')
plt.xlabel('State')
plt.ylabel('Value')
plt.legend()
plt.figure(1)
plt.plot(states, distribution[1: -1], label='State distribution')
plt.xlabel('State')
plt.ylabel('Distribution')
plt.legend()

Results are as follows:

distribution

state_value

Semi-gradient Methods

Bootstrapping target such as n-step returns $G_{t:t+n}$ or the DP target $\sum_{a, s^{\prime}, r} \pi(a|S_t)p(s^{\prime}, r|S_t, a)[r + \gamma \hat{v}(s^{\prime}, \mathbf{w}_t)]$ all depend on the current value of the weight vector $\mathbf{w}_t$, which implies that they will be biased and that they will not produce a true gradient=descent method. We call them semi-gradient methods.

Although semi-gradient (bootstrapping) methods do not converge as robustly as gradient methods, they do converge reliably in important cases such as the linear case. Moreover, they offer important advantages which makes them often clearly preferred. One reason for this is that they are typically significantly faster to learn. Another is that they enable learning to be continual and online, without waiting for the end of an episode. This enables them to be used on continuing problems and provides computational advantages. A prototypical semi-gradient method is semi-gradient TD(0), which uses $U_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$ as its target. Complete pseudocode for this method is given in the box below.

semi_grad_td

Linear Methods

One of the most important special cases of function approximation is that in which the approximate function, $\hat{v}(\cdot, \mathbf{w})$, is a linear function of the weight vector, $\mathbf{w}$. Corresponding to every state $s$, there is a real-valued vector of features $\mathbf{x}(s) \doteq (x_1(s),x_2(s),\cdots,x_d(s))^{\top}$, with the same number of components as $\mathbf{w}$. However the features are constructed, the approximate state-value function is given by the inner product between $\mathbf{w}$ and $\mathbf{x}(s)$:
$$
\hat{v}(s, \mathbf{w}) \doteq \mathbf{w^{\top}x}(s) \doteq \sum_{i=1}^d w_i x_i(s).
$$
The individual functions $x_i:\mathcal{S} \rightarrow \mathbb{E}$ are called basis functions. The gradient of the approximate value function with respect to $\mathbf{w}$ in this case is
$$
\nabla \hat{v} (s, \mathbf{w}) = \mathbf{x}(s).
$$
The update at each time $t$ is

$$
\begin{align}
\mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha \Big( R_{t+1} + \gamma \mathbf{w}_t^{\top}\mathbf{x}_{t+1} - \mathbf{w}_t^{\top}\mathbf{x}_{t}\Big)\mathbf{x}_t \\
&= \mathbf{w}_t + \alpha \Big( R_{t+1}\mathbf{x}_t - \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^{\top} \mathbf{w}_t\Big),
\end{align}
$$
where here we have used the notational shorthand $\mathbf{x}_t = \mathbf{x}(S_t)$. If the system converges, it must converge to the weight vector $\mathbf{w}_{TD}$ at which
$$
\mathbf{w}_{TD} \doteq \mathbf{A^{-1}b},
$$
where
$$
\mathbf{b} \doteq \mathbb{E}[R_{t+1}\mathbf{x}_t] \in \mathbb{R}^d \;\; \text{and} \;\; \mathbf{A} \doteq \mathbb{E}\big[ \mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^{\top} \big] \in \mathbb{R}^d \times \mathbb{R}^d.
$$
This quantity is called the TD fixedpoint. At this point we have:
$$
\text{MSVE}(\mathbf{w}_{TD}) \leq \frac{1}{1 - \gamma} \min_{\mathbf{w}} \text{MSVE}(\mathbf{w}).
$$
Now we use the state aggregation example again, but use the semi-gradient TD method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# semi-gradient n-step TD algorithm
# @valueFunction: an instance of class ValueFunction
# @n: # of steps
# @alpha: step size
def semiGradientTemporalDifference(valueFunction, n, alpha):
# initial starting state
currentState = START_STATE
# arrays to store states and rewards for an episode
# space isn't a major consideration, so I didn't use the mod trick
states = [currentState]
rewards = [0]
# track the time
time = 0
# the length of this episode
T = float('inf')
while True:
# go to next time step
time += 1
if time < T:
# choose an action randomly
action = getAction()
newState, reward = takeAction(currentState, action)
# store new state and new reward
states.append(newState)
rewards.append(reward)
if newState in END_STATES:
T = time
# get the time of the state to update
updateTime = time - n
if updateTime >= 0:
returns = 0.0
# calculate corresponding rewards
for t in range(updateTime + 1, min(T, updateTime + n) + 1):
returns += rewards[t]
# add state value to the return
if updateTime + n <= T:
returns += valueFunction.value(states[updateTime + n])
stateToUpdate = states[updateTime]
# update the value function
if not stateToUpdate in END_STATES:
delta = alpha * (returns - valueFunction.value(stateToUpdate))
valueFunction.update(delta, stateToUpdate)
if updateTime == T - 1:
break
currentState = newState
1
2
3
4
5
6
7
8
9
10
11
12
13
14
nEpisodes = int(1e5)
alpha = 2e-4
valueFunction = ValueFunction(10)
for episode in range(0, nEpisodes):
print('episode:', episode)
semiGradientTemporalDifference(valueFunction, 1, alpha)
stateValues = [valueFunction.value(i) for i in states]
plt.figure(2)
plt.plot(states, stateValues, label='Approximate TD value')
plt.plot(states, trueStateValues[1: -1], label='True value')
plt.xlabel('State')
plt.ylabel('Value')
plt.legend()

Results are as follows:

semi_gradient_td

We also could use the n-step semi-gradient TD method. To obtain such quantitatively similar results we switched the state aggregation to 20 groups of 50 states each. The 20 groups are then quantitatively close to the 19 states of the tabular problem.

The semi-gradient n-step TD algorithm we used in this example is the natural extension of the tabular n-step TD algorithm. The key equation is
$$
\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \Big[ G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1})\Big] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \;\; 0 \leq t \leq T,
$$
where
$$
G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \;\; 0 \leq t \leq T-n.
$$
Pseudocode for the complete algorithm is given in the box below.

n_step_semi_gradient_td

Now let us show the performance of different value of n:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# truncate value for better display
truncateValue = 0.55
# all possible steps
steps = np.power(2, np.arange(0, 10))
# all possible alphas
alphas = np.arange(0, 1.1, 0.1)
# each run has 10 episodes
episodes = 10
# perform 100 independent runs
runs = 100
# track the errors for each (step, alpha) combination
errors = np.zeros((len(steps), len(alphas)))
for run in range(0, runs):
for stepInd, step in zip(range(len(steps)), steps):
for alphaInd, alpha in zip(range(len(alphas)), alphas):
print('run:', run, 'step:', step, 'alpha:', alpha)
# we have 20 aggregations in this example
valueFunction = ValueFunction(20)
for ep in range(0, episodes):
semiGradientTemporalDifference(valueFunction, step, alpha)
# calculate the RMS error
currentStateValues = np.asarray([valueFunction.value(i) for i in states])
errors[stepInd, alphaInd] += np.sqrt(np.sum(np.power(currentStateValues - trueStateValues[1: -1], 2)) / N_STATES)
# take average
errors /= episodes * runs
# truncate the error
errors[errors > truncateValue] = truncateValue
plt.figure(3)
for i in range(0, len(steps)):
plt.plot(alphas, errors[i, :], label='n = ' + str(steps[i]))
plt.xlabel('alpha')
plt.ylabel('RMS error')
plt.legend()

The results are as follows:

n_step_semi_gradient_td_compare

Feature Construction for Linear Methods

Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learning systems. Intuitively, the features should correspond to the natural features of the task, those along which generalization is most appropriate. If we are valuing geometric objects, for example, we might want to have features for each possible shape, color, size, or function. If we are valuing states of a mobile robot, then we might want to have features for locations, degrees of remaining battery power, recent sonar readings, and so on.

Polynomials

poly

Fourier Basis

fourier

Konidaris et al. (2011) found that when using Fourier cosine basis functions with a learning algorithm such as semi-gradient TD(0), or semi-gradient Sarsa, it is helpful to use a different step-size parameter for each basis function. If $\alpha$ is the basic step-size parameter, they suggest setting the step-size parameter for basis function $x_i$ to $a_i=\alpha/\sqrt{(c_1^i)^2 + \cdots + (c_d^i)^2}$ (except when each $c_j^i=0$, in which case $\alpha_i=\alpha$).

Now, let us we compare the Fourier and polynomial bases on the 1000-state random walk example. In general, we do not recommend using the polynomial basis for online learning.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# a wrapper class for polynomial / Fourier -based value function
POLYNOMIAL_BASES = 0
FOURIER_BASES = 1
class BasesValueFunction:
# @order: # of bases, each function also has one more constant parameter (called bias in machine learning)
# @type: polynomial bases or Fourier bases
def __init__(self, order, type):
self.order = order
self.weights = np.zeros(order + 1)
# set up bases function
self.bases = []
if type == POLYNOMIAL_BASES:
for i in range(0, order + 1):
self.bases.append(lambda s, i=i: pow(s, i))
elif type == FOURIER_BASES:
for i in range(0, order + 1):
self.bases.append(lambda s, i=i: np.cos(i * np.pi * s))
# get the value of @state
def value(self, state):
# map the state space into [0, 1]
state /= float(N_STATES)
# get the feature vector
feature = np.asarray([func(state) for func in self.bases])
return np.dot(self.weights, feature)
def update(self, delta, state):
# map the state space into [0, 1]
state /= float(N_STATES)
# get derivative value
derivativeValue = np.asarray([func(state) for func in self.bases])
self.weights += delta * derivativeValue

The function upper is used to construction the features of states (map states to features).

Next, we will compare different super-parameters’ (order) performance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
runs = 1
episodes = 5000
# # of bases
orders = [5, 10, 20]
alphas = [1e-4, 5e-5]
labels = [['polynomial basis'] * 3, ['fourier basis'] * 3]
# track errors for each episode
errors = np.zeros((len(alphas), len(orders), episodes))
for run in range(0, runs):
for i in range(0, len(orders)):
valueFunctions = [BasesValueFunction(orders[i], POLYNOMIAL_BASES), BasesValueFunction(orders[i], FOURIER_BASES)]
for j in range(0, len(valueFunctions)):
for episode in range(0, episodes):
print('run:', run, 'order:', orders[i], labels[j][i], 'episode:', episode)
# gradient Monte Carlo algorithm
gradientMonteCarlo(valueFunctions[j], alphas[j])
# get state values under current value function
stateValues = [valueFunctions[j].value(state) for state in states]
# get the root-mean-squared error
errors[j, i, episode] += np.sqrt(np.mean(np.power(trueStateValues[1: -1] - stateValues, 2)))
# average over independent runs
errors /= runs
plt.figure(5)
for i in range(0, len(alphas)):
for j in range(0, len(orders)):
plt.plot(errors[i, j, :], label=labels[i][j]+' order = ' + str(orders[j]))
plt.xlabel('Episodes')
plt.ylabel('RMSVE')
plt.legend()

Results:

poly_vs_four

TODO: TILE CODING