Abracadabra

Dynamic Programming

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are provides an essential foundation for the understanding of the methods presented later. In fact, all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.

The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. In here we show how DP can be used to compute the value functions defined in earlier. As discussed there, we can easily obtain optimal policies once we have found the optimal value functions $v_{\star}$ or $q_{\star}$ which satisfy the Bellman optimality equations:
$$
\begin{align}
v_{\star}(s) &= \max_{a} \mathbb{E} [R_{t+1} + \gamma v_{\star}(S_{t+1}) \ | \ S_{t}=s, A_{t}=a] \\
&= \max_{a} \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) \left[r + \gamma v_{\star}(s^{\prime})\right]
\end{align}
$$
or
$$
\begin{align}
q_{\star}(s, a) &= \mathbb{E} [R_{t+1} + \gamma \max_{a^{\prime}} q_{\star}(S_{t+1}, a^{\prime}) \ | \ S_{t}=s, A_{t}=a] \\
&= \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma \max_{a^{\prime}} q_{\star}(s^{\prime}, a^{\prime})]
\end{align}
$$
for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}, \; \text{and} \; s^{\prime} \in \mathcal{S^{+}}$. As we shall see, DP algorithms are obtained by turning Bellman equations such as these into assignments, that is, into update rules for improving approximations of the desired value functions.

First we consider how to compute the state-value function $v_{\pi}$ for an arbitrary policy $\pi$. This is called policy evaluation in the DP literature. We also refer to it as the prediction problem. Recall that for all $s \in \mathcal{S}$,
$$
\begin{align}
v_{\pi}(s) &\doteq \mathbb{E}_{\pi} [R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \cdots \ | \ S_{t}=s] \\
&= \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \ | \ S_{t}=s] \\
&= \sum_{a} \pi(a|s) \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})]
\end{align}
$$
If the environment’s dynamics are complete known, then (7) is a system of $|\mathcal{S}|$ simultaneous linear equations in $|\mathcal{S}|$ unknowns (the $v_{\pi}(s), s \in \mathcal{S}$). In principle, its solution is a straightforward, if tedious, computation. For our purpose, iterative solution methods are most suitable. The initial approximation, $v_0$, is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive approximation is obtained by using the Bellman equation for $v_{\pi}$ as an update rule:
$$
\begin{align}
v_{k+1}(s) &\doteq \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{k}(S_{t+1}) \ | \ S_{t}=s] \\
&= \sum_{a} \pi(a|s) \sum_{s^{\prime}, r} p(s^{\prime}, r|s, a) [r + \gamma v_{k} (s^{\prime})]
\end{align}
$$
This algorithm is called iterative policy evaluation.

Iterative policy evaluation

Input $\pi$, the policy to be evaluated

Initialize an array $V(s) = 0$, for all $s \in \mathcal{S^{+}}$

Repeat

​ $\Delta \leftarrow 0$

​ for each $s \in \mathcal{S}$:

​ $v \leftarrow V(s)$

​ $V(s) \leftarrow \sum_{a} \pi(a|s) \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v(s^{\prime})]$

​ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

until $\Delta < \theta$ (a small positive number)

Output $V \approx v_{\pi}$

We can see the algorithm used in the grid world problem just is the iterative policy evaluation.

Our reason for computing the value function for a policy is to help find better policies. Once a policy $\pi$ has been improved using $v_{\pi}$ to yield a better policy $\pi^{\prime}$, we can then compute $v_{\pi^{\prime}}$and improve it again to yield an even better $\pi^{\prime\prime}$. We can thus obtain a sequence of monotonically improving policies and value functions:
$$
\pi_{0} \stackrel{E}\longrightarrow v_{\pi_{0}} \stackrel{I}\longrightarrow \pi_{1} \stackrel{E}\longrightarrow v_{\pi_{1}} \stackrel{I}\longrightarrow \pi_{2} \stackrel{E}\longrightarrow \cdots \stackrel{I}\longrightarrow \pi_{\star} \stackrel{E}\longrightarrow v_{\star},
$$
where $\stackrel{E}\longrightarrow$ denotes a policy evaluation and $\stackrel{I}\longrightarrow$ denotes a policy improvement. This way of finding an optimal policy is called policy iteration.

Policy iteration (using iterative policy evaluation)

  1. Initialization

    $V(s) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A(s)}$ arbitrarily for all $s \in \mathcal{S}$

  2. Policy Evaluation

    Repeat

    ​ $\Delta \leftarrow 0$

    ​ For each $s \in \mathcal{S}$:

    ​ $v \leftarrow V(s)$

    ​ $V(s) \leftarrow \sum_{s^{\prime}, r} p(s^{\prime}, r | s, \pi(s)) [r + \gamma v(s^{\prime})]$

    ​ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

    until $\Delta < \theta$ (a small positive number)

  3. Policy Improvement

    policy-stable $\leftarrow$ true

    For each $s \in \mathcal{S}$:

    old-action $\leftarrow$ $\pi_(s)$

    ​ $\pi (s) \leftarrow argmax_{a} \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v(s^{\prime})]$

    ​ If old-action $\neq \pi(s)$, then policy-stable $\leftarrow$ false

    If policy-stable, then stop and return $V \approx v_{\star} \; \text{and} \; \pi \approx \pi_{\star}$; else go to 2.

Let us solve a problem used by policy iteration. The problem defined as follows:

Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and it credited \$10 by the national company. If he out of cats at that location, then the business is lost. Cars become available for renting the day after they are returned. To ensure that cars are available where they are needed, Jack ca move them between the two locations overnight, at a cost of \$2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables, meaning that the probability that the number is $n$ is $\frac{\lambda^{n}}{n!}e^{-\lambda}$, where $\lambda$ is the excepted number. Suppose $\lambda$ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be $\lambda=0.9$ and formulate this as a continuing finite MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cats moved between the two locations overnight.

The excepted result is as follows:

car_rental

Figure 1

The first, we define some facts of this problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# maximum # of cars in each location
MAX_CARS = 20
# maximum # of cars to move during night
MAX_MOVE_OF_CARS = 5
# expectation for rental requests in first location
RENTAL_REQUEST_FIRST_LOC = 3
# expectation for rental requests in second location
RENTAL_REQUEST_SECOND_LOC = 4
# expectation for # of cars returned in first location
RETURNS_FIRST_LOC = 3
# expectation for # of cars returned in second location
RETURNS_SECOND_LOC = 2
DISCOUNT = 0.9
# credit earned by a car
RENTAL_CREDIT = 10
# cost of moving a car
MOVE_CAR_COST = 2

From the problem definition, we know that in this MDP the states is the number of cars at each location at the end of the day, and the actions are the net numbers of cats moved between the two locations overnight. Each action is a integer that positive number represents the number of cars moving from the first location to second location and vice verse.

1
2
3
4
5
6
7
8
# current policy
policy = np.zeros((MAX_CARS + 1, MAX_CARS + 1))
# current state value
stateValue = np.zeros((MAX_CARS + 1, MAX_CARS + 1))
# all possible states
states = []
# all possible actions
actions = np.arange(-MAX_MOVE_OF_CARS, MAX_MOVE_OF_CARS + 1)

For visualization (Figure 1) convenient, we define a method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# axes for printing use
AxisXPrint = []
AxisYPrint = []
for i in range(0, MAX_CARS + 1):
for j in range(0, MAX_CARS + 1):
AxisXPrint.append(i)
AxisYPrint.append(j)
states.append([i, j])
# plot a policy/state value matrix
figureIndex = 0
def prettyPrint(data, labels):
global figureIndex
fig = plt.figure(figureIndex)
figureIndex += 1
ax = fig.add_subplot(111, projection='3d')
AxisZ = []
for i, j in states:
AxisZ.append(data[i, j])
ax.scatter(AxisXPrint, AxisYPrint, AxisZ)
ax.set_xlabel(labels[0])
ax.set_ylabel(labels[1])
ax.set_zlabel(labels[2])

Next, we define a Poisson function that return the probability:

1
2
3
4
5
6
7
8
9
10
11
12
13
# An up bound for poisson distribution
# If n is greater than this value, then the probability of getting n is truncated to 0
POISSON_UP_BOUND = 11
# Probability for poisson distribution
# @lam: lambda should be less than 10 for this function
poissonBackup = dict()
def poisson(n, lam):
global poissonBackup
key = n * 10 + lam
if key not in poissonBackup.keys():
poissonBackup[key] = exp(-lam) * pow(lam, n) / factorial(n)
return poissonBackup[key]

Now, the preparation is done. We’ll implement the policy iteration algorithm as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
newStateValue = np.zeros((MAX_CARS + 1, MAX_CARS + 1))
improvePolicy = False
policyImprovementInd = 0
while True:
if improvePolicy == True:
# start policy improvement
print('Policy improvement', policyImprovementInd)
policyImprovementInd += 1
newPolicy = np.zeros((MAX_CARS + 1, MAX_CARS + 1))
for i, j in states:
actionReturns = []
# go through all actions and select the best one
for action in actions:
if (action >= 0 and i >= action) or (action < 0 and j >= abs(action)):
actionReturns.append(expectedReturn([i, j], action, stateValue))
else:
actionReturns.append(-float('inf'))
bestAction = argmax(actionReturns)
newPolicy[i, j] = actions[bestAction]
# if policy is stable
policyChanges = np.sum(newPolicy != policy)
print('Policy for', policyChanges, 'states changed')
if policyChanges == 0:
policy = newPolicy
break
policy = newPolicy
improvePolicy = False
# start policy evaluation
for i, j in states:
newStateValue[i, j] = expectedReturn([i, j], policy[i, j], stateValue)
if np.sum(np.abs(newStateValue - stateValue)) < 1e-4:
stateValue[:] = newStateValue
improvePolicy = True
continue
stateValue[:] = newStateValue

We can see the logistic is the same as the pseudocode of the policy iteration algorithm. There is a core method in the code, that is, exceptedReturn() is used to calculate the reward of cars rental.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# @state: [# of cars in first location, # of cars in second location]
# @action: positive if moving cars from first location to second location,
# negative if moving cars from second location to first location
# @stateValue: state value matrix
def expectedReturn(state, action, stateValue):
# initailize total return
returns = 0.0
# cost for moving cars
returns -= MOVE_CAR_COST * abs(action)
# go through all possible rental requests
for rentalRequestFirstLoc in range(0, POISSON_UP_BOUND):
for rentalRequestSecondLoc in range(0, POISSON_UP_BOUND):
# moving cars
numOfCarsFirstLoc = int(min(state[0] - action, MAX_CARS))
numOfCarsSecondLoc = int(min(state[1] + action, MAX_CARS))
# valid rental requests should be less than actual # of cars
realRentalFirstLoc = min(numOfCarsFirstLoc, rentalRequestFirstLoc)
realRentalSecondLoc = min(numOfCarsSecondLoc, rentalRequestSecondLoc)
# get credits for renting
reward = (realRentalFirstLoc + realRentalSecondLoc) * RENTAL_CREDIT
numOfCarsFirstLoc -= realRentalFirstLoc
numOfCarsSecondLoc -= realRentalSecondLoc
# probability for current combination of rental requests
prob = poisson(rentalRequestFirstLoc, RENTAL_REQUEST_FIRST_LOC) * \
poisson(rentalRequestSecondLoc, RENTAL_REQUEST_SECOND_LOC)
# if set True, model is simplified such that the # of cars returned in daytime becomes constant
# rather than a random value from poisson distribution, which will reduce calculation time
# and leave the optimal policy/value state matrix almost the same
constantReturnedCars = True
if constantReturnedCars:
# get returned cars, those cars can be used for renting tomorrow
returnedCarsFirstLoc = RETURNS_FIRST_LOC
returnedCarsSecondLoc = RETURNS_SECOND_LOC
numOfCarsFirstLoc = min(numOfCarsFirstLoc + returnedCarsFirstLoc, MAX_CARS)
numOfCarsSecondLoc = min(numOfCarsSecondLoc + returnedCarsSecondLoc, MAX_CARS)
returns += prob * (reward + DISCOUNT * stateValue[numOfCarsFirstLoc, numOfCarsSecondLoc])
else:
numOfCarsFirstLoc_ = numOfCarsFirstLoc
numOfCarsSecondLoc_ = numOfCarsSecondLoc
prob_ = prob
for returnedCarsFirstLoc in range(0, POISSON_UP_BOUND):
for returnedCarsSecondLoc in range(0, POISSON_UP_BOUND):
numOfCarsFirstLoc = numOfCarsFirstLoc_
numOfCarsSecondLoc = numOfCarsSecondLoc_
prob = prob_
numOfCarsFirstLoc = min(numOfCarsFirstLoc + returnedCarsFirstLoc, MAX_CARS)
numOfCarsSecondLoc = min(numOfCarsSecondLoc + returnedCarsSecondLoc, MAX_CARS)
prob = poisson(returnedCarsFirstLoc, RETURNS_FIRST_LOC) * \
poisson(returnedCarsSecondLoc, RETURNS_SECOND_LOC) * prob
returns += prob * (reward + DISCOUNT * stateValue[numOfCarsFirstLoc, numOfCarsSecondLoc])
return returns

The comments are very clear, and we’re going to do a lot of this. Finally, let us print the result:

1
2
3
prettyPrint(policy, ['# of cars in first location', '# of cars in second location', '# of cars to move during night'])
prettyPrint(stateValue, ['# of cars in first location', '# of cars in second location', 'expected returns'])
plt.show()

The results are as follows:

1
2
3
4
5
6
7
8
9
10
Policy improvement 0
Policy for 332 states changed
Policy improvement 1
Policy for 286 states changed
Policy improvement 2
Policy for 83 states changed
Policy improvement 3
Policy for 19 states changed
Policy improvement 4
Policy for 0 states changed

car_rental_policy

car_rental_return

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation is done iteratively, then convergence exactly to $v_{\pi}$ occurs only in the limit. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one backup of each state). This algorithm is called value iteration. It can be written as a particular simple backup operation that combines the policy improvement and truncated policy evaluation steps:
$$
\begin{align}
v_{k+1} &\doteq \max_{a} \mathbb{E}[R_{t+1} + \gamma v_{k}(S_{t+1}) \ | \ S_{t}=s, A_{t}=a] \\
&= \max_{a}\sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v_{k}(s^{\prime})],
\end{align}
$$
for all $s \in \mathcal{S}$.

Value iteration

Initialize array $V$ arbitrarily (e.g. $V(s) = 0$ for all $s \in \mathcal{S^{+}}$)

Repeat

​ $\Delta \leftarrow 0$

​ For each $s \in \mathcal{S}$:

​ $v \leftarrow V(s)$

​ $V(s) \leftarrow \max_{a}\sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$

​ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

until $\Delta < \theta$ (a small positive number)

Output a deterministic policy, $\pi \approx \pi_{\star}$, such that

​ $\pi(s) = \arg\max_{a}\sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$

Let us use the value iteration algorithm to solve a Gambler’s Problem. The problem defined as follows:

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he staked on the flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of \$100, or loses by running out of money. On each flip, the gambler must decide what portion of his capital to stake, in integer number of dollars. This problem can be formulated as an undiscounted, episodic, finite MDP. The state is the gambler’s capital, $s \in \{1, 2, \cdots, 99\}$ and the actions are stakes, $a \in \{0, 1, \cdots, \min(s, 100-s)\}$. The reward is zero on all transitions excepted those on which the gambler reaches his goal, when it is +1. The state-value function then gives the probability of winning from each state. A policy is mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let $p_h$ denote the probability of the coin coming up heads. If $p_h$ is known, then the entire problem is known and it can be solved, for instance, by value iteration.

OK, now let us to solve this problem by use the value iteration algorithm.

The first we defined some facts and some auxiliary data structure:

1
2
3
4
5
6
7
8
9
10
11
# goal
GOAL = 100
# all states, including state 0 and state 100
states = np.arange(GOAL + 1)
# probability of head
headProb = 0.4
# optimal policy
policy = np.zeros(GOAL + 1)
# state value
stateValue = np.zeros(GOAL + 1)
stateValue[GOAL] = 1.0

The step of value iteration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# value iteration
while True:
delta = 0.0
for state in states[1:GOAL]:
# get possilbe actions for current state
actions = np.arange(min(state, GOAL - state) + 1)
actionReturns = []
for action in actions:
actionReturns.append(headProb * stateValue[state + action] + (1 - headProb) * stateValue[state - action])
newValue = np.max(actionReturns)
delta += np.abs(stateValue[state] - newValue)
# update state value
stateValue[state] = newValue
if delta < 1e-9:
break

Calculate the optimal policy:

1
2
3
4
5
6
7
8
# calculate the optimal policy
for state in states[1:GOAL]:
actions = np.arange(min(state, GOAL - state) + 1)
actionReturns = []
for action in actions:
actionReturns.append(headProb * stateValue[state + action] + (1 - headProb) * stateValue[state - action])
# due to tie and precision, can't reproduce the optimal policy in book
policy[state] = actions[argmax(actionReturns)]

Print the results:

1
2
3
4
5
6
7
8
9
plt.figure(1)
plt.xlabel('Capital')
plt.ylabel('Value estimates')
plt.plot(stateValue)
plt.figure(2)
plt.scatter(states, policy)
plt.xlabel('Capital')
plt.ylabel('Final policy (stake)')
plt.show()

The results are as follows:

gambler_policy

gambler_value