The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are provides an essential foundation for the understanding of the methods presented later. In fact, all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.

The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. In here we show how DP can be used to compute the value functions defined in earlier. As discussed there, we can easily obtain optimal policies once we have found the optimal value functions $v_{\star}$ or $q_{\star}$ which satisfy the Bellman optimality equations:
\begin{align} v_{\star}(s) &= \max_{a} \mathbb{E} [R_{t+1} + \gamma v_{\star}(S_{t+1}) \ | \ S_{t}=s, A_{t}=a] \\ &= \max_{a} \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) \left[r + \gamma v_{\star}(s^{\prime})\right] \end{align}
or
\begin{align} q_{\star}(s, a) &= \mathbb{E} [R_{t+1} + \gamma \max_{a^{\prime}} q_{\star}(S_{t+1}, a^{\prime}) \ | \ S_{t}=s, A_{t}=a] \\ &= \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma \max_{a^{\prime}} q_{\star}(s^{\prime}, a^{\prime})] \end{align}
for all $s \in \mathcal{S}, \; a \in \mathcal{A(s)}, \; \text{and} \; s^{\prime} \in \mathcal{S^{+}}$. As we shall see, DP algorithms are obtained by turning Bellman equations such as these into assignments, that is, into update rules for improving approximations of the desired value functions.

First we consider how to compute the state-value function $v_{\pi}$ for an arbitrary policy $\pi$. This is called policy evaluation in the DP literature. We also refer to it as the prediction problem. Recall that for all $s \in \mathcal{S}$,
\begin{align} v_{\pi}(s) &\doteq \mathbb{E}_{\pi} [R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \cdots \ | \ S_{t}=s] \\ &= \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \ | \ S_{t}=s] \\ &= \sum_{a} \pi(a|s) \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v_{\pi}(s^{\prime})] \end{align}
If the environment’s dynamics are complete known, then (7) is a system of $|\mathcal{S}|$ simultaneous linear equations in $|\mathcal{S}|$ unknowns (the $v_{\pi}(s), s \in \mathcal{S}$). In principle, its solution is a straightforward, if tedious, computation. For our purpose, iterative solution methods are most suitable. The initial approximation, $v_0$, is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive approximation is obtained by using the Bellman equation for $v_{\pi}$ as an update rule:
\begin{align} v_{k+1}(s) &\doteq \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{k}(S_{t+1}) \ | \ S_{t}=s] \\ &= \sum_{a} \pi(a|s) \sum_{s^{\prime}, r} p(s^{\prime}, r|s, a) [r + \gamma v_{k} (s^{\prime})] \end{align}
This algorithm is called iterative policy evaluation.

Iterative policy evaluation

Input $\pi$, the policy to be evaluated

Initialize an array $V(s) = 0$, for all $s \in \mathcal{S^{+}}$

Repeat

​ $\Delta \leftarrow 0$

​ for each $s \in \mathcal{S}$:

​ $v \leftarrow V(s)$

​ $V(s) \leftarrow \sum_{a} \pi(a|s) \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v(s^{\prime})]$

​ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

until $\Delta < \theta$ (a small positive number)

Output $V \approx v_{\pi}$

We can see the algorithm used in the grid world problem just is the iterative policy evaluation.

Our reason for computing the value function for a policy is to help find better policies. Once a policy $\pi$ has been improved using $v_{\pi}$ to yield a better policy $\pi^{\prime}$, we can then compute $v_{\pi^{\prime}}$and improve it again to yield an even better $\pi^{\prime\prime}$. We can thus obtain a sequence of monotonically improving policies and value functions:
$$\pi_{0} \stackrel{E}\longrightarrow v_{\pi_{0}} \stackrel{I}\longrightarrow \pi_{1} \stackrel{E}\longrightarrow v_{\pi_{1}} \stackrel{I}\longrightarrow \pi_{2} \stackrel{E}\longrightarrow \cdots \stackrel{I}\longrightarrow \pi_{\star} \stackrel{E}\longrightarrow v_{\star},$$
where $\stackrel{E}\longrightarrow$ denotes a policy evaluation and $\stackrel{I}\longrightarrow$ denotes a policy improvement. This way of finding an optimal policy is called policy iteration.

Policy iteration (using iterative policy evaluation)

1. Initialization

$V(s) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A(s)}$ arbitrarily for all $s \in \mathcal{S}$

2. Policy Evaluation

Repeat

​ $\Delta \leftarrow 0$

​ For each $s \in \mathcal{S}$:

​ $v \leftarrow V(s)$

​ $V(s) \leftarrow \sum_{s^{\prime}, r} p(s^{\prime}, r | s, \pi(s)) [r + \gamma v(s^{\prime})]$

​ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

until $\Delta < \theta$ (a small positive number)

3. Policy Improvement

policy-stable $\leftarrow$ true

For each $s \in \mathcal{S}$:

old-action $\leftarrow$ $\pi_(s)$

​ $\pi (s) \leftarrow argmax_{a} \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v(s^{\prime})]$

​ If old-action $\neq \pi(s)$, then policy-stable $\leftarrow$ false

If policy-stable, then stop and return $V \approx v_{\star} \; \text{and} \; \pi \approx \pi_{\star}$; else go to 2.

Let us solve a problem used by policy iteration. The problem defined as follows:

Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and it credited \$10 by the national company. If he out of cats at that location, then the business is lost. Cars become available for renting the day after they are returned. To ensure that cars are available where they are needed, Jack ca move them between the two locations overnight, at a cost of \$2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables, meaning that the probability that the number is $n$ is $\frac{\lambda^{n}}{n!}e^{-\lambda}$, where $\lambda$ is the excepted number. Suppose $\lambda$ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be $\lambda=0.9$ and formulate this as a continuing finite MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cats moved between the two locations overnight.

The excepted result is as follows:

Figure 1

The first, we define some facts of this problem:

From the problem definition, we know that in this MDP the states is the number of cars at each location at the end of the day, and the actions are the net numbers of cats moved between the two locations overnight. Each action is a integer that positive number represents the number of cars moving from the first location to second location and vice verse.

For visualization (Figure 1) convenient, we define a method:

Next, we define a Poisson function that return the probability:

Now, the preparation is done. We’ll implement the policy iteration algorithm as follows:

We can see the logistic is the same as the pseudocode of the policy iteration algorithm. There is a core method in the code, that is, exceptedReturn() is used to calculate the reward of cars rental.

The comments are very clear, and we’re going to do a lot of this. Finally, let us print the result:

The results are as follows:

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation is done iteratively, then convergence exactly to $v_{\pi}$ occurs only in the limit. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one backup of each state). This algorithm is called value iteration. It can be written as a particular simple backup operation that combines the policy improvement and truncated policy evaluation steps:
\begin{align} v_{k+1} &\doteq \max_{a} \mathbb{E}[R_{t+1} + \gamma v_{k}(S_{t+1}) \ | \ S_{t}=s, A_{t}=a] \\ &= \max_{a}\sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma v_{k}(s^{\prime})], \end{align}
for all $s \in \mathcal{S}$.

Value iteration

Initialize array $V$ arbitrarily (e.g. $V(s) = 0$ for all $s \in \mathcal{S^{+}}$)

Repeat

​ $\Delta \leftarrow 0$

​ For each $s \in \mathcal{S}$:

​ $v \leftarrow V(s)$

​ $V(s) \leftarrow \max_{a}\sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$

​ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

until $\Delta < \theta$ (a small positive number)

Output a deterministic policy, $\pi \approx \pi_{\star}$, such that

​ $\pi(s) = \arg\max_{a}\sum_{s^{\prime}, r} p(s^{\prime}, r | s, a) [r + \gamma V(s^{\prime})]$

Let us use the value iteration algorithm to solve a Gambler’s Problem. The problem defined as follows:

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he staked on the flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of \$100, or loses by running out of money. On each flip, the gambler must decide what portion of his capital to stake, in integer number of dollars. This problem can be formulated as an undiscounted, episodic, finite MDP. The state is the gambler’s capital,$s \in \{1, 2, \cdots, 99\}$and the actions are stakes,$a \in \{0, 1, \cdots, \min(s, 100-s)\}$. The reward is zero on all transitions excepted those on which the gambler reaches his goal, when it is +1. The state-value function then gives the probability of winning from each state. A policy is mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let$p_h$denote the probability of the coin coming up heads. If$p_h\$ is known, then the entire problem is known and it can be solved, for instance, by value iteration.

OK, now let us to solve this problem by use the value iteration algorithm.

The first we defined some facts and some auxiliary data structure:

The step of value iteration:

Calculate the optimal policy:

Print the results:

The results are as follows: