A reinforcement learning task that satisfied the Markov property is called Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).

A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action s and a, the probability of each possible pair of next state and reward, s’, r, is denoted
$$p(s^{\prime}, r | s, a) \doteq \mathrm{Pr}\{S_{t+1}=s^{\prime}, R_{t+1}=r \ | \ S_{t}=s, A_{t}=a \}$$
Given that, one can compute anything else one might want to know about the environment, such as the excepted rewards of state-action pairs,
$$r(s,a) \doteq \mathbb{E}[R_{t+1} \ | \ S_{t}=s, A_{t}=a] = \sum_{r \in \mathcal{R}}r\sum_{s^{\prime} \in \mathcal{S}}p(s^{\prime},r|s, a)$$
the state-transition probabilities,
$$p(s^{\prime}|s,a) \doteq \mathrm{Pr}\{S_{t+1}=s^{\prime} \ | \ S_{t}=s, A_{t}=a\} = \sum_{r \in \mathcal{R}} p(s^{\prime},r|s, a)$$
and the excepted rewards for state-action-next-state triples,
$$r(s, a, s^{\prime}) \doteq \mathbb{E}[R_{t+1} \ | \ S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}] = \frac{\sum_{r \in \mathcal{R}}rp(s^{\prime},r|s,a)}{p(s^{\prime}|s,a)}$$
Almost all reinforcement learning algorithms involve estimating value functions–functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

Recall that a policy, $\pi$, is a mapping from a each state, $s \in \mathcal{S}$, and action, $a \in \mathcal{A}(s)$, to the probability $\pi(a|s)$ of taking action a when in state s. Informally, the value of a state s under a policy $\pi$, denoted $v_{\pi}(s)$, is the excepted return when starting in s and following $\pi$ thereafter. For MDPs, we can define $v_{\pi}(s)$ formally as
$$v_{\pi}(s) \doteq \mathbb{E_{\pi}}[G_{t} \ | \ S_{t}=s] = \mathbb{E_{\pi}}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \ | \ S_{t}=s \right]$$
Note that the value of the terminal state, if any, is always zero. We call the function $v_{\pi}$ the state-value function for policy $\pi$.

Similarly, we define the value of taking action a in state s under a policy $\pi$, denoted $q_{\pi}(s,a)$, as the expected return starting from $s$, taking the action $a$, and thereafter following policy $\pi$:
$$q_{\pi}(s,a) \doteq \mathbb{E_{\pi}}[G_{t} \ | \ S_{t}=s, A_{t}=a] = \mathbb{E_{\pi}}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \ | \ S_{t}=s, A_{t}=a\right]$$
We call $q_{\pi}$ the action-value function for policy $\pi$.

A fundamental property of the value functions used in reinforcement learning and dynamic programming is that they satisfy particular recursive relationships. For any policy $\pi$ and any state $s$, the following consistency condition holds between the value of $s$ and the value of its possible successor states:
\begin{align} v_{\pi}(s) &\doteq \mathbb{E_{\pi}[G_{t} \ | \ S_{t}=s]} \\ &= \mathbb{E_{\pi}}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \ | \ S_{t}=s \right] \\ &= \mathbb{E_{\pi}}\left[R_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+2} \ | \ S_{t}=s \right] \\ &= \sum_{a} \pi(a|s) \sum_{s^{\prime}}\sum_{r}p(s^{\prime},r|s,a) \left[ r + \gamma \mathbb{E_{\pi}} \left[ \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+2} | S_{t+1}=s^{\prime} \right] \right] \\ &= \sum_{a} \pi(a|s) \sum_{s^{\prime}, r}p(s^{\prime},r|s,a) \left[ r + \gamma v_{\pi}(s^{\prime}) \right], \;\;\; \forall s \in \mathcal{S}, \end{align}
Equation (11) is the Bellman equation for $v_{\pi}$.

Figure 1 (left) shows a rectangular grid world representation of a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Action would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions result in a reward of 0, excepted those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to $\mathrm{A^{\prime}}$. From state B, all actions yield a reward +5 and take the agent to $\mathrm{B^{\prime}}$.

Figure 1

Suppose the agent selects all four actions with equal probability in all states. Figure 1 (right) shows the value function, $v_{\pi}$, for this policy, for the discounted reward case with $\gamma = 0.9$. This value function was computed by solving the system of linear equations (11).

OK, now let us solve this problem. The first, we need to define the grid world by code.

This world has 5 by 5 cells, and there are four special cells: A, A’, B, B’. Discount represents the $\gamma$ in equation (11). We know that the agent in the world selects all four actions with equal probability in all states (cells). So we have:

The actionProb is a list that has five items. Each item represents a row in the grid and it also is a list that has five items that represents a column in corresponding row, that is, each item in a row represents a cell in the grid. In all cells (states), there are four direction could be selected with equal probability 0.25. Then, we’ll define a undirected graph with weights. The node represented the cell in grid. If between two node has a edge then the agent could move between this two nodes (cells). The weight on the edges represents the reward do this move.

The nextState and actionReward are the same as actionProb that we explained earlier.

Now, we could solve this problem by use the equation (11):
\begin{align} v_{\pi}(s) &\doteq \sum_{a} \pi(a|s) \sum_{s^{\prime}, r}p(s^{\prime},r|s,a) \left[ r + \gamma v_{\pi}(s^{\prime}) \right], \;\;\; \forall s \in \mathcal{S}, \end{align}
Let us jump into the implementation detail.

The core code is:

The += represents the first sum notation in the equation (11). If we ensure the current state (cell) and action will take in this world, then the next state and reward also will be ensured. So $\sum_{s^{\prime},r} p(s^{\prime}, r | s, a)$ is equal to 1.

The result as follows:

We can see the value of all states is the same as the Figure 1.

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies. A policy $\pi$ is defined to be better than or equal to a policy $\pi^{\prime}$ if its excepted return is greater than or equal to that of $\pi^{\prime}$ for all states. In other words, $\pi \ge \pi^{\prime}$ if and only if $v_{\pi}(s) \ge v_{\pi^{\prime}}(s)$ for all $s \in \mathcal{S}$. There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by $\pi_{\star}$. They share the same state-value function, called the optimal state-value function, denote $v_{\star}$, and defined as
$$v_{\star}(s) \doteq \max_{\pi} v_{\pi}(s),$$
for all $s \in \mathcal{S}$.

Optimal policies also share the same optimal action-value function, denoted $q_{\star}$, and defined as
$$q_{\star}(s, a) \doteq \max_{\pi} q_{\pi}(s, a)$$
for all $s \in \mathcal{S}$ and $a \in \mathcal{A(s)}$. For the state-action pair (s, a), this function gives the excepted return for taking action a in state s and thereafter following an optimal policy. Thus, we can write $q_{\star}$ in terms of $v_{\star}$ as follows:
$$q_{\star}(s, a) = \mathbb{E}[R_{t+1} + \gamma v_{\star} \ | \ S_{t}=s, A_{t}=a]$$
Suppose we solve the Bellman equation for $v_{\star}$ for the simple grid task introduced in earlier and shown again in Figure 2 (left). Recall that state A is followed by a reward of +10 and transition to state A’. while state B is followed by a reward of +5 and transition to state B’. Figure 2 (middle) shows the optimal value function, and Figure 2 (right) shows the corresponding optimal policies. Where there are multiple arrows in a cell, any of the corresponding actions are optimal.

Figure 2

Now, let us solve this problem:

We can see the core code is as follows:

The only difference between this code and the earlier code is the prior only uses the maximum value and the latter uses the weighted average.

The result is below:

It is not doubt that the result is the same as the Figure 2 (middle).