Until now almost all the methods have learned the values of actions and then selected actions based on their estimated action values; their policies would not even exist without the action-value estimates. In this post we consider methods that instead learn a parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection. We use the notation $\boldsymbol{\theta} \in \mathbb{R}^d$ for the policy’s parameter vector. Thus we write $\pi(a|s, \boldsymbol{\theta}) = \text{Pr}(A_t=a | S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta})$ for the probability that action $a$ is taken at time $t$ given that the agent is in state $s$ at time $t$ with parameter $\boldsymbol{\theta}$. If a method uses a learned value function as well, then the value function’s weight vector is denoted $\mathbf{w} \in \mathbb{R}^m$, as in $\hat{v}(s, \mathbf{w})$.

In this chapter we consider methods for learning the policy parameter based on the gradient of some performance measure $J(\boldsymbol{\theta})$ with respect to the policy parameter. These methods seek to maximize performance, so their updates approximate gradient ascent in $J$ :
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \widehat{\nabla J(\boldsymbol{\theta}_t)}.$$
All methods that follow this general schema we call policy gradient methods, whether or not they also learn an approximate value function. Methods that learn approximations to both policy and value functions are often called actor–critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

### Policy Approximation

The most preferred actions in each state are given the highest probability of being selected, for example, according to an exponential softmax distribution:
$$\pi(a|s, \boldsymbol{\theta}) = \frac{\exp(h(s, a, \boldsymbol{\theta}))}{\sum_b \exp(h(s, b, \boldsymbol{\theta}))}.$$
For example, they might be computed by a deep neural network, where $\boldsymbol{\theta}$ is the vector of all the connection weights of the network. Or the preferences could simply be linear in features,
$$h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^{\top} \mathbf{x}(s, a).$$

We deﬁne the performance measure as the value of the start state of the episode. We can simplify the notation without losing any meaningful generality by assuming that every episode starts in some particular (non-random) state s0. Then, in the episodic case we deﬁne performance as
$$J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0),$$
where $v_{\pi_{\boldsymbol{\theta}}}$ is the true value function for $\pi_{\boldsymbol{\theta}}$, the policy determined by $\boldsymbol{\theta}$.

The policy gradient theorem is that
$$\nabla J(\boldsymbol{\theta}) = \sum_s \mu_{\pi}(s) \sum_a q_{\pi}(s, a) \nabla_{\boldsymbol{\theta}} \pi(a | s, \boldsymbol{\theta}),$$
where $\mu_{\pi}(s)$ we mentioned in earlier.

### REINFORCE: Monte Carlo Policy Gradient

\begin{align} \nabla J(\boldsymbol{\theta}) &= \sum_s \mu_{\pi}(s) \sum_a q_{\pi}(s, a) \nabla_{\boldsymbol{\theta}} \pi(a | s, \boldsymbol{\theta}) \\ &= \mathbb{E}_{\pi} \Bigg[ \gamma^t \sum_a q_{\pi}(S_t, a) \nabla_{\boldsymbol{\theta}} \pi(a | S_t, \boldsymbol{\theta}) \Bigg] \\ &= \mathbb{E}_{\pi} \Bigg[ \gamma^t \sum_a \pi(a|S_t, \boldsymbol{\theta}) q_{\pi}(S_t, a) \frac{\nabla_{\boldsymbol{\theta}} \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})} \Bigg] \\ &= \mathbb{E}_{\pi} \Bigg[ \gamma^t q_{\pi}(S_t, A_t) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \Bigg] \;\;\; \text{(replacing a by the sample } A_t \sim \pi \;) \\ &= \mathbb{E}_{\pi} \Bigg[ \gamma^t G_t \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \Bigg] \;\;\; \;\;\; \;\;\; \;\;\; \;\; (\text{because } \mathbb{E}_{\pi}[G_t|S_t,A_t]=q_{\pi}(S_t,A_t)). \end{align}

So we get
$$\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_{t} + \alpha \gamma^t G_t \frac{\nabla_{\boldsymbol{\theta}} \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})}.$$
This is shown explicitly in the boxed pseudocode below.

Notice that $\nabla \log x = \frac{\nabla x}{x}$.

### REINFORCE with Baseline

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline $b(s)$:

$$\nabla J(\boldsymbol{\theta}) = \sum_s \mu_{\pi}(s) \sum_a \big(q_{\pi}(s, a) - b(s)\big) \nabla_{\boldsymbol{\theta}} \pi(a | s, \boldsymbol{\theta}).$$
The baseline can be any function, even a random variable, as long as it does not vary with $a$; the equation remains true, because the subtracted quantity is zero:
$$\sum_a b(s) \nabla_{\boldsymbol{\theta}} \pi(a|s, \boldsymbol{\theta}) = b(s) \nabla_{\boldsymbol{\theta}} \sum_a \pi(a|s, \boldsymbol{\theta}) = b(s) \nabla_{\boldsymbol{\theta}} 1 = 0 \;\;\;\; \forall s \in \mathcal{S}.$$
The update rule that we end up with is a new version of REINFORCE that includes a general baseline:
$$\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_{t} + \alpha \gamma^t \big(G_t-b(S_t) \big) \frac{\nabla_{\boldsymbol{\theta}} \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})}.$$
One natural choice for the baseline is an estimate of the state value, $\hat{v}(S_t, \mathbf{w})$, where $\mathbf{w} \in \mathbb{R}^m$ is a weight vector learned by one of the methods presented in previous posts. A complete pseudocode algorithm for REINFROCE with baseline is given in the box (use Monte Carlo method for learning the policy parameter and state-value weights).

### Actor-Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor-critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating a state from the estimated values of subsequent states), but only as a baseline for the state being updated. In
order to gain these advantages in the case of policy gradient methods we use actor-critic methods with a true bootstrapping critic.

One-step actor-critic methods replace the full return of REINFORCE with the one-step return (and use a learned state-value function as the baseline) as follow:
\begin{align} \boldsymbol{\theta}_{t+1} &\doteq \boldsymbol{\theta}_{t} + \alpha \gamma^t \Big( G_{t:t+1} - \hat{v}(S_t, \mathbf{w})\Big) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \\ &= \boldsymbol{\theta}_{t} + \alpha \gamma^t \Big( R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\Big) \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \\ &= \boldsymbol{\theta}_{t} + \alpha \gamma^t \delta_t \frac{\nabla_{\boldsymbol{\theta}} \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})}. \end{align}
The natural state-value-function learning method to pair with this is semi-gradient TD(0). Pseudocode for the complete algorithm is given in the box below. Note that it is now a fully online, incremental algorithm, with states, actions, and rewards processed as they occur and then never revisited.