Summary of Papers

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning



  • DenseNet (121 layers)
  • Batch Normalization
  • The weights of the network are randomly initialized
  • Trained end-to-end using Adam with standard pa- rameters (β1 = 0.9 and β2 = 0.999)
  • Batch size = 16
  • Oversample the minority (positive) class Buda et al., 2017
  • Use an initial learning rate of 0.01 that is decayed by a factor of 10 each time the validation loss plateaus after an epoch, and pick the model with the lowest validation loss.

Model Interpretation

To interpret the network predictions, we also produce heatmaps to visualize the areas of the image most in- dicative of the disease using class activation mappings (CAMs) Zhou et al., 2016





###Maximum Entropy Deep Inverse Reinforcement Learning [2016]





Maximum Entropy Inverse Reinforcement Learning Brian [2008][Repost]

this is a summary of Ziebart et al’s 2008 paper: Maximum Entropy Inverse Reinforcement Learning **. I found this is a good way for me to distill the essence of the paper. Since the Maxent algorithm is mostly cited by the later papers in IRL/imitation learning, I would like to look into details of this algorithm. Code is available at github. Hope this post can also help others who are interested in IRL.


The paper points out the limitations of the previous algorithms including LPIRL (Ng & Russell 2000), structured maximum margin prediction (MMP, Ratliff, Bagnell & Zinkevich 2006, Apprenticeship Learning vis IRL (Abbeel & Ng 2004) about feature counts and IRL

  • Both IRL and the matching of feature counts are ambiguous.
    • Each policy can be optimal for many reward functions.
    • many policies lead to the same feature counts.
  • The ambiguity of suboptimality is unresolved.


An MDP is a tuple $(S,A,P_{sa},\gamma,R)$

  • $S$ is a finite set of $N$ states.
  • $A={a_1,..,a_k}$ is a set of $k$ actions.
  • $P_{sa}(s^{\prime})$ is the state transition probability of landing at state $s^{\prime}$: $P(s,a,s^{\prime})$ upon taking the action aa at state $s$.
  • $\gamma \in [0,1)$ is the discount factor.
  • $R: S \rightarrow \mathbf{R}$ is the reward function.


  • $\zeta:\{(s, a)\}$ is a trajectory.
  • $\mathbf{f_s} \in \mathbf{R}^k$ is the feature vector of the state $s$.
  • $\theta \in \mathbf{R}^k$ reward function parameters.
  • $P(\zeta)$ probability of the trajectory $\zeta$ to occur
  • $P(s)$ the probatility of visiting state $s$ (state visitation frequency), $P(\zeta) = \prod_{s\in\zeta} P(s)$.


  • The reward of a trajectory is expressed as a linearly combination with feature counts

R(\zeta) = \theta ^T \mathbf{f}_{\zeta} = \sum_{s\in \zeta} \theta ^T \mathbf{f}_s

  • Principle of maximum entropy (Jaynes 1957): probability of a trajectory demonstrated by the expert is exponentially higher for higher rewards than lower rewards,

P(\zeta) \propto e^{R(\zeta)}


The Maxent algorithm learns from demonstrated expert trajectories with the objective being maximizing the likelihood of the demonstrated trajectories,
\theta^{\star} &= \text{argmax}_{\theta} L(\theta) \\
&= \text{argmax}_{\theta} \frac{1}{M}\text{log} P(\{\zeta\} | \theta)\\
&= \text{argmax}_{\theta} \frac{1}{M}\text{log} \prod_{\zeta} P(\zeta|\theta)\\
&= \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} \text{log} P(\zeta|\theta)\\
&= \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} \text{log} \frac{e^{R(\zeta)}}{Z}\\
&= \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} R(\zeta) - \text{log} Z\\
Where $M$ is the number of trajectories, $Z$ is the normalization term,
Z = \sum_{\zeta} e^{R(\zeta)}
\theta^{\star} = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} R(\zeta) - \text{log} \sum_{\zeta} e^{R(\zeta)}\\
\theta^{\star} = \text{argmax}_{\theta} \frac{1}{M}\sum_{\zeta} \theta ^T \mathbf{f}_{\zeta} - \text{log} \sum_{\zeta} e^{\theta ^T \mathbf{f}_{\zeta}}\\
And the objective is convex! (with the second term being log-sum-exp). We go ahead to differentiate the objective to find the gradients:
\nabla_{\theta} L &= \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \frac{1}{\sum_\zeta e^{R(\zeta)}} \sum_{\zeta} (e^{R(\zeta)\frac{d R(\zeta)}{d\theta}})\\
&= \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \frac{1}{\sum_\zeta e^{R(\zeta)}} \sum_{\zeta} (e^{R(\zeta)}\frac{d R(\zeta)
&= \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \sum_{\zeta}\frac{e^{R(\zeta)}}{\sum_\zeta e^{R(\zeta)}} \mathbf{f}_{\zeta}\\
&= \frac{1}{M}\sum_\zeta \mathbf{f}_{\zeta} - \sum_{\zeta} P(\zeta | \theta) \mathbf{f}_{\zeta}\\
Since the trajectories {ζ}{ζ} are consist of states,
\nabla_{\theta} L = \frac{1}{M}\sum_s \mathbf{f}_{s} - \sum_{s} P(s | \theta) \mathbf{f}_{s}
Where $\mathbf{f}_s$ is the feature vector for the state $s$. And
is the state visitation frequency for state $s$.

So far the main body of the algorithm is described. The only thing left is to compute the state visitation frequency (SVF) vector. To do so, we can use the following dynamic programming algorithm (for convienience we use $P(s)$ to denote SVF on state $s$).

We use μt(s)μt(s) to denote the prob of visiting $s$ at $t$ (obviously,
P(s) = \sum_t \mu_t(s)

  • solve the MDP using value iteration with the intermediate recovered rewards to get current optimal policy $\{\pi(a,s)\}$.
  • compute $\mu_1(s)$ using sampled trajectories
  • using DP to solve for the rest given optimal policy $\{\pi(a,s)\}$ and the transition dynamics $\{P_{sa}(s’)\}$

For $t = 1,..,T$
\mu_{t+1} (s) = \sum_{a}\sum_{s’} \mu_{t}(s’)\pi(a,s’)P_{sa}(s’)

  • And finally.

P(s) = \sum_t \mu_t(s)

One key things to note is that, the algorithm solves MDP in each iteration of training.

If the transition dynamics $\{P_{sa}(s’)\}$ is unknown, we can actually using Monte Carlo to estimate the SVF with the trajectories. This is much more easier, so the details are omitted. Plugging the SVF back to the gradients, we can use iterative gradient descent to solve for the parameters $\theta$.


As a final summary of the algorithm, here is the slide from UC Berkeley’s CS 294, Deep Reinforcement Learning course,

Maxent IRL

Strengths and Limitations

  • Strengths
    • scales to neural network costs (overcome the drawbacks of linear costs)
    • efficient enough for real robots
  • Limitations
    • requires repeatedly solving the MDP
    • assumes known dynamics


  • Ziebart et al’s 2008 paper: Maximum Entropy Inverse Reinforcement Learning **
  • UCB’s CS 294 DRL course, lecture on IRL

Code Details

def compute_state_visition_freq(P_a, gamma, trajs, policy, deterministic=True):
"""compute the expected states visition frequency p(s| theta, T)
using dynamic programming
P_a NxNxN_ACTIONS matrix - transition dynamics
gamma float - discount factor
trajs list of list of Steps - collected from expert
policy Nx1 vector (or NxN_ACTIONS if deterministic=False) - policy
p Nx1 vector - state visitation frequencies
N_STATES, _, N_ACTIONS = np.shape(P_a)
T = len(trajs[0])
# mu[s, t] is the prob of visiting state s at time t
mu = np.zeros([N_STATES, T])
for traj in trajs:
mu[traj[0].cur_state, 0] += 1
mu[:,0] = mu[:,0]/len(trajs)
for s in range(N_STATES):
for t in range(T-1):
if deterministic:
mu[s, t+1] = sum([mu[pre_s, t]*P_a[pre_s, s, int(policy[pre_s])] for pre_s in range(N_STATES)])
mu[s, t+1] = sum([sum([mu[pre_s, t]*P_a[pre_s, s, a1]*policy[pre_s, a1] for a1 in range(N_ACTIONS)]) for pre_s in range(N_STATES)])
p = np.sum(mu, 1)
return p
def maxent_irl(feat_map, P_a, gamma, trajs, lr, n_iters):
Maximum Entropy Inverse Reinforcement Learning (Maxent IRL)
feat_map NxD matrix - the features for each state
P_a NxNxN_ACTIONS matrix - P_a[s0, s1, a] is the transition prob of
landing at state s1 when taking action
a at state s0
gamma float - RL discount factor
trajs a list of demonstrations
lr float - learning rate
n_iters int - number of optimization steps
rewards Nx1 vector - recoverred state rewards
N_STATES, _, N_ACTIONS = np.shape(P_a)
# init parameters
theta = np.random.uniform(size=(feat_map.shape[1],))
# calc feature expectations
feat_exp = np.zeros([feat_map.shape[1]])
for episode in trajs:
for step in episode:
feat_exp += feat_map[step.cur_state,:]
feat_exp = feat_exp/len(trajs)
# training
for iteration in range(n_iters):
if iteration % (n_iters/20) == 0:
print 'iteration: {}/{}'.format(iteration, n_iters)
# compute reward function
rewards = np.dot(feat_map, theta)
# compute policy
_, policy = value_iteration.value_iteration(P_a, rewards, gamma, error=0.01, deterministic=False)
# compute state visition frequences
svf = compute_state_visition_freq(P_a, gamma, trajs, policy, deterministic=False)
# compute gradients
grad = feat_exp - feat_map.T.dot(svf)
# update params
theta += lr * grad
rewards = np.dot(feat_map, theta)
# return sigmoid(normalize(rewards))
return normalize(rewards)

Apprenticeship Learning via Inverse Reinforcement Learning [ICML 2004]


The reward function can be expressed as a linear combination of known features.

The value of a policy $\pi$:
E_{s_0 \sim D}[V^{\pi}(s_0)] &= E[\sum_{t=0}^{\infty}\gamma^t R(s_t) | \pi] \\
&= E[\sum_{t=0}^{\infty}\gamma^t w \cdot \phi(s_t)| \pi] \\
&= w \cdot E[\sum_{t=0}^{\infty}\gamma^t \phi(s_t)| \pi],
where vector of features $\phi: S \rightarrow [0, 1]^k$ or $\phi: S \times A \ \rightarrow [0, 1]^k$, $w \in \mathbb{R}^k$. In order to ensure that the rewards are bounded by 1, we also assume $| w |_1 \leq 1$.

The feature exceptions:
\mu(\pi) = E[\sum_{t=0}^{\infty}\gamma^t \phi(s_t)| \pi] \in \mathbb{R}^k.
So the value of a policy may be rewritten to
E_{s_0 \sim D}[V^{\pi}(s_0)] = w \cdot \mu(\pi)
We assume the expert policy is $\pi_E$ and we need to estimate the expert’s feature expectations $\mu_E = \mu(\pi_E)$. Specifically, given a set of $m$ trajectories $\{s_0^{(i)}, s_1^{(i)}, \cdots, \}_{i=1}^{m}$ generated by the expert, we denote the empirical estimate for $\mu_E$ by
\hat{\mu}_E = \frac{1}{m}\sum_{i=1}^m\sum_{t=0}^m \gamma^t \phi(s_t^{(i)}).
Furthermore, we could construct a new policy by linear combination some other policies. More specifically, we have
\mu(\pi_3) = \lambda \mu(\pi_1) + (1 - \lambda) \mu(\pi_2) .
Note that the randomization step selecting between $\pi_1$ and $\pi_2$ occurs only once at the start of a trajectory, and not on every step taken in the MDP.


We want to find a policy whose performance is close to that of the expert, that is, for any $w \in \mathbb{R}^k (|w|_1 \leq 1)$,
|w^T\mu(\tilde{\pi}) - w^T\mu_E| \leq |w|_2 |w^T\mu(\tilde{\pi}) - w^T\mu_E|_2 \leq 1 \cdot \epsilon \leq \epsilon.
So the problem is reduced to finding a policy $\tilde{\pi}$ that induces feature expectations $\mu(\tilde{\pi})$ close to $\mu_E$. We find such a policy as follows:

  1. Randomly pick some policy $\pi^{(0)}$. compute (or approximate via Monte Carlo) $\mu^{(0)} = \mu(\pi^{(0)})$, and set $i=1$.
  2. Solve optimization problem (Solver as same as SVM) (*):
  3. If $t^{(i)} \leq \epsilon$, then terminal.
  4. Using the RL algorithm, compute the optimal policy $\pi^{(i)}$ for the MDP using reward $R = (w^{(i)})^T\phi$.
  5. Compute (or estimate) $\mu^{(i)} = \mu(\pi^{(i)})$.
  6. Set $i = i + 1$, and go back to step $2$.

Finally, we can find the point closest to $\mu_E$ in the convex closure of $\mu^{(0)}, \cdots, \mu^{(n)}$ by solving the following QP:
\min| \mu_E - \mu |_2, \; \text{s.t.} \; \mu = \sum_i \lambda_i \mu^{(i)}, \lambda_i \geq 0, \sum_i \lambda_i = 1.
\max_{t, w} &\;t \\
s.t. &\; w^T\mu_E \geq w^T\mu^{(j)} + t, \; j = 0, \cdots, i-1 \\
&\; |w |_2 \leq 1.
Note that, step 2 could replace by a simpler way that no QP solver is needed:

  • Set $\bar{\mu}^{(i-1)} = \bar{\mu}^{(i-2)} + \frac{(\mu^{(i-1)} - \bar{\mu}^{(i-2)})^T(\mu_E - \bar{\mu}^{(i-2)})}{(\mu^{(i-1)} - \bar{\mu}^{(i-2)})^T(\mu^{(i-1)} - \bar{\mu}^{(i-2)})} (\mu^{(i-1)} - \bar{\mu}^{(i-2)})^T$
    • This computes the orthogonal projection of $\mu_E$ onto the line through $\bar{\mu}^{(i-2)}$ and $\mu^{(i-1)}$.
  • Set $w^{(i)} = \mu_E - \bar{\mu}^{(i-1)}$
  • Set $t^{(i)} = | \mu_E - \bar{\mu}^{(i-1)} |_2$

In the first iteration, we also set $w^{(1)} = \mu_E - \mu^{(0)}$ and $\bar{\mu}^{(0)} = \mu^{(0)}$.

Deep Reinforcement Learning with Double Q-learning

The idea of Double Q-learning is to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation. Although not fully decoupled, the target network in the DQN architecture provides a natural candidate for the second value function, without having introduce additional networks. We therefore propose to evaluate the greedy policy according to the online network, but using the target network to estimate its value. In reference to both Double Q-learning and DQN, we refer to the resulting algorithm as Double DQN. Its update is the same as for DQN, but replacing the target $Y_t^{DQN}$ with
Y_t^{\text{DoubleDQN}} \equiv R_{t+1} + \gamma Q(S_{t+1}, {\arg \max}_a Q(S_{t+1},a; \boldsymbol{\theta}_t),\boldsymbol{\theta}_t^{\mathbf{-}}).
In comparison to Double Q-learning
Y_t^{\text{DoubleDQN}} \equiv R_{t+1} + \gamma Q(S_{t+1}, {\arg \max}_a Q(S_{t+1},a; \boldsymbol{\theta}_t),\boldsymbol{\theta}_t^{\boldsymbol{\prime}}),
the weights of the second network $\boldsymbol{\theta_t^{\prime}}$ are replaced with the weights of the target network $\boldsymbol{\theta_t^{-}}$ for the evaluation of the current greedy policy. The update to the target stays unchanged from DQN, and remains a periodic copy of the online network.

Prioritized Experience Replay


where $p_i > 0$ is the priority of transition $i$. The exponent $\alpha$ determines how much prioritization is used, with $\alpha=0$ corresponding to the uniform case.

The first we consider is the direct, proportional prioritization where $p_i = |\delta_i| + \epsilon$, where $\delta_i$ is the TD-error of transition $i$ and $\epsilon$ is a small positive constant that prevent the edge-case of transitions not being revisited once their error is zero. The second variant is an indirect, rand-based prioritization where $p_i = \frac{1}{\text{rank}(i)}$, where $\text{rank}_i$ is the rank of transition $i$ when the replay memory is sorted according to $|\delta_i|$

Dueling Network Architectures for Deep Reinforcement Learning


Let us consider the dueling network shown in above, where we make one stream of fully-connected layers output a scalar $V(s;\theta,\beta)$, and the other stream output an $\mathcal{A}$-dimensional vector $A(s, a; \theta, \alpha)$. Here, $\theta$ denotes the parameters of the convolutional layers, while $\alpha$ and $\beta$ are the parameters of the two streams of fully-connected layers. Using the definition of advantage, we might be tempted to construct the aggregating module as follows:
Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + A(s, a; \theta, \alpha),
Note that this expression applies to all $(s, a)$ instances; that is, to express equation above in matrix form we need to replicate the scalar, $V(s;\theta,\beta)$, $|\mathcal{A}|$ times.

Equation above is unidentifiable in the sense that given $Q$ we cannot recover V and A uniquely. To see this, add a constant to $V(s;\theta,\beta)$ and subtract the same constant from $A(s, a; \theta, \alpha)$. This constant cancels out resulting in the same $Q$ value. This lack of identifiability is mirrored by poor practical performance when this equation is used directly.

To address this issue of identifiability, we can replace the equation above to this one:
Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + \Big( A(s, a; \theta, \alpha) - \frac{1}{|\mathcal{A}|} \sum_{a^{\prime}} A(s, a^{\prime}; \theta, \alpha) \Big).

Learning Tetris Using the Noisy Cross-Entropy Method

the paper works on solving Tetris with modified cross entropy method. original CE method in reinforcement learning usually results in early convergence.

Cross entropy method in reinforcement learning

  • first we start with a random uniform distribution F_0
  • drawn from F_0 and get N samples θ_0, θ_1, …
  • choose the top K samples that get the highest scores and use these selected sample(θ_0, θ_1, …) update distribution and get F_1


  • add noise to the cross-entropy method to prevent early converge
  • if we decrease the noise, which is only depend on time steps, the performance can even be better.
  • noise 👉 prevent early converge


  • can view the noise apply to std as ensure enough exploration

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

We study the off-policy value evaluation problem, where one aims to estimate the value of a policy with data collected by another policy.

There are roughly two classes of approaches to off-policy value evaluation. The first is to fit an MDP model from data via regression, and evaluate the policy against the model; The second class of approaches are based on the idea of importance sampling (IS), which corrects the mismatch between the distributions induces by the target policy ang by the behavior policy.

Importance Sampling Estimators

The basic IS Estimator

The IS estimator provides an unbiased estimate of $\pi_1$’s value by averaging the following function of each trajectory $(s_1, a_1, r_1, \cdots, s_{H+1})$ in the data: define the per-step importance ratio as $\rho_t := \pi_1(a_t|s_t)/\pi_0(a_t|s_t)$, and the cumulative importance ratio $\rho_{1:t}:=\prod_{t^{\prime}=1}^{t}\rho_{t^{\prime}}$; the basic (trajectory-wise) IS estimator, and an improved step-wise version are given as follows:
V_{\text{IS}} := \rho_{1:H} \cdot (\sum_{t=1}^{H} \gamma^{t-1} r_t),

V_{\text{step-IS}} := \sum_{t=1}^{H} \gamma^{t-1}\rho_{1:t}r_t.

Given a dataset $D$, the IS estimator is simply the average estimate over the trajectories, namely $\frac{1}{|D|}\sum_{i=1}V_{\text{IS}}^{(i)}$, where $|D|$ is the number of trajectories in $D$ and $V_{\text{IS}}^{(i)}$ is IS applied to the $i$-th trajectory.

The weighted importance sampling (WIS)

Define $w_t = \sum_{i=1}^{|D|}\rho_{1:t}^{(i)}/|D|$ as the average cumulative importance ratio at horizon $t$ in a dataset $D$, then for each trajectory in $D$, the estimates given by trajectory-wise and step-wise WIS are respectively
V_{\text{WIS}} = \frac{\rho_{1:H}}{w_{H}}(\sum_{t=1}^H\gamma^{t-1}r_t),

V_{\text{step-WIS}} = \sum_{t=1}^H\gamma^{t-1}\frac{\rho_{1:t}}{w_t}r_t.

The doubly robust estimator for contextual bandits

V_{\text{DR}} := \widehat{V}(s) + \rho(r - \widehat{R}(s, a)),

where $\rho := \frac{\pi_1(a|s)}{\pi_0(a|s)}$ and $\widehat{V}(s) := \sum_a \pi_1(a|s)\widehat{R}(s, a)$. If $\widehat{R}(s, a)$ is a good estimator of $r$, the magnitude of $r - \widehat{R}(s, a)$ can be much smaller than of $r$. Consequently, the variance of $\rho(r - \widehat{R}(s, a))$ tends to be smaller than that of $\rho r$, implying that DR often has a lower variance that IS (Dud´ık et al., 2011).

DR Estimator for the Sequential Setting

A key observation is that Eqn.(6) can be written in a recursive form. Define $V_{\text{step-IS}}^0 := 0$, and for $t=1, \cdots, H$,
V_{\text{step-IS}}^{H+1-t} := \rho_t (r_t + \gamma V_{\text{step-IS}}^{H-t}).
We can apply the bandit DR estimator at each horizon, and obtain the following unbiased estimator: define $V_{\text{DR}}^0 := 0$, and
V_{\text{DR}}^{H+1-t} := \widehat{V}(s_t) + \rho_t (r_t + \gamma V_{\text{DR}}^{H-t} - \widehat{Q}(s_t, a_t)).
The DR estimator of the policy value is then $V_{\text{DR}} := V_{\text{DR}}^H$.

One modification of DR that further reduces the variance in state transitions is:
V_{\text{DR-v2}}^{H+1-t} := \widehat{V}(s_t) + \rho_t (r_t + \gamma V_{\text{DR-v2}}^{H-t} - \widehat{Q}(s_t, a_t) - \gamma \widehat{V}(s_{t+1})).

Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach


State Autoencoder + Deep Q-Network + Dueling Network + Double Q Learning + Prioritized Experience Replay

Note: loss function have changed:
\mathcal{L}(\theta) = \mathbb{E}[(Q_{\text{double-target}} - Q(s, a;\theta))^2] - \lambda \cdot \max(|Q(s, a;\theta) - R_{\max}|, 0),
where $R_{\max} = \pm15, Q_{\text{double-target}}=r+\gamma Q(s^{\prime}, \arg\max_{a^{\prime}}Q(s^{\prime}, a^{\prime};\theta);\theta)$. $\theta$ are the weights used to parameterize the main network, and $\theta^{\prime}$ are the weights used to parameterize the target network.



Data were aggregated into windows of 4 hours, with the mean or sum being recorded (as appropriate) when several data points were present in one window. Variables with excessive missingness were removed, and any remaining missing values were imputed with k-nearest neighbors, yielding a 47 × 1 feature vector for each patient at each timestep. Values exceeding clinical limits were capped, and capped data were normalized per-feature to zero mean and unit variance.

Feature list

The physiological features used in our model are presented below.

Shock Index, Elixhauser, SIRS, Gender, Re-admission, GCS - Glasgow Coma Scale, SOFA - Sequential Organ Failure Assessment, Age

Lab Values
Albumin, Arterial pH, Calcium, Glucose, Hemoglobin, Magnesium, PTT - Partial Thromboplastin Time, Potassium, SGPT - Serum Glutamic-Pyruvic Transaminase, Arterial Blood Gas, BUN - Blood Urea Nitrogen, Chloride, Bicarbonate, INR -International Normalized Ratio, Sodium, Arterial Lactate, CO2, Creatinine, Ionised Calcium, PT - Prothrombin Time, Platelets Count, SGOT - Serum Glutamic-Oxaloacetic Transaminase, Total bilirubin, White Blood Cell Count

Vital Signs
Diastolic Blood Pressure, Systolic Blood Pressure, Mean Blood Pressure, PaCO2, PaO2, FiO2, PaO/FiO2 ratio, Respiratory Rate, Temperature (Celsius), Weight (kg), Heart Rate, SpO2

Intake and Output Events
Fluid Output - 4 hourly period, Total Fluid Output, Mechanical Ventilation


Discounted Returns vs. Mortality

We bin Q-values obtained via SARSA (baseline) on the test set into discrete buckets, and for each, if it is part of a trajectory where a patient died, we assign it a label of 1; if the patient survived, we assign a label of 0.


State Representation


Doubly Robust off-policy value evaluation


Action differences


Action vs. Mortality


A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units


Preprocessing using Gaussian Processes

Denoting the observations of the vital signs by $v$ and the measurement time $t$, we model
v = f(t) + \varepsilon,
where $\varepsilon$ vector represents i.i.d Gaussian noise, and $f(t)$ is the latent noise-free function we would like to estimate. We put a GP prior on the latent function $f(t)$:
f(t) \sim \mathcal{GP}(m(t), \mathcal{K}(t, t^{\prime})),
where $m(t)$ is the mean function and $\mathcal{K}(t, t^{\prime})$ is the covariance function or kernel. In this work, we use a multi-output GP to account for temporal correlations between physiological signals during interpolation. We adapt the framework in Cheng et al. [2017] to impute the physiological signals jointly by estimating covariance structures between them, excluding the sparse prior settings (please read the paper for more details).

Note: Just for continuous features.


Reward Function



Fitted Q-iteration with sampling (Omitted)


Features importance


The outcome of difference

We divide the 664 test admissions into six groups according to the fraction of
FQI policy actions that differ from the hospital’s policy: $\Delta_0$ comprises admissions in which the true and recommended policies agree perfectly, while those in $\Delta_5$ show the greatest deviation. Plotting the distribution of the number of reintubations and the mean accumulated reward over patient admissions respectively, for all patients in each set.


Deep Reinforcement Learning, Decision Making and Control (ICML 2017 Tutorial)

Model-Free RL

policy gradients

REINFORCE algorithm:

  1. sample ${\tau^i}$ from $\pi_\theta(s_t|a_t)$
  2. $\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i))(\sum_t r(s_t^i, a_t^i))$
  3. $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

Reduce variance

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i))(\sum_{t^{\prime}=1}^T r(s_{t^{\prime}}^i, a_{t^{\prime}}^i))$

“reward to go”


one baseline: average reward.

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i\nabla_\theta \log \pi_\theta(\tau) [r(\tau) - b]$

$b = \frac{1}{N} \sum_{i=1}^N r(\tau)$

Control variates (see also: Gu et al. 2016 (Q-Prop))

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i)) \Big (\hat{Q}_t^i - b(s_{t^{\prime}}^i, a_{t^{\prime}}^i) \Big ) + \frac{1}{N} \sum_i \sum_t \nabla_\theta E_{a \sim \pi_\theta(a_t | s_t^i)}[b(s_t^i, a_t)]$

covatriant/natural policy gradient

natural gradient: pick $\alpha$

$\theta \leftarrow \theta + \alpha \mathbf{F}^{-1}\nabla_\theta J(\theta)$

$\mathbf{F} = E_{\pi_{\theta}}[\log\pi_{\theta}(a|s)\log\pi_{\theta}(a|s)^T]$

trust region policy optimization: pick $\epsilon$

$\theta^{\prime} \leftarrow \arg\max_{\theta^{\prime}} (\theta^{\prime} - \theta)^T \nabla_\theta J(\theta) \; \text{s.t.} (\theta^{\prime} - \theta)^T\mathbf{F}(\theta^{\prime} - \theta) \leq \epsilon$

Policy gradients suggested readings
Classic papers
​ • Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning: introduces REINFORCE algorithm
​ • Baxter & Bartlett (2001). Infinite-horizon policy-gradient estimation: temporally decomposed policy gradient (not the first paper on this! see actor-critic section later)
​ • Peters & Schaal (2008). Reinforcement learning of motor skills with policy gradients: very accessible overview of optimal baselines and natural gradient
Deep reinforcement learning policy gradient papers
​ • L. & Koltun (2013). Guided policy search: deep RL with importance sampled policy gradient (unrelated to later discussion of guided policy search)
​ • Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region policy optimization: deep RL with natural policy gradient and adaptive step size
​ • Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximal policy optimization algorithms: deep RL with importance sampled policy gradient

Actor-Critic algorithms

Value function fitting

batch actor-critic algorithm:

  1. sample $\{s_i, a_i\}$ from $\pi_\theta(a|s)$
  2. fit $\hat{V_\phi}(s)$ to sampled reward sums (*)
  3. evaluate $\hat{A^{\theta}}(s_i, a_i) = r(s_i, a_i) + \hat{V_\phi}(s_i^{\prime}) - \hat{V_\phi}(s_i)$
  4. $\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a^i|s^i) \hat{A}(s_i, a_i) $
  5. $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

(*) $y_{i,t} \approx \sum_{t=1}^T r(s_t^i, a_{t}^i)$

$\mathcal{L}(\phi) = \frac{1}{2}\sum_i | \hat{V_\phi}(s_i) - y_i|^2$

Discount factors

$y_{i,t} \approx r(s_t^i, a_{t}^i) + \gamma \hat{V_\phi}(s_{t+1}^i)$ (0.99 works well)

online actor-critic algorithm:

  1. take action $\mathbf{a} \sim \pi_\theta(a|s)$, get $(s, a, s^{\prime}, r)$
  2. update $\hat{V_\phi^{\pi}}(s)$ using target $r + \gamma \hat{V_\phi^{\pi}}(s^{\prime})$
  3. evaluate $\hat{A^{\theta}}(s_i, a_i) = r(s_i, a_i) + \hat{V_\phi}(s_i^{\prime}) - \hat{V_\phi}(s_i)$
  4. $\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a^i|s^i) \hat{A}(s_i, a_i) $
  5. $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

Step 2 and 4 works best with a batch.

We can design better estimators (for both batch and online). See Schulman, Moritz, L. Jordan, Abbeel ‘16: Generalized advantage estimation.

We can use single network for actor and critic.

Actor-critic suggested readings

Classic papers
​ • Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for
reinforcement learning with function approximation: actor-critic algorithms with
value function approximation
Deep reinforcement learning actor-critic papers
​ • Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016).
Asynchronous methods for deep reinforcement learning: A3C – parallel online
​ • Schulman, Moritz, L., Jordan, Abbeel (2016). High-dimensional continuous
control using generalized advantage estimation: batch-mode actor-critic with
blended Monte Carlo and function approximator returns
​ • Gu, Lillicrap, Ghahramani, Turner, L. (2017). Q-Prop: sample-efficient policy-
gradient with an off-policy critic: policy gradient with Q-function control variate

Value Functions


  1. take some action $a_i$ and observe $(s_i, a_i, s^{\prime}_i, r_i)$, add it to $\mathcal{R}$
  2. sample mini-batch $(s_j, a_j, s^{\prime}_j, r_j)$ from $\mathcal{R}$ uniformly
  3. compute $y_j = r_j + \gamma \max_{a^{\prime}_j} Q_{\phi^{\prime}} (s_j^{\prime},a_j^{\prime})$ using target network $Q_{\phi^{\prime}} $
  4. $\phi \leftarrow \phi - \alpha \sum_j \frac{dQ_{\phi}}{d\phi} (Q_\phi(s_j, a_j) - y_j)$
  5. update $\phi^{\prime}$: copy $\phi$ every $N$ steps, or Polyak average $\phi^{\prime} \leftarrow \tau\phi^{\prime} + (1-\tau)\phi$

Q-Learning with continuous actions

Option 1: use function class that is easy to optimize (Gu, Lillicrap, Sutskever, L., ICML 2016)

$Q(s, a | \theta^Q) = A(s, a|\theta^A) + V(x|\theta^V)$

$A(s, a|\theta^A) = -\frac{1}{2}(s-\mu(x|\theta^{\mu}))^TP(s|\theta^P)(s-\mu(x|\theta^{\mu}))$

Option 2: learn an approximate maximizer DDPG (Lillicrap et al., ICLR 2016)

$\mu_\theta(s) = a, \theta \leftarrow \arg\max_\theta Q_{\phi}(s, \mu_{\theta}(s))$

$y_j = r_j + \gamma \max_{a^{\prime}_j} Q_{\phi^{\prime}}(s_j^{\prime}, \mu_{\theta(s_j^{\prime})})$

Q-learning suggested readings
Classic papers
​ • Watkins. (1989). Learning from delayed rewards: introduces Q-learning
​ • Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural
Deep reinforcement learning Q-learning papers
​ • Lange, Riedmiller. (2010). Deep auto-encoder neural networks in reinforcement
learning: early image-based Q-learning method using autoencoders to construct
​ • Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-
learning with convolutional networks for playing Atari.
​ • Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.
​ • Lillicrap et al. (2016). Continuous control with deep reinforcement learning: continuous Q-learning with actor network for approximate maximization.
​ • Gu, Lillicrap, Stuskever, L. (2016). Continuous deep Q-learning with model-based
acceleration: continuous Q-learning with action-quadratic value functions.
​ • Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling network
architectures for deep reinforcement learning: separates value and advantage estimation in Q-function.

Soft optimality (WIG)

RL inference in a graphical model

soft Q-learning

$\phi \leftarrow \phi + \alpha \nabla_\phi Q_\phi (s, a) (r(s, a) + \gamma V(s^{\prime}) - Q_\phi(s, a))$

target value: $V(s^{\prime}) = \text{soft}\max_{a^{\prime}} Q_\phi(s^{\prime}, a^{\prime}) = \log \int \exp(Q_\phi(s^{\prime}, a^{\prime})da^{\prime}$

$\pi(a|s) = \exp(Q_\phi(s,a)-V(s)) = \exp(A(s, a))$

policy gradient with soft optimality (WIG)

Soft optimality suggested readings
• Todorov. (2006). Linearly solvable Markov decision problems: one framework for
reasoning about soft optimality.
• Todorov. (2008). General duality between optimal control and estimation: primer on the equivalence between inference and control.
• Kappen. (2009). Optimal control as a graphical model inference problem: frames control as an inference problem in a graphical model.
• Ziebart. (2010). Modeling interaction via the principle of maximal causal entropy:
connection between soft optimality and maximum entropy modeling.
• Rawlik, Toussaint, Vijaykumar. (2013). On stochastic optimal control and reinforcement learning by approximate inference: temporal difference style algorithm with soft optimality.
• Haarnoja, Tang, Abbeel, L. (2017). Reinforcement learning with deep energy based
models: soft Q-learning algorithm, deep RL with continuous actions and soft optimality
• Nachum, Norouzi, Xu, Schuurmans. (2017). Bridging the gap between value and policy based reinforcement learning.
• Schulman, Abbeel, Chen. (2017). Equivalence between policy gradients and soft Q-

Inverse RL

Maximum Entropy Inverse RL (Ziebart et al. ’08)
  1. Initialize $\psi$, gather demonstrations $\mathcal{D}$
  2. Solve for optimal policy $\pi(a|s)$ w.r.t. reward $r_\psi$
  3. Solve for state visitation frequencies $p(s|\psi)$
  4. Compute gradient $\nabla_\psi \mathcal{L} = -\frac{1}{|\mathcal{D}|}\sum_{\tau_d \in \mathcal{D}} \frac{dr_\psi}{d\psi}(\tau_d)\sum_{s} \frac{dr_\psi}{d\psi}(s)$
  5. Update $\psi$ with one gradient step using $\nabla_\psi\mathcal{L}$

guided cost learning & generative adversarial imitation algorithm (Finn et al. ICML ’16, Ho & Ermon NIPS ’16)

Suggested Reading on Inverse RL Classic Papers

  • Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement learning
  • Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning.
    Introduction of probabilistic method for inverse reinforcement learning
    Modern Papers
  • Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement Learning. MaxEnt IRL using deep reward functions
  • Finn et al. ICML ’16. Guided Cost Learning. Sampling-based method for MaxEnt IRL
    that handles unknown dynamics and deep reward functions
  • Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. IRL method
    building on Abbeel & Ng ’04 using generative adversarial networks

Further Reading on Inverse RL

  • MaxEnt-based IRL: Ziebart et al. AAAI ’08, Wulfmeier et al. arXiv ’16, Finn et al. ICML ‘16
  • Adversarial IRL: Ho & Ermon NIPS ’16, Finn, Christiano et al. arXiv ’16, Baram et al. ICML ’17
  • Handling multimodality: Li et al. arXiv ’17, Hausman et al. arXiv ’17, Wang, Merel et al. ‘17
  • Handling domain shift: Stadie et al. ICLR ‘17

Model-bases RL (WIS)

Guided Policy Search (GPS)

Suggested Reading on Model-based RL

  • Tassa et al. IROS ’12. Synthesis and Stabilization of Complex Behaviors. Good

introduction to MPC with a known model

  • Levine, Finn et al. JMLR ’16. End-to-End Learning of Deep Visuomotor Policies.

    Thorough paper on guided policy search for learning real robotic vision-based skills

  • Heess et al. NIPS ’15. Stochastic Value Gradients. Backdrop through dynamics to assist model-free learner

  • Watter et al. NIPS ’15. Embed-to-Control, Learn latent space and use model-baed RL in learned latent space to reach image of goal

  • Finn & Levine ICRA ’17. Deep Visual Foresight for Planning Robot Motion. Plan using learned action-conditioned video prediction model

Further Reading on Model-based RL

  • Use known model: Tassa et al. IROS ’12, Tan et al. TOG ’14, Mordatch et al. TOG ‘14
  • Guided policy search: Levine, Finn et al. JMLR ’16, Mordatch et al. RSS ’14, NIPS ‘15
  • Backprop through model: Deisenroth et al. ICML ’11, Heess et al. NIPS ’15, Mishra et al. ICML ’17, Degrave et al. ’17, Henaff et al. ‘17
  • MBRL in latent space: Watter et al. NIPS ’15, Finn et al. ICRA ‘16
  • MPC with deep models: Lenz et al. RSS ’15, Finn & Levine ICRA ‘17
  • Combining Model-Based & Model-Free:
    • use roll-outs from model as experience: Sutton ’90, Gu et al. ICML ‘16
    • use model as baseline: Chebotar et al. ICML ‘17
    • use model for exploration: Stadie et al. arXiv ’15, Oh et al. NIPS ’16
    • model-free policy with planning capabilities: Tamar et al. NIPS ’16, Pascanu et al. ‘17
    • model-based look-ahead: Guo et al. NIPS ’14, Silver et al. Nature ‘16