# Decision Making & Reinforcement Learning

Supervised Learning: $y = f(x)$

Unsupervised Learning: $f(x)$

Reinforcement Learning: $y = f(x), z$

## Markov Decision Process

States: $S$

Model: $T(s, a, s^{\prime}) \sim Pr(s^{\prime} | s, a)$

Actions: $A(s), A$

Reword: $R(s), R(s, a), R(s, a, s^{\prime})$

Policy: $\pi(s) \rightarrow a$

​ $\pi^{*}$

### Sequences of Rewards: Assumption

• Infinite Horizons

• Utility of sequences

if $U(s_0, s_1, s_2, \cdots) > U(s_0, s^{\prime}_1, s^{\prime}_2, \cdots)$

then $U(s_1, s_2, \cdots) > U(s^{\prime}_1, s^{\prime}_2, \cdots)$

$$U(s_0, s_1, s_2, \cdots)=\sum_{t=0}^{\infty}\gamma^{t}R(s_t), 0 \leq \gamma \leq 1$$

$$U\leq\frac{R_{max}}{1 - \gamma}$$

• Policies

$$\pi^{\star}=argmax_{\pi} E[\sum_{t=0}^{\infty}\gamma^{t}R(S_t)|\pi]$$

$$U^{\pi}(s)=E[\sum_{t=0}^{\infty}\gamma^{t}R(s_t)|\pi,s_0=s]$$

$$\pi^{\star}(s)=argmax_{a}\sum_{s^{\prime}}T(s, a, s^{\prime})U(s^{\prime})$$

$$U(s)=R(s)+\gamma \max_{a}\sum_{s^{\prime}}T(s, a, s^{\prime})U(s^{\prime})$$

Above is the Bellman Equation.