Abracadabra

Reinforcement Learning (GT) Notes

Decision Making & Reinforcement Learning

Supervised Learning: $y = f(x)$

Unsupervised Learning: $f(x)$

Reinforcement Learning: $y = f(x), z$

Markov Decision Process

States: $S$

Model: $T(s, a, s^{\prime}) \sim Pr(s^{\prime} | s, a)$

Actions: $A(s), A$

Reword: $R(s), R(s, a), R(s, a, s^{\prime})$


Policy: $\pi(s) \rightarrow a$

​ $\pi^{*}$

Sequences of Rewards: Assumption

  • Infinite Horizons

  • Utility of sequences

    if $U(s_0, s_1, s_2, \cdots) > U(s_0, s^{\prime}_1, s^{\prime}_2, \cdots)$

    then $U(s_1, s_2, \cdots) > U(s^{\prime}_1, s^{\prime}_2, \cdots)$

$$U(s_0, s_1, s_2, \cdots)=\sum_{t=0}^{\infty}\gamma^{t}R(s_t), 0 \leq \gamma \leq 1$$

$$U\leq\frac{R_{max}}{1 - \gamma}$$

  • Policies

    $$\pi^{\star}=argmax_{\pi} E[\sum_{t=0}^{\infty}\gamma^{t}R(S_t)|\pi]$$

    $$U^{\pi}(s)=E[\sum_{t=0}^{\infty}\gamma^{t}R(s_t)|\pi,s_0=s]$$

    $$\pi^{\star}(s)=argmax_{a}\sum_{s^{\prime}}T(s, a, s^{\prime})U(s^{\prime})$$

    $$U(s)=R(s)+\gamma \max_{a}\sum_{s^{\prime}}T(s, a, s^{\prime})U(s^{\prime})$$

    Above is the Bellman Equation.