强化学习问题定义

策略函数\(\pi(s,a)\):智能体在状态s下采取动作a的概率

确定的策略函数指在给定状态s的情况下,只有一个动作a使得概率\(\pi(s,a)\)取值为1,记\(a=\pi(s)\)

回报值\(G_t\)

价值函数\(V_\pi(s)=\mathbb E_\pi[G_t|S_t=s]\):智能体在时刻t处于状态s时,按照策略\(\pi\)采取行动时所获得回报的期望

动作-价值函数\(q_\pi(s,a)=\mathbb E_\pi[G_t|S_t=s,A_t=a]\):表示智能体在时刻t处于状态s时,选择动作a后,在t时刻后根据策略\(\pi\)采取行动所获得回报的期望

强化学习问题可转化为一个策略学习问题:给定一个马尔可夫过程\(MDP=(S,A,P,R,\gamma)\),学习一个最优策略\(\pi^*\ s.t.\ \forall s\in S,V_{\pi^*}(s) \max\)

贝尔曼方程:

\[ \begin{array}{ll} V_{\pi}(s)&=\mathbb E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots|S_t=s]\\ &=\mathbb E_{a\sim\pi(s,\cdot)}[\mathbb E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots|S_t=s,A_t=a]]\\ &=\sum\limits_{a\in A}\pi(s,a)q_\pi(s,a) \end{array} \]
\[ \begin{array}{ll} q_\pi(s,a)&=\mathbb E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots|S_t=s,A_t=a]\\ &=\mathbb E_{s'\sim P(\cdot|s,a)}[R(s,a,s')+\gamma\mathbb E_\pi[R_{t+2}+\gamma R_{t+3}+\cdots|S_{t+1}=s']]\\ &=\sum_{s'\in S}P(s'|s,a)[R(s,a,s')+\gamma V_\pi(s')] \end{array} \]

互相带入,得贝尔曼方程:

\[ V_\pi(s)=\mathbb E_{a\sim\pi(s,\cdot)}\mathbb E_{s'\sim P(\cdot|s,a)}[R(s,a,s')+\gamma V_\pi(s')] \]
\[ q_\pi(s,a)=\mathbb E_{s'\sim P(\cdot|s,a)}[R(s,a,s')+\gamma\mathbb E_{a'\sim\pi(s',\cdot)}|q_\pi(s',a')] \]