MDP
马尔可夫决策过程(Markov Decision Process, MDP)并非强化学习算法本身,而是构建智能体与环境交互的数学框架。作为强化学习理论体系的基石,超过90%的强化学习算法都建立在环境满足马尔可夫性的基本假设之上
马尔科夫性:动态系统的记忆
形式化定义
记代表一个随机现象的随机向量在 t t t 时候取值为 S t \boldsymbol S_t S t ,则一个随机过程可以表示 t + 1 t+1 t + 1 时刻的 S t + 1 \boldsymbol S_{t+1} S t + 1 随机向量的取值概率函数为 P ( S t + 1 ∣ S 1 , … , S t ) P(\boldsymbol S_{t+1}|\boldsymbol S_1, \dots, \boldsymbol S_t) P ( S t + 1 ∣ S 1 , … , S t ) ,当 P ( S t + 1 ∣ S t ) = P ( S t + 1 ∣ S 1 , … , S t ) P(\boldsymbol S_{t+1} | \boldsymbol S_t)=P(\boldsymbol S_{t+1}|\boldsymbol S_1, \dots, \boldsymbol S_t) P ( S t + 1 ∣ S t ) = P ( S t + 1 ∣ S 1 , … , S t ) 时,称这个随机过程满足马尔科夫性
马尔可夫决策过程由元组 < S , A , P , r , γ > <\mathcal{S, A}, P, r, \gamma> < S , A , P , r , γ > 构成,其中:
S \mathcal{S} S 是状态集合,A \mathcal{A} A 是动作集合
γ \gamma γ 为折扣因子
r ( s , a ) r(\boldsymbol{s,a}) r ( s , a ) 是奖励函数,当奖励只取决于状态时退化为 r ( s ) r(\boldsymbol{s}) r ( s )
P ( s ′ ∣ s , a ) P(\boldsymbol{s}^\prime|\boldsymbol{s,a}) P ( s ′ ∣ s , a ) 为状态转移函数,表示状态 s \boldsymbol{s} s 下执行动作 a \boldsymbol{a} a 之后达到状态 s ′ \boldsymbol{s}^\prime s ′ 的概率
值函数:评价决策的好坏
定义 V π ( s ) V^\pi(\boldsymbol{s}) V π ( s ) 为 MDP 中策略 π \pi π 下的状态价值函数(state-value function),定义为从状态 s \boldsymbol{s} s 出发遵循策略 π \pi π 获得的期望回报(长期期望收益):
V π ( s ) = E π [ G t ∣ S t = s ] V^\pi(\boldsymbol{s}) = \mathbb{E}_\pi [ G_t | \boldsymbol{S}_t=s]
V π ( s ) = E π [ G t ∣ S t = s ]
定义动作价值函数 Q π ( s , a ) Q^\pi(\boldsymbol{s,a}) Q π ( s , a ) 为(量化某个动作的预期价值):
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] Q^\pi(\boldsymbol{s,a}) = \mathbb{E}_\pi[G_t|\boldsymbol{S}_t=\boldsymbol{s}, \boldsymbol{A}_t=\boldsymbol{a}]
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ]
注意在强化学习里面求期望的习惯记号是存在很多省略之处的,下标 π \pi π 代表着采样序列的时候使用策略 π \pi π 采样,而不是对什么求期望,被求期望的对象在这里都省略掉了,其中 Q π Q^\pi Q π 的求期望对象是转移状态 s ′ \boldsymbol s' s ′ ,V π V^\pi V π 的求期望对象是当前策略下会采用的动作 a \boldsymbol a a 和下一个转移状态 s ′ \boldsymbol s' s ′
状态值函数和动作值函数满足关系 :
V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) Q π ( s , a ) = r ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V π ( s ′ ) \boxed{
\begin{aligned}
V^\pi(\boldsymbol s) =& \sum_{\boldsymbol a}\pi(\boldsymbol{a|s}) Q^\pi(\boldsymbol{s,a}) \\
Q^\pi(\boldsymbol{s,a}) =& r(\boldsymbol{s,a}) + \gamma\sum_{\boldsymbol s'}P(\boldsymbol s'|\boldsymbol{s,a}) V^\pi(\boldsymbol s')
\end{aligned}
}
V π ( s ) = Q π ( s , a ) = a ∑ π ( a ∣ s ) Q π ( s , a ) r ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) V π ( s ′ )
贝尔曼期望方程:沿时序递推
贝尔曼方程
状态值函数沿时序递推:
Q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + γ G t + 1 ∣ S t = s , A t = a ] = E π , s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ Q π ( s ′ , a ′ ) ∣ S t = s , A t = a ] = r ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) E a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s ′ , a ′ ) ] = r ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V π ( s ′ ) = r ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) \begin{aligned}
Q^\pi(\boldsymbol{s,a}) =& \mathbb{E}_\pi[G_t|\boldsymbol{S}_t=\boldsymbol{s}, \boldsymbol{A}_t=\boldsymbol{a}]\\
=& \mathbb{E}_\pi[R_t + \gamma G_{t+1}|\boldsymbol{S}_t=\boldsymbol{s}, \boldsymbol{A}_t=\boldsymbol{a}]\\
=& \mathbb{E}_{\pi, \boldsymbol{s}^\prime \sim P(\cdot|\boldsymbol{s,a})}[R_t + \gamma Q^\pi(\boldsymbol{s}^\prime ,\boldsymbol{a}^\prime)|\boldsymbol{S}_t=\boldsymbol{s}, \boldsymbol{A}_t=\boldsymbol{a}]\\
=& r(\boldsymbol{s},\boldsymbol{a}) + \gamma\sum_{\boldsymbol{s'}}P(\boldsymbol{s'}|\boldsymbol{s,a})\mathbb{E}_{\boldsymbol{a}' \sim \pi(\cdot|\boldsymbol{s}')}[Q^\pi(\boldsymbol{s',a'})]\\
=& r(\boldsymbol{s},\boldsymbol{a}) + \gamma\sum_{\boldsymbol{s'}}P(\boldsymbol{s'}|\boldsymbol{s,a}) V^\pi(\boldsymbol s')\\
=& \boxed{r(\boldsymbol{s},\boldsymbol{a}) + \gamma\sum_{\boldsymbol{s'}}P(\boldsymbol{s'}|\boldsymbol{s,a}) \sum_{\boldsymbol a'}\pi(\boldsymbol a'|\boldsymbol{s}') Q^\pi(\boldsymbol{s',a'})}
\end{aligned}
Q π ( s , a ) = = = = = = E π [ G t ∣ S t = s , A t = a ] E π [ R t + γ G t + 1 ∣ S t = s , A t = a ] E π , s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ Q π ( s ′ , a ′ ) ∣ S t = s , A t = a ] r ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) E a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s ′ , a ′ ) ] r ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) V π ( s ′ ) r ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) a ′ ∑ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ )
动作值函数沿时序递推:
V π ( s ) = E π [ G t ∣ S t = s ] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ G t + 1 ∣ S t = s , A t = a ] = E a ∼ π ( ⋅ ∣ s ) [ R t + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ G t + 1 ∣ S t + 1 = s ′ ] = ∑ a π ( a ∣ s ) ( r ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V π ( s ′ ) ) \begin{aligned}
V^\pi(\boldsymbol s) =& \mathbb{E}_\pi [ G_t | \boldsymbol{S}_t=s] \\
=& \mathbb{E}_{\boldsymbol a \sim \pi(\cdot | \boldsymbol s)}\left[ \mathbb{E}_{\boldsymbol s' \sim P(\cdot | \boldsymbol{s,a})}[R_t + \gamma G_{t+1}|\boldsymbol S_t=\boldsymbol s, \boldsymbol A_t=\boldsymbol a \right] \\
=& \mathbb{E}_{\boldsymbol a \sim \pi(\cdot | \boldsymbol s)}\left[ R_t + \gamma \mathbb{E}_{\boldsymbol s' \sim P(\cdot | \boldsymbol{s,a})}[ G_{t+1}|\boldsymbol S_{t+1}=\boldsymbol s' \right] \\
=& \boxed{\sum_{\boldsymbol a} \pi(\boldsymbol a| \boldsymbol s) \left( r(\boldsymbol{s,a}) + \gamma\sum_{\boldsymbol s'}P(\boldsymbol s'| \boldsymbol{s,a})V^\pi(\boldsymbol s') \right)}
\end{aligned}
V π ( s ) = = = = E π [ G t ∣ S t = s ] E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P ( ⋅ ∣ s , a ) [ R t + γ G t + 1 ∣ S t = s , A t = a ] E a ∼ π ( ⋅ ∣ s ) [ R t + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ G t + 1 ∣ S t + 1 = s ′ ] a ∑ π ( a ∣ s ) ( r ( s , a ) + γ s ′ ∑ P ( s ′ ∣ s , a ) V π ( s ′ ) )
贝尔曼最优方程:
V ∗ ( s ) = max a ∈ A { r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) V ∗ ( s ′ ) } Q ∗ ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) max a ′ ∈ A Q ∗ ( s ′ , a ′ ) V^*(\boldsymbol s) = \max_{a \in A} \left\{ r(\boldsymbol {s, a}) + \gamma \sum_{s' \in S} P(\boldsymbol s' \mid \boldsymbol{s, a}) \boldsymbol{V^*(\boldsymbol s')} \right\} \\
Q^*(\boldsymbol{s, a}) = r(\boldsymbol {s, a}) + \gamma \sum_{\boldsymbol s' \in S} P(\boldsymbol s' \mid \boldsymbol{s, a}) \max_{a' \in A} \boldsymbol{Q^*(\boldsymbol s', \boldsymbol a')}
V ∗ ( s ) = a ∈ A max { r ( s , a ) + γ s ′ ∈ S ∑ P ( s ′ ∣ s , a ) V ∗ ( s ′ ) } Q ∗ ( s , a ) = r ( s , a ) + γ s ′ ∈ S ∑ P ( s ′ ∣ s , a ) a ′ ∈ A max Q ∗ ( s ′ , a ′ )
注意,由贝尔曼最优方程的推导可知,只有在满足马尔科夫性的时候最优方程才能成立
实际中的估算方式
由于实际中状态转移是完全随机的,我们只能采样出大量的 G t , R t G_t,R_t G t , R t ,因此我们需要用 G t , R t G_t,R_t G t , R t 去估计 V π , Q π V^\pi, Q^\pi V π , Q π 来引导策略的执行,根据 V π ( s ) , Q π ( s , a ) V^\pi(\boldsymbol s), Q^\pi(\boldsymbol{s,a}) V π ( s ) , Q π ( s , a ) 的定义(期望),实际中常常使用蒙特卡洛方法进行估算,例如 V π ( s ) V^\pi(\boldsymbol s) V π ( s ) :
V π ( s ) = E π [ G t ∣ S t = s ] ≈ 1 N ∑ i = 1 N G t ( i ) V^\pi(\boldsymbol{s}) = \mathbb{E}_\pi [ G_t | \boldsymbol{S}_t=s] \approx\frac{1}{N}\sum_{i=1}^N G_t^{(i)}
V π ( s ) = E π [ G t ∣ S t = s ] ≈ N 1 i = 1 ∑ N G t ( i )
值得注意的是这个需要固定策略,因为 V π V^\pi V π 的值是依赖于策略的
占用度量:刻画策略行为的空间
占用度量
占用度量(Occupancy Measure)是刻画策略与环境交互过程中状态-动作分布特征 的指标,包含两种形式:
状态占用度量:设初始状态分布为 ν 0 ( s ) \nu_0(\boldsymbol s) ν 0 ( s ) ,策略 π \pi π 下第 t t t 步访问状态 s \boldsymbol s s 的概率为 P π ( S t = s ) P^\pi(\boldsymbol S_t=\boldsymbol s) P π ( S t = s ) 。称状态 s \boldsymbol s s 的折扣访问频率
ν π ( s ) = ( 1 − γ ) ∑ t = 0 ∞ γ t P π ( S t = s ) \nu^\pi(\boldsymbol s) = (1-\gamma)\sum_{t=0}^\infty \gamma^t P^\pi(\boldsymbol S_t=\boldsymbol s)
ν π ( s ) = ( 1 − γ ) t = 0 ∑ ∞ γ t P π ( S t = s )
注意这里的归一化因子是必要的,为保证 ∑ s ν π ( s ) = 1 \sum_s \nu^\pi(s) = 1 ∑ s ν π ( s ) = 1 ,需乘以归一化因子 ( 1 − γ ) (1-\gamma) ( 1 − γ ) ,使得:
∑ s ν π ( s ) = ( 1 − γ ) ∑ t = 0 ∞ γ t ∑ s P π ( S t = s ) ⏟ = 1 = 1 \sum_s \nu^\pi(\boldsymbol s) = (1-\gamma)\sum_{t=0}^\infty \gamma^t \underbrace{\sum_s P^\pi(\boldsymbol S_t=\boldsymbol s)}_{=1} = 1
s ∑ ν π ( s ) = ( 1 − γ ) t = 0 ∑ ∞ γ t = 1 s ∑ P π ( S t = s ) = 1
状态-动作占用度量(可以视作强化学习的数据分布 ):在状态占用度量的基础上,考虑策略的决策过程(类似 Q π Q^\pi Q π 与 V π V^\pi V π 之间的关系):
ρ π ( s , a ) = E [ ∑ t = 0 ∞ γ t I ( S t = s , A t = a ) ∣ π ] = ( 1 − γ ) ∑ t = 0 ∞ γ t P π ( S t = s , A t = a ) \begin{aligned}
\rho^\pi(\boldsymbol {s,a}) =& \mathbb{E}\left[ \sum_{t=0}^\infty \gamma^t \mathbb{I}(\boldsymbol S_t = \boldsymbol s, \boldsymbol A_t = \boldsymbol a) | \pi \right] \\
=& (1-\gamma)\sum_{t=0}^\infty \gamma^t P^\pi(\boldsymbol S_t=\boldsymbol s, \boldsymbol A_t=\boldsymbol a)
\end{aligned}
ρ π ( s , a ) = = E [ t = 0 ∑ ∞ γ t I ( S t = s , A t = a ) ∣ π ] ( 1 − γ ) t = 0 ∑ ∞ γ t P π ( S t = s , A t = a )
由条件概率分解:
ρ π ( s , a ) = ( 1 − γ ) ∑ t = 0 ∞ γ t P π ( S t = s ) ⏟ ν π ( s ) ⋅ P π ( A t = a ∣ S t = s ) ⏟ π ( a ∣ s ) \rho^\pi(\boldsymbol{s,a}) = \underbrace{(1-\gamma)\sum_{t=0}^\infty \gamma^t P^\pi(\boldsymbol S_t=\boldsymbol s)}_{\nu^\pi(s)} \cdot \underbrace{P^\pi(\boldsymbol A_t=\boldsymbol a|\boldsymbol S_t=\boldsymbol s)}_{\pi(\boldsymbol a|\boldsymbol s)}
ρ π ( s , a ) = ν π ( s ) ( 1 − γ ) t = 0 ∑ ∞ γ t P π ( S t = s ) ⋅ π ( a ∣ s ) P π ( A t = a ∣ S t = s )
故得 ν π ( s ) , ρ π ( s , a ) \nu^\pi(s),\rho^\pi(s,a) ν π ( s ) , ρ π ( s , a ) 之间的关系式(很好理解,甚至可以直接写出):
ρ π ( s , a ) = ν π ( s ) π ( a ∣ s ) \boxed{\rho^\pi(\boldsymbol{s,a}) = \nu^\pi(\boldsymbol s)\pi(\boldsymbol {a|s})}
ρ π ( s , a ) = ν π ( s ) π ( a ∣ s )
Theorem 1:
智能体分别以策略 π 1 \pi_1 π 1 和 π 2 \pi_2 π 2 和同一个 MDP 交互得到的占用度量 ρ π 1 \rho^{\pi_1} ρ π 1 和 ρ π 2 \rho^{\pi_2} ρ π 2 满足
ρ π 1 = ρ π 2 ⇔ π 1 = π 2 \boldsymbol{\rho^{\pi_1} = \rho^{\pi_2}} \Leftrightarrow \boldsymbol{\pi_1 = \pi_2}
ρ π 1 = ρ π 2 ⇔ π 1 = π 2
Theorem 2:
给定一合法占用度量 ρ \rho ρ ,可生成该占用度量的唯一策略是
π ρ = ρ ( s , a ) ∑ a ′ ρ ( s , a ′ ) \boldsymbol{\pi_\rho = \frac{\rho(s, a)}{\sum_{a'} \rho(s, a')}}
π ρ = ∑ a ′ ρ ( s , a ′ ) ρ ( s , a )
这两个定理揭示了强化学习理论中策略空间 与占用度量空间 之间的同构关系。定理一指出策略与占用度量存在严格的双射对应 :不同策略必然产生不同的状态-动作分布(ρ π 1 ≠ ρ π 2 \rho^{\pi_1} \neq \rho^{\pi_2} ρ π 1 = ρ π 2 当且仅当π 1 ≠ π 2 \pi_1 \neq \pi_2 π 1 = π 2 ),而定理二给出了从占用度量逆向构造策略的解析表达式π ρ ( a ∣ s ) = ρ ( s , a ) / ∑ a ′ ρ ( s , a ′ ) \pi_\rho(a|s)=\rho(s,a)/\sum_{a'}\rho(s,a') π ρ ( a ∣ s ) = ρ ( s , a ) / ∑ a ′ ρ ( s , a ′ ) 。 这种对应关系的核心意义体现在:
表示形式的转换 :策略作为条件概率分布π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) ,在优化过程中常面临维度灾难和策略梯度方差大的问题。而占用度量ρ ( s , a ) \rho(s,a) ρ ( s , a ) 作为联合分布,天然满足线性约束∑ a ρ ( s , a ) = ( 1 − γ ) μ 0 ( s ) + γ ∑ s ′ , a ′ ρ ( s ′ , a ′ ) P ( s ∣ s ′ , a ′ ) \sum_a\rho(s,a)=(1-\gamma)\mu_0(s)+\gamma\sum_{s',a'}\rho(s',a')P(s|s',a') ∑ a ρ ( s , a ) = ( 1 − γ ) μ 0 ( s ) + γ ∑ s ′ , a ′ ρ ( s ′ , a ′ ) P ( s ∣ s ′ , a ′ ) ,为凸优化提供了更友好的数学结构。
问题重构的桥梁 :通过这种双射,可将策略搜索问题转化为占用度量空间的线性规划问题:max ρ ∑ ρ ( s , a ) r ( s , a ) s.t. ρ ∈ D \max_\rho \sum \rho(s,a)r(s,a) \quad \text{s.t.} \quad \rho \in \mathcal{D}
ρ max ∑ ρ ( s , a ) r ( s , a ) s.t. ρ ∈ D
其中约束集D \mathcal{D} D 由贝尔曼流方程定义。这种转换将复杂的策略梯度计算简化为凸优化问题
在推导时可以自由选择最适合问题特性的数学空间——需要策略显式表达时使用π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) ,需要分析状态-动作分布时使用ρ ( s , a ) \rho(s,a) ρ ( s , a )
状态占用度量的动态方程
从时间递推视角分析:
\begin{align*}
\nu^\pi(s') &= (1-\gamma)\sum_{t=0}^\infty \gamma^t P^\pi(\boldsymbol S_t=\boldsymbol s') \\
&= (1-\gamma)\nu_0(\boldsymbol s') + (1-\gamma)\sum_{t=1}^\infty \gamma^t P^\pi(\boldsymbol S_t=\boldsymbol s')
\end{align*}
对于 t ≥ 1 t \geq 1 t ≥ 1 的情况,应用全概率公式:
P π ( S t = s ′ ) = ∑ s , a P π ( S t − 1 = s ) π ( a ∣ s ) P ( s ′ ∣ s , a ) P^\pi(\boldsymbol S_t=\boldsymbol s') = \sum_{s,a} P^\pi(\boldsymbol S_{t-1}=\boldsymbol s)\pi(\boldsymbol{\boldsymbol {a|s}})P(\boldsymbol s'|\boldsymbol {s,a})
P π ( S t = s ′ ) = s , a ∑ P π ( S t − 1 = s ) π ( a ∣ s ) P ( s ′ ∣ s , a )
代入后得:
\begin{align*}
\nu^\pi(\boldsymbol s') &= (1-\gamma)\nu_0(\boldsymbol s') + \gamma(1-\gamma)\sum_{t=1}^\infty \gamma^{t-1} \sum_{s,a} P^\pi(\boldsymbol S_{t-1}=\boldsymbol s)\pi(\boldsymbol{a|s})P(\boldsymbol s'|\boldsymbol{s,a}) \\
&= (1-\gamma)\nu_0(\boldsymbol s') + \gamma\sum_{s,a} \underbrace{(1-\gamma)\sum_{\tau=0}^\infty \gamma^\tau P^\pi(\boldsymbol S_\tau=\boldsymbol s)}_{\nu^\pi(\boldsymbol s)} \pi(\boldsymbol {a|s})P(\boldsymbol s'|\boldsymbol{s,a})
\end{align*}
最终得到递归方程:
ν π ( s ′ ) = ( 1 − γ ) ν 0 ( s ′ ) + γ ∑ s , a ν π ( s ) π ( a ∣ s ) P ( s ′ ∣ s , a ) \boxed{\nu^\pi(s') = (1-\gamma)\nu_0(s') + \gamma\sum_{s,a}\nu^\pi(s)\pi(a|s)P(s'|s,a)}
ν π ( s ′ ) = ( 1 − γ ) ν 0 ( s ′ ) + γ s , a ∑ ν π ( s ) π ( a ∣ s ) P ( s ′ ∣ s , a )
在连续状态和动作空间时将求和变为积分
应用场景
应用领域
作用机理
典型案例
约束强化学习
通过占用度量约束状态分布:ρ π ( s , a ) ≤ b ( s , a ) \rho^\pi(s,a) \leq b(s,a) ρ π ( s , a ) ≤ b ( s , a )
确保机器人避开危险区域
模仿学习
最小化专家与智能体占用度量差异:min π D K L ( ρ e x p e r t ∣ ρ π ) \min_\pi D_{KL}(\rho^{expert} | \rho^\pi) min π D K L ( ρ e x p e r t ∣ ρ π )
自动驾驶行为克隆
迁移学习
分析源域与目标域占用度量差异
机械臂跨任务技能迁移
探索策略分析
计算占用度量的熵值:H ( ρ π ) = − ∑ ρ π log ρ π H(\rho^\pi)=-\sum\rho^\pi\log\rho^\pi H ( ρ π ) = − ∑ ρ π log ρ π
评估探索效率
深度思考:超越马尔可夫性
当面对非马尔可夫环境 时,可通过扩展占用度量处理历史依赖:
ρ π ( h , a ) = ( 1 − γ ) ∑ t = 0 ∞ γ t P π ( H t = h ) π ( a ∣ h ) \rho^\pi(h,a) = (1-\gamma)\sum_{t=0}^\infty \gamma^t P^\pi(H_t=h)\pi(a|h)
ρ π ( h , a ) = ( 1 − γ ) t = 0 ∑ ∞ γ t P π ( H t = h ) π ( a ∣ h )
其中h = ( s 0 , a 0 , . . . , s t ) h=(s_0,a_0,...,s_t) h = ( s 0 , a 0 , . . . , s t ) 为历史轨迹。这种扩展为处理:
传感器受限的POMDP问题
长程依赖的决策场景
多时间尺度任务
提供统一的分析框架,彰显占用度量理论的强大扩展能力。
关键推导的可视化表达
graph TB
A[初始分布μ0] --> B[状态访问概率 P^\pi(S_t=s)]
B --> C[折扣求和ν^π(s)]
C --> D[动作分解ρ^π(s,a)=ν^π(s)π(a|s)]
D --> E[价值函数V^π=Σρr/(1-γ)]
B -.-> |时间递推| F[递归方程ν(s')=...]
F --> G[对偶优化约束]