强化学习 | Reinforcement Learning ¶
约 716 个字 预计阅读时间 3 分钟
基本概念 ¶
find the best sequence of actions to maximize the reward
数学性较强; 系统性较强;
fancy trial & error
if we choose RL: we have to set up the controller structure
what success is (reward function)
we have to learn efficiently (RL algorithms)
model-free RL model-based RL: provide knowledge about the environment
Hyperparameters: - learning rate - discount factor - exploration vs exploitation
- \(X\): state
- \(A\): action
- \(P\): transition probability
-
\(R\): reward function
-
\(V(x)\): state value function
- \(Q(x,a)\): state-action value function
- \(\pi(x,a)\): policy function
缺点
手动调整策略 ¶
环境变化时,需要重新调整策略
explainable AI
黑盒模型,不清楚哪里出错了
robustness¶
how to verify:
with a learned policy, it's hard to predict how the system will behave from one state to another
RL 的测试次数是更多的
verification is also difficult¶
formal verification¶
稳定裕度这样的指标是不存在的;无法确定鲁棒性和安全性
方法 ¶
- set up learning to b e more robust
- increase the safety
- solve a different problem
Exploration & Exploitation¶
exploration: explore new areas in the environment
exploitation: collect the most reward you already know
tricky to balance exploration and exploitation
\(\epsilon\)-greedy 策略 ¶
即以 \(\epsilon\) 的概率进行探索,以 \(1-\epsilon\) 的概率进行利用;所以概率分布较宽的时候,\(\epsilon\) 应该取大一些
递推求解即可,仅需记忆次数 \(n-1\) 以及平均奖赏 \(Q_{n-1}(k)\)
Softmax 策略 ¶
基于 Boltzmann 分布,选择概率与奖赏成正比
\(\tau\) 是温度参数,控制探索的强度;当 \(\tau \to 0\) 时,Softmax 退化为仅利用;当 \(\tau \to \infty\) 时,Softmax 退化为仅探索
k 摇臂赌博机 | k-armed bandit ¶
Deploy & verify¶
Bellman Equation¶
DDPG: 深度确定性策略梯度 - 连续的动作空间 - 估计确定性策略,训练更快
critic 是帮助 actor 进行训练的,actor 是决定策略选择的
Policy¶
state → action
Q-function
- curse of dimensionality
设计一个高自由度的 funtion 很困难,所以需要一个函数逼近的方法,使用神经网络来逼近
得有一定复杂度,能够拟合大部分函数
policy gredient
- execute current policy
- collect reward
- increase probability of good action
Policy-function¶
Value-function¶
Actor-Critic¶
有模型学习 ¶
Reward function: - make a reward function is easy - make a good reward function is hard
智能体很难随机生成得到稀疏奖励的动作序列
但是如果奖励函数有一个内置的倾向的话,学习过程可能会直接向这个倾向收敛
策略评估公式推导
bellman optimal equation | 贝尔曼最优方程 BOE ¶
模型已知的情况下,强化学习任务可以归结为基于动态规划的求优问题
bellman 与动态规划