




下載本文檔
版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
Deep
RL:
successes
and
limitationsAtari
games[Mnih
et.
al.,
2015]AlphaGo/AlphaZero[Silver
et.
al.,
2016;
2017]Parkour[Heess
et.
al.,
2017]Simulation
=
success
Computation-ConstrainedReal-world
=
not
applied…?DonstrainedWhy
Robotics?vsRecipe
for
a
Good
Deep
RL
AlgorithmSample-e?ciency
采樣效率Stability
穩(wěn)定性Scalability
可擴(kuò)展性State/Temporal ion
時(shí)空抽象化Exploration
探索Reset-free
無(wú)重置Universal
Reward
萬(wàn)能Human-free
Learning
無(wú)需人的學(xué)習(xí)Transferability/Generalization
可轉(zhuǎn)移性,普及性Risk-Aware
風(fēng)險(xiǎn)意識(shí)Interpretability
可解釋性Algorithm算法Automation自動(dòng)化Reliability可靠性O(shè)utline
of
the
talk?Sample-e?ciency
采樣效率Good
Off-policy
Algorithm
好的離策算法:NAF
[Gu
et
al,2016],Q-Prop/IPG
[Gu
et
al,2017/2017]Good
Model-based
Algorithm
好的有模型算法:TDM
[Pong*,Gu*
et
al,2018]Human-free
Learning
無(wú)需人的學(xué)習(xí)Safe
&
reset-free
RL
安全的,無(wú)重制的強(qiáng)化學(xué)習(xí):LNT
[Eysenbach,Gu
et
al,2018]“Universal”
reward
function
萬(wàn)能 函數(shù):TDM
[Pong*,
Gu*
et
al,
2018]Temporal ion
時(shí)間抽象化Data-efficient
hierarchical
RL
高采樣效率,分層型強(qiáng)化學(xué)習(xí):HIRO
[Nachum,Gu
et
al,2018]???????Notations
&
Definitionson-policymodel-free
在策無(wú)模型法:e.g.policy
search
~
trial
and
error試錯(cuò)o?-policy
model-free
離策無(wú)模型法:e.g.Q-learning
~
introspectionmodel-based
有模型法:e.g.MPC
~
imagination
想象Sample-efficiency
&
RL
controversyModel-basedOn-policy
Off-policyMoreLesssample-e?ciency
采樣效率
learning
signals
學(xué)習(xí)信號(hào)instability
不穩(wěn)定性“蛋糕上的櫻桃”Toward
Good
Off-policy
Deep
RL
AlgorithmOff-policy
actor-critic,e.g.
DDPG
[Lillicrt
al,
2016]?No
new
samples
needed
perupdate!Quite
sensitive
to
hyper-parametersActor
行動(dòng)者Trial
&
error
試錯(cuò)?On-policy
Monte
Carlo
policy
gradient,e.g.
TRPO
[Schulman
et
al,
2015]?Many
new
samples
needed
perupdate.Stable
but
very
sample-intensiveCritic
批評(píng)者Introspectionimperfect
不是全知的。“Better”
DDPGNAF
[Gu
et
al
2016],
Double
DQN
[Hasselt
et
al
2016],
Dueling
DQN
[Wang
et
al
2016],
Q-Prop/IPG
[Gu
et
al
2017/2017],
I[Amoset
al
2017],
SQL/SAC
[Haarnoja
et
al
2017/2017],
GAC
[Tangkaratt
et
al
2018],
MPO
[Abdolmaleki
et
al
2018],
TD3
[Fujimoto
et
al
2018],
…Normalized
Advantage
Functions
(NAF)[Gu,
Lillicrap,
Sutskever,
Levine,
ICML
2016]Related
(later)
work:Dueling
Network
[Wang
et
al
2016]?I
[Amos
et
al
2017]SQL
[Hajaorna
et
al
2017]?Benefit:
2
objectives
(actor-critic)
to
1
objective(Q-learning)Halve
#hyperparametersLimitation:
expressibility
of
Q-functionDoesn’t
work
well
on
otionWorks
well
on
manipulation3-joint
peg
insertionJACO
arm
grasp
&
reachAsynchronous
NAF
for
Simple
Manipulation[Gu*,
Holly*,
Lillicrap,
Levine,
ICRA
2017]2.5
hoursTrain
time/ExplorationTest
timeDisturbance
testQ-Prop
&
Interpolated
Policy
Gradient
(IPG)[Gu,
Lillicrap,
Ghahramani,
Turner,
Levine,
ICLR
2017][Gu,
Lillicrap,
Ghahramani,
Turner,
Schoelkopf,
Levine,
NIPS
2017]On-policy
algorithms
are
stable.
How
to
make
off-policy
more
on-policy?Add
one
eq
balancing
on-policy
ando?-policy
gradMixing
Monte
Carlo
returnsTrust-region
policy
updateOn-policy
explorationBias
trade-offs
(theoretically
bounded)Trial
&
error
試錯(cuò)+Critic
批評(píng)者Related
concurrent
work:PGQ
[O’Donoghue
et
al
2017]ACER
[Wang
et
al
2017]Toward
Good
Model-based
Deep
RL
AlgorithmRethinking
Q-learningQ-learning
vs
parameterized
Q-learning+無(wú)限
改寫(xiě)Off-policy
+
Relabeling
trickfrom
HER
[Andrychowicz
et
al,2017]Examples:UVF
[Schaul
et
al,
2015]TDM
[Pong*,
Gu*
et
al
2017]Introspection
(off-policy
model-free)
+
relabeling
=
imagination
(model-based)?(離策無(wú)模型)+無(wú)限
改寫(xiě)
=
想象(有模型)?Temporal
Difference
Models(TDM)[Pong*,
Gu*,
Dalal,
Levine,
ICLR
2018]A
certain
parameterized
Q-function
is
a
generalization
of
dynamics
modelEfficient
learning
by
relabelingNovel
model-based
planningToward
Human-free
Learning
無(wú)需人的學(xué)習(xí)?Autonomous,
Continual,Safe,
Human-freeHuman-administered,Manual
resetting,Reward
engineeringLeave
No
Trace
(LNT)[Eysenbach,
Gu,
Ibarz,
Levine,
ICLR
2018]?Learn
to
resetEarly
abort
based
on
how
likely
you
can
go
back
toinitial
state
(reset
Q-function)Goal:
reduce/eliminate
manual
resets
=
safe,autonomous,
continual
learning
+
curriculumRelated
work:Asymmetric
self-play
[Sukhbaatar
et
al
2017]Automatic
goal
generation
[Held
et
al
2017]Reverse
curriculum
[Florensa
et
al
2017]Who
resets
the
robot?-
PhD
students能去能回不落痕跡A
“Universal”Reward
Function“萬(wàn)能”函數(shù)+Off-Policy
LearningGoal-reaching
reward,
e.g.UVF
[Schaul
et
al
2015]/HER[Andrychowicz],TDM
[Pong*,
Gu*
et
al
2018]Diversity
reward,
e.g.SNN4HRL
[Florensa
et
al
2017],
DIAYN
[Eysenbach
et
al
2018]??Goal:
learn
as
many
useful
skills
as
possible
sample-e?ciently
with
minimal
reward
engineeringExamples:Toward
Temporal
ions?When
you
don’t
know
how
to
ride
bike…When
you
know
how
to
ride
bike…TDM
learns
many
skills
very
quickly…?How
to
e?ciently
solve
other
problems?時(shí)間抽象化HIerarchical
Reinforcement
learning
with
Off-policy
correction
(HIRO)[Nachum,
Gu,
Lee,
Levine,
preprint
2018]?Most
recent
HRL
work
is
on-policye.g.
option-critic
[Bacon
et
al
2015],
FuN
[Vezhnevets
etal
2017],
SNN4HRL
[Florensa
et
al
2017],
MLSH
[Frans
etal
2018]VERY
data-intensive?How
to
correct
for
off-policy?
Relabel
the
action.不是 改寫(xiě),是
糾正HIRO
(cont.)[Vezhnevets
et
al,
2017][Florensa
et
al,
2017][Houthooft
et
al,
2016]Test
rewards
at
20000
episodesAnt
MazeAnt
PushAnt
Fa
l[Nachum,
Gu,
Lee,
Levine,
preprint
2018]?Optimizing
for
computation
alone
is
not
enough;
also
for
sample-e?ciency
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年Android性能優(yōu)化最佳實(shí)踐分享一點(diǎn)面試小經(jīng)驗(yàn)-android 縮短inflate時(shí)間
- 建筑施工特種作業(yè)-建筑架子工附著式腳手架真題庫(kù)-7
- 森林消防演練題目及答案
- 如皋中考語(yǔ)文題目及答案
- 04《運(yùn)動(dòng)和力的關(guān)系》-2025高中物理水平合格考備考知識(shí)清單+習(xí)題鞏固
- 2023-2024學(xué)年云南省玉溪市高二下學(xué)期期末教學(xué)質(zhì)量檢測(cè)數(shù)學(xué)試卷(解析版)
- 2024-2025學(xué)年山西省部分地市高二上學(xué)期期末考試語(yǔ)文試題(解析版)
- 店面房屋租賃合同范本-房屋店面租賃合同模板-店面租賃合同范本
- 中國(guó)石油新疆油田油氣儲(chǔ)運(yùn)分公司環(huán)境影響后評(píng)價(jià)報(bào)告書(shū)
- 上呼吸道感染的治療講課件
- 計(jì)算物理面試題及答案
- JG/T 455-2014建筑門(mén)窗幕墻用鋼化玻璃
- 村文書(shū)考試題及答案
- 2025年中國(guó)鐵路西安局招聘高校畢業(yè)生第二批(102人)筆試參考題庫(kù)附帶答案詳解
- 創(chuàng)新創(chuàng)業(yè)策劃書(shū)格式
- 大數(shù)據(jù)在區(qū)域經(jīng)濟(jì)學(xué)中的應(yīng)用研究-洞察闡釋
- 美洲文化課件教學(xué)
- 2025屆重慶市巴川中學(xué)生物七下期末統(tǒng)考試題含解析
- 醫(yī)學(xué)檢驗(yàn)進(jìn)修匯報(bào)
- 2025春季學(xué)期河南電大本科補(bǔ)修課《民法學(xué)#》一平臺(tái)無(wú)紙化考試(作業(yè)練習(xí)+我要考試)試題及答案
- 《數(shù)據(jù)分析與可視化》課件
評(píng)論
0/150
提交評(píng)論