acm機(jī)器學(xué)習(xí)-課件guest-shixiang_第1頁(yè)
acm機(jī)器學(xué)習(xí)-課件guest-shixiang_第2頁(yè)
acm機(jī)器學(xué)習(xí)-課件guest-shixiang_第3頁(yè)
acm機(jī)器學(xué)習(xí)-課件guest-shixiang_第4頁(yè)
acm機(jī)器學(xué)習(xí)-課件guest-shixiang_第5頁(yè)
免費(fèi)預(yù)覽已結(jié)束,剩余15頁(yè)可下載查看

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

Deep

RL:

successes

and

limitationsAtari

games[Mnih

et.

al.,

2015]AlphaGo/AlphaZero[Silver

et.

al.,

2016;

2017]Parkour[Heess

et.

al.,

2017]Simulation

=

success

Computation-ConstrainedReal-world

=

not

applied…?DonstrainedWhy

Robotics?vsRecipe

for

a

Good

Deep

RL

AlgorithmSample-e?ciency

采樣效率Stability

穩(wěn)定性Scalability

可擴(kuò)展性State/Temporal ion

時(shí)空抽象化Exploration

探索Reset-free

無(wú)重置Universal

Reward

萬(wàn)能Human-free

Learning

無(wú)需人的學(xué)習(xí)Transferability/Generalization

可轉(zhuǎn)移性,普及性Risk-Aware

風(fēng)險(xiǎn)意識(shí)Interpretability

可解釋性Algorithm算法Automation自動(dòng)化Reliability可靠性O(shè)utline

of

the

talk?Sample-e?ciency

采樣效率Good

Off-policy

Algorithm

好的離策算法:NAF

[Gu

et

al,2016],Q-Prop/IPG

[Gu

et

al,2017/2017]Good

Model-based

Algorithm

好的有模型算法:TDM

[Pong*,Gu*

et

al,2018]Human-free

Learning

無(wú)需人的學(xué)習(xí)Safe

&

reset-free

RL

安全的,無(wú)重制的強(qiáng)化學(xué)習(xí):LNT

[Eysenbach,Gu

et

al,2018]“Universal”

reward

function

萬(wàn)能 函數(shù):TDM

[Pong*,

Gu*

et

al,

2018]Temporal ion

時(shí)間抽象化Data-efficient

hierarchical

RL

高采樣效率,分層型強(qiáng)化學(xué)習(xí):HIRO

[Nachum,Gu

et

al,2018]???????Notations

&

Definitionson-policymodel-free

在策無(wú)模型法:e.g.policy

search

trial

and

error試錯(cuò)o?-policy

model-free

離策無(wú)模型法:e.g.Q-learning

introspectionmodel-based

有模型法:e.g.MPC

imagination

想象Sample-efficiency

&

RL

controversyModel-basedOn-policy

Off-policyMoreLesssample-e?ciency

采樣效率

learning

signals

學(xué)習(xí)信號(hào)instability

不穩(wěn)定性“蛋糕上的櫻桃”Toward

Good

Off-policy

Deep

RL

AlgorithmOff-policy

actor-critic,e.g.

DDPG

[Lillicrt

al,

2016]?No

new

samples

needed

perupdate!Quite

sensitive

to

hyper-parametersActor

行動(dòng)者Trial

&

error

試錯(cuò)?On-policy

Monte

Carlo

policy

gradient,e.g.

TRPO

[Schulman

et

al,

2015]?Many

new

samples

needed

perupdate.Stable

but

very

sample-intensiveCritic

批評(píng)者Introspectionimperfect

不是全知的。“Better”

DDPGNAF

[Gu

et

al

2016],

Double

DQN

[Hasselt

et

al

2016],

Dueling

DQN

[Wang

et

al

2016],

Q-Prop/IPG

[Gu

et

al

2017/2017],

I[Amoset

al

2017],

SQL/SAC

[Haarnoja

et

al

2017/2017],

GAC

[Tangkaratt

et

al

2018],

MPO

[Abdolmaleki

et

al

2018],

TD3

[Fujimoto

et

al

2018],

…Normalized

Advantage

Functions

(NAF)[Gu,

Lillicrap,

Sutskever,

Levine,

ICML

2016]Related

(later)

work:Dueling

Network

[Wang

et

al

2016]?I

[Amos

et

al

2017]SQL

[Hajaorna

et

al

2017]?Benefit:

2

objectives

(actor-critic)

to

1

objective(Q-learning)Halve

#hyperparametersLimitation:

expressibility

of

Q-functionDoesn’t

work

well

on

otionWorks

well

on

manipulation3-joint

peg

insertionJACO

arm

grasp

&

reachAsynchronous

NAF

for

Simple

Manipulation[Gu*,

Holly*,

Lillicrap,

Levine,

ICRA

2017]2.5

hoursTrain

time/ExplorationTest

timeDisturbance

testQ-Prop

&

Interpolated

Policy

Gradient

(IPG)[Gu,

Lillicrap,

Ghahramani,

Turner,

Levine,

ICLR

2017][Gu,

Lillicrap,

Ghahramani,

Turner,

Schoelkopf,

Levine,

NIPS

2017]On-policy

algorithms

are

stable.

How

to

make

off-policy

more

on-policy?Add

one

eq

balancing

on-policy

ando?-policy

gradMixing

Monte

Carlo

returnsTrust-region

policy

updateOn-policy

explorationBias

trade-offs

(theoretically

bounded)Trial

&

error

試錯(cuò)+Critic

批評(píng)者Related

concurrent

work:PGQ

[O’Donoghue

et

al

2017]ACER

[Wang

et

al

2017]Toward

Good

Model-based

Deep

RL

AlgorithmRethinking

Q-learningQ-learning

vs

parameterized

Q-learning+無(wú)限

改寫(xiě)Off-policy

+

Relabeling

trickfrom

HER

[Andrychowicz

et

al,2017]Examples:UVF

[Schaul

et

al,

2015]TDM

[Pong*,

Gu*

et

al

2017]Introspection

(off-policy

model-free)

+

relabeling

=

imagination

(model-based)?(離策無(wú)模型)+無(wú)限

改寫(xiě)

=

想象(有模型)?Temporal

Difference

Models(TDM)[Pong*,

Gu*,

Dalal,

Levine,

ICLR

2018]A

certain

parameterized

Q-function

is

a

generalization

of

dynamics

modelEfficient

learning

by

relabelingNovel

model-based

planningToward

Human-free

Learning

無(wú)需人的學(xué)習(xí)?Autonomous,

Continual,Safe,

Human-freeHuman-administered,Manual

resetting,Reward

engineeringLeave

No

Trace

(LNT)[Eysenbach,

Gu,

Ibarz,

Levine,

ICLR

2018]?Learn

to

resetEarly

abort

based

on

how

likely

you

can

go

back

toinitial

state

(reset

Q-function)Goal:

reduce/eliminate

manual

resets

=

safe,autonomous,

continual

learning

+

curriculumRelated

work:Asymmetric

self-play

[Sukhbaatar

et

al

2017]Automatic

goal

generation

[Held

et

al

2017]Reverse

curriculum

[Florensa

et

al

2017]Who

resets

the

robot?-

PhD

students能去能回不落痕跡A

“Universal”Reward

Function“萬(wàn)能”函數(shù)+Off-Policy

LearningGoal-reaching

reward,

e.g.UVF

[Schaul

et

al

2015]/HER[Andrychowicz],TDM

[Pong*,

Gu*

et

al

2018]Diversity

reward,

e.g.SNN4HRL

[Florensa

et

al

2017],

DIAYN

[Eysenbach

et

al

2018]??Goal:

learn

as

many

useful

skills

as

possible

sample-e?ciently

with

minimal

reward

engineeringExamples:Toward

Temporal

ions?When

you

don’t

know

how

to

ride

bike…When

you

know

how

to

ride

bike…TDM

learns

many

skills

very

quickly…?How

to

e?ciently

solve

other

problems?時(shí)間抽象化HIerarchical

Reinforcement

learning

with

Off-policy

correction

(HIRO)[Nachum,

Gu,

Lee,

Levine,

preprint

2018]?Most

recent

HRL

work

is

on-policye.g.

option-critic

[Bacon

et

al

2015],

FuN

[Vezhnevets

etal

2017],

SNN4HRL

[Florensa

et

al

2017],

MLSH

[Frans

etal

2018]VERY

data-intensive?How

to

correct

for

off-policy?

Relabel

the

action.不是 改寫(xiě),是

糾正HIRO

(cont.)[Vezhnevets

et

al,

2017][Florensa

et

al,

2017][Houthooft

et

al,

2016]Test

rewards

at

20000

episodesAnt

MazeAnt

PushAnt

Fa

l[Nachum,

Gu,

Lee,

Levine,

preprint

2018]?Optimizing

for

computation

alone

is

not

enough;

also

for

sample-e?ciency

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論