第10章_強化學習（章節(jié)課程）

上傳人：8*** IP屬地：廣東上傳時間：2022-07-08 格式：PPT 頁數(shù)：90 大小：2.45MB 積分：8.4 舉報 版權申訴

已閱讀5頁，還剩85頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權，請進行舉報或認領

文檔簡介

1、高級人工智能第十章史忠植中國科學院計算技術研究所強化學習1向陽書屋p內(nèi)容提要引言強化學習模型動態(tài)規(guī)劃蒙特卡羅方法時序差分學習Q學習強化學習中的函數(shù)估計應用2向陽書屋p引言人類通常從與外界環(huán)境的交互中學習。所謂強化（reinforcement）學習是指從環(huán)境狀態(tài)到行為映射的學習，以使系統(tǒng)行為從環(huán)境中獲得的累積獎勵值最大。在強化學習中，我們設計算法來把外界環(huán)境轉化為最大化獎勵量的方式的動作。我們并沒有直接告訴主體要做什么或者要采取哪個動作,而是主體通過看哪個動作得到了最多的獎勵來自己發(fā)現(xiàn)。主體的動作的影響不只是立即得到的獎勵，而且還影響接下來的動作和最終的獎勵。試錯搜索(trial-and

2、-error search)和延期強化(delayed reinforcement)這兩個特性是強化學習中兩個最重要的特性。 3向陽書屋p引言強化學習技術是從控制理論、統(tǒng)計學、心理學等相關學科發(fā)展而來，最早可以追溯到巴甫洛夫的條件反射實驗。但直到上世紀八十年代末、九十年代初強化學習技術才在人工智能、機器學習和自動控制等領域中得到廣泛研究和應用，并被認為是設計智能系統(tǒng)的核心技術之一。特別是隨著強化學習的數(shù)學基礎研究取得突破性進展后，對強化學習的研究和應用日益開展起來，成為目前機器學習領域的研究熱點之一。4向陽書屋p引言強化思想最先來源于心理學的研究。1911年Thorndike提出了效果律（

3、Law of Effect）：一定情景下讓動物感到舒服的行為，就會與此情景增強聯(lián)系（強化），當此情景再現(xiàn)時，動物的這種行為也更易再現(xiàn)；相反，讓動物感覺不舒服的行為，會減弱與情景的聯(lián)系，此情景再現(xiàn)時，此行為將很難再現(xiàn)。換個說法，哪種行為會“記住”，會與刺激建立聯(lián)系，取決于行為產(chǎn)生的效果。動物的試錯學習,包含兩個含義：選擇（selectional）和聯(lián)系（associative），對應計算上的搜索和記憶。所以，1954年，Minsky在他的博士論文中實現(xiàn)了計算上的試錯學習。同年，F(xiàn)arley和Clark也在計算上對它進行了研究。強化學習一詞最早出現(xiàn)于科技文獻是1961年Minsky 的論文“Ste

4、ps Toward Artificial Intelligence”，此后開始廣泛使用。1969年， Minsky因在人工智能方面的貢獻而獲得計算機圖靈獎。5向陽書屋p引言1953到1957年，Bellman提出了求解最優(yōu)控制問題的一個有效方法：動態(tài)規(guī)劃（dynamic programming） Bellman于 1957年還提出了最優(yōu)控制問題的隨機離散版本，就是著名的馬爾可夫決策過程（MDP, Markov decision processe），1960年Howard提出馬爾可夫決策過程的策略迭代方法，這些都成為現(xiàn)代強化學習的理論基礎。1972年，Klopf把試錯學習和時序差分結合在一起。1

5、978年開始，Sutton、Barto、 Moore，包括Klopf等對這兩者結合開始進行深入研究。 1989年Watkins提出了Q-學習Watkins 1989，也把強化學習的三條主線扭在了一起。1992年，Tesauro用強化學習成功了應用到西洋雙陸棋（backgammon）中，稱為TD-Gammon 。6向陽書屋p內(nèi)容提要引言強化學習模型動態(tài)規(guī)劃蒙特卡羅方法時序差分學習Q學習強化學習中的函數(shù)估計應用7向陽書屋p主體強化學習模型i: inputr: reward s: statea: action狀態(tài) sisi+1ri+1獎勵 ri環(huán)境動作 aia0a1a2s0s1s2s38向陽書屋p描

6、述一個環(huán)境（問題）Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuousThe most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.9向陽書屋p強化學習問題Agent-environment interaction

7、States, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities (Markov Property):reward probabilities:EnvironmentactionstaterewardRLAgent10向陽書屋p與監(jiān)督學習對比Reinforcement Learning Learn from interactionlearn from its own experience, and

8、 the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs (“actions”)Training Info = evaluations (“rewards” / “penalties”)Supervised Learning Learn from exampl

9、es provided by a knowledgable external supervisor.11向陽書屋p強化學習要素Policy: stochastic rule for selecting actionsReturn/Reward: the function of future rewards agent tries to maximizeValue: what is good because it predicts rewardModel: what follows whatPolicyRewardValueModel ofenvironmentIs unknownIs my g

10、oalIs I can getIs my method12向陽書屋p在策略下的Bellman公式The basic idea: So: Or, without the expectation operator: is the discount rate13向陽書屋pBellman最優(yōu)策略公式其中：V*：狀態(tài)值映射S：環(huán)境狀態(tài)R：獎勵函數(shù)P：狀態(tài)轉移概率函數(shù)：折扣因子14向陽書屋p馬爾可夫決策過程 MARKOV DECISION PROCESS 由四元組定義。環(huán)境狀態(tài)集S 系統(tǒng)行為集合A 獎勵函數(shù)R：SA 狀態(tài)轉移函數(shù)P：SAPD（S）記R（s，a，s）為系統(tǒng)在狀態(tài)s采用a動作使環(huán)境

11、狀態(tài)轉移到s獲得的瞬時獎勵值；記P（s，a，s）為系統(tǒng)在狀態(tài)s采用a動作使環(huán)境狀態(tài)轉移到s的概率。15向陽書屋p馬爾可夫決策過程 MARKOV DECISION PROCESS馬爾可夫決策過程的本質是：當前狀態(tài)向下一狀態(tài)轉移的概率和獎勵值只取決于當前狀態(tài)和選擇的動作，而與歷史狀態(tài)和歷史動作無關。因此在已知狀態(tài)轉移概率函數(shù)P和獎勵函數(shù)R的環(huán)境模型知識下，可以采用動態(tài)規(guī)劃技術求解最優(yōu)策略。而強化學習著重研究在P函數(shù)和R函數(shù)未知的情況下，系統(tǒng)如何學習最優(yōu)行為策略。16向陽書屋pMARKOV DECISION PROCESSCharacteristics of MDP:a set of states

12、: Sa set of actions : Aa reward function :R : S x A RA state transition function:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a17向陽書屋p馬爾可夫決策過程 MARKOV DECISION PROCESS18向陽書屋pMDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)19向陽書屋

13、pMDP Graphical Representation, : T (s, action, s )Similarity to Hidden Markov Models (HMMs)20向陽書屋pReinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent w

14、ith a sequential decision problem:Move cost = 0.04(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)21向陽書屋pRe

15、inforcement Learning M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy: mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable (a

16、ccessible): percept identifies the statePartially observableMarkov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.22向陽書屋p動態(tài)規(guī)劃

17、Dynamic Programming動態(tài)規(guī)劃(dynamic programming)的方法通過從后繼狀態(tài)回溯到前驅狀態(tài)來計算賦值函數(shù)。動態(tài)規(guī)劃的方法基于下一個狀態(tài)分布的模型來接連的更新狀態(tài)。強化學習的動態(tài)規(guī)劃的方法是基于這樣一個事實：對任何策略和任何狀態(tài)s，有(10.9)式迭代的一致的等式成立(as)是給定在隨機策略下狀態(tài)s時動作a的概率。(ssa)是在動作a下狀態(tài)s轉到狀態(tài)s的概率。這就是對V的Bellman(1957)等式。23向陽書屋p動態(tài)規(guī)劃Dynamic Programming - ProblemA discrete-time dynamic systemStates 1, ,

18、n + termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies24向陽書屋pFinite Horizon ProblemInfinite Horizon ProblemValue Iteration動態(tài)規(guī)劃Dynamic Programming Iterative Solution 25向陽書屋p動態(tài)規(guī)劃中的策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationV

19、alue Iteration26向陽書屋p動態(tài)規(guī)劃方法TTTTTTTTTTTTT27向陽書屋p自適應動態(tài)規(guī)劃(ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve = value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g. Backgammon:

20、 1050 equations in 1050 variables28向陽書屋pValue Iteration AlgorithmAN ALTERNATIVE ITERATION: (Singh,1993)(Important for model free learning)Stop Iteration when V(s) differs less than .Policy difference ratio = 2 / (1- ) ( Williams & Baird 1993b)29向陽書屋pPolicy Iteration Algorithm Policies converge faste

21、r than values.Why faster convergence? 30向陽書屋p動態(tài)規(guī)劃Dynamic Programming典型的動態(tài)規(guī)劃模型作用有限，很多問題很難給出環(huán)境的完整模型。仿真機器人足球就是這樣的問題，可以采用實時動態(tài)規(guī)劃方法解決這個問題。在實時動態(tài)規(guī)劃中不需要事先給出環(huán)境模型，而是在真實的環(huán)境中不斷測試，得到環(huán)境模型。可以采用反傳神經(jīng)網(wǎng)絡實現(xiàn)對狀態(tài)泛化，網(wǎng)絡的輸入單元是環(huán)境的狀態(tài)s, 網(wǎng)絡的輸出是對該狀態(tài)的評價V(s)。31向陽書屋p沒有模型的方法Model Free MethodsModels of the environment:T: S x A ( S) and

22、 R : S x A RDo we know them? Do we have to know them?Monte Carlo MethodsAdaptive Heuristic CriticQ Learning32向陽書屋p蒙特卡羅方法 Monte Carlo Methods 蒙特卡羅方法不需要一個完整的模型。而是它們對狀態(tài)的整個軌道進行抽樣，基于抽樣點的最終結果來更新賦值函數(shù)。蒙特卡羅方法不需要經(jīng)驗，即從與環(huán)境聯(lián)機的或者模擬的交互中抽樣狀態(tài)、動作和獎勵的序列。聯(lián)機的經(jīng)驗是令人感興趣的，因為它不需要環(huán)境的先驗知識，卻仍然可以是最優(yōu)的。從模擬的經(jīng)驗中學習功能也很強大。它需要一個模型，但它可以

23、是生成的而不是分析的，即一個模型可以生成軌道卻不能計算明確的概率。于是，它不需要產(chǎn)生在動態(tài)規(guī)劃中要求的所有可能轉變的完整的概率分布。33向陽書屋pMonte Carlo方法TTTTTTTTTTTTTTTTTTTT34向陽書屋p蒙特卡羅方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s)Based only on experience Assumes episodic tasks (Experience is divided into

24、episodes and all episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 35向陽書屋pMonte Carlo策略評價Goal: learn Vp(s) under P and R are unknown in advanceGiven: some number of episodes under p which contain sIdea: Average returns observ

25、ed after visits to sEvery-Visit MC: average returns for every time s is visited in an episodeFirst-visit MC: average returns only for first time s is visited in an episodeBoth converge asymptotically1234536向陽書屋pProblem: Unvisited pairs(problem of maintaining exploration)For every make sure that:P( s

26、elected as a start state and action) 0 (Assumption of exploring starts ) 蒙特卡羅方法 37向陽書屋p蒙特卡羅控制How to select Policies:(Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action

27、-value) function38向陽書屋p時序差分學習 Temporal-Difference時序差分學習中沒有環(huán)境模型，根據(jù)經(jīng)驗學習。每步進行迭代，不需要等任務完成。預測模型的控制算法，根據(jù)歷史信息判斷將來的輸入和輸出，強調模型的函數(shù)而非模型的結構。時序差分方法和蒙特卡羅方法類似，仍然采樣一次學習循環(huán)中獲得的瞬時獎懲反饋，但同時類似與動態(tài)規(guī)劃方法采用自舉方法估計狀態(tài)的值函數(shù)。然后通過多次迭代學習，去逼近真實的狀態(tài)值函數(shù)。39向陽書屋p時序差分學習 TDTTTTTTTTTTTTTTTTTTTT40向陽書屋p時序差分學習 Temporal-Differencetarget: the actu

28、al return after time ttarget: an estimate of the return41向陽書屋p時序差分學習 (TD)Idea: Do ADP backups on a per move basis, not for the whole state space.Theorem: Average value of U(i) converges to the correct value.Theorem: If is appropriately decreased as a function of times a state is visited (=Ni), then

29、U(i) itself converges to the correct value42向陽書屋pTD(l) A Forward ViewTD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return: Backup using l-return:43向陽書屋p時序差分學習算法 TD()Idea: update from the whole epoch, not just on state transition.Special cases:=1: Least-me

30、an-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with 44向陽書屋p時序差分學習算法 TD()算法 10.1 TD(0)學習算法Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy deriv

31、ed from V(e.g., -greedy) Take action a, observer r, s Until s is terminal 45向陽書屋p時序差分學習算法46向陽書屋p時序差分學習算法收斂性TD()Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t. In practice, often a fixed is used for all i and t.47向陽書屋p時序差分學習 TD48向陽書屋pQ-learningWatkins, 1989在Q學習中，回溯從動作

32、結點開始，最大化下一個狀態(tài)的所有可能動作和它們的獎勵。在完全遞歸定義的Q學習中，回溯樹的底部結點一個從根結點開始的動作和它們的后繼動作的獎勵的序列可以到達的所有終端結點。聯(lián)機的Q學習，從可能的動作向前擴展，不需要建立一個完全的世界模型。Q學習還可以脫機執(zhí)行。我們可以看到，Q學習是一種時序差分的方法。49向陽書屋pQ-learning在Q學習中，Q是狀態(tài)-動作對到學習到的值的一個函數(shù)。對所有的狀態(tài)和動作： Q: (state x action) value 對Q學習中的一步： (10.15)其中c和都1，rt+1是狀態(tài)st+1的獎勵。 50向陽書屋pQ-LearningEstimate the

33、Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the maximum of the estimated Q-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear r

34、egression as the approximator, and of course, squared error as the appropriate loss function.51向陽書屋pQ-learningQ (a,i)Direct approach (ADP) would require learning a model .Q-learning does not:Do this update after each state transition:52向陽書屋pExplorationTradeoff between exploitation (control) and expl

35、oration (identification) Extremes: greedy vs. random acting(n-armed bandit models)Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate a is decreased fast en

36、ough but not too fast (as we discussed in TD learning)53向陽書屋pCommon exploration methodsIn value iteration in an ADP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzmann explorationExploration funcR+ if nNu o.w.54向陽書屋pQ-Learning AlgorithmQ學習算法Initialize Q(

37、s,a) arbitrarilyRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from Q (e.g., -greedy)Take action a, observer r, s Until s is terminal55向陽書屋pQ-Learning AlgorithmSetForThe estimated policy satisfies56向陽書屋pWhat is the intuition?Bellman equa

38、tion gives If and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing57向陽書屋pA-Learning Murphy, 2003 and Robins, 2004Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the

39、estimated A-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.58向陽書屋pA-Learning Algorithm (Inefficient Version)ForThe estimated policy satisfies

40、59向陽書屋pDifferences between Q and A-learningQ-learningAt time t we model the main effects of the history, (St,At-1) and the action At and their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learningAt time t we only model the effects of At and

41、 its interaction with (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1) 60向陽書屋pQ-Learning Vs. A-LearningRelative merits and demerits are not completely known till now.Q-learning has low variance but high bias.A-learning has high variance but low bia

42、s. Comparison of Q-learning with A-learning involves a bias-variance trade-off.61向陽書屋pPOMDP部分感知馬氏決策過程 Rather than observing the state we observe some function of the state.Ob Observable functiona random variable for each states.Problem : different states may look similarThe optimal strategy might ne

43、ed to consider the history.62向陽書屋pFramework of POMDPPOMDP由六元組定義。其中定義了環(huán)境潛在的馬爾可夫決策模型上，是觀察的集合，即系統(tǒng)可以感知的世界狀態(tài)集合，觀察函數(shù)：SAPD（）。系統(tǒng)在采取動作a轉移到狀態(tài)s時，觀察函數(shù)確定其在可能觀察上的概率分布。記為（s, a, o）。1 可以是S的子集，也可以與S無關63向陽書屋pPOMDPsWhat if state information (from sensors) is noisy?Mostly the case!MDP techniques are suboptimal!Two halls

44、 are not the same.64向陽書屋pPOMDPs A Solution StrategySE: Belief State Estimator (Can be based on HMM): MDP Techniques65向陽書屋pPOMDP_信度狀態(tài)方法Idea: Given a history of actions and observable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribut

45、ion over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation)Open Problem : How to deal with the continuous distribution? 66向陽書屋pThe Learning Process of Belief MDP67向陽書屋pMajor Methods to Solve POMDP 算法名稱基本思想學習值函數(shù)Memoryless policies直接采用標準的強化學習算法Sim

46、ple memory based approaches使用k個歷史觀察表示當前狀態(tài)UDM(Utile Distinction Memory)分解狀態(tài)，構建有限狀態(tài)機模型NSM(Nearest Sequence Memory)存儲狀態(tài)歷史，進行距離度量USM(Utile Suffix Memory)綜合UDM和NSM兩種方法Recurrent-Q使用循環(huán)神經(jīng)網(wǎng)絡進行狀態(tài)預測策略搜索Evolutionary algorithms使用遺傳算法直接進行策略搜索Gradient ascent method使用梯度下降（上升）法搜索68向陽書屋p強化學習中的函數(shù)估計RLFASubset of states

47、Value estimate as targetsV (s)Generalization of the value function to the entire state spaceis the TD operator.is the function approximation operator.69向陽書屋p并行兩個迭代過程值函數(shù)迭代過程值函數(shù)逼近過程How to construct the M function? Using state cluster, interpolation, decision tree or neural network?70向陽書屋pFunction Appr

48、oximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem : How to design the non-liner FA system which can converge with the incremental instances? 并行兩個迭代過程71向陽書屋pSemi-MDPDiscre

49、te timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete timeDiscrete eventsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level. One approach to Temporal Hierarchical RL72向陽書屋pThe equations73向陽書屋pMulti-agent MDPDistrib

50、uted RLMarkov GameBest ResponseEnvironmentactionstaterewardRLAgentRLAgent74向陽書屋p三種觀點問題空間主要方法算法準則合作多agent強化學習分布、同構、合作環(huán)境交換狀態(tài)提高學習收斂速度交換經(jīng)驗交換策略交換建議基于平衡解多agent強化學習同構或異構、合作或競爭環(huán)境極小極大-Q理性和收斂性NASH-QCE-QWoLF最佳響應多agent強化學習異構、競爭環(huán)境PHC收斂性和不遺憾性IGAGIGAGIGA-WoLF75向陽書屋p馬爾可夫對策在n個agent的系統(tǒng)中，定義離散的狀態(tài)集S（即對策集合G），agent動作集Ai的集

51、合A, 聯(lián)合獎賞函數(shù)Ri：SA1An和狀態(tài)轉移函數(shù)P：SA1AnPD（S）。 76向陽書屋p基于平衡解方法的強化學習Open Problem : Nash equilibrium or other equilibrium is enough? The optimal policy in single game is Nash equilibrium.77向陽書屋pApplications of RLCheckers Samuel 59TD-Gammon Tesauro 92Worlds best downpeak elevator dispatcher Crites at al 95Inven

52、tory management Bertsekas et al 9510-15% better than industry standardDynamic channel assignment Singh & Bertsekas, Nie&Haykin 95Outperforms best heuristics in the literatureCart-pole Michie&Chambers 68- with bang-bang controlRobotic manipulation Grupen et al. 93-Path planningRobot docking Lin 93Par

53、kingFootball Stone98TetrisMultiagent RL Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots of work sinceCombinatorial optimization: maintenance & repairControl of reasoning Zhang & Dietterich IJCAI-9578向陽書屋p仿真機器人足球應用Q學習算法進行仿真機器人足球2 對1 訓練，訓練的目的是試圖使主體學習獲得到一種戰(zhàn)略上的意識，能夠在進攻中進行配合 79向陽書屋p仿真機器

54、人足球前鋒A控球，并且在可射門的區(qū)域內(nèi)，但是A已經(jīng)沒有射門角度了；隊友B也處于射門區(qū)域，并且B具有良好的射門角度。A傳球給B，射門由B來完成，那么這次進攻配合就會很成功。通過Q學習的方法來進行2 對1的射門訓練，讓A掌握在這種狀態(tài)情況下傳球給B的動作是最優(yōu)的策略；主體通過大量的學習訓練（大數(shù)量級的狀態(tài)量和重復相同狀態(tài)）來獲得策略，因此更具有適應性。80向陽書屋p仿真機器人足球狀態(tài)描述，將進攻禁區(qū)劃分為個小區(qū)域，每個小區(qū)域是邊長為2m的正方形，一個二維數(shù)組()便可描述這個區(qū)域。使用三個Agent的位置來描述2 對 1進攻時的環(huán)境狀態(tài)，利用圖10.11所示的劃分來泛化狀態(tài)。可認為主體位于同一戰(zhàn)略區(qū)域為相似狀態(tài)，這樣對狀態(tài)的描述雖然不精確，但設計所需的是一種戰(zhàn)略層次的描述，可認為Agent在戰(zhàn)略區(qū)域內(nèi)是積極跑動的，這種方法滿足了需求。如此，便描述了一個特定的狀態(tài)；其中，是進攻隊員A的區(qū)域編號，是進攻隊員B的區(qū)域編號，是守門員的區(qū)域編號。區(qū)域編號計算公式為：。相應的，所保存的狀態(tài)值為三個區(qū)域編號組成的對。前鋒A控球，并且在可射門的區(qū)域內(nèi)，但是A已經(jīng)沒有射門角度了；隊友B也處于射門區(qū)域，并且B具有良好的射門角度。A傳球給B，射門由B來完成，那么這次進攻配合就會很成功。通過Q學習的方法來進行2 對1的射門訓練，讓A掌握在這種狀態(tài)情況下傳球給B的動作是最優(yōu)的策略；主體通過大量的

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責。
6. 下載文件中如有侵權或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

第10章_強化學習（章節(jié)課程）

文檔簡介

溫馨提示

最新文檔

評論

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

第10章_強化學習（章節(jié)課程）

文檔簡介

溫馨提示

最新文檔

評論

相關文檔