




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、外文文獻(xiàn):Adaptive Dynamic Programming: An IntroductionAbstract: In this article, we introduce some recent research trends within the field of adaptive/approximate dynamic programming (ADP), including the variations on the structure of ADP schemes, the development of ADP algorithms and applications of AD
2、P schemes. For ADP algorithms, the point of focus is that iterative algorithms of ADP can be sorted into two classes: one class is the iterative algorithm with initial stable policy; the other is the one without the requirement of initial stable policy. It is generally believed that the latter one h
3、as less computation at the cost of missing the guarantee of system stability during iteration process. In addition, many recent papers have provided convergence analysis associated with the algorithms developed. Furthermore, we point out some topics for future studies.IntroductionAs is well known, t
4、here are many methods for designing stable control for nonlinear systems. However, stability is only a bare minimum requirement in a system design. Ensuring optimality guarantees the stability of the nonlinear system. Dynamic programming is a very useful tool in solving optimization and optimal cont
5、rol problems by employing the principle of optimality. In 16, the principle of optimality is expressed as: “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from t
6、he first decision.” There are several spectrums about the dynamic programming. One can consider discrete-time systems or continuous-time systems, linear systems or nonlinear systems, time-invariant systems or time-varying systems, deterministic systems or stochastic systems, etc.We first take a look
7、 at nonlinear discrete-time (timevarying) dynamical (deterministic) systems. Time-varying nonlinear systems cover most of the application areas and discrete-time is the basic consideration for digital computation. Suppose that one is given a discrete-time nonlinear (timevarying) dynamical systemwher
8、e represents the state vector of the system and denotes the control action and F is the system function. Suppose that one associates with this system the performance index (or cost)where U is called the utility function and g is the discount factor with 0 , g # 1. Note that the function J is depende
9、nt on the initial time i and the initial state x( i ), and it is referred to as the cost-to-go of state x( i ). The objective of dynamic programming problem is to choose a control sequence u(k), k5i, i11,c, so that the function J (i.e., the cost) in (2) is minimized. According to Bellman, the optima
10、l cost from time k is equal toThe optimal control u* 1k2 at time k is the u1k2 which achieves this minimum, i.e.,Equation (3) is the principle of optimality for discrete-time systems. Its importance lies in the fact that it allows one to optimize over only one control vector at a time by working bac
11、kward in time.In nonlinear continuous-time case, the system can be described byThe cost in this case is defined asFor continuous-time systems, Bellmans principle of optimality can be applied, too. The optimal cost J*(x0)5min J(x0, u(t) will satisfy the Hamilton-Jacobi-Bellman EquationEquations (3) a
12、nd (7) are called the optimality equations of dynamic programming which are the basis for implementation of dynamic programming. In the above, if the function F in (1) or (5) and the cost function J in (2) or (6) are known, the solution of u(k ) becomes a simple optimization problem. If the system i
13、s modeled by linear dynamics and the cost function to be minimized is quadratic in the state and control, then the optimal control is a linear feedback of the states, where the gains are obtained by solving a standard Riccati equation 47. On the other hand, if the system is modeled by nonlinear dyna
14、mics or the cost function is nonquadratic, the optimal state feedback control will depend upon solutions to the Hamilton-Jacobi-Bellman (HJB) equation 48 which is generally a nonlinear partial differential equation or difference equation. However, it is often computationally untenable to run true dy
15、namic programming due to the backward numerical process required for its solutions, i.e., as a result of the well-known “curse of dimensionality” 16, 28. In 69, three curses are displayed in resource management and control problems to show the cost function J , which is the theoretical solution of t
16、he Hamilton-Jacobi- Bellman equation, is very difficult to obtain, except for systems satisfying some very good conditions. Over the years, progress has been made to circumvent the “curse of dimensionality” by building a system, called “critic”, to approximate the cost function in dynamic programmin
17、g (cf. 10, 60, 61, 63, 70, 78, 92, 94, 95). The idea is to approximate dynamic programming solutions by using a function approximation structure such as neural networks to approximate the cost function.The Basic Structures of ADPIn recent years, adaptive/approximate dynamic programming (ADP) has gai
18、ned much attention from many researchers in order to obtain approximate solutions of the HJB equation, cf. 2, 3, 5, 8, 1113, 21, 22, 25, 30, 31, 34, 35, 40, 46, 49, 52, 54, 55, 63, 70, 76, 80, 83, 95, 96, 99, 100. In 1977, Werbos 91 introduced an approach for ADP that was later called adaptive criti
19、c designs (ACDs). ACDs were proposed in 91, 94, 97 as a way for solving dynamic programming problems forward-in-time. In the literature, there are several synonyms used for “Adaptive Critic Designs” 10, 24, 39, 43, 54, 70, 71, 87, including “Approximate Dynamic Programming” 69, 82, 95, “Asymptotic D
20、ynamic Programming” 75, “Adaptive Dynamic Programming” 63, 64, “Heuristic Dynamic Programming” 46, 93, “Neuro-Dynamic Programming” 17, “Neural Dynamic Programming” 82, 101, and “Reinforcement Learning” 84.Bertsekas and Tsitsiklis gave an overview of the neurodynamic programming in their book 17. The
21、y provided the background, gave a detailed introduction to dynamic programming, discussed the neural network architectures and methods for training them, and developed general convergence theorems for stochastic approximation methods as the foundation for analysis of various neuro-dynamic programmin
22、g algorithms. They provided the core neuro-dynamic programming methodology, including many mathematical results and methodological insights. They suggested many useful methodologies for applications to neurodynamic programming, like Monte Carlo simulation, on-line and off-line temporal difference me
23、thods, Q-learning algorithm, optimistic policy iteration methods, Bellman error methods, approximate linear programming, approximate dynamic programming with cost-to-go function, etc. A particularly impressive success that greatly motivated subsequent research, was the development of a backgammon pl
24、aying program by Tesauro 85. Here a neural network was trained to approximate the optimal cost-to-go function of the game of backgammon by using simulation, that is, by letting the program play against itself. Unlike chess programs, this program did not use lookahead of many steps, so its success ca
25、n be attributed primarily to the use of a properly trained approximation of the optimal cost-to-go function.To implement the ADP algorithm, Werbos 95 proposed a means to get around this numerical complexity by using “approximate dynamic programming” formulations. His methods approximate the original
26、 problem with a discrete formulation. Solution to the ADP formulation is obtained through neural network based adaptive critic approach. The main idea of ADP is shown in Fig. 1.He proposed two basic versions which are heuristic dynamic programming (HDP) and dual heuristic programming (DHP).HDP is th
27、e most basic and widely applied structure of ADP 13, 38, 72, 79, 90, 93, 104, 106. The structure of HDP is shown in Fig. 2. HDP is a method for estimating the cost function. Estimating the cost function for a given policy only requires samples from the instantaneous utility function U, while models
28、of the environment and the instantaneous reward are needed to find the cost function corresponding to the optimal policy.In HDP, the output of the critic network is J, which is the estimate of J in equation (2). This is done by minimizing the following error measure over timewhere J(k)5J 3x(k), u(k)
29、, k, WC4 and WC represents the parameters of the critic network. When Eh50 for all k, (8) implies thatDual heuristic programming is a method for estimating the gradient of the cost function, rather than J itself. To do this, a function is needed to describe the gradient of the instantaneous cost fun
30、ction with respect to the state of the system. In the DHP structure, the action network remains the same as the one for HDP, but for the second network, which is called the critic network, with the costate as its output and the state variables as its inputs.The critic networks training is more compl
31、icated than that in HDP since we need to take into account all relevant pathways of backpropagation.This is done by minimizing the following error measure over timewhere 'J 1k2 /'x1k2 5'J 3x1k2, u1k2, k, WC4/'x1k2 and WC represents theparameters of the critic network. When Eh50 for a
32、ll k, (10) implies that2. Theoretical DevelopmentsIn 82, Si et al summarizes the cross-disciplinary theoretical developments of ADP and overviews DP and ADP; and discusses their relations to artificial intelligence, approximation theory, control theory, operations research, and statistics.In 69, Pow
33、ell shows how ADP, when coupled with mathematical programming, can solve (approximately) deterministic or stochastic optimization problems that are far larger than anything that could be solved using existing techniques and shows the improvement directions of ADP.In 95, Werbos further gave two other
34、 versions called “actiondependent critics,” namely, ADHDP (also known as Q-learning 89) and ADDHP. In the two ADP structures, the control is also the input of the critic networks. In 1997, Prokhorov and Wunsch 70 presented more algorithms according to ACDs.They discussed the design families of HDP,
35、DHP, and globalized dual heuristic programming (GDHP). They suggested some new improvements to the original GDHP design. They promised to be useful for many engineering applications in the areas of optimization and optimal control. Based on one of these modifications, they present a unified approach
36、 to all ACDs. This leads to a generalized training procedure for ACDs. In 26, a realization of ADHDP was suggested: a least squares support vector machine (SVM) regressor has been used for generating the control actions, while an SVM-based tree-type neural network (NN) is used as the critic. The GDH
37、P or ADGDHP structure minimizes the error with respect to both the cost and its derivatives. While it is more complex to do this simultaneously, the resulting behavior is expected to be superior. So in 102, GDHP serves as a reconfigurable controller to deal with both abrupt and incipient changes in
38、the plant dynamics due to faults. A novel fault tolerant control (FTC) supervisor is combined with GDHP for the purpose of improving the performance of GDHP for fault tolerant control. When the plant is affected by a known abrupt fault, the new initial conditions of GDHP are loaded from dynamic mode
39、l bank (DMB). On the other hand, if the fault is incipient, the reconfigurable controller maintains performance by continuously modifying itself without supervisor intervention. It is noted that the training of three networks used to implement the GDHP is in an online fashion by utilizing two distin
40、ct networks to implement the critic. The first critic network is trained at every iterations while the second one is updated with a copy of the first one at a given period of iterations.All the ADP structures can realize the same function that is to obtain the optimal control policy while the comput
41、ation precision and running time are different from each other. Generally speaking, the computation burden of HDP is low but the computation precision is also low; while GDHP has better precision but the computation process will take longer time and the detailed comparison can be seen in 70. In 30,
42、33 and 83, the schematic of direct heuristic dynamic programming is developed. Using the approach of 83, the model network in Fig. 1 is not needed anymore. Reference 101 makes significant contributions to model-free adaptive critic designs. Several practical examples are included in 101 for demonstr
43、ation which include single inverted pendulum and triple inverted pendulum. A reinforcement learning-based controller design for nonlinear discrete-time systems with input constraints is presented by 36, where the nonlinear tracking control is implemented with filtered tracking error using direct HDP
44、 designs. Similar works also see 37. Reference 54 is also about model-free adaptive critic designs. Two approaches for the training of critic network are provided in 54: A forward-in-time approach and a backward-in-time approach. Fig. 4 shows the diagram of forward-intimeapproach. In this approach,
45、we view J(k) in (8) as the output of the critic network to be trained and choose U(k)1gJ(k11) as the training target. Note that J(k) and J(k11) are obtained using state variables at different time instances. Fig. 5 shows the diagram of backward-in-time approach. In this approach, we view J(k11) in (
46、8) as the output of the critic network to be trained and choose ( J(k)2U(k)/g as the training target. The training ap proach of 101 can be considered as a backward- in-time ap proach. In Fig. 4 and Fig. 5, x(k11) is the output of the model network.An improvement and modification to the two network a
47、rchitecture, which is called the “single network adaptive critic (SNAC)” was presented in 65, 66. This approach eliminates the action network. As a consequence, the SNAC architecture offers three potential advantages: a simpler architecture, lesser computational load (about half of the dual network
48、algorithms), and no approximate error due to the fact that the action network is eliminated. The SNAC approach is applicable to a wide class of nonlinear systems where the optimal control (stationary) equation can be explicitly expressed in terms of the state and the costate variables. Most of the p
49、roblems in aerospace, automobile, robotics, and other engineering disciplines can be characterized by the nonlinear control-affine equations that yield such a relation. SNAC-based controllers yield excellent tracking performances in applications to microelectronic mechanical systems, chemical reacto
50、r, and high-speed reentry problems. Padhi et al. 65 have proved that for linear systems (where the mapping between the costate at stage k11 and the state at stage k is linear), the solution obtained by the algorithm based on the SNAC structure converges to the solution of discrete Riccati equation.譯
51、文:自適應(yīng)動態(tài)規(guī)劃綜述摘要:自適應(yīng)動態(tài)規(guī)劃(Adaptive dynamic programming, ADP) 是最優(yōu)控制領(lǐng)域新興起的一種近似最優(yōu)方法, 是當(dāng)前國際最優(yōu)化領(lǐng)域的研究熱點(diǎn). ADP 方法利用函數(shù)近似結(jié)構(gòu)來近似哈密頓 雅可比 貝爾曼(Hamilton-Jacobi-Bellman, HJB)方程的解, 采用離線迭代或者在線更新的方法, 來獲得系統(tǒng)的近似最優(yōu)控制策略, 從而能夠有效地解決非線性系統(tǒng)的優(yōu)化控制問題. 本文按照ADP 的結(jié)構(gòu)變化、算法的發(fā)展和應(yīng)用三個(gè)方面介紹ADP 方法. 對目前ADP 方法的研究成果加以總結(jié), 并對這一研究領(lǐng)域仍需解決的問題和未來的發(fā)展方向作了進(jìn)一步的
52、展望。關(guān)鍵詞:自適應(yīng)動態(tài)規(guī)劃, 神經(jīng)網(wǎng)絡(luò), 非線性系統(tǒng), 穩(wěn)定性引言動態(tài)系統(tǒng)在自然界中是普遍存在的, 對于動態(tài)系統(tǒng)的穩(wěn)定性分析長期以來一直是研究熱點(diǎn), 且已經(jīng)提出了一系列方法. 然而控制科技工作者往往在保證控制系統(tǒng)穩(wěn)定性的基礎(chǔ)上還要求其最優(yōu)性. 本世紀(jì)50»60 年代, 在空間技術(shù)發(fā)展和數(shù)字計(jì)算機(jī)實(shí)用化的推動下, 動態(tài)系統(tǒng)的優(yōu)化理論得到了迅速的發(fā)展, 形成了一個(gè)重要的學(xué)科分支: 最優(yōu)控制. 它在空間技術(shù)、系統(tǒng)工程、經(jīng)濟(jì)管理與決策、人口控制、多級工藝設(shè)備的優(yōu)化等許多領(lǐng)域都有越來越廣泛的應(yīng)用. 1957 年Bellman 提出了一種求解最優(yōu)控制問題的有效工具: 動態(tài)規(guī)劃(Dynamic
53、programing,DP) 方法1. 該方法的核心是貝爾曼最優(yōu)性原理,即: 多級決策過程的最優(yōu)策略具有這種性質(zhì), 不論初始狀態(tài)和初始決策如何, 其余的決策對于由初始決策所形成的狀態(tài)來說, 必定也是一個(gè)最優(yōu)策略. 這個(gè)原理可以歸結(jié)為一個(gè)基本的遞推公式, 求解多級決策問題時(shí), 要從末端開始, 到始端為止, 逆向遞推.該原理適用的范圍十分廣泛, 例如離散系統(tǒng)、連續(xù)系統(tǒng)、線性系統(tǒng)、非線性系統(tǒng)、確定系統(tǒng)以及隨機(jī)系統(tǒng)等。下面分別就離散和連續(xù)兩種情況對DP方法的基本原理進(jìn)行說明. 首先考慮離散非線性系統(tǒng)。假設(shè)一個(gè)系統(tǒng)的動態(tài)方程為其中, 為系統(tǒng)的狀態(tài)向量, 為控制輸入向量。系統(tǒng)相應(yīng)的代價(jià)函數(shù)(或性能指標(biāo)函數(shù)
54、)的形式為其中, 初始狀態(tài)x(k) = xk 給定, l(x(k),u(k),k) 是效用函數(shù), r為折扣因子且滿足0 < r <=1。控制目標(biāo)就是求解容許決策(或控制) 序列u(k), k = i, i+ 1, , 使得代價(jià)函數(shù)(2) 最小。根據(jù)貝爾曼最優(yōu)性原理, 始自第k 時(shí)刻任意狀態(tài)的最小代價(jià)包括兩部分, 其中一部分是第k 時(shí)刻內(nèi)所需最小代價(jià), 另一部分是從第k +1 時(shí)刻開始到無窮的最小代價(jià)累加和, 即相應(yīng)的k 時(shí)刻的控制策略u(k) 也達(dá)到最優(yōu), 表示為接下來, 考慮連續(xù)非線性(時(shí)變) 動態(tài)(確定) 系統(tǒng)的最優(yōu)控制問題. 考察如下的連續(xù)時(shí)間系統(tǒng):其中, F(x,u,t)
55、為任意連續(xù)函數(shù)。 求一容許控制策略u(t) 使得代價(jià)函數(shù)(或性能指標(biāo)函數(shù))最小. 我們可以通過離散化的方法將連續(xù)問題轉(zhuǎn)換為離散問題, 然后通過離散動態(tài)規(guī)劃方法求出最優(yōu)控制, 當(dāng)離散化時(shí)間間隔趨于零時(shí), 兩者必趨于一致。 通過應(yīng)用貝爾曼最優(yōu)性原理, 可以得到DP 的連續(xù)形式為可以看出, 上式是J* (x(t),t) 以x(t)、t 為自變量的一階非線性偏微分方程, 在數(shù)學(xué)上稱其為哈密頓雅可比貝爾曼(Hamilton-Jacobi-Bellman, HJB)方程。如果系統(tǒng)是線性的且代價(jià)函數(shù)是狀態(tài)和控制輸入的二次型形式, 那么其最優(yōu)控制策略是狀態(tài)反饋的形式, 可以通過求解標(biāo)準(zhǔn)的黎卡提方程得到. 如果
56、系統(tǒng)是非線性系統(tǒng)或者代價(jià)函數(shù)不是狀態(tài)和控制輸入的二次型形式, 那么就需要通過求解HJB 方程進(jìn)而獲得最優(yōu)控制策略. 然而, HJB 方程這種偏微分方程的求解是一件非常困難的事情. 此外, DP方法還有一個(gè)明顯的弱點(diǎn): 隨著x 和u 維數(shù)的增加, 計(jì)算量和存儲量有著驚人的增長, 也就是我們平常所說的維數(shù)災(zāi)" 問題1-2. 為了克服這些弱點(diǎn), Werbos 首先提出了自適應(yīng)動態(tài)規(guī)劃(Adaptive dynamic programming, ADP) 方法的框架3, 其主要思想是利用一個(gè)函數(shù)近似結(jié)構(gòu)(例如神經(jīng)網(wǎng)絡(luò)、模糊模型、多項(xiàng)式等) 來估計(jì)代價(jià)函數(shù), 用于按時(shí)間正向求解DP 問題。近些
57、年來, ADP 方法獲得了廣泛的關(guān)注, 也產(chǎn)生了一系列的同義詞, 例如: 自適應(yīng)評價(jià)設(shè)計(jì)4-7、啟發(fā)式動態(tài)規(guī)劃8-9、神經(jīng)元?jiǎng)討B(tài)規(guī)劃10-11、自適應(yīng)動態(tài)規(guī)劃12 和增強(qiáng)學(xué)習(xí)13 等. 2006 年美國科學(xué)基金會組織的2006 NSF Workshop and Out-reach Tutorials on Approximate Dynamic Pro-gramming" 研討會上, 建議將該方法統(tǒng)稱為Adap-tive/Approximate dynamic programming". Bert-sekas 等在文獻(xiàn)10¡11 中對神經(jīng)元?jiǎng)討B(tài)規(guī)劃進(jìn)行了總結(jié), 詳
58、細(xì)地介紹了動態(tài)規(guī)劃、神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)和訓(xùn)練算法, 提出了許多應(yīng)用神經(jīng)元?jiǎng)討B(tài)規(guī)劃的有效方法. Si 等總結(jié)了ADP 方法在交叉學(xué)科的發(fā)展,討論了DP 和ADP 方法與人工智能、近似理論、控制理論、運(yùn)籌學(xué)和統(tǒng)計(jì)學(xué)的聯(lián)系14. 在文獻(xiàn)15中, Powell 展示了如何利用ADP 方法求解確定或者隨機(jī)最優(yōu)化問題, 并指出了ADP 方法的發(fā)展方向. Balakrishnan 等在文獻(xiàn)16 中從有模型和無模型兩種情況出發(fā), 對之前利用ADP 方法設(shè)計(jì)動態(tài)系統(tǒng)反饋控制器的方法進(jìn)行了總結(jié). 文獻(xiàn)17 從要求初始穩(wěn)定和不要求初始穩(wěn)定的角度對ADP 方法做了介紹. 本文將基于我們的研究成果, 在之前研究的基礎(chǔ)上, 概述ADP 方法的最新進(jìn)展。ADP 的結(jié)構(gòu)發(fā)展為了執(zhí)行ADP 方法, Werbos 提出了兩種基本結(jié)構(gòu): 啟發(fā)式動態(tài)規(guī)劃(Heuristic dynamic pro-gramming, HDP) 和二次啟發(fā)式規(guī)劃(Dual heuris-tic programming, DHP), 其結(jié)構(gòu)如圖1 和圖2 所
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年四川省綿陽地區(qū)七下英語期中經(jīng)典模擬試題含答案
- 聽力模擬沖刺試題及答案
- 2025年策劃雙方股權(quán)互轉(zhuǎn)協(xié)議模板
- 2025年官方專利許可協(xié)議范本
- 2025年標(biāo)準(zhǔn)住宅預(yù)售購買協(xié)議范本
- 2025年離婚保險(xiǎn)客戶服務(wù)協(xié)議樣本
- 2025年跨境貿(mào)易金融服務(wù)協(xié)議
- 2025年醫(yī)藥技術(shù)研發(fā)合作協(xié)議
- 2025年官方股權(quán)聯(lián)營策劃協(xié)議樣本
- 施工過程中對施工材料的選擇與管理
- 廣西南寧市二中2024屆物理高一下期末質(zhì)量檢測模擬試題含解析
- 美術(shù)遺存的保護(hù)與傳承
- 執(zhí)業(yè)藥師課件
- TB10092-2017 鐵路橋涵混凝土結(jié)構(gòu)設(shè)計(jì)規(guī)范
- 小學(xué)學(xué)科教育中的全面素質(zhì)與個(gè)性發(fā)展培養(yǎng)
- 青年教師培養(yǎng)方案
- DB35∕T 516-2018 益膠泥通用技術(shù)條件
- 學(xué)大教育:上海瑞聚實(shí)業(yè)有限公司設(shè)備年市場租金價(jià)值評估項(xiàng)目評估報(bào)告
- advantrol pro v270學(xué)習(xí)版系統(tǒng)應(yīng)用入門手冊
- 生物化學(xué)實(shí)驗(yàn)智慧樹知到答案章節(jié)測試2023年浙江大學(xué)
- GA 1801.4-2022國家戰(zhàn)略儲備庫反恐怖防范要求第4部分:火炸藥庫
評論
0/150
提交評論