大規模Lustre集群文件系統關鍵技術的研究

上傳人：r*** IP屬地：湖北上傳時間：2022-02-14 格式：DOC 頁數：96 大小：123.50KB 積分：20 舉報 版權申訴

已閱讀5頁，還剩91頁未讀，繼續免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

1、大規模Lustre集群文件系統關鍵技術的研究國防科學技術大學博士學位論文大規模Lustre集群文件系統關鍵技術的研究姓名：錢迎進申請學位級別：博士專業：計算機科學與技術指導教師：金士堯 2011-03 國防科學技術大學研究生院博士學位論文摘要集群已成為當今高性能計算機的主流體系結構。集群文件系統是緩解高性能計算集群 I/O 瓶頸問題的核心技術。隨著高性能計算技術的不斷發展，很多高性能計算應用的存儲需求在不斷提高。Lustre 是領先的集群文件系統，已經成為構建高性能計算存儲系統的標準，在高性能計算市場中占據統治地位。它可以有效地擴展到支持上萬個節點的大規模 HPC

2、系統，具有被證實的聚合性能和擴展性。隨著高性能計算不斷的以增加節點來提升系統性能，未來高性能計算集群將變得異常龐大，技術上給 Lustre 帶來了擴展性、I/O 性能和可用性等諸多嚴峻挑戰。本文所做的工作就是緊緊圍繞這些問題展開的。具體研究內容和創新成果如下： 1 針對大規模應用的并行 I/O 訪問特性，設計了一種新穎的跨網絡的服務器端 I/O 請求調度器框架，并提出了一種基于對象的輪轉（OBRR, Object Based Round Robin ）調度算法來優化性能。它通過調度上層的并行I/O 請求的執行，呈現給后端存儲系統更容易優化的 I/O 工作負載。同時，為了避免饑餓以及滿

3、足不同緊急程度 I/O 請求響應時間的需求，提出了一種新穎的兩級 deadline 設置策略:動態 deadline 和強制 deadline 。一系列的模擬測試結果表明使用 OBRR 性能提高了 40% 以上，兩級deadline 設置策略可以保持公平性，避免饑餓，確保不同緊急程度 I/O 的響應時間。 2 與網絡擁塞類似，當存儲系統達到超大規模時，也會造成 I/O 擁塞問題。針對這個問題，提出了一種動態 I/O 擁塞控制機制來更好的支持未來的艾級規模 HPC 系統的存儲需求。在該機制的控制下，當服務器輕載時，允許客戶端發送更多的 I /O 請求給服務器，以達到優化網絡和服務器資源利

4、用率提高 I/O 吞吐率的目的；另一方面，當服務器負載過重時，它可以對客戶端I/O 進行節流控制，限制服務器掛起的 I/O 請求的數目，控制 I/O 延遲，避免服務器擁塞崩潰。在天河一號上的一系列評估實驗結果證明了提出的擁塞控制機制的有效性：它阻止了擁塞崩潰的發生；在此前提下，它最大化了Lustre 文件系統的 I/O 性能。 3 針對傳統的固定超時機制不能適應超大規模集群環境的不足，提出了一種綜合考慮網絡條件、服務器負載、擴展性和性能等因素的自適應可擴展的 RPC 超時機制。它包括兩個策略：自適應超時策略和及早回復策略。在自適應超時策略中，客戶端設置的超時值可以根據客戶端服務器

5、間的網絡情況以及服務器的工作負載動態的進行調整，以適應集群環境的變化，從而避免不必要的超時造成整個系統性能的降低；同時，為了區分服務器因負載過重而擁塞和網絡/節點失效，以及為了解決嵌入式超時問題，提出了一種及早回復策略：當服務器知道它不能在客戶端期待的響應時間內回復 RPC 請求時，它將提前發送一個輕量級的及早回復第 i 頁國防科學技術大學研究生院博士學位論文消息給客戶端并指示一個估測的額外需要的服務時間。該策略進一步減少了超時的發生，提高了系統的響應速度。一系列的模擬評估的結果表明：與固定超時機制相比，使用自適應超時策略 RPC 超時率從 76%降低到 13%，結合及早回復

6、策略，超時率甚至降低到 0% ；在基于RPC 的超大規模集群系統中，其他的一些RPC 失效檢測機制，如客戶端驅動的輪詢或探測機制，會產生大量的不必要的網絡流量，存在擴展性問題，而我們的機制通常只產生少量的網絡流量，是一個更具有擴展性的基于超時的失效檢測機制。 4 研究了 Lustre 分布式鎖管理器技術。首先，分析了 Lustre 的文件訪問的并發控制機制，基于鎖回調的客戶端目錄項高速緩沖和數據寫回緩沖；其次，研究了 Lustre 的基于意圖鎖的元數據操作和子樹鎖機制以及基于范圍鎖的文件大小獲取算法；最后，提出了自適應 I/O 鎖策略、基于區間樹的范圍鎖沖突檢測優化策略以及鎖淘

7、汰策略等，進一步增強了Lustre 的I/O 性能和鎖服務的擴展性。 5 研究有狀態的 Lustre 基于事務的元數據更新算法和恢復機制。Lustre 允許服務器完成了事務的內存更新就可以將結果返回客戶端，而且其結果在整個命名空間即為可見的。這種方式能夠提供優異的元數據性能，但它會在服務器重啟恢復（或者故障切換）時造成事務的疊加 abort 的問題，從而不能進行透明無縫的恢復。Lustre 的重啟恢復算法需要集群中所有客戶端在指定的恢復時間窗口內與服務器重新建立連接，客戶端重傳未提交的事務請求，服務器嚴格按照事務序列號重放所有未提交的事務，其要求過于嚴格。為了提高 Lustre 的

8、可恢復性，提出了基于版本恢復和共享時提交算法，它們分別對 Lustre 的元數據更新算法和重啟恢復恢算法進行了擴展，允許客戶端在更為寬松的條件下能夠進行恢復重新加入到集群。基于版本的恢復算法在恢復的過程中加入了版本檢查，允許操作對象版本匹配的事務進行重放恢復。在共享時提交算法中，服務器一旦檢測到未提交的客戶端間依賴事務時，會將它提交到磁盤來避免讀或者寫未提交的事務的數據，從而消除客戶端間的恢復依賴關系，使得各個客戶端可以獨立的恢復。實驗評估證明由于發生事務依賴時需要強制進行磁盤提交，共享時提交算法對性能會有所影響。盡管如此，在超大規模的 Lustre 集群中，為了能夠提供高可靠

9、高可用的服務，一般都會選擇開啟共享時提交功能。關鍵詞：Lustre；高性能計算；I/O 調度；服務質量；可擴展性；擁塞控制；失效檢測；分布式鎖；并發控制；恢復；高可用第 ii 頁國防科學技術大學研究生院博士學位論文 Abstract The cluster architecture has been matured as the mainstream architecture for high-performance computers. Clustered file system is a key technology to easy the I/O bottleneck prob

10、lem of HPC clusters. With the continuing development of HPC technologies, the storage demand for HPC applications keeps increasing. Lustre is the leading clustered file system, and it has become the standard to construct HPC storage systems with largest market share in HPC. Lustre effectively scales

11、 to support systems with tens of thousands of compute nodes and has proved aggregative I/O performance and scalability. As HPC systems increase node counts to increase overall performance, future HPC clusters will become extreme large. This brings serious challenges for Lustre especially in scalabil

12、ity, I/O performance and availability. The work in this thesis mainly focuses on these problems. The crucial contributions are as follows. 1 According to the parallel I/O access characteristic of large scale applications, this thesis presents a novel server-side network request scheduler framework f

13、or a large-scale, LustreTM storage cluster system. Based on it, it proposes an Object Based Round Robin OBRR scheduling algorithm that reorders the execution of I/O requests, presenting a workload to the backend storage that can be optimized more easily. In the meanwhile, to avoid starvation and mee

14、t the requirement of response time for I/O requests with different urgencies, it proposes a novel two-level deadline setting strategy - a dynamic deadline and a mandatory deadline. Via a series of experiments using the Lustre simulator scaling up to thousands of nodes, it demonstrates that the I/O p

15、erformance increases as high as 40% by using OBRR algorithm and the two-level deadline setting strategy can maintain fairness, avoid starvation and ensures the response time requirement for I/Os with different urgencies. 2 Similar to network congestion, it will also cause I/O congestion problem when

16、 the storage cluster scales up to extreme large size. This thesis proposes a dynamic I/O congestion control mechanism to support the incoming exascale HPC systems. Under its control, the clients are allowed to issue more concurrent I/O requests to the server, which optimizes the utilization of the n

17、etwork/server resources and improves the I/O throughput, when the server is under light load; on the other hand, it can throttle the clients I/O and limit the number of I/O requests queued on the server to control the I/O latency and avoid congestive collapse, when the server is under overload. The

18、results from series of evaluation experiments in Tianhe-1 super computer demonstrate the effectiveness of our I/O congestion control mechanism. It prevents the occurrence of congestive collapse; on this premise it performs a best-effort approach and imizes the I/O throughput for the scalable Lustre

19、file system. 3 To solve the problem of the fixed timeout mechanism emerging in large scale 第 iii 頁國防科學技術大學研究生院博士學位論文 HPC cluster systems, this thesis proposes an adaptive scalable RPC timeout mechanism that considers network conditions, server loads, scalability and performance. The mechanism inclu

20、des two strategies: adaptive timeout strategy and early reply strategy. In the adaptive timeout strategy the timeout value set by clients is adapted and adjusted in a dynamic fashion according to the network conditions and server workload to accommodate the environment changes, reducing performance

21、degradation of the entire system caused by ineffective timeouts; To distinguish the server congestion from a failure of the server or network, and to resolve the nested timeout problem, it proposes an early reply strategy: the server notifies the client to wait for an extra amount of time for a resp

22、onse to an RPC that is about to time out by a light-weight early reply message passing. It further avoids the occurrences of unnecessary timeouts and enhances the system responsiveness. A series of simulation experiments demonstrate that: compared with fixed timeout mechanism, the RPC timeout rate d

23、rops from 76% to 13% using the adaptive timeout strategy, and it even drops to 0% combined with the early reply strategy; in RPC-based large scale clusters, existing mechanisms for the RPC failure detection, such as client-driven polling and probing, generate considerable amount of unnecessary netwo

24、rk traffic and have scalability problem, while our mechanism generates much less extra network traffic and it is a more scalable failure detection mechanism for RPC models with timeouts . 4 This thesis researches Lustre distributed lock manager technology. First, it analyzes concurrent control mecha

25、nism for file access, and client-side dentry cache and data writeback cache based on the lock callback; Second, it researches the metadata operations based on intent locks, sub tree lock mechanism and file size acquiring algorithm based on extent locks; At last, it proposes adaptive I/O locking stra

26、tegy, optimized conflict check strategy for extent locks based on interval tree and lock discarding strategy, and these proposed strategies further improve Lustres I/O performance and scalability of Lustres lock service. 5 This thesis researches transactional metadata update algorithm and recovery m

27、echanism for the stateful Lustre. Lustre allows the server to return the result of metadata transaction to the client when finished the memory update, and the result is visible in the whole namespace. By this way, it can provide good metadata performance, but it will cause cascade abort problem duri

28、ng reboot recovery or failover , making recovery transparent impossible. Lustre reboot recovery algorithm needs that all clients reconnect to the server in a special recovery time window, and then clients resend uncommitted transactional requests and the server replays these requests strictly in the

29、 transaction number order. The recovery conditions are too strict. To improve Lustres recoverability, this thesis proposes version based recovery and commit on share algorithms. They extend Lustres metadata update algorithm and recovery algorithm respectively and allow clients rejoin in the cluster

30、by recovery under a more relaxed 第 iv 頁國防科學技術大學研究生院博士學位論文 condition. The version based recovery algorithm adds version check during recovery, and the transactions with version match are allowed to replay. The commit on share algorithm forces to commit the inter-client dependent transaction to disk

31、once detect, to avoid reading or writing the data of uncommitted transactions. It eliminates the inter-client recovery dependencies and clients are allowed to recovery independently. Experiment evaluation demonstrates that the commit on share algorithm has effect on performance due to mandatory disk

32、 commits when detect inter-client dependencies. However, in a very large scale Lustre cluster, commit on share functionality is usually enabled to provide high reliable, high available service. Key words：Lustre，HPC，I/O Schedule，QoS，Scalability，congestion control， failure detection ，distributed lock，

33、concurrent control，recovery，high availability 第 v 頁國防科學技術大學研究生院博士學位論文表目錄表 3.1 磁盤調度器合并后磁盤驅動獲得的I/O請求大小統計. 31 表 4.1 符號術語定義 . 39 表 4.2 RCC 固定為 8 的靜態RCC策略各個RPC 時間階段的統計. 46 表 4.3 各種測試用例在穩定階段的I/O延遲的統計以及總的I/O帶寬 . 50 表 5.1 各種服務時間估測算法的Rto和Rra 的統計. 70 表 5.2 額外消息統計 . 71 表 6.1 鎖模式的兼容性 . 77 表 6.2 鎖模式的包容性 . 7

34、8 表 6.3 鏈表和區間樹進行范圍搜索的測試時間對比（單位：秒） . 93 表 7.1 COS鎖的兼容性 . 123 表 7.2 不同目錄并行文件創建性能對比 . 125 表 7.3 共享目錄下并行文件創建性能對比 . 125 表 7.4 創建刪除對比測試 . 125 第 V 頁國防科學技術大學研究生院博士學位論文圖目錄圖2.1 Lustre體系結構12 . 13 圖2.2 Lustre子系統交互圖. 14 圖2.3 Lustre I/O系統組件圖. 15 圖2.4 Lustre文件open和文件I/O交互過程. 15 圖2.5 模塊化LNET 的層次結構. 16 圖2.6 鏈路級

35、負載均衡和故障接管 . 17 圖2.7 OSS和MDS 的failover配置. 18 圖2.8 Lustre模擬器組件圖. 19 圖2.9 ORNL Jaguar系統與模擬器的測試數據對比. 20 圖3.1 NRS構架 . 24 圖3.2 DDN S2A 9550 性能評測 . 26 圖3.3 4M bulk I/O vs. 1M bulk I/O. 27 圖3.4 OBRR調度算法 . 28 圖3.5 FCFS與OBRR調度算法性能對比. 31 圖3.6 兩級deadline設置策略評估. 32 圖4.1 Lustre I/O模型. 36 圖4.2 Lustre寫處理流程. 37 圖4.3 Lustre擁塞控制算法. 41 圖4.4 RC

人人文庫> 全部分類> 專業文獻 > 工程機械

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

大規模Lustre集群文件系統關鍵技術的研究

文檔簡介

溫馨提示

最新文檔

評論

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

大規模Lustre集群文件系統關鍵技術的研究

文檔簡介

溫馨提示

最新文檔

評論

相關文檔