




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、在大規模Kubernetes集群上實現SLOMethods to achieve high SLOs on a large scale Kubernetes clusterWhy SLO?LatencySLIAvailabilityQPSCorrectnessSLOPunishmentSLASLI defines an indicator, which can represent user experience.SLO is the object that try to meets all SLIs in a period of time.SLA = SLO + Punishment.SLO
2、(Service-Level Objective). Within service-level agreements (SLAs), SLOs are the objectives that must be achieved for each service activity, function and process to provide the best opportunity for service recipient success. GartnerSLA/SLO/SLIWhat we concern about Large k8s ClusterIs the cluster heal
3、thy1 Are all software components working fine 2 How many failures occurred on the clusterWhat happened about the cluster1 Is there something unexpected happened in the cluster 2 What end users did in the clusterHow to locate failureWhich component is going wrongWhich component that leads delivery of
4、 the pod to failureSLIs on Large k8s ClusterCluster health stateA combination value indicates the risk in the cluster. Currently Healthy, Warning and Fatal are the possible value.Success rateA rate value indicates the rate of success about creating/upgrading pod.Number of Terminating PodA number val
5、ue indicates the count of pods that can not be deleted in a certain period.Centralized Components AvailabilityA ratio value indicates the time in which the cluster is available. It is used to evaluate the master components.Nodes AvailabilityA number value indicates the number of unhealthy node in th
6、e cluster. Pods scheduled to unhealthy nodes may not be delivered in time, success rate would decrease consequently.The success standard and reason classificationThe success standard:PodFeatureTime limitSuccess conditionPodRestartPolicy=Always1min(example value)the status of .Status.Conditions. “typ
7、e=Ready” is True”Job PodRestartPolicy=Never/OnFailure1min.Status.Phase= Running/Succeeded/FailedTerminating Pod1minpod is removed from etcdUnhealthy NodeTaint/Degrade1minNode has taints or is degradedProcessingBase on the failure reasonUnhealth node is healed or removed.Reason classification:SourceF
8、eatureExampleSystemFailure caused by cluster itselfRuntimeError, ImageFailed, Unscheduled, KubeletDelay.End UsersFailure caused by end usersContainerCrashLoopBackOff, FailedPostStartHook, UnhealthyTrace systemIncrease of SLOData CollectAudit logEventThe unhealthy nodeMonitoringIsolationRecoverDegrad
9、eData AnalysisFailures/MachineFailures/ReasonReportLifecycle of PodFailure ReasonTargetKubeletApiserverSchedulerOperatorRuntimeDaemonsetAlertGray ScaleBug FixSuccess RateSLOThe unhealthy nodeCluster HealthyTerminating Pod NumberDaily ReportValidationHousekeepi ngHigh AvailableFast RecoveryDisplay Bo
10、ardAlertAnalysis PlatformWeekly ReportSLO:Indicate the cluster is healthy or thereis something unexpected happened.Trace system:Collect and analyze logs in cluster. Sowe can known what happened about the cluster.Increase of SLO:Get the weakness of the cluster byanalyzing the failure reasons and effe
11、ctive actions should be taken to increase the success rate.The unhealthy node:Detect the unhealthy machines timely and fix problems automatically。The infrastructureLogEventEnd UserStorageAnalysis PlatformTraceReportWeaknessThe trace systemData Collect:Collect Audit log for the whole cluster.Data ana
12、lysis:Analyze failure reason if pod is failed.Reason analysis:Analyze the failure reasons. Try to find somethingabnormal in the cluster.Trace Result:We can get:It is failed to deliver the pod,and the fail reason isFailedMount.Pod LifecycleFailureReasonTrace System:Node Metricsnodemetricskubelet metr
13、icsdaemonset metricsnode loadslo metricscsi metricsdirty dataWith huge amount of metrics data collected, statistical methods can be used to check whether the node is healthy or not.Besides, node delivery capacity can also be evaluated via historical data.With dirty data metrics which consists ofesca
14、ped/zombie/uninterruptible processorphaned containersorphaned pod directories/volumesorphaned cgroupsorphaned net deviceand so on, node recovery system can cleanup those dirty data or alert cluster admins to process dirty data manually.Unhealthy nodeSLOTraceNPDMetricsEvent SourceruntimeErrorCo ntoll
15、erfailedPodContr ollerDetectorStrategyUnhealthy node listFast TaintWeight AdjustRecoveryManualHandlingImproveAuto Human experienceImprove of strategyCollect data from metrics NPD, Trace system and Log.Analyze the problem of the node, such as DiskRO, critical Daemonset is not ready.Processing unhealt
16、hy node: Heal, Degrade or Isolate. With scoring mechanism and historical operation records, unhealthy nodes can be recovered automatically. Otherwise manual intervention is required.Generate daily report to show what happens, and programmers can improve the system with the reportscontinuously.Daily ReportTips on increasing SLOCase 1: Image DownloadImage lazyload technology provides the ability to run a container without downloading image.Case 2: RetryPod should be recreate when the previous pod is failed. The previous node sho
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年企業可持續發展報告:聚焦SDGs目標與員工健康
- 2025年農業與食品行業食品安全監管體系優化報告
- 2025年農業生物技術在種業中的生物技術產品市場拓展與國際化報告
- 2025年農業面源污染治理農業面源污染治理與鄉村振興研究報告
- 腔鏡手術護理新進展培訓班講課件
- 偏執性精神病的護理講課件
- 廣州體育學院《醫學統計學(A)》2023-2024學年第二學期期末試卷
- 黑龍江大學《就業技能提升與實訓》2023-2024學年第二學期期末試卷
- 泰山學院《人體斷面解剖學》2023-2024學年第二學期期末試卷
- 漯河醫學高等專科學?!吨袑W體育教學技能訓練》2023-2024學年第二學期期末試卷
- GB/T 45700-2025物業管理術語
- 【MOOC】土木工程制圖-同濟大學 中國大學慕課MOOC答案
- 創業修煉智慧樹知到期末考試答案2024年
- 八年級道德與法治下冊第一單元堅持憲法至上思維導圖人教部編版
- (完整版)綠色施工管理體系與管理制度
- 報銷明細匯總表
- 塊狀物品推送機機械原理課程設計
- 室內全彩LED屏采購合同
- 鳳仙花的發芽與生長的觀察記錄表
- 入無分別總持經(敦煌本)簡體+入無分別法門經(宋)
- 海綿城市詳解ppt課件
評論
0/150
提交評論