在大規模Kubernetes集群上實現SLO_第1頁
在大規模Kubernetes集群上實現SLO_第2頁
在大規模Kubernetes集群上實現SLO_第3頁
在大規模Kubernetes集群上實現SLO_第4頁
在大規模Kubernetes集群上實現SLO_第5頁
已閱讀5頁,還剩6頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、在大規模Kubernetes集群上實現SLOMethods to achieve high SLOs on a large scale Kubernetes clusterWhy SLO?LatencySLIAvailabilityQPSCorrectnessSLOPunishmentSLASLI defines an indicator, which can represent user experience.SLO is the object that try to meets all SLIs in a period of time.SLA = SLO + Punishment.SLO

2、(Service-Level Objective). Within service-level agreements (SLAs), SLOs are the objectives that must be achieved for each service activity, function and process to provide the best opportunity for service recipient success. GartnerSLA/SLO/SLIWhat we concern about Large k8s ClusterIs the cluster heal

3、thy1 Are all software components working fine 2 How many failures occurred on the clusterWhat happened about the cluster1 Is there something unexpected happened in the cluster 2 What end users did in the clusterHow to locate failureWhich component is going wrongWhich component that leads delivery of

4、 the pod to failureSLIs on Large k8s ClusterCluster health stateA combination value indicates the risk in the cluster. Currently Healthy, Warning and Fatal are the possible value.Success rateA rate value indicates the rate of success about creating/upgrading pod.Number of Terminating PodA number val

5、ue indicates the count of pods that can not be deleted in a certain period.Centralized Components AvailabilityA ratio value indicates the time in which the cluster is available. It is used to evaluate the master components.Nodes AvailabilityA number value indicates the number of unhealthy node in th

6、e cluster. Pods scheduled to unhealthy nodes may not be delivered in time, success rate would decrease consequently.The success standard and reason classificationThe success standard:PodFeatureTime limitSuccess conditionPodRestartPolicy=Always1min(example value)the status of .Status.Conditions. “typ

7、e=Ready” is True”Job PodRestartPolicy=Never/OnFailure1min.Status.Phase= Running/Succeeded/FailedTerminating Pod1minpod is removed from etcdUnhealthy NodeTaint/Degrade1minNode has taints or is degradedProcessingBase on the failure reasonUnhealth node is healed or removed.Reason classification:SourceF

8、eatureExampleSystemFailure caused by cluster itselfRuntimeError, ImageFailed, Unscheduled, KubeletDelay.End UsersFailure caused by end usersContainerCrashLoopBackOff, FailedPostStartHook, UnhealthyTrace systemIncrease of SLOData CollectAudit logEventThe unhealthy nodeMonitoringIsolationRecoverDegrad

9、eData AnalysisFailures/MachineFailures/ReasonReportLifecycle of PodFailure ReasonTargetKubeletApiserverSchedulerOperatorRuntimeDaemonsetAlertGray ScaleBug FixSuccess RateSLOThe unhealthy nodeCluster HealthyTerminating Pod NumberDaily ReportValidationHousekeepi ngHigh AvailableFast RecoveryDisplay Bo

10、ardAlertAnalysis PlatformWeekly ReportSLO:Indicate the cluster is healthy or thereis something unexpected happened.Trace system:Collect and analyze logs in cluster. Sowe can known what happened about the cluster.Increase of SLO:Get the weakness of the cluster byanalyzing the failure reasons and effe

11、ctive actions should be taken to increase the success rate.The unhealthy node:Detect the unhealthy machines timely and fix problems automatically。The infrastructureLogEventEnd UserStorageAnalysis PlatformTraceReportWeaknessThe trace systemData Collect:Collect Audit log for the whole cluster.Data ana

12、lysis:Analyze failure reason if pod is failed.Reason analysis:Analyze the failure reasons. Try to find somethingabnormal in the cluster.Trace Result:We can get:It is failed to deliver the pod,and the fail reason isFailedMount.Pod LifecycleFailureReasonTrace System:Node Metricsnodemetricskubelet metr

13、icsdaemonset metricsnode loadslo metricscsi metricsdirty dataWith huge amount of metrics data collected, statistical methods can be used to check whether the node is healthy or not.Besides, node delivery capacity can also be evaluated via historical data.With dirty data metrics which consists ofesca

14、ped/zombie/uninterruptible processorphaned containersorphaned pod directories/volumesorphaned cgroupsorphaned net deviceand so on, node recovery system can cleanup those dirty data or alert cluster admins to process dirty data manually.Unhealthy nodeSLOTraceNPDMetricsEvent SourceruntimeErrorCo ntoll

15、erfailedPodContr ollerDetectorStrategyUnhealthy node listFast TaintWeight AdjustRecoveryManualHandlingImproveAuto Human experienceImprove of strategyCollect data from metrics NPD, Trace system and Log.Analyze the problem of the node, such as DiskRO, critical Daemonset is not ready.Processing unhealt

16、hy node: Heal, Degrade or Isolate. With scoring mechanism and historical operation records, unhealthy nodes can be recovered automatically. Otherwise manual intervention is required.Generate daily report to show what happens, and programmers can improve the system with the reportscontinuously.Daily ReportTips on increasing SLOCase 1: Image DownloadImage lazyload technology provides the ability to run a container without downloading image.Case 2: RetryPod should be recreate when the previous pod is failed. The previous node sho

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論