并行程序設計中文課件 10 并行算法及性能分析

上傳人：o*** IP屬地：未知上傳時間：2024-12-31 格式：PPT 頁數：49 大小：1.03MB 積分：6 舉報 版權申訴

已閱讀5頁，還剩44頁未讀，繼續免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

ParallelProgrammingInstructor:ZhangWeizhe(張偉哲)ComputerNetworkandInformationSecurityTechniqueResearchCenter,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyCommunicationCostsinParallelComputers

并行計算機中的通信成本OutlineMatrix-VectorMultiplication矩陣向量乘法Matrix-MatrixMultiplication矩陣矩陣乘法MatixAlgorithms:IntroductionDuetotheirregularstructure,parallelcomputationsinvolvingmatricesandvectorsreadilylendthemselvestodata-decomposition.Typicalalgorithmsrelyoninput,output,orintermediatedatadecomposition.Mostalgorithmsuseone-andtwo-dimensionalblock,cyclic,andblock-cyclicpartitioning.MatixAlgorithms:Introduction由于它們的規則結構，涉及矩陣和向量的并行計算容易使數據分解。典型的算法依賴于輸入，輸出或中間數據分解。大多數算法使用一維和二維塊，循環和塊循環分區。Matrix-VectorMultiplication6Matrix-VectorMultiplicationWeaimtomultiplyadensenxnmatrixAwithannx1vectorxtoyieldthenx1resultvectory.我們的目的是將密集的n×n矩陣A與n×1向量x相乘以產生n×1個結果向量y。Theserialalgorithmrequiresn2multiplications乘法andadditions加法.Matrix-VectorMultiplication:

Rowwise1-DPartitioningThenxnmatrixispartitionedamongnprocessors,witheachprocessorstoringcompleterowofthematrix.nxn矩陣在n個處理器之間分配，每個處理器存儲矩陣的完整行。Thenx1vectorxisdistributedsuchthateachprocessownsoneofitselements.nx1向量x被分布，使得每個進程擁有其一個元素。Matrix-VectorMultiplication:

Rowwise1-DPartitioningMultiplicationofannxnmatrixwithannx1vectorusingrowwiseblock1-Dpartitioning.Fortheone-row-per-processcase,p=n.Matrix-VectorMultiplication:

Rowwise1-DPartitioningSinceeachprocessstartswithonlyoneelementofx,anall-to-allbroadcastisrequiredtodistributealltheelementstoalltheprocesses.由于每個進程只以x的一個元素開始，因此需要一個全部廣播來將所有元素分發到所有進程。ProcessPinowcomputes.Matrix-VectorMultiplication:

Rowwise1-DPartitioningConsidernowthecasewhenp<nandweuseblock1Dpartitioning.Eachprocessinitiallystoresn=pcompleterowsofthematrixandaportionofthevectorofsizen=p.Theall-to-allbroadcasttakesplaceamongpprocessesandinvolvesmessagesofsizen=p.Thisisfollowedbyn=plocaldotproducts.Thus,theparallelruntimeofthisprocedureisMatrix-VectorMultiplication:

Rowwise1-DPartitioning現在考慮p<n的情況，我們使用塊1D劃分。每個過程最初存儲矩陣的n=p個完整行和大小為n=p的向量的一部分。所有廣播都在p進程之間進行，并且涉及大小為n=p的消息。其次是n=p個本地點產品。因此，此過程的并行運行時間為Matrix-VectorMultiplication:

2-DPartitioningThenxnmatrixispartitionedamongn2processorssuchthateachprocessorownsasingleelement.

n×n矩陣在n2個處理器之間劃分，使得每個處理器擁有單個元素。Thenx1vectorxisdistributedonlyinthelastcolumnofnprocessors.nx1向量x僅分布在n個處理器的最后一列中。Matrix-VectorMultiplication:2-DPartitioningMatrix-VectorMultiplication:

2-DPartitioningWemustfirstalignthevectorwiththematrixappropriately.Thefirstcommunicationstepforthe2-Dpartitioningalignsthevectorxalongtheprincipaldiagonalofthematrix.Thesecondstepcopiesthevectorelementsfromeachdiagonalprocesstoalltheprocessesinthecorrespondingcolumnusingnsimultaneousbroadcastsamongallprocessorsinthecolumn.Finally,theresultvectoriscomputedbyperforminganall-to-onereductionalongthecolumns.Matrix-VectorMultiplication:

2-DPartitioning我們必須首先將向量與矩陣適當地對齊。2-D分區的第一個通信步驟使向量x沿矩陣的主對角線對齊。第二步將向量元素從每個對角線進程復制到相應列中的所有進程，使用列中所有處理器之間的n個同時廣播。最后，通過沿著列執行一對一的減少來計算結果向量。Matrix-VectorMultiplication:

2-DPartitioningThreebasiccommunicationoperationsareusedinthisalgorithm:one-to-onecommunicationtoalignthevectoralongthemaindiagonal,one-to-allbroadcastofeachvectorelementamongthenprocessesofeachcolumnall-to-onereductionineachrow.Matrix-VectorMultiplication:

2-DPartitioning該算法使用三種基本的通信操作：一對一通信，沿著主對角線對齊向量，每列n個進程中的每個向量元素的一對多廣播每行多對一。Matrix-VectorMultiplication:

2-DPartitioningWhenusingfewerthann2processors,eachprocessownsanblockofthematrix.Thevectorisdistributedinportionsofelementsinthelastprocess-columnonly.Inthiscase,themessagesizesforthealignment,broadcast,andreductionareallThecomputationisaproductofansubmatrixwithavectoroflength.Matrix-VectorMultiplication:

2-DPartitioning當使用少于n2個處理器時，每個進程擁有一個

矩陣塊。向量僅在最后一個進程列中

的元素的部分分布。在這種情況下，

對齊，廣播和減少的消息大小都是計算是具有

長度向量的

子矩陣的乘積。Matrix-VectorMultiplication:

2-DPartitioningThefirstalignmentsteptakestimeThebroadcastandreductionstaketimeLocalmatrix-vectorproductstaketimeTotaltimeis

Example:CodeofMVSeeReference22OutlineMatrix-MatrixMultiplicationMatrix-MatrixMultiplication

Matrix-MatrixMultiplicationConsidertheproblemofmultiplyingtwonxndense,squarematricesAandBtoyieldtheproductmatrixC=AxB.TheserialcomplexityisO(n3).Ausefulconceptinthiscaseiscalledblockoperations.Inthisview,annxnmatrixAcanberegardedasaqxqarrayofblocksAi,j(0≤i,j<q)suchthateachblockisan(n/q)x(n/q)submatrix.Inthisview,weperformq3matrixmultiplications,eachinvolving(n/q)x(n/q)matrices.Matrix-MatrixMultiplication考慮將兩個n×n密集矩陣A和B相乘以產生乘積矩陣C=A×B的問題。串行復雜度為O（n3）。在這種情況下，一個有用的概念稱為塊操作。在該視圖中，n×n矩陣A可以被認為是塊Ai，j（0≤i，j<q）的q×q陣列，使得每個塊是（n/q）×（n/q）子矩陣。在這種觀點中，我們執行q3矩陣乘法，每個涉及（n/q）x（n/q）個矩陣。Matrix-MatrixMultiplicationConsidertwonxn

matricesAandBpartitionedintopblocksAi,jandBi,j(0≤i,j<

)ofsizeeach.ProcessPi,jinitiallystoresAi,jandBi,jandcomputesblockCi,joftheresultmatrix.ComputingsubmatrixCi,jrequiresallsubmatricesAi,kandBk,jfor0≤k<.All-to-allbroadcastblocksofAalongrowsandBalongcolumns.Performlocalsubmatrixmultiplication.Matrix-MatrixMultiplication考慮兩個n×n矩陣A和B分成p個

塊Ai，j和Bi，j（0≤i，j< ）。過程Pi，j首先存儲Ai，j和Bi，j并計算結果矩陣的塊Ci，j。計算子矩陣Ci，j要求所有子矩陣Ai，k和Bk，j為0≤k< 。沿著行的A的所有廣播塊，沿列的B。執行局部子矩陣乘法。Matrix-MatrixMultiplicationThetwobroadcaststaketime

Thecomputationrequiresmultiplicationsof sizedsubmatrices.Theparallelruntimeisapproximately

Majordrawbackofthealgorithmisthatitisnotmemoryoptimal.算法的主要缺點是它不是內存最優的。Matrix-MatrixMultiplication:

Cannon'sAlgorithmInthisalgorithm,weschedulethecomputationsofthe processesoftheithrowsuchthat,atanygiventime,eachprocessisusingadifferentblockAi,k.TheseblockscanbesystematicallyrotatedamongtheprocessesaftereverysubmatrixmultiplicationsothateveryprocessgetsafreshAi,kaftereachrotation.Matrix-MatrixMultiplication:

Cannon'sAlgorithm在該算法中，我們調度第i行的

進程的計算，使得在任何給定時間，每個進程使用不同的塊Ai，k。在每個子矩陣乘法之后，這些塊可以在進程之間系統地旋轉，使得每個進程在每次旋轉之后都獲得新的Ai，k。Matrix-MatrixMultiplication:

Cannon'sAlgorithmMatrix-MatrixMultiplication:

Cannon'sAlgorithmPerformlocalblockmultiplication.EachblockofAmovesonestepleftandeachblockofBmovesonestepup(againwithwraparound).Performnextblockmultiplication,addtopartialresult,repeatuntilallblockshavebeenmultiplied.Matrix-MatrixMultiplication:

Cannon'sAlgorithm執行本地塊乘法。A的每個塊移動一步，B的每個塊向上移動一次（再次繞過）。執行下一個塊乘法，添加到部分結果，重復直到所有的

塊都被乘。Matrix-MatrixMultiplication:

Cannon'sAlgorithmInthealignmentstep,sincethemaximumdistanceoverwhichablockshiftsis,thetwoshiftoperationsrequireatotaloftime.Eachofthesingle-stepshiftsinthecompute-and-shiftphaseofthealgorithmtakestime.Thecomputationtimeformultiplyingmatricesofsize is.Theparalleltimeisapproximately:Matrix-MatrixMultiplication:

DNSAlgorithmUsesa3-Dpartitioning.Visualizethematrixmultiplicationalgorithmasacube.matricesAandBcomeintwoorthogonalfacesandresultCcomesouttheotherorthogonalface.Eachinternalnodeinthecuberepresentsasingleadd-multiplyoperation(andthusthecomplexity).DNSalgorithmpartitionsthiscubeusinga3-Dblockscheme.Matrix-MatrixMultiplication:

DNSAlgorithm使用3-D分區。將矩陣乘法算法可視化為立方體。矩陣A和B進入兩個正交面，結果C出來另一個正交面。多維數據集中的每個內部節點都表示單一的加乘操作（因此是復雜的）。DNS算法使用3-D塊方案分區此多維數據集。Matrix-MatrixMultiplication:

DNSAlgorithmAssumeannxnxnmeshofprocessors.MovethecolumnsofAandrowsofBandperformbroadcast.Eachprocessorcomputesasingleadd-multiply.ThisisfollowedbyanaccumulationalongtheCdimension.Sinceeachadd-multiplytakesconstanttimeandaccumulationandbroadcasttakeslogntime,thetotalruntimeislogn.Thisisnotcostoptimal.Itcanbemadecostoptimalbyusingn/lognprocessorsalongthedirectionofaccumulation.Matrix-MatrixMultiplication:

DNSAlgorithm假設一個nxnxn個網格的處理器。移動A列和B列，并執行廣播。每個處理器計算單個加法乘法。其次是沿著C維度的積累。由于每個加乘乘以恒定時間，累加和廣播占用時間，因此總運行時間為logn。這不是最佳的成本。通過沿著積累的方向使用n/logn個處理器可以使成本最優。Matrix-MatrixMultiplication:

DNSAlgorithmMatrix-MatrixMultiplication:

DNSAlgorithm

Usingfewerthann3

processors.Assumethatthenumberofprocessespisequaltoq3forsomeq<n.Thetwomatricesarepartitionedintoblocksofsize(n/q)x(n/q).Eachmatrixcanthusberegardedasaqxqtwo-dimensionalsquarearrayofblocks.Thealgorithmfollowsfromthepreviousone,except,inthiscase,weoperateonblocksratherthanonindividualelements.Matrix-MatrixMultiplication:

DNSAlgorithm使用少于n3個處理器。假設某些q<n，進程數p等于q3。將兩個矩陣劃分成大小（n/q）×（n/q）的塊。因此，每個矩陣可以被認為是塊的q×q二維正方形陣列。算法遵循前一個算法，除了在這種情況下，我們操作塊而不是單個元素。Matrix-MatrixMultiplication:

DNSAlgorithmUsingfewerthann3

processors.Thefirstone-to-onecommunicationstepisperformedforbothAandB,andtakestimeforeachmatrix.Thetwoone-to-allbroadcaststaketimeforeachmatrix.ThereductiontakestimeMultiplicationofsubmatricestakestime.Theparalleltimeisapproximatedby:

Example:CodeofMMSeeReference43BasicDefinitions基本定義Memory(M)—amountofstoragerequired(e.g.,words)forgivenproblem給定問題所需的存儲空間（例如單詞）Work(W)—numberofoperations(e.g.,flops)requiredforgivenproblem,includingloadsandstores給定問題所需的操作數（例如，翻牌），包括加載和存儲Velocity(V)—numberofoperationsperunittime(e.g.,flops/sec)performedbyoneprocessor由一個處理器執行的每單位時間的操作數（例如，觸發器/秒）Time(T)—elapsedwall-clocktime(e.g.,secs)frombeginningtoendofcomputation從開始到結束計算的經過的掛鐘時間（例如秒）Cost(C)—productofnumberofprocessorsandexecutiontime(e.g.,processor-seconds)處理器數量和執行時間的乘積（例如處理器秒數）44BasicDefinitionsSubscriptindicatesnumberofprocessorsused(e.g.,T1isserialexecutio

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

并行程序設計中文課件 10 并行算法及性能分析

文檔簡介

溫馨提示

最新文檔

評論

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

并行程序設計 中文課件 10 并行算法及性能分析

文檔簡介

溫馨提示

最新文檔

評論

相關文檔

并行程序設計中文課件 10 并行算法及性能分析