




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
CUDAProgramming
(GPUProgramming)Instructor:ZhangWeizhe(張偉哲)ComputerNetworkandInformationSecurityTechniqueResearchCenter,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyMotivation動機(jī)GPUArchitectureGPU架構(gòu)Threewaystoaccelerateapplications三種加速應(yīng)用的方法CUDAProgrammingModelCUDA編程模型CUDAProgrammingBasicsCUDA編程基礎(chǔ)Outline3ASimpleExampleThreeIntegerArrays
A[N]B[N]C[N]Wewanttocalculate
C[i]=A[i]+B[i]
4Traditionally,onthecpu(serial) for(i=0;i<N;++i)
C[i]=A[i]+B[i] T(N)=O(N)5Traditionally,onthecpu(parallel)createNthreads
C[threadid.i]=A[threadid.i]+B[threadid.i]
T(N)=O(1)6GPUComputingButthereisaproblem.ApplicationslikeNeedthousandsofthreadstoexecuteMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline8AsimplecomparisonbetweenCPUandGPU
9AdetaileddescriptionGraphicsProcessingClusters(GPCs)TextureProcessingClusters(TPCs)StreamingMultiprocessors(SM)10pascal-architecture-whitepaperMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline12ThreemethodsCUDAOptimizedLibrariesProgrammingLanguages13ThreemethodsCUDAOptimizedLibrariesTheselibrariesarewritteninCUDASimplyreplaceyourstandardlibraryfunctionswithcorrespondingCUDAlibrariesItsupportsmanymathlibrariesbutnotallasupportedlistcanbefoundat/gpu-accelerated-libraries
14ThreemethodsItisadirective-basedprogrammingmodelYouneedtoisnertsomedirectivesinyourcodeUseopenacccompilertocompilethecode15ThreemethodsProgrammingLanguagesMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline18PrerequestforCUDAProgrammingHardwareANvidiaGraphicscard:Itcanbeaspecializedcomputingcard,likeTeslaPascalGP100(tooexpensive),oranormalgamegraphiccard,likeGTorGTX.CheckwheteryourGPUsupportsCUDA:youcancheckoutthiswebsite/cuda-gpus
Clickon19PrerequestforCUDAProgrammingSoftwareCUDAToolkit:It’ssupportedonWindows,Mac,andmoststandardLinuxdistributions.Downloadfromhttps:///cuda-toolkitVisualStudio(ifonwindows):IfyouworkonWindows,forIknow,VSistheonlyIDEthatcanworkwithCUDA.Ifyoudon’twanttoinstallVS,youcanusetheCUDAcompilerNVCCdirectlyfromacommandline.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline21CUDAExecutionFlowCUDAApplicationHost=CPUDevice=GPUHost=CPUDevice=GPUParallelcodeSerialcodeSerialcodeParallelcode22CUDAExecutionFlow1.CopydatafromCPUmemorytoGPUmemory23CUDAExecutionFlowInstructtheGPUtostartcomputing24CUDAExecutionFlowCopytheresultsbacktoCPUmemoryCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline26CUDAThreadsParallelportionofanapplicatonfloatx=in[threadIdx.x];floaty=func(x);out[threadIdx.x]=y;
in[i]in[i+1]in[i+2]in[i+3]out[i]out[i+1]out[i+2]out[i+3]AkernelisafunctionexecutedontheGPUasanarrayofthreadsinparallelandcanbecalledfromCPUAllthreadsexecutethesamecode,cantakedifferentpathsEachthreadhasanID27CUDAThreads.....…..WarpWarpBlock28CUDAThreads.....Warp32ThreadsaregroupedintowarpsAwarpinCUDAistheminimumsizeofthedataprocessedinSIMDfashionbyaCUDAmultiprocessor.ThreadIDswithinawarpareconsecutiveandincreasingWarpisunitofthreadschedulinginSMs29CUDAThreads.....WarpOneorMorewarpsaregroupedintoblocksAthreadblockisbatchofthreadsthatcancooperatewitheachotherbysharingdatathroughsharedmemoryandsynchronizingtheirexecution.Ablockcanatmostcontain1024threadsbecasuseofthehardwaresourcelimitThethreadidisuniqueandstartsfromzeroinablockWarp30CUDAThreadsBlockBlockBlockBlockBlockBlockGridAkernelwillbeexecutedasagrid31CUDAThreadsKernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith2SMsSM0SM1Block0Block1Block2Block3Block4Block5Block6Block7CUDAThreads32KernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith4SMsSM0SM1SM2SM4Block0Block4Block1Block5Block2Block6Block3Block7CUDAThreads33CUDAThreads34Block0CUDAThreads35Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblem:branchdivergenceExamplewithdivergence:If(threadIdx.x>2){…}ThiscreatestwodifferentcontrolpathsforthreadsinablockToavoidthis:If(threadIdx.x/WARP_SIZE>2){…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Sodon’tusethiskindofcode,letthewholewarpdothesamework.CUDAThreads36Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblembranchdivergenceExamplewithdivergence:
If(threadIdx.x%2==0){…} else{…}ThiscreatestwodifferentcontrolpathsforthreadsinablockCUDAThreads37
If(threadIdx.x%2==0){…}else{…}CUDAThreads38If(threadIdx.x/WARP_SIZE==0){…}Else{…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Letthewholewarpdothesamework.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutlineGlobalMemory&SyntaxGlobalmemoryisthe“main”memoryoftheGPU.Ithasglobalscopeandlifetimeoftheallocatingprogram(oruntilcudaFreeiscalled)GlobalmemoryissimilartotheheapinaCprogram.GlobalmemorysyntaxAllocatewithcudaMalloc(void**devPtr,size_tsize)FreewithcudaMalloc(void*devPtr)40intblk_sz=64;float*Md;intsize=blk_sz*blk_sz*sizeof(float);cudaMalloc((void**)&Md,size);…cudaFree(Md);Host-DeviceDataTransfercudaMemcpy()MemorydatatransferRequiresfourparametersPointertodestinationPointertosourceNumberofbytescopiedType/DirectionoftransferHosttoHost,HosttoDevice,DevicetoHost,DevicetoDeviceTransfertodeviceisasynchronous41cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);CPUMemoryGPUGPUMemoryCPUPCI-E8GB/sGDDR5190GB/sConstantMemory&SyntaxConstantmemoryisaformofvirtualaddressingofglobalmemory.SpecialpropertiesCached&read-onlySupportsbroadcastingasinglevaluetoalltheelementswithinawarpConstantmemoryisrestrictedto64KB(kernelargumentsarepassedthroughconstantmemory)ConstantmemorysyntaxInglobalscope(outsideofkernel,attoplevelofprogram)__constant__int
foo[2014];InhostcodecudaMemcpyToSymbol(foo,h_src,sizeof(int)*1024);42TextureMemoryComplicatedandonlymarginallyusefulforgeneralpurposecomputationUsefulcharacteristics2Dor3Ddatalocalityforcachingpurposesthrough“CUDAarrays”.GoesintospecialtexturecacheFastinterpolationon1D,2D,or3DarrayConvertingintegersto“unitized”floatingpointnumbers43It’sacomplextopic,youcanlearneverythingyouwanttoknowaboutitfromCUDAHandbookSharedMemory&SyntaxSharedmemoryisusedtoexchangedatabetweenCUDAthreadswithinablock.VeryfastmemorylocatedintheSMOn-chipmemory,low-latency,user-controlledL1cacheSharedmemorysyntaxStaticallocation__shared__floatdata[1024];//declarationinkernel,nothinginhostcodeDynamicallocationHost:
kernel<<<grid,block,numBytesShMem>>>(arg);Device(inkernel):
extern__shared__float
s[];44RememberSM=StreamingmultiprocessorSM≠SharedmemoryComputationalIntensityComputationalintensity=FLOPS/IOMatrixmultiplication:n3/n2=nN-bodysimulation:n2/n=n45Ifcomputationalintensityis>1,thensamedatausedinmorethan1computation.Doasfewgloballoadsandasmanysharedloadsaspossible.Registers&LocalMemoryRegistersFastest“memory”possible,about10xfasterthansharedmemoryMoststackvariablesdeclaredinkernelsarestoredinregisters(example:floatx)StaticallyindexedarraysstoredonthestackaresometimesputinregistersLocalMemoryLocalmemoryiseverythingonthestackthatcan’tfitinregistersThescopeoflocalmemoryisjustthethread.Localmemoryisstoredinglobalmemory(muchslowerthanregisters!)46Non-Programmable!CUDAMemoryModelSummaryMemorySpaceManagedbyPhysicalImplementationScopeonGPUScopeonCPULifetimeRegistersCompilerOn-chipPerThreadNotvisibleLifetimeofathreadLocalCompilerDeviceMemoryPerThreadNotvisibleSharedProgrammerOn-chipBlockNotvisibleBlocklifetimeGlobalProgrammerDeviceMemoryAllThreadsRead/WriteApplicationoruntilexplicitlyfreedConstantProgrammerDeviceMemoryAllThreadsRead-onlyRead/WriteTextureProgrammerDeviceMemoryAllThreadsRead-onlyRead/Write47MotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline50ParallelProgramminginCUDACThefirstexample:
SummingVectorsImaginehavingtwolistsofnumberswherewewanttosumcorrespondingelementsofeachlistandstoretheresultinathirdlist.51ParallelProgramminginCUDACCPUcore1CPUcore252ParallelProgramminginCUDAC53ParallelProgramminginCUDAC__host____device__cudaError_tcudaMalloc(void**devPtr,size_tsize)__host__and__device__istypequafiler,whichmeasthisfunctioncanbecalledonthedeviceorthehost.AllcudafunctionstakesaerrorcodeasareturnvalueDifferentfromC’sMallocfunction,thisfunctionstakesapointertopointerasaparameter.54ParallelProgramminginCUDACExecutionconfiguration
<<<Dg,Db,Ns,S>>>Dgisoftypedim3andspecifiesthedimensionandsizeofthegrid,suchthatDg.x*Dg.y*Dg.zequalsthenumberofblocksbeinglaunched;Dbisoftypedim3andspecifiesthedimensionandsizeofeachblock,suchthatDb.x*Db.y*Db.zequalsthenumberofthreadsperblock;Nsistypeofsize_tandspecifiesthenumberofbytesinsharedmemorythatisdynamicallyallocatedperblockforthiscallinadditiontothestaticallyallocatedmemory;It’sanoptionalargumentwhichdefaultto0;SisoftypecudaStream_tandspecifiestheassociatedstream;It’sanoptionalargumentwhichdefaultsto055ParallelProgramminginCUDAC56ParallelProgramminginCUDAC__host__cudaError_tcudaMemcpy(void*dst,constvoid*src,size_t count,cudaMemcpyKindkind)CopiesdatabetweenhostanddevicecudaMemcpyKindspecifiesthedirectionofthecopy.ItisoneofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice57ParallelProgramminginCUDAC__global__functionsmushhavevoidreturntype.Anycalltoa__global__functionmustspecifyitsexecutionconfiguration.Acalltoa__global__functionisasynchronous,meaningitreturnsbeforethedevicehascompleteditsexecution.58ParallelProgramminginCUDACblockIdx.xcontainstheblockindexwithinthegridthreadIdx.xcontainsthethreadindexwithintheblockCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline60SharedMemoryandSynchronizationThreadswithinablockcancommunicatewitheachotherthroughsharedmemory.__shared__isusedtomakethevariableresidentinsharedmemory.Iftwothreadswanttowritethesamesharedvariable,theremustbeasynchronizationbetweentwothreads.EitherAwritesafterB,orBwritesafterA.Now,let’stakealookatanexamplethatusesthesefeatures61SharedMemoryandSynchronizationDOTPRODUCTEachthreadmultipliesapairofcorrespondingentries,andtheneverythreadmovesontoitsnextpair.Becausetheresultneedstobethesumofallthesepairwiseproducts,eachthreadkeepsarunningsumofthepairsithasadded.62SharedMemoryandSynchronizationSupposewehaveNthreads,andthearraysizeisN*M63SharedMemoryandSynchronization64SharedMemoryandSynchronization65SharedMemoryandSynchronization66SharedMemoryandSynchronization67SharedMemoryandSynchronizationCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline69ConstantMemoryandEventsConstantmemoryusedfordatathatwillnotchangeoverthecourseofakernelexecutionNVIDIAhardwareprovides64KBofconstantmemorythatittreatsdifferentlythanittreatsstandardglobalmemory.70ConstantMemoryandEventsProblemDescription:Produceanimageofathree-dimensionalseene.Asimpleidea:Imaginethelightsgothroughtheobjects,andcastashadowontheplate.Sincelightscancomefromanyplaceatanypointinourscene,itturnsourit’seasiertoworkbackward.Eachpixelintheimagewillshootarayintothescene.Wefigureoutwhatcolorisseenbyeachpixelbytracingarayfromthepixelinquestionthroughthesceneuntilithitsoneofourobjects.Wethensaythatthepixelwould“see”thisobjectandcanassignitscolorbasedonthecoloroftheobjectitsees.Mostofthecomputationrequiredbyraytracingisinhecomputationoftheseintersectionsoftheraywiththeobjextsinthescene.71ConstantMemoryandEvents72ConstantMemoryandEventsWhatwilltheraytracerdoItwillfirearayfromeachpixelComputewhichrayshitwhichspheres,andthedepthofeachofthesehits.Inthecasewherearaypassesthroughmultiplespheres,onlythespherefurthesttotheimagecanbeseen.Wewillmodeloursphereswithadatastructurethatstoresthesphere’scentercoordinateof(x,y,z),itsradius,anditscolorof(r,b,g).73ConstantMemoryandEvents74ConstantMemoryandEvents75ConstantMemoryandEvents76ConstantMemoryandEventsConstantmemoryalwaysdeclaredinFilescope(globalvariable)CUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline78TextureMemoryLikeconstantmemory,texturememoryisAnothervarietyofread-onlymemoryCachedonchip,soitwillprovidhighereffectivebandwidthDesignedforgraphicsapplicationswherememoryaccesspatternsexhibitagreatdealofspatialocality79TextureMemoryspatialocalityInacomputingapplication,thisroughlyimpliesthatathreadislikelytoreadfromanaddress“near”theaddressthatnearbythreadsread,asshowninthefigure.80TextureMemoryForCPUcachingscheme-------->thefouraddressesshownarenotconsecutiveandwouldnotbecached
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 嵌入式虛擬平臺評測試題及答案
- 姓氏文化創(chuàng)意管理制度
- 農(nóng)村移風(fēng)易俗管理制度
- 婦幼衛(wèi)生監(jiān)測管理制度
- 行政組織理論的精細(xì)管理試題及答案
- 工廠營銷設(shè)備管理制度
- 發(fā)酵工藝菌種管理制度
- 監(jiān)理師考試合作學(xué)習(xí)試題及答案
- 廚具用品倉庫管理制度
- 學(xué)校班規(guī)班級管理制度
- ECMO并發(fā)癥教學(xué)課件
- 胸椎骨折的護(hù)理查房
- 【知識精講精研】高中英語備課組長工作匯報(bào)
- 工程招標(biāo)代理服務(wù)投標(biāo)方案(技術(shù)方案)
- 錯漏混料點(diǎn)檢稽核表空白模板
- 2021城鎮(zhèn)燃?xì)庥枚酌褢?yīng)用技術(shù)規(guī)程
- 地面三維激光掃描作業(yè)技術(shù)規(guī)程
- GB/T 15587-2023能源管理體系分階段實(shí)施指南
- 工程項(xiàng)目部組織機(jī)構(gòu)架構(gòu)
- 【保安服務(wù)】服務(wù)承諾
- 老年醫(yī)學(xué)科臨床營養(yǎng)管理流程
評論
0/150
提交評論