




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
CUDAGPU3D圖像,是一個耗時的過程。在本文中我們將提出一種基于GPUCUDA編寫的盒計數(shù)算法的快速并行版本。優(yōu)化的GPU版本與串行CPU28CPU7倍。我們改進的盒計數(shù)算法已在3D模型下通過不同復雜程度、特征和大小進試,其有效性和算法精度已經(jīng)過已知的FD值模型驗證。在這個研究中,一個3D的包含一些腦組織數(shù)據(jù)的FD分析用于我們的GPU盒計數(shù)算法。1.法主要包括計數(shù)非零像素(2D)或素立體(3D)這類由圖像模型填補,在方框網(wǎng)格中所載形網(wǎng)格大小變化中不斷重復。數(shù)據(jù)中FD的值,根據(jù)選擇體素的類型(白、灰、黑、黑+示盒中該類型體素的數(shù)量N(l。最終得到的FD線性回歸比率為: FD的估計值。為找到盒的尺寸的上下界已進行了許多研究,而事實上這仍是一個開放性問題。有些理論研究[3,4]中,了這些框大小限制的參 研究了FD計算在生物醫(yī)學和醫(yī)學影像學中的應用。Lopes和Betrouni了一篇有關(guān)計算FD方法和他們在外地醫(yī)學圖像分析中的應用的有趣文章。更硬化,并檢測在早產(chǎn)兒和無生長受限變化情況。在[15]的程序,用于計算的MRI數(shù)據(jù)的三維FD呈現(xiàn)。他們在優(yōu)化這個軟件的執(zhí)行時間時需要通過加速3D盒子計數(shù)算近來,GPUGPGPU的使用已成倍增長,主要是由于新的架構(gòu)和編程模型的外觀的出現(xiàn),如NVIDIACUDA技術(shù)[16]Khronos的OpenCL的[17]。使用眾所周知的語言(C語言)進行簡單的擴展。GPUNVIDIACUDA,作者使用GPU來有效地執(zhí)行內(nèi)部患者的生物醫(yī)學圖像彈性配準。在文獻[22]中的一個GPULiebotich等人開發(fā)的。[23][24]。第二種算法[25]由薩卡和喬胡瑞研制,稱為差分盒計數(shù)(DBC。這些算法雖然有效,卻只使用一個線程執(zhí)行。最近,曾等人。利用多CPU[26]完善而且并行化了一個DBC盒計數(shù)算法。然而,據(jù)我們所知,以目前的GPU的高度并行計算能力來實現(xiàn)這些算法的性能改進的在本文接下來的部分,我們首先會在硬件和軟件兩個主要方面描述GPU處理和CUDA。然后,我們描述了我們在并行GPU上的盒子計數(shù)方法。接下來,展示我們FD的有效性。在我們的快速盒計數(shù)算法在生物醫(yī)學方面的測試中,我們展示出三個腦組織三維FD分析。結(jié)果最后歸納在第6部分。2、算法和理CUDA編程模型的特點。其次,我們描述了我們并行的盒計數(shù)算法,討論為什么這種算法被選定為并行GPU上。我們也解釋了當在CPU上執(zhí)行所CUDA編程模GPGPU編程模型是NVIDIA的CUDA和Khronos能,我們選擇了NVIDIA的CUDA開發(fā)我們的軟件執(zhí)行相同的功能,這就是所謂的內(nèi)核。只需要一個傳統(tǒng)語言(通常是C/C++)編程,使支持CUDA的GPU的硬件規(guī)格。每個CUDA設備有幾個多處理器(MPS,每個包含流處理器(SP)SIMD(多數(shù)據(jù)流)架構(gòu)。這些處理器負責以并行的方式執(zhí)行所有線程。CUDA的內(nèi)存架構(gòu)是由程序員顯式管理。每個設備都有大量慢速全局內(nèi)存(CPUGPU均可寫入,一個全局只讀(GPU方面)常量內(nèi)存,和一個特殊的紋理32GF100架構(gòu)[29]的出現(xiàn),一個真正的高1,23維的線程塊。線程塊被GPU的線程可以同時啟動,并在GPU上執(zhí)行。每個線程塊被分配到一個多處理器,所以它的(SIMT中線程數(shù)為warp大小的倍數(shù)[30]時達到最佳的性能。隨著CUDA調(diào)度執(zhí)行,這一結(jié)構(gòu)允許更高水平的擴展,因為線程塊被自動分配到空閑侯氏盒計數(shù)算[23],由侯等人優(yōu)化[24],具有O(N*ln(N))間復雜度。這將是我們在本文中使用在GPU上的盒。其他比較熟知的盒子計數(shù)算法是由薩卡和喬杜里[25]的差本文中,我們假設3D素體化模型(例如,通過一個2DMR圖像的堆棧形成的3D矩 侯氏盒子計數(shù)算法通過k位二進制數(shù)表示唯一確定一素坐標的全局坐標,其中2??為能夠完全包含整個模型邊界的體素個數(shù)。框格以迭代方式生成,每一個的尺寸,2??素,各不相同,其中0≤M≤k。因此,考慮到n為在一個方框內(nèi)網(wǎng)格的非零體素的數(shù)量,如果n=(2?? 1侯氏盒計數(shù)算法及其相關(guān)CUDAD表示m=1標包含k位。的串。位于([?????1…??2??1??0]??????,[?????1…??2??1??0]??????,[?????1…??2??1??0]??????)的體素其插位字符串為[?????1?????1?????1…?? ??]??????。因此插位字符串長度一定為k*3位。從而我們可以獲得一系列的插位字符串,其每一比特串唯一的標識一個3*m02個或以上標記過程反復進行直至m≤k。體素)3D模型,其分辨率k=264(1中僅示出了程中當m1這一算法在CPU上有一些實現(xiàn),例如由Kruger的[31],以及最近的 在這一部分中,介紹侯氏盒計數(shù)算法在NVDIAGPU和CUDA執(zhí)行時如何并行2GPU盒維數(shù)算法的主要模塊和數(shù)據(jù)流。需要注意的是圖二2陰影部分的功能和數(shù)據(jù)是在GPU端執(zhí)行(函數(shù))或的。另一方面,對于進一步代碼。此代碼直接對應的流程圖顯示在圖2。2PCI-Express總線傳輸GPU方面(全局內(nèi)存,在任何GPGPU的應用(1的第5行)GPU1示為圖1的一個立方體。如果它是一個空坐標(圖1中灰色立方體,該線程一個在3 的值(例如(2??),在圖1和圖3中表示為無限大。我像素坐標(X,Y,Z)三個二進制值([?????1…??2??1??0]??????,[?????1…??2??1??0]??????,??210??????3D模型的坐標中運行(18-10行。迭代的數(shù)目和網(wǎng)格尺寸直接依賴步,很多研究,根據(jù)GPU的性能優(yōu)勢提出了一些排序算法,以提高性能。因此,我們研模塊。在文獻[32],Peters等。提出了一種基于Batcher的傳統(tǒng)雙調(diào)排序算法(雙調(diào)排序也是CUDA軟件開發(fā)包中包含的一個樣本)的一個高性能的排序算法。最近Kolonias等為每個插位字符串是一個獨特的值)的數(shù)據(jù)。最后,Merrill等,在基數(shù)排序算法的基礎裝的。如果在未來的一個新的,更好的GPU排序算法被開發(fā),它可以被直接引入到我們標,從而提高了算法的性能,因為較少的線程必須在連續(xù)的內(nèi)核推出。其次,t在盒子計數(shù)0(m0時)t盒被標記為黑色,并且沒有白色或灰色的確定的值的數(shù)目。所以一旦t被計算,我們就可以直接在第一個框中計數(shù)的值(0個白色,0個灰色以及t個黑盒。隨后,進入到主循環(huán)(參見圖2和列表1中的17-30行。請記住,2??是完的=1,且M≤k時我們反復地啟動兩個GPU的核函數(shù),標記位字符串(MaskBit Boxes,算盒增量。兩個內(nèi)核都在相同的CUDA配置環(huán)境下啟動:一維線程塊中每個包含512個線程以及二維網(wǎng)格,包含啟動T個線程必須數(shù)量的線程快。通過啟動t個線程,只有真正需要的工作流程在執(zhí)行,從而忽視了位串數(shù)組的無用的位,如前面提到的。這也示于圖1。 Strings(列表1,20行。每個線程將位串最右邊m*3位標記為0,如第2.2節(jié),步驟D的第一行所示。這個內(nèi)核執(zhí)(MaskedArray Boxes(盒是黑色或灰色。為了更好地理解,檢查盒過程圖解詳見圖3,其內(nèi)核代碼如列表2所線程無法計數(shù)任何黑色或灰色的盒。這些線程在圖3中被表示為“什么也不做”。反之,如果是這樣的v的第一次出現(xiàn),則線程“跳躍”2??3?1坐標(這是在一個方框內(nèi)的體素的最后一個問題涉及到的檢查盒的核函數(shù),這個問題是是如何的黑色和灰色框的數(shù)列表加前一個共享內(nèi)存值,最后它們將此中間值在全局器。列表2顯示了這個內(nèi)核的模型的x和y坐標,因此不得不在每次迭代掩蓋2*m位,而不是3*m位。結(jié)當我們在GPU上實現(xiàn)了并行化的盒計數(shù)算法,對比原始侯氏算法。我們還要性。最后,我們使用我們的GPU盒計數(shù)算法對三個腦組織進行三維FD分析。硬件環(huán)境和測我們使用的GPU為一塊NVIDIAGeGTX580(GTX580),驅(qū)動版本為286.19,CUDA4.1,CUDAcomputingcapability2.0。GTX580FermiMP1536個線MP81024MP最大線程過1536。在內(nèi)存方面,對于GTX580每個MP擁有32K個32位的寄存存,16KB64KB1536MB的DDR5全局內(nèi)存。在CPU方面,我們使用InXeonE5620@2.40GHz四核處理器,可以同時仿真4個線程。我們配備了12GB的內(nèi)存,系統(tǒng)為Windows7Professional64位。使用C++windows.h的PerfTimer類度量運行時間。我們盡量實現(xiàn)了侯氏單線程算法,GPU和CPU的運算效率,我們實現(xiàn)了一個多線程的侯氏盒計數(shù)算法并在多核CPU上運行。這個多線程算法使用C語言實現(xiàn)使用了process.h[36](在VisualStudio中包含。對于CPU并行計算的排序算法,我們選用In的TBB排序算法[37,38]。TBB算法可以改善多核運算的效率,對于測試過程中使用的模型,我們選擇一套包含不同復雜度、特征和大小的3D體素女性骨盆,繩結(jié)環(huán)來自于[40],另外納芙蒂蒂王后像來自于[41],圖4展示了每個模型在而后,盒計數(shù)算法的準確性通過使用兩個熟知的3DFD值分型比較準確性:一個了20年代Aubert-Broche的的正常人腦解剖模型[42,43]。這組模型可以在此實現(xiàn),比較了他們相關(guān)的GPU運行時間,我們執(zhí)行了從128x128x128體素到512x512x512體素的六組模型。測試模型平均GPU時間包括CPU-GPU-CPU的數(shù)據(jù)傳輸。線程,因此多核CPU得到充分利用。本的增長,而GPU運行時間的增長則較慢。事實上,如在圖5中所示,加速度幾乎以一的GPU盒計數(shù)算法縮放很理想。7倍。表2示出了在兩個不同的分辨率下各個模型的加速可以達到的情況。這些結(jié)果表6對每一個大小為2??速比示于圖6,再一次,加速隨著模型尺寸的增大而呈幾乎線性趨勢縮放。勢,即在表2中的3D模型的結(jié)果。7顯示了當執(zhí)行我們改進的盒計數(shù)算法時設備級的使用比例。這個數(shù)字清楚地表明GPUNVIDIAVisualProfiler工具。要注意,這些值與GPU運行時間嚴格相關(guān)。因此,CPU參與的指令,如內(nèi)核啟CUDAGPU實現(xiàn)的盒計數(shù)算法,快速程序,這項工作我們已完成。為了測量運行時間,他們所使用的“門格爾海境下)實現(xiàn)。我們也將在該環(huán)境下執(zhí)行我們的CUDA“門格爾海綿”分形。他們的3D512×512×512的固體立方體(3DFD=3729×729×729(3D形“謝爾賓斯毯”(理論二維FD=1.8928)檢查了我們的算法的精度。點的范圍要考慮到最低數(shù)量,能夠最大化線性回歸的相關(guān)值R以及更符合理論值,每個點2.7212.7268。類似的結(jié)果可以在固體立方體測試得賓斯毯”而獲得的二維FD為1.8629(參照圖10),也非常接近理論1的值,320個人(受試者編號如[40])三個腦組織(血管,骨髓和腦灰質(zhì))的詳細尺FD3DFD的平均值,執(zhí)行3DFDCPU和GPU的運行時間和相應的加速比。每個組織類的平均執(zhí)行時間和加速比如圖12FD2,這是一個邏輯的結(jié)果,因為骨髓形狀類似后,灰質(zhì)3DFD值因偏差較小而脫穎而出。為在三維FD的平均值(方形)SEM(塊程序的可用結(jié)GPU實現(xiàn)的侯氏盒子計數(shù)算法。我們已經(jīng)解釋了每GPU上進行的優(yōu)化。與此相關(guān),我們已經(jīng)簡要地介紹了主流的GPU排序算法,選擇最適合我們的算法。盒計數(shù)算法的有效性是明顯的,因為它是用于執(zhí)行對FD的計算基礎,以及許多生物和醫(yī)學成像研究和應用領(lǐng)域的基本GPU實現(xiàn)的盒子計數(shù)算法。在我們的程序中,最終的平3D2D圖像,并能夠獲得已將我們的快速盒計數(shù)算法應用于腦組織的3DFD分析。致 ERDF。通過研究項目UJA2009/13/04和PI10-TIC-5807支參考文 Russel,J. 1175–1178.1242computermethodsand F.Normant, Dean,V. in Afast program (2011)241–249. Y.T. H.Y. W.Y. 61(2010)Y.T.Wu, T.R.Chen, W.Y.Guo, Sahgal,X. fyingdegenerationofwhitematterinnormalagingusingfractaldimension,NeurobiologyofAging28(2007)Y.T.Wu,K.K.Shyu,C.W.Jao,Z.Y.Wang,B.W.Soong, P.S.Wang, change NeuroImage49 F.J.Esteban, Go?ni, P.Villoslada, Bargalló,P.Villoslada,E.Grós,Fractal-ysisdetectscerebralchangesinpreterminfantswithandwithoutintrauterinegrowthrestriction,NeuroImage43(2010)1225–1232.J.RuizdeMiras,P.Villoslada,J.Navas,F.J.Esteban,UJA- (2011)452–460.NVIDIA OPENCL T.Aach, –possibilitiesand P. andmulti-core . T.Toth, Afast A141 box-countingapproachtocomputefractaldimensionofimage,IEEETransactionsonGeoscienceandRemoteSensing32(1994)1096–1102.Y.C.Tzeng,K.T.Fan,Y.J.Su,K.S.Chen,Aparalleldifferentialboxcountingalgorithmappliedtohyperspectralimageclassifications,in:IEEEIGARSS, (1987)NVIDIACUDALibrary /compute/DevZone/docs/html/C/doc/html/index.html. (2011)50–59. withBitonic V.Kolonias, Voyiatzis,G. High AParallel InThreading In 2011,. 3DVIA Anew computermethodsandprogramsinbiomedicine108(2012)journalhomepage: /journals/cmpb Fastbox-countingalgorithmonJ.Jiménez,J.RuizdeMirasDepartmentofComputerSciences,UniversityofJaén,Jaén,articl inf abstracArticleReceived17December2011Receivedinrevisedform28JuneAccepted30July:BoxcountingFractal
Thebox-countingalgorithmisoneofthemostwidelyusedmethodsforcalculatingthefractaldimension(FD).TheFDhasmanyimage ysisapplicationsinthebiomedical?eld,whereithasbeenusedextensivelytocharacterizeawiderangeofmedicalsignals.However,computingtheFDforlargeimages,especiallyin3D,isatimeconsumingprocess.Inthispaperwepresentafastparallelversionofthebox-countingalgorithm,whichhasbeencodedinCUDAforexecutionontheGraphicProcessingUnit(GPU).TheoptimizedGPUimplementationachievedanaveragespeedupof28times(28×)comparedtoamono-threadedCPUimplementation,andanaveragespeedupof7times(7×)comparedtoamulti-threadedCPUimplementation.Theperformanceofourimprovedbox-countingalgorithmhasbeentestedwith3Dmodelswithdifferentcomplexity,featuresandsizes.Thevalidityandaccuracyofthealgorithmhasbeencon?rmedusingmodelswithwell-knownFDvalues.Asacasestudy,a3DFD ysisofseveralbraintissueshasbeenperformedusingourGPUbox-countingalgorithm.Thebox-countingalgorithm[1]isusedforcalculatinganesti-mateofthefractaldimension(FD)ofanimage,bothintwo(2DFD)andthreedimensions(3DFD).Thebox-countingalgorithmbasicallyconsistsofcountinghowmanynon-zeropixels(2D)orvoxels(3D),thosethatare?lledbytheimagemodel,arecontainedinagridofsquareboxes.Eachoneoftheseboxesislabelledasblack,grayorwhite,dependingonhowmanypixelsorvoxelsitcontains.Thisprocessisrepeatedforgridswithdifferentboxsizes.Fromthesedata,thevalueoftheFD,foraselectedtypeofvoxel(white,gray,black,black+gray),iscalcu-latedthroughalog–loglinearregressioninwhichtheXaxisrepresentstheinverseboxsize,l,andtheYaxisrepresentsthenumberofboxesforthattypeofvoxel,N(l).The?nalvaluefortheFDcorrespondstotheslopeofthelinearregression:FD=?((lnN(l))/(lnl))Severalstudies[2]haveshowedthatthebox-methodisvalidforstatisticallyself-similarmodels.
?2012Elsevier .All locationofthegridineachi tionand,especially,theselectionoftheboxsizeaffectthevalueoftheestimatedFD.Manystudiesweredoneto?ndtheupperandlowerboundsfortheboxsize;infactthisisstillanopenissue.Therearetheoreticalstudies[3,4]thatestablishtheseboxsizelimitsintermsofparameterssuchastheimagesize,itsdimensionorthenumberofpointsitcontains.Thesemethodsaregenerallyapplicableonlyforidealfractals.Mostauthorsselecttheboundsfortheboxsizeaccordingtotheirparticularmethodforcalculatingthebox-countingandtheapplication?eld.Inthesecases,thelinearportionofthelog–logplotusedtoestimatethevalueoftheFDisnormallydeterminedbyyzingdifferentvoxelsizerangeslookingforthosethatizethecorrelationvalue[5,6].Despitethesedraw-backs,thebox-countingalgorithmisboththemostfrequentlyusedandthemostpopularmethodtocalculatetheFD.Inrecentyears,severalpapershavestudiedtheutilityoftheFDcalculationinbiomedicineandmedicalimaging.LopesandBetrounipresentin[7]aninterestingreviewofthemostrelevantmethodstocalculatethefractaldimensionandtheir?Correspondingauthorat:Dpt.ComputerSciences,UniversityofJaén,CampusLasLagunillass/n,23071Jaén,Spain..: E-mailaddress:demiras@ujaen.es(J.Ruizde0169-2607/$–seefrontmatter?2012Elsevier .All computermethodsandprogramsinbiomedicine108(2012)applicationinthe?eldofmedicalimageysis.Inmoredetail,in[8,9],anysisofthecomplexityofthefetalcorti-calsurfaceisperformedbycalculatingthe3DFD.In[5,10,11]thecerebralstructureofwhitematteranditsdegenerationwithageingis?ed,thanksagaintothe3DFD.Wuetal.larstructurecouldbe?edby3DFDcalculation.Finally,in[13,14]3DFDisusedtoyzethegraymatterinmulti-plesclerosisandtodetectingchangesinpreterminfantswithandwithoutintrauterinegrowthrestriction,respectively.In[15]aprogramforcalculatingthe3DFDofMRIdataispre-sented.Theyexposedtheneedtooptimizethissoftwareintermsofexecutiontimebyacceleratingthe3Dbox-countingalgorithm.Duetotheincreasingamountandsizeofdata,forexampleinthe?eldofmedicalimaging,thistypeofimageprocessingalgorithmhastobeoptimized.Recently,theuseoftheGraphicProcessingUnit(GPU)forgeneralpurposes(GPGPU)hasgrownexponentially,prin-cipallyduetotheappearanceofnewarchitecturesandprogrammingmodels,suchasNVIDIACUDA[16]orKhronosOpenCL[17].TheseprogrammingmodelsdonotrequireanypreviousknowledgenoronprogrammingGPU,neitheronthegraphicsvisualizationpipeline,andusesimpleextensionsofwellknownlanguageslikeC.GPUcomputing,andespeciallyNVIDIACUDA,hastakenanimportantroleinresolvingcom-plexproblemsinbiomedicine,asrecentlyexpoundedin[18].Asforrecentexamples,[19,20]presentedstudiesforusingtheGPUparallelcapabilitytoyzefunctionalmagneticresonanceimaging(fMRI)data;in[21]theauthorsusetheGPUtoef?cientlyperformelasticregistrationofintra-patientbiomedicalimages.In[22]aGPUskeletonizationalgorithmispresented.Thisalgorithmhasmanybiomedicalapplications;seeforexample[18].Somestudieshavefocusedonimprovingthebox-algorithm.Therearetwowidelyusedalgorithmsforper-formingbox-counting.The?rstalgorithmwasdevelopedbyLiebotichetal.[23]andwaslaterimprovedbyHouetal.[24].SarkarandChaudhuridevelopedthesecondalgo-rithm[25],knownasdifferentialbox-counting(DBC).Thesealgorithms,althoughef?cient,onlyuseonethreadintheirexecution.Recently,Tzengetal.improvedandparallelizedaDBCbox-countingalgorithmbyusingamulti-coreCPU[26].However,toourknowledge,therearenostudiesexploitingthehighlyparallelcomputingcapabilityofthecurrentGPUsinordertoachieveaperformanceimprovementforthesealgo-Anotherwidelyusedalgorithminmedicalimagevisual-izationistheMarchingCubesalgorithm[27].Thisalgorithmfromaregulargridwhereeachvertexisassociatedwithsomescalarvalue.Thenumberofgridcubeswhosevaluesatthever-intersectedbytheiso-surface.Thus,bysubsamplingthegridwithcubesoflargersizes,thebox-countingoftheiso-surfacecanalsobeobtained.However,afundamentaldisadvantageofdirectlyusingthistechniquetoobtainthebox-countingistheadditionalcomputingcostfortheysisofthevalues
theverticescontainedinthesubsampledcubes,alsorequir-ingcountingthevoxelsoncethecubesareclassi?edasblack,grayorwhite.Intherestofthepaperwe?rstdescribethemainhardwareandsoftwareaspectsofGPUprocessingandCUDA.Thenwedescribethebox-countingmethodthatweparallelizeontheGPU.Next,weshowthekernelsofoursoftware,detailinghowwehaveoptimizedeachofthem.InSection4,weyzethespeedupandef?ciencyofourimplementations.WealsotestitsvalidityforcalculatingtheFD.Withtheproposaloftestingourfastbox-countingalgorithminabiomedicalcontext,a3DFDysisofthreebraintissueshasbeenperformed.Theresultsare?nallysummarizedinSection6.ComputationalmethodsandInthissection,we?rstsummarizethecharacteristicsoftheCUDAprogrammingmodel.Second,wedescribethebox-countingalgorithmthatwehaveparallelized,discussingwhythisalgorithmwaschosenforparallelizationonGPU.Wealsoexinhowtheselectedbox-countingalgorithmworkswhenexecutedonaCPU.CUDAprogrammingAspreviouslymentioned,themostcommonGPGPUpro-grammingmodelsareNVIDIACUDAandKhronosOpenCL.Currently,andmainlyduetoCUDAbeingamoreestablishedtechnology,thereisahighernumberofwell-optimizedAPIsandlibraries(whichimplementfunctionssuchas:parallelreductions,pre?x-sums,sortingalgorithms...)designedforCUDAthanforOpenCL.So,withtheaimofaccessingtheseexistingwell-optimizedmodulesandfunctions,thusobtain-ingthefastestalgorithmaspossible,wechoseNVIDIACUDAfordeveloourTheCUDAprogrammingmodel[16]allowsthemertosimultaneouslylaunchthousandsofGPUthreads.Eachthreadexecutesthesamefunction,whichiscalledkernel,onadataset.Itisonlynecessarytoprogramwithasupportedtradi-tionallanguage(usuallyC/C++),usingtheCUDAAPI[28]toharnesstheGPUresources.InadditiontotheispresentinallCUDA-capableGPUs.EachCUDAdevicehasseveralMultiProcessors(MPs),eachconsistingofsomeStreamingProcessors(SPs)withSIMD(SimpleInstruction–MultipleData)architec-ture.Theseprocessorsareresponsibleforexecutingallthethreadsinaparallelway.TheCUDAmemoryarchitectureisexplicitlymanagedbytheprogrammer.EachdevicehasalargeamountofslowGlobalMemory(writablebyboththeCPUandtheGPU),aglobalread-only(intheGPUside)ConstantMemory,andaspecialTextureMemory.Somesmall-sizeandfast-accessmemorymodules,calledSharedMemory,arepresentineachMP.TheSPscanalsomanageseveral32-bitregisters.Inaddi-tion,sincethearrivalofGF100architecture[29],arealcachehierarchyhasbeenintroduced.
Allthreadsareorganizedintoseverallevels.Individualthreadsaregroupedin1-,2-or3-dimensionalthread-blocks.Thread-blocksaregroupedinamono-orbi-dimensionalgrid,andalso,onlyonGF100GPUs,inthree-dimensionalgrids.Eachthread-blockhasalimitof1024threads(onGF100GPUs),andeachgridsupportsupto65,535blocksineachdimen-sion.Therefore,thousandsofthreadscouldbesimultaneouslycomputermethodsandprogramsinbiomedicine108(2012) launchedandexecutedontheGPU.Eachthread-blockisassignedtoanMP,soitssharedmemoryisonlyaccessibleforthreadswithinthisthread-block.Residentthreadsaresimul-taneouslyexecutedineachSPingroupsof32,calledwarps.Awarpistheminimumschedulingunit.Thedeviceexecutesaninstructionforallthreadswithinawarpbeforelaunchingthenextinstruction.ThisisknownasSingleInstructionMulti-pleThreads(SIMT).Threadswithinawarpphysicallyexecutethesameinstructionineachcycle.Thebestperformanceisachievedwhenthenumberofthreadsinathread-blockisamultipleofthewarpsize[30].Thisorganizationphilosophy,alongwiththeCUDAsched-ulerexecution,allowsahighscalabilitylevel,becausethread-blocksareautomaticallyassignedtoidleMPs,indepen-dentlyofcompiledprogramcanbeexecutedondifferentandheterogeneousCUDAcapabledevices.CUDAoffersspe-cialbarrierinstructionsinordertoestablishsynchronizationpointsbetweenthreadsinthesameblock.However,globalmemoryisneededtosharedatabetweenthread-blocksorsetasynchronizationpointforthreadslocatedindifferentHou’sbox-countingAspreviouslymentioned,therearetwodifferentbox-countingalgorithmsthatarewidelyusedinliture.Liebotich’salgo-rithm[23],improvedin[24]byHouetal.,hasatimecomplexityofO(N*ln(N)).Thisisthebox-countingmethodthatweuseinthispaperandimplementontheGPU.Theotherremark-ablebox-countingalgorithmisthedifferentialbox-counting(DBC)presentedin[25]bySarkarandChaudhuri.ThisDBCalgorithmwasalsorecentlyimprovedbyTzengetal.in[26],whoattemptedtoadjustitforanef?cientandfastexecutiononmulti-coreCPUs.TheDBCalgorithmisbasedongraylevelimages,whichfallsoutsidethescopeofourwork.Inthispaperweassume3Dvoxelizedmodels(forexample,a3Dmatrix
formedbyastackof2DMRimages),sofromnowonwewillreferonlytovoxelsand3Dboxes.Hou’sbox-countingalgorithmuniquelydeterminestheglobalpositionofavoxelthroughthebinaryrepresentationofitscoordinateswithkbits,where2kvoxelsistheedgesizeoftheboxthatcompleycoversthemodel.Gridsofboxesitivelygenerated,eachonewithadifferentedgesizeofvoxels,where0≤m≤k.Soconsideringthatnisthenumberofnon-zerovoxelswithinaboxofthegrid,thisboxwilllabelledasblackifn=(2m)3,asgrayif0<n<(2m)3,andlabelledaswhiteifn=ThedetailedstepsofHou’sbox-countingalgorithmforthe3Dcaseare(seeFig.1):Thevoxelcoordinatesx,yandz,withrange0...2k?1,Thus,eachcoordinatewillbecomposedofkForeachvoxel,itsbinarycoordinatesarethenintercalated,formingalargebitstring.Theinterca-latedbitstringofavoxelsituatedintheposition([xk?1...x2x1x0]bin,[yk?1...y2y1y0]bin,[zk?1...z2z1z0]bin)is(xk?1yk?1zk?1...x2y2z2x1y1x1x0y0z0)bin.Thus,theinterca-latedbitstringmusthaveasizeequaltok*3bits.Asresult,alistofintercalatedbitstringsisobtained,eachbitstringuniquelyidenti?esoneThelistofintercalatedbitstringsissortedaccordingtothevalueofeachintercalatedbitstring.Finally,m*3bitsontherightofeachbitstringaremaskedto0.Iftwoormoremaskedbitstringshavethesamevalue,thentheircorrespondingvoxelsbelongtothesameedgesize2mbox.Asaresultofthismaskingstep,thenumberofdistinctvaluesinthelistgivesusthenumberofboxes(blackboxes+grayboxes)ofedgesize2mnecessarytocompleycoverthemodel,thusperform-ingthebox-countingforanedgesizeof2m.Whiteboxesarethenobtainedtrivially.Wedifferentiatebetweenblackandgrayboxesusingthe?xednumberofvoxelsrequiredFig.1–Hou’sbox-countingalgorithmanditsrelatedCUDAkernels.StepDisrepresentedform= computermethodsandprogramsinbiomedicine108(2012)to?llabox.Thisprocessis tivelyrepeatedm≤beencompleted,yieldingthenumberofblack,grayandboxes,foredgesize2mwith0≤m≤k,tocoverthewhole3Dvoxelizedmodel.TheHou’sbox-countingalgorithmisgraphicallyrepre-sentedinFig.1.Inthiscase,weusea3Dmodelformedby10voxels(redvoxels)insidea3Dgridwitharesolutionofk=2voxelsofthewholegrid).Inthis?gure,weonlyrepresentthebox-countingprocessform=1.TherearesomeimplementationsofthisalgorithmonCPU,liketheoneperformedbyKrugerin[31],orarecentMAT-LABimplementationin[6].But,asmentionedthroughoutthispaper,toourknowledge,therearenoGPUimplementationsofHou’salgorithm,noranyotherbox-countingalgorithm.GPUbox-countingInthissection,wedescribehowHou’sbox-countingalgorithmhasbeenparallelizedinordertoachieveanoptimalperfor-mancewhenexecutingitonNVIDIAGPUswithCUDA.Ontheonehand,Fig.2showsthemainmodulesanddata?owofourGPUbox-countingalgorithm.ItisimportanttonotethatshadedfunctionsanddatainFig.2areexecuted(func-tions)orstored(data)ontheGPUside.Ontheotherhand,andforfurtherunderstanding,Listing1showsthesimpli?ed
mainfunctioncodeofourbox-countingalgorithm.Thisdirectlycorrespondswiththe?owchartshowedinFig.Asaninitializationstep,thedataset(a3Darrayof0’sand1’srepresentingthevoxelizedmodel)istransferredfromthehostside(RAMmemory)totheGPUside(globalmemory)throughthePCI-expressbus,asinanyGPGPUapplication(line5inListing1).OncethedataisontheGPU,kernelscanoperateonit.Forabetterunderstanding,Figs.1and3showhoweachmainkerneloperatesandtransformsthedataalongthewholeThe?rstkerneliscalledBitIntercalate.Eachthreadofthiskernelaccessesonlyonepositionofthe3Dmodel.Eachposi-tionisrepresentedasacubeinFig.1.Ifitisanemptyposition(graycubeinFig.1),thethreadstoresaprede?nedandunreachablevalue(like(2k)3,representedasin?niteinFigs.1and3forclari?cation)inthebitstringsarray.Weben-e?tfromthisoperationinthefollowingkernel.Otherwise,ifthepositioninthemodelisnotempty(redcubeinFig.1)(Forinterpretationofthereferencestocolorinthissentence,theFig.2–Data?owchartfortheparallelbox-countingreaderisreferredtothewebversionofthearticle.),eachthreadinterpretsthevoxelcoordinates(x,y,z)asbinary([xk?1...x1x0]bin,[yk?1...y1y0]bin,[zk?1...z1z0]bin).ThethreadthenintercalatesthesebitsasshowninstepSectionFinallyeachthreadstoresthe?nalintercalatedbitstringinanewglobalmemoryarray(“BitStringsArray”inFig.1).Sowhenallthethreads?nalizetheirexecution,thearrayofinter-calatedcoordinates(bitstringsarray)iscompleygenerated.RegardingtheCUDAlaunchcon?gurationoftheBitInter-calatekernel,weselecta3Dthread-blockwithsize8×88threads,whichistheum3Dsizewhichisofthewarpsize.Also,thisisthethread-blockcon?gurationthatworksbestinourcase,accordingtoourtests.Inaddi-tion,this3Dthread-blockcon?gurationdirectlycorrespondswiththe3Dcoordinatesofthemodel,whichhelptoimprovethekernelperformance.Althoughthekernelworksovera3Dmodel,weselecta2Dgridofblocksinsteadofa3Dgridbecauseweachievehigherspeedupvaluesaccordingtoourtests.Therefore,wehavetoitivelylaunchtheBitIntercalatekernelinordertoexecuteitoverall3Dmodelpositions(lines8–10inListing1).BoththenumberofitionsandthegridFig.3–CheckBoxesprocedure.Itstartsfromanarrayofmaskedbitstrings,andthenitdetectswhichboxesareblackandwhicharegray.ThissamplecorrespondswithFputermethodsandprogramsinbiomedicine108(2012) sizedirectlydependonthe3Dmodelsize,andarecalculatedontheThenextstepconsistsofsortingthebitstringsarray.Sincesortingdataisafundamentalandnecessarystepinmanyalgorithms,therearemanystudiesthatproposesortingmethodswhichtakeadvantageoftheGPUcharacteristics,toimprovetheperformance.Therefore,wehavestudiedsomecurrentparallelsortingalgorithmsinordertodeterminewhichoneisthebest,andthenwehaveincludeditasamoduleofourfastbox-countingalgorithm.In[32],Petersetal.presentahigh-performancesortingalgorithmbasedonBatcher’straditionalbitonicsort(bitonicsortisalsooneofthesamplesincludedintheCUDASDK).MorerecentlyKoloniasetal.havedesignedandimplementedanef?cientcountsort
algorithminCUDAGPUs[33].Thisisoneofthefastestsortingalgorithmswhenalargeamountofasmallrangeofvalueshastobesorted.Unfortunaythisisnotourcase,sincewehavealargeamountofdatawithahighrangeofvalues(sinceeachintercalatedbitstringisauniquevalue).Finally,Merrilletal.presentin[34]anewimplementationbasedontheradixsortalgorithm.Theyachieveaspeedupofupto3.8times,withrespecttoothercurrentGPUsortingalgorithms.ThisCUDAsortingalgorithmiscurrentlytheonethatoffersthebestper-formance,thereforewehaveuseditinourwork.TheauthorsoffertheirsortingalgorithmpackedintheThrustLibrary[35].ThislibraryisautomaticallyinstalledtogetherwiththeCUDAToolkit4.0orgreater.IfinthefutureanewandbetterGPUsortingalgorithmisdeveloped,itcouldbedirectlyintroducedintoouralgorithmtoimprovethegeneralperformance.##GENERALPROCEDURE- log2(modeledge modeledge//CopyinputdatafromCPUtocudaMemcpy(dev3DModel,3DModel,//Bitfor(griddev3DModel,dev10.//Sort-thrust::sort(devintedCoords,devintedcCoords+//Count"Non-ZeroVoxels"- size-thrust::count(devdevintedCoords+size,for(m1;m<k;//MaskBitmask size,mask//Check CheckBoxes<<<nblocks,nthreads>>>(devintedCoords,size,t,pow(2.0.masksize),dblacks,dgrays);//ThrustReduceCounters(BlackBoxesarray&GrayBoxesthrust::inclusive_scan(dblacks,ddthrust::inclusive_scan(dgrays,dd//Saveresults(BlackBoxesvalue&GrayBoxescudaMemcpy(&dBCB[m-1],&dblacks[ncounters-sizeof(int),cudaMemcpy(&dBCG[m-1],&dgrays[ncounters-sizeof(int),31.//CopyoutputdatafromGPUtocudaMemcpy(BCB,dBCB,cudaMemcpy(BCG,dBCG,Listing1.MainprocedureoftheGPUBox-Countingalgorithm computermethodsandprogramsinbiomedicine108(2012)Afterthisstep(line13inListing1),thebitstringsissortedinascendingorder,i.e.“SortedBitStringsArray”in1.Itisimportanttonotethatuselessbitstrings,corre-spondingtoemptyvoxels(representedasin?niteinFig.1),arethenmovedtotheendofthearray.Oncethearrayissorted,wehavetop
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 云南省昆明市西山區(qū)民中2024-2025學年數(shù)學高二下期末教學質(zhì)量檢測模擬試題含解析
- 西南名校2025屆物理高二下期末復習檢測試題含解析
- 浙江省金蘭教育合作組織2025年高二數(shù)學第二學期期末達標測試試題含解析
- 財產(chǎn)保全擔保合同(金融借貸合同中的資產(chǎn)保全協(xié)議)
- 節(jié)能環(huán)保車輛承運合同與綠色運輸服務管理細則
- 風險管理型柴油發(fā)電機組采購合同
- 財務總監(jiān)股權(quán)激勵勞動合同
- 股權(quán)激勵稅務籌劃與咨詢合同
- 車輛掛靠業(yè)務合作經(jīng)營合同
- 股權(quán)收購代理股權(quán)過戶執(zhí)行合同
- 2025-2030中國酸奶冰淇淋市場需求前景預測及投資效益盈利性研究報告
- 2025年高考英語應用文第09講 讀后續(xù)寫分話題萬能結(jié)尾滿分句(講義)
- 新媒體國企面試題及答案
- 寶寶改姓夫妻協(xié)議書
- 央企華潤集團杭州片區(qū)年度品牌傳播策略案
- 《社區(qū)公園》課件
- 2024年海南三亞事業(yè)單位招聘考試真題答案解析
- 互聯(lián)網(wǎng)公司民事起訴狀模板
- 科目一急救考試題及答案
- 2025閩教版英語三年級下冊單詞表
- 兩人合伙開燒烤店協(xié)議
評論
0/150
提交評論