




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
上海財經大學信息管理與工程學—BasicConcepts,DecisionTree,andModelEvaluationClassification:Givenacollectionofrecords(trainingClassification:Givenacollectionofrecords(trainingsetEachrecordcontainsasetofattributes,oneoftheattributesistheclass.FindaforclassattributeasafunctionofvaluesofotherGoal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.PageProf.Dr.LIUIllustratingClassificationTrainingTestPageProf.Dr.LIUIllustratingClassificationTrainingTestPageProf.Dr.LIU?????123456789ExamplesofClassificationPredictingtumorExamplesofClassificationPredictingtumorcellsasbenignormalignantClassifyingcreditcardaslegitimateorClassifyingsecondarystructuresofprotein(蛋白質)asalpha-helix,beta-sheet,orrandomCategorizingnewsstoriesasfinance,weather,entertainment,sports,etcPageProf.Dr.LIUClassificationDecisionTreebasedClassificationDecisionTreebasedRule-basedMemorybasedNeuralNa?veBayesandBayesianBeliefSupportVectorPageProf.Dr.LIUDecisionTree:DecisionTree:DecisionTree:DecisionTree:DecisionTree:DecisionTree:ExampleofaDecisionSplittingSingle,<>DecisionTrainingExampleofaDecisionSplittingSingle,<>DecisionTraining123456789AnotherExampleofDecision<>TherecouldbemorethanonetreeAnotherExampleofDecision<>Therecouldbemorethanonetreethatfitsthesamedata!123456789DecisionTreeClassificationTrainingTestDecisionTreeClassificationTrainingTest?????123456789ApplyModeltoTestTestStartfromtherootofSingle,<ApplyModeltoTestTestStartfromtherootofSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestAssignCheattoSingle,<>ApplyModeltoTestTestAssignCheattoSingle,<>?DecisionTreeClassificationTrainingTestDecisionTreeClassificationTrainingTest?????123456789DecisionTreeManyHuntDecisionTreeManyHunt’sAlgorithm(oneoftheID3,GeneralStructureofHunt’sLetDtbethesetoftrainingrecordsthatreachanodetGeneralIfGeneralStructureofHunt’sLetDtbethesetoftrainingrecordsthatreachanodetGeneralIfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledastIfDtisanemptyset,thentisaleafnodelabeledbytheclass,IfDtcontainsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.?123456789Hunt’s123456789Hunt’s123456789<>=TreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopTreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopHowtoSpecifyTestDependsHowtoSpecifyTestDependsonattributeDependsonnumberofwaysto2-wayMulti-waySplittingBasedonOrdinalMulti-waysplit:UseasmanypartitionsasSplittingBasedonOrdinalMulti-waysplit:UseasmanypartitionsasdistinctBinarysplit:DividesvaluesintotwoNeedtofindoptimalHowaboutthisSplittingBasedonContinuousDifferentSplittingBasedonContinuousDifferentwaysofDiscretizationtoformanordinalcategoricalStatic–discretizeonceattheDynamic–rangescanbefoundbyequalbucketing,equalfrequencybucketing(percentiles),orclustering.BinaryDecision:(A<v)or(AconsiderallpossiblesplitsandfindsthebestcanbemorecomputeSplittingBasedonContinuous><>SplittingBasedonContinuous><>(i)Binary(ii)Multi-wayTreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopHowtodeterminetheBestBeforeSplitting:10recordsofclass10recordsofclassHowtodeterminetheBestBeforeSplitting:10recordsofclass10recordsofclassWhichtestconditionistheHowtodeterminetheBestGreedyHowtodeterminetheBestGreedyNodeswithhomogeneousclassdistributionareNeedameasureofnodeNon-HighdegreeofLowdegreeofMeasuresofNodeGiniMeasuresofNodeGinierror/HowtoFindtheBestBeforeGain=M0–M12M0–HowtoFindtheBestBeforeGain=M0–M12M0–MeasureofImpurity:GiniIndexforagivennodet(NOTE:p(j|t)istherelativefrequencyMeasureofImpurity:GiniIndexforagivennodet(NOTE:p(j|t)istherelativefrequencyofclassjatnodet).Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinterestinginformationMinimum(0.0)whenallrecordsbelongtooneclass,implyingmostinterestinginformation33241506ExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Gini=1–P(C1)2ExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Gini=1–P(C1)2–P(C2)2=1–0–1=P(C1)=P(C2)=Gini=1–(1/6)2–(5/6)2=P(C1)=P(C2)=Gini=1–(2/6)2–(4/6)2=241506SplittingBasedonUsedinCART,SplittingBasedonUsedinCART,SLIQ,Whenanodepissplitintokpartitions(children),thequalityofsplitiscomputedas,ni=numberofrecordsatchildn=numberofrecordsatnodeBinaryAttributes:ComputingGINISplitsintotwoEffectofWeighingLargerandPurerPartitionsareBinaryAttributes:ComputingGINISplitsintotwoEffectofWeighingLargerandPurerPartitionsaresought=1–(5/7)2–==1–(1/5)2–==7/12*0.408+5/12*0.320=512466Gini=CategoricalAttributes:ComputingGiniForeachdistinctvalue,gathercountsforeachclassintheUsethecountmatrixtomakeTwo-way(findbestpartitionofMulti-way{LuxuryCategoricalAttributes:ComputingGiniForeachdistinctvalue,gathercountsforeachclassintheUsethecountmatrixtomakeTwo-way(findbestpartitionofMulti-way{LuxurySports131512141122153124ContinuousAttributes:ComputingGiniUseBinaryDecisionsbasedononeSeveralChoicesContinuousAttributes:ComputingGiniUseBinaryDecisionsbasedononeSeveralChoicesforthesplittingNumberofpossiblesplitting=NumberofdistinctEachsplittingvaluehasacountmatrixassociatedwithitClasscountsineachofthepartitions,A<vandAvSimplemethodtochoosebest>Foreachv,scanthedatabasetogathercountmatrixandcomputeitsGiniindexComputationallyRepetitionof123456789ContinuousAttributes:ComputingGiniForefficientcomputation:foreachSorttheattributeonLinearlyscanthesevalues,eachtimeupdatingthecountmatrixandcomputingginiindexChoosethesplitpositionthathastheleastginiSortedSplitTaxable>>>>>>>>>ContinuousAttributes:ComputingGiniForefficientcomputation:foreachSorttheattributeonLinearlyscanthesevalues,eachtimeupdatingthecountmatrixandcomputingginiindexChoosethesplitpositionthathastheleastginiSortedSplitTaxable>>>>>>>>>>>03030303122130303030300716253434343443526170AlternativeSplittingCriteriabasedonAlternativeSplittingCriteriabasedonEntropyatagivennode(NOTE:p(j|t)istherelativefrequencyofclassjatnodeMeasureshomogeneityofaMaximum(lognc)whenrecordsareequallydistributedamongallclassesimplyingleastinformationMinimum(0.0)whenallrecordsbelongtooneclass,implyingmostinformationEntropybasedcomputationsaresimilartotheGINIindexExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Entropy=–0logExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Entropy=–0log0–1log1=–0–0=P(C1)=P(C2)=Entropy=–(1/6)log2(1/6)–(5/6)log2(1/6)=P(C1)=P(C2)=Entropy=–(2/6)log2(2/6)–(4/6)log2(4/6)=241506SplittingBasedonInformationParentNode,SplittingBasedonInformationParentNode,pissplitintokpartitions;niisnumberofrecordsinpartitioniMeasuresReductioninEntropyachievedbecauseofsplit.Choosethesplitthatachievesmostreduction(maximizesGAIN)UsedinID3andDisadvantage:Tendstoprefersplitsthatresultinlargenumberofpartitions,eachbeingsmallbutpure.SplittingBasedonGainParentNode,pissplitintokpartitionsniisSplittingBasedonGainParentNode,pissplitintokpartitionsniisthenumberofrecordsinpartitionAdjustsInformationGainbytheentropyofthepartitioning(SplitINFO).Higherentropypartitioning(largenumberofsmallpartitions)ispenalized!UsedinDesignedtoovercomethedisadvantageofInformationSplittingCriteriabasedonClassificationClassificationSplittingCriteriabasedonClassificationClassificationerroratanodetMeasuresmisclassificationerrormadebyaMaximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinterestingMinimum(0.0)whenallrecordsbelongtooneclass,implyingmostinterestinginformationExamplesforComputingP(C1)=0/6=P(C2)=6/6=Error=1–maxExamplesforComputingP(C1)=0/6=P(C2)=6/6=Error=1–max(0,1)=1–1=P(C1)=P(C2)=Error=1–max(1/6,5/6)=1–5/6=P(C1)=P(C2)=Error=1–max(2/6,4/6)=1–4/6=241506ComparisonamongSplittingForaComparisonamongSplittingFora2-classMisclassificationErrorvs=1–(3/3)2–==1–(4/7)2–MisclassificationErrorvs=1–(3/3)2–==1–(4/7)2–==3/10*+7/10*=GiniimprovesError(Parent)=340373Gini=TreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopStoppingCriteriaforTreeStopexpandingStoppingCriteriaforTreeStopexpandinganodewhenalltherecordsbelongthesameStopexpandinganodewhenalltherecordssimilarattributeEarlytermination(tobediscussedDecisionTreeBasedInexpensiveDecisionTreeBasedInexpensivetoFastexecutionScalewellforlargedataCanhandlehighdimensionalExtremelyfastatclassifyingunknownEasytointerpretforsmall-sizedAccuracyiscomparabletootherclassificationtechniquesformanysimpledatasetsCannotcapturecorrelationsamongConsideronlyaxis-parallelExample:Simpledepth-firstUsesExample:Simpledepth-firstUsesInformationSortsContinuousAttributesateachNeedsentiredatatofitinUnsuitableforLargeNeedsout-of-coreYoucandownloadthesoftwareififthenRainorOutlook=OvercastthenYes;Outlook=RainorOutlook=OvercastthenYes;Outlook=RainorSunnyandHumidity=HighthenNoOutlook=RainorSunnyandHumidity=then分類模型中決策規則的數量可能非常大,規則的修剪可以分類模型中決策規則的數量可能非常大,規則的修剪可以PracticalIssuesofUnderfittingandPracticalIssuesofUnderfittingandMissingCostsofUnderfittingandOverfitting500circularand500triangulardatapoints.CircularUnderfittingandOverfitting500circularand500triangulardatapoints.Circular0.5sqrt(x1+x2) sqrt(x1+x2)<0.5 sqrt(x1+x2)>UnderfittingandUnderfitting:whenmodelisUnderfittingandUnderfitting:whenmodelistoosimple,bothtrainingandtesterrorsareOverfittingduetoDecisionboundaryOverfittingduetoDecisionboundaryisdistortedbynoiseOverfittingduetoInsufficientLackOverfittingduetoInsufficientLackofdatapointsinthelowerhalfofthediagrammakesitdifficulttopredictcorrectlytheclasslabelsofthatregion-InsufficientnumberoftrainingrecordsintheregioncausesthedecisiontreetopredictthetestexamplesusingothertrainingrecordsthatareirrelevanttotheclassificationtaskNotesonOverfittingresultsindecisionNotesonOverfittingresultsindecisiontreesthataremorethanTrainingerrornolongerprovidesagoodestimateofhowthetreewillperformonpreviouslyunseenNeednewwaysforestimatingEstimatingGeneralization再代入/訓練誤差Re-substitutionerrors:EstimatingGeneralization再代入/訓練誤差Re-substitutionerrors:errorontraininge(t)泛化誤差GeneralizationerrorserrorontestingMethodsforestimatinggeneralizationOptimisticapproach:e’(t)=PessimisticForeachleafnode:e’(t)=Totalerrors:e’(T)=e(T)+N(N:numberofleafForatreewith30leafnodesand10errorsontraining(outof1000instances):Trainingerror=10/1000=Generalizationerror=(10+300.5)/1000=ReducederrorpruningusesvalidationdatasettoestimategeneralizationOccam’sGiventwomodelsofOccam’sGiventwomodelsofsimilargeneralizationerrors,onepreferthesimplermodeloverthemorecomplexForcomplexmodels,thereisagreaterchancethatitfittedaccidentallybyerrorsinTherefore,oneshouldincludemodelcomplexityevaluatingaMinimumDescriptionLength01AB01Cost(Model,Data)=Cost(Data|Model)+CostisthenumberofbitsneededforSearchfortheleastcostlyCost(Data|Model)MinimumDescriptionLength01AB01Cost(Model,Data)=Cost(Data|Model)+CostisthenumberofbitsneededforSearchfortheleastcostlyCost(Data|Model)encodesthemisclassificationCost(Model)usesnodeencoding(numberofchildren)plussplittingconditionencoding.Xy????……?Xy1001……1HowtoAddressPre-PruningHowtoAddressPre-Pruning(EarlyStoppingStopthealgorithmbeforeitbecomesafully-grownTypicalstoppingconditionsforaStopifallinstancesbelongtothesameStopifalltheattributevaluesaretheMorerestrictiveStopifnumberofinstancesislessthansomeuser-specifiedStopifclassdistributionofinstancesareindependentoftheavailablefeatures(e.g.,using2test)Stopifexpandingthecurrentnodedoesnotimproveimpuritymeasures(e.g.,Giniorinformationgain).HowtoAddressPost-HowtoAddressPost-GrowdecisiontreetoitsTrimthenodesofthedecisiontreeinabottom-upIfgeneralizationerrorimprovesaftertrimming,replacesub-treebyaleafnode.Classlabelofleafnodeisdeterminedfrommajorityclassofinstancesinthesub-treeCanuseMDLforpost-ExampleofPost-ing)=)/30=g)=TrainingError(BeforesplitPessimisticerror=(10+0.TrainingError(AftersplittiPessimisticerror(Afterspl=(9+40.5)/30 ClassExampleofPost-ing)=)/30=g)=TrainingError(BeforesplitPessimisticerror=(10+0.TrainingError(AftersplittiPessimisticerror(Afterspl=(9+40.5)/30 Class=8Class=3Class=4Class=5Class=4Class=4Class=1Class=1Class=Class=Error=ExamplesofPost-OptimisticCase1:7/20vs.ExamplesofPost-OptimisticCase1:7/20vs.Case2:5/20Don’tpruneforbothPessimisticCase1:(7+0.5*1)/20vs.Case2:(5+0.5*1)/20vs.Don’tprunecase1,prunecaseReducederrorDependsonvalidationCaseC0:C1:CaseC0:C1:HandlingMissingAttributeMissingvaluesHandlingMissingAttributeMissingvaluesaffectdecisiontreeconstructioninthreedifferentways:AffectshowimpuritymeasuresareAffectshowtodistributeinstancewithmissingvaluetochildnodesAffectshowatestinstancewithmissingvalueisComputingImpurityBeforeSplitting:=-0.3log(0.3)-(0.7)log(0.7)=SplitonRefund:=-(0/3)log(0/3)–(3/3)log(3/3)==-(2/6)log(2/6)–(4/6)log(4/6)=ComputingImpurityBeforeSplitting:=-0.3log(0.3)-(0.7)log(0.7)=SplitonRefund:=-(0/3)log(0/3)–(3/3)log(3/3)==-(2/6)log(2/6)–(4/6)log(4/6)==3/10(0)+6/10(0.9183)=Gain=0.9(0.8813–0.551)===032410123456789?DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4?123456789DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4?123456789111111111 ClassifyNew<>ProbabilitythatMarital=MarriedisProbabilitythatMarital={Single,Divorced}is123456789ClassifyNew<>ProbabilitythatMarital=MarriedisProbabilitythatMarital={Single,Divorced}is123456789111111111??31040131OtherDataSearchOtherDataSearchTreeDataNumberofinstancesgetsDataNumberofinstancesgetssmallerasyoutraversedownNumberofinstancesattheleafnodescouldbetoosmallmakeanystatisticallysignificantSearchFindinganoptimaldecisiontreeisSearchFindinganoptimaldecisiontreeisNP-Thealgorithmpresentedsofarusesagreedy,top-recursivepartitioningstrategytoinduceareasonableOtherDecisiontreeprovidesexpressiverepresentationDecisiontreeprovidesexpressiverepresentationforlearningdiscrete-valuedfunctionButtheydonotgeneralizewelltocertaintypesofBooleanExample:parityClass=1ifthereisanevennumberofBooleanattributeswithtruthvalue=TrueClass=0ifthereisanoddnumberofBooleanattributeswithtruthvalue=TrueForaccuratemodeling,musthaveacompleteNotexpressiveenoughformodelingcontinuousParticularlywhentestconditioninvolvesonlyasingleattributeat-a-timeDecision1x<y<y<001xBorderlinebetweentwoDecision1x<y<y<001xBorderlinebetweentwoneighboringregionsofdifferentclassesisknownasdecisionboundaryDecisionboundaryisparalleltoaxesbecausetestconditioninvolvesasingleattributeat-a-timeyObliqueDecisionx+y<ObliqueDecisionx+y<TestconditionmayinvolvemultipleMoreexpressiveFindingoptimaltestconditioniscomputationallyTreePQRS0Q101S001SamesubtreeappearsinTreePQRS0Q101S001SamesubtreeappearsinmultipleModelMetricsforPerformanceHowModelMetricsforPerformanceHowtoevaluatetheperformanceofaMethodsforPerformanceHowtoobtainreliableMethodsforModelHowtocomparetherelativeperformanceamongcompetingmodels?ModelMetricsforPerformanceHowModelMetricsforPerformanceHowtoevaluatetheperformanceofaMethodsforPerformanceHowtoobtainreliableMethodsforModelHowtocomparetherelativeperformanceamongcompetingmodels?MetricsforPerformanceFocusonthepredictivecapabilityofaRatherthanMetricsforPerformanceFocusonthepredictivecapabilityofaRatherthanhowfastittakestoclassifyorbuildmodels,scalability,etc.Confusiona:TPtruepositive,真正b:FN(falsenegative,假負c:FPfalsepositive,假正)d:TN(truenegative,真負PREDICTEDabcdMetricsforPerformanceMostwidely-usedAccuracy a MetricsforPerformanceMostwidely-usedAccuracy a TP abcTPTNFPPREDICTEDLimitationofConsidera2-classLimitationofConsidera2-classNumberofClass0examples=NumberofClass1examples=Ifmodelpredictseverythingtobeclass0,accuracyis9990/10000=99.9%Accuracyismisleadingbecausemo
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 上海安全員c證考試試題及答案
- 上汽汽車認識試題及答案
- 寫字間長期租用合同3篇
- 院內創傷急救流程
- 曲柄滑塊機構課程設計
- T/ZHCA 027-2023化妝品個性化服務指南
- 永安公園景觀設計方案
- 2025西安醫學高等專科學校輔導員考試試題及答案
- 2025貴州商學院輔導員考試試題及答案
- 銅川易源電力實業有限責任公司招聘筆試題庫2025
- 國家開放大學一平臺電大《法律社會學》我要考形考任務2及3題庫答案
- 公司收文處理箋
- 6G 移動通信系統
- 環境因素識別評價表(一)
- 《三毛流浪記》作者簡介張樂平
- 2023年山西建設投資集團有限公司招聘筆試題庫及答案解析
- 鐵皮石斛的抗氧化、保濕功效研究和應用現狀
- GB/Z 18620.4-2008圓柱齒輪檢驗實施規范第4部分:表面結構和輪齒接觸斑點的檢驗
- GB/T 97.1-2002平墊圈A級
- 泊 秦 淮唐 杜牧
- GB/T 1871.1-1995磷礦石和磷精礦中五氧化二磷含量的測定磷鉬酸喹啉重量法和容量法
評論
0/150
提交評論