文稿案例課件

上傳人：我*** IP屬地：北京上傳時間：2024-02-05 格式：DOCX 頁數：94 大?。?.41MB 積分：20 舉報 版權申訴

已閱讀5頁，還剩89頁未讀，繼續免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

上海財經大學信息管理與工程學—BasicConcepts,DecisionTree,andModelEvaluationClassification:Givenacollectionofrecords(trainingClassification:Givenacollectionofrecords(trainingsetEachrecordcontainsasetofattributes,oneoftheattributesistheclass.FindaforclassattributeasafunctionofvaluesofotherGoal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.PageProf.Dr.LIUIllustratingClassificationTrainingTestPageProf.Dr.LIUIllustratingClassificationTrainingTestPageProf.Dr.LIU?????123456789ExamplesofClassificationPredictingtumorExamplesofClassificationPredictingtumorcellsasbenignormalignantClassifyingcreditcardaslegitimateorClassifyingsecondarystructuresofprotein(蛋白質)asalpha-helix,beta-sheet,orrandomCategorizingnewsstoriesasfinance,weather,entertainment,sports,etcPageProf.Dr.LIUClassificationDecisionTreebasedClassificationDecisionTreebasedRule-basedMemorybasedNeuralNa?veBayesandBayesianBeliefSupportVectorPageProf.Dr.LIUDecisionTree:DecisionTree:DecisionTree:DecisionTree:DecisionTree:DecisionTree:ExampleofaDecisionSplittingSingle,<>DecisionTrainingExampleofaDecisionSplittingSingle,<>DecisionTraining123456789AnotherExampleofDecision<>TherecouldbemorethanonetreeAnotherExampleofDecision<>Therecouldbemorethanonetreethatfitsthesamedata!123456789DecisionTreeClassificationTrainingTestDecisionTreeClassificationTrainingTest?????123456789ApplyModeltoTestTestStartfromtherootofSingle,<ApplyModeltoTestTestStartfromtherootofSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestSingle,<>ApplyModeltoTestTestSingle,<>?ApplyModeltoTestTestAssignCheattoSingle,<>ApplyModeltoTestTestAssignCheattoSingle,<>?DecisionTreeClassificationTrainingTestDecisionTreeClassificationTrainingTest?????123456789DecisionTreeManyHuntDecisionTreeManyHunt’sAlgorithm(oneoftheID3,GeneralStructureofHunt’sLetDtbethesetoftrainingrecordsthatreachanodetGeneralIfGeneralStructureofHunt’sLetDtbethesetoftrainingrecordsthatreachanodetGeneralIfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledastIfDtisanemptyset,thentisaleafnodelabeledbytheclass,IfDtcontainsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.?123456789Hunt’s123456789Hunt’s123456789<>=TreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopTreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopHowtoSpecifyTestDependsHowtoSpecifyTestDependsonattributeDependsonnumberofwaysto2-wayMulti-waySplittingBasedonOrdinalMulti-waysplit:UseasmanypartitionsasSplittingBasedonOrdinalMulti-waysplit:UseasmanypartitionsasdistinctBinarysplit:DividesvaluesintotwoNeedtofindoptimalHowaboutthisSplittingBasedonContinuousDifferentSplittingBasedonContinuousDifferentwaysofDiscretizationtoformanordinalcategoricalStatic–discretizeonceattheDynamic–rangescanbefoundbyequalbucketing,equalfrequencybucketing(percentiles),orclustering.BinaryDecision:(A<v)or(AconsiderallpossiblesplitsandfindsthebestcanbemorecomputeSplittingBasedonContinuous><>SplittingBasedonContinuous><>(i)Binary(ii)Multi-wayTreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopHowtodeterminetheBestBeforeSplitting:10recordsofclass10recordsofclassHowtodeterminetheBestBeforeSplitting:10recordsofclass10recordsofclassWhichtestconditionistheHowtodeterminetheBestGreedyHowtodeterminetheBestGreedyNodeswithhomogeneousclassdistributionareNeedameasureofnodeNon-HighdegreeofLowdegreeofMeasuresofNodeGiniMeasuresofNodeGinierror/HowtoFindtheBestBeforeGain=M0–M12M0–HowtoFindtheBestBeforeGain=M0–M12M0–MeasureofImpurity:GiniIndexforagivennodet(NOTE:p(j|t)istherelativefrequencyMeasureofImpurity:GiniIndexforagivennodet(NOTE:p(j|t)istherelativefrequencyofclassjatnodet).Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinterestinginformationMinimum(0.0)whenallrecordsbelongtooneclass,implyingmostinterestinginformation33241506ExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Gini=1–P(C1)2ExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Gini=1–P(C1)2–P(C2)2=1–0–1=P(C1)=P(C2)=Gini=1–(1/6)2–(5/6)2=P(C1)=P(C2)=Gini=1–(2/6)2–(4/6)2=241506SplittingBasedonUsedinCART,SplittingBasedonUsedinCART,SLIQ,Whenanodepissplitintokpartitions(children),thequalityofsplitiscomputedas,ni=numberofrecordsatchildn=numberofrecordsatnodeBinaryAttributes:ComputingGINISplitsintotwoEffectofWeighingLargerandPurerPartitionsareBinaryAttributes:ComputingGINISplitsintotwoEffectofWeighingLargerandPurerPartitionsaresought=1–(5/7)2–==1–(1/5)2–==7/12*0.408+5/12*0.320=512466Gini=CategoricalAttributes:ComputingGiniForeachdistinctvalue,gathercountsforeachclassintheUsethecountmatrixtomakeTwo-way(findbestpartitionofMulti-way{LuxuryCategoricalAttributes:ComputingGiniForeachdistinctvalue,gathercountsforeachclassintheUsethecountmatrixtomakeTwo-way(findbestpartitionofMulti-way{LuxurySports131512141122153124ContinuousAttributes:ComputingGiniUseBinaryDecisionsbasedononeSeveralChoicesContinuousAttributes:ComputingGiniUseBinaryDecisionsbasedononeSeveralChoicesforthesplittingNumberofpossiblesplitting=NumberofdistinctEachsplittingvaluehasacountmatrixassociatedwithitClasscountsineachofthepartitions,A<vandAvSimplemethodtochoosebest>Foreachv,scanthedatabasetogathercountmatrixandcomputeitsGiniindexComputationallyRepetitionof123456789ContinuousAttributes:ComputingGiniForefficientcomputation:foreachSorttheattributeonLinearlyscanthesevalues,eachtimeupdatingthecountmatrixandcomputingginiindexChoosethesplitpositionthathastheleastginiSortedSplitTaxable>>>>>>>>>ContinuousAttributes:ComputingGiniForefficientcomputation:foreachSorttheattributeonLinearlyscanthesevalues,eachtimeupdatingthecountmatrixandcomputingginiindexChoosethesplitpositionthathastheleastginiSortedSplitTaxable>>>>>>>>>>>03030303122130303030300716253434343443526170AlternativeSplittingCriteriabasedonAlternativeSplittingCriteriabasedonEntropyatagivennode(NOTE:p(j|t)istherelativefrequencyofclassjatnodeMeasureshomogeneityofaMaximum(lognc)whenrecordsareequallydistributedamongallclassesimplyingleastinformationMinimum(0.0)whenallrecordsbelongtooneclass,implyingmostinformationEntropybasedcomputationsaresimilartotheGINIindexExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Entropy=–0logExamplesforcomputingP(C1)=0/6=P(C2)=6/6=Entropy=–0log0–1log1=–0–0=P(C1)=P(C2)=Entropy=–(1/6)log2(1/6)–(5/6)log2(1/6)=P(C1)=P(C2)=Entropy=–(2/6)log2(2/6)–(4/6)log2(4/6)=241506SplittingBasedonInformationParentNode,SplittingBasedonInformationParentNode,pissplitintokpartitions;niisnumberofrecordsinpartitioniMeasuresReductioninEntropyachievedbecauseofsplit.Choosethesplitthatachievesmostreduction(maximizesGAIN)UsedinID3andDisadvantage:Tendstoprefersplitsthatresultinlargenumberofpartitions,eachbeingsmallbutpure.SplittingBasedonGainParentNode,pissplitintokpartitionsniisSplittingBasedonGainParentNode,pissplitintokpartitionsniisthenumberofrecordsinpartitionAdjustsInformationGainbytheentropyofthepartitioning(SplitINFO).Higherentropypartitioning(largenumberofsmallpartitions)ispenalized!UsedinDesignedtoovercomethedisadvantageofInformationSplittingCriteriabasedonClassificationClassificationSplittingCriteriabasedonClassificationClassificationerroratanodetMeasuresmisclassificationerrormadebyaMaximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinterestingMinimum(0.0)whenallrecordsbelongtooneclass,implyingmostinterestinginformationExamplesforComputingP(C1)=0/6=P(C2)=6/6=Error=1–maxExamplesforComputingP(C1)=0/6=P(C2)=6/6=Error=1–max(0,1)=1–1=P(C1)=P(C2)=Error=1–max(1/6,5/6)=1–5/6=P(C1)=P(C2)=Error=1–max(2/6,4/6)=1–4/6=241506ComparisonamongSplittingForaComparisonamongSplittingFora2-classMisclassificationErrorvs=1–(3/3)2–==1–(4/7)2–MisclassificationErrorvs=1–(3/3)2–==1–(4/7)2–==3/10*+7/10*=GiniimprovesError(Parent)=340373Gini=TreeGreedySplittheTreeGreedySplittherecordsbasedonanattributetestthatoptimizescertaincriterion.DeterminehowtosplittheHowtospecifytheattributetestHowtodeterminethebestDeterminewhentostopStoppingCriteriaforTreeStopexpandingStoppingCriteriaforTreeStopexpandinganodewhenalltherecordsbelongthesameStopexpandinganodewhenalltherecordssimilarattributeEarlytermination(tobediscussedDecisionTreeBasedInexpensiveDecisionTreeBasedInexpensivetoFastexecutionScalewellforlargedataCanhandlehighdimensionalExtremelyfastatclassifyingunknownEasytointerpretforsmall-sizedAccuracyiscomparabletootherclassificationtechniquesformanysimpledatasetsCannotcapturecorrelationsamongConsideronlyaxis-parallelExample:Simpledepth-firstUsesExample:Simpledepth-firstUsesInformationSortsContinuousAttributesateachNeedsentiredatatofitinUnsuitableforLargeNeedsout-of-coreYoucandownloadthesoftwareififthenRainorOutlook=OvercastthenYes;Outlook=RainorOutlook=OvercastthenYes;Outlook=RainorSunnyandHumidity=HighthenNoOutlook=RainorSunnyandHumidity=then分類模型中決策規則的數量可能非常大，規則的修剪可以分類模型中決策規則的數量可能非常大，規則的修剪可以PracticalIssuesofUnderfittingandPracticalIssuesofUnderfittingandMissingCostsofUnderfittingandOverfitting500circularand500triangulardatapoints.CircularUnderfittingandOverfitting500circularand500triangulardatapoints.Circular0.5sqrt(x1+x2) sqrt(x1+x2)<0.5 sqrt(x1+x2)>UnderfittingandUnderfitting:whenmodelisUnderfittingandUnderfitting:whenmodelistoosimple,bothtrainingandtesterrorsareOverfittingduetoDecisionboundaryOverfittingduetoDecisionboundaryisdistortedbynoiseOverfittingduetoInsufficientLackOverfittingduetoInsufficientLackofdatapointsinthelowerhalfofthediagrammakesitdifficulttopredictcorrectlytheclasslabelsofthatregion-InsufficientnumberoftrainingrecordsintheregioncausesthedecisiontreetopredictthetestexamplesusingothertrainingrecordsthatareirrelevanttotheclassificationtaskNotesonOverfittingresultsindecisionNotesonOverfittingresultsindecisiontreesthataremorethanTrainingerrornolongerprovidesagoodestimateofhowthetreewillperformonpreviouslyunseenNeednewwaysforestimatingEstimatingGeneralization再代入/訓練誤差Re-substitutionerrors:EstimatingGeneralization再代入/訓練誤差Re-substitutionerrors:errorontraininge(t)泛化誤差GeneralizationerrorserrorontestingMethodsforestimatinggeneralizationOptimisticapproach:e’(t)=PessimisticForeachleafnode:e’(t)=Totalerrors:e’(T)=e(T)+N(N:numberofleafForatreewith30leafnodesand10errorsontraining(outof1000instances):Trainingerror=10/1000=Generalizationerror=(10+300.5)/1000=ReducederrorpruningusesvalidationdatasettoestimategeneralizationOccam’sGiventwomodelsofOccam’sGiventwomodelsofsimilargeneralizationerrors,onepreferthesimplermodeloverthemorecomplexForcomplexmodels,thereisagreaterchancethatitfittedaccidentallybyerrorsinTherefore,oneshouldincludemodelcomplexityevaluatingaMinimumDescriptionLength01AB01Cost(Model,Data)=Cost(Data|Model)+CostisthenumberofbitsneededforSearchfortheleastcostlyCost(Data|Model)MinimumDescriptionLength01AB01Cost(Model,Data)=Cost(Data|Model)+CostisthenumberofbitsneededforSearchfortheleastcostlyCost(Data|Model)encodesthemisclassificationCost(Model)usesnodeencoding(numberofchildren)plussplittingconditionencoding.Xy????……?Xy1001……1HowtoAddressPre-PruningHowtoAddressPre-Pruning(EarlyStoppingStopthealgorithmbeforeitbecomesafully-grownTypicalstoppingconditionsforaStopifallinstancesbelongtothesameStopifalltheattributevaluesaretheMorerestrictiveStopifnumberofinstancesislessthansomeuser-specifiedStopifclassdistributionofinstancesareindependentoftheavailablefeatures(e.g.,using2test)Stopifexpandingthecurrentnodedoesnotimproveimpuritymeasures(e.g.,Giniorinformationgain).HowtoAddressPost-HowtoAddressPost-GrowdecisiontreetoitsTrimthenodesofthedecisiontreeinabottom-upIfgeneralizationerrorimprovesaftertrimming,replacesub-treebyaleafnode.Classlabelofleafnodeisdeterminedfrommajorityclassofinstancesinthesub-treeCanuseMDLforpost-ExampleofPost-ing)=)/30=g)=TrainingError(BeforesplitPessimisticerror=(10+0.TrainingError(AftersplittiPessimisticerror(Afterspl=(9+40.5)/30 ClassExampleofPost-ing)=)/30=g)=TrainingError(BeforesplitPessimisticerror=(10+0.TrainingError(AftersplittiPessimisticerror(Afterspl=(9+40.5)/30 Class=8Class=3Class=4Class=5Class=4Class=4Class=1Class=1Class=Class=Error=ExamplesofPost-OptimisticCase1:7/20vs.ExamplesofPost-OptimisticCase1:7/20vs.Case2:5/20Don’tpruneforbothPessimisticCase1:(7+0.5*1)/20vs.Case2:(5+0.5*1)/20vs.Don’tprunecase1,prunecaseReducederrorDependsonvalidationCaseC0:C1:CaseC0:C1:HandlingMissingAttributeMissingvaluesHandlingMissingAttributeMissingvaluesaffectdecisiontreeconstructioninthreedifferentways:AffectshowimpuritymeasuresareAffectshowtodistributeinstancewithmissingvaluetochildnodesAffectshowatestinstancewithmissingvalueisComputingImpurityBeforeSplitting:=-0.3log(0.3)-(0.7)log(0.7)=SplitonRefund:=-(0/3)log(0/3)–(3/3)log(3/3)==-(2/6)log(2/6)–(4/6)log(4/6)=ComputingImpurityBeforeSplitting:=-0.3log(0.3)-(0.7)log(0.7)=SplitonRefund:=-(0/3)log(0/3)–(3/3)log(3/3)==-(2/6)log(2/6)–(4/6)log(4/6)==3/10(0)+6/10(0.9183)=Gain=0.9(0.8813–0.551)===032410123456789?DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4?123456789DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4DistributeProbabilitythatRefund=YesisProbabilitythatRefund=NoisAssignrecordtotheleftchildwithweight=3/9andtotherightchildwithweight=6/924030+32+4?123456789111111111 ClassifyNew<>ProbabilitythatMarital=MarriedisProbabilitythatMarital={Single,Divorced}is123456789ClassifyNew<>ProbabilitythatMarital=MarriedisProbabilitythatMarital={Single,Divorced}is123456789111111111??31040131OtherDataSearchOtherDataSearchTreeDataNumberofinstancesgetsDataNumberofinstancesgetssmallerasyoutraversedownNumberofinstancesattheleafnodescouldbetoosmallmakeanystatisticallysignificantSearchFindinganoptimaldecisiontreeisSearchFindinganoptimaldecisiontreeisNP-Thealgorithmpresentedsofarusesagreedy,top-recursivepartitioningstrategytoinduceareasonableOtherDecisiontreeprovidesexpressiverepresentationDecisiontreeprovidesexpressiverepresentationforlearningdiscrete-valuedfunctionButtheydonotgeneralizewelltocertaintypesofBooleanExample:parityClass=1ifthereisanevennumberofBooleanattributeswithtruthvalue=TrueClass=0ifthereisanoddnumberofBooleanattributeswithtruthvalue=TrueForaccuratemodeling,musthaveacompleteNotexpressiveenoughformodelingcontinuousParticularlywhentestconditioninvolvesonlyasingleattributeat-a-timeDecision1x<y<y<001xBorderlinebetweentwoDecision1x<y<y<001xBorderlinebetweentwoneighboringregionsofdifferentclassesisknownasdecisionboundaryDecisionboundaryisparalleltoaxesbecausetestconditioninvolvesasingleattributeat-a-timeyObliqueDecisionx+y<ObliqueDecisionx+y<TestconditionmayinvolvemultipleMoreexpressiveFindingoptimaltestconditioniscomputationallyTreePQRS0Q101S001SamesubtreeappearsinTreePQRS0Q101S001SamesubtreeappearsinmultipleModelMetricsforPerformanceHowModelMetricsforPerformanceHowtoevaluatetheperformanceofaMethodsforPerformanceHowtoobtainreliableMethodsforModelHowtocomparetherelativeperformanceamongcompetingmodels?ModelMetricsforPerformanceHowModelMetricsforPerformanceHowtoevaluatetheperformanceofaMethodsforPerformanceHowtoobtainreliableMethodsforModelHowtocomparetherelativeperformanceamongcompetingmodels?MetricsforPerformanceFocusonthepredictivecapabilityofaRatherthanMetricsforPerformanceFocusonthepredictivecapabilityofaRatherthanhowfastittakestoclassifyorbuildmodels,scalability,etc.Confusiona:TPtruepositive,真正b:FN(falsenegative,假負c:FPfalsepositive,假正)d:TN(truenegative,真負PREDICTEDabcdMetricsforPerformanceMostwidely-usedAccuracy a MetricsforPerformanceMostwidely-usedAccuracy a TP abcTPTNFPPREDICTEDLimitationofConsidera2-classLimitationofConsidera2-classNumberofClass0examples=NumberofClass1examples=Ifmodelpredictseverythingtobeclass0,accuracyis9990/10000=99.9%Accuracyismisleadingbecausemo

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

文稿案例課件

文檔簡介

溫馨提示

最新文檔

評論

老太爷的乳妓h开裆裤,久久久久久精品国产三级非禁歌 ,久久久久久久99精品国产片,免费观看交性大片

文稿案例課件

文檔簡介

溫馨提示

最新文檔

評論

相關文檔