




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
Knowledgediscovery&datamining
Tools,methods,andexperiencesFoscaGiannottiand
DinoPedreschiPisaKDDLabCNUCE-CNR&Univ.Pisahttp://www-kdd.di.unipi.it/Atutorial@EDBT2000EDBT2000tutorial1Konstanz,March2000ContributorsandacknowledgementsThepeople@PisaKDDLab:FrancescoBONCHI,GiuseppeMANCO,MircoNANNI,ChiaraRENSO,SalvatoreRUGGIERI,FrancoTURINIandmanystudentsThemanyKDDtutorialistsandteacherswhichmadetheirslidesavailableontheweb(allofthemlistedinbibliography);-)Inparticular:JiaweiHAN,SimonFraserUniversity,whoseforthcomingbookDatamining:conceptsandtechniqueshasinfluencedthewholetutorialRajeevRASTOGIandKyuseokSHIM,LucentBellLabsDanielA.KEIM,UniversityofHalleDanielSilver,CogNovaTechnologiesTheEDBT2000boardwhoacceptedourtutorialproposalKonstanz,27-28.3.20002EDBT2000tutorial-IntroTutorialgoalsIntroduceyoutomajoraspectsoftheKnowledgeDiscoveryProcess,andtheoryandapplicationsofDataMiningtechnologyProvideasystematizationtothemanymanyconceptsaroundthisarea,accordingthefollowinglinestheprocessthemethodsappliedtoparadigmaticcasesthesupportenvironmenttheresearchchallengesImportantissuesthatwillbenotcoveredinthistutorial:methods:timeseries,exceptiondetection,neuralnetssystems:parallelimplementationsKonstanz,27-28.3.20003EDBT2000tutorial-IntroTutorialOutlineIntroductionandbasicconceptsMotivations,applications,theKDDprocess,thetechniquesDeeperintoDMtechnologyDecisionTreesandFraudDetectionAssociationRulesandMarketBasketAnalysisClusteringandCustomerSegmentationTrendsintechnologyKnowledgeDiscoverySupportEnvironmentTools,LanguagesandSystemsResearchchallengesKonstanz,27-28.3.20004EDBT2000tutorial-IntroIntroduction-moduleoutlineMotivationsApplicationAreasKDDDecisionalContextKDDProcessArchitectureofaKDDsystemTheKDDstepsinshortKonstanz,27-28.3.20005EDBT2000tutorial-IntroEvolutionofDatabaseTechnology:
fromdatamanagementtodataanalysis1960s:Datacollection,databasecreation,IMSandnetworkDBMS.1970s:Relationaldatamodel,relationalDBMSimplementation.1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)andapplication-orientedDBMS(spatial,scientific,engineering,etc.).1990s:Datamininganddatawarehousing,multimediadatabases,andWebtechnology.Konstanz,27-28.3.20006EDBT2000tutorial-IntroMotivations
“NecessityistheMotherofInvention”Dataexplosionproblem:
Automateddatacollectiontools,maturedatabasetechnologyandinternetleadtotremendousamountsofdatastoredindatabases,datawarehousesandotherinformationrepositories.
Wearedrowningininformation,butstarvingforknowledge!
(JohnNaisbett)Datawarehousinganddatamining:On-lineanalyticalprocessingExtractionofinterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases.Konstanz,27-28.3.20007EDBT2000tutorial-IntroAlsoreferredtoas:
Datadredging,Dataharvesting,DataarcheologyAmultidisciplinaryfield:DatabaseStatisticsArtificialintelligenceMachinelearning,ExpertsystemsandKnowledgeAcquisitionVisualizationmethodsArapidlyemergingfieldArapidlyemergingfieldKonstanz,27-28.3.20008EDBT2000tutorial-IntroMotivationsforDM
AbundanceofbusinessandindustrydataCompetitivefocus-KnowledgeManagementInexpensive,powerfulcomputingenginesStrongtheoretical/mathematicalfoundationsmachinelearning&logicstatisticsdatabasemanagementsystemsKonstanz,27-28.3.20009EDBT2000tutorial-IntroWhatisDMusefulfor?MarketingDatabaseMarketingDataWarehousingKDD&DataMining
Increaseknowledgetobasedecisionupon.E.g.,impactonmarketingKonstanz,27-28.3.200010EDBT2000tutorial-IntroTheValueChain
Data
Customerdata
Storedata
DemographicalData
Geographicaldata
Information
XlivesinZSisYyearsoldXandSmovedWhasmoneyinZ
Knowledge
AquantityYofproductAisusedinregionZ
CustomersofclassYusex%ofCduringperiodD
Decision
PromoteproductAinregionZ.
MailadstofamiliesofprofilePCross-sellserviceBtoclientsCKonstanz,27-28.3.200011EDBT2000tutorial-IntroApplicationAreasandOpportunitiesMarketing:
segmentation,customertargeting,...Finance:investmentsupport,portfoliomanagementBanking&Insurance:creditandpolicyapprovalSecurity:
frauddetectionScienceandmedicine:
hypothesisdiscovery,
prediction,classification,diagnosisManufacturing:
processmodeling,qualitycontrol, resourceallocationEngineering:
simulationandanalysis,pattern recognition,signalprocessingInternet:smartsearchengines,webmarketingKonstanz,27-28.3.200012EDBT2000tutorial-IntroClassesofapplicationsMarketanalysistargetmarketing,customerrelationmanagement,marketbasketanalysis,crossselling,marketsegmentation.RiskanalysisForecasting,customerretention,improvedunderwriting,qualitycontrol,competitiveanalysis.FrauddetectionText(newsgroup,email,documents)andWebanalysis.Konstanz,27-28.3.200013EDBT2000tutorial-IntroMarketAnalysisWherearethedatasourcesforanalysis?Creditcardtransactions,loyaltycards,discountcoupons,customercomplaintcalls,plus(public)lifestylestudies.TargetmarketingFindclustersof“model”customerswhosharethesamecharacteristics:interest,incomelevel,spendinghabits,etc.DeterminecustomerpurchasingpatternsovertimeConversionofsingletoajointbankaccount:marriage,etc.Cross-marketanalysisAssociations/co-relationsbetweenproductsalesPredictionbasedontheassociationinformation.Customerprofilingdataminingcantellyouwhattypesofcustomersbuywhatproducts(clusteringorclassification).IdentifyingcustomerrequirementsidentifyingthebestproductsfordifferentcustomersusepredictiontofindwhatfactorswillattractnewcustomersProvidessummaryinformationvariousmultidimensionalsummaryreports;statisticalsummaryinformation(datacentraltendencyandvariation)MarketAnalysisandManagementMarketAnalysis(2)RiskAnalysisFinanceplanningandassetevaluation:cashflowanalysisandpredictioncontingentclaimanalysistoevaluateassetscross-sectionalandtimeseriesanalysis(financial-ratio,trendanalysis,etc.)Resourceplanning:summarizeandcomparetheresourcesandspendingCompetition:monitorcompetitorsandmarketdirections(CI:competitiveintelligence).groupcustomersintoclassesandclass-basedpricingproceduressetpricingstrategyinahighlycompetitivemarketFraudDetectionApplications:widelyusedinhealthcare,retail,creditcardservices,telecommunications(phonecardfraud),etc.Approach:usehistoricaldatatobuildmodelsoffraudulentbehaviorandusedataminingtohelpidentifysimilarinstances.Examples:autoinsurance:detectagroupofpeoplewhostageaccidentstocollectoninsurancemoneylaundering:detectsuspiciousmoneytransactions(USTreasury'sFinancialCrimesEnforcementNetwork)medicalinsurance:detectprofessionalpatientsandringofdoctorsandringofreferencesMoreexamples:Detectinginappropriatemedicaltreatment:AustralianHealthInsuranceCommissionidentifiesthatinmanycasesblanketscreeningtestswererequested(saveAustralian$1m/yr).Detectingtelephonefraud:Telephonecallmodel:destinationofthecall,duration,timeofdayorweek.Analyzepatternsthatdeviatefromanexpectednorm.BritishTelecomidentifieddiscretegroupsofcallerswithfrequentintra-groupcalls,especiallymobilephones,andbrokeamultimilliondollarfraud.Retail:Analystsestimatethat38%ofretailshrinkisduetodishonestemployees.FraudDetection(2)SportsIBMAdvancedScoutanalyzedNBAgamestatistics(shotsblocked,assists,andfouls)togaincompetitiveadvantageforNewYorkKnicksandMiamiHeat.AstronomyJPLandthePalomarObservatorydiscovered22quasarswiththehelpofdataminingInternetWebSurf-AidIBMSurf-AidappliesdataminingalgorithmstoWebaccesslogsformarket-relatedpagestodiscovercustomerpreferenceandbehaviorpages,analyzingeffectivenessofWebmarketing,improvingWebsiteorganization,etc.WatchforthePRIVACYpitfall!OtherapplicationsTheselectionandprocessingofdatafor:theidentificationofnovel,accurate,andusefulpatterns,andthemodelingofreal-worldphenomena.Datamining
isamajorcomponentoftheKDDprocess-automateddiscoveryofpatternsandthedevelopmentofpredictiveandexplanatorymodels.WhatisKDD?Aprocess!Konstanz,27-28.3.200020EDBT2000tutorial-IntroSelectionand
PreprocessingDataMiningInterpretationandEvaluationData
ConsolidationKnowledgep(x)=0.02WarehouseDataSourcesPatterns&
ModelsPreparedDataConsolidatedDataTheKDDprocessKonstanz,27-28.3.200021EDBT2000tutorial-IntroTheKDDProcessCoreProblems&ApproachesProblems:identificationofrelevantdatarepresentationofdatasearchforvalidpatternormodelApproaches:top-downdeductionbyexpertinteractivevisualizationofdata/models*bottom-upinduction
fromdata*DataMiningOLAPKonstanz,27-28.3.200022EDBT2000tutorial-IntroLearningtheapplicationdomain:relevantpriorknowledgeandgoalsofapplicationDataconsolidation:CreatingatargetdatasetSelectionandPreprocessing
Datacleaning:(maytake60%ofeffort!)Datareductionandprojection:findusefulfeatures,dimensionality/variablereduction,invariantrepresentation.Choosingfunctionsofdataminingsummarization,classification,regression,association,clustering.Choosingtheminingalgorithm(s)Datamining:searchforpatternsofinterestInterpretationandevaluation:analysisofresults.visualization,transformation,removingredundantpatterns,…UseofdiscoveredknowledgeThestepsoftheKDDprocessIdentifyProblemor
OpportunityMeasureeffectofActionActonKnowledgeKnowledgeResultsStrategyProblemThevirtuouscycleKonstanz,27-28.3.200024EDBT2000tutorial-IntroApplications,operations,techniquesKonstanz,27-28.3.200025EDBT2000tutorial-IntroRolesintheKDDprocessKonstanz,27-28.3.200026EDBT2000tutorial-IntroIncreasingpotentialtosupportbusinessdecisionsEndUserBusinessAnalystDataAnalystDBA
MakingDecisionsDataPresentationVisualizationTechniquesDataMiningInformationDiscoveryDataExplorationOLAP,MDAStatisticalAnalysis,QueryingandReportingDataWarehouses/DataMartsDataSourcesPaper,Files,InformationProviders,DatabaseSystems,OLTPDataminingandbusinessintelligenceKonstanz,27-28.3.200027EDBT2000tutorial-IntroGraphicalUserInterfaceDataConsolidationSelectionandPreprocessingDataMiningInterpretationandEvaluationWarehouseKnowledgeDataSourcesArchitectureofaKDDsystemKonstanz,27-28.3.200028EDBT2000tutorial-IntroAbusinessintelligenceenvironmentKonstanz,27-28.3.200029EDBT2000tutorial-IntroSelectionand
PreprocessingDataMiningInterpretationandEvaluationData
ConsolidationKnowledgep(x)=0.02WarehouseDataSourcesPatterns&
ModelsPreparedDataConsolidatedDataTheKDDprocessKonstanz,27-28.3.200030EDBT2000tutorial-IntroGarbageinGarbageout
Thequalityofresultsrelatesdirectlytoqualityofthedata50%-70%ofKDDprocesseffortisspentondataconsolidationandpreparationMajorjustificationforacorporatedatawarehouseDataconsolidationandpreparationKonstanz,27-28.3.200031EDBT2000tutorial-IntroFromdatasourcestoconsolidateddatarepositoryRDBMSLegacyDBMSFlatFilesDataConsolidationandCleansingWarehouseObject/RelationDBMS
MultidimensionalDBMS
DeductiveDatabase
FlatfilesExternalDataconsolidationKonstanz,27-28.3.200032EDBT2000tutorial-IntroDeterminepreliminarylistofattributesConsolidatedataintoworkingdatabaseInternalandExternalsourcesEliminateorestimatemissingvaluesRemoveoutliers(obviousexceptions)DeterminepriorprobabilitiesofcategoriesanddealwithvolumebiasDataconsolidationKonstanz,27-28.3.200033EDBT2000tutorial-IntroSelectionand
PreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseTheKDDprocessKonstanz,27-28.3.200034EDBT2000tutorial-IntroGenerateasetofexampleschoosesamplingmethodconsidersamplecomplexitydealwithvolumebiasissuesReduceattributedimensionalityremoveredundantand/orcorrelatingattributescombineattributes(sum,multiply,difference)ReduceattributevaluerangesgroupsymbolicdiscretevaluesquantizecontinuousnumericvaluesTransformdatade-correlateandnormalizevaluesmaptime-seriesdatatostaticrepresentationOLAPandvisualizationtoolsplaykeyroleDataselectionandpreprocessingKonstanz,27-28.3.200035EDBT2000tutorial-IntroSelectionand
PreprocessingDataMining
InterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseTheKDDprocessKonstanz,27-28.3.200036EDBT2000tutorial-IntroDatamining
tasksandmethodsAutomatedExploration/Discoverye.g..
discoveringnewmarketsegmentsclustering
analysisPrediction/Classificatione.g..
forecastinggrosssalesgivencurrentfactorsregression,neuralnetworks,geneticalgorithms,
decisiontreesExplanation/Descriptione.g..
characterizingcustomersbydemographics
andpurchasehistorydecisiontrees,association
rulesx1x2f(x)xifage>35andincome<$35k
then...Konstanz,27-28.3.200037EDBT2000tutorial-IntroClustering:partitioningasetofdataintoasetofclasses,calledclusters,whosememberssharesomeinterestingcommonproperties.Distance-basednumericalclusteringmetricgroupingofexamples(K-NN)graphicalvisualizationcanbeusedBayesianclusteringsearchforthenumberofclasseswhichresultinbestfitofaprobabilitydistributiontothedataAutoClass(NASA)oneofbestexamplesAutomatedexplorationanddiscoveryKonstanz,27-28.3.200038EDBT2000tutorial-IntroLearningapredictivemodelClassificationofanewcase/sampleManymethods:ArtificialneuralnetworksInductivedecisiontreeandrulesystemsGeneticalgorithmsNearestneighborclusteringalgorithmsStatistical(parametric,andnon-parametric)PredictionandclassificationKonstanz,27-28.3.200039EDBT2000tutorial-IntroTheobjectiveoflearningistoachievegoodgeneralizationtonewunseencases.GeneralizationcanbedefinedasamathematicalinterpolationorregressionoverasetoftrainingpointsModelscanbevalidatedwithapreviouslyunseentestsetorusingcross-validationmethodsf(x)xGeneralizationandregressionKonstanz,27-28.3.200040EDBT2000tutorial-IntroClassificationandpredictionClassifydatabasedonthevaluesofatargetattribute,e.g.,classifycountriesbasedonclimate,orclassifycarsbasedongasmileage.Useobtainedmodeltopredictsomeunknownormissingattributevaluesbasedonotherinformation.Konstanz,27-28.3.200041EDBT2000tutorial-IntroObjective:
Developageneralmodelor hypothesisfromspecificexamplesFunctionapproximation(curvefitting)Classification(conceptlearning,patternrecognition)x1x2ABf(x)xSummarizing:inductivemodeling=learningKonstanz,27-28.3.200042EDBT2000tutorial-IntroLearnageneralizedhypothesis(model)fromselecteddataDescription/InterpretationofmodelprovidesnewknowledgeMethods:InductivedecisiontreeandrulesystemsAssociationrulesystemsLinkAnalysis…ExplanationanddescriptionKonstanz,27-28.3.200043EDBT2000tutorial-IntroGenerateamodelofnormalactivityDeviationfrommodelcausesalertMethods:ArtificialneuralnetworksInductivedecisiontreeandrulesystemsStatisticalmethodsVisualizationtoolsException/deviationdetectionKonstanz,27-28.3.200044EDBT2000tutorial-IntroOutlierandexceptiondataanalysisTime-seriesanalysis(trendanddeviation):Trendanddeviationanalysis:regression,sequentialpattern,similarsequences,trendanddeviation,e.g.,stockanalysis.Similarity-basedpattern-directedanalysisFullvs.partialperiodicityanalysisOtherpattern-directedorstatisticalanalysisKonstanz,27-28.3.200045EDBT2000tutorial-IntroSelectionand
PreprocessingDataMiningInterpretationandEvaluationDataConsolidationandWarehousingKnowledgep(x)=0.02WarehouseTheKDDprocessKonstanz,27-28.3.200046EDBT2000tutorial-IntroAdataminingsystem/querymaygeneratethousandsofpatterns,notallofthemareinteresting.Interestingnessmeasures:easilyunderstoodbyhumansvalidonnewortestdatawithsomedegreeofcertainty.potentiallyusefulnovel,orvalidatessomehypothesisthatauserseekstoconfirmObjectivevs.subjectiveinterestingnessmeasuresObjective:basedonstatisticsandstructuresofpatterns,e.g.,support,confidence,etc.Subjective:basedonuser’sbeliefsinthedata,e.g.,unexpectedness,novelty,etc.Areallthediscoveredpatterninteresting?Findalltheinterestingpatterns:Completeness.Canadataminingsystemfindalltheinterestingpatterns?Searchforonlyinterestingpatterns:Optimization.Canadataminingsystemfindonlytheinterestingpatterns?ApproachesFirstgenerateallthepatternsandthenfilterouttheuninterestingones.Generateonlytheinterestingpatterns-miningqueryoptimization.Completenessvs.optimizationEvaluationStatisticalvalidationandsignificancetestingQualitativereviewbyexpertsinthefieldPilotsurveystoevaluatemodelaccuracyInterpretationInductivetreeandrulemodelscanbereaddirectlyClusteringresultscanbegraphedandtabledCodecanbeautomaticallygeneratedbysomesystems(IDTs,Regressionmodels)InterpretationandevaluationKonstanz,27-28.3.200049EDBT2000tutorial-IntroVisualizationtoolscanbeveryhelpfulsensitivityanalysis(I/Orelationship)histogramsofvaluedistributiontime-seriesplotsandanimationrequirestrainingandpracticeResponseVelocityTempInterpretationandevaluationKonstanz,27-28.3.200050EDBT2000tutorial-Intro1989IJCAIWorkshoponKDDKnowledgeDiscoveryinDatabases(G.Piatetsky-ShapiroandW.Frawley,eds.,1991)1991-1994WorkshopsonKDDAdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,eds.,1996)1995-1998AAAIInt.Conf.onKDDandDM(KDD’95-98)JournalofDataMiningandKnowledgeDiscovery(1997)1998ACMSIGKDD1999SIGKDD’99Co
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 商務(wù)咨詢勞務(wù)合同協(xié)議
- 2025商業(yè)辦公空間租賃合同范本
- 2025商業(yè)房產(chǎn)租賃合同范本與格式內(nèi)容
- 模具合同延期補(bǔ)充協(xié)議
- 《小米的營銷策略》課件
- 2025年出口商品購銷合同范本
- 《2025協(xié)商解除勞動合同協(xié)議書》
- 2025機(jī)械設(shè)備購貨合同模板
- 2025家居用品最簡單購銷合同
- 2025年貨運(yùn)從業(yè)資格證考試模擬考試題及答案詳解
- 監(jiān)理實(shí)施細(xì)則模板(信息化、軟件工程)
- 2025年春季學(xué)期形勢與政策第二講-中國經(jīng)濟(jì)行穩(wěn)致遠(yuǎn)講稿
- 人教PEP版英語五年級下冊Recycle 1單元教學(xué)設(shè)計(2課時教案)
- 2025年中共涼山州委辦公室面向全州考調(diào)所屬事業(yè)單位工作人員高頻重點(diǎn)模擬試卷提升(共500題附帶答案詳解)
- 夏季貨車行車安全教育
- 【基礎(chǔ)卷】同步分層練習(xí):五年級下冊語文第14課《刷子李》(含答案)
- 2025年山西焦煤集團(tuán)有限責(zé)任公司招聘筆試參考題庫含答案解析
- 產(chǎn)后疼痛管理指南
- 環(huán)境生物學(xué)復(fù)習(xí)-段昌群-參考重點(diǎn)
- 第六屆“四川工匠杯”職業(yè)技能大賽(健康照護(hù)賽項(xiàng))理論參考試題庫(含答案)
- DB2306-T 179-2023 林場森林火災(zāi)隱患調(diào)查評估技術(shù)規(guī)程
評論
0/150
提交評論