2009高級人工智能大綱日歷及教案10知識發(fā)現(xiàn)_第1頁
2009高級人工智能大綱日歷及教案10知識發(fā)現(xiàn)_第2頁
2009高級人工智能大綱日歷及教案10知識發(fā)現(xiàn)_第3頁
2009高級人工智能大綱日歷及教案10知識發(fā)現(xiàn)_第4頁
2009高級人工智能大綱日歷及教案10知識發(fā)現(xiàn)_第5頁
已閱讀5頁,還剩25頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、2022/7/151高級人工智能 知識發(fā)現(xiàn) 2022/7/152 概述 在數(shù)據(jù)庫基礎(chǔ)上實現(xiàn)的知識發(fā)現(xiàn)系統(tǒng),通過綜合運(yùn)用統(tǒng)計學(xué)、粗糙集、模糊數(shù)學(xué)、機(jī)器學(xué)習(xí),和專家系統(tǒng)等多種學(xué)習(xí)的手段和方法, 從大量的數(shù)據(jù)中提煉出抽象的知識,從而揭示出蘊(yùn)涵在這些數(shù)據(jù)背后的客觀世界的內(nèi)在聯(lián)系和本質(zhì)規(guī)律,實現(xiàn)知識的自動獲取,這是一個富有挑戰(zhàn)性、應(yīng)用前景廣闊的研究課題。2022/7/153提綱KDD的由來和應(yīng)用領(lǐng)域KDD的定義KDD的各個步驟KDD軟件KDD領(lǐng)域的會議和雜志2022/7/154Evolution of Database Technology:from data management to data an

2、alysis1960s:Data collection, database creation, IMS and network DBMS.1970s: Relational data model, relational DBMS implementation.1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).1990s: Data mining an

3、d data warehousing, multimedia databases, and Web technology.2022/7/155Motivations “Necessity is the Mother of Invention”Data explosion problem: Automated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other

4、information repositories. We are drowning in information, but starving for knowledge! (John Naisbett)Data warehousing and data mining :On-line analytical processingExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.2022/7/1561989 IJCAI Works

5、hop on KDDKnowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, eds., 1991)1991-1994 Workshops on KDDAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996)1995-1998 AAAI Int. Conf. on KDD and DM (KDD95-98)Journal of

6、 Data Mining and Knowledge Discovery (1997)1998 ACM SIGKDD 1999 SIGKDD99 Conf.Important dates of data mining2022/7/157數(shù)據(jù)庫知識發(fā)現(xiàn)該術(shù)語于1989年出現(xiàn),F(xiàn)ayyad定義(1996)為“KDD是從數(shù)據(jù)集中識別出有效的、新穎的、潛在有用的,以及最終可理解的模式的非平凡過程” the nontrivial process of identifying valid, novel, potentially useful,and ultimately understandable pa

7、tterns in data2022/7/158IdentifyProblem or OpportunityMeasure effectof ActionAct onKnowledgeKnowledgeResultsStrategyProblemThe virtuous cycle2022/7/159Application Areas and OpportunitiesMarketing: segmentation, customer targeting, .Finance: investment support, portfolio managementBanking & Insurance

8、: credit and policy approvalSecurity: fraud detectionScience and medicine: hypothesis discovery, prediction, classification, diagnosis Manufacturing: process modeling, quality control,resource allocationEngineering: simulation and analysis, pattern recognition, signal processingInternet: smart searc

9、h engines, web marketing 2022/7/1510Selection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data ConsolidatedDataThe KDD process2022/7/1511Data mining is a major component of the KDD process - automated dis

10、covery of patterns and the development of predictive and explanatory models.What is KDD? A process!2022/7/1512Learning the application domain:relevant prior knowledge and goals of applicationData consolidation: Creating a target data setSelection and Preprocessing Data cleaning : (may take 60% of ef

11、fort!)Data reduction and projection:find useful features, dimensionality/variable reduction, invariant representation.Choosing functions of data mining summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestInter

12、pretation and evaluation: analysis of results.visualization, transformation, removing redundant patterns, Use of discovered knowledgeThe steps of the KDD process2022/7/1513Garbage in Garbage out The quality of results relates directly to quality of the data50%-70% of KDD process effort is spent on d

13、ata consolidation and preparationMajor justification for a corporate data warehouseData consolidation and preparation2022/7/1514From data sources to consolidated data repositoryRDBMSLegacy DBMSFlat FilesDataConsolidationand CleansingWarehouseObject/Relation DBMS Multidimensional DBMS Deductive Datab

14、ase Flat files ExternalData consolidation2022/7/1515Determine preliminary list of attributes Consolidate data into working databaseInternal and External sourcesEliminate or estimate missing valuesRemove outliers (obvious exceptions)Determine prior probabilities of categories and deal with volume bia

15、sData consolidation2022/7/1516Generate a set of exampleschoose sampling methodconsider sample complexitydeal with volume bias issuesReduce attribute dimensionalityremove redundant and/or correlating attributescombine attributes (sum, multiply, difference)Reduce attribute value rangesgroup symbolic d

16、iscrete valuesquantize continuous numeric valuesTransform datade-correlate and normalize values map time-series data to static representationOLAP and visualization tools play key roleData selection and preprocessing2022/7/1517Data mining tasks and methods Automated Exploration/Discoverye.g. discover

17、ing new market segmentsclustering analysisPrediction/Classificatione.g. forecasting gross sales given current factorsregression, neural networks, genetic algorithms, decision treesExplanation/Descriptione.g. characterizing customers by demographics and purchase historydecision trees, association rul

18、esx1x2f(x)xif age 35 and e $35k then .2022/7/1518Clustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties.Distance-based numerical clusteringmetric grouping of examples (K-NN)graphical visualization can be usedBayesian clus

19、teringsearch for the number of classes which result in best fit of a probability distribution to the data AutoClass (NASA) one of best examplesAutomated exploration and discovery2022/7/1519Learning a predictive modelClassification of a new case/sample Many methods:Artificial neural networksInductive

20、 decision tree and rule systemsGenetic algorithmsNearest neighbor clustering algorithmsStatistical (parametric, and non-parametric)Prediction and classification2022/7/1520The objective of learning is to achieve good generalization to new unseen cases.Generalization can be defined as a mathematical i

21、nterpolation or regression over a set of training pointsModels can be validated with a previously unseen test set or using cross-validation methodsf(x)xGeneralization and regression2022/7/1521Objective: Develop a general model or hypothesis from specific examplesFunction approximation (curve fitting

22、)Classification (concept learning, pattern recognition)f(x)xx1x2ABSummarizing: inductive modeling = learning2022/7/1522Learn a generalized hypothesis (model) from selected dataDescription/Interpretation of model provides new knowledge Methods:Inductive decision tree and rule systemsAssociation rule

23、systemsLink Analysis Explanation and description2022/7/1523Generate a model of normal activityDeviation from model causes alertMethods:Artificial neural networksInductive decision tree and rule systemsStatistical methodsVisualization toolsException/deviation detection2022/7/1524Outlier and exception

24、 data analysisTime-series analysis (trend and deviation): Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.Similarity-based pattern-directed analysisFull vs. partial periodicity analysisOther pattern-directed or statistical an

25、alysis2022/7/1525A data mining system/query may generate thousands of patterns, not all of them are interesting.Interestingness measures:easily understood by humansvalid on new or test data with some degree of certainty.potentially usefulnovel, or validates some hypothesis that a user seeks to confi

26、rm Objective vs. subjective interestingness measuresObjective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on users beliefs in the data, e.g., unexpectedness, novelty, etc.Are all the discovered pattern interesting?2022/7/1526Find all the interest

27、ing patterns: Completeness.Can a data mining system find all the interesting patterns?Search for only interesting patterns: Optimization.Can a data mining system find only the interesting patterns?ApproachesFirst generate all the patterns and then filter out the uninteresting ones.Generate only the interesting patterns - mining query optimization.Completeness vs. optimization2022/7/1527EvaluationStatistical validation and significance testingQualitative review by experts in the fieldPilot surveys to evaluate model accuracyInterpretationIn

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論