Data Mining

DATA MINING- TRENDS AND CHALLENGES Seminar Talk-1 (04.02.2016)

KONGU ARTS AND SCIENCE COLLEGE, ERODE, TAMIL NADU, INDIA

Dr.B.K.Tripathy, Senior Professor School of Computer Science and Engineering VIT University, Vellore-632014,Tamil Nadu, India E-mail: [email protected]

WHY DATA MINING? • The Explosive Growth of Data: From terabytes to petabytes (1 million GB) per day pour into our computer networks • Data collection and data availability are: Through Automated data collection tools, database

systems, Web, computerized society • Major sources of abundant data: Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, …

WHY DATA MINING?  Society and everyone: news, digital cameras, YouTube • WE ARE NOW DROWNING IN DATA, BUT STARVING FOR KNOWLEDGE!

• “Necessity is the mother of invention” • Data mining—Automated analysis of massive data sets

EVOLUTION OF SCIENCES • Before 1600: Empirical Science • 1600-1950s: Theoretical Science – Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. • 1950s-1990s: Computational Science – Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) – Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.

EVOLUTION OF SCIENCES • 1990s-Now: Data science • The flood of data from new scientific instruments and simulations • The ability to economically store and manage petabytes of data online • The Internet and computing Grid that makes all these archives universally accessible • Scientific information management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! • Source: Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

EVOLUTION OF DATABASE TECHNOLOGY • 1960s:Data collection, database creation, IMS and network DBMS • 1970s: Relational data model, relational DBMS implementation • 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.), Application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: Data mining, data warehousing, multimedia databases, and Web databases • 2000s: Stream data management and mining, Data mining and its applications, Web technology (XML, data integration) and global information systems

WHAT IS DATA MINING? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data – Data mining: a misnomer? • ALTERNATE NAMES – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: Is everything “data mining”? – Simple search and query processing – (Deductive) expert systems

KDD STEPS

DATA MINING AND BUSINESS INTELLIGENCE a

Increasing potential to support business decisions

End User

Decision Making Data Presentation Visualization Techniques

Business Analyst

Data Mining Information Discovery

Data Analyst

Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

DBA

DATA MINING: CONFLUENCE (AGGREGATION) OF MULTIPLE DISCIPLINES b

DATABASE TECHNOLOGY

MACHINE LEARNING PATTERN RECOGNITION

STATISTICS

DATA MINING

ALGORITHMS

VISUALISATION

OTHER DISCIPLINES

WHY NOT TRADITIONAL DATA ANALYSIS? • Tremendous amount of data: Algorithms must be highly scalable to handle such as terabytes or petabytes of data • High-dimensionality of data: Micro-array may have tens of thousands of dimensions • High complexity of data: – Data streams and sensor data – Time-series data, temporal data, sequence data – Structure data, graphs, social networks and multi-linked data – Heterogeneous databases and legacy databases – Spatial, spatiotemporal, multimedia, text and Web data – Software programs, scientific simulations • New and sophisticated applications

MULTI-DIMENSIONAL VIEW OF DATA MINING • Data to be mined : Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW • Knowledge to be mined: Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. – Multiple/integrated functions and mining at multiple levels • Techniques utilized: Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. • Applications adapted: Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

DATA MINING ON WHAT KIND OF DATA • Database-oriented data sets and applications  Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequential data (including Biosequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data and spatiotemporal data

 Multimedia database  Text databases and WWW data

MAJOR ISSUES IN DATA MINING • Mining methodology: – Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web – Performance: efficiency, effectiveness, and scalability – Pattern evaluation: the interestingness problem – Incorporation of background knowledge – Handling noise and incomplete data – Parallel, distributed and incremental mining methods – Integration of the discovered knowledge with existing one: knowledge fusion

MAJOR ISSUES IN DATA MINING • User interaction:

– Data mining query languages and ad-hoc mining – Expression and visualization of data mining results – Interactive mining of knowledge at multiple levels of abstraction • Applications and social impacts – Domain-specific data mining – Protection of data security, integrity, and privacy

ARE ALL THE DISCOVERED PATTERNS INTERESTING? • Data mining may generate thousands of patterns: Not all of them are interesting •

Suggested approach: Human-centered, query-based, focused mining

• Interestingness measures  easily understood by humans  valid on new or test data with some degree of certainty  potentially useful

 novel or  validates some hypothesis that a user seeks to confirm

ARE ALL THE DISCOVERED PATTERNS INTERESTING? CONTD… • Objective vs. subjective interestingness measures

 Objective: Based on statistics and structures of patterns, e.g.,  Support

Confidence, etc.  Subjective: Based on user’s belief in the data, e.g., unexpectedness Novelty  Action ability etc.

FIND ALL AND ONLY INTERESTING PATTERNS? • Find all the interesting patterns:



Leads to Completeness



Can a data mining system find all the interesting patterns?



Do we need to find all of the interesting patterns?

Heuristic vs. exhaustive search Association vs. classification vs. clustering

FIND ALL AND ONLY INTERESTING PATTERNS? • Search for only interesting patterns

An optimization problem •

Can a data mining system find only the interesting patterns?

•

Approaches

First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns—mining query optimization

OTHER PATTERN MINING ISSUES • Precise patterns vs. approximate patterns

 Association and correlation mining: possible find sets of precise patterns But approximate patterns can be more compact and sufficient

How to find high quality approximate patterns?  Gene sequence mining: approximate patterns are inherent

How to derive efficient approximate pattern mining algorithms?

OTHER PATTERN MINING ISSUES • Constrained vs. non-constrained patterns

 Why constraint-based mining?  What are the possible kinds of constraints?  How to push constraints into the mining process?

DATA MINING TERNDS • Data Mining tasks, diversity of data and data mining approaches pose many challenging research issues in data mining • Requirements: (For researchers, system and application developers) • Development of efficient and effective data mining methods • Data Mining systems and services • Integrated and interactive data mining environments • Use of data mining techniques to solve large or sophisticated application problems

DATA MINING TERNDS - APPLICATION EXPLORATION • Exploration of data mining for business continues to expand as e-commerce and e-marketing become mainstream in the retail industry • Exploration of applications in areas like web and text analysis, financial analysis, industry, government, biomedicine and science • Genetic datamining systems dealing with application specific data mining systems and tools have limitations

DATA MINING TERNDS – SCALABLE AND INTERACTIVE DATA MINING METHODS • Data mining must be able to handle huge amount of data efficiently and if possible interactively • As amount of data collected increases rapidly, scalable algorithms for individual and integrated data mining functions has become essential • Users should be provided added control by allowing the specification and use of constraints to guide data mining systems in their search for interesting patterns and knoledge

DATA MINING TERNDS- INTEGRATION OF DATA MINING WITH OTHER SYSTEMS • Integration with other mainstream information processing and computing systems: • Search engines • Database systems • Data warehousing systems • Cloud Computing systems • It should be ensured that data mining acts as an essential data analysis component that can be smoothly integrated into such an information processing environment • The integration should be seamless, uniform framework • LEADING TO: data mining portability, scalability, high performance

DATA MINING TERNDS- MINING SOCIAL AND INFORMATION NETWORKS • Such networks ubiquitous and complex • This leads to development of scalable and effective knowledge discovery methods • Applications for large numbers of network data is essential • The networks under this are: • Social networks, Information networks, Biological networks, Bioinformatics • Major themes are: • Graph pattern mining, Data cleaning, integration and validation by network analysis, clustering and classification of graphs and networks, clustering, ranking and classification of heterogeneous networks

DATA MINING TERNDS- MINING ADVANCED SYSTEMS • • • • • • • • • •

These systems are: Spatiotemporal systems Moving objects Cyber-physical systems These data are mounting because of the use of Cellular phones GPS Sensors Other wireless equipment Requirements are: Realising real-time and knowledge discovery with such data

effective

DATA MINING TERNDS- MINING MULTIMEDIA AND WEB DATA • • • • • • •

It is one of the most recent focus of data mining research These databases store large number of multimedia objects: Image data Video data Audio data Sequence data: Hypertext data- text, text mark-ups, and linkages

DATA MINING TERNDS- WEB DATA • • • • • • • •

Web Data: Global information centre for news Advertisements Consumer Information Financial Management Education Government E-Commerce

DATA MINING TERNDS- BIOLOGICAL AND BIOMEDICAL DATA • • • • • • • • • • •

These data are combination of: Complexity, Richness, Size As a consequence these data require special attention Applications like DNA and protein sequences High dimensional microarray data Biological pathway Network analysis Mining biological literature Heterogeneous biological data Information integration of biological data by data mining

DATA MINING TERNDS- S/W ENGINEERING AND SYSTEM ENGINEERING • Software systems and programs are becoming bulky in size, complex as a result of integration of different components • Ensuring S/W robustness and reliability is a part of data mining • Through data mining methodologies for S/W or System debugging enhance their robustness

DATA MINING TERNDS- VISUAL AND AUDIO DATA MINING • Integration of humans with audio and visual systems to discover knowledge • It is likely to enhance human participation for effective and efficient data analysis

DATA MINING TRENDS- DISTRIBUTED DATA MINING • • • • • • • • •

Presence of distributed computing environments like Internet Intranet LANs High Speed Wireless Networks Sensor Networks Cloud Computing LEADS FOR DEVELOPMENT OF Distributed data mining techniques

DATA MINING TRENDS- REAL TIME DATA STREAM MINING • • • • • • • •

Stream data comprises of E-Commerce Web mining Stock Analysis Intrusion Detection Mobile data mining Data for Counterterrorism ADDITIONAL RESEARCH IS NEEDED IN THIS DIRECTION

DATA MINING TRENDS-PRIVACY PROTECTION AND INFORMATION SECURITY INDATA MINING • Enough of personal and confidential information are available in electronic form • Powerful data mining tools when applied may lead to disclosure of these information • This creates a threat to privacy and security • Database anonymization techniques are to be applied to secure these information before applying data mining tools, or • THE DATA MINING TOOLS ARE TO BE MODIFIED TO ENSURE THESE PROTECTIONS

A BRIEF HISTORY OF DATA MINING SOCIETY • 1989 IJCAI Workshop on Knowledge Discovery in Databases

– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997)

A BRIEF HISTORY OF DATA MINING SOCIETY • ACM SIGKDD Explorations

conferences

since

1998

and

SIGKDD

• More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. • ACM Transactions on KDD starting in 2007

CONFERENCES AND JOURNALS ON DATA MINING • KDD Conferences – ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) – SIAM Data Mining Conf. (SDM) – (IEEE) Int. Conf. on Data Mining (ICDM) – Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) – Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)  Other related conferences  ACM SIGMOD  VLDB

CONFERENCES AND JOURNALS ON DATA MINING (IEEE) ICDE  WWW, SIGIR  ICML, CVPR, NIPS Journals  Data Mining and Knowledge Discovery (DAMI or DMKD)  IEEE Trans. On Knowledge and Data Eng. (TKDE)  KDD Explorations  ACM Trans. on KDD  TDP Transactions on Data Privacy  Many Journal of other publications like Inder Science 



WHERE TO FIND REFERENCES? • Data mining and KDD (SIGKDD: CDROM) – Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. – Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD • Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) – Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA – Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

WHERE TO FIND REFERENCES? • AI & Machine Learning – Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. – Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. • Web and IR – Conferences: SIGIR, WWW, CIKM, etc. – Journals: WWW: Internet and Web Information Systems,

WHERE TO FIND REFERENCES? • Statistics – Conferences: Joint Stat. Meeting, etc. – Journals: Annals of statistics, etc. • Visualization – Conference proceedings: CHI, ACM-SIG Graph, etc. – Journals: IEEE Trans. visualization and computer graphics, etc.

SOME VALUED REFERENCES 1. J. Han, M. Kamber and J. Pei: Data Mining- Concepts and Techniques, Elsevier publications, 3rd edition, (2012) 2. R. Agrawal and R. Srikant: privacy preserving Data Mining, In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data, pp. 439-450 3. C. M. Bishop: Pattern Recognition and Machine Learning, New York: Springer, (2006) 4. M. J. A. Berry and G. Linoff: Mastering Data Mining: The art and Science of Customer relationship Management, John Wiley & Sons, (1999) 5. M. J. A. Berry and G. Linoff: Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, John Wiley & Sons, (2004)

SOME VALUED REFERENCES 6. D. J. Cook and L. B. Holder: Mining Graph Data, John Wiley & Sons, (2007) 7. K. Cios, W. Pedrycz and R. Swiniarski: Data Mining Methods for Knowledge Discovery, Kluwer Academic, (1998). 8. S. Dzeroski and N. Lavrac (eds.): Relational Data Mining, New York, Springer, (2001). 9. P. Domingos and G. Hulten: Mining High Speed Data Streams, In Proc. 2000 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD ‘00), pp.71-80. 10. T. Dasu and T. Johnson: Exploratory Data Mining and Data Cleaning, John Wiley &Sons, (2003).

SOME VALUED REFERENCES • 11. W. Hsu, M. L. Lee and J. Wang: Temporal and SpatioTemporal Data Mining, IGI Publishing, (2007). • 12. H. Kargupta, J. Han, S. Yu, R. Motwani and V. Kumar: Next Generation of Data Mining, Chapman & Hall/CRC, (2008). • 13. H. Kargupta, A. Joshi, K. Sivakumar and Y. Yesha: Data Mining: Next Generation Challenges and Future Directions, Cambridge, MA: AAAI/MIT Press, (2004). • 14. H. Miller and J. Han: Geographic Data Mining and Knowledge Discovery (2nd ed.), Chapman & Hall/CRC, (2009). • 15. J. Vaidya, C. W. Clifton and Y. M. Zhu: privacy Preserving Data Mining, New York, Springer, (2010).