DM is the application of statistics in the form of exploratory data analysis and ... is the branch of mathematics concerned with collection, classification, analysis, ..... Introduction, (2010), http://www.stanford.edu/class/cs345a/slides/01- intro.pdf.
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476
Understanding Linkage between Data Mining and Statistics Jaya Srivastava, Dr. Abhay Kumar Srivastava Department of Computer Science, Jaipur National University, Jaipur, India Department of Decision Sciences, Jaipuria Institute of Management, Lucknow, India Abstract: Data Mining and Statistics, though having different origin, exist for common purpose. Many of us are unable to understand scope and limitations of the two disciplines and how it is interrelated [6, 8]. The paper highlights these issues and also the linkage power between Data Mining and Statistics [12, 13]. Keywords: “Data Mining”, “Statistics”, “knowledge extraction”, “connection”, “linkage”
Introduction Data Mining (DM) and Statistics are the two disciplines which are commonly used in Data analysis and knowledge extraction. Though Statistics is a traditional branch that has evolved from applied Mathematics while Data Mining is a multidisciplinary branch that has evolved from computer science, but both are used for the same purpose. There are many techniques which are common in both disciplines but some approaches used in statistics can reduce the job of a data miner. In this paper we shall be highlighting these issues by introducing both Data Mining and Statistics and finding linkage and differences between these two streams. The growth of data mining has been massive in past decade. Its application has increased with the increase of data generation as more and more data being captured through various means of Information Technology like internet. There is a growing research in the area of databases with the help of data mining. Since data mining can be used in advance data research analysis and is capable of extracting valuable knowledge from large data sets. It has emerged as a new scientific and engineering discipline to meet such requirements. Data Mining is commonly quoted as “solving problems by analyzing data that already exists in databases”. In addition to the mining of structured and numeric data stored in data warehouses, more and more interest is now being experienced in the mining of unstructured and non-numeric data such as text and web in recent times. 1. Defining Data Mining DM is a combination of computational and statistical techniques to perform exploratory data analysis (EDA) on rather large and mostly not very well cleaned data sets (or data bases). In recent times, the issue of capturing data is not considered to be a major issue but since a huge amount of data does not convey any information, screening of useful and non useful data has become a major challenge. Most modern problems can electronically deal with the cumulative data from many years ago [39]. This leads to a requirement for training the data miners in statistics or statistics graduates in data mining. Although DM has a short history but its importance is felt in various domains. It has been defined in different manner by experts. Some of the definitions of DM are as under: Data Mining is the analysis of large observational data sets rather than experimental data sets to find unsuspected relationships and presenting the data in novel ways which are easy to understand and useful for the users. DM is the extraction of hidden predictive information from large databases. [28] DM is process of analyzing data from different perspectives and summarizing it into useful information within a particular context. [4] DM is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. [2] DM is finding interesting structure (patterns, statistical models, relationships) in databases. [39]
4
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 DM is the application of statistics in the form of exploratory data analysis and predictive models to reveal patterns and trends in very large data sets. (Insightful Miner 3.0 User Guide) DM is “learning from the data” or “turning data into information”. [24] DM is the process of knowledge discovery in databases (KDD). [41] In almost all definitions the focus is on analyzing large data (generally exploratory) and finding hidden information in databases. The process of extraction of information is automatic or semi-automatic which is presented in a very simple manner. Figure1 below shows the evolution of data mining nomenclature. It all started with different terms by statisticians like Data fishing, data dredging and data snooping. Now-a-days data mining is often called as Knowledge Discovery in databases.
Figure 1: Evolution of Nomenclature in Data Mining 1.1 Major goals of data mining We can distinguish the major goals of data mining by two types: a. Verification of user’s hypothesis b. Discovery of new patterns that can be used for prediction and description Data mining methods seek to discover unexpected and interesting regularities, called patterns, in presented data sets. Statistical significance testing also called as Hypothesis testing can be applied in these scenarios to select the surprising patterns that do not appear as clearly in random data. As each pattern is tested for significance, a set of statistical hypotheses are considered simultaneously. The multiple comparisons of several hypotheses simultaneously are often used in Data Mining. Prediction involves using some variables or fields in the database to forecast unknown or future values of other variables of interest. Description focuses on finding human-interpretable patterns describing the data. Various complexities in the stored data (data interrelations) have limited the use of Verification-Driven Data Mining in decision-making. It must be complemented with the discovery-driven data mining. Furthermore, in
5
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 the context of Data Mining, description tends to be more important than prediction. This is contrast to pattern recognition and machine learning applications where prediction is often the primary goal. 1.2 Popular Data Mining Methods There are many different techniques for data mining. Often the technique to use is determined by the type of data we have and the type of information we are trying to determine from the data. The most popular data mining methods in current use are classification, clustering, neural networks, association, Bayesian Network, estimation, and visualization. A very brief explanation of each method is given below: S. No. 1. 2. 3.
D.M. Technique
Description
Classification Clustering Associations
4. 5. 6.
Description & Visualization Summarization Estimation
7. 8.
Deviation Detection Link Analysis
predicting an item class Finding similar groups and sub-groups in data Determining which things go together, also known as dependency modeling Depicting visual summaries in data and exploring describing a group predicting a continuous value such as income, bank balance etc. finding changes finding relationships
2. Statistics Statistics deals with the quantification, collection, analysis, interpretation, and drawing conclusions from data. It is considered to be the oldest research stream that has become one of the branches of Pure Applied Mathematics. Different Statistician had defined statistics in different ways: Statistics is both art and science which examines the principles and methods implemented in collecting, summarizing, analyzing and interpreting the numerical data on a research field [38] Statistics is the branch of mathematics concerned with collection, classification, analysis, and interpretation of numerical facts, for drawing inferences on the basis of their quantifiable likelihood (probability).It is subdivided into descriptive statistics and inferential statistics. [31] The most important science in the whole world: for upon it depends the practical application of every other science and every art: the one science essential to all political and social administration, all education, and all organization based on experience, for it only gives results of our experience [11] Statistics is the science of counting and averages. [31] Statistics is the science of estimate and probabilities. [31] It is the method of judging collection, natural or social phenomena from the results obtained from the analysis or enumeration or collection of estimates [31] Statistics is the numerical statement of facts capable of analysis and interpretation and the science of statistics is the study of the principles and the methods applied in collecting, presenting, analysis and interpreting the numerical data in any field of inquiry. Statistics consists of two main parts, descriptive and inferential statistics. The methodology for organizing and summarizing the data for the sample is called descriptive statistics. When one uses these summaries to draw conclusions about an entire population, we use the methodology called as statistical inference [l]
6
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 2.1 Major Approaches in Statistics S.No. Statistical Technique 1. Descriptive Statistics
2.
3.
4.
5.
6. 7.
8. 9.
Regression -Linear -Logistic -Non Linear Correlation Analysis -Pearson correlation -Spearman correlation Probability Theory -marginal -Union -Joint -Conditional Probability Distribution -Discrete Probability Distribution -Continuous Probability Distribution Bayesian Classification Estimation Theory
Analysis of Variance ANOVA) Factor Analysis (FA)
Description Central Tendency Dispersion Shape (Graphical Display)
-Prediction -Modeling -Association
Prediction of the behavior of the system defined
Bayes’ Theorem and Naïve Bayesian classification -Model Selection -Estimating Confidence interval and significance level -ROC Curves ( Test equality of more than two groups mean Reduction of large no. of variables into some general ones, also known as Data reduction Technique Predict a categorical response variable Forecasting trends and seasonality
10 11.
Discriminate Analysis Time series analysis -Moving Average Method -Exponential smoothing -auto regression method
12.
Quality Control Charts Display the spread of individual observation with reference to mean -Attributes Charts -Variable charts Principal Component Analysis Canonical Correlation Data Reduction Analysis Cluster Analysis -Hierarchal -Non Hierarchal Sampling -Random Sampling -Non Random Sampling
13. 14. 15.
16.
7
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 3. Why Data Mining? With the availability of large number of data analysis tools in statistics, one may think about the relevance of Data Mining. Many reasons can be stated in support of DM. First, as industry needs solutions for real-life problems largely in Customer Relationship Management [23, 27], one of the most important issues is the problem solving speed: many data mining methods are able to deal with very large datasets in a very efficient way, while the algorithmic complexity of statistical methods may turn out to be prohibitive for their use on very large databases. Next, the results of the analysis need to be represented in an appropriate, usually human understandable way; apart from the analytical languages used in statistics, data mining methods also use other forms of representations, the most popular being decision trees and rule sets. Another important issue is the assumptions about the type of data. In general one may claim that data mining deals with all sorts of structured tabular data (e.g., non-numeric, highly unbalanced, unclean data) as well as with non-structured data (e.g., text documents, images, multimedia), and does not make assumptions about the distribution of the data. Finally, while one of the main goals of statistics is hypothesis testing, one of the main goals of data mining is the construction of hypotheses. 3.1 Nature of Data used in Statistics and Data Mining Most statisticians are concerned with primary data analysis. That is, the data are collected with a particular question or set of questions in mind. In-fact experimental design and survey design have been developed to facilitate the efficient collection of data so as to answer the given questions. On the other hand, Data mining is entirely concerned with secondary data analysis. In fact we might define data mining as the process of secondary analysis of large databases aimed at finding unsuspected relationships which are of interest or value to the database owners. We see from this that data mining is very much an inductive exercise, as opposed to the hypothetical-derived approach often seen as the paradigm for how modern science progresses Statistics is more concerned towards learning from data or turning data into information which can be further used for making rational decisions. The context of data mining encompasses statistics, but with a somewhat different emphasis. In particular, data mining involves retrospective analyses of data i.e. if something occurred, we try to investigate why it has occurred thus Experimental design may not be very suitable in data mining [18]. Data miners are often more interested in ease of understanding or interpreting rather than accuracy or predictability. Thus, there is a focus on relatively simple interpretable models involving rules, trees, graphs, and so forth. Applications involving very large numbers of variables and vast numbers of measurements are also common in data mining. Thus computational efficiency and scalability are very important, and issues of statistical significance maybe a secondary consideration. 4. Role of Statistics in Data Mining: Both Statistics and Data Mining are data-centered process and the Real time data is always error prone due to several factors like ultra large size, noise in data, incomplete data, redundancy and dynamism in data. In Data Mining, Data driven techniques either rely on heuristics to guide their search through the large space of possible relations between combinations of attribute values or adopt some kind of data-reduction method to make the algorithm more efficient. Statistics provides several algorithms which can be used for data analysis in data mining also. For ultra large size and dynamic nature of data, traditional statistics provides sampling and Bayesian analysis which can be effectively used to counter these problems [8]. As shown in figure 2, DM can be viewed as intersection of Artificial Intelligence, classical statistics and Machine learning advance statistics. 4.1 Algorithms for data analysis in statistics Computing has always been a fundamental to statistics. Some of the important computational tools for data analysis, rooted in classical statistics are: efficient estimation by maximum likelihood, least squares and least absolute deviation estimation, and the EM algorithm; analysis of variance (ANOVA, MANOVA, ANCOVA), and the analysis of repeated measurements; nonparametric statistics; log-linear analysis of categorical data;
8
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 linear regression analysis, generalized additive and linear models, logistic regression, survival analysis, and discriminant analysis; frequency domain (spectrum) and time domain (ARIMA) methods for the analysis of time series; multivariate analysis tools such as factor analysis, principal component and later independent component analyses, and cluster analysis; density estimation, smoothing and de-noising, and classification and regression trees (decision trees); Bayesian networks and the Monte Carlo Markov Chain (MCMC) algorithm for Bayesian inference. The overview of these topics is readily available [12].
Figure 2: Data mining as an interdisciplinary branch 4.2 Handling massive data through sampling While data storage has become cheaper as memories have become increasingly affordable; CPU, throughput, memory management, and network bandwidth continue to be constraints when it comes to processing large quantities of data. Business analysts and Data Scientist are so overwhelmed with the sheer volume; they do not know where to start in order to convert data into information. Sampling could be used in a smart way to overcome this problem. Sampling is a very well developed area of statistics, but is usually used in DM at the very basic level. Exploring a representative sample is easier, more efficient, and can be as accurate as exploring the entire database. After the initial sample is explored, some preliminary models can be fitted and assessed. If the preliminary models perform well, then perhaps the data mining project can continue to the next phase. However, it is likely that the initial modeling generates additional, more specific questions, and more data exploration is required. A major benefit of sampling is the speed and efficiency of working with a smaller data table that still contains the essence of the entire database. Ideally, one uses enough data to reveal the important findings, and no more. Sufficient quantity depends mostly on the modeling technique, and that in turn depends on the problem. Sampling enables analysts to spend relatively more time fitting models and thereby less time waiting for modeling results. Visualization of the data and its structure facilitate understanding of the data, as well as drawing conclusions drawn from the data, are another central theme in DM. Visualization of quantitative data as a major activity flourished in the statistics of the 19th century. Both the theory of visualizing quantitative data and the practice have dramatically changed in recent years. To better understand a variable, univariate plots of the distribution of values are useful. To examine relationships among variables, bar charts and scatter plots (2-dimensional and 3-dimensional) are helpful. Spinning data to gain a 3-dimensional understanding of point-clouds, or the use of projection pursuit are just two examples of visualization technologies that emerged from statistics. Data cleansing (detecting, investigating, and correcting errors, outliers, missing values, and so on) can be very time-consuming. To cleanse the entire database might be a very difficult and frustrating task. Data 9 Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 augmentation (adding information to the data such as demographics, credit bureau scores, and so on), like data cleaning, will be less expensive if applied only to a sample. 4.3 Bayesian network: An important Linkage between Statistics and Data Mining A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. The model is well suited for Data Mining. When used in conjunction with statistical techniques, this graphical model has several advantages for data modeling. 1. Since the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Hence it is suited for incomplete and noisy data. 2. A Bayesian network can be used to learn causal (Cause-effect) relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention especially in the case of exploratory data Analysis. 3. As the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. In a real-world modeling task, one knows the importance of prior or domain knowledge, especially when data is scarce or expensive. The fact that some commercial systems (i.e., expert systems) can be built from prior knowledge. Thus posterior values can be derived with the help of prior knowledge 4. Over fitting of Data occurs when a model describes random error or noise instead of the underlying relationship. Over fitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been over fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data (source Wikipedia). Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the over fitting of data. 5. Conclusion & Future Work This paper provides several aspects of Statistics and data mining, aiming to provide the background and basic understanding of the topics and linkage between them. Statistics is observed to be very useful for verifying relationships among few parameters when the relationships are linear. It considers every aspect right from planning for the collection of data and subsequent data management to end of the line activities such as drawing inferences from numerical facts called data and presentation of results. It can be viewed as fulfilling the basic need of human being. On the other hand Data mining may be viewed as building many complex, predictive, nonlinear models which are used for predicting behavior impacted by many factors. It is used to discover those hidden patterns and relationships in our data that make business decisions more accurate and realistic. Another reason why data mining has a scientific and commercial future was given by Friedman “Every time the amount of data increases by a factor of 10, we should totally rethink how we analyze it.”[12] What distinguishes data mining from conventional statistical data analysis is that data mining is usually done for the purpose of 'secondary analysis' aimed at finding unsuspected relationships, perhaps, unrelated to the purposes for which the data were originally collected. In other words, data mining is very much an inductive exercise, as opposed to the traditional hypothetico-deductive approach of statistics. Data mining sits at the common frontiers of fields such as Information Systems (Database management & Data warehousing), Computer Science (Artificial Intelligence, Machine Learning & Pattern Recognition), and Statistics (Data Visualisation & Modelling) [13]. There is a strong linkage between statistics and Data Mining because most of the basic functions in DM are covered in statistics using proper algorithms. Since the focus of DM is on non parametric data which is less relied in traditional statistics so there is a lot of scope left to work on these non-parametric tests stated in statistics which can be used in DM.
10
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 References: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]
[18] [19] [20]
[21] [22] [23] [24] [25] [26]
Anders Hald, A history of probability and statistics and their applications before 1750, Wiley IEEE, ISBN 0471471291, (2003). Berry, J.A.M., and Linoff, G., Data mining techniques-for marketing, sales and customer support", New York, Wiley, (1997). Berry, M.J.A. and Linoff, G.S., Mastering Data Mining -The Art and Science of Customer Relationship Management, New York, Wiley(2000). Bill Palace, Data Mining: What is Data Mining?, (1996), http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm Chatfield C., Data Mining, Royal statistical society news, Vol. 25, (1997), 1-2. Clark Glymour Et. Al, Statistical Themes and lessons for Data Mining, Data Mining and Knowledge Discovery,Vol. l, Kluwer Academic publishers, (1997), 11- 28. Data Mining Community's Top Resource , Data Mining, and Knowledge Discovery: An Introduction, (2011),http://www.kdnuggets.com/data_mining_course/x1-intro-to-data-mining-notes.html David J. Hand, Statistics and Data Mining: Intersecting Disciplines, copyright @ ACM SIGKDD, Vol. 1, Issue 1, (1999). David J. HAND, Data Mining: Statistics and More? The American Statistician, Vol. 52, No. 2., (1998). Emanuel Parzen, Data Mining, Statistical Methods Mining and History of Statistics, Interface Symposium on Computing Science and Statistics, Proceedings, ed. D. Scott., (1998). Florence Nightingale, Statistics,(2011), http://jwilson.coe.uga.edu/emt668/EMAT6680.Folders/Brooks/6690stuff/Statistics/Statistics.htm Friedman J.H, Data mining and Statistics-What's the Connection, 29th Symposium on the interface, (1998). Ganesh, S., Data mining: Should it be Included in The Statistics Curriculum? The 6th international conference on teaching statistics (ICOTS 6), Cape Town, South Africa, (2002). Glymour Et al., Statistical Inference and Data Mining, Communications of the ACM, Vol. 39, No. 11, (1996). Goodman A. , Kamath C. And Kumar V., Data Analysis in the Twenty-First Century, Vol. 1, No. 1, Journal Volume: 1; Journal Issue: 1, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, (2008), 1-3. Gorunescu F., Data Mining Concepts, Models and Techniques, Vol. 12, intelligent systems reference library, Springer, (2011). Hamparsum Bozdogan Et al, Statistical Data Mining and Knowledge Discovery. 2nd edition, London, (2004). International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013 149 Hand, D.J., Data mining-statistics and more?” American Statistician, Vol. 52, (1998), 112-118. Hand, D.J. (1999). Data mining: new challenges for statisticians. Proceedings of the ASC (Association for Survey Computing) International Conference, 21-26. Hand, D.J. (1999). Statistics and data mining: intersecting disciplines. SIGKDD Explorations, 1, 16-19. Hand, D.J., Blunt, G., Kelly, M.G. & Adams, N.M. (2000). Data mining for fun and profit. Statistical Science, 15, 111-131. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. Hastie, T., Tibshirani R. and Friedman J. H., Elements of statistical learning-data mining inference and prediction, Springer Verlag, New York, (2001). I. Krishna Murthy, Data Mining- Statistics Applications: A Key to Managerial Decision Making” article, indiastat.com, (2010) J. Hosking, E. Pednault, and M. Sudan, A Statistical Perspective on Data Mining, Future Generation Computing Systems, special issue on Data Mining, (1997). Jure Leskovec, Data Mining : Introduction, (2010), http://www.stanford.edu/class/cs345a/slides/01- intro.pdf Klamber M. and Han J."Data Mining: Concepts and Techniques, 2nd Edition, Elsevier Inc., USA, (2006).
11
Jaya Srivastava, Dr. Abhay Kumar Srivastava
International Journal of Engineering Technology, Management and Applied Sciences
www.ijetmas.com October 2015, Volume 3, Issue 10, ISSN 2349-4476 [27] [28] [29] [30 ] [31] [32]
[33] [34] [35]
[36] [37] [38]
[39] [40] [41]
Kuonen, D., Data mining and Statistics: What is the connection? The Data Administrative Newsletter, Switzerland, (2004). Kurt Thearling, An Introduction to Data Mining, (2010), http://www.thearling.com/text/dmwhite/dmwhite.htm Lomax, R. G., An Introduction to Statistical Concepts for Education and Behavioral Sciences (2nd ed.). New York: Routledge, (2007). Lovleen Kumar Grover and Rajni Mehra, The Lure of Statistics in Data Mining, Journal of Statistics Education Volume 16, Number 1, (2008),www.amstat.org/publications/jse/v16n1/grover.html/ Math Zone, Definition of Statistics, (2011), http://www.emathzone.com/tutorials/basic-statistics/definition-ofstatistics.html Neal Leavitt, Data Mining Corroborate Masses, (2011), http://www.leavcom.com/ieee_may02.htm [30] Robert Nisbet Et al, The Handbook of Statistical Analysis and Data Mining Applicants, Academic Press, ISBN: 0123747651, (2009), www.elsevierdirect.com/datamining SAS Analytics, Statistics Definitions, http://www.businessdictionary.com/definition/statistics.html Siva Ganesh, Data Mining: Should It Be Included In The 'Statistics' Curriculum? ICOTS6, (2002). SPSS Inc., SPSS Data Mining Tips, ISBN 1-56827-282-0 Printed in the U.S.A., (2005). [34] STASTICA Data Analysis Software and Services, Stat Soft Electronic Statistics Textbook, Data Mining Techniques, (2011), http://www.statsoft.com/textbook/data-mining-techniques/ Stephen M. Stigler, Statistics on the table: the history of statistical concepts and methods, Cambridge, Mass: Harvard University Press, (2002). Stephen M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Cambridge, MA: Belknap Press of Harvard University Press, (1986). Tim Menzies and Ying Hu, Computing Practices Data Mining for Very Busy People, (2009), http://biblioteca.universia.net/html_bura/ficha/params/title/computing-practices-data-mining-for-verybusypeople/id/47808919.html U. Fayyad, S. Chaudhuri and P. Bradley, Data mining and its rule in database systems, Proceeding of 26th VLDB Conference. Cairo, Egypt, Morgan Kaufmanu, (2000), 63 – 124. Wiley Inter Science, Data Analysis in the 21st Century, (2007), www.interscience.wiley.com/. Wikipedia, Data Mining, (2011), http://en.wikipedia.org/wiki/Data_mining
12
Jaya Srivastava, Dr. Abhay Kumar Srivastava