Comparing Four-Selected Data Mining Software

4 downloads 0 Views 2MB Size Report
and BioDiscovery GeneSight®, each of which was provided by partnerships with our univer- sity. These software are described and compared by their existing ...


Chapter 4.22

Comparing Four-Selected Data Mining Software Richard S. Segall Arkansas State University, USA Qingyu Zhang Arkansas State University, USA



This chapter discusses four-selected software for data mining that are not available as free opensource software. The four-selected software for data mining are SAS® Enterprise MinerTM, Megaputer PolyAnalyst® 5.0, NeuralWare Predict® and BioDiscovery GeneSight®, each of which was provided by partnerships with our university. These software are described and compared by their existing features, characteristics, and algorithms and also applied to a large database of forest cover types with 63,377 rows and 54 attributes. Background on related literature and software are also presented. Screen shots of each of the four-selected software are presented, as are future directions and conclusions.

Historical Background Han and Kamber (2006), Kleinberg and Tardos (2005), and Fayyad et al. (1996) each provide extensive discussions of available algorithms for data mining. Algorithms according to StatSoft (2006b) are operations or procedures that will produce a particular outcome with a completely defined set of steps or operations. This is opposed to heuristics that according to StatSoft (2006c) are general recommendations or guides based upon theoretical reasoning or statistical evidence such as “data mining can be a useful tool if used appropriately.” The Data Intelligence Group (1995) defined data mining as the extraction of hidden predictive

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Comparing Four-Selected Data Mining Software

information form large databases. According to The Data Intelligence Group (1995), “data mining tools scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.” Brooks (1997) describes rules-based tools as opposed to algorithms. Witten and Frank (2005) describe how data mining algorithms work including covering algorithms, instance-based learning, and how to use the WEKA, an open source data mining software that is a machine learning workbench. Segall (2006) presented a chapter in the previous edition of this Encyclopedia that discussed microarray databases for biotechnology that included a extensive background on microarray databases such as that defined by Schena (2003), who described a microarray as “an ordered array of microscopic elements in a planar substrate that allows the specific binding of genes or gene products.” The reader is referred to Segall (2006) for a more complete discussion on microarray databases including a figure on the overview of the microarray construction process. Piatetsky-Shapiro (2003) discussed the challenges of data mining specific to microarrays, while Grossman et al. (1998) reported about three NSF (National Science Foundation) workshops on mining large massive and distributed data, and Kargupta at al. (2005) discussed the generalities of the opportunities and challenges of data mining. Segall and Zhang (2004, 2005) presented funded proposals for the premises of proposed research on applications of modern heuristics and data mining techniques in knowledge discovery whose results are presented as in Segall and Zhang (2006a, 2006b) in addition to this chapter.

Software Background There is a wealth of software today for data mining such as presented in American Association for Artificial Intelligence (AAAI) (2002) and Ducatelle

(2006) for teaching data mining, Nisbet (2006) for CRM (Customer Relationship Management) and software review of Deshmukah (1997). StatSoft (2006a) presents screen shots of several softwares that are used for exploratory data analysis (EDA) and various data mining techniques. Proxeon Bioinformatics (2006) manufactures bioinformatics software for proteomics the study of protein and sequence information. Lazarevic et al. (2006) discussed a software system for spatial data analysis and modeling. Leung (2004) compares microarray data mining software. National Center for Biotechnology Information (NCBI) (2006) provides tools for data mining including those specifically for each of the following categories of nucleotide sequence analysis, protein sequence analysis and proteomics, genome analysis, and gene expression. Lawrence Livermore National Laboratory (LLNL) (2005) describes their Center for Applied Scientific Computing (CASC) that is developing computational tools and techniques to help automate the exploration and analysis of large scientific data sets.

MAIN THRUST Algorithms of Four-Selected Software This chapter specifically discusses four-selected data mining software that were chosen because these software vendors have generously offered their services and software to the authors at academic rates or less for use in both the classroom and in support of the two faculty summer research grants awarded as Segall and Zhang (2004, 2005). SAS Enterprise MinerTM is a product of SAS Institute Inc. of Cary, NC and is based on the SEMMA approach that is the process of Sampling (S), Exploring (E), Modifying (M), Modeling (M), and Assessing (A) large amounts of data. SAS


Comparing Four-Selected Data Mining Software

Enterprise MinerTM utilizes a workspace with a drop-and-drag of icons approach to constructing data mining models. SAS Enterprise MinerTM utilizes algorithms for decision trees, regression, neural networks, cluster analysis, and association and sequence analysis. PolyAnalyst® 5 is a product of Megaputer Intelligence, Inc. of Bloomington, IN and contains sixteen (16) advanced knowledge discovery algorithms as described in Table 1 that was constructed using its User Manual by Megaputer Intelligence Inc. (2004; p. 163, p. 167, p.173, p.177,

p. 186, p.196, p.201, p.207, p. 214, p. 221, p.226, p. 231, p. 235, p. 240, p.263, p. 274.). NeuralWorks Predict® is a product of NeuralWare of Carnegie, PA. This software relies on neural networks, According to NeuralWare (2003, p.1): “One of the many features that distinguishes Predict® from other empirical modeling and neural computing tools is that is automates much of the painstaking and time-consuming process of selecting and transforming the data needed to build a neural network.”

Table 1. Description of data mining algorithms for PolyAnalyst® 5 Data Mining Algorithm 1. Discriminate 2. Find Dependencies 3. Summary Statistics 4. Link Analysis (LA) 5. Market and Transactional Basket Analysis 6. Classify 7. Cluster 8. Decision Forest (DF) 9. Decision Tree 10. Find Laws 11. Nearest Neighbor 12. PolyNet Predictor 13. Stepwise Linear Regression 14. Link Terms (LT) 15. Text Analysis (TA) 16. Text Categorization (TC)


Underlying Algorithms 1. (a.) Fuzzy logic for classification 1. 1. (b.) Find Laws, PolyNet Predictor, or Linear Regression 2. ARNAVAC [See Key Terms] 3. Common statistical analysis functions 4. Categorical, textual and Boolean attributes 5. PolyAnalyst Market Basket Analysis 6. Same as that for Discriminate 7. Localization of Anomalies Algorithm 8. Ensemble of voting decision trees 9. (a.) Information Gain splitting criteria (b.) Shannon information theory and statistical significance tests. 10. Symbolic Knowledge Acquisition Technology (SKAT) [See Key Terms] 11. PAY Algorithm 12. PolyNet Predictor Neural Network 13. Stepwise Linear Regression 14. Combination of Text Analysis and Link Analysis algorithms 15.Combination of Text analysis algorithms augmented by statistical techniques 16. Text Analysis algorithm and multiple subdivision splitting of databases.

Comparing Four-Selected Data Mining Software

NeuralWorks Predict® has a direct interface with Microsoft Excel that allows display and execution of the Predict® commands as a dropdown column within Microsoft Excel. GeneSightTM is a product of BioDiscovery, Inc. of El Segundo, CA that focuses on cluster analysis using two main techniques of hierarchical and partitioning both of which are discussed in Prakash and Hoff (2002) for data mining of microarray gene expressions. Both SAS Enterprise MinerTM and PolyAnalyst® 5 offer more algorithms than either GeneSightTM or NeuralWorks Predict®. These two software have algorithms for statistical analysis, neural networks, decision trees, regression analysis, cluster analysis, self-organized maps (SOM), association (e.g. market-basket) and sequence analysis, and link analysis. GeneSightTM offers mainly cluster analysis and NeuralWorks Predict® offers mainly neural network applications using statistical analysis and prediction to support these data mining results. PolyAnalyst® 5 is the only software of these that provides link analysis algorithms for both numerical and text data.

Applications of the Four-Selected Software to Large Database Each of the four-selected software have been applied to a large database of forest cover type that is available on the same website of the Machine Learning Repository at the University of California at Irvine by Newman et al. (1998) for which results are shown in Segall and Zhang (2006a, 2006b) for different datasets of numerical abalone fish data and discrete nominal-valued mushroom data. The forest cover type’s database consists of 63,377 records each with 54 attributes that can be used to as inputs to predictive models to support decision-making processes of natural resource managers. The 54 columns of data are composed of 10 quantitative variables, 4 binary variables for wilderness areas, and 40 binary variables

of soil types. The forest cover type’s classes include Spruce-Fir, Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow, Aspen, Douglas-Fir, Krummholz, and other. The workspace of SAS Enterprise MinerTM is different than the other software because it uses icons that are user-friendly instead of only using spreadsheets of data. The workspace in SAS Enterprise MinerTM is constructed by using a drag-and-drop process from the icons on the toolbar which again the other software discussed do not utilize. Figure 1 shows a screen shot of cluster analysis for the forest cover type data using SAS Enterprise MinerTM . From Figure 1 it can be seen using a slice with a standard deviation measurement, height of frequency, and color of the radius that this would yield only two distinct clusters: one with normalized mean of 0 and one with normalized mean of 1. If different measure of measurement, different slice height, and different key for color were selected than a different cluster figure would have resulted. A screen shot of PolyAnalyst® 5.0 showing the input data of the forest cover type data with attribute columns of elevation, aspect, scope, horizontal distance to hydrology, vertical distance to hydrology, horizontal distance to roadways, hillshade 9AM, hillshade Noon, etc. PolyAnalyst® 5.0 yielded a classification probability of 80.19% for the forest cover type database. Figure 2 shows the six significant classes of clusters corresponding to the six major cover types: Spruce-Fir, Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow, Aspen, and DouglasFir. One of the results that can be seen from Figure 2 is that the most significant cluster for the aspect variable is the cover type of Douglas-Fir. NeuralWare Predict® uses a Microsoft Excel spreadsheet interface for all of its input data and many of its outputs of computational results. Our research using NeuralWare Predict® for the forest type cover data indicates an accuracy of 70.6% with 70% of the sample for training and 30% of


Comparing Four-Selected Data Mining Software

Figure 1. SAS Enterprise MinerTM screen shot of cluster analysis

Figure 2. PolyAnalyst® 5.0 screen shot of cluster analysis


Comparing Four-Selected Data Mining Software

Figure 3. NeuralWare Predict® screen shot of complete neural network training (100%)

Figure 4. BioDiscovery GeneSight® screen shot of hierarchical clustering


Comparing Four-Selected Data Mining Software

the sample for test in 2 minutes and 26 seconds of execution time. Figure 3 is a screen shot of NeuralWare Predict® for the forest type cover data that indicates that an improved accuracy of 79.2% can be obtained using 100% of the 63,376 records for both training and testing in 3 minutes 21 seconds of execution time in evaluating model. This was done to investigate what could be the maximum accuracy that could be obtained using NeuralWare Predict® for the same forest cover type database for comparison purposes to the other selected software. Figure 4 is a screen shot using GeneSight® software by BioDiscovery Incorporated. That shows hierarchical clustering using the forest cover data set. As noted earlier, GeneSight® only performs statistical analysis and cluster analysis and hence no regression results for the forest cover data set can be compared with those of NeuralWare® Predict and PolyAnalyst®. It should be noted that from Figure 4 that the hierarchical clustering performed by GeneSight® for the forest cover type data set produced a multitude of clusters using the Euclidean distance metric.

FUTURE TRENDS These four-selected software as described will be applied to a database that already has been collected of a different dimensionality. The database that has been presented in this chapter is of a forest cover type data set with 63,377 records and 54 attributes. The other database is a microarray database at the genetic level for a human lung type of cancer consisting of 12,600 records and 156 columns of gene types. Future simulations are to be performed for the human lung cancer data for each of the four-selected data mining software with their respective available algorithms and compared versus those obtained respectively for the larger database of 63,377 records and 54 attributes of the forest cover type.


CONCLUSION The conclusions of this research include the fact that each of the software selected for this research has its own unique characteristics and properties that can be displayed when applied to the forest cover type database. As indicated, each software has it own set of algorithm types to which it can be applied. NeuralWare Predict® focuses on neural network algorithms, and Biodiscovery GeneSight® focuses on cluster analysis. Both SAS Enterprise MinerTM and Megaputer PolyAnalyst® employ each of the same algorithms except that SAS has a separate software SAS TextMiner® for text analysis. The regression results for the forest cover type data set are comparable for those obtained using NeuralWare Predict® and Megaputer PolyAnalyst®. The cluster analysis results for SAS Enterprise MinerTM, Megaputer PolyAnalyst®, and Biodiscovery GeneSight® are unique to each software as to how they represent their results. SAS Enterprise MinerTM and NeuralWare Predict® both utilize Self-Organizing Maps (SOM) while the other two do not. The four-selected software can also be compared with respect to their cost of purchase. SAS Enterprise MinerTM is the most expensive and NeuralWare Predict® is the least expensive. Megaputer PolyAnalyst® and Biodiscovery GeneSight® are intermediate in cost to the other two software. In conclusion, SAS Enterprise MinerTM and Megaputer PolyAnalyst® offer the greatest diversification of data mining algorithms.

ACKNOWLEDGMENT The authors want to acknowledge the support provided by a 2006 Summer Faculty Research Grant as awarded to them by the College of Business of Arkansas State University. The authors also want to acknowledge each of the four software manufactures of SAS, Megaputer Intelligence,

Comparing Four-Selected Data Mining Software

Inc., BioDiscovery, Inc., and NeuralWare, for their support of this research.

REFERENCES AAAI (2002), American Association for Artificial Intelligence (AAAI) Spring Symposium on Information Refinement and Revision for Decision Making: Modeling for Diagnostics, Prognostics, and Prediction, Software and Data, retrieved from

Han, J. & Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufman, San Francisco, CA. Kargupta, H., Joshi, A., Sivakumar, K., & Yesha, Y. (2005), Data mining: Next generation challenges and future directions, MIT/AAAI Press, retrieved from http:/// Kargupta/ngdmbook.html. Kleinberg, J. & Tardos, E., (2005), Algorithm Design, Addison-Wesley, Boston, MA.

Brooks, P. (1997), Data mining today, DBMS, February 1997, retrieved from http://www.dbmsmag. com/9702d16.html.

Lawrence Livermore National Laboratory (LLNL), The Center for Applied Scientific Computing (CASC), Scientific data mining and pattern recognition: Overview, retrieved from http://www.

Data Intelligence Group (1995), An overview of data mining at Dun & Bradstreet, DIG White Paper 95/01, retrieved from http://www.thearling. com.text/wp9501/wp9501.htm.

Lazarevic A., Fiea T., & Obradovic, Z., A software system for spatial data analysis and modeling, retrieved from papers/lazarevic00.pdf.

Deshmukah, A. V. (1997), Software review: ModelQuest Expert 1.0, ORMS Today, December 1997, retrieved from orms/orms-12-97/software-review.html.

Leung, Y. F.(2004), My microarray software comparison – Data mining software, September 2004, Chinese University of Hong Kong, retrieved from arraysoft mining specific.html.

Ducatelle, F., Software for the data mining course, School of Informatics, The University of Edinburgh, Scotland, UK, retrieved from http://www. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996), From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1–30. Menlo Park, Calif.: AAAI Press. Grossman, R., Kasif, S., Moore, R., Rocke, D., & Ullman, J. (1998), Data mining research: opportunities and challenges, retrieved from http://www.

Megaputer Intelligence Inc. (2004), PolyAnalyst 5 Users Manual, December 2004, Bloomington, IN 47404. Megaputer Intelligence Inc. (2006), Machine learning algorithms, retrieved from http://www. php3. Moore, A., Statistical data mining tutorials, retrieved from National Center for Biotechnology Information (2006), National Library of Medicine, National Institutes of Heath, NCBI tools for data mining, retrieved from http://www.ncbi.nlm,nih. gov/Tools/.


Comparing Four-Selected Data Mining Software

NeuralWare (2003), NeuralWare Predict® The complete solution for neural data modeling: Getting Started Guide for Windows, NeuralWare, Inc., Carnegie, PA 15106

Segall, R.S. (2006), Microarray databases for biotechnology, Encyclopedia of Data Warehousing and Mining, John Wang, Editor, Idea Group, Inc., pp. 734-739.

Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer Science,

Segall, R. S, & Zhang, Q. (2004). Applications of modern heuristics and data mining techniques in knowledge discovery, funded proposal submitted to Arkansas State University College of Business Summer Research Grant Committee.

Nisbet, R. A. (2006), Data mining tools: Which one is best for CRM? Part 3, DM Review, March 21, 2006, retrieved from http://www. cfm?articleId=1049954. Piatetsky-Shapiro, G. & Tamayo, P. (2003), Microarray data mining: Facing the challenges, SIGKDD Exploration, vo.5, n.2, pages 1-5, December, retrieved from cfm?doid=980972.980974 and http://www.broad. Prakash, P. & Hoff, B. (2002) Microarray gene expression data mining with cluster analysis using GeneSightTM, Application Note GS10, BioDiscovery, Inc., El Segundo, CA, retrieved from Proxeon Bioinformatics, Bioinformatics software for proteomics, from proteomics data to biological sense in minutes, retrieved from http://www. SAS® Enterprise MinerTM, SAS Incorporated, Cary, NC, retrieved from technologies/analytics/datamining/miner. Schena, M. (2003). Microarray analysis, New York, John Wiley & Sons, Inc.


Segall, R. S. & Zhang, Q. (2005). Continuation of research on applications of modern heuristics and data mining techniques in knowledge discovery, funded proposal submitted to Arkansas State University College of Business Summer Research Grant Committee. Segall, R.S. & Zhang, Q. (2006a). Applications of neural network and genetic algorithm data mining techniques in bioinformatics knowledge discovery – A preliminary study, Proceedings of the Thirty-seventh Annual Conference of the Southwest Decision Sciences Institute, Oklahoma City, OK, v. 37, n. 1, March 2-4, 2006. Segall, R. S. & Zhang, Q. (2006b). Data visualization and data mining of continuous numerical and discrete nominal-valued microarray databases for biotechnology, Kybernetes: International Journal of Systems and Cybernetics, v. 35, n. 9/10. StatSoft, Inc. (2006a). Data mining techniques, retrieved from stdatmin.html. StatSoft, Inc. (2006b). Electronic textbook, retrieved from glosa.html. StatSoft, Inc. (2006c). Electronic textbook, retrieved from glosh.html. Tamayo, P. & Ramaswamy, S. (2002). Cancer genomics and molecular pattern recognition, Cancer Genomics Group, Whitehead Institute,

Comparing Four-Selected Data Mining Software

Massachusetts Institute of Technology, retrieved from projects/genomics/Humana_final_Ch_06_23_ 2002%20SR.pdf.

Link Terms (LT): A technique used in text mining that reveals and visually represents complex patterns of relations between terms in textual notes.

Witten, IH & Frank E. (2005). Data mining: Practical machine learning tools and techniques with Java implementation, Morgan Kaufman.

Market and transactional basket analysis: Algorithms that examine a long list of transactions to determine which items are most frequently purchased together, as well as analysis of other situations such as identifying those sets of questions of a questionnaire that are frequently answered with the same categorical answer.

KEY TERMS Algorithm: That which produces a particular outcome with a completely defined set of steps or operations. ARNAVAC: An underlying machine language algorithm used in PolyAnalyst® for the comparison of the target variable distributions in approximately homogeneously equivalent populated multidimensional hyper-cubes. Association and sequence analysis: A data mining method that relates first a transaction and an item and secondly also examines the order in which the products are purchased. BioDiscovery GeneSight®: A program for efficient data mining, visualization, and reporting tool that can analyze massive gene expression data generated by microarray technology. Data Mining: The extraction of interesting and potentially useful information or patterns from data in large databases; also known as Knowledge Discovery in Data (KDD). Link Analysis (LA): A technique used in data mining that reveals and visually represents complex patterns between individual values of all categorical and Boolean attributes.

Megaputer PolyAnalyst® 5.0: A powerful multi-strategy data mining system that implements a broad variety of mutually complementing methods for the automatic data analysis. Microarray Databases: Store large amounts of complex data as generated by microarray experiments (e.g. DNA) NeuralWare NeuralWorks Predict®: A software package that integrates all the capabilities needed to apply neural computing to a wide variety of problems. SAS® Enterprise MinerTM: Software that that uses an drop-and-drag object oriented approach within a workspace to performing data mining using a wide variety of algorithms. Symbolic Knowledge Acquisition Technology (SKAT): An algorithm developed by Megaputer Intelligence and used in PolyAnalyst® 5.0 that uses methods of evolutionary programming for high-degree rational expressions that can efficiently represent nonlinear dependencies. WEKA: Open-source data mining software that is a machine learning workbench.

This work was previously published in Encyclopedia of Data Warehousing and Mining, Second Edition, edited by J. Wang, pp. 269-277, copyright 2009 by Information Science Reference (an imprint of IGI Global).