Sep 6, 2010 - Abstract. Open source development has become more prominent in recent years in a multitude of software areas. In the domain of data mining ...
Chapter 9
Open Source Data Mining Tools for Sports
Chapter Overview Open source development has become more prominent in recent years in a multitude of software areas. In the domain of data mining tools, several solutions have gained significant acceptance such as Weka and RapidMiner. Both tools share the same underlying learning algorithms, however, their approach to displaying results, are very much different.
1 Introduction
Both Weka and RapidMiner are excellent open source tools for sports data mining. Both Weka and RapidMiner are excellent open source tools that can leverage multiple algorithms, allowing users to rapidly explore and anlyze their sports data however they see fit. This means that users can run their data through one of the built-in algorithms, see what results come out, and then run it through a different algorithm to see if anything different stands out. Because of these programs’ open source nature, users are free to modify the source code, provided that the modifications are made available to others.
2 WEKA WEKA, an open source collection of data mining algorithms written in java, is a solid exploratory tool for those interested in mining their collected data (Witten and Frank 2005). Users can either use the Weka-provided interface or take advantage of R.P. Schumaker et al., Sports Data Mining, Integrated Series in Information Systems 26, DOI 10.1007/978-1-4419-6730-5_9, # Springer ScienceþBusiness Media, LLC 2010
89
90
9 Open Source Data Mining Tools for Sports
incorporating the java class libraries into their own code. While it is open source and freely distributable, Weka is covered under the GNU General Purpose License, where any changes to the software must be made freely available. Weka was developed at the University of Waikato in New Zealand and is primarily aimed at the academic community as a data mining tool. An example screenshot of the Weka tool for selected greyhound racing data is shown in Fig. 9.1. Weka contains multiple classifier algorithms including several categories of naı¨ve Bayesian classes, numerous fitting algorithms such as least squares, regression, neural networks and support vector machines, a handsome variety of boosting and bagging algorithms, a nice assortment of decision trees and a collection of rulebased algorithms. Aside from the classifiers, Weka also supports clustering and association rule mining. Aside from the wealth of algorithms at your disposal, Weka also features a plethora of options such as how to partition the data between training and testing sets, options on how to filter the results and options on how to visualize the testing data. The steps for using Weka are relatively straight forward. l l
l
Start the Weka program Open the file of the dataset to be mined (assuming it is in a form that Weka understands) Select the attributes to learn from
Fig. 9.1 The Weka tool for Greyhound racing data mining
3 Rapidminer
91
Fig. 9.2 Weka’s predictive results using selected stock market data l l l l
Select a classifier Select how to partition the data between training and testing Select what attribute to make predictions about Start the system
Weka will then display the predictive results to the user, as shown in Fig. 9.2.
3 Rapidminer RapidMiner is another data mining tool, but this one is a bit unique. RapidMiner is partially open source and partially closed source. The reason for this division is because RapidMiner’s core system utilizes the Weka algorithms. As a result of using Weka, Weka’s GNU license requires the source and modifications to be open source. The unique aspect of RapidMiner is its focus on the frontend, in displaying the results to users. Since this part is not a part of Weka, it can be maintained as closed source. RapidMiner comes in two varieties, one is an enterprise version in which the system will explore multiple alternatives and return the most favorable one. This commercial version is not free and is aimed primarily at larger corporations that have an extensive amount of data to mine. The other version, a community version, is available for free and performs much the same as Weka, however, RapidMiner
92
9 Open Source Data Mining Tools for Sports
Fig. 9.3 RapidMiner visualization screenshot, courtesy of http://rapid-i.com/content/view/9/25/ lang,en/
boasts having more algorithms and a more user friendly visualization interface. An example screenshot of the RapidMiner system is shown in Fig. 9.3.
4 Conclusions Both Weka and RapidMiner are exceptional open source tools that nearly anyone with some basic computing training can use. They capitalize on an abundance of machine learning algorithms, data manipulation options, and visualization techniques. While these tools still require human direction and experimentation, it is not far off to imagine these tools as 1 day becoming one-click systems that analyze the data automatically under multiple algorithms and multiple visualization techniques and returns only those that score high on an “interestingness” scale. Both tools would be useful for effective sports data mining.
5 Questions for Discussion 1. What other open source data mining tools do you have experience with and what are their strengths? 2. How would you adopt either Weka or RapidMiner for sports data mining? 3. If you were tasked to identify “interesting patterns” in different sports data, what criteria would you use?