Data Mining: Practical Machine Learning Tools and

0 downloads 0 Views 119KB Size Report
Witten and Frank's textbook was one of two books that 1 used for a data mining class in the Fall of 2001. The book covers all major methods of data mining that ...
Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations b y / a n H. Witten a n d Eibe F r a n k M o r g a n K a u f m a n n Publishers, 2 0 0 0 416 pages, Paper, $ 4 9 . 9 5 I S B N 1-.55860-552-5

R e v i e w by: James Geller, N e w Jersey Institute of T e c h n o l o g y CS D e p a r t m e n t , 323 Dr. King Blvd., N e w a r k , NJ 07 t 0 2 [email protected] http:Hweb, n j i t . e d u / - g e l l e r / Story

o f the b o o k

Witten and Frank's textbook was one of two books that 1 used for a data mining class in the Fall o f 2001. T h e b o o k covers all m a j o r methods o f data mining that p r o d u c e a knowledge representation as output. Knowledge representation is hereby u n d e r s t o o d as a representation that can be studied, understood, and interpreted by h u m a n beings, at least in principle. T h u s , neural networks and genetic algorithms are excluded f r o m the topics of this textbook. W e need to say "can be u n d e r s t o o d in principle" b e c a u s e a large decision tree or a large rule set m a y be as hard to interpret as a neural network. T h e b o o k first develops the basic m a c h i n e learning and data mining methods. T h e s e include decision trees, classification and association rules, s u p p o r t vector machines, instance-based learning, Naive Bayes classifiers, clustering, and numeric prediction based on linear regression, regression trees, and model trees. It then goes deeper into evaluation and i m p l e m e n t a t i o n issues. Next it moves on to deeper c o v e r a g e of issues such as attribute selection, discretization, data cleansing, and c o m b i n a t i o n s o f multiple models (bagging, boosting, and stacking). T h e final c h a p t e r deals with a d v a n c e d topics such as visual m a c h i n e learning, text mining, and W e b mining.

76

A w a l k through the contents T h e greatest strength of this Data M i n i n g b o o k lies outside o f the b o o k itself. All the algorithms described in this b o o k are i m p l e m e n t e d and freely available t h r o u g h the WEK.A ( W a i k a t o E n v i r o n m e n t for Knowledge Ana lys is) W e b s i te (www.cs.waikato.ac.nz/ml/weka). Chapter 8 o f the book is a tutorial to the i m p l e m e n t e d algorithms. T h e integration b e t w e e n the b o o k and the W e b site is excellent, and the W e b site is alive, thriving and growing. T h u s , the n u m b e r o f data mining a l g o r i t h m s available on the W e b site goes far beyond what is described in the book. Indeed. even Neural N e t w o r k s have been added to the W e b site since the b o o k was first published. W h i l e m a n y books offer an associated W e b site by now, the close linkage between b o o k and W e b site and the rapid g r o w t h o f the W e b site are highly c o m m e n d a b l e . A n o t h e r pleasant feature o f the W E K A i m p l e m e n t a t i o n is that it is d o n e in Java. T h i s m a k e s it possible to c o n s t r u c t systems, based on Java, that capitalize on the other strengths of Java, s u c h as access to relational d a t a b a s e s t h r o u g h J D B C and easy access to W e b pages f r o m within Java p r o g r a m s .

T a r g e t audience T h e b o o k is written for a c a d e m i c s and practitioners and I believe it can be well understood, even by undergraduate students.

SIGMOD

R e c o r d , Vol. 31, N o . 1, M a r c h 2002

In fact, it is probably the most accessible survey of data mining in print, without sacrificing too much of precision and rigor. The book is written in a highly redundant style, which I would like to describe as an exercise in iterative deepening. Basic concepts are repeated in several chapters. but covered to a deeper level in the later chapters. This should make it easy for students to keep reading it, without having to refer back to earlier chapters at every step of the way. On the other hand. for a person that is already familiar with the basics of data mining, this makes boring reading at some places. However, I do not recommend a streamlining of the book. Instead, I recommend that readers with some knowledge of the topic may skip paragraphs that sound familiar without any guilty feelings.

have) to strengthen the formulas, without necessarily adding new ones.

Reviewer's appreciation

In America we say "Actions speak louder than words". Thus. instead of summarizing the book I will describe some actions that I intend to take (or that I am already taking). (1) I am using W E K A for my research. (2) If I teach the same course again, I will use Witten and Frank's book again. (3) If the book appears in a second edition, I will acquire it.

The book goes to great lengths to avoid "formula shock". Formulas are developed step-by-step and well explained. Only absolutely necessary formulas are included. In many cases, where the derivation of a complex result is irrelevant to the actual data mining issues, the authors defer to statistics textbooks. While I am greatly in favor of both these approaches in writing textbooks, I feel that they have gone too far at a few places. At a number of places, the authors avoid introducing "'one more letter" to keep the text readable. However, the price they pay for that is that many of their formulas have no cclual signs. Thus, a sentence is terminated with a colon and followed with a formula, which is presumably equal to the quantity described by the sentence. This is done on many pages, e.g., 132--135, 137, 196, 207, 222, etc. Not in my wildest dreams would I have thought that I could ever criticize a book author for having too few formulas and too few variables. But this is exactly what I need to do here. While I do not recommend eliminating the previously mentioned redundancy of description, I do recommend for the next edition (which this book will undoubtedly

SIGMOD

Record, Vol. 31, No. 1, March 2002

At a few places, the book could also be improved by adding rnore explanations to figures. Figure 3.6 is a prime example for this issue. I found myself spending time verifying that instance counts in two subfigurcs truly add to the same total (of 209). They do. The reader could be spared this effort by a better caption or a better description in the body of the text. Similarly, the Apriori algorithm is introduced in a figure, but only in the "'Further Reading" subsection (following much later) is the name of the algorithm mentioned. A better figure caption would help the scholarly advancement of students who might not take the "Further Reading" section that seriously.

77

Suggest Documents