TEACHING WEB MINING IN THE CLASSROOM: WITH AN OVERVIEW OF WEB USAGE MINING Richard S. Segall Arkansas State University Department of Computer & Information Technology College of Business State University, AR 72467-0130 USA Voice: 870-972-3989 Fax: 870-972-3417 E-mail:
[email protected]
Qingyu Zhang Arkansas State University Department of Computer & Information Technology College of Business State University, AR 72467-0130 USA Voice: 870-680-8076 Fax: 870-972-3417 E-mail:
[email protected]
ABSTRACT Web mining is a new area of research in information technology. Its applications to teaching and research are addressed in this paper. The purpose of this paper is two-fold: (1) to provide insight into current techniques for teaching web mining in the classroom, and (2) to provide an overview of one selected type of web usage mining. The later is discussed as an introduction to the reader for the purpose of a future direction of this research of discussing the applications of the other types of web mining in teaching web mining in the classroom. References are listed separately for the two topics of (1) web usage mining and (2) teaching web mining. Future directions of the research are presented. INTRODUCTION According to Wikipedia (2007), web mining is the application of data mining techniques to discover patterns from the Web and can be classified into three different types of “web usage mining”, “web content mining”, and “web structure mining”. This paper selects only one of these three types of web mining as a basis for an introduction to discussion of the area of web mining and how this can be applied to teaching web mining in the classroom. The type of web mining selected for this paper is web usage mining which focuses on knowledge discovery from the usage of individuals web sites. A review of current literature on teaching web mining in the classroom is presented, critiqued and categorized. Figure 1 of this paper from Liang (2003) illustrates the purposes and taxonomy of web mining by identifying the three divisions of web mining (i.e., web content mining, web structure mining, and web usage mining) and their respective subdivisions.
20
Web Mining Taxonomy Web Mining
Web Content Mining
Web Page Content Mining Identify information within given web pages
Web Structure Mining
Search Result Mining Categorizes documents using phrases in titles and snippets
Uses interconnections between web pages to give weight to pages
Web Usage Mining
General Access Pattern Tracking Understand access patterns and trends to improve structure
Customized Usage Tracking Analyzes access patterns of a user to improve response
Distinguish personal home pages from other web pages
Figure 1 [Source: Liang (2003) ]
BACKGROUND ON WEB MINING Chen and Chau (2004b) wrote an extensive chapter on Web mining that provides a thorough background on this novel area. Chen and Chau (2004) carefully describe the meaning on Web mining and its relationship to data mining and text mining. Chen and Chau (2004b, p.289) discuss the novelty and origins of web mining within their paper by stating: “The Web’s size and its unstructured and dynamic content, as well as its multilingual nature, make the extraction of useful knowledge a challenging research problem. Furthermore, the Web generates a large amount of data in other formats that contain valuable information. For example, Web server logs’ information about user access patterns can be used for information personalization or improving Web page design.”
Chen and Chau (2004b, p. 291) discussed techniques unique to web mining by the following: “It is also interesting to note that, although Web mining relies heavily on data mining and text mining techniques, not all techniques applied to Web mining are based on data mining or text mining. Some techniques, such as Web link structure analysis, are unique to Web mining.”
Chen and Chau (2004b, p. 291) further distinguish web mining, data mining, and text mining, by stating: “In general, it is reasonable to consider Web mining as a subfield of data mining, but not a subfield of text mining, because some Web data are not textual (e.g. Web log
21
data). …. Web mining research is at the intersection of several established research areas, including information retrieval, Web retrieval, machine learning, databases, data mining, and text mining.”
The reader is referred to Gorgi et al. (2007), Icfai University Press (2007), and Scime (2005) for a thorough discussion of web mining that also includes substantial discussions of web usage mining. Scime (2005) authored a text that provides a thorough discussion of applications and techniques of web mining. Web data mining has also changed curricula in academic programs as evidenced by Mobasher (2007) who taught a graduate course entitled Web Data Mining” in the School of Computer and Information Technology at DePaul University in Chicago, IL. The reader is referred to the extensive web pages for this course by Mobasher (2007) to realize the depth of coverage in a single course on this subject of web data mining. According to Chen and Chau (2004b, p.316), “Web mining activities are still in their early stages and should continue to develop as the Web evolves. One future direction of Web mining is multimedia data mining.” BACKGROUND ON WEB USAGE MINING A PowerPoint presentation by Chen and Chau (2004a) discuss that “by analyzing the Web usage data, web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which had various applications”
that could include personalization and collaboration in Web-based systems as well as Web site design and evaluation. As Chen and Chau (2004b) indicates “One of the major challenges faced by Web usage mining applications is the at Web Server log data area anonymous, making it difficult to identify users and user sessions from the data. Techniques like Web cookies and user registration have been used in some applications, but each method has its shortcomings.” Web usage mining uses pattern discovery and analysis techniques such as association rule mining, classification, and clustering. Chen and Chau (2004b) also indicate that web usage mining is “an excellent way to learn about users’ interest” and also that “the exponential growth of the Web has greatly increased the amount of usage data in server logs.” Chen and Chau (2004a) indicate applications of Web usage mining to personalization and collaboration in Web-based systems, marketing, web-site design and evaluation, and decision support. Barsagade (2003) provided a survey paper on web usage mining and pattern discovery. Some of the applications of web usage mining cited by Barsagade (2003) include the facts that web usage patterns can be used to gather business intelligence to improve customer attraction and retention, sales, marketing and advertisement, and cross sales. Barsagade (2003, p.5) indicates that
22
“web usage mining offers users the ability to analyze massive volumes of clickstream or click flow data, integrate the data seamlessly with transaction and demographic data from offline sources and apply sophisticated analytics for web personalization, e-CRM and other interactive marketing programs.”
One of the important usages of web usage mining also indicated by Barsagade (2003, p.6) is for security in that it can be used for detecting intrusion, fraud, and attempted break-ins to the system. Web usage mining can be described to consist of three phases similar in purpose only to that of data mining alone, but different in context to web pages of: (1.) preprocessing, (2.) pattern discovery, and (3.) pattern analysis. Susac (2002) describes web usage mining with SQL Server 2000 by discussing the design and implementation process for a data mining framework that can be used as an addition to any document management framework or dynamic website. SAS has developed software for building Web-based decision support applications and these are the basis of the in-depth article by Cohen et al. (2001). Jensen and Scacchi (2004) discussed data mining for software process discovery in open source software development (OSSD) communities including that of community Web repositories that encode process data in terms of its usage and update patterns. Jespersen et al. (2002) discussed a hybrid approach to web usage mining that focuses on a new approach for knowledge discovery from the clicks in the web log for a given site a.k.a. “clickstream” and especially on the analysis of sequence of clicks. The hybrid approach presented in Jespersen et al. (2002) is based on a novel combination of Hypertext Probabilistic (HPG) and “Click Fact Table” approaches. Web agents are web mining tools that can be used to gather web usage related data useful for website designers. Yao et al. (2002) authored a text devoted to PagePrompter that is an intelligent web agent created using data mining techniques. Nasraoui et al. (2006a) summarized the contents and outcomes of the WebKDD 2006 workshop on Web Mining and Web Usage Analysis that was held in conjunction with the 12th ACD SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006). Nasraoui (2006b) discussed new clustering algorithms based on robust estimation niches with applications to web usage mining. Figures 2 and 3 below are from Nasraoui (2006b) that illustrate a demonstration of web mining usage ECSAGO software. Figure 3 illustrates a refinement in the detail scaling from that of Figure 2.
23
Figure 2: Web Usage Mining Using ECSAGO software Source: Nasraoui, O. (2006c)
Figure 3: Web Usage Mining Using ECSAGO software with Refinement Source: Nasraoui, O. (2006c)
24
WEB MINING SOFTWARE The Web mining software selected for comparison in Table 1 are Megaputer WebAnalyst, SAS Web Analytics, WebLog Expert 2.0, and Visitator. Table 1: Web Mining Software Megaputer SAS Web WebLog WebAnalyst Analytics Expert 2.0 Web log analysis x x x Clickstream path analysis x Analytic executive x dashboard Web-based reporting Web data mart Analytical scorecard Analytical visitor segmentation
x x
x x x x
Visitator x
x
x
x
Megaputer WebAnalyst is web mining software. According to Slynko and Ananyan (2007) there are four distinct group of users of WebAnalyst inside an organization and these include webmasters and database administrators, data analysts, executive managers and marketing analysts, and website visitors. SAS Web Analytics delivers up-to-date information on your entire Web presence, answering questions such as: Who is visiting your site? Where do they go once they arrive? How long do they stay? Are they buying? Which marketing campaigns are getting the most responses? The SAS Web Analytics can turn Web data into key metrics specific to your business, enabling you to refine your business strategies as needed and to identify the drivers that influence your bottom-line results. Weblog Expert 2.0 provides general statistics, activity statistics on a daily and hour of the day basis, and access statistics by pages, files, images, directories, and entry pages. Weblog Expert 2.0 thus provides visual graphs and plots for daily page access, most popular pages, most downloaded files, most requested images, most requested directories, and top entry pages. It also provides referrers by sites, URLs, search engines, and search processes. The general statistics provided by Weblog Expert 2.0 include total hits, average hits per day, average hits per visitor, total page views, average page views per day, and average visitors per day, average bandwidth per day and average bandwidth per visitor. Visitator is a web mining software that clusters and visually presents visitor groups based on access patterns. The analysis of web server logfiles can reveal information about the site visitors’ behaviors. The program can show the number of times a certain page has been requested and sometimes the clickstreams as well. There is much more that can be mined in logfiles. TEACHING WEB MINING
25
Possible texts that could be adopted for teaching a course in web mining include Scime (2005) and Liu (2007) that discuss both applications and techniques of web mining. Piatetsky-Shapario (2006a) provided web postings of Web usage mining teaching materials that are freely available. Piatetsky-Shapario (2006b) and Chen (2007) provide web pages for teaching a web mining course. Chen (2007) is web address for the actual web pages for a course in Data and Web Mining as taught at the University of Arizona. Helian (2007) provides an on-line module on web mining and web search as use in a graduate level course in United Kingdom. Berendt et at. (2004) provides a tutorial for evaluation in web mining using statistical approaches. Nachmias and Hershkovitz (2006) of Tel Aviv University discussed using web mining for learning about the online learner. The main focus of Nachmias and Hershkovitz (2006) was to establish a research framework, both theoretically and empirically for employing Web mining techniques on Web-based learning environments, in order to understand teaching and learning behaviors in such systems. Lau and Fong (2003) wrote an investigation on the effectiveness on web-based learning using web-mining approach and presented at an international workshop on database and expert systems applications. Pahl and Donnellan (2002) used data mining technology for the evaluation of the effectiveness of web-based teaching and learning systems, and their paper presents and illustrates different data mining techniques for the evaluation of Web-based teaching and learning systems. Sung et al. (2000) wrote an article on Web mining for distance education, and show that the use of Web mining for education is of great interest. Ai and Laffey (2007) discussed the use of Web mining as a tool for understanding online learning. Ai and Laffey (2007) explained the use of Web mining in Course Management Systems (CMS), and identified some illustrative learning patterns that can be found by using Web-mining approaches. Ai and Laffey (2007) provided examples in three areas to show how Web mining could potentially benefit E-learning: (1.) Understanding learner behavior, (2.) Determining elearning system effectiveness, and (3.) Measure the success of instructional efforts. Chen et al. (2001) investigated the use of open Web APIs (Application Programming Interfaces) to teach data mining in classrooms. Chen et al. (2001) concluded “students acquired valuable experience in leveraging the power of he APIs to build important and interesting Web mining applications”. Chen and Chau (2004a, 2004b) discussed on how machine learning techniques can be applied to Web mining. Chen and Chau (2004a, 2004b) also discusses that “web mining research overlaps substantially with other areas, including data mining, text mining, information retrieval, and web retrieval” by presenting a classification of retrieval and mining techniques and applications. Chen and Chau (2004a, 2004b) concluded that two of the major limitations of Web mining research that likewise affect teaching are (1.) lack of suitable text collections that can be used by researchers, and (2.) difficulty in collecting Web usage data across different Web sites. The reader is referred to Chen and Chau (2004a, 2004b) for a detailed discussion of Web Content mining and Web Structure mining. Lei et al. (2003) developed an evaluation technique for content interaction in Web-based teaching and learning environments. Lei et al. (2003) proposed Web usage mining in conjunction with an analytic model as the evaluation approach. Ravid et al. (2002) presented a paper on Web
26
mining in education for using students’ log files as an indicator of on-line learning and a tool for improving on-line instruction, and illustrated for an on-line learning environment at the Open University of Israel. CONCLUSIONS AND FUTURE DIRECTIONS This paper provides a brief introduction to web mining as well for one of the types of web mining of Web usage mining. Web mining has created a new dimension in the teaching of data and text mining, with universities starting to offer courses in web mining. The future directions of this research are to compare other selected software that can be used in the teaching of web mining. The software selected for future investigation in the context of web mining include SPSS Clementine and Megaputer PolyAnlayst for general web mining, and ClickTracks by Web Analytics for web usage mining. ACKNOWLEGEMENTS The authors acknowledge that this research would not have been possible without the support provided by a 2007 Summer Faculty Research Grant as awarded by the College of Business of Arkansas State University for a research proposal by Zhang and Segall (2006). REFERENCES A.) Web Usage Mining: Barsagade, N. (2003), Web Usage Mining and Pattern Discovery: A Survey Paper, Computer Science and Engineering Dept. CSE Tech Report 8331, SMU Southern Methodist University, Dallas, TX, http://engr.smu.edu/%7Emhd/8331f04/barsagada.doc Cohen, M-D, Kelly, C.B., and Medaglia, A.L. (2001), Decision Support with Web-Enabled Software, INTERFACES, v. 31, n. 2, March-April, pp. 109-129 Gogri, R. (2007), Chawla, D., and Aparadh, A., Introduction to Web Mining, PowerPoint presentation, www.cs.sunysb.edu/~cse634/spring 2007/group3_final.ppt Icfai University Press (2007) Web Mining www.icfaiuniversitypress.org/Books/WebMining_ovw.asp
–
An
Overview,
Jensen, C and Scacchi, W. (2004) Data Mining for Software Process Discovery in Open Source Software Development Communities, International Workshop on Mining Software Repositories (MSR 2004), W17S Workshop - 26th International Conference on Software Engineering (2004/917), p. 96 -100, Edinburgh, Scotland, UK, 25 May 2004, ISBN: 0 86341 432 X, www.ics.uci.edu/~wscacchi/Papers/New/Jensen-Scacchi-MSR04.pdf Jespersen, S.E., Thorhauge, J., Pedersen, T.B., (2002), A Hybrid Approach to Web Usage Mining, Technical Report, Department of Computer Science, Aalborg University Liang, J.W. (2003) Introduction to Text and Web Mining, Seminar at North Carolina Technical University, http://www.database.cis.nctu.edu.tw/seminars/2003F/TWM/slides/p.ppt
27
Mobasher, B. (2007), ECT 584 – Web Data Mining, School of CIT, DePaul University, Chicago, IL, http://maya.cs.depaul.edu/~classes/ect584 Nasraoui, O., Spiliopoulou, J., Srivastava, J., Mobasher, B., and Masand, B., (2006a), WebKDD 2006 – Web Mining and Web Usage Analysis Post-Workshop Report, SIGKDD Explorations, v.8, n.2, pp.84-89. Nasraoui, O. (2006b), New Clustering Algorithms Based on Robust Estimation Niches with Applications to Web Usage, http://webmning.spd.louisville.edu/NSF_Career Nasraoui, O. (2006c), Web Mining Usage ECSAGO software demonstration, http://webmining.spd.louisville.edu/NSF_Career/software/clustering/ECSAGO/demo/ Scime, A. (2005), Web Mining: Applications and Techniques; IGI Global Publishing, Hershey, PA www.igi-pub.com/books/additional.asp?id=4383&title=Preface&col=preface Susac, D. (2003), Web Usage Mining and SQL Server ftp://ftp.asptoday.com/AspToday/Articles_20020923_01_1.zip
2000,
ASP
Today,
Yao, Y. Y., Hamilton, H. J., Wang, X. (2002), PagePrompter: An Intelligent Web Agent Created using Data Mining Techniques, Springer Berlin/Heidelberg Zhang, Q. and Segall, R. S. (2006), “Further continuation of research on applications of data mining techniques in knowledge discovery: an In-depth investigation on algorithms and heuristics, Proposal submitted and funded by the Arkansas State University College of Business Summer Faculty Development Grant, State University, AR. B.) Teaching Web Mining: Ai, J. and Laffey, J. (2007), Web Mining as a Tool for Understanding Online Teaching, MERLOT Journal of OnLine Learning and Teaching, v. 3, n.2, June, pp. 160-169, http://jolt.merlot.org/vol3no2/ai.htm Berendt, B., Spiliopoulou, M., Menasalvas, E.,(2004), Evaluation in web mining, Tutorial at ECML/PKDD 2004, Workshop on Statistical Approaches for Web Mining (SAWM 2004), Pisa, Italy, 20 September, http://www.wiwi.hu-berlin.de/~berendt/evaluation04 Chen, H. (2007), MIS 510 Data and Web Mining, University of Arizona, Artificial Intelligence Lab, Eller College of Management, http://ai.arizona.edu/hchen/class510.htm Chen, H. and Chau, M. (2004a), Web Mining: Machine Learning for Web Applications, PowerPoint Presentation, http://ai.arizona.edu/hchen/classs510.htm Chen, H. and Chau, M.(2004b), Web Mining: Machine Learning for Web Applications, Chapter 6, Annual Review of Information Science and Technology (ARIST), v38 p289-329 2004, http://ai.bpa.arizona.edu/go/intranet/papers/WebMining,pdf Chen, H., Li, X., Chau, M., Ho, Y-J, Tseng. C. (2001), Using Open Web APIs in Teaching Web Mining, http://ai.arizona.edu/hchen/chencourse/webapi.pdf Helian, N. (2007), DBP0003 Web mining and web search, https://intranet.londonmet.ac.uk/progplan/postgrad-line/modules/db/dbp003.cfm
28
Lau, I. K., and Fong, J. (2003), Investigation on the effectiveness on web-based learning using web-mining approach, 14th International Workshop on Database and Expert Systems Applications (DEXA’03), http://doi.ieeecomputersociety.org/10.1109/DEXA.2003.1232040 Lei, X, Pahl, C., Donnellan, D., (2003), An evaluation technique for content interaction in Webbased teaching and learning environments, Proceedings of the 3rd IEEE International Conference on Advanced Learning Technologies, July 9-11, pp. 294-295. Liu, B. (2007), Web Data Mining, Springer-Verlag, New York, ISBN-13: 9783540378815 Nachmias, R. and Hershkovitz, A., (2006), Using web mining for learning about the online learner, Tel Aviv University, School of Education, Science and Technology Education Center, http://132.66.30.63/virtual/EU-FP7/web-mining.doc Pahl, C. and Donnellan, D. (2002), Data mining technology for the evaluation of web-based teaching and learning systems, Proceedings of the E-Learn 2002 World Conference on ELearning in Corporate, Government, Healthcare, & Higher Education, Montreal, Quebec, Canada, October 15-19, 2002, also available as item ED479591, Education Resources Information System (ERIC), Association for the Advancement of Computing in Education (AACE), P.O. Box 3728, Norfolk, VA 23514, http://www.eric.ed.gov/ERICWebPortal/recordDetail?accno=ED479591 Piatetsky-Shapario, G. (2006a), Web usage mining teaching materials (freely available), KDnuggets: News, n.13, item 3, http://www.kdnuggets.com/news/2006/n13/3i.html Piatetsky-Shapario, G. (2006b), http://www.kdnuggets.com.web_mining_course/
Web
mining
course,
Ravid, G., Yaffe, E., Tal, E. (2002), Web mining in education, Using students’ log files as an indicator of on-line learning and as a tool for improving on-line instruction, The First Annual Doctoral Consortium in Israel About Computer-Mediated Communication, the Internet, and Social Aspects Thereof, The Center for the Study of the Information Society, University of Haifa, Israel, September 19, http://infosoc.haifa.ac.il/kennes/Golad3.doc Scime, A. (2005), Web Mining: Applications and Techniques, IGI Publishing, Hershey, PA. Sung, H. H., Sung M.B., Sang C.P.(2000), Web mining for distance education, ICMIT 2000, Proceedings of the 2000 IEEE Conference on Management of Innovation and Technology, v. 2, Issue 2000, pp. 715-719. Wikipedia (2007), Web mining, http://en.wikipedia.org/wiki/Web_mining
29