Knowledge Mining with ELM System Ilona Bluemke and Agnieszka Orlewicz Institute of Computer Science,Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland
[email protected]
Abstract. The problem of knowledge extraction from the data left by web users during their interactions is a very attractive research task. The extracted knowledge can be used for different goals such as service personalization, site structure simplification, web server performance improvement or even for studying the human behavior. We constructed a system, called ELM (Event Logger Manager), able to register and analyze data from different applications. The registered data can be specified in an experiment. ELM provides several knowledge mining algorithms i.e. Apriori, ID3, C4.5. The objective of this paper is to present knowledge mining in data from interactions between user and a simple application conducted with ELM system. Keywords: web usage mining system, knowledge mining.
1 Introduction The time spent by people in front of computers still increases so the problem of knowledge extraction from the enormous amount of data left by web users during their interactions is a research task that has increasingly gained attention in the last years. The analysis of such data can be used to understand users preferences and behaviors in a process commonly referred to as Web Usage Mining (WUM). The extracted knowledge can make Web site management much more effective by enabling service personalization, site structure simplification, web server performance improvement or even for studying the human behavior. Leveraging these techniques we can create adaptive Web sites which better meet user requirements. In the past, several WUM systems have been implemented, some of them are presented in section 2. At the Department of Electronics and Information Systems Warsaw University of Technology a WUM system, called ELM (Event Logger Manager) was designed and implemented [1]. Our system significantly differs from existing WUM systems. The majority of WUM systems is dedicated to one web server or one application. ELM is flexible and easy to integrate with any Java application. We use aspect modification, as proposed in [2], to add code responsible for registering required data. ELM is able to collect and store data from users interactions from many applications in one data base, such feature is not provided by other systems. ELM system contains several knowledge mining algorithms e.g. Apriori, ID3 and C4.5 taken from weka library [3], but without any difficulty other algorithms can be added. In our system a user can also decide on the kind of data registered and filtered and, to R. Setchi et al. (Eds.): KES 2010, Part II, LNAI 6277, pp. 82–92, 2010. © Springer-Verlag Berlin Heidelberg 2010
Knowledge Mining with ELM System
83
our best knowledge, this property is not available in other systems. In section 3 ELM system is briefly presented. The objective of this paper is to present how web usage mining can be conducted with ELM system. The experiment is described in section 4 and conclusions are given in section 5.
2 Web Usage Mining Systems The information placed in Internet is still increasing. During navigation web users also leave many records of their activity. This enormous amount of data can be a useful source of knowledge but sophisticated processes are need for the analysis of these data. Data mining algorithms can be applied to extract, understood and use knowledge from these data and all these activities are called web mining. Depending on the source of input data web mining can be divided into three types: • • •
contents of internet documents are analyzed in Web Content Mining, structure of internet portals are analyzed in Web Structure Mining, the analysis of data left by users can be used to understand users preferences and behavior in a process commonly referred to as Web Usage Mining (WUM).
Overviews of web mining with long lists of references can be found in [4, 5, 6]. The web usage mining process consists of following phases: data acquisition, data preprocessing, pattern discovery and pattern analysis. Data preparation and pattern discovery processes are described e.g. in [7]. Often the results of pattern analysis are feedback to pattern discovery activity. A brief overview of pattern discovery methods is given in [8]. Pattern discovery methods can be divided into several groups: 1. statistical analysis (like frequency, mean), 2. association rules discover correlation among pages accessed, 3. clustering of users enable to discover groups of users with similar navigational patterns, 4. sequential patterns find inter session patterns presence of a set of items followed by another item in time order, 5. classification maps data item into one of predefined classes, 6. dependency modeling determines if there are significant dependencies among variables. Effective data acquisition phase is crucial for web usage mining. The sources of data for web usage mining are following: 1. Web servers – the most common source of data. Web logs usually contain fields like date, time, user IP address, bytes sent, request line, etc. The information is usually expressed in one of standard file format as Common Log Format, Extended Log Format and Log ML. 2. Proxy side – data are similar to data collected at server side but they contain data for groups of users accessing groups of web servers. 3. Client side -data can be obtained by modified browsers, Java applets, Java scripts. This approach avoids the problem of session identification and problems caused by caching. Detailed information about users behavior can be provided.
84
I. Bluemke and A. Orlewicz
The analysis of data from users’ interactions can be used to understand users preferences and behavior in a Web Usage Mining (WUM) [9, 10]. The knowledge extracted can be used for different goals such as service personalization, site structure simplification, and web server performance improvement e.g. [11,12]. There are different WUM systems: small systems performing fixed number of analysis and there are also systems dedicated to large internet services. In the past, several WUM projects have been proposed e.g. [13, 14, 15, 16, 17, 18]. In Analog system [13] users activity is recorded in server log files and processed to form clusters of user sessions. The online component builds active user sessions which are then classified into one of the clusters found by the offline component. The classification allows to identify pages related to the ones in the active session and to return the requested page with a list of related documents. Analog was one of the first project of WUM. The geometrical approach used for clustering is affected by several limitations, related to scalability and to the effectiveness of the results found. Nevertheless, the architectural solution introduced was maintained in several other more recent projects. Web Usage Mining (WUM) system, called SUGGEST [14], was designed to efficiently integrate the WUM process with the ordinary web server functionalities. It can provide useful information to make easier the web user navigation and to optimize the web server performance. Many Web usage mining systems are collaborating with an internet portal or service. The goal is to refine the service, modify its structure or make it more efficient. In this group are systems SUGGEST [14], WUM [16], WebMe [17], OLAM [18]. Data are extracted from www servers’ logs. Some examples of patterns extracted from server logs are presented in [7]. In [8] a method of extracting patterns from Web logs based on Kohonen’s Self Organizing Maps (SOM) is presented. This method is simple and fast comparing to clustering algorithms. It also enables to extract patterns of any length. Other advantage of this approach is that the number of clusters do not need to be known. Widely known big portals and companies are using web mining for different purposes. In Amazon techniques like associations between pages visited, click-path analysis are used to adapt the store for individual customer. Knowledge obtained from Web mining is used for instant recommendations or ‘wishes –lists’ [19]. Google [20], very popular and widely used search machine, was probably the first to identify the importance of link structure in mining information from the Web. PageRank algorithm measures the importance of a page using structural information of the Web graph. The relevance of a page is expressed in how many other pages cite it. PageRank is used in all Google search products. Google also introduced the Advertising Program providing advertisements, that are relevant to a search query. Other service offered by Google is “Google News” [21] integrates the latest news from online versions of newspapers and organizes the categorically to make it easier for users to read the most relevant news. In e-Bay web mining techniques are used by to analyze the bidding behavior to determine if a bid is a fraud [25]. Yahoo [26] introduced the concept of “personalized portal”. Mining usage logs provides valuable information about usage habits thus enabling providing personalized contents. An example of a system using data from client side is Archcollect [16]. This system is registering data by modified explorer. Archcollect is not able to perform
Knowledge Mining with ELM System
85
complex analysis. In [24] a method which combines graph based web usage mining with statistical analysis of data from client side is presented. Authors ([24]) show also that the results of the mining process are helpful for administrators. We haven't found a system registering data on both client and server sides.
3 ELM Web Usage Mining System In ELM system all phases of web usage mining process are performed. ELM is responsible for data storage, data preparation and data analysis. System provides predefined set of data mining algorithms and data preparation filters, but without difficulties new algorithms and custom filters can be added. System is able to store data in local or remote data base. ELM user is able to define what events should be stored and when. Some basic types of events are predefined but user can also specify its own logical events. In ELM system data are preprocessed, data mining algorithms can be executed and its results may be observed.
Fig. 1. ELM architecture
In Fig.1 the main parts of ELM system are presented. Data acquisition is performed by EventLogger and analysis by EventManager. Both modules are using eventDB, relational data base, containing stored events. EventLogger API is written to facilitate the access to data base for data registering and data analyzing modules. In this module logical types of events may be defined. For defined events also the parameters may be created. EventLogger API writes or reads events with parameters values. In order to avoid dependencies to third party libraries only JDBC (Java Data Base Connectivity) is used to access the database. JDBC provides methods for querying and updating data in a relational database. EventLogger is able to register events handled at the server side. We use aspect modification proposed in [2] to log events. An aspect, written in AspectJ [25], is responsible for storing appropriate information describing the event. This aspect can be woven to the code of Java application, even if only compiled code of this application is available. Registering data in aspects, instead of using default web logs, enables more flexibility in the content and format of data thus making the data preparation phase much easier. The structure of ELM repository is shown in Fig. 2. More information concerning ELM system can be found in [1].
86
I. Bluemke and A. Orlewicz
Fig. 2. ELM repository
4 Example To find the advantages and disadvantages of ELM system and its ability to discover knowledge an experiment with a real application should be conducted. We had problems to find a real application which could be monitored by ELM system. It is natural, that the owners will not let to associate with theirs applications “an unknown” system. We had to find an exemplary application having most, regular functionalities of a web application. jpetstore [26] – an application in JavaEE technology, available at sourceforge.net, was taken as a “case study”. This application is available in two versions, one using Spring MVC and the other Struts. In jpetstore an animal in one of following categories: cats, dogs, birds, fish or reptiles may be bought. When a user enters appropriate category the species in his category are displayed. Each species is assigned with some individuals presented on a picture with detailed information about the price and the provider. During registration process each user should give her/his personal data and preferences e.g. preferable animal’s category. Banner will remind the user to buy products for her/his pet. The goal of our investigation was to reduce the amount of data describing the preferences during the registration process and to generate the hints what to buy based on user’s interactions. Data registering was accomplished by aspect modification [2]. An aspect in AspectJ [25] was written. In this aspect, using EventLoggerAPI, several events were defined e.g. viewing products, adding a product into basket, checkout, signoff. Project containing EventLoggerAPI (in .war file) and the aspect were deployed on Tomcat application server. Load time weaving was used to combine application with its aspect. Several experiments on registered events and data were performed. Data left by a user were multiplied and randomly interleaved to simulate many users. In interaction data, on purpose some regularities were inserted e.g. clients buying cats are also buying fishes, because they find it cheaper than buying special food. We wanted to check if ELM will be able to detect such regularities using data mining algorithms. From the beginning of our experiments the events entering an application page and filling formats were registered.
Knowledge Mining with ELM System
87
4.1 Knowledge Mining In this section the knowledge mining process with ELM system working on jpetstore application is briefly described. Association and classification learning were explored. 4.1.1 Classification Learning The classification learning process was used to find a simple classifier giving the client’s favorite category based on the contents of his/her basket and the price of all products. A special event, named PET_STORE_CHECKOUT, a kind of CartEvent was defined. The definition is given in Fig. 3.
Fig. 3. Definition of event PET_STORE_CHECKOUT and its parameters categories and prices instead of product names
The k-fold cross-validation [10] technique for used for splitting the data into training and testing sets. The data is split into k approximately equal partitions. Each partition is in turn used for testing and the remainder for training. Learning procedure is executed k times on different training sets. k error estimates are averaged to yield an overall error estimate. Several classification learning algorithms were executed. We started with simple ZeroR algorithm which was able to find Reptiles as the majority category. Next, several algorithms building decision trees i.e. ID3, C4.5, PART were executed with different options. Filtering of data e.g. mergetwovalues were also applied. The results are presented in Table 1. The best results can be observed for C4.5 and PART algorithms. Table 1. Comparison of results of classification algorithms. Columns contain the number of data, percentage of correctly and incorrectly classified cases. data products categories categories categories categories categories products categories
no 51 21 21 67 67 67 51 67
Alg. ID3 ID3 C4.5 C4.5 C4.5 C4.5 C4.5 PART
Correct 68.6% 68.7% 72.4% 79.1% 83.5% 77.6% 80.3% 80.5%
Incorrect 5.8% 17.2% 27.5% 20% 16.5% 22.3% 19.6% 19.4%
Not classified 25.4% 13.7% 0 0 0 0 0 0
option useLaplace -M3 binarySplit -
In Fig. 4 the tree obtained by ID3 algorithm on data with reduced number of attributes is presented. On this tree a rule that: “for clients taking the first product from CATS and third - FISH the favorite category is CATS”. Other rule which can be seen on this tree says, that if they are buying more CATS (C1=CATS and C3=CATS)
88
I. Bluemke and A. Orlewicz
Fig. 3. Decision tree obtained by ID3 algorithm on data with reduced number of attributes
theirs favorite animal is REPTILE. These rule are the result of the phenomenon on purpose introduced in interactions that users are buying fishes as food for cats and cats as food for reptiles. In Fig. 5 the tree generated by C4.5 algorithm with binary test are presented.
Fig. 4. Decision tree obtained by C4.5 algorithm on data with full number of attributes and binary tests
4.1.2 Association Learning Three association-rule learners from weka library were examined: Apriori, Tertius and PredictiveApriori [10, 3]. A PET_STORE_VIEW event was defined in ELM system. This event was registered if a user entered a system page to see the description of a product or an exemplary. This event contains an additional parameter the name of a product or an exemplary. The data obtained from logging PET_STORE_VIEW events were grouped with the session identifier. The goal was to obtain records describing all categories, products and exemplary viewed by a user during one session. These data were filtered and the association-rule learners were executed by appropriate scripts in ELM. Some problems were observed. PredicitiveApriori ended with an error “insufficient stack”. Tertius was working for a long time and was not able to produce
Knowledge Mining with ELM System
89
reasonable results. Apriori also doesn’t generate reasonable results. To solve this problem the set of attributes was reduced. The generated rules are presented in Fig. 6 - 8. These rules concern some regularities on “not viewing some products” and can be translated into clients behavior in “real” world.
Fig. 5. Rules obtained by Apriori algorithm on data concerning event PET_STORE_VIEW with some attributes removed
Fig. 6. Rules obtained by Tertius algorithm on data concerning event PET_STORE_VIEW
Fig. 7. Rules obtained by PredictiveApriori algorithm on data concerning event PET_STORE_VIEW with some attributes removed
90
I. Bluemke and A. Orlewicz
Algorithm Apriori (Fig. 8) found a rule: “clients not buying reptiles are not buying roof skeeper cats as well”. This rule is a consequence of the regularity on purpose introduced in data that “clients buying reptiles are also buying cats as a food for reptiles”. In the results generated by Tertius it can be seen that “clients not buying food for kittens are also not buying cats” which is obvious. PredictiveApriori also generated obvious rule that: ”clients not buying reptiles and not buying food for them are also not buying cats” 4.2 Results of Experiment However in the above described case study a simple internet application was examined the obtained results are very promising and close to “real world”. In the knowledge mining process typical activities were practiced. Such activities will also appear in more complex internet applications. This case study revealed advantages and disadvantages of ELM system. Grace to aspect modification the ELM system can be easily “joined” with any Java application, even if the source code of this application is not available. The amount of AspectJ code which must be prepared is limited (aspect, pointcut). ELM system is resilient, different types of events, even some defined by user, can be logged. The attributes of the registered data and its number can be easily changed. User is able to monitor the knowledge mining process and tune it. Other advantage of ELM system is the support in data processing phase: available different filters and especially the ability to group data. Currently the system is not able to apply a sequence of filters and such feature could facilitate the data processing phase. Other functionality, very convenient in data processing phase, which should be considered in future version of ELM could be splitting the data into training and testing sets (currently cross validation is used [10]).
5 Conclusions Knowledge about users and understanding user needs is essential for many web applications. Using algorithms for knowledge discovery it is possible to find many interesting trends, obtain pattern of user behavior, better understand user needs. We presented briefly in section 3 general system for web usage mining and business intelligence reporting. We can not compare our small system to giants like Google, Yahoo or Amazon. To our best knowledge, the details about web usage mining processes in these systems are not published so we can not make more thorough references. ELM system is flexible and easy to integrate with Java web applications by aspect modification. It is able to collect and store data from users interactions from many applications in one data base and analyze them using different knowledge discovery algorithms. A user of our system may add own algorithms or tune implemented ones, as well as decide what events should be captured, filtered and which pattern discovery method should be used. Others WUM systems are usually strictly associated with a portal or one application, use server log files with “fixed contents” and contain one group of pattern discovery methods.
Knowledge Mining with ELM System
91
Simple example, described in section 4, shows, that using ELM system containing data mining algorithms some useful information from user interactions can be deduced. To obtain more convincing results similar experiments but on a “real” application should be conducted.
References 1. Bluemke, I., Orlewicz, A.: System for knowledge mining in data from interaction between user and application. In: Man-Machine Interactions, Advances in Intelligent and Soft Computing, vol. 59, pp. 103–110. Springer, Heidelberg (2009) 2. Bluemke, I., Billewicz, K.: Aspects in the maintenance of compiled programs. In: Proceedings of Third International Conference on Dependability of Computer Systems, DepCoSRELCOMEX 2008, pp. 253–260 (2008) 3. weka, http://www.cs.waikato.ac.nz/ml/wek 4. Srivastava, J., et al.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2), 12–23 (2000) 5. Srivastava, J., Desikan, P., Kumar, V.: Web mining – concepts, applications and research directions. In: Data Mining: Next Generations, Challenges and Future Directions. AAAI/MIT Press, Boston (2003) 6. Berendt, B., Stumme, G., Hotho, A.: Usage Mining for and on the Semantic Web. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 264–278. Springer, Heidelberg (2002) 7. Nina, S.P., et al.: Pattern discovery of web usage mining. In: Int. Conf. on Computer Tech. and Devel. IEEE Xplore (2009), 978-0-7695-3892-1/09 8. Etminani, K., Delui, A.R., Yaneshsari, N.R., Rouhani, M.: Web usage mining: discovery of the users’ navigational pattern using SOM. IEEE Xplore (2009), 978-1-4244-4615-5/09 9. Chen, J., Wei, L.: Research for web usage mining model. In: Computational Intelligence for Modelling. Control and Automation, vol. 8 (2006), ISBN: 0-7695-2731-0 10. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 11. Clark, L., et al.: Combining ethnographic and clickstream data to identify user web browsing strategies. Information Research 11(2) (2006) 12. Nasraoui, O.: World wide web personalization. In: Wang, J. (ed.) Encyclopedia of Data Mining and Data Warehousing. Idea Group, USA (2005) 13. Analog (04, 2009), http://www.analog 14. Baraglia, R., Palmerini, P.: Suggest: A web usage mining system. In: Proceedings International Conference on Information Technology: Coding and Computing, pp. 282–287 (2002), ISBN:0-7695-1503-1 15. Botia, J.A., Hernansaez, J.M., Gomez-Skarmeta, A.: METALA: a distributed system for Web usage mining. In: Mira, J., Álvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2687, pp. 703–710. Springer, Heidelberg (2003) 16. de Castro Lima, J., et al.: Archcollect front-end: A web usage data mining knowledge acquisition mechanism focused on static or dynamic contenting applications. In: ICEIS, vol. 4, pp. 258–262 (2004) 17. Lu, M., Pang, S., Wang, Y., Zhou, L.: WebME - web mining environment. In: International Conference on Systems, Man and Cybernetic, vol. (7) (2002)
92
I. Bluemke and A. Orlewicz
18. Cercone, X.H.: An OLAM framework for Web usage mining and business intelligence reporting. In: Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, vol. (2), pp. 950–955 (2002) 19. Amazon, http://www.amazon.com 20. Google, http://google.com 21. Google news, http://news.google.com 22. Colet, E.: Using data mining to detect fraud in auctions, DSstar (2002) 23. Yahoo, http://www.yahoo.com 24. Heydari, M., et al.: A Graph –Based Web usage mining method considering client side data. In: Int. Conf. on Electrical Eng. Informatics, Malaysia. IEEE Xplore (2009), 978014244-4913-2/09/ 25. AspectJ, http://www.eclipse.org/aspectj/ 26. JPetStore, http://sourceforge.net/projects/ibatisjpetstore/