A Clustering Methodology of Web Log Data for Learning Management ...

Valsamidis, S., Kontogiannis, S., Kazanidis, I., Theodosiou, T., & Karakos, A. (2012). A Clustering Methodology of Web Log Data for Learning Management Systems. Educational Technology & Society, 15 (2), 154–167.

A Clustering Methodology of Web Log Data for Learning Management Systems Stavros Valsamidis1, Sotirios Kontogiannis1, Ioannis Kazanidis2, Theodosios Theodosiou2 and Alexandros Karakos1 1

Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece // 2 Accounting Department, Kavala Institute of Technology, Agios Loukas, 65404, Kavala, Greece // [email protected] // [email protected] // [email protected] // [email protected] // [email protected] (Submitted July 2, 2010; Revised January 14, 2011; Accepted April 1, 2011) ABSTRACT Learning Management Systems (LMS) collect large amounts of data. Data mining techniques can be applied to analyse their web data log files. The instructors may use this data for assessing and measuring their courses. In this respect, we have proposed a methodology for analysing LMS courses and students’ activity. This methodology uses a Markov CLustering (MCL) algorithm for clustering the students’ activity and a SimpleKMeans algorithm for clustering the courses. Additionally we provide a visualisation of the results using scatter plots and 3D graphs. We propose specific metrics for the assessment of the courses based on the course usage. These metrics applied to data originated from the LMS log files of the Information Management Department of the TEI of Kavala. The results show that these metrics, if combined properly, can quantify quality characteristics of the courses. Furthermore, the application of the MCL algorithm to students’ activities provides useful insights to their usage of the LMS platform.

Keywords E-learning, Web mining, Clustering, Metrics

Introduction Learning Management Systems (LMSs) offer a lot of methods for the distribution of information and for the communication between the participants on a course. They allow instructors to deliver assignments to the students, produce and publish educational material, prepare assessments and tests, tutor distant classes and activate archive storage, news feeds and students’ interaction with multimedia. They also enhance collaborative learning with discussion forums, chats and wikis (Romero et al., 2008a). Some of the most well-known commercial LMS are Blackboard, Virtual-U, WebCT and TopClass, while Moodle, Ilias, Claroline and aTutor are open source, freely distributed LMSs (Romero et al., 2008a). In Greece, the Greek University Network (GUNet) uses the platform Open eClass (GUNet, 2009), which is an evolution of Claroline (Claroline, 2009). This system is an asynchronous distance education platform which uses Apache as a web server, MySQL as its database server and has been implemented in PHP. Open eClass is open source software under General Public Licence (GPL). Due to the volume of data, one of the main problems of any LMS is the lack of exploitation of the acquired information. Most of the times, these systems produce reports with statistical data, which, however, don’t help instructors to draw useful conclusions either about the course or about the students; they are useful only for administrative purposes of each platform. Moreover, the existing e-learning platforms do not offer concrete tools for the assessment of user actions and course educational content.

Data and web mining Data mining is the search for relationships and patterns that exist in large databases, but are 'hidden' among the vast amounts of data. It is part of the whole Knowledge Data Discovery (KDD) process. KDD is the complete set of processes for knowledge discovery in databases that aims at the detection of valid information and pattern recognition in raw data (Kantardzic, 2003). The classical KDD process includes 5 phases: data pre-processing, data ISSN 1436-4522 (online) and 1176-3647 (print). © International Forum of Educational Technology & Society (IFETS). The authors and the forum jointly retain the copyright of the articles. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear the full citation on the first page. Copyrights for components of this work owned by others than IFETS must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from the editors at [email protected].

154

transformation, data mining, data visualization and data interpretation. The first two phases select and "clean" a given dataset. The next phase, data mining, is essential in the whole KDD process; through it non-trivial patterns in data are found with the use of algorithms. Data mining consists of such tasks as classification, clustering, time series discovery or prediction and association rules mining (Witten and Eibe, 2000). Web mining (Srivastava et al., 2000) is a sub-category of data mining. Data mining techniques are applied to extract knowledge from web data. There are three main web mining categories from the used data viewpoint: Web content mining, Web structure mining and Web usage mining (Spiliopoulou, 1999; Kosala and Blockeel, 2000; Bing, 2007). Web content mining is the process used to discover useful information from text, image, audio or video data on the web. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. Web Usage Mining (WUM) is the application that uses data mining to analyze and discover interesting patterns of user data on the web. The usage data records the user’s behavior when he/she browses or makes transactions on the web site. The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers. Using such tools, it is possible to determine such information as the number of accesses to the server, the times or time intervals of visits, as well as the domain names and the URLs of users of the Web server. However, in general, these tools provide little or no analysis of data relationships between the accessed files and directories within the Web space.

Data mining in e-learning Data mining techniques have been used to discover the sequences patterns of students’ web usage after the analysis of log files data (Romero et al, 2007). Server log files store information containing the page requests of each individual user. After the pre-processing, phase of data analysis, this information can be considered as a per-user ordered set of web page requests from which it is possible to infer user navigation sessions. The extraction of sequential patterns has been proven to be particularly useful and has been applied to many different educational tasks (Romero et al, 2008b). In this work, a methodology is proposed for the creation of a software tool which will be incorporated in e-learning platforms. With it students’ usage analyses can be made in order to motivate instructors to increase the use of platform for the needs of their courses. Instructors can benefit from the evaluations resulting from the proposed methodology, when trying to get a good place in the ranking of course usage and to improve their courses according to the tool indications at the relative units of the content. This improvement in the educational content will allow students to profit from the asynchronous study of courses by using actualized and optimal educational material. Next we describe work related to our research. Then our methodology is presented in detail followed by the experiments and the results. Finally we discuss the proposed methodology and describe future work.

Related work Data mining may be applied to (i) traditional educational systems, (ii) web based courses, (iii) LMSs like Moodle, CLAROLINE, WebCT, etc., as well as (iv) adaptive and intelligent educational systems. Romero and Ventura (2007) surveyed several research studies in e-learning environments. One of their conclusions was that there is still not a standardized methodology for applying data mining techniques in this field. This lack of methodology gave us the motivation to propose a new approach with the different use of existing techniques. Traditional educational data sets are normally small (Hamalainen et al, 2006) compared to files used in other data mining fields such as e-commerce applications; these may involve thousands of clients (Srinivasa, 2005). This is due to the typical, relatively small size of the classroom, although it varies depending on the type of the course (elementary, primary, adult, higher, tertiary, academic and special education); corresponding transactions are therefore also fewer. The user model is also different in both systems (Romero & Ventura, 2007). There are several commercial tools such as DB Miner, Speed Tracer, Commerce Trends, Clementine, etc. (Galeas, 2009) and several free tools for WUM, like the Analog, Page Gather and SUGGEST.

155

The Analog system (Yan et al., 1996) consists of two main components, performing online and offline data processing with respect to the web server activity. Past users activity is recorded in server log files which are processed to form clusters of user sessions. The online component builds active user sessions which are then classified into one of the clusters found by the offline component. Perkowitz and Etzioni (1999) proposed Page Gather, a WUM system that builds index pages containing links to similar pages. Page Gather creates index pages. The main hypothesis is that users behave coherently during their navigation. It deals with page clusters instead of session clusters, and bases them on the previous assumption called visit coherence, i.e. pages within the same session are in general conceptually related. The SUGGEST WUM (Baraglia and Palmerini, 2002) system was designed to produce links to pages of the potentional user interests. It can provide useful information to make web user navigation easier and to optimize web server performance. It was implemented as a module to Apache web server. In addition to the above mentioned general purpose WUM tools, there are also several specialized ones that are used in the e-learning platforms. CourseVis (Mazza and Dimitrova, 2007) is a visualization tool that tracks web log data from an LMS. By transforming this data, it generates graphical representations that keep instructors well-informed about what precisely is happening in distance learning classes. GISMO (Mazza and Milani, 2004) is a tool similar to CourseVis, but provides different information to instructors, such as students’ details in using the course material. Sinergo/ColAT (Avouris at al., 2005) is a tool that acts as an interpreter of the students' activity in a LMS. Mostow et al. (2005) provides a tool which uses log files in order to represent the instructor-student interaction in hierarchical structure. MATEP (Zorrilla and Álvarez, 2008) is another tool acting in two levels. First, it makes a mixture of data from different sources suitably processed and integrated. These data originate from e-learning platform log files, virtual courses, academic and demographic data. Second, it feeds them to a data webhouse which provides static and dynamic reports. Sinergo/ColAT (Avouris et al., 2005) is a tool that offers interpretative views of the activity developed by students in a group learning collaborative environment. Mostow (Mostow et al., 2005) describes a tool that shows a hierarchical representation of tutor-student interaction taken from log files. An automatic personalization approach is also proposed by Khribi et al. (2009). It provides online automatic recommendations for active learners without requiring their explicit feedback through two modules: an off-line module which preprocesses data to build learner and content models, and an online module which uses these models on-the-fly to recognize the students’ needs and goals, and predict a recommendation list. All these tools are based on the analysis of log files as our methodology does. Especially the Analog system (Yan et al., 1996) and the last one proposed by Khribi et al. (2009) seeded the idea for a final tool acting in two levels: online and off-line. However, none of the aforementioned tools proposes and uses indexes calculated by the pages and sessions accessed by the users. These indexes derive after the pre-processing of the raw data contained in the log files.

Methodology The proposed methodology consists of three main steps, namely the logging step, the pre-processing and the clustering step. These steps are based on the framework described in detail in a study by Kazanidis et al. (2009) and facilitate the extraction of useful information from the data logged by a web server running an LMS. Instructors can benefit from the methodology’s course evaluation indexes. The main advantages of the proposed methodology are that: (i) it uses data mining techniques for user and course evaluation; (ii) it proposes new indexes and metrics to be used with data mining algorithms; (iii) it can be easily adapted to any LMS, (iv) it visualizes the results in a user friendly environment and allows interactive exploitation of the data.

156

Logging the data This step involves the logging of specific data from e-learning platforms. In more detail, the data recording module is embedded in the web server of the e-learning platform and records specific e-learning platform fields. Specifically eleven (11) fields (request_time_event, remote_host, request_uri, remote_logname, remote_user, request_method, request_time, request_protocol, status, bytes_sent, referer, and agent) and user requests from different courses are recorded with the use of an Apache module, developed in Perl programming language, as a first step. The Apache web page server uses the following configurations in its log files: Common Log Format, Extended Log Format, Cookie Log Format and Forensic Log Format. We used the latter configuration, because of the advantage that it stores server requests, before and after server process. Thus, there are two records per client request in a client's log file. For the recognition of each request, one unique ID is assigned to each request and a pair of signs (+/), that signals the first or second record of a request. The development of such a module has the following advantages: (i) rapid storage of user information, since it is executed straight from the server API and not by the e-learning application, and (ii) the produced data are independent of specific formulations used by the e-learning platform.

Data pre-processing The data of the log file contain noise such as missing values, outliers etc. These values have to be pre-processed in order to prepare them for data mining analysis. Specifically, this step filters the recorded data delivered from step 1. It uses outlier detection and removes extreme values. This step is not performed by the e-learning platform and thus can be embedded into a variety of LMS systems. The produced log file, from the previous step, is filtered, so it includes only the following fields: (i) courseID, which is the identification string of each course; (ii) sessionID, which is the identification string of each session; (iii) page Uniform Resource Locator (URL), which contains the requests of each page of the platform that the user visited. Although, these fields contain information about the e-learning process, more indexes and metrics (Table 1) are proposed in order to adequately facilitate the evaluation of course usage. However, although there are many metrics in web usage analysis for e-commerce (Lee et al., 1999), there is lack of corresponding metrics in e-learning. A simple approach was done by (Nagi et al., 2008) with “Reports”, which was embedded into the Moodle LMS for the courseware evaluation quality and the student interaction with the system. The ratio views/posts indicates the quality of the activity of a course, where "view" means that the data about access to an object is not saved into the database whilst "posts" means anything new that was created and uploaded, is saved in the database. Also, usability metrics for e-learning were proposed by Wong et al. (2003), where fuzzy systems were used to model each one of several factors and to reflect how each affects the overall value of e-learning usability. Table 1. Indexes name and description Index name Sessions Pages

Description of the index The number of sessions per course viewed by users The number of pages per course viewed by users

Unique pages The number of unique pages per course viewed by users Unique Pages per CourseID per Session The number of unique pages per course viewed by users per session (UPCS) Enrichment The enrichment of courses Disappointment The disappointment of users when they view pages of the courses Interest It is the one 's complement to the disappointment It represents the quality of the course combining Enrichment and Quality index Interest Final score

It is the final score of the course

157

First, the number of the sessions and the number of the pages were counted in order to calculate course activity. The unique pages index measures the total number of unique pages per course viewed by all users. Unique pages are a set of pages uniquely identified by user sessions per LMS course. The Unique Pages per Course per Session (UPCS) index expresses the unique user visits per course and per session; it is used in order to calculate activity in an objective manner. For example, some novice users may navigate in a course visiting a page more than once. UPCS eliminates duplicate page visits, considering the visits of the same user in a session only once. UPCS was first introduced in the framework (Kazanidis et al., 2009) with a slightly different name, UniquePCSession. Enrichment is a new metric which is proposed in order to express the “enrichment” of each course in educational material. We defined as the LMS enrichment index the complement of the division of unique pages per LMS course to total number of course web pages. Enrichment = 1- (Unique Pages/Total Pages)

(1)

where Unique Pages

A Clustering Methodology of Web Log Data for Learning Management ...

A Clustering Methodology of Web Log Data for Learning Management ...

Suggest Documents

An Ameliorated Methodology for Preprocessing Web Log Data using ...

A methodology for programmatic web data extraction

Current state of Learning Management Systems' log data-based

Web Log Clustering Approaches â A Survey - Engg Journals ...

a web tool for visualizing clustering of multivariate data ... - CiteSeerX

CiteSeerX â LODAP: A LOg DAta Preprocessor for mining Web ...

Intelligent Clustering Scheme for Log Data Streams

Clustering Dynamic Web Usage Data

Web Data Clustering - Springer Link

Incremental Mixture Learning for Clustering Discrete Data

Learning Pairwise Similarity for Data Clustering

Learning Dissimilarity for Categorical Data Clustering

Learning and Data Clustering - CiteSeerX

An Overview of Web Data Clustering Practices

Analysis of students clustering results based on Moodle log data

Analysis of students clustering results based on Moodle log data

Frequent Pattern Mining in Web Log Data

Clustering of User Behaviour based on Web Log data using Improved ...

An Effective Clustering Approach to Web Query Log Anonymization

Web User Clustering from Access Log Using Belief Function - CiteSeerX

Web User Clustering from Access Log Using Belief Function - CiteSeerX

Exploiting Web Log Mining for Web Cache

Methodology for Cost-Benefit Analysis of Web-Based Tele-Learning ...

Finding Generalized Path Patterns for Web Log Data Mining *

A Clustering Methodology of Web Log Data for Learning Management ...