Types of Online Hierarchical Repository Structures

3 downloads 27520 Views 495KB Size Report
structures which are presented to the students in the online courses. ... c) Course management, which supplies the instructor with administrative tools, such.
Types of Online Hierarchical Repository Structures

Arnon Hershkovitz, Ronit Azran, Sharon Hardof-Jaffe, Rafi Nachmias Knowledge Technology Lab, School of Education, Tel Aviv University {arnonher, ronit122, sharonh2, nachmias}@post.tau.ac.il

Abstract The main purpose of this study is to empirically research online hierarchical repositories of items presented to university students in Web-supported course websites, using Web mining methods. Data from 1,747 courses was collected, and the use of online repositories of content items in these courses was analyzed using Cluster Analysis to answer the three research questions: 1) What is the extent of content items presented? 2) Which types of hierarchical structures of content items are empirically revealed? and 3) What are the associations between the different types of structures and repository size (number of items), course size, and academic discipline? Results suggest five types of repository structures: Main-folder, Extensive Filing, Flat Small Folders, Pile in Hierarchy Filing, and Pile in Flat Filing. Furthermore, associations between structure type and repository size, course size, and academic discipline were found. Discussion of these results is provided.

Keywords: Knowledge repository; Hierarchical repository; Learning Management Systems; Data Mining.

1. Introduction Nowadays the Internet is an integral part of the teaching/learning process at many universities and leading academic institutions worldwide. Of the many applications enabled by new technologies, the most commonly used in higher education is Learning Management Systems (LMS), e.g, Moodle, BlackBoard, which enable a wide range of Web-supported courses. LMS provides the instructor with tools for uploading content items, facilitating communication, and managing the learning. As suggested by previous research, most lecturers perceive LMS’s as content providers rather than communication and learning management facilitators (Shemla and Nachmias, 2007). LMS provides the instructors with tools for uploading content items into folders and subfolders, thus enabling a variety of hierarchical repository structures which are presented to the students in the online courses. This study examines to what extent the LMS repositories are used, and which of the various repository structures are presented to the students. Revealing the types of repository structures presented to students in Web-supported courses is of importance due to their implications for content consumption. Furthermore, this understanding might assist with evaluating the actual use of content delivery tools available in LMS’s. However, there is a lack of research in identifying the different types of these hierarchical structures. In order to empirically reveal the types of hierarchical structures presented to students in Web-supported academic courses at campus level, we used data mining methods. Data mining is a general term for a large set of techniques that focuses on finding interesting patterns in large databases, and is an emerging methodology in education research (Romero and Ventura, 2007).

2

2. Background 2.1.

Online Repositories in Web-supported Courses

Many of the higher-education institutions are currently integrating Web applications to support their traditional teaching frameworks. Blended learning, i.e., the blending of new and innovative Web-based modules into the traditional learning process – has become more and more widespread in the academic world, and many in the learning community, both instructors and students, have started using the Web on a daily basis according to their needs (Mioduser and Nachmias, 2002; Bonk and Graham, 2006; Allen and Seaman, 2008). Learning Management Systems (LMS), e.g., Moodle, BlackBoard, enable the instructors to develop websites for their courses to support face-to-face teaching by means of different tools. These systems supply the instructors with easy to use ready-made components which enable three major modules: a) Content delivery, mainly used for presenting and organizing course material (e.g., relevant articles, instructional presentations, multimedia items); b) Communication tools, facilitating synchronous or asynchronous interaction between students and instructors or peers, used for collaborative assignments and learning discussions; and c) Course management, which supplies the instructor with administrative tools, such as: announcements, syllabi, attendances and grades tracking (Simonson, Smaldino, Albright & Zvacek, 2006; Romero, Ventura & Garcia, 2008). Although most of the LMS’s offer an enriched environment that goes beyond the usual content management tools, in practice the most prominent use of these systems is for transferring information and increasing accessibility of learning materials, and instructors consider the course website to be a platform for storing and sharing content items (Bonk, 2001; Nachmias and Segev, 2003; Shemla and Nachmias, 2007; Roqueta, 2008; Lonn and Teasley, 2009). Usually the content modules in these systems enable the construction of a hierarchical repository of information items; 3

consequently the instructor is able to create folders and upload files. The large variety of possible hierarchical structures of these repositories constitutes the core of this research.

2.2. Hierarchical Repositories Hierarchical organization tools are widely used in many applications designed for novice users to build their own knowledge archives, and the most common example is desktop systems, which allow users to create personal information archives. The traditional design of hierarchical digital archiving tools was derived from the belief that the best way to present computers to a novice user is by drawing an analogy to a situation which is familiar to the user, namely the organization in the office filing cabinet. Therefore, virtual folders are presented as the place to keep documents – just like the physical folders in the office (Carroll and Thomas, 1980; Rumelhart and Norman, 1981). Folders and files are arranged in a hierarchical tree structure – i.e., each item (except for the root) has a link to one parent item – with branches (folders) and leaves/nodes (files); visualizing this structure reveals the relationships between elements and groups within the repository (Mullet and Sano, 1995; Shneiderman, 1997; Marsden and Cairns, 2004). Although hierarchical information structures have been criticized, mainly because of their single-inheritance principle, they have two main advantages: a) Information items are categorized into meaningful groups; b) Retrieval is easier, since the user is able to track his location in the archive during navigation (Dourish et al., 2000; Nielsen, 2000). Today, the strengths of the hierarchical structure are once again appreciated, and new methods of automatic organization of information items highlight the benefits of categorization and convenient navigation of search results comparing other information structures (Yee, Swearingen, & Hearst, 2003; Kaki, 2005; Hearst, 2006; Xing et al., 2008). 4

Hierarchical structure organizational paradigms are commonly used in the personal information collection of documents, emails and favorite links (Dourish, Edwards, LaMarca, & Salisbury, 1999). Therefore, personal information management studies have examined the user organization strategies of hierarchical tools and it has been shown that there are two main organization strategies: piling and filing (Malone, 1983; Whittaker and Sidner, 1996; Abrams, Baecker, & Chignell, 1998; Fisher, Brush, Gleave, & Smith, 2006). The piling strategy traditionally describes users who keep information items in one pre-labeled directory (e.g. 'My Document', 'Inbox'), hence a pile of information items is created, from which it could be difficult to retrieve items, however, as a recent research suggests, a hidden pile might be also created in a folder within the hierarchy (Hardof-Jaffe, Hershkovitz, Abu-Kishk, Bergman, & Nachmias, 2009). Filing is a strategy in which the user divides the information items into many labeled folders at different depths within the hierarchy. Previous studies have shown that most users employ a mixture of these two strategies, creating various hierarchical structures which differ according to the depth and width of the tree, folder size and pile size (Whittaker and Hirschberg, 2001). Most of the studies that have investigated structures of hierarchical repositories built by computer users, have used traditional methodologies (e.g., interviews and screenshots). Our approach to this task is different, and we aim to automatically collect data, via the Web, that describes the online hierarchical repositories, and to analyze it using data mining techniques. This process is called Web mining.

2.3. Web Mining Web mining is the application of data mining techniques on large datasets which originate from the Web, in order to nontrivially identify valid, novel, potentially useful and ultimately understandable patterns in them (Etzioni, 1996; Fayyad, 5

Piatetsky-Shapiro, & Smyth, 1996). The most common category of Web mining is Web usage mining, the main purpose of which is to discover patterns of usage of Websites by analyzing log files that document every user's access to the site (Cooley, Mobasher, & Srivastava, 1997). Massively used in e-commerce (e.g. by amazon.com to reveal patterns of utilization of their huge online store), where it aims to eventually increase sales and profit, Web mining is an emerging methodology in education, aiming to improve learning and teaching processes (Zaiane and Luo, 2001; Nachmias and Hershkovitz, 2006; Castro Vellido, Nebot, & Mugica, 2007; Romero and Ventura, 2007). Web mining has been applied in education research for a variety of purposes and with a wide range of techniques. This comes as no surprise, as on one hand, the use of online learning environments has been growing at a rapid pace at universities worldwide, while on the other hand, traditional assessment and evaluation methods are not useful for such systems, in which students often study from a distance and large-scale examination is needed. From a data mining point of view, the analysis of data retrieved from LMS is relatively convenient because: a) Users (due to most of the institutional policies) are identified by username (and not only by IP address), a fact that dramatically simplifies the analysis of this data and enables a cross-course examination and triangulation with other external data (e.g., grades); and b) Detailed log files of these systems are usually easily and conveniently available to instructors, administrators and researchers and can be collected on different levels, e.g., student level, course level, faculty level or institution level. Previous studies have examined different aspects of using LMS in the teaching/learning process, such as investigating the actual usage of the system (Nachmias and Segev, 2003; Zorrilla, Menasalvas, Marín, Mora, & Segovia, 2005), analyzing student behavior (Talavera and Gaudioso, 2004; Zafra and Ventura, 2009), examining cost-effectiveness of Web-supported

6

instruction (Cohen and Nachmias, 2009), exploring personal information organization by students (Hardof-Jaffe et al., 2009), and developing tools that aid instructors with improving their teaching by using the system (Mazza and Milani, 2005; Garcia, Romero, Ventura, & de Castro, 2009). Besides Web usage mining, there are two other categories of Web mining: Web content mining, which focuses on discovering useful information from the Web contents/data/documents, and Web structure mining, which aims to model the link structure of the Web (Kosala and Blockeel, 2000; Zaiane and Luo, 2001). We think of the current research as applying a combination of Web content mining and Web structure mining, as it examines documents (and folders in which these documents are stored) presented to students online, with the aim of revealing the underlying structure of a Website.

3. The Study The main purpose of this research is to empirically investigate the types of online hierarchical structures of content items presented to university students in Websupported courses. Three research questions are addressed in the study: 1. What is the extent of content items presented to university students in online repositories within Web-supported courses? 2. Which types of hierarchical structures of content items are empirically revealed? 3. What are the associations between the different types of structures and repository size (Number of Items), Course Size and Academic Discipline?

7

3.1. Research Variables In order to study the extent and the types of online repositories, three groups of variables were defined, describing size of the repositories, their structure, and the characteristics of the courses (independent variables); the variables are detailed in Table 1. An example repository is presented in Figure 1, and it will be used to provide numerical examples of the variable values. ---------------------- Insert Table 1. about here -------------------In Figure 1, a schematic hierarchical repository is presented. This repository contains 14 content items (Number of Items) and 8 folders (number of folders), hence the average folders size is 1.75. The largest folder contains 5 items, so the value of largest folder share is 0.36. Overall, the hierarchical depth is 3, as this is the length of the longest branch (starting from the second folder from the left). When students enter this repository, they find 4 folders in the first hierarchical level (immediately under the root), therefore this is the value of the visible width. Taking the two previous variables, the width-depth proportion in this repository is 1.33. ---------------------- Insert Figure 1. about here --------------------

4. Methodology 4.1. Research Field The research was carried out on all the Winter semester (2008/9) courses in Tel Aviv University which were accompanied by a Website within the HighLearn LMS (by Britannica Knowledge1), N=1,747.

4.2. Data file 1

http://www.britannica-ks.com/index.asp

8

Raw data describing the contents of the online repositories in all courses (i.e., campus-wide), was extracted using SQL queries on HighLearn databases. In this data file, each row corresponds to a single content item within the system, and documents the unique ID of the course to which the item belongs and its full path within the repository. The data file consisted of 72,753 rows (i.e., content items) of 1,747 courses.

4.3. Procedure For computing the research variables and describing the size and structure of the repositories in the course level (variables A-H), data from the data file was aggregated and the research variables were computed. A new table was formed, holding 1,747 rows (courses) and 8 columns (variables). In addition, two independent variables describing the Course Size and Academic Discipline (variables I, J) were added to this table. The research was carried out in three stages, corresponding with the three research questions: 1) Examining the extent of content items in the repositories; 2) Revealing types of repository structures, using Two-step Cluster Analysis; and 3) Looking for associations between structure types and three other variables: Repository Size, Course Size and Academic Discipline, using ANOVA tests. In the second and third stages, only courses the repositories of which consisted of 15 content items or more were considered, N=1,203. In order to identify different types of structures of the repositories (stage 2), we used Cluster Analysis, an exploratory and discovery data analysis tool for classifying cases (in this study, courses) into groups or clusters, so that members of the same cluster are "similar" to each other and members of different clusters are "dissimilar" to each other (Aldenderfer & Blashfield, 1984). The similarity/dissimilarity is determined by the values of the variables according to which the Cluster Analysis was done. Clustering 9

by two variables might be understood as finding groups of nearby points on a twoaxis scatter plot; taking three variables into consideration, these groups are more like clouds on a three-axis scatter plot. Clustering by more than three variables is clearly hard to visualize. For the Cluster Analysis we used SPSS, and as a proximity method we used log-likelihood (Heckerman and Meila, 1998).

5. Results 5.1. Extent of Content Items in the Repositories The two size variables (variables A, B) were examined in order to determine the extent of the online repositories by using the terms of files and folders; results are summarized in Table 2. In the analysis of Number of Items, it was found that the range of values is wide and fluctuates between 1 and 1,029, with an average of 41.64 files (SD=69.10). The distribution of this variable is far from normal, as seen by the quartiles: first quartile is 11, median is 24, and third quartile is 45. Only 27.2% of the courses had repositories larger than the average. When looking at the very large repositories formed, we see that 388 of the courses (21.6%) have repositories of 50 files or more, and 133 of the courses (7.6%) have repositories of 100 files or more. A similar distribution is also observed in the variable Number of Folders, the values of which range between 1 and 185. The average of this variable is 10.69 (SD=16.78), and its three quartiles are: 3 (first), 6 (median), 13 (third). Only 31.1% of the course repositories have more folders than the average. It is interesting to note that most of the repositories barely use the folders and have only 6 or less (943 courses, 54%), while there are a few courses that make extensive use of folders: 40 courses (2.3%) with 50 folders or more, 23 courses (1.3%) with 100 folders or more. ---------------------- Insert Table 2. about here -------------------10

5.2. Types of Repository Structures A descriptive statistics of the repository structure variables (C-H) is given in Table 3. The mean of Average Folder Size (C) (it is an average of averages) is 5.87 (SD=5.13), with first, second and third quartiles of 2.82, 4.63, and 7.63, respectively, hence this variable is positively skewed. There is a little portion of repositories with Average Folder Size of 20 or more (21 of 1,203; less than 2% of the population). Largest Folder (D) takes an average of 16.69 (SD=9.85). While 7.5% of the repositories (90 of 1,203) have a largest folder of 3 files or less, there are a few with 50 files and more (8 of 1,203; 0.67%). When looking at the Largest Folder Share (E), i.e., the ratio of the largest folder to the repository size, we have an average of 0.34 (SD=0.24) with third quartile equals 0.5. That means that in 25% of the repositories, at least a half of the repository items are kept in one folder. Regarding Hierarchical Depth (F), it was found that on average, repositories in the population have a depth of 1.6 (SD=0.89), with first quartile and median both equals 1, and third quartile equals 2. Examining the data, it was found that only 9.98% of the repositories (120 of 1,203) have a depth of 3 or more. Visible Width (G) represents the number of folders to which the students is exposed upon browsing to the repository. From these folders, students might keep browsing to see the content of the repository. The mean value of that variables is 7.89 (SD=6.19), and it is positively skewed with first, second, and third quartiles of 4, 6, and 10, respectively. It is interesting to point out that 6.15% of the repositories have 20 folders or more in the first level (74 of 1,203), and the maximum value of that variable (held by one repository) was 52. When taking the ration of Visible Width to Hierarchical Depth, we get the variable Width-depth Proportion (H), which is also positively skewed with an average of 5.82 (SD=5.25). 11

---------------------- Insert Table 3. about here -------------------For revealing different types of repository structures, we used a Two-step Cluster Analysis. After a few iterations for examining the resulting clusters, we chose four of the structure variables to participate in the clustering procedure. The clustering variables are: Average Folder Size, Largest Folder Share, Hierarchical Depth, and Breadth-depth Proportion. A Two-step clustering algorithm with k=5 (number of clusters) was used in order to best represent the different types of structures in the research population. For this stage, we used the reduced population of courses with repositories of 15 files or more (N=1,203). Descriptive statistics of the variables according to which the clusters were formed is presented in Table 4. ---------------------- Insert Table 4. about here -------------------We will now describe the clusters which were formed and will label them according to the variable values within them. In the first cluster, the averages of all the four variables take their extreme values in comparison with the other clusters: Largest Folder Share (0.93) and Average Folder Size (19.07) are maximal, indicating the existence of one folder which consists of most of the files in the repository, i.e., a pile; Hierarchical Depth (1.01) is minimal, indicating a low depth of the hierarchy structure. Hence, the repositories in the first cluster are of a Main-folder structure. This is the smallest cluster with 67 courses (5.6%). The second cluster has the highest average value of Hierarchical Depth (3.66) compared with the other clusters, and the lowest average value of Largest Folder Share (0.16). These values indicate that this structure is characterized by filing without any pile creation and relatively high depth. The average value of Average Folder Size (3.55) demonstrates the relatively small folders in this structure. Hence, the repositories within this cluster represent an Extensive Filing structure. This cluster holds 120 courses (10%). 12

The third cluster has the highest average value of Breadth-depth Proportion (14.61) compared with the other clusters, and a low value of the average Hierarchical Depth (0.19). These two parameters indicate hierarchy flatness. The average value of Average Folder Size (3.27) – which is minimal in this cluster – and the relatively low average value of Largest Folder Share (0.19) indicate that the folders are small and that there is no pile at all. Consequently, the repositories within this cluster are labeled as having a Small Flat Folders structure. The third cluster holds 222 courses (18.4%). In the fourth cluster, we can observe two main characteristics: the existence of a pile, illustrated by the average value of Largest folder share (0.25); and the use of filing, indicated by the average value of Hierarchical depth (2.00) and the average value of Average folder size (4.21). This cluster represents a Pile in Hierarchy Filing structure. It is one of the two largest clusters, with 354 courses (29.4%). The repositories in the fifth cluster are characterized by a flat structure, as indicated by the average values of Breadth-depth proportion (4.59) and Hierarchical depth (1.07). In addition, the existence of a large pile, holding on average of more than half of the files in the repository, is indicated by the average value of Largest folder share (0.53); however, the rest of the items are filed into folders, as indicated by the average value of Average folder size (7.14). Therefore, the repositories in this cluster are of a Pile in Flat Filing structure. This cluster is the largest, with 440 courses (36.6%). These results are summarized in Table 5. ---------------------- Insert Table 5. about here --------------------

13

5.3. Associations Between Repository Structures and Course Characteristics One-way ANOVA test was applied to the variable Number of Items (A), in order to determine whether an association exists between the different clusters. There was an overall significant difference between the clusters, with F(4,1198)=105.13, significant at p

Suggest Documents