Apache Pig: It can be considered a platform rather than a tool used for analyzing massive sets of data.
Using Big Data Technology for Prediction of Quiz Difficulty Level in E-learning Systems Hiba A. Abu-Alsaad

Rana Riad K. AL-Taie

Computer Engineering Department / Mustansiryah University [email protected]/ [email protected]/

Abstract In recent years, big data have received a great deal of attention from many of researchers. Big data concepts have its various applications; it can be used in the field of healthcare and medicine, education world, finance and fraud detection, education and





environment, a large amount of data could be generated as a result of different e-learning aspects, which is called Big Data in e-learning.







organization has the potential to enhance the future of e-learning contents and students' performance. The aim of the paper is to develop a method that can be used for analyzing the level of difficulty of the test questions for students who tested through elearning software. I addition, it is expected to help instructors in determining strengths and weaknesses of students in the exams, as well as recognizing hardest/easiest questions for students based on their answers. An emerging open source Apache Spark


‫‪tool had been used to facilitate the analysis process of large data‬‬ ‫‪through linking it to the database of e-learning systems.‬‬

‫;‪Keywords: Big data; E-learning; big data analysis; Apache spark‬‬ ‫‪Framework.‬‬

‫المستخلص‬ ‫في السنوات األخيرة‪ ،‬تلقت البيانات الكبيرة قد ار كبي ار من االهتمام من العديد من الباحثين‪.‬‬ ‫مفاهيم البيانات الكبيرة لديها تطبيقاتها المختلفة‪ .‬فإنه يمكن استخدامها في مجال الرعاية‬ ‫الصحية والطب‪ ،‬والتعليم في العالم‪ ،‬والتمويل والكشف عن االحتيال وقطاعات الصناعة ‪...‬‬ ‫الخ‪ .‬من بيئة التعلم اإللكتروني‪ ،‬يمكن توليد كمية كبيرة من البيانات نتيجة لجوانب التعلم‬ ‫اإللكتروني المختلفة‪ ،‬والتي تسمى البيانات الكبيرة في التعلم اإللكتروني‪ .‬تحليل البيانات‬ ‫الكبيرة عبر المؤسسة التعليمية لديه القدرة على تعزيز مستقبل محتويات التعلم اإللكتروني‬ ‫وأداء الطالب‪ .‬والهدف من هذه الورقة هو تطوير الطريقة التي يمكن استخدامها لتحليل‬ ‫مستوى صعوبة أسئلة االختبار للطالب الذين اختبروا من خالل برامج التعلم اإللكتروني‪.‬‬ ‫باإلضافة إلى ذلك‪ ،‬فمن المتوقع أن تساعد المدربين في تحديد نقاط القوة والضعف لدى‬ ‫الطالب في االمتحانات‪ ،‬وكذلك معرفة أصعب وأسهل األسئلة بالنسبة للطالب على أساس‬ ‫إجاباتهم‪ .‬وقد استعملت أداة أباتشي سبارك (‪ )open source‬لتسهيل عملية تحليل البيانات‬ ‫الضخمة من خالل ربطها بقاعدة بيانات نظم التعلم اإللكتروني‪.‬‬ ‫الكلمات المفتاحية‪ :‬البيانات الكبيرة؛ التعلم اإللكتروني؛ تحليل البيانات الكبيرة‪ .‬أباتشي‬ ‫سبارك؛ اطار النظام‪.‬‬

‫‪1. Introduction‬‬ ‫‪Nowadays, a larger scale of data is available on the internet‬‬ ‫‪which provided by many of sources (programs, humans, or‬‬

services) [1]. Such massive amount of data known as Big Data have high quickness, high magnitude with different and complex nature. These characteristics introduce new challenges when thinking about storing, processing and analyzing big data [2]. Big Data Analytics can be defined as is a handy tool for collecting data from distinct resources and organizing them in significant form, then those considerable sets of data will be analyzed in order to exploit meaningful values and more effective graphical representation. In short: "big data analysis is the process of finding knowledge from bulk variety of data" [3]. Currently, there is a significant focus from researchers on accelerating analysis algorithms to keep up with the increasing amount of data and speed up processors based on the Moore's Law. This means that Big Data analysis needs to make many of challenges and changes not only in the development of the hardware components but also in the software components too. Thus, powerful computers with modern techniques are required to deal with them [4]. The concept of Big data analytics plays a vital role in various aspects of life including, agriculture, banking chemistry, administrative, data mining, cloud computing, marketing, learning, finance and medical ...act [5]. Learning Analytics (LA) is a new trend of analysis techniques which currently has been applied by eLearning







educational data which has been collected from student's interaction with digital learning resources [6]. Most of the e-

2018 learning data are collected from learning platforms established by the educational organization. The e-learning platform will be filled with a diverse set of data such as files with different formats, quizzes, course materials, test questions...etc. There are several of significant advantages for gathering and analyzing educational data such as: determining methods in the learning process, deducting the difficulties in student performance, allowing the learners to make their training experiments and practices more effective with online resources and the distance learning platforms [1]. In addition, providing an overview about which eLearning modules are the most visited, while in social learning professionals can determine which eLearning modules or links are the most shared with other learners and received data instantly. It can help professionals to know how information is digesting by learners and which learning needs better ways for clarifying [7].

2. LITERATURE REVIEW In the last few decades, the crucial role of Big Data analysis presents opportunities for researchers and Scientific to establish a lot of projects related to the challenging, tools and techniques of analytics and its applications in many of fields, and that what was explained by the authors in [5], they introduced many of the areas where Big Data can be applied in, such as and in data mining, cloud computing, Marketing, Healthcare, Banking,

2018 Finance. While Shoro and Soomro in [3] deployed Apache Spark tool for analyzing a set of meaningful information that collected from Twitter streaming API and used as a sample of the big data source. Then they explained the results of their experiment in tabular and graphical format. E-learning is one of the fields that has extremely benefited from the big data analytics, for example, the authors in [1] developed a tool called "Big-Learn" which aimed to facilitate the searching process for learner in the e-learning environments and present more optimal relevant results via analyzing both structured and unstructured data in one data layer. While the authors in [8] proposed a novel system named “inVideo” which used for analyzing data in video or format automatically without having to watch its content by viewers in advance. The proposed system expected to be an effective tool for learning technology research that raising the interactions in an online environment for learning. Due to the important role played by tests and quizzes in the e-learning platform and their benefits for both learner and instructor, we will present in this paper an experience for analyzing tests and quizzes contents (as a big data source) that has been collected from a virtual learning platform to measure the level difficulty questions based on the answers of the students that participate in an exam. To make the analysis easier, a graphical representation will be used for displaying the result. That way gives the instructor an overview of test scores, which questions were hardest/ easiest for most of

2018 the students, moreover, gives a chance for the learner to discover their weakness in certain areas. The experiment will be applied along with one of the industries emerging tools, known as Spark by Apache. (More details about the spark tool will be given in the following section).

3. Big Data Analysis Tools Many kinds of analysis tool will be required to analyze such huge data and scientists have developed various tools for this purpose. In the following, we will explain a brief overview of some big data analysis tools. The most focus will be on the open source tool Apache Spark along with clarifying the selection of it [3]. a. Apache Hive: is a software runs on top of Hadoop Apache, hence, it is data warehousing software, used for providing data summarization, query, and analysis. Apache Hive queries data stored in databases by using its Language (HQL) which is similar to SQL [3]. b. Apache Pig: It can be considered a platform rather than a tool used for analyzing massive sets of data. The language for this platform is called Pig Latin. In comparison with the Hive in terms of performance, Pig is considered the best for the data preparation phase of data processing, while Hive more appropriate for the data warehousing and presentation scenario better [3].

2018 c. Apache H Base: It is another database engine, which had been built to be run on the top of Hadoop, designed for the tables with high rates updated and modeled after Google's Big Table. HBase is more appropriate in real time for accessing data from very large tables (tables with billions of rows and millions of columns) [9]. d. Apache Storm: it is an open source tool for processing a huge amount of data, easy to use and setup. Just like Hadoop, Storms can be applied for a real-time computation on streams of data. In the beginning, Storms had been used for processing streams of data of Twitter, currently; it can be used by many of organization as a tool for stream processing [3]. e. Apache Spark: Apache Spark It is speedy and dependable cluster computing engine from Apache used for general purpose. It can provide application programming interfaces in several programming languages like Scala, Java, and Python. Spark engine functions completely advance and distinct from Hadoop in the cutting-edge analysis. With it, data will be analyzed faster in running and writing data. Also, can be used in disk- based processing as well as inmemory computing, where data can be queried by far more

2018 quickly than alternative disk-based engines (like Hadoop) [3]. Some of the experts explained that running spark in memory will be faster than Hadoop MapReduce in 100 times. While in processing disk-based data, Spark will be 10 times faster than Hadoop MapReduce. Spark frequently used with machine learning tasks and in addition to an enormous range of other tasks such as processing of streaming data from sensors or financial systems and Interactive queries across large data sets. Developers can also benefit from Spark’s extensive set of developer libraries and APIs. From the previous words we can derive that simplicity and speed support are main reasons to choose Spark [9]:  Simplicity: its capabilities can be accessed by a set of rich APIs, which all of them had been designed specifically for interacting faster and easier with data at scale. These APIs are documented and structured in a way that makes it simple and quick to put Spark to work by data scientists and application developers [9].  Speed: as mentioned above, Spark is designed especially for speed, and can be applied for both in memory and on disk. In 2014, Spark had been used

2018 to process 100 terabytes of data stored on solidstate drives for 23 minutes only (During the challenge of Daytona Gray Sort benchmarking) [9].  Support: a range of programming languages can be supported by Spark, including Java, Python, R, and Scala. Although often closely associated with Hadoop's [9].

There are various examples of websites and electronic applications combined efforts to develop e-Learning industry future by means of learning analytics. The target in this paper in general to enhance e-learning level and in particular to improve the performance of students through providing a method for instructors to get feedback from actions accomplished by their students in an educational platform. Instructors will be able to assess the difficulties of tests and quizzes established. Hence, in this experiment three important factors exist: a. Educational platform (a website where students will be able to create their assignments). b. Collected data: from student's answers (The data will be collected








2018 c. Analysis tool in a higher education institute (Apache Spark open source tool). The educational platform well provided by deploying an open source that called Moodle as a virtual platform for dynamic learning had been designed along with face-to-face classroom teaching with the possibility for uploading courses resources. Reason for Moodle selection Moodle is easy to use so students and academics will be very familiar to use it for learning and delivering courses. Students can login to the Moodle after registration and take an exam for a specific course. Then, their answers will be collected to be analyzed through connecting the spark engine with Moodle database. For testing our work without any access to the Internet, Moodle and it’d database had been executed on a single machine using XAMPP localhost installed in Linux operating system. The outside view of the proposed system is illustrated in figure (1).

Figure 1.General system description.

2018 Analysis results are displayed in the form of bar charts where the instructor can measure the difficulties of given questions. The proposed system which is named (Flipped Learning Website) has been designed to work in the internet environment and users can access to its portal through the user’s web browser. Figure (2) and (3) respectively give an example of admin interfaces as starting point to manage the experience. Figure (4), (5) and (6) respectively give examples of interfaces when a student starting an online assignment. While figures (7) explain how a student can review all his attempts on an assignment (in this example assignment applied on C++ subject).

Figure 2. Admin Login Interface.

Figure 3. Admin Home Page.

Figure 4. A student reviews all his attempts on an exam.


Figure 5. An online assignment.

Figure 6. The status of an attempt before submitting.



Figure 7. All attempts along with grades.

5. Experimental Results This section illustrates the analysis of the collected data for this experiment. As mentioned before, the average difficulty of an equation will be given in graphical format based on the following equation: Difficulty Average Level =

𝑵𝒐.𝒐𝒇 𝑾𝒓𝒐𝒏𝒈 𝑨𝒕𝒕𝒆𝒎𝒑𝒕𝒔 𝑵𝑶.𝑶𝒇 𝒂𝒍𝒍 𝑨𝒕𝒕𝒆𝒎𝒑𝒕𝒔

× 𝑫𝒆𝒈𝒓𝒆𝒆 𝒐𝒇 𝒆𝒂𝒄𝒉 𝒒𝒖𝒆𝒔𝒕𝒊𝒐𝒏 (1)

For example, in bellow random values have been selected for determining the average difficulty of questions one and two. Q1= 𝟏𝟒 × 𝟏𝟎 ≅ 𝟕


Q2 =



𝟓.𝟎 𝟏𝟑

× 𝟏𝟎 ≅ 𝟒

2018 In the following example we explain the function of the proposed system. An online assignment applied on C++ subject, after correcting the answers of C++ assignment, the spark started to collect and analyze the given data. The y-axis of the bar chart in figure (8) represents the number of wrong answers from all attempts for each question whereas x-axis represents the number of questions in the basement. For example, the total number of attempts for Q1 is 14; only 9 attempts with incorrect answers and 5 attempts were with right answers. In figure (9), a feedback will be forwarded to the instructor in the form of bar chart where the x-axis pointed to the assignment questions (Q1, Q2-Q11), while y-axis gives the average of difficulty level for each question. This will help the instructor to find the most difficult questions which will be represented by the highest average difficulty.

Figure 8. Number of Wrong and Right Answers from all attempts.

Figure 9. The Average difficulty level of questions.

6. Conclusions and Future Work This paper has clarified the benefit of big data analysis in the improvement of educational environment by means of tracking and analyzing the outcomes of teaching systems then visualizing the results to inform decisions instructors. This model will provide a strategy for educators to realize and enhance the educational level of students in the classroom learning of the higher education level. The bulk of data source generated from online exams connected to the Apache Spark which can analyze a stream of students’ outcomes with few seconds. In addition to exploring the role of analyzing Big Data in the education field. Big Data is a powerful tool that makes things easy in various fields, this system provides new tools and


