Distributed computing

16 downloads 0 Views 3MB Size Report
Mar 18, 2014 - Hadoop Development Framework. • Cloudera CDH4 QuickStart VM is used as the development environment for the. MapReduce application.
Applications of Distributed Computing for the Next Generation of On-line Education: Real World Use Cases Prof. Santi Caballe cv.uoc.edu/~scaballe/

Faculty of Computer Science, Multimedia, and Telecommunications Open University of Catalonia (UOC) Barcelona, SPAIN

SMARTLEARN · UOC

Open University of Catalonia Barcelona, Spain

www.uoc.edu

SMARTLEARN · UOC

Mission and purpose “Since 1995, the mission of the Open University of Catalonia (UOC) is to provide people with lifelong learning and education opportunities. The aim is to help individuals meet their learning needs and provide them with full access to knowledge, above and beyond the usual scheduling and location constraints” The UOC engages people who offer quality full online university education and promote: • innovative education enabling personalised learning, • technological leadership that facilitates interaction and collaborative work, • academic research on the information society and e-learning, • dissemination of knowledge. SMARTLEARN · UOC

3

Facts & figures

SMARTLEARN · UOC

4

Facts & figures The Networked UOC

SMARTLEARN · UOC

Social networks

5

Research & Innovation

SMARTLEARN · UOC

6

Research group

SMARTLEARN Research group smartlearn.uoc.edu @smartlearn_uoc

SMARTLEARN · UOC

Mission and purpose SMARTLEARN research group is devoted to the intensive use of ICT to improve and enhance any form of e-Learning (e-Training, m-Learning, CSCL, etc) from a multidisciplinary perspective.  The ultimate goal is to meet the demanding and changing requirements of the next generation of eLearning systems and services. SMARTLEARN purpose is to tackle the full picture of the e-Learning domain by promoting:  pedagogical conceptualization of e-Learning systems  technological and engineering methodologies  developments to be piloted and integrated in LMSs  dissemination of knowledge

 exploitation of resulting tools and services SMARTLEARN · UOC

8

Research group members

Dr. Santi Caballé (Leader) UOC Researcher eLearning, Distributed Computing, Learning Analytics, E-Assessment, Security, Software Engineering.

Dr. Jordi Conesa UOC researcher Ontologies, eLearning Analytics, Software Engineering,

Dr. Fatos Xhafa Research associate Distributed Computing, eLearning, Security

Dr. David Gañan Dr. Jorge Miguel Dr. Modesta Pousada Technical staff Postdoc fellow UOC researcher CEO DeltaDev (2010) CIO IT Systems (2005) Social Sciences, Learning Analytics, eLearning, Distributed Emotional eLearning Software Engineering computing,Security

PhD students eLearning, Learning Analytics, Distributed technologies SMARTLEARN · UOC

9

Research line: Distributed technologies for eLearning  Goal: Grid, Cloud, Cluster and P2P to provide eLearning systems with (1) transparent, ubiquitous and on-demand access to distributed learning facilities: Virtual Labs, content repositories, video-conferencing, etc. (efficient learning) and (2) building of users models, monitor the learning progress, and design secure online learning activities (effective learning).

 Challenge: Support non-functional requirements (QoS, scalability, fault-tolerance, efficiency) and manage large data sets/streams (log file processing and analysis).  SMARTLEARN researchers: Dr. Santi Caballé, Dr. Fatos Xhafa,  Related projects: MOSAIC, COLE Funded by the Spanish Gov. Funds: 90,000€.  Recent publications: 

Caballé, S., Miguel, J., Xhafa, F., Capuano, N., Conesa, J. (2016). Using trustworthy web services for secure e-assessment in collaborative learning Grids. International Journal of Web and Grid Services. Accepted. ISI-JCR.



Miguel, J., Caballé, S., Xhafa, F., Prieto, J. (2014). Massive Data Processing Approach for Effective Trustworthiness in Online Learning Groups. Concurrency and Computation: Practice and Experience , 27(8), 1988–2003. ISI-JCR.



Caballé, S., Xhafa, F. (2013). Distributed-based Massive Processing of Activity Logs for Efficient User Modeling in a Virtual Campus. Cluster Computing. 16(4), 829-844.

SMARTLEARN · UOC

10

Research map

SMARTLEARN history on distributed computing for eLearning

2008-09 Grid 2006-07 P2P Grid JXTA PlanetLab 2004-05 Contributory P2P Grid Replication JXTA GT4+Planetlab P2P learning Data mining MW paradigm P2P learning Data management Data CSCL effective CSCL knowledge management User modeling WSKS’08 management CSCL effective ICDIM’08 CCGrid’04 GADA’07 GADA’04 ICICTE’09 UM’07 PDPTA’05 3PGIC’07 SMARTLEARN · UOC

2010-11 Cluster/Cloud P2P JXTA-Overlay Contributory Data mining User modeling 3PGCIC-10 EIDWT’11

2014-15 Cluster/Cloud 2012-13 HadoopCluster/Cloud Mapreduce P2P Yahoo S4 HadoopBig data streams MapReduce Learning Learning analytics analytics Secure learning Secure learning CISIS’14 P2P learning INCOS’14 EIDWT’11 AINA’15 CISIS’12 INCOS’12

2016Cluster/Cloud HadoopMahout Spark Machine learning Big data Streams INCOS’16 3PGCIC’16

11

Main research activities  Editor-in-Chief of two International Scientific Journals:  International Journal of Grid and Utility Computing , Inderscience.

 International Journal of Computing, Inderscience.

Space-Based

and

Situated

 Editor of Book Series  Intelligent Data-Centric Systems. Elsevier

 Lecture Notes in Data Engineering and Communication Systems, Springer  Guest Editor of 40 special issues of International Journals.  Proceedings Editor of 35 International Conferences.

 Leading and participating in 30 research projects (Spain, EU, US).  Steering Committee chair and founder of 6 conferences  3PGCIC, CISIS, INCOS, BWCCA, NBiS, EIDWT.  General and PC Co-chair of 12 conferences

 PC member of 20+ International conferences and workshops SMARTLEARN · UOC

12

Next research activities

http://smartlearn.uoc.edu/1st-callfor-papers-of-3pgcic-bwcca-2017/ /

SMARTLEARN · UOC

13

SMARTLEARN Research group

Visit us! smartlearn.uoc.edu @smartlearn_uoc

SMARTLEARN · UOC

Applications of Distributed Computing for the Next Generation of On-line Education: Real World Use Cases Prof. Santi Caballe cv.uoc.edu/~scaballe/

Faculty of Computer Science, Multimedia, and Telecommunications Open University of Catalonia (UOC) Barcelona, SPAIN

SMARTLEARN · UOC

Current needs of eLearning • Engaging learners has become one of the most significant problems faced by modern e-learning. – Cause early drop-outs.

• The lack of engagement can be attributed to several issues: – – – –

Interaction Challenge Empowerment Social identity

To contribute to overcome the lack of engagement, current eLearning tools need to combine personalization, collaboration, assessment, simulation, etc.  However… Strong need of computational power to run advanced tools… SMARTLEARN · UOC

16

Current needs of eLearning Collaborative Learning • Purpose: to reuse in a engaging way the information exchanged and the knowledge elicited during collaborative learning activities

SMARTLEARN · UOC

17

Current needs of eLearning Virtual Laboratories •

Three available types of virtual laboratories for hands-on student experimentation: •

Simulation Lab: Use of management) and simulations.



Home Lab: Hardware (and its associated software) is physically sent to students.



Remote Lab: 24/7 remote hardware and software access. Resource access is controlled in terms of Scheduling (i.e. booking system), Connection and use (i.e. remote control).

SMARTLEARN · UOC

software

(including

license

18

Current needs of eLearning Serious Games • Purpose: to enhance the learning experience with highly interactive and engaging simulations like immersive 3D worlds

SMARTLEARN · UOC

19

Current needs of eLearning Can we meet the demanding computational needs of current eLearning??

SMARTLEARN · UOC

20

Current needs of eLearning Specific needs for current eLearning (short list): • • • • • •

• • • • • • • • •

wide geographical distribution of learners and tutors belonging to different institutions, access from anywhere, on any learners’ computer platform and any software, support for a growing load of learning resources, such as Virtual Labs, contents repositories and on demand video services and users who access these resources, transparent access and share of a huge variety of software and hardware learning resources, process huge amounts of user interaction data from online learning (usually in log files) provide eLearnng applications with advanced knowledge-based services in terms of learning analytics, user modeling, monitoring, prediction, e-assessment and security.. multiple administrations from different departments/organizations with specific policies, inherent dynamism of learners’ and tutors’ needs and changing learning resources, move to highly customized pedagogical models, such as collaborative learning, in timely fashion, each targeting a specific learning goal and incorporating its specific resources, personalize and update learning resources by instructors and learners without technology skills. avoid central point of failure (decentralized) of LMS, students use their own computational resources for task accomplishment during eLearning, eLearning applications are naturally scalable, fault tolerance and high performance customize the learning environments to the needs of groups of learners. etc.

Distributed computing as the most suitable paradigm to meet these demanding needs  see next SMARTLEARN · UOC

21

Distributed Computing for eLearning (1)

Grid and Web services for eLearning: Meeting compute-intensive educational needs Technologies: SOA/OGSA AXIS2 GT4 WSRF Frameworks: OKI ELF IAF

SMARTLEARN · UOC

22

Distributed Computing for eLearning (1)

Grid and Web services for eLearning: Meeting compute-intensive educational needs Specific eLearning needs to be met: • • • • • • • • •

wide geographical distribution of learners and tutors belonging to different institutions, access from anywhere, on any learners’ computer platform and any software, support for a growing load of learning resources, such as Virtual Labs, contents repositories and on demand video services and users who access these resources, transparent access and share of a huge variety of software and hardware learning resources, multiple administrations from different departments/organizations with specific policies, inherent dynamism of learners’ and tutors’ needs and changing learning resources, move to highly customized pedagogical models, such as collaborative learning, in timely fashion, each targeting a specific learning goal and incorporating its specific resources, personalize and update learning resources by instructors and learners without technology skills. customize the learning environments to the needs of groups of learners.

Bring together personalized learning experiences with transparent, ubiquitous and on-demand access to distributed learning resources, all according to pedagogically sound approaches. SMARTLEARN · UOC

23

Distributed Computing for eLearning (2)

P2P technologies for eLearning: Building decentralized and scalable systems Technologies: JXTA JXTA-Overlay

Broker peer Client peer Primitives Primitives: Peer discovery; Peer’s resources discovery; Resource allocation; Task submission and execution; Peer group functionalities (groups, rooms etc.); Monitoring of peers, groups, tasks, etc. SMARTLEARN · UOC

24

Distributed Computing for eLearning (2)

P2P technologies for eLearning: Building decentralized and scalable systems Specific eLearning needs to be met: • • • • • • • •

avoid central point of failure and bottlenecks. free communication and sharing (with no institutional rules on who can share and how). eLearning applications are naturally scalable, fault tolerance and high performance inherent dynamism of learners’ and tutors’ needs and changing learning resources, customize the learning environments to the needs of groups of learners. students can use their own computational resources f(contributory systems). alleviate high maintenance cost of web-based centralized systems Increase performance and fault tolerance (in comparison to centralized systems).

Direct P2P communication and sharing (with no institutional rules) increases the interaction among peers while fostering social networking, thus overcoming the limitations of reduced interaction found in eLearning (centralized) systems. SMARTLEARN · UOC

25

Distributed Computing for eLearning (3)

Cloud/Cluster technologies for eLearning: Meeting data-intensive educational needs Technologies: Hadoop MapReduce Yahoo S4! Spark PlanetLab Cluster-RDlab

UOC (2 nodes)

Cluster RDLab rdlab.upc.edu SMARTLEARN · UOC

26

Distributed Computing for eLearning (3)

Cloud/Cluster technologies for eLearning: Meeting data-intensive educational needs Specific eLearning needs to be met: • • • • • • • •

massive processing of large amounts of activity logs from online learning real-time processing of group activity log data provision of effective feedback and awareness to online learning groups user modeling in virtual campus mining navigation patterns in virtual campus efficient interaction analysis for provision of knowledge about the online discussion process massive data processing for effective trustworthiness in online learning groups prediction of trustworthiness behavior to enhance security in online assessment.

Processing and analysis of the information captured from student’s actions is a core function for the modeling of the student’s behavior during the learning process and of the learning process itself as well.

SMARTLEARN · UOC

27

Applications: Real world use cases (1)

Real-time user modeling in a virtual campus

SMARTLEARN · UOC

28

Motivation PURPOSE Describing and predicting students’ actions and intentions as well as adapting the learning system to students’ features, habits, interests, and preferences results in a great stimulation of the learning experience. 

PROBLEM Complex processes involved in Web-based learning generate a great variety of type and formats of data stored in log files  



 

found in ill-structured highly redundant form, need to reduce and structure data for later analysis and extract knowledge.

Applications are characterized by a high degree of user-user and usersystem interaction, which stresses the amount of interaction data generated. Need for processing data in real time to have more impact in the learning experience. Computational cost is the main obstacle to processing great amount and variety of log data in real time.

SMARTLEARN · UOC

29

Local processing of log data 







A PC with standard configuration is used for local sequential processing of the log files. Daily log files (up to 15 GB) are used for massive processing. As expected, results with one node are linear over time. The time spent in processing large log files is too much for our purpose to process data in real time.

SMARTLEARN · UOC

30

Efficient processing of log data

 Distributed approach using the MW paradigm to parallelize the processing of log files.  Distributed infrastructure made up of master and worker peers.  Master peers create and submit their requests.  Worker peers assign Master requests to the available nodes and notify and send the results to the Master.

 Deployment in a large-scale, distributed network using nodes from PlanetLab platform  up to 32 nodes for parallel processing are used. SMARTLEARN · UOC

31

Master-Worker algorithm 1. [Pre-processing phase]: UOCLogsProcessing counts the total number of lines of the log file, totalNbLines, and knowing the total number of parts to split the file off, nbParts, each peer node will receive and process a totalNbLines/nbParts of lines from the file. 2. [Master Loop]: Repeat a. Read totalNbLines/nbParts lines from the original file and create a file with them. b. Create a petition and submit it to the distributed infrastructure. c. [Parallel processing]: i. The request is assigned to a peer node (slave node) of the distributed infrastructure. ii. Upon receiving a petition, the peer node reads the part of the file it has to read via HTTP. The peer runs UOCLogProcessing functionality for processing the lines of the file, one at a time, and stores the results of the processing in a buffer. iii. The peer node, once the processing of the petition is done, sends back to the master node the content of the buffer. Until the original log file has been completely scanned. 3. [Master’s final phase]: Receive messages (partial files) from slave nodes and append in the correct order the newly received resulting file to the final file containing the information extracted from the original log file. SMARTLEARN · UOC

32

PlanetLab platform PlanetLab is an open distributed computing infrastructure currently with 1353 nodes distributed in 717 different sites among universities, research centers or homes. The nodes are located outside a firewall and have to be visible from anywhere in more than one DNS. Currently, the characteristics of the node must be the following : • • • • •

UOC (2 nodes)

4GBRAM At least 500 GB hard disk At least 1 MB/sec connection to the Internet [email protected] 4× Intel (e.g., 2× dual core or quad core) External or built-in CPU, remote-access power-reset capability, accessible from PLE, such as IntelAMT, HPiLO, DellRAC, IPMIv2, etc.

SMARTLEARN · UOC

33

PlanetLab processing 

By using a distributed infrastructure of up to 16 nodes, a considerable speed up is achieved in processing large log file data.



For infrastructures larger than 16 nodes truly geographically distributed, such as PlanetLab, the speed up reduces with the increase in the number of nodes due to the significant communication time 

in receiving the chunk files and sending back the results to master node.

SMARTLEARN · UOC

34

Parallel processing results  Once the original log files have been processed and the results have been merged in well-structured form  there exists a enormous reduction of the overall data up to 95 %  from 15 GB log file to resulting 800–900 MB. 82.35.62.84 2014_3_18_08_00_19 GET /UOC/templates/Agenda/agBlank.html http://cv.uoc.edu/UOC/a/cgibin/agLogin?s=bd4c1731917cbf3e9f0363 1f4acb67bbca5aedc79eda4c3b55c84729c9 251dcf958b3808975429c631b0daa7dd7c0 b980ab30e24d503e7a257f49a3414823fe5

82.35.62.84 1322221652 18/03/2014 08:00:19 agenda

 This allows a feasible storing of the processing results in a database for further analysis and knowledge extraction. SMARTLEARN · UOC

35

Extracting navigation patterns  Once data is processed, it is analyzed with some data mining methods suitable for extracting navigation patterns  K-means, Apriori and FPGrowth data mining algorithms  Use of WEKA framework

 Association rules:: 1. Students who have accessed to a classroom have also accessed to classrooms spaces (forums, materials,etc). 2. Students who have accessed to mailbox have not accessed to teaching plan. 3. Students who have accessed to teaching activities, have navigated through classrooms. SMARTLEARN · UOC

36

Conclusion  The Open University of Catalonia offers distance education through the Internet  60,000 students, lectures and tutors  600 on-line official courses from 23 official degrees and other PhD and post-graduate programs.

 On-line campus activity  15-20GB/day of interaction data stored in log files  Information found in an ill-structured highly redundant form

Benefits from an efficient processing of the campus’ log data: 1. Monitor and adapt the campus navigation to the students’ needs. 2. Predict future actions of students based on navigation patterns. SMARTLEARN · UOC

37

Applications: Real world use cases (2)

Security in online learning based on trustworthiness

SMARTLEARN · UOC

38

Motivation

SMARTLEARN · UOC

39

Trustworthiness model

SMARTLEARN · UOC

40

Trustworthiness methodology 1.- First activity is to study module M with a questionnaire Q1 with tasks about the module M.

2.- Second activity is a forum F , a collaborative framework devoted to enhance responses Q1.

3.- A questionnaire Q2 is sent with the set of responses over Q1. By Q2, students evaluate classmates’ responses and their participation in F (P2P evaluation).

Log data 15GB/day SMARTLEARN · UOC

41

41

Hadoop MapReduce processing approach MapReduce application: •

The Map phase takes a record..



The Map function receives the record, which is processed following the normalization process:

LMS Activities BSCW

Log







UOC Virtual Campus

1 LogN

The output of the Map function is sent as the input for the Reduce function. Without Reduce function. If we only want to store normalized data for further analysis, the Reduce task does not perform additional work, it only stores the output.

With Reduce function. The Reduce function is used to compute relevant information related to the students' activity data.

SMARTLEARN · UOC

1

Log Log

N

Moodle

... Log

1 LogN

Normalization

Hadoop MapReduce

Log U

-

user

-

time

-

action

-

[value]*

Analysis / Trustworthiness Hadoop MapReduce

User 1 1 → 900

User 2 1 → 15

User 3 1 → 7,5

User action instances42 42

Hadoop MapReduce processing approach Hadoop Development Framework •

Cloudera CDH4 QuickStart VM is used as the development environment for the MapReduce application. – This virtual machine contains a standalone Apache Hadoop framework with everything we needed to test our model.

Deployment •

The cluster support for our MapReduce implementation was provided by the RDlab.



The RDlab is focused on promoting research and development of computing projects at the departments of the Polytechnic University of Catalonia.



The aggregated hardware resources are 160 physical servers, 1000 CPU cores, and more than 3 TBytes of RAM memory, 130 TBytes of disk space, and high speed network at 10 Gbit.



The RDlab’s cluster offers the possibility of executing a Hadoop environment.



The cluster integrates with the parallel work management queue system.



HDFS is directly integrated with the Lustre file system.

SMARTLEARN · UOC

43

Hadoop MapReduce processing results

Time (s)

400 300 200 100 0 0

• • • •

2000

4000 6000 Log File Size (MB)

8000

10000

0 2 4 6 8 10

The figure shows comparative results with multiple Hadoop nodes (i.e. 0, 2, 4, 6, 8 and 10 workers) in the RDLab cluster. 0-node line shows results of local sequential processing which grows linearly. 2 nodes lines and above show considerable speed up is achieved: >50% for more than 4 nodes and >75% for 10 nodes. Although the overhead introduced by the MapReduce framework for small values is noticeable, when the task size is >1000 MB the amount of overhead time considerably diminishes in comparison with the total processing time.

SMARTLEARN · UOC

44

Trustworthiness analysis

Difference between manual evaluation from the tutor and automatic evaluation from the students (P2P evaluation). • • • •

The mean difference (manual - automatic) is 0.81 (scale, 0 to 10). The maximum and minimum difference: 0.03 and 2,82. 77% of cases, the difference (manual - automatic) is < 1. 92% of cases, the difference (manual - automatic) is < 2.

Students whose deviation is >=2 (8%) are found anomalous and required further investigation for potential cheating. SMARTLEARN · UOC

45

Conclusion • Innovative approach for modelling trustworthiness for secure assessment in online learning groups. • The model is based on trustworthiness factors, indicators and levels, which allow for discovering how trustworthiness evolves into the learning system. • We motivated the need to process large amounts of LMS log data for the analysis of students' activity data in collaborative learning models, with the aim of provisioning security in eLearning through trustworthiness models. • We justified the feasibility of a parallel computing approach, and then we proposed a complete MapReduce and Hadoop application for massive processing of LMS log file data.

The analysis of the data processed by Hadoop MapReduce indicate that our trustworthiness model can tackle security breaches and enhance information security in e-Learning. SMARTLEARN · UOC

46

Summary • Engagement is fundamental in current complex online education. However, many challenges are faced in advanced eLearning tools: • • • • • •

wide geographical distribution of learners and tutors belonging to different institutions, access from anywhere, on any learners’ computer platform and any software, support for a growing load of learning resources, such as Virtual Labs, contents repositories and on demand video services and users who access these resources, transparent access and share of a huge variety of software and hardware learning resources, Process large amounts of information for user modeling, monitoring and feedback. Etc

• Distributed computing comes to play a key role in eLearning: • Grid computing support compute-intensive eLearning processes (simulation in Labs, on-demand video, etc.). • P2P computing support informal and social learning by building decentralized and scalable systems naturally. • Cloud/Cluster computing support data-intensive tasks (large data sets processing) to make eLearning more effective. SMARTLEARN · UOC

47

Selected publications 1.

Caballé, S., Miguel, J., Xhafa, F., Capuano, N., Conesa, J. (2016). Using trustworthy web services for secure eassessment in collaborative learning Grids. International Journal of Web and Grid Services. Inderscience Publishers. Accepted.

2.

Xhafa, F., Garcia, D., Ramirez, D., Caballé, S. (2015) Performance Evaluation of a MapReduce Hadoop-based Implementation for Processing Large Virtual Campus Log Files. In proceedings of the 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 200-206. IEEE Computer Society.

3.

Miguel, J., Caballé, S., Xhafa, F., Prieto, J. (2014). Massive Data Processing Approach for Effective Trustworthiness in Online Learning Groups. Concurrency and Computation: Practice and Experience , 27(8), 1988–2003. Wiley.

4.

Caballé, S., Xhafa, F. (2013). Distributed-based Massive Processing of Activity Logs for Efficient User Modeling in a Virtual Campus. Cluster Computing. 16(4), 829-844. Springer.

5.

Xhafa, F., Caballé, S., Ruiz, J. J., Spaho, E., Barolli, L., Miho, R. (2012). Massive processing of activity logs of a Virtual Campus. In proceedings of the 3rd International Conference on Emerging Intelligent Data and Web Technologies, pp. 104-110. IEEE Computer Society.

6.

Xhafa, F., Paniagua, C., Barolli, L., Caballé, S. (2011). Using Grid Services to Parallelize IBM's Generic Log Adapter. Journal of Systems and Software., 84(1), 55-62. Elsevier.

7.

Xhafa, F., Barolli, L., Caballé, S., Fernández, R. (2010). Efficient PeerGroup Management in JXTA-Overlay P2P System for Developing Groupware Tools. Journal of Supercomputing, 53(1), 45-65. Springer.

8.

Xhafa, F., Paniagua, C., Barolli, L., Caballé, S. (2010). A Parallel Grid-based Implementation for Real Time Processing of Event Log Data in Collaborative Applications. International Journal of Web and Grid Services, 6(2), 124-140. Inderscience Publishers.

SMARTLEARN · UOC

48

Applications of Distributed Computing for the Next Generation of On-line Education: Real World Use Cases Prof. Santi Caballe cv.uoc.edu/~scaballe/

Faculty of Computer Science, Multimedia, and Telecommunications Open University of Catalonia (UOC) Barcelona, SPAIN

SMARTLEARN · UOC