Teaching Big Data Curriculum: Methods, Practices ...

4 downloads 37032 Views 162KB Size Report
Teaching Big Data Curriculum: Methods, Practices, and Lessons ... data, which can be utilized for analysis and .... data, cloud computing has come to forefront. It.
Teaching Big Data Curriculum: Methods, Practices, and Lessons Jawwad A Shamsi, Atika Burney, Bilal Butt, Furqan Khan FAST-National University of Computer and Emerging Sciences, Karachi.

Abstract Big Data Systems encompass massive amount of data, which can be utilized for analysis and extraction of information in many application areas such as financial forecasting, social networking, and DNA analysis. This paper describes big data curriculum at FAST-NUCES, Karachi. The approach has been motivated to impart updated knowledge to both the undergraduate and the graduate students. The curriculum is motivated to prepare data scientists, which are ready to take challenges in all the leading domains of big data such as data analysis, programming, and system design. Our efforts are spanned over five elective courses, which are offered, regularly. We highlight goals for our curriculum and our describe our approach in meeting them. We also discussed various lessons learned while imparting knowledge to students. Keywords: Big data systems, Data-intensive Computing, Information Retrieval, Visualization, Security, High Performance Computing Introduction Recent technological advancements allow massive collection of data for computational analysis for many computer systems such as social networks, web search, weather analysis, financial forecasting, and geological updates. Consequently, big data computing or data intensive computing has emerged as a leading branch of computer science. Data itself is like crude oil. Its value is insignificant unless it is processed – analyzed to uncover underlying patterns to make future predictions. Therefore, big data systems have immense potential for unleashing information through meticulous and extensive analysis of

data [1]. At the same time, they possess significant challenges such as scalability, networking, security and privacy, information extraction, mining, and visualization. It is also important that new tools are studied and explored. Realizing this emergence and significance, it is critical for teachers and educationists to update the curriculum [2] and impart necessary knowledge to students. The purpose of this paper is to provide a detailed description of big data curriculum at a leading computer science university in Pakistan. The curriculum has been designed both for undergraduate and graduate students. The paper is beneficial for academic community in assessing in updating curriculum, and assessing strengths and weaknesses. Background Information: The curriculum described in this paper has been adapted at FAST-National University of Computer and Emerging Sciences. The university is a leading computer science university in Pakistan with campuses across five major cities. The paper is focused on the curriculum followed at the Karachi campus. About the department: The CS department at Karachi Campus has a diversified student population. There are approximately 1200 BS students and 150 graduate students in the department. Around 15-20% of them are from rural areas. There are 10% female students, approximately. The department is selective; around 20 % of applicants seeking admission are admitted in fall every year. Considerations: The above mentioned demographics highlight diversity of the students. Further, students have the desire to seek updated knowledge in curriculum.

Table 1 - Pedagogical Goals

Goals G1

Description Approach and Method To prepare system designers for Impart platform-related knowledge such as big data. scalability, networking to undergraduate and graduate students G2 To prepare students with strong Equip students with Big data Algorithms and foundation on data analytics . Concepts. G3 To motivate students about the Map big data algorithms to real world problems potential and impact of data science. G4 To prepare data scientists and Enrich students with latest tools and techniques analysts who can meet growing that are useful for Big Data. requirements We now briefly describe our courses. Pedagogical Goals Realizing significance of big data, its potential 1) High Performance Computing (BS) impact on job market, and the need for imparting quality education, the CS department has The course involves parallel computing designed an extensive curriculum, consisting of concepts, algorithms, and platform-related issues five courses for graduate and undergraduate [3]. Special focus is on MapReduce and Hadoop. students. Since parallel computing provides an important Table 1 lists the pedagogical goals for our direction in solving big-data problems, this curriculum. The table also lists our approach and course is significant in Big data curriculum. methods in accomplishing these goals. All the Further, teaching data-intensive computing four goals are focused to meet our overall goal requires special considerations related to issues of preparing better data scientists and system such as scalability and networking. Coverage of designers. Goal 1 is necessary in order to built these topics provides enhanced learning. For efficient systems for big data. Goal 2 helps us in instance, Hadoop's concept of data locality is building strong foundation for curriculum. Goal important in reducing transfer time of data. 3 is significant for enhanced learning of students, whereas goal 4 is important to improve 2) Introduction to Cloud Computing (BS) student proficiency. With the increase in computational resources required to store and process huge amount of Courses data, cloud computing has come to forefront. It We have designed courses both at the offers computing as a service, be it storage or undergraduate and graduate level to accomplish processing power. The main focus of this course these goals. The courses are offered as electives is to introduce the cloud platform to the students. to BS (CS) and MS (CS) students. The discussion is focused on methods used to Considering that big data curriculum involves implement the services provided including many topics such as scalability, networking, virtualization, network, security& privacy over parallel computing, replication, mining, cloud and the use of the services for data and information extraction, visualization, algorithms, computational intensive tasks using MapReduce and tools, we felt that it is necessary that a framework. Specifically, Hadoop is used for diverse set of electives must be offered. computation and Hbase for data store along with files.

Table 2 - Accomplishments of Goals

Goal Topics / How Covered Courses G1 Scalability, Fault tolerance, Data Locality, HPC (Undergraduate) + Introduction to Virtualization, Networking, Security and Privacy, Cloud (Undergraduate)+ Cloud Computing Hadoop/MapReduce (Graduate) G2

G3

G4

Dimensionality Reduction, Efficient Data Structures, Mapping Data Mining Algorithms to Big Data, Trade-off among Space, Time and Accurancy of Algorithm, , Parallel and Distributed Algorithms Assessments (Quizzes, Assignments, exams), Course Projects, Thesis, PhD. Dissertation (in Preparation) R Language, Hadoop/MapReduce, RapidMiner, Apache Mahout, Matlab

Big Data Analytics (Graduate) + HPC (Undergraduate) + Graph Algorithms (Undergraduate + Graduate)

Big Data Analytics (Graduate) + HPC (Undergraduate) + Graph Algorithms (Undergraduate + Graduate) + Introduction to Cloud (Undergraduate) Hbase, Big Data Analytics (Graduate) + HPC (Undergraduate) + Graph Algorithms (Undergraduate + Graduate) + Introduction to Cloud (Undergraduate)

3) Graph Algorithms (BS/MS) The course involves topics in Graph Theory, such as Marriage Theorem, Network Flows, Eulerian and Hamiltonian graphs etc. These are useful for social networking problems. Students are introduced to real world problems such as Travelling salesman problem, Transportation problem, Web as graph, and Social Network Analysis. Course assignments involved applying graph algorithms on MATLAB. Though the main focus of the course is to introduce the intrinsic of Graph Theory, students are given knowledge about data representation and visualization using different tools namely MATLAB, R and Gephi. For undergraduate students, getting familiarity with visualization and tools gives them a step forward towards data science.

paradigms for big data storage and processing. In addition, they will learn R to carry out data mining tasks and visualization of data. Students are also expected to do a project related to big data analytics to put the theory in practical perspective. Class assignments focuses on analyzing data on WEKA and RapidMiner tools, that are freely available. Exposure to such tools enable students to get familiar with job market.

4) Big Data Predictive Analytics (MS/PhD)

Student Projects and Thesis The adopted curriculum has yielded increased adaptation of big data related projects from students. For instance, a course from Big Data and Analytics course predicted Election results in Pakistan using Twitter data [4]. Similarly, in the course of Graph Theory course one of the interesting project was “Exploratory Analysis of Social Media using Political Tweets”. In the

The main focus of the course is to learn about algorithms and tools used for analysis of data in big data context. However, to enable big data processing associated technologies related to data management and visualization are also discussed. Specifically, the course introduces graduate students to Hadoop and MapReduce

5) Cloud Computing (MS/PhD) The course of cloud computing focuses on teaching dataintensive platforms. Cloud computing serves as a major platform in solving Big Data Problems. The course also covers networking issues. For instance, in a resource-constrained environment for Big data, software defined networking can be very significant.

project, a social network of 37 twitter accounts connected by roughly 98K tweets was analyzed using Python and Gephi. A number of students have opted for Hadoop and Hbase for their senior projects to process large data. Computing resources available at Amazon and MS Azure were utilized where students setup their Hadoop cluster for processing and use the available data store. The data required for analysis was obtained from Wikipedia and Reuters that is easily available. Some projects and MS thesis have also been offered on data analysis with focus on text mining, predictive analytics, cloud, etc. In addition, there are several PhD. student working in the area of Big data.

Conclusion Teaching curriculum for Data-intensive computing requires extensive efforts. We believe that global cooperation is required in meeting these challenges. Further, educationists should emphasize on comprehensive coverage for curriculum. Topics should include programming methods, tools, platform, networks, and algorithms. Acknowledgements This work is in part supported by IEEE TCPP early Adopter Award 2012 [5] , Amazon AWS Educational grant [6] , and Microsoft Azure grant [7]. References

Challenges and Lessons Learned Our courses have been extensively designed to meet our goals. However, there are a few important points which we would like to share with the community.

1. Shamsi, Jawwad, Muhammad Ali Khojaye, and Mohammad Ali Qasmi. "Data-intensive cloud computing: requirements, expectations, challenges, and solutions." Journal of grid computing 11.2 (2013): 281-310.

1) While Big data has massive potential of solving local problems; availability of big data, which is suitable to local context remains a concern. Due to privacy concerns and inexistence of laws, many organizations in developing countries such as Pakistan are reluctant to share data which is suitable to local context. During our offering of courses, we faced this challenge on numerous occasions.

2. USC - Department of Computer Science Viterbi School of Engineering - Data Science (http://www.cs.usc.edu/academics/masters/msd ata.htm)

2) Massive resources such as computational power, networking, and storage are required to store, retrieve, compute, and analyze big data. Due to limitations, we focused on cloud platforms (Amazon and Azure) to solve this problem. However, access to cloud introduces networking delays.

4. Mahmood, Tariq, Tasmiyah Iqbal, Farnaz Amin, Wajeeta Lohanna, and Atika Mustafa. "Department of Computer Science National University of Computer & Emerging Sciences Karachi, Pakistan." In Multi Topic Conference (INMIC), 2013 16th International, pp. 49-54. IEEE, 2013

3) We observed that students have been quite eager to learn about big data. They have been motivated to solve their senior projects, and course projects utilizing big data. In this context, programming and exposure to tools serves as a major platform for learning.

5. Shamsi Jawwad. NSF Early Adopter Award for Teaching High Performance Computing. Fall 2012.

3. Shamsi Jawwad, Durrani Nouman, and Kafi Nadeem. Teaching High Perfromance Computing with a Practical Approach. Poster. Edupar Workshop IPDPS 2014.

6. Shamsi Jawwad and Burney Atika. AWS Educational Grant 2011-2014. 7. Shamsi Jawwad and Burney Atika. Microsoft Azure Educational grant 2012-2014.

Suggest Documents