Assistance in computer programming learning using ...

4 downloads 8308 Views 337KB Size Report
Learning analytics, programming language, programming errors,. Integrated Development .... in a second-year subject of the degree in Computer Science. The.
Assistance in computer programming learning using educational data mining and learning analytics Fernandez-Medina, Carlos

Pérez-Pérez, Juan Ramón

Álvarez-García, Víctor M.

Paule-Ruiz, M. del Puerto

Department of Computer Science, University of Oviedo, Spain

Department of Computer Science, University of Oviedo, Spain +34 98510 5094

Department of Computer Science, University of Oviedo, Spain +34 98510 4338

Department of Computer Science, University of Oviedo, Spain +34 98510 4338

[email protected]

[email protected]

[email protected]

carlosfernandezmedina@gmail. com

ABSTRACT The learning of

programming presents many difficulties for students. Nowadays, a number of software tools are available that enable students in programming courses to develop and exercise their knowledge and skills. However, these tools do not examine their work or provide students with indications on their learning process. In this paper we introduce a learning approach for programming based on the analysis of students’ mistakes during practical lessons in programming subjects. This approach makes use of compiler messages to analyse their quantity and semantic value, and report the individual and comparative learning progress. This approach is illustrated in practice by a case study conducted in a class of undergraduate students of computer science. This study makes it possible to provide an analytic representation of reflective learning practice, giving us a better understanding on programming learning processes.

Categories and Subject Descriptors K.3.2 [Computer and Information Science Education]: Computer science education; D.3.3 [Programming Languages]: Language Contructs and Features; H.2.8 [Database Applications]: Data mining.

General Terms Algorithms, Design, Experimentation, Languages.

Keywords Learning analytics, programming language, programming errors, Integrated Development Environment, Eclipse plug-ins, elearning.

1. INTRODUCTION The learning of programming presents many difficulties for novice programmers [15], since it requires students to understand abstract concepts and master the lexicon, syntax and semantics of programming languages. Some studies suggest, as an option to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.

overcome these difficulties, that students learn programming concepts by means of solving problems [14] or making the learning more active [16]. Another interesting approach is the use of learning indicators that provide visual feedback to the students in an easier way [9]. These approaches promote more interaction during the classes, and aim at promoting students engagement in the learning activities. The feedback provided to students plays a crucial role for the achievement of the previous goals [18]. In the purely instructional sense, feedback can be said to describe any communication or procedure given to inform a learner of the accuracy of a response, usually to an instructional question [5, 13, 22]. According to this definition, the main purpose of feedback is to confirm or change a student’s knowledge as represented by answers to practice or test questions. However, some researchers [3] have suggested that viewing feedback in such a unilateral context fails to take into account variations in behaviour that might be the result of selfregulation and student engagement. Tucker [24] points out that feedback is particularly important when evaluating dynamic instructional programs because its presence or absence can “dramatically affect the accuracy required of human judgment and decision making” (p. 303). Altogether, the previous considerations remark the relevance of feedback in settings where learning takes place. In the particular case of students of programming languages, they normally make use of integrated development environment (IDEs) for coding activities. IDEs facilitate the resolution of tasks assigned by teachers during programming laboratory classes. However, according to our experience, they do not support the learning of programming adequately. This is mainly caused by the insufficient information provided by them. IDEs represent the mistakes made in coding activities as warning and error messages, which are sometimes complex and hardly understandable for novice programmers. As a result of this process, development environments provide limited feedback, more suitable for experienced programmers than for students, who are not able to properly connect the error messages with their associated learning concepts [12]. In order to overcome the pedagogical limits of IDEs and enhance the assistance given to students during the learning of programming, we propose a model, called Colmena, that makes use of data mining [21] and learning analytics techniques [7] to gather the data provided by compilers and generate a more complete feedback and guidance on the learning process. Thus, our approach uses the errors and warnings in the source code to

generate a complete report on students’ individual and group development and progress over a period of time. This approach is illustrated in practice by a case study conducted in a class of undergraduate students of computer science. A software development based on the model makes it possible to provide a statistical analysis of the students’ individual and group performance in each learning cycle. This study makes it possible to provide an analytic representation of reflective learning practice, giving us a better understanding on programming learning processes. This paper contains the following sections in this order: Section 2 offers a state of the art with the literature review on related topics such as programming learning, data mining and learning analytics. Section 3 describes the model in detail, explaining the theoretical base of the system. The next section describes the software and discusses the use case scenario with students. Finally, conclusions and future work are outlined in Section 5.

2. RELATED WORK The learning of a programming language can be an arduous task for students. Several authors have studied the problematic concepts and skills that the students have to learn in programming learning subjects [15, 20]. From teachers’ perspective, programming learning requires constant assistance. For this purpose, the revision of the code is a valuable tool for teacher, and it is also important that students can review the code by themselves, as it is remarked by Zeller [27]. Whereas the individual analysis of source code requires a lot of effort, the application of data mining techniques makes it possible to research automatic processes of revision. Data mining allows applying different techniques for gathering information necessary to review source code. Word frequency analysis and keyword classification of log messages on CVS version archives can identify the purpose of changes [17] and by using association rule mining [26, 28], it is possible to observe the programming processes and the programmers’ behaviour. Other studies focus specifically on defects analysis; discussing how to use logs to find new errors [25], trying to predict fault-prone files [19] or searching for programming error patterns and providing marks for misunderstood language features [10]. In Jadud [11], the author applies statistical techniques in the programming learning domain, where information collected at compilation time is used to explore the novice student compilation behaviour. In order to achieve that information obtained through techniques mentioned before be provided to the users properly, some authors have suggested different techniques of learning analytics. The Society for Learning Analytics (SoLAR, http://solaresearch.org/) defines learning analytics as “the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs” [1]. Some years before, in a similar way, the field of educational data mining proposed research on techniques and approaches to understand the learning process through analytics and data. Recently, these disciplines are converging to similar approaches [23] and they are getting more support from the research community. Dashboards proposed by Duval [7] or learning indicators exposed by Glahn [9] are examples of visualisation techniques provided by learning analytics. Learning indicators makes easier to monitor the learning process, providing a simplified representation of the

information. It is frequent to apply these visualisation techniques over data extracted from traditional learning management systems. However, we consider that they could also be transferred to other learning environments, such as the integrated development environments which usually support the learning of programming in higher education. Programming IDEs usually have specific panels or views, which depict compiler messages. However, learning activities makes it necessary to capture and provide more relevant information. In our approach we combine educational data mining and learning analytics, and we apply them to integrated development environments in order to create an extension of the programming environment that can be used for the improvement of the learning process.

3. COLMENA: ASSISTANCE MODEL FOR PROGRAMMING LEARNING SUPPORT Learning environments should be capable of collecting and providing quantitative and qualitative feedback on students’ learning experiences. This reasoning has lead us to propose an assistance model to offer support and guidance to teachers and students involved in programming learning activities. We have named this model as “Colmena”. The strategy for achieving this goal consists in extracting and analysing information on students’ coding practices by applying data mining techniques, and transforming the information and displaying the feedback by means of learning analytics. On the basis of the work by Glahn [9], Colmena has been envisaged as a layered model, where each one of the layers has well-defined responsibilities to allow for extracting, processing, analysing and displaying the information (figure 1).

Figure 1. Diagram of Colmena, an assistance model for programming learning support. The four-layered model has been designed to cover all the aspects necessary for gathering and reflecting students’ learning progress. Each layer of the architecture is communicated by interfaces designed to link different types of information and which are capable at the same time of providing independence between layers. This results in each layer having its own mechanisms which can be modified or replaced without altering the system performance.

Data extraction layer The first step of our model consists in extracting compiler messages (warnings and errors) detected during programming exercises. This information can be recovered directly from generated messages in integrated development environments and log files. As a result of modifications in any resource belonging to a programming project, the IDE stores messages, which can be subsequently persisted and used as sources for further analysis.

Java programming language and the Eclipse IDE, both of which are well known by second-year students. The purpose of this study is to achieve a better understanding of the common mistakes made by students, as well as analysing their results from developing real software projects. Table 1: Topics and programing concepts of each practical session. Session

Topic

Programming concepts

S1

Dynamic Programming

Simple and array variables usage

S2

Dynamic Programming

Simple and array variables usage

S3

Backtracking

Variables usage and recursive methods

S4

Branch and Bound

Genericity

S5

Branch and Bound

Genericity

S6

Divide and conquer parallel.

Object variables usage and parallel structure

Data processing layer This layer is responsible for filtering and processing the messages collected from the extraction layer. Filtering takes compiler messages, filters redundant data, and launches a classification error analysis, where errors and warnings are grouped in clusters or families. These families of errors refer to different code elements, such as methods, constructors, types, internal, etc., which can previously be obtained from programming language taxonomies [2]. Once compiler messages have been filtered and grouped by families, this layer makes a statistical study of students’ individual, group and session performance, obtaining quantitative measures for the total and frequency of the errors. Information analysis layer Teachers can use their expertise in the subject to determine the match between compiler messages and their associated semantic concepts, or more specifically, to establish the link between errors/warnings and their associated family of errors, and the programming concept they represent. Furthermore, it is also possible to determine techniques and best programming practices to provide learning support that can be reused in similar situations in the future. Information visualisation layer The dashboard can be integrated with the IDE to provide a visualisation tool for teachers and students. Teachers can consult a comprehensive and well-organised presentation of the learning process, and track the progress made by students, both from an individual and from a group perspective. At the same time, students can visualise their individual information to increase awareness of their study and learning outcomes. One example of this is a selected list of programming concepts that have been misunderstood and require revision. This final step aims at better understanding the learning needs and further helps professors and students to make decisions.

4.1 Monitoring. Data extraction and processing layers We have developed a prototype that allows for capturing and analysing compiler messages retrieved from the IDE. Specifically, we have implemented a plug-in software for Eclipse [4] that extracts and analyses the information about errors and warnings that each student generates individually while he/she is programming. Although this study uses Java and Eclipse, it is possible to apply the proposed model to other programming languages and environments. One of the advantages of using this approach is its portability, which allows students to install and use the plug-in in other locations besides university laboratories, and it allows data to be automatically uploaded to the server, where updated information can be tracked and analysed at any moment.

4. ASSISTANCE MODEL IN PRACTICE In this work we introduce a general and flexible model that can find application in different learning environments. Based on our experience as Computer Science lecturers, we have applied the model in programming learning at higher education, particularly, in a second-year subject of the degree in Computer Science. The main competences of the subject are aimed at students to learn and manage algorithms such as dynamic programming, backtracking, branch and bound, parallel divide and conquer, and also some relevant program elements and transversal programming language concepts such as classes, methods, fields, expressions, type conversions, libraries, modularity and generic programming (Table 1). For this study, we have chosen to use the

Figure 2: Flow diagram of the data extraction and processing layers.

Table 2. Ranking of more frequent errors in relation with their families and learning concepts Family Error/ Warning

Concept

Concept Explanation

Type Warning

Variables Definition, Scope of Variables

A variable or field was declared but never was used.

Syntax

Syntax Error

Code Blocks usage

Incomplete code structure, is missing a final semicolon, bracket, etc.

X cannot be resolved to a variable

Field

Syntax Error

Variables Definition, Scope of Variables

A used variable is never declared before

X cannot be resolved to a type

Type

Syntax Error

Variables Definition, Import Usage

It is trying to use a variable of a type that does not exist

Type mismatch: cannot convert from X to X

Type

Semantic Error

Types, Conversion of Types

Possible loss of precision between variables

X is a raw type. References to generic type X should be parameterized

Type

Genericity Warning

Genericity, Types, Types Checking

X interface is used but it is never parameterized with the type of the object that implements the interface

The import X is never used

Import

Import Warning

Language Modularity

A class not existent in the package or project was imported.

References to generic type Y should be parameterized

Type

Genericity Warning

Genericity, Types, Types Checking

Method X returns a generic type. X return type should be parameterized

The method X(parameters) is undefined for the type Y

Method

Semantic Error

Method declaration, Language Modularity

Method X is called, but it is not declared in the Class Y.

Compiler Error

Conceptual

Value of field/local variable X is not used

Internal

Syntax error, insert X to complete statement.

Eclipse has a specific panel for problems, where each error and warning is represented as a “marker”. The prototype retrieves all markers from the IDE [8], runs a filter of avoiding redundant data, and focus on new errors and warnings produced in the code section which has been modified most recently (Figure 2). This procedure avoids repetitive filtering of messages which have already been analysed and intentionally left behind by their authors. Integrated development environments, such as Eclipse provide students with data on the file and line number where the error/warning was detected, as well as the compiler message. This data is originally volatile and not stored. In our prototype, besides filtering the data generated by the IDE, compiler messages are associated with the student/programmer, group and programming session, and all the information is persisted in the system to make it available for further analysis. Data processing addresses the application of data mining techniques over filtered errors/warnings to allow for clustering them in families of errors. We use a classification taxonomy based on the following sources: ●

Conceptual family [2]. There are three main groups of compile-time errors:



Syntax: Belonging to the grammar of the programming language.



Semantic: Related with the semantic rules of the programming language.



Structural: It makes reference to the structure of the programming language. In addition of Ben-Ari errors, we propose adding one new group for warnings messages, which is related with bad programming practices and may result in execution-time errors. ●

Compiler error family. Classification of implementation errors according to the IDE. I.e. Eclipse internally uses the following classification for compile-time errors:

○ ○ ○

Field: Declaration and usage of variables.



Import: Modules added to the code or packages used for development, additional classes, etc.

Syntax: Structure of the programming language. Type: Types of the variables and the objects declared.



Method: Structure of methods, considering the parameters, return statements, etc.



Constructor: Class constructor method.

4.2 Response. Analysis and visualisation layers Beyond the statistical process of data, the level of analysis relies on the expertise of teachers to determine the theoretical programming concepts associated to the errors made by students, and adds this information to the knowledge already acquired by other methods (table 2). Furthermore, as facilitators of the learning process, teachers are able to provide explanations, guidance and direction to students based on the information held in the model of each student. Smart indicators enable us to represent and visualise programming concepts and families of errors. This information can be used by lecturers to monitor the errors made by students in each session, as well as a source to provide appropriate feedback of the individual and group learning progress (Figure 3).

challenges. Programming exercises and coding activities are currently supported by integrated development environments (IDEs), but according to our experience, IDEs provide insufficient information to adequately support the learning of programming. Teachers can use their knowledge to facilitate the learning process, but both teachers and students can benefit from a more effective use of IDEs. Technologies themselves do not directly cause learning to occur but can afford certain tasks that themselves may result in learning or give rise to certain learning benefits [6]. To accomplish this objective, we propose a layered and flexible assistance model for programming learning support, called Colmena, where each one of the layers has well-defined responsibilities to allow for extracting, processing, analysing and displaying the information. Colmena extends the capabilities of current IDEs, collecting and providing quantitative and qualitative feedback on students’ programming learning experiences. In this research we suggest the use of data mining techniques to extract and process data on students’ coding practices, as well as learning analytics to analyse and transform the information and display the feedback. The flexibility of the model allows it to be used for a wide range of learning environments. In this study, we have applied the model in programming learning in higher education, particularly, in a second-year subject of the degree in Computer Science. The purpose of this study is to achieve a better understanding of the common mistakes made by students, as well as analysing their results from developing real software projects. Statistical analysis of this data has allowed us to identify error patterns, classify them in different families, and represent the information through the use of learning indicators. The use of indicators allow lecturers to visualise the current state of their students in each learning cycle and further helps professors and students to make decisions.

Figure 3. Indicator of families of errors and associated percentages in each session. The analysis and visualisation of errors presented by the system allows for revealing patterns of common errors. For example, the experience described in this section suggested the following: ●





The frequency of syntax errors decrease along programming sessions. Language syntax is a key programming concept, and it is present in all the exercises. Reducing errors of this family allows students to focus on other programming aspects. Internal errors, related with the declaration and scope of variables, are very common and they accounted for at least 20% of all the errors in every session. For other families, the frequency of appearance of errors is mainly determined by the particular topic and the pedagogy used in the exercise. I.e. Typing errors are more frequent when new programming elements related to data structures and algorithms are introduced (sessions 4 and 5).

5. CONCLUSIONS AND FUTURE WORK The learning of programming presents many difficulties for students, as it involves a series of conceptual and technical

In our further research, we will apply the model in a variety of of learning domains. We also propose to assess the degree of accomplishment of the learning objectives, that is, to determine to what extent information about the learning process is adequately provided and used as a means to enhance learning. At the same time, we consider that the capture and analysis of students behavioural patterns, along with the use of indicators, would allow for better understanding the learning progress and adapt IDEs to groups of students and individual learning contexts.

6. ACKNOWLEDGMENTS This work has been funded by the Department of Science and Innovation (Spain) under the National Program for Research, Development and Innovation: project TIN2011-25978 entitled Obtaining Adaptable, Robust and Efficient Software by including Structural Reflection to Statically Typed Programming Languages.

7. REFERENCES [1] About | Society for Learning Analytics Research: http://www.solaresearch.org/mission/about/. Accessed: 2012-1227. [2] Java.

Ben-Ari, M.M. 2007. Compile and Runtime Errors in

[3] Butler, D.L. and Winne, P.H. 1995. Feedback and selfregulated learning: A theoretical synthesis. Review of educational research. 65, 3 (1995), 245–281.

[4] Clayberg, E. and Rubel, D. 2008. Eclipse Plug-ins. Addison-Wesley Professional.

Machinery, Special Interest Group on Computer Science Education). 28, Special Issue (ene. 1996), 52-54.

[5] Cohen, V.B. 1985. A Reexamination of Feedback in Computer-Based Instruction: Implications for Instructional Design. Educational Technology. 25, 1 (1985), 33–37.

[17] Mockus, A. and Herbsleb, J.D. 2002. Expertise browser: A quantitative approach to identifying expertise. Proceedings - International Conference on Software Engineering (2002), 503-512.

[6] Dalgarno, B. and Lee, M.J.W. 2010. What are the learning affordances of 3-D virtual environments? British Journal of Educational Technology. 41, 1 (2010), 10–32. [7] Duval, E. 2011. Attention please!: learning analytics for visualization and recommendation. Proceedings of the 1st International Conference on Learning Analytics and Knowledge (New York, NY, USA, 2011), 9–17.

[18] Mory, E.H. 2004. Feedback research revisited. Handbook of research on educational communications and technology. (2004), 745–783. [19] Ostrand, T.J. and Weyuker, E.J. 2004. A tool for mining defect-tracking systems to predict fault-prone files. IEE Seminar Digests (2004), 85–89.

[8] Fernández-Medina, C., Pérez-Pérez, J.R., Paule-Ruiz, M.P. and Álvarez García, V.M. 2011. COLMENA : Collaborative knowledge and user classification environment based on programming experience. Proceedings of the VIII Multidisciplinary Symposium on Design and Evaluation of Digital Content for Education (Ciudad Real, Spain, jul. 2011), 50-58.

[20] Ragonis, N. and Ben-Ari, M. 2005. A long-term investigation of the comprehension of OOP concepts by novices. Computer Science Education. 15, 3 (2005), 203-221.

[9] Glahn, C., Specht, M. and Koper, R. 2008. Smart indicators to support the learning interaction cycle. International Journal of Continuing Engineering Education and Life-Long Learning. 18, 1 (2008), 98-117.

[22] Sales, G.C. 1993. Adapted and adaptive feedback in technology-based instruction. Interactive instruction and feedback. (1993), 159–175.

[10] Hovemeyer, D. and Pugh, W. 2004. Finding bugs is easy. SIGPLAN Not. 39, 12 (dic. 2004), 92–106. [11] Jadud, M.C. 2006. Methods and tools for exploring novice compilation behaviour. Proceedings of the second international workshop on Computing education research (2006), 73–84. [12] Kölling, M., Quig, B., Patterson, A. and Rosenberg, J. 2003. The BlueJ System and its Pedagogy. Computer Science Education. 13, 4 (2003), 249-268. [13] Kulhavy, R.W. 1977. Feedback in written instruction. Review of Educational Research. (1977), 211–232. [14] Kumar, A. 2003. Learning programming by solving problems. INFORMATICS CURRICULA AND TEACHING METHODS. 117, (2003), 29-39. [15] Lahtinen, E., Ala-Mutka, K. and Järvinen, H.-M. 2005. A study of the difficulties of novice programmers. Proceedings of the 10th annual SIGCSE conference on Innovation and technology in computer science education (New York, NY, USA, 2005), 14–18. [16] McConnell, J.J. 1996. Active learning and its use in Computer Science. SIGCSE Bulletin (Association for Computing

[21] Romero, C. and Ventura, S. 2010. Educational Data Mining: A Review of the State of the Art. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 40, 6 (nov. 2010), 601 -618.

[23] Siemens, G. and Baker, R.S.J.D. 2012. Learning analytics and educational data mining: Towards communication and collaboration. ACM International Conference Proceeding Series (2012), 252-254. [24] Tucker, S.A. 1993. Evaluation as feedback in instructional technology: The role of feedback in program evaluation. Interactive Instruction and Feedback. Englewood Cliffs, NJ: Educational Technology Publications. (1993), 301– 342. [25] Williams, C.C. and Hollingsworth, J.K. 2005. Automatic mining of source code repositories to improve bug finding techniques. IEEE Transactions on Software Engineering. 31, 6 (jun. 2005), 466 - 480. [26] Ying, A.T.T., Murphy, G.C., Ng, R. and Chu-Carroll, M.C. 2004. Predicting source code changes by mining change history. IEEE Transactions on Software Engineering. 30, 9 (sep. 2004), 574 - 586. [27] Zeller, A. 2000. Making students read and review code. ACM SIGCSE Bulletin (2000), 89–92. [28] Zimmermann, T., Weißgerber, P., Diehl, S. and Zeller, A. 2004. Mining version histories to guide software changes. Proceedings - International Conference on Software Engineering (2004), 563-572.

Suggest Documents