Increasing the Effectiveness of Automated ... - ACM Digital Library

19 downloads 18030 Views 892KB Size Report
Computer Science courses and is often carried out automat- ically. Timely feedback is essential to students and, where an automated marking system is in use, ...
Increasing the Effectiveness of Automated Assessment By Increasing Marking Granularity and Feedback Units Nickolas Falkner, Rebecca Vivian, David Piper and Katrina Falkner School of Computer Science The University of Adelaide Adelaide, South Australia, Australia, 5005 [email protected]

ABSTRACT

final submission, if we choose to do so. While this raises the spectre of students attempting to ‘hack’ their solution to match the testing harness [9], a practice Ben-Ari referred to as bricolage [3], this must be balanced against any benefit of providing this early feedback [16]. Automated feedback can feel mechanical, inauthentic and, as noted, can provoke mindless repetitive attempts to second guess the mechanism, but we will focus on how we can make the most beneficial use out of what is a growing necessity. As we have seen in project-based learning[4], student development is greatly assisted by revision and review to allow steady improvement over time. We have an obvious motivation to provide continuous feedback as students make progress to encourage their development. An automated system can provide students with a measure of their progress towards achieving a ‘successful’ program by awarding a response, possibly including a grade, which we define as a feedback unit. The range of feedback units available to most automated systems is a set of scripted responses with associated mark awards that are awarded based on the comparison and analysis of program output relative to a set of known inputs, syntactic analysis of key structures in the program, limited semantic analysis or graph analysis to determine correspondence to a known algorithm. The term grain is often used in the context of fine-grained or coarse-grained marking schemes [18, 17]. We focus on the granularity of marks awarded as an issue of assessment reliability and consistency [7, 18] and the extent to which difference in achievement is reflected in a single mark. We would like any marks to reflect sufficient progress and to have the marks awarded reflect that the work being assessed is a sufficient improvement upon a previous piece of work with a lower mark, with any useful associated textual feedback. When used as part of pre-submission feedback, this requirement is even more important. However, the marks apparently available for students may not all actually be available, and corresponding textual feedback may thus not be displayed. A range of 0–100 implies a fine-grained scheme but it is not fine-grained if we only award the marks 0, 50 and 100. As we will explore, we use the term granularity in this paper to reflect the actual marking granularity that students have achieved after attempting the assignment, which is the only perception of granularity that is truly meaningful to students. A granularity of one would state that the same single mark was awarded to all students, where a granularity of six would indicate that there are six possible marks available, such as {0,20,40,60,80,100} or {0,1,2,4,7,10}. The granularity that we make available to students is a combi-

Computer-based assessment is a useful tool for handling large-scale classes and is extensively used in the automated assessment of student programming assignments in Computer Science. The forms that this assessment takes, however, can vary widely from simple acknowledgement to a detailed analysis of output, structure and code. This study focusses on output analysis of submitted student assignment code and the degree to which changes in automated feedback influence student marks and persistence in submission. Data was collected over a four year period, over 22 courses but we focus on one course for this paper. Assignments were grouped by the number of different units of automated feedback that were delivered per assignment to investigate if students changed their submission behaviour or performance as the possible set of marks, that a student could achieve, changed. We discovered that pre-deadline results improved as the number of feedback units increase and that post-deadline activity was also improved as more feedback units were available.

Categories and Subject Descriptors K.3.2 [Computers and Education]: Computer and Information Science EducationComputer Science Education

General Terms Human Factors

Keywords automated assessment, feedback, student performance

1.

INTRODUCTION

Automated assessment is used for many different assignment types including quizzes and computer programs. Marking student programs is a substantial source of workload in Computer Science courses and is often carried out automatically. Timely feedback is essential to students and, where an automated marking system is in use, we have the opportunity to provide feedback on a work-in-progress, prior to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGCSE ’14 March 5–8, 2014, Atlanta, Georgia, USA Copyright is held by the author(s). Publication rights licensed to ACM. ACM 978-1-4503-2605-6/14/03 ...$15.00 http://dx.doi.org/10.1145/2538862.2538896.

9

nation of the granularity we initially define, modified by the feedback units available through an automated marking system. As we will discuss, what appears to be fine-grained can easily become coarse in implementation. In this paper, we will discuss the effects of granularity and then assess the impact of changes in the number of feedback units on students’ submission times and automatically assessed marks. As part of this, we will identify outstanding questions about our theories regarding student interpretation of marking granularity and discuss follow-up work and our conclusions. While a mark, by itself, does not constitute constructive feedback for improvement, it does provide useful feedback to the student on the degree of progress made, even in the absence of additional text. However, in the majority of cases in this study, each mark awarded was also presented with associated textual feedback identifying which tests had been passed, what this meant and, in some cases, possible places to look for improvement.

2.

et al [7] identify that the granularity of the mark awarded is most easily understood and effective when it matches the weight of the assessment item to which it is matched. However, the granularity of the marking scheme applied can have further impact depending on whether it is coarse- or finegrained. Sadler [18] illustrates the problem of becoming too fine grained in our assessment scheme overall, which reinforces [7], but there is a tension between too much detail and insufficient detail. Where too much granularity distorts the initial intent of the learning objective then we risk losing the coherency of the overall assessment scheme. Sadler notes [18] that increasing the granularity of the marking scheme risks the assignment being seen as a set of atomic challenges, unified administratively rather than semantically, reducing a student’s ability to integrate understanding in this area of knowledge. [7] and [18] agree that a clear alignment between objectives and learning activities is essential and that finer granularity may be justified providing that a reasonable correspondence is maintained between mark returned and actual value of the mark. Sadler [18] explores exemplars as part of a grading scheme, noting that it is rare to find one exemplar that provides all of the desirable learning activities. Automatic marking systems are effectively providing comparison to an exemplar and, hence, may not capture the richness needed for fair and equitable assessment. Naude [14] explores this issue in the development of an automated marking system that exploits graph similarity to avoid simplistic approaches to automated program assessment. Panadero et al [16] explore the distinction between selfassessment scripts and rubrics in the development of selfregulated learning behaviour. Rubrics provide criteria for assessing goals, scales for grading achievement levels and a clear description of different qualitative levels that allows discrimination. Scripts are a step-structured set of questions to organise the task in the sequence an expert might employ and allow us to assess process, as well as, potentially outcome. The issue of granularity, when applied in the context of the prescriptive nature of the script, can potentially lead students to focus on successfully completing the script steps rather than learning the lesson associated with the activity. In the rubric, which is more outcome focussed and less process-locked, we reduce this risk. A marks scheme from an electronic system can potentially function as a rubric but we are aware that students are not guaranteed to read all or any of a rubric and that the perception of even the most helpful rubric could be of a prescriptive statement of teacher intent, rather than a supportive document to encourage participation and achievement of learning outcomes [2]. Rubrics make implicit requirements explicit and clearly allocate marks to the successful attainment of certain goals, facilitating feedback and self-assessment [10]. Clearly, a reliable and correctly constructed electronic marking scheme, even one that is scripted, can meet this requirement and has equal validity to a rubric, which is also rarely customised to the individual student level. Ihantola et al [9] identify that the iterative nature of automated marking, where students are able to resubmit their work multiple times so that they are able to realise improvements upon their mark, can lead to a stress on the summative nature of automated marking and an attempt by students to “hack” their way to a correct solution. Nikula et al [15] present a longitudinal study of a first year program-

EXISTING WORK

The role of the marking system is to encourage the student to produce a working piece of software as building something that is complete and functional is of great assistance to inspiring and developing student confidence [5]. The advantages of automated assessment for programming tasks are the same as for any computer-based assessment scheme: objective assessment, removal of halo effects, and reduction in manual, error-prone, marking burden [11]. Kay et al [13] further identify the benefits of automated marking in efficiently assessing multiple execution paths in computer programs. A detailed review of existing automated assessment tools within Computer Science is presented in [1], identifying that in Computer Science, automated assessment is used more frequently than manual assessment methods for practical work. Automated assessment is frequently applied to the areas of functionality, program efficiency, and the development of testing skills, with a small number of specialised automated systems developed to explore assessment criteria more typically performed using manual marking, including structural aspects of design and software metric analysis. Although we may not all agree on a specific numeric grade, the majority of instructors can agree upon whether programs are “very good” or “very bad” [8], supporting the derivation of automated marking scripts that are able to clearly identify both ends of the spectrum. While automated assessment may not by able to assess all criteria of programming assignments to the same degree of quality [1], the additional benefits that automated marking systems provide such as repeatability of assessment, timely feedback and formative reassessment provide alternative pedagogical benefits. Providing constructive feedback to students improves their ability to correct their mistakes. As observed in [11], multiple choice questions did not show any difference in performance between automated and manual assessment, while programming assignments did benefit from in-progress feedback and the automated assessment of these assignments achieved a better result than having the same work coded in a paper and pencil environment. Students who have the opportunity to resubmit their work, with feedback, also improve outside of the programming structure, where compiler and execution feedback is inherently more objective. Considerable existing work explores the issue of assessment granularity and impact upon student learning. Fincher

10

ming course restructure, where they introduced an automated marking system, the Virtual Learning Environment, which tested their submissions for correctness as well as tracking student performance metrics. Nikula et al identify student frustration upon receiving feedback that they were not able to act upon once the deadline had passed. Edwards et al [6] examine the combined effect of a student’s starting time for an assignment on their mark, when presented with feedback from the Web-CAT system, which confirmed the overall result that starting earlier is usually correlated with better grades but also identified that average student marks appear to drop 1-2 days before the assignment is due, as measured by the Web-Cat system, and then start to rise again following submission, despite marks penalties preventing students from benefiting from these marks. Where students are allowed to carry out an unrestricted number of submissions, we can either see students using this to best ability or practicing, bricolage, where progress is haphazard and accidental, being constructed from the random changes the student puts together, rather than from intent. Karavirta et al [12] studied this as part of an investigation to cluster students based on the way in which they employed the resubmission mechanism for an automatically assessed assignments, noting that it was possible to use semi-random mutation through iteration to solve a problem without necessarily understanding the source of error. Although both ambitious and iterative students made great use of resubmission, [12] noted that only the ambitious had a corresponding improvement in their final examination grade. As noted in [16] and regularly observed in the classroom, intervention and feedback provided during an assignment have a noticeably different effect from post-hoc feedback once no further improvement is possible, and we use this as motivation to observe what students do when we provide immediate feedback during the assignment process.

a stand-alone entry containing the mark received, the mark awarded after any late penalties and whether the work was on time or not and, if so, by how much. This data was then analysed using R-64 statistical software, under R Studio, with coordinating packages written in Ruby. An initial sweep of the data was carried out to produce graphs, using ggplot2, to identify key areas of interest and, following this, the investigation was refined to specific courses and the extraction of features in numerical form for statistical analysis. The observed activity period for each assignment is the 120 hours preceding the established submission deadline and the 120 hours following the submission deadline. The activity in this interval dominates the activity in the surrounding weeks by orders of magnitude. For statistical reasons, we have not carried out any further analysis on the data in the weeks outside of this area. In effect, the number of students active over a week before a deadline in an undergraduate course are, from our observations, so small as to be considered negligible. Some components of the original data could not be captured in the final dataset due to ambiguous dates or nonstandard or external automated marking and these data have been excluded. The original data capture was a byproduct of an assessment system and, while individual instructors could determine the marks obtained under nonstandard schemes, a decision was made to exclude them at this time, pending more work on the automated harvesting. The total number excluded for non-standard marking schemes is small, less than 0.1% of total submissions. The data from our investigation is noisy and, where smoothing has been used, we have uniformly used LOESS smoothing, with the default parameters, as part of the ggplot2 package. Numerical data is provided, later in the paper, to identify that interesting facets in the graphs are not due to smoothing errors introduced by the LOESS smoothing.

3.

4.

METHODOLOGY

RESULTS

The initial discussions of the granularity identify one immediate question: given that increased granularity increases the complexity of both the assignment and the automated marking systems, what granularity do we choose to provide student benefit without overloading the teaching staff?

This is ongoing work on an investigation of student behaviour through the analysis of a data corpus, collected as part of an automated marking system for a school of Computer Science at a mid-scale University. The main corpus spans the years 2007 to 2010, over eight semesters, with over 200,000 recorded (valid) submissions and over 20,000 individual students across 22 courses. The data were collected by an automated marking system that required students to commit their work to an repository and then accept boilerplate text describing University policies, which then resulted in the student’s work being sent to the web submission system. Student submissions were automatically marked on submission, with any feedback and marks returned immediately. In 2007, school policy was to operate a 25% cumulative mark cap for each day, or part of day, late but this was not strictly enforced. In 2008, staff conformance to the cumulative mark cap was mandated. However, students could not receive a lower mark than one previously awarded, hence a student who submitted on time for 80% would not be capped at 75% were they to submit a superior version of their code after the due date. The assignments allowed for unlimited resubmission. The dataset was constructed from the student submission data through automated extraction of the course, student, submission time, automatically assessed mark and on-time or late status, with each record being normalised to provide

4.1

The Impact of Granularity

Analysis of the different granularities across all schemes in use in the study showed that most courses had perceived granularities of 5 or less, with a spike at granularity 11. For the four years under study, over the150 courses that had granularities less than or equal to 5, of a range from 2 to 46. Unsurprisingly, given the complexity of establishing automated marking as the number of handled cases increases, the count of the different awarded marks decreases with the number of items available. The spike at 11 reflects the 0 − 10 scheme with unit steps or a 0 − 100 in steps of 10.

4.2

Student Performance

Although we can look at the student performance across the entire school, the range of entry points, the differing assignment types found across courses and the risk of the artificial dominance of larger enrolments in less advanced courses means that such analysis will only give us a very broad indication of trends. What about the performance within a given course, where the content, student body and overall

11

assessment consistency allows us to isolate given treatments? We have identified a major contributor to the data set that has a similar distribution of assessment granularities under automated marking to that found in our initial assessment of school data but is a well-defined course with no major structural changes across the four year period, in second year of a Bachelors degree with a computer science major. This course has been de-identified and will be labelled as “bee3”. After consolidation, the summary data for the whole school consists of 27,744 ‘final’ records, showing the last time a student submitted an assignment across 1,917 unique students. bee3 is a significant contributor to school data, with 7,010 of these records (the most for any course) and 971 unique students represented. This course was the major programming course for the degree and a pre-requisite for further study in Computer Science. All students in our cohort are required to take this course, resulting in a representative range of students who had successfully completed first year foundation programming courses. Annual enrolments in bee3 ranged from 210 to 260 students. All assignments were coding activities that were designed to take at most one week to solve and were available to students for at least two weeks prior to deadline. All assignments allowed an unlimited number of resubmissions, including beyond the due date, and were worth between 3-4% of the final mark for the course. Student submission did not significantly vary if the mark was 4% rather than 3%. Each of the assignments consisted of three to six questions with increasing degree of difficulty. The goal for each student was to submit code for each problem in a week’s question set. Feedback was provided in addition to marks, at least informing the students of the nature of test being performed, the presented input, the expected output and any variation. In some cases, individual goal statements were also presented to identify when students had met specific stages in a development process or to provide guidance as to how to move forward. Obviously, a granularity of 2 (binary grading) does not provide a great range of feedback, nor does granularity 3. Although assignments were used with these granularities, these were frequently for early assignments which encouraged behaviour such as handing up files successfully or single lines of output. The point at which the assignments started to provide more detailed feedback was, unsurprisingly, at the point where there were 4 or more possible marks available. Hence, we will focus on student performance and mark achievement for granularities of 4, 5, 6 and 11. Each graph is LOESS smoothed and shows the standard deviation as a grey shadow below each line. As expected, the deviation is higher for the lower groupings, because of the introduction of the artificially high 100% grades introduced by granularity 2 and 3 mark schemes. For each graph, where we state “granularity split at x”, this indicates that the solid blue line is for all assignment marking schemes that achieved at least granularity x and the dotted red line represented all assignments that had less than this granularity. Hence, in Figure 1 (granularity split at 4), the solid blue line represents all marking schemes that had a granularity of 4 or up in awarded marks, while the dotted red line represents marking schemes with granularity of 2 or 3. (Obviously, granularity 1 can provide no feedback and anything less than 1 or non-integer is not possible.) As can be seen from the graphs, students are achieving noticeably greater average marks when the marking gran-

Figure 1: Granularity split at 4

Figure 2: Granularity split at 5 ularity increases and, while this effect appears to be temporarily neutralised at the major deadline, they continue to receive higher marks after the deadline. Investigating the mean mark and standard deviation for each hour confirms that the separations of average marks are greater than one standard deviation for periods up to two days prior to the submission deadline. Once we proceed beyond a granularity of 15, while the graphs start to converge before the deadline, we see a clear separation of marks after submission, indicating that average student mark (for late submissions) increases as we increase the granularity of marking, as shown in Figure 5 for granularity split 15. While our late submission penalty scheme would, in theory, reduce the benefit in completing the work as students would not be receiving many, if any, of additionally earned marks, the high granularity schemes are seeing students continuing to work for longer and achieving higher marks. By aligning the average marks received from automatic marking with the hour of activity, we identify an immediate observation that the peak average mark returned by automated marking occurs well before the submission date, in the range of 48 to 72 hours ahead of the no-penalty sub-

Figure 3: Granularity split at 6

12

Table 1: Percentile increase with granularity Hour -109 -108 -107 -106 -105 -103 -102 -101 -100 -99 -92 -86 -85 -84 -82 -79 -78 -77 -76 -75 -74 -61 -60 -59 -58 -57 -56 -55 -54 -53 -52 47 60 68

Figure 4: Granularity split at 11

Figure 5: Granularity split at 15 mission date, which is also seen in [6]. While, as has been documented in the literature previously, we expect to see a concentration of activity immediately prior to the due date, assessing the automated mark that is assigned clearly identifies a downwards trend in average mark and, in most cases, the average mark (for an on-time submission) starts to approach the mark that would be received if the work were handed in one day late. Across the school, the marks start a downward trend over the final 48 hours prior to submission, dropping steeply in the four hours prior to submission. This is also the time at which the greatest number of per hour submissions occur, with a very high peak in the final hour. [6] also noted the existence of outliers who continued working after the due date, even at a stage where further work would attract more marks. Our work confirms this effect and, if late penalties are removed, there is a clear climb back towards the highest average mark achieved prior to the submission deadline.

4.3

µ 5 > µ0 4 24 33 32 4 18 48 49 42 33 100 40 39 14 30 11 2 6 14 7 16 27 26 18 13 13 29 3 18 9 40 20 20

µ15 > µ0 5 11 14 6

µ 5 > µ0 *

18 48 29 37 33 100 37 46 12 31 16 2 13 22

18 48 49 42 33 100 40 39

18 48 29 37 33 100 37 46

30

31

16 30 9 26 5

16 27

7 24 9 18 5 50 18 20

µ15 > µ0 * 5

32

22

26

16 30 9 26

29 18

18

40 20 20

50 18 20

n 12 15 15 15 23 12 14 18 14 20 4 9 18 13 14 19 16 14 17 19 18 16 38 32 37 50 44 38 23 18 25 7 4 4

increase, marked with an asterisk, but require that the difference in mark be greater than the standard deviation in marks. The final column is the total number of students across each hour band. It is immediately apparent that, as identified in the graphs, students are gaining higher marks as we improve granularity and, despite the inherent noise of this data, we are able to clearly identify points before and after a given deadline where students appear to be deriving benefit from the additional effort invested in making the automated marking scheme more granular. To summarise this data, within this course, using a granularity of 5 increased the automated assessment mark received for 30 of a possible 120 hours, with 15 of these 30 hours showing a mark increase that was one standard deviation above granularity 4 and below. Similarly, using a granularity of 15 led to 28 of the 120 hours before submission having a better average mark than for all of the granularities 14 and down - which includes granularity 5 - with 16 being significantly larger (greater than one standard deviation). An additional 14 hours in the span 44-88 showed substantially higher marks for 60 students across this period when higher granularity was used but had to be eliminated because we had no comparator in the low granularity data set - the students submitted in an additional 14 hours in the area under study. There were at least 14 additional hours when students submitted work, providing the teacher had provided at least 5 marking intervals. The students undertaking assignments with higher granularity schemes are working over more of the post-deadline period, for longer, well after any on-time benefit of additional granularity has stopped increasing, which we interpret as an overall increase

Numerical analysis

Because of the large number of possible hours available for submission, we have 240 possible submission points and a great deal of noise due to conflating factors such as low granularity scores that introduce binary marking schemes. However, cleaning the data to remove single entries, which would return an invalid standard deviation, we then determined which hours saw an increase in mean mark and which hours saw an increase in mean mark that was beyond one standard deviation. These results for granularity less than 5, 5 and 15 are shown in Table 1. The first two columns compare the mean mark for everything below granularity 5 with granularities of 5 and greater, then 15 and greater, where we note that the column for granularity 5 or greater does subsume the following column but that this has been added to clearly isolate benefit at higher levels of granularity. Columns three and four then determine the mark

13

in persistence. This, again, confirms results observed in [6], which clearly shows an increasing average grade after the due date, with a minimum mark granularity of 10 marks.

5.

[5]

CONCLUSIONS

This study suggests that marking schema and granularity of marks is important to consider when using and designing automated assessment of programming assignments. Our study shows that students perform differently as we vary the granularity that they can perceive. There is, however, a need for additional investigation into the behaviour of the individual student in response to these factors. Does a more detailed marking scheme indicate the presence of a more methodical and pedagogical approach to assignment formation or material delivery? Again, we have all of the source assignments and the vast majority of teaching materials so, over time, we plan to go through and carry out a quality assessment of the works, to finish the isolation of the marking scheme from the materials to which they are attached. An assignment with 100 possible marks should provide a similarly enumerated set of actual marks but, across four years of assignments, there is only one assignment that has 46 different marks awarded, yet there are many assignments that notionally have 100 marks available and only offer 11 different marks. By concentrating on a single course with a reasonable range of assignment mark granularities, we have started to reduce the potential impact of the teacher in terms of motivation and content delivery outside of the assignments, but there is work to be done on the quality and nature of the assignment. We need to find reliable, high-quality alternatives to manual marking so that we can handle the increasing number of students with a decreasing, and steadily more over-worked, academic faculty. What is apparent from our analysis is that automated feedback, even when it is confined to marks, does provide some impetus to continue working, even when the marks returned are low - if a student can see that they are making progress then, unsurprisingly, they may strive for longer. There has never really been a question about the value of formative feedback delivered during a project but automated marking scripts have long been seen as a very poor cousin to their manual and face-to-face alternative, being neither personalised nor human! Despite the occasional issues with an overly pedantic comparison or a case that was not foreseen by the lecturer, it appears that students will perform better, earlier, when we take the time to provide a more detailed and granular marking scheme for our automatically marked assignments.

6.

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

REFERENCES [16]

[1] K. M. Ala-Mutka. A survey of automated assessment approaches for programming assignments. Computer Science Education, 15(2):83–102, 2005. [2] H. Andrade and Y. Du. Student perspectives on rubric-referenced assessment. Practical Assessment Research & Evaluation, 10(3), Apr. 2005. [3] M. Ben-Ari. Constructivism in computer science education. In Proceedings of the twenty-ninth SIGCSE technical symposium on Computer science education, SIGCSE ’98, pages 257–261, New York, NY, USA, 1998. ACM. [4] P. C. Blumenfeld, E. Soloway, R. W. Marx, J. S. Krajcik, M. Guzdial, and A. Palincsar. Motivating

[17]

[18]

14

Project-Based Learning: Sustaining the Doing, Supporting the Learning. Educational Psychologist, 26(3):369–398, 1991. C. Douce, D. Livingstone, and J. Orwell. Automatic test-based assessment of programming: A review. J. Educ. Resour. Comput., 5(3), Sept. 2005. S. H. Edwards, J. Snyder, M. A. P´erez-Qui˜ nones, A. Allevato, D. Kim, and B. Tretola. Comparing effective and ineffective behaviors of student programmers. In Proceedings of the fifth international workshop on Computing education research workshop, ICER ’09, New York, NY, USA, 2009. ACM. S. Fincher, M. Petre, and M. Clark, editors. Computer Science Project Work: Principles and Pragmatics. Springer, 2001. S. Fitzgerald, B. Hanks, R. Lister, R. McCauley, and L. Murphy. What are we thinking when we grade programs? In Proceeding of the 44th ACM technical symposium on Computer science education, SIGCSE ’13, pages 471–476, New York, NY, USA, 2013. ACM. P. Ihantola, T. Ahoniemi, V. Karavirta, and O. Sepp¨ al¨ a. Review of recent systems for automatic assessment of programming assignments. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research, Koli Calling ’10, pages 86–93, New York, NY, USA, 2010. ACM. A. Jonsson and G. Svingby. The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2):130 – 144, 2007. N. Kalogeropoulos, I. Tzigounakis, E. A. Pavlatou, and A. G. Boudouvis. Computer-based assessment of student performance in programing courses. Computer Applications in Engineering Education, 2011. V. Karavirta, A. Korhonen, and L. Malmi. On the use of resubmissions in automatic assessment systems. Computer Science Education, 16(3):229–240, 2006. D. Kay, T. Scott, P. Isaacson, and K. Reek. Automated grading assistance for student programs. In Proceedings of the 25th SIGCSE technical symposium on Computer science education, pages 381–382, 1994. K. A. Naud´e, J. H. Greyling, and D. Vogts. Marking student programs using graph similarity. Comput. Educ., 54(2):545–561, Feb. 2010. U. Nikula, O. Gotel, and J. Kasurinen. A motivation guided holistic rehabilitation of the first programming course. Trans. Comput. Educ., 11(4):24:1–24:38, Nov. 2011. E. Panadero, J. A. Tapia, and J. A. Huertas. Rubrics and self-assessment scripts effects on self-regulation, learning and self-efficacy in secondary education. Learning and Individual Differences, 22(6):806 – 813, 2012. D. R. Sadler. Interpretations of criteria-based assessment and grading in higher education. Assessment & Evaluation in Higher Education, 30(2):175–194, 2005. D. R. Sadler. Perils in the meticulous specification of goals and assessment criteria. Assessment in Education: Principles, Policy & Practice, 14(3):387–392, 2007.