1 A controlled experiment on Python vs C for an

1 A controlled experiment on Python vs C for an Introductory Programming course: students’ outcomes JACQUES WAINER and EDUARDO C. XAVIER, University of Campinas, Brazil

We performed a controlled experiment comparing a C and a Python Introductory Programming course. Three faculty at University of Campinas, Brazil, taught the same CS1 course for the same majors in two different semesters, one version in Python and one in C, with a total of 391 students involved in the experiment. We measured the dropout rate, the failure rate, the grades on the two exams, the proportion of completed lab assignments, and the number of submissions per completed assignment. There was no difference in the dropout rate. The failure rate for Python was 16.9% against 23.1% for C. The effect size (Cohen’s D) on the comparison of Python against C on the midterm exam was 0.27, and 0.38 for the final exam. The effect size for the proportion of completed assignments was 0.39 and the effect size for the number of submissions per assignment was -0.61 (Python had less submissions per completed assignments). Thus for all measures, with the exception of dropout rate, the version of the course in Python yielded better student outcomes than the version in C and all differences are significant (with 95% confidence) with the exception of the failure rate (p-value=0.12). CCS Concepts: •Social and professional topics → Computational science and engineering education; Additional Key Words and Phrases: Introductory programming, Python, C, CS1, Controlled experiment ACM Reference Format: Jacques Wainer and Eduardo C. Xavier. 2017. A controlled experiment on Python vs C for an Introductory Programming course. ACM Trans. Comput. Educ. 1, 1, Article 1 (March 2017), 16 pages. DOI: 0000001.0000001

1. INTRODUCTION

In this work we are interested in the problem of choosing a programming language to be used in an introductory computer programming course, considering students learning outcomes. Although C/C++ and Java have been popular choices for a first programming course for the last years, according to Guo [2014] Python is the most popular programming language for introductory programming courses in the top US universities. On the other hand, wider surveys on introductory programming courses languages choices are not that clear to point Python as the most frequent choice [Murphy et al. 2016; Siegfried et al. 2016]. We will abbreviate ”introductory programming course” as CS1 in this paper, but we warn the reader that CS1 has a connotation of being an introductory programming course for computer science majors that is not necessary true in this paper - we use CS1 to refer to any introductory programming course. This paper reports on a controlled experiment of using Python and C as the programming language of an introductory programming course. This experiment was done at a large Brazilian university (University of Campinas – Unicamp) for the course Introduction to Computer Programming (MC102). Author’s addresses: E. Xavier and J. Wainer, Computing Institute, University of Campinas, Campinas, SP 13083-852, Brazil. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2017 ACM. 1946-6226/2017/03-ART1 $15.00

DOI: 0000001.0000001 ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

1:2

J. Wainer and E. Xavier

The course is offered every semester to Computer Science and Computer Engineering majors, and also to a dozen other Science/Math/Engineering majors. For CS and non-CS major alike, MC102 is a required course. Every semester, the course is taught to approximately 800 students in 12 different classes. For the last 16 years the course has been taught in C. The course is further discussed in Section 1.2. In this experiment, three faculty taught MC102 in two semesters, the second semester of 2015 and the second of 2016, alternating between C and Python. Thus a professor may have taught the course in 2015 in C, and in 2016 in Python and both instances were offered to students of a same major, say Mechanical Engineering. The experiment is described in details in section 2.1. The results show that the average grade in each one of the two exams were higher in the Python course, and the number of students that failed the course were lower, compared to the C version. Also students in the Python course were able to solve more lab projects, while submitting less erroneous solutions until submitting a final correct one. 1.1. Related Works

A discussion comparing several programming languages to be used in a CS1 course was done by Giangrande Jr [2007], who provided a historical perspective of the development of programming languages and their use in introductory courses. It is interesting to notice that by the end of the 90s and beginning of 2000s several introductory courses were redesigned to introduce the object oriented paradigm as early as possible in the course. The ACM 2001 Computing Curricula included Objects First (OO-first) as one of the possible approaches in introductory courses, where the notions of Object and Inheritance are covered early in the course. Ehlert and Schulte [2009] did a controlled experiment comparing the OO-first versus OO-later option while teaching programming to high school students in Germany. The experiment was done in such a way to control several variables. Students of a same school were divided in two groups and were taught programming in Java, where the main difference was the order in which material was covered (OO-first x OO-later). The results show that there is no significant difference in learning outcomes, however students rated the OO-later as easier to follow than OO-first. On the other hand, Uysal [2012] found that when the outcome was how well the students learned object oriented concepts, a OO-first approach was statistically significant better than a OO-later approach, but the experiment did not test different programming languages, only different ordering of the presentation of programming concepts. Also, Cooper et al. [2003] compared a group of students that took a preliminary (before CS1) OO-first course to a control group that did not take it, and they found that the experimental group had both higher grades on a following CS1 course, and more of them went to enroll in the CS2 course. Reges [2006] reported on a redesign of the introductory courses at University of Washington using Java, where they focused in teaching the basics of procedural programming, with basic problem solving skills using loops, conditions, and arrays. Before this redesign, the course in Java was focused in object-oriented concepts, and some problems were reported such as decline in student satisfaction and enrollment, as well as lack of basic programming skills reported by instructors of upper-division courses. After the redesign, focusing on the basic constructions and procedural programming, students evaluation improved and there was an increase in enrollment in the course. In another study [Ramalingam and Wiedenbeck 1997], students with little programming experience studied and answered questions about three imperative and three object-oriented programs. The authors found that students had a better comprehenACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

Python vs C for Introductory Programming: a controlled experiment

1:3

sion of the programming aspects of the code for the imperative programs while for the object-oriented programs, students had a better understanding of the domain aspects. Several works have discussed local experiences in transition introductory courses from other languages, such as C, to Python [Patterson-McNeill 2006; Miller and Ranum 2005; Oldham 2005; Goldwasser and Letscher 2008; Shannon 2003]. Hunt [2015] show an experience in changing a CS1 course from Java to Python, and then back to Java, arguing against the use of Python in CS1. Hunt’s main argument is the lack of arrays, which he believes resulted in problems in a subsequent course. On the other hand, controlled studies such as Koulouri et al. [2015] found Python to be beneficial to novices learning, compared to Java. Most of these studies provide experiences and anecdotal stories of the transition from another language to Python, and have instructive examples, but they lack a more rigorous analysis of the results. Considering the choice of a programming language to an introductory course, Pears and Malmi [2009] analyzed a sample of 352 articles from a population of 1,306 computer science education articles published between 2000 and 2005. They found that one-third of the articles did not report research on human participants, about 40% of the articles that dealt with human participants only provided anecdotal evidence for their claims, and the articles that used experiments did that using mainly questionnaires. Bellow we describe some of the relevant works we found that did some empirical evaluation. Mannila et al. [2006] reported an experiment on comparing Python and Java as the choice for a high school introductory programming course. Two groups of high school students were taught the basics of computer programming by the same professor, one group in Python and the other in Java. The study first analyzed 60 programs developed by students, 30 in each language, and the results indicated that students implementing in Python produced programs with less syntax and logical errors, as well as better structured programs. Enbody et al. [2009] reported on a study of the impact of CS1 given in Python versus C++ in the subsequent CS2 course, whose primary language was C++. In 2007 their department started using Python in the CS1 course, and in the following year the CS2 course had both students that took CS1 in C++ (21 students) and in Python (62 students). The authors compared the two groups of students on three outcomes: final exam grade, programming project scores, and final grade for the course. The main result was that there were no significant differences between the two groups for all three outcomes, and that the best indicator of final grades in CS2 was the overall GPA of the students (and not the programming language used in the introductory course). In another study, Enbody and Punch [2010] continued tracking the difference in performance of the students in subsequent courses, and again found that there was no statistically significance difference between the two groups of students. Kasurinen and Nikula [2007] presented a study of the redesign of a CS1 course which among other things, changed the programming language from C to Python. The authors concluded that the change led to higher grades, improved student satisfaction, and a decline in the dropout and failure rates. A controlled study was done by Koulouri et al. [2015] to test three factors when teaching a CS1 course: the choice of programming language (Java x Python), problemsolving training in the beginning of the course before mastering the material on programming, and the use of informative assessment to give frequent feedback to students. The results suggested that the use of Python when compared to Java and problem-solving training resulted in an improvement on the students’ programming ability. However informative feedback did not provided an observable benefit. Stefik and Siebert [2013] studied the impact of the syntax of programming languages among programmers and non-programmers via questionnaires. In this study, programACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

1:4


ming language constructs (variable declarations, assignment, loops, conditions, etc) of a set of different programming languages were presented to novices and experienced programmers, and they had to rate how intuitive the syntax constructions were. The general conclusion is that novices find many of the choices made by language designers to be unintuitive. For experienced programmers similar constructions to the ones they already know are, as expected, more intuitive. Pears et al. [2007] is an important survey of the introduction to programming literature, with a whole section discussing works on language choice for a CS1 course. 1.2. The MC102 course

MC102 is an introduction programming course offered by the Computer Science Department of the University of Campinas. All students from Computer Science, Computer Engineering, Science, Math, Statistics, and all Engineering majors are required to take this course. Every semester, the course is offered to approximately 800 students, divided into 12 classes. MC102 covers variable declarations, assignments, conditional and loop controls, arrays, strings, functions, sorting (including the O(n log n) algorithms), search algorithms, dynamic memory, file systems, and recursion. The course has three 2 hours sessions each week. The lab sessions (2 hours a week) are managed by the teaching assistants. The standard lectures (4 hours a week) are taught by faculty from the Computer Science Department. The faculty follow a predetermined schedule of topics for each day of class and use a prepared set of slides. The course has two exams; the midterm exam covers up to looping structures and arrays, and the final exam covers all topics. All exams are prepared by a single professor, after discussion with the faculty teaching in the current semester. For each exam, it is prepared three versions, one to be taken by the morning classes, one for the afternoon, and the third for the evening classes. Each faculty is responsible for grading the exams of his/her own class. The course also has from 16 to 18 laboratory assignments and the students have a week to submit their solutions. The student can work on these assignments during a two hour lab session a week, where they receive help from the teaching assistant, or work from home in their own computers. The students solutions are electronic submitted and automatically graded. The student can resubmit his/her file solutions until their program passes all tests. The lab assignments are the same for all classes in a semester. The same professor that is responsible for preparing the exams is also responsible for creating the lab assignments. The final grade is a weighted average of the midterm, the final exam, and the lab assignments. If the student has submitted a minimum number of labs assignments but did not receive a passing grade, he or she can still take a make-up exam, and the final grade will be the average of the grade during the semester and that of the make-up exam. Usually 8% to 10% of students in a class will have to take the make-up exam. 2. METHODS 2.1. The Experiment

In the second semester of 2015 three CS faculty teaching MC102 to respectively three different classes X, Y and Z, with a total of 194 students, participated in the experiment. Classes X and Y were taught using Python, while class Z was taught using C. In the second semester of 2016 the same three CS faculty taught to their same class of the previous year but in the other language, so classes X and Y were taught in C while class Z was taught in Python. A total of 197 students were in these three classes in the second semester of 2016. Specifically: ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.


1:5

— JW taught MC102 in Python in the first semester of 2015, and in C in 2016, for Physics and Math majors. This was a morning class. — HH taught using C in 2015 and Python in 2016 for Statistics majors. This was an afternoon class. — DA taught using Python in 2015 and C in 2016 for Food Engineering majors. This was an evening class. For the six classes, we collected the students’ grades for the midterm and final exams, and the final grade (including the make up exam). We also collected for each lab exercise, whether the student completed the lab and the number of submissions for that lab exercise. From this raw data we computed the measures we analyze in this paper: — total is the total number of students that registered for any of the three C classes or any of the Python classes — dropout is the number of students that did not take both the midterm and the final exam. These students may have decided early that they would not perform well in the course and stop attending it. Unfortunately we did not collect the attendance data on the classes, so we do not have a true number of students that stop attending the classes - we are estimating that number as the number of students that did not attend any of the exams. — fail student is a student that did not dropout but did not achieve a passing grade (5.0) as the course final grade. The overall final grade is a weighted combination of the grades of the midterm, final exams, lab exercises, and possible the make-up exam. Unfortunately the formula for computing the overall final grade is not the same in both years (but it is the same for the C and Python versions within a year). Thus we did not perform any analysis on the overall final grade itself, only whether the student passed or not the course. — midterm and final exam grades are the raw grades the student received in the two exams, out of a maximum grade of 10.0. — proportion of completed assignments is the proportion of the assignments that the student submitted and that passed all the automatic tests. This was only computed for students with a passing final grade. The number of assignments were different in the two years, so instead of using just the number of completed assignments, we normalized it by dividing by the total number of assignments on that semester. — submission per completed assignment is the average number of submissions made by the student for each completed assignment. This was only computed for students with a passing final grade. This figure does not include submissions to assignments that were not completed. The material covered in the Python version of the course was essentially the same covered in the C course, with the exception of pointers and dynamic memory allocation, only taught in C. Most of the transparencies of the C course were rewritten substituting the C code with an equivalent Python code. Therefore the Python version of MC102 was not really a “pythonic” CS1, it was a C-based CS1 course taught in a different programming language. We do not believe we tested the full potential of a Python CS1 course, only a CS1 course using a programming language, in some aspects, simpler than C. In particular, we believe that only a few aspects of Python as an introductory programming language were tested in this experiment: the simpler syntax, the simpler semantics of using lists in place of C arrays, the interactive environment, and a clear feedback for some types of memory errors such as arrays/lists access out of bounds. Python lists were used in place of C arrays, and thus most of the examples discussed in class were of homogeneous lists (lists of elements of the same type). The append list ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

1:6


method was the only method described in class to include a new item into a list. The for command was mostly used in the format for var in range(len(list)): instead of the format for var in list: since that is closer to C’s for loop. Python classes and in particular class and instance variables were taught in correspondence to C’s struct. In this last case, little emphasis was placed in class methods, and inheritance was not taught at all. Numpy arrays were used as a substitute for C multidimensional arrays, but only standard indexing of Numpy arrays was mentioned in classes. Finally, for the three classes where in C pointers and memory management were taught, the Python students had regular expressions and dictionaries, but none of these two concepts were necessary for the assignments nor were they included in the final exam questions. Table I is the daily class plan for both the C and the Python versions of MC102. Table I. Class plans for the C and Python version of the MC102 course. class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

C topic information about the course computer organization basics variables basic types (int,float,double,char,long) assignment math expressions input/output logic expressions if and if-else if, if-else and nested ifs loops loops loops loops functions/ global and local variables functions/ global and local variables functions arrays arrays/sorting (selection and bubble arrays/sorting (insertion) linear and binary search exercises midterm strings 2D arrays multidimensional arrays structs pointers/call by reference pointers (arrays and structs) memory allocation text files byte files recursion recursion (mergesort) recursion (quicksort) Exercises Final exam

Python topic information about the course computer organization basics variables basic types (int,float,string) assignment math expressions input/output logic expressions if and if-else if, if-else and nested ifs loops loops loops loops mutable types/functions functions/global and local variables functions list and tuples lists/sorting (selection and bubble) lists/sorting (insertion) linear and binary search exercises midterm string operations 2D numpy arrays numpy multidimensional arrays classes and instance variables regular expressions regular expressions dictionary and sets text files byte files recursion recursion (mergesort) recursion (quicksort) Exercises Final exam

Thus we believe that only a few aspects of Python as an CS1 language were really tested in this experiment. Below is a list of Python aspects that we believe are important for learning programming and that we believe were exercised in this experiment: — simpler syntax. We do not know of any research that measures how much Python syntax is simpler or more intuitive than C syntax, but we believe that the claim of a simpler syntax is unproblematic. ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.


1:7

— dynamic typing. In particular, we believe that the important aspect of dynamic typing that was tested was the fact that variables do not need to be declared. None of the programs shown in class had examples were a variable was assigned values of different types (because the examples shown in class were “translated” from C programs). — Python lists as a container type with a simpler semantics than C arrays, in particular: — Python checks for out-of-bounds access to list elements. — the whole list have a syntactic representation in Python, so they can easily be set and inspected as opposed to C arrays which need a for loop to set or inspect its elements. — Python list can grow or shrink as opposed to C arrays that are preallocated which forces the distinction between the “used part” of the array versus the “allocated space” of the array. — the interactive environment already available in Python which allow to inspect variables and show the effect of a single command. — the lack of call by reference.

2.2. Statistical Methods

We computed the p-value, the effect size (and the 95% confidence interval for the effect size) of the Dropout rate, Failure rate, Midterm grades, Final grades, Proportion of completed labs, and Submissions per completed labs. “Dropout rate” is the ratio of the number of dropout students over the total number; “Failure rate” is the ratio of failing students over the students that did not drop out of the course. The average Midterm grade, the average final grade, the average Proportion of completed assignments were calculated only for the students that did not drop out or fail the course. For the Midterm, Final exam, Proportion of completed labs, and Submissions per completed labs, we computed the p-value of the t test between the two groups of data. For the Dropout rate and the Failure rate we computed the p-value of the test of equal proportions (as implemented by the R function prop.test). Effect size [Kelley and Preacher 2012] are measures of the magnitude of the difference between two or more set of measures. In this paper we will report two different effect size measures: Cohen’s d for interval data (for grades, proportion of completed assignments, and number of submissions per completed assignment) and percentage of increase for rates (for the dropout and failure rates). Effect sizes are measures that are comparable across different experiments, and thus can be contrasted and aggregated across different experiments into meta-analyses. For example, Salleh et al. [2011] presented a systematic review of the literature on pair programming applied to teaching; one of the results is a meta-analysis of experiments that measure the effect size (Cohen’s d) of pair programming in comparison to solo programming on two measures: final exam scores and assignment scores. The meta-analysis concluded that the effect size of pair programming on the final exam scores is 0.16, and that the effect size on assignment scores is much higher, 0.67. Cohen’s d [Cohen 1988] is the difference between the means of the two groups divided by a measure of the standard deviation of the two groups (there are different ways of defining the standard deviation of two groups). In this paper we use the pooled standard deviation (Equation 2). Formally:

cohen d =

m1 − m2 s

ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

(1)

1:8


where m1 and m2 are the means of the two groups, and s is the pooled standard deviation s (n1 − 1)s21 + (n2 − 1)s22 s= (2) n1 + n2 − 2 where n1 and n2 are the sizes of the two groups and s1 and s2 are the standard deviation of the two groups. Percentage of increase assumes that there is a control and an experimental group, and it is the difference between a ratio or proportion for the experimental group and the control group, divided by the proportion that refers to the control group. In this paper the Python version of the course is considered the experimental group. pe − pc percent of increase = 100 × (3) pc where pe and pc are proportions of the experimental and control groups respectively. Finally, we also report on the 95% confidence interval for the effect sizes. The confidence interval on the percentage increase is computed using [Bross 1954]. The confidence interval on Cohen’s d is computed using the non-central student distribution [Howell 2011] as implemented in the R function cohen.d from the R package effsize. 2.3. Complementary materials

The data used in this research is available at https://doi.org/10.6084/m9.figshare. 5284483.v1. The files in the link include the anonymized student data, the program to run the statistical analysis performed in this paper, the translated versions of all the exams (a total of 24 exams – 2 languages x 2 semesters x 2 exams per semester X 3 periods), and the non-translated text of the lab assignments. 3. RESULTS

Table II is the main results of this research. “Total” is the total number of students enrolled in the three Python and the three C classes. “Fail” is the number of students that did not receive a passing grade at the end of the course, but that attended at least one of the exams. “Midterm” is the average grade (out of 10.0) of the midterm for students that attended that exam. “Final exam” is the average grade of the final exam again for students that attended that exam. “Proportion of completed labs” is the average proportion of completed labs for the students that received a passing grade. We did not include in this calculations the students that received a failing grade, because we believe that would add a negative bias to the C data, since more students failed that version of the course. “Submissions per completed labs” is the average number of times a student with passing grade submitted each lab exercise until it passed all test cases. Table II. Main results of the Python and C comparisons (see text for the meaning of the lines). Total Dropout Fail Midterm Final exam Proportion of completed labs Submissions per completed lab

Python 194 24 27 7.08

C 197 24 40 6.41

7.55

6.72

82.3% 2.46

76.0% 2.91

Units students students students grade (maximum grade is 10.0) grade (maximum grade is 10.0) ratio number of submissions



1:9

Table III contains the statistical analysis of the data. Table III. Statistical analysis of results of the Python and C comparisons. Dropout rate Failure rate Midterm Final exam Proportion of completed labs Submissions per completed lab

Python 12% 16%

C 12% 23%

p-value 1.0 0.12

effect size

7.08 7.55 82.3% 2.46

6.41 6.72 76.0% 2.91

0.014 0.001 0.002 1.3e-05

0.27 0.36 0.39 -0.54

-31%

measure percentage increase Cohen d Cohen d Cohen d Cohen d

95%CI on the effect size of

-153% to 10% 0.05 to 0.48 0.13 to 0.59 0.14 to 0.62 -0.79 to -0.29

All the results measured, with the exception of the dropout rate, are better for the students that took the course in Python. The dropout rate is the same for both languages. Although there is a 31% decrease of the failure rate, this result is not significant with 95% of confidence (p-value = 0.12). Nevertheless we feel that the result provide some evidence that there was a real drop of the failure rate. The difference in the results for the midterm grade (p-value = 0.014, Cohen’s d = 0.27), final grade (p-value = 0.001, Cohen’s d = 0.36) , proportion of completed assignments (p-value = 0.002, Cohen’s d = 0.39), and submissions per completed assignment (p-value < 0.001, Cohen’s d = -0.54) are all significant with 95% confidence. 4. DISCUSSION

The effect size of the differences are somewhat compatible with previously published experiments. Salleh et al. [2011], already mentioned above, report an average effect size of 0.16 on exam grades of pair programming versus solo programming in CS1 courses. The Python effect size on exam grades that we encountered is larger than that. Salleh et al. [2011] also report an average effect size of 0.67 when measuring grades on the assignment themselves. We do not have a measure of quality of the assignments in this experiment, but the effect size measures related to the assignments, completion rate and number of submissions per assignment (0.38 and -0.61 respectively) are compatible with that result. Thus we believe we can make the claim that the use of Python in an introductory programming course would have a higher impact on exam grades then the use of pair programming, and both would have similar impact on the quality of the assignments. The effect sizes reported in this paper are also comparable to other interventions on teaching science, technology, engineering, and mathematics topics. Freeman et al. [2014] summarized 158 experiments on using active learning instead of traditional teaching, and the average effect size (in all areas of STEM) is 0.47. There were 8 experiments in that set that were specific to Computer Science. For that subset the average effect size of active learning was 0.3 [Freeman et al. 2014, Fig. 2]. As for the failure rate, Yadin [2011] did not see any reduction on the failure rate just by changing the programming language from Java to Python, but we must point out that the failure rates for those courses were on the 40% range, higher than the ones in this paper. Silva-Maceda et al. [2016] compared two versions of a C-based sequence of courses (two and three semesters) and a three semester version with a visualization tool, and finds a 20% reduction when moving to a three semester course. But again it seems unclear whether the results can be compared since the failure rates in that experiments were in the range of 70% to 90%, much higher than ours. Porter et al. [2013] reports an average of 61% percent reduction on failure rates by adopting peer instruction for four different courses (16 offerings) over a 10 year period. One of the courses in that report is CS1, and the average percentage decrease of the failure rate ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

1:10


by adopting peer instruction was 60% [Porter et al. 2013, table 2]. Finally, our results are consistent with Watson and Li [2014] that find a general rule that Python based courses have, on average, lower failure rates than C or other language based courses. Finally, it could be the case that the dropout rates are, in our particular university, only weakly related to the teaching itself, and thus almost unrelated to the programming language chosen. Our university is free and thus there is no cost for the student to enroll in more courses that she is planning to take. The student would attend the different classes and eventually chose the ones that she will really attend, officially or even non officially dropping out of the others. If that is the case, then courses with attractive first classes, or courses with a reputation among students of being interesting (or easy) should be more likely selected to attend. But both versions of the MC102 courses had the same first classes, which covered among other things, general aspects of computer architecture, numeric representation in computers, and so on. Both versions would likely have the same level of interestingness, and the same reputation among the students. As we discussed, we believe that only a few aspects of Python were tested in this experiment: simpler syntax, dynamic typing (in particular no declaration of variables), simpler semantics of lists, the availability of an interactive environment, and the lack of call by reference. There are some literature from the area of design of programming languages that show that some of these aspects are indeed more intuitive for novice programmers. We assume that students are the quintessential novice programmers. In respect to a simpler syntax, several studies show that mastering the syntax of a programming language is a challenge to novice programmers. Denny et al. [2011] did an empirical study on that matter, and showed that syntax errors are by far the most common type of errors students struggle with. This is critical to learning, since other studies shows that a significant amount of time is consumed by novices correcting syntax errors [Denny et al. 2012], and the difficulty of not recognizing these errors (due to unclear compiler error messages for example) is a source of frustration [Kummerfeld and Kay 2003]. Stefik and Siebert [2013] tested the accuracy of non-programmers to solve simple tasks in six programming languages: Quorum, Perl, Java, Ruby, Python, and Randomo (a fake language created by randomly generating some keywords). They found that for languages with a C-like syntax (Java and Perl) the accuracy results were not statistically significantly different than that of Randomo (which plays the role of a placebo programming language), while the accuracy for Python, Ruby, and Quorum was significantly higher than that of Randomo. Other studies found that the way errors are presented and structured is important in learning [Nienaltowski et al. 2008; Marceau et al. 2011]. So, it is no surprise that errors hard to catch by novice programmers in C, such as the use of index out of bounds, that eventually result in a generic segmentation fault message, are much harder to be corrected than the same error in Python, which results in a clear out of bounds message error. Regarding dynamic typing, the literature focus more on non-novice programmers (i.e. developers) and the results are in the opposite direction as our result: static types have better outcomes for complex tasks than dynamic types. A series of controlled experiments with developers divided into different groups, has been done to assess whether a static typed language is better than a dynamic typed language in performing some programming task such as correcting certain types of errors, using undocumented APIs and software maintainability [Kleinschmager et al. 2012; Mayer et al. 2012; Hanenberg et al. 2014]. These experiments shows that static typed systems are better in several criteria, and even when solving problems where dynamic typed languages seems to have benefits, a controlled experiment showed that this is not the case [Okon and Hanenberg 2016]. We are not aware of any experimental results that test ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.


1:11

specifically the impact of differences between list and array semantics on programmer efficiency. 4.1. Why do these metric matters?

We reported six measures: dropout rate, failure rate, midterm grades, final exam grades, proportion of completed assignments, and number of submissions per completed assignment. How are these measures related to the quality of learning? The importance of reducing failure rates is discussed, for example, by Porter et al. [2013]. Failing a class increases student frustration and may contribute to the student abandoning their major or higher education at all in the future; higher failure rates increase future class sizes since the student will likely re-take the class in the future and add to instructor’s frustration. In the case described in this research, the university is fully supported by government funds, and thus a fail student is also an extra expense on public funds. The importance of dropout rates in learning/teaching is less clear. In some way a student that dropped out of class (if that was done soon enough) is not a burden on the system: he or she was not in class, their exams were not graded, their questions were not answered by the instructors or the teaching assistants. The only “cost” of a dropout student is if he or she did not unregister from the class soon enough so the class opening could have been filled by another student. The final exam grade is a good indication of the quality of the student learning: if a student succeeded on the final exam, it is likely that the student will be at least able to create small programs to solve small problems. Let us assume for the moment that all final exams for the two semesters of this experiments were “equivalent” for the purpose of evaluation the ability of students to solve small programming problems (we will discuss this point in further details below). Then, the combination of a higher grade for the Python final exams and a lower failure rate are a true indication of better learning: more students learned more programming in the Python version than in the C version. By itself the metric of proportion of completed labs can be both an indication of better learning or the opposite. In the opposite direction, maybe the students had to complete more labs because they had worse grades in the exams and needed the labs to achieve a passing grade. But given that the exam grades were higher for the Python students, the more plausible explanation is that these students had less of a problem solving the exercises using Python than their peers using C. The higher proportion of completed lab assignments and the lower number of submissions per completed assignment are indications that novice programmers needed less effort to solve simpler problems in Python than in C. 4.2. Threats to validity

For classroom based interventions, it is very difficult to perform a randomized controlled trial (RCT) since the students are assigned to classrooms not randomly but based on variables that are likely relevant for the experiment. In this experiment, students are assigned to classrooms based on their majors. Since this experiment was not a RCT, there are many threats to the validity of the results but we believe that the switching design, where the same instructor taught both a C and a Python version of the course to students of the same major should control for some of the threats. We discuss the major threats to the validity of the results below. (1) difference among the instructors. Of course instructors have different teaching abilities, but we assume that the variability among instructors is preserved across ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.

1:12


programming languages. In this case the variability among instructors is canceled out by the switching design. For example, if a faculty has a particular ability of motivating students that are performing worse in the course, we assume that that ability had the same effect of improving the grades of failing students in both versions of the course. If a faculty is particularly severe in grading the exams, then the negative bias he/she introduces to the grades is the same in both versions of the course. We must also point out that all instructors followed the same class plan and used the same set of instruction slides. (2) variability among students. Students have different background knowledge and interest in a CS1 course. We assume that if there was different background knowledge or different motivation and interest in the CS1 course, it was a major specific characteristic. Thus, if for example, Math and Physics majors do not find the CS1 of particular interest for their education, the negative effect of this relative lack of interest would balance out in both versions of the course. We also believe students of a same major, but of different years, have a same general knowledge background, due to the form students are admitted in the university. Acceptance in Unicamp is done by an entry exam that are ranked according to major. That is, the student must choose a major before enrolling and compete with other candidates for that major. For all majors in this study the ratio candidates per opening was well above 10 to 1, and so one can consider that all students in the same major came from a similar pool of well qualified high school students. (3) instrumentation threat within a single semester. Instrumentation threat, in general, refers to the potential problem that the tools to measure the outcome can change, and the change in outcome may be due to the change in the instruments used to measure it. In this case, there could have been differences in the Python and C exams or lab assignments within a single semester. In the case of the assignments, they were the same for both versions of the course in a single semester. For the exams, since each of the classes involved in the experiment were in different periods1 (morning, afternoon, and evening) the exams for each of the classes was not exactly the same. The professor responsible for the exams and lab assignments has to create three slightly different exams for each of the three periods. Of course the professor has a preoccupation of creating similarly easy or difficult exams. In fact the exams have the same “four types” of questions, and details of the questions change for each period. A translation of the exams, for both years, both programming languages, and for the three periods is available at https://doi.org/10.6084/m9.figshare.5284483.v1. The interested reader can evaluate for himself our claim that the three exams are isomorphic. Finally, the final exam was not exactly the same for the Python and C versions within a single semester. There was always one question (out of 4) in the C exam that ask the students to use call by reference to define a function that returns more than one value, say a function that receives a vector of numbers and returns the maximum and minimum elements in the vector. The equivalent Python question on the exam did not use call by reference, but asked the student to implement the same function that returns a tuple of values. In some way the Python question is “easier” since both compute the same values, but for the C version the student would also have to understand call by reference. (4) instrumentation threat across the two semesters. Since there were more Python students in 2015 than 20162 , if in that semester the exams or exercises were “eas1 This

balancing across periods was an accident, not part of the experiment design.

2 The original experiment design was balanced across differences on the number of students in each language

version for both semesters. A fourth faculty was planned to follow the C in 2015 and Python in 2016 schedule,



1:13

ier”, that could explain the overall positive outcome of the Python students. In both semesters the professor in charge of creating the exams and exercises was the same, and there is always an important goal of not making a semester particularly hard or easy on the students. As mentioned above, the supplemental material of this paper contains the translation of the exams, and the reader can evaluate for himself if the exams in the two semesters are isomorphic. The supplemental material also contains the untranslated set of lab assignments for both semesters. But, although we cannot provide evidence of a non-difference between the two semesters, we can show that there is no strong evidence that there was a difference. Table IV displays the average grades for the C and Python midterm and final exams for each semester separately, and the p-value of the comparison. We also display the mean submission proportion for both languages and semesters. For the instrumentation across semesters to be a threat for the conclusions of this paper, it is necessary that the 2015 semester was “easier” than the 2016, that is, it should be the case that in general all grades were higher in 2015. For the midterms, indeed they were, but the differences are not significant; for the final the reverse seems true but only one of the differences is significant. Finally for the lab completion proportion, both results are significant but point in opposite directions. There is no strong and consistent evidence that the grades in the 2015 semester were higher than the grades in the 2016 semester. Table IV. Comparison of the exams and completed lab assignments across the two different semesters. Measure Midterm C Midterm Python Final C Final Python Completed assignments C Completed assignments Python

2015 6.51 7.18 4.89 6.63 0.62 0.81

2016 6.37 6.85 6.29 7.01 0.70 0.67

p-value 0.72 0.42 0.00 0.45 0.02 0.00

A different form of evidence for the non-difference between the two semesters is to consider the results for each semester separately. One could assume that the experimental comparison of the C and the Python versions of MC102 were executed in a single semester. In this case the experiment would suffer from the threats of the difference among the instructors and the variability of the students discussed above, but that experimental design would not suffer from the difference between the two semesters. Table V shows the main results of this paper if the experiment was run exclusively on one or the other semester. The table report the values of the corresponding measures for 2015 and for 2016, and the p-value of the corresponding comparison. The results in Table V show that all but one (the lab completion rate for 2016) of the results of this paper (lower failure rate, higher midterm and final exam grades, higher completion rates, and lower submissions per lab for Python) would be reproduced, although as expected, some would not be statistically significant. We believe that both the results in Tables IV and V, and the availability of the exams so the reader can evaluate them for him/herself should provide enough evidence that the results reported in the experiment were not due to instrumentation threats across the two semesters. and he did teach the first C course. But for reasons not related to the experiment he could not teach the corresponding Python course in 2016. We did not include the data related to the C class he taught in the results.


1:14

J. Wainer and E. Xavier Table V. Results if the experiment were run in each of the two semesters independently. Measure Fail rate Midterm Final Lab completion Lab submissions

Python 0.16 7.18 7.46 0.86 2.35

2015 C 0.26 6.51 5.41 0.69 2.61

p value 0.15 0.09 0.00 0.00 0.05

Python 0.15 6.86 7.74 0.73 2.70

2016 C 0.21 6.36 7.36 0.78 3.03

p value 0.45 0.24 0.31 0.07 0.02

(5) dependency between the results. This is a more subtle threat. This paper made claims that there were changes in different outcomes due to the choice of programming language: in both exams grades, in failure rate, in the proportion of completed assignments and the number of submissions per assignment. But depending on how the measures were calculated, these claims could be explained as derived from a single outcome change. For example, it could be the case that the choice of Python only reduces the failure rate; if we calculated the final exams mean as the mean for all students, then there could be many grades in the C version of the final exam that were 0, because the student knowing that he would fail, decided not to take that exam. These 0 grades could explain the lower mean grade for the final exam in the C version. The two claims that Python decreased the failure rate and increase the final exam grades become “entangled”, and would not provide an independent evidence for the better outcomes of Python. We attempted to decrease these dependencies on the different claims: we calculated the mean grades of the exams only for the students who had a greater than zero grade on that exam. Similarly we computed the proportion of submitted assignments only for the students that passed the course, and we computed the number of submissions only for assignments that were completed. In this way he believe we are decoupling the outcomes measured in this research, and the claims can be seen as more independent of each other. (6) threats to generalization. There are some potential limits on generalizing the results in this paper to all students. As mentioned above, the university is fully funded with public funds, and so it is free for the students. Also the university is very highly ranked among the country’s universities. So to be accepted the student has to pass an entry exam which is highly competitive. Therefore the students involved in this experiments are probably more motivated and have more knowledge related to high school topics than one would expect from the “average student”. 5. CONCLUSIONS

In this research we compared the use of C and the use of Python as the programming language in a CS1 course. The use of Python was somewhat limited in the sense that only a few aspects/concepts of Python were used in the course. In fact we believe that the only aspects of Python that were relevant in this experiment was the simpler syntax (in comparison to C), dynamic types, simpler semantics for lists (as opposed to C arrays), and the interactive environment. Under this limited use of Python, we find a positive effect of using Python on a) the failure rate, b) the exam grades, c) the ratio of completed lab assignments, and d) the number of submissions per completed lab assignment. All these positive effects with the exception of the drop in failure rates are statistically significant (with 95% confidence). The effects sizes are comparable to published results on interventions to a CS1 course. Although there seems to be no negative consequences to moving a C-based Introductory Programming to Python, we must point out that we have not tested the impact of this change on future CS courses. But we remind the reader of Enbody et al. [2009] ACM Transactions on Computing Education, Vol. 1, No. 1, Article 1, Publication date: March 2017.


1:15

and Enbody and Punch [2010] who has evidence of no negative consequence of using Python as the CS1 programming language on future CS courses, while Hunt [2015] found a negative impact. ACKNOWLEDGMENTS The authors would like to thank Professors Diego F. Aranha and Heiko Hornung for participating in the experiment. The authors would also like to thank the associate editor and the three reviewers for their very generous comments and suggestions to the previous version of this paper. The work is supported by CNPq (grants No.: 306358/2014-0, 425340/2016-3) and Fapesp (grants No.:2015/11937-9, 2016/23552-7).

REFERENCES Irwin Bross. 1954. A confidence interval for a percentage increase. Biometrics 10, 2 (1954), 245–250. Jacob Cohen. 1988. Statistical power analysis for the behavioral sciences. Lawrence Earlbaum Associates. Stephen Cooper, Wanda Dann, and Randy Pausch. 2003. Teaching objects-first in introductory computer science. ACM SIGCSE Bulletin 35, 1 (2003), 191–195. Paul Denny, Andrew Luxton-Reilly, and Ewan Tempero. 2012. All syntax errors are not equal. In 17th ACM Conference on Innovation and Technology in Computer Science Education. 75–80. Paul Denny, Andrew Luxton-Reilly, Ewan Tempero, and Jacob Hendrickx. 2011. Understanding the syntax barrier for novices. In 16th Annual Conference on Innovation And Technology in Computer Science Education. 208–212. Albrecht Ehlert and Carsten Schulte. 2009. Empirical comparison of objects-first and objects-later. In 5th International Workshop on Computing Education Research. 15–26. Richard J. Enbody and William F. Punch. 2010. Performance of Python CS1 students in mid-level nonPython CS courses. In 41st ACM Technical Symposium on Computer Science Education. ACM, 520–523. Richard J. Enbody, William F. Punch, and Mark McCullen. 2009. Python CS1 as preparation for C++ CS2. ACM SIGCSE Bulletin 41, 1 (2009), 116–120. Scott Freeman, Sarah L. Eddy, Miles McDonough, Michelle K. Smith, Nnadozie Okoroafor, Hannah Jordt, and Mary Pat Wenderoth. 2014. Active learning increases student performance in Science, Engineering, and Mathematics. Proceedings of the National Academy of Sciences 111, 23 (2014), 8410–8415. Ernie Giangrande Jr. 2007. CS1 programming language options. Journal of Computing Sciences in Colleges 22, 3 (2007), 153–160. Michael H. Goldwasser and David Letscher. 2008. Teaching an object-oriented CS1 -: with Python. ACM SIGCSE Bulletin 40, 3 (2008), 42–46. Philip Guo. 2014. Python is now the most popular introductory teaching language at top U.S. universities. BLOG@CACM http://cacm.acm.org/blogs/blog-cacm/ 176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-u-s-universities/ fulltext. (2014). ´ Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Eric Tanter, and Andreas Stefik. 2014. An empirical study on the impact of static typing on software maintainability. Empirical Software Engineering 19, 5 (2014), 1335–1382. David C Howell. 2011. Confidence intervals on effect size. Lecture handout. Psychology. University of Vermont. (2011). John M. Hunt. 2015. Python in CS1-not. Journal of Computing Sciences in Colleges 31, 2 (2015), 172–179. Jussi Kasurinen and Uolevi Nikula. 2007. Lower dropout rates and better grades through revised course infrastructure. In 10th IASTED International Conference on Computers and Advanced Technology in Education. 152–157. Ken Kelley and Kristopher J. Preacher. 2012. On effect size. Psychological Methods 17, 2 (2012), 137. Sebastian Kleinschmager, Romain Robbes, Andreas Stefik, Stefan Hanenberg, and Eric Tanter. 2012. Do static type systems improve the maintainability of software systems? An empirical study. In 20th IEEE International Conference on Program Comprehension. 153–162. Theodora Koulouri, Stanislao Lauria, and Robert D. Macredie. 2015. Teaching introductory programming: a quantitative evaluation of different approaches. ACM Transactions on Computing Education (TOCE) 14, 4 (2015), 26. Sarah K Kummerfeld and Judy Kay. 2003. The neglected battle fields of syntax errors. In 5th Australasian Conference on Computing Education. 105–111.


1:16


¨ Linda Mannila, Mia Peltomaki, and Tapio Salakoski. 2006. What about a simple language? Analyzing the difficulties in learning to program. Computer Science Education 16, 3 (2006), 211–227. Guillaume Marceau, Kathi Fisler, and Shriram Krishnamurthi. 2011. Measuring the effectiveness of error messages designed for novice programmers. In 42nd ACM Technical Symposium on Computer Science Education. 499–504. ´ Clemens Mayer, Stefan Hanenberg, Romain Robbes, Eric Tanter, and Andreas Stefik. 2012. An empirical study of the influence of static type systems on the usability of undocumented software. ACM SIGPLAN Notices 47, 10 (2012), 683–702. Bradley N. Miller and David L. Ranum. 2005. Teaching an introductory computer science sequence with Python. In 38th Midwest Instructional and Computing Symposium. Ellen Murphy, Tom Crick, and James H. Davenport. 2016. An analysis of introductory university programming courses in the UK. arXiv preprint arXiv:1609.06622 (2016). Marie-H´el`ene Nienaltowski, Michela Pedroni, and Bertrand Meyer. 2008. Compiler error messages: What can help novices? ACM SIGCSE Bulletin 40, 1 (2008), 168–172. Sebastian Okon and Stefan Hanenberg. 2016. Can we enforce a benefit for dynamically typed languages in comparison to statically typed ones? a controlled experiment. In 24th IEEE International Conference on Program Comprehension. 1–10. Joseph D. Oldham. 2005. What happens after Python in CS1? Journal of Computing Sciences in Colleges 20, 6 (2005), 7–13. Holly Patterson-McNeill. 2006. Experience: from C++ to Python in 3 easy steps. Journal of Computing Sciences in Colleges 22, 2 (2006), 92–96. Arnold Pears and Lauri Malmi. 2009. Values and objectives in computing education research. ACM Transactions on Computing Education (TOCE) 9, 3 (2009), 15. Arnold Pears, Stephen Seidman, Lauri Malmi, Linda Mannila, Elizabeth Adams, Jens Bennedsen, Marie Devlin, and James Paterson. 2007. A survey of literature on the teaching of introductory programming. ACM SIGCSE Bulletin 39, 4 (2007), 204–223. Leo Porter, Cynthia Bailey Lee, and Beth Simon. 2013. Halving fail rates using peer instruction: a study of four computer science courses. In 44th ACM Technical Symposium on Computer Science Education. 177–182. Vennila Ramalingam and Susan Wiedenbeck. 1997. An empirical study of novice program comprehension in the imperative and object-oriented styles. In 7th Workshop on Empirical Studies of Programmers. 124–139. Stuart Reges. 2006. Back to basics in CS1 and CS2. ACM SIGCSE Bulletin 38, 1 (2006), 293–297. Norsaremah Salleh, Emilia Mendes, and John Grundy. 2011. Empirical studies of pair programming for CS/SE teaching in higher education: A systematic literature review. IEEE Transactions on Software Engineering 37, 4 (2011), 509–525. Christine Shannon. 2003. Another breadth-first approach to CS1 using Python. 35, 1 (2003), 248–251. Robert Michael Siegfried, Jason Siegfried, and Gina Alexandro. 2016. A longitudinal analysis of the Reid list of first programming languages. Information Systems Education Journal 14, 6 (2016), 47. ˜ and F. Edgar Castillo-Barrera. 2016. More time or better Gabriela Silva-Maceda, P. David Arjona-Villicana, tools? A large-scale retrospective comparison of pedagogical approaches to teach programming. IEEE Transactions on Education 59, 4 (2016), 274–281. Andreas Stefik and Susanna Siebert. 2013. An empirical investigation into programming language syntax. ACM Transactions on Computing Education (TOCE) 13, 4 (2013), 19. Murat P. Uysal. 2012. The Effects of Objects-First and Objects-Late Methods on Achievements of OOP Learners. Journal of Software Engineering and Applications 5, 10 (2012), 816. Christopher Watson and Frederick W.B. Li. 2014. Failure rates in introductory programming revisited. In Conference on Innovation & Technology in Computer Science Education. 39–44. Aharon Yadin. 2011. Reducing the dropout rate in an introductory programming course. ACM inroads 2, 4 (2011), 71–76. Received March 2017; revised XXX; accepted XXX


1 A controlled experiment on Python vs C for an

1 A controlled experiment on Python vs C for an

Suggest Documents

[PDF] Haskell vs. Ada vs. C++ vs. Awk vs. ... An experiment in Software ...

[PDF] Haskell vs. Ada vs. C++ vs. Awk vs. ... An experiment in Software ...

A Controlled Experiment

A Controlled Experiment

A Controlled Experiment - OceanRep

An Experiment on Comparing Textual vs. Visual Industrial Methods for ...

A Controlled Experiment Investigation on the

A Controlled Experiment for Program ... - Semantic Scholar

A Controlled Experiment - KSI Research

A controlled study vs. placebo

Component Repair Experiment-1: An Experiment ...

A controlled experiment investigation of an object-oriented design

Experiment V ariant frequency (% ) a) b) c) d) Experiments #1 vs ... - Plos

A Controlled Experiment of a Method for Early ... - DiVA portal

Python 2 vs. Python 3 for Carleton CS 111 students

Experiment A â Controlled Drug Delivery

A Controlled Experiment on the Impact of the ... - RCOST - Unisannio

A Controlled Experiment in Maintenance Comparing Design ...

A Controlled Experiment on the Effects of ... - Semantic Scholar

A Controlled Experiment on the Effects of Machine Translation in ...

A controlled experiment investigation of ...

Stratospheric controlled perturbation experiment: a ... - Blood Journal

1 Writing Frisian on computer: an experiment on ...

A Controlled Experiment for the Empirical Evaluation of Safety ...

1 A controlled experiment on Python vs C for an