Commonsense computing (episode 5): algorithm ... - ACM Digital Library

0 downloads 0 Views 390KB Size Report
Aug 11, 2009 - their analysis as well as insight into the variety of preconcep- tions an instructor should ... Earlier work on commonsense computing has sought to .... story” about water balloon testing. .... (D), and small private liberal arts schools (B and E). While ..... Table 9 looks at whether the students' selection of the bet-.
Commonsense Computing (episode 5): Algorithm Efficiency and Balloon Testing Robert McCartney∗

Dennis J. Bouvier

Tzu-Yi Chen

Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269 USA

Department of Computer Science Southern Illinois University Edwardsville Edwardsville, IL 62026 USA

Department of Computer Science Pomona College Claremont, CA 91711 USA

[email protected]

[email protected]

[email protected]

Gary Lewandowski

Kate Sanders∗

Dept. of Mathematics and Computer Science Xavier University Cincinnati, OH 45207 USA

Dept. of Mathematics and Computer Science Rhode Island College Providence, RI 02908 USA

[email protected]

[email protected]

Beth Simon

Tammy VanDeGrift

Department of Computer Science and Engineering University of California San Diego La Jolla, CA 92093 USA

Department of Electrical Engineering and Computer Science University of Portland Portland, OR 97203 USA

[email protected]

[email protected]

ABSTRACT

General Terms

This paper investigates what students understand about algorithm efficiency before receiving any formal instruction on the topic. We gave students a challenging search problem and two solutions, then asked them to identify the more efficient solution and to justify their choice. Many students did not use the standard worst-case analysis of algorithms; rather they chose other metrics, including average-case, better for more cases, better in all cases, one algorithm being more correct, and better for real-world scenarios. Students were much more likely to choose the correct algorithm when they were asked to trace the algorithms on specific examples; this was true even if they traced the algorithms incorrectly.

Algorithms, Experimentation

Keywords commonsense, preconceptions, constructivism, water balloons, algorithm analysis

1. INTRODUCTION The ability to analyze the efficiency of algorithms is central to computer science. From the earliest courses, students are taught to consider the effects of the algorithms and data structures that they choose on the performance of their programs. As they continue in the computer science major, students learn more formal techniques and are taught how to apply these to both theoretical and practical problems. In order to effectively teach algorithm analysis, it is important to first establish what students already understand about the topic. The motivation is the constructivist theory about how people learn: they start with what they already know and build knowledge on that foundation, rather than receiving it passively from an instructor. Each learner’s background, culture, and previous knowledge define his or her starting point. Thus, Bransford et al. [4] argue that learning must engage students’ preconceptions to be effective. This paper investigates algorithmic preconceptions by addressing the following questions:

Categories and Subject Descriptors K.3.2 [Computers and Education]: Computers and Information Science Education—Computer Science Education ∗ Visiting researcher, Teknisk Databehandling, Inst. för Informationsteknologi, Uppsala Universitet, Spring 2009

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICER’09, August 10–11, 2009, Berkeley, California, USA. Copyright 2009 ACM 978-1-60558-615-1/09/10 ...$5.00.

• How do preconceptions affect the understanding of an English description of an algorithm?

51

• What does “efficiency” mean to beginning students – what sort of efficiency measures do students use?

groups showed “algorithmic understanding” of how to solve these problems.

• Can students choose between two alternative algorithms based on their efficiency?

Earlier work on commonsense computing has sought to identify student preconceptions that could be leveraged in teaching beginning computing concepts. To investigate student preconceptions about sorting, Simon et al. asked beginners to describe in words how they would sort a list of numbers into ascending order. [16] A majority described a coherent algorithm to solve the problem, and many gave versions of selection or insertion sort. Most treated numbers as strings, however, and manipulated them digit by digit. Many students used iteration. Surprisingly, most iteration involved post-test loops. Lewandowski et al. examined students’commonsense understanding of concurrency. [10] They found that 97% of the students could identify a race condition and that 71% of the students provided a reasonable solution. The most common technique for avoiding the race condition was subdividing the resources. The question given to the students was based on Ben-David Kolikant’s work [2] on experienced students entering an introductory concurrency course. In both correct and incorrect preconceptions, the students in the later study appeared to enter their first computing class with essentially the same level of intuition as Ben-David Kolikant’s students entered the advanced course on concurrency. Simon et al. described an investigation of students’ commonsense knowledge of debugging. [15] They gave beginners four different questions designed to elicit knowledge of debugging strategies. The questions asked students to describe (1) the advice they would give someone whose light did not turn on when he flipped the light switch; (2) how they would locate the moment when things go wrong in the children’s “telephone game”; (3) how they would find a Starbucks, if they found themselves suddenly in a strange city where they did not speak the language; or (4) an experience of their own that involved troubleshooting. (Different students received different questions.) In general, beginners were found to have less commonsense knowledge of debugging than of sorting, and some of their pre-existing knowledge did not serve them well. For example, real-world fixes can be easy to undo, unlike programming changes. Similarly, the real world is nondeterministic in ways that CS1 programs generally are not – if your car didn’t start, wouldn’t you turn the key a second time? There is a substantial body of work both in computing and other disciplines on misconceptions: incorrect concept understandings that need to be replaced with correct models. Gal-Ezer and Zur [7] looked at student misconceptions about algorithm efficiency among high school students before and after taking a computing course. They asked students about the relative efficiency of pairs of programs that perform the same task, and found that students often believe (incorrectly) that a shorter program is more efficient than a longer one, and a program with fewer variables is more efficient than one with more variables. Clancy [5] provides a survey of the misconceptions work in computer science; the National Academy’s Committee on Undergraduate Science Education [6] (Ch. 4) gives a more general overview. Smith et al. [17] challenge this view in the context of math and science education, arguing that misconceptions are limited mental models that can be built upon to gain correct understanding.

• Once students choose between algorithms, how do they support their choice? • How do students use examples, and what effect does this have on their performance? The answers to these questions provide a basis for understanding how students approach algorithmic problems and their analysis as well as insight into the variety of preconceptions an instructor should consider when designing lessons on algorithm analysis. This paper is structured as follows. In Section 2 we discuss related work on preconceptions. We describe our methodology in Section 3. In Section 4 we present our results. In Section 5 we discuss the results in general, effects that seem to follow from the wording of the questions asked, and implications of our findings. We close with some conclusions and directions for future work.

2.

RELATED WORK

The importance of constructivism is recognized in the computing education community. Ben-Ari [1] compared constructivism in computing education with other fields, and pointed out some differences: in computer science education, it is necessary that students have a model of the computer, and the computer can provide an “effective ontological reality”, i.e., a verification of whether a program works or not. Several researchers have studied student preconceptions: • Miller [13] analyzed “natural language” programs by students who had not had a formal programming course, with the purpose of exploring the idea of writing computer programs in natural language. He found that a number of standard programming concepts showed up in these natural language descriptions. • Onorato and Schvaneveldt [14] also looked at natural language descriptions of a programming task, comparing subjects drawn from different pools: naïve– students with no programming experience, beginner– students currently taking their first programming course, and expert–students with a good deal of programming experience. They found differences between experts and the others, and also between the naïves and beginners, though neither had experience programming. • While studying misconceptions of novice programmers, Bonar and Soloway [3] specifically considered preprogramming knowledge, which they call “step-by-step natural language programming knowledge.” They distinguish this preprogramming knowledge from knowledge of the programming language Pascal, which the students were learning in their introductory course. They found that many of the observed bugs could be explained by a mismatch between students’ knowledge in these two different domains. • Gibson and O’Kelly [8] looked at a variety of search problems (with pre-college students) and Towers-ofHanoi problems (with beginners), and found that both

52

3.

METHODOLOGY

The Tower and Water Balloon Problem WaterBalloons Inc. has asked you to test the strength of a new fabric for water balloons. You are given some water balloons, sent to a tower with 256 floors, and asked to determine the highest floor from which the balloons can be dropped without breaking. In other words, there is some floor such that any water balloon dropped from that floor or any lower floor will not break, but any water balloon dropped from a higher floor will break. Fortunately you’re allowed to break the balloons you’re given in order to determine this floor. In addition, the balloons do not weaken if they are dropped and do not break. The following are two ways to determine the strength of the fabric if you are given 2 equivalent balloons, one red and one blue. Which is more efficient, in the sense that it requires the fewest balloon drops? Explain your answer in complete English sentences. “By Half ” Solution: Drop the red balloon from the 128th floor of the tower. If it does not break, drop it from the 192nd floor, and then 224th, etc., increasing the floor number by half of the remaining floors of the tower each time. When the red balloon breaks you begin dropping the blue balloon from the highest floor from which the red balloon did not break, working your way to the floor on which the red balloon broke. “Square Root” Solution: Drop the red balloon from the 16th floor of the tower. If it does not break, drop it from the 32nd, then 48th, and so on, increasing the floor number by 16 (the square root of the total number of floors in the tower) each time. When the red balloon breaks you begin dropping the blue balloon from the highest floor from which the red balloon did not break, working your way to the floor on which the red balloon broke.

The data collected for this study were student answers to an algorithmic problem: which is the better of two search algorithms? We collected data using five variations of the problem: two in the fall at five schools, three in the spring at three schools each. The data were all collected electronically, though the exact collection method varied as described in section 3.2 below.

3.1 The problem This study uses a variation on Ginat’s “The Tower and the Glass Balls” problem, which was stated as follows in [9]: The Tower and the Glass Balls. Given a tower of N floors, one wants to find the lowest floor from which a glass ball will break once it falls. If there is only one such ball, then the search for the “breaking floor” must be linear. That is: throw the ball from the first floor, then from the second, and so on, until the ball will break. But, there are two identical balls. Thus, one ball can be used for a “coarse-grain” search until it breaks. Then, the other ball can be used for a “fine-grain” search. Develop an efficient algorithm for performing the search with minimal number of throws. Ginat’s problem was designed for group discussion: students solved the problem, then discussed the solutions with their peers. In the discussion phase, two different classes of solutions could be compared: those that did the coarsegrained search in a binary-search pattern (N/2, 3N/4, 7N/8, and so on), and those that did the coarse-grained search by checking every kth floor (k, 2k, 3k, and so on); for each the fine-grained search is a linear search of the floors in the interval found upon the first balloon breaking. Ginat concluded that discussion and instructor guidance could be used to help students reach conclusions about the most efficient algorithm. This suggested that the glass balls problem could allow us to examine the preconceptions about algorithm analysis that students might bring to such a classroom discussion. To support the use of individual student answers without the subsequent discussion format, we provided two solutions to this problem, corresponding to the classes of Ginat’s solutions [9]. One was a binary-search approach; √ the other was an “every kth floor” approach, where k = N . To make the problem more concrete and calculations simpler, we stipulated the building to have exactly 256 floors (256 is both a perfect square and a power of 2), and we provided a “frame story” about water balloon testing. Our basic statement of the problem is given in Figure 1. This problem has a number of features that make it suitable as an instrument for studying preconceptions about efficiency and algorithm analysis. Most notably, neither algorithm is best for all cases: the Square Root solution is better in 177 cases, the By Half solution is better in 65 cases, and they are the same in the remaining 15. On the lowest 16 floors both algorithms require the same number of drops to find the breaking floor. If the balloons first break on floors 16–128, 145–192, or 209–224, the Square Root solution requires fewer drops than the By Half method. In the remaining 65 cases (i.e., if the balloons first break on floors 129–144, 193–208, 225–256, or not at all) the By Half solution requires fewer drops.

Figure 1: The algorithm description, with the form of the question given to the students in Fall, 2008.

In addition, the worst-case performance of the two algorithms occurs at different floors. The Square Root algorithm does worst when the balloon first breaks at floor 256; this requires 32 drops if one strictly follows the algorithm in Figure 1, though one drop is redundant so 31 drops suffices. The By Half algorithm does worst when the balloon first breaks at floor 128; this requires 128 drops. The Square Root algorithm is better than the By Half algorithm in terms of worst case and average case, and it is better for more than half of the possible answers.

3.2 Data Collection Data were collected from seven schools overall. In the fall of 2008 we asked two versions of the question in Figure 1: in one version the By Half solution was given before the Square Root solution, and in the other the two solutions were reversed. The order of the solutions did not have a significant effect on which algorithm the students identified as better (χ2 (1, N = 100) = .503, p = .45), so for the rest of the paper we group the responses to the two versions. In the spring of 2009 we used three versions of the question (which we label in the rest of the paper as Sp.V1, Sp.V2, and Sp.V3):

53

Inst. A B C D E F G total

Course CS1 CS1/FrWr CS1 CS1 CS1 CS0/CS1 CS1/CS1.5

Fall 16 19/13 20 33 28 – – 129

Sp.V1 – – – – 14 13/20 27/29 103

Sp.V2 – – – – 15 18/25 45/13 116

Sp.V3 – – – – 12 18/26 45/20 121

3.3.1 Type of analysis The problem statement in Figure 1 asks which of the two algorithms is most efficient “in the sense that it requires the fewest balloon drops.” Student answers showed a number of different interpretations of what this meant: 1. better worst-case performance – the algorithm whose maximum number of drops was less than the other’s. 2. better performance on more of the floors – the algorithm that is better than the other for the majority of the points in the outcome space.

Table 1: Number of students for each institution and version.

3. better average-case performance – a lower average number of drops over the outcome space.

Sp.V1: In this version we asked students to calculate the number of balloon drops needed by each algorithm to determine that the balloon first broke at floors 10, 65 and 130. The second part of the question then asked them to choose the best algorithm, and to explain their choice.

4. a number of other measures, based on things like the number of first or second balloon drops only, better performance for certain floors, the correctness of the algorithms over the outcome space, and the better algorithm for the balloon to likely break on a lower (or higher) floor.

Sp.V2: This variation asked students to choose the algorithm that “performs best in the worst case,” without further explanation of “worst case.”

The outcome space here is the set of 257 possible highestsafe-floor values: the 256 floors on which the balloon might break, plus the possibility that it might actually break for the first time when dropped from some higher floor. Students did not always identify the more efficient algorithm, and it was not always possible to determine how a student interpreted efficiency due to unclear or contradictory explanations. Answers were assigned to categories. Four of these – worst-case, best on most floors, average-case, and other – were based on the interpretations above. Three more categories: unclear, no explanation, and always better were used for answers that are too unclear or contradictory to categorize, students who provided no explanation, and students who stated one algorithm superior for all cases, respectively. The following is an example of a worst-case response:

Sp.V3: This variation asked students to choose the algorithm that performs best in the worst case, but we also explained that “for each of the two methods there is some worst floor in the sense that it would require the most number of drops to determine that the balloons break at that floor.” These variations give insight into students’ understanding of, and ability to use the definition for, worst case analysis. The questions also help illuminate whether scaffolding (e..g., being required to trace the algorithm on concrete examples or given the definition of worst case) affects students’ responses. The number of answers collected, by version and school, are given in Table 1. These institutions include large public research schools (A and G), medium-sized public liberal arts schools (C and F), a medium-sized private liberal arts school (D), and small private liberal arts schools (B and E). While most responses were collected from students in a CS1 course, there were also some responses collected from students in a freshman writing course, a CS0 course, and a CS1.5 course. The CS0 course is a survey of computing topics; the CS1.5 course is taken after a 10 week CS1 class with standard programming concepts in Java.

The square root method would on average require less balloon drops because the maximum number of drops would be about 32, as opposed to around 128 for the half method. Even though the half method decreases the interval for which the balloon will break more quickly, it leaves the interval from 0 to 128, which will more often than not end up taking longer than using the square root method. (Subject B027, Fall) An example of a best on most floors solution is the following:

3.3 Analysis

The square root solution is the faster solution to this problem. Once the balloon breaks, the tester has to go through fewer floors in order to find the highest floor that the balloon can be dropped from. For example, if the balloon breaks on the 32nd floor and the tester is using the square root solution, he would only have to go back to the 16th floor. If the balloon breaks on the same floor, the person using the half solution would have to go back to the first floor. The half solution would be faster in some cases, but for the majority of the floors, the square root solution is faster. (Subject D03, Fall)

We coded student explanations for the following attributes: • the type of analysis used • whether they mentioned concrete examples • whether they discussed intervals For each of these attributes, two researchers developed criteria and then independently tagged all of the student responses. They then discussed their differences and refined their criteria. For a handful of cases a third researcher was asked to “break the tie.”

54

to determine the wore (sic) case than if using the by half method. For example if using the by half method and the balloon broke on floor 192 but not on floor 128, one would have to do up to 63 drops to determine the worst case floor. When using the square root method, if the balloon broke on the maximum amount of drops that would have to be done to determine the worst case floor is 31. If using the by half method the minimun number drops needed to determine the worst case floor would be more than 31. (Subject F104, Sp.V3)

An example of a response that uses an average-case metric is the following: The square root solution is the more efficient method. I approached this problem by drawing a tree diagram that showed the different outcomes and used it to figure out the number of of trials involved for each floor (the diagram covered ranges of floors rather than each individual floor). I then used a calculator to find the total number of trials for each tree and divided it by the number of floors. I found that, on average, the “by half” solution required about 44 trials to find the correct floor, whereas the “square root” solution took only 16. (Subject B015, Fall)

We also tagged data that implicitly used an example such as: The greatest number of drops needed in the worstcase scenario using the “by half” method is 129 drops, or in that neighborhood, and the greatest number of drops needed in the worst-case scenario using the “square root” method is 32, or in that neighborhood, which leads me to assert that the “square root” method is the more efficient method, at least by looking at the worstcase scenarios. (Subject F209, Sp.V3)

This answer is an example of an Other answer: it uses “cover a larger amount of floors” as a metric, and seems to only count the first balloon: I would choose the “By Half” method because it can cover a larger amount of floors using few balloons and also decreases the amount of floors you would have to search by half each time you drop a red balloon. For example, if the red balloon broke on the 256th floor, you would have only used about eight balloons starting from the 128th floor. If you use the “Square Root” method, you would have used 16 balloons by the time you reached the 256th floor, starting from the 16th floor. (Subject E017, Fall)

3.3.3 Use of intervals We also noted that many of the explanations appealed to intervals. When a response identified the Square Root algorithm as best, the explanations often mentioned “more accurate” or “precise” intervals. When a response identified the By Half algorithm as best, the explanations often mentioned intervals that became smaller and could therefore pinpoint the floor more quickly. An example of the former is:

Some of the Other answers involve correctness arguments; see, for example, the quotes in Section 5.1.1. An example of an Unclear answer is the following: the explanation is unclear, and seems to be inconsistent with the choice of By Half as more efficient:

The “Square Root” solution is more efficient because it has smaller increments to get to the red ballon breaking point. So, when you have to go back with the blue balloon, you don’t have to go as far as the other method and are therfore closer to the worst floor, requiring fewer drops to get there. (Subject E86, Sp.V3)

Because if it does break then you have fewer windows in between to see which floor is the highest. Also since they don’t weaken if they don’t break then even though you are dropping it more times you can use the same balloon till you have gotten too high. (Subject F343, Sp.V2)

An example of the latter is:

The following is an Always Better answer: the student identifies Square Root as better at all floors:

In my opinion, I would say that the “by half” solution would be more sensible. The “by half” way covers a lot more floors in less tries. Breaking the remaining floors in half is very efficient because it cuts the number of possible floors which the solution might be found on by half every time a balloon is dropped, no matter how many floors are left. In total, you will only have to drop the balloon eight times, no matter which floor the balloon is on. (Subject C06, Fall)

In all cases it takes less steps to find the floor from which the balloon breaks. (Subject F346, Sp.V1) This answer was marked as No Explanation: the explanation added nothing to their selection of By Half as the better algorithm: if we do by square root, we need to try much more time than the by half (Subject G047, Sp.V2)

4. RESULTS

3.3.2 Concrete examples

4.1 Algorithm Selection

One way to justify a choice of algorithm is by appealing to some examples. The following response is one that uses an example:

The number of students choosing the Square Root versus the By Half solution is shown in Table 2. Overall, 70.1% of students chose the Square Root solution (the better algorithm for most metrics) and 27.7% of students chose the By Half solution.

After the initial drops and finding the floor that the balloon drops on, there will be less floors

55

Algorithm chosen Square Root By Half Undecided No Answer Total students

89 31 9 0 129

Fall (69.0%) (24.0%) (7.0%) (0%)

90 12 0 1 103

Sp.V1 (87.4%) (11.7%) (0%) (1.0%)

71 45 0 0 116

Sp.V2 (61.2%) (38.8%) (0%) (0%)

79 42 0 0 121

Sp.V3 (65.3%) (34.7%) (0%) (0%)

All versions 329 (70.1%) 130 (27.7%) 9 (2.9%) 1 (0.2%) 469

Table 2: Number of students selecting each algorithm as best, given as a function of the version asked.

4.4 Use of Intervals

Students who were asked the first of the three Spring versions of the question were clearly more likely to choose the correct Square Root solution than were students asked other versions of the question. The effect of the version on the algorithm selected as better is significant(χ2 (3, N = 459) = 22.84, p < .0001); the 10 students who were unable to select a better algorithm were not included.

Instead of, or sometimes in addition to, justifying their answers through examples, some students appealed to logic about the intervals at which the balloons were dropped. These were answers that one might use in class to help students come to a solid understanding of the advantages of the Square Root algorithm. Table 7 shows the number of students who used particular interval arguments: more precise interval to support Square Root, and larger intervals to support By Half. It does not consider the arguments that use smaller intervals to support By Half (2 students) or larger intervals to support Square Root.

4.2 Analysis types By reading through their explanations, we determined the metric that each student used to identify the better algorithm. Table 3 shows the metrics used for “best”, again as a function of the four versions of the question that were asked. Even though we explicitly asked students to use worst-case analysis in Sp.V2 and Sp.V3, most students used other metrics for their evaluation of the algorithms. Only about 30% of students used worst-case analysis overall, with not much difference in the Sp.V2 and Sp.V3 versions. This could be because they had difficulty understanding what was meant by “worst case,” or because they had difficulty understanding the algorithms, or because they could not determine the floor on which the worst case occurred for the algorithms given. More broadly, if we consider only whether the students used worst case analysis alone or whether they used some other metric, students were most likely to use worst case analysis in the fall and least likely to use it in Sp.V1. The difference in the percentages is significant(χ2 (3, N = 469) = 26.43, p < .00001). Tables 4 and 5 break down the metrics used as a function of whether the students chose the Square Root or the By Half method as better overall. Clearly the students who correctly identified Square Root as the better algorithm were also more likely to use worst case analysis.

4.5 Tracing the Algorithms The Sp.V1 version asked students not only to identify the better algorithm, but also to explicitly calculate the exact number of balloon drops required for each algorithm under the assumption that the balloons first broke at floor 10 (and the assumption that the balloons first broke at floor 130 and floor 65). Very few students answered the question correctly, which suggests they had trouble tracing algorithms. Some students had simple “off-by-1” errors, but 22 students showed a significant misunderstanding of the algorithm in that they counted the number of drops under the assumption that the second balloon was dropped by decreasing floor numbers instead of by increasing floor numbers (labelled “reverse” in Table 8). Of the 12 who chose the By Half as the better solution, just one person demonstrated a clear understanding of the algorithms through their calculations. The other 11 gave numbers that seemed random (i.e., numbers which defied simple explanation, labelled “unclear” in Table 8). In fact the one student who chose By Half as better and demonstrated a clear understanding of the algorithms stated in his or her explanation that the Square Root solution is the better solution (and most likely checked the wrong box). Table 9 looks at whether the students’ selection of the better algorithm was consistent with the results of their tracing. Note that the examples were selected so that the trace should have been inconclusive (i.e., Square Root did better on one, By Half did better on another, and they required the same number of drops on the third). Table 10 looks at whether the students’ selection of the better algorithm was consistent with the results of their tracing in a worst case sense: across the examples tried, does the largest number of drops happen with the By Half or the Square Root algorithm and does the student identify the corresponding algorithm as better?

4.3 Use of examples Table 6 shows the number of subjects who used examples even when they were not specifically required to do so, both as a function of the algorithm they chose as best and for all algorithms. Nothing is given for the Sp.V1 column because there the subjects were explicitly asked to calculate the exact number of drops needed to determine the lowest floor on which the balloons would break for three specific scenarios. In general (Used example and any answer), students were far more likely to use an example in the fall than in the spring. The difference in the percentage of students who used an example as a function of the version of the question asked is significant (χ2 (2, N = 366) = 47.3, p < 10−10 ). The students in the fall who identified the Square Root algorithm as being better were also more likely to use an example in their explanation. The difference in percentages is significant (χ2 (1, N = 120) = 18.20, p < .0001).

56

Analysis type Worst case Better on most floors Average case Other Always better Unclear No explanation Two or more metrics Total students

54 14 6 27 1 17 0 10 129

Fall (45.8%) (10.9%) (4.7%) (20.9%) (0.8%) (13.2%) (0.0%) (7.8%)

12 5 6 25 8 20 24 3 103

Sp.V1 (12.6%) (4.9%) (5.8%) (24.3%) (7.8%) (19.4%) (23.3%) (2.9%)

34 0 0 45 6 21 10 0 116

Sp.V2 (29.3%) (0.0%) (0.0%) (38.8%) (5.2%) (18.1%) (8.6%) (0.0%)

31 3 0 45 4 20 18 0 121

Sp.V3 (25.6%) (2.5%) (0.0%) (37.2%) (3.3%) (16.5%) (14.9%) (0.0%)

All 131 22 12 142 19 78 52 13 469

versions (27.9%) (4.9%) (2.6%) (30.3%) (4.1%) (16.6%) (11.1%) (2.8%)

Table 3: Number of students using analysis type by question type. The fall and Sp.V1 questions asked the student to identify the better algorithm; the Sp.V2 and Sp.V3 questions specifically asked students to use worst case analysis. Analysis type Worst case Better on most floors Average case Other metric Always better Unclear No explanation Two or more metrics Total students

Fall (56.2%) (11.2%) (3.2%) (12.4%) (0.0%) (5.6%) (0.0%) (9.0%)

50 10 5 11 0 5 0 8 89

11 5 6 22 8 17 18 3 90

Sp.V1 (12.2%) (5.6%) (6.7%) (24.4%) (8.9%) (18.9%) (20.0%) (3.3%)

32 0 0 23 3 8 5 0 71

Sp.V2 (45.1%) (0.0%) (0.0%) (32.4%) (4.2%) (11.3%) (7.0%) (0.0%)

29 2 0 27 3 10 8 0 79

Sp.V3 (36.7%) (2.5%) (0.0%) (34.2%) (3.8%) (12.7%) (10.1%) (0.0%)

All 122 17 11 83 14 40 31 11 329

versions (37.1%) (5.2%) (3.3%) (25.2%) (4.25%) (12.2%) (9.4%) (3.3%)

Table 4: Number of students who used each metric type, for the set of students who chose Square Root as the better algorithm. Analysis type Worst case Better on most floors Average case Other Always better Unclear No explanation Two or more metrics Total students

Fall (12.9%) (3.2%) (3.2%) (45.2%) (3.2%) (29.0%) (0.0%) (3.2%)

4 1 1 14 1 9 0 1 31

1 0 0 3 0 3 5 0 12

Sp.V1 (8.3%) (0.0%) (0.0%) (25.0%) (0.0%) (25.0%) (41.7%) (0.0%)

2 0 0 22 3 13 5 0 45

Sp.V2 (4.4%) (0.0%) (0.0%) (48.9%) (6.7%) (28.9%) (11.1%) (0.0%)

2 1 0 18 1 10 10 0 42

Sp.V3 (4.8%) (2.4%) (0.0%) (42.9%) (2.4%) (23.8%) (23.8%) (0.0%)

All 9 2 1 57 5 35 20 1 130

versions (6.9%) (1.5%) (0.8%) (43.8%) (3.8%) (26.9%) (15.3%) (0.8%)

Table 5: Number of students who used each metric type, for the set of students who chose By Half as the better algorithm. Fall Used example and Square Root better Used example and By Half better Used example and Undecided Used example and any answer

Sp.V1

Sp.V2

Sp.V3

All versions

73/89

(82.0%)

-

32/71

(45.1%)

33/79

(41.8%)

138/239

(57.7%)

13/31

(42.9%)

-

10/45

(22.2%)

7/42

(16.7%)

30/118

(25.4%)

7/9

(77.8%)

-

7/9

(77.8%

93/129

(72.1%)

-

175/366

(47.8%)

42/116

(36.2%)

40/121

(33%)

Table 6: Number of subjects whose explanations included evaluating an algorithm on an example, broken down by the algorithm they selected as best.

57

Type of Response Square Root is better because the intervals are more precise, smaller, limited By Half is Better because it eliminates more floors all at once (has a larger interval)

Fall 41/89 (46.1%)

Sp.V1 37/90 (41.4%)

Sp.V2 33/71 (46.5%)

Sp.V3 38/79 (48.1%)

All versions 149/329 (45.3%)

17/31 (54.8%)

1/12 (8.3%)

22/45 (48.9%)

16/42 (38.1%)

56/130 (43.1%)

Table 7: Number of students choosing interval arguments specific to each answer type.

Square Root better By Half better

Understands 33 1

Reverse 22 0

Unclear 35 11

Table 8: Number of students whose tracing results in the Sp.V1 version indicated an understanding of the algorithm, a backwards understanding, or neither. The data are broken down by the algorithm the student specified as better.

Chose Square Root Chose By Half

Traces find Square Root better 28 3

Traces find By Half better 21 6

Trace inconclusive 41 3

Table 9: Number of students whose tracing results in the Sp.V1 version were consistent with their choice of the better algorithm. Note that the correct answer should have been that the trace was inconclusive, and that the Square Root algorithm is better.

Chose Square Root Chose By Half

Traces find Square Root better 65 5

Traces find By Half better 20 5

Trace inconclusive 5 2

Table 10: Number of students whose tracing results in the Sp.V1 version were consistent with their choice of the better algorithm in the sense that the largest number of drops occurred in one of the three scenarios with the algorithm that they considered worse.

4.6 Quality of argument

Their classification of the ways in which students can use examples, however, was the basis for our analysis. We categorized the degree to which students’ examples supported their arguments according to the following scale, based on the categories suggested by Marrades and Gutierrez:

As noted above in Section 4.1, 70.1% of the students correctly chose the Square Root solution. In addition to whether the students got the right answer, we were also interested in how well they could argue in support of their conclusions. We restricted our analysis to the fall data, since the spring responses were so short that they provided much less insight into the students’ thought processes. In analyzing these arguments, we took the students’ answers – By Half or Square Root – and the metrics as our starting points. (For the list of metrics, see Table 3). For example, if the student concluded that Square Root was the right answer using the metric “best on most floors”, we wanted to know how well he or she justified that conclusion. We excluded the fall responses whose metric was U (“unclear”), since we couldn’t evaluate an argument without knowing what the student was trying to prove. That left us with 112 responses. Our analysis was influenced by Marrades and Gutierrez’s work on student proofs in geometry. [11] Their focus on whether the students chose a deductive or an empirical strategy was not relevant in our context, as 84% of our answers (94 of 112) used a combination of examples and deductive reasoning.

Crucial – they chose all or some of the key examples for what they were trying to prove. Partial – they chose examples that were relevant, but not the key examples. No Support – they chose examples that were irrelevant or actually undermined their case. None – no examples were included. A key example supporting the argument that Square Root is better in the worst case, for example, would be the 128th floor, since that is the worst case for the By Half algorithm (and considerably worse than any of the results for Square Root). A partial example would be floor 33, since Square Root is better for that floor. An example that provides no support would be floor 14, since the number of drops to determine that the balloon breaks on the 14th floor is the same for both Square Root and By Half.

58

Surprisingly, Marrades and Gutierrez do not suggest a scale for evaluating the deductive portion of arguments. So we developed the following set of categories based on the categories they used for examples: Convincing – an argument that logically supports their conclusion, given their interpretation of the question. Partial – an argument that provides some support for their conclusion. No Support – an argument that provides no support for their conclusion. No Argument Made – no deductive argument was included, or an argument that simply restated part of the question, such as “It is more efficient.” The students did surprisingly well. Of the 112 answers we analyzed, 84% used examples, and all of those examples provided at least partial support for their conclusions. Almost two-thirds (63%) of the total answers included key examples for what they were trying to prove. The overwhelming majority of students (107/112, or 96%) made arguments that provided at least partial support for their conclusions. More than a third (38%) made arguments that we considered to be convincing. The students who chose Square Root did better than those who chose By Half. Of the arguments in support of Square Root, 45% were convincing (38 of 84) and 99% (83 of 84) provided at least partial support for their conclusions. The students who chose By Half, however, still demonstrated the ability to marshal at least some arguments in favor of their position. Of their arguments, only 9% (2 out of 23) were convincing, but 83% (19 of 23) provided at least partial support. Of the five students who did not choose an answer, two gave convincing arguments and three gave partial support for their lack of a conclusion.

The differences in the lengths of explanations could result from one or more of the following factors: fall student cohort versus spring student cohort, questions presented on paper (or pdf documents) versus questions presented as a survey, answers submitted as uploaded files versus answers submitted in a web form, and differing institutions. Data were collected from several institutions during both semesters, but just one institution was common between fall and spring. Even looking at the word counts for the common institution, there is a difference in explanations lengths: 120.6 in fall and 49.8, 70.3, and 50.5 in the spring versions (Significant difference using a Kruskal-Wallis test for all four versions, H(3, N = 69) = 26.41, p < 0.00001; no significant difference among the three spring groups, H(2, N = 41) = 2.88, p = 0.236). More targeted studies are necessary to determine the factors influencing the response lengths to the questions.

5. DISCUSSION Even when the question explicitly asked students to use worst-case analysis in evaluating the two algorithms, Table 3 shows that most of the students used some other metric. This suggests that the computer science concept of worst case may not be the same as students’ preconceived notion of worst case; alternatively, the concept of worst case may be particularly confusing or difficult to explain. As shown in the Fall, Sp.V2, and Sp.V3 data in Table 2, about two-thirds of the students given those versions chose the Square Root algorithm as being the better algorithm. This is better than random, but not by as much as one might hope. However, requiring students to work examples as in the Sp.V1 case, even if their answers on those examples were not always correct, resulted in a much larger percentage of students selecting the Square Root algorithm as being better. This was true despite the fact that the metric they used to come to their answer was often unclear (as shown in Table 3). Table 10 does suggest, however, that many of them may have calculated the number of drops needed on each of the three scenarios, determined that the By Half algorithm gave the largest number of drops across those three examples, and as a result chose the Square Root as the better algorithm. While this perhaps does not show a real understanding of the algorithms, in this case it does lead to the correct answer. In the three other versions, where students were not explicitly asked to trace the algorithms on specific examples, students who referred to an example in their explanation were significantly more likely to choose the Square Root algorithm as better, even if they did not use a worst-case metric. While some of these correct answers may be attributable to good luck (similar to what we saw in the Sp.V1 data), it also suggests that the act of working an example is an important step in finding the correct answer.

4.7 Length of responses The four different versions of the question showed large differences in the length of student responses. In the fall, all students submitted responses on paper or electronically, either as a homework assignment or during a closed lab session. Students were given the water balloon question on paper or as a pdf document. In the fall, the average number of words in student responses was 133.5. In the spring versions, questions were provided and answers submitted using web forms from a web survey site. The average word counts for the three spring versions were 28.5, 50.3, and 47.2 respectively. Comparing the word counts using statistical tests confirmed the observed differences. Given the non-normal distribution of the word counts, we used nonparametric tests based on ranks [12]. Running a Kruskal-Wallis test comparing all four groups showed significant differences (H(3, N = 469) = 209.4, p < 10−44 ). Running subsequent tests a posteriori, we found significant differences when comparing Fall, Sp.V1, and Sp.V2 (H(2, N = 366) = 153.5, p < 10−33 ), and when comparing the three spring groups (H(2, N = 340) = 30.5, p < 10−6 ). Comparing Sp.V2 and Sp.V3 showed no significant difference (using a Mann-Whitney test since two groups): (U = 6543, N = 237, p = 0.184). All of these tests assume α = 0.05 for significance testing.

5.1 Wording of the question In coding the data it became clear that the wording of the question was problematic for some students. Some students found fault with the algorithms themselves, and others had issues separating their analysis from their prior real-world experiences. In addition, there were students who either misunderstood the problem or who changed it for an unspecified reason.

59

5.1.1 Addressing the algorithm

the balloon breaks on the 132nd floor. We would start off on the 128th floor and it wouldn’t break so we would know that the floor it would break on would be higher. So then, we try the 192nd floor and it would break. So then we would drop it from the 160th floor and it would break and also the 144th floor and it would break as well as the 136th floor. We would do the step one more time and find out that the balloon would break on the 132nd floor. Now we know that it breaks on any floor between 128th floor and 132nd floor. We would then try the 130th floor and it would not break and then the 131st floor and it would not break. We would now know that it breaks on the 132nd floor. Required attempts: 8 (Subject A102, Fall)

Ten students correctly observed a lack of precision in the algorithms. In both algorithms, when the red balloon breaks, the procedure is to “begin dropping the blue balloon from the highest floor from which the red balloon did not break . . . ” If the red balloon breaks on its first drop, however, there is no “floor from which it did not break.” can’t find another floor if first balloon breaks (Subject G84, Sp.V3) However, neither method gives you a process for going back down in floors if one of the balloons does break, so it is difficult to say conclusively which would be more efficient. (Subject B016, Fall)

5.2 Implications for Teaching

Four other students made a “Zeno’s Paradox” argument regarding the math in the By Half solution, arguing that the floors tested, in order, would be 128, 192, 224, 240, 248, 252, 254, 255, 255.5, 255.75, . . .

5.2.1 The meaning of efficiency The wide range of answers regarding the notion of efficiency suggests that asking what it means to be efficient would be an excellent starting point for discussion. Bringing out the many preconceptions regarding the term allows the instructor to explain the computer science definition, why we use it, and how it differs from the preconceptions.

[T]he “by half” solution would be better until you get to the 255th floor. Once the 255th floor is reached there is an infinite amount of balloon drops before you can move up to the 256th floor. (Subject B007, Fall)

5.2.2 Worst-case analysis is not intuitive

5.1.2 Real-world knowledge

As shown in Table 3, even when worst case was defined for them, relatively few students used it. A few students stated their assumption that the worst case is when the balloon first breaks at the top floor.

Some students (15 of the 469 total) applied real world knowledge of the problem. Fourteen cited pre-existing knowledge that water balloons will break when dropped from a relatively low floor; one referred to terminal velocity:

5.2.3 Concrete examples are good

Most likely the balloon with [will] reach terminal velocity by the 15 floor or such making the “By Half Method” completely overzealous . . . Therefore I would chose [choose] the “Square Root” method . . . (Subject E13, Fall)

Even though very few of the students displayed a solid understanding of the two algorithms in their analyses of the concrete examples in Sp.V1 (Table 8), the students who were asked that version were overwhelmingly more likely to correctly identify the Square Root algorithm as better. It may be that doing examples is a way to force students to think rather than applying intuition. At the same time, few students did all of the examples correctly, suggesting we must deal with the reality that tracing an algorithm on concrete data is not a skill students arrive with. Instructors in early courses should stress this skill development.

Another student made an argument based on the psychology of the client, inferring that the tests wouldn’t be done on a 256-story building without some reason: However, as . . . we are given 256 floors to test the balloons on, it is reasonable to assume that the balloon rubber used is reasonably strong . . . (Subject E20, Fall)

5.2.4 Abstraction is hard To solve this problem correctly, students needed to decide which details of the problem were essential and which could be ignored – they needed to abstract, both from the problem itself and from their real-world knowledge of water balloons. It is clear to us that the fact that there are exactly two water balloons is essential, while the fact that water balloons usually break when dropped from the second or third floor of a building is not. But the students need guidance in making these choices. We chose to make Ginat’s question more concrete by providing a specific number of floors and thinking of water balloons which appeared to us to be more realistic to drop than glass balls. While providing students an easier way to conceptualize the problem, this introduced additional issues as they interpreted the problem statement through a more realistic lens. This suggests that, particularly when one pro-

5.1.3 Changing the problem Finally, twenty-three students analyzed modified versions of the algorithms. This subject did the analysis using just one balloon: The “By Half” solution to the program is more efficient because the maximum number of drops would be 9 if the balloon had not broken at all compared to the 16 drops of the “Square Root” solution. . . . (Subject A110, Fall) Other students, including the following, changed the problem statement to allow an unlimited number of balloons: In my opinion, the better more efficient solution would be the “by half” solution. . . . Let’s say that

60

defined worst case as the top floor for both algorithms; it would be interesting to find out what they were thinking. Finally, since asking students to work out examples led to a higher percentage of students correctly identifying the more efficient algorithm, it would be useful to study how tracing algorithms is related to students’ ability to understand and to evaluate algorithms.

vides problems with a context, a discussion is needed to refine understanding of the question. Discussion is needed regarding the aspects of reality we are ignoring (the height of 256 floors), assumptions we make to begin (what do we do with fractional values?), and simplifications from reality. See Section 5.1.2 for specific examples.

5.2.5 The medium may affect the message We noticed a significant difference in the responses between fall and spring data collection. One possible factor is that the spring data were collected using a web-based survey site. Students may have thought that the exercises would be quick or seemed less like real academic work, and gave correspondingly less effort. Another factor is that students at some institutions in the spring were given the questions as mandatory assignments, while others responded voluntarilly.

Acknowledgments

6.

7. REFERENCES

Thanks to our students for their insight and data. This work was supported in part by the National Science Foundation under grants DUE-0736343, DUE-0736700, DUE-0736572, DUE-0736738, DUE-0736859, and DUE-0736958. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the National Science Foundation.

CONCLUSIONS AND FUTURE WORK

The ability to analyze algorithms is an essential skill for a computer scientist. This study found that, given a challenging problem and a choice between two algorithms, 70% of the students asked were able to correctly choose the more efficient algorithm. However, an exploration of their justifications for doing so demonstrates that, for many students, their notion of efficiency is not the same as the common worst-case notion typically used in computer science. Even allowing for the students’ different definitions for efficiency, we see a range in their ability to construct arguments justifying their choice of the more efficient algorithm. Students who used examples were more likely to choose the correct answer than those who did not. In fact, we found that when we scaffolded the problem with examples, 87% of students correctly answered the question, as opposed to the 65% correctly answering without having examples required. This was true despite the fact that those students frequently computed the examples incorrectly. The results among those who abstracted from individual floors to intervals were not so striking – little or no difference in results among those who used intervals – but there were two different interval arguments that led to different conclusions. We observe that students generally can do some form of algorithm analysis, at least at the level of choosing between two alternatives. They can justify their choice, but may be unable to use worst-case analysis to do so. They can work from concrete examples, but may struggle with understanding algorithm descriptions. Our efforts to “make it real” by casting problems in real-world scenarios (such as water balloons) can distract students from separating the algorithm analysis from the “artificial” context. These results suggest a number of directions for future work. First, while we have rich data in the form of student responses, we could not ask follow-up questions to better understand their ideas. Acquiring complementary data through interviews with students could give researchers and educators more insight regarding students’ ability to understand algorithms, trace algorithms, use metrics, and choose the best algorithm. Second, given that most students did not use the worstcase metric we had in mind, it would be useful to ask more targeted questions to understand how students define worst case. For example, some responses indicated that students

[1] M. Ben-Ari. Constructivism in computer science education. Journal of Computers in Mathematics and Science Teaching, 20(1):45–73, 2001. [2] Y. Ben-David Kolikant. Gardeners and cinema tickets: High school students’ preconceptions of concurrency. Computer Science Education, 11(3):221–245, 2001. [3] J. Bonar and E. Soloway. Preprogramming knowledge: A major source of misconceptions in novice programmers. In E. Soloway and J. Spohrer, editors, Studying the Novice Programmer. Lawrence Erlbaum Associates, Hillsdale, NJ, 1989. [4] J. D. Bransford, A. L. Brown, and R. R. Cocking, editors. How People Learn: Brain, Mind, Experience, and School. National Academy Press, Washington, DC, expanded edition, 2000. [5] M. Clancy. Misconceptions and attitudes that interfere with learning to program. In S. Fincher and M. Petre, editors, Computer Science Education Research. Taylor and Francis Group, London, 2004. [6] Committee on Undergraduate Science Education. Science Teaching Reconsidered: A Handbook. National Academy Press, Washington, DC, 1997. [7] J. Gal-Ezer and E. Zur. The efficiency of algorithms: misconceptions. Computers and Education, 42(3):215–226, 2004. [8] J. P. Gibson and J. O’Kelly. Software engineering as a model of understanding for learning and problem solving. In Proc. of the 2005 International Workshop on Computing Education Research (ICER ’05), pages 87–97, 2005. [9] D. Ginat. Efficiency of algorithms for programming beginners. In SIGCSE ’96: Proceedings of the twenty-seventh SIGCSE technical symposium on Computer science education, pages 256–260, New York, NY, USA, 1996. ACM. [10] G. Lewandowski, D. Bouvier, R. McCartney, K. Sanders, and B. Simon. Commonsense computing (episode 3): Concurrency and concert tickets. In Proc. of the 2007 International Workshop on Computing Education Research (ICER ’07), pages 133–144, Atlanta, GA, 2007. ACM Press. [11] R. Marrades and A. Gutiérrez. Proofs produced by secondary school students learning geometry in a

61

[12]

[13]

[14]

[15]

[16]

[17]

The following are two ways to determine the strength of the fabric if you are given 2 equivalent balloons, one red and one blue. “Square Root” Solution: Drop the red balloon from the 16th floor of the tower. If it does not break, drop it from the 32nd, then 48th, and so on, increasing the floor number by the square root of the total number of floors in the tower each time. When the red balloon breaks, you begin dropping the blue balloon from the highest floor from which the red balloon did not break, working your way to the floor on which the red balloon broke. “By Half ” Solution: Drop the red balloon from the 128th floor of the tower. If it does not break, drop it from the 192nd floor, and then 224th, etc., increasing the floor number by half of the remaining floors of the tower each time. When the red balloon breaks, you begin dropping the blue balloon from the highest floor from which the red balloon did not break, working your way to the floor on which the red balloon broke.

dynamic computer environment. Educational studies in mathematics, 44:87–125, 2000. J. H. McDonald. Handbook of Biological Statistics (online). Sparky House, Baltimore, 2008. (accessed at http://udel.edu/ mcdonald/statintro.html, March 22, 2009). L. Miller. Natural language programming: Styles, strategies, and contrasts. IBM Systems Journal, 20(2):184–215, 1981. L. Onorato and R. Schvaneveldt. Programmer/nonprogrammer differences in specifying procedures to people and computers. In E. Soloway and S. Iyengar, editors, Empirical Studies of Programmers, pages 128–137. 1986. B. Simon, D. Bouvier, T.-Y. Chen, G. Lewandowski, R. McCartney, and K. Sanders. Commonsense computing (episode 4): Debugging. Computer Science Education, 18(2):117–133, 2008. B. Simon, T.-Y. Chen, G. Lewandowski, R. McCartney, and K. Sanders. Commonsense computing: what students know before we teach (episode 1: sorting). In Proc. of the 2006 International Workshop on Computing Education Research (ICER ’06), pages 29–40, 2006. J. Smith, A. diSessa, and J. Roschelle. Misconceptions reconceived: A constructivist analysis of knowledge in transition. Journal of the Learning Sciences, 3(2):115–163, 1993.

The four versions differed in the instructions they gave for selecting the better algorithm. Fall: Which is more efficient, in the sense that it requires the fewest balloon drops? Explain your answer in complete English sentences. Spring V1: A. How many drops do each of the methods require if the balloons break at the floor indicated below? • Square Root - Floor 10: • By Half - Floor 10: • Square Root - Floor 130: • By Half - Floor 130: • Square Root - Floor 65: • By Half - Floor 65: B. If you had to declare one method the best one, which would you choose and why? Please explain your answer.

Appendix: Questions used We used four versions of the question, one in the fall and three in the spring. Each included the following description of the problem and the two solutions (though the order of the solutions was reversed in some of the fall questions, as discussed above): Waterballoons Inc. has asked you to test the strength of a new [“fabric” in the fall, “material” in the spring] for water balloons. You are given some water balloons, sent to a tower with 256 floors, and asked to determine the highest floor from which the balloons can be dropped without breaking. In other words, there is some floor such that any water balloon dropped from that floor or any lower floor will not break, but any water balloon dropped from a higher floor will break. Fortunately, you’re allowed to break the balloons you’re given in order to determine this floor. In addition, the balloons do not weaken if they are dropped and do not break.

Spring V2: Which method is more efficient, in the sense that it requires the fewest balloon drops in the worst case? Explain your answer Spring V3: For each of the two methods, there is some worst floor in the sense that it would require the most number of drops to determine that the balloons break at that floor. Which is more efficient, in the sense that it requires the fewest balloon drops in the worst case? Explain your answer.

62