Multistrategy Discovery and Detection of Novice Programmer Errors

0 downloads 0 Views 143KB Size Report
... well as the student model, make student modeling a very difficult task. ... 2. Several knowledge-level errors can underlie the behavior-level errors exhibited by a ... The first factor indicates the need for unsupervised learning, i.e., learning in which classes .... rules (Tamaki & Sato, 1984) for logic programs (Table 1, step 1).
Machine Learning, 38, 157–180, 2000. c 2000 Kluwer Academic Publishers. Printed in The Netherlands. °

Multistrategy Discovery and Detection of Novice Programmer Errors RAYMUND C. SISON Department of Computer Science, De La Salle University, Manila, Philippines

[email protected]

MASAYUKI NUMAO Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan

[email protected]

MASAMICHI SHIMURA [email protected] Department of Industrial Administration, Science University of Tokyo, Chiba, Japan Editors: Floriana Esposito, Ryszard Michalski, & Lorenza Saitta Abstract. Detecting and diagnosing errors in novice behavior is an important student modeling task. In this paper, we describe MEDD, an unsupervised incremental multistrategy system for the discovery of classes of errors from, and their detection in, novice programs. Experimental results show that MEDD can effectively detect and discover misconceptions and other knowledge-level errors that underlie novice Prolog programs, even when multiple errors are enmeshed together in a single program, and when the programs are presented to MEDD in a different order. Keywords: multistrategy learning, unsupervised learning, conceptual clustering, student modeling

1.

Introduction

An intelligent tutoring system (ITS) is a computer program for educational support that can diagnose problems of individual learners. This diagnostic capability enables it to adapt instruction or remediation to the needs of individuals. The diagnostic and adaptive capabilities of an ITS are due primarily to its use of a student model (McCalla, 1992), which is an approximate, qualitative representation of student knowledge about a particular domain, that fully or partially accounts for student behavior. Student modeling, i.e., the construction of a student model, is a complex task that involves the examination of student behavior in light of a system’s background knowledge. Constructing the background knowledge of a student modeling system entails not only analyzing the facts, rules, and schemas of a domain (called the theory of that domain), but, more critically, analyzing and categorizing the erroneous solutions of a population of students in that domain, and encoding this knowledge of novice errors in a bug library. Constructing or extending the background knowledge, particularly the bug library, is laborious, time-consuming, and requires a certain amount of knowledge of the domain. The combined problems of automatically constructing the bug library, as well as the student model, make student modeling a very difficult task.

158

SISON, NUMAO & SHIMURA

Machine learning, which deals with the automatic induction of new knowledge or compilation of existing knowledge, can be used to expedite, if not automate, the task of bug library extension or construction, as is done in PIXIE (Sleeman et al., 1990) and ASSERT (Baffes & Mooney, 1996). The domains of these systems are sufficiently understood so that their domain theories are completely specifiable, and the tasks of the students they model, therefore readily automatable.1 Our work deals with automatic bug library and student model construction in a richer domain, namely, computer programming, in which problem solving operators are not as readily and completely specifiable. How, then, can student models as well as bug libraries in a programming domain be automatically constructed? More specifically, how can the discovery and subsequent detection of novice programmer errors be automated? Before we answer the question, we first clarify what we mean by novice errors. Errors can be analyzed at various levels (Sison & Shimura, 1998). For example, behaviorlevel analysis, in a programming domain, looks at which portions of a student’s program are erroneous. Knowledge-level analysis, on the other hand, looks at which behavior-level errors are related, as these relationships might indicate: (a) knowledge errors, i.e, errors in the student’s knowledge about the domain (e.g., misconceptions, i.e., incorrect knowledge; insufficient knowledge), or (b) slips due, for example, to fatigue, boredom or depressed affect. These knowledge-level errors can, in turn, account for the errors that occur at the behavior level. Identifying the knowledge-level errors (if any) underlying a novice program can be viewed as a classification task: Given: - novice program - definitions of knowledge-level errors Find: - classification of novice program with respect to the knowledge-level error(s) it manifests, if any Note, however, that our classification task is further complicated by several factors: 1. A complete and universal set of knowledge-level errors that can be manifested in incorrect programs is difficult, if not impossible, to have; 2. Several knowledge-level errors can underlie the behavior-level errors exhibited by a single program; 3. A program that looks unfamiliar may be correct; that is, there can be more than one correct approach to solving a programming problem; and 4. There can be many variants, correct as well as incorrect, per approach. The first factor indicates the need for unsupervised learning, i.e., learning in which classes (knowledge-level errors) are not known beforehand. The second factor implies that multiple classifications of a single object (incorrect program) should be allowed. The third and fourth factors characterize the variability inherent in programming and signify the need for intention-based diagnosis (Johnson, 1990), i.e., the diagnosis of program errors with respect to a program’s intention or intended approach. Finally, these factors collectively

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

159

suggest a two-stage classification. Specifically, the last two factors suggest classification of a novice program with respect to its intention, and the first two factors suggest further classification of an incorrect program with respect to the knowledge error(s) that it exhibits. All these are accomplished by a learning student modeling system called MEDD (Multistrategy Error Detection and Discovery). The incremental nature of MEDD (pronounced |med|) enables it to construct and extend a bug library (MEDD’s main learning task: error discovery) at the same time that it infers student models (MEDD’s main performance task: error detection). The difficulty of MEDD’s general learning task—the unsupervised learning of classes of correct and incorrect logic programs—necessitates the use of multiple strategies. To learn classes of correct programs, where each class represents a different intention, MEDD uses a technique that is similar to that of exemplar-based learners such as PROTOS (Bareiss, Porter, & Wier, 1987) in that similarity between exemplars is determined by matching surface as well as abstract features, and no intensional definitions of concepts are stored; instead, concepts (intentions) in MEDD are defined by specific exemplars (programs). On the other hand, to learn classes of incorrect programs, or more specifically, to learn classes of errors that incorrect programs exhibit, where each class represents a different knowledge-level error, MEDD uses an incremental relational clustering technique that considers causal relationships in the background knowledge as well as regularities in the data. Discrepancies between the incorrect novice programs and their associated intentions (rather than the incorrect programs themselves) are the ones that are clustered by MEDD because of the need to characterize knowledge-level errors and detect multiple knowledge errors in a single program. In the rest of this paper, we first provide an overview of MEDD, after which we provide further detail about the two modules that MEDD uses to carry out its two-stage learning and classification tasks. We then report and discuss experimental results that show the viability and potential of the approach. Finally, we describe related work and give our concluding remarks.

2.

MEDD: an overview

MEDD is a learning student modeling system whose target and implementation language is Prolog. In this section, we provide an overview of MEDD in terms of its inputs, background knowledge, outputs, and the two modules it uses to accomplish its tasks, namely IDD/LP and MMD. We also identify ways of putting MEDD’s output to practical use.

2.1.

Inputs

The primary input to MEDD are Prolog programs written by students learning the language. Student programs would typically include reverse/2 programs (where 2 is the number of arguments of a predicate named reverse) for reversing the order of a list of elements, and sumlist/2 programs for summing the elements of a list of integers.

160

SISON, NUMAO & SHIMURA

An example of a buggy student program for reversing a list is shown below, whose errors we discuss shortly: % buggy program 1 reverse([],[]). reverse([H|T],[T1|H]) :reverse(T,T1). 2.2.

Background knowledge

The main components of the background knowledge of MEDD are sets of reference programs, and associated error hierarchies. We describe these components with reference to the reverse/2 problem. There are at least four kinds of solutions to the problem of list reversal in Prolog, namely, the naive reverse, accumulator, inverse naive, and railway shunt algorithms. The naive reverse solution shown below is the most intuitive of the four, as it is the logical equivalent of the recursive formulation of the reverse task in any language (Sterling & Shapiro, 1986). % a correct program (naive reverse) reverse([],[]). reverse([H|T],R) :reverse(T,T1), append(T1,[H],R). append([],L,L). append([H|T],L,[H|T1]) :append(T,L,T1). In declarative terms, the correct program above states that the reverse of a list [H | T], whose first element is H and whose remaining elements form a list called T, is the concatenation of the reverse of T and the single-element list [H]. Procedurally, this means that to get the reverse of a list [H | T], one must accomplish two subgoals. First, one must get the reverse of T and call this T1. Then one has to append T1 and H to form one list, R, which is the reversed list. Of course, if the given list is empty, then its reverse is just the empty list. We call the correct program above, as well as the programs for the other reverse/2 algorithms, reference programs for reverse/2. The reference program above has two predicates, namely, reverse/2 and append/3. We call reverse/2 the main predicate. The predicate that it calls other than itself, i.e., append/3, is called an auxiliary predicate. Reference programs embody the intention of a programmer. An intention can be expressed as a partially ordered set of goals and plans (where plans are implementations of goals, and can either be subgoals or code) for solving problems (Johnson, 1990). Since, in Prolog, the reference program and intention are one, this paper treats the two terms synonymously. In MEDD, reference programs are stored in frames, as illustrated in figure 1. Though not shown in the figure, reference program frames can also include abstract properties such as:

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

Figure 1.

161

Representation of problems and reference programs.

• relations between clauses and literals (e.g., ordering constraints) used when searching for the reference program that is least dissimilar to the novice program; • relations between components of a clause (e.g., causal relationships), used when clustering discrepancies between student and reference programs; • test cases, used when determining whether discrepancies are indeed behavior-level errors; • correct stylistic variants of the reference program, and a counter for student programs that perfectly match the reference program or its variants; and • natural language snippets explaining the function and type of each component of the reference program, which can be used when remediating students. While MEDD can learn reference programs, it is preferred that as many reference programs per problem be provided to MEDD as possible. We explain this in Section 3.2. The other main item in the background knowledge of MEDD are OR trees called error hierarchies. Each subtree under the root node of an error hierarchy corresponds to a class of errors committed by students, and each such class denotes an error at the knowledge level, of which the subtree forms an intensional definition. Each reference program has one error hierarchy associated with it. We will see examples of error hierarchies for naive reverse in Section 4.

2.3.

Outputs

The primary outputs of MEDD are the intentions and knowledge-level errors, if any, underlying the input programs. For example, MEDD would determine: (1) the naive reverse solution as the intention underlying buggy program 1 above, and (2) the invalid append error below as the knowledge-level error that the buggy program manifests:

162

SISON, NUMAO & SHIMURA

invalid_append:remove(subgoal2), ( replace(head,R,[X?|H]) ; replace(subgoal1,T1,[X?|H])

).

The operational definition of the invalid append error above states that this error is identified by the existence of two discrepancies: (1) the omission of the append/3 subgoal (of the correct reverse/2 clause), and (2) the replacement of an output variable in either the head or the first subgoal (of the correct reverse/2 clause) with a list which has a tail H and which is formed using the [ | ] operator. This can, in turn, be interpreted as characterizing a misconception in which the append/3 predicate and the [|] operator are thought to be functionally the same, at least as far as concatenating two lists is concerned. This is in fact the misconception that underlies the buggy program given earlier. MEDD’s outputs can be used in several ways. For example, the intensional definition of a knowledge-level error detected in a student program can be used to provide graded hints and explanations to the student, select or construct (using case-based techniques) similar buggy programs for the student to analyze, or select similar problems (again using casebased techniques) for the student to solve. MEDD’s error hierarchy can also be used to differentiate severe and common errors from minor or infrequently occurring errors and slips. This knowledge can be used to direct the course of remediation, or to design a course in such a way that severe and common errors are minimized if not avoided. Further details regarding the use of MEDD’s output are found in (Sison, 1998). 2.4.

IDD and MMD

To accomplish its learning and classification tasks, MEDD uses two modules, namely, IDD/LP (Intention and Discrepancy Detection for Logic Programs), or IDD for short, and MMD (Multistrategy Misconception Discovery).2 IDD determines the intention underlying a student program, computes the discrepancies between the student’s actual and intended programs, and then checks whether these discrepancies are associated with known knowledge-level errors or not. If not, it tests the student program to help confirm whether the syntactic discrepancies between the student and reference programs are indeed behaviorlevel errors, or are stylistic differences that result in correct program variants. In the process of detecting intentions and discrepancies, IDD learns new intentions and variants. Behavioral discrepancies are expressed naturally as atomic formulas in the function-free first order logic. Consider again the buggy and correct programs presented earlier, whose recursive reverse/2 clauses we reproduce below. % buggy clause 1 reverse(H|T],[T1|H]) :reverse(T,T1).

% correct clause reverse([H|T],R) :reverse(T,T1), append(T1,[H],R).

The buggy clause differs from the correct clause in two ways. First, in place of the variable R in the head of the correct clause, the buggy clause has the list [T1 | H]. This discrepancy

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

163

can be expressed in relational terms as replace(head,R,[T1 | H]). Second, the second subgoal in the correct clause is omitted in the buggy clause. This discrepancy can be expressed in relational terms as remove(subgoal2). The two discrepancies just described form a discrepancy set. The second module, MMD, then takes such a discrepancy set as one input object, and clusters the discrepancy set into the error hierarchy associated with the student program’s underlying intention, revising the error hierarchy in the process. As stated earlier, each subtree under the root node of an MMD-induced error hierarchy operationally defines a knowledge-level error. 3.

Detecting intentions and discrepancies

We now provide more detail about IDD/LP. IDD/LP (whose outline is shown in Table 1) detects intentions, computes behavioral discrepancies, determines behavior-level errors, and learns, in the process, new intentions and variants. 3.1.

Detecting intentions

IDD/LP begins by canonicalizing the student program, i.e., by renaming predicates and variables, and performing simple equivalence-preserving transformations using unfold/fold rules (Tamaki & Sato, 1984) for logic programs (Table 1, step 1). IDD then proceeds to determine the reference program that reflects the student program’s underlying intention (step 2). 3.1.1. Focal discrepancies and dissimilarity function. Determining the intention that underlies the student program is accomplished via an A* search (Hart, Nilsson, & Raphael, 1968) through a space of partial mappings and discrepancies, at each level of which abstract properties and specific literals of student and reference programs are compared and discrepancies computed. Perfectly matching literals, and literals that are equivalent according to a set of equiv operators (e.g., equiv(XX), equiv(X+Y,Y+X)), do not have any discrepancies. The literals that are matched at each level are called the focus (also called focal literals) of that level. The focus of the first level is nil, since the task of IDD at that level is simply to retrieve the reference programs; no matching is done yet. The focus of the second level are the heads of the clauses of the main predicate. The focus of the third level are the subgoals of the clauses of the main predicate. The focus of the fourth level are the heads of the clauses of the first auxiliary predicate, while the focus of the fifth level are the subgoals of the clauses of the first auxiliary predicate, and so on. Nodes are scored according to the dissimilarity function: Dif (n) = g(n) + h(n), where g(n) is the cumulative weighted number of discrepancies so far, excluding the current focal discrepancies; and h(n) is 2(2−d) ∗ discs(n), where d is the current depth or level, and discs(n) is the number of current focal discrepancies. Note that g(n) is just the Dif score of the current node’s parent. Note also that since h(n) is a lower bound on the number of discrepancies remaining, including the current ones, the search algorithm is admissible, i.e., it always terminates in an optimal path whenever one exists.

164 Table 1.

SISON, NUMAO & SHIMURA Outline of IDD/LP.

1. NORMALIZE PROGRAM. Canonicalize predicate and variable names in the input program, then simplify using unfold/fold rules, producing the program π . 2. DETECT INTENTION AND COMPUTE DISCREPANCIES. 2.1 Perform an A* search for the reference program that is least dissimilar to π (for which a dissimilarity measure Dif is minimal). 2.2 If a goal node n is found, and Dif (n) = 0, stop and return the reference program, ρ, associated with n. Else, if Dif (n) > 0, pass to the next step: the reference program ρ, and the union, δ, of discrepancy sets from the root of the search tree to n. If no n (and therefore no ρ) is found, go to step 4(.2). 3. TEST DISCREPANCIES. 3.1 If δ is in the library of known knowledge-level errors (computed by MMD), stop and output the corresponding knowledge-level error and program error (PD/CI/NC, see below) along with ρ. 3.2 Otherwise, test π using the test cases and ρ as oracle, if possible. 3.2.1 Simulate π on input x (x are the values of the input variables). 3.2.1.1 Possible divergence. When π’s computation depth exceeds that of ρ on x (and the current stack of procedure calls contains a discrepancy δi in δ), return the error code PD (and δi ). 3.2.1.2 Coverage of an incorrect instance. Whenever a procedure call returns an output, check this output if it is correct using ρ. If it is not correct, return the error code CI and the incorrect procedure call. 3.2.2 Noncoverage of a correct instance. Simulate π on input x and output y (x and y are the values of the input and output variables, respectively). If a particular ground instance of ρ cannot be covered by π, return the error code NC and the uncovered instance. 4. OUTPUT BEHAVIOR-LEVEL ERRORS OR LEARN NEW INTENTIONS/VARIANTS. 4.1 If ρ could serve as an oracle for testing π : 4.1.1 and π tests correctly, then learn (i.e., store) π as a correct variant of ρ; 4.1.2 but if π does not test correctly, then output (to MMD) the intention ρ, the discrepancies δ (as behavior-level errors), and the effect (PD/CI/NC) of the behavioral errors. 4.2 If there is no ρ, or ρ could not serve as an oracle, 4.2.1 and if π runs correctly at the top level (i.e., using the test cases alone), then learn π as a new intention; 4.2.2 otherwise (if π does not run correctly given the test cases), attempt to debug π interactively and learn (i.e., store) the debugged program as a new intention.

3.1.2. Intention search. The search for a program’s intention (step 2.1) begins with the retrieval of the reference programs (and their correct variants, if any) associated with the problem that the student program is supposed to solve, and the creation of a node for each reference program (figure 2, level 1). Each node is scored according to the dissimilarity function Dif . From all end nodes of the search tree, IDD then selects the node whose Dif score is minimal. If this node is not yet a goal node (i.e., there still remain reference program literals to be examined), IDD expands the node by determining the focal literals, focal discrepancies, and remaining literals of the reference program associated with the current node in the next level, and the same process of scoring and selection continues. However, if

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

Figure 2.

165

Intention recognition.

the selected node is a goal node, the search stops and returns this node’s associated reference program, and the union of the focal discrepancies on the path from the start node to this goal node, if any (Table 1, step 2.2). Figure 2 shows the sequence of nodes created by MEDD given buggy program 1 and the four reference programs mentioned in Section 2.2. (Though not shown in the figure, MDD examines all alternative mappings between the focal literals of the reference and student programs given the ordering constraints associated with the reference program literals.) Since a goal node (node 7) is reached in the subtree rooted in the node for naive reverse, the naive reverse algorithm is assumed to be the intention behind the buggy program. 3.2.

Detecting errors and learning new intentions/variants

The set of discrepancies between the student program and the chosen reference program are then checked whether they are symptoms of a knowledge-level error. This is done

166

SISON, NUMAO & SHIMURA

in two steps. First, the discrepancies are looked up in a rule base or hierarchy of errors (generated by MMD); if found, IDD considers the behavioral discrepancies as behavioral errors, and immediately outputs the intention together with the associated knowledge-level error (Table 1, step 3.1). If this step fails, the student program is tested using a slightly modified version of Shapiro’s (1983) diagnosis routines, modified to use the reference program as oracle (step 3.2). If the program terminates correctly given the test cases in the background knowledge and the reference program as oracle, the discrepancies will not be treated as errors; instead, the student program will simply be learned, i.e., stored, as a correct variant of the current intention (step 4.1.1).3 However, if the student program does not terminate correctly (step 4.1.2), the discrepancies detected earlier will be treated as behavior-level errors and will be output (into MMD), unless the associated reference program could not be used as an oracle, in which case the reference program will not be considered the student’s intention for obvious reasons; instead, a debugging substep will be triggered, during which the student program will be revised to make it correct (step 4.2.2). Currently, we use a prototype PDS6based (Shapiro, 1983) inductive debugger, which we modify to allow it to reuse previously obtained information (from step 3) about the program’s errors. It will be noted, however, that student programs can become quite buggy, thus violating the assumption of first-order minimal theory revision systems such as PDS6 and FORTE (Richards & Mooney, 1995) that the initial theory be approximately correct. Hence our statement in Section 2.2 that we prefer as many reference programs to be provided to MEDD as possible, because if a reference program could not be found, MEDD would only be able to learn new intentions from student programs when the latter are either correct (step 4.2.1), or almost correct (step 4.2.2).

4.

Discovering classes of errors

We now turn to the other module used by MEDD, namely, MMD, whose function is the discovery of knowledge-level errors underlying behavior-level discrepancies of buggy programs via multistrategy conceptual clustering.

4.1.

Error hierarchy formation via incremental relational clustering

Conceptual clustering (Michalski & Stepp, 1983) is the grouping of unlabeled objects into categories for which conceptual descriptions are formed. Basically an object O is classified into a category with concept description C with which it is quite similar (or not quite dissimilar). 4.1.1. Similarity measure. To measure this degree of similarity/dissimilarity, we adapt Tversky’s (1977) contrast model: Sim(C, O) = θ f (C ∩ O) − α f (C − O) − β f (O − C)

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

167

which expresses the similarity between two objects C and O that are each expressed as a set of features, as a function of the weighted measures of their commonalties (C ∩ O) and differences (C − O, O −C). In our system, θ , α, β and f (X ) are user-redefinable. Since the objects we deal with are sets of behavioral discrepancies represented as atomic formulas, we do not compute the commonalities between an input set of discrepancies O and a node C in an error hierarchy using mere set intersection, but as follows: (C ∩ O) = Com(C, O) =

n m [ [

lgg(Ci , O j )

i=1 j=1

where lgg(Ci , O j ) is the least general generalization (Plotkin, 1970) of two atomic formulas in the function-free first order logic, and m and n are the number of discrepancies in C and O, respectively. Specifically, the lgg of two function-free terms, x and y, is x if x = y; otherwise, the lgg is a variable V (which we call a pattern variable), where the same variable is used everywhere as the lgg of these terms. The lgg of two function-free atomic formulas p(s1 , s2 , . . . , sn ) and p(t1 , t2 , . . . , tn ) is p(lgg(s1 , t1 ), lgg(s2 , t2 ), . . . , lgg(sn , tn )). We call a generalization that contains a pattern variable a variabilization generalization; one that does not contain a pattern variable is called an intersection generalization. 4.1.2. Similarity-based relational clustering. Since it is desirable that an error hierarchy be revised rather than reconstructed from scratch each time new data arrives, the basic similarity-based clustering algorithm (Table 2), which we call SMD (Similarity-based Misconception Discovery), is incremental, and therefore takes one discrepancy set at a time and

Table 2.

Outline of SMD.

1. MATCH DISCREPANCIES. From the children N1 , . . . , Nm of a given node N of the error hierarchy, determine those that match the set of input discrepancies, D. Specifically, compute, for every child node Ni : • the set of commonalities, Com(Ni , D), between a node, Ni , and D, and • the degree of similarity, Sim(Ni , D), between Ni and D and determine whether Sim exceeds a system threshold, γ . 2. POSITION DISCREPANCIES. If no match is found, place D under N . Otherwise, for every Ni that matches D, perform one of the following and increase weight counters: 2.1 If both (D − Com(Ni , D)) and (Ni − Com(Ni , D)) are empty, simply increase the weight counter of Ni ; 2.2 If (D − Com(Ni , D)) is empty but (Ni − Com(Ni , D)) is not, replace Ni with D and insert (Ni − D) under D i; 2.3 If (Ni − Com(Ni , D)) is empty but (D − Com(Ni , D)) is not, cluster (D − Ni ) under Ni , i.e., repeat the procedure, this time matching (D − Ni ) with the children of Ni ; 2.4 Otherwise (if both (D − Com(Ni , D)) and (Ni − Com(Ni , D)) are nonempty), create a new node, Com(Ni , D), under N , representing the commonalities of D and Ni , and place their differences under this node. 3. FORGET OCCASIONAL SLIPS. Nodes whose weights fall below a system parameter may be discarded on a regular or demand basis. (No nodes are discarded in the current implementation.)

168

Figure 3.

SISON, NUMAO & SHIMURA

Clustering a set of behavioral discrepancies into an error hierarchy.

classifies this recursively into the nodes of the error hierarchy that match the discrepancy set to a certain degree (Table 2, steps 1 and 2). To illustrate, recall buggy program 1 from Section 2.1, or more specifically, the discrepancies of buggy program 1 vis-a-vis the correct program for naive reverse, which were computed by IDD, and which we reproduce in figure 3A. Now consider the error hierarchy in the same figure, and note that the single discrepancy in the node [remove(subgoal2)] below the root of the error hierarchy is actually one of the discrepancies in the input discrepancy set. Thus, an intersection generalization is found. Assuming that the γ parameter of step 1 indicates that this is a match, the input discrepancy set will therefore be classified into this subtree, among possible others (Table 2, step 2.3). Next, the remaining discrepancy replace(head,R,[T1 | H]) of the input discrepancy set is compared against the child node [replace(head,R,[T | H]), remove(subgoal1)]. This time, a variabilization generalization (replace(head,R, [X? | H])) occurs, so a node for this variabilization is created, and the instantiations of the pattern variable X? (i.e., X?=T and X?=T1), among others, are pushed down to the next level (Table 2, step 2.4). Figure 3B shows the revised hierarchy. We have already seen this subtree’s clausal form (invalid append) in Section 2.3.

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

4.2.

169

Strengthening the coherence of concept (knowledge-level error) descriptions

Similarity measures like the one above ignore qualitative relationships among features. However, the presence of qualitative, e.g., causal, relationships between the features that make up a conceptual description can strengthen the coherence of the concept and provide an explanation for the co-occurrence of the concept’s features. MMD uses causal relationships in the background knowledge and causality heuristics in order to determine the presence and direction of causal relationships within and among conceptual descriptions. One such heuristic that we use is the component-level causality heuristic, which simply states that causal (or enabling or determination) relationships among the components of the reference behavior/program that are present in a set of discrepancies suggest causal relationships among these discrepancies. To illustrate, consider again the correct recursive clause for naive reverse, reproduced below. The correct clause states that the reverse of a list is the concatenation of the reverse of its tail, and its head. Note that this can be viewed as describing relationships among four objects, namely, the head H and tail T of the list to be reversed, the reversed list R, and a temporary entity T1 representing the reverse of T; and the relations reverse/2 and append/3.

% correct clause reverse([H|T],R) :reverse(T,T1), append(T1,[H],R).

% buggy program 1’s discrepancies replace(head,R,[T1|H]), remove(subgoal2)

Now consider again the discrepancies between buggy clause 1 and the correct clause above. Note that the reference program components of the two discrepancies of buggy program 1 both involve the user variable R, and that the R involved in the second discrepancy can be viewed as causing the R involved in the first. (Of course, the R in the head and the R in the second subgoal of the correct clause refer to one and the same thing. To be more precise, therefore, what we mean by the R in the subgoal ‘causing’ the R in the head is this: given that both R’s are output variables, the subgoal, rather than the head, is responsible for binding R to a value.) Thus, the component-level causality heuristic suggests that there exists a causal relationship between the two discrepancies. Although the above heuristic can help determine the presence of a causal relationship between two discrepancies, it cannot determine the direction of causality. It cannot determine, for example, that the use of the [|] operator to concatenate, albeit incorrectly, some list with H causes or results in the omission of an append/3 subgoal. Causative discrepancies are inferred using other causality heuristics. However, while knowing which discrepancies are causative is useful when remediating a student, it does not affect MMD’s classification and clustering abilities. Hence, this paper focuses only on the component-level causality heuristic and its effect on hierarchy reorganization, to be discussed next. Interested readers are referred to (Sison, Numao, & Shimura, 1997; Sison, 1998) for the other causality heuristics.

170

SISON, NUMAO & SHIMURA

Table 3.

Outline of MMD.

1. MATCH DISCREPANCIES. From the children N1 , . . . , Nm of a given node N of the error hierarchy, determine those that match the set of input discrepancies, D. Specifically, compute, for every child node Ni : • the set of commonalities, Com(Ni , D), between a node, Ni , and D, and • the degree of similarity, Sim(Ni , D), between Ni and D and determine whether Sim exceeds a system threshold, γ . In addition, determine causal relationships among discrepancies in D using the component-level causality heuristic. 2. POSITION DISCREPANCIES. (same as SMD’s step 2) 3. SEVER FALSE TIES. For every new node created in step 2, determine, using the causality heuristics, conceptand subconcept-level causalities. Then, 3.1 if no component- or concept-level causality exists among discrepancies in this node, retain the node nevertheless. (If the node needs to be split, it will be split in step 2 on another occasion.) 3.2 if this node is a leaf node, and no (subconcept-level) causality exists between components of this node (that are not mere pattern variable instantiations) and its parent, sever the link between this child and its parent, and reclassify it (and the current subtree) into the hierarchy. 4. FORGET OCCASIONAL SLIPS. (same as SMD’s step 3)

4.3.

Similarity and causality-based relational clustering

Existing multistrategy approaches (e.g., Lebowitz, 1986; Pazzani, 1994) to using similarity and causality in the formation of concepts with causative features use separate similarity and theory (i.e., causality) driven learning components one after the other. However, Wisniewski and Medin (1994) argue cogently that such loosely coupled approaches to using data and theory, while undoubtedly useful, still remain inadequate as models of concept formation. In MMD, the similarity and causality-based components are tightly coupled in the concept (i.e., knowledge-level error) formation process. This tight coupling entails three revisions to the basic SMD algorithm above (instead of a separate algorithm altogether): (1) causal relationships are determined using the component-level causality heuristic, (2) the directions of causalities are determined whenever possible using the other (concept and subconceptlevel) heuristics, and (3) the link between a parent node and its child is severed if the two are found to be (causally) unrelated. These revisions to the SMD procedure are shown in slanted type style in Table 3, which summarizes the procedure of MMD. Note that step 3.2 (SEVER) effectively functions as a reorganization operator that is causality-based. Step 4 likewise reorganizes (prunes) the hierarchy, though it does this based on frequencies. Reorganization operators are especially important for incremental learners in mitigating so-called ordering effects. We shall return to the issue of ordering effects later in Section 5.2. For now, we discuss the mechanism of the SEVER operator. In addition to the hierarchical reorganization that step 2 entails, the absence of causal relationships between a parent and a child in a hierarchy can cause further reorganization: said child can be severed from its parent and then reclassified (step 3.2 (SEVER)). This allows MMD to separate unrelated discrepancies so that the generalized discrepancies that make up an intensional definition of an error class form a coherent set. The effect of this

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

Figure 4.

171

Clustering a set of behavioral discrepancies into an error hierarchy, II.

is that MMD can disentangle the multiple misconceptions or knowledge errors that may have produced the discrepancies in a student’s behavior. Consider the following buggy clause: % buggy clause 2 reverse([H|T],R) :append(T,[H],R). The discrepancies between the above clause and the correct naive reverse/2 clause presented earlier are remove(subgoal1) and replace(subgoal2,T1,T). Clustering the discrepancies of buggy clause 2 into the error hierarchy in figure 4A produces the revised error hierarchy in figure 4B. Now note that the node [replace(subgoal2,[H],H)] in the right subtree of figure 4B is not at all (causally) related to its parent ([remove(subgoal1), replace(subgoal2,T1,T)]). Since the child is not related to its parent, MMD severs the child from the parent and then reclassifies the child into the hierarchy. This particular severing operation is consistent with the fact that omitting or forgetting to put the list brackets [] around a variable (denoted

172

SISON, NUMAO & SHIMURA

by the discrepancy replace(subgoal2,[H],H)) has nothing to do with omitting the first subgoal of the correct clause (i.e., the discrepancy remove(subgoal1)), and the natural consequence of the latter, which is using some other variable in place of T1 in the second subgoal of the correct clause (i.e., the discrepancy replace(subgoal2,T1,T)). Note carefully that the SEVER operator only applies between a child and an unrelated parent; i.e., a subset of discrepancies will not be severed from its original set, S, even if no causal relationships can be found between it and S, unless it has already been ‘gently pushed out of’ S, only connected to the rest of S by a parent-child link. In other words, the SEVER operator will not split a node even though its contents are unrelated with respect to the background knowledge. This conservative approach of retaining nodes despite the absence of component-level causalities thus takes into account the possibility that the background knowledge may be incomplete. 5.

Experiments

We now examine the performance of MEDD in terms of its ability to recognize and classify novice programs, discover knowledge-level errors, and mitigate ordering effects. 5.1.

Classification and discovery: MEDD with SMD, CMD, and MMD

5.1.1. Experiment. To empirically evaluate the ability of MEDD to discover and detect, from discrepancies in real-world behavior, error classes that represent knowledge-level errors, sufficiently large corpora of incorrect novice behavior, here in the form of 64 buggy naive reverse/2 and 56 buggy sumlist/2 programs were obtained.4,5 The buggy naive reverse and sumlist programs were then given to a team of experts (Prolog teachers) who were asked to identify the programs’ knowledge-level errors. Simultaneously, the discrepancies between the buggy programs and their associated intentions were computed via IDD/LP and then fed in random order into MMD. The error hierarchies generated by MEDD were then also given to the experts, who were then asked to produce two lists of knowledge-level errors (see appendix) and to identify the correspondence between the knowledge-level errors they have listed and the error classes generated by MEDD. MEDD’s performance is examined from two perspectives: (1) program classification and (2) error discovery. A buggy program is said to be correctly classified if its computed error class corresponds to a knowledge-level error that has been identified as being exhibited by the buggy program. Similarly, an error class is said to be correctly discovered if it corresponds to a knowledge-level error that has been identified as being exhibited by certain buggy programs in the dataset. Furthermore, since buggy programs may exhibit more than one knowledge-level error, we consider a buggy program as fully classified only if all the knowledge-level errors underlying it are detected; otherwise, it is only partially classified, and receives only a partial point (fraction of the total number of knowledge-level errors underlying the program). Similarly, an error class is considered fully discovered (with respect to past data) only if the class contains all the manifestations that it can assume in a buggy program; otherwise, it is only partially discovered, and receives only a partial point (fraction of the total number of possible ways that it can appear in a buggy program).

173

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS Table 4.

Classification and discovery results for the reverse/2 and sumlist/2 datasets. Buggy programs correctly classified (%)

Error classes correctly discovered (%)

Dataset

Algorithm

P-P

P-F

P-P

P-F

reverse/2

SMD

61

79

34

38

MMD

84

94

70

88

CMD

68

81

44

50

SMD

76

81

55

58

MMD

95

97

75

75

CMD

44

45

63

63

sumlist/2

P-P: Partial classification/discovery awarded Partial point. P-F: Partial classification/discovery awarded Full point.

The results of MMD’s (and SMD’s) performance are shown in Table 4. (The table also shows the results of an algorithm called CMD, which we shall discuss shortly.) The values in the table are average performance rates given five random orderings of the datasets. Parameters for Sim were set as follows: θ = 1, α = 0.5, β = 0.5, and γ = 0, with f (X ) implemented as the cardinality of X . Other values for θ, α, β, and γ are less intuitive and did not yield better results. The first causality heuristic was implemented using causal relationships among variables in the reference program, as described in Section 4.2. 5.1.2. Results and discussion. The results are very encouraging. As Table 5 shows, MMD was able to correctly classify most (reverse (P-P): 84%, sumlist (P-P): 95%) of the buggy programs and correctly discover most (reverse (P-P): 70%, sumlist (P-P): 75%) of the knowledge-level errors in both datasets. The values were even higher when partially classified programs or partially discovered knowledge-level errors were awarded full points (P-F). Although MMD was not able to discover all error classes, it nevertheless succeeded in discovering the more common ones. This explains why the classification rates (84%, 95%) were higher than the discovery rates (70%, 75%). Compared with SMD, MMD’s performance in classifying buggy programs was better in both datasets (reverse: better by 38%, sumlist: better by 25%). The same is true for its performance in discovering error classes (reverse: better by 105%, sumlist: better by 36%). MMD’s superior performance is due to the causality-based SEVER operator, which allows discrepancy sets to be multiply classified and multiple knowledge-level errors to be distinguished. A limitation of the current implementation has to do with the vocabulary in terms of which the discrepancies are expressed, i.e., the language bias. Currently, the system uses only four predicates (remove, add, replace, switch) to express behavioral discrepancies, and these operators are given equal weights. It is of course possible to use a richer set of discrepancy operators. One could also assign different weights to the discrepancy operators. However, we shall soon see that feature weighting is not always a good idea.

174

SISON, NUMAO & SHIMURA

5.1.3. Causality-based relational clustering. Since MMD’s superior performance is due to the causality-based SEVER operator, the question now arises as to the extent to which causality can be used without being abused. Consider a variant of MMD which requires that for two sets of discrepancies to match, their commonalities must be causally related. This implies that every discrepancy in every concept node thus formed will be causally related with some other discrepancy in the same node. Let us call this variant CMD (Causality-based Misconception Discovery). Note that CMD’s procedure would be similar to that for MMD (recall Table 3) except for two things. First, the last part of step 1 of CMD’s procedure would read “. . . system threshold, γ and causal relationships exist among the discrepancies in Com.” Second, steps 3.1 and 3.2 of MMD’s procedure would be omitted (because they would never happen). Our experimental procedure and parameter settings for evaluating CMD’s classification and discovery performance follow those in Section 5.1.1. Results (Table 4) show that CMD’s performance was not so good. MMD’s classification performance (P-P) was 24% better than CMD’s on the reverse dataset, and more than 100% better than CMD’s on the sumlist dataset. As far as discovery performance is concerned, MMD (P-P) was almost 50% better than CMD on the reverse dataset, and 19% better than CMD on the sumlist dataset. The differences were even greater when partially classified programs or partially discovered error classes were given full points (P-F). This lackluster performance can be traced to the fact that CMD requires that two similar discrepancy sets have (the same) inter-discrepancy causal relationships. Thus, for example, CMD was not able to learn the error class [replace(subgoal2,T1,H), replace(subgoal2,[H],X?)] because the discrepancies in this node are not causally related according to the first causality heuristic and the background knowledge of causal relationships among the variables in the correct program. (Futher details may be found in Sison, 1998). Unfortunately, not all co-occurring discrepancies in the above experiments had causal relationships that could be found in or explained by the background knowledge. This is not to say that there are no causal relationships between these co-occurring discrepancies; this only says that the background knowledge of causal relationships might be incomplete. By retaining such robust (i.e., frequently co-occurring but unseparatable into parent and child) groups of discrepancies in spite of the absence of causal relationships in the background knowledge, MMD can in fact correctly discover more errors and classify more programs than CMD. (In addition, such as-yet-not-fully-explainable but apparently robust co-occurrences can be used to trigger a new kind of discovery process involving the search for new causal relationships or heuristics.) 5.2.

Mitigating ordering effects: feature weighting, reclassification, and MEDD

5.2.1. Mitigating ordering effects. Heuristic incremental systems that summarize data generally suffer from ordering effects: different orderings of the input data may yield different classifications. Though there may be a number of ways of dealing with ordering problems (see e.g., Fisher, Xu, & Zard, 1992), what seems to be a generally preferred approach when no significant amount of data is known a priori would be to introduce control operators that reorganize the classification hierarchies. One such approach is to use

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

175

a node splitting (or merging) operator whenever a node is considered to be ‘misplaced’ (e.g., COBWEB, Fisher, 1987; UNIMEM, Lebowitz, 1987). Another approach is to recluster a classification hierarchy until the hierachy has ‘stabilized’ (e.g., ITERATE (Biswas et al., 1991); experiments by Fisher, 1987). It will be recalled that MMD’s SEVER operator enables the separation of multiple classes of errors that are manifested in a single complex behavior. Actually, this operator can also serve to mitigate ordering effects in a manner similar to the splitting operators of COBWEB or UNIMEM, and to the reclassification approach of ITERATE, though it is more powerful than previous splitting operators in the sense that it can split not only categories but objects (the discrepancy sets) as well, and is more informed and efficient than the general reclassification approach in the sense that only subtrees containing causally unrelated parents and children are reclustered. 5.2.2. Experiment. To evaluate the effectiveness of MEDD/MMD’s causality-based SEVER operator in mitigating ordering effects, we compare the performance of MMD against that of a general reclassification scheme and a feature weighting scheme, on the discrepancy sets of 64 buggy reverse/2 and 56 buggy sumlist/2 Prolog programs. The general reclassification scheme is one in which an entire hierarchy is repeatedly reclassified ρ number of times while ρ is less than some user-defined parameter, or until no more reclassifications are possible (i.e., further reclassifications do not change the structure of the error hierarchy). In the feature weighting scheme, more ‘severe’ discrepancies are given greater weights (e.g., the omission of an entire subgoal in a buggy program is given more weight than the use of a different argument, say T instead of T1, in some subgoal). As in the previous experiments, performance is examined from the perspective of program classification and error discovery, and results for partial as well as full classification/discovery were taken. The results (mean X¯ and standard deviation σ ) of SMD, MMD, and SMD or MMD combined with the general reclassification and feature weighting schemes for 5 random orderings of the input datasets are shown in Table 5. For the MMD-based algorithms, Sim parameters were set as before. The first causality heuristic was also implemented as before, using causal relationships between variables of the reference program. For the algorithms that used a general reclassification scheme, ρ = 1. This value will be explained shortly. For the algorithms that used a feature weighting scheme, discrepancies that involved the omission of entire subgoals or the incorrect switching of the order of two subgoals were each given a weight of 5; all other discrepancies were given a weight of 1. Experiments on weights 5 ± 3 for the ‘severe’ discrepancies yielded the same results; weights below or above this range did not yield better results. 5.2.3. Results and discussion. The main result is that MMD can effectively mitigate ordering effects while still achieving high performance rates. MMD’s overall performance given different orderings of the input data varied only slightly at σ = 0 and σ < 1 for the reverse and sumlist datasets, respectively. On the other hand, SMD’s performance in buggy program classification and error class discovery was much more unstable, with its σ values going as high as 20 times those of MMD. The results also confirm Fisher’s (1987) and Biswas and colleagues’ (1991) finding that general reclassification is a useful mechanism for mitigating ordering effects, although by

176

SISON, NUMAO & SHIMURA

itself and in our domain, general reclassification is not as powerful as MMD’s SEVER operator, which can be viewed as incorporating a knowledge-driven, local (subtree) reclassification. Thus, although general reclassification consistently brought down the standard deviations on SMD and MMD’s average performances (with ρ = 1 sufficient to stabilize MMD’s performance on the sumlist dataset, and reclassification completely unnecessary for MMD working on the reverse dataset), the classification and discovery rates of SMD+R only amounted to about half of those achieved by MMD in the reverse dataset. Moreover, general reclassification did not significantly improve MMD’s performance ( X¯ ) in both datasets. This can be explained by the fact that general reclassification does not introduce a strong enough bias when clustering discrepancies. Nevertheless, said mechanism can easily be incorporated into MEDD and turned off when memory space is limited. In contrast to the results for general reclassification, the results for feature weighting indicate that the latter is not always helpful in mitigating ordering effects. In the sumlist dataset, for example, feature weighting (“+W” in Table 5) was able to decrease all σ ’s. In the reverse dataset, however, the σ value when feature weighting was used together with MMD (i.e., MMD+W) for error discovery shot up to σ = 7.2, over seven points higher than Table 5.

Mitigation results for the reverse/2 and sumlist/2 datasets. Buggy programs correctly classified (%) P-P

Error classes correctly discovered (%)

P-F

P-P

P-F

Dataset

Algorithm



σ



σ



σ



σ

reverse/2

SMD

61

17.6

79

19.2

34

8.8

38

12.5

SMD+R

44

4.1

59

4.3

28

4.3

35

5.0

SMD+W

38

5.0

53

7.2

19

6.3

20

6.8

SMD+R+W

41

0.0

58

0.0

25

0.0

25

0.0

sumlist/2

MMD

84

0.0

94

0.0

70

0.0

88

0.0

MMD+R

84

0.0

94

0.0

70

0.0

88

0.0

MMD+W

73

2.9

91

3.4

58

7.2

75

0.0

MMD+R+W

84

0.0

94

0.0

70

0.0

88

0.0

SMD

76

17.1

81

18.3

55

10.8

58

18.3

SMD+R

88

0.0

95

0.0

61

0.0

63

0.0

SMD+W

90

0.0

98

0.0

63

0.0

63

0.0

SMD+R+W

90

0.0

98

0.0

63

0.0

63

0.0

MMD

95

1.0

97

0.6

75

0.2

75

0.0

MMD+R

96

0.0

98

0.0

75

0.0

75

0.0

MMD+W

96

0.0

98

0.0

75

0.0

75

0.0

MMD+R+W

96

0.0

98

0.0

75

0.0

75

0.0

P-P/P-F: as in Table 4. R: General Reclassification. W: Feature Weighting.

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

177

that obtained when MMD was used alone (σ = 0). This can be explained by the fact that feature weighting introduces too strong a bias when classifying discrepancies: each main subtree will tend to have at least one severe (i.e., heavily weighted) discrepancy. However, not all error classes involve severe discrepancies.

6.

Related work

MEDD’s IDD/LP, when viewed as an intention-based program diagnoser, would be related to the PROUST (Johnson, 1990) and APROPOS2 (Looi, 1991) debuggers, which are also intention-based. IDD/LP is closer to APROPOS2 in the sense that dynamic analysis (i.e., program testing) a` la Shapiro (1983) is used after static analysis (i.e., syntactic matching) in both. However, MEDD further adapts Shapiro’s routines to take full advantage of the availability of (1) the reference program and (2) the discrepancies computed between the student and reference programs, in order to more fully automate this task. Of course, the most important difference between IDD/LP and other intensional diagnosers is that the others rely on hand-coded, static libraries of bugs, whereas IDD/LP has the ability to: (1) extend its background knowledge of intentions and correct variants, and (2) use and support the construction of an automatically constructed and extendable bug library. The similarity component of MEDD’s MMD is similar to UNIMEM (Lebowitz, 1987) and COBWEB (Fisher, 1987), both of which are also incremental, hierarchical conceptual clusterers. However, UNIMEM and COBWEB deal with attribute-value descriptions, whereas MMD deals with relational descriptions. More importantly, MMD considers causalities in the background knowledge as well as similarities in the data in the formation of concepts. To determine causal relationships, particularly the causative features, among the features that make up an induced concept definition, the multistrategy UNIMEM extension proposed in (Lebowitz, 1986) uses the frequency of occurrence of a feature. In MMD, causative discrepancies are determined by whether a generalized discrepancy is an intersection or a variabilization, or if it is a parent or a child. In (Pazzani, 1994), which likewise uses a UNIMEMlike clusterer, there are two kinds of features, namely, actions and state changes, and actions are always the causative features. In MMD, any discrepancy is potentially causative.

7.

Conclusion

We have presented a system, MEDD, that learns, without supervision, how to classify logic programs written by novices. This is accomplished in a multistrategic manner by the exemplar-based classification and learning of novice and reference programs, and the similarity- and causality-based clustering of discrepancies between student and reference programs into an error hierarchy, whose main subtrees form intensional definitions of error classes, which, in turn, denote knowledge-level errors. We have seen how well MEDD performs (on novice Prolog programs) in terms of both classification and discovery, and how MEDD can correctly detect the occurrence of knowledge-level errors in these programs, even when multiple errors are enmeshed together in a single program, and when the programs are presented to MEDD in a different order. The applicability of MEDD’s approach

178

SISON, NUMAO & SHIMURA

should extend naturally to other languages or domains in which the internal workings of behavioral components (e.g., clauses, subgoals) are abstracted away and do not interfere with those of other components. Appendix: Knowledge-level errors listed by experts Note: The bracketed number alongside each knowledge-level error indicates the number of buggy programs that exhibited the error. Knowledge-level errors in the reverse/2 dataset • Slip or misconception: [X] is the same as X [19] • Misconception: [|] is the same as append/3 [26] • Lack of background knowledge: Introducing variables in the body of a clause [11] • Lack of background knowledge: Output variables [2] • Lack of background knowledge: Recursion [4] • Lack of task knowledge: Reversing [17] • Repair: Incorrect analogy with a solution in which recursion comes last [4] • Repair: Making arguments in a subgoal consistent with previously (but erroneously) specified arguments [6] • Lack of background knowledge: Termination [9] Knowledge-level errors in the sumlist/2 dataset • Slip or misconception: [X] is the same as X [6] • Misconception: Arithmetic expression as output variable [1] • Lack of background knowledge: Introducing variables in the body of a clause [39] • Lack of background knowledge (or misconception): Recursion in Prolog [6] • Lack of background knowledge: Recursion: No recursion [2] • Lack of background knowledge: Recursion: “Can’t wait” [3] • Repair: Incorrect analogy with a solution in which recursion comes last, Making arguments in a subgoal consistent with previously (but erroneously) specified arguments [30] • Lack of background knowledge: Termination [8] Acknowledgments We would like to thank Lorenza Saitta, Katharina Morik, and the editors and reviewers for their invaluable comments; Rhodora Reyes, Ethel Chua Joy, and Philip Chan for their help with the buggy programs; Tsuyoshi Murata, Oscar Ortega Lobo, Ryutaro Ichise, and Jefferson Tan for their technical assistance; and the Ministry of Education (Monbusho) of Japan for its generous financial support.

MULTISTRATEGY DISCOVERY AND DETECTION OF NOVICE ERRORS

179

Notes 1. Problem solving (solving elementary algebraic equations) was the task of students modeled by PIXIE. On the other hand, classification (of given C++ programs as correct or incorrect) was the task of students modeled by ASSERT. In addition to the use of machine learning to automate background knowledge construction, as in PIXIE and ASSERT, machine learning techniques can also be used to infer student models, e.g., ACM (Langley & Ohlsson, 1984), SIERRA (VanLehn, 1987), ASSERT, and (Neri, 1999). Sison & Shimura (1998) provide a review and discussion of the application of machine learning and machine learning-like techniques in student modeling. 2. IDD is called MDD (Multistrategy Discrepancy Detection) in (Sison, 1998). However, since what is multistrategic in MDD is its approach to program diagnosis—it performs both static and dynamic analysis of programs—rather than its approach to learning (in the current implementation), we do not use the term ’multistrategy’ here to avoid confusion. 3. For speedup, the discrepancy set can also be learned as a transformation rule via explanation-based generalization (Mitchell, Keller, & Kedar-Cabelli, 1986) by showing that each element in the discrepancy set is part of a successful application of a sequence of proven correct transformation rules. However, the learning of transformation rules is not currently implemented, as this does not affect MEDD’s accuracy, which is our current concern. Moreover, note that the learning of correct variants of reference programs is already a form of speedup learning. 4. From a total of 73 reverse/2 and 90 sumlist/2 programs written by third-year undergraduate students who were learning Prolog, 64 buggy naive reverse and 56 buggy sumlist programs were obtained. Three of the 73 reverse programs were rightly found by IDD/LP to be correct, while 6 used an accumulator, albeit incorrectly. We did not perform any experiments on the buggy accumulator reverse programs, as these were too sparse. Thirty-four of the 90 sumlist programs were rightly determined by IDD/LP as correct. 5. It is important to note that although the reverse/2 and sumlist/2 problems are simple from an expert’s point of view, these problems are not that simple from the point of view of novices, as evidenced by the buggy programs. The reverse/2 and sumlist/2 programs form part of a suite of problems which Gegg-Harrison (1994) has empirically shown to be sufficient for novice Prolog programmers. The rest of the programs in this suite are length/2, product/2, count/2, enqueue/3, and append/3. The reverse/2 problem is the most difficult in this suite. MMD has only been tested on novice programs, specifically, on programs written by students for GeggHarrison’s suite of problems. However, the approach and results should extend naturally to programs written by more advanced programmers for as long as the so-called competent programmer assumption (Shapiro, 1983) holds, i.e., that a program is written with the intention of being correct, and that if it is not correct, then a close variant of it is.

References Baffes, P. & Mooney, R. (1996). Refinement-based student modeling and automated bug library construction. Journal of Artificial Intelligence in Education, 7, 75–116. Bareiss, E., Porter, B., & Wier, C. (1987). PROTOS: An exemplar-based learning apprentice. Proceedings of the Fourth International Workshop on Machine Learning. Biswas, G., Weinberg, J., Yang, Q., & Koller, G. (1991). Conceptual clustering and exploratory data analysis. Proceedings of the Eighth International Workshop on Machine Learning (pp. 591–595). Fisher, T. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172. Fisher, D., Xu, L., & Zard, N. (1992). Ordering effects in clustering. Proceedings of the Ninth International Workshop on Machine Learning (pp. 163–168). Gegg-Harrison, T. (1994). Exploiting program schemata in an automated program debugger. Journal of Artificial Intelligence in Education, 5, 255–278. Hart, P., Nilsson, N., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems, Sience and Cybernetics, 4, 100–107. Johnson, W. L. (1990). Understanding and debugging novice programs. Artificial Intelligence, 42, 51–97.

180

SISON, NUMAO & SHIMURA

Langley, P. & Ohlsson, S. (1984). Automated cognitive modeling. Proceedings of the Second National Conference on Artificial Intelligence (pp. 193–197). Lebowitz, M. (1986). Integrated learning: Controlling explanation. Cognitive Science, 10, 219–240. Lebowitz, M. (1987). Experiments with incremental concept formation: UNIMEM. Machine Learning, 2, 103– 138. Looi, C. (1991). Automatic debugging of Prolog programs in a Prolog intelligent tutoring system. Instructional Science, 20, 215–263. McCalla, G. (1992). The central importance of student modeling to intelligent tutoring. In E. Costa (Ed.), New directions for intelligent tutoring systems, Berlin, Springer Verlag. Michalski, R. & Stepp, R. (1983). Learning from observation: conceptual clustering. In R. Michalski, J. Carbonell, & T. Mitchell (Eds.), Machine learning: an artificial intelligence approach. Palo Alto, CA: Tioga. Mitchell, T., Keller, R., & Kedar-Cabelli, S. (1986). Explanation-based generalization: A unifying view. Machine Learning, 1, 47–80. Neri, F. (1999). The role of explanation in elementary physics learning. Machine Learning, This issue. Pazzani, M. (1994). Learning causal patterns: making a transition from data-driven to theory-driven learning. In R. Michalski & G. Tecuci (Eds.), Machine learning: a multistrategy approach (Vol. IV, pp. 267–293). San Francisco, CA: Morgan Kaufmann. Plotkin, G. (1970). A note on inductive generalization. Machine Intelligence, 5, 153–163. Richards, B. & Mooney, R. (1995). Automated refinement of first-order Horn-clause domain theories. Machine Learning, 19, 95–131. Shapiro, E. (1983). Algorithmic program debugging. Cambridge, MA: MIT Press. Sison, R. (1998). Multistrategy discovery and detection of novice programmer errors. Ph.D. Thesis, Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan. Sison, R., Numao, M., & Shimura, M. (1997). Using data and theory in multistrategy (mis)concept(ion) discovery. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 274–279). Sison, R. & Shimura, M. (1998). Student modeling and machine learning. International Journal of Artificial Intelligence in Education, 9, 128–158. Sleeman, D., Hirsh, H., Ellery, I., & Kim, I. (1990). Extending domain theories: Two case studies in student modeling. Machine Learning, 5, 11–37. Sterling, L. & Shapiro, E. (1986). The art of prolog. Cambridge, MA: MIT Press. Tamaki, H. & Sato, T. (1984). Unfold/fold transformation of logic programs. Proceedings of the Second International Logic Programming Conference (pp. 127–138). Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352. VanLehn, K. (1987). Learning one procedure per lesson. Artificial Intelligence, 31, 1–40. Wisniewski, E. & Medin, D. (1994). On the interaction of theory and data in concept learning. Cognitive Science, 18, 221–281. Received December 22, 1998 Accepted June 18, 1999 Final manuscript June 18, 1999

Suggest Documents