Int J of Sci and Math Educ (2017) 15:19–38 DOI 10.1007/s10763-015-9678-6
Format Effects of Empirically Derived Multiple-Choice Versus Free-Response Instruments When Assessing Graphing Abilities Craig Berg 1 & Stacy Boote 2
Received: 27 May 2015 / Accepted: 16 August 2015 / Published online: 8 September 2015 # Ministry of Science and Technology, Taiwan 2015
Abstract Prior graphing research has demonstrated that clinical interviews and freeresponse instruments produce very different results than multiple-choice instruments, indicating potential validity problems when using multiple-choice instruments to assess graphing skills (Berg & Smith in Science Education, 78(6), 527–554, 1994). Extending this inquiry, we studied whether empirically derived, participant-generated graphs used as choices on the multiple-choice graphing instrument produced results that corresponded to participants’ responses on free-response instruments. The 5 – 8 choices on the multiplechoice instrument came from graphs drawn by 770 participants from prior research on graphing (Berg, 1989; Berg & Phillips in Journal of Research in Science Teaching, 31(4), 323–344, 1994; Berg & Smith in Science Education, 78(6), 527–554, 1994). Statistical analysis of the 736 7th – 12th grade participants indicate that the empirically derived multiple-choice format still produced significantly more Bpicture-of-the-event^ responses than did the free-response format for all three graphing questions. For two of the questions, participants who drew graphs on the free-response instruments produced significantly more correct responses than those who answered multiple-choice items. In addition, participants having “low classroom performance” were affected more significantly and negatively by the multiple-choice format than participants having “medium” or “high classroom performance.” In some cases, prior research indicating the prevalence of “picture-of-the-event” and graphing treatment effects may be spurious results, a product of the multiple-choice item format and not a valid measure of graphing abilities. We also examined how including a picture of the scenario on the instrument versus only a written description affected responses and whether asking participants to add marker
* Craig Berg
[email protected] Stacy Boote
[email protected] 1
University of Wisconsin-Milwaukee, Milwaukee, WI, USA
2
University of North Florida, Jacksonville, FL, USA
20
Craig Berg and Stacy Boote
points to their constructed or chosen graph would overcome the short-circuited thinking that multiple-choice items seem to produce. Keywords Assessing . Construction . Graphing . Graphs . Interpretation . Validity
Introduction Graphs are a significant component of learning science content, a critical tool for analyzing data when doing science, and an important visual aid when communicating and developing an understanding of the many scientific and economic factors in daily lives. Graphs are a primary means of identifying patterns, and Kimbal (1967) argued that discovering patterns fitting a mathematical model is the ultimate goal of science. Constructing and interpreting graphs are essential tools, often working in tandem, for understanding and communicating ideas in science (Barclay, 1986; Leinhardt, Zaslavsky & Stein, 1990; Linn, Layman & Nachmias, 1987; Macdonald-Ross, 1977; McKenzie & Padilla, 1984, 1986; Mokros, 1986; Weintraub, 1967). While graph interpretation and creation are not mutually exclusive, they rely on different cognitive processes (Leinhardt et al., 1990). The former includes “point-by-point” local processes, whereas the latter includes the more global identification of “trend direction(s)” (p. 9). Researchers and teachers rely on valid assessments to understand students’ development of both graph creation and interpretation skills. Previous research on graph creation and interpretation with middle and high school students has shown graphing to be a weak area for many of our science and mathematics students (Berg, 1989; Berg & Phillips, 1994; Berg & Smith, 1994; Boote, 2014; Brasell, 1987, 1990; Keller, 2008; Kerslake, 1977; McDermott, Rosenquist, Popp & van Zee, 1983; Shaw, Padilla & McKenzie, 1983). Specifically, transferring graph knowledge learned in one class to another is very challenging for students (Friel, Curcio & Bright, 2001; Glazer, 2011; Leinhardt et al., 1990). Students’ lack of fluency with graphing is one reason for the emphasis on graph interpretation in recent US standards documents. In the Next Generation Science Standards (NGSS), a central theme is to make sure our students graduate with “an understanding of the enterprise of science as a whole—the wondering, investigating, questioning, data collecting, and analyzing” (pg. 1, Appendix H, NGSS). This emphasis is also echoed across the Common Core State Standards. Both Mathematics and English Language Arts Standards emphasize graphic literacy (Common Core State Standards Initiative, 2010). Graph interpretation and creation is a common thread throughout core academic content areas and is instrumental in tying the three major dimensions of the NGSS together (NGSS Lead States, 2013). These standards stipulate that students must learn a variety of skills and concepts related to graph creation and interpretation: reading and plotting points in various graph formats, the relationship between independent and dependent variables, etc. (Common Core State Standards Initiative, 2010; National Research Council, 2012; NGSS Lead States, 2013). These standards also stipulate that students must learn to apply these graph creation and interpretation skills within the context of various science fields: plotting the movement of objects in physics, graphing pollution curves in biology, charting pH over time, etc. Importantly, students must learn field-specific concepts (e.g. kinematics, population dynamics, and limnology, respectively) to be able to use graphs in each field of science. Understandably, creating
Format Effects of Empirically Derived Multiple-Choice
21
and interpreting graphs is often used in school as an important means of learning these concepts (National Research Council, 2012). Considering the diversity of graphing skills and concepts within the mathematics and science standards, it seems prudent for researchers and teachers to question whether assessment formats measure what we need them to measure. If a test item format like M-C is unable to differentiate knowledge of science phenomena from graphing conventions, then using M-C is not an appropriate item format. Simply, if the format of the question impedes a test’s ability to differentiate degrees of knowledge mastery, or to assess graphing skills, then the format of the question needs to change. In addition, if the M-C format of the instrument unduly influences a participant’s thinking and actions, producing a response unlike what they would construct on a F-R format, then serious concerns regarding the validity of the M-C instrument exist. Yet, to question the validity of the M-C instrument is to question the foundation of most of the assessment industry. Significant to this issue of format validity is that M-C graphing questions are commonly used in educational research and college entrance exams. In the former, they are used to measure treatment effects from graphing instruction using technologies like Microcomputer-Based Laboratory (MBL) (Barclay, 1986; Brasell, 1987; Linn et al., 1987), Calculator-Based Laboratory (CBL) (Kwon, Kim & Woo, 2015; Lapp & Cyrus, 2000), and composite video and graphing software (Dori & Sasson, 2008; Ploetzner, Lippitsch, Galmbacher, Heuer & Scherrer, 2009; Sasson & Dori, 2012; Wu, Shah & Davidson, 2003). In the latter, many colleges rely on M-C tests like the Scholastic Aptitude Test (SAT), ACT, Graduate Record Examination (GRE), Graduate Management Admissions Test (GMAT), and Medical College Admissions Test (MCAT) to confirm that students are equipped with requisite knowledge and skills to be successful within their programs of study. However, the M-C format used within these high-stakes assessments may not provide a valid measure of what these test items claim to be assessing. As researchers have established that M-C formats affect student responses, the purpose of this study was to investigate the validity of student-generated, empirically derived M-C test items assessing participants’ abilities to create kinematics graphs. To evaluate the validity of these M-C items, participants’ M-C responses were compared with F-R answers of the same graph creation task. Two main research questions examined whether test item format used to assess graph creation abilities (1) differentially affected correct responses, and (2) differentially affected participants’ attractions toward picture-of-the-event responses. Two additional research questions examined (3) how including a picture of the scenario on the instrument versus only a written description differentially affected participant responses, and (4) if asking participants to add marker points to their constructed or chosen graph would overcome the short-circuited thinking that M-C items seem to produce.
Literature Review and Conceptual Framework Prior to the publication of the 1999 edition of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association & National Council of Measurement in Education, 1999), it was common to refer to several types of validity. Since the publication of these Standards, validity is instead understood as a unitary concept. BIt is the degree to which all the accumulated evidence supports the intended interpretation of the test scores for the
22
Craig Berg and Stacy Boote
proposed purpose^ (American Educational Research Association et al., 1999, p. 11). From this perspective, it is then appropriate to evaluate different types of evidence of validity rather than distinct types of validity. Such evidence might include data related to the content of the test, response processes, the internal structure of the test, and relations to other variables. These multiple forms of evidence should then be integrated to understand the kinds of interpretations that are appropriately supported for particular uses of test scores. Examining the evidence may also suggest additional or alternative factors that might be influencing scores and, therefore, invalidating the items for particular purposes (1999). In an effort to better understand the effects of instructional and curricular methods used to improve graphing abilities, researchers and teachers often use M-C tests (Adams & Shrum, 1990; Ates & Stevens, 2003; Beichner, 1990; Brasell, 1987; Culbertson & Powers, 1959; Friedler, Nachmias & Linn, 1990; Linn et al., 1987; McKenzie & Padilla, 1984, 1986; Mokros & Tinker, 1987; Rowland & Stuessy, 1988; Svec, 1995; Tairab & Khalaf Al-Naqbi, 2004). In core academic disciplines where context is key, the main problem with relying on M-C items to measure students’ mastery of content knowledge is that Ba few multiple-choice questions are not enough for drawing definite conclusions about students’ contextual coherence^ (Savinainen & Viiri, 2008, p. 727). Attempts to identify root causes of graphing difficulties have uncovered what appear to be validity problems inherent in some of these M-C graphing assessments (Berg, 1989; Berg & Phillips, 1994; Berg & Smith, 1994). Several researchers have identified significant discrepancies between how students respond to M-C and F-R formats of the same graphing questions. Ward, Frederiksen & Carlson (1980) concluded that “free-response and machine-scorable versions of formulating hypothesis (FH) clearly cannot be considered alternate forms of the same test” (p. 26). It is clear from research that participants respond differently to M-C questions than to F-R questions in which they generate their own responses. Apparently, M-C distracters do their job too well and attract participants to answers that they would not draw when asked to construct a graph for the same scenario. One common interpretation challenge that learners of all ages encounter is seeing the graph as a picture of the physical event rather than a graph, referred to as a picture-ofthe-event graph difficulty or iconic graph difficulty (Berg, 1989; Berg & Smith, 1994; Barclay, 1986; Beichner, 1990; Clement, 1986; Friel et al., 2001; Glazer, 2011; Kerslake, 1977; Leinhardt et al., 1990; Schultz, Clement & Mokros, 1986). Although picture-of-the-event difficulty is common, M-C tests highly accentuate this problem by significantly affecting correct responses (Berg & Phillips, 1994; Berg & Smith, 1994). In these earlier studies, there was as much as a 19 % difference in correct responses, three times as many Bpicture-of-the-event^ from M-C instruments and significant differences in how M-C and F-R affected students of various ability and grade levels. Other studies have demonstrated these discrepancies in student performance across item formats. When DeMars (1998) analyzed scores from science and mathematics sections from a high school proficiency test and compared M-C to F-R, high ability male students scored higher on the M-C section, whereas female students either scored higher on constructed response (C-R) or the gap between male and female was lessened. In another study using computer-delivered vocabulary lessons with M-C or C-R study tasks followed by cued recall and recognition post-tests, Clariana (2003) found that the C-R study task was significantly more effective.
Format Effects of Empirically Derived Multiple-Choice
23
The differences in student performance identified in earlier studies suggest that item format affects test content. That is, different test question formats measure different objectives. These differences by themselves do not help us to resolve which item format is more valid. Validity must always be judged against the intended purpose(s) of the test. In this regard, F-R question formats are inherently more valid. Ward et al. (1980) argued that F-R formats have greater “ecological validity” as “real problems, in science and in life, rarely present themselves in multiple-choice format” (p. 27). Similarly, Lissitz, Hou & Slater (2012) asserted that C-R assessments require active construction in the sense that the learner provides the response. This active construction of knowledge is more in line with the process of learning and development as defined by Piaget and Vygotsky. Discrepancies in student performance between F-R and M-C item formats suggest problems with the validity of M-C item formats. On graphing assessments, if M-C questions produce different results compared to FR questions and affect males and females or students with varying ability levels differently, researchers and educators must take notice. For an assessment to provide a valid indicator of students’ knowledge, understanding, or skill development, the test item format must actually measure what it claims to be measuring. As seen in prior research and in our current study, there are strong validity concerns when M-C questions are attempting to measure that which only F-R formats are able. Not all researchers agree that the M-C format poses a validity problem. After examining 56 studies, Rodriguez (2003) conversely concluded that construct equivalence “appears to be a function of the item design method or the item writer’s intent” (p. 163). Yet, efforts to validate M-C instruments range from pilot study interviews to gather a sample of participants’ responses to Bexpert panel agreement^ on suitable questions and choices. In addition, at times M-C instruments are not validated by determining whether participants would really respond in that manner or if the distracters are even plausible from the participants’ novice-graphing point of view. Furthermore, the M-C format precludes participants from explaining their thinking, and therefore shields researchers or teachers from further insight into the thinking behind their choices. When formats allow for researchers to hear participants’ unforeseen explanations, perhaps logical and correct from the child’s point of view, it helps to alleviate the assumptions made that stem from an adult researcher’s perspective (Berg & Smith, 1994). When formats provide both an answer and explanations or insights into participants’ thinking, the evidence overwhelmingly suggests that some M-C responses scored as Bincorrect^ are actually found to be Bcorrect.^ The reversal may also be true—M-C choices scored as correct may only be guesses, as who would know for sure if a response is a guess or not. Lissitz et al. (2012) note that C-R items reduce the probability of guessing to essentially zero, while M-C formats allow for a higher probability of guessing correctly, lowering test reliabilities for students who are less able. Munby (1982) defined this particular shortcoming of M-C as the “doctrine of immaculate perception”—the assumption that both the student, test-maker, and/or scorer perceive the same meaning in the question and choices. These problems remain, and as such, severely challenge the validity of M-C instruments to assess certain aspects of graphing. Recognizing these problems, researchers and test makers have developed M-C tests containing plausible distracters produced by and gleaned from participants during interviews or other open-ended pilot studies. These empirically derived M-C tests
24
Craig Berg and Stacy Boote
(Aikenhead, 1988) attempt to improve assessments by containing student-generated choices Bgrounded in the empirical data of student viewpoints rather than relying solely on instruments structured by the philosophical stances of science educators^ (p. 607). These student-generated choices help narrow the discrepancy between the variations constructed by participants and those provided by researchers and/or test-makers. Aikenhead also reported that empirically derived M-C tests greatly reduced ambiguity in student responses, noting that semi-structured interviews were the least ambiguous of all measures used in the study. Additionally, Howe (1985) encouraged using data compiled from interviews to help establish the validity of quantitative measurements. Many past and current studies have used M-C to arrive at conclusions about student graphing abilities and the effects of using various curricula or technologies to influence graphing abilities. As described in the next section, our study uses data collected during interviews in a previous study (i.e. F-R in the previous study) as possible M-C answers in our current study. We are interested in whether this methodology to assess graphing abilities (1) differentially affects correct responses, and (2) differentially affects participants’ attractions toward picture-of-the-event answer choices. We also examine (3) the effects of including a picture of the scenario on the test instrument versus including only a written description. Finally, we measure (4) the effects of asking participants to add marker points to the constructed graph (F-R) or chosen graph (M-C) in an attempt to overcome short-circuited thinking often produced by M-C questions.
Methods Instruments There were six versions of instruments used in this study, with each version consisting of a set of three graphing questions. Utilizing the insights from prior research and student-generated graphs (Berg, 1989; Berg & Phillips, 1994; Berg & Smith, 1994), F-R and M-C instruments were constructed. The choices used on the M-C instrument were empirically derived from the 700+ participantconstructed graphs from prior studies from both clinical interviews and F-R studies. In other words, instead of using choices on the M-C version that were possible distractors based on adult thinking, the choices provided were based on student-constructed graphs—what children tend to draw as they respond to the provided graphing question/scenario. Version 1 contained a drawing of the scenario and a F-R blank graph (see Fig. 1). Version 2 contained a drawing of the scenario and M-C responses (see Fig. 2). Version 3 contained a written description of the scenario and F-R blank graphs— identical to version 1 except the drawing of the scenario was absent. Version 4 contained a written description of the scenario and M-C responses—identical to version 2 except the drawing of the scenario was absent. Version 5 contained a drawing of the scenario, F-R blank graphs, and additional instructions to add marks to the graph to indicate when the moving participant or object was at a particular place(s). Specifically, the following instructions were added: BThen, on the graph place a B to show where the ball is at the bottom of the large incline and a T for the top of the small incline.^ The
Format Effects of Empirically Derived Multiple-Choice
25
Fig. 1 Version 1—drawing of the scenario and F-R blank graph
drawing of the scenario also had a B and T that corresponded with the instructions (see Fig. 3). Version 6 contained a drawing of the scenario, M-C, and additional instructions to add marks to the graph to correspond to when the participant or object was at a particular place(s). Versions were constructed using stem-equivalent items to control for content differences and to isolate the format effect (Ackerman & Smith, 1988). Contained in each version was a set of three graphing questions used in prior research studies. Graphing question 1 (Ball-Hill) (Berg & Smith, 1994; Mokros & Tinker, 1987) asked participants to think about and graph the speed of a ball as it travels down a hill, up a small incline, and onto a flat surface. Graphing question 2 asked participants to think about and graph their walk to the wall and back in terms of distance away from the wall (Walk-Wall). Graphing question 3 asked participants to think about their time and speed as they ride a bicycle up a hill and down the other side (Bike-Hill). High validity was maintained by the very nature of the student-generated responses of the F-R instrument. The empirically derived choices on the M-C questions
26
Craig Berg and Stacy Boote
Fig. 2 Version 2—drawing of the scenario and M-C responses
provided a naturally higher validity than regular M-C due to the use of studentgenerated choices. As explained earlier, the choices used on the M-C version were not based on adult speculation of what might comprise suitable distractors, rather the choices came from what students would normally draw on their own in response to the graphing questions. F-R, participant-constructed graphs were scored by one individual who matched the constructed graph with preset categories corresponding to various distracters on the M-C version.
Format Effects of Empirically Derived Multiple-Choice
27
Fig. 3 Version 5—example of instructions to place marker points
Using the format of F-R established a high level of reliability due to participants constructing their own graphs. Reliability of scoring the participants’ responses was determined by having a second scorer examine and categorize a sample of the participants’ responses and compare the scores. Since M-C responses were not open to interpretation by the scorer, only F-R formats (versions 1, 3, and 5) were scored by the second scorer. Scorers matched up F-R-drawn graphs with the empirically derived graphs used as correct, incorrect, and picture-of-the-event choices on the M-C. Version 5 (F-R), in which participants added marks to their graph (such as B and T), minimized the scorer’s interpretation of responses and was helpful in a manner similar to researchers asking the subject to explain their graph. For the approximately one-third of F-R participants graphs scored by a second scorer, the inter-rater reliability was 96 %. Participants and Procedures Each participant answered only one version of graphing questions. The 6 versions were distributed across the 736 participants with approximately 122 participants per version. The 736 participants in grades 7 – 12 included 53 % males and 47 % females. Participants’ teachers provided a classroom performance ranking for each participant based on a full school year of classroom activity and work. Participants were ranked by the teacher as Bhigh classroom performance^ if, in general, they were part of the top 20 % of students in their class, and Blow^ if performing in the lower 20 % of students. All other participants were ranked as Bmedium^ level of classroom performance. This method of categorizing participants by ability levels was used because the classroom teachers had spent the previous 7 months teaching and interacting with these students. Consequently, they had consistent access to a large amount of data for each individual (e.g. classroom assignments, quizzes, and tests). We believe the prolonged time and extensive data enabled the teachers to label accurately each participant into a low, medium, or high category. Final tallies indicate that performance groupings were separated into low (26 %), medium (51 %), and high (22 %) ability levels. Data Analysis Statistical analysis was completed by using the log-linear procedure in SPSS/PC, resulting in parameter estimates composed of the value of the coefficient, the standard
28
Craig Berg and Stacy Boote
error of the coefficient, the standardized value (labeled z-score) of the coefficient, and the confidence interval for the coefficient. The significant effects reported here are designated in terms of z-scores and levels of significance.
Results This study examined whether empirically derived M-C questions produced results similar to what resulted from F-R graphing questions. The results were compared for two main indicators, correct responses and picture-of-the-event responses. Additionally, the following factors were analyzed in order to better understand the strengths and weaknesses of using M-C or F-R instruments: the effect of having a picture of the scenario present (or not) and the effect of having participants note specific marker points on the graph that corresponded to specific points of time in the scenario being graphed in order to better understand their constructed or chosen graph. Does the Format Affect Correct Responses? The results of this empirically derived M-C and F-R comparison study reinforced findings from an earlier study indicating that M-C and F-R produced significantly different results for two of the three graphing scenarios. For example, participants scored significantly more correct responses on the F-R instrument for both the Ball-Hill and Walk-Wall graphs (see Table 1). Does the Format Affect Picture-of-the-Event Responses? For the question BDoes the format differentially affect picture-of-the-event responses?^, M-C instruments produced significantly more Bpicture-of-the-event^ responses than FR, as much as +24 % for the ball graph, +32 % for the walk graph, and +10 % for the bike graph (see Table 2). For the question as to whether format differentially affected student performance levels and grade levels, the results indicate that both performance levels and grade levels were differentially affected by the two formats (see Tables 3 and 4). Even though M-C participants always produced greater numbers of Bpictures^ than F-R participants, for all three graphing questions, participants having Blow classroom performance^ Table 1 Does the format affect the number of correct responses? Graph
Percent correct responses F-R (%)
M-C (%)
1 and 3
2 and 4
z-score
Walk-Wall
38
28
2.19**
Ball-Hill
41
34
1.82*
Bike-Hill
38
45
1.44
*p < .10; **p < .05
Format Effects of Empirically Derived Multiple-Choice
29
Table 2 Does the format affect the number of Bpicture-of-the-event^ responses? Graph
Percent Bpicture^ responses F-R (%)
M-C (%)
1 and 3
2 and 4
z-score
Walk-Wall
37
69
6.75****
Ball-Hill
30
54
5.12****
Bike-Hill
36
46
2.11**
**p < .05; ****p < .001
responded with significantly more “pictures” on the M-C than participants having “high” or “medium performance.” In addition, for two of the graphing questions (Ball-Hill and Walk-Wall), participants in the higher grades (10th – 12th) responded with significantly more pictures on the M-C than participants in the lower grades (7th – 9th). Does Presence or Absence of the Drawing of the Scenario Affect Responses? Results indicated that the presence of the drawing positively affected correct responses (see Table 5). This was true for two of the three M-C questions and five of the six F-R questions produced. When there were significant differences between versions, it was always the version lacking the drawing of the scenario that produced more picture-ofthe-event and fewer correct responses. These fewer correct responses were directly attributed to significantly greater picture-of-the-event responses. Does Adding Marks to Indicate Transition Points Affect Responses? Results indicate that adding marks produced significantly more correct responses on FR than on M-C in all three graphing scenarios (see Table 5). In version 5 (F-R) and in version 6 (M-C), participants were asked to add marks to indicate transition points on the graph. For example, on the Ball-Hill graph when the ball reached the bottom of the hill, the participant would label that point on their graph a B. The purpose was to Table 3 Interaction between instrument type and ability level regarding Bpicture^ responses Graph
z-score
Walk-Wall
2.47**a
Ball-Hill
3.35***b
Bike-Hill
2.55**c
**p < .05; ***p < .01 a
Low ability participants had significantly greater pictures on M-C than medium ability
b
Low ability participants had significantly greater pictures on M-C than high or medium ability
c
Low ability participants had significantly greater pictures on M-C than high ability
30
Craig Berg and Stacy Boote
Table 4 Interaction between instrument type and grade level regarding Bpicture^ responses Graph
z-score
Walk-Wall
2.39**a
Ball-Hill
3.04***b
Bike-Hill
1.54
**p < .05; ***p < .01 a
Grade 10 – 12 participants had significantly greater pictures than grade 7 – 9 participants on M-C
b
Grade 10 – 12 participants had significantly greater pictures than grade 7 – 9 participants on M-C
determine if adding a mark would focus the participant’s attention when drawing or choosing a graph, resulting in a higher correct response rate. For the Walk-Wall graph, participants were asked to place a mark on the graph that corresponded to the opposite wall and then back at the starting point. Does Plotting Points Affect Participant Responses? As shown in Table 6, the results indicate that as a group, participants who visibly plotted points experienced greater success in terms of correct responses on two of the three graphing scenarios. Participants who physically plotted points on F-R experienced greater success compared to those who did not show evidence of plotting points.
Discussion In this study, we examined whether empirically derived, participant-generated graphs used as choices on M-C graphing instruments produced results that corresponded to participants’ responses on F-R instruments. The final tallies and analysis of responses indicate that M-C and F-R produced quite different results. Utilizing empirically
Table 5 Including/not including the drawing of the scenario effect on correct responses Versions
1 vs 3
2 vs 4
5 vs 6
F-R
M-C
F-R M-C
Pic vs no pic
Pic vs no pic
Marks Marks
0.34
1.54*
2.70***a,b
Graph Walk-Wall Ball-Hill Bike-Hill
a
1.92* 1.91*
a
3.07***c
2.42**
a
4.57****
1.70*
*p < .10; **p < .05; ***p < .01; ****p < .001 a
Indicates significantly more pictures (from version without the drawing of the scenario)
b
Indicates significantly more pictures on M-C version and significantly more correct on the F-R version
c
Indicates M-C version produced significantly more correct responses
Format Effects of Empirically Derived Multiple-Choice
31
Table 6 Participants’ percent correct responses when plotting points versus not plotting points Version
Plotted pts. (%)
No pts. (%)
z-score
Walk-Wall
50
36
1.93*
Ball-Hill
55
38
2.47**
Bike-Hill
41
38
0.48
Graph
*p < .10; **p < .05
derived M-C produced results similar to regular M-C and, consequently, bringing the validity of such instruments used to measure graphing ability into question. Correct Responses and Picture-of-the-Event Responses Prior studies have demonstrated that in some instances, M-C questions are not a valid indicator of graphing abilities (Berg, 1989; DeMars, 1998; Ward et al., 1980). Multiple sources of evidence have questioned the validity of M-C items: up to 19 % more correct responses when using F-R, three times as many picture-of-the-event responses from M-C instruments, and significant differences in how M-C and F-R affect students with various ability and grade levels (Berg & Phillips, 1994; Berg & Smith, 1994). Beichner (1990) also recognized M-C items as having a problem with validity when stating that in “most cases, students would pick distracters suggested by commonly seen graphing misconceptions” (p. 808). If this disparity of percentage correct responses were part of a study designed to determine the effects of some particular technological treatment such as MBL, CBL, or video simulation, the conclusions would be significantly flawed. The discrepancies between the results of F-R and empirically derived M-C appear to result from M-C participants being significantly affected by picture-ofthe-event distractors, which in turn, affected and reduced their correct responses. The M-C format apparently did not stimulate the use of strategies that would have enabled participants to overcome their attraction to picture-of-the-event choices. More alarming was that this negative effect was greater on participants having “low classroom performance.” Taken together, our results suggest that certain distractors had a “priming effect” (Kahneman, 2011) that led participants to unknowingly select a choice that represented their intuitive rather than considered response. The M-C distractors acted as subtle environmental cues, which shortcircuited participants’ conscious considerations. Instead of validly assessing participants’ graphing abilities, the picture-of-the-event distractors, instead, were more likely to influence participants’ intuitive tendencies, yielding an invalid measure. By knowingly cueing intuitive responses with picture-of-the-event distractors when the objective of the measure is to assess graphing abilities, we might be intellectually entrapping our students. Especially for students having low-abilities, the consequences of these unethical assessment practices can be deleterious (American Educational Research Association, 2014).
32
Craig Berg and Stacy Boote
Does Presence or Absence of the Drawing of the Scenario Affect Responses? A prior study (Berg & Smith, 1994) investigated the picture-of-the-event phenomenon and statements by Mokros & Tinker (1987) suggesting that perhaps “students were confused and chose the ‘picture’ simply because of the strong response set created by the visual depiction of the ball going down the hill” (p. 377). Therefore, versions were created to answer the question: Do strong visual depictions in the scenarios, the person walking back and forth across the stage (Walk-Wall), the bike going over the hill (BikeHill), or the ball going down the hill and back up (Ball-Hill), significantly influence the learner and affect the response? As such, for all three scenarios, one-half of the M-C and F-R instruments were modified to contain only the written description of the event; the drawing of the scenario was not included. The results indicate and reinforce earlier findings (Berg & Smith, 1994) that (1) participants provided significantly more correct responses when the drawing of the scenario was present on the instrument, (2) participants often did significantly better in answering questions correctly on the forms that included the drawing of the scenario, and (3) more pictures come from versions lacking the drawing of the scenario. It seems that having a picture of the scenario is more helpful to achieving more correct responses (see Table 5), opposite of the Mokros and Tinker’s prediction (1987), and perhaps contrary to what one might expect based on intuition. Does Asking Participants to Add Marks Indicating Transition Points Affect Responses? As discussed in Berg & Smith (1994), constructing a graph or interpreting a graph (i.e. choosing the best graph) would seem to involve a similar process of point-bypoint analysis until enough points are established for the graph to be constructed on the F-R or choices to be eliminated from the M-C. This is readily apparent when students first plot points then draw a line on a F-R instrument. We proposed that for participants to be successful on M-C instruments, they would have to plot physically or mentally enough points to locate a trend, interpolate, or extrapolate points in order to choose a correct response. As such, perhaps forcing or encouraging participants to focus on transition points (key aspects of the scenario or graph) would assist them to construct an accurate graph or perhaps provide an extra buffer against the influence of M-C choices that result in picture-of-the-event answers in which they are unduly drawn. In addition, asking participants to add marks on M-C or F-R might cause them to examine or reexamine their graph, and at the same time provide additional cues to help delineate correctness. For example, one subset of the F-R instrument asked participants to place a B on the graph just constructed, where the ball would be at the bottom of the large incline, and a T on the graph indicating the position where the ball would be at the top of the small incline (see Fig. 3). This process greatly helped us score the constructed graphs (see Fig. 4). First, we should note that without the B and T marks, some responses might have been incorrectly categorized. Yet, for every Ball-Hill and Bike-Hill M-C response, there was no way of knowing which responses were complete guesses, and there were usually no indicators to aid the scorers. The Walk-Wall M-C choices had marks labeled
Format Effects of Empirically Derived Multiple-Choice
33
Fig. 4 Examples of how marks B and T aided in scoring correctly
(see Fig. 5) as prior research (Berg & Smith, 1994) determined that without these marks, substantial mis-scoring would have resulted. It is interesting to note that even in spite of the marks that might have helped cue the participant, the Walk-Wall graph still produced the most picture-of-the-event responses and the largest disparity of correct responses between F-R and M-C. However, the results were mixed with regard to the benefit to participants. In one case (Walk-Wall), F-R participants that were asked to add these indicator points did significantly better than F-R participants who did not add marks. Conversely, in another case (Ball-Hill), FR participants who added marks did significantly more poorly than F-R participants who did not add marks. Does Plotting Points Affect Participant Responses? When scoring the responses on the F-R instrument, we also recorded when participants used visible points to aid in plotting a graph. First, it should be noted that almost no M-C participants used the strategy of plotting visible points on the choices as they attempted to eliminate distracters (although this may have been done mentally). Conversely, for each graphing question, approximately 25 % of the F-R participants chose, on their own initiative, to plot points as they constructed a line graph. Based upon results from the prior study, it seemed that plotting points, either mentally or physically, is a requisite beginning point for participants to experience graphing success (Berg & Smith, 1994). Participants mentally plotting the points sometimes drew the whole line in one fluid motion, while other participants drew the line in spurts, still resulting in one fluid line in which the spurts or segments were not apparent (from observations of participants). We must assume that some participants may have mentally plotted points in both F-R and M-C. Although knowing whether one version stimulated more mental point plotting and how that affected graphing success is unknown, we know that F-R stimulated much more visible point plotting. This led us to question whether this point plotting strategy translated into greater graph interpretation success. For the question “Did participants who physically plotted points on F-R experience any greater success than those who did not show evidence of plotting points?”, results
34
Craig Berg and Stacy Boote
Fig. 5 Example of Walk-Wall M-C with marks
in Table 6 indicate that in two of the three graphing scenarios, participants who visibly plotted points experienced greater success in terms of correct responses. Success ranged from a +3 to +17 % increase in correct responses by participants who plotted points. We must assume that the sample of participants who used M-C are equally capable of using this helpful strategy. M-C instruments, however, did not stimulate participants to plot points during the process of choosing an answer to the question, as very few plotted points on the M-C instruments. Potentially, M-C items short-circuited participants’ basic graphing strategies that lead to success, indicative in part by the almost complete lack of M-C participants who plotted points. This negative effect on participants’ success contributes to the conclusion that a M-C instrument is a poor choice as a valid indicator of graphing abilities. Limitations of the Study While participants could have been assessed on many different types of graphing questions, with varying complexity and subject matter, earlier research had provided extensive insights into student thinking on the three graphing scenarios used in this
Format Effects of Empirically Derived Multiple-Choice
35
study, and more importantly, the empirically derived graph choices that served as the M-C distractors. One limitation of this study pertains to using only speed-time graphs, leaving questions open with regard how empirically derived M-C choices for other content compare to F-R format responses. A second limitation pertains to one of the specific graphs used in this study. After using these three graphing scenarios in multiple studies, the authors have less confidence in the Bike-Hill scenario compared to the other scenarios used in this study. For the Bike-Hill scenario, participants were instructed to graph the speed when riding a bike toward and up a hill, then down the other side. Readers might have noticed that there were both more correct responses and more picture-of-the-event responses on the M-C version of this scenario than the F-R version; a different result than the other two graphing scenarios. The authors suggest that the nature of the Bike-Hill scenario allows for a wide variety of personalized interpretations. For example, one participant might pedal hard and fast to the bottom of the hill so that their speed carries them part way up the hill before they have to work hard with slow and steady pedaling to get to the top. A different participant might just pedal steady and slow all the way from the beginning of the ride to the top of the hill before letting gravity bring them down. Finally, another participant might pedal hard all the way down the hill resulting in greater speed. The other two scenarios would seem to have less potential for personalized interpretations. As such, the authors have less confidence on the bike on a hill graph results due to the potential personalized variations and would excise this scenario from future studies.
Conclusion Prior to the publication of the 1999 edition of the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999), it was common to refer to several types of validity. Since the publication of these Standards, validity is instead understood as a unitary concept. BIt is the degree to which all the accumulated evidence supports the intended interpretation of the test scores for the proposed purpose^ (American Educational Research Association et al., 1999, p. 11). From this perspective, it is then appropriate to evaluate different types of evidence of validity rather than distinct types of validity. Such evidence might include data related to the content of the test, response processes, the internal structure of the test, and relations to other variables. These multiple forms of evidence should then be integrated to understand the kinds of interpretations that are appropriately supported for particular uses of test scores. Examining the evidence may also suggest additional or alternative factors that might be influencing scores and, therefore, invalidating the items for particular purposes (1999). This research has further demonstrated that the format of a graphing instrument can significantly affect participant responses and, therefore, call into question the validity of the instrument. The M-C format continues to fulfill the assessment industry’s criteria of a simple, fast, and economical means of collecting student responses. This study presented several sources of evidence that questioned the validity of using M-C questions to measure students’ graphing abilities. Assessing students’ graphing abilities with F-R questions that ask them to construct their own graphs provides a more valid
36
Craig Berg and Stacy Boote
measure of graphing ability compared to M-C questions that only ask them to select Bthe best^ choice out of four options on a M-C test. In addition to providing a more valid measure of overall graphing ability, F-R questions also provide opportunities to identify the background knowledge our students bring to graphing tasks and their understanding of discipline specific knowledge. The latter, heavily affected and influenced by the instrument itself, too easily permits the omission of the knowledge the question is trying to assess. Unfortunately, the questionable validity of M-C tests make it impossible to have confidence in any inferences based on those measures. In addition, this research and earlier research (Berg & Smith, 1994) demonstrates that testing fairness, or in this case unfairness, is an issue when using M-C, defined in Pellegrino, Chudowsky & Glaser (2001, p. 39) as when Btest scores underestimate or overestimate the competencies of members of a particular group.^ Test fairness appears to be a serious problem associated with using M-C or empirically derived M-C to assess graphing. Therefore, F-R graphing assessments are a better option. In particular, having students plot data points before constructing their FR graph seems to avoid the priming effect (Kahneman, 2011) and yields a more valid measure of their graphing ability. Prior studies that use M-C to measure treatment effects should be re-examined for the validity of student responses. Results of this study indicate that using more C-R or F-R instruments increases the validity of the assessment and provides a means of assessing graphing abilities. Using valid measures will help researchers better understand the root causes of graphing difficulties, provide a true measure of learner’s graphing abilities, and determine with more confidence the effects that technology such as MBL has on learners’ abilities to construct and interpret graphs.
References Adams, D. D. & Shrum, J. W. (1990). The effects of microcomputer-based laboratory exercises on the acquisition of line graph construction and interpretation skills by high school biology students. Journal of Research in Science Teaching, 27(8), 777–787. Ackerman, T. A. & Smith, P. L. (1988). A comparison of the information provided by essay, multiple-choice, and free- response writing tests. Applied Psychological Measurement, 12(2), 117–128. Aikenhead, G. (1988). An analysis of four ways of assessing student beliefs about sts topics. Journal of Research in Science Teaching, 25(8), 607–629. American Educational Research Association (2014). Standards for educational and psychological testing. Washington, DC: AERA. American Educational Research Association, American Psychological Association, & National Council of Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: AERA. Ates, S. & Stevens, J. T. (2003). Teaching line graphs to tenth grade students having different cognitive developmental levels by using two different instructional modules. Research in Science & Technological Education, 21(1), 55–66. Barclay, W. (1986). Graphing misconceptions and possible remedies using microcomputer based labs. Paper presented at the 7th National Educational Computing Conference, University of San Diego, CA. Beichner, R. J. (1990). The effect of simultaneous motion presentation and graph generation in a kinematics lab. Journal of Research in Science Teaching, 27(8), 803–815. Berg, C. (1989). An investigation of the relationship between logical thinking structures and the ability to construct and interpret line graphs. (Unpublished doctoral dissertation). Iowa City, IA: The University of Iowa.
Format Effects of Empirically Derived Multiple-Choice
37
Berg, C. & Phillips, D. (1994). An investigation of the relationship between logical thinking structures and the ability to construct and interpret line graphs. Journal of Research in Science Teaching, 31(4), 323–344. Berg, C. & Smith, P. (1994). Assessing students’ abilities to construct and interpret line graphs: Disparities between multiple-choice and free-response graphs. Science Education, 78(6), 527–554. Boote, S. K. (2014). Assessing and understanding line graph interpretations using a scoring rubric of organized cited factors. Journal of Science Teacher Education, 25(3), 333–354. Brasell, H. M. (1987). Effect of real time laboratory graphing on learning graphic representations of distance and velocity. Journal of Research in Science Teaching, 24, 385–395. Brasell, H. M. (1990). Graphs, graphing, graphers. In M. B. Rowe (Ed.), What research says to the science teacher (Vol. Six). Washington, DC: The National Science Teachers Association. Clariana, R. B. (2003). The effectiveness of constructed-response and multiple-choice study tasks in computeraided learning. Journal of Educational Computing Research, 28(4), 395–406. Clement, J. (1986). The concept of variation and misconception in cartesian graphing. Paper presented at the American Educational Research Association, San Francisco, CA. Common Core State Standards Initiative (2010). Common Core State Standards for Mathematics (CCSSM). Washington, DC: National Governors Association Center for Best Practices and the Council of Chief State School Officers. Culbertson, H. M. & Powers, R. D. (1959). A study of graph comprehension difficulties. Educational Technology Research and Development, 7(3), 97–110. DeMars, C. E. (1998). Gender differences in mathematics and science on a high school proficiency exam: The role of response format. Applied Measurement in Education, 11(3), 279–299. Dori, Y. J. & Sasson, I. (2008). Chemical understanding and graphing skills in an honors case‐based computerized chemistry laboratory environment: The value of bidirectional visual and textual representations. Journal of Research in Science Teaching, 45(2), 219–250. Friedler, Y., Nachmias, R. & Linn, M. C. (1990). Learning scientific reasoning skills in microcomputer‐based laboratories. Journal of Research in Science Teaching, 27(2), 173–192. Friel, S. N., Curcio, F. R. & Bright, G. W. (2001). Making sense of graphs: Critical factors influencing comprehension and instructional implications. Journal for Research in Mathematics Education, 32(2), 124–158. Glazer, N. (2011). Challenges with graph interpretation: A review of the literature. Studies in Science Education, 47(2), 183–210. Howe, K. R. (1985). Two dogmas of educational research. Educational Researcher, 14(8), 10–18. Kahneman, D. (2011). Thinking, fast and slow. New York: Farrah, Strauss, and Giroux. Keller, S. K. (2008). Levels of line graph question interpretation with intermediate elementary students of varying scientific and mathematical knowledge and ability: A think aloud study. (Unpublished doctoral dissertation). Retrieved from ProQuest LLC. (UMI Microform 3340991). Kerslake, D. (1977). The understanding of graphs. Mathematics in Schools, 6(2), 22–25. Kimbal, M. (1967). Understanding the nature of science: A comparison of scientists and science teachers. Journal of Research in Science Teaching, 5, 110–120. Kwon, C., Kim, Y. & Woo, T. (2015). Digital–physical reality game mapping of physical space with fantasy in context-based learning games. Games and Culture. doi:10.1177/1555412014568789. Lapp, D. A. & Cyrus, V. F. (2000). Using data-collection devices to enhance students’ understanding. Mathematics Teacher, 93(6), 504–510. Leinhardt, G., Zaslavsky, O. & Stein, M. K. (1990). Functions, graphs, and graphing: Tasks, learning, and teaching. Review of Educational Research, 60(1), 1–64. Linn, M. C., Layman, J. & Nachmias, R. (1987). Cognitive consequences of microcomputer based laboratories: Graphing skills development. Journal of Contemporary Educational Psychology, 12, 244–253. Lissitz, R. W., Hou, X. & Slater, S. C. (2012). The contribution of constructed response items to large scale assessment: Measuring and understanding their impact. Journal of Applied Testing Technology, 13(3), 1– 50. Macdonald-Ross, M. (1977). How numbers are shown. AV Communication Review, 25(4), 359–409. McDermott, L., Rosenquist, M., Popp, B. & van Zee, E. (1983). Student difficulties in connecting graphs, concepts and physical phenomena. Paper presented at the the meeting of the American Educational Research Association, Montreal, Canada. McKenzie, D. L. & Padilla, M. J. (1984). Effects of laboratory activities and written simulations on the acquistion of graphing skills by eighth grades students. Paper presented at the National Association for Research in Science Teaching, New Orleans, LA. McKenzie, D. L. & Padilla, M. J. (1986). The construction and validation of the test of graphing in science (TOGS). Journal of Research in Science Teaching, 23(7), 571–579.
38
Craig Berg and Stacy Boote
Mokros, J. R. (1986). The impact of MBL on children’s use of symbol systems. Paper presented at the the meeting of the American Educational Research Association, San Francisco, CA. Mokros, J. R. & Tinker, R. F. (1987). The impact of microcomputer based science labs on children’s ability to interpret graphs. Journal of Research in Science Teaching, 24, 369–383. Munby, H. (1982). The place of teachers’ beliefs in research on teacher thinking and decision making, and an alternative methodology. Instructional Science, 11, 201–225. National Research Council (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. Washington, DC: The National Academies Press. NGSS Lead States (2013). Next generation science standards: For states, by states. Washington, DC: The National Academies Press. Pellegrino, J. W., Chudowsky, N. & Glaser, R. (Eds.) (2001). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press. Ploetzner, R., Lippitsch, S., Galmbacher, M., Heuer, D. & Scherrer, S. (2009). Students’ difficulties in learning from dynamic visualisations and how they may be overcome. Computers in Human Behavior, 25(1), 56– 65. Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184. Rowland, P. & Stuessy, C. L. (1988). Matching mode of CAI to cognitive style: An exploratory study. Journal of Computers in Mathematics and Science Teaching, 7(4), 36–40, 55. Sasson, I. & Dori, Y. J. (2012). Transfer skills and their case-based assessment. In B. J. Fraser, K. Tobin & C. J. McRobbie (Eds.), Second international handbook of science education (pp. 691–709). Netherlands: Springer. Savinainen, A. & Viiri, J. (2008). The force concept inventory as a measure of students’ conceptual coherence. International Journal of Science and Mathematics Education, 6(4), 719–740. Schultz, K., Clement, J. & Mokros, J. (1986). Adolescent graphing skills: A descriptive analysis. Paper presented at the meeting of the American Educational Research Association, San Francisco. Shaw, E. L., Padilla, M. J. & McKenzie, D. L. (1983, April). An examination of the graphing abilities of students in grades seven through twelve. Paper presented at the the meeting of the National Association for Research in Science Teaching, Dallas, TX. Svec, M. T. (1995). Effect of micro-computer based laboratory on graphing interpretation skills and understanding of motion. Paper presented at the annual meeting of the National Association for Research in Science Teaching, San Francisco, CA. Tairab, H. H. & Khalaf Al-Naqbi, A. K. (2004). How secondary school science students interpret and construct scientific graphs. Journal of Biological Education, 38(2), 119–124. Ward, W. C., Frederiksen, N. & Carlson, S. B. (1980). Construct validity of free-response and machinescorable forms of a test. Journal of Educational Measurement, 17(1), 11–28. Weintraub, S. (1967). Reading graphs, charts and diagrams. Reading Teacher, 20, 345–349. Wu, Y., Shah, J. J. & Davidson, J. K. (2003). Computer modeling of geometric variations in mechanical parts and assemblies. Journal of Computing and Information Science in Engineering, 3(1), 54–63.