J Sci Educ Technol (2014) 23:641–657 DOI 10.1007/s10956-014-9491-y
Development of a Short-Form Measure of Science and Technology Self-efficacy Using Rasch Analysis Richard L. Lamb • David Vallett • Leonard Annetta
Published online: 23 March 2014 Ó Springer Science+Business Media New York 2014
Abstract Despite an increased focus on science, technology, engineering, and mathematics (STEM) in U.S. schools, today’s students often struggle to maintain adequate performance in these fields compared with students in other countries (Cheek in Thinking constructively about science, technology, and society education. State University of New York, Albany, 1992; Enyedy and Goldberg 2004; Mandinach and Lewis 2006). In addition, despite considerable pressure to promote the placement of students into STEM career fields, U.S. placement is relatively low (Sadler et al. in Sci Educ 96(3):411–427, 2012; Subotnik et al. in Identifying and developing talent in science, technology, engineering, and mathematics (STEM): an agenda for research, policy and practice. International handbook, part XII, pp 1313–1326, 2009). One explanation for the decline of STEM career placement in the U.S. rests with low student affect concerning STEM concepts and related content, especially in terms of self-efficacy. Researchers define self-efficacy as the internal belief that a student can succeed in learning, and that understanding student success lies in students’ externalized actions or
R. L. Lamb (&) Department of Teaching and Learning, Washington State University, 321 Cleveland Hall, Pullman, WA 99164, USA e-mail:
[email protected] D. Vallett Department of Teaching and Learning, University of Nevada Las Vegas, Las Vegas, NV 89109, USA e-mail:
[email protected] L. Annetta College of Education and Human Development, George Mason University, 4400 University Dr., Fairfax, VA 22036, USA e-mail:
[email protected]
behaviors (Bandura in Psychol Rev 84(2):191–215, 1977). Evidence suggests that high self-efficacy in STEM can result in student selection of STEM in later educational endeavors, culminating in STEM career selection (Zeldin et al. in J Res Sci Teach 45(9):1036–1058, 2007). However, other factors such as proficiency play a role as well. The lack of appropriate measures of self-efficacy can greatly affect STEM career selection due to inadequate targeting of this affective trait and loss of opportunity for early intervention by educators. Lack of early intervention decreases selection of STEM courses and careers (Valla and Williams in J Women Minor Sci Eng 18(1), 2012; Lent et al. in J Couns Psychol 38(4), 1991). Therefore, this study developed a short-form measure of self-efficacy to help identify students in need of intervention. Keywords Self-efficacy Rasch analysis General science Technology use Instrumentation
Introduction Despite an increased focus on science, technology, engineering, and mathematics (STEM) in U.S. schools, today’s students often struggle to maintain adequate performance in these fields compared with students in other countries (Cheek 1992; Enyedy and Goldberg (2004); Mandinach and Lewis 2006). In addition, despite considerable pressure to promote the placement of students into STEM career fields, U.S. placement is relatively low (Sadler et al. 2012; Subotnik et al. 2009). One explanation for the decline of STEM career placement in the U.S. rests with low student affect concerning STEM concepts and related content, especially in terms of self-efficacy. Researchers define selfefficacy as the internal belief that a student can succeed in
123
642
learning, and that understanding student success lies in students’ externalized actions or behaviors (Bandura 1977). Evidence suggests that high self-efficacy in STEM can result in student selection of STEM in later educational endeavors, culminating in STEM career selection (Zeldin et al. 2007). However, other factors such as proficiency play a role as well. The lack of appropriate measures of self-efficacy can greatly affect STEM career selection due to inadequate targeting of this affective trait and loss of opportunity for early intervention by educators. Lack of early intervention decreases selection of STEM courses and careers (Valla and Williams 2012; Lent et al. 1991). Therefore, this study developed a short-form measure of self-efficacy to help identify students in need of intervention. Theoretical Framework Measuring self-efficacy is difficult due to the latent nature of the response variable (Scherbaum et al. 2006). Ketelhut (2010) developed a context-specific measure of self-efficacy for technology and science called the self-efficacy in technology and science survey (SETS). The original survey was developed from previously published survey questions found in peer-reviewed science and technology education journals. The original SETS survey contains 67 questions, coupled with 10 demographic questions for a total of 77 questions. Due to the large number of questions, respondents can experience survey fatigue and therefore not complete the survey or fill in nonsense answers (Porter et al. 2004; Savage and Waldman 2008). By reducing the number of questions, we addressed this issue in order to produce higher-quality data and a greater likelihood of accurate survey completion. By using the Rasch model [under item response theory (IRT)], we developed an instrument with fewer items while maintaining the psychometric properties of the original measure. IRT analyses such as Rasch analysis provide psychometric information about an instrument to facilitate logical, substantiated modifications. This results in improved precision and efficacy in measurement. These modifications are often necessary when converting item responses to the common unit of logits and measuring a common underlying attribute, such as self-efficacy, with theta (H). A common perception based on classical response theory or classical test theory (CTT) is that shorter surveys produce less consistent and less reliable results (Hays et al. 2008). However, these differences in perception arise from CTT’s fundamental approach to measurement of traits, which is survey specific. Rasch modeling allows for a smaller number of survey items due to its focus on item responses versus true-score approximations that are survey specific. The need for fewer items results from
123
J Sci Educ Technol (2014) 23:641–657
parameterization of response pattern likelihoods and assumed ability distributions (Stone 2005). Development of a screening short-form version of the SETS survey allows researchers and teachers to quickly identify ‘‘at risk’’ students who show lower levels of self-efficacy in science or technology. This provides an important means of addressing our nation’s crisis in STEM, since students exhibiting lower levels of self-efficacy are more likely to be unsuccessful in science classes mediated by technology (Lamb and Annetta 2012a, b). Educators can employ attributional retraining (AR) to help prevent student learning failures due to poor affect (McGrath and Braunstein 1997). With the increased reliance on technology (computer use, serious educational games, and online learning) for science education, the need to identify students with low self-efficacy in technology is important. Self-efficacy is an affective attributional characteristic that can be improved with interventions via attribute retraining. This suggests that several external stimuli affect student’s self-efficacy, rather than the responsibility lying solely on the student. AR allows developers of an intervention to raise self-efficacy by adjusting these external stimuli. Attributional theory suggests that individuals are strongly inclined to seek resolution to unexpected outcomes resulting from external stimuli. This attempt to bring about resolution is the key to raising or adjusting self-efficacy on an individual basis. AR is an effective means in improving academic motivation in students and reduces the likelihood of failure in academic settings (Hall et al. 2004; Haynes et al. 2011; Perry et al. 2010). Given the internal and latent nature of self-efficacy, there is little opportunity to measure the construct directly. Thus, a self-reporting measure is necessary, and changes in the self-reporting measure can equate with changes in self-efficacy. Through AR interventions, respondents change their level self-efficacy, rather than just the way they respond to items. In this study, we assume that within the self-efficacy framework, changes in item ratings truly reflect self-reported changes in self-efficacy. Self-efficacy Framework Our theoretical framework is based on Bandura’s (1977) theory of self-efficacy and Schunk’s (1985) classroom learning model. Self-efficacy is the belief held by an individual about their ability to perform a particular behavior or task successfully (Cassidy and Eachus 2002; Bandura 1977). Bandura states, ‘‘Cognitive events are induced and altered most readily by experiences of mastery arising from effective performance’’ (Bandura 1977, p. 191). However, traditional psychological research presents a stimulus response outcome for learning that is directly linked to the response to stimuli (Walter 1973).
J Sci Educ Technol (2014) 23:641–657
643
The external linkage between antecedent and subsequent behavior did not effectively describe the modes of change seen in students (Strecher et al. 1986). By understanding the internal link between stimulus and behavior along with modal change, one can account for behavior arising from the review of contingencies related to affect (Cox and Smitsman 2008; Henderson and Peterson 2004). Self-efficacy, in this vein, is a key construct that acts as a type of expectancy from contingencies (Vancouver et al. 2008; Pajarres 1996). The magnitude of this construct is established early in a person’s life and does not appear to vary with age (Caprara et al. 2011). Key sources of self-efficacy are: (a) previous performance on similar tasks, (b) successful completion of tasks by a peer, (c) verbal persuasion by authority or peer, and (d) emotional arousal creating effect on self-efficacy. Britner and Pajares (2006) found a significant relationship between each of the key sources of self-efficacy that support and extend Bandura’s theoretical framework (1997). Self-efficacy ties to four areas: task choice, motivation, effort, and perseverance (Bandura 1982; Bong and Skaalvik 2003; Schunk 1989; Usher 2009; Zimmerman 1997). While these areas may seem specific in their effect, self-efficacy is not explicitly associated with one domain or area of effect (Bandura 1986, 2006; Bandura et al. 1994). In other words, self-efficacy outcomes generalize only as part of a larger construction identified as attitudes (Saks 1997). Recent studies contradict the traditional view of self-efficacy as a context-specific construct. Panuonen and Hong (2010) suggest that self-efficacy is tied to specific cognitive abilities, which allow for generalization from these abilities through transference. Subsequent studies corroborate these findings (Caprara et al. 2011; Geary 2010; Tierney and Farmer 2011). Relatively few scales are designed to measure self-efficacy related to scientific reasoning, computer use, and video games. An online search using the key terms: self-efficacy, self-efficacy scales, measures of self-efficacy in science, technology and video games, yields few results for this particular variation of the construct. While several scales measure self-efficacy in technology, few tie together the larger constructs of science and technology (Beckers and Schmidt 2001; Colley et al. 1994). Self-efficacy is a latent trait construct that must be measured indirectly due to its internal (latent) nature (Judge 2009; Skaalvik and Skaalvik 2007). As such, in keeping with Bandura’s conceptualization of self-efficacy, instruments measuring this construct are self-report surveys (Cassidy and Eachus 2002).
in this context measures self-efficacy in terms of one’s ability to engage scientific reasoning (Lawson 2004). In this case, assessment of self-efficacy as a construct is the ability to apply cognitive skills to the overall task for computers and science use. Thus, self-efficacy can be linked to cognitive attributes related to science and technology. In this work, we combine each of these individual components under the unidimensional, general construct of self-efficacy in science and technology. This facet of self-efficacy is the construct measured by the proposed instrument, SETS-SF. In this study, we define self-efficacy in science and technology as an affective trait with four factors, as proposed by Bandura (1977). These factors influence actualization and expectancy outcomes in the specific domains of computer use, video game use, and science reasoning. Britner (2008) suggests that science self-efficacy is a strong predictor of science grades across fields, regardless of gender. This relationship is of key interest to science educators. However, female students have slightly higher self-efficacy in science than males. While the relationship between self-efficacy, career selection, and achievement is significant, proficiency plays a key role in these outcomes and reduces actualization for students with high self-efficacy. However, the predictive factors associated with selfefficacy differ between genders. Britner (2008) also suggests that mastery experiences are the ‘‘only significant predictor of self-efficacy, for males, while for females factors such as social persuasion, vicarious experience and physiological states were better predictors of self-efficacy in life and physical science classes’’ (p. 963). Researchers claim that quantified self-efficacy reflects expected outcomes in a given subject in terms of multiple contingencies. However, virtual or digital world parallels can reflect expectancy actions within the real world. Thus, interactions within the virtual world of video games (technology) can affect the emotional states of players in the ‘‘real’’ world (Annetta et al. 2009). Successful completion of game situations leads to positive effects that are linked to higher arousal states (Vorderer et al. 2006). In other words, individuals are aware of the areas where they excel in virtual environments, and they report this perceived ability on self-efficacy instruments (Pajares and Miller 1995). However, when measuring aspects of a complex construct such as self-efficacy, one must be aware of an instrument’s ability to measure it.
Construct Definition Self-efficacy in Technology and Science Reasoning
The validity of an instrument is the extent to which the instrument measures the construct it is meant to measure. In this case, we consider validity in terms of purposes: the construct of self-efficacy. Validation depends on the evaluation of multiple variables to include empirical evidence,
Computer self-efficacy refers to the assessment of one’s ability to use a computer. Scientific reasoning self-efficacy
Validity Framework
123
644
continuous investigation, and contextual factors. By examining these multiple sources of evidence, one can create an argument for validity under the Messik framework. Therefore, next we discuss the integration of the Rasch validity model and the Messik framework for validity. We also identify components and results that support the various aspects of validity. As part of the Rasch measurement framework established by Write and Stone (2004), we integrated the Rasch validity model into Messik’s larger framework of validity (Messick 1989, 1996a, b). Within this framework, Messik argues multiple aspects of validity can be seen as interdependent and complementary, and that one type of validity cannot be substituted for another. The components of validity identified by Messik are content validity, sustentative validity, structural validity, generalizability, external aspects of validity, and consequential validity. We accomplished this by examining multiple evidence-based aspects of Rasch item analysis and traditional measurement analysis (Embretson and Gorin 2001; Lamb et al. 2012). This allows for a multifaceted examination of evidence and multiple sources of evidence for the aspects of validity. With the Rasch framework, evidence for structural validity is developed through item fit and order examination and is referred to as fit and order validity evidence. This is derived from examination of consistency of item responses such as infit, outfit, and residuals. External evidence of validity develops from comparisons of other measures and criteria of the related domain or construct, as well as the degree to which scores on the target instrument align and mirror scoring on the outside instruments and criteria. Within this analysis, this evidence corresponds to examination of other self-efficacy measures. Evidence of validity tied generalizability arises from the development of scores that can be interpreted to broadly generalizable scored related to the specified construct within the groups sampled via appropriate sampling technique and representativeness. Substantive validity evidence provides evidence of the domain processes examined in the instrument items. This evidence derives from appropriate coverage of the domain content and accumulation of empirical evidence over time. Evidence of content validity is often developed from an examination of the logical procedure for development of the instrument and framing of the domain of examination using appropriate standards. Consequential validity evidence arises from examination of the intended and unintended consequences of the measure. By examining multiple aspects of validity, one can gain insights into the validity of an instrument. Purpose, Research Question, and Hypotheses The purpose of this study was to develop a short-form, diagnostic screening instrument called the self-efficacy in
123
J Sci Educ Technol (2014) 23:641–657
technology and science short-form survey (SETS-SF). Specifically, this study develops the unidimensional construct known as self-efficacy as it relates to video games and science. This instrument can be used to assess student self-efficacy within the domains of science process knowledge and technology use (i.e., computers and video games) (Ketelhut 2010). This study discusses the underlying psychometric properties of the SETS-SF and evaluates these properties for appropriate functioning of the instrument under the formal requirements of the Rasch model. Research questions include: 1. 2. 3.
Do the results of the study meet the formal requirements for fitting the data to the Rasch model? Is the construct of self-efficacy domain specific or generalized? Is the SETS-SF internally and structurally valid when applied to the Rasch model?
Consideration of the research questions and literature supports the following hypotheses. The SETS-SF data provide a proper fit to the Rasch model; (a) a secondary hypothesis is that the construct of self-efficacy is domain specific; (b) a tertiary hypothesis is that the outcome of psychometric analysis using Rasch modeling results in a structurally valid measure of self-efficacy though selfreporting extrinsic factors. Development of the SETS-SF is based on the previously validated and published measure by Ketelhut (2010) using classical test theory.
Methods Study Design The overall study design was a non-randomized, intact posttest only design. However, the survey validation design took a multiple-administration, evaluative-reductive approach (Lamb et al. 2012). The lack of a pretest makes it difficult, if not impossible, to attribute changes in self-efficacy to the treatment effect. However, this design eliminates any threats to internal validity due to pretesting effects. The elimination of carryover is important in terms of item development in an iterative process. Reducing the familiarity and carryover effects increases the reliability of the data. Another advantage to the intact group design is that the intervention is embedded into the natural classroom environment, creating more realistic data outcomes and better data quality. A posttest only design is acceptable for exploratory studies such as this to establish instrument reliability and validity and inform development of future research questions and hypotheses (Lamb et al. 2014). Study participants completed an initial 5-day workshop on video game use and design within the science classroom.
J Sci Educ Technol (2014) 23:641–657
During a 3-month period, students designed science-based video games over that exemplified science concepts they had learned throughout the school year. During the workshop, all students were administered the original long version of the SETS instrument. Each student worked individually to answer survey questions online and did not discuss responses. Students were required to complete each section of the survey prior to going to the next section, overseen by workshop organizers oversaw. Surveys were completed on the same day that video game interventions were administered. Survey results were then analyzed, and items were removed based upon the analysis. The shortform self-efficacy in technology and science survey (SETSSF) was administered 3 months later to minimize carryover effects. Results of the SETS-SF were compared to the original long-form SETS to develop and characterize the psychometric properties of the SETS-SF and ensure equivalency of the measure. It is important to note that both versions were administered after the intervention to eliminate self-efficacy changes resulting from the workshop.
645 Table 1 Demographic composition of the study sample (N = 606) Demographic category
N (606)
Percentage
10–12
151
24.9 %
13–15
201
33.2 %
16–18
231
38.1 %
23
3.7 %
Age
19–19? Gender Female
242
40 % (51 %)
Male
357
59 % (49 %)
Ethnicity Hispanic
93
15.3 % (16 %)
Asian Caucasian
40 351
6.6 % (5 %) 57.9 % (63 %)
African American
109
Mixed ethnicity Other
17.9 % (13 %)
12
1.9 % (2 %)
6
.09 % (1 %)
Numbers in parentheses represent national statistics from 2012 census figures (United States Census Bureau 2012)
Participants We obtained data from 651 students in 15 schools (seven comparison and eight intervention), ranging from grades 5–12. We collected data for validation and item reduction purposes during workshops attended by the study participants over a 3-month period. This allowed sufficient time to pass between re-administration of the long-form and short-form survey. We developed items using an iterative process suggested by Liu (2010). Next, we selected schools from a pool within each participating district in the southeastern and midwestern U.S. schools were assigned by district superintendent selections, with the agreement of principals and teachers. We randomly assigned classrooms to comparison and intervention groups using block assignments. Ninety-four percent of all students in the sample completed the measure sufficiently for analysis purposes. The use of block assignments and presentation of the measure to all students increases generalizability by reducing student self-selection for intervention purposes. Notably, participating teachers showed interest in the use of technology in the classroom, which may have predisposed students to being open to the use of technology in the educational environment. However, this is not a concern, since students did not self-select for the study. Workshop participants developed their own serious educational games using a personal computer platform focused on science content. Serious educational games are video games that are not solely for entertainment purposes. They incorporate an educational component using pedagogical approaches and promote problem solving related to content. Our goals for the workshop were for students to
develop positive attitudinal outcomes, including self-efficacy, while engaging meaningfully with technology (using video games with computers) and science education activities. Forty-five subjects were removed from the analysis due to their failure to complete all survey questions, resulting in a final n of 606. Non-responding subjects did not differ significantly in characteristics and traits compared with the larger sample. Sampled subjects were 59 % male (357) and 40 % female (242). The remaining 1 % (6) did not provide this data. Ages ranged from 10 to 19. Table 1 provides a breakdown of the age, gender, and ethnic composition of the study sample. An initial comparison of demographic characteristics to national statistics seems to indicate overrepresentation of certain groups within the study. However, examination of the sample using a hypergeometirc distribution test at an alpha of .05 does not indicate overrepresentation by one particular ethnicity or gender, p = .066 (Harkness 1965). Analysis of the representativeness of the age groups was not examined in this manner, as census data regarding these age bands were not available. This presents a limitation to the generalizability of the study results related to age. Examination of the age bands suggests that the most representative portions of the sample are found for 13–18 years old. Therefore, more work is needed to determine the effectiveness of the measure for subjects under 13 and over 18 years of age. Instruments Ketelhut (2010) designed and established psychometric properties for the original 67-question instrument. From
123
646
J Sci Educ Technol (2014) 23:641–657
Table 2 Item and subscale assignments for the SETS-SF Item number (original survey item number)
Item
Facet
1 (1)
No matter how hard I try, I cannot figure out the main idea in science class
SP
2 (25) reverse
It is easy for me to look at the results of an experiment and tell what they mean
SP
3 (44)
Once I have a question, it is hard for me to design an experiment to test it
SP
4 (48)
I have trouble figuring out the main ideas of what my science teacher is teaching us
SP
5 (10)
No matter how hard I try, video games are too difficult for me to learn
VG
6 (13)
Even when I try hard, I don’t do well at playing video games
VG
7 (18)
I can only succeed at the easiest video game
VG
8 (22)
No matter how hard I try, I do not do well when playing computer games
VG
9 (32) reverse
Video games are easy to figure out for me even when I do not try
VG
10 (33)
Even when I try hard, learning how to play a new video game is complicated for me
VG
11 (36)
When playing in a simulation game, I only do well for short periods of time
VG
12 (26)
No matter how hard I try, I cannot use computers
CU
13 (28)
I find it difficult to learn how to use the computer
CU
14 (42)
Whenever I can, I avoid using computers
CU
15 (45) reverse
When using the Internet, I usually do not have trouble finding the answers I am looking for Even if I try very hard, I cannot use the computer as easily as paper and pencil
CU
16 (51)
Rasch Analysis of an Affective Survey
CU
this original survey, we retained 16 items for re-administration of the sample. We selected these items by reviewing their psychometric properties and fitting them to the Rasch model, along with substantive considerations. We discuss specific guidelines for retention of items are discussed in subsequent sections. Data obtained post-re-administration indicate that the 16 selected items should be retained in the SETS-SF. The SETS-SF is a 16-item (with an additional 10 demographic responses), freestanding, self-reporting
123
survey based on the SETS survey (2010). The SETS-SF survey measures three component subscales (a) science reasoning (4 items, a = .92), (b) computer use (5 items), and (c) video gaming (7 items), representing the latent construct of science and technology self-efficacy. These three subscales represent the three areas of interaction responsible for the processing effects seen in science education within a technology-integrated learning experience. Response categories are expressed through a 5-point Likert scale. The rating scale is (1) Strongly Disagree, (2) Disagree, (3) Neutral (acting as an anchor), (4) Agree, and (5) Strongly Agree. Subjects endorse responses that best describe their levels of experience with computer technology, video games, and science process. Table 2 shows each item and its associated subscale for examination.
Rasch measurement provides a theoretical model to create equal-step measure construction of a self-efficacy instrument (Boone et al. 2011). The nature of the model is probabilistic, based upon logits (Stewart et al. 2009). This probabilistic model allows for an adequate measure of those items that participants are less likely to endorse (affirm). Individuals who show a greater likelihood of exhibiting a high level of endorsement are more likely to demonstrate an increased level of self-efficacy. Consequently, when a high-measuring subject does not endorse items that are ranked lower than those endorsements, the result is considered as unexpected. This reduces model fit, which in turn reduces validity. Therefore, the Rasch model is only applicable to the characterization of single-trait construct such as self-efficacy in science and technology. A key to development of the self-efficacy scale is to establish the construct in sufficient detail to allow for estimations of the relative location of each item, along a single-vector construct (Boone et al. 2011; Wright and Stone 1979). Establishment of the single vector occurs qualitatively through rater review of items and their placement on a linear scale in the order of likelihood of endorsement. Rasch analysis is readily applicable to the assessment of cognitive and physical skills. However, it is difficult to apply the model to attitudes, beliefs, and motivation with sufficient construct development. While the theoretical posits of the Rasch model allows the model to be applied to measures of self-efficacy, the difficulty lies in the construction of unidimensional sequences that increase in endorsement as the levels of the construct increase. Given this concern, Boone et al. (2011) illustrated that self-efficacy is a belief variable that lends itself to psychometric analysis using the Rasch model, with the STEBI-B as a test case to develop the theoretical underpinnings.
J Sci Educ Technol (2014) 23:641–657
Instruments developed with Rasch analysis contain items that remain fixed (scaled) allowing for the calibration and comparison across differing samples because of the measurement of the underlying trait, theta (H). This study examined the quality of the rating scale and assessed item quality in defining the science and technology self-efficacy dimensions. Rasch model analysis allows us to describe how well items represent the self-efficacy range of evaluated Differential item functioning (DIF) and item characteristics curves (ICC) with regard to the subjects. DIF occurs when subjects of different demographic groups (minority and majority) with the same ability have different probabilities of endorsing responses on a questionnaire. Item characteristic curves taken in aggregate in the form of test information functions are the most common means of assessing DIF, which is the log-odds estimate that contrasts differences between DIF sizes. This is equivalent to a Mantel–Haenszel t test, but has the advantage of allowing for missing data. Psychometricians and measurement researchers use Rasch modeling to paint a clearer picture of ways in which to modify and evaluate an existing instrument. Through an application of the techniques suggested by Boone et al. and others (Rasch 1960; Wright 1968, 1984 Wright and Stone 1979; Linacre 2009a, b; Bond and Fox 2007), one can effectively create psychometrically sound short-form instruments with appropriate contrast coverage. Analysis In order to address research questions 1 and 3, we analyzed the measure with WINSTEPS 3.80 (Linacre 2009a) and SAS JMP 11.0 Statistical Discovery Software. We used the Rasch rating scale model (Rasch model) to evaluate the underlying psychometric properties of the original 67-item SETS instrument. The results of the initial SETS Rasch analysis informed the development of the 16-item SETSSF instrument. Specifically, Rasch analysis determined the retention or deletion of items using fit analysis. Comparison between each iteration of the survey model assisted in the determination of item functioning and removal for creation of the short-form SETS. We then analyzed the SETS-SF to develop its underlying psychometric properties. Stability measures to the 99 % (±1/2 logit) confidence interval were provided at the 150-subject threshold (Linacre 1994), which exceeded that number in this study. The model assumes the likelihood of endorsing a particular subscale response increases as the subject’s self-efficacy exceeds a critical level of theta (H) with respect to technology and science exposure. Item misfit results indicate that there is no relationship between an item and other items on the scale examined via the item characteristic curves. The lack of fit within this analysis frame is
647
interpreted as statistical noise, reducing fidelity within the measure construct. This results in a poor measure of theta (self-efficacy). Mitigation of this effect is accomplished through item removal. Item removal also assists in overall model fit and adjustment of chi-square. Outfit residuals greater than 1.5 were identified as not conforming to model fit (Kyngdon 2009). Thus, the increase in positive student endorsements indicates an increase in student self-efficacy. By estimating the expected ordering of positive endorsement responses, the analysis also provides information on the degree to which each individual’s response pattern conforms to the most typical order for total self-efficacy. Separation indices show the level of step difference that can be identified across the sample. The level of separation is an indicator of item and individual endorsement discrimination, which helps establish measure reliability and discrimination between those respondents with high selfefficacy and those with low self-efficacy. Chi-square ratios were used to determine Rasch model fit. The most commonly used statistics are the infit and outfit statistics (Wright 1996). The outfit statistic is based on the sum of squares residual and is affected by outlying responses (e.g., negative endorsements from individuals with high cumulative self-efficacy scores). Infit statistics are influenced by responses resulting from unexplained patterns among observations when the level of positive endorsement is similar to the level of self-efficacy expected by the individual responding. When response patterns of subjects and appropriate patterns fit the Rasch model, the difference between infit and outfit will fall into a range of .6–1.4 (Linacre 1999). Lower infit and outfit statistics are associated with greater discrimination between those responses above the threshold and those below the threshold. Acceptance of model fit linked to previously known acceptable ranges of fit .6–1.4, indicating a good fit to the model. The acceptable range of fit ensures that differences among response slopes are not a factor in the resultant precision of the measure. Decreases in misfit due to item removal resulted in greater model fit and maintains measurement precision. The use of the Rasch model assumes that responses and endorsements resulting from survey questions are due to individual variation along a single underlying construct self-efficacy in this study (Liu 2010). An important consideration when creating a short-form screening instrument from a long-form version is construct representativeness. Analysis of construct representativeness is established in two ways. The first is through analysis of the relationship of items to the whole on the original measure—in this case, examination of the items selected for retention and their role as a part of the SETS instrument. The second is through a sustentative analysis of items, as suggested by Messick (1989). We conducted the
123
648
J Sci Educ Technol (2014) 23:641–657
first part of the analysis by examining the ratio of the items to the construct. While not all items are as representative of the construct as others, Rasch analysis allows for selection of the most representative items via item calibration. We calculated a ratio between those items and the whole on the original measure. Then, we calculated a similar ratio for the short-form measure. Next, we conducted a chi-square analysis to determine whether there is significant difference between the actual and expected number of representative items to the whole. Lack of significance indicates sufficient coverage of items. The second and more important aspect of construct coverage is sustentative analysis of items to ensure the items and subscales in the new measure cover similar concepts and content as the original measure (Messick 1998). These results assist in the development of argument-based validity. Discriminant and Convergent Evidence of Validity Evidence for discriminant validity was established using an exploratory factor analysis (EFA) (Allen and Yen 1979). EFA provides this evidence through comparison of factor loadings between the self-efficacy construct discussed in this study and an unrelated construct—in this case, 16 randomly selected items from the Academic Curiosity Scale developed by Vidler and Rawan (1974). Data for each of the scales are submitted together, and ideally, two clear factors emerge. Convergent evidence is demonstrated if all of the targeted and marker self-efficacy scales converge on the self-efficacy factor within the EFA (Allen and Yen 1979).
Table 3 Frequency of responses to the SETSSF
123
Results Descriptive Statistics Data from the sample suggest that the Rasch model describes the internal structure of the measure and the associated items (research questions 1–3). Table 3 illustrates that there are relatively few missing responses (\1 %) on the SETS-SF. For analysis purposes only, we collapsed the response categories strongly agree and agree into a single category. A similar approach was used for the strongly disagree and strongly disagree. The purpose of this was to develop the resulting data into a more dichotomous response set for descriptive purposes. The most frequently endorsed response was disagree. Some items are reversed in order to increase measure reliability and are listed with an (R). Table 4 displays the mean, standard deviation, minimum, and maximum for each of the measured subscales. The subscale with the largest mean is video games (VG), with a value of 2.05, and the subscale with the lowest mean is science process (SP), with a value of 1.46.
Table 4 Subscale means and standard deviations Subscale
Mean
Std dev
Min
Max
SP
1.46
.79
.00
5.00
VG
2.05
.89
.00
4.25
CU
1.69
.89
.00
4.86
Item
N
No response
1 strongly disagree
2 disagree
1 (1)
535
33
151
219
2 (25) R
535
34
275
148
3 (44)
535
35
251
4 (48)
535
37
252
5 (10)
535
56
6 (13)
535
7 (18)
3 neutral
4 agree
5 strongly agree
93
22
17
55
11
13
147
58
26
19
153
45
32
17
95
180
156
36
12
58
206
173
64
21
14
535
61
96
200
140
30
8
8 (22)
535
59
320
123
26
4
3
9 (32) R
535
59
305
126
32
9
4
10 (33)
535
59
259
135
54
21
8
11 (36)
535
59
228
142
70
24
13
12 (26)
535
59
130
163
137
33
14
13 (28)
535
61
282
123
46
16
7
14 (42)
535
67
86
151
183
35
13
15 (45) R 16 (51)
535 535
60 62
193 244
168 140
77 64
24 18
13 7
J Sci Educ Technol (2014) 23:641–657
Instrument Reliability Our estimation of reliability for the measured constructs followed the latent trait reliability method (LTRM) embedded in an IRT approach (Dimitrov 2012; Raykov 2009; Raykov et al. 2010). The LTRM provides a superior estimation of internal reliability, as it does not rely on the assumptions associated with more common reliability methods such as Cronbach’s alpha. Specifically, Cronbach’s alpha requires essential tau-equivalence and no correlated errors. Within the framework for latent variable modeling, score reliability was developed as the ratio of the true-score variance to the observed variance (Dimitrov 2012). The reliability of the measured constructs is estimated at REL = .92, CI 5 % [.76–.80], SEM 2.32, CI 5 % [2.27–2.37]. The computed level of reliability is adequate for this type of measure and provides evidence of structural validity. To increase item reliability and reliability of persons, researchers often use the metrics ranging from 0 to 1 in order to account for ordinal data. Item Separation Index Rasch analysis makes use of the item separation index and person separation index, which have no upper bounds and account for the nature of ordinal data. Comparisons of psi (W) coefficients for the SETS and the SETS-SF reveal that the short-form survey has greater discrimination. WSETS-SF is 5.69 and the WSETS is 1.09. Interpretation of the WSETS-SF (5.69) identifies six levels of strata, while the original SETS indicates one level of stratification. This stratification of the items discriminates between item choices and assists in the scaling of the responses. Thus, we suggest that the SETS-SF is a more psychometrically sound instrument than the SETS. Increased discrimination is the overall ability of the instrument to effectively index (not parameterize as in a 2PL and 3PL model) between the level of affirmation for students with high- and low levels of selfefficacy.
649 Table 5 Relevance rating for each item on the SETSSF survey Reviewer 2
Reviewer 1
Weakly relevant
Strongly relevant
Weakly relevant
–
1, 2, 4
Strongly relevant
–
3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
Numbers correspond to individual items on the science interest survey. Letters correspond to relevance groupings. A: WRWR, B: WRSR, C: SRWR, and D: SRSR
multiple expert reviewers may rate the individual items relevance to the construct H. Table 5 illustrates the independent relevance rating of each reviewer for each item on the SETS-SF survey. We calculated construct validity (and qualitative item Pn fit) coefficient with the equation d ¼ ID I¼1 IAD (Lamb et al. 2012). An agreement analysis revealed that 13 items carry a strong relevance to the construct self-efficacy. This agreement analysis provides evidence of content and sustentative validity. This percentage corresponds to a construct validity coefficient of .813. This level of construct relevance is adequate for an affective measure, as the accepted cutoff score of an affective measure is a construct validity coefficient of .70. The remaining items show a mixed relevance (i.e., one reviewer rated the item as strongly relevant and one reviewer rated the same item as weakly relevant or both reviewers rated the item as weakly relevant). This does not necessarily mean that the experts felt the item did not measure any aspect of H. Instead, it means that the item did not measure as much of H as the strongly relevant items did (Lamb et al. 2012). We selected expert reviewers based upon their unique understanding of the measured items of interest as it relates to efficacy. Each selected expert had a Ph.D. in Science Education and/or Educational Psychology and an extensive background in affective measures related to self-efficacy. Construct Representativeness
Measure Construction Construct Validity A construct is an attribute reflected in measure outcomes. The measured construct is denoted with the Greek letter H (theta). For the purposes of this study, H is the individual’s level of self-efficacy in the domains of science, computer use, and video games. Construct validity is the degree to which a scale measures the theoretical, psychological constructs, or H. When a survey measures (an affective test) a trait (i.e., self-efficacy) that is difficult to define,
Table 6 provides an overview of the individual items aligned to the subscale for the SETS-SF measure. Our examination of the ratio differences between the original measure and the short-form measure suggest adequate construct coverage for the facets of science reasoning (SR) and VG use. The ratio method indicates that the computer use (CU) facet is over-represented in the new measure. However, our examination of content suggests that this is not the case, as each question captures a slightly different aspect of computer use, and any over-representation marginally affects instrument function at a substantive level.
123
650
J Sci Educ Technol (2014) 23:641–657
Table 6 Construct coverage based upon the item to construct ratio comparison Item
No matter how hard I try, I cannot figure out the main idea in science class
Facet
Science process
Item number
Subscale ratio
SETS
SETS
SETSSF
1
1.39
1
SETSSF 1
Ratio change
Representation
?.39
Adequate representation v2(3) = .870, p = .832 Not significant
It is easy for me to look at the results of an experiment and tell what they mean
Science process
25
2
Once I have a question, it is hard for me to design an experiment to test it
Science process
44
3
I have trouble figuring out the main ideas of what my science teacher is teaching us
Science process
48
4
No matter how hard I try, video games are too difficult for me to learn.
Video game use
10
5
1
1.88
?.88
Adequate representation v2(6) = 11.30, p = .079 Not significant
Even when I try hard, I don’t do well at playing video games
Video game use
13
6
I can only succeed at the easiest video game
Video game use
18
7
No matter how hard I try, I do not do well when playing computer games
Video game use
22
8
Video games are easy to figure out for me even when I do not try
Video game use
32
9
Even when I try hard, learning how to play a new video game is complicated for me
Video game use
33
10
When playing in a simulation game, I only do well for short periods of time
Video game use
36
11
No matter how hard I try, I cannot use computers
Computer use
26
12
I find it difficult to learn how to use the computer
Computer use
28
13
Whenever I can, I avoid using computers
Computer use
42
14
When using the Internet, I usually do not have trouble finding the answers I am looking for
Computer use
45
15
Even if I try very hard, I cannot use the computer as easily as paper and pencil
Computer use
51
16
1
3.61
?2.61
Overrepresented v2(4) = 9.95, p = .041 Significant
The results of this analysis provide evidence of content and generalizability validity. Discriminant and Convergent Analysis Table 7 illustrates the results of the EFA used to examine the discriminate and convergent evidence for the SETS-SF. These analyses provide for evidence of external factor
123
validity and validity related to generalizability. There is a high degree of discriminate evidence, as items from the SETS-SF load on one factor, and items from the Academic Curiosity Scale (ACS) load on a second factor as seen in Table 7. This suggests that the construct measured in the SETS-SF is separate from the ACS construct, which is a separate constructs. This evidence derived from a convergent analysis resulting in a high correlation coefficient (.86)
J Sci Educ Technol (2014) 23:641–657
651
Table 7 EFA demonstrating evidence of convergent and discriminant validity
Table 8 Summary of validity evidence Type of validity
Sources of evidence
Content
Literature review, examination and rating by expert reviewers of the item-construct appropriateness, examination of construct representativeness of the domain
Substansive
Examination and rating by expert reviewers of the item-construct appropriateness, Rasch analysis of item behavior, analysis of construct representativeness of the domain Fit to the Rasch model, infit outfit statistics, internal consistency reliability
Factors 1
2
SETSSF3
.531
.274
SETSSF14
.594
.139
SETSSF16 SETSSF5
.598 .529
.234 .237
SETSSF12
.550
.144
SETSSF11
.553
.234
SETSSF6
.531
-.064
SETSSF7
.546
-.072
SETSSF2
.738
.223
SETSSF1
.558
.134
SETSSF13
.571
-.124
SETSSF4
.527
237
SETSSF10
.646
118
SETSSF8
.636
064
SETSSF9
.693
373
SETSSF15
.707
.234
ACS7
.383
.223
ACS11 ACS9
.137 .139
.725 .503
ACS13
-.237
.634
ACS3
-.244
.688
ACS12
.234
.600
ACS14
-.064
.585
ACS4
-.173
.723
ACS1
-.213
.578
ACS15
-.118
.639
ACS5
-.104
.708
ACS6
.354
.671
ACS16
.012
.624
ACS10
.369
.554
ACS8
-.194
.636
ACS2
.060
.558
between the SETS-SF and other measures of self-efficacy suggests evidence for external factor validity. It should be noted that the ACS illustrates high levels of discriminate and convergent evidence and thus is a good measure to compare with the EFA method. Summary of Validity Evidence Table 8 summarizes each aspect of validity as a unified concept focused on the use of the instrument. Table 8 provides for identification of the validity components and the source of the evidence discussed within the paper. The six aspects are seen from the modern validity framework as
Structure Generalizability
Examination of content representativeness, examination of the instrument outcomes compared to other instruments measuring the same construct. Discussion of discriminant and convergence analysis
External factors
Rasch item functioning, discriminant analysis and convergence analysis
Consequential
Differential item functioning
interdependent and complementary. Through examination of the results of the Rasch analysis, the theoretical framework developed within the literature around self-efficacy, and examination of the intended and unintended consequences helps to establish evidence of validity. Due to the interdependent and complementary nature of the validity construct, sources of evidence apply to multiple aspects of the validity framework in the table.
Psychometric Characteristics of the Measure Rasch Analysis The item–person map in Fig. 1 suggests the relative level of affirmation on the Rasch calibrated scale in logits. Participants with the highest level of self-efficacy show the highest outcome score. The even distribution of the respondent scores suggests effective targeting of the latent trait by the SETS-SF. Notably, the items show a marginally higher level of affirmative endorsement compared with the participant likelihood of endorsement. For the purposes of this study, item difficultly refers to willingness of the participant to agree with the statements in the survey (Linacre 1999), as indicated by scale items at or above 0 and denoted by M. The five most difficult items to affirm, in order, are items 1, 2, 3, 5, and 14. The four least difficult items to endorse, in order, are items 9, 7, 8, and 16. However, there is a floor effect associated with the measure. This is not a significant concern, since levels of efficacy at these points are extremely low and the null space of the measure is relatively small. The mean outfit and infit statistics for the SETS-SF measure are 1.08 and
123
652
J Sci Educ Technol (2014) 23:641–657
cross-loaded factor loadings suggest that other constructs are being measured. Therefore, we removed the crossloaded items. In all, ten more items were removed from the survey due to this improper loading. Item removal resulted in a 16-question survey and a further reduction in the v2 value and a nonsignificant p value. An important consideration when examining the psychometric properties of an instrument is the functioning of item scales. An essential property of the Rasch analysis approach is that the probability of responding in higher categories of the rating scale increases with the increase within the respondents’ scores on the measured trait. Figure 2 displays the category probability curves for the fivecategory Likert items. Review of the figure demonstrates that there is no threshold disordering and that advancement across the categories from left to right are characterized as monotonic. This indicates proper scale functioning without the need to reduce item scale choices. Fig. 1 Item–person map SETSSF
1.05, respectively. These results suggest proper functioning and scaling of the overall measure. The chi-square result for model fit suggests that the SETS-SF measure conforms to the Rasch model, as there is no significant difference between the observed and expected item response outcomes (v2 = 624.11, df = 650, p = .761). Table 9 summarizes all items that were removed due to improper functioning and/or improper fit to the Rasch model. Items 11, 16, 17, 27, 37, 47, 57, 64, and 67 were removed from the original survey for improper functioning. Analysis of the original 67-question survey suggests poor model fit (v2 = 718.33, df = 650, p = .0321, infit = 2.3, outfit = 2.1). Infit and outfit statistics suggest improper functioning of the original SETS measure items with respect to the Rasch model. v2 statistics show a significant difference, suggesting a lack of model fit between actual and expected results for SETS. This is not to say that the original SETS survey is not valid and reliable under the CTT model. Ketelhut established the psychometric properties of the original SETS under this model. However, the original SETS instrument does not meet the formal requirements of a Rasch model and is unsuitable for examination or development in the Rasch context. The original SETS is not suited to Rasch model analysis due to measure malfunction, suggesting that the removal of items is needed. Items from the original survey that were outside of the fit range were items 2, 5, 9, 14, 19, 23, 24, 29, 35, 38, 39, 43, 46, 49, 50, 52, 53, 54, 55, 56, 58, 59, 60, 62, and 65. Removal of the listed items resulted in the reduction in the v2 value; however, the value still resulted in a significant v2 value. While the unidimensional construct (self-efficacy) is exhibited in the 26-item version of the SETS-SF, values of
123
Differential Item Functioning (DIF) Results of DIF analysis support the removal of items from the original SETS instrument due to a lack of invariance across groups. Items removed due to poor DIF are items 5, 9, 14, 19 and 60. The SETS-SF with 16 retained items illustrates invariance across groups. Table 9 illustrates DIF across the demographic characteristics of age, gender and race for the remaining 16 items. Examination of Table 10 reveals that the remaining 16 items do not significantly vary across age, gender and ethnicity in terms of item functioning and provides evidence for consequential validity. Scoring the SETS Survey Researchers using the SETS-SF can convert raw scores to Rasch scores without using Rasch analysis by employing the validated data from this study. This conversion holds mostly for subjects who endorse a majority of items (98 %) in order to meet the psychological clinical cutoff level of the self-efficacy construct. The raw score is related to the Rasch person measure. The relationship can be described using a logit equation.
Discussion The purpose of this study was to develop a freestanding, self-reporting survey measuring the three subscales associated with the global construct understood as self-efficacy in technology and science. We examined the underlying factors that make up the construct of self-efficacy. Confirmation of the study hypotheses would result in a
J Sci Educ Technol (2014) 23:641–657
653
Table 9 Item removal table Item
Infit/outfit difference
Original subscale
Reason for removal
57
.58
Generic science learning
Dropped, blended into inquiry science process
37
.88
Synchronous chat
Dropped, blended into generic computer
64
.97
Synchronous chat
Dropped, blended into generic computer
27
.98
Synchronous chat
Dropped, blended into generic computer
67
1.09
Synchronous chat
Dropped, blended into generic computer
47
.75
Synchronous chat
Dropped, blended into generic computer
17
.46
Synchronous chat
Dropped because not consistent with EFA
16
.83
Generic computer use
Dropped because not consistent with EFA
11
.59
Inquiry science process
Dropped because not consistent with EFA
65
-.2
Generic science learning
Dropped because below logit threshold
49 29
-.78 -.51
Generic science learning Generic science learning
Dropped because below logit threshold Dropped because below logit threshold
38
-.28
Generic science learning
Dropped because below logit threshold
43
-.4
Inquiry science process
Dropped because below logit threshold
23
-.32
Inquiry science process
Dropped because below logit threshold
46
-.37
Inquiry science process
Dropped because below logit threshold
9
-.62
Synchronous chat
Dropped because below logit threshold
19
-.57
Synchronous chat
Dropped because below logit threshold
24
-.83
Synchronous chat
Dropped because below logit threshold
50
-.34
Synchronous chat
Dropped because below logit threshold
14
-.44
Synchronous chat
Dropped because below logit threshold
15
-.54
Synchronous chat
Dropped because below logit threshold
55
-.54
Synchronous chat
Dropped because below logit threshold
52
-.71
Generic computer use
Dropped because below logit threshold
2
-.79
Generic computer use
Dropped because below logit threshold
59
-.92
Generic computer use
Dropped because below logit threshold
5 35
-.73 -.73
Generic computer use Generic computer use
Dropped because below logit threshold Dropped because below logit threshold
60
-.92
Generic computer use
Dropped because below logit threshold
62
-1.18
Video gaming
Dropped because below logit threshold
53
-.44
Video gaming
Dropped because below logit threshold
56
-.39
Video gaming
Dropped because below logit threshold
58
-.72
Video gaming
Dropped because below logit threshold
54
-.2
Video gaming
Dropped because below logit threshold
39
-.72
Video gaming
Dropped because below logit threshold
Logit threshold values are N [ ±2.0 logits (Linacre 1999)
psychometrically sound measure of self-efficacy mediated through review of extrinsic factors. Results of the Rasch rating scale model analysis of the SETS-SF items for selfefficacy provide strong support for a single latent dimension underlying the responses. A primary assumption of the Rasch model is the assumption of measure unidimensionality. In other words, there is one dominating dimension underlying a person’s response on a survey. First, we used principal component analyses of the residuals to examine component loading.
Eigenvalues suggest one underlying dimension, with positive residuals exhibited within the contrast plot. Analysis of the patterns of residuals shows that the residuals loaded in one direction on the original subscales. Unidirectional loading is indicative of the unidimensional nature of the construct of self-efficacy in science and technology. These three subsets of items defined by positive loadings on the first residual component fit to the Rasch model and the person estimates obtained. Differences in person estimates derived from these analyses were trivial between the
123
654
J Sci Educ Technol (2014) 23:641–657
Fig. 2 Category probability curves
VG, CU, and SP and the overall scale at .03 logits. Neither of the item subscales illustrated a significant difference in person estimates to the original measurement. There is clear evidence of the unidimensionality of the self-efficacy domain when this SETS-SF develops as a bifactor model. We examined the construct of self-efficacy as a bifactor model. This means that the general factor self-efficacy and the domain-specific factors (subscales) account for a unique contribution to the general factor, however; all are measuring the construct of self-efficacy. This contrasts with a second-order factor model in which the higher-order factor is a qualitatively different (superordinate) dimension to the lower-level factors. Therefore, the structure of the SETS-SF is based upon a relatively broad underlying Table 10 Differential item function contrasts for age, gender, and race
123
Item label
construct with diverse indicators arranged into subdomains. These subdomains represent uncorrelated subdomains; thus, IRT analysis D = 1.00 (Rasch) is appropriate and not multidimensional IRT. The goal of this Rasch analysis was to establish the validity and reliability of the SETS-SF questionnaire and determine whether the survey meets the formal assumption of measurement defined within the Rasch model. Item response functioning is compliant with Rasch requirements and does not require collapse of the neutral response item. Rasch model fit suggested the removal of several items listed in Table 5. The resulting fit statistics suggest an adequate model fit for the 16-item SETS-SF and demonstrates a justified scale for the measurement of self-efficacy. Survey results show a high degree of reliability and validity and construct representativeness. The person–item map of the Rasch scaled SETS-SF shows adequate targeting of the scale, with little floor effects and no ceiling effects. The adequate targeting of the survey to subject responses suggests the ability of the respondents to interpret item wording in a uniform way. Multi-item targeting at the same discrimination level suggests that the additional items may, in fact, allow for slightly more efficient and reliable targeting. Rasch analysis provides evidence to support the construct of self-efficacy related to science and technology, as proposed by the authors. This model is consistent with previously formulated models by Bandura and others. The analysis of the SETS-SF reveals the underlying psychometric structure of the measure. SETS-SF scoring shows three subscale components accounting for a unique contribution of each subscale, which can effect change using
Likelihood of endorsement
1
.62
Standard error
DIF contrast
p tDIF
Age
Gender
Ethnicity
.02
.80
.46
.32
.07 .12
2
-.80
.03
-.94
.22
-.23
3
.51
.02
.33
.24
.08
.06
4
.51
.02
.41
-.26
-.50
.06
5
.90
.03
.09
-.29
-.19
.07
6
-.73
.03
-07
.07
.01
.10
7
-.46
.02
.19
.54
.13
.08
8
-.11
.02
-.24
-.37
.01
.08
9
-.38
.03
.23
-.10
.14
.10
10
1.05
.02
.10
-.18
.32
.06
11
-1.09
.02
-.07
.13
.40
.07
12
.45
.03
.92
-.21
-.66
.07
13
-.38
.02
.42
.02
.31
.09
14
.32
.02
-.03
.17
.09
.06
15
.75
.03
.78
-.50
.44
.06
16
-.80
.03
.13
.17
-.79
.07
J Sci Educ Technol (2014) 23:641–657
an AR model for intervention in the students’ self-efficacy in science and computer use. This study demonstrates that the application of the Rasch model supports the 16-item, five-response scale, SETS-SF as a valid scale for assessing self-efficacy under the Rasch model. This greater level of self-efficacy within the scientific reasoning subscale does not carry over to the other subscales found within the measure. Males show statistically significant self-efficacy differences than females on the video game self-efficacy subscale. Results suggest a relationship between high self-efficacy and expectancy. This is exemplified in the examination of the number of males reporting playing video games and their self-efficacy with respect to video game play. However, further examination suggests that this relationship was not as strong for females in the study. This difference may be due to the sociocultural contextualization of the video games during the early formation of self-efficacy realization by females. This sociocultural contextualization of video games seems to encourage traditional gender differences across age groups. Using grade level as a proxy for age, self-efficacy with respect to the three represented subscales seems to be invariant. This supports literature suggesting that selfefficacy seems to be set early in the individual’s experiences. However, there also seems to be a threshold level necessary to create a change in self-efficacy. Analysis of self-efficacy in science reasoning levels across gender suggests significantly higher self-efficacy within the female portion of the sample. This supports findings noted in the previous literature. The cutoff time for video game use was 7 h, while the cutoff to raise computer self-efficacy was 4 h. This suggests that a similar time threshold is necessary for scientific reasoning to affect self-efficacy. Further study is warranted in order to quantify the amount of time needed to affect scientific reasoning self-efficacy. Domain Generality This bifactor model illustrates how self-efficacy can act as a domain general construct and maintain district subscales seeming to be domain specific. The lack of correlation between the subscales, as well as the orthogonal nature of the constructs demonstrated in the analysis, raises questions about Bandura’s initial conception of self-efficacy as domain specific (Bandura 1977). Since the correlation is not complete, we interpret these results as implying that while a portion of self-efficacy is specific to a given domain or task, there are carryover effects between related tasks or tasks displaying similar features and patterns. This accords with the literature on transfer. Traditional conceptions of transfer describe the transport or heuristics and procedural knowledge between tasks and problems that display similarity (Gentner 2010). In addition, recent
655
research in transfer notes this phenomenon in terms of affect and emotion, with the effects exaggerated by achievement experiences (Oikawa et al. 2011). In this light, one can draw the conclusion that success at various activities within a science classroom or incorporating instructional technology may have carryover effects to other activities. Interventions in Self-efficacy The discussion of a domain general portion of self-efficacy, or the potential for self-efficacious affect to be transferred between similar situations and bolstered by mastery experiences and achievement, implies that the measurement of self-efficacy is salient to education in another manner. Specifically, this lends itself to the creation of interventions designed to target latent affective traits, including selfefficacy, with the goal of improving those traits for their direct effects on cognitive abilities. This is commensurate with contemporary studies aimed at precisely this mark, which demonstrate promise in raising self-efficacy in science students (Hall et al. 2004; Haynes et al. 2011; Perry et al. 2010). Furthermore, our findings suggest that efforts to improve self-efficacy in generalized science process tasks and reasoning may extend to other tasks within a science education setting.
Conclusion Results of this study indicate that SETS-SF is a valid and reliable measure of the single construct known as selfefficacy related to scientific reasoning, computer technology, and video gaming in adolescent students ages 13 through 18. Our data analysis of differences in video gaming self-efficacy across genders suggests that selfefficacy is domain specific rather than general. These findings supports those of other recent studies (Panuonen and Hong 2010; Caprara et al. 2011; Geary 2010; Tierney and Farmer 2011). Finally, the data obtained from the SETS-SF display a high degree of fit for the Rasch model. Our results highlight the unidimensional nature of the SETS-SF and the difficulty of creating reliable, unbiased self-efficacy-related items. Information from the SETS-SF will enable field practitioners and researchers to measure self-efficacy with a high level of reliability and validity. The measurement of self-efficacy will allow educators to develop interventions targeting a change in the self-efficacy of students. Through proper targeting and measurement, it may be possible to assess the dosing of interventions related to self-efficacy. By successfully tracking adjustments in student self-efficacy, we can lead more students to choose STEM careers and assist them in
123
656
positive achievement outcomes. Further implications suggest that current science education interventions based on technology should be carefully considered, as males and females require different mechanisms to stimulate change through self-efficacy. This use of non-content-based measures also suggests the use of assessments to measure cognitive and affective traits that assist in the conceptualization of science. Future research directions should focus on the use of DIF to assess the behavior of items under various conditions. In addition, neural network analysis may help identify key nodes to effect adjustments in self-efficacy within an ever-developing tripartic model of learning.
References Allen M, Yen W (1979) Introduction to measurement theory. Brooks/ Cole, Monterey Annetta LA, Minogue J, Holmes SY, Cheng MT (2009) Investigating the impact of video games on high school students’ engagement and learning about genetics. Comput Educ 53(1):74–85 Badura A (1986) The explanatory and predictive scope of selfefficacy theory. J Soc Clin Psychol 4(3):359–373 Bandura A (1977) Self-efficacy: toward a unifying theory of behavioral change. Psychol Rev 84(2):191–215 Bandura A (1982) Self-efficacy mechanism in human agency. Am Psychol 37(2):122–147 Bandura A (2006) Chapter 14. Guide for constructing self-efficacy scales. In: Pajares F, Urdan T (eds) Self-efficacy beliefs of adolescents. Information Age, New York Bandura A et al (1994) Multifaceted impact of self-efficacy beliefs on academic functioning. Child Dev 67(3):1206–1222 Beckers J, Schmidt H (2001) The structure of computer anxiety: a sixfactor model. Comput Hum Behav 17(1):35–49 Bond TG, Fox CM (2007) Applying the Rasch model: fundamental measurement in the human sciences, 2nd edn. Erlbaum, Mahwah Bong M, Skaalvik E (2003) Academic self-concept and self-efficacy: how difference are they really. Educ Psychol Rev 15(1):1–40 Boone WJ, Townsend JS, Staver J (2011) Using Rasch theory to guide the practice of survey development and survey data analysis in science education and to inform science reform efforts: an exemplar utilizing STEBI self-efficacy data. Sci Educ 95:258–280 Britner SL (2008) Motivation in high school science students: a comparison of gender differences in life, physical, and earth science classes. J Res Sci Teach 45:955–970 Britner SL, Pajares F (2006) Sources of science self-efficacy beliefs of middle school students. J Res Sci Teach 43:485–499 Caprara GV, Vecchione M, Alessandri G, Gerbino M, Barbaranelli C (2011) The contribution of personality traits and self-efficacy beliefs to academic achievement: a longitudinal study. Br J Educ Psychol 81(1):78–96 Cassidy A, Eachus P (2002) Developing the computer user selfefficacy (CUSE) scale: investigating the relationship between computer self-efficacy, gender and experience with computers. J Educ Comput Res 26(2):133–153 Cheek DW (1992) Thinking constructively about science, technology, and society education. State University of New York, Albany Colley A, Gale M, Harris T (1994) Effects of gender role identity and experience on computer attitude components. J Educ Comput Res 10(2):129–137
123
J Sci Educ Technol (2014) 23:641–657 Cox R, Smitsman A (2008) Special section: towards an embodiment of goals. Theory Psychol 18(3):317–339 Dimitrov D (2012) Statistical methods for validation of assessment scale data in counseling and related fields. American Counseling Association, Alexandria Embretson S, Gorin J (2001) Improving construct validity with cognitive psychology principles. J Educ Meas 38(4):343–368 Enyedy N, Goldberg J (2004) Inquiry in interaction: how local adaptations of curricula shape classroom communities. J Res Sci Teach 41(9):905–935 Geary D (2010) Evolution and education. Psicothema 22(1):35–40 Gentner D (2010) Bootstrapping the mind: analogical processes and symbol systems. Cognit Sci 34(5):752–775 Hall N, Hladkyj S, Perry R, Ruthig J (2004) The role of attributional retraining and elaborative learning in college students’ academic development. J Soc Psychol 144(6):591–612 Harkness W (1965) Properties of extended hypergeonomic distribution. Ann Math Stat 36(3):938–945 Haynes T, Clifton R, Daniels L, Perry R, Chipperfield J, Ruthig J (2011) Attributional retraining: reducing the likelihood of failure. Soc Psychol Educ 14(1):75–92 Hays R, Brown J, Brown L, Spritzer K, Crall J (2008) Classical test theory and item response theory analyses of multi-item scales assessing parents’ perceptions’ of their children’s dental care. Med Care 44(11):S60–S68 Henderson P, Peterson R (2004) Mental accounting and categorization. Organ Behav Hum Decis Process 51(1):92–117 Judge T (2009) Core self-evaluations and work success. Curr Dir Psychol Sci 18(1):58–62 Ketelhut DJ (2010) Assessing gaming, computer and scientific inquiry self-efficacy in a virtual environment. In: Annetta L, Bronsak S (eds) Serious educational games assessment: practical methods and models for educational games, simulations and virtual worlds. Sense, New York Kyngdon A (2009) The Rasch model from the perspective of the representation theory of measurement. Theory Psychol 18(1): 89–109 Lamb RL, Annetta L (2012a) The use of online modules and the effect on student outcomes in a high school chemistry class. J Sci Educ Technol 22(5):603–613 Lamb R, Annetta L (2012b) Influences of gender on computer simulation outcomes. Meridian 13(1):1–4 Lamb RL, Annetta L, Meldrum J, Vallett D (2012) Measuring science interest: Rasch validation of the science interest survey. Int J Sci Math Educ 10(3):643–668 Lamb R, Annetta L, Vallett D, Sadler T (2014) Cognitive diagnostic like approaches using neural network analysis of serious educational video games. Comput Educ 70:92–104 Lawson A (2004) The nature and development of scientific reasoning: a scientific view. Int J Sci Math Educ 2(3):307–338 Lent R, Lopez F, Bieschke K (1991) Mathematics self-efficacy: sources and relation to science-based career choice. J Couns Psychol 38(4):424–430 Linacre JM (1994) Sample size and item calibration stability. Rasch Meas Trans 7(4):324 Linacre JM (1999) Investigating rating scale category utility. J Outcome Meas 3(2):103–122 Linacre JM (2009a) Practical Rasch measurement—core topics (Online course) Linacre JM (2009b) WINSTEPS (Software and user’s guide). Winsteps, Chicago Liu X (2010) Using and developing measurement instruments in science education: a Rasch modeling approach. Information Age, Charlotte Mandinach EB, Lewis A (2006) The current context of research: seeking a balance between rigor and relevance. Paper presented
J Sci Educ Technol (2014) 23:641–657 at the annual meeting of the American Educaitonal Research Association (AERA), San Francisco, CA McGrath M, Braunstein A (1997) The prediction of freshmen attrition: an examination of the importance of certain background, academic, financial, and social factors. Univ Stud J 31:396–408 Messick S (1989) Validity. In: Linn RL (ed) Educational measurement, 3rd edn. Macmillan, New York, pp 13–103 Messick S (1996a) Standards-based score interpretation: establishing valid grounds for valid inferences. Proceedings of the joint conference on standard setting for large scale assessments, Sponsored by National Assessment Governing Board and The National Center for Education Statistics. Government Printing Office, Washington, DC Messick S (1996b) Validity of performance assessment. In: Philips, G Technical issues in large-scale performance assessment. National Center for Educational Statistics, Washington Messick S (1998) Test validity: a matter of consequence. Soc Indic Res 45(1–3):35–44 Oikawa M, Aarts H, Oikawa H (2011) There is a fire burning in my heart: the role of causal attribution in affect transfer. Cogn Emot 25(1):156–163 Pajares F, Miller DM (1995) Mathematics self-efficacy and mathematics performance: the need for specificity of assessment. J Couns Psychol 42:190–198 Pajarres F (1996) Self-efficacy beliefs in academic settings. Rev Educ Res 66(4):543–578 Panuonen S, Hong R (2010) Self-efficacy and the prediction of domain specific cognitive abilities. J Pers 78(1):339–360 Perry R, Stupnisky R, Hall N, Chipperfield J, Weiner B (2010) Bad started and better finishes: attributional retaining and initial performance in competitive achievement settings. J Soc Clin Psychol 29(6):668–700 Porter S, Whitcomb M, Weitzer W (2004) Multiple surveys of students and survey fatigue. New Dir Inst Res 121:63–73 Rasch G (1960) Probabilistic models for some intelligence and attainment tests. Danmarks Paedagogiske Institute, Copenhagen Raykov T (2009) Evaluation of scale reliability for unidimensional measures using latent variable modeling. Meas Eval Couns Dev 42(3):223–232 Raykov T, Dimitrov D, Asparouhov T (2010) Evaluation of scale reliability with binary measures using latent variable modeling. Struc Equ Model Multidiscip J 17(2):265–279 Sadler PM, Sonnert G, Hazari Z, Tai R (2012) Stability and volatility of STEM career interest in high school: a gender study. Sci Educ 96(3):411–427 Saks A (1997) Transfer of training and self-efficacy: what is the Dilemma? Appl Psychol 46(4):365–370 Savage S, Waldman D (2008) Learning and fatigue during choice experiments: a comparison of online and mail survey modes. J Appl Econom 23(3):351–371 Scherbaum C, Cohen-Charash Y, Kern M (2006) Measuring general self-efficacy: a comparison of three measures using item response theory. Educ Psychol Meas 66(6):1047–1063 Schunk D (1985) Self-efficacy and classroom learning. Psychol Sch 22(2):208–223 Schunk D (1989) Self-efficacy and achievement behaviors. Educ Psychol Rev 1(3):173–208
657 Skaalvik E, Skaalvik S (2007) Dimension of teacher self-efficacy and relations with strain factors, perceived collective teacher efficacy and teacher burnout. J Educ Psychol 99(3):611–625 Stewart B, Tennant A, Tennant R, Platt S, Parkinson J, Weich S (2009) Internal construct validity of the Warwick-Edinburgh mental well-being Scale: a Rasch analysis using data from the Scottish Health Education Population Survey. Health Qual Life Outcomes 7(15):1–8 Stone C (2005) Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. J Educ Meas 37(1):58–75 Strecher V, DeVellis B, Marshall B, Rosenstock I (1986) The role of self-efficacy in achieving health behavior change. Health Educ Behav 13(1):73–92 Subotnik R, Orland M, Rayhack K, Schuck J (2009) Identifying and developing talent in science, technology, engineering, and mathematics (STEM): an agenda for research, policy and practice. In: Shavinina LV (ed) International handbook, Part XII, pp 1313–1326 Tierney P, Farmer S (2011) Creative self-efficacy development and creative performance over time. J Appl Psychol 96(2):277–293 United States Census Bureau (2012). QuickFacts Data.gov. Retrieved 27 June 2013, from http://quickfacts.census.gov Usher E (2009) Sources of middle school students’ self-efficacy in mathematics: a qualitative investigation. Am Educ Res J 46(1):275–314 Valla JM, Williams WM (2012) Increasing achievement and highereducation representation of under-represented groups in science, technology, engineering, and mathematics fields: a review of current K-12 intervention programs. J Women Minor Sci Eng 18(1):21–53 Vancouver J, More K, Yoder R (2008) Self-efficacy and resource allocation: support for nonmonotonic discontinues model. J Appl Psychol 93(1):35–47 Vidler DC, Rawan HR (1974) Construct validation of a scale of academic curiosity. Psychol Rep 35(1):263–266 Vorderer P, Klimmt C, Ritterfeld U (2006) Enjoyment: at the heart of media entertainment. Commun Theory 14(4):388–408 Walter M (1973) Toward a cognitive learning reconceptualization of personality. Psychol Rev 80(4):252–283 Wright BD (1968) Sample-free test calibration and person measurement. Paper presented at the National Seminar on Adult Education Research, Chicago, IL Wright BD (1984) Despair and hope for educational measurement. Contemp Educ Rev 3(1):281–288 Wright BD (1996) Reliability and separation. Rasch Meas Trans 9(4):472 Wright BD, Stone MH (1979) Best test design. Mesa Press, Chicago Write BD, Stone MH (2004) Making measures. Phaneron Press, Chicago, IL Zeldin A, Britner S, Pajares F (2007) A comparative study of the selfefficacy beliefs of successful men and women in mathematics, science and technology careers. J Res Sci Teach 45(9): 1036–1058 Zimmerman B (1997) Social origins of self-regulatory competence. Educ Psychol 32(4):195–208
123