ing and behavioral difficulties, are strongly connected (Hölling, Schlack, Peter- ... prevention-orientated support programs (Huber & Grosche, 2012; Johnson,.
Insights into Learning Disabilities 12(1), 73-90, 2015
Copyright @ by LDW 2015
Behavior Assessment Using Direct Behavior Rating (DBR) A Study on the Criterion Validity of DBR Single-Item-Scales Christian Huber University of Potsdam, Germany
Christian Rietz University of Cologne, Germany Progress monitoring assessments play an important role in evaluating interventions in educational settings and psychology. This article introduces the method of “Direct Behavior Rating” (DBR) in a study with 133 participants. It evaluated, to what extend DBR-raters during behavior observation of a target student in eight experimental videos were able to accomplish similar results compared to two experienced school psychologists using the method of systematic direct behavior observation. Results suggest that DBR-ratings possess high structural conformity with the results of systematic direct behavior observation, but only show moderate numerical conformity with absolute indicated scores. Findings are critically addressed and DBRs possible implementation into school settings is discussed.
Keywords: Behavioral Assessment; Curriculum Based Measurement; Developmental Assessment, State Assessment; Direct Behavior Rating. Introduction Relevance of Means to Regularly Monitor Behavior The challenges that arise as a result of the current trend towards inclusive education also require those working in the German school system to think about new approaches to supporting children and young adults with special educational needs. Numerous earlier studies have shown that teachers are always better able to implement intervention measures when they receive regular feedback on the effectiveness of the measures introduced (Hattie, 2008; Hunt, Alwell, Farron Davis, & Goetz, Sum, 1996). This is true for learning disorders and for behavioral disorder alike. In everyday school life both phenomena, learning and behavioral difficulties, are strongly connected (Hölling, Schlack, Petermann, Ravens-Sieberer, & Mauz, 2014). Results from an uncountable amount of studies suggest that learning problems can be caused by behavioral problems and vice versa. So professionals like teachers or school psychologists are often
Insights into Learning Disabilities is published by Learning Disabilities Worldwide (LDW). For further information about learning disabilities, LDW’s many other publications and membership, please visit our website: www.ldworldwide.org.
73
Insights into Learning Disabilities 12(1), 73-90, 2015
asked to stabilize the classroom behavior of a struggling student first in order to foster his academic competences (e.g. in maths or spelling). Building on these insights, it is apparent that regular monitoring of the measures introduced (e.g. Response to Intervention) is required in staged prevention-orientated support programs (Huber & Grosche, 2012; Johnson, Fuchs, & McKnight, 2006). Although an increasing number of instruments that can be used in the evaluation of learning progress, known as curriculum-based measurements, is already available for the area of cognitive competences and skills (Klauer, 2006; Strathmann & Klauer, 2012), the question arises of how a system that allows the assessment of changes over time could be implemented in order to monitor behavior. Here also a close evaluation is an important building block in providing the ability to separate effective from ineffective interventions. If one reviews the existing studies on behavioral assessment, it is apparent that the focus of discussion is on the role of observation of behavior in the form of a precise capturing and description of behavioral problems. Methods of Behavioral Observation Traditionally, two main methods of behavioral assessment are recognized, known as “systematic direct observation” and “behavioral assessment” (Amelang & Schmidt-Atzert, 2006; Christ, Riley Tillman & Chafouleas, 2009; Schmidt-Atzert, 2012; Wittchen & Hoyer, 2011). In the following, both approaches are briefly summarized. Systematic Direct Observation (SDO). This form represents the hitherto most complex and most precise form of behavioral observation. During a direct systematic behavioral assessment, the behavior of a person is observed in a differentiated and quantifiable manner (Schmidt-Atzert, 2012). The data acquisition usually takes the form of counting or making time measurements. To improve accuracy, this method frequently relies on video recordings, which also allow repeated analysis of a complex situation. The validity, reliability and objectivity of such direct systematic behavioral observations has been questioned by many authors, since systematic and random assessment errors can distort the results of such behavioral observations (Schmidt-Atzert, 2010). For this reason, Amelang and Schmidt-Atzert (2006) have specified that the actual quality of observation needs to be re-evaluated for every behavioral observation. While it is true that systematic direct observation can therefore be considered a comparatively precise form of behavioral assessment, it is not a viable option for use in the school environment because of the time it requires. The time requirement increases considerably if the method of direct systematic behavioral observations is to be employed in order to assess progress and changes. Behavioral Assessment. This form of evaluating behavior is much less complex but, in comparison with direct systematic direct observation, it delivers 74
Insights into Learning Disabilities 12(1), 73-90, 2015
more abstract observational data (Schmidt-Atzert, 2012). In this method, the observer evaluates the behavior of a person over a particular period of time using a single- or multi-item scale. Each area of behavior usually requires several items to be assessed. Data acquisition is normally effected with the help of a standardized questionnaire. Many procedures moreover offer the advantage of standardization so that the statements can be assessed in relation to a defined (age-related) criterion. While systematic direct observation produces absolute and very precise observational data which exhibits a close correlation with the observational situation, the behavioral assessment method, by contrast, delivers coarse observation data portraying several observational situations summarized over a longer time interval. When both methods are considered with regard to their suitability for the measurement of a sequence of behavioral events in the school routine, the time required to carry out the two - systematic direct observation and behavioral assessment - means that they are unworkable in this context. Christ et al. (2009) and S. M. Chafouleas (2011) therefore propose direct behavior rating as a practical alternative method of collecting behavioral development data in the school situation. Direct Behavior Rating (DBR). DBR is a method of behavioral progress monitoring, that was developed by a working group around Sandra Chafouleas in 2002 (Chafouleas, Riley Tillman und Sassu, 2002). According to Christ et al. (2009), DBR is a combination of systematic direct observation and behavior assessment. It is defined as a targeted form of behavioral assessment of a previously defined behavioral sequence (e.g. academic engagement) immediately following an observational situation (e.g. a lesson) with the help of a multi-item scale. In contrast to behavioral assessments, DBR measurements are high-frequency and continuous. Thus, the assessments can be repeated several times per day (e.g. after every school lesson). The high frequency of the measurements combined with the temporal proximity between observation situation and its assessment are intended to reduce assessment errors. Due to the highly efficient nature of the data acquisition technique, it is also possible to carry out a measurement for every lesson during the school day. This means that a large number of measurements can be compiled in a short time and these can be converted to graphs showing changes over time. Teachers would therefore be able to document the behavioral development of their students without excessive effort. In analogy to the debate concerning the relative quality of observational ratings obtained using systematic direct observation and behavioral assessment, the same question must be posed regarding the use of DBR scales - namely to what extent and under which circumstances the data thus collected are objective, reliable, and valid. This current study focuses on the aspect of validity, or more precisely, on criterion validity. 75
Insights into Learning Disabilities 12(1), 73-90, 2015
Current Status of Research In the last decades there has been a broad discussion on the test quality criteria that should be met for the technique of systematic direct observation. Research on SDO showed that the quality of the observation results is strongly influenced by errors in perception and appraisal (Schmidt-Atzert, 2010; Schmidt-Atzert, 2012; Westhoff, 2010). Thus, there is an intensive discussion on the question in how far criteria of test quality can be applied to the field of SDO. Some authors suggest that scientists should develop new criteria for SDO. Cone (1988) argues that there are at most conceptual but no methodical differences between SDO and psychometrical tests. This means that objectivity, validity, and reliability are also relevant criteria for SDO. Amelang (2006) derives from this that test quality must be verified for every single observation. Following this point of view, the current discussion on the test quality of SDO shifted from a general to a more methodological debate in the last years. In the past, most authors agreed that criterion related validity, interrater reliability, and test-retest reliability are also important as measures for SDO, and have to be analyzed for every single observation. In sum, SDO is referred to be the gold standard for behavioral assessment on the one side. On the other side, it appears to be a method of low economy, which is hard to implement in everyday school life. DBR as a method draws on this criticism. To date, the observation quality of DBR as a method has been the subject of 17 US studies and two reviews (Huber & Rietz). Apart from two exceptions, all studies were carried out by a research group led by Sandra Chafouleas (S. M. Chafouleas, 2011b). The majority of the studies had the objective of obtaining information relating to differentiation capacity, objectivity, and/or inter-rater reliability with the aid of generalisation studies. On the basis of the few currently available findings, it can be cautiously concluded that the differentiation capacity, objectivity, and/or inter-rater reliability of DBR ratings are most suitable in connection with the evaluation of “academic engagement” and “disruptive behavior”. Thus, there was only minor variance between raters with regard to their assessment of academic engagement and lesson disruption, which leads to the conclusion that the observed behavioral sequences were perceived by the raters in a comparable manner (Chafouleas, Christ, Riley Tillman, Briesch, & Chanese, Julie A. M., 2007; Christ, Riley Tillman, Chafouleas, & Boice, 2010). The number of measurement intervals that were required to achieve a satisfactory measurement accuracy varied markedly across the individual studies. The findings range from three measurements (Briesch, Chafouleas, & Riley Tillman, 2010; S. M. Chafouleas, 2011b) up to more than 20 measurements (Volpe & Briesch, 2012). There are only limited data relating to the stability of the measurements and therefore to test-retest reliability (cf. Riley Tillman, Christ, Chafouleas, Mallach, & Briesch, 2011). Furthermore, Riley Tillman, Chafou76
Insights into Learning Disabilities 12(1), 73-90, 2015
leas, Christ, Briesch, and LeBel (2009) were able to show that the accuracy of observation increased if the definition of the observed behavior was formulated globally and less specifically. However, this finding is contradicted by Volpe and Briesch (2012). In their study, these authors compared single-item scales (SISs, only a single global behavior is assessed) with multiple-item scales (MISs, several specific types of behavior are assessed) and came to the conclusion that when MISs are used, fewer measurement points are required in comparison with SISs in order to achieve satisfactory measurement accuracy. A problematic aspect of a majority of the available studies is that very small rater sample sizes (from N=2 to 11) were used. The rationale behind DBR initially suggests that this type of methodology should be particularly useful for describing changes to behavior over time. Yet, individual research groups place a more extensive value on current findings (S. M. Chafouleas, 2011a; Chafouleas, Kilgus, Riley Tillman, Jaffery, & Harrison; Christ et al., 2009; Volpe & Briesch, 2012). These authors assume that, if a certain critical number of measurement points is exceeded, the inter-rater reliability of DBR raters also allows status-related diagnostic statements to be made about the “normality” of a behavior. But in the opinion of the current authors, the previously published studies do not, however, allow such a clear conclusion to be drawn. Research Question The present study, as was the case for the previous US-based studies, is aimed at understanding the extent to which reliable state and developmental information can be obtained using DBR. The main focus of this paper is the criterion validity of two DBR-SI scales. The notion of criterion validity holds that in an “ideal world”, different raters using systems to evaluate status and changes over time would assess the same situation identically on a scale of 0 to 10 independently of one another (Figure 1a). For the purposes of statistics, we could say, more or less, that the raters assign identical scores. In the following, we refer to this aspect as “numeric invariance”. However, this conformity of absolute scores is not a mandatory precondition if the method is to be employed to evaluate changes over time. In this case, the most important criterion is correspondence between the profiles documented by raters, and therefore a relative conformity with regard to the scores recorded by raters. Statistically, this corresponds to a correlation of 1.0, while, at the same time, the scores recorded by the individual raters can differ. In the following, we use the term “structural invariance” for this aspect. This means that the individual observation measurements recorded by different raters do not have to agree exactly, but the changes occurring between different observation time points must be perceived as the same. To give an example; all observers should thus rate any behavioral changes shown by a single person that occur within the time frame situation 1 to situation 2 as improvements (Figure 77
Insights into Learning Disabilities 12(1), 73-90, 2015
1b). The baseline score for behavior can vary across the raters, however (e.g. rater 1: from 4 to 7; rater 2 from 5 to 8; rater 3 from 6 to 9). Figure 1. Comparison of numerical invariance and structural invariance
When it comes to the verification of criterion validity, we can say there is a variant A with state assessment orientation (conformity of measurements, or numeric invariance) and a variant B with developmental assessment orientation (conformity of rater profiles, or structural invariance). If DBR is to be shown to be suitable for assessing status, it is essential that the collated measurements exhibit a high level of correspondence with the actual behavior of the observed person. However, it is difficult to determine what the “true behavior” is because this is not directly measured, but can only be determined on the basis of derived data. Following a research design of Riley Tillman, Christ, Chafouleas, Mallach, & Briesch (2011) in the present study, the “true behavior” was operationalized with the help of the ratings documented during systematic direct behavior observation (henceforth SDO) by two experienced raters. With regard to our research objective, the question to be answered was to what extent DBR ratings and the ratings of the behavioral observers (“true score”) correspond. Problem statement A (criterion validity, numerical invariance): To what extent do the ratings documented by DBR raters correspond to the “true” behavior of a target person? Problem statement B (criterion validity, structural invariance): To what extent do the profiles documented by DBR raters correspond to the “true” behavior profile of a target person? Methodology Design It was necessary to incorporate in the study design a “true” behavior score against which the accuracy of the DBR scales could be tested in order to determine the criterion validity. For this purpose, we employed a (randomized) 78
Insights into Learning Disabilities 12(1), 73-90, 2015
design involving the use of video material. We asked 133 study subjects to observe a previously defined target behavior of a target student in eight separate experimental videos and in each of the eight situations they were required to assess the behavior with the help of a behavioral definition and a DBR single-item scale. Both the target behavior and the target student were the same in the eight experimental videos. In order to investigate criterion validity, the assessments of the DBR raters were compared with the “true behavior score” of the student. Following Briesch et al. (2010), this assumed “true behavior score” was operationalized with the help of systematic direct behavior observation by two experienced school psychologists. Figure 2 shows a summary of the design of the study. Figure 2. Design of study
The study subjects were randomly assigned to two trial groups. The first trial group (henceforth TG_Q) was given a qualitative rating scale with which to assess the target student; they were required to rate the quality of the target behavior on a scale from 0 (low quality) to 10 (high quality). The second trial group (henceforth TG_T) was given a quantitative rating scale; they were to use this to document the percentage time over which the target behavior occurred on a scale of 0% (target behavior did not occur) to 100% (target behavior was observable over the entire observation period) (see below). 79
Insights into Learning Disabilities 12(1), 73-90, 2015
Test Subjects In order to be able to investigate as many subjects as possible under the same conditions simultaneously, we used a random sample of trainee teachers (primary level with ‘inclusive education’ as main subject). This approach meant that it was possible to recruit the same random sample at two separate times, without having to inform them of the second experimental session at the time of the first experiment. Our random sample consisted of a total of 133 trainee teachers. All trainees were in the first subject-related semester of their course and at the time of the experiment they thus had no practical school-related experience beyond what they knew from their own time at school. Both experiments were conducted at the end of the winter semester in mid and late January 2014 within the context of a lecture. Table 1 shows subjects ordered by trial group, gender and age. Table 1. The sample group
TG 1 (TG_Q) 2 (TG_T) Total
n (%) 62 (46.6%) 71 (53.4%) 133 (100%)
Sex
Age
n.d. (no mean female male data) score (s) 55 7 21.9 (4.2) 61 9 1 22.4 (5.3) 116 16 1 22.2 (4.8)
n.d. (no data) 7 3 10
Trainees were allocated to trial groups on the basis of their dates of birth; this was thus a form of random assignment. Gender distribution (measurement time point 1: 87.8% female participants; measurement time point 2: 90.0% female participants) corresponded to the normal gender ratio among primary school level trainee teachers. Material Video Material. A total of eight two-minute experimental videos were produced for the study. The videos show a group of seven students and a year 5 teacher (special needs school; emotional and social development). One of the seven students was selected as the target student and was displayed in a window in the upper right corner in all videos. The subject of all eight videos was a series of English lessons. All members of the class had German as their mother tongue and learned English as a second language. In the shown lesson students were required to learn the meaning of everyday English. Before the lessons were recorded, a target behavior (academic engagement) and a target student were selected. The student selected was known to have a highly variable level of active 80
Insights into Learning Disabilities 12(1), 73-90, 2015
participation in lessons. Only used for the experiment was video material, in which the target student was clearly visible and acoustically easily understandable. From this material, eight passages of the recordings were selected at random. Then, from these passages, two-minute behavioral sequences were selected that in terms of content were easily comprehensible for an external observer. DBR Single-Items Scale (SIS). The target behavior was assessed by the test subjects using an SIS. In studies by Chafouleas (2011), and Briesch, Kilgus, Chafouleas, Riley Tillman, and Christ (2012), a percentage time scale divided into eleven stages (graduated in groups of 10 from 0% to 100%) was found to be a reliable and stable indicator. Volpe and Briesch (2012), on the other hand, used an eight-fold graduated scale to assess the quality of behavior. These authors conclude that this scale provides a high level of observational accuracy. In order to avoid distortion of the results due to the different graduations of the scales, this assessment scale was extended from eight to eleven stages. Figure 3 shows the scales used. Figure 3. The two DBR scales for the two test groups
Target Behavior. The target behavior to be observed was operationalized using the systems employed by Chafouleas (2011) and Volpe and Briesch (2012). These authors found that their scales allowed observational ratings with a high level of accuracy to be documented for the areas of “academic engagement”, “disruptive behavior”, and “respectful behavior”. Since all our test subjects were in the first semester of their teacher training course at the time of participating in the study (see above) and therefore had no specialist knowledge of social and emotional development disorders, “academic engagement” was selected as the observation target. In order to ensure that all test subjects had the same perception of what target behavior was to be observed, the following definition of the observation target was made available to them in writing (taken from Chafouleas, Jaffery, Riley Tillman, Christ, & Sen, 2013). 81
Insights into Learning Disabilities 12(1), 73-90, 2015
“Actively or passively participating in the classroom activity. Examples: writing, raising his or her hand, answering a question, talking about a lesson, listening to the teacher, reading silently, or looking at instructional materials)” (Chafouleas et al., 2013, p. 43). “Academic engagement” was therefore defined as all types of behavior that could be interpreted either as active participation in the events occurring during the lesson (e.g. in the form of writing, reporting, giving answers, talking about the subject) or as passive participation in the events occurring during the lesson (e.g. looking at the teacher, reading, looking at the material).
Procedure In the experimental design selected, eight two-minute videos were presented in sequence to the test subjects. Each of the total of 133 test subjects was randomly assigned to one of the two test groups (TG_Q and TG_T) on the basis of their date of birth. Before the start of the study, all test subjects were informed of the procedure to be used in the experiment. After being instructed with regard to the procedure, the test subjects received written information about the assessment scale of their respective test group. They were not informed about the assessment scale to be used the other test group. The target student was identified for all test subjects before these were separated into their groups. The test subjects of the two test groups therefore had different assessment scales but the student and the experimental conditions were identical for both groups. So that all test subjects could familiarise themselves with the experimental situation, the target student to be assessed and the use of the assessment scale, they were first instructed in how to undertake behavioral assessment in a three-minute training session (Chafouleas, Kilgus, Riley Tillman, Jaffery, & Harrison, 2012). After any queries relating to the material and experimental procedure had been answered, the experimental phase began. After each one of the eight video sequences, the test subjects were given approximately one minute to assess the academic engagement of the target student. The systematic direct observation of the sequence was performed in a similar fashion. This behavioral observation was performed by two experienced school psychologists (male and female) with additional qualifications in behavioral observation. Unlike the DBR raters, however, the school psychologists were allowed to view the video material repeatedly. In order to determine the relevant time percentages, the school psychologists used a stopwatch to measure the exact duration of participation. In order to assess the quality of the teaching, the school psychologists used an observation matrix in which the various categories of the observation definition were first selectively assessed; a global assessment was provided at the end of each video. The 82
Insights into Learning Disabilities 12(1), 73-90, 2015
school psychologists and the DBR raters analyzed the eight video sequences in the same order and using the same method of operationalization of the target behavior. Where there were discrepancies in their assessments, the school psychologists were asked to come to a consensus on which of the behavioral sequences were to be assessed as academic engagement. In this way, eight percentage time ratings and eight qualitative ratings were documented using methods analogous to those employed by the two tests groups; these represented the “true” academic engagement of the target student. Hypotheses Available for data analysis were the eight scores documented by the DBR raters of the two test groups and the analogous sets of scores recorded by the SDO raters. Table 2 provides an overview of the variables and their groupspecific designations. Table 2. Ratings TG_Q
TG-T
Quality
Time
rig1
rig2
School psychologists Quality Time rcg1
rcg2
Notes: MTP = measurement time point, TG = test group, ri= rating 1 to rating 8 t1 = first measurement time point, g1=TG_Q, g2=TG_T, c = criterion
Table 3 and Table 4 summarize the descriptive statistics for the DBR and SDO scores. Table 3. Descriptive statistics (DBR raters)
Rating TG
1
2
3
4
5
6
7
8
M
TG_Q
M
3.4
7.5
4.0
1.3
8.9
4.0
0.5
2.1
4.0
(n=62)
s
1.6
1.5
1.8
1.3
1.3
1.9
0.8
1.3
0.8
TG_T
M
4.5
8.2
4.8
1.3
9.2
5.2
0.9
2.3
4.6
2.1
1.4
1.9
1.2
1.4
2.3
1.3
1.6
0.9
(N=71) s
Notes: TG = test group, M = mean, s = standard deviation
83
Insights into Learning Disabilities 12(1), 73-90, 2015
Table 4. Descriptive statistics (SDO raters)
Rating CV
Rater
1
2
3
4
5
6
7
8
M
R1
3.0
7.0
4.0
1.0
9.0
5.0
0.0
2.0
3.9
CV1
R2
1.0
9.0
5.0
M
2.0
8.0
4.5
3.0 2.0
10.0 9.5
7.0 6.0
1.0 0.5
2.0 2.0
4.8 4.3
R1
75.9
92.6 69.3 37.4 96.2 88.9 11.6 61.0 66.6
CV2
R2
69.4 78.9 59.4 16.5 94.3 42.4
5.3
49.0 51.9
M
72.7 85.8 64.4 27.0 95.3 65.7
8.5
55.0 59.3
Notes: CV = criterion variable, R = rater, M = mean
Because all eight ratings were collected for the same target student and the same target behavior in each case, the eight scores can be added per cell to provide a common mean score (thus for example . Problem statement A (criterion validity, numerical invariance) can be analyzed by comparison of the scores rig1 with the “criterion scores” krig1, or of the scores rig2 with the SDO ratings krig2. Assuming that there is strict invariance, the eight DBR ratings can be compared with the SDO ratings of the school psychologists to determine numerical differences. In both test groups, the null hypothesis (“desired hypothesis1”) H0: μi(gi) = μi(kgi) was tested using the multivariate Hotelling T 2 test (cf. Bortz & Schuster, 2010, p. 472-473) while the null hypothesis for the mean scores H0
: (gi) = gi)
was tested using a one-sample t test. In the case of problem statement B (criterion validity, structural invariance), the relevant issue is the correlation between the ratings documented by the DBR and SDO raters under the two experimental conditions. In order to evaluate this, all correlations between the individual ratings of the DBR raters and the corresponding ratings of the SDO raters (“criterion”) were determined, With regard to hypotheses relating to numerical invariance, the null hypothesis is the so-called “desired hypothesis” and thus relevant to the test for β error. In this case therefore the α-level was specified as 0.10. Accordingly, no effect sizes were calculated.
1
84
Insights into Learning Disabilities 12(1), 73-90, 2015
subjected to Fisher transformation, averaged and retransformed to provide a correlation coefficient. This results in the null hypothesis H0: ρ(rigi,krgi) ≤ 0 which was tested using a t test for both test groups. Results Problem statement A (Criterion Validity, Numerical Invariance) In connection with problem statement A, it was necessary to determine the extent to which the average DBR ratings for each video numerically deviate from the ratings of the SDO raters (i.e. the criterion), both with regard to quality (TG_Q) and the estimated time (TG_T). Figure 4 shows the ratings of the SDO raters (dashed line) and the average ratings of the DBR raters (solid line) for both tests groups. As can be seen from the graph, the DBR raters showed a tendency to slightly underestimate the “true score” (as recorded by the experts). Figure 4. Mean ratings of the two test groups - comparison of SDO raters and DBR raters
Both in the TG_Q and TG_T groups, it appears at first glance as if there are no major differences between the DBR and the SDO ratings and between the assessments of the students and the “experts”. For the TG_Q group however, the results of the Hotelling T2 hypothesis test show that the DBR and the SDO ratings do not correspond: in both test groups (TG_Q and TG_T) there was thus a significant difference between the students and the experts (TG_Q: T2 = 45.02, df1 = 8, df2 = 54, p < 0.00; TG_T: T2 = 44.59, df1 = 8, df2 = 63, p < 0.00). 85
Insights into Learning Disabilities 12(1), 73-90, 2015
Hence, the null hypothesis must be rejected in the light of the result of the multivariate comparison of mean scores. An univariate test at the level of the mean scores shows that in the TG_Q group there is a less marked difference between DBR and SDO ratings at MTP1 (TG_Q: t = -3.60, df = 62, p < 0.00). In the TG_T group on the other hand, the differences are highly significant (TG_T: t = -12.72, df = 70, p < 0.00). In summary, the findings must be interpreted as indicating a lack of numeric invariance. Problem Statement B (Criterion Validity, Structural Invariance) In the calculations relevant to this hypothesis, all individual observers were correlated with the criterion. The mean correlation score in the TG_Q group was r=0.91 (t=18.23, df = 60, p