INTERNATIONAL JOURNAL OF TESTING, 7(1), 39–52 Copyright © 2007, Lawrence Erlbaum Associates, Inc.
Potential Sources of Differential Item Functioning in the Adaptation of Tests Paula Elosua and Alicia López-Jaúregui Department of Psychology University of the Basque Country, Spain
This report shows a classification of differential item functioning (DIF) sources that have an effect on the adaptation of tests. This classification is based on linguistic and cultural criteria. Four general DIF sources are distinguished: cultural relevance, translation problems, morph syntactical differences, and semantic differences. The influence of these on the adaptation of tests is greater among those languages belonging to different linguistic families as the cultural distance among the groups increases. The model offered is contrasted empirically through the analysis of 53 verbal items from a Spanish reference sample and a Basque focal sample. The DIF is aided by the Mantel–Haenszel procedure. In addition, a group of experts cataloged each of the items in terms of the DIF sources defined by the model. The level of agreement between the two procedures, judgmental and statistical, stands at 69.8%. Keywords: DIF, test adaptation, Mantel–Haenszel, causes of DIF
There are 20 official languages and 30 minority languages recognized by the European Charter of Fundamental Rights. Within such multilinguistic space, the practice of adaptation of tests has intensified during the last few years. One of the reasons for such intensification is the implementation of international programs for the assessment of the educational system. The fundamental characteristic of the adaptation of a test in comparison to any other type of pragmatic, aesthetic, linguistic or ethnographic translation (Casagrande, 1954) is that it is governed by the principles of psychometric equivalence and is set forth as such in the Guidelines for the Adaptation of Tests written by the International Test Commission (Elosua, 2003; Hambleton, 2001). Correspondence should be addressed to Paula Elosua, Department of Psychology, University of the Basque Country, Avda. Tolosa, 70, 20018, San Sebastian, Spain. E-mail:
[email protected]
40
ELOSUA AND LÓPEZ-JAÚREGUI
Psychometric equivalence focuses on metric equivalence (Drasgow, 1984). From this perspective, two items will be equivalent in the source language (L1) and in the final language (L2) when their item characteristics curves are equivalent (Mellenbergh, 1989). In these circumstances, we would be talking about metric invariance. Invariance analysis is usually carried out by detecting the possible differential item functioning (DIF; Lord, 1980). The study of DIF in the adaptation of tests has resulted in a great deal of empirical reports aimed at the assessment of psychometric equivalence between versions. Nevertheless, there are few reports whose aim is the search and study of the origin of the DIF. Several authors who propose logical procedures for the study of the DIF can be mentioned along this line of research (Berk, 1982; Cole, 1981; Hambleton & Jones, 1994; Scheuneman, 1987; Shepard, 1982; Title, 1982). They all recommend that aspects related to the composition and format of the items should be taken into consideration so as to avoid bias. The study focuses on the analysis of the content during the construction process of the test so that it avoids, among other aspects, a discriminatory and stereotyped use of language with regard to minority groups or the effect of differential familiarity with the format and type of test on the execution (van de Vijver & Poortinga, 1992). All these studies warn us of the importance of aspects related to the content and format in order to prevent the DIF, although they do not search for “theoretical models” for its explanation. Within this second line of work we could quote those studies that, in more or less detail, attempt to systematize the sources of the DIF. Hulin (1987; Hulin & Mayer, 1986) for instance, differentiates two DIF focuses; the first one is directly related to the process of translation, and the second refers to the cultural relevance of the content of the item. Although this hypothesis of direct relation between the source of the DIF and the parameter of the item was later refuted by Ellis (1995), we can consider it to be the first attempt to tackle the problem of the DIF source. In a later study carried out by van de Vijver and Tanzer (1997), three DIF sources were differentiated: adaptation problems and/or ambiguity of the item; contaminating factors related to the nonproposed inclusion of additional constructs; and cultural specificity, which the authors associated with incidental differences in the connotative meanings of the item. That is to say, a factor related to the possible irrelevant variance, which may disturb the execution of an item, is added to the previous proposal by these authors. Allalouf, Hambleton, and Sireci (1999) looked at the previous studies in greater depth and proposed four DIF causes: changes in the level of difficulty, changes in the content arising from a deficient translation, changes in format (length, relation between statement–answer alternatives), and differences in cultural relevance. All these proposals refer to the cultural relevance and problems associated with the translation as potential DIF sources, yet they do not make a thorough linguistic
DIF IN TEST ADAPTATION
41
analysis in order to determine the causes of the DIF. Within such context of study, the objective of this research is to investigate in greater depth the potential DIF sources that may create an alteration in the psychometric properties of the item between L1 and L2 (difficulty–facility, discrimination, chance). It aims to show a way for the study and prevention of DIF in the adaptation of tests, which will have to be particularized in each case depending on the linguistic characteristics of L1 and L2. Together with this main objective, the causes of the DIF that are characteristic of the adaptation process between Spanish and Basque are defined in this study. Basque or “Euskera”(“Basque” in Basque) is an isolated minority language spoken in the northern part of the Iberian Peninsula and in the southwest of France—to be more precise, in the Autonomous Community of Navarre, the Autonomous Community of the Basque Country, and the southern region of the Atlantic Pyrenees in France. The number of speakers stands at around 600,000 according to the Basque Language Academy (Euskaltzaindia). It is, together with Spanish, the official language of the Autonomous Community of the Basque Country. The 1982 Act of Normalisation of the Use of Euskera, according to which each and every inhabitant of the Autonomous Community of the Basque Country has the right to know and use it together with Spanish, regulates its legal framework. METHOD Instrument The analyzed data correspond to the verbal items of the Battery of Differential and General Aptitudes-E (Yuste, 1988). There are 53 dichotomous items of maximum execution with 5 answer alternatives of which only one is correct. The classification of the items according to the type of task they require are shown in Table 1. Adaptation The test, originally constructed in Spanish, was adapted to Basque following the steps specified by the process of back-translation. The linguistic quality of the adaptation was analyzed with the aid of a scale constructed for that purpose. Each of the items making up the test (statement and alternatives) was assessed by two philologists. An assessment scale with three points was constructed for this purpose (1 = bad translation, 2 = acceptable translation, and 3 = good translation). The average value reached by the items was good, with an arithmetic average of 2.5. Participants The sample was comprised of 1,048 students between the ages of 9 and 11. The original test in Spanish was taken by 498 children, who made up the reference group. The focal sample, which completed the adapted test, was comprised of 550 students.
42
Note.
6 4 2 13 5 6 8 9 53
Number of Items
11.32 7.54 3.77 24.52 9.43 11.32 15.09 16.98 100.00
% of Items
DIF = differential item functioning.
Synonyms Antonyms Purpose or common use Analogies Definitions Word orderings Classification Constancy of a characteristic Total
Item Type 6 3 1 8 3 3 6 2 32
Number of DIF Items 100.00 75.00 50.00 61.53 60.00 50.00 75.00 22.22 60.37
% of DIF Items
1 3 2 11
3
1 1
Moderate
21
5 2 1 5 3 2 3
Large
DIF Classification
66.6 66.6 100.0 50.0 100.0 66.3 50.0
1 17
%
4 2 1 4 3 2
No. of Items
Better Performance by Spanish Examinees
TABLE 1 Item Type—Classification of DIF: Item Type, Direction, and DIF Category
4 — 1 6 1 15
2 1
No. of Items
33.3 100.0 50.0
50.0
33.3 33.3
%
Better Performance by Basque Examinees
DIF IN TEST ADAPTATION
43
DIF The DIF was assessed with the Mantel–Haenszel two-stage procedure (Holland & Thayer, 1988) implemented by the authors in S-Plus. This shows the values reached by the chi-square and MH-Delta statistics. The latter is used by the Educational Testing Service (Dorans & Holland, 1993). The DIF may be regarded as moderate or severe according to the absolute value of MH-Delta and the significance of the differential functioning statistical test. Absolute values over 1.5 and a statistically significant test (at the .01 level) indicate severe DIF, whereas absolute values between 1 and 1.5 and a statistically significant test (at the .01 level) would catalog the DIF as moderate.
Causes of DIF in the Adaptation of Tests A committee of experts was created in order to define and assess the points that would have to be taken into consideration so that the adaptations would comply with the condition of metric equivalence. The committee was comprised of a multidisciplinary group that assessed the different aspects involved in the answer process of an item. These experts belonged to the panel of philologist–translators, primary school teachers who teach in Spanish and Basque, and psychometric specialists. In the committee there were 6 people altogether, 2 philologists specialized in the adaptation of texts, 2 primary school teachers, and 2 bilingual psychometrics and data analysis specialists. The first team, the philologists, assessed the different aspects related to the correction and linguistic equivalence between the original version and the adapted one; the teachers analyzed, in each case, the adaptation and equivalence between the difficulty levels of the task to be completed. The psychometric specialists advised on the complexity of the DIF concept. Three items from each of the tasks of which the test is comprised were presented to the committee of experts, and they were asked to search for those aspects of the translation/adaptation process that might make students of the same verbal competence in each of the languages have a different probability of correctly answering the item. This study placed special emphasis on the revision of both statements and answer alternatives. Relations Between Theoretical Criteria and Empirical Results To assess the concordance between the hypothesis created for the substantive assessment of items with DIF in the adaptation of tests between Spanish/Basque and the results of an empirical study of DIF detection, a second committee of experts was created; this one was comprised of philologists and teachers who worked independently from the first committee in the analysis of the 53 items of the scale. Their task involved assessing each of the items with regard to each of the potential
44
ELOSUA AND LÓPEZ-JAÚREGUI
bias sources described by the first team and, where appropriate, indicating which group (reference or focal) would benefit from the item in its current form. The instructions received by the team focused on the need to indicate those aspects that might produce an alteration in the difficulty of the items in any of the two languages.
RESULTS Previous Analysis
( (
) )
The average performance of the original sample X R = 36.65; S R = 5.66 is superior to the average performance of the focal sample X F = 27.31; S F = 4.25 . The difference between the two arithmetic means is statistically significant: t(1048) = 29.73, p £ .0001. These results are consistent with previous studies. The mean performance of Spanish student samples on aptitude tests is better than the performance of Basque samples (Elosua & López, 1999; Elosua, López, Egaña, Artamendi, & Yenes, 2000). The unidimensionality of the test was analyzed with a nonparametric procedure. With the assistance of DIMTEST (Nandakumar & Yu, 1996), it was found that the two scales (original and adapted) are essentially unidimensional. In both samples, the hypothesis of multidimensionality was rejected (tR = 0.842, p = .199; tF = –0.397, p = .654). This condition is indispensable before assessing the metric equivalence between items. Detection of DIF Table 2 shows statistics MH-Delta and chi-square of each of the analyzed items. Having made a summary of the values obtained, we can conclude that 32, that is to say 60.36%, of the items present differential functioning. Their distribution in terms of tasks can be seen in Table 1. This table shows that in each and every category, differential functioning is superior to 50% and it is therefore the category most affected that requires a synonym among the alternatives presented; in this case the DIF reaches 100%. The classification of the severity of the DIF (Table 1) concludes that 11 of the items present a moderate DIF, whereas 22% (21 items) is severe. The direction of differential functioning is also presented in the same table (Table 1). Out of all the items with DIF, 17 favor the reference group, whereas 15 favor the focal one. The close relation (c2 = 8.21, p = .004) existing between the level of differential functioning and its direction ought to be highlighted. In fact, 15 out of the 17 items that favor the reference group (which amounts to 88.22%) have been cataloged as serious. Out of the 15 that favor the focal group, 40% are serious.
DIF IN TEST ADAPTATION
45
TABLE 2 Differential Item Functioning Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
MH D-Delta –2.68 –3.80 –1.88 0.70 –0.03 –0.62 1.35 –4.95 –2.59 1.80 –2.08 –2.58 0.08 0.28 –5.33 –0.65 –2.17 1.30 1.22 –8.18 0.09 1.04 0.58 –1.53 –2.44 –0.59
Note.
MH χ2
Item
18.85 62.12 20.65 2.19 0.00 1.86 12.52 69.73 38.11 19.23 30.49 35.19 0.01 0.33 147.40 2.88 22.12 9.09 10.27 246.67 0.02 7.08 2.12 16.88 35.73 2.35
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
MH D-Delta –3.40 –2.69 1.91 1.16 0.37 0.71 0.99 0.81 1.08 2.47 –0.18 2.12 1.41 –1.30 0.05 0.59 –0.65 1.40 1.05 2.41 1.24 –1.03 0.80 –2.50 0.75 2.36 0.85
MH χ2 68.31 42.10 21.04 8.15 0.32 1.79 2.88 3.32 1.42 16.69 0.14 15.38 9.33 11.91 0.00 1.30 2.83 12.33 7.75 37.64 8.23 7.12 3.72 22.28 2.74 16.02 4.04
Numbers in bold represent significant DIF.
Classification of DIF Sources in the Adaptation of Tests In the search of sources related to the DIF in the adaptation of test we can basically differentiate four foci. The defined sources are as follows: deficiencies in translation, grammatical differences between the languages, semantic differences, and cultural specificity. This classification is not exclusive. Any item may have problems in more than one source.
Translation. It is a recognized common source in all the studies carried out. It includes the errors made by deficient translation between L1 and L2. Grammatical differences. Each language has its own grammatical structures and characteristics, and these may not necessarily have an equivalent in the
46
ELOSUA AND LÓPEZ-JAÚREGUI
language of destination, hence the classification of languages according to linguistic typology. Languages may be classified according to the syntactical order in the construction of sentences (SVO1 in English, Spanish, or German; SOV in, e.g., Basque, Japanese, Persian, or Latin; VSO in, e.g., Welsh or Arabic; VOS in Fijian; or OSV in Brazilian Xavante) or as belonging to the accusative–ergative categories. For instance, ergative–absolutive languages stress the subject of a verb through the inflection of the case. Romance languages have practically lost this inflection. Furthermore, within the languages belonging to the same typology, as opposed to analytical languages, it is possible to differentiate flexive or synthetic languages. The first are characterized by a tendency to include a great deal of information in suffixes and prefixes, through the inflection of certain words. Inflection is often used to distinguish the different cases accepted by the language. The translation between languages belonging to different typologies (syntactical order, ergative–accusative languages) creates, at least, a change in format (e.g., the order of the elements of the sentence, concordance of statement/alternatives, length of the sentence) that may sometimes produce changes in the difficulty level of the item in one of the groups. In addition, the differences between synthetic/analytical languages, among which we can mention the use of suffixing as an example, may open new unintended lines for the resolution of the item. All in all, we would include two general categories in this section—the first one related to the change in the format of the item caused by the syntactical differences between the languages (more or less complexity in the statement, collocation of the correct option in the statement…), and the second one related to the morphological aspects characteristic of each language. In our case, we would be talking about the translation between two languages that belong to different families. One is SVO and the other, Basque, SOV; at the same time, Spanish is an accusative language, whereas Basque is ergative. Last but not least, Spanish is an analytical language, and Basque is synthetic. We have two languages that are linguistically different and that do not share the same morphological characteristics. We must therefore pay attention to suffixing and format differences. In order to clarify the influence of grammatical rules in the resolution of an item, we offer here an example of the analyzed items. This item requires the student to recognize among a group of words the only one that is not a tree. In the Basque language, the majority of tree names carry the suffix “-ONDO”. As a result of such linguistic peculiarity, the Basque student recognizes the correct answer with more ease than the Spanish one. The Basque student has two lines of resolution: first, the knowledge of the task presented and second, the recognition of the only term which does not carry the suffix “-ONDO”. The MH-Delta statistical value for this item is 2.12. 1S
= subject; V = verb; O = object.
DIF IN TEST ADAPTATION
47
Semantic equivalence. Semantic equivalence refers to the equivalence between the connotative meanings of the term in L1 and L2. One of the problems in translation is the possible failure of adaptation and lack of correspondence between the word in the first language and the term chosen in the language of destination. A term in a language possesses rules of familiarity, concretion, or emotional valences that may not coincide with its literal translation in another language. That is why in the language of destination linguistic–grammatical translation may produce vocabulary of different values in these categories, which may produce a lack of psychometric equivalence. In addition, in this section we include the possible polysemy of some words in one of the languages. Cultural relevance. Cultural relevance refers to the characterization of an item such as etic or emic. There are terms that cannot be directly translated to another language owing to the lack of an equivalent reference. For instance, in a broader perspective we could embrace the local customs, geographical characteristics, or cultural representations (as well as the symbols). According to the uses intended for the test, the pertinence of highly cultural items and their adaptation to L2 would have to be analyzed in each case. Detection of DIF by the Group of Experts Once the DIF sources were defined, the second team of experts analyzed the equivalence of all the items that make up the scale. As a result of the analysis, they determined that the differences found in 28 items between the two languages, Spanish and Basque, may affect the answers given by the students. Their distribution in terms of tasks can be seen in Table 3. Out of the total items detected, 20 favor the reference group and the rest (8) favor the adapted version in the scale. Relation Between the Two Detection Procedures The concordance between both analysis procedures, Mantel–Haenszel and judgment of experts, is shown in Table 4. The number of concordances in the classification of items, DIF/non-DIF, is 37. Both detection lines coincide in 69.81%. The contingence coefficient for this table is 0.366 (p = .004). If we fix on the items with DIF we conclude that judgmental procedure flaws 28 items, and the Mantel–Haenszel detects 32 items. The agreement of both procedures is 22 items. That is, 78.57% of DIF items detected by experts are also detected by Mantel–Haenszel, and 68.75% of DIF items detected by Mantel–Haenszel are also detected by experts. Those ratios improve previous findings. In a similar study, Hambleton and Jones (1994) analyzed a test of 75 items. They used Mantel–Haenszel and judgmental procedures to evaluate DIF. The empirical analysis detected 15 items with DIF, and the judgmental procedure
48
ELOSUA AND LÓPEZ-JAÚREGUI
TABLE 3 Classification of DIF: Judgment of Experts Better Performance by Spanish Examinees
Item Type Synonyms Antonyms Purpose or common use Analogies Definitions Word orderings Classification Constancy of a characteristic Total Note.
Number of DIF Items
% of DIF Items
5 2 1 6 3 4 3 4 28
83.30 50.00 50.00 46.15 60.00 66.60 37.50 44.40
Better Performance by Basque Examinees
No. of Items
%
No. of Items
%
3 1 1 4 3 3 1 4 20
60.0 50.0 100.0 66.6 100.0 75.0 33.3 100.0
2 1
40.0 50.0
2
33.3
1 2
25.0 66.6
8
DIF = differential item functioning.
TABLE 4 Relation Between Statistical and Judgmental Procedures Judgmental Procedure Statistical Procedure
Non-DIF
DIF
Total
Non-DIF DIF Total
15 10 25
6 22 28
21 32 53
Note.
DIF = differential item functioning.
detected 11 items with DIF. The agreement of items with DIF between both procedures is 5 items. The level of agreement with regard to the linguistic group favored by the items is also high. Out of the DIF items detected by both procedures (22), 13 items favor the reference group and 7 the focal group. The contingence coefficient for this double classification is 0.635 (p < .001). The total number of DIF sources detected is 29 (Table 5) and the number of items with DIF is 22; there are items in which more than one source coincides. There are 21 items affected by only one source: 7 items in which 2 sources are detected and 1 item in which there are 3 sources of error. The distribution of the sources detected according to the proposed model shows that the grammatical factor has an influence on 12 occasions, whereas the semantic factor appears on 15. Translation problems only affect 2 of the items, which could be foreseen after adaptation quality analysis. Finally,
DIF IN TEST ADAPTATION
49
TABLE 5 Judgmental Detection of DIF: Sources of DIF Grammar Number of DIF Items
Item Type Synonyms Antonyms Purpose or common use Analogies Definitions Word orderings Classification Constancy of a characteristic Total Note.
Cultural Relevance
5 2 1 5 2 3 2 2 22
Translation
Suffix Format Familiarity 1
1 1
1
4 3
1 0
2
Semantic
2
2 10
Polysemy
5 2 1 1 1 2 1 1 14
DIF = differential item functioning.
it ought to be highlighted that the most quoted factor in the bibliography on this subject, namely, the cultural factor, does not affect the adaptation of this test. CONCLUSION The existence of distributional differences between the reference and focal groups, as well as the high percentage of items with DIF, defines conditions that are not the most favorable for the effectiveness of the Mantel–Haenszel procedure. Analytical and simulation studies have shown that the Mantel–Haenszel procedure yields inflated Type I error rates when the means of the matching variable differ between groups (Mazor, Clauser, & Hambleton, 1994; Rogers & Swaminathan, 1993). Nevertheless, those conditions are constant in the adaptation of children’s aptitude tests between Spanish and Basque (Elosua & López, 1999; Elosua, López, & Torres, 1999). Although this flag in the effectiveness of the Mantel–Haenszel procedure could add validity to our conclusion, the level of agreement between the two procedures supports the theoretical proposal formulated in this study. As a result of this research, we offer a general framework of reference that classifies the DIF sources in the adaptation of tests from a linguistic and cultural perspective. All studies of DIF sources consistently call for a difference in the cultural relevance in the content of the item. However, in many cases this source does not explain the presence of DIF. Just as the differences related to the representation of the psychological constructs or the familiarity with the aspects related to the method have a stronger influence on structural bias as the difference between cultures increases (van de Vijver & Tanzer, 1997), the importance of cultural relevance as the source of DIF will increase as does the distance between the groups to be assessed.
1
1
50
ELOSUA AND LÓPEZ-JAÚREGUI
In our case, for instance, despite the high number of problematic items, none of the items with DIF have been included in this category. It was therefore necessary to widen the sources that defined the origin of DIF. This widening has its foundations on the linguistic analysis on the basis of two general content areas of this branch of knowledge, semantics and morph syntax. Phonology and phonetics have been left aside in this study as they do not affect the scale analyzed. Therefore, this study shows a general classification of DIF sources that may be used in the adaptation among all the languages and that takes into consideration those morph syntactical and semantic characteristics that may create a variation in the difficulty of the resolution of the items in L1 or L2. This general classification will indeed have to be particularized with regard to each of the morph syntactical peculiarities of the languages involved in the adaptation process. In our case, having two languages belonging to different linguistic families with different morph syntactical characteristics (synthetic and analytical) as well as general factors regarding format, we have made explicit reference to the influence of suffixes. On the other hand, it is obvious to consider the fact that a literal translation does not necessarily involve semantic correspondence between the languages. The terms or concepts of a language possess semantic dimensions which do not automatically coincide with their literal translation in another language. In order to reach semantic equivalence, it would be advisable to make use of normative studies that assess the semantic dimension of the words so that the adaptation of tests could advance beyond literal equivalence. The objective would be semantic concordance which guaranteed equality in the dimensions of familiarity and signification of the terms between L1 and L2. In order to achieve metric equivalence, it is not enough to have good knowledge of both cultures. Unless normative guidelines based on the frequency of the use of words are established, it will be difficult to reach metric equivalence, especially in aptitude tests intended for children in which the verbal component is substantial. The influence of the sources described increases as does the weight of the verbal component of the scale. In this sense, the DIF sources will potentially have a greater effect on the performance or aptitude scales than on the personality or attitude scales. The reason is simple; in those ability tests which make use of multiple choice format and include tasks such as the ones described in this study (e.g., analogies, synonyms, comprehension), the factors mentioned have direct influence on each item (statement + alternative; Bejar, Chaffin, & Embretson, 1991). In the synonyms task for instance, there are various words to be translated for each of the statements, which entails greater difficulty when looking for terms with equivalent semantic values. In those tests of typical execution with a graduate response format, this problem does not exist. The variable influence of DIF sources on the type of scale may be proved in the case of test adaptation between Spanish and Basque. The results obtained in empirical studies show a big difference in the number of items with DIF in both types of
DIF IN TEST ADAPTATION
51
tests. Although the DIF exceeds 60% in the ability scales, it is very low in the personality tests, reaching less than 7% (Elosua, 2005; Elosua et al., 2000). In conclusion, we present a proposal for a hypothesis that must still be consolidated through many more studies and analyses of a greater number of items. The total number of items analyzed does not allow us to validate the model empirically through more robust statistical analysis. Although the initial results are promising, we will have to look at this line of analysis in greater depth.
ACKNOWLEDGMENT This research was supported, in part, by grants from Ministerio de Ciencia Y Technología (BSO2002–00490) and Ministerio de Educación y Ciencia (SEJ2005–01694).
REFERENCES Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36, 185–198. Bejar, I. I., Chaffin, R., & Embretson, S. (1991). Cognitive and psychometric analyses of analogical problems solving. New York: Springer. Berk, R. A. (1982). (Ed.). Handbook of methods for detecting test bias. Baltimore: John Hopkins University Press. Cole, N. S. (1981). Bias in testing. American Psychologist, 36, 1067–1077. Casagrande, J. B. (1954). The ends of translation. International Journal of American Linguistics, 20, 335–340. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel–Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Mahwah, NJ: Lawrence Erlbaum Associates. Drasgow, F. (1984). Scrutinizing psychological test: Measurement equivalence and equivalent relations with external variables are the central issues. Psychological Bulletin, 95, 134–135. Elosua, P. (2003). Testen euskaratzea. Balizko alborapenaren iturriak [Adapting tests to Basque. Potential Sources of Bias.] Tantak, 30, 17–38. Elosua, P. (2005).Evaluación progresiva de la invarianza factorial entre las versiones original y adaptada de un cuestionario sobre autoconcepto [Progressive analysis of the factorial invariance between the original and adapted version of one self-concept scale.]. Psicothema, 17(2), 356–362. Elosua, P. & López, A. (1999). Funcionamiento diferencial de los ítems y sesgo en la adaptación de dos pruebas verbales [Differential Item Functioning and Bias in the adaption of two verbal tests]. Psicológica, 20, 23–40. Elosua, P., López, A., & Torres, E. (1999). Adaptación al euskera de una prueba de inteligencia verbal [Adaption to Basque of one verbal intelligence test]. Psicothema, 11(1), 151–161. Elosua, P., López, A., Egaña, J., Artamendi, J. A., & Yenes, F. (2000). Funcionamiento diferencial de los ítems en la aplicación de pruebas psicológicas en entornos bilingües [Differential item functioning related with the use of tests in bilingual contexts.]. Revista de Metodología de las Ciencias del Comportamiento, 2(1), 17–33.
52
ELOSUA AND LÓPEZ-JAÚREGUI
Ellis, B. B. (1995) A partial test of Hulin’s psychometric theory of measurement equivalence in translated test. European Journal of Psychological Assessment, 11, 184–193. Hambleton, R. K. (2001). The next generation of the ITC test translation and adaptation guidelines. European Journal of Psychological Assessment, 17, 164–172. Hambleton, R. K., & Jones, R. W. (1994). Comparison of empirical and judgmental procedures for detecting differential item functioning. Educational Research Quarterly, 18, 21–37. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel–Haenszel procedure. In H. Wainer & H. J. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates. Hulin, C. L. (1987). A psychometric theory of evaluations of item scale translations. Journal of Cross-Cultural Psychology, 18, 115–142. Hulin, C. L., & Mayer, L. (1986). Psychometric equivalence of a translation of the job descriptive index into Hebrew. Journal of Applied Psychology, 71, 83–94. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of non-uniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational and Psychological Measurement, 54, 284–291. Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143. Nandakumar, R., & Yu, F. (1996). Empirical validation of DIMTEST on non-normal ability distributions. Journal of Educational Measurement, 33, 355–368. Rogers, H. J., & Swaminathan, H. (1993) A comparison of the logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105–117. Scheuneman, J. D. (1987). An experimental exploratory study of causes of bias in test items. Journal of Educational Measurement, 24, 97–118. Shepard, L. A. (1982). Definition of bias. In R. A. Berk (Ed.), Handbook of methods for detecting test bias (pp. 9–30). Baltimore: Johns Hopkins University Press. Title, C. K. (1982). Test bias. In J. K. Keeves (Ed.), Educational research, methodology and measurement an international handbook. Oxford, England: Pergamon Press. van de Vijver, F. J. R., & Poortinga, Y. H. (1992). Testing across cultures. In R. K. Hambleton & J. N. Zaak (Eds.), Advances in educational and psychological testing: Theory and applications (pp. 277–308). Boston, MA: Kluwer Academic. van de Vijver, F. J. R., & Tanzer, N. K. (1997). Bias and equivalence in cross-cultural assessment: An overview. European Review of Applied Psychology, 47, 263–279. Yuste, C. (1988). BADYG-E. Madrid: Ciencias de la Educación Preescolar y Especial.