Development, psychometric properties and new ... - Semantic Scholar

Revista Electrónica de Metodología Aplicada 2011, Vol. 16 nº 1, pp. 50-65

Development, psychometric properties and new validity evidences of the web-based computerized adaptive test of English eCat Julio Olea1, Francisco José Abad1, Vicente Ponsoda1, David Aguado2 y Julia Díaz2 1

Facultad de Psicología, Universidad Autónoma de Madrid 2 Instituto de Ingeniería del Conocimiento

RESUMEN Se describe la construcción del test eCat y se informa de las consecuencias que su funcionamiento ha tenido en sus propiedades psicométricas. Los resultados obtenidos en una muestra de 3224 estudiantes mostraron que el banco tiene una adecuada fiabilidad y validez convergente en relación a los autoinformes de dominio del inglés. Las simulaciones realizadas mostraron que las estimaciones carecían de sesgo y que el test ha de tener al menos 15 ítems para alcanzar una precisión razonable. A partir de las 7254 aplicaciones del test disponibles, se han vuelto a obtener las propiedades psicométricas y se han comparado con las previstas antes de la aplicación. Se aportan nuevas evidencias de validez. Palabras clave: Test adaptativo informatizado, tests por internet, evaluación del inglés, selección de personal. ABSTRACT This paper describes the process of constructing the eCat test and provides information on the effects its normal operation had on its psychometric properties. Results obtained from 3224 students revealed that the 225 item pool has adequate reliability and good convergent validity with respect to self-reported English proficiency. Simulations showed that ability estimations are essentially unbiased and that a stopping criterion of over 15 items is required to achieve a reasonable level of precision. From 7254 eCat administrations the test psychometric properties have been computed again and compared with those of the preoperational eCat. Additional validity evidences are also presented. Keywords: Computerized adaptive testing, testing by the Internet, English assessment, erecruitment. Contacto: Vicente Ponsoda e-mail: [email protected] tel: 914975203 fax: 914975215 This research has been funded by the Ministerio de Ciencia e Innovacion (grants PSI2008-01685 and PSI200910341) and by the Chair “Psychometric models and applications” sponsored by the Instituto de Ingeniería del Conocimiento.

50


1.- Introduction One of the most important advances in the theory and practice of personnel recruitment has very likely been caused by the arrival of computer and related technologies (Viswesvaran, 2003). Computerized testing via the Internet for personnel selection is being more and more popular in the last years and is one of the preferred organization practices (Bartram & Hambleton, 2006; Nye, Do, Drasgow & Fine 2008). Internet testing has important advantages, as quicker and cheaper assessment processes (Naglieri, Drasgow, Schmite, Handler, Prifitera, Margolis & Velasquez, 2004), and it makes it possible that candidates anywhere in the world can easily apply and be tested even at their own homes. However, they have also some difficulties, related mainly to the unsupervised nature of most assessments and the lack of control on the examinee behavior when responding to the test (Tippins, Beaty, Drasgow, Gibson, Pearlman, Segall & Shepherd, 2006). Different test types, as linear tests, questionnaires and computerized adaptive tests (CAT) can be and usually are administered by the Internet. Advances in CAT have been made possible by advances in both item response theory (IRT) and in computer hardware (Ponsoda & Olea, 2003). The main idea of a CAT is to efficiently estimate the respondent ability through administering items matched to the level of proficiency demonstrated by him or her throughout the test. The respondent will not have to respond to items that very likely will pass or fail for being too easy or difficult for him or her. A CAT in general administers a reduced number of items but its precision could be even higher than of a longer linear test. The use of CATs is increasingly common in large-scale psychological and educational, certification and licensure programs, where the fundamental objective is to obtain a precise estimate of ability through the application of a reduced number of items. There are currently adaptive versions of several important knowledge and aptitude tests (such as the ASVAB, the GRE, SAT and TOEFL). The “move to CAT” is a worldwide phenomenon. There are more than 30 operational CAT programs1 all over the world that evaluate four to six million men, women, and children each year (Fetzer, Dainis, Lambert and Meade, 2008). CATs have the potential advantages of any computerized test (Carlson & Harvey, 2004) such as, for example, those relating to data storage and recovery, homogeneity of the testing conditions, establishment of controls to preserve the security of the test, speed in data processing, assessment through new item formats, more flexible grading procedures, preparation of automatic reports, etc. In addition to these potential advantages, CATs hold other, more specific ones: enhanced test security, reduced application time, more precise ability estimates for tests of the same length, the possibility of applying different tests to the same examinee on different occasions, and, in the long run, lower testing costs when it is necessary to assess samples of great size. There are however specific difficulties related to CAT implementation, as it brings about the technical issue of dynamic transmission of information between the computer the examinee responds to and a server, which carries out the performance estimation process, selection of new items, their display to the examinee, data recording, and preparation of the end report. For the examinee, the time interval between their response and the display of the next item should not be perceptible. These issues become especially important when a large number of test-takers simultaneously take the adaptive test. In addition, the issues of 1

For a list of more than 20 programs see the address http://www.psych.umn.edu/psylabs/catcentral/ 51


preservation of the security of the item pool and anonymity of the test-takers require the establishment of specific access controls to the system. In spite of these difficulties, web-based CATs may represent, especially for high stakes testing situations with numerous samples and several applications every year, significant progress towards the achievement of some objectives pursued by computerized assessment. Web-based CAT supplies test-takers with the most convenient and flexible conditions in which to take a test. It also provides the contracting company with advantages, such as a simplification of procedures, an enhanced company image, and an improved ondemand service. This paper describes eCat, one of the CAT programs referred to above, web-based, developed by psychometricians and computer scientists from the Universidad Autónoma de Madrid and the Instituto de Ingeniería del Conocimiento, that assesses English proficiency. The test is operational from 2006 and it is being mostly applied in unsupervised personnel selection processes. We will describe the construction and calibration process of the item pool, the operation of the management system, the adaptive algorithms implemented, and its psychometric properties. Some validity evidences recently gathered and the results of a recalibration conducted on the responses given to the operational test will be also offered to ascertain the effects the test normal operation have on its psychometric properties.

2.- eCat Development 2.1. – Design of the item bank The efficiency of a CAT tightly depends on the quality of the item bank from which the items have to be selected. There are a few important issues related to the item pool developing and maintenance (Wise and Kingsbury, 2000). In the eCat developing process, much attention was paid to create an appropriate item pool. Two specialists in English Philology, in collaboration with the group of psychometricians, designed a pool containing 635 items. The psychometricians guided the item development process. They proposed a multiple choice format with 4 response options for the items, and provided guidelines for the wording of incorrect options and recommendations on the content validity of the pool and on the desirable difficulty of the items. The philologists designed a cognitive-functional model of English proficiency, that included 7 grammar categories and 46 more specific subcategories: form-related aspects (2 subcategories, 17 items in total), morphology (17 subcategories, 222 items), morphosyntax (1 subcategory, 7 items), pragmatics (2 subcategories, 20 items), lexicon (7 subcategories, 177 items), syntax (14 subcategories, 82 items), and compound categories (3 subcategories, 110 items). For the calibration of the items, it was not possible to apply the entire bank to each examinee. Therefore, a linking design was used in order to administer different subsets of items to different samples, with linking items common to the different forms. In all, 15 forms were designed, each made up of 61 items, 20 common (the linking test) and 41 specific to each form. Both the items of the linking test and those belonging to each specific form were 52


chosen to be adequate samples of the difficulty of the pool and of the proportion of items in each of the 7 competency categories. 2.2.- Psychometric properties of the item bank Five of the forms – a total of 225 items - were administered to a sample of 3224 first year students of the Pontificia Universidad Católica de Chile, who had varying levels of training in the English language. With the objective of carrying out predictive validity studies, a brief questionnaire was included with the corresponding form. It was designed to obtain information on the type of secondary school they attended (bilingual or Spanish), on the type of English language training (formal education, language academies, family, stays in Englishspeaking countries, etc.). A self-assessment of language proficiency in terms of reading, writing and spoken was also requested. Several studies were carried out to ascertain the psychometric properties of the 20 items comprising the linking test and of the items of the different forms. These studies’ specific aims were to a) estimate the item-test correlations, and to remove the items negatively affecting the form’s reliability, b) estimate the internal consistency and mean difficulty of the five forms, to see whether equating procedures were needed, and c) verify the unidimensional assumption required by the IRT model used in the calibration. Table 1 provides the means and standard deviations, as well as the coefficient α, the average of the item-test biserial correlations, and the RMSEA index of each of the forms. Form

No. of items

Average correct answers

Standard deviation

α

Biserial average

RMSEA

1 61 30.589 13.480 0.947 0.643 0.00702 2 61 28.443 13.037 0.944 0.615 0.00683 3 61 30.174 13.492 0.948 0.642 0.00728 4 61 31.773 14.516 0.955 0.668 0.00753 5 61 32.161 14.454 0.953 0.661 0.00750 Linking 20 9.912 4.956 0.871 0.691 0.00473 Table 1. Descriptive data, internal consistency and fit to the unidimensional solution indices for the 5 forms and the linking test

In spite of the random assignment of subjects to the 5 forms, the average number of correct responses significantly differed (p