Validity and Reliability of Tests in Web-PSI Courses Jana Jacková University of Žilina, Faculty of Informatics and Management Science Bystrická cesta 21, 034 01 Ružomberok, Slovakia, Tel.: + 421 44 432 17 10, Fax: + 421 44 432 17 46
[email protected] Abstract “Test reliability and validity are the two most important features of a test. They indicate the quality and usefulness of the test”[1]. The article deals with problems of students’ knowledge evaluation in higher education ecourses. Stages of testing process needed for preparation a computer science course with mastery learning philosophy are described. The procedure is “contentindependent”. This process of test construction is based on didactic principles. It is not very known among teachers - technicians (with master degree “Ing”) without pedagogy and psychology background. To clear the test assessment procedure the author focuses on steps needed before and after testing. Worldwide usage of PSI [2] (Personalized System of Instruction / Keller plan) courses shows higher efficiency than traditional ones. “Student feedback from the PSI courses was almost 100% positive -- almost everyone -- students, proctors, and I -enjoyed the experience” [3].
Keywords: teacher qualification, test construction, specific aims, reliability, validity, PSI, mastery, criterion-referenced assessment
1. Introduction Secondary school teachers who are graduates from technical faculties have to pass specialized teacher training programme. On the other hand, university teachers are not required to have teacher qualification. Tests are common educational process element at almost all kind of schools and courses. The question is: Are the tests we do correct? And if, what do they really measure? Both secondary and university technical subjects teachers can obtain teacher qualification passing specialized teacher study at the Faculty of Material Sciences and Technology, Slovak Technical University where various teacher training subjects and also the test construction are taught. Another example is from Comenius University, Faculty of Mathematics, Physics and Informatics, branch of study Secondary and Post Secondary Teaching, that offers a School Tests course [4].
Syllabus of these courses covers fundamentals of educational measurement theory, didactic principles of test construction, test usage in practice, current trends etc. Nowadays I have been preparing a basic skills programming web-PSI course as an alternative to a traditional one. PSI is based on mastery learning philosophy and criterion-referenced assessment. Mastery tests “assess whether or not the person can attain a prespecified mastery level of performance”[5]. I have applied the school test construction theory from [6] [7] [8] [9] into the procedure for constructing PSI course tests. It does not matter if you do a traditional course or a web one. With or without technology, the procedure would have to be done step by step. Surely, technology, when used correctly, can multiply the effects both the testing procedure and a course. “For example, computerized quizzes that give students immediate feedback on their content knowledge and refer them to their textbook or other source of information to restudy amplify two important student characteristics. First, students learn the information better and come to appreciate the value of feedback and review. Second, students learn something about the effectiveness of their learning strategies and how to improve them” [10]. To receive reliable test results about real students’ knowledge is the aim of all teachers. To design reliable and valid tests is serious work, particularly if students’ results are important for very important decisions, e.g. passing/failing the final course examination, final course evaluation.
2. Validity and Reliability A good test is reliable and valid. Reliability is not enough; a test must also be valid for its use. If test scores are to be used to make accurate inferences about an examinee' s ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose [11].
Tests themselves are not valid or invalid. Instead, we validate the use of a test score. Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school [11]. Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes [11]. Methods for establishing the validity of a test' s use: • content validity (face validity and curricular validity) • criterion-related validity (predictive validity or a concurrent validity) • construct validity (convergent validity and/or discriminant validity) or (evidence from content validity or criterion-related validity studies) • consequential validity (an inquiry into the social consequences) Factors affecting the reliability coefficient: Any factor which reduces score variability or increases measurement error will also reduce the reliability coefficient. For e.g., all other things being equal, short tests are less reliable than long ones, very easy and very difficult tests are less reliable than moderately difficult tests, and tests where examinees’ scores are affected by guessing (e.g. true-false) have lowered reliability coefficients [ 5].
3. Test construction This algorithm describes “manual” steps. Of course, some of them can be automatized by special software tools, especially in web courses (see above). O course curriculum --A construction (planning and design) B administration (doing by learners) C correction D classification E analysis
O/ Outset : course curriculum A course is well prepared for testing / examining if course curriculum is stated: 1. identification data (name of subject, number of lessons, number of credits, teachers,…) 2. general aims of a subject education 3. specific educational aims with highlighted basic curriculum (specific aims that represent minimal norm) Bloom’s taxonomy in cognitive domain (higher-order thinking skills): level 1: Rote knowledge (simple memorization) level 2: Comprehension (using one’s own words} level 3: Application (using concepts or principles to solve new problems) level 4: Analysis (similarities and differences are identified) level 5: Synthesis (novel combinations of concepts or principles) level 6: Evaluation (taking and defending a position using sound reasoning}[12] 4. time-table of education (topis/modules+methods, forms and material means) 5. requirements for passing (specification of projects, points) 6. methods of examination 7. recommended study literature To have designed course specific educational aims (O/3) is important requirement for content validity of course tests. In most higher education courses in Slovakia specific aims are very rarely defined. Students usually know only syllabus – content of a course. Comparison of students’ knowledge of various teachers’ groups of the same subject is very difficult In this case. An A grade in one group is not the same as an A grade in another group.
A/ construction (planning and design) 1.to specify purpose (WHY, WHO FOR) to choose correct type of DT: • nonstandarised (by teachers of subjects, test nonprofessionals) • cognitive (to find knowledge and intelect zrucnosti) • output of module / final of course • criterion-referenced (C-R) • objective scored Because PSI courses need C-R assessment we are going to discuss nonstandarized, C-R didactic tests (DT) for module output and final course assessment . 2. to allocate content of DT what topics (how much time were taught)
3. content analysis to keep content validity (if tasks represent the curriculum) = didactic analysis of curriculum (topics, items – importance / time) • specific aims for a module output DT: to mark importance (basic curriculum item in mastery tests), to headline possibility of testing (no / suitable / very suitable), to specify level of knowledge • specification table for a final DT (construction see in [6]): topics / modules + number of lessons OR chapters + number of pages 4. to specify adequate form of tasks in DT for the most suitable form of tasks see [6] (independent tasks, clear task statements, no tricky questions, to avoid negative task statements, easy correction for objective scored tasks /closed OR open with short answer/, 5 answers are optimum for polytomic tasks)
12. to review by competent persons (e.g. competent colleagues) 13. trial testing (piloting) at least 1 group for considering suitability (important if serious decisions about students are done after making DT) 14. amendment after results (12, 13) analysis, final test assembly and specification of rules for test-makers and conditions: to determine optimal environment (physical and psychological conditions), time, schedule, seating Test construction is a serious and time-consuming work. An example timescale and resource requirements for test construction show that this stage can take up to 20 weeks [9], especially for final tests.
5. to formulate tasks for DT, to build task bank =specific aims are converted into tasks
B/ administration (doing by learners)
6. to determine testing time
C/ correction of DT, results
7. to specify number of tasks (reliability increases with number of tasks) to calculate supposed time for each task and to specify testing time 8. to determine form and number of variants number of variants depends on way of seating and testing equivalent variants are required (only modification of tasks from first variant: to change task order / values of variables in task / order of task answers / to do mirror task pictures ; or combination of these possibilities) 9. to design first draft of DT 10. (only if less than 21 tasks) to assign task importance weights if more than 20 tasks: weighting is useless because it does not increase neither reliability nor validity 11. to assign scoring of tasks (scoring key: points to tasks) • for more than 20 tasks binary scoring (1,0) is used • to calculate DT score • after didactic analysis of curriculum items presented in DT to specify marginal score (minimum score for successful result): 80-90% for C-R tests • transformation key for marking in C-R tests: transformation of DT score (number of points) to marks / grades • mastery tests in PSI courses: 0% - 80% failed, 81%-100%.....passed
learners testing
quick feedback for students is more effective
D/ evaluation, assessment
according to transformation key (see A/11) table of results with evaluation, non-weighted score (average of tasks success) if non-weighted DT (see A/10)
E/ analysis of DT results
feedback information for increasing the quality of a course (teacher’s and students’work) to measure reliability of results, to increase DT reliability for next testing 1.to do statistic analysis (DT results) a/ measures of central tendency (mean values of DT scores): mean1(average score), median2 b/ measures of variability (DT score distribution around mean values): variance3 (from 1,number of students), standard deviation4 (3), range5, variational coefficient6 (4,1) c/ DT results in graphs (frequency of scores): polygon / histogram 2. to think adequacy of testing time • minimally 80% students have to finish in time • each test-maker has to solve minimally 75% tasks 3. to specify suspicious tasks • logical analysis of tasks results in C-R tests • incorrect alternative answers are chosen by: almost no student / more ”good” students than “weaker” students
• questions / questionnaire for students about DT quality (unclear tasks, tasks with more correct answers, tasks with no correct answer, unclear words / sentences / unclear instructions)
[2] Jacková, J., “Mastery Learning in Higher Education”, eLearn 2006. Zborník z medzinárodného seminára 8.-9. februára 2006. Žilina: EDIS - vydavate stvo ŽU, 2006, s. 94-99. ISBN 80-8070-505-4. eLearn 2006. Zborník príspevkov. [CD-ROM]. ISBN 80-8070-506-2.
4. to think relationship between DT scores and testing time
[3] Gallup, H., “Personalized System of Instruction: Behavior Modification in Education”, 1995 [online],
5. to calculate DT reliability and criterion-related validity a/ reliability coefficient7 (more than 0.6 is required for nonstandarized tests /0.9 for standarized) b/ error of measurement: standard error of estimate8 (4,7) = error of students scores, standard error of average performance9 (8, number of students) c/ concurrent validity coefficient10 (correlation coefficient – e.g. between this test and another valid test) 6. to do task factor analysis (success of curriculum items solving) • curriculum items with less than 60% average success have to be repeated with the whole study group • reasons: ?not teaching of item, ?amount of curriculum items, ?few lessons (little time), ?incorrect educational means, ?low ability of students, ?insufficient input knowledge of students
http://ww2.lafayette.edu/%7Eallanr/gallup.html
[4] School Tests course [online], www.fmph.uniba.sk/mffuk/studium/bc_mgr/ILA/1-UIN328a.doc [5] Tests end Test Validity [online],
http://www.mathcs.duq.edu/~packer/Courses/Psy624/test.html [6] Turek, I., Úvod do didaktiky vysokej školy. Košice: Technická univerzita, 2005. ISBN 80-8073-301-5. [7] Koláriková, H., Kostelník, J., Tináková, K., Praktikum z inžinierskej pedagogiky. Bratislava: Slovenská technická univerzita, 1998. ISBN 80-227-1087-3.
[8] WebCAPSI Web-based Computer-Aided Personalized System of Instruction [online], http://home.cc.umanitoba.ca/~capsi/capsislideshow/ slideshow1.htm
7. suggestions for test correction - for a next test run
[9] Izard, J. , Overview of test construction [online], Paris: UNESCO, 2005, http://www.sacmeq.org/modules/module6.pdf
4. Conclusion
[10] Brothen, T. , “Transforming instruction with technology for developmental students”, 1998 [online], Reprinted from the Journal of Developmental Education, 21(3), http://www.sacmeq.org/modules/module6.pdf
A good test is valid. “Professionally developed tests should come with reports on validity evidence, including detailed explanations of how validation studies were conducted. If you develop your own tests or procedures, you will need to conduct your own validation studies. As the test user, you have the ultimate responsibility for making sure that validity evidence exists for the conclusions you reach using the tests. When properly applied, the use of valid and reliable assessment instruments will help you make better decisions”[1] . The procedure for test construction can be used well in any course, with or without web technology and PSI philosophy. Using valid tests contributes to improving the quality of teaching/learning process.
5. References [1] Understanding Test Quality-Concepts of Reliability and Validity [online], http://www.hr-guide.com/data/G362.htm
[11] ACES: Validity Handbook: What Is Test Validity? [online], [12] Modified Bloom’s Taxonomy in Cognitive Domain [online], http://home.cc.umanitoba.ca/~capsi/capsislideshow/ slideshow8.htm