Proceedings of CLaSIC 2004
PLACEMENT TESTING AND AUDIO QUIZ-MAKING WITH OPEN SOURCE SOFTWARE Don Hinkelman and Timothy Grose Sapporo Gakuin University, Bunkyodai 11, Ebetsu-shi, Japan,
[email protected],
[email protected]
Abstract Commercial placement testing services are often cost-prohibitive for many schools teaching foreign languages. Open source software provides a low cost alternative that may eventually result in superior testing results. This paper reports the results of a pilot listening/reading comprehension placement test given in April 2004 to 230 freshmen entering a university General English program in Japan. The process of designing and programming an audio-based listening test or quiz is described and criteria of practicality and reliability are analyzed. The test design will be discussed; a complete transcript of questions will be displayed; and practical logistics will be detailed in a handout. An item analysis of the 50 question test showing item facility and item discrimination will determine which questions were most useful in dividing students into three levels of English ability. Examples of problems that need improvement for a proposed 2005 version are identified. Finally, the purpose and operation of a statistical report module (developed by T. Robb) which is included within the open source software package called Moodle will be discussed. By improving item quality year by year, the authors conclude that a self-created placement test using open source software could, over several years of development, prove equal or superior to generic commercial products in reliability for closed population placement testing.
1
Introduction
Placement testing, while common in intensive language programs, has often been neglected in university level foreign language curricula in many parts of Asia. This is largely due to the high cost of personnel and financial resources that schools must invest if they are to implement an initial diagnostic evaluation of a student’s language level prior to enrolment into a compulsory foreign language course. In addition, there is reluctance amongst faculty to “penalize” students who may place at a lower level. In Asia particularly, English language education at the secondary and tertiary level is compulsory and characterized by large class size and mixed levels of students in the same classroom. The result is confusion for teachers who must design a lesson plan to cross a variety of language levels, and cope with lower motivation for students who are either over-challenged with difficult tasks or bored with easy activities. To overcome this problem, one university in northern Japan attempted to stream its students into levels. With limited time and financial resources, they experimented with an open source software approach to placement testing. Commercial testing packages had been considered but rejected due to high fees proposed. Similarly, in describing the negative attributes of computer-assisted language tests, Taylor et al (1999) found that the tests in their study failed the criteria of practicality, as they were so expensive that many examinees and testing programs were not able to adopt them. If this practical problem cannot be overcome, then the positive attributes of computer-based testing may never be realized. Secondly, the
974
Current Perspectives and Future Directions in Foreign Language Teaching and Learning
time-consuming nature of test marking and test-item analysis means that test revision is rarely done despite the consensus of language testing authorities (Cohen, 1994; Bachman & Palmer, 1996; Brown, 1996) who emphasize the importance of statistical analysis to continually assess and improve testing results. To this end, this paper documents a single case study of a pilot placement test which includes descriptions of testing approach, test design, hardware/software selection, testing administration, and statistical evaluation. Chapelle (2000) lists six qualities to consider in evaluating the usefulness of a computer-assisted test: reliability, construct validity, authenticity, interactiveness, positive impact, and practicality. In this pilot phase, this study will focus particularly on two of these questions: the practicality and reliability of applying computer-based placement testing to large numbers of students in a single teaching institution. 1. 2.
What is the feasibility of using open source software for in-school placement testing? How can test item reliability be improved with this software?
2
Testing Approach
A placement test is a form of diagnostic evaluation for students already selected to join a language learning program. Its purpose is to separate students into levels so that tasks and activities can be tailored to fit their level or a level slightly higher. This is different from a formative evaluation which judges progress within a course or summative evaluation which compares students at the end of a course. Entrance or admissions examinations are a form of summative evaluation intending to judge students from a variety of schools to determine their relative competence in the language. The purpose of this placement test, however, only compares students within the group accepted into the language learning program. In this sense, the students experience less pressure in taking the test and no fear of failure. The test is given during the freshmen orientation period and with no advance warning to allow students to prepare. The incoming student body to be tested ranges from 1000~1100 students per year. In Japan, English is nearly always a compulsory subject for high school students. Universities thus include English as a subject in almost all entrance examinations. As four or five subjects may be examined and considered in accepting a new student, an entering class of first year students will have a high diversity in their actual language levels. Without diagnostic placement tests, a teacher may find students in the same class who cannot hear or speak simple greetings alongside students who are fluent in everyday English conversation, perhaps even with study abroad experiences during their high school term. Yet surprisingly, diagnostic language testing has not become a mainstream practice at universities in Japan. Amano (1990) suggests that in Japan this is due to group-oriented conventions that discourage the differentiation and demarcation of individuals’ ability levels. The history of testing at an institution also affects the approach to testing. For almost ten years, a progress test to all general English students at Sapporo Gakuin University was administered for streaming students into levels for their second and final year of compulsory foreign language courses. This was comprised of a 50 item test, including 20 reading questions and 30 listening questions. This same testing format was then adopted as the basis for the design of the freshman placement test used as a pilot in April 2004. The validity of this format has not been examined and remains an important question to be researched. However, by eliminating
975
Proceedings of CLaSIC 2004
grammar-based test items and giving 60% of the test to a variety of listening tests, it was intended to give less focus on grammar-translation than traditional testing methods common in Japan. 3
Test Design
A multiple-choice question format was selected out of convenience for administration and due to its integration with the statistical analysis module included in the software package. The test consists of fifty multiple-choice questions with four choices each (one correct answer and three retractors). The test is displayed to look like a paper-based test, showing all questions on a single web page, and allowing students to scroll up and down, to go back and change answers before final submission. As most students are unfamiliar with computer-based testing, instructions and navigation aids were made in the native language (Japanese). Students were normally given forty minutes to complete all questions in the paper-based version. For the computerized pilot, an extra ten minutes was added to allow students time to familiarize themselves with the equipment. No visual aids, photos, or videos were added to the test questions, although the software was capable of including them. The reading section consisted of three parts: 1) ten sentences with vocabulary and usage problems, 2) a short reading with five follow-up questions, and 3) a longer reading with five follow-up questions. The thirty listening comprehension questions were recorded as audio files in mp3 format. The resulting file size of this format is very small (5kb-100kb), allowing fast playback on the web with little wait time for downloading. This section was also divided into three parts. The first part was short listening. One to three sentences were recorded for each question, and then a single question was asked for each recording. The second part included short dialogues with a single question asked after each dialogue. The third part consisted of two longer listening recordings. After each recording, five questions were asked. Students could play the recording as many times as they wished, keeping in mind the overall time limit for the test. 4
Hardware/Software Selection
One popular placement testing service is the CASEC Test, which offers a TOEFL-style test hosted on their own servers accessible via the Internet. They provide TOEFL-equivalent scoring system and allow a student to use the system for unlimited tests during a one year period. Cost is 3000 yen per student per year (USD27). Testing an entire first year student body of 1100 students would then exceed three million yen (>$30,000 USD) per year. Open source alternatives were also examined and Moodle was selected due to the following criteria: -
low cost: zero cost for open source software standard hardware and OS compatibility: Windows, Linux, Mac quiz module for creating multiple-choice questions integrated statistical analysis of test items audio file playback easy-to-use user interface multi-lingual capability (Japanese and English) exportable test results test item revision capacity
The one requirement not provided by Moodle in late 2003 was a user interface that would
976
Current Perspectives and Future Directions in Foreign Language Teaching and Learning
show a start/stop button for each audio quiz question. It was necessary to avoid a separate window appearing to play a recording, a needlessly confusing operation. As Moodle is open source, modifications are easy to make, and new source code can be added to new releases, thus allowing many schools to benefit from improvements. Sapporo Gakuin University therefore donated 120,000 yen (USD1100) to a Moodle association of programmers to modify the code in time for April 2004 placement testing. The resulting modification allowed audio streaming in any part of the Moodle activities, including quiz questions. In this design, a multi-media filter places a small, clearly identifiable button (programmed in Flash) next to the quiz question showing start and stop functions. Despite a low cost solution to our placement test needs and an easy-to-use interface that teachers were ready to use for test authoring, some members of the Information Science Faculty resisted this selection by the English department. They recommended an test-making package that had been already purchased, as a further way to economize. However, the test-making software proved difficult to use (HTML programming skills required) and had no statistical analysis functions. Despite demonstrating the ease-of-use and analytic features of the Moodle application, there is only a consensus among English Department faculty to use this approach for the following year. 5 Test Administration Test administration is cited as one of the main problems in computer-based testing (Gorsuch and Cox, 2000). To eventually handle over 1000 students in one day of testing, careful procedures and staff training were necessary. In this pilot phase, four computer rooms with sixty workstations in each room (240 student capacity) were available for the testing. Two departments were tested at separate times. The Commerce Department with 130 students went first, followed by the English Department with 80 students. Three faculty members were assigned to each room to assist students during the tests. These staff prepared the rooms by placing headphones at each station and putting an instruction sheet on each keyboard. Students had assigned seating according to their student number. The staff ensured that students sat in their correct, allocated seats and that all students were correctly logged in. The staff then monitored the rooms to ensure that the students understood what to do and to check that all computers were functioning. To prevent students from copying answers directly from adjacently seated students, the test item answers were randomly shuffled in each test question. Test questions could also be shuffled, but due to the format of the test (several questions connected to a single reading or listening passage), this was not practical. When the test began, one teacher read out instructions in Japanese to the classroom of 50-60 students and two teachers roamed, assisting students with questions. Students in the Commerce Department then attempted to log in to the test but were unable to due to over-demand on the server. The test was abandoned after 20 minutes of attempts to revive the server. The second group of 80 students came in and this time the login was successful, though slow. All students were able to complete the test within the allotted one hour. 6 Placement Procedures Tests results of the 80 students in the English Department were calculated immediately by the software program and exported to a spreadsheet file. The 130 students in the Commerce
977
Proceedings of CLaSIC 2004
Department who took a paper version of the test were marked one week later. The tota; 230 students in the pilot test phase were then divided into three levels according to their raw scores. The upper third were titled “fast track” or “F” level. The middle level was called “regular” or “R” track and the lower level were placed in the “novice” or “N” track. Students subsequently began attending classes based on these groupings (two semesters, 14 classes x 2 = 28 total classes). 7 Evaluation of Results Test item reliability, technical operation, and teacher impressions were evaluated following the pilot placement test. The test item results were immediately analyzed with a single-click button, “Detailed Test Results” within the teacher’s administration section in the quiz module of Moodle. Individual student answers were compiled into a master sheet viewable onscreen or downloadable into a spreadsheet. The table showed the content of each answer as well as the scores. In the summary scoring section, a percentage of correct answers for each test item was indicated (item facility) as well as the ability of the test item to separate better performing students from poorer performing ones (item discrimination). This summary table is shown in Appendix A. Item facility and item discrimination were judged on by the following criteria shown in Table 1. Item Facility
Item Discrimination
Good Acceptable Poor: too difficult Poor: too easy
40~60% Good > 2.0 30~40%, 60~70% Acceptable 1.3 ~ 2.0 0~30% Poor < 1.3 70~100%
Table 1: Criteria for Item Analysis
These criteria were used to identify test items that needed revision or elimination. In Table 2 see the results for 50 test items in the pilot placement test with 80 students. Item Facility
Item Discrimination
Performance Level Number of Items Performance Level Number of Items Good Acceptable Poor: too difficult Poor: too easy
12 16 6 16
Good Acceptable Poor
20 18 12
Table 2: Test Item Results According to Item Facility and Item Discrimination
From these figures we identified 24 test items that needed attention—new question or new answers needed. 26 items were considered useful for the purpose of placing our students into three levels. This information was then applied to the revision of questions for the following year’s placement test.
978
Current Perspectives and Future Directions in Foreign Language Teaching and Learning
Note that test item revision is not an option in commercial testing services. Individual schools or teachers are not allowed to create copy, review or revise any content on the copyrighted system. Thus a test cannot be tailored efficiently for a particular school and its population of students. These commercial products are aimed at a national market and thus must test all levels of students. When computer-adaptive features are added, the test may come closer to fitting the level of an individual, yet there is still no way for a single institution to statistically check the items to ensure their usability and effectiveness. Network logs were examined to determine the cause of server stoppage that occurred in the first of the two testing periods. The database, the testing software, webserver software, and LAN network all performed adequately. Moodle has been tested in university environments of over 20,000 students and performed with excellent speed when set up with sufficient hardware capacity. On the day of our pilot test, network traffic only approached 10% of capacity, approximately 1MB of data throughput at peak. All problems were due to insufficient memory size and processor speed in the hardware. The server (a single processor Mac 933mhz desktop) reached its capacity at 80 students simultaneously accessing the server. Significant slowdowns began with 40-50 students accessing it. The MySQL database queries were too many for the processor to handle and when over 130 students began logging in, the database stopped with error messages. In the second test period, with only 80 students, students were able to log in and begin the test, however with significant slowdown. All 80 students could complete the test including hearing the audio portions. The teachers were not surveyed or interviewed formally. However, their impressions reported after the first two weeks of class were highly enthusiastic. They reported that after switching to streamed class rosters, they noticed class cohesion was higher, and there was greater initial cooperation amongst the students than in previous years. These were promising comments that need to be investigated in the future. 8 Future Directions Plans and budget have been authorized to purchase new hardware with sufficient speed and capacity to handle up to 250 students simultaneously. These are being installed in time for April 2005 orientation to new students. Despite the initial hardware failure, teachers in the English Department were positive of the potential of the placement tests and encouraged further development. In 2005, two versions of the placement test are planned—one version for departments with typically higher levels of English (Psychology, Human Science, English) and one version for departments with generally lower levels (Economics, Commerce, Law, Information Science). 9 Conclusion With sufficient hardware resources, open source software was successful in providing a practical technical platform for administering placement tests to large numbers of students in a short time. Administration time was equal to comparable paper-based assessments. Time savings, however, were dramatic in the marking and analysis of test results where computer-based scoring made these tasks almost instantaneous. Financial costs were very low for hardware, very low for software, and moderately high for personnel in the first year of development. It is anticipated that personnel time for setup will be very low in subsequent
979
Proceedings of CLaSIC 2004
years that the placement test is employed. Concerning the second research question, the reliability of the test items, twenty four of the fifty test items were marked for revision or removal after item facility and item discrimination analysis. Typically the poorly performing questions were too easy (16=high IF) or too difficult (6=low IF). An additional two questions with acceptable IF, were found to have poor discrimination between high and low scoring students. Thus the statistical module included in the software made it very easy to identify problems in test item construction. With such a highly cost-effective platform, universities with large intakes of students studying foreign languages will find open source software suitable for placement testing programs. References Amano, I. (1990). Education and examination in modern Japan (W. K. Cummings & F. Cummings, Trans.). Tokyo: University of Tokyo Press. Bachman, L. and Palmer, A. (1996) Language testing in practice. Oxford, UK: Oxford University Press. Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents. Cohen, A. (1994). Assessing language ability in the classroom (2nd ed.). Boston: Heinle & Heinle Publishers. Chapelle, C. (2000) Computer applications in second language acquisition: Foundations for teaching, testing, and research. Cambridge, U.K.: Cambridge University Press. Gorsuch, G. and Cox, T. (2000). "Something old, something new, something borrowed, something....: Piloting a computer mediated version of the Michigan Listening Comprehension Test TESL-EJ [on-line], 4(4), 1-19. Available: http://www-writing.berkeley.edu/TESL-EJ/ej16/a2.html Robb, T. (2003) Detailed statistics for the quiz module. Moodle open source course management system. Available: http://moodle.org Taylor, C., Kirsch, I., Jamieson, J., and Eignor, D. (1999) Examining the relationship between computer familiarity and performance on computer-based language tasks. Language Learning, 49 (2), 219-274.
980
Current Perspectives and Future Directions in Foreign Language Teaching and Learning
Appendix A: Item Response Analysis Item Facility (IF) and Item Discrimination (ID), n=80
Sapporo Gakuin University, Placement Test for First Year English Majors, 2004.4.5 Software Source: Moodle http://moodle.org Statistical Module: Robb, T. (2003)
981