The Development and Validation of a Simulated ... - Wiley Online Library

The Development and Validation of a Simulated dral Proficiency Interview CHARLES W. STANSFIELD and Center fw Applied Linguistics 11 18 22nd Street, NW Washington, DC 20037 E-Mail: CAL.@GUVAX

DORRY MANN KENYON Center for Applied Linguistics 11 18 22nd Street, NW Washington, DC 20037 E-Mail: CAL@GUVAX

SINCE THE PUBLICATION AND DISSEMInation of the Proficietuy Guidelines of the American Council on the Teaching of Foreign Languages (ACTFL) (1; 2), proficiency instruction and testing have become household words in foreign language education. The Guidelines, originally developed in 1982, were the subject of over 400 articles in professional journals by 1988 (8; 18) and represent the efforts of ACTFL, with assistance from the Educational Testing Service (ETS) and Federal Interagency Language Roundtable (FILR) (9). As a method of assessing global speaking proficiency, the oral proficiency interview (OPI) has also been widely disseminated. During the 1980s, ACTFL trained approximately 2000 oral proficiency interviewers and raters in Spanish, French, German, and Russian. This article reports on an alternative method to the face-to-face procedure employed by the OPI for eliciting speech samples that may be rated according to the ACTFL Proficienq Guidelines. While many professionals in the commonly taught languages were being trained in the OPI procedures, it became clear to staff at the Center for Applied Linguistics (CAL) that in the less commonly taught languages problems of manpower and economics would limit the accessibility of the benefits of comprehensive oral proficiency testing offered by the OPI. Thus CAL sought to explore the use of a tapemediated procedure for assessing oral proficiency. At the same time, CAL was anxious to ensure that the ACTFLIFILR Profickuy Guidelines would be used as the scoring scale for the new procedure. In other words, the new tapeThe Modern Language J o u m l , 76, ii (1992) 0026-7902/92/O001/129 $1.5010 01992 The Moabn Languuge Journal

mediated procedure would have to collect examinee speech samples containing the necessary breadth to be rated on the ACTFL scale. Through its research and development projects, CAL has developed what has come to be called the simulated oral proficiency interview (SOPI) to achieve these ends (13). The SOPI is distinguished from earlier tapemediated assessments of speaking ability, such as the Recorded Oral Proficiency Exam (ROPE) (10) or the Test of Spoken English (TSE) (6),by its combination of each of these three characteristics: 1) in format it is similar to the OPI, beginning with a warm-up and using a variety of speaking tasks at different levels on the ACTFL scale to probe speaking proficiency; 2) it uses both aural and visual stimuli to elicit the speech sample; and 3) it is scored according to the ACTFL Guidelines. Recognizing the limited abilities of examinees at the Novice level, the SOPI is designed for examinees at the range of Intermediate-Low to Superior on the ACTFL scale (1-3 + on the FILR scale). CAL completed the development of its first SOPI, the Chinese Speaking Test (CST) (5), in 1986. Like its predecessors developed by the federal government (lo), the CST was a response to the need for oral proficiency testing in situations where the administration of a face-to-face test is either impractical or impossible due to the unavailability of a trained interviewer. Since the CST appeared, CAL has made further refinements on the format and has developed similar tests in Portuguese (15), Hebrew ( 1 1 ; 12), Hausa, and Indonesian (17) under grants from the US Department of Education. In addition, CAL has developed such tests in French and Spanish for the Texas Education Agency for use in its teacher certification program. The practical advantages of the tapemediated SOPI format have led to a growing

130

interest in using it in both low volume, less commonly taught languages and large scale testing programs. Other institutions have used the SOPI model to develop similar tests in other languages: the University of Pennsylvania (Hindu, Urdu, Bengali); the University of Virginia (Tibetan); and the University of Michigan (Arabic). In addition, CAL staff have made a number of presentations on the SOPI format to foreign language supervisors and others who are interested in considering it as an option for their oral proficiency testing needs. As a semi-direct procedure, the SOPI uses recorded and printed stimuli to elicit a speech sample from the examinee rather than face-toface (i.e., direct) interaction. Thus it eliminates the need for the on-site interviewer and offers the most efficient and feasible approach to oral proficiency testing in low volume languages, where there may be only a few individuals in the nation trained to give and score the OPI. Tapes recorded at any institution can be sent to the trained individual for scoring, without that individual needing to see the examinee. But because the SOPI is scored using the ACTFL Guidelims, the SOPI can provide the benefits derived from a continual assessment program and promote proficiency based goals which students of the low volume languages can attempt to attain. On the other hand, in certain situations there is a very high volume of examinees to be tested in a relatively short period of time. Since the SOPI can be group-administered in a language laboratory setting, many examinees can be given the test simultaneously (though their taped responses will still need to be scored separately). Also, in many high volume situations, standardization of the test can play a crucial role (14). Although the quality of an OPI may vary depending on a variety of factors (such as interviewer's experience, fatigue, personality characteristics),the SOPI does not vary in quality, but offers the exact same caliber of test to all examinees. In a large volume testing situation, raters can be carefully trained and monitored, as in essay paper scoring, to ensure reliable scoring. This article reports on one of CAL's SOPIs, the Indonesian Speaking Test (IST), which represents a typical SOPI. We first give some background on the format of the IST. Then we describe its development. The last section of the paper deals with the outcome of a study investigating the reliability of the IST and its validity

The M o h Language J o u m l 7 6 (1992) as a surrogate for the OPI in measuring oral proficiency. FORMAT OF THE IST

Like all SOPIs, the IST has three components: a Master Tape containing the test directions and questions, a Test Booklet containing pictures and other materials used in responding, and an Examinee Response Tape, upon which the examinee's responses are recorded and which is scored after the test is administered. The test can be administered to a group in a language laboratory setting or individually using two tape recorders. There are five parts to the IST. Each part requires the examinee to accomplish different tasks with the language. Various types of questions are used to probe the depth and elicit a representative sample of the examinee's ability to speak and use Indonesian. Before beginning the test, the examinee listens to general directions from the Master Tape. These directions are also printed on the front cover of the Test Booklet. Following is a brief description of the various types of items found in each part of the test. Part 1 ( P e r s d Conversation). In this part of the test, the examinee is asked to respond to several short questions about his or her family, education, hobbies, etc. This is the only part in which the target language is used on the tape. After each question is asked, the examinee has between three and twenty seconds to respond, depending on the information requested by the question. Below are some examples (though not actual test questions) of the type of questions found in this part of the test, followed by English translations. Apa a& p u n p k a h h den a&? (Do you have brothers and sisters?) Apa yang k a n y a anda hkuhan poda waku liburan? (What do you usually do on your vacation?) Anda suka o l d raga apa, dan mengap? (What type of sports do you like, and why?)

For each item in parts two through five, examinees hear the task on the Master Tape. The task is also written in the Test Booklet. Following the reading of the task, examinees are given between fifteen and thirty seconds to review it and think about their response before they speak. Pauses on the tape for giving an answer vary from forty-fiveseconds to two minutes, depending on the complexity of each item.

Charla W . Stansfield & Dorry Mann Kenym Part 2 (Giving Directions). In this item the examinee is shown a pictorial map in the Test Booklet and is instructed to give directions between two points on the map. Examinees are told to whom to give the directions and why the directions are needed. Part 3 (Picture Sequcnce Narration). The examinee is instructed to speak in a narrative fashion about a sequence of four or five pictures shown on a single page in the Test Booklet. There are three picture-sequence items focusing on present, past, and future time narration respectively. The context in which the examinee gives the narration and the person to whom he or she is giving it is specified in the directions to the item. Part 4 (Topical Discourse). In this part, the examinee is instructed to talk to various Indonesians about six topics. Examples of such topics follow. Describe your favorite outdoor activities to Dewi, an Indonesian friend about your age. Describe for Dr. Amir, a visiting Indonesian professor, some of the advantages and disadvantages of using public transportation in America. Explain to Fatimah, a new Indonesian employee in your company, how one would go about buying a used car in the United States. Some Americans feel that foreign language education should begin in the first grade or even kindergarten. Do you agree or disagree? Mohamad, a graduate student from Indonesia, has asked you about this matter. Explain to him your own position on this matter, giving him dear reasons to support your views. Akhmad, an Indonesian acquaintance about your age, has asked you what you would do and where you would go if you were financially able to spend one year traveling, free of any other responsibilities. Explain to him your answer.

Part 5 (Situational Discourse). In this part, the examinee hears and reads five printed descriptions of real-life situations in which a specified audience and communicative task are identified. The examinee is then instructed to carry out the specified task. Examples of such situations are given below. You arrive at a hotel in Jakarta. Tell the young desk clerk you need a room with private bath for four nights. Ask him about room rates and if you can pay by traveler’s check.

You are at a clothing store in Bandung. Ask the young female shop assistant if you can return a

131 shirt you bought there yesterday. You discovered it had a stain on it when you brought it home. You want to buy a batik at a stand at an open air market in Indonesia. However, you feel the asking price is too high. Try to convince the vendor, an elderly man, to lower the price.

You are studying for a semester in Indonesia. Hassan, an Indonesian student you have gotten to know, asks you to tutor him on a regular basis in English. This is not something you want to do during your stay in Indonesia. Give your response now to Hassan’s request. At the end of a one-year stay in Indonesia, the family you have been living with holds a farewell party in your honor. During the party, you make a few remarks of appreciation of your host family to the group assembled. DEVELOPMENT OF TH E IST

The day-to-day work of the project was conducted at the Center for Applied Linguistics (CAL)in Washington, D.C. A local test development committee was formed, which included CAL staff and consultants experienced in SOP1 test development and two experienced instructors of Indonesian with training in using the FILR oral proficiency testing procedures and rating scales.’ An external review board composed of three professors of Indonesian was also formed.2 Input from these individuals, received as they reviewed draft forms of the test items, played an important role in determining the final format of the test. The local test development committee met on a regular basis for three months to develop the specific items for the two forms of the pilot version of the test. These items were based on the item types used in the Chinese and Portuguese SOPIs, which had been developed earlier. After review by the external board members and subsequent revisions, the two forms of the IST were trialed on twelve individuals from the Washington, D.C. area who had learned Indonesian in a variety of ways, some with resident experience in Indonesia. The purpose of the trialing was to ensure that the items were clear, understandable, and working as intended, as well as to check the appropriateness of the pause times allotted on the tape for examinee responses. Each subject in the trialing took the test on an individual basis using two tape recorders. In each case, an Indonesian speaking member of the local test development commit-

132 tee observed the examinee taking the test and made notes on his or her performance on a specially prepared questionnaire. Upon completion of the test, examinees also responded to a detailed questionnaire about it. In most cases, they were also debriefed on their testing experience in person. The project coordinator took the tapes made during the trialing to the spring 1989 meeting of Indonesian instructors working on developing the ACTFL guidelines for Indonesian, at which meeting the three members of the external review board were also present. The trialing form of the IST was discussed and one of the trialing tapes was listened to and commented on by the entire group. Many comments were offered for improving the pilot version of the test, most notably the need to contextualize the test to an even greater degree than had been done on previous SOPIs, giving specific information on the age and social status of every interlocutor presented in the test. This was deemed more crucial in the IST than in the PST, as one of the major characteristics distinguishing Indonesian from Western languages is the importance of using correct terms of address and modifying speech depending on the social status and relationships of the speakers. On the basis of data collected during the trialing and of comments from the external review board members, the final format of the test was modified from that of the trialing version in three ways. First, the Personal Conversation section of the IST was completely contextualized into a single cohesive role-play. In one form the examinee is being interviewed by a member of a scholarship selection committee; in the other, the examinee is talking to an Indonesian friend’s aunt. Questions in this part remained, however, of a “warm-up” nature, focusing on the examinee’s personal background, education, interest in the Indonesian language, etc. Second, for the rest of the test, more information on the one being spoken to is given in the IST than in the CST or PST. In narrative form, information on the person’s sex, age, social status, and name (when applicable) is given in each item. Third, it was decided to leave out the second picture item from the trialing form. This item asked examinees to give a detailed description of things and activities presented in a picture. The external review committee perceived this item to be more of a vocabulary exercise and not helpful in rating examinees above the Novice level in Indone-

The Modern LanguageJournal 76 (1992) sian. In its place, an extra topic item was included on the final form of the IST. Once the local test development committee completed revising the two forms of the IST, the forms were again reviewed by the members of the external review board. After final revisions were made, test booklets and tapes for the validation study were prepared. VALIDATION STUDY

Similar to studies conducted for the CST and PST (reported on in 5 and 13), a research study was designed to investigate the extent to which the IST had reached its goals of providing a test that can be reliably rated using the ACTFL/ FILR scale and can be used as a surrogate for the OPI in cases where giving a face-to-face interview is impractical or impossible. Thus, the study was designed to answer three research questions: 1) Can this test be scored reliably by different raters? 2) Are the two separate forms of the IST interchangeable, i.e., do they produce similar examinee results independently of the particular form administered? 3) Do the simulated oral proficiency interviews produce the same score as a face-to-face interview for any given examinee? This study involved sixteen subjects. Eight were students in the intensive and regular Indonesian program at Cornell University, and eight were from the Washington, D.C. area. The latter group learned Indonesian through a variety of means. Each subject was first administered the OPI by the same individual, a certified tester from the Foreign Service Institute. Following the interview, the subjects at Cornell took the taped tests at Cornell’s language lab within two weeks after the live interview was administered. Subjects in Washington took the two SOPIs at the Center for Applied Linguistics, directly following the live interview. In one case, however, the subject returned a week later to take the taped tests. The design controlled for order of administration, with half of the subjects in each group (Cornell and Washington) receiving Form A first and Form B second, and the other half in reverse order. The design also attempted to select subjects representing a variety of proficiency levels. Thus, participants were selected on the basis of their amount of exposure to Indonesian. Their responses to the OPI were recorded on tape for later scoring. The individual administering the OPI served

133

Charles W . Stansjeld &f Dorry Mann Kenyon as one of the two raters for the study. A second Indonesian examiner at the Foreign Service Institute served as the second rater.’ All the tapes were rated independently, anonymously, and in random order. Raters, however, scored all of the OPIs before proceeding to rate any of the SOPIs. After all the ratings were completed, subjects were mailed their test results: the scores of the two raters on the live interview and on each of the IST versions. To proceed with the empirical analysis of the ratings, scores on both the live interview and the tape-mediated semi-direct tests were converted to a simple scale combining both the ACTFL and FILR rating scales with weights assigned as follows: ACTFLIFILR Level Novice-LowlLevel 0 Novice-Mid Novice-High/Level 0 + Intermediate-LowILevel 1 Intermediate-Mid Intermediate-High/Level 1 + AdvancedlLevel2 Advanced-Plus/Level 2 + Superior/Level 3 High SuperiorlLevel 3 + and above

Coded as: 0.2 0.5 0.8 1 .o 1.5 1.8 2.0 2.8 3.0 3.8

The system of score coding above is derived from the FILR 0 to 5 rating scale and is intended to assign an appropriate numerical value to the proficiency level descriptions. For example, proficiency at an Advanced-Plus level is characterized by many of the same features as at the Superior/3 level, though the examinee cannot sustain the performance. Thus, the numerical interpretation falls closer to 3.0 than mid-way between Advanced and Superior. The several tables below provide descriptive statistics, inter-rater reliabilities, and parallel

form reliability data obtained in the study. Table I gives the descriptive statistics for the scores assigned in this study. Table I indicates that although the range of scores awarded to examinees was wide (from Novice-Mid to High Superior), as a group the examinees were actually quite proficient in Indonesian. The average ratings assigned by both raters, between an Advanced and AdvancedPlus, reflect this fact. The mean scores for each rater were very similar, indicating that the raters were almost equal in their degree of severity. Mean scores awarded for the live interview, however, were slightly higher than those awarded on the taped test. The frequencies of the scores awarded to each person are presented in the following three cross-tab diagrams. These diagrams also show the number of agreements between the absolute ratings of the two raters (in bold print along the diagonal). First, Table I1 presents the ratings of Rater 1 (down) against the ratings of Rater 2 (across) for the live interview. For the live interview, there was total agreement in 81.25% of the ratings. In the three cases where there was disagreement, none was more than one step away on the rating scale. The total columns indicate the high proficiency of this group. Eleven of the sixteen examinees were assigned an Advanced Plus or better by both raters for the live interview. Table I11 presents the ratings of Rater 1 (down) against Rater 2 (across) for IST Form A. From Table I11 we see that the agreement of the absolute ratings was again extremely high. There was total agreement in 87.5% of the sixteen Form A ratings. For the two cases of disagreement, neither was more than one step away on the rating scale. Nine of the sixteen

TABLE I Descriptive Statistics for Score Levels Assigned By Raters to SOP1 and OPI Tests Test Form OPI (N = 16) Rater 1 Rater 2 IST Form A (N = 16) Rater 1 Rater 2 IST Form B (N = 16) Rater 1 Rater 2

Minimum Score

Maximum Score

Mean

Standard Deviation

0.8 0.8

3.8 3.8

2.64 2.63

0.96 0.90

0.8 0.8

3.8 3.8

2.47 2.50

0.94 0.92

0.5 0.5

3.8 3.8

2.58 2.44

1.03 1 .oo

134

The Modem Language Journul76 (199.2)

TABLE I1 Crosstabulations of OPI Ratings (N = 16) Rater 1 (down) I Rater 2 (across)

TABLE I11 Crosstabulations of IST Form A Ratings (N = 16)

TABLE IV Crosstabulations of IST Form B Ratings (N = 16)

I

Charles W . Stans$eld

Ej,

examinees were awarded an Advanced Plus or better by both raters on IST Form A. Table IV presents the ratings of Rater 1 (down)against Rater 2 (across)for IST Form B. From Table IV we see that the agreement of the absolute ratings was again moderately high (62.5%).Where there was disagreement, it was no more than one step away on the rating scale. Again, nine of the sixteen examinees received ratings of Advanced-Plus or above, indicating a very high level of proficiency for the group involved in this study. Together, Tables I1 through IV indicate relatively high consistency between the two raters. No consistent trend is apparent in either rater in terms of rater severity, though Rater 1 was occasionally more lenient than Rater 2 on Form B. THE RELIABILITY OF THE IST

There are various approaches to determining the reliability of test scores. Two will be presented here: the more traditional approach of using correlations, and then a more modern and informative approach using generalizability theory. Table V below presents the inter-rater reliabilities (Pearson product-moment correlations) between the ratings assigned by Rater 1 and those assigned by Rater 2 for the two semidirect test forms and for the live interview. These inter-rater reliabilities are all uniformly high across the two IST forms and the live interview. These correlations show that inter-rater reliability was not adversely affected by the SOP1format. This outcome suggests that the IST elicits a sample of speech as ratable on the combined ACTFL/FILR scale as the live interview. On performance-based tests such as the IST, there is an increased concern for test-retest reliability. This form of reliability measures the degree of inconsistency in examinee performance on two separate administrations of the same test. The amount of inconsistency reflects the degree to which the test score may be con-

Interview (N = 16) IST Form A (N = 16) IST Form B IN = 16)

founded by such inconsistency. Therefore, it is important to examine this factor. However, on a test with a limited number of items such as the IST, it is not wise to administer the same test twice, since the first sitting can affect performance on the second sitting in a number of ways. (For a thorough discussion of this phenomenon, which has been referred to as the “reactivity effect,” see 16: p. 174.) Under such circumstances,it is preferable to administer different forms of the test while still using the same rater to score the performance. This type of reliability is known as parallel form reliability, which is the degree of correlation between scores on two forms of the test. Parallel form reliabilities for the same subject taking two different test forms, with the same rater scoring both forms, are shown in Table VI. The statistics indicate that the parallel form reliability of the IST is very high. With the first rater, the parallel form reliability was .92, while with Rater 2 it was even higher (.95). Such favorable statistics provide strong support for the claim that each form of the IST elicits a sample of speech that is uniformly challenging to the examinee. The fact that the parallel form reliability was high for two different raters supports the claim that the sample of speech elicited by different forms is equally ratable. In summary, the evidence from Table VI warrants the conclusion that natural variations in examinee oral language performance are adequately controlled for by the IST format. Table VII shows parallel form reliabilities for subjects taking two different test forms, with each form scored by a different rater. This type of parallel form reliability involves

TABLE VI Parallel Form Reliabilities (Same Rater)

IST Forms A and B (N= 16)

Rater 1

Rater 2

.92

.95

TABLE VII Parallel Form Reliabilities (Different Forms and Raters)

TABLE V Inter-rater Reliabilities Test Form

135

Dorry Mann Kenyon

RaterlForm Combination Correlation .97 .99

.96

Rater l/Form A (N= 16) Rater l/Form B (N= 16)

- Rater 2lForm B

Correlation .90

- Rater 2/Form A

.91

136 error that can be attributed to natural variation in examinee speech, error that can be attributed to differences in test form, and error that can be attributed to differences in raters. Thus, it may be viewed as a lower-bound estimate of the reliability of an IST score. Again the reliabilities here are high, even under these severe conditions (different forms and different raters). We will now look at the same data using an approach based on generalizability theory (G theory). (For an introduction to generalizability theory in the context of foreign language research, see 3.) In contrast to more traditional approaches in which sources of errors in measurement are addressed separately, G theory provides a framework whereby contributions of different sources of measurement error may be simultaneously estimated and examined. In the present case, sixteen examinees took two forms of the IST and each form was rated by two raters. Each examinee received four scores but, in a psychometric sense, possesses only one “true level of proficiency” for his or her performance on the SOPI. Inconsistencies among the two raters and/or inconsistencies in the two forms may contribute to error in the measurement of that proficiency. In G theory terminology, the subjects are the objects of measurement; i.e., the goal of the measurement is to capture true differences between the subjects’ differing levels of oral proficiency. The sources of measurement error are termed facets. In the present case there are two facets: raters and IST forms. Each of these facets has two levels (i.e., two raters and two IST forms). There are thus three components in our G study: subjects, raters, and forms. Measurement error may also be due to interactions between these components. For example, if a rater scores one form more leniently than another, there would be a rater-by-form interaction effect. In the traditional correlational approach used above, errors of measurement caused by using different raters and different forms are examined separately in two separate correlational studies. Only Table VII above, which presents correlations using different forms and raters, comes closest to presenting an estimate of reliability when both raters and forms are considered. G theory decomposes the total amount of variance in the test scores into component parts in a manner somewhat analogous to the Analysis of Variance (ANOVA) technique, but here each person is in a unique cell rather than belonging to a group. This method breaks down

The Modern Lan&ge

Journal 76 ( I 992)

the total variance into separate components depending on the design of the study, which determines which variances can be estimated. As in any model, there is some residual variance, i.e., variance unexplained by any component. In our case, every subject (the object being measured) was rated on every test (two tests) by every rater (two raters), so we have a complete generalizability design and the variance of all facets can be estimated. In all studies presented here, the GENOVA computer program (7) was used for estimating variance components. Table VIII shows the amount of total variance contributed by the separate components and their interactions for the two SOPI forms and the two raters in the validation study. The estimated absolute amount of variance is given in the first column. The second column contains the standard error of the estimate. The final column shows what percent of the total variance each element has contributed. Although the amount of total variance will always vary across studies, looking at the percent of total variance contributed is a helpful way of making comparisons across studies. Table VIII shows that subjects contribute the vast amount of variance. This is as it should be, since the true differences between subjects is what we desire to measure. Moving down the column, no variance is contributed by either forms or raters as a main effect. This lack of variance means that there were no consistent differences caused by these two sources. Nor is there any real variance from the subjects by raters or the forms-by-raters interactions, since the estimated variance is equal to or less than the standard error. The lack of variance in interactions indicates that raters scored consistently across individual subjects and across forms. There is, however, some variance due to the subjects-by-forms interaction. This variance is interpreted as an indication of the degree to which subjects performed differently across the different forms. There may be a slight three way interaction, but it is impossible to determine its magnitude since residual error cannot be separated from it. In summary, the G study results indicate that the IST raters were able to rate both forms and subjects consistently, but there is some indication that subjects may not have performed in an entirely consistent way across the two forms. This outcome is not surprising, as both forms were administered back-to-back in the context of this study. Subjects may have “warmed-up” on the first form they took and performed

Charles W. Stunsfield & Dorry i Mann Kenyon

137

TABLE VIII Amount of Variance Contributed by Different Elements (IST SOPI AlSOPI B) Effect Subjects Forms Raters Subjects X Forms Subjects X Raters Forms X Raters Subjects X Forms TOTAL

X

Variance

Standard Error

Percent Total Variance

3721 - .0060 - .0022 ,0489 .0022 .0054 ,0200 ,9486

.3098 .0030 ,0029 .0205 .0054 .0054 .0068

9 1.9% O.O%* O.O%* 5.2% 0.2% 0.6% 2.1%

Raters, residual error

*Negative variance due to sampling. May be regarded as 0 variance (4: pp. 47-48).

somewhat differently on the second. Or alternatively, they may have tried to do their best on the first form, but may have lost interest in responding to the second, parallel form. Such inconsistency is typical of the reactivity effect referred to earlier. In either case, the G study variances indicate that the raters rated consistently and in a similar way across both the subjects and the forms. In addition to the above information, a G study also produces two reliability-like coefficients: a G coefficient and a Phi coefficient. Although the G coefficient is most commonly reported, it is really appropriate only for norm-based scoring, i.e., where the ranking of the examinees relative to one another is of interest. The Phi coefficient (which is normally less than and can never be greater than the G coefficient) is appropriate for criterionreferenced tests involving measurement along an absolute scale. Thus, the Phi coefficient is appropriate for this study. In mathematical terms, the Phi coefficient can be viewed as the amount of variance of the object of measurement divided by the total amount of variance (i.e., variance contributed by all components). G study coefficients are estimated under different testing conditions. In the present study, if an examinee took one form of the IST and was rated by a single rater, the Phi coefficient is .919; when two raters scored the test, it is .933. These very high Phi coefficients indicate that when measurement error due to both forms and raters is accounted for simultaneously, the IST remains a very reliable test. COMPARISON WITH THE OPI

To examine the extent to which scores on the IST are valid surrogates for scores on the OPI,

it is necessary to examine the degree to which subjects are awarded the same score on both tests. Correlations of IST scores with those awarded on the live face-to-face interview are given in Table IX. Again, the correlations are all high. The average correlation based on sixty-four pairs of ratings (16 subjects x 2 IST forms x 2 ratings, correlated with the score assigned for the live interview) was .95. Such results support the claim that the IST is a valid measure of oral language proficiency that can be substituted for a live interview. The degree of agreement in absolute ratings given on the live interview with ratings given on the same examinee’s IST may be seen from the cross-tab diagram. In Table X, all sixty-four pairs of interview ratings (down) with IST ratings (across) are presented. From Table X we see that in sixty-four percent of the cases there was absolute agreement between the SOPI and the OPI ratings. For all of the remaining ratings except one, the difference was only one step away on the rating scale. In one case, an examinee was awarded a 2.0 on an interview, but received a 1.5 by the same rater on one of the IST forms. Thus, for

TABLE IX Correlations with Live Interview RaterlIST Form Rater 1/Form A (N = 16) Rater I/Form B (N= 16) Rater 2lForm A (N = 16) Rater 2/Form B (N = 16) All Matched InterviewslForms (64 pairs of ratings)

Rater 11 Rater 2/ Interview Interview

.96

.95 .93 .93 .94

.90 .96 .96 .95

138

Thc M

h LanguageJournal 76 (1992)

TABLE X Crosstabulations of Interview Ratings by IST Ratings

I

Total

ninety-eight percent of the ratings, the rating on the live interview and the rating on the IST were equal or differed by less than one step on the rating scale. Thus, besides the high correlations documented, the absolute values given to examinees on both the live interview and the IST were extremely close. Table X also shows, however, that when there was a disagreement between the rating on the taped test and the rating on the interview, in eighty-three percent of the cases the score on the live interview was higher than the score on the taped test. There are several possible explanations for this. One is that the examinees actually performed differently on the two tests but that raters did rate them both reliably. Another is the unfamiliarity of the raters with the taped test. When in doubt about a rating on the IST, the raters may have erred on the side of being conservative, while they knew better what to look for in the live interview. Another explanation may lie in the fact that the vast majority of the examinees had a high level of proficiency. Eighty-one percent were at the Advanced level or above, and thirty-eight percent were at the Superior level or above. Indeed, twenty-five percent of the examinees (or four of the sixteen) were rated at the High Superior level by at least one rater on one test. While the OPI (as given by the Foreign Service Institute) can accommodate all the higher levels up to level 5 , the taped tests were designed for the range beginning at Intermediate and going up to High Superior (3.8)as a ceiling. Thus, a quarter of the sample population in this study might be considered to be at or beyond the range of the

17

1

taped test. These high level examinees may not at times have had the opportunity to fully show what they could do on the taped test. Further evidence for this interpretation is found in the fact that on an examinee evaluation of the test, six of the sixteen examinees felt that the tape’s pause times for giving an answer were in general too short, while none of the examinees felt that the pause times were too long. In other words, examinees may have felt that the IST did not give them adequate opportunity to complete the speaking tasks in a way that allowed them to fully demonstrate their level of ability. The validity of the SOPI as a surrogate for the OPI may also be examined through the use of G theory. In this case, we can compare examinee performances using raters and test method (OPI and SOPI) as sources of errors, and see if there is any indication of differential performance. T o more accurately determine effects, the OPI and the two IST forms (A and B) were examined separately. Results of both are presented in Table XI. Although the magnitudes in terms of the percentage of total variance differ slightly, there is a high degree of consistency in the variances found in comparing the OPI with IST Form A and with IST Form B. Table XI shows many similarities to Table VIII. Again subjects contribute the vast amount of variance, and neither raters nor test methods contribute to any consistent differences in scores (note that the standard error of the estimate for method variance in both cases is greater than the actual estimated variance). This outcome indicates that

Charles W.Stamfield W Dmry Mann Kenyon

139

TABLE XI Results of G Studies on the OPI and Each of the Two SOPI Forms to Examinee Variance Due to Raters and Test Methods Effect

IST FORM A Variance

Subjects Methods Raters Subjects x Methods Subjects X Raters Methods X Raters Subjects x Forms X Raters, residual error TOTAL

,8174

.0088 - .0006 .0337 ,0046

.oooo

.0100

Standard Error

Percent

.2879 .0092 .0003 .0134 .0037 .0005 .0034

93.5% 1.0%

O.O%* 3.9% 0.5% 0.0% 1.1%

3745 ~

Effect

IST FORM B Variance

Standard Error

,8690 .3100 Subjects Methods .0036 .0066 .oooo .0027 Raters .0424 .0169 Subjects x Methods .0200 .0094 Subjects x Raters .0024 .0026 Methods x Raters Subjects x Methods X Raters, residual error .0130 .0044 .9504 TOTAL *Negative variance due to sampling. May be regarded as 0 variance (4: pp. 47-48).

raters scored consistently with respect to each other, and scores on the two test methods were consistent with respect to each other. As for interaction effects, Table XI (as did Table VIII) indicates that the largest amount of variance was due to the subject-by-method interaction (subject-by-form in Table VIII). This indicates that there was some tendency for certain examinees to perform differently on the OPI and the SOPI. There is also a little subjectby-rater interaction (in both cases greater than the standard error), indicating that raters scored certain individual examinees differently. Again, as in Table VIII, there is a minor three way interaction. In summary, Table XI shows that raters scored consistently with respect to themselves and the two methods. More prevalent was the tendency for certain examinees to score differently on the two tests. It must be remembered that in this research design, examinees received three rather thorough and lengthy tests. It would not be surprising to see evidence of fatigue and disinterest come into play. Particularly, we should remember that in all cases the OPI was given first. Examinees may have been fresher or more motivated to perform their best in this test. The reliability-like coefficients obtained using G studies may be considered similar to the

Percent 91.4% 0.4%

0.0% 4.5% 2.1% 0.3% 1.4%

correlational coefficients that were presented in Table IX, though the Phi coefficient examines the effects of both different raters and different forms simultaneously. In this study, using either the OPI or IST Form A, the Phi coefficient when the test was scored by one rater is .935;when using either the OPI or IST Form B, it is .914. These very high coefficients indicate that differences between the two methods are slight. Overall, unwanted measurement error from both raters and method components is negligible. One way to examine whether the IST is an appropriate surrogate for the OPI is to study whether differences between the IST and the OPI are similar to differences between two parallel forms of the IST. If they are similar, then we would have evidence to consider the OPI and the IST as parallel forms; if they are different, this evidence would not be there. Table XI1 contrasts the results of the G studies on the two forms of the IST with those of the OPI and each IST form, in terms of the percentage of variance due to each component in the studies. A comparison down each column shows remarkably consistent results across the three studies. The percentage of variance due to subjects is uniformly high. In no studies were there consistent differences due to raters. The largest

The Modern L a n p p J o u r n a l 7 6 (1992)

140 TABLE XI1 G Study Results in Terms of Percentage of Total Variance Comparison IST Form Al IST Form B OPIl IST Form A OPI/ IST Form B

S

T

R

SxT

SxR

TXR

SxTxR

PHI

91.9

o.o*

o.o*

5.2

0.2

0.6

2.1

.919

93.5

1 .o

o.o*

3.9

0.5

0.0

1.1

.933

91.4

0.4

0.0

4.5

2.1

0.3

1.4

.914

Kqr S = Subjects T = Tests R = Raters *Negative variance due to sampling. May be regarded as 0 variance (4: pp. 47-48).

amount of variance in each case was due to the subject-by-testinteraction. However, we see that this variance is higher in the comparison of two forms of the IST than in the comparisons between the OPI and the IST. Thus it would be safe to conclude that some subjects involved in this research actually performed differently on all three of the speaking tests they took. Had this effect occurred only in the OPI and the IST comparisons, there would have been evidence that the two different testing methods caused this effect. However, due to its occurrence between parallel forms of the same testing method, the most reasonable interpretation is that this effect is unique to the individual subjects and not to any of the tests. In any case, the effect is very small. Besides the subject-by-test interaction, the only other non-negligible variance occurs in the subject-by-rater column, where there seems to be some interaction in the OPUIST B comparison that is not present in the other test comparisons. This interaction effect was most likely due to the one anomalous case mentioned above where an examinee was awarded a 2.0 on an interview, but received a 1.5 (a difference of

two steps on the scale) by the same rater on IST Form B. Across the board, the Phi coefficients are very high, indicating that in this research study any of the tests given was very reliable even when scored by only one rater. Since the OPI and the two forms of the IST seem to be functioning similarly, it is valid to submit all three to one G study, viewing the OPI and the SOPIs as three levels of one facet, i.e., a facet called “tests” with three levels: OPI, IST Form A, and IST Form B. The results of this analysis, presented in Table XIII, may be viewed as a summary of all the above discussion. These figures show that only for the subjectby-test interaction effect was the contribution of any source of variance from any main effect or two-way interaction effect greater than one percent. The percentage contribution of that factor was 4.5%, indicating that there was a slight tendency for examinees to perform inconsistently on the three tests. This factor is not due to inconsistencies between the raters; there were no real consistent differences in raters or in raters’ interaction with the three tests. Raters

TABLE XI11 G Study Using OPI and IST as Three Levels of Tests Effect

Variance

Standard Error

Percent

,2997 .0048 .0011 .0123 .0049 .0025 .0036

92.4% 0.2%

~

Subjects Tests Raters Subjects x Tests Subjects x Raters Tests X Raters Subjects x Tests X Raters, residual error TOTAL

,8528 .0021 - .0009 .0416 ,0090 .0026 .0143

~~

.9224

*Negative variance due to sampling. May be regarded as 0 variance (4: pp. 47-48).

O.O%* 4.5% 1.0% 0.3% 1.6%

Charles W . Stamfield & Do?

M a n n Kenyon

consistently applied the same standards across the tests. The one percent of total variance by the subject-by-rater interaction is really negligible, indicating raters judged consistently across the individuals in the study. It appears that some subjects truly performed differently on the three tests, whether they were the OPI or the SOPIs. Again, in a research design such as this one, this outcome is not surprising, since three rather lengthy tests were given, and a vaNOTES

‘The FILR consultants were Jijis Chadran of the Foreign Service Institute, and Kadir Noor of the US Government Language School.

BIBLIOGRAPHY

1 . American Council on the Teaching of Foreign Languages. ACTFL Provisional Proficiency Guidelines. Hastings-on-Hudson, NY: ACTFL, 1982. 2. . ACTFL Proficiency Guidelines. Hastings-onHudson, NY: ACTFL, 1986. 3. Bolus, Roger E., Frances B. Hinofotis & Kathleen M. Bailey. “An Introduction to Generalizability Theory in Second Language Research.” Language Learning 32 (1982): 245-58. 4. Brennan, Robert L. Elements of Generalimbility Theq.Iowa City, IA: American College Testing Program, 1983. 5. Clark, John L. D. & Ying-chi Li. Developmat, Validalion, and Dissemination of a Proficiency-based Test of Speaking Ability in Chinese and an Associated Assessment Model for Other Less Commonly Taught Languages. Washington: GAL, 1986 [ERIC Document ED 278 2641. 6. -& Spencer Swinton. The Test of Spoken English as a Measure of Communicative Ability in Enghh-medium Instructional Settings. TOEFL Research Report 7. Princeton, NJ: Educational Testing Service, 1980. 7. Crick, Joe E. & Robert L. Brennan. Manual for GENOVA: A GeneralizedAnalysis Of VarianceSystem. Iowa City: American College Testing Program, 1983. 8. Galloway, Vicki, Charles W. Stansfield & Lynn E. Thompson. “Topical Bibliography of Proficiency-Related Issues.” ACTFL Proficiency Guidelines for the Less Commonly Taught Languages. Ed. Charles W. Stansfield & Chip Harmon. Washington: GAL & ACTFL, 1987 [ERIC Document ED 289 3451. 9. Lowe, Pardee, Jr. “The Unassimilated History.” ‘Second Language Proficiency Assessment: Current Issues. Ed. Pardee Lowe, Jr. & Charles W.

141

riety of differing factors could affect examinee performance across the tests (e.g., interest, fatigue, motivation). However, these results indicate that the SOPIs can be rated reliably and consistently, and that given the same actual performance, examinees would get the same score on any of the three tests. Thus, there is a great deal of evidence that the IST may be confidently given as a surrogate measure in place of the OPI. ‘These were James T. Collins, University of Hawaii at Manoa; Ellen Rafferty, University of Wisconsin, Madison; and John Wolff, Cornell University. ’Jijis Chadran of the Foreign Service Institute served as the interviewer and as one of the raters; Andang Poeraatmadja of the Foreign Service Institute served as the other rater. Stansfield. Englewood Cliffs: Prentice HallRegents, 1988: 11-51. 10. -& Ray T. Clifford. “Developing an Indirect Measure of Overall Oral Proficiency.” Measuring Spoken L a n p g e Proficiency. Ed. James R. Frith. Washington: Georgetown Univ. Press, 1980: 31-39. 11. Shohamy, Elana, Clair Gordon, Dorry M. Kenyon & Charles W. Stansfield. “The Development and Validation of a Semi-direct Test for Assessing Oral Proficiency in Hebrew.” BulEetin of Hebrew Higher Education 4 (1989): 4-9. 12. -& Charles W. Stansfield. “The Hebrew Speaking Test: An Example of International Cooperation in Test Development and Cooperation.” AILA R e v h 7 (1990): 79-90. 13. Stansfield, Charles W. Simulated Oral Proficiency Interviews. ERIC Digest. Washington: ERIC Clearinghouse on Languages & Linguistics, 1989. 14. . “A Comparative Analysis of Simulated and Direct Oral Proficiency Interviews.” Current Developments in Language Testing. Ed. Sarinee Anivan. Singapore: Regional Language Center, 1991: 199-209. 15. -& Dorry Mann Kenyon. Development of the Portuguese Speaking Test. Final Report to the US Department of Education. Washington: GAL, 1988 [ERIC Document ED 296 5861. 16. -& Jacqueline Ross. “A Long-term Research Agenda for the Test of Written English.” Language Learning 5 (1988): 160-86. 17. -& Dorry Mann Kenyon. Development of the Hama, Hebrew, and Indonesian Speaking Tests. Final Report to the US Department of Education. Washington: GAL, 1989 [ERIC Document 329 1001. 18. -& Lynn E. Thompson. “Topical Bibliography of Proficiency-Related Publications: 1987-1988.” The ACTFL Oral Proficiency Interviewer Tester Training Manual. Ed. Katherine Buck. Yonkers, NY: ACTFL, 1989.