Native- and nonnative-speaking EFL teachers' evaluation ... - CiteSeerX

Native- and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing Ling Shi University of British Columbia

This study examined differences between native and nonnative EFL (English as a Foreign Language) teachers’ ratings of the English writing of Chinese university students. I explored whether two groups of teachers – expatriates who typically speak English as their first language and ethnic Chinese with proficiency in English – gave similar scores to the same writing task and used the same criteria in their judgements. Forty-six teachers – 23 Chinese and 23 English-background – rated 10 expository essays using a 10-point scale, then wrote and ranked three reasons for their ratings. I coded their reported reasons as positive or negative criteria under five major categories: general, content, organization, language and length. MANOVA showed no significant differences between the two groups in their scores for the 10 essays. Chi-square tests, however, showed that the Englishbackground teachers attended more positively in their criteria to the content and language, whereas the Chinese teachers attended more negatively to the organization and length of the essays. The Chinese teachers were also more concerned with content and organization in their first criteria, whereas English-background teachers focused more on language in their third criteria. The results raise questions about the validity of holistic ratings as well as the underlying differences between native and nonnative EFL teachers in their instructional goals for second language (L2) writing.

I Introduction The fairness and construct validity of the ratings of different groups of raters using the same scale in assessing ESL/EFL (English as a Second/Foreign Language) writing is a major concern in second language (L2) writing instruction and testing (Hamp-Lyons, 1991a; Connor-Linton, 1995a; Silva, 1997). Exploring whether L2 writers’ texts meet the expectations of native English readers, a growing number of studies have investigated whether native English speakers (NES) and nonnative English speakers (NNS) share the same judgement of the students’ writing (James, 1977; Hughes and Lascaratou, 1982; Machi, 1988; Santos, 1988; Kobayashi, 1992; Hinkel, 1994; Address for correspondence: Ling Shi, Department of Language and Literacy Education, 2034 Lower Mall Road, UBC, Vancouver, BC, Canada V6T 1Z2; email: Ling.Shi얀ubc.ca Language Testing 2001 18 (3) 303–325

0265-5322(01)LT206OA  2001 Arnold

304 Evaluation of Chinese students’ English writing Connor-Linton, 1995b; Kobayashi and Rinnert, 1996; Zhang, 1999; Hamp-Lyons and Zhang, 2001). Most studies on the evaluation of L2 writing have focused on either heterogeneous ESL students in British (Hamp-Lyons, 1989) and North American universities (Santos, 1988; Cumming, 1990; Brown, 1991; Vaughan, 1991; Hinkel, 1994) or a homogeneous EFL group, such as Arab university students (Khalil, 1985), Greek high school students (Hughes and Lascaratou, 1982), Israeli high school students (Shohamy et al., 1992), Chinese university students (Zhang, 1999; Hamp-Lyons and Zhang, 2001) or Japanese adult students (Machi, 1988; Kobayashi, 1992; ConnorLinton, 1995b; Kobayashi and Rinnert, 1996). Following in this tradition, the present study investigates differences between NES and NNS EFL teachers in their judgments of English writing of Chinese university English majors. It aims not only to verify previous findings but also to explore the issue of cooperation between NNS and NES EFL teachers in countries like China where writing programs are commonly taught jointly by both groups of teachers. Previous research suggests that we know little about whether the NES and NNS raters/teachers give similar holistic ratings and qualitative evaluations to the same ESL/EFL essays. In a study comparing the evaluative criteria of 26 American ESL and 29 Japanese EFL instructors in their ratings of 10 compositions written by adult L1 Japanese EFL students, Connor-Linton (1995b) asked half of the raters from each group to rate the compositions holistically and the other half to rate the compositions with an analytical evaluation profile and then to state three reasons for their ratings. The results suggested similarities in ratings between the two groups, although NES teachers tended to focus on intersentential discourse features compared to NNS teachers’ focus on accuracy in their qualitative judgments (Connor-Linton, 1995b). Compared with Connor-Linton (1995b), other researchers have used analytical scoring to investigate the extent to which NES and NNS teachers/raters attend to the same features and value the same qualities in student writing. As a result, NES teachers were found to favour American English rhetorical patterns (Kobayashi and Rinnert, 1996) and clarity of meaning and organization (Kobayashi, 1992). They also differed systematically from NNS raters in their judgments in terms of students’ paragraph structuring and political/social stance (Hamp-Lyons and Zhang, 2001) in terms of ‘purpose and audience, specificity, clarity and adequate support’ (Hinkel, 1994), and in terms of ‘overall organization, supporting evidence, use of conjunctions, register, objectivity and persuasiveness’ (Zhang, 1999). Furthermore, differences were found in the raters’ judgments of error gravity. In general, NES teachers were found to be more tolerant of students’

Ling Shi 305 errors than NNS teachers (James, 1977; Hughes and Lascaratou, 1982; Santos, 1988). Kobayashi (1992), however, observed that NNS instructors would accept grammatically correct but awkward sentences compared to NESs. These findings were all based on pre-determined categorical evaluations which might have restricted or mandated teachers/raters judgments. Some, for example, used decontextualized or edited student writing to direct the raters’ attention (Santos, 1988; Hinkel, 1994; Kobayashi and Rinnert, 1996). Research is therefore needed, using authentic writing samples and no predetermined evaluation criteria, to verify accurately whether NES and NNS teachers score L2 essays for different reasons. In addition, as these studies indicating differences in teachers/raters’ qualitative evaluations did not compare the analytic judgments using holistic scoring, I thought it would be worth trying to verify the findings using both qualitative and quantitative judgments. Set in the context of Mainland China, the present study parallels Connor-Linton’s (1995b) study to compare holistic scores and selfreported reasons from teachers with different ethnolinguistic backgrounds. I planned the study with a two-fold purpose, viz. to verify whether NES and NNS teacher raters differ in (1) their holistic ratings and (2) their analytical reasons or qualitative judgments in evaluating EFL students’ writing. II Method 1 Written samples Ten written samples were selected randomly in the fall of 1998 from writings of third-year students in the English department of a university in Mainland China. From a pool of 86 in-class written assignments gathered in no particular order, every eighth essay was selected for the study. Students were asked to write, within a 50-minute session (a common length of a lesson period in Chinese universities), a 250-word expository essay on the topic of TV and newspapers. The decision on the 250-word length was made based on the suggestion of the three class teachers who administered the task with an attempt to match the present task with other existing writing tasks in the program. Most students, however, produced much longer essays. The 10 essays selected averaged 292 words. (For Essay 1 as an example of essays in the middle range of scores, see Appendix 1.) Apart from length, the writing prompt was also a result of negotiation with the class teachers to incorporate the task into their teaching routines. Here is the prompt that was used in this study: Nowadays with the popularity of televisions people gain daily news more conveniently. Some people even begin to play down the advantage of newspapers,

306 Evaluation of Chinese students’ English writing arguing that it is time that they were replaced by television. To what extent do you agree or disagree with this statement? Give support for your argument.

2 Teacher raters The 10 written samples were first sent, together with a letter of invitation to participate in the study and a background questionnaire, to 70 English expatriates teaching in various tertiary institutions in all parts of China. All of them had then been teaching in China for a minimum of six months and a maximum of about a year. The name list was provided by Amity Foundation, a Christian organization that helps Chinese universities to recruit native English teachers. Twentyfour of the NES EFL teachers responded to the letter and sent back the completed questionnaires and their quantitative and qualitative evaluations of the essays. I then invited 24 EFL NNS teachers, who were either working or taking an in-service teacher training program in the university where data were collected, to evaluate the same set of essays. In each of the groups, there was one rater who missed ratings for several of the essays, which made for an equal number of 23 raters in each group for the study. The information summarized from the questionnaire suggested that the participating raters were mostly experienced teachers with some professional training. At the time of data collection, the 46 raters, as they reported in the questionnaires, were teaching at 23 tertiary institutes in 12 cities of China. Of the 23 NES teachers, 14 were US citizens, 5 British, 2 Canadian, and 2 Norwegians. (The two Norwegians are considered NESs in this study because they were expatriates who had learnt and used English since elementary school. With emphasis on multilingualism in the school system and English being the second most important language, many educated Norwegians are bilinguals.) Summarizing the rater profiles, Table 1 shows that more than half (frequency of 31, 67%) of the raters were female and 15 (33%) were male. In terms of educational background, about half (frequency of 23, 50%) had a master’s degree and 17 (37%) had an undergraduate degree. Most raters (frequency of 34, 74%) said they had teacher training, whereas 11 (8 NNSs and 5 NESs) reported that they had no such training. Of the 46 raters, 18 (39.1%) had taught English for 1 to 5 years, 11 (23.9%) 6 to 9 years, and 17 (37%) about 10 years. In general, the NES group appeared to be more educated than the NNSs (4 NES teachers had a PhD whereas 1 NNS had only a high school diploma) although the latter seemed to have more teaching experience than the former. (12 NNSs had taught for about 10 years compared with only 5 NESs who had similar experience.)

Ling Shi 307 Table 1 Rater profiles Variables

Categories

Frequencies

Percentage

English

Chinese

Total

Number of missing cases

Gender

Male Female Total

6 17 23

9 14 23

15 31 46

32.6 67.4 100.0

Degrees

High School Graduate Master PhD Total

– 7 12 4 23

1 10 11 – 22

1 17 23 4 45

2.2 37.0 50.0 8.7 97.9

1

Yes No Total

20 3 23

14 8 22

34 11 45

73.9 23.9 97.8

1

1–5 years 6–10 years Over 10 years Total

10 8 5 23

8 3 12 23

18 11 17 46

39.1 23.9 37.0 100.0

Teacher training

Years of teaching English

3 Procedures I asked the participating teachers to rate the 10 compositions holistically using a 10-point scale. The teachers were told that the purpose of the research was to compare how teachers evaluated English writing of Chinese university students. Following previous research (Cumming, 1990; Connor-Linton, 1995b; Kobayashi and Rinnert, 1996), no criteria or analytical categories for the rating scale were provided with the purpose of finding out how these teachers defined the criteria themselves. In addition, to find out whether these teachers made different qualitative judgments, the teachers were asked to state three reasons or comments, in the order of importance, for their ratings of each essay. (For evaluation instructions, see Appendix 2.) With no pre-determined criteria, the teachers were expected to make and weight qualitative judgments based on their own beliefs of how salient certain writing elements were in L2 writing evaluation. Finally, the raters completed a questionnaire about their teaching experiences and educational backgrounds. Although similar in its basic design to Connor-Linton’s (1995b) study, asking two groups of teachers to rate 10 essays and state three reasons (for justifications of these methodological choices, see Connor-Linton, 1995b: 102), the present study, apart from subject identities and context, differed from the previous study in that all the teachers rated the essays holistically

308 Evaluation of Chinese students’ English writing either on a 10-point rather than a 4-point scale or with an analytical evaluation profile that Connor-Linton (1995b) used. In addition, the teachers in the present study were asked to rank order their reasons rather than listing them randomly. A coding scheme was developed for analysing the teachers’ comments. Previous researchers have used different methods to categorize verbal protocols or self-reports of raters (Hamp-Lyons, 1989; 1991b; Cumming, 1990; Vaughan, 1991; Connor-Linton, 1995b; Zhang, 1999). The present study used key-word analysis based on an initial observation of the comments which contained typically short phrases including an adjective and a content word, such as ‘good idea’ or ‘poor argument’, indicating the rater’s negative or positive attitude to a chosen feature. First of all, one researcher, a doctoral student studying in the area of L2 writing, went through the whole data and identified key words, or words that had been repeatedly used in these teachers’ comments, such as ‘thesis’, ‘argument’ ‘ideas’, ‘content’, ‘logic’, ‘organization’, ‘grammar’, ‘vocabulary’, ‘paragraph’, ‘clarity’, ‘support’, as well as words indicating whether the comments were positive or negative, such as ‘good’, ‘poor’, ‘clear’, ‘unclear’, ‘balanced’, ‘unbalanced’, etc. Based on the key words, five major categories were generated in terms of general comments on the essay as a whole, and specific comments addressing the content, organization, language or length of the essays. Then the comments on content, organization and language were further divided into 10 subcategories: • comments on content further distinguished as to whether they were (1) general comments or specific comments on the quality of (2) ideas or (3) arguments; • comments on organization as to whether they were (4) general comments or specific comments focusing on (5) paragraph organization or (6) transitions, including coherence or cohesion; • comments on language as to whether they were (7) general comments or specific comments on (8) intelligibility, (9) accuracy or (10) fluency of language use. Before coding the data, the doctoral student and I each independently coded comments from 10 teachers, 5 NES and 5 NNS, and reached an agreement of 95%. As a result, a total of 1299 reasons (a few teachers gave fewer than three reasons for some essays) were identified as positive or negative comments under the 12 categories. (For definitions and examples of each category, see Appendix 3.) Three types of statistical analysis were used to identify group differences in the raters’ evaluation of the 10 essays. First, reliability tests were run to check the intra-group reliability of NES and NNS teachers. Then MANOVA assessed differences between the two

Ling Shi 309 groups in their holistic scoring of the 10 essays. Finally, chi-square values on the frequencies of the comments were computed to explore the differences between the two groups, first in terms of whether the 12 categorical comments were positive or negative, and then in terms of the ordering of the three comments for each essay. III Findings and discussion 1 Reliability Consistency of the ratings of each group – NNE and NES teachers – was estimated by Cronbach’s coefficient alpha, which reflects the level of agreement of each group as a whole during the rating process applied to the package of 10 essays. The ratings of the NES group showed a greater reliability (coefficient ␣ = .88) than the NNS group (coefficient ␣ = .71), indicating that the NES group was more consistent than the NNS group in their assessment of the 10 essays during the rating process. The reliability achieved by the NNS teachers in the present study would be more or less predicable in the given context. Similar levels of reliability for both the NES and NNS raters (.75) were also reported by Connor-Linton (1995b). However, the much higher reliability of the NES teachers in the present study was unusual, since such high reliability is usually achieved only through training. Nevertheless, Shohamy et al. (1992) also observed high reliability coefficients in the range of .80 to .90 from untrained NES raters with the aid of indicators for a rating scale. 2 Comparison of scores Table 2 summarizes and compares the means and standard deviations of the ratings of the two groups of teachers for the 10 students’ essays. MANOVA indicated no significant differences between the two groups on the ratings (F = 1.940, p ⬎ 0.05). Similarities between the two groups were also shown in the positive rank orderings of the average scores of the 10 essays (Kendall Correlation Coefficients, .73, 2-tailed, p ⬍ 0.01). The mean scores of the two groups indicate that the 10 essays ranged from a low of 5 to a high of 8 on the 10point scale, with slightly lower scores from the NES teachers (Group M of 5.98 vs. 6.19). Both groups of teachers agreed that Essay 9 was the best and that Essay 5 the second best. The rest of the essays showed a difference of one to two ranks between the NES and NNS teachers. It is also interesting to note a smaller group mean of standard deviations for the NNS group compared with that for the NES group

310 Evaluation of Chinese students’ English writing Table 2 Comparisons of NES and NNS teachers’ holistic scores on the 10 essays NES (n = 23)

NNS (n = 23)

Ranking of essays

M

sd

M

sd

NES

NNS

1 2 3 4 5 6 7 8 9 10

6.69 6.29 6.96 5.14 7.50 5.47 7.21 7.06 8.14 5.64

2.27 1.20 1.52 1.77 1.20 1.86 1.99 1.51 1.25 2.20

6.83 7.14 7.35 6.07 7.65 5.35 7.31 6.91 7.87 6.59

1.40 1.14 1.17 1.54 1.15 0.92 1.78 1.01 1.04 1.47

6 7 5 10 2 8 3 4 1 9

7 5 3 9 2 10 4 6 1 8

Group M

5.98

1.68

6.19

1.26

Essays

(Group M of S.D. of 1.26 vs. 1.68). This indicates that the scores given by the NNS teachers were more homogeneous or less spread out than those given by the NES teachers, although individual members within the NNS group were less consistent as suggested by the lower reliability of the group I reported earlier. In contrast, the NES group, as the present data suggest, while being able to maintain a high reliability, also used a wider range of average scores than the NNSs (from 5.35 to 7.87 for NNSs and from 5.14 to 8.14 for NESs). This tendency towards great variation among NES raters has also been documented by previous researchers (e.g., Hamp-Lyons 1989; Vaughan, 1991; Zhang, 1999). Together with previous studies, the present study suggests that NES raters were perhaps either more willing to take risks in awarding scores at the endpoints of the scale or were better able to make finer distinctions among levels of ability in rating ESL/EFL essays than their NNS colleagues. 3 Comparison of qualitative judgements: raters’ positive and negative comments Figure 1 compares the raters’ self-reported positive and negative comments, grouped into 12 categories. The overall statistically significant chi-square value of the frequencies of the 12 types of positive and negative comments indicates that NES and NNS teachers were systematically influenced by different qualitative judgments in their ratings (chi-square = 54.03, 51.01, d.f. = 11, p ⬍ 0.001). In line with previous research (Brown, 1991; Connor-Linton, 1995b), the present study implies that these two groups of teacher raters arrived at their

Ling Shi 311 NES positive 140

NNS positive

Frequencies

120

NES negative

100

NNS negative

80 60 40 20

( ) th

**

nc

ng

ge ua

La

ng

Le

y)

(fl

ue

(+

y)

, )

)

**

** ur

La

ng

ua

ge

(a

cc

te l (in

ge ua

ng

ac

ilit y) ib

lig

ge ua ng

La La

(+

er al )

ns

(g

si an (tr

n

at io iz an

en

tio

ph ra

ag

al

(p ar n

at io iz

)

s)

( ) )*

en t) en (g n at io

an

rg O

O

O rg

an

iz

C on

te n

t(

ar

er

t(

gu

id

m

ea s)

*( l)*

te n

C on

ra ne

ge

rg

C

on

te n

t(

G

en

er

al

*(

)

+)

0

Figure 1 NES and NNS teachers’ positive and negative comments Notes: *(+/−) Indicates a chi-square value showing a significant difference at .05 level (** at .01 level) between the NES and NNS teachers in their positive/negative comments.

holistic ratings based on somewhat different criteria or qualitative judgments. A preliminary perspective on the trends from Figure 1 is that content, particularly the argument of the essay, appears to be identified more often as a positive feature by both NES and NNS raters, whereas language, particularly its intelligibility, shows the reverse pattern with both groups identifying it more as a negative feature in these students’ writing. This suggests a biased attitude among these EFL teachers, the NESs even more so as they gave significantly more positive comments on general content quality than the NNSs (chi-square = 14.44, d.f. = 1, p ⬍ 0.01). Apart from an unbalanced perception of the writing performance of these L2 students, this tendency implies a distinction, implicitly made by these teacher raters, between the writing skill of content presentation and language proficiency in L2 writing. According to Cumming (1990), both experienced and inexperienced ESL teachers in his study distinguished students’ writing expertise and L2 proficiency while evaluating students’ essays. The present finding suggests that these participating teacher raters seemed to make this distinction implicitly in their ratings. Despite the general tendency of both NES and NNS teachers in the present study to be positive on content but negative on language qualities of students’ writing, NES teachers were generally more positive than the NNS teachers in their qualitative evaluations (frequency of 365 vs. 280 positive comments). While the NES teachers tended to give lower marks than NNS teachers (Group M of 5.98 vs. 6.19), they nevertheless gave more positive qualitative reasons or comments than did the NES teachers. Kobayashi (1992) also reported that NES

312 Evaluation of Chinese students’ English writing raters were more positive on certain qualities of students’ writing. One possible reason for NES raters to be stricter raters but give more positive comments was perhaps best explained by a NES rater in Zhang’s (1999: 250) study, who reported that she used different standards in assessing EFL students’ essays while playing a double role of a strict native speaker and a lenient EFL teacher. The mismatch between the NESs’ positive comments and lower scores suggests not only a gap between teaching criteria and testing standards, but also a concern of whether NES teachers should be advised or trained to play consistently the role of a stricter rater or whether NESs other than NES teachers should be used for testing purposes. The chi-square values indicate divergent tendencies between the two groups of teachers in the frequencies of their positive and negative comments in terms of the 12 evaluation categories. As Figure 1 shows, apart from general content quality, the NESs gave significantly more positive comments on intelligibility (chi-square = 15.81, d.f. = 1, p ⬍ 0.01) of language. In addition, they also gave more comments, both positive and negative, on language accuracy than did the NNS teachers (chi-square = 13.50, 11.77, d.f. = 1, p ⬍ 0.01). Previous researchers (such as Hughes and Lascaratou, 1982; Koyabashi, 1992; Connor-Linton, 1995b) have also reported that NNS raters/teachers appeared to be less attentive than the NES raters/teachers to the language quality of students’ essays. In line with previous findings, the present study confirms Koyabashi’s (1992) observation of a lack of confidence of NNS raters. Put differently, being nonnative speakers, the Chinese teachers might shy away from making qualitative judgments or comments about the English language of their students. It is interesting to note that these differences in the qualitative comments were not reflected in holistic ratings which showed no significant group differences on the 10 essays. As Connor-Linton (1995a: 763) put it, holistic rating scales ‘risk forcing potentially multidimensional rater responses into a single dimension of variation’. The discrepancies between the NES and NNS teachers found in the present study confirm the disadvantage of holistic rating being unable to distinguish accurately various characteristics of students’ writing. Compared with the higher frequencies of the positive comments from NES teachers, significantly higher frequencies of negative comments were produced by the NNS teachers in terms of general comments on the whole essay (chi-square = 8.07, d.f. = 1, p ⬍ 0.05), comments on general organization (chi-square = 8.76, d.f. = 1, p ⬍ 0.05) and the length of the essays (chi-square = 12.52, d.f. = 1, p ⬍ 0.01). These differences suggest an underlying disagreement between the NES and NNS teachers on how students’ writing should be evaluated. The NNS teachers in the present study appeared more critical

Ling Shi 313 to the general quality and the structure of these students’ essays than were the NES raters. As one Chinese rater in Hamp-Lyons and Zhang’s (2001) study explained, this tendency could result from an awareness of a heavy influence of L1 in their students’ writing. In contrast to the present finding, previous studies – based on predetermined criteria or edited student essays – have reported negative cultural biases of many NES raters towards the rhetorical writing styles of NNS students (Land and Whitley, 1989; Basham and Kwachka, 1991; Hinkel, 1994; Tedick and Mathison, 1995; Kobayashi and Rinnert, 1996; Silva, 1997; Zhang, 1999; Hamp-Lyons and Zhang, 2001). Following the assumption that NES raters might be positively affected by their exposure to the particular culture (Hamp-Lyons, 1989), the present study suggests that the participating NES raters may have been influenced by their experiences in China. Unlike the NESs in most previous studies, the NES instructors in the present study had all been teaching in Mainland China for at least six months, a period which might be long enough for them to be familiar with and, consequently, tolerant of the Chinese rhetorical style. In sum, the present finding implies that the issue raised by previous researchers on the biased attitudes of NESs towards non-English rhetorical styles might not be consonant with the evaluations of EFL students’ writing at the proficiency level examined in the present study. 4 Comparison of qualitative judgments: self-reported importance of raters’ comments A slightly different picture emerges in Figures 2 to 4 documenting the differences in the qualitative comments between the two groups of teachers in terms of the ordering of their three comments for each

Frequencies

100 NES

80

NNS

60 40 20

th ng Le

nc y)

ge

(fl

ur

ue

ac y)

*

y) *

ua ng La

ge ua ng

La

ua

ge

(in

(a

te

cc

llig

ib i

lit

er

al )

ns ) ge ua ng

ng La

La

at iz an rg

O

(g en

tio si (tr

n io

n io at iz

an rg O

an

ag (p ar

(g at

iz an

hs ra p

er al en

um

n io

nt te

rg

on

O

C

)*

t) en

)* (a

nt te on C

rg

(id e

er en (g

nt

as

al )

al er G en

te on C

)

0

Figure 2 First comments Notes: *Indicates a chi-square value, with a significant level at 0.05; Overall chi-square = 51.14; d.f. = 11; p ⬍ 0.001.

314 Evaluation of Chinese students’ English writing Frequencies

100

NES

80

NNS

60 40 20

t(

Le ng th

* ar gu m iz en at O io t) rg n an (g en iz at er io al n )* (p O rg ar an ag iz ra at ph io s) n (tr an La si ng tio ua ns La ) ge ng (g ua en ge er al (in ) t e La llig ng ib ua ilit y) ge * (a cc La ur ac ng y) ua * ge (fl ue nc y)

id ea s)

O rg

an

te n C on

C

C on

on

te n

te n

t(

ge ne r

t(

G en er

al

al )

0

Figure 3 Second comments Notes: *Indicates a chi-square value, with a significant level at 0.05; Overall chi-square = 19.99; d.f. = 11; p ⬍ 0.001.

Frequencies

100

NES

80

NNS

60 40 20

** th ng

y) nc ue (fl

cc La

ng

ua

ge

(a ge ua

ng La

Le

y) ur

ilit ib llig

te (in ge

ua ng

ac

y)

al er en

ge ua ng

La

**

)

) ns

(g

si an (tr n

io at

rg

an

iz

at O

iz an rg

tio

ph ra ag

n io

iz an rg O

O

La

) al er

n io at

nt te on C

(p ar

(g

en

um (a

rg

(id nt te

on C

t)

s) ea

er en (g

nt te

en

**

) al

al er en G on C

s)

0

Figure 4 Third comments Notes: *Indicates a chi-square value, with a significant level at 0.05; Overall chi-square = 58.42; d.f. = 11; p ⬍ 0.001.

of the 10 essays. Since the teachers were asked to record their three reasons or comments in order of importance, I expected a comparison of the ranking of the comments to reveal how these teachers weighted various writing elements in their evaluations. An initial inspection of the data suggests that in their rank ordering of the 12 categorical comments, the two groups of teachers varied from primary focus on the quality of content and organization to a lesser focus on language. As Figure 2 illustrates, in the comments they ranked as first in importance, both groups of teachers focused primarily on the content, particularly the ideas and arguments of the essays, although the NNS teachers were found to attend more than the NES teachers to ideas (chi-square = 5.15, 4.19, d.f. = 1, p ⬍ 0.05). There was also a fair amount of attention among both groups of teachers to the general organization of the essays in their most important comments, although

Ling Shi 315 the NNS teachers attended more to such comments than did the NES teachers (chi-square = 4.19, d.f. = 1, p ⬍ 0.05). Figure 3 shows a change of tendencies in the comments ranked as second in importance. As both groups remained focused on the arguments of the essays, the teachers also turned their attention to language intelligibility. With the comments ranked third in importance, as Figure 4 shows, the majority comments from these teacher raters remained focused on language intelligibility, with markedly more of such attention from the NES teachers (chi-square = 17.02, d.f. = 1, p ⬍ 0.01). This shifting of focus suggests that these teachers consciously placed different weightings on various components. Recall that these comments were primarily positive on content but negative on language, so that the shift of focus from content to language suggests that these participating teachers also tended to focus on the positive elements of the essays before moving to the negative ones. As these raters were teachers who might have emphasized positive feedback because they believed in encouraging students by finding their strengths, this finding needs to be verified in future research to determine whether the movement from positive to negative evaluation is typical of EFL teachers or raters in general. Despite the general tendencies of the two groups of teachers in the ordering of the three comments, chi-square tests showed a significant overall difference between the two groups in the frequencies of the 12 categorical evaluations mentioned as the first, second and third most important comments (chi-square = 51.14, 19.99, 58.42; d.f. = 11, p ⬍ 0.001). This further suggests that certain elements were more or less salient to the two groups of teachers. Aside from the differences mentioned above that the NNS teachers were more concerned with ideas and general organization in their most important comments and NES more with language intelligibility in the third-ranked comments, NES teachers were also found to focus significantly more than the NNS raters on the intelligibility and accuracy of language use in their most important comments (chi-square = 4.00, 16.33, d.f. = 1, p ⬍ 0.05), again on accuracy in their second-ranked comments (chisquare = 8.17, d.f. = 1, p ⬍ 0.05) and then ideas in their third-ranked comments (chi-square = 18.18, d.f. = 1, p ⬍ 0.01). These findings suggest that the NESs not only were more aware of the language ability of these EFL students than their NNS colleagues, but also saw the quality of intelligibility as more salient than accuracy in evaluation. In contrast, NNS raters were found to focus significantly more than the NES raters on the general organization in their second-ranked comments (chi-square = 4.33, d.f. = 1, p ⬍ 0.05), and then length of the essays in their third comments (chi-square = 18.67, d.f. = 1, p ⬍ 0.01). As I reported earlier these comments from the NNS were

316 Evaluation of Chinese students’ English writing mostly negative. They seemed to emphasize the weakness of organization more than the problem of length in students’ writing. If the rankings of these reasons or comments reflected the general characteristics of the writing samples that these teacher raters chose to attend to, the present finding suggests that NES and NNS teachers differ in their beliefs and weightings of these writing elements in L2 writing evaluation. This indicates that the process of L2 writing evaluation is as important as, and perhaps more important than, the results of scoring in identifying the differences between NES and NNS teacher raters. IV Conclusions The present study investigated whether the evaluation standards for written compositions differ among NES and NNS English teachers in an EFL setting such as China. I examined not only how NES and NNS teachers holistically evaluated students’ writing, but also how the two groups of teachers consciously attended to and weighted various writing features. Results indicate that NES and NNS teachers gave similar scores to EFL students’ writing as no significant differences were found in the scoring of the 10 essays. However, the two groups of teachers differed markedly in the frequencies of certain types of comments or criteria they mentioned to justify their ratings. The analysis of the qualitative comments suggest that NES teachers attended more positively to content and language, whereas NNS teachers attended more negatively to organization and length, although both groups appeared more positive towards content but negative toward intelligibility of the language of the essays. Furthermore, as both groups seemed to move their attention from content to language in their rank ordering of their comments for each essay in terms of importance, the NNSs appeared to be more concerned with ideas and general organization in the first comments and NES teachers focused on intelligibility in their third comments. The above findings raise at least two important issues for language testing. The first issue concerns the extent to which holistic ratings reflect analytical or qualitative judgments. As the present findings suggest, although the writing of EFL students might be rated equally by holistic scoring, this might be evaluated from different perspectives by NES and NNS teachers. Following Hamp-Lyons (1995: 760) who noted that ‘the writing of second language English users is particularly likely to show varied performance on different traits’, the present study suggests that holistic scoring was not effective in distinguishing subtle differences in students’ writing performances, nor was it an effective way to detect differences between NES and NNS

Ling Shi 317 teacher raters. Analytic ratings or qualitative evaluations, as the present study implies, might be preferable to holistic scoring to assess accurately the quality of L2 writing for purposes such as research, high stakes testing or diagnostic assessments, where the quality of information from evaluation is important. Related to the first issue is the second issue of the relationship between reliability and validity in language testing. Both groups of the NES and NNS teachers in the present study have achieved relatively high reliability in their ratings of the 10 essays, with the former obtaining even a higher reliability. However, the fact that these raters differed in their qualitative comments – reflecting differing understandings of what constitutes good writing – raises questions about the construct validity of the inferences that might made solely on the basis of the holistic ratings. For example, Essay 1 was awarded a point of 7 on the scale by both a NNS and a NES, but for different reasons. In recording their most important reasons for giving this mark, the NNS commented positively on its ‘well presented ideas’ whereas the NES commented negatively on the ‘many simple grammar mistakes’. This raises questions about whether these teachers even agreed with the terms or the construct definition they used. For example, the concept ‘argument’ seems to mean different things to different raters. The same argument in Essay 1, for instance, was rated as low as 1 on the scale by a NES teacher because the writer ‘tried to argue on both sides’, whereas it was awarded as high as 8.5 by a NNS teacher because of its ‘good argument’. With these disagreements among teachers, there is no doubt that students would be confused if they were presented with these comments. There is a great need for research to pursue construct validity in testing by comparing differences not only among teachers/raters but also between teachers/raters and students. The present study suggests insufficiency of reliability and the importance of increasing validity in L2 writing assessment. Training is highly recommended for raters to reach consensus on the criteria before evaluation. The process of rating becomes more complicated if one takes into consideration other related issues, such as tolerance brought by familiarity (e.g., Santos, 1988) or washback effects of tests on instruction (e.g., Wall and Alderson, 1993). These issues lead to pedagogical implications of the present study. For example, there is a real concern about how EFL students can acquire English academic rhetorical patterns if their NES teachers, as the present study suggests, become tolerant to their rhetorical problems after being exposed to their culture. Furthermore, if we assume a correlation between how teachers evaluate students’ essays and teachers’ instructional goals, the present findings call for further inquiry to verify whether there is a divergence

318 Evaluation of Chinese students’ English writing of instructional emphasis underlying these teachers’ classroom teaching of EFL writing. As the NES teachers in the present study emphasized, the development of ideas and intelligibility of language use, while the NNS teachers emphasized the general organization and the length of essays; they may similarly have divergent instructional focuses in their writing classes. Researchers have indicated that teaching emphases can differ when writing is taught in places where teachers are dominantly either NES or NNS (Mohan and Lo, 1985). The present study suggests that such differences could also exist in EFL programs that are taught either or jointly by NES and NNS teachers. Further research is also needed to investigate whether EFL students in China are attending to different instructions from NES and NNS teachers, as this is important information for curriculum designers and policy makers. In an exam-driven school system, Chinese students are continually assessed on their writing, the results of which influence decisions that affect their future. Chinese students need to pass exams rated by Chinese EFL teachers in order to graduate from universities and, at the same time, to pass tests like TWE (Test of Written English) rated by NESs if they plan to study abroad. Many Chinese universities, being aware of the students’ needs, currently have their English writing programs jointly taught by NES and local Chinese teachers. This reflects the general concern of the qualification of NNSs in teaching writing and a call for favouring NESs to correct and guide students’ writing (Takashima, 1987; Kobayashi, 1992), despite recognition that the knowledge of the target language as a second or foreign language should ‘constitute the basis for non-native teachers’ confidence, not their insecurity’ (Seidlhofer, 1999: 238) or ‘threatened . . . authority’ (Tang, 1997: 578). If students receive different directives from the two groups of teachers as the present study suggests may happen, there is need for closer cooperation between NES and NNS teachers. NES teachers need to understand the background training of their EFL Chinese students just as NNS teachers need to incorporate the standards of their NES counterparts into their teaching and testing evaluation so as to socialize their students into English academic discourse. On the one hand, NESs should be aware of the different variety of English writing being cultivated by NNS teachers. They should adjust their teaching accordingly in a setting where most students are studying to pass exams rated by local NNS teachers. Based on similar findings, Kobayashi and Rinnert (1996) suggested that NES teachers in an EFL context should be aware of the needs of local students to meet the standards of NNS teachers. On the other hand, NNS teachers should be aware of the diverse instructional emphases of their NES

Ling Shi 319 colleagues. Both NES and NNS teachers should also help their students to understand how readers’ expectations can vary based on their cultural backgrounds, and how this would affect the way they should write. Seminars for teachers and training workshops for raters should be organized to help the communication between NES and NNS teachers so that each group gains access to the shared background and beliefs of the other. Such cooperation and share of expertise could lead to more effective writing instruction for EFL students in China. Finally, the present study has at least two limitations. One limitation is that it used various statistical procedures to explore group differences in the scoring of a set of 10 essays rather than individual essay scores from each rater. Qualitative case studies need to be conducted to explore how individual raters vary their judgments from essay to essay. Another limitation of the study is that, although a contrast between NES and NNS teachers is the main point of the research, the distinction of the two groups can be problematic with lack of precise population definitions or sampling techniques (for various discussions on the topic, see Braine, 1999). I used the terms of NES and NNS only to follow previous studies in distinguishing educators in English language teaching whether English is their first or second language. It is important to caution that neither the present nor any other published research has utilized populations of English instructors or raters that are truly representative of larger populations. There is a great gap in research before generalizations could be made about values or practices of NESs or NNSs. Acknowledgements This project was funded first by the Committee of Research and Conference Grants of the University of Hong Kong, then by the Humanities and Social Sciences Research Grant of the University of British Columbia. An earlier version of the paper was presented at 22nd Annual Language Testing Research Colloquium, 2000. I thank the participating students and teachers, Quiofang Wen and Wenyu Wong for assistance in data collection and analyses, and Maria Trache for statistical advice. I also thank Alister Cumming, Bournot-Trites Monique, Bonny Norton, Lee Gunderson, Lyle F. Bachman, and three anonymous reviewers for their valuable comments on earlier drafts of the paper. V References Basham, C. and Kwachka, P.E. 1991: Reading the world differently: a cross-cultural approach to writing assessment. In Hamp-Lyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 37–49.

320 Evaluation of Chinese students’ English writing Braine, G. 1999: Non-native educators in English language teaching. Mahwah, NJ: Lawrence Erlbaum. Brown, J.D. 1991: Do English and ESL faculties rate writing samples differently? TESOL Quarterly 25, 587–603. Connor-Linton, J. 1995a: Looking behind the curtain: what do L2 composition ratings really mean? TESOL Quarterly 29, 762–65. —— 1995b: Crosscultural comparison of writing standards: American ESL and Japanese EFL. World Englishes 14, 99–115. Cumming, A. 1990: Expertise in evaluating second language compositions. Language Testing 7, 31–51. Hamp-Lyons, L. 1989: Raters respond to rhetoric in writing. In Dechert, H.W. and Raupach, M., editors, Interlingual processes. Tu¨ bingen: Gunter Narr, 229–44. —— 1991a: Issues and directions in assessing second language writing in academic contexts. In Hamp-Lyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 323–29. —— 1991b: Reconstructing “Academic writing proficiency.” In HampLyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 127–53. —— 1995: Rating nonnative writing: the trouble with holistic scoring. TESOL Quarterly 29, 759–62. Hamp-Lyons, L. and Zhang, B.W. 2001: World Englishes: issues in and from academic writing assessment. In Flowerdew, L. and Peacock, M., editors, Research perspectives on English for academic purposes. Cambridge: Cambridge University Press, 101–16. Hinkel, E. 1994: Native and nonnative speakers’ pragmatic interpretations of English texts. TESOL Quarterly 28, 353–76. Hughes, A. and Lascaratou, C. 1982: Competing criteria for error gravity. ELT Journal 26, 175–82. James, C. 1977: Judgments of error gravities. ELT Journal 31, 116–24. Khalil, A. 1985: Communicative error evaluation: native speakers’ evaluation and interpretation of written errors of Arab EFL learners. TESOL Quarterly 19, 335–51. Kobayashi, H. and Rinnert, C. 1996: Factors affecting composition evaluation in an EFL context: cultural rhetorical pattern and readers’ background. Language Learning 46, 397–437. Kobayashi, T. 1992: Native and nonnative reactions to ESL compositions. TESOL Quarterly 26, 81–112. Land, R. and Whitley, C. 1989: Evaluating second language essays in regular composition classes: toward a pluralistic U.S. rhetoric. In Johnson, D. and Roen, D., editors, Richness in writing. New York: Longman, 289–93. Machi, E. 1988: An exploratory study on essay-grading behavior of native speakers and Japanese teachers of English. Paper presented at the 27th Annual Japan Association of College English Teachers Convention, Tokyo. Mohan, B.A. and Lo, W.A.Y. 1985: Academic writing and Chinese

Ling Shi 321 students: transfer and developmental factors. TESOL Quarterly 19, 515–34. Santos, T. 1988: Professors’ reactions to the academic writing of nonnativespeaking students. TESOL Quarterly 22, 69–90. Seidlhofer, B. 1999: Double standards: teacher education in expanding circle. World Englishes 18, 233–45. Shohamy, E., Gordon, C.M. and Kraemer, R. 1992: The effect of raters’ background and training on the reliability of direct writing tests. Modern Language Journal 76, 27–33. Silva, T. 1997: On the ethical treatment of ESL writers. TESOL Quarterly 21, 350–63. Takashima, H. 1987: To what extent are non-native speakers qualified to correct free compositions: a case study. British Journal of Language Teaching 25, 43–48. Tang, C. 1997: On the power and status of nonnative ESL teachers. TESOL Quarterly 31, 577–80. Tedick, D.J. and Mathison, M.A. 1995: Holistic scoring in ESL writing assessment: what does an analysis of rhetorical features reveal? In Belcher, D. and Braine, G., editors, Academic writing in a second language: essays on research and pedagogy. Norwood, NJ: Ablex, 205–30. Vaughan, C. 1991: Holistic assessment: what goes on in the rater’s mind? In Hamp-Lyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 111–25. Wall, D. and Alderson, J.C. 1993: Examining washback: the Sri Lankan impact study. Language Testing 10, 41–69. Zhang, W.X. 1999: The rhetorical patterns found in Chinese EFL student writers’ examination essays in English and the influence of these patterns on rater response. PhD thesis, Hong Kong Polytechnic University.

Appendix 1 Writing sample* Essay 1 Since the invention of TV, newspaper business has deciding. Fewer and fewer people are reading newspaper and someone even declares that newspaper is dying out soon. Judging from the means of providing information, TV has its unprevailable advantage over the newspaper. It provides visions and sounds. It was a great event that the first day people could see with their eyes what had happened in the other part of the world. Thus the information on TV became powerful for pictures are different from printed words—seeing is believing. In this sense, TV can present the news, the events more vividly and carry more information from several angles. It should be a better source.

322 Evaluation of Chinese students’ English writing However, the truth is that newspaper is more reliable simply because of the commercialization of TV. TV is powerful and people abuse its power. Most of what we get from TV is not information, but a kind of entertainment. The news reported on TV are usually in terse, popular language with impressive images, but without much profounding critics. We find more good critics on newspapers. The TV people don’t give critical opinions, they are busy making questionaires about masses’ taste and fussy with soapy shows. And if they have an attitude, (I feel happy as well as worried about it.) they can use, manipulate their powerful weapon—TV without letting us know. Pictures clipped, words omitted, repetitions on certain facts, all these can give us a totally wrong idea. Yet, the worst thing is we believe the news, for seeing is believing. One Hong Kong banker was interviewed by foreign journalists about his opinion on Hong Kong’s future. He said “There is difficulty ahead, but I have the full confidence.” On screen, you just hear “There is difficulty ahead” and see a shot of his lowered head. What is the truth? I remember one film named “making city” starred by Dustin Huffman. The protagonist doesn’t tend to harm anyone, but is made the image as a terrorist by the media and shoots himself at last. When Dustin ories “ We killed him” in the end of the film, I feel there’s something much worse than entertainment that TV can provide—the false. Of course, TV itself is not bad. It’s like money and depends on how we use it. The problem with the masses is that we are easily taken in by what we see and by indulging too much in its entertainment, we don’t think and become slave of it. In this sense, the “ancient” means of data leaves us more space for thinking. Note: *Errors and mistakes are retained verbatim from the original student’s essay. Appendix 2 Evaluation of students’ writing This project aims to find out how teachers of English rate university students’ essays. Please read and rate the 10 essays provided using a 10-point scale (point 10 being the highest on the scale) and then state, in an order of importance, three reasons or characteristics in each essay that you think have most influenced your rating of that essay (reason 1 being the most important).

Ling Shi 323 Essays 1 2 3 4 5 6 7 8 9 10

Ratings

Reason 1

Reason 2

Reason 3

324 Evaluation of Chinese students’ English writing Appendix 3 Coding scheme Major categories

Sub-categories

Definitions

Examples of positive/negative comments*

General

General

General comments on overall quality of writing.

• Well written. • It fails to complete the task.

General

General comments on content.

• Good contents. • Content shallow.

Ideas

General or specific comments on ideas and thesis.

Arguments

General or specific comments on aspects of arguments such as balance, use of comparison, counter arguments, support, uses of details or examples, clarity, unity, maturity, originality, relevance, logic, depth, objectivity, conciseness, development and expression

• • • • • • • •

Content

General Organization Paragraphs

Transitions

General comments on organization Comments on the macro level concerning paragraphs introductions, and conclusions.

Comments on the micro level concerning transitions, coherence, and cohesion.

• • • • • • • • • • • • • • • • • • • •

Good ideas. Poor idea. Thesis clearly stated. No thesis statement. Good argument. Poor argument. Arguments balanced. Lack of arguments on the newspaper issue. Arguments well supported. Arguments not very well supported. Logical argument. No logic in reasoning. Argument well developed. Limited development of argument. Excellent organization. Weak organization. Paragraphs are well arranged. Paragraphs are poorly organized. Good introduction. Poor introduction. Unique conclusion. No conclusion. Good transitions Bad use of transition words. Coherent Lacks coherence. Cohesive. Lack of connectives.

Ling Shi 325 Appendix 3 continued Major categories

Language

Sub-categories

Definitions

Examples of positive/negative comments*

General

General comments on language. Comments on whether the language is clear or easy to understand. General comments on accuracy or specific comments on word use, grammar and mechanics.

• • • •

Language good. Poor English. Easy to read and follow. Meaning unclear.

• • • • • • •

Accurate language. Too many errors. Good use of words. Bad diction. Good grammar. Poor grammar. Few errors of spelling and punctuation. Too many errors in mechanics. Fluent language. Language not smooth. Concise language. Wordy. Complex sentences. Use of language immature. Idiomatic English. Chinglish. Voice not appropriate. Vivid language. About 250 words. Too short.

Intelligibility Accuracy

• Fluency

Length

Length

Comments on fluency, conciseness, maturity, naturalness, appropriateness, and vividness of language.

Comments on whether the writer has fulfilled the word limit.

Note: *Negative comments are in italics.

• • • • • • • • • • • •