Journal of Retailing 80 (2004) 37–52
Scale modification: alternative approaches and their consequences夽 Adam Finn a,∗ , Ujwal Kayande b,1 b
a University of Alberta School of Business, Edmonton, Alb., Canada T6G 2R6 SMEAL College of Business, Pennsylvania State University, University Park, PA 16802-3007, USA
Abstract Important marketing scales such as SERVQUAL are often adapted for use in a particular applied/theory-testing context and/or refined using some statistical method, assuming the resultant scale will have improved psychometric properties for a particular application. However, little attention has been paid to the consequences of how scale modifications are made, the criteria that are used to assess how well a modified scale performs, and whether scale modification is in fact worthwhile. To investigate these issues, we review approaches to scale modification, select an influential marketing scale and use context, and then examine the effects of the different scale modification approaches on which items are included in the refined scale. We then cross-validate the psychometric performance of the resulting scales using criteria that reflect the multiple purposes to which a scale can be applied. The results show approaches that leads to a more reliable scale for one purpose (e.g., segmentation) can be far less adequate for another purpose (e.g., benchmarking). © 2004 by New York University. Published by Elsevier. All rights reserved. Keywords: Scale modification; Scale adaptation; Scale refinement; Scale performance; Service quality
Introduction Important marketing scales such as SERVQUAL (Parasuraman, Zeithaml, & Berry, 1988) are often modified for particular applications by changing the wording of items, adding items, or dropping items from the original scale to suit the specific context of scale usage. Scales are typically developed using the influential multi-item scale development paradigm presented by Churchill (1979). With important confirmatory refinements to the assessment of dimensionality (Gerbing & Anderson, 1988) and construct validity (Steenkamp & van Trijp, 1991), this heavily cited paradigm has considerably improved the reliability and validity of marketing scales (see Bearden & Netemeyer, 1999). However, less attention has been given to further scale modification. Both scale adaptation to a particular context and scale refinement to improve psychometric properties are common, but little has been written about how they should be done and what effect they have on scale performance. Therefore, 夽 An earlier version of this paper was presented at the American Marketing Association Winter Educators Conference, San Antonio, 2000. ∗ Corresponding author. Tel.: +1-780-492-5369; fax: +1-780-492-3325. E-mail addresses:
[email protected] (A. Finn),
[email protected] (U. Kayande). 1 Tel.: +1-814-863-4250.
the purpose of this paper is to review and compare the scale modification approaches commonly used in marketing, when modifying SERVQUAL for a retail context. To begin, we briefly describe scale adaptation, scale refinement, and scale performance, giving examples from the marketing literature. Scale adaptation Scale adaptation refers to addition or deletion of items based on their supposed suitability for a particular research context. Item suitability is typically asserted a priori, on the basis of face validity, or based on exploratory research. The literature on service quality measurement provides many examples. Parasuraman et al. (1988) describe their 22-item SERVQUAL scale as providing ‘a basic skeleton, which when necessary, can be adapted or supplemented to fit the characteristics of specific research needs of a particular organization’ (p. 31). Thus, the scale’s originators themselves adapted the scale, replacing two items and reversing the negative items (Parasuraman, Berry, & Zeithaml, 1991). Table 1 reports more instances of the a priori adaptation of SERVQUAL. Column 2 identifies the service application context, and Column 3 identifies the number of a priori additions and omissions of items. Column 4 identifies whether the negative items were reversed, and Column 5 identifies whether any other rewording occurred.
0022-4359/$ – see front matter © 2004 by New York University. Published by Elsevier. All rights reserved. doi:10.1016/j.jretai.2004.01.003
38
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Scale refinement Scale refinement refers to the steps taken to improve the psychometric performance of a scale after it is first reported in the academic literature. Smith and McCarthy (1995) identify several reasons for refining a scale, including the establishment of better internal consistency, determination of content homogeneity of unidimensional facets, and inclusion of items that discriminate at the desired level of attribute intensity. Published marketing scales are often refined on the basis of new data about their psychometric properties. SERVQUAL again provides examples. The scale’s originators also refined their scale, eliminating one item and reassigning two others to different dimensions (Parasuraman, Zeithaml, & Berry, 1994). The right-hand columns of Table 1 identify instances where items were deleted due to refinement of SERVQUAL-based scales. The frequent adaptation and refinement of SERVQUAL make it a prime example of the prevalence of scale modification in marketing. Scale performance When a multi-item scale is modified the change is presumably intended to improve scale performance. We now discuss two important considerations in assessing scale performance, namely the degree of cross-validation and the assumed purpose of measurement. Cross-validation requires new data be used to assess the performance of the modified scale. In the past, modified scales were sometimes assessed on the data that were used to refine the scale (e.g., Deshpande, Farley, & Webster, 1993; Spiro & Weitz, 1990), thus positively biasing the assessed scale performance. Now, new data are usually collected. But, whether the refined scale is still superior to the original scale is rarely investigated. Often, the cross validation data collection is only partial, including only the items remaining after refinement (for examples, see Babin, Darden, & Griffin, 1994; Kohli, Jaworski, & Kumar, 1993; Parasuraman et al., 1988). Secondly, the scales are almost always evaluated without carefully specifying the purpose of measurement. Services measurement applications can require the scaling of objects such as firms (benchmarking their service quality), advertisements (pretesting ads), products (testing interest in concepts) or brands (comparing brand image), rather than just a scaling of respondents (segmenting respondents). Thus, it is important to know the object of measurement for which a modified scale is cross-validated. Scale adaptation or scale refinement, based on results expected or obtained when scaling respondents, presumably generates a scale that will do a better job when scaling respondents. However, it is not clear whether the modified scale will also do a better job when scaling other objects, such as firms, brands, and ads. Here we compare the performance of different multi-item scales using a generalizability coefficient (hereafter called GC) for each of the relevant objects of measurement
(Cronbach, Gleser, Nanda, & Rajaratnam, 1972). Generalizability theory has long been identified as the best approach to assessing reliability in marketing (Peter, 1979; Rentz, 1987). However, only recently has evidence been provided of how far generalizability when scaling respondents can diverge from generalizability for other objects of measurement (Finn & Kayande, 1997, 1999). Generalizability theory and the generalizability coefficient In generalizability theory, the variation in test scores or survey ratings can be attributed to the systematic variation across levels of multiple factors that constitute the design of a measurement study. Consider benchmarking the quality of service provided by retailers by asking respondents to rate a number of retailers on a number of items. Respondents, items, and retailers are the three factors in the measurement study. We can now decompose the variance in ratings into a variance attributable to each one of these factors and their associated interactions,2 using the variance components model, as follows: 2 2 2 2 σy2 = σretailers + σrespondents + σitems + σretailers×respondents 2 2 2 + σretailers×items + σrespondents×items + σerror ,
(1)
2 2 2 , σrespondents and σitems are the variance comwhere σretailers 2 ponents associated with each factor, σretailers×respondents , 2 2 σretailers×items and σrespondents×items are variance components 2 is the associated with the two-way interactions and σerror variance component associated with the three-way interaction and random error. The variance components model in (1) is basically a general version of the true score model used in traditional psychometric theory. The GC for outlets is an indicator of the overall adequacy of the measurement study to provide a relative standing of retailers. It is expressed as,
GCretailers =
2 σretailers 2 2 σretailers + σrelative error
.
(2)
The relative error variance component in a study with nrespondents and nitems as the number of respondents and items is 2 σrelative error =
2 σretailers×respondents
nrespondents +
2 σerror
nrespondents × nitems
+ ,
2 σretailers×items
nitems (3)
2 Such a 3-factor design is commonly used in market research surveys. In practice, however, respondents in such surveys are more commonly nested within retailers, i.e., each respondent rates only the retailer at which they frequently shop. Our assumption of a crossed design (each respondent rates each retailer on each item) is made to simplify the exposition, but the methodology generalizes to any nested or crossed design.
Table 1 Studies that have modified SERVQUAL for specific application contexts Authors
Application context and form of data
A priori scale adaptation Items and dimensions added or dropped
Scale refinement SERVQUAL items reworded All +ve
Carman (1990)
Bojanic (1991) Freeman and Dart (1993) Weekes, Scott, and Tidwell (1996) Vandamme and Leunis (1993)
Accounting Q Hospitals Q
Khatabi, Ismail, and Thyagarajan (2002) Kettinger and Lee (1994) Lapierre and Filiatrault (1994)
Information services Q Engineering consulting P
Johns and Tyas (1996) de Ruyter and Wetzels (1997) Lam and Zhang (1999)
Foodservice Q Clothing stores P Travel agents Q
Frochot and Hughes (2000)
7 items added and 5 items dropped 13 items added and 7 items dropped 6 items added and 12 items dropped 23 items added and 11 items dropped 2 items added and 2 items dropped
10 items dropped 13 fee and client interaction items added and 6 items dropped 38 items including timeliness, professionalism and fees added 9 items added and 3 items dropped
Telecommunications P
37 items added
Criteria for removing items
Final items
Other
No No No No Yes
Minor
Yes
Yes
1
Yes No
Yes Yes
10 1
n.r.
Yes
Yes
Yes
11
Yes
Yes
6
Yes No
24 23 16 34 22
9
Factor analysis loadings and alpha
21
PCA varimax
12 28
Factor analysis
n.r.
Alpha, item to total and PCA varimax loadings Concurrent validity correlation CFA Item to total and PCA varimax loadings
17
16 13 19
Yes Yes Yes
Yes Yes Yes
3
Historic houses P
14 items including food added 1 item added and 7 items dropped 4 resources and corporate image items added 21 items added
Yes
Yes
19
Webster and Hung (1994) Wong, Dean, and White (1999) Bouman and van der Wiele (1992) Lim and Tang (2000) Oldfield and Baron (2000) Brown and Bell (1998) Akan (1995) Cook and Thompson (2000)
Hotels D Hotels D Auto services Q
12 items dropped 8 items added and 3 items dropped 48 new items for SERVQUAL aspects
Yes Yes Yes
Yes Yes na
10 27 48
Hospitals Q Business schools P Healthcare P Hotels P Libraries P
Yes Yes Yes Yes Yes
Yes Minor Yes Yes No
25 24 16 30 41
Hoxley (2000) Nelson and Nelson (1995) Dabholkar, Thorpe, and Rentz (1996)
Construction P Real estate brokers Q Retail P
3 accessibility and affordability items added 2 items added 24 new items 8 hotel characteristic items added 19 items including libraries as place and access to collections added 12 items added and 6 items dropped 14 items added including partnership 11 items added and 5 items dropped
Yes Yes Yes
No
PCA varimax loadings Alpha, item to total and PCA loadings
8
Factor analysis
7
Item analysis and factor analysis Low item correlations Alpha, factor analysis CFA
2 5
36 16 23 24
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Parasuraman, Berry, and Zeithaml (1991) Parasuraman, Zeithaml, and Berry (1994)
Tire store P Placement center P Dental clinic P Hospitals P Telephone, insurance and banks Q Computers, retail, auto insurance and life insurance Q Accounting P Accounting Q
Items deleted
26 31 28 39
40
Table 1 (Continued ) Application context and form of data
Mehta, Lalwani, and Han (2000)
Retail P
Sureshchander, Rajendran, and Kamalanabhan (2001)
Generic P
Philip and Stewart (1999) Philip and Stewart (1999) Philip and Hazlett (2001) Barnes and Vigden (2001) Vazquez, Rodriguez-Del Bosque, Diaz, and Riuz (2000) Engelland, Workman, and Singh (2000)
Health information S Mental health S Information services D Web retail P Supermarkets P
Othman and Owen (2001) Richard and Alloway (1993)
A priori scale adaptation Items and dimensions added or dropped
Scale refinement SERVQUAL items reworded
Items deleted
Criteria for removing items
Final items
12
Factor analysis varimax loadings
21
All +ve
Other
Yes
No
Yes
Yes
41
Yes Yes
41 42 32 24 18
Yes Yes
Yes Yes Yes No Yes
College career offices Q
11 items for convenience, problem solving and store policy added 19 items including core service, systematization/standardization and social responsibility added 19 items including pivotal outcomes added 20 items including pivotal outcomes added 10 items including pivotal outputs added 15 items added and 13 items dropped Items added including convenience, products and pricing policies 50 items added
Yes
Minor
Islamic Banks I Home delivered pizza P
12 items including compliance added 6 service outcome items added
Yes Yes
Yes Yes
PAF oblimin 8 55
CFA Dimension, alpha and content validity checks
Notes. n.r.: not reported, P: performance measures, Q: performance-expectation measures, D: disconfirmation measures, S: satisfaction measures, I: importance measures.
17
34 28
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Authors
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
On the other hand, the GC for scaling respondents is, GCrespondents =
2 σrespondents 2 2 σrespondents + σrelative error
,
(4)
where the relative error variance component is, 2 σrelative error
=
2 σretailers×respondents
+ nretailers 2 σerror + . nretailers × nitems
2 σrespondents×items
nitems (5)
Traditional scale refinement methods are designed to increase the GC for scaling respondents, but this cannot give any indication of the refined scale’s ability to scale retailers (reflected in GCretailers ). This is undesirable if the context requires a scaling of retailers rather than respondents. Thus, we should expect that state-of-the-art scale refinement methods such as confirmatory factor analysis (CFA) will lead to scales that perform best when the purpose of measurement is to scale respondents. However, there is little reason to expect the same result when the purpose is to scale retailers. What can be expected is that the multiple methods will not systematically perform better or worse when the purpose is to scale retailers. In summary, we conduct a full cross-validation and use the generalizability coefficient as a scale performance measure. From (2) and (3), it can be seen that the number of items affects the estimate of the generalizability coefficient. We compare the generalizability coefficients based on each scales’ own number of items and dimensions. Because scale performance varies across the objects of measurement, we apply the criterion for three different objects of measurement. We emphasize that our purpose is to compare the scale modification approaches commonly used in marketing. To do so, we first discuss the common forms of adapting a scale by adding or deleting items to modify the content domain encompassed in a construct. Next we discuss four criteria that are commonly employed when refining a multi-item scale. These criteria are briefly contrasted with a generalizability theory perspective. We test the alternative approaches when modifying SERVQUAL for the measurement of retail service quality using mystery shoppers. Subsequently, we compare the data quality obtained when mystery shoppers use the resulting scale variants for different objects of measurement. Finally, we discuss these results and conclude with the implications for scale modification in marketing.
Criteria for scale item modification Adaptation of the scale content domain Many marketing constructs can be conceptualized at varying levels in a hierarchy, from abstract and general at
41
the top, to concrete and specific at the bottom. Such constructs often use reflective items for multiple content areas or dimensions nested within the higher order construct. As a result, the most common form of scale adaptation is a reassessment of which content areas rightfully fall within the construct domain, with groups of items for a content area added or dropped from the scale. Richard and Alloway (1993) add six product outcome related items for their assessment of the predictive validity of SERVQUAL. Other SERVQUAL examples are presented in Table 1. Rossiter (2002), who advocates a context specific view of construct definition, implicitly encourages such scale adaptation to the specific context of measurement. The priority he places on content validity over empirical evaluation suggests we should expect scale adaptation to have a greater impact than scale refinement on scale performance. Refinement of the multi-item scale In marketing, multi-item scales are typically refined using criteria based on item correlations. However, the procedures used are inconsistent and rarely specified precisely enough for an exact replication. Therefore, for expository purposes we consider each of the commonly applied criteria in isolation. Generalizability theory provides a contrasting perspective. Best coefficient alpha for dimension Churchill (1979) suggests that items with item to total score correlations that are near zero or are substantially lower than those for other items be dropped (p. 68). For multidimensional scales, these criteria are applied separately to the items measuring each dimension. Kopalle and Lehmann (1997) show that eliminating such poor items can have a sizable positive impact on reported alpha. However, their research only considered scaling of respondents, not the effect for other objects of measurement. Factor analysis As Churchill (1979) notes, the dimensionality of a pool of items generated to tap a construct is often uncertain. As a result, exploratory factor analysis is often used during the item pool purification process, to identify both the underlying dimensions and which items load cleanly on each dimension. As a corollary, it is used to eliminate items that fail to load strongly on any factor or that load approximately equally on two or more factors (Floyd & Widaman, 1995). The term “factor analysis” is often used loosely to refer to a range of procedures, which can produce different outcomes. Principal components analysis, with orthogonal rotation, is most appropriately used for data reduction. Common factor analysis is better for understanding the relationships amongst a set of underlying dimensions, making oblique rotation the logical choice with this procedure. It was once thought that the choice between principal component analysis and common factor analysis had little effect on results
42
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
(Velicer, 1977; Velicer & Jackson, 1990). However, if a small number of items are expected to load on each dimension and if the items have only modest communalities, the results of component and common factor analysis can diverge markedly (Widaman, 1993). Hence, we investigate both forms of analysis. Confirmatory factor analysis Gerbing and Anderson (1988) demonstrate how CFA provides a rigorous assessment of dimensionality, testing for both internal and external consistency. It can also be used for item selection if a measurement model fails to fit. The location of large standardized residuals can be used to identify the item most inconsistent with the measurement model. However, it is not clear whether stepwise application of this criterion leads to the same modifications as other criteria. Generalizability theory In contrast, generalizability theory takes a domain sampling approach to measurement. The items that are expected to reflect a particular construct dimension are viewed as exchangeable. An observed difference in their performance in a set of data is presumed to be due to sampling error; it is not seen as substantive, and so does not in itself provide grounds for adding or dropping items.
Current study The study objectives are to compare the effects, if any, of adapting a scale for a particular context and of employing different item selection criteria for scale refinement on the psychometric quality of data collected using the resulting scales. The context chosen is the modification of SERVQUAL for use in a mystery shopper assessment of retailer performance. SERVQUAL has received considerable academic attention (Carman, 1990; Cronin & Taylor, 1992, 1994; Teas, 1994). We chose SERVQUAL because, as shown in Table 1, it has frequently been modified. Moreover, multiple objects of measurement are relevant in service quality research, thus our investigation is not limited to any single purpose of measurement. In particular, we conduct a two phase study to investigate (i) the effect of adding a set of items to measure a dimension of retailer performance that is not captured by the generic service quality scale, (ii) the effect of eliminating a set of items to narrow the domain embraced for the construct, and then compare, (iii) the effect of using four widely adopted item selection criteria, namely coefficient alpha, principal components, principal axis factor analysis, and CFA, with (iv) a generalizability approach, which drops/selects items at random because they are viewed as exchangeable. The measure of scale performance is the generalizability coefficient for each of the relevant objects of measurement.
Phase I: scale adaptation and refinement Changes to the domain covered by the scale We examine both a broadening and a narrowing of the content domain covered by the scale. First, SERVQUAL does not include items to assess the product assortment carried by a retailer, even though an outlet’s product assortment is an important limitation on the service it can provide for customers. Therefore, we broaden the construct domain and scale item pool by adding four product related items to 20 current SERVQUAL items (Parasuraman et al., 1994). Second, we narrow the construct domain and scale item pool by eliminating the four tangibles items, as these might just as easily be considered measures of store atmosphere rather than service quality. Therefore, our narrower scale consists of the 16 SERVQUAL items for the dimensions of responsiveness, assurance, reliability and empathy. This ‘intangibles-only’ scale might be a purer measure of service quality than the full 20-item scale, and so might provide higher quality data for some purposes. Data for scale refinement Students enrolled in marketing classes at a large North American university provided data in return for receiving class credit. They were asked to evaluate the retail outlet in an on-campus convenience-oriented shopping mall they had shopped at most recently. Students were appropriate respondents because they comprise the majority of the mall’s business. Responses for 24 perceptions of performance items (see Table 2) were collected using a 0–10 rating scale, with anchor labels of terrible and excellent. The data collection approach provided a total of 150 reports, spread over ten different retailers. Because SERVQUAL is a multidimensional scale, we used omega as an indicator of internal consistency reliability (Heise & Bohrnstedt, 1970). Omega for the 20-item SERVQUAL scale was .963. After adaptation, omega is .964 for the augmented 24-item scale and .955 for the 16-item intangibles only scale. Scale refinement results To help identify any differential effects, we apply each item selection criteria (i.e., coefficient alpha, principal components, principal axis factor analysis, confirmatory factor analysis and random omission) to the full pool of 24 items. Table 2 provides a summary of which items are retained for each of the refinement approaches described next. Best coefficient alpha for dimension Application of the coefficient alpha criteria identified two items whose item to dimension total correlation is low enough to reduce the internal consistency at a dimension level. The remaining 22 items for the coefficient alpha modified scale are identified in the second column in Table 2.
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
43
Table 2 Applying scale refinement approaches to the 24-item pool All items in the pool
A priori dimensions
Best alpha for dimensiona
Principal components varimaxb
Principal axis factor obliminc
F3 F3
F3 F3
F3
F3
F1 F1
F1 F1
Tangibles Modernness of equipment Attractiveness of physical facilities Staff neatness and professional look Attractive communication materials
A1 A1 A1 A1
A1 A1 A1 A1
Responsiveness Keeping customers informed Promptness of service Willingness to help Responsiveness to requests
A2 A2 A2 A2
A2 A2 A2
Assurance Trustworthiness of staff Confidence instilled in customers Courteousness of staff Product knowledge of staff
A3 A3 A3 A3
A3 A3 A3 A3
F1
F1
F1
F1
Reliability Delivering on promises Dependability of services provided Accuracy of service delivered Service provision when promised
A4 A4 A4 A4
A4 A4 A4 A4
F2 F2
F4
F2
Empathy Individualized customer attention Care about customers Respect the customer’s interests Understand each customer’s needs
A5 A5 A5 A5
A5 A5 A5 A5
F1 F1
Products Quality of products carried Selection of products carried Range of brands offered Variety of alternatives available
A6 A6 A6 A6
A6 A6 A6
F4 F4 F4
a b c d e
Items Items Items Items Items
eliminated eliminated retained if eliminated eliminated
Confirmatory factor analysis (1)d
(2)e
F1 F1
F1 F1
F2 F2 F2 F3 F3
F3
F3
F3
F4
F4 F4
F4 F4
F5 F5
F5 F5 F5 F5
F5 F5 F5 F5
F2 F2 F2
F6 F6 F6
F6 F6 F6
if their inclusion reduced coefficient alpha for the dimension. if loadings failed to reach threshold of .707 after rotation. loading exceeded .64 or more than twice magnitude of the largest cross-loading. based on sum of absolute standardized residuals until insignificant test of close fit. based on sum of absolute standardized residuals until insignificant Chi-squared.
For this scale as a whole, coefficient omega is .963, equal to the value for the 24 items. Principal components We identified four components in the data, applying the most frequently used ‘number of eigenvalues greater than one’ criterion. After varimax rotation, most of the responsiveness, assurance and empathy items loaded on the first component. The three remaining components could be identified with the tangibles, reliability, and products dimensions. For this approach we retained all items with a loading of greater than .707 and with no cross loading exceeding half the primary component loading. As shown in the third column in Table 2, of the 15 items that satisfy these requirements, six load on the general component with three on each of the other components. Coefficient omega for this set of 15 items is a little lower, at .950.
Principal axis factor analysis Using the scree test and principal axis factoring with oblimin rotation identified five factors, including separate responsiveness/assurance and empathy factors. For the principal axis approach we retain all items with primary loadings greater than .64 or twice the size of their largest cross-loadings. As shown in the fourth column in Table 2, 14 items satisfy these conditions, with 4 items loading on responsiveness/assurance, three each on tangibles and products and two each on responsiveness and reliability. Coefficient omega for this set of 14 items is .944. Confirmatory factor analysis As might be expected given the results reported above, a six factor measurement model failed to fit the 24 items (χ2 with 237, df = 556.4, RMSEA = .095, CFI = .87). Better fitting measurement models were identified using a stepwise
44
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
process, eliminating the item contributing most to lack of fit, as indicated by the sum of its absolute standardized residuals. Two stopping criteria were chosen to reflect alternative views as to what constitutes adequate fit. The more moderate criterion was first failure to reject in a test of close fit (Browne & Cudeck, 1993). This occurs after seven items are eliminated from the six-factor measure. Given the exploratory nature of this process, the fit statistics for the six factor, 17-item model (χ2 with 104, df = 163.0, p = .0002, RMSEA = .062, p(closefit) = .14, CFI = .964) must be treated with caution. Coefficient omega for this scale is .945. Continuing the stepwise elimination until first failure to reject the measurement model eliminates all items for the responsiveness factor. Fit occurs after eleven items are eliminated, leaving a five factor, 13-item scale (χ2 with 55, df = 69.96, p = .148, RMSEA = .036, CFI = .989). Coefficient omega for this smaller set of items is .927. The final column in Table 2 shows the measurement models for these two stopping criteria. Generalizability theory From a generalizability perspective, the four items per dimension of retailer performance should be exchangeable. Items capturing the same dimension should engender the same pattern of responses across all facets, which may be objects of measurement. Thus, any random subset of items drawn from an item pool should be equally effective for measurement purposes. To test this position, we created four scale variants from the 24-item scale by dropping one item per dimension. Using random assignment, a different one of the four items per dimension was dropped for each of the four scales. The omega coefficients for the resulting 18-item scales are .951, .951, .949, and .954. Table 3 provides a concise summary of the results for coefficient omega. For comparison purposes, it includes values
Table 3 Comparison of scale modification approaches on consistency coefficients Scale modification approach
Number of items
Coefficient omega
Coefficient alpha
Scale adaptation Product and SERVQUAL SERVQUAL only Intangibles only
24-item 20-item 16-item
.964 .963 .955
.942 .950 .946
(1) (2)
22-item 15-item 14-item 17-item 13-item
.963 .950 .944 .945 .927
.939 .904 .892 .916 .879
(1) (2) (3) (4)
18-item 18-item 18-item 18-item
.951 .951 .949 .954 .951
.925 .922 .919 .925 .923
Scale refinement Best alpha for dimension Principal components Principal axis factor CFA Random deletion Generalizability theory
G-theory average
for the more commonly reported coefficient alpha, recognizing it is not an appropriate measure for multidimensional scales (Cortina, 1993). Note that both coefficients are for the scaling of respondents and are not for cross-validation data. The various modifications have only minor effects on omega. All of the modified scales appear highly reliable, although omega generally increases with the number of retained items.
Phase II: cross validation Design Our cross-validation data consist of evaluations of 16 retail stores conducted by six research assistants chosen from the mall’s primary target market of students. The stores were located in the same on-campus mall used in the scale refinement phase. The four types of stores selected were coffee shops, takeaway food service stores without private seating, food service stores with private seating areas, and non-food convenience stores. The six ‘mystery shoppers’ first visited and evaluated two other stores to confirm their suitability for the observational task. They were then given individually counterbalanced schedules, such that their store visits were scheduled to occur at four different times of the day and on five different days of the week. Moreover, they followed a standardized procedure for each visit. The latter required them to observe the service provided to a minimum of three customers, then to approach and complete a transaction, including making a specific request for some additional service, such as requesting extra condiments for a sandwich, before completing their product and service quality evaluations. To be able to cross-validate all of the scales, the evaluations included all four items for each of the five SERVQUAL dimensions and four product items. Single item measures of overall performance and of purchase intention were also collected as criteria for tests of predictive validity. Analysis The data available for analysis consist of ratings for all 24 items on each of 96 mystery shopping reports. The retailers and mystery shoppers are treated as random factors, as they are only a sample of the populations to which we wish to generalize and might examine in such studies. In an experimental design sense, retailers are crossed with shoppers and items. However, the items are nested within dimensions. Because the scale modification criteria resulted in an unequal number of items for each dimension, variance components are estimated using the restricted maximum likelihood (REML) method advocated for unbalanced data (Searle, Casella, & McCulloch, 1992). The eleven sources for which variance components can be estimated are shown in Table 4. Variance component estimates obtained for the full 24-item scale are shown in the
Table 4 Effect of scale modification approaches on variance components Source of variance
Product and SERVQUAL
SERVQUAL only
Intangibles only
Best alpha for dimension
Principal components
Principal axis factor
24a
20a
16a
22a
15a
14a
Estimate 0.181 0.651 0.078 0.052 0.635 0.227 0.091 0.028 0.091 0.305
Error
0.701
Total
3.038
Estimate
5.96 21.42 2.55 1.73 20.89 7.46 2.99 0.92 2.99 10.03
0.247 0.705 0.122 0.059 0.729 0.138 0.042 0.029 0.107 0.186
23.07
0.739
100
3.103
Percentage 7.97 22.73 3.94 1.88 23.50 4.44 1.36 0.93 3.44 5.99
0.182 0.782 0.046 0.077 0.766 0.000 0.000 0.000 0.100 0.084
23.82
0.680
100
Confirmatory factor analysis (1)
17a
Estimate
2.717
Percentage
Estimate
6.71 28.78 1.68 2.85 28.21 0.01 0 0 3.67 3.08
0.166 0.654 0.099 0.037 0.625 0.247 0.103 0.029 0.085 0.321
25.02
0.670
100
3.034
Percentage
Estimate
5.46 21.57 3.27 1.20 20.6 8.14 3.38 0.97 2.79 10.56
0.195 0.639 0.100 0.027 0.510 0.434 0.124 0.011 0.042 0.486
22.07
0.552
100
3.121
Percentage
Estimate
6.25 20.48 3.21 0.86 16.35 13.91 3.97 0.35 1.36 15.58
0.188 0.666 0.089 0.009 0.593 0.330 0.134 0.031 0.048 0.429
17.67
0.559
100
3.077
Percentage 6.10 21.66 2.91 0.30 19.28 10.73 4.34 1.01 1.56 13.93 18.18 100
Generalizability theory approach (2)
Percentage
Estimate
13a
Estimate
(1) 18a Percentage
Estimate
(2) 18a Percentage
Estimate
(3) 18a Percentage
Estimate
(4) 18a Percentage
Estimate
Percentage
Retailers Shoppers Dimensions Items (dimensions) Retailers × shoppers Retailers × dimensions Shoppers × dimensions Retailers × items (dimensions) Shoppers × items (dimensions) Retailers × shoppers × dimensions
0.160 0.683 0.097 0.058 0.535 0.236 0.096 0.034 0.074 0.365
5.46 23.35 3.31 1.98 18.28 8.05 3.30 1.14 2.54 12.46
0.158 0.681 0.137 0.025 0.430 0.291 0.089 0.035 0.058 0.393
5.48 23.66 4.75 0.87 14.93 10.09 3.08 1.22 2.01 13.66
0.171 0.616 0.102 0.047 0.633 0.259 0.124 0.039 0.082 0.330
5.54 19.93 3.31 1.51 20.48 8.39 4.00 1.24 2.66 10.69
0.168 0.662 0.091 0.031 0.658 0.229 0.098 0.013 0.078 0.267
5.54 21.84 2.99 1.01 21.71 7.57 3.25 0.42 2.58 8.79
0.184 0.674 0.043 0.071 0.597 0.233 0.090 0.033 0.080 0.284
6.21 22.73 1.46 2.41 20.12 7.84 3.02 1.12 2.71 9.57
0.201 0.649 0.075 0.061 0.649 0.186 0.053 0.027 0.123 0.340
6.56 21.17 2.46 1.99 21.18 6.06 1.74 0.88 4.00 11.08
Error
0.589
20.14
0.583
20.24
0.689
22.27
0.736
24.3
0.677
22.81
0.702
22.9
Total
2.925
a
100
2.879
100
3.092
100
3.030
100
2.967
100
3.064
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Retailers Shoppers Dimensions Items (dimensions) Retailers × shoppers Retailers × dimensions Shoppers × dimensions Retailers × items (dimensions) Shoppers × items (dimensions) Retailers × shoppers × dimensions
Percentage
100
Total number of items.
45
46
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Table 5 Comparison of scale modification approaches on scaling retailers, retailers by dimension, and shoppers By object of measurement
Product and SERVQUAL
SERVQUAL only
Intangibles only
Best alpha for dimension
Principal components
Principal axis factor
(A) Scaling shoppers GCshoppers (scale)a
.470
.463
.478
.474
.476
(B) Scaling retailers GCretailers (scale)
.194
.229
.178
.181
.267
.001
.325
(C) Scaling retailers by dimension GCretailers×dimensions .318 (scale) a
CFA
G-theory
(1)
(2)
(1)
(2)
(3)
(4)
.471
.512
.542
.451
.464
.490
.460
.200
.192
.192
.204
.181
.177
.203
.206
.406
.380
.287
.316
.312
.307
.309
.242
GC (scale) indicates the generalizability coefficient for the actual number of items and dimensions included in the scale.
first column of data in Table 4. The associated percentage of variance for each source is shown in the second column. These variance components are then used to estimate the generalizability coefficients for three different objects of measurement. Mystery shoppers are the first objects of measurement. This is not because we expect managers to be likely to want to scale mystery shoppers. Rather, we do this because current academic practice routinely assesses the reliability of multi-item scales when scaling respondents, who in this application are the mystery shoppers. The two most obvious reasons for conducting mystery shopping are to compare a retailer’s overall quality with that of other retailers, and to determine how a retailer’s quality compared to other retailers varies by dimensions of quality (see Finn & Kayande, 1997). Thus, we estimated generalizability coefficients for scaling shoppers, retailers, and scaling retailers by dimensions. All generalizability coefficients are estimated for the number of items and dimensions in each scale and are reported in Table 5. The smaller these generalizability coefficients, the greater the additional sampling required and more costly it is to produce a desired reliability, say .9. Results We first compare the variance components and generalizability coefficients3 for each group of modified scales. Then, we report a comparison of the managerially salient rank ordering of the retailers, and conclude with tests of the predictive validity of each scale. Adaptation scales Adding and dropping dimensions generates substantial changes in variance components for several sources, including the main effect of shoppers (from 21.42 to 28.78%) and interactions involving dimensions, such as retailers by dimensions (from 7.46 to 0.01%) and retailers by shoppers by dimensions (from 10.03 to 3.08%). Both changes produce 3 We also used the monetary cost criterion suggested by Finn and Kayande (1997) for each of the refined scales, but found results similar to those presented here. In the interest of space, we decided not to include the cost results, but they can be obtained from the authors.
small increases in the GCshoppers (.470 and .478, cf. .463) relative to the 20-item SERVQUAL scale, but result in decreases in GCretailers (.194 and .178, cf. .229). The effects on GCretailers×dimensions are even greater, such that it is impossible to reliably differentiate retailers’ performance on dimensions after dropping the tangibles. Refinement scales By comparison, the variance components for the five refined scales look quite similar to each other, and to the full 24-item scale. All five refined scales have a smaller proportion of error variance (confounded with the highest order interaction) relative to the other scales. The differences in variance components are greater for the principal components scale and for the aggressive confirmatory factor analysis scale. These variance components translate into at least marginally larger GCshoppers for all five refined scales relative to the 24-item scale. The most reliable is the more aggressively modified confirmatory factor analysis scale, at .542 compared with .470 for the 24-item base case scale. The results for GCretailers are different, with lower values for three out of five of the refined scales compared to the 24-item scale. Two scales are a marginal improvement, with the aggressively modified confirmatory factor analysis the better GCretailers (.204, cf. .194) of the two. Three refined scales produce larger GCretailers×dimensions . The principal components scale performs best, reaching .406 compared with .318 for the original 24-item scale. Thus retaining the items that loaded strongly on four orthogonal principal components cross-validates as a scale with substantially more distinct dimensions. Generalizability theory scales Table 4 shows the random deletion of one item per dimension has only relatively minor effects on the estimated variance components for the mystery shopping data. Indeed, these results provide evidence of the stability of variance components when estimated with varying amounts of data. The resulting scale generalizability coefficients indicate the variability to be expected at random, with some of the shorter 18-item scales more generalizable than the original 24-item scale for each of the objects of measurement. Interestingly
Table 6 Effect of modification and refinement item selection criteria on retailer’s performance means and rankings PSQ
SQ
Int
PC
PAF
Alpha
CFA
CFA
RD1
RD2
RD3
RD4
Number of items
24
20
16
15
14
22
17
13
18
18
18
18
Store
µ
Mean
7.88
Nonfood 1 Coffee 1 Coffee 2 Sitdown 1 Coffee 3 Coffee 4 Nonfood 2 Takeaway 1 Takeaway 2 Takeaway 3 Sitdown 2 Sitdown 3 Sitdown 4 Takeaway 4 Nonfood 3 Nonfood 4
8.74 8.53 8.40 8.34 8.26 8.18 8.17 8.12 7.82 7.76 7.75 7.51 7.40 7.37 7.33 6.43
Rk
µ
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
8.66 8.58 8.37 8.51 8.20 8.17 8.06 8.05 7.84 8.02 7.92 7.81 7.40 7.27 7.31 6.08
Rk
µ
1 2 4 3 5 6 7 8 11 9 10 12 13 15 14 16
8.58 8.56 8.27 8.69 8.17 8.26 8.41 7.99 8.15 8.31 7.86 8.21 7.75 7.35 7.76 6.35
7.89
Rk
µ
2 3 6 1 9 7 4 11 10 5 12 8 14 15 13 16
8.91 8.64 8.51 8.06 8.38 8.24 8.15 8.15 7.65 7.45 7.70 7.19 7.29 7.35 7.19 6.52
8.04
Rk
µ
1 2 3 8 4 5 7 6 10 11 9 14 13 12 15 16
8.85 8.63 8.46 8.17 8.35 8.28 8.23 8.18 7.82 7.64 7.73 7.36 7.35 7.40 7.28 6.53
7.87
Rk
µ
1 2 3 8 4 5 6 7 9 11 10 13 14 12 15 16
8.75 8.57 8.40 8.33 8.26 8.20 8.22 8.12 7.81 7.78 7.74 7.53 7.45 7.37 7.42 6.49
7.87
Rk
µ
1 2 3 4 5 7 6 8 9 10 11 12 13 15 14 16
8.78 8.65 8.42 8.32 8.27 8.22 8.22 8.20 7.75 7.75 7.76 7.65 7.48 7.41 7.41 6.60
7.89
Rk
µ
1 2 3 4 5 6 7 8 10 11 9 12 13 15 14 16
8.82 8.71 8.45 8.21 8.33 8.20 8.12 8.29 7.68 7.67 7.73 7.55 7.45 7.44 7.39 6.71
7.93
Rk
µ
1 2 3 6 4 7 8 5 10 11 9 12 13 14 15 16
8.75 8.55 8.40 8.31 8.27 8.19 8.19 8.08 7.84 7.74 7.70 7.51 7.44 7.37 7.39 6.42
7.91
Rk
µ
1 2 3 4 5 7 6 8 9 10 11 12 13 15 14 16
8.73 8.55 8.40 8.39 8.30 8.25 8.22 8.21 7.87 7.85 7.81 7.52 7.43 7.46 7.34 6.48
7.88
Rk
µ
1 2 3 4 5 6 7 8 9 10 11 12 14 13 15 16
8.75 8.56 8.44 8.27 8.30 8.17 8.11 8.13 7.81 7.69 7.73 7.41 7.37 7.37 7.27 6.52
7.93
Rk
µ
1 2 3 5 4 6 8 7 9 11 10 12 13 14 15 16
8.71 8.46 8.38 8.40 8.19 8.12 8.17 8.07 7.75 7.78 7.75 7.62 7.38 7.27 7.31 6.31
7.87
Rk
7.85 1 2 4 3 5 7 6 8 10 9 11 12 13 15 14 16
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Scale
Note. Stores listed in order of overall mean score when scaled using all 24 items (product and service quality); µ is the mean on a 10-point scale; Rk indicates the rank of the store.
47
48
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
enough, the GCretailers for one of these scales (.206) is even slightly better than the best of the refined scales (.204). In summary, scale adaptation has more substantial effects on variance components and G-coefficients. Both scale adaptation and scale refinement have inconsistent effects across objects of measurement. For scale refinement, the aggressively modified confirmatory factor analysis scale, which is most effective for scaling shoppers, is marginally more effective for scaling retailers but not better than a simple random selection of items, and is worse for scaling retailers by dimensions. Consistency of rank ordering of retailer performance Managers who use mystery shopping are typically most interested in comparing the performance of different retail outlets. Therefore, an additional issue investigated is whether their conclusions depend on scale modification approach. Table 6 shows the rank ordering of the retailers with each of the scales examined in this study. Comparison of the store rankings across the different scales shows that the modification approaches generally have a limited effect on the conclusions that would be made about the relative performance of the 16 stores. All but one of the scales produce a similar ordering of the stores, as evidenced by Spearman’s rank-order correlations of above .9 for almost all pairs of scales (often it is closer to 1). The notable exception is for the intangibles-only scale, where the narrower service domain gives substantially different rankings for a few retailers. In particular, a sit-down fast food outlet moves up from fourth for almost all other scales to be the top performer on intangibles. Its generally favorable overall performance rating is generated by exceptional personal service, but it is held back somewhat by relatively poor tangibles. Two other stores also move up in ranking on this scale, with similar implications. On the other hand, a coffee outlet and a fast-food outlet drop down in ranking on this scale, suggesting that it is their tangibles, more than their service itself, that favorably impress the mystery shoppers. Predicting overall retailer performance and purchase intentions Managers would prefer modifications that generate scales that can better predict overall retailer performance and purchase intentions. As tests of predictive validity, Table 7 reports the correlations between the retailer effect means on the various scales and corresponding means for the single item measures of overall retailer performance and purchase intention. Correlations are consistently stronger for overall retailer performance than for purchase intentions. The principal axis factor refined scale produces the strongest correlations with overall retailer performance and with purchase intentions. Random deletion again performs quite well, with one case producing the second strongest correlations. The intangibles only and SERVQUAL scales fare worst. In sum-
Table 7 Correlation of retailer effect means with overall retailer performance and purchase intention Scale modification approach
Correlation with Overall performance
Purchase intention
.926 .853 .739
.762 .643 .536
(1) (2)
.925 .936 .948 .919 .941
.752 .791 .816 .755 .789
(1) (2) (3) (4)
.930 .930 .944 .913
.773 .766 .799 .736
Scale adaptation Product and SERVQUAL SERVQUAL only Intangibles only Scale refinement Best alpha for dimension Principal components Principal axis factor CFA Random deletion Generalizability theory
mary, scale adaptation has far more substantial effects on predictive validity than does scale refinement, with principal axis factoring the best of the scale refinement methods.
Discussion Scale modification is an important issue in the use of multi-item measures. Multi-item scales are often modified after development, being adapted for use in different contexts and refined to improve their psychometric performance. However, there has been no systematic investigation of the effects of scale adaptation and refinement. To cast some light on such scale modification, we first compared the effects of broadening and narrowing the domain sampled by the scale items, and then compared the commonly cited scale refinement criteria, namely maximizing coefficient alpha for each dimension, principal components with varimax rotation, principal axis factoring with oblique rotation, and confirmatory factor analysis. Random deletion, suggested by taking a generalizability theory perspective, is used as a comparison standard. Our data provide some interesting results that suggest divergence in the outcomes of these methods. Scale adaptation for use in a specific application area is commonplace in marketing, with items being added or dropped from a scale to reflect a context specific view of the dimensions of a construct, such as products and tangibles when measuring retailer service quality. Our examination of such adaptation found that it generated large differences. First, and perhaps contrary to naive expectations, broadening the construct domain to include product related items, because they are important for retailers, reduced the GCretailers . But, as a distinct dimension, it increased the GCretailers×dimensions , and improved predictive validity.
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Second, narrowing the performance construct domain to intangibles increased the GCshopper but reduced both GCretailers and GCretailers×dimensions . These results demonstrate how adaptation of a scale to improve performance when scaling respondents can adversely affect performance when scaling other objects of measurement. Similarly, the common practice of scale refinement based on an item’s impact when scaling respondents (shoppers in our case), often results in changes to the scale that are counter-productive when the application area requires the scaling of other objects of measurement (here retailers and retailers by dimensions). This is an important finding of our cross-validation research. All of the scale refinement methods result in scales that performed better than the base 24-item scale when mystery shoppers are treated as the object of measurement. Not surprisingly, selecting items based on methods that make use of item intercorrelations across shoppers improves crossvalidation scale performance when shoppers are again the objects of measurement. However, even in this case, three criteria perform marginally worse than the best of the random deletion scales. Only the two confirmatory factor scales perform better than all four of the random deletion scales. Thus, the most effective of the scale refinement methods, in terms of scaling shoppers, is confirmatory factor analysis, with a more stringent, test of exact fit, stopping criteria. This is also the most rigorous of the extant scale refinement methods. However, this stringent confirmatory factor analysis scale does not cross-validate as the best scale for other objects of measurement. When scaling retailers, the aggressive confirmatory factor analysis scale is the best of the five refined scales. But even it is marginally worse than one of the scales generated by random deletion of items, raising questions as to the likely reproducibility of the modest improvement in performance. When scaling retailers by dimensions, the principal components scale performs best, but in this instance substantially ahead of all the other scales considered in this study. The method that emphasizes the identification of orthogonal dimensions generates scale dimensions that cross-validated with the greatest degree of independence. Regardless of scale variant, the more managerially important GCretailers and GCretailers×dimensions are always substantially smaller than the corresponding generalizability coefficients for the managerially less important scaling of mystery shoppers. The typical scale modification analysis that is justified by a higher internal consistency coefficient when scaling respondents may provide no guide to the performance of the scale for more important objects of measurement. Moreover, the high values of omega (and Cronbach’s alpha) that were reported in Table 3 are obviously misleading, as none of the cross-validated generalizability coefficients comes anywhere near those values, even when scaling shoppers. Because generalizability theory ‘blurs’ the distinction between reliability and validity coefficients when sources of
49
variation are random (see Brennan, 2001, p. 135), generalizability coefficients reflect some aspects of validity, and our tests of predictive validity provide evidence of nomological validity. However, our work does not address all aspects of construct validity, such as the effects on discriminant validity. Our results are also limited in the sense that they only provide evidence for the adaptation and refinement of one scale in one applied research context. Further research is needed to establish whether our results are peculiar to this scale and context or whether what we find is a generalizable conclusion, and to investigate the impact on other aspects of construct validity. Finally, it is important to recognize that we are not advocating a random selection of items, even though a scale with randomly deleted items performed marginally better than the best of refined scales for scaling retailers. What we are saying, however, is that (1) modified scales need to be assessed for the purpose to which they will be put, and (2) scale adaptation (some change of wording or a substantive deletion/addition of an item) seems to have a greater impact on scale performance than traditional scale refinement methods. As we stated previously, random deletion serves as a comparison standard and that is how it should be viewed. It is not difficult to understand why scale refinement methods cross validate no better than some random deletions—a published scale has typically gone through a purification process, selecting items that are more similar to each other. This is certainly true in terms of the ability of the SERVQUAL items to scale retailers, as evidenced by the small interaction variance components for retailers by items across all modification methods.
Conclusions Multi-item marketing scales are often modified after development, being adapted for use in a specific context and refined to improve psychometric properties. To investigate the effects of such scale modifications, we adapted the influential SERVQUAL scale for use in the measurement of retail service quality by mystery shoppers and examined the effects of commonly used scale refinement criteria. Taking a generalizability theory approach to the assessment of the psychometric properties of each of the resulting scales when used by mystery shoppers to assess retail stores, we find: i. The scale adaptation modifications we investigated have a far greater impact on scale performance than the scale refinement modifications. ii. Contrary to naive expectations, simply adding items to measure product assortment, a new content area specific to retailers, which was not included in the original scale, did not improve scale performance when scaling retailers. It made a reliable scaling of retailers by dimensions easier, because it adds to the heterogeneity of the dimensions.
50
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
iii. Removing items for the tangibles content area, which might be considered inappropriate on conceptual grounds, improves scale performance when scaling mystery shoppers. However, the resulting more homogeneous scale performs more poorly when scaling retailers, and could not scale retailers by dimensions. iv. Refining the scale using coefficient alpha, principal components, principal axis factoring, and moderate or aggressive confirmatory factor analysis criteria all produce scales with improved performance when scaling mystery shoppers. As confirmatory factor analysis is by far the most effective, it is recommended for refining a scale for scaling shoppers. v. There is no such clear result or recommendation possible for refining the scale when scaling retailers. The more aggressive confirmatory factor analysis performs best of the refinement methods. However, it does not improve performance when scaling retailers beyond that seen if a random deletion process is employed. There appears to be little point in using current refinement methods when scaling retailers. The principal components criteria produces the scale that performed best when scaling retailers by dimensions. vi. Cross-validating using a generalizability coefficient criterion leads to quite different conclusions as to the value of scale modifications, whether considering the scaling of retailers or mystery shoppers. All of the scales give coefficient omega values that are high enough for applied research. However, we find the generalizability coefficients for all of the scales to be so small that an extensive sampling of items, dimensions, and retailers or shoppers is needed for reliable scaling. In summary, our contribution is two-fold. First, we review and discuss commonly used approaches to modification of multi-item reflective scales in marketing. Our paper, to our knowledge, is the first to focus marketing attention on the important issue of scale modification. We also discuss scale performance measures and suggest a measure of scale performance that takes into account the managerial purpose of measurement. We find scale adaptation, which is only based on face validity claims, can have a far greater impact than more formal scale refinement methods, which are often no better than random deletion for some purposes. We also find that the most rigorous scale refinement method, aggressive CFA, produces the best refinement outcome, but only for the purpose of scaling respondents. CFA is the best refinement method for scaling retailers, but it may be misleading to strongly recommend this refinement method, because the modified scale did not perform better than random deletion. CFA does not produce the best scale for scaling retailers by dimension, an important applied problem. Our second contribution is that we provide the first comprehensive empirical comparison of scale modification approaches, using cross-validation with all items. Our findings suggest serious problems in assuming multi-purpose adequacy of scale
modification approaches. An approach that produces a more reliable scale for one purpose might also produce a less reliable scale for another purpose. Given the nature of our findings, it is important to further investigate the methods being used for modifying scales, especially if the purpose of measurement is other than scaling respondents. Our findings clearly demonstrate that assessing the performance of modified scales without taking into account the different purposes of measurement can be misleading and therefore, a modified scale must be assessed for the purpose to which it will be put.
References Akan, Perran. (1995). Dimensions of service quality: A study in Istanbul. Managing Service Quality, 5(6), 39–43. Babin, Barry J., Darden, William R., & Griffin, Mitch. (1994). Work and/or fun: Measuring hedonic and utilitarian shopping value. Journal of Consumer Research, 20(March), 644–656. Barnes, S. J., & Vigden, Richard. (2001). An evaluation of cyberbookshops: The WebQual method. International Journal of Electronic Commerce, 6(1), 11–30. Bearden, William O., & Netemeyer, Richard G. (1999). Handbook of marketing scales: Multi-item measures for marketing and consumer behavior research, second ed. Newbury Park, CA: Sage. Bojanic, David C. (1991). Quality measurement in professional service firms. Journal of Professional Services Marketing, 7(2), 27–36. Bouman, Marcel, & ven der Wiele, Ton. (1992). Measuring service quality in the car service industry: Building and testing an instrument. International Journal of Service Industry Management, 3(4), 4–16. Brennan, Robert L. (2001). Generalizability theory. New York: SpringerVerlag. Brown, Reva Berman, & Bell, Louise. (1998). Patient-centred audit: A users’ quality model. Managing Service Quality, 8(2), 88–96. Browne, Michael W., & Cudeck, Robert. (1993). Alternative ways of assessing model fit. In Kenneth A. Bollen & J. Scott Long (Eds.), Testing structural equation modeling (pp. 136–162). Newbury Park, CA: Sage. Carman, James M. (1990). Consumer perceptions of service quality: An assessment of the SERVQUAL dimensions. Journal of Retailing, 66(Spring), 33–55. Churchill, Gilbert A., Jr. (1979). A paradigm for developing better measures of marketing constructs. Journal of Marketing Research, 16, 64–73. Cook, Colleen, & Thompson, Bruce. (2000). Psychometric properties of scores from the Web-based LibQUAL + study of perceptions of library service quality. Library Trends, 49(Spring), 585–604. Cortina, Jose M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(February), 98–104. Cronbach, Lee J., Gleser, Goldine C., Nanda, Harinder, & Rajaratnam, Nageswari. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley & Sons. Cronin, J. Joseph, & Taylor, Steven A. (1992). Measuring service quality: A reexamination and extension. Journal of Marketing, 56(July), 55– 68. Cronin, J. Joseph, & Taylor, Steven A. (1994). SERVPERF versus SERVQUAL: Reconciling performance-based and perceptions-minusexpectations measurement of service quality. Journal of Marketing, 58(January), 125–131. Dabholkar, Pratibha A., Thorpe, Dayle I., & Rentz, Joseph O. (1996). A measure of service quality for retail stores: Scale development and
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52 validation. Journal of the Academy of Marketing Science, 24(Winter), 3–16. de Ruyter, Ko, & Wetzels, Martin. (1997). On the perceived dynamics of retail service quality. Journal of Retailing and Consumer Services, 4(April), 83–88. Deshpande, Rohit, Farley, John U., & Webster, Frederick E., Jr. (1993). Corporate culture, customer orientation, and innovativeness. Journal of Marketing, 57, 23. Engelland, B. T., Workman, L., & Singh, M. (2000). Ensuring service quality for campus career services centers: A modified SERVQUAL scale. Journal of Marketing Education, 22(December), 236– 245. Finn, Adam, & Kayande, Ujwal. (1997). Reliability assessment and optimization of marketing measurement. Journal of Marketing Research, 34(May), 262–275. Finn, Adam, & Kayande, Ujwal. (1999). Unmasking a phantom: A psychometric assessment of mystery shopping. Journal of Retailing, 75(Summer), 195–215. Floyd, Frank J., & Widaman, Keith F. (1995). Factor analysis in the development and refinement of clinical assessment instruments. Psychological Assessment, 7, 286–299. Freeman, Kim D., & Dart, Jack. (1993). Measuring the perceived quality of professional business services. Journal of Professional Services Marketing, 9(1), 27–47. Frochot, Isabelle, & Hughes, Howard. (2000). HISTOQUAL: The development of a historic houses assessment scale. Tourism Management, 21, 157–167. Gerbing, David W., & Anderson, James C. (1988). An updated paradigm for scale development incorporating unidimensionality and its assessment. Journal of Marketing Research, 25, 186–192. Heise, D. R., & Bohrnstedt, G. W. (1970). Validity, invalidity and reliability. In E. F. Borgatta & G. W. Bohrnstedt (Eds.), Sociological methodology (pp. 104–129). San Francisco: Jossey-Bass. Hoxley, M. (2000). Measuring UK construction professional service quality: The what, how, when and who. International Journal of Quality & Reliability Management, 17(4/5), 511–526. Johns, Nick, & Tyas, Phil. (1996). Use of service quality gap theory to differentiate between foodservice outlets. The Service Industries Journal, 16(July), 321–346. Kettinger, William J., & Lee, Choong C. (1994). Perceived service quality and user satisfaction with the information services function. Decision Sciences, 25(5/6), 737–766. Khatabi, Abod Ali, Ismail, Hishamuddin, & Thyagarajan, Venu. (2002). What drives customer loyalty: An analysis from the telecommunications industry. Journal of Targeting, Measurement and Analysis for Marketing, 11(September), 34–44. Kohli, Ajay K., Jaworski, Bernard J., & Kumar, Ajith. (1993). MARKOR: A measure of market orientation. Journal of Marketing Research, 30(November), 467–477. Kopalle, Praveen K., & Lehmann, Donald R. (1997). Alpha inflation? The impact of eliminating scale items on Cronbach’s alpha. Organizational Behavior and Human Decision Processes, 70(October), 189– 197. Lam, Terry, & Zhang, Hanqin Qiu. (1999). Service quality of travel agents: The case of travel agents in Hong Kong. Tourism Management, 20(June), 341–349. Lim, Pauy Cheng, & Tang, Nelson K. H. (2000). A study of patients’ expectations and satisfaction in Singapore hospitals. International Journal of Health Care Quality Assurance, 13(7), 290–299. Mehta, Subhash C., Lalwani, Ashok K., & Han, Soon Li. (2000). Service quality in retailing: Relative efficiency of alternatives scales for different product-service environments. International Journal of Retail & Distribution Management, 28(2), 62–72. Nelson, Susan Logan, & Nelson, Theron R. (1995). RESERV: An instrument for measuring real estate brokerage service quality. Journal of Real Estate Research, 10(1), 99–113.
51
Oldfield, Brenda M., & Baron, Steve. (2000). Student perceptions of service quality in a UK university business and management faculty. Quality Assurance in Education, 8(1), 85–95. Othman, Abdul Qawi, & Owen, Lynn. (2001). Adopting and measuring customer service quality (SQ) in Islamic Banks: A case study in Kuwait Finance House. International Journal of Islamic Financial Services, 3(April–June), 1–26. Parasuraman, A., Berry, Leonard L., & Zeithaml, Valarie A. (1991). Refinement and reassessment of the SERVQUAL scale. Journal of Retailing, 67(Winter), 420–450. Parasuraman, A., Zeithaml, Valarie A., & Berry, Leonard L. (1988). SERVQUAL: A multiple item scale for measuring consumer perceptions of service quality. Journal of Retailing, 64(Spring), 12– 40. Parasuraman, A., Zeithaml, Valarie A., & Berry, Leonard L. (1994). Alternative scales for measuring service quality: A comparative assessment based on psychometric and diagnostic criteria. Journal of Retailing, 70(January), 201–230. Peter, J. Paul. (1979). Reliability: A review of psychometric basics and recent marketing practices. Journal of Marketing Research, 16, 6– 17. Philip, George, & Hazlett, Shirley-Ann. (2001). Evaluating the service quality of information services using a new “P-C-P” attributes model. The International Journal of Quality & Reliability Management, 18(8/9), 900–916. Philip, George, & Stewart, Jonathan. (1999). Involving mental health service users in evaluating service quality. International Journal of Health Care Quality Assurance, 12(5), 199–209. Philip, George, & Stewart, Jonathan. (1999). Assessment of the service quality of a cancer information service using a new P-C-P attributes model. Managing Service Quality, 9(3), 167–179. Rentz, Joseph O. (1987). Generalizability theory: A comprehensive method for assessing and improving the dependability of marketing measures. Journal of Marketing Research, 24, 19–28. Richard, M. D., & Alloway, A. W. (1993). Service quality attributes and choice behavior. Journal of Services Marketing, 7(1), 28–36. Rossiter, John R. (2002). The C-OAR-SE procedure for scale development in marketing. International Journal of Research in Marketing, 19(December), 305–335. Searle, Shayle, Casella George, & McCulloch, Charles E. (1992). Variance components. New York: Wiley-Interscience. Smith, Gregory T., & McCarthy, Denis M. (1995). Methodological considerations in the refinement of clinical assessment instruments. Psychological Assessment, 7(September), 300–308. Spiro, Rosann L., & Weitz, Barton A. (1990). Adaptive selling: Conceptualization, measurement, and nomological validity. Journal of Marketing Research, 27(February), 61–70. Steenkamp, Jan-Benedict E. M., & van Trijp, Hans C. M. (1991). The use of LISREL in validating marketing constructs. International Journal of Research in Marketing, 8(November), 283–299. Teas, Kenneth R. (1994). Expectations as a comparison standard in measuring service quality: An assessment of a reassessment. Journal of Marketing, 58(January), 132–139. Vandamme, R., & Leunis, J. (1993). Development of a multiple-item scale for measuring hospital service quality. International Journal of Service Industry Management, 4(3), 30–49. Vazquez, Rodolfo, Rodriguez-Del Bosque, Ignacio A., Diaz, Ana Ma, & Ruiz, Agustin V. (2000). Service quality in supermarket retailing: Identifying critical service experiences. Journal of Retailing and Consumer Services, 8(January), 1–14. Velicer, Wayne F. (1977). An empirical comparison of the similarity of principal components, image and factor patterns. Multivariate Behavioral Research, 12(1), 3–22. Velicer, Wayne F., & Jackson, Douglas N. (1990). Component analysis versus common factor analysis: Issues in selecting an appropriate procedure. Multivariate Behavioral Research, 25(1), 1–28.
52
A. Finn, U. Kayande / Journal of Retailing 80 (2004) 37–52
Webster, Calum, & Hung, Li-Chu. (1994). Measuring service quality and promoting decentring. The TQM Magazine, 6(5), 50–55. Weekes, David W., Scott, Mark E., & Tidwell, Paula M. (1996). Measuring quality and client satisfaction in professional business services. Journal of Professional Services Marketing, 14(2), 25–37. Widaman, K. F. (1993). Common factor analysis versus principal components analysis: Differential bias in representing model parameters? Multivariate Behavioral Research, 28(3), 263–311. Wong, O. M., Dean, A. A. M., & White, C. J. (1999). Analysing service quality in the hospitality industry. Managing Service Quality, 9(2), 136–143.
Further reading Lapierre, Jozée and Filiatrault, Pierre (1994). Professional services as perceived by organizational customers. Paper presented at the Recent Advances in Retailing and Services Science Conference, Lake Louise, Canada. Sureshchandar, G.S., Rajendran, Chandrasekharan & Anantharaman, R.N. (2002). Determinants of consumer-perceived service quality: a confirmatory factor analysis approach. Journal of Services Marketing, 16(1), 9–34.