This paper describes the development of Form II of the Patient Satisfaction Questionnaire. (PSQI, a self-administered survey instrument designed for use in ...
Evaluation andProgram Planning, Vol. 6, pp. 247-263, 1983 Printed
Copyright
in the USA. All rights reserved.
DEFINING
JOHN
AND
E. WARE, JR.,
MEASURING PATIENT WITH MEDICAL CARE
MARY K. SNYDER, W.
0149-7189/83 $3.00 + .OO c 1984 Pergamon Press Ltd
SATISFACTION
RUSSELL WRIGHT, AND ALLYSON R. DAVIES
The Rand Corporation
ABSTRACT This paper describes the development of Form II of the Patient Satisfaction Questionnaire (PSQI, a self-administered survey instrument designed for use in general population studies. The PSQ contains 55 Likert-type items that measure attitudes toward the more salient characteristics of doctors and medical care services (technical and interpersonal skills of providers, waiting time for appointments, office waits, emergency care, costs of care, insurance coverage, availability of hospitals, and other resources) and satisfaction with care in general. Scales are balanced to control for acquiescent response set. Scoring rules for 18 multi-item subscales and eight global scales were standardized following replication of item analyses in four field tests. Internal-consistency and test-retest estimates indicate satisfactory reliability for studies involving group comparisons. The PSQ well represents the content of characteristics of providers and services described most often in the literature and in response to openended questions. Empirical tests of validity have also produced generally favorable results.
to the development and testing of numerous instruments including several patient satisfaction questionnaires as well as measures of the importance placed on different features of medical care services. We summarize here the conceptual work and empirical results from the SIU studies that have been available only in technical reports (Ware, Snyder, & Wright, 1976a, 1976b). We focus on Form II, which has proven to be the most comprehensive and reliable version of the PSQ.
The Patient Satisfaction Questionnaire (PSQ) was developed at Southern Illinois University (SIU) School of Medicine during a study funded by the National Center For Health Services Research and Development. The major goals of the SIU project were to develop a short, self-administered satisfaction survey that would be applicable in general population studies and would yield reliable and valid measures of concepts that had both theoretical and practical importance to the planning, administration, and evaluation of health services delivery programs. The SIU work led CONCEPTUALIZING
PATIENT
In theory, a patient satisfaction rating is a personal evaluation of health care services and providers. It is wrong to equate all information derived from patient surveys with patient satisfaction (Ware, 1981). For example, patient satisfaction ratings are distinct from reports about providers and care. Reports are intentionally more factual and objective. Satisfaction ratings are intentionally more subjective; they attempt to capture a personal evaluation of care that cannot be known by observing care directly. For example, patients can be asked to report the length of time spent
SATISFACTION
with their provider or to rate whether they were given enough time. Although satisfaction ratings are sometimes criticized because they do not correspond perfectly with objective reality or with the perceptions of providers or administrators of care, this is their unique strength. They bring new information to the satisfaction equation. We believe that differences in satisfaction mirror the realities of care to a substantial extent; these differences also reflect personal preferences as well as expectations (see Ware et al., 1976b, pp. 433463, 607-622).
This research and preparation of this manuscript were supported by the National Center for Health Services Research and Development and by the Health Insurance Study grant from the Department of Health and Human Services. Reprint requests and inquiries should be sent to John E. Ware, Jr., Behavioral Sciences Department, The Rand Corporation, 1700 Main Street, Santa Monica, CA 90406. 247
JOHN E. WAKE et al.
248
Thus, a patient satisfaction rating is both a measure of care and a measure of the patient who provides the rating. During the development of the PSQ, we attempted to determine what satisfaction ratings measure-features of the care or the patient. This distinction is important for studies that attempt to use satisfaction ratings as a source of information about specific aspects of care. Specifically, when dissatisfaction is detected, should care be changed or should patients be changed (i.e., their expectations, preferences, and standards) to increase satisfaction? During field tests of Form I of the PSQ, we measured separately the importance placed oneach characteristic of doctors and services described by PSQ items. We also measured independently how often each characteristic was observed or experienced. We noted significant effects of patient expectations and value preferences on satisfaction ratings. These effects, however, proved to be of more theorectical than practical interest because they were small relative to the impact of experiences reported by patients. For example, the length of time a patient had to wait to see a doctor determined satisfaction with office waits substantially more than expectations or preferences for short and long office waits. Hence, a satisfaction rating seems to be much more a measure of care than it is a measure of the patient, although the latter is a part of the message. Another important conceptual issue is the nature and number of dimensions of patient satisfaction. As described beIow, we attempted to build a taxonomy of these characteristics that would provide a framework for classifying the content of satisfaction measures and for evaluating the content validity of the PSQ. The taxonomy we have derived during studies of the PSQ posits that several different characteristics of providers and medical care services influence patient satisfaction, and that patients develop distinct attitudes toward each of these characteristics. Brief definitions of each RESEARCH
STRATEGY
The strategy for developing and testing the PSQ focused on improving the reliability and validity of items and multi-item scales and reducing the costs (dollar and time) required for their administration. That process began with a survey (the Seven-County Study) that included over 900 items administered in person by trained interviewers (Chu, Ware, & Wright, 1973; Ware, Wright, Snyder, & Chu, 1975). Ultimately, Form II of the PSQ was much shorter and was selfadministered with success (Ware et al., 1976a). Of necessity, the research began without an agreedupon conceptual framework for defining and measuring patient satisfaction and with many unanswered questions about methodological issues. Hence, the instruments were field tested over a 4-year period in an
dimension appear below, along with examples of item content: Inter~ersonai manner: features of the way in which providers interact personally with patients (e.g., concern, friendliness, courtesy, disrespect, rudeness). Technical qua~jty: competence of providers and adherence to high standards of diagnosis and treatment (e.g., thoroughness, accuracy, unnecessary risks, making mistakes). Accessibility/convenience: factors involved in arranging to receive medical care (e.g., time and effort required to get an appointment, waiting time at office, ease of reaching care location). Finances: factors involved in paying for medical services (e.g., reasonable costs, alternative payment arrangements, comprehensiveness of insurance coverage). ESficacy/outcomes: the results of medical care encounters (e.g., helpfulness of medical care providers in improving or maintaining health). Continuity: sameness of provider and/or location of care (e.g., see same physician). Physical envjron~ent: features of setting in which care is delivered (e.g., orderly facilities and equipment, pleasantness of atmosphere, clarity of signs and directions). Availability: presence of medical care resources (e-g., enough hospital facilities and providers in area).
The preceding order of these dimensions reflects the relative frequency of their inclusion in studies of patient satisfaction before the PSQ. The first four (interpersonal manner, technical quality, accessibility/ convenience, and finances) were by far the most commonly measured features of care measured in patient satisfaction studies. AND DATA
SOURCES
iterative process that included formulations of models of the dimensions of patient satisfaction, construction of measures of those dimensions, empirical tests of the measures and models, and refinements in both. This iterative process included 12 studies of patient satisfaction; some involved secondary analyses of data provided by others.’ Studies of Form II of the PSQ were replicated in four independent field tests, including three general ‘We gratefully acknowledge the cooperation of individuals who provided satisfaction questionnaire data for analysis, including: Barbara Hulka and John Cassel at the University of North Carolina, James Greenley and Richard Schoenherr at the University of Wisconsin, and LuAnn Aday and Ronald Andersen at the University of Chicago.
249
Defining and Measuring Satisfaction household surveys (East St. Louis, Illinois; Sangamon County, Illinois; and Los Angeles County, California) and a survey of patients enrolled in a family practice center (Springfield, Illinois). Sample sizes in these four sites ranged from 323 to 640, and their sociodemographic characteristics varied considerably. Two thirds to four fifths of the respondents in the East St. Louis, Sangamon County, and Family Practice samples were women; in Los Angeles County, almost two thirds were men. In East St. Louis, 90% of respondents were nonwhite; in Sangamon County, 3% were nonwhite, and in Los Angeles County, 35%. (Data on race were not obtained for the Family Practice sample.) Median age in the three household surveys was about 45 years; the Family Practice sample was younger, with a median age of 32 years. Median annual family incomes (in 1974 dollars) ranged from a low of $5,400 in East St. Louis to $9,500 in Los Angeles, and approximately $12,000 in the Sangamon County and Family Practice samples. Median educational levels were close to 12 years in three samples; the median was 14 years in the Family Practice sample. In summary, the samples ranged from a chiefly nonwhite and socioeconomically disadvantaged sample in East St. Louis to predominantly white, middle-class samples in Sangamon County and the Family Practice center. population
Content of the PSQ Our research began by formulating hypotheses about the nature and number of specific characteristics of providers and medical care services that should be represented by PSQ items to achieve content validity. An
outline of satisfaction constructs was developed from the content of available instruments, published books and articles from the health services research literature, and the responses of convenient samples of persons to open-ended questions about their experiences with doctors and medical care services. The latter studies were designed to generate new items. We sought to achieve a comprehensive specification of patient satisfaction constructs and a good understanding of the words people actually use when they talk about medical care services. This knowledge helped in choosing the specific vernacular used to construct PSQ items. The item-generation studies consisted of three tasks: (a) making sentence fragments into statements of opinions about medical care (e.g., write favorable and unfavorable opinions using the words cost of cure); (b) expressing comments about the most- and leastliked aspects of medical care; and (c) responses from group sessions in which participants were asked to compose and discuss statements of opinion that reflected favorable and unfavorable sentiments about medical care. These three tasks yielded a pool of approximately 2,300 items, that were sorted into content
categories by independent judges. The resulting content outline and constructs identified from other instruments and the literature were integrated into a taxonomy on which we based initial hypotheses about the nature and number of satisfaction constructs. Redundancies and ambiguities were identified and the item pool was reduced to about 500 edited items, each describing only one characteristic of medical care services. Data-Gathering Methodological
and other Considerations
A number of methodological studies addressed questions about data-gathering methods, the structure of PSQ items, instructions to respondents, and other procedural issues. Some decisions were made after reviewing the literature and consulting with experts; other decisions were made after formal study. These decisions are explained and selected methodological results are summarized in the following paragraphs. References are provided to the more complete documentation of results from our studies of methodological issues. Choice of Likert-Type Items. A standardized patient satisfaction item has two parts: the item stem and the response scale. The item stem describes a specific feature of care or care in general. The response scale defines the choices used to evaluate that feature. PSQ item stems, response choices, and scoring rules were standardized to facilitate administration and to maximize reliability and validity. We chose the traditional approach to attitude measurement in which the item is structured as a statement of opinion, such as “It’s hard to get an appointment for medical care right away,” and response choices range from strongly agree to strongly disagree. Several different questionnaire formats were tested. The format we recommend places the preceded responses to the right of the items, and labels these responses at the head of each page, as shown in Figure 1. In general population studies designed to measure satisfaction with the respondent’s total medical care experience, instructions are offered as follows: On the following pages are some statements about medical care. Please read each one carefully, keeping in mind the medical care you are receiving now. If you have not received medical care recently, think about what you would expect if you needed care today. On the line next to each statement circle the number for the opinion which is closest to your own view. These instructions are followed by an example and further explanation of how to use the response scale. The instructions end with the following:
250
JOHN E. WARE et al.
~1 I’m very satisfied with the medical care I receive
Figure 1. PSQ Item Format.
Some statements look similar to others, but each statement is different. You should answer each statement by itself. This is not a test of what you know. There are no right or wrong answers. We are only interested in your opinions or best impression. Please circle only one number for each statement. This traditional Likert-type approach has several advantages. First, use of identical response scales for all items facilitates the task of completing a survey. Once respondents become familiar with the response choices, they can listen to or read each item stem and quickly indicate their response. When choices differ from item to item, more time and effort is involved. Second, it is usually easier to format a questionnaire when the same response choices are used for each item. Such questionnaires can often be printed on fewer pages. Third, we found it easier to revise items with the goal of changing the distribution of item responses reduce skewness) when item stems were struc(e.g., tured as statements of opinion. Examples of how PSQ items were reworded in more favorable or more unfavorable terms to manipulate response distributions are reported in detail elsewhere (Ware et al., 1976a, pp. 171-179). This manipulation was also done for items structured as questions about satisfaction with response choices that defined levels of satisfaction, although it was more difficult and frequently required awkward wording. Nztmber of Response Choices. A key assumption underlying our work was that satisfaction itself is a continuum. Our goal in choosing an item response scale was, therefore, that the responses should place people as precisely as possible along that continuum in terms of their attitudes toward services and providers. The better each item performed in this regard, the fewer the items required per scale. A response scale with only two choices-agree versus disagree or satisfied versus dissatisfiedwas judged to be too coarse. Published studies and analyses of pretest data suggested that five choices yielded more information and more reliable responses than did two or three. Any further increase in reliability with seven response choices did not seem to warrant the resulting increase in questionnaire length and the additional complexity of formatting items. Thus, the response scale chosen for the PSQ asks the respondent to select one of five choices
to report strength of agreement or disagreement (strongly agree, agree, not sure, disagree, strongly disagree).
Focus on Personal Versus Genera/ Care Experiences. Another important characteristic of patient satisfaction rating items is whether they focus on the respondent’s personal care experiences or those of people in general. An example of an item with a general referent is “It takes most people a long time to get to the place where they receive medical care.” The same item can be structured to focus on the respondent’s personal experience: “It takes me a long time to get to the place where I receive medical care.” Both kinds of items have been used widely in patient satisfaction surveys (Snyder & Ware, 1975). The main reason for being interested in items with a more general referent was to reduce the number of items left unanswered because of inapplicability. The validity and other psychometric characteristics of these general items, however, had not been studied systematically. To examine these characteristics, we studied 10 pairs of items that measured satisfaction constructs in three general categories: access to care (2 pairs), finances (2 pairs), and quality of care (6 pairs). items in each pair differed only in terms of whether they asked about the respondent’s own care or care received by people in general (as in the examples above). These item pairs were interspersed throughout a special 7%item version of the PSQ fielded in the Los Angeles and Sangamon County field tests (total n = 952). Paired items were compared in terms of test-retest reliability (6-week interval), factorial validity (similarity of correlations across derived satisfaction factors), predictive validity (in relation to five health and illness behaviors), and differences in mean scores and variances. Results from both field tests supported the same conclusions. No noteworthy differences in reliability or validity coefficients were observed between items in the same pair. Mean scores for items evaluating personal care experiences were consistently significantly more favorable for items that described the experiences of people in general. Explanations for differences in mean scores are discussed elsewhere (Snyder et al., 1975). The practical implication is that the difference in item referent (personal vs. general) has little or no impact on reliability or validity. Hence, the choice between the two
Defining and Measuring Satisfaction
kinds of items was made with other considerations mind.
in
Ad~~~istr~tion Methods. Development and validation of the PSQ required the design of oral interview schedules and various self-administered questionnaires. Our analyses of administration methods examined their effects on response rates, completeness of data, data-gathering costs, characteristics of respondents and nonrespondents, and satisfaction levels. We also examined the effects of asking other questions before administration of the PSQ. Response rates were not completely determined by administration method. For example, in a randomizedgroups experiment during the Los Angeles County field test, approximately 69% of those who were asked to self-administer and return a questionnaire booklet by mail returned the booklet as compared with a 95% completion rate for those whose self-administration was supervised by a trained interviewer. Other field tests showed no difference between return rates for groups who self-administered the PSQ with and without supervision (the latter with mail-back). The vigorousness of follow-up seemed to be the more important factor in determining completion rates when mail-back was relied upon. Further, we detected no difference in data quality between supervised and unsupervised self-administration of the PSQ. Supervision by a trained interviewer increased data-gathering costs about 5-fold. In the Los Angeles County field test, characteristics of respondents and nonrespondents to a mail-back survey were compared. These characteristics were documented during an interview before the questionnaire was either dropped off for self-administration and mailed return or completed under supervision. The drop-off/mail-back method resulted in significant underrepresentation of persons aged 40 and younger, nonwhites, and low-income persons. Comparison of satisfaction scores for persons in the mailback and hand-back groups suggested that those who were more satisfied with the quality of their care are less likely to return questionnaires. We also examined whether differences in satisfaction levels might be caused by differences in when the PSQ was administered during a longer interview schedule. We randomly varied whether the PSQ was self-administered before or after a series of questions that asked about use of health care services and compliance with medical regimens. We hypothesized that questions about health care experiences might increase the salience of attitudes toward those experiences (i.e., medical care satisfaction). For 14 of the 18 PSQ scales, scores tended to be lower for those who answered questions about health care experiences first. Scores on scales measuring access to care in emergencies,
251
costs of care, and payment mechanisms were significantly lower. These results suggest that administration procedures (and particularly the placement of satisfaction questions) in a longer survey should be standardized. Further research is necessary to determine whether satisfaction ratings are more or less valid if obtained after a review of recent health care experiences. The length of time required to complete the PSQ was systematically measured. Considerable variability was observed across respondents. On average, respondents took about 11-13 seconds to complete each PSQ item. Thus, the 55 items used to score scales constructed from Form II of the PSQ take about 11 minutes to complete. The 43-item PSQ short form takes 8-9 minutes on average.’ Administration times tend to be somewhat longer for disadvantaged respondents (i.e., Iow education, low income). Response Set Effects. Beginning with Form I, all versions of the PSQ contained a balance of favorably and unfavorably worded items to control for bias due to acquiescent response set (ARS), a tendency to agree with statements of opinion regardless of content. ARS bias was a noteworthy problem that became apparent at several stages during development of the PSQ, including: empirical studies of item groupings (i.e., factor analyses of items), estimation of internalconsistency reliability, and during comparisons of group differences in satisfaction (Ware, 1978). More recently, similar problems have surfaced in other studies of health care attitudes (Winkler, Kanouse, & Ware, 1982). During tests of the PSQ, 40% to 60% of respondents manifested some degree of ARS, and 2% to 10% demonstrated substantial ARS tendencies. (We used 11 matched pairs of favorably and unfavorably worded items that measured the same feature of care and extremely worded validity check items to identify such respondents.) Effects of ARS bias included: appearance of method rather than trait factors in item analyses (i.e., factors defined by differences in the direction of item wording, not differences in characteristics of medical care); inflated reliability estimates for unbalanced multi-item scales; and seriously biased comparisons of mean differences between groups of respondents differing in educational attainment, income, and age. (These effects were also observed in analyses of responses to the Thurstone scales constructed by ‘The 43-item PSQ short form was developed for use in Rand’s Health Insurance Experiment, a randomized controlled trial designed to estimate the effects of different health care financing arrangements and organizations on patient satisfaction. This short form was also fielded in a national study of access to health care services (Aday, Andersen, & Fleming, 1980). The short form questionnaire and scoring instructions are available from the authors.
252
JOHN
E. WARE
Hulka and her colleagues [Hulka, Zyzanski, Cassel, & Thompson, 19701.) For example, differences in satisfaction with quality of care between education groups were substantially overestimated by PSQ scales constructed entirely from favorably worded items, and were missed entirely by scales constructed entirely from unfavorably worded items. The balanced PSQ Technical Quality satisfaction subscale, which was not correlated with ARS, detected significant differences in satisfaction between education groups (Ware, 1978). We also studied two other types of response set that might bias patient satisfaction ratings: opposition response set (ORS, a tendency to disagree with statements regardless of content), and socially desirable response set (SDRS). ORS proved to be very rare and thus of little concern. SDRS was common but did not correlate with ratings of satisfaction with medical care (see Ware et al., 1976b, pp. 537-588).
ITEMS Item Number 1’ 2 3’ 4* 5* 6” 7 8* 9* 10’ 11 12* 13 14’ 15 16’ 17 18* 19’ 20 21’ 22’ 23’ 24 25* 26’ 27 28* 29’ 30’ 31* 32’ 33” 34’ 35’ 36* 37
et al.
PSQ Items and Descriptive Statistics Following the Seven-County study and several smallsample pretests of instructions and instrument format, 80 Likert-type items were self-administered in Form I of the PSQ during a survey of households in three southern Illinois counties, the Tri-County Study (Ware 8~ Snyder, 1975). Each item was worded as a statement of opinion and items were evenly divided between favorable and unfavorable statements. Analyses of items in Form I led to substantial revisions and to construction of Form II of the PSQ. Only 4 items from Form I were retained without revision; 59 were revised and retained, and 5 new items were written for Form II. The verbatim content of all 68 PSQ Form II items appears in Table 1; items are listed in the order of their administration. Before testing multi-item PSQ scales, we evaluated item descriptive statistics. Specifically, we checked distributions of item score to determine
TABLE 1 IN FORM II OF THE
PSQ
Item Content I’m very satisfied with the medical care I receive. Doctors let their patients tell them everything that the patient thinks is important Doctors ask what foods patients eat and explain why certain foods are best. I think you can get medical care easily even if you don’t have money with you. I hardly ever see the same doctor when I go for medical care. Doctors are very careful to check everything when examining their patients. We need more doctors in this area who specialize. If more than one family member needs medical care, we have to go to different doctors. Medical insurance coverage should pay for more expenses than it does. I think my doctor’s office has everything needed to provide complete medical care. Doctors never keep their patients waiting, even for a minute. Places where you can get medical care are very conveniently located. Doctors act like they are doing their patients a favor by treating them. The amount charged for medical care services is reasonable. Doctors always tell their patients what to expect during treatment. Most people receive medical care that could be better. Most people are not encouraged to get a yearly exam when they go for medical care. If I have a medical question, I can reach someone for help without any problem. In an emergency, it’s very hard to get medical care quickly. I can arrange for payment of medical bills later if I’m short of money now. I am happy with the coverage provided by medical insurance plans. Doctors always treat their patients with respect. I see the same doctor just about every time I go for medical care. The amount charged for lab tests and x-rays is extremely high. Doctors don’t advise patients about ways to avoid illness or injury. Doctors never recommend surgery (an operation) unless there is no other way to solve the problem. Doctors hurt many more people than they help. Doctors hardly ever explain the patient’s medical problems to him. Doctors always do their best to keep the patient from worrying. Doctors aren’t as thorough as they should be. It’s hard to get an appointment for medical care right away. There are enough doctors in this area who specialize. Doctors always avoid unnecessary patient expenses. Most people are encouraged to get a yearly exam when they go for medical care. Office hours when you can get medical care are good for most people. Without proof that you can pay, it’s almost impossible to get admitted to the hospital. People have to wart too long for emergency care.
253
Defining and Measuring Satisfaction TABLE f fcontinued)
Item Number
Item Content
38 39* 40* 41 42’ 43* 44 45* 48 47” 48 49” 50 51* 52” 53* 54 55” 56 57 58* 59 60 6-t” 82 63 64 65 66” 67” 66 _*
Medical insurance plans pay for most medical expenses a person might have. Sometimes doctors make the patient feel foolish. My doctor’s office lacks some things needed to provide complete medical care. Doctors always explain the side effects of the medicine they prescribe. There are enough hospitals in this area. It takes me a long time to get to the place where I receive medical care. Just about all doctors make house calls. The care I have received from doctors in ihe last few years is just about perfect. Doctors don’t care if their patients worry. Sometimes doctors take unnecessary risks in treating their patients, In an emergency, you can always get medical care. The fees doctors charge are too high. Doctors are very thorough. The medical problems I’ve had in the past are ignored when I seek care for a new medical problem. Parking is a problem when you have to get medical care. There are enough family doctors around here. Doctors never expose their patients to unnecessary risk. Doctors respect their patient’s feelings. It’s cash in advance when you need medical care. Doctors never look at their patient’s medical records. There are things about the medical care I receive that could be better. When doctors are unsure of what’s wrong with you, they always call in a specialist. When I seek care for a new medical problem, they always check up on the problems I’ve had before. More hospitals are needed in this area. Doctors sefdom explain why they order lab tests and x-rays. I think the amount chargad for emergency room service is reasonable. Sometimes doctors miss important information which their patients give them. My doctor treats everyone in my family when they need care. Doctors cause some people to worry a lot because they don’t explain medical problems to patients. There is a big shortage of family doctors around here. Sometimes doctors cause their patients unnecessary medicai expenses. People are usually kept waiting a long time when they are at the doctor’s office.
Note. Items marked with an asterisk are included in the43.item short form of the PSQ; one item in that form does not appear in Form II. In addition, four items (11, 27, 44, and 57) were used only as validity checks.
whether revisions in item wording would be necessary to achieve roughIy symmetrical (if not normal) response distributions. These characteristics are desirable for items to be used in simple summated ratings scales _ Because questionnaire responses for alI PSQ items were preceded so that “strongly agree” equaled 1 and “strongly disagree” equaled 5, responses to the favorably worded items were recoded as shown in Table 2. iMeans and standard deviations for the 68 PSQ Form II items in four field tests appear in Table 3. AU items are scored so that a higher number indicates a more favorable evaluation of medical care. Constructing Multi-Item Subscales Our experiences in analyzing 87 items from the SevenCounty Study (Chu et ai., 1973; Ware, Miller, & Snyder, 1973) convinced us that an individual questionnaire item is not a very satisfactory unit of analysis for a study of the structure of patient attitudes about doctors and medical care services. An item score is
coarse, less reliable, and substantially influenced by the direction of item wording and other methodological features in addition to the construct(s) being measured. Although the Seven-County Study gave us TABLE 2 lTEM
SCORING RULES FOR FQRM II OF THE PSQ Scoring
Item Numbersa
1 = Strongly disagree 2 3 4 5
= = = =
Disagree Not sure Agree Strongly agree
1, 2, 3, 4, 6, 10, 72, 14, 15, ‘f8, 20, 21 I 22, 2$ 26, 29, 32, 33, 34, 35, 38, 41, 42, 45, 48, 50, 53, 55, 59, 60, 83, 05
5 4 3 2 1
= = = = =
Strongly disagree Disagree Not sure Agree Strongfy agree
5, 7, 8, 9, 13, 16, 17, 19, 24, 25, 26,30, 31, 36, 37, 39,40, 43, 46, 47, 49, 51, 52, 54, 56, 56, 61, 62, 64, 66, 67, 66
Vhe four validity-check items (numbers 11,27,44, and 57) are not included (see text).
254
JOHN
E. WARE TABLE
ITEM
DESCRIPTIVE
East St. Louis
et al.
3
STATISTICS,
PSQ
SangamonCounty
FORM
II Los Angeles County
Item No.a
Mean
SD
Mean
SD
Mean
SD
1 2 3 4 5* 6 7' 8' 9* 10 11 12 13' 14 15 16* 17' 18 19* 20 21 22 23 24* 25' 26 27* 28' 29 30* 31' 32 33 34 35 36' 37* 38 39* 40* 41 42 43' 44
3.50 3.52 3.23 2.61 3.44 3.03 1.81 2.98 1.93 3.08 1.80 2.98 2.91 2.58 2.89 2.29 2.64 2.83 2.40 3.36 2.89 3.48 3.63 2.07 2.84 3.40 3.58 2.89 3.52 2.55 2.42 2.25 2.66 3.11 3.17 2.27 2.03 3.15 2.79 2.72 2.96 2.28 3.30 1.56
1.21 1.13 1.21 1.25 1.28 1.29 0.98 1.22 0.94 1.71 1.05 1.22 1.17 1.11 1.20 0.93 1.17 1.24 1.24 1.08 1.20 1.07 1.12 0.91 1.17 1.11 0.92 1.20 1.03 1.04 1.12 1.13 1.02 1.09 1.03 1.18 1.10 1.06 1.07 1.05 1.16 1.18 1.16 0.82
3.67 3.43 3.01 3.14 3.89 3.00 3.02 2.74 2.27 3.50 1.50 3.48 3.28 2.53 3.04 2.59 3.02 3.28 3.09 3.71 2.95 3.40 3.78 2.17 2.92 3.28 4.09 3.42 3.41 2.84 2.37 3.19 2.66 3.26 3.30 2.68 2.55 2.95 3.07 3.31 2.77 3.06 3.72 1.49
1.05 1.07 1.08 1.11 1.03 1.15 1.13 1.18 1.08 0.98 0.80 0.93 1.08 1.13 1.09 0.86 1.08 1.08 1.10 0.76 1.11 1.00 0.90 0.92 1.05 0.96 0.70 1.04 0.93 1.01 1.13 1.00 0.95 0.96 0.97 1.02 1.07 1.02 1.02 0.91 1.03 106 0.83 0.67
3.60 3.49 3.09 2.43 3.62 3.03 2.83 3.01 2.02 3.45 1.71 3.24 3.25 2.25 2.93 2.45 2.92 3.16 2.90 3.36 2.88 3.38 3.60 2.12 2.99 3.13 3.90 3.38 3.46 2.74 2.70 3.08 2.48 3.14 3.06 2.17 2.48 2.90 3.06 3.25 2.90 3.26 3.53 1.59
1.08 1.04 1.11 1.18 1.16 1.16 1.08 1.13 0.94 1.06 0.87 1.10 1.11 1.08 1.09 0.93 1.09 1.16 1.19 1.01 1.15 1.00 1.08 0.96 1.12 1.04 0.83 1.03 0.88 0.98 1.13 0.96 0.95 1.01 1.04 1.01 1.04 1.08 1.02 0.96 1.06 0.99 0.98 0.74
45 46' 47* 48 49* 50 51' 52' 53 54 55 56* 57 56* 59 60 61*
3.02 3.48 2.98 2.91 2.22 2.87 3.32 2.88 2.15 2.98 3.45 3.21 2.26 2.38 3.45 3.55 1.93
1.18 0.94 0.97 1.21 0.97 1.05 1.08 1.11 1.02 0.97 0.98 1.07 0.92 1.04 1 11 1.00 0.97
3.20 3.46 3.37 3.41 2.40 2.96 3.48 3.14 2.19 3.10 3.46 3.89 2.06 2.66 3.43 3.42 3.01
1.07 0.91 0.86 0.91 1.04 0.95 0.87 1.12 0.89 0.83 0.90 0.67 0.66 1.00 0.89 0.85 1.04
3.10 3.42 3.21 316 2.05 2.98 3.46 3.17 2.72 3.12 3.50 3.22 2.15 2.58 3.42 3.51 3.08
1.10 0.89 0.88 1.04 0.89 0.95 0.90 1.04 0.97 0.84 0.80 1.04 0.69 0.99 0.94 0.86 0.96
255
Defining and Measuring Satisfaction TABLE 3 (continued) Sangamon
County
Los Angeles
SD
Mean
SD
Mean
SD
1.16 1.04 0.98 1.19 1.08 0.98 1 .oo
3.27 2.54 2.85 2.96 2.98 2.28 2.78
1.01 1 .Ol 0.85 1.17 1 .oo 0.88 0.93
3.22 2.46 2.83 3.15 3.02 2.69 2.59
1.03 0.97 0.86 1.10 0.98 0.93 0.89
East St. Louis Item No.a
Mean
62’ 63 64* 65 66’ 67’ 68*
2.90 2.45 2.58 3.19 2.66 2.03 2.48
County
altems are listed in the order they appear in Form II of the PSQ; see Table 1 for content. *These items define unfavorable attitudes; their raw scores have been recorded here following the item scoring rules in Table 2.
our first “picture” of the structure of patient satisfaction, the picture was not very clear because of these methodological problems. A major goal in our studies of Form I was to test empirically our taxonomy of patient satisfaction constructs. If supported, this taxonomy would provide the “blueprint” for Form II. Our progress toward this goal would be limited by the adequacy of the measures available for model testing. To increase chances for success, we adopted the concept of a Factored Homogeneous Item Dimension (FHID) developed by Comrey (1961). He used this technique because of its advantages in solving various measurement problems in personality research; we discuss these advantages in reference to the PSQ elsewhere (Ware & Snyder, 1975; Ware et al., 1976a). Simply stated, a FHID is a group of items that has satisfied both logical and statistical criteria. The logical criterion is that the items have very similar content (appear highly conceptually related). Abbreviated examples of the content of items from a FHID measuring attitude toward the interpersonal manner of providers are: Doctors treat their patients with respect, doctors make patients feel foolish, and doctors act like they are doing patients a favor by treating them. We labeled this FHID Respect. Empirically, items in the same FHID must share substantially more variance with each other than with items in other FHIDs. Items that fulfill these criteria are combined to yield a single score that serves as the unit of analysis in subsequent analyses. The FHID strategy, which is in contrast to the common practice of using a single questionnaire item as the unit of analysis, was employed extensively in evaluations of Forms I and II of the PSQ. Results for Form I are reported elsewhere (Ware & Snyder, 1975; Ware et al., 1976a, pp. 167-179). Our evaluation of item groupings hypothesized for Form II of the PSQ was conducted in two phases. First, 20 hypothesized FHIDs were tested with data from the Sangamon County field test (n = 432). Seven matrices of inter-item correlations, each containing
five or six FHIDs, were factor analyzed. Inspection of seven factor matrices had several advantages. The number of PSQ items per matrix ranged from only 14 to 16 for a subjects/variables ratio of greater than 25/l in all matrices. Further, each FHID could be tested against more than one combination of other FHIDs. (It is much more difficult to validate a FHID against other FHIDs that measure conceptually similar as opposed to dissimilar constructs.) In addition to testing specific hypotheses about Form II item groupings, analyses of the seven matrices also provided a thorough test for unhypothesized satisfaction factors. Results from the FHID validation studies in Sangamon County confirmed 17 of the 20 FHIDs hypothesized to measure specific dimensions of patient satisfaction with doctors and medical care services. These FHIDs included 5 1 items. Results also confirmed an 18th FHID of four items that measured satisfaction with medical care in general. These 18 item groupings (FHIDs) and higher-order factors (global scales), identified in Table 4, were subjected to multitrait scaling tests during the second step of our item analyses. The multitrait analyses were performed independently in each of the four field tests. The multitrait analyses involved inspection of itemscale correlation matrices to evaluate each item in relation to two criteria: first, based on the logic of Likert (1932) scaling, whether each item had a substantial linear relationship with the total score for its hypothesized scale; and second, based on the logic of discriminant validity, whether each item correlated higher with its hypothesized scale than with other scales. (In these analyses, we used a modified version of the Analysis of Item-Test Homogeneity (ANLITH) program developed by Thomas Gronek at IBM and Thomas Tyler at the Academic Computing Facility, Southern Illinois University.) Additional details regarding specific criteria for scaling “successes” and “failures” are reported elsewhere (Ware et al., 1976a, pp. 179-210). Item-scale correlations were corrected for overlap using the technique recommended by Howard and
256
VALIDATED
JOHN
E. WARE
(1962). This correction provided more stringent tests of scaling criteria by removing the effect of the item being evaluated from the total scale score. Because the scales were short, each item had a considerable influence on the total scale score. Multitrait scaling is not as complete as convergentdiscriminant validation with the multitrait-multimethod (MTMM) approach described by Campbell and Fiske (1959). Only one measurement method is represented in the matrix in multitrait scaling. We would argue, however, that our approach is a “cousin” of the MTMM approach and that it is superior to traditional analyses of item internal-consistency because it provides discriminant tests of item validity across traits (in this case satisfaction constructs) that are measured by the same method. Results of the multitrait scaling analyses were more than satisfactory for all 18 subscales in all four sites. Only I1 correlations (corrected for overlap) between items and their hypothesized scales were below 0.30 in 220 tests across four sites. Of 3,740 tests of the item discriminant validity criterion (the second and more stringent criterion just defined), approximately 98% were favorable. Items in six of the hypothesized scales (the three Availability scales, Cost of Care, Insurance Coverage, and Doctor’s Facilities) passed the criterion in 100% of the item-discriminant validity tests in all four field tests. The largest number of discrepancies were observed for items in the Access to Care, PrudenceRisks, and Prudence-Expenses subscales; most of
TAOLE 4 ITEM GROUPINGS FOR PSQ SUBSCALES
D~mension/ltem
Grouping
Forehand
Item Number
Access to care (nonfinancial) 1. Emergency care 2. Convenience of services 3. Access Financial aspects 4. Cost of care 5. Payment mechanisms 6. Insurance coverage Availability of resources 7. Family doctors 8. Specialists 9. Hospitals
19, 37, 48 12,43 18,31 14, 24, 49, 63 4, 20, 36, 56 9, 21, 38 53,67 7, 32 42,61
Continuity of care 10. Family 11. Self Technical Quality 12. Quality/~ompetence
8, 65 5, 23 3,
6, 17, 25, 30, 34, 50, 51, 60 47,54 10,40
13. Prudence-Risks 14. Doctor’s facilities Interpersonal manner 15. Explanations 16. Consideration 17. Prudence-Expenses Overall satisfaction 18. General satisfaction Note. Source: Adapted Wright (1976a), p. 198.
28, 62, 66 22, 26, 29, 39, 55 33,68 1, 16, 45, 58
from
Figure
GLOBAL Global Access
21 in Ware, Snyder,
SATISFACTION
et al.
and
SCALES
TABLE 5 SCORED
FROM
Combine
Scale
These
FORM
II OF THE
ItemslSubscales:
(19 + 37 + 48) f (12 + 43) + (18 + 31) Emergency care + Convenience of services to care
to care
PSQ
+ Access
(53 + 67) + (42 + 61) + (7 + 32) Availabilitylfamily doctors + Availability/hospitals Availability/specialists
Availability
+
Finances
(14 + 24 + 49 + 63) + (9 + 21 + 38) + (4 + 20 + 36 + 56) Cost of care + Insurance coverage + Payment mechanisms
Continuity
(8 + 65) + (5 + 23) Continuity of Care/Family
Interpersonal
manner
+ Continuity
of Care/Self
(22 + 26 + 29 + 39 + 55) + (28 + 62 + 66) Consideration -I- Explanations
Quality
total
(10 + 40) + (47 + 54) + (3 + 6 + 17 + 25 + 30 + 34 + 50 + 51 + 60) Doctor’s facilities + Prudence/expenses + Quality/ competence
Access
total
Access
Doctor
conduct
total
+ finances
Interpersonal
manner
+ Quality
total
257
Defining and Measuring Satisfaction
MEANS
AND
Number of Items Nonfinancjal access Emergency care Convenience Access to care Financial access Cost of care Insurance coverage Payment mechanisms Availability Family doctors Specialists Hospitals Continuity of care Family Self Humaneness Consideration Explanations Technical quality Doctors’ facilities Prudence/risks Quality/competence Prudence/expenses Overall satisfaction General satisfaction Note.
Field Tests:
Highest Possible Score
PSQ FORM
II SCALES
SAC
ESL
FP
LAC
R
SD
X
SD
R
SD
ii
SD
15 10 10
7.3 6.3 5.2
2.7 1.9 1.9
9.1 7.2 5.6
2.4 1.5 1.8
9.2 7.2 6.3
2.6 1.6 2.0
8.6 6.8 5.9
2.7 1.8 1.9
4 3 4
20 15 20
9.2 7.9 il.4
2.7 2.3 2.9
9.6 8.2 13.4
3.1 2.6 2.3
11.0 7.6 13.9
3.1 2.6 2.5
8.9 7.9 11.2
2.9 2.7 3.1
2 2 2
10 10 10
4.1 4.0 4.1
1.7 1.8 1.9
4.5 6.2 6.1
1.6 1.9 2.0
4.7 6.6 6.6
1.6 1.9 2.1
5.4 5.9 6.4
1.7 1.8 1.8
2 2
10 10
6.2 7.1
2.0 2.0
5.7 7.7
2.1 1.6
6.7 7.6
2.2 2.0
6.4 7.2
2.2 2.0
5 3
25 15
16.6 8.5
3.8 2.6
16.6 9.7
3.6 2.4
16.3 10.0
4.2 2.6
16.6 9.6
3.5 2.4
10 10 45 10
5.8 6.0 27.0 5.1
2.0 1.4 6.0 1.6
6.8 6.5 27.9 5.4
1.7 1.4 5.9 1.6
6.6 6.4 28.4 5.6
1.8 1.6 6.8 1.8
6.7 6.3 27.9 5.1
1.9 1.5 5.6 1.7
20
11.2
3.0
12.1
3.1
12.1
3.2
11.8
3.1
(ESL), Sangamon
County
(SAC), Family
these discrepancies were noted in data from the East St. Louis field test, which provided the most economically disadvangtaged sample. Thus, with relatively few exceptions, the internal consistency of hypothesized subscales and the discriminant validity of the 55 PSQ item scores was demonstrated successfully in four independent studies. (The four general satisfaction items proved to be substantially internally consistent. They were not expected to show high discriminant validity and, in fact, correlated significantly, if not substantially, with the 17 other subscales.) On the strength of these findings, the 18 PSQ Form II subscales were scored using the item groupings shown in Table 4. Scale scores were calculated by computing the simple algebraic sum of the items in the scale, after scoring the items as shown in Table 2. Items were constructed and modified, as necessary, to achieve nearly equal (unit) variances; item content was modified, as necessary, so that items in the same scale RELIABILITY We placed considerable liability of the PSQ. several considerations. been published rarely
FOR
3 2 2
4
East St. Louis
TABLE 6 DEVIATIONS
STANDARD
Center
(FP), & Los Angeles
County
(LAC).
(FHID) would have approximately the same correlations with their primary factor, and no other substantial correlations. These goals were generally met and it was not necessary, therefore, to standardize items or to use factor coefficients to weight them differently. Higher scores on all scales indicate more favorable attitudes. Logical and empirically verified groupings of PSQ subscales were used to compute global satisfaction scores. The item groupings for global scales were hypothesized from the taxonomy of satisfaction constructs and the higher-order factor structure of PSQ subscales (discussed later). Scoring rules for six global PSQ Form II scales are defined in Table 5. The global scales are computed after scoring the items as shown in Table 2 and the subscales as shown in Table 4. Descriptive statistics (means and standard deviations) for the subscales and global scales in the four field tests appear in Table 6.
AND STABILITY
emphasis on evaluating the reThis emphasis stemmed from First, reliability estimates had for the satisfaction measures
Practice
ANALYSES
developed before the PSQ, and we found no published estimates of test-retest reliability or intertemporal stability of such measures (Ware, Davies-Avery, & Stewart, 1978). Second, reliability estimates were
258
JOHN E. WARE
essential to interpret results of validity studies (e.g., an MTMM matrix). Finally, because internal consistency reliability estimates are a direct function of item homogeneity, these analyses provided further evidence regarding the appropriateness of the PSQ item groupings. (Homogeneity estimates, or average inter-item correlations, for the subscales and global scales appear in Ware et al., 1976b, pp. 299-321). Both internal consistency and test-retest methods of estimating reliability were used. Internal consistency reliability was estimated, using coefficient CY (Cronbach, 1951), from data obtained during a single administration of the PSQ in each of four field tests. Estimates of test-retest reliability were obtained by computing product-moment correlations between scores for the same respondents on two administrations of the PSQ approximately 6 weeks apart in two field tests (East St. Louis and Sangamon County). Internal consistency (ICR) and test-retest (TRT) reliability estimates for the PSQ subscales and global scales appear in Table 7. For the 18 subscales, 68 of the 72 ICR estimates exceeded the 0.50 standard recommended for studies that involve group comparisons
SUMMARY
OF RELIABILITY
et
al.
(Helmstadter, 1964). For 17 subscales administered twice, 28 of the 34 TRT estimates equaled or exceeded that criterion (such estimates were not available for General Satisfaction). These results were encouraging, particularly because more than half of the Form II subscales were each constructed from only two questionnaire items. Test-retest coefficients for single-item measures were much less favorable in the two field tests that repeated administrations of the PSQ. Approximately 75% of the 55 items failed to achieve the 0.50 standard for test-retest reliability in East St. Louis, and approximately one third failed to meet that standard in Sangamon County. Thus, muiti-item PSQ subscales represent a substantial improvement in reliability over single-item measures. These gains over single-item measures are particularly important in studies of disadvantaged respondents. The reIiability of PSQ scores improved further even for disadvantaged respondents, when the 1X subscales were aggregated into global satisfaction scores (see the lower part of Table 7). The highest reliability coefficients were observed for the global Quality of Care
TABLE 7 ESTIMATES
FOR SATISFACTION
SCALES
Field Tests ESL
SAC
FP
LAC
k
ICR
TRT
ICR
TRT
ICR
ICR
Access to care Convenience of services Emergency care
2 2 3
56 .57 .68
.71 58 .66
53 .48 .63
.52 .44 56
.65 .47 .72
.49 58 .70
Availability-family Availability-hospitals Availability-specialists
2 2 2
.72 .91 .74
.46 .87 .74
.68 .80 .71
.52 .66 52
.78 .93 .80
.62 .80 .67
2 2
.73 51
.79 52
54 .52
.64 59
68 .83
.32 .66
Cost of care Insurance coverage Payment mechanisms
4 3 4
.73 .71 .50
73 .73 .51
.60 51 .51
.63 .48 59
.70 76 .57
.70 .64 .63
Consideration Explanations Prudence-expenses
5 3 2
.81 .70 .66
.74 .74 .58
.77 .64 .47
68 .48 50
.84 .75 .78
74 .71 .57
Doctor’s facilities Prudence-risks Quality/competence General satrsfaction
2 2 9 4
.82 .60 .83 .77
.73 .46 .74 NA
.73 .23 .77 .62
.72 .39 .70 NA
.84 .69 .a7 73
.75 .54 .79 .70
Global scales Access to care Financial aspects Access total Availability Contmurty of care Doctor conduct
7 10 17 6 4 23
.72 .66 .79 .66 59 .92
.72 .75 .79 .75 .74 .82
.73 ‘60 .78 .74 .43 .88
.62 69 .7l .62 .63 .78
.74 .70 .81 .57 .73 .94
77 .76 .84 .73 .52 .90
Scale
Continuity Continuity
Name
doctors
of care-family of care-self
Note. Source: based on Tables 54, 56, 59, and 60 in Ware, Snyder, Field Tests: East St. Louis (ESL), Sangamon County (SAC), Family
and Wright (1976a). Practice Center (FP), & Los Angeles
County
(LAC).
Defining and Measuring Satisfaction it is the longest and most homogeneous scale. Although correlations among subscales in the same global scale should be substantial and positive, this standard was not always met. The poor reliability of the global Availability of Resources scale in the Family Practice study was traced to a negative correlation between two of its component subscales. The access subscales also tended to be less highly intercorrelated than subscales used to construct other global scales. Given such results, the interpretation of the aggregate measures may be problematic. The PSQ subscales and global scales tended to be less reliable in East St. Louis than in other field tests. Consistent with this finding, comparisons of scale reliabilities within field tests for groups formed on the basis of demographic and socioeconomic variables scale, because
VALIDITY Validation, or determining the meaning of scores and how to interpret a difference of a particular size, is an ongoing process for the PSQ and looms as the greatest challenge for satisfaction measurement in general. This process proceeds in the absence of direct measures of patient satisfaction or of agreed-upon satisfaction “criteria” that can be used to evaluate validity. This problem is common in psychological measurement. A solution that is becoming standard is the strategy of construct validation. This approach examines a wide range of variables to determine the extent to which an instrument produces results that are consistent with what would be expected for the construct to be measured (APA, 1974). A major difficulty in applying the construct validation method to patient satisfaction measures is the lack of well-specified theory. Specifically, what results should one expect for a valid measure of patient satisfaction? In the face of this dilemma, several approaches were used to test the validity of the PSQ: (a) a systematic review of content validity; (b) factor analytic studies of the structure of items and subscales; (c) studies of convergent-discriminant validity that compared results across alternative methods of measuring patient satisfaction; and (d) studies of the predictive validity of PSQ scales in relation to health and illness behaviors thought to be influenced by individual differences in patient satisfaction. Our experiences with the first three kinds of validity studies are documented in detail elsewhere (Ware et al., 1976b, pp. 323-588) and are summarized briefly here. Studies of predictive validity are discussed in a companion paper in this issue (Ware & Davies, 1983). In developing the PSQ, we sought to capture the most salient characteristics of services and providers that might influence patient satisfaction with care. Given this goal, content validity is a relevant standard and has been systematically investigated for the PSQ.
259
(age, gender, education, and income) indicated that satisfaction ratings tend to be less reliable for persons reporting less income or education. Results (data not presented) regarding the stability of satisfaction levels over a 2-year interval came from a follow-up study of respondents in a field test of Form I of the PSQ. Correlations between scores for scales administered approximately 2 years apart ranged from 0.34 for Availability to 0.60 for Nonfinancial Access and 0.61 for Doctor Conduct. (These are lowerbound stability estimates, because the PSQ forms were not identical on both administrations.) The results suggest that satisfaction is relatively stable over time. Therefore, precision in hypothesis-testing is likely to improve significantly with a repeated-measures design and covariation on initial satisfaction levels. OF THE PSQ The match between PSQ items and the taxonomy of characteristics of services and providers that has evolved using information from a variety of sources is quite good. (The PSQ is systematically compared with this taxonomy by Ware et al., 1976b, pp. 373-378; and by Ware et al., 1978). However, potential areas of improvement in the content of PSQ items have been identified (particularly in the areas of quality of care and finances, as noted below). Although the PSQ is more comprehensive than its predecessors, there are still more distinguishable features of medical care services than PSQ subscales. The great majority of these features are assessed by one or more PSQ items. However, for many if not most studies of patient satisfaction, a single-item measure is not a very desirable unit of analysis. Thus, the PSQ subscales represent a deliberate compromise between respondent burden and content validity and other psychometric standards. Specifically, the PSQ attempts to strike a balance between the number of different satisfaction constructs measured and how well each construct is measured while holding administration time well below 15 minutes. All but 2 of the 18 subscales contain two to four items each. The two subscales measuring satisfaction with the technical and interpersonal skills of providers are longer, because these features of care seem most influential in determining patient satisfaction and are more difficult to distinguish. Standards of empirical validity derive from the intended uses of an instrument. The PSQ was designed with the diverse goals of several types of study in mind. First, it was designed to measure patient satisfaction as an outcome of care. For this application, the PSQ must detect the amount of satisfaction and dissatisfaction produced by different systems of care (e.g., fee-for-service vs. prepaid group practice) as well as by different facilities. Because competing systems of
260
JOHN E. WARE et al.
care might involve different tradeoffs (e.g., increased access versus provider continuity), an overall satisfaction score is particularly useful in summa~~ng satisfaction outcomes. Second, the PSQ was designed to provide programmatically useful information about the major sources of satisfaction and dissatisfaction. For this use, the information it provides about satisfaction must relate to the distinct features of care. The validity issue most relevant to this application is whether PSQ subscales measure different dimensions of satisfaction and how each subscale should be interpreted with regard to a specific feature of care. Finally, the PSQ was designed to be useful in studies of patient behavior. This application requires that its predictive validity be established. A major feature of the PSQ that is important for several of its intended applications is its structure. If there are distinct features of medical care services that cause differences in patient satisfaction, then a valid satisfaction measure should be multidimensional. The validity of the PSQ in this regard rests on a rather substantial body of empirical evidence. First, the scaling studies involved many tests of item discriminant validity. Results showed that groupings of items corresponding to the PSQ subscales measure different things. These tests were repeated using subscales as the unit of analysis, and findings were notably consistent across four independent field tests in diverse populations. Specifically, four higher-order factors (quality of care, access to care, availability of resources, and continuity of care) were observed and replicated. The pattern of correlations for each subscale across factors also showed little variance across field tests and between groups who had and had not used medical care services recently. These patterns were evaluated empirically by estimating similarity coefficients using methods described by Kaiser, Hunka, and Bianchini (1971). The weight of empirical evidence regarding the generalizability of the item and higher-order factor analyses clearly indicates that PSQ items and subscales measure distinct dimensions. Differences in the face validity of items in each subscale also support this conclusion. Further, the higher-order factor structure of PSQ subscales is strikingly similar to the major features of health care services that are distinguished in the published literature. This evidence strongly suggests that the PSQ measures the same things that are written about in this literature. Only the most ardent supporter of construct validation by factor analysis, however, is likely to accept from this evidence alone that the PSQ measures distinct dimensions of patient satisfaction. The evidence summarized above constitutes a sound psychometric basis for scoring and interpreting
distinct factors defined by PSQ items. The content of these factors suggests that item responses reflect differences in satisfaction with the specific characteristics of doctors and medical care services described by the scale labels (e.g., finances, interpersonal manner). We conducted a number of empirical studies to test the appropriateness of this conclusion. These studies focused on how well the PSQ agrees with the results of other methods of measuring patient satisfaction. These studies, described in detail elsewhere (Ware et al., 1976b, pp. 379-463), are summarized briefly here. Every field test of the PSQ included open-ended questions about recent care experiences and other events that may have changed sentiments regarding doctors and medical care services. These questions were included to test for previously unidentified satisfaction constructs and to validate PSQ scores. In two field tests (East St. Louis, Sangamon County) of Form II, these responses were formally analyzed to test hypotheses about the validity of the PSQ. Two questions were addressed: (a) Does the PSQ discriminate between persons who describe negative health care experiences and those who report positive experiences or no events affecting their sentiments? (b) Do the PSQ subscales predict the specific sources of satisfaction and dissatisfaction reported in descriptions of these experiences. For example, are responses to items in the Technical Quality subscale more sensitive to problems with technical quality than to problems with finances? For several reasons, we were not able to perform all of the planned analyses of responses to open-ended questions. In both field tests, the majority of respondents preferred not to discuss their experiences verbally. A practical implication of this result is that, in addition to costing less than personal interviews, completion rates using a standardized self-administered satisfaction survey are much higher than with unstructured interviews. In East St. Louis, only 3 of 323 respondents volunteered a favorable statement about doctors or medical care services in response to openended questions. Hence, a traditional sensitivityspecificity analysis was not possible. In both field tests, some features of services were not mentioned frequently enough to test the sensitivity of the corresponding PSQ subscale. Only four dimensions of care (technical quality, access, finances, and interpersonal manner) were mentioned frequently enough to permit any kind of empirical analysis. Hence, we compared responses to open-ended questions against the four PSQ global scales corresponding to these four problem areas. In East St. Louis, complaints were expressed about (in order of prevalence): technical quality, access, finances, and interpersonal manner of providers. With
Defining and Measuring Satisfaction one exception, the PSQ scales showed good convergent and discriminant validity in identifying persons who made these complaints. Respondents who voiced complaints tended to score lower than noncomplaining respondents (approximately 35th-19th percentiles, on average, across subscales corresponding to the subject matter of the complaints). This supports the convergent validity of PSQ subscales. Further, persons who complained about technical quality (n = 30) scored lower on the Technical Quality subscale (at the 25th percentile, on average) than on the other three PSQ subscales studied (access, finances, and interpersonal manner), The other three PSQ scales showed a similar pattern of results in support of their sensitivity and specificity in detecting specific problems with care. We encountered one noteworthy exception to this pattern of favorable discriminant validity results across the four complaint groups and corresponding PSQ global scales in East St. Louis. The interpersonal manner of providers was rated very unfavorably (14th20th percentiles of the Humaneness Scale distribution) by all groups who complained (regardless of what was complained about). This pattern of results, which was also apparent in other tests of discriminant validity, raises interesting questions about the dynamics of patient satisfaction in the area of provider “caring.” Are practices that have problems with access, finances, and other features of care also more likely to produce unsatisfactory doctor-patient relationships? Are patients inclined to blame their doctors(s) for long waits in the waiting room, financial difficulties, and so on? Further research is necessary to determine the extent to which dimensions of service satisfaction are not orthogonal. The pattern of results observed in the Sang~on County study, where the number of positive and negative comments was large enough to permit a more traditional correlational analysis, also supported both the convergent and discriminant validity of the PSQ subscales. However, the fact that many respondents commented about more than one feature of care complicated interpretations. In general, for respondents making only one complaint about their care, scores for the PSQ scale that corresponded to the content of the complaint tended to be lowest. Other validity studies of PSQ subscales focused on access variables and compared PSQ subscales with patient reports using standardized questions about objective features of services, including: distance to care facilities (in miles, travel time); availability of emergency care; and proportion of costs paid by outside sources (e.g., insurance). These analyses were replicated in three field tests (East St. Louis, Sangamon County, and Family Practice). We also tested whether satisfied patients were more likely to report a regular source of care and (in the family prac-
261
tice center study) to claim a particular facility as their regular source of care. Tests based on these criteria support the discriminant validity of PSQ access-related subscales. For example, the access criteria correlated higher with the PSQ access-related subscales (Access to Care, Emergency Care, Convenience) than with other PSQ subscales; in fact, most correlations with other PSQ subscales were not statistically significant. Further, analyses comparing the PSQ with other standardized report and rating measures provided support for the interpretation of PSQ access-related subscales as evaf~atjons. For example, although the access-related subscales (particularly the Convenience subscale} correlated significantly with reported miles traveled and travel time, their correlations with a standardized evaluative rating of travel time were consistently much higher. The validity of the PSQ subscales that measure availability of resources could not be evaluated in our field tests because such a study requires a geographic area as the unit of analysis. Results relevant to this validity issue have been reported by Aday, Andersen, & Fleming (1980). They have linked the PSQ Availability of Resources subscales convincingly to independent measures of medical resources per capita. We also examined several multitrait-multimethod matrices that correlated PSQ subscales and global scales with measures based on other methods, including ratings of care on a satisfaction continuum (very satisfied vs. very dissatisfied) and a method that combined measures of the frequency of health care events with the importance placed on those events. These analyses provide strong support for the convergent and discriminant validity of PSQ subscales and global scales as measures of patient satisfaction. Noteworthy exceptions, however, included results for the technical and interpersonal subscales, as discussed later. Some problems with the use of satisfaction rating scales (i.e., scales using a very satisfied vs. very dksathfied) response continuum were also noted in comparisons with the PSQ. For example, correlations among satisfaction ratings seem to be high, relative to correlations among PSQ subscales, despite the higher reliability of the latter. This result suggests a strong halo or method effect of ratings on a satisfaction continuum, or lack of discriminate validity. We conclude our discussion of validity with comments regarding unanswered questions about PSQ measures of the quality of care. Throughout our studies of the PSQ, we have observed substantial correlations (in the &I-.70 range) among PSQ quality of care subscales and global scales (measures of the technical and interpersonal skills of providers). Some access measures also correlate substantially with these
262
JOHN E. WARE et al.
quality of care subscales. When this pattern of results was first observed in tests of Form I, we attributed it to the references to doctor in many PSQ items. Item revisions in Form II items deleted most such references to focus attention on specific features of care rather than on doctors in general. Substantial correlations among quality of care items, however, have persisted although PSQ item analyses and analyses of other items ciearly indicate that patients can distinguish among specific quality of care features (Ware et al., 1976a, Ware et al., 1975). In convergent-discriminant tests of PSQ measures (using the rigorous MTMM method) we sometimes encountered problems with discriminant validity of scaIes assessing technical and interpersonal skills of providers as well as access to providers. According to the logic of convergent-discriminant validation one should be able to measure a trait well enough that measures of the same trait using different methods correlate more highly than measures of different traits based on the same method. The opposite pattern of results is encountered all too often. Despite these reasons to reserve judgement regarding the discriminant validity of PSQ subscales measuring the interpersonal and technical skills of providers, we have little doubt that they measure patient satisfaction. These scales perform very well in relation to a wide range of criterion variables, and are con-
sistently (across studies) the best predictors of satisfaction with care in general and of continuity of care. At issue is their discriminant validity in relation to the particular quality of care attributes they are supposed to measure- interpersonal manner versus technical skills- and in relation to the provider’s accessibility to patients. An analysis of correlations among the subscales in question, taking into account the reliability of each subscale, leaves no doubt that each subscale measures something not measured by the others. Unfortunately, we have no basis for evaluating the size of these interscale correlations because we do not know the extent to which the attributes of providers in question are correlated in the real world. Are friendlier doctors likely to be more thorough in examining their patients? Are doctors who show more courtesy and respect when they see their patients also more likely to return their patients’ phone calls in a timely manner? if so, substantial correlations among measures of the technical and interpersonal skills of providers and general access reflect favorably on their validity. We beIieve that there is something to this argument, although we also suspect that PSQ items can be constructed to better discriminate between the interpersonal and technical skills of providers. These hypotheses are now being tested (Ware, Kane, Davis, & Brook, in press).
CONCLUSIONS Our experience in developing the Patient Satisfaction Questionnaire (PSQ) and testing it in the field has led us to a number of conclusions about the nature of the patient satisfaction concept and important methodological considerations in its measurement. Although much empirical work remains to be done before a complete model of patient satisfaction can be specified, we are convinced of the importance of several features of that model. First, patient satisfaction with medical care is a multidimensonal concept, with dimensions that correspond to the major characteristics of providers and services. Second, the realities of care are reflected in patients’ satisfaction ratings. Finally, the influence of patients’ expectations, preferences for specific features of care, and other hypothetical constructs on patient satisfaction remain to be determined. Consistent with this preliminary model of the pa-
tient satisfaction concept, the PSQ was constructed to measure patient satisfaction in general as well as satisfaction with specific features of care. This permits testing of more focused hypotheses and makes rest&s more useful from a programmatic point of view. The PSQ also reflects our solutions to a number of methodological problems, namely; relying on self-administration to reduce data-gathering costs and increase confidentiality; structuring items as statements of opinion with an agree-disagree response format to reduce the skewness of response distributions; balancing scales to control for acquiescence; scoring multiitem scales to achieve minimum standards of reliability; and vaIidating scales using the logic of construct validity, in the absence of agreed-upon criteria. These solutions have served us well and we recommend them and the PSQ to others.
REFERENCES ADAY, L. A., ANDERSEN, R., & FLEMING, G. V. Health care rn the U.S. equituhlefor whom 7 Beverly Hills: Sage Publications, 1980.
cducatronal and psychologrcal tests. Washmgton, Psychological
Associatton,
DC:: American
1974.
CAMPBELL, D. T., & FISKE. Ct. W. Convergent and di~crlrni~lan~ validation by the multItralt-multtmetllod matrix. Pcychoiogtcul AMERICAN
PSYCHOLOGICAL
ASSOCIATION.
Standards for
Bullrrln, 1959. 56. 81-105
Defining and Measuring Satisfaction CHU, 0. C., WARE, J. E., JR., & WRIGHT, W. R. Health related research in southernmost Illinois: A preliminary report. (Tech. Rep. No. HCP-73-6.) Springfield, IL: Southern Illinois University, School of Medicine, 1973. COMREY, A. L. Factored homogeneous item dimensions in personality research. Educational and Psychological Measurement, 1961, 21, 417-431.
263
WARE, J. E., JR., DAVIES-AVERY, A., & STEWART A. L. The measurement and meaning of patient satisfaction. Health and Medical Care Services Review, 1978, 1, 1-15. WARE, J. E., JR., KANE, R. L., DAVIES, A. R., & BROOKS, R. H. Thepatient role in assessrng medical careprocess. Santa Monica, CA: The Rand Corporation, in press.
HELMSTADTER, G. C. Principles of psychological measurement. New York: Appleton-Century-Crofts, 1964.
WARE, J. E., JR., MILLER, W. G., & SNYDER, M. K. Comparison of factor analytic methods in the development of healthrelated indexes from questionnaire data. (NTIS No. PB 239-517/AS.) Springfield, VA: National Technical lnformation Service, ly973.
HOWARD, K. I., & FOREHAND, G. G. A method for correcting item-total correlations for the effect of relevant item inclusion. Educational and Psychological Measurement, 1962, 22, 731-735.
WARE, J. E., JR., & SNYDER, M. K. Dimensions of patient attitudes regarding doctors and medical care servtces. Medical Care, 1975, 13, 669-682.
HULKA, B. S., ZYZANSKI, S. J., CASSEL, J. C., & THOMPSON. S. J. Scale for the measurement of attitudes toward physicians and primary medical care. Medical Care, 1970, 8, 429-436.
WARE, J. E., JR., SNYDER, M. K., & WRIGHT, W. R. Development and validation of scales to measure patient satisfactton with health care services: Volume I of a Final Report Part A: Review of literature, overview of methods, and results regarding construction of scales. (NTIS No. PB 288-329.) Springfield, VA: National Technical Information Service, 1976. (a).
CRONBACH, L. J. Coefficient (Yand the internal Psychometrika, 195 1, 16, 297-334.
structure
of tests.
KAISER, H. F., HUNKA, S., & BIANCHINI, J. C. Relating factors between studies based upon different individuals. Multivarrate Behavioral Research, 197 1, 6, 409-422. LIKERT, R. A technique for the measurement chives of Psychology, 1932, (No. 140), l-55.
of attitudes.
Ar-
SNYDER, M. K., & WARE, J. E., JR. Differences in sattsfaction with health care servtces as a function of rectpient: Self or others. (P-5488.) Santa Monica, CA: The Rand Corporation, 1975. WARE, J. E., JR. Effects of acquiescent response set on patient satisfaction ratings. Medical Care, 1978, 16, 327-336. WARE, J. E., JR. How to survey patient satisfaction. telbgence and Climcal Pharmacy, 1981, 15, 892-899.
Drug In-
WARE, J. E., JR., & DAVIES, A. R. Behavioral consequences of consumer dissatisfaction with medical care. Evaluatton and Program Plannmg, 1983, 6, 291-297.
WARE, J. E., JR., SNYDER, M. K., &WRIGHT, W. R. Development and validation of scales to measure patient satisfaction with health care services: Volume I of a Final Report Part B: Results regarding scales constructed from the patient satisfaction questionnaire and measures of other health careperceptions. (NTIS No. PB 288-330.) Springfield, VA: National Technical Information Service, 1976. (b). WARE, J. E., JR., WRIGHT, W. R., SNYDER, M. K., & CHU, G. C. Consumer perceptions of health care services: implications for academic medicine. Journal of Medical Education, 1975, 50, 839-848. WINKLER, J. D., KANOUSE, D. E., & WARE, J. E., JR. Controlling for acquiescence response set in scale development. Journal of Applied Psychology, 1982, 67, 555-561.