Cultural diversity issues in the development of ... - Wiley Online Library

Cultural Diversity Issues in the Development of Vdid and Reliable Measures of Health Status Judith Gonzalez-Calvo, Virginia M. Gonzalez, and Kate Lorig Introduction A rapid growth of the United States Hispanic population is expected in the next decade. By the year 2000, Hispanics will constitute the largest minority group in the US, and already comprise 40% of the population in certain parts of California (1).The Southeast Asian population, while not growing at the rate of the Hispanic population, is very diverse both in culture and language. African-Americans are also a sizable group with unique health care needs. The ever-increasing cultural and ethnic diversity of the US necessitates a culturally competent approach to the delivery of health care. No longer can the assumptions of the Western medical model be applied across the board to all groups. In order to provide effective care, providers working among diverse groups must understand how to measure and assess health status, health care utilization, health behaviors, beliefs, and attitudes in these diverse groups. Culturally competent health care requires sensitivity to the differences that exist among groups, not only in outward behavior, but also in attitudes and in the meanings that one attaches to events such as pain, deSupported in part by State of California Department of Health grant no. 94-20403-AZ and National Institute of Nursing Research grant no. 1-R01-NR-03146. Judith Gonzilez-Calvo, PhD, Professor of Women’s Studies, California State University, Fresno, and Maternal, Child, and Adolescent Health Research and Evaluation Consultant; and Virginia M. Gonzilez, MPH, and Kate Lorig, DrPH, Stanford Patient Education Research Center, Stanford University School of Medicine, Palo Alto, California. Address correspondence to Judith Gonzalez-Calvo, PhD, Women’s Studies Program, 5340 North Campus Drive, MS #78, California State University, Fresno, Fresno, CA 93740-0078. Submitted for publication June 7, 1997; accepted in revised form August 11, 1997. 0 1997 by the American College of Rheumatology.

448

pression, disability, stress, and other health conditions. Culturally competent health care includes and respects such differences and incorporates these into a care plan that will produce well-being. Understanding how to measure and assess behaviors, attitudes, health status, and utilization among diverse groups ensures that health care and health research are more appropriate and valid. Addressing the needs of medically underserved groups necessitates attention to the racial, ethnic, and cultural differences as well as the socioeconomic and educational differences that exist between and within groups, as well as between rural and urban settings. Even when controlling for race, there are substantial differences in health behavior and perceptions of healthiness between urban, middle class whites and rural, poor whites. The development of instruments to measure such phenomena must take the milieu and educational level of the respondents into consideration. The measurement of physical health, mental state, and emotional health varies greatly by the ability of the individual to think in complex and abstract terms. In this era of managed care and decreased funding available to the public sector, research and health care delivery must be cost-effective. This requires that health care delivery be evidence-based, with measurable outcomes and processes, underscoring the importance of valid and reliable measurement tools that will elicit a true representation of the health care behaviors, therapeutic processes, and outcomes of the population it serves. Researchers must pay increased attention to the cultural differencesthat underlie measurement and assessment. In this article, we discuss issues of cultural diversity that directly affect measurement and assessment.

Instrument content The assumption of cultural universality often leads to substantial bias and error in the construction of in0893-7524/97/$5,00

Arthritis Care and Research

struments and the interpretation of research findings. Important attention must be paid to the purpose of the study and to the functioning of not only the instrument as a whole, but also of its individual items. To avoid errors made on the assumption of universality, it is important to conduct preliminary qualitative studies that involve collaborative work with individuals from the groups to be studied. Researchers should plan interviews and focus groups with the target group, during which individuals representing the culture or social class under study participate in a process of producing words and expressionsthat can be equated with the concept to be measured. In this way, items and scales are constructed that have content validity for that culture or group. Pools of items and appropriate wording can be produced using such strategies as free listing, frame technique, and card sorting to elicit wording and conceptual frameworks for new instruments (2). Free listing is a type of brainstorming where the researcher begins by asking respondents to list all the items they believe are included in a group-recognized domain, such as symptoms of arthritis, descriptors of pain, signs of depression, or some other concept. These lists can also be rank-ordered in terms of most to least categories (distressing, intense, common, acceptable, etc.). The data generated by the free listing technique can be compiled and analyzed to determine the consistency with which respondents list items. The relative rankings of the items on the list are then analyzed. A coherent domain is one in which the ranking is statistically consistent across respondents and tends to be revealed in a relatively small number (e.g., 20-30) of informants (3,4). Frame techniques link items and provide detailed information needed to construct relevant content domains ( 5 ) . Specific to arthritis, one could ask a series of questions in either a closed or open-ended format. An example of a closed-ended question might be: “Can - come from ?” In the case of arthritis this may be: “Can arthritis cxme from eating certain foods, exposure to cold, fever, anger, etc.?” An open-ended question might be: “What do you think caused your arthritis (or any other illness)?” Frame tests are easy to design and can be used with a large number of respondents. Consultation with key informants or focus groups can help provide culturally relevant framing alternatives. Card sorting follows the free listing and frame techniques. Respondents are given a pack of cards with terms that have been generated from the eee listing and frame techniques or through information from focus groups or key informants. These terms are written in their native language. The respondents proceed to sort the cards into smaller piles according to whatever

Cultural Diversity and Measurement of Health Status 449

criteria make sense to them. At each sorting level, the criteria for association and subdivision are recorded. To summarize the information, a tree diagram can be constructed and questions can be asked to elicit words or phrases to explain the relationships between each level of sorting. These methods have extensive use in cross-cultural research and instrument development. A more extensive discussion of these techniques can be found in Barnard (6).

Item structure Questionnairesmust be “user-friendly”in that items are worded to meet the educational or literacy level and colloquial style of the client. Bracken and Barona (7)provide a set of guidelines for item construction that applies both to original instruments and translated target instruments. Similar guidelines are also delineated in Gonzhlez et a1 (8). Suggested guidelines are: 1. Scale items should consist of simple sentences. 2. Pronouns should be avoided when they might ob-

scure meaning and the validity of the response. For example, an individual might be asked whether or not he or she agrees with this statement: “Doctors should avoid giving drugs to arthritis patients because they are dangerous.” This could be interpreted in two ways: doctors are dangerous or drugs are dangerous. The individual may interpret it incorrectly, making the response invalid. 3. Items should not contain metaphors, idiomatic expressions, slang, or colloquialisms that might not be understood. For example, one would not use the term “guagua” for bus when translating bus into Spanish, because it means bus to Puerto Rican and Cuban Spanish speakers and a baby to Chileans. Thus, the term “autobus” would be preferable unless your sample was comprised entirely of Puerto Ricans or Cubans. Conversely, when designing an instrument to be used among various Spanishspeaking groups, one might offer options to the subject in order to be sure that all groups understand the meaning of the item (e.g., the use of “tina,”with the words “bafiera” and “bafio” in parentheses to signify bathtub). Similar dialectical differences should be addressed in any language. Informants who are native speakers of the target language should be consulted. 4. The passive tense and double negative structures that are common in other languages should be avoided. If some items need to be reversed in order to tap the underlying domain, write them as simple statements and then recode responses later to make these items directionally equivalent. For example,

450 Gonzalez-Calvo et a1

“How much pain was experienced?” is less understandable than “How much pain did you have?” 5. Hypothetical phrasing and the subjunctive mood should be avoided. This can obscure meaning. The respondents may guess and fail to convey their true feelings or attitudes. For example, an item worded as: “Given all of the days you might have felt pain, what would have been the worst intensity of this pain?” is better said as “Think of the past 3 days. How intense was the worst pain you felt during this time?”

In essence, the principle of “simple and direct” should apply both to source and target language versions of your instrument. Comprehension, clarity, and communication are essential. Quality of the source language instrument is critical. Because translation is expensive, time-consuming, difficult, and prone to error, the source document must be of highest quality before translating it into another language or dialect. Reliability is dependent upon items that maximize true variance and minimize error. Translation of a poorly constructed source instrument will only exacerbate errors. Most current instrumentation requires command of standard English and the understanding of abstract concepts such as coping, self-esteem,and the like. Even when items are translated into the native language of the respondents, the difficulty of the item may confuse the respondent. Therefore, to achieve conceptual equivalence, it is important to employ back translation (9). This involves at least two competently bilingual translators, one whose native language is the source language and one whose native language is the target language. The instrument is first translated into the target language by the translator whose native language is the target language and then translated back into English (usually the source language) by the other translator. The first and third versions, both in the source language, must be conceptually equivalent. Because most languages, including English, have several dialects and uses of slang, it is important that back translation be done in “standard language” such as that used in the media. The exception to that dictum might exist when using instruments within African-American populations where some of the words used in standard English have different or augmented meaning. For example, the term “attitude” has a varied and complex meaning among African-Americans that is not readily understood by whites (10).Among African-Americans, “attitude” conveys several things, such as proud presentation to the public, toughness, or coolness, as well as its negative connotation of negative, pessimistic thinking or a brash, “chip on the shoulder” presentation. The tone

Vol. 10,No. 6 ,December 1997

of the word in conversation and context changes the meaning. Thus, items using the word “attitude” might be misinterpreted by African-Americansto mean something more than the researcher had intended. A high-quality translation that will be conceptually equivalent should include the following steps proposed by Bracken and Barona (7): 1. Source to target language translation is preliminary

2.

3.

4.

5.

6.

translation by a thoroughly bilingual and trained translator who is familiar with the concepts being measured. Translators should come from the community for whom the instrument is being developed. Blind back translation involves the use of another translator who is not familiar with the instrument in its source language. The back translation should be compared to the original version of the instrument in grammatical structure, comparability of concepts, level of word complexity, and overall similarity in meaning, wording, and format. Translation-back-translation repetition should be repeated as necessary to reduce the discrepancies that exist between the original version and the back translation. The instrument should be subjected to examination by a multinational or multiregional bilingual review committee to ensure that the translation is sensible for all respondents who might be expected to participate in the study. Pilot testing involves use of a small sample of respondents from the target population to elicit their subjective reactions to the instrument. Particular attention should be paid to words or phrases that systematically fail to elicit the appropriate response. Looks of puzzlement, embarrassment, laughing, disbelief, anger, resistance, confusion, or any other emotional or verbal response that indicates the inappropriateness of an item should be noted. These notes should be taken back to the bilingual review committee to explore what is occurring and to modify the translation as needed. Field-testing is simply another pilot test with a larger sample. Assuming a sufficiently large sample, formal item analyses can be conducted to determine if the item functions differently in various linguistic and cultural settings.

Another important and often overlooked aspect of conceptual equivalence is the level of syntactic complexity and vocabulary. If the back translation reveals a much more complex rendering, or if the target language version appears to be too complex, an additional step should be taken to simplify meaning. Idioms and slang cannot be translated and should be avoided in item construction.


The educational level of the translator can affect the complexity and abstractness of item translations. For example, when asking college-educated native speakers of another language to translate English items into their native language, they may do so at a level of complexity and vocabulary that is difficult for others with limited formal education to understand. The resulting instrument may retain conceptual equivalence to the original English instrument, but will be too complex in terms of structure and vocabulary. Therefore, in working with translators, it is important to have the translation examined by other native speakers of the target language who are familiar with the different ways of speaking within the target group. These could be paraprofessional outreach workers or service providers who have frequent contact with the population and are conversant in their style. These individuals should be part of a bilingual review committee. This translation process can be applied with any language group, provided that you have resources to employ at least two translators and can ask for input from others, such as outreach workers, service providers, or helpful members of the target community. In working with a scale of depression, social support, or any other concept, the items must be analyzed using samples representing all groups with whom the researcher will be working to determine if the items in fact tap the underlying domain. Item statistics that vary among groups indicate that the item in question is actually measuring different constructs in the different groups. There are some examples ftom research that illustrate this divergence across cultures. For instance, in developing a Spanish language version of the Center for Epidemiological Studies Depression Scale (CES-D), investigators identified a few items that did not appear to represent the underlying domain of depression. They found that asking respondents to rate how frequently they felt they were “as good as others” had a negative correlation with the overall scale for Spanish speakers, but not for English speakers (8).Low feelings of self-worth are considered clinically symptomatic of depression, but the self-worth item did not function as expected with Spanish speakers. It is possible that this almost idiomatic expression of self-worth in English has a different meaning unrelated to depression among Spanish speakers. A negative response in English is related to self-esteem, while a negative response in Spanish may be a culturally approved avoidance of superfluous insight or self-aggrandizement. Other studies of depression among Hispanics (2) and social support and health status among Navajo women (1)have also demonstrated a divergence in how these


concepts are perceived and expressed between these groups and non-Hispanic whites.

Item response categories An assumption of scaling is that there should be at least 4-5 response categories, requiring the individual to make fine distinctions along a continuum. When individuals have not had the chance to think about the item, it is difficult to make fine distinctions of agreement, frequency, or intensity. Even within groups that speak the source language, rating scales that are excessively complex can lead to reliability problems. This is especially apparent in studies using groups with limited educational attainment, such as low-income inner city women or semi-literate adults. Our research among African-American women illustrates the tendency to rate items either at the extremes or at the neutral or center point when answering questions about stressors, coping styles, and dimensions of social support. There is considerable disagreement on the optimal number of response choices. Mattell and Jacoby (12) found that conversion of response categories to dichotomous or trichotomous measures does not result in any significant decrement in validity and reliability. Other researchers (13-16) contend that test-retest and internal consistency are independent of the number of scale points. However, for outcome measures such as pain intensity, depression, or self-efficacy, too few response categories may result in problems with the sensitivity of the instrument to measure theoretically or clinically important variations. When fine gradations of depth, frequency, or intensity are necessary, the item should be tied to a concrete system, such as “how many days” instead of “how often” have you felt ? There are clear instances when finer distinctionsmust be made, such as how much and how often one is feeling severe, moderate, or mild pain, or how much time one spends per week doing different types of exercise. We recommend that the number of response categories be reduced only when it will increase clarity of meaning and, by extension, reliability, provided that the sensitivity of the measurement is not sacrificed. Because there are statistical methods that can deal with categorical data, it is preferable to opt for reliability and accuracy rather than proliferation of inaccurate responses. Also, when items are combined to create scales, the resultant summed scores yield interval level data. Thus, statistical methods that assume interval or ordinal level data can be used. The trade-off between number of response categories and reliability of data is especially critical in research or diagnostic settings where it is important to retain fine distinctions between levels of a variable or risk factor. Translation of rating


scales introduces another element of concern, that is, the equivalency of terms from language to language. Jones and Kay (17) discuss the semantic difference between “moderately” as a descriptor in English and “moderamente” in Spanish. “Bastante” (enough) is actually closer in meaning to “moderately” than the readily apparent cognate “moderamente.” Inaccuracies in translation will create misrepresentation of the semantic distance between categories in the original instrument. Often, a trade-off must be made between gradations of meaning and accurate responses when creating rating scales that will reliably measure intensity, frequency, or strength. For example, if one knows from previous experience that individuals cannot make distinctions between responses such as “strongly agree” and “agree,” then an alternative set of choices must be found. If your purpose is to diagnose and assess for intervention, then a “yes” or “no” response may be preferable to vague middle categories or to inaccurate responses on a more complex rating scale. Reductions of 5-point scales to 3 or 4 may also be preferable. The major drawbacks to less complexity are reduced sensitivity of the measure and limitations on statistical methods that can be used with categorical versus ordinal or interval level data.

Validity Several types of validity must be considered in the development of useful measures. The most important are content, construct, and predictive validity, often referred to as criterion validity. Criterion validity is often subdivided into predictive, concurrent, and postdictive validity to designate temporal ordering of the variables used as measures and those used as criterion references. This article only deals with validity specifically related to cross-cultural issues. Complete discussions of validity are provided by DeVellis (18) and Nunnally and Bernstein (19). Content validity. Content validity concerns itself with the extent to which a set of items reflects an underlying domain of content or attribute. Given a concept, such as self-esteem, self-efficacy, stress, or pain, each is comprised of potentially countless items within a content domain. A measure that possesses content validity will be comprised of items that have been adequately sampled and are representative of the underlying variable being measured. For example, in attempting to create a measure of arthritis health beliefs and behaviors, it would be important to include items that tap culturally specific beliefs and behaviors with

Vol. 10,No. 6, December 1997

regard to arthritis pain management, etiology, and beliefs about its cure or treatment. Constructvalidity. Construct validity concerns itself with the degree to which a set of items “behave” as expected in relationship to other variables that are theoretically supposed to relate to these items. Based on the theory that a researcher or psychometrician has developed, one may build hypotheses regarding relationships among latent constructs or variables, such as coping, self-efficacy, stress, or shoe size. Let us suppose, for example, that we expected Construct A (problem-solving coping strategies) to be negatively related to Construct B (stress levels), positively related to Construct C (self-esteem), and unrelated to Construct X (shoe size). This set of expectations would emerge from theory and be set a priori. Now, given a scale with a set of items, we should find positive correlation between Construct A’s and C’s measures and negative correlation between Construct A’s and B’s measures. We should find no correlation between Construct A’s and Construct X’s measures. The construct validity of a measure, however, may be affected by cultural differences in how variables relate to each other. Our Western, European-based society places a high value on logical, critical thinking and problem-solving strategies;therefore, the dominant coping style among middle-class Americans might be to solve problems, and the greater the ability to do so, the more valued one might feel. If we were to test a group of educated, middle-class professionals, we might find that there would be positive correlation between high scores on this coping dimension and high self-esteem scores. Conversely,we might expect to find a negative correlation between problem-solving coping and stress measures. This relationship, however, even if the scales have satisfactory construct validity in this setting, might not hold up in different cultural settings. For example, in a society that values prayer, resignation, or deflection strategies of coping, there might be no correlation between problem-solving coping and self-esteem, especially if such skills had little to do with one’s perceived self-image. We might also find that the problem-solving coping scale would not relate to stress. Laungani (20) discusses this difference in his research on stress management between British and East Indians. He alludes to the very different styles of coping that emerge between Western Europeans and East Indians, one relying on active problem solving and the other on bearing suffering and seeing problems as part of the vicissitudes of life. In a study examining coping among Hispanic women with arthritis, Abraido-Lanza et a1 (21) also found some cultural differences in coping styles. Hispanic women tended to use religion or pray-


er more when compared to a sample of non-Hispanic women, and they relied more on family rather than friends for support. Whether or not these differences cast doubt on the construct validity of a coping scale is not certain. However, such differences do necessitate the redesign and expansion of what is considered coping.

Reliability Reliability is a crucial issue in psychological and social measurement (18,19,22).Reliability is a characteristic of multi-item scales, and as such cannot be applied to single items measuring a given construct. Scale reliability is the proportion of variance attributable to the variable being measured. The higher the proportion of variance attributed to actual variation in the variable, the less is attributable to measurement error. Thus, the scale in question is more reliable as this proportion increases. There are several methods available to compute reliability, and most are contained in statistical software packages such as SPSS, BMDP, or SAS. Reliability must be considered at all levels of instrument design and testing. Once reliability is established for an instrument in a source language, the same iterative process of pilot testing and re-testing must be applied to the translated scale. It is a common mistake to assume that reliability of the source version carries over automatically to the target instmment.

Test-retest reliability. Test-retest reliability involves administration of the same instrument over two points in time to the same sample. For example, one might design a scale to measure pain or depression and readminister the same scale 2 to 4 weeks later to the same group. The scores on the first occasion are correlated with the scores on the second administration. The correlation of scores obtained across two administrations to the same individuals should represent the degree to which the latent variable determines observed scores. There are problems inherent in test-retest reliability measures. The stability of scores over time may be confounded by other factors, such as: 1) real change in the group in the construct of interest; 2) systematic cycling as a function of some outside factor (such as anxiety being affected by study setting, person administering the test, time of day, or other outside factors);3) changes attributable to subjects, such as increased fatigue causing items to be misread; or 4) temporal instability due to inherent unreliability of the instrument. Only item 4 is evidence of poor reliability. Thus, it is important to use various methods of assessing reliability, since


test-retest correlations only tell us something about reliability when we are highly confident that the latent construct is truly invariant or stable over time. The assumption of invariance and stability is often not attainable in practice; this is made more complex when cultural diversity is introduced into the study. The underlying construct may be more or less invariant over time between cultures. If the researcher is not familiar with the culture, it is unlikely that he or she will have complete confidence in the temporal stability of the variable being measured.

Measurement sensitivity and statistical power Much of the research on treatment effectiveness fails to detect intervention change because of the lack of sensitivity of measures used to capture variability and change (23). When the measure of interest is designed with careful attention to validity and reliability, it is also necessary to determine that the measure has adequate sensitivity. (A complete discussion of the issue of statistical power and its relationship to treatment effect can be found in Lipsey [ref. 231). Sensitivityrefers to an instrument’sability to detect fine changes or variability in the variable being measured. Finer gradations in response categories allow for greater sensitivity and increase the ratio of true variability to measurement error. For example,a categorical measure of pain, either no pain or pain present, is far too blunt and allows for considerable measurement error and failure to detect important change over time or between treatment and control groups. With such a crude measure, a study has limited power to detect change and leads to substantial errors in hypothesis testing. If, on the other hand, one employs a visual analog scale or uses a 110 scale rating of pain, the instrument is much more sensitive and allows for greater precision and power to detect change due to an intervention. When it is imperative to detect small magnitudes of change, it is important to work with respondents to ensure that they are able to understand the rating scale. The response choices must be translated in a way that is an accurate representation of the English language set of responses. Maximizing validity, reliability, and sensitivity has a very important benefit: increased statistical power for a given sample size or equivalent statistical power with a smaller sample size. Thus, if careful attention is given to these issues, research may be more cost-effective in the long run. A more reliable and sensitive instrument contributes less error in statistical analysis.

Sociocultural setting An often overlooked dimension of measurement and assessment is the setting in which a respondent is sit-


uated. The larger systems represented by the subject and the administrator or researcher either interlock or are mismatched. A health care provider or researcher working with a culturally or socially different group may encounter resistance, mistrust, or excessive acquiescence in responses. In many cultures, the concept of “host” versus “guest” affects responses in the person who receives an interviewer into his or her homethey may wish to appear gracious, generous, or in control. This may increase socially desirable responses to interview questions. The relationship between the subject and the researcher is a microcosm of the systems represented by each. Research among populations that are marginal, medically underserved, or transient may be extremely difficult because of the social status or cultural divergence between the researcher and subject. A public health nurse interviewing a homeless teen may fmd it impossible to proceed until some rapport can be established. The status differences between subject and researcher or clinician may be the most salient, obscuring even any of the cultural barriers. Gender and age differences between researcher and subject are also important. For example, a young pregnant woman with rheumatoid arthritis may feel more comfortable answering questions about her health condition with a female interviewer, while an older person with osteoarthritis might prefer someone who is considered a peer. If the researcher or health care provider is viewed as a representative of a hated or mistrusted system, it is likely that responses will either be missing or deliberately inaccurate. In one of the our experiences with interviewing public health nurses, it was found that only when the nurses had established trust with their clients could they begin to administer an assessment of psychosocial risk factors among African-American and Hispanic pregnant women. The manifestation of mistrust, however, was very different between the two groups. Among the African-American women, initial answers to self-esteem inventories netted overly cocky and inflated responses; whereas the Hispanic women responded with modest and cautious underestimation of self-esteem. Only after considerable time and acquaintance with the women were nurses able to assess self-esteem accurately. In another study measuring selfefficacy to change behaviors among high-risk AfricanAmerican women, this same tendency to respond on the high end of the scale was found (24). When the respondent is of the same ethnicity and social class as the interviewer or the organization sponsoring the research is trusted, there is an interlocking of systems that facilitates a more accurate disclosure of information. Therefore, to reduce the mismatch between the respondent and researcher’s “worlds,” it is

Vol. 10,No. 6, December 1997

preferable to train paraprofessionals or community workers with the same ethnicity, gender, and educational levels as the respondents to gather data. These individuals, however, should be chosen carefully and not solely on the basis of race, social class, or gender. Just because someone comes from the same cultural or social group as the respondents, does not assure that the person is culturally competent. The immediate setting in which an instrument is administered can also affect reliability. The presence of children, other family members, or even the interviewer can affect the respondent’s feelings, ability to concentrate, and willingness to disclose information. For example, information about frequency of substance use, sexual behavior, or other intimately personal items may be unduly affected by the presence of one’s spouse, parents, or children in an interview setting. Timing also plays a part in the administration of an instrument. When the subject matter is intrusive or emotionally charged, two different dimensions of time must be considered: I]response time needed to answer questions, and 2) sequencing and timing of questions. Sufficient time must be given to answer questions, and questions should be sequenced from lesser to greater levels of intrusiveness. Demographic information of a general nature should precede information that might be considered more private or risky to disclose. This sequencing must be balanced enough to ensure that you will get to the most important information before the respondent becomes fatigued, especially if you have only one chance to interview him or her, such as in telephone surveys. The fatigue factor is less obvious when you have an ongoing therapeutic or research relationship that is established over several interview sessions. You can choose to take a break when fatigue sets in and pursue the next set of questions at a later meeting.

Physical setting of instrument administration The physical setting contributes to the level of comfort achieved in a research setting. Whenever possible, the subject should have the choice of setting, either at home, a clinic, a location chosen by the respondent, or by phone. In the home, privacy, quiet, comfortable seating, and good lighting are essential. For example, it is difficult to administer a scale or conduct an interview when children are actively playing in the same room, when one’s spouse is present, or when the television is on. Often, the clinic is a better setting because other family members are not immediately present. Also, there is the expectation that confidentiality will be protected and that it is appropriate to be asking health-related questions in a clinic. It is imperative to inquire how the individual feels


about a particular setting and to ensure respect, privacy, confidentiality, and choice in whether or not to answer any question. This is standard procedure in any research or diagnostic setting and is part and parcel of all informed consent. Many times, audiovisual recording equipment is used to gather data. This may be frightening and offensive to some subjects who are culturally -different from the researcher or who do not trust recording instruments. This may be particularly true among some refugee groups who have experienced government surveillance and persecution. If there is an apparent fear, it is important to either remove the recording equipment or reassure the informant that you do not represent any government entity and that the tapes will be destroyed as soon as you extract the data. In addition to fear, some subject matter, such as sexual behavior or important cultural values, may be embarrassing to discuss. Recording devices may intensify this embarrassment, adversely affecting disclosure and the ability to concentrate.

Form of administration The form of administration affects two primary facets of data gathering: the ability to cloak one’s identity and the ease with which the items may be understood and responded to by the subject. This section discusses 3 major forms of administration. Face-to-face interviewing. Face-to-face administration is most common in assessment, since its diagnostic purpose is often achieved in a clinical setting where the provider and the client are well known to each other and clearly identified. Because of the intimacy involved in face-to-face interviewing, caution must be exercised when eliciting responses. Body language, coaxing, and other forms of nonverbal communication can affect the reliability of data. Telephone interviewing. Telephone interviewing adds a layer of anonymity, but often is not feasible among medically underserved, marginal, or transient groups who may not have access to a telephone, although this is becoming much less of a problem as more people have access to phones. The most common application of telephone interviewing is in conducting marketing research, although it has been used in health research as well (2526). When the subject is selected for telephone interviewing, there may be some problems with comprehension,because the individual cannot see the instrument and has to remember the entire question asked and the response categories. This may be unrealistic among groups who have limited vocab-


ulary skills or who are not native speakers of the language being used in the interview. There is some evidence, however, that telephone interviewing is effective among low-income populations with less than high school education, when the interview is conducted in the interviewee’s native language (25). Paper and pencil instruments. Paper and pencil instruments are common to measurement and assessment. Paper and pencil instruments are essentially of two kinds: 1) self-administered and 2) administered by an interviewer. Among samples of subjects with low educational attainment and/or limited literacy skills, interviewer administration may be the method of choice. This enables clarification of concepts and verbal or visual repetition of the appropriate response categories. It is not, however, without its limitations.Many disenfranchised social groups may not trust the process and are reticent to see their responses committed to paper. In work with public health nurses, one of the authors discovered that many of the clients refused to answer questions because they thought the responses would be written down and used against them later. A comment, “I don’t do paperwork,” from one respondent illustrates a level of distrust, especially after this same woman revealed a history of physical abuse during pregnancy in a casual conversation with a nurse that had not been revealed during the formal assessment. Inability to write data down creates reliability problems in that the interviewer may be able to jot down only sketchy notes, trying to remember how the respondent answered particular questions. With rating scales, this creates almost insurmountable reliability problems because it is not possible to remember how a respondent rated each item. In situations where there is an absolute refusal to allow recording of a response to a question, there are two choices: leave the offending question blank or wait until trust is developed and try again. If the respondent refuses to answer numerous questions, you may have to walk away and count this person among your “nonrespondent” group. In a therapeutic setting, you may have another opportunity to gather data as the relationship with the client becomes more established. In some research settings, such as in clinical trials and intervention studies, the researcher may be able to establish a higher level of trust and rapport than a survey researcher. Most clinical trials and interventions involve repeated contact with subjects and repeated measures of the predictor and outcome variables. In these cases, the similarity to the clinical setting is greater than for those conducting survey research.

456

Mail surveys and self-administered instruments. Mail surveys are also commonly used, but certain cautions are in order for groups with limited literacy skills. For groups with limited English skills or low literacy levels, surveys must be simple, with straightforward response categories and little need for clarification. Mail surveys must be carefully designed for groups with less than a high school education and low literacy levels. Recent evidence suggests that the effectiveness of self-administered instruments among low-income, minority groups with limited education can be increased with the use of vigorous followup to boost mail survey response rates and to capture missing data (25).

summary The development of instruments for use in culturally diverse settings and populations really involves much more than mere translation. Measurements must be tested for content validity and appropriate meaning among members of the group to be studied. Attention to issues of validity, reliability, and cross-cultural differences will lead to effective assessment, culturally competent health care, and the enhancement of the client/provider relationship. The concerns surrounding the use of quantitative measurement in diverse cultural groups are substantial. While the refinement of scales to meet the needs of various groups is a challenging task, such effort is essential to the diagnosis of disease, determination of health status, and the measurement of health outcomes in the diverse subgroups of this country’s population.

REFERENCES 1. California State Department of Health. Racelethnic es-

2.

3. 4.

5.

6. 7.

Vol. 10, No. 6, December 1997

Gonzalez-Calvo et a1

timates by county, 1990-1995. Sacramento (CA): State Department of Finance; 1996. Hines AM. Linking qualitative and quantitative methods in cross-cultural survey research: techniques from cognitive science. Am J Community Psychol 1993;21: 72946. Weller S, Romney A. Metric scaling: correspondence analysis. Newbury Park (CA): Sage Publications; 1990. Weller S, Romney A. Systematic data collection. Beverly Hills (CA): Sage Publications; 1988. Garro LC. Intracultural variation in folk medicine knowledge: a comparison between curers and noncurers. Am Anthropologist 1986;88:351-70. Barnard HR. Research methods in cultural anthropology. Beverly Hills (CA): Sage Publications; 1988. Bracken BA, Barona A. State of the art procedures for translating, validating and using psychoeducational

tests in cross-cultural assessment. School Psychol Int 1991;12 1119-32. 8. Gonztilez VM, Stewart A, Ritter PL, Lorig K. Translation and validation of arthritis outcome measures into Spanish. Arthritis Rheum 1995;38:1429-46. 9. Brislin RW, Lonner WJ, Thorndike RM. Cross-cultural research methods. New York: Wiley; 1973. 10. San Joaquin County Public Health Black Women’s Initiative. Use of the word “attitude” among African-Americans. Stockton (CA): San Joaquin County Public Health Black Women’s Initiative; 1997. 11. Higgins PG, Dicharry EK. Measurement issues addressing social support with Navajo women. West J Nurs Res 1991;13:242-55. 12. Mattell MS, Jacoby J. Is there an optimal number of alternatives for Likert scale test items? Study I: reliability and validity, Educ Psychol Meas 1971;31:657-74. 13. Bendig AW. Reliability and the number of rating scale categories. J Appl Psychol 1954;38:38-40. 14. Cronbach LJ. Further evidence on response sets and test design. Educ Psychol Meas 1950;10:3-31. 15. Komorita SS, Graham WK. Number of scale points and the reliability of scales. Educ Psychol Meas 1965; 4 :9a 7-95. 16. Peabody D. Two components in bipolar scales: direction and extremeness. Psychol Rev 1962;69:65-73. 17. Jones EG, Kay M. Instrumentation in cross-cultural research. Nurs Res 1992;41:186-8. 18. DeVellis RF. Scale development: theory and applications. Newbury Park (CA): Sage Publications; 1991. 19. Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York: McGraw-Hill; 1994. 20. Laungani P. Cultural differences in stress and its management. Stress Med 1993;9:37-43. 21. Abraido-Lanza AF, Guier C, Revenson TA. Coping and social support resources among Latinas with arthritis. Arthritis Care Res 1996;9:501-8. 22. Ghiselli EE, Campbell JP, Zedeck S. Measurement theory for the behavioral sciences. SanFrancisco: Freeman; 1981. 23. Lipsey MW. Design sensitivity: statistical power for experimental research. Newbury Park (CA): Sage Publications; 1990. 24. Gonzalez-Calvo J. Report on initial psychometric evaluation of self-efficacy scales measuring health care among high-risk African American pregnant and postpartum women. Stockton (CA): Black Women’s Initiative, County of San Joaquin Public Health Department; 1997.

25. Lorig K, Chastain RL, Ung E, Shoor S, Holman HR. Development and evaluation of a scale to measure perceived self-efficacy in people with arthritis. Arthritis Rhem 1989;32:37-44. 26. Marin G, van Oss-Marin B, Perez-Stable EJ. Feasibility of a telephone survey to study a minority community: Hispanics in San Francisco. Am J Public Health 1990; 80:323-6.