Development and application of a generic ... - Oxford Academic

5 downloads 88337 Views 216KB Size Report
Development and application of a generic methodology to assess the quality of clinical guidelines. FRANC¸ OISE A. CLUZEAU1, PETER LITTLEJOHNS1, ...
International Journal for Quality in Health Care 1999; Volume 11, Number 1: pp. 21–28

Development and application of a generic methodology to assess the quality of clinical guidelines FRANC ¸ OISE A. CLUZEAU1, PETER LITTLEJOHNS1, JEREMY M. GRIMSHAW2, GENE FEDER3 AND SARAH E. MORAN1 1

Health Care Evaluation Unit, St George’s Hospital Medical School, London, 2Health Services Research Unit, University of Aberdeen and 3Department of General Practice and Primary Care, St Bartholomew’s and the Royal London School of Medicine and Dentistry, London, UK

Abstract Background. Despite clinical guidelines penetrating every aspect of clinical practice and health policy, doubts persist over their ability to improve patient care. We have designed and tested a generic critical appraisal instrument, that assesses whether developers have minimized the biases inherent in creating guidelines, and addressed the requirements for effective implementation. Design. Thirty-seven items describing suggested predictors of guideline quality were grouped into three dimensions covering the rigour of development, clarity of presentation (including the context and content) and implementation issues. The ease of use, reliability and validity of the instrument was tested on a national sample of guidelines for the management of asthma, breast cancer, depression and coronary heart disease, with 120 appraisers. A numerical score was derived to allow comparison of guidelines within and between diseases. Results. The instrument has acceptable reliability (Cronbach’s a coefficient, 0.68–0.84; intra-class correlation coefficient, 0.82–0.90). The results provided some evidence of validity (Pearson’s correlation coefficient between appraisers’ dimension scores and their global assessment was 0.49 for dimension one, 0.63 for dimension two and 0.40 for dimension three). The instrument could differentiate between national and local guidelines and was easy to apply. There was variation in the performance of guidelines with most not achieving a majority of criteria in each dimension. Conclusions. Use of this instrument should encourage developers to create guidelines that reflect relevant research evidence more accurately. Potential users or groups adapting guidelines for local use could apply the instrument to help decide which one to follow. The National Health Service Executive is using the instrument to assist in deciding which guidelines to recommend to the UK National Health Service. This methodology forms the basis of a common approach to assessing guideline quality in Europe. Keywords: appraisal, clinical guidelines, instrument, quality, reliability, validity

Clinical guidelines are now ubiquitous in every aspect of clinical practice and health policy. They are expected to fulfil a myriad of roles from increasing the uptake of research findings [1] to facilitating the rationing of health care [2]. Whilst there is evidence that guidelines can improve clinical practice, their successful introduction is dependent on many factors, including the clinical context, methods of development, dissemination and implementation [3]. Successfully addressing all of these issues in routine practice can prove difficult [4], but is necessary if guidelines are to improve the quality of health care [5].

An increasing concern is the number of disease-specific guidelines that offer inconsistent advice [6,7]. Many reasons have been put forward to explain this variability ranging from lack (or differing interpretation) of underlying research findings, different values given to anticipated outcome (for example clinical versus economic), dubious achievement of consensus and possible bias introduced through conflicts of interest. Faced with this diversity, potential users will want to make an informed choice. However the information on which to base this judgement is often lacking [8,9]. Ideally, data from a formal evaluation of the ability of the guidelines

Address correspondence to Franc¸oise Cluzeau, Health Care Evaluation Unit, St George’s Hospital Medical School, Cranmer Terrace, London, SW17 0RE, UK. Tel: +44 181 725 2771. Fax: + 44 181 725 3584. E-mail: [email protected]  1999 International Society for Quality in Health Care and Oxford University Press

21

F. A. Cluzeau et al.

to bring about the anticipated health outcomes when adhered to (defined as validity [10]) would be available; in reality there is a virtual absence of this type of outcome data for most guidelines. Moreover when results from carefully controlled randomized trials of guidelines implementation strategies are available they may not necessarily be generalizable to a routine clinical setting [11]. In the absence of appropriate outcome indicators on which to judge effectiveness, most assessments of clinical quality substitute process and structural criteria [12]. Indeed this is often the most practical way to assess quality of care on a routine basis [13]. Using this approach to the assessment of guidelines requires the determination of whether guideline developers have been rigorous in minimizing the potential biases in creating the guideline [14], in essence, critically appraising guidelines. There is increasing published work on how to critically appraise primary research and reviews [15–17]. This work has been stimulated by the Cochrane Collaboration [18]. However, the application of this approach to guidelines is in its infancy. In 1992 the Institute of Medicine (IOM) started the process by developing a provisional, if unwieldy, appraisal instrument based on ‘desirable attributes’ of good guidelines [19]. Subsequently, shorter checklists were produced in Canada [20] and Australia [21] but their usefulness has never been formally assessed. In June 1993, The UK National Health Service Management Executive organized a workshop to explore the issues around assessing the quality of guidelines. A research programme was initiated to produce a generic instrument to appraise guidelines. The instrument should be capable of being applied by anyone (general or specialist clinicians, health care managers, and researchers) interested in assessing guidelines and should allow comparison between guidelines. This paper describes the creation of the instrument, an assessment of its validity and reliability, and a description of the quantity and quality of UK guidelines for the management of coronary artery disease, depression, breast cancer and asthma.

Methods Appraisal instrument The purpose of the appraisal instrument is to assess the extent to which clinical guidelines are ‘systematically developed’ [22], and take into account known determinants of effective strategies for dissemination and implementation. Initially the reliability and face validity of the IOM instrument was tested on five UK guidelines with seven appraisers in a pilot study [7]. Based on these results potential questions for a simplified appraisal tool were circulated to individuals interested in guideline development for comments. The revised list contained 37 items (see Appendix). These address different aspects and are categorized into three conceptual dimensions which could be mapped to the IOM attributes. The first dimension, rigour of development, reflects the attributes necessary to enhance guideline validity and reproducibility. It

22

contains 20 items and assesses responsibility and endorsement for the guidelines, the composition of the development group, identification and interpretation of evidence, the link between evidence and main recommendation, peer review and updating. The second dimension, context and content, contains 12 items addressing the attributes of guideline reliability, applicability, flexibility and clarity. It assesses the aims of the guidelines, the target population, circumstances for applying the recommendations, presentation and format of the guidelines and estimated benefits–harms and costs. The third dimension, application, contains five items addressing the implementation, dissemination and monitoring strategies. All three dimensions assess the adequacy of documentation. Each item inquires whether information is present and then requires a judgement about the quality of the information. The specific questions demand ‘yes’, ‘no’, ‘not sure’ answers. An option for ‘not applicable’ answers is available for some items. To ensure that the questions were interpreted consistently and to minimize the need for judgement a user manual was designed; this contained a detailed explanation of the meaning of each question [23], and suggested circumstances where a ‘yes’ answer may be appropriate. In the study a global assessment of the guidelines was asked for, as a measure of overall quality: ‘strongly recommended’ (for use in practice without modifications); ‘recommended’ (for use in practice on condition of some alterations or with provisos); or ‘not recommended’ (not suitable for use in practice). Selection of guidelines for appraisal Sixty guidelines were selected from a national survey of UK guidelines between January 1991 and January 1996 on coronary heart disease, asthma, breast cancer and depression (15 guidelines per disease group) [24]. The size of the sample was based on Nunnally’s recommendation that at least 300 observations are needed for inter-rater test of reliability [25]. We hypothesized that national guidelines would be more systematically developed than local ones. All 12 guidelines produced by nationally recognized organizations or commissioned by the NHS Executive were selected. Forty-eight local guidelines were drawn through a random sample. Guideline authors were asked to provide copies of their guidelines and information on how their guidelines had been developed. Appraisers Each guideline was assessed independently by six appraisers (120 in total). Each assessed three guidelines. Each block of three guidelines (20 blocks altogether) was assessed by the same six appraisers (Figure 1). These included a national expert in the disease area, a general practitioner, a public health physician, a hospital consultant physician, a nurse specializing in the disease area, and a researcher on guideline methodology. They were recruited through UK cardiac units, asthma centres, the Royal College of General Practitioners, respondents to the survey, the Royal College of Nursing and research institutions and were randomly allocated guidelines.

Guidelines appraisal methodology

Guidelines Appraisers

1

2

3

1 2 3 4 5 6

4

5

6

7 8 9 10 11 12

7

8

9

13 14 15 16 17 18

10

11

12

19 20 21 22 23 24

13

14

15

25 26 27 28 29 30

Figure 1 Design for the assessment of coronary heart disease guidelines (design repeated for other three disease areas: asthma, breast cancer and depression).

Analysis In order to allow comparison of guideline performance, dimension scores for each guideline were calculated. A ‘yes’ response was given a value of 1 and other responses ( ‘no’, ‘not sure’ and ‘not applicable’) a value of zero. Individual appraisers’ dimension scores were calculated by summing their scores for each item within a dimension. A guideline dimension score was obtained by calculating the mean of the appraisers’ scores. This was then expressed as a percentage of the maximum possible score for that dimension in order to compare scores across the three dimensions. Item dimension We calculated Pearson’s correlation coefficients between each item and dimension scores, omitting the index item, to check that each item was in the appropriate dimension [26]. Reliability Reliability of the instrument was assessed in two ways: first, internal consistency was measured by calculating the correlation between all items within a dimension to test to what extent they measured the same underlying concept, using Cronbach’s a coefficient [27]. Second, inter-rater agreement was measured by calculating the intra-class correlation coefficient (ICC) for the dimension scores according to the criteria of Shrout and Fleiss [28]. Calculations were based on the assumption that each guideline was assessed by a different set of appraisers.

calculating Pearson’s correlation coefficients between appraisers’ dimension scores and their global assessment of a guideline. We predicted that dimension scores for national guidelines would be higher than those for local guidelines. In an attempt to investigate validity further, analysis of variance (ANOVA) was used to test this hypothesis. ANOVA was also used to examine the effect of year of publication, disease area and level of background information on guideline dimension scores. Year of publication was classified into three categories: pre 1994, 1994–1996 and unknown. These were chosen because a number of influential papers and recommendations had been published about the development of guidelines in 1993 [29,30]. A zero skewness log transformation was used in the ANOVA for dimensions one and three because the scores were not normally distributed. Mann–Whitney tests were used on individual appraisers’ scores to examine differences between professional groups. Appraisers who omitted at least one question in a dimension were excluded from calculations of the ICCs and Pearson’s correlation coefficients for that dimension.

Results Background information was received for 53 guidelines. Five guidelines (three national and two local) had a background document with details of their development process. Completed structured questionnaires were available for 46 guidelines and two authors provided information in a letter. No additional information was available for seven local guidelines. Thirty-eight guidelines had been published between 1994 and 1996, 14 between 1992 and 1993 and eight documents were undated. One appraiser had been closely associated with the development of one of the guidelines and therefore assessed only two guidelines. Figure 2 shows the distribution of guideline scores for each dimension. Over two-thirds of guidelines scored less than 50 on dimension one, which means that less than 50% of criteria for rigorous development were met. The median for dimension one was 30.4 with a wide range of 0.8–85. The median score was higher for dimension two (47.9). Performance was poorest on dimension three (median 24.2). The distribution for this dimension was very skewed with scores ranging from 0 to 95. Item dimension Items were in the appropriate dimension as all but two correlated more highly with their dimension scores than with the other two dimensions’ scores (table of results available from the authors). Reliability

Validity In the absence of a gold standard or a validated measure of guideline quality, evidence of criterion validity was sought by

All three dimensions had good internal consistency (Cronbach’s a, 0.68–0.84) and excellent inter-rater agreement (ICCs, 0.82–0.90) and narrow confidence intervals [31] (Table 1).

23

F. A. Cluzeau et al.

Validity The Pearson’s correlation coefficients between appraisers’ dimension scores and their global assessment were 0.49 (n= 311) for dimension one, 0.63 (n=319) for dimension two and 0.40 (n=315) for dimension three. All coefficients were highly significant (P