the bigger the better?

2 downloads 0 Views 112KB Size Report
The purpose in making a trial as large as possible is to maximize the chance of detecting a treatment effect, particularly if that effect is not very big, and to provide ...
STATISTICS IN MEDICINE Statist. Med. 2002; 21:2797–2805 (DOI: 10.1002/sim.1283)

Advanced issues in the design and conduct of randomized clinical trials: the bigger the better? Charles Warlow∗;† University of Edinburgh; Department of Clinical Neurosciences; Western General Hospital; Edinburgh; Scotland; U.K.

SUMMARY Large trials provide more precise estimates of treatment eects than small trial (how good is the treatment?) and they may allow a few sensible and pre-dened sub-group analyses (for whom does the treatment work best?). But large trials will not get done unless they are simple and as part of normal clinical practice as possible. Copyright ? 2002 John Wiley & Sons, Ltd. KEY WORDS:

randomized trial; design; analysis; sample size

WHY LARGE? The purpose in making a trial as large as possible is to maximize the chance of detecting a treatment eect, particularly if that eect is not very big, and to provide a precise estimate of the size of the treatment eect. Small trials risk producing a false negative result because of bad luck (the familiar type II error), and will never provide a precise result because the condence intervals around the estimate of the treatment eect will be wide. In a large trial with a denite overall treatment eect, it may be possible to explore whether there is variation in the size of that eect in particular types of patients, at a particular stage of their illness, that is, subgroup analysis. One can then make appropriate inferences: to use the treatment in certain sorts of patient but not in others, or to explore the treatment in a particular subgroup in further randomized trials. Essentially we are trying to answer the question: ‘for whom does the treatment work best?’ If possible, we want to oer the treatment to the people who really need it, not to the larger number who might need it, especially if the treatment is expensive, risky or dicult to use. Rather than one very large trial, there may be ‘political’ and logistical advantages in having two or more fairly large trials to achieve the necessary sample size. The individual patient data can then be combined in a meta-analysis, provided the investigators use similar baseline and outcome denitions, and are prepared to collaborate and share their data. This may be ∗

Correspondence to: Charles Warlow, University of Edinburgh, Department of Clinical Neurosciences, Western General Hospital, Edinburgh, Scotland, U.K. † E-mail: [email protected]

Copyright ? 2002 John Wiley & Sons, Ltd.

2798

C. WARLOW

dicult if commercial sensitivities are involved. For example, when the North American Symptomatic Carotid Endarterectomy Trial (NASCET) [1] started, a decision had been made to organize an independent trial in North America because it was unlikely that American centres would join the ongoing European Carotid Surgery Trial (ECST) [2]. Similarly, the Chinese Acute Stroke Trial (CAST) [3] of aspirin was planned, partly because it would have been impossible to co-ordinate any Chinese centres in the ongoing International Stroke Trial (IST) [4] due to language and cultural dierences. In both cases, the trials have been put together in an individual patient data meta-analysis to increase the precision of the overall treatment estimates, and to explore sensible subgroups. An illustration of the problem of an insucient sample size from over 20 years ago is when Bryan Matthews, and others in Oxford, did a randomized trial of Dextran versus control in 100 patients with acute ischaemic stroke [5], at the time a ‘large’ trial. They observed about a 50 per cent reduction in the odds of death but with a 95 per cent condence interval going from something that was better than penicillin (90 per cent relative reduction in the odds of death) to something which was extremely dangerous (double the odds). A hugely clinically relevant treatment eect, but quite probably all due to chance. However, suppose for the sake of argument they had done the trial with 1000 patients, and got the same relative odds reduction, there would have been a much narrower condence interval (from about 30–70 per cent relative reduction in odds of death), a hugely statistically signicant result. This shows that with increasing sample size, one obtains a better estimate of the treatment eect and, in this theoretical case, a very statistically and clinically signicant treatment eect. The treatment eect might not have been the same with 1000 patients, after all the eect was so large that it was probably too good to be true, but if it had been as large in a larger trial, we would all have been using Dextran for 20 years instead of consigning it to the history books.

IN LARGE TRIALS THERE MAY BE LARGE ENOUGH SUBGROUPS TO EXPLORE WHETHER TREATMENT IS MORE OR LESS EFFECTIVE FOR PARTICULAR SORTS OF PATIENTS In a large trial there are likely to be many dierent sorts of patients recruited, and in reasonably large numbers (old versus young, male versus female, good baseline prognosis versus bad etc.). Such diversity is to be encouraged for it means that one has a reasonable chance of exploring whether the treatment has a substantially dierent eect in some, but not too many, large subgroups. It is, therefore, a mistake to insist that all randomized patients should be much the same because one is then limited to looking at the treatment eect in just one particular type of patient (females or males etc.). It is dicult to know whether one can generalize the result to dierent sorts of patients in future clinical practice. Will males respond in the same way as females, does the treatment work better in patients with severe rather than mild disease etc.? Doing a string of small trials in a number of very tightly dened groups of patients does not seem very cost-eective research. It is better to conduct a large trial with predened and sensible subgroups of interest which are themselves large enough for reliable analysis. The whole point of subgroup analysis is to explore dierences in relative treatment eect between various types of patients who might be expected to respond to treatment dierently (absolute treatment eects may be very dierent because they depend on the absolute risk of a poor outcome in the control patients). This might be a qualitative dierence, but much more Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805

DESIGN AND CONDUCT OF RANDOMIZED CLINICAL TRIALS

2799

likely it will be a quantitative dierence. Unexpected qualitative dierences are most unusual. For example, thrombolysis is likely to be more eective if it is given sooner rather than later in acute ischaemic stroke (a quantitative dierence), and carotid surgery to prevent stroke is more likely to be eective the greater the degree of arterial stenosis (another quantitative dierence). However, a trial of thrombolysis in acute stroke that did not use prior CT brain scanning to exclude cerebral haemorrhage might have a null result because thrombolysis makes cerebral haemorrhage worse even though it may provide net benet in ischaemic stroke. This is a qualitative dierence in treatment eect since after all thrombolysis is already known to cause haemorrhage.

AN EXAMPLE OF A SENSIBLE SUBGROUP ANALYSIS FROM THE EUROPEAN TRIAL OF CAROTID SURGERY Baseline diversity is exactly what we exploited in the European Carotid Surgery Trial. We asked the collaborators to randomize eligible patients with any degree of carotid stenosis they felt comfortable with. This use of ‘the uncertainty principle’ not only ensured a wide range of patients were entered but it solved an ethical dilemma. It allowed the collaborators to operate on some patients if they felt, for whatever reason, that it was unethical to randomize them to the possibility of no surgery, to avoid operation in others if they felt that surgery was contraindicated, and only to randomize those patients where they were uncertain whether surgery was better than no surgery. Some collaborators randomized patients with severe stenosis, some moderate, and some very mild stenosis, and others a wide range of stenosis (Figure 1). In the whole trial there were a large number of patients across the whole range of stenosis severity, although all the patients had recently had a mild cerebral ischaemic event (Figure 2). We could then nd out whether surgery was better than no surgery for various severities of stenosis. If we had not had this heterogeneity and reasonably large numbers, we would have had no information at all for moderate and mild stenosis. Indeed, the NASCET did not randomize patients with mild stenosis, so they were not able to provide information about this patient group; they had imposed more homogeneity on their selection of patients. In general, heterogeneity of patients in a large trial is a good thing, provided one can identify the dierent subgroups of interest from the baseline data, that the subgroups are large and prespecied and the results not reported just on the basis of retrospective analysis. There are too many examples of ‘subgroup torture’ where the results in one – almost always positive – subgroup are presented in isolation with no attempt to compare them with other subgroups or the overall trial result. In a large trial it may be too expensive to collect comprehensive baseline data on every patient but there only have to be enough to identify subgroups which may be of interest: males/females: old/young: severe stenosis/not so severe stenosis; lacunar stroke/non-lacunar stroke etc. All this can be done within the context of a large trial, and a surprising amount of useful data can be collected simply on the telephone before a patient is randomized. It is absolutely not necessary to do a series of rather small complicated trials, each with very homogenous patients – males and then females, the young and then the old, mildly diseased patients and then severe patients and so on. Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805

2800

C. WARLOW

Figure 1. The variation in the distribution of symptomatic carotid stenosis severity in patients randomized by six collaborators in the European Carotid Surgery Trial.

Figure 2. The distribution of carotid stenosis severity of the symptomatic and contralateral asymptomatic carotid arteries in the European Carotid Surgery Trial (3024 patients).

AN EXAMPLE OF SENSIBLE SUBGROUP ANALYSES IN THE TREATMENT OF ACUTE STROKE WITH ASPIRIN What might be the sensible subgroups in a randomized trial of early treatment in acute stroke, or in a meta-analysis of such trials? Certainly one would like to know if a treatment eect is dierent for old versus young, male versus female, early versus late, lacunar versus nonlacunar territorial infarction, and if the brain CT scan is normal or shows an appropriate infarct. Patients in atrial brillation may have a dierent treatment eect compared with those in sinus rhythm, and the treatment eect may be dierent again for various grades of risk of Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805

DESIGN AND CONDUCT OF RANDOMIZED CLINICAL TRIALS

2801

poor outcome at baseline, and the risks and benets of the treatment may be dierent to the overall treatment eect. All the baseline variables to address these questions were collected in the International Stroke Trial (IST) and the Chinese Acute Stroke Trial (CAST). The IST was a trial in nearly 20 000 ischaemic stroke patients less than 48 hours from onset, and with no clear indications or contraindications to the two treatments (aspirin and heparin in a factorial design). Baseline data came in by telephone. For follow-up there was just a one page form to complete at two weeks, and another at six months, to make a large trial practicable without huge expense, since we were dealing with aspirin and heparin, with no sponsorship from pharmaceutical companies. CAST was very similar except it looked only at aspirin and was double-blind. Combining the two trials and then looking at some sensible subgroups was interesting [6]; there was no statistically signicant heterogeneity, in other words the treatment eect was not statistically dierent in any of the subgroups examined (Figure 3). Some of these subgroups are worth looking at in detail because the results were surprising.

ATRIAL FIBRILLATION VERSUS SINUS RHYTHM Some believe that an antiplatelet drug like aspirin should have little or no eect on early recurrent stroke if the brain artery is likely to have been occluded by a brin-containing embolus from the heart (such as is often the case if there is atrial brillation) rather than a platelet-containing embolus from an atheromatous plaque. In CAST and IST put together, the point estimate of treatment eect on early recurrent stroke or early death was less in the brillators than in those in sinus rythmn. This nding might well have been a chance eect (Figure 3), particularly as there was no dierence at all if recurrent ischaemic stroke was used as the outcome event [6]. There was a wide condence interval around the result in the 4500 brillators but this is still the largest sample size available of the eects of early aspirin in ischaemic stroke patients and atrial brillation. The sample size is large but even so it is dicult to nd out whether a treatment eect really is dierent in dierent sorts of patients. Aspirin still might be less eective in brillating stroke patients but we cannot be sure, because 4500 patients was probably not enough. Since the eect of aspirin is not denitely dierent for brillating stroke patients compared with sinus rhythm stroke patients, one should take the overall trial result – not the subgroup result – to help decide on the most appropriate treatment in the next acute stroke patient coming to the hospital, whether or not the patient is in sinus rhythm. This illustrates that with two large trials put together we helped to answer the clinically relevant question about aspirin versus no aspirin in presumed cardioembolic stroke. At the present time we recommend using aspirin in both sorts of patients.

TIME FROM STROKE ONSET Because IST was large and simple, it came in for a lot of criticism. An anonymous referee of the IST pilot study report said that it ‘has little to do with an acute stroke trial in which patients are currently being randomized within six hours or less’. This was quite incorrect. In IST there were about 3000 patients randomized within six hours of symptom onset, larger than any other trial of ‘early’ treatment, and in both IST and CAST together there were more Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805

2802

Copyright ? 2002 John Wiley & Sons, Ltd.

C. WARLOW

Statist. Med. 2002; 21:2797–2805

DESIGN AND CONDUCT OF RANDOMIZED CLINICAL TRIALS

2803

than 5000. With this large data set covering a variety of times from stroke onset, we could look at the eect of treatment given at 0–6 hours, 7–12 hours, right up to 48 hours after onset (Figure 3). We deliberately encouraged the heterogeneous nature of our randomized patients and we did not want only patients randomized at one particular time. In CAST and IST we found that treatment given within six hours of stroke onset was not detectably dierent in eect compared with other times, or the overall treatment eect, and nor was the eect at 7–12 hours even though it seemed to be going the ‘wrong’ way against aspirin. Between 13 and 24 hours, and 25 and 48 hours, the relative reduction in odds of stroke and death was nearer the overall treatment eect again. Presumably this ‘zig’ and ‘zag’ was just a chance eect due to relatively small numbers – ‘only’ about 7000 patients randomized between 7 and 12 hours post-stroke. Imagine how chance could have inuenced the result of a trial of just 500 patients! Of course if we had not randomized patients beyond 12 hours we would not have been able to say anything about the eect of aspirin at 13–48 hours. Aspirin seemed to work just as well early as late, probably because the main eect is to reduce early recurrence rather than the eect of the stroke itself. Since we randomized a large number of patients across a wide range of time windows, and then looked at subgroups dened a priori, we could make more appropriate inferences about the interaction between the eect of aspirin in acute ischaemic stroke and time from stroke onset to treatment. We could only do this because the two trials were large enough to have reasonably sized subgroups. These subgroups were still too small to exclude any quantitative dierence in treatment eects at dierent times post-stroke onset.

LACUNAR STROKE We might like to know if aspirin is particularly eective in lacunar stroke versus other sorts of stroke. In the former, a small artery in the brain is occluded by an intrinsic disorder of those small arteries. In non-lacunar strokes, larger brain arteries are occluded by emboli from the heart or from atheromatous plaques aecting the large arteries in the neck. What does IST and CAST tell us about lacunar stroke, dened at baseline by asking a few simple clinical questions over the telephone. Within the two trials there was a subgroup of about 11 000 lacunar patients. There has never been a trial in lacunar stroke of this size before but we could not discern any denite dierence between the eect of aspirin in lacunar and non-lacunar stroke patients (Figure 3). ←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Figure 3. Meta-analysis of the CAST and IST trials of aspirin in acute ischaemic stroke, in various clinically dened subgroups; proportional eects on recurrent stroke or death. For each particular subgroup, the observed minus the expected (O − E) number of events among aspirin-allocated patients, its variance, and the odds ratio (OR) ( with area proportional to the total number of patients with an event) are given. A square to the left of the solid vertical line suggests benet, signicant at 2P¡0:01 only if the entire 99 per cent Cl (horizontal line) is to the left of the solid vertical line. The open diamond indicates the overall result and its 95 per cent CI. Summation of the 10 separate X 2 heterogeneity test statistics (one for each baseline characteristic but not for ‘days’, , yields the global test 2 for heterogeneity between the 28 subgroups: X18 = 16:5, NS) (reproduced with permission from Chen et al. [6]). Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805

2804

C. WARLOW

WHICH SUBGROUPS ARE NOT SENSIBLE? We might also like to know if aspirin, or thrombolysis, is more or less eective in patients who still have an occluded artery in the brain rather than in those who have recanalized the occluded artery, but that would require transcranial Doppler or another technology which is not generally available, and which takes time (wasted time if the brain cells are dying as time passes). Transcranial Doppler has a limited ability to detect occlusion of arteries other than the middle cerebral artery. We might also like to know whether aspirin works better or worse in a patient with a large ischaemic penumbra compared with one with a small ischaemic penumbra. These are patients who still have viable brain cells. This is not practical except in very small numbers because so few centres have the magnetic resonance imaging capability to dene this subgroup. To be even more extreme, and yet to ask a perfectly legitimate question as doctors who know about stroke, we might like to know whether thrombolysis is particularly eective for patients who have a large middle cerebral artery infarct, a large ischaemic penumbra, a still occluded middle cerebral artery, and who are in atrial brillation. This is a perfectly good question but an unanswerable one because there will never be enough patients randomized with this combination, and the technology to identify them is not widely available. Even if we could answer that question, why not answer it within the context of a large trial which included at least a few centres with the relevant technology? What we denitely do not need are small trials in very small and tightly dened groups of patients. The results tend to be imprecise and so unreliable. This is not a cost-eective approach. It is better to do one, two or three large trials, within which there are enough patients within a priori dened subgroups, to get answers for them as well as the overall treatment eect.

CAN A TRIAL BE TOO LARGE? Some fear that if a trial is planned to be too large, there is a risk that far too many patients will be randomized long after a treatment eect – or hazard – becomes clear. This wastes time, is costly, and may be unethical. However, a trial can always be stopped early, provided the unblinded results are reviewed by a competent data monitoring and ethics committee. In any event, there has probably been much more damage done by doing trials that were far too small than far too large; that is, patients have not been given treatments with real but modest eects because these eects were missed in the small trials.

AN IMPERATIVE: LARGE TRIALS WILL NOT GET DONE UNLESS THEY ARE SIMPLE Large trials will give us precise answers to a lot of questions but they will never get done unless they are simple, but not too simple. Subgroups have to be identiable from the baseline data – male or female, old or young, severe or mild etc. – but these data can be easily collected over the telephone. They certainly will not get collected in large numbers of patients unless they are simple; neither will they get collected unless it is inexpensive to do so, and they are part of routine clinical practice without demanding a lot of expensive technology. Outcome events must (of course) be sensible and valid, and they too must be collected simply if the Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805

DESIGN AND CONDUCT OF RANDOMIZED CLINICAL TRIALS

2805

trial is to be large. If these trials are part of routine clinical practice, then their results will apply to routine clinical practice, which is, after all, the whole point of doing the trials in the rst place. REFERENCES 1. Barnett HJ, Taylor DW, Eliasziw M, Fox AJ, Ferguson GG, Haynes RB, Rankin RN, Clagett GP, Hachinski VC, Sackett DL, Thorpe KE, Meldrum HE. Benet of carotid endarterectomy in patients with symptomatic moderate or severe stenosis. North American Symptomatic Carotid Endarterectomy Trial Collaborators. New England Journal of Medicine 1998; 339:1415 – 1425. 2. European Carotid Surgery Trialists’ Collaborative Group. Endarterectomy for recently symptomatic carotid stenosis: nal results of the MRC European Carotid Surgery Trial (ECST). Lancet 1998; 351:1379 – 1387. 3. CAST (Chinese Acute Stroke Trial) Collaborative Group. CAST: randomised placebo-controlled trial of early aspirin use in 20,000 patients with acute ischaemic stroke. Lancet 1997; 349:1641 – 1649. 4. International Stroke Trial Collaborative Group. The International Stroke Trial (IST): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19,435 patients with acute ischaemic stroke. Lancet 1997; 349: 1569 – 1581. 5. Matthews WB, Oxbury JM, Grainger KM, Greenhall RC. A blind controlled trial of dextran 40 in the treatment of ischaemic stroke. Brain 1976; 99:193 – 206. 6. Chen Z-M, Sandercock P, Pan H-C, Counsell C, Collins R, Liu L-S, Xie J-X, Warlow C, Peto R on behalf of the CAST and IST Collaborative Groups. Indications for early aspirin use in acute ischaemic stroke. A combined analysis of 40,000 randomised patients from the Chinese Acute Stroke Trial and the International Stroke Trial. Stroke 2000; 31:1240 – 1249.

Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:2797–2805