in epidemiology we need reliable and valid measurements of exposure, covariates. (potential confounders and effect modifiers), and outcomes. Causal ...
1. The epidemiological theory: principles of biomarker validation Paolo Vineis and Valentina Gallo Imperial College, London, United Kingdom
1.1. Validity and reliability To achieve an accurate estimate of the association between any marker and disease, in epidemiology we need reliable and valid measurements of exposure, covariates (potential confounders and effect modifiers), and outcomes. Causal inference cannot be interpreted in the absence of such requirements. We will distinguish, in the information to follow, between a marker (any variable that can be measured and is informative for the purposes of the study), an assay (a specific laboratory test which aims at measuring that marker), and a measurement (the concrete act of measuring the value of a marker in an individual, by a specific assay). For example, PAH-DNA adducts are a type of a marker, P32-postlabelling is a type of an assay, and the actual data are the measurements. Validity is defined as the (relative) lack of systematic measurement error when comparing the actual observation with a standard, that is a reference method which represents the “truth”. Such “truth” is in fact an abstract concept we are interested in, for example “cancer ” as defined through a histologic characterization. While validity entails a “standard”, reliability concerns the extent to which an experiment or any measuring procedure yields the same results on repeated trials (1). By “the same results” we do not mean an absolute correspondence, but a relative concept, i.e. “a tendency towards consistency of repeated measurements of the same phenomenon”. Reliability is relative to the type and purpose of the measurement: for some purposes we may accept a level of unreliability that is unacceptable for other purposes. In addition to being reliable, the marker must be valid, i.e. provide an accurate representation of some abstract concept. Validity and reliability are independent: a measurement may be perfectly reliable (reproducible in different laboratories and repeatable at different times), but consistently wrong, i.e. far away from the true value. For example, a gun may be completely reliable if all the shots are comprised within a small area but seriously biased if such area is far away from the target; conversely, the gun is unbiased but unreliable if the shots have an average distribution around the center of the target, but are dispersed over a large area. We are interested both in the validity and reliability; however, since validity is often not measurable, reliability is sometimes used (incorrectly) as a surrogate. An aspect that is clearly relevant to the discussion of measurement error is timing: any inference about the meaning of biomarker measures should be strictly time-specific, since time influences the results in several different ways.
Paolo Vineis, Valentina Gallo 10
The major components of biomarker variability that affect the design of epidemiologic studies are variability between subjects (inter-subject), within subjects (intra-subject) and variability due to measurement errors. The impact of the three categories of variability on the biomarker response can be represented by a linear model of this form (2): Yijk = u + ai + bj + eijk
[1.1.]
where Yijk — the marker response for subject i at time j and replicate measurement k; u
— the true population mean response;
ai
— the offset in mean response for subject i (assumed to be normally distributed with mean = 0 and variance = si2; the variance represents the extent of inter-subject variability);
bj
— the offset in response at time j (assumed to be normally distributed with mean = 0 and variance = sj2; this variance represents the extent of intra-subject variability);
eijk — the assay measurement error (normally distributed with mean = 0 and variance = sijk2) (2).
The normality of distribution, assumed in the model, must be verified. In fact, many biomarkers have distributions that are far from being normal; normalization can be achieved through an appropriate transformation, for example log transformation. The model is based on a linear (additive) assumption, which implies that measurement errors are independent of average measurements. Such assumption must be verified case by case, for example, by checking whether errors correlate with the mean.
1.2. Impact of measurement error: random and systematic error The errors of marker measurement may have different impact depending on error distribution. If the epidemiological study has been conducted blindly, i.e. the laboratory analyses have been done with no knowledge of the exposed/unexposed or diseased/healthy status of the subjects; we expect that the measurement error will be evenly distributed across strata of exposure or disease. (This is true only if the error is equally distributed across the scale of the exposure — i.e. non smokers may be more difficult to characterize than smokers. Also, detectable levels of biomarkers may influence controls, or the unexposed subjects, more than cases, or the exposed subjects. More controls may have undetectable levels than cases; therefore, the measurement error may influence in a differential way the cases vs. controls). This kind of misclassification leads to underestimation of the risk ratio due to a “blurring” of the relationship between exposure and disease. Both underestimation and overestimation of the association of interest may occur when misclassification is not evenly distributed across the study variables. We may have a more general distortion of the etiologic relationship, and not only a “blurring” of the association, if the classification of exposure depends on the outcome (diseased/healthy
The epidemiological theory: principles of biomarker validation 11
status). Blurring is “bias towards the null”, while distortion as a consequence of uneven distribution of misclassification can be in either direction, both towards and away from the null hypothesis. A realistic example of bias depending on the knowledge of the disease status by the researcher is related to the degradation of analytes when biological samples are stored for a long time. If the samples from the cases affected by the disease of interest and, respectively, those from controls (within a cohort design) are analyzed at different times, bias can arise from differential degradation in the two series. For example, the researcher may decide (incorrectly) to analyze the samples from the cases as soon as these arise in the cohort, while the controls are analyzed at the end of the study. Since the levels of, say, vitamin C decrease rapidly with time, serious bias may arise from differential timing of measurement in the two series. For this reason, biochemical analyses should be made after the matching of cases and controls for the time since sample collection.
1.3. Sources of variability: inter-subject, intra-subject, laboratory Variation in biomarkers includes inter-individual (inter-subject) variation, intra-subject variation (i.e. variation in marker over a particular time period), biological sampling variation (i.e. variation depending on the frame of biologic sample collection) and laboratory variation. Sometimes, the intra-individual and/or sampling variations are so large that the laboratory measurement variation makes a marginal contribution to overall variation. A particular example of intra-subject variation is associated with error due to the handling, processing, and storing of specimens; such variability can be measured only if repeated samples from the same individual are collected. Inter-subject variability in marker response may derive from factors such as ethnic group, gender, diet or other characteristics. Similarly, the marker response may vary within the same subject over time due to the changes in the diet, health status, variation in exposure to the compound of interest (for dietary items, season is an important variable), and variation in exposure to other compounds that influence the marker response. Biological sampling variation is related to the circumstances of biological sample collection. For example, hyperproliferation of colonic cells is extremely variable at different points of the colon mucosa. Therefore, not only the intra-subject variation over time is important, due to the varying exposure to agents that induce cell proliferation, but also the measurements are strongly influenced by how and where the mucosa is sampled. For example, a study (3) estimated that 20% of the variability of the rectal mucosa proliferation index (measured by nuclear antigen immunohistochemistry) is due to subject, 30% to the biopsy within the subject, and 50% is due to crypts within a biopsy. In other words, as much as 80% of the variation is related to sampling.
Paolo Vineis, Valentina Gallo 12
Laboratory measurements can have many sources of error, in particular, two general classes of laboratory errors: those that occur between analytical batches and those that occur within the batches. An example of the study that was designed to assess the different sources of laboratory variation is reported by Taioli et al. (2), using the model described above. In one experiment, they drew blood from five subjects three times in three different weeks, in order to measure DNA-protein cross-links. The results indicated that the variation between batches was quite important and larger than the variation between subjects. Methodological issues should be discussed, as much as possible, within biomarker categories, due to the specificities of each category. The following table shows how methodological data can be organized according to biomarker type. Intra-individual and sampling variations are considered due to the extent of their influence on actual measurements for most markers. Table 1.1. Methodological data organisation according to biomarker type
Biomarker category
Intra-individual variation
Biological sampling variation
Yes (diurnal variation)
No
Water-soluble nutrients
Yes (short half-life)
No
Organochlorine
No (longer half-life)
No
Internal dose Hormones
Biologically effective dose Peripheral white blood cells
Yes (half-life: weeks to months)
No
Yes (half-life: months)
Yes
More or less stable
?
Probably low
No (?)
Cervical dysplasia
Yes
Yes
Colonic hyperproliferation
Yes
Yes
Exfoliated urothelial cells DNA adducts Early biological effect Lymphocyte metaphase chromosome aberration Somatic cell mutations glycophorin A Intermediate markers
Genetic susceptibility Genotype assay
No
No
Noninducible phenotype
No
No
Inducible phenotype
Yes
No
Yes
Yes
Tumour markers
The epidemiological theory: principles of biomarker validation 13
1.4. Measurement of variation The extent of variability in measurements can be measured itself in several ways. Let us distinguish between continuous measurements and categorical measurements. A general measure of the extent of variation for continuous measurements is the Coefficient of Variation (CV — standard deviation/mean, expressed as a percentage). A more useful measure is the ratio between CVb and CVw : CVw measures the extent of laboratory variation within the same sample in the same assay, CVb measures the between-subject variation, and the CVb/CVw ratio indicates the extent of the between subject variation relative to the laboratory error. Large degrees of laboratory error can be tolerated if between-person differences in the parameter to be measured are large. A frequently used measure of reliability for continuous measurements is the intra-class correlation coefficient, i.e. the between person variance divided by the total (between plus within-subject) variance. The intra-class coefficient is equal to 1.0 if there is exact agreement between the 2 measures on each subject (thus differing from the Pearson correlation coefficient that takes the value 1.0 when one measure is a linear combination of the other, not only when the two exactly agree). A coefficient of 1.0 occurs when within-subject variation is null, i.e. laboratory measurements are totally reliable. The intra-class correlation coefficient can then be used to correct measures of association (e.g. relative risks) in order to allow for laboratory error. The intra-class correlation coefficient can be used to estimate the extent of between-subject variability in relation to total variability. The latter includes variation due to different sources (reproducibility, repeatability, and sampling variation). To measure reproducibility, i.e. the ability of two laboratories to agree when measuring the same analyte in the same sample, the mean difference between observers (and the corresponding confidence interval) has been proposed (4). In addition to reproducibility, i.e. agreement between readers on the same set of observations, with similar techniques we can measure repeatability, i.e. agreement within the same observer at different times (repeat observations). Another concept that is used in biomarker validation is inter-observer concordance in the classification of binary outcomes. It should be borne in mind that concordance between two observers can arise by pure chance. Therefore, agreement beyond chance is measured. However, total potential agreement between two readers cannot be 100%, i.e., to be fair we must subtract chance agreement from 100% to have an estimate of total attainable agreement. The final measure is the difference between observed agreement and chance agreement, divided by the total possible agreement beyond chance; this measure is called kappa index. There are other measures of agreement beyond chance, and the use of kappa has to be made cautiously since there are some methodological pitfalls; for example, the value of kappa strictly depends on the prevalence of the condition which is studied: with a high underlying prevalence we expect a high level of agreement (4). Until now we have considered reliability as a property of the assay in the hands of different readers (reproducibility) or at repeat measurements
Paolo Vineis, Valentina Gallo 14
(repeatability). Let us consider, now, validity of assessment, i.e. correspondence with a standard. It is essential to bear in mind that two readers may show very high levels of agreement, as measured e.g. by Pearson correlation coefficient (i.e. r = 0.9), even if the first consistently records twice the value of the second observer. Or, alternatively (for, example, when using the intra-class correlation coefficient), two readers could show high levels of agreement (e.g. ICC = 0.9) but poor validity if the same errors repeat themselves for both raters. Now we are interested in the correspondence of the measurement with a conceptual entity, for example accumulation of the p53 protein as a consequence of gene mutation (in fact, without a mutation the protein has a very short half-life and rapidly disappears from the cells). The Table 1.2. below shows data on the correspondence between immunohistochemistry and p53 mutations. Sensitivity of immunohistochemistry is estimated as 85%, i.e. false negatives are 15% of all samples containing mutations; specificity is estimated as 71%, i.e. 29% of samples not containing mutations are falsely positive at immunohistochemistry. A combined estimate of sensitivity and specificity is the area under the Receiver-Operating-Curve (ROC), i.e. a curve which represents graphically the relationship between sensitivity and (1-specificity). It is usually believed (5) that sensitivity and specificity indicate properties of a test irrespectively of the frequency of the condition to be detected (however, this is an assumption that requires to be verified). In the example of the Table, the proportion of samples showing a mutation is high (32/73 = 44%); it would be much lower for example in patients with benign bladder conditions or in healthy subjects. A measure which is useful to predict how many subjects, among those testing positive, are really affected by the condition we aim to detect is the positive predictive value. In the example, among 39 patients testing positive at immunohistochemistry, 27 actually have mutations, i.e. immunohistochemistry correctly predicts mutations in 69% of the positive cases. Let use suppose, however, that the prevalence of mutations is not 44%, but 4.4% (32/730). With the same sensitivity and specificity values (85% and 71%, respectively) we would have a positive predictive value of 11.8%, i.e. much lower. The predictive value is a very useful measure, because it indicates how many true positive cases we will obtain within a population of subjects who test positive Table 1.2. Validity of p53 immunohistochemistry as compared to mutations in the p53 gene (bladder cancer patients) (15)
p53 nuclear reactivity (immunohistochemistry) –
+
++
Total
No mutations
29
7
5
41
All mutations
5
8
19
32
34
15
24
73
Total
Sensitivity of immunohistochemistry (+ and ++) = 27/32 = 85%. Specificity = 29/41 = 71%. Positive predictive value = (8+19)/(15+24) = 27/39 = 69%.
The epidemiological theory: principles of biomarker validation 15
with the assay we are applying. However, we must bear in our minds that the predictive value is strongly influenced by the prevalence of the condition: a very low predictive value may simply indicate that we are studying a population in which very few subjects actually have the condition we want to identify.
1.5. Publication bias It has been suggested that publication bias is an important issue in epidemiological research, especially when performing meta-analyses or pooled-analyses of epidemiological studies (6,7). The hypothesis for the existence of publication bias is that studies with statistically significant outcomes, generally, are more likely to be published than nonsignificant studies (8). Publication bias would be shown by a specific statistical analysis involving funnel plots. The presence of asymmetry in the plot of study precision versus the logarithm of the point estimate (usually the Odds Ratio — OR) suggests publication bias, which can be tested statistically (9). A significant asymmetry indicates the presence of bias; if publication bias existed, then the funnel plots from the published data should have a certain degree of asymmetry because of the lack of small published studies with negative results. It has been suggested that time-lag bias could be an alternative to publication bias (10), with first studies giving more favourable results compared with the subsequent studies (Fig. 1.1.).
Fig. 1.1. The strength of the association is shown as an estimate of the odds ratio (OR) without confidence intervals. Eight topics in which the results of the first study or studies differed beyond chance (P < 0.05) when compared with the results of the subsequent studies. Adapted by permission from Macmillan Publishers Ltd: Nature Genetics (29(3):306–309), copyright (2001).
Paolo Vineis, Valentina Gallo 16
1.6. Laboratory drift; study design; quality control When we organize and analyze an epidemiologic study employing biomarkers, we want to minimize total intra-group variability, in order to identify inter-group differences (e.g. between exposed and unexposed or between diseased and healthy subjects), if they exist. Total intra-group variation is the weighted sum of inter-subject, intra-subject, sampling and laboratory variation, with weights that are inversely correlated to the numbers of subjects, measurements per subject, and analytical replicates used in the study design, respectively. Obviously, if we do not have detailed information we cannot adjust for intra-group variation. This is the reason why in epidemiologic studies employing biomarkers it is important to collect, whenever possible, a) repeat samples (day-to-day, month-to-month or year- to year variation may be relevant depending on the marker); b) potentially relevant information on subject characteristics that may influence inter-subject variation; c) conditions under which samples have been collected and laboratory analyses have been conducted (batch, assay, specific procedures). Concerning item c), measurement variation may occur as a consequence of many different aspects that are related not only to the choice of the assay, but to: — collection of the sample (how and when a blood sample was drawn; the type of test tube utilized; amount of biological material collected; for some measurements: whether the subject was fasting; avoidance of exposure to light if we are interested in vitamin C); — processing of the sample (e.g. speed of centrifuging to separate different blood components; use of a gradient to separate lymphocytes); — storing (in a simple refrigerator at –20°C; at –70°C; in liquid nitrogen at –196°C; for how long); — laboratory analyses (inter-laboratory variation; assay; technician performing the assay; batch; accidental contamination of the sample). Therefore, in order to minimize intra-group variation, technical details should be considered. As an example, for blood collection the following variables need controlling (11–12): 1. Collection tubes contamination; for example, in the case of trace metals all materials (needles, tubes, pipettes, etc.) should be of a type which do not release metals. 2. Type of additives; for example, collection of plasma entails the use of heparin. 3. Order of collection tubes; to avoid carry-over of trace additives, tubes without additives should be proceeded first. 4. Time of venipuncture; for example, measurement of compounds that undergo substantial changes during the day, like hormones, requires very accurate timing. 5. Subject posture; pysiological compounds like proteins, iron, cholesterol can be increased by 5–15% in the standing position in comparison with supine position. 6. Hemolysis may occur as a consequence of tube transport and manipulation. 7. Storage conditions. Laboratory drift is a special problem, which, however, is not peculiar of laboratory analyses (for example, drift in the quality of interviews typically occurs during longitudinal epidemiological studies). Laboratory drift is a consequence of changes in procedures
The epidemiological theory: principles of biomarker validation 17
and accuracy in the course of time, so that the first samples that are analyzed tend to differ from subsequent samples. Avoidance of laboratory drift implies a monitoring program, which consists in repeated quality controls. For example, measurements may be compared with a standard at different points in time. Another source of “drift”, which cannot be technically avoided, is degradation of analytes when they are stored for a long time. To know more about how the variability in laboratory measurements influences study designs decisions see Rundle et al. (13).
1.7. Overall evaluation: ACCE ACCE is a core group that takes its name from the four components of evaluation — analytical validity; clinical validity; clinical utility; and ethical, legal, and social implications and safeguards. The effort builds on a methodology previously described for evaluating screening and diagnostic tests. The ACCE process includes collecting, evaluating, interpreting, and reporting data about DNA (and related) testing for disorders with a genetic component in a format that allows policymakers to have access to up-to-date and reliable information for decision-making. The ACCE model contains a list of 44 questions, targeting the four areas of ACCE, to develop a comprehensive review of a candidate test for potential use (14).
Analytical validity Analytical validity focuses on the ability of the test to measure accurately and reliably the marker/genotype of interest. The components of analytical validity are sensitivity, specifi-city, and test reliability. Sensitivity evaluates how well the test measures the marker/genotype when it is present. Specificity, on the other hand, evaluates the test to determine how well the test measures the marker/genotype when it is not present. The reliability of a test measures how often the same results are obtained when a sample is retested.
Clinical validity Clinical validity focuses on the ability of the genetic test to detect or predict the associated disorder (phenotype). Clinical validity is also the PPV, positive predictive value, that is, the proportion of individuals who develop the disease given that they have the marker/genotype.
Clinical utility Clinical utility addresses the elements that need to be considered when evaluating risks and benefits associated with the introduction of the test into routine clinical practice. A test that has clinical utility, such as blood cholesterol, provides the individual
Paolo Vineis, Valentina Gallo 18
with valuable information that can be used for prevention, treatment, or life planning, regardless of results.
References 1. Carmines EG, Zeller RA: Reliability and validity assessment. London: Sage Publications; 1979. 2. Taioli E, Kinney P, Zhitkovich A, Fulton H, Voitkun V, Cosma G, et al. Application of reliability models to studies of biomarker validation. Environ Health Perspect 1994;102:306–9. 3. Lyles CM, Sandler RS, Keku TO, Kupper LL, Millikan RC, Murray SC, et al. Reproducibility and variability of the rectal muicosal proliferaton index using proliferating cell nuclear antigen immunohistochemistry. Cancer Epidemiol Biomakers Prev 1994;3:597–605. 4. Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. Br Med J. 1992;304:1491–4. 5. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology — the essentials. 2nd edition. Baltimore: Williams and Wilkins;, 1988. 6. Friedenreich CM. Methods for pooled analyses of epidemiologic studies. Epidemiology 1993;4:295–302. 7. Colhoun HM, McKeigue PM, Davey Smith G. Problems of reporting genetic associations with complex outcomes. Lancet 2003;361:865–72. 8. Stern JM, Simes RJ. Publication bias: evidence of delayed publication in a cohort study of clinical research projects. BMJ 1997;315(7109):640–5. 9. Egger M, Smith GD, Shneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. Br Med J 1997;315:629–34. 10. Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG. Replication validity of genetic association studies. Nat Genet 2001;29:306–9. 11. Young DS, Bermes EW: Specimen collection and processing : sources of biological variation. In: Tiertz NW [ed.]. Textbook of clinical chemistry. Philadelphia: W.B. Saunders Co.; 1986. 12. Pickard NA. Collection and handling of patients specimens. In: Kaplan LA, Pesce AJ. [eds.]. Clinical chemistry: theory, analysis and correlation. 2nd edition. St. Louis: C.V. Moody Co.; 1989. 13. Rundle AG, Vineis P, Ahsan H. Design options for molecular epidemiology research within cohort studies. Cancer Epidemiol Biom Prev 2005;14(8):1899–907. 14. Sanderson S, Zimmern R, Kroese M, Higgins J, Patch C, Emery J. How can the evaluation of genetic tests be enhanced? Lessons learned from the ACCE framework and evaluating genetic tests in the United Kingdom. Genet Med 2005;7(7):495–500. 15. Esrig D, Spruck CH III, Nichols PW. P53 nuclear protein accumulation correlates with mutations in the p53 gene, tumor grade and stage in bladder cancer. Am J Pathol 1993;143:1389–97.