Reproducibility of p53 Immunohistochemistry in ... - Semantic Scholar

1 downloads 0 Views 195KB Size Report
Frederic M. Waldman, and the National Cancer. Institute Bladder ...... Esrig, D., Spruck, C. H. III, Nichols, P. W., Chaiwun, B., Steven,. K., Groshen, S., Chen, S. C., ...
1854 Vol. 6, 1854 –1864, May 2000

Clinical Cancer Research

Reproducibility of p53 Immunohistochemistry in Bladder Tumors1 Lisa M. McShane,2 Roger Aamodt, Carlos Cordon-Cardo, Richard Cote, David Faraggi, Yves Fradet, H. Barton Grossman, Amy Peng, Sheila E. Taube, Frederic M. Waldman, and the National Cancer Institute Bladder Tumor Marker Network3 National Cancer Institute, Bethesda, Maryland 20892 [L. M. M., R. A., S. E. T.]; Memorial Sloan-Kettering Cancer Center, New York, New York 10021 [C. C-C.]; University of Southern California School of Medicine/Norris Comprehensive Cancer Center, Los Angeles, California 90033 [R. C.]; University of Haifa, 31 905 Haifa, Israel [D. F.]; Laval University, Quebec City, Quebec G1R 2J6, Canada [Y. F.]; University of Texas M.D. Anderson Cancer Center, Houston, Texas 77030 [H. B. G.]; The Emmes Corporation, Potomac, Maryland 20854 [A. P.]; and University of California at San Francisco, San Francisco, California 94143 [F. M. W.]

ABSTRACT The National Cancer Institute Bladder Tumor Marker Network conducted a study to evaluate the reproducibility of immunohistochemistry for measuring p53 expression in bladder tumors. Fifty paraffin blocks (10 from each of the five network institutions) were chosen at random from among high-grade invasive primary bladder tumors. Two sections from each block were sent to each laboratory for staining and scoring, and then all sections were randomly redistributed among the laboratories for a second scoring. Intra- and interlaboratory reproducibility was assessed with regard to both staining and scoring. For overall assessments of p53 positivity, the results demonstrated that intralaboratory reproducibility was quite good. Concordance across the five participating laboratories was high for specimens exhibiting no or minimal nuclear immunostaining of tumor

Received 10/4/99; revised 2/4/00; accepted 2/7/00. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. 1 This work was supported by the National Cancer Institute Bladder Tumor Marker Network, Grants CA47538 (to C. C-C.), CA70903 (to R. C.), CA47526 (to Y. F.), CA56973 (to H. B. G.), and CA47537 (to F. M. W.). 2 To whom requests for reprints should be addressed, at National Cancer Institute, Biometric Research Branch, Room 739, Executive Plaza North, MSC 7434, 6130 Executive Boulevard, Bethesda, MD 20892-7434. Phone: (301) 402-0636; Fax: (301) 402-0560; E-mail: [email protected]. 3 Members of the Bladder Tumor Marker Network are Roger Aamodt (National Cancer Institute, Bethesda, Maryland), Carlos Cordon-Cardo (Memorial Hospital, New York, New York), Richard Cote (University of Southern California School of Medicine/Norris Comprehensive Cancer Center, Los Angeles, California), Yves Fradet (Laval University, Quebec City, Quebec, Canada), H. Barton Grossman (University of Texas M.D. Anderson Cancer Center, Houston, Texas), and Frederic M. Waldman (University of California, San Francisco, California).

cells or high percentages of tumor cells with nuclear immunoreactivities. However, there was a reduced level of concordance on specimens with percentages of stained tumor cells in an intermediate range. The discordancies were due mainly to staining differences in one of the five laboratories and scoring differences in another laboratory. These results indicate that some caution must be used in comparing results across studies from different groups. Standardization of staining protocols and selection of a uniform threshold for binary interpretation of results may improve assay reproducibility between laboratories.

INTRODUCTION Mutation of the p53 gene or altered p53 expression are common and potentially important prognostic events in bladder cancer and other tumors (1– 8). In bladder cancer, a number of studies have shown a high correlation between p53 nuclear accumulation detection by IHC4 and mutation detection by DNA sequencing (9 –13). A significant advantage of IHC over sequencing is that this technique is commonly used for the assessment of other antigens as tumor markers in many pathology laboratories. Furthermore, identification of p53 nuclear accumulation in tumor cells in the absence of gene mutation has been noted, indicating alternative disruption of this pathway (14 –16). Although some groups have shown that p53 is a significant predictor of bladder cancer progression (6 – 8), others have concluded that this marker provides no prognostic information (17–20). There are a number of possible explanations for these inconsistencies. First, some of these studies are based on small numbers of patients drawn from patient populations with varying treatments and clinical characteristics; the power of such studies to address the significance of p53 alterations is questionable. Second, assay differences between laboratories, including the choice of anti-p53 antibody, IHC protocol, and scoring criteria, may lead to significant variation in results (21). To address variation introduced by methodological differences, we performed a study to examine the reproducibility of the p53 IHC assay. This study was undertaken by the five institutions comprising the National Cancer Institute Bladder Cancer Marker Network to evaluate the reproducibility of p53 expression as measured by IHC. We describe the intra- and interlaboratory reproducibility and differences due to staining protocol and scoring criteria that one might experience for the p53 IHC assay in laboratories using the same primary antibody. Although we show that the results are generally reproducible, some variability did exist. Such results should be considered in decisions regarding combining study results across laboratories and in decisions regarding centralized versus noncentralized assays for large prospective diagnostic marker studies.

4

The abbreviation used is: IHC, immunohistochemistry.

Clinical Cancer Research 1855

Fig. 1 Flow diagram of the study design.

MATERIALS AND METHODS Tissue Sectioning and Slide Distribution. The study design is diagrammed in Fig. 1. Ten paraffin blocks were chosen at random from high-grade invasive primary bladder tumors collected after 1993 at each of the five institutions (Laval University, M.D. Anderson Cancer Center, Memorial SloanKettering Cancer Center, University of California San Francisco, and University of Southern California). The labels A, B, C, D, and E were randomly assigned to these five laboratories. Eighteen 4- to 5-␮m sections were cut from each tumor: 10 for IHC staining, 3 for H&E evaluation (first, middle, and last sections), and 5 were reserved for repeat staining if needed. Sections were cut consecutively until 18 usable sections were successfully transferred onto Probe-On Plus (Fisher Scientific Co., Pittsburgh, PA) positively charged slides. Slides were not baked prior to distribution. The institution contributing a block evaluated the three H&E sections from that block to ensure that there was tumor throughout. The remaining 15 sections were coded and randomly numbered according to a randomization table produced by the National Cancer Institute statistical office. Each institution designated an individual not involved in the

staining and scoring of slides to receive the randomization tables for their laboratory and to apply the identification number labels to the slides. Laboratories’ copies of the initial randomization tables were destroyed after slide labeling. Each institution then shipped 2 slides from each case to each of the other four institutions (20 slides to each institution) according to the distribution list provided by the statistical office. The distribution list was randomized to ensure that there were no systematic patterns in the physical proximity of slides within a block sent to the same or different institutions. Each laboratory thus had 100 slides to stain and score. After staining and scoring, all laboratories again shipped 20 slides to each of the other institutions for rescoring according to a second randomization list generated by the statistical office. For each case, the pair of slides stained at a particular laboratory was randomly assigned as a unit to one of the five laboratories for rescoring. Staining batches were formed according to each laboratory’s usual protocol. The design, consisting of 50 blocks, with two sections from every block stained and initially scored at each of the five laboratories, allowed for estimation of within-laboratory agree-

1856 p53 IHC in Bladder Tumors

ment based on initial scorings to within ⫾ 11% (with 95% confidence), assuming a true agreement rate of at least 80%. The maximum (95% confidence) margin of error would be ⫾ 14%, attained if the true rate was 50%. Smaller margins of error were possible for estimation of between-laboratory agreement based on initial scorings, for comparisons using second scorings in addition to first, and for estimation of overall agreement rates averaged over the five laboratories. p53 IHC. Each institution stained the 100 slides using antigen retrieval, according to their usual protocols. All institutions used the same lot of anti-p53 antibody (PAb1801 Ab-2; Oncogene Research Products, Cambridge, MA, kindly provided by CalBiochem). Staining was performed in a timely manner, not more than 30 days after receipt of the slides. Each institution used its own positive and negative controls. The avidin-biotin peroxidase method was used by all laboratories, with modifications. Sections were deparaffinized and treated with 3% H2O2 to block endogenous peroxidase activity. They were then immersed in 0.01 M citric acid (pH 6.0) in a microwave oven for 15 min to enhance antigen retrieval (2). After cooling, slides were incubated with normal horse serum for 10 min to block nonspecific staining, followed by an overnight incubation at 4°C with primary antibody at 2 ␮g/ml (3). After extensive washing, sections were incubated at room temperature for 30 min with biotinylated horse antimouse antibody (1:200 final dilution; Vector Laboratories, Burlingame, CA), and then for 30 min with avidin-biotin peroxidase complexes (1:25 final dilution; Vector Laboratories). Diaminobenzidine (0.06%) was used as the final chromogen, and hematoxylin was used as the nuclear counterstain. Each of the laboratories used variations on this general protocol. Laboratory A used 10% H2O2 for blocking, and sections were incubated with normal horse serum for 30 min. Laboratory B used a final concentration 1 ␮g/ml primary antibody, a 1:500 dilution for biotinylated horse antimouse IgG, incubation with streptavidin-horseradish peroxidase (1:200; Zymed Labs, South San Francisco, CA) instead of with avidinbiotin peroxidase complexes, and 0.05% rather than 0.06% diaminobenzidine. Laboratory C microwaved for 17 min, blocked in serum for 20 min, and incubated with primary antibody at room temperature. In addition, laboratory C used the Ultra Streptavidin Detection System (Signet Laboratories, Inc., Dedham, MA). Laboratory D incubated the primary antibody at room temperature, and at a final dilution of 2.5 ␮g/ml, used a 1:100 final dilution of avidin-biotin peroxidase complexes, and used 0.03% rather than 0.06% diaminobenzidine. Laboratory E differed only in using a 15-min microwave treatment. Scoring. The entire tissue section was screened for positive tumor cells, defined as cells with nuclear staining. The estimated percentage of tumor cells that stained positively, the intensity of the staining (1⫹ to 4⫹), and an overall assessment of positive or negative for the slide were recorded along with identifiers and comments on background and cytoplasmic staining. The percentage of cells staining positively for p53 was recorded on an ordered categorical scale: 0 ⫽ zero, 1 ⫽ 1–5%, 2 ⫽ 6 –10%, 3 ⫽ 11–20%, 4 ⫽ 21–30%, 5 ⫽ 31– 40%, 6 ⫽ 41–50%, 7 ⫽ 51–70%, 8 ⫽ 71–100%, or “cannot be assessed.” The overall (binary) assessment for a slide involved a judgment of the percentage of stained cells subjected to a “positivity”

threshold, and the result was recorded as positive, negative, “equivocal,” or “cannot be assessed.” The thresholds applied by the laboratories involved in this study were ⱖ10% (laboratory A), ⱖ10% (laboratory B), ⬎5% (laboratory C), ⱖ10% (laboratory D), and ⱖ20% (laboratory E). All results were recorded on a standardized score sheet. A single individual in each laboratory performed all scoring for this study. Rescoring. Each institution received 20 slides from each of the other institutions for rescoring. These were combined with 20 slides that were originally stained and scored at that institution, making a total of 100 slides. Rescoring criteria were identical to the primary scoring. Completeness of Data. In total, 500 slides (10 sections from each of 50 tumor blocks) were prepared for analysis in the study. One slide was lost in shipping, leaving 499 slides on which scoring was attempted. For purposes of analysis, the percentage of staining scores coded as cannot be assessed and binary scores coded as either cannot be assessed or equivocal were treated as missing values. The result was that all but 36 of the 499 slides had both a first and second percentage of staining score, and all but 38 slides had usable (non-missing) values in both a first and second binary scoring. Five slides having neither a first nor a second binary and percentage of staining score were later found to have all come from the same tumor block in which the portion of the block from which the sections were cut contained minimal or no tumor. Results from the 10 sections cut and stained from that tumor block were excluded from all subsequent analyses even if some laboratories had reported scores for them. Because of the small percentage of measurements (⬍3%) excluded from the analyses, it was not expected that ignoring the missing values would introduce significant bias into the results. Statistical Analyses. The two measures of p53 status evaluated for their reproducibility in this study are the percentage of stained cells, recorded on the categorical scale described in the “Scoring” section above, and the overall (binary) assessment of p53 positivity. Reproducibility was assessed both by computing simple percentage of agreement between pairs of scores and by examining differences in average staining percentage levels and rates of overall positivity across laboratories. Percentage of Agreement. Estimates of the percentage of agreement were calculated as the percentage of matching scorings in all pairs of scorings made on the same slide or on two different slides from the same block, depending on the comparison of interest. Standard errors for these estimates were calculated using a stratified jackknife procedure for clustered data where the blocks were the independent (primary) sampling units and the strata were the contributing institutions. The stratified jackknife variance estimator used here is similar to the usual jackknife estimator except that rather than iteratively deleting single observations, all observations from a given block (primary sampling unit) are deleted at each iteration, and the “deleted” estimate is computed by reweighting observations from blocks in the same stratum from which the given block was deleted. A precise definition is given by Korn and Graubard (22). Staining Percentage Levels and Positivity Rates. Descriptive statistics for staining percentage levels and staining positivity rates were obtained by calculating means for all

Clinical Cancer Research 1857

Fig. 2 Overall p53 staining positivity. Each column represents a single tumor block, with height representing the total number of slides on which first scorings were available. The black portion of each column indicates the number of slides scored as positive, and the white portion denotes the number of slides scored as negative.

staining-by-scoring laboratory combinations. For the percentage of cell staining measurement, each of the response categories 0 – 8 was scored by its category midpoint. For example, measurements falling into category 2 (6 –10%) were scored as 8%. The mean value of the midpoint scorings was then computed for each staining-by-scoring laboratory combination. The mean of the overall binary assessments recorded by a given staining-byscoring laboratory combination is equal to the proportion of positively scored slides for that combination. The standardized means for scoring laboratories and for staining laboratories were each balanced over the effects of the other factor, i.e., scoring laboratory means were standardized to a hypothetical situation in which equal numbers of slides were scored at the five laboratories, and scoring laboratory means were standardized to a situation in which equal numbers of slides were stained at the five laboratories. This removes imbalance resulting from the fact that in this study all slides stained in a particular laboratory received their first scoring from that same laboratory. The means and their SEs were calculated by SUDAAN PROC DESCRIPT (23), using a stratified jackknife procedure similar to that described above for percentage of agreement estimates.

Cumulative logit proportional odds models (24) for correlated ordinal data were used to analyze p53 cell staining percentages, and logistic regression models for correlated binary data were used to analyze overall p53 staining positivity assessments. Both models contained additive terms representing staining and scoring effects. Staining-by-scoring interaction terms were initially considered in each model but were not significant and were dropped from the model. These analyses were implemented in SUDAAN PROC MULTILOG (23) using generalized estimating equations methods (25) with a “working” independence correlation matrix. Blocks were designated as primary sampling units, and contributing institutions formed the strata.

RESULTS Sources of Variability. Figs. 2 and 3 show the results of the overall p53 staining positivity assessments and the cell staining percentages obtained on first scorings of all slides. The use of second scorings rather than first scorings produced essentially the same results (not shown). The variations observed in Figs. 2 and 3 reflect multiple sources of variability. First,

1858 p53 IHC in Bladder Tumors

Fig. 3 Percentage of cells staining positively for p53. Each column represents a single tumor block with height representing the total number of slides on which first scorings were available. The columns are subdivided according to the number of scorings falling into each cell staining percentage category with the degree of shading indicating the midpoint of the category.

there is intraslide variation, which may consist of intrascorer variation (the same scorer reads the same slide twice) and interscorer variation (two different scorers read the same slide). Interslide variation may include, in addition, variation due to staining laboratory and true biological variation in p53 expression between sections. There was generally more uniformity in binary scorings within a block (Fig. 2) compared with cell staining percentages (Fig. 3). However, there were several cases in which scorings of two slides may have agreed for cell staining percentage but not for the binary assessments because of different positivity thresholds applied by different laboratories. For example, the sections in block 25 showed cell staining percentages approximately in the 0 – 40% range, but the binary assessments were approximately split between positive and negative, with three missing scores. This block presented difficulties because many of its sections had cell staining percentages hovering in the positivity threshold range. Block 42 demonstrated the reverse situation. Cell staining percentages fell into three categories, but all staining percentages varied within the 40 – 80% range, so all were above the positivity thresholds set by the five laboratories.

Percentage of Agreement between Binary Scores. Fig. 4 displays intraslide variation measured as the percentage of slides in which first and second binary scorings agreed. Agreement of replicate scorings was not, on average, affected by which laboratory performed the staining, and scoring laboratories did not consistently agree significantly more with themselves than with other laboratories. The main exception was that the binary scoring agreement of each laboratory with laboratory C was generally lower than agreement between other pairs of laboratories. When a laboratory’s scoring with itself (intrascorer variation) was compared, the percentage of agreement averaged over the five laboratories was 93% (SE, 3.4%). This value is comparable to the 91% agreement (SE, 2.0%) achieved when two different scorers scored the same slide (interscorer variation). Fig. 5 examines whether intraslide agreement was a function of cell staining percentage. It suggests that agreement of replicate binary scores of the same slide was highest at the lower and upper extremes of the cell staining percentage scale. We also found that the cell staining percentage assessments themselves were more reproducible at the extremes (results not shown).

Clinical Cancer Research 1859

Fig. 4 Intraslide percentage of agreement for overall p53 staining positivity. Pairs of letters labeling each point denote the two laboratories for which scoring is being compared. When the two letters in the point label agree (e.g., A vs. A), intrascorer variation is represented; when they differ, interscorer variation is represented. The X axis indicates the staining laboratory. The estimates represented by the plotted points are each based on 14 –20 paired scorings (slides).

Different thresholds were considered to determine whether or not the interscorer agreement on binary assessments could be improved by enforcement of a uniform threshold for declaring a slide as positive rather than allowing each laboratory to choose its own threshold. Uniform thresholds of typical levels of ⬎0%, ⬎5%, ⬎10%, and ⬎20% were considered, and each threshold was applied to the percentage of cell staining measurements reported by the laboratories. The estimated percentages of interscorer agreement were 87% (SE, 2.5%) using the ⬎0% threshold, 94% (SE, 1.7%) using the ⬎5% threshold, 93% (SE, 1.9%) using the ⬎10% threshold, and 90% (SE, 2.9%) using the ⬎20% threshold. When these results were compared with the 91% interscorer agreement when each laboratory used its own threshold, it could be seen that the ⬎5% and ⬎10% thresholds resulted in modest observed improvement. The percentage of agreement between scorings on two different slides from the same block was computed as a measure of interslide variability. It may reflect both staining variability and biological variability as well as scoring variability. Fig. 6 shows the percentage of agreement of binary scores on sections from the same block stained in different

laboratories but scored by the same laboratory (staining laboratory not necessarily the same as scoring laboratory). The average overall estimated percentage of agreement in Fig. 6 is 86% (SE, 2.8%). If the slides were scored in different laboratories in addition to being stained at different laboratories, the estimated percentage of agreement between binary scorings of two slides from the same block decreased slightly to 83% (SE, 2.6%). The percentage of agreement between pairs of slides in which staining and scoring were performed in one laboratory was examined to both assess intralaboratory variability and examine for potential biological variation or tumor heterogeneity. The percentage of agreement estimates for binary scorings when all staining and scoring was performed in a single laboratory were 94% (laboratory A), 93% (laboratory B), 87% (laboratory C), 100% (laboratory D), and 100% (laboratory E). When these values were averaged over the five laboratories, the overall estimate was 95% (SE, 1.6). If the staining laboratory was not necessarily the same as the scoring laboratory, but a single staining laboratory was used and a single scoring laboratory was used, then averaging overall staining-by-scoring lab-

1860 p53 IHC in Bladder Tumors

Fig. 5 Intraslide agreement as a function of average p53 cell staining percentage. Plot was constructed by computing the mean of the two scorings on each slide, categorizing those means using the same set of categories as the original cell staining percentage scores, and plotting percentage of agreement on p53 positivity in each category. Bars, ⫾ SE. The number on the top of each bar indicates the number of slides involved in the calculation for that point.

oratory combinations still gave an overall percentage of agreement estimate of 95% (SE, 1.3). To investigate the potential contribution of biological variation to the interslide variation, interslide percentage of agreement estimates in which the staining laboratory and the scoring laboratory were held fixed were examined for association with distance between slides. If there was substantial biological variation in p53 status on the scale represented by the amount of specimen encompassed by the collection of sections cut from each tumor block, then one might expect that the percentage of agreement would be decreased for more widely separated sections. No such association was found, and therefore, it could not be concluded that there was evidence for biological heterogeneity on this scale. The percentage of agreement estimates for binary scoring by sources of variability represented are summarized in Table 1. The estimates in Table 1 suggest that for the five laboratories involved in this study, staining differences contributed more to the total variability than scoring or biological variability. Staining Percentage Levels and Positivity Rates. As a descriptive measure of whether there were systematic patterns of lower or higher p53 staining assessments among the laboratories that might explain some of the disagreements of scores within and between slides, mean scores cross-classified by staining and scoring laboratory were examined. The means for cell staining percentage category midpoints are presented in Table 2, and the staining positivity rates based on the binary assessments are given in Table 3. Table 2 suggests lower average cell staining percentages for slides stained in laboratory A. Table 3 suggests higher positivity rates for binary scoring made by laboratory C and lower positivity rates for slides stained in laboratory A. To test the statistical significance of these observed differences between laboratories, additive models containing laboratory staining and scoring effects were fit. For cell staining percentages, there were significant overall laboratory staining (P ⬍ 0.0001) and scoring effects (P ⫽ 0.0255; Table 4). Pairwise tests for staining differences for each of the 10 possible

pairs of laboratories were considered statistically significant at 0.005 (0.005 ⫽ 0.05/10, adjusting for multiple comparisons). By this criterion, there were significant differences between laboratory A versus all other laboratories and between laboratory C versus laboratory D. In laboratory A, staining resulted in significantly lower average cell staining percentages than in all other laboratories. Slides stained by laboratory C had average cell staining percentages significantly lower than slides stained by laboratory D. No pairwise scoring difference reached significance at 0.005. The standardized means presented in the last row of Table 2 suggest that the overall significance of scoring laboratory effects is likely due to an accumulation of fairly minor differences rather than any striking difference among laboratories. For overall p53 staining positivity assessments, there also were significant laboratory staining (P ⫽ 0.0006) and scoring effects (P ⫽ 0.0042; Table 5). Pairwise tests (significance at 0.005) revealed significantly lower rates of positive results for slides stained by laboratory A compared with all other laboratories except for laboratory C (P ⫽ 0.0240, staining by laboratory A versus laboratory C). When laboratory C scored, the average positivity rates were significantly higher than the rates in all other laboratories. To investigate whether the apparent lower staining levels for slides stained in laboratory A could be attributed to just one or two aberrant staining batches, the average cell staining percentage category midpoint and the proportion of overall positives were computed for all staining batches for all laboratories. It was found that the batches of slides stained by laboratory A generally had lower cell staining rates, and the lower staining rates from laboratory A in this study did not appear to be simply an artifact of one or two aberrant staining batches.

DISCUSSION An inherent difficulty in conducting prognostic marker studies is that the prognostic value and reproducibility aspects usually are addressed simultaneously. If the underlying prog-

Clinical Cancer Research 1861

Fig. 6 Interslide percentage of agreement for overall p53 staining positivity. The estimates are based on pairs of slides scored by the same laboratory but stained by different laboratories. Pairs of letters labeling each point denote the two staining laboratories being compared. The X axis indicates the scoring laboratory. The estimates represented by the plotted points are each based on 28 – 40 paired scorings such that within any pair, the two scorings were made on different slides.

Table 1 Percentage of agreement (SE) according to sources of variability represented

Variability sources represented

% agreement for overall (binary) staining positivity (SE)

Intrascorer (1 scorer/1 slide)a Intra- ⫹ interscorer (2 scorers/1 slide) Intrascorer ⫹ biological (1 scorer/2 slides/1 staining lab) Intrascorer ⫹ biological ⫹ staining (1 scorer/2 slides/2 staining labs) Intra- ⫹ interscorer ⫹ biological ⫹ staining (2 scorers/2 slides/2 staining labs) a

93 (3.4) 91 (2.0) 95 (1.3) 86 (2.8) 83 (2.6)

A single scorer scores the same slide twice.

nostic value of a marker is very good but cannot be measured reproducibly, its strength may be masked. More importantly, its potential for widespread clinical use is greatly diminished if the marker measurements cannot be relied on for making clinical management decisions in the case of individual patients. There-

fore, an assessment of assay reproducibility is essential in the development of any prognostic marker. In this report, we were able to identify both considerable concordance and some discordance among the p53 assays in these five experienced laboratories. Intralaboratory reproducibility was generally quite good. Regarding interlaboratory reproducibility, agreement of binary assessments was good at the extremes of low or high nuclear tumor cell staining percentages. Considering the full range of cell staining percentages, agreement of binary assessments was reduced, largely because of differences exhibited by two of the five laboratories. There were significant differences in staining by one of the laboratories (laboratory A) compared with the other four, and significant differences in overall (binary) scoring by another laboratory (laboratory C) compared with the rest. Slides stained by laboratory A experienced significantly less p53 cell staining. Laboratory C used a lower cell staining percentage threshold for scoring slides as overall positive, and as a result, scored significantly more slides as overall positive compared with the other four laboratories. The remaining three of five laboratories ex-

1862 p53 IHC in Bladder Tumors

Table 2

p53 cell staining percentage category midpoint by staining laboratory and scoring laboratory Scoring laba

Staining lab

A

B

C

D

E

Standardized mean, % (SE)

A B C D E Standardized mean

17.64 (4.29) 18.23 (9.57) 40.11 (7.54) 57.35 (13.60) 27.68 (10.54) 32.20 (4.59)

9.74 (9.30) 36.10 (5.64) 49.68 (12.80) 32.25 (13.56) 27.00 (11.91) 30.95 (5.53)

34.56 (12.82) 45.35 (12.95) 30.63 (5.07) 28.10 (12.60) 40.18 (13.05) 35.76 (5.79)

17.53 (9.49) 37.60 (11.80) 27.03 (11.13) 36.58 (5.63) 23.19 (10.58) 28.39 (5.17)

15.84 (9.49) 37.70 (14.25) 14.85 (7.59) 47.74 (13.79) 41.40 (6.00) 31.51 (5.56)

19.06 (4.73) 35.00 (5.75) 32.46 (4.71) 40.40 (6.09) 31.89 (5.52) 31.76 (4.65)

a

Results given as mean (SE) %.

Table 3

p53 positivity by staining laboratory and scoring laboratory Scoring laba

Staining lab

A

B

C

D

E

Standardized mean, % (SE)

A B C D E Standardized mean

35 (7) 35 (16) 78 (13) 70 (16) 53 (19) 54 (7)

11 (7) 56 (7) 63 (16) 50 (19) 50 (17) 46 (7)

78 (16) 70 (16) 61 (7) 70 (16) 89 (11) 74 (7)

32 (17) 70 (16) 42 (18) 57 (8) 57 (20) 52 (8)

32 (17) 50 (17) 30 (16) 76 (16) 59 (7) 50 (8)

37 (7) 56 (7) 55 (7) 65 (8) 62 (8) 55 (6)

a

Results given as % (SE).

hibited good agreement with one another in both staining and scoring. To improve scoring agreement, one must first determine whether disagreements are occurring in the estimation of cells staining positive, at the binary (thresholded) level, or both. For the five laboratories studied here, it appeared that the major scoring differences were resulting from use of different thresholds. The observed intraslide agreement on the binary scorings was improved by use of a uniform threshold of ⬎5% or ⬎10%, largely due to bringing laboratory C’s threshold “in line” with the other laboratories. If there had been large differences at the cell staining percentage level, then allowing each laboratory to adjust for its staining and scoring patterns by use of its individually selected threshold might be most appropriate. Admittedly, this study was not designed to evaluate all possible thresholds because the cell staining percentage data were collected as grouped (categorical data). In addition, any “optimal” threshold suggested by the data would require validation on an independent data set to determine not only whether the revised threshold(s) improve reproducibility, but whether the new measurements yield overall more accurate prognostic information. The most obvious way to address staining differences is to standardize the laboratory staining protocols. The present study was intentionally designed without a standardized staining protocol to assess the reproducibility of results under existing practices, and to address the question of whether it would be valid to retrospectively combine assay results from different laboratories. The answer to that question seems to be that it would not be advisable to combine assay results from different laboratories for use in retrospective studies without consideration of potential differences in assay results between laboratories. Even with standardized protocols, it is possible that some subtle differences would remain; therefore, reproducibility

should be re-assessed after standardization. Just as in the case of optimizing scoring criteria, final selection of a standardized staining protocol should be made with consideration of prognostic value. The diminished reproducibility observed in binary scoring when cell staining percentages were in the low to mid range may be explained by multiple factors. The first, and most obvious, is that the laboratories’ thresholds fell in that range. Even if the laboratories were in perfect agreement on the cell staining percentage assessments, their binary assessments may have disagreed because they use different positivity thresholds. The second factor is that we found proportionately less variability in the cell staining percentage assessments within and between laboratories at the two extremes of the cell staining percentage scale. It would make sense that if a specimen was virtually devoid of p53-overexpressing tumor cells or was very densely populated with overexpressing tumor cells, there would be little room for subjective interpretation. These results suggest that for specimens with cell staining percentages falling in this intermediate range, analyses of additional sections, or additional analyses such as mutation analysis of the same section, might be advisable. Duplicate staining and assessment of such borderline staining usually is performed by the participating laboratories. However, the study design did not permit these additional analyses. Whether such a reassessment would decrease the observed variability in these low-to-mid range slides is uncertain. Some issues that could not be addressed by this study are how well laboratories might agree on p53 assessments of lower stage or grade tumor specimens, and how well results might agree if staining was performed on sections cut from different blocks of the same tumor. In addition, variability due to different lots of antibodies could not be addressed. This study used a

Clinical Cancer Research 1863

Table 4 ANOVA for staining and scoring effects in p53 cell staining percentage category assessments Factor

Degrees of freedom

Adjusted Wald F test

P

8

6.63

⬍0.0001

4 4

9.66 3.10

⬍0.0001 0.0255

Full model vs. intercepts-only model Staining labs Scoring labs

Table 5 ANOVA for staining and scoring effects in overall (binary) p53 staining positivity assessments Factor

Degrees of freedom

Adjusted Wald F test

P

8

4.67

0.0005

4 4

6.07 4.49

0.0006 0.0042

Full model vs. intercepts-only model Staining labs Scoring labs

common lot of primary antibody and might therefore be expected to show a better level of reproducibility than if multiple lots of antibody or different primary antibodies had been used. There are several implications of these results for conducting future studies to attempt to resolve the issue of p53 prognostic value in bladder cancer. Great care must be exercised in drawing conclusions across studies using different laboratories because there can be substantial variation in assays results. Some assay methods (or even a standardized assay carried out in some hands) may ultimately prove more valid than others in the sense of more accurately measuring something of biological or prognostic significance. To formally conduct a retrospective study by combining preexisting assay results from different laboratories would, at minimum, require that statistical adjustments for the assaying laboratories be attempted, and it would likely require that assay measurements be available in their most “raw” form rather than preprocessed into dichotomous values. However, such a solution might only be approximate and may require additional data comparing the laboratories’ assays to one another or to a common reference. For large-scale prospective studies, centralization of all assays to a single laboratory using the current best candidate assay procedure would be desirable, but if multiple laboratories are to be involved, then verification of interlaboratory reproducibility must occur before embarking on the study. Or, if a best candidate assay method has not yet emerged, a prospective study involving multiple assays for the same marker per specimen provides an extraordinary opportunity to simultaneously address issues of prognostic value and reproducibility. Alternatively, such a study may be conducted on a retrospective specimen collection by performing multiple assays on each specimen.

ACKNOWLEDGMENTS We thank G. Gherman and S. Long for data management, and B. Graubard, E. Korn, and F. Meyer for statistical advice and helpful comments on earlier drafts of this report. We also acknowledge the expert assistance and advice of G. Beaudry, W. Benedict, E. Charytonowicz, K. Chew, H. LaRue, D. Murray, S. Shi, B. Tetu, and S. Thu.

REFERENCES 1. Hollstein, M., Sidransky, D., Vogelstein, B., and Harris, C. C. p53 mutations in human cancers. Science (Washington DC), 253: 49 –53, 1991. 2. Levine, A. J., Perry, M. E., Chang, A., Silver, A., Dittmer, D., Wu, M., and Welsh, D. The 1993 Walter Hubert Lecture: the role of the p53 tumour-suppressor gene in tumorigenesis. Br. J. Cancer, 69: 409 – 416, 1994. 3. Boyle, J. O., Hakim, J., Koch, W., van der Riet, P., Hruban, R. H., Roa, R. A., Correo, R., Eby, Y. J., Ruppert, J. M., and Sidransky, D. The incidence of p53 mutations increases with progression of head and neck cancer. Cancer Res., 53: 4477– 4480, 1993. 4. Miller, C. W., Simon, K., Aslo, A., Kok, K., Yokota, J., Buys, C. H., Terada, M., and Koeffler, H. P. p53 mutations in human lung tumors. Cancer Res., 52: 1695–1698, 1992. 5. Spruck, C. H. III, Ohneseit, P. F., Gonzalez-Zulueta, M., Esrig, D., Miyao, N., Tsai, Y. C., Lerner, S. P., Schmutte, C., Yang, A. S., Cote, R., Dubeau, L., Nichols, P. W., Hermann, G. G., Steven, K., Horn, T., Skinner, D. G., and Jones, P. A. Two molecular pathways to transitional cell carcinoma of the bladder. Cancer Res., 54: 784 – 788, 1994. 6. Sarkis, A. S., Zhang, Z., and Cordon-Cardo, C. p53 nuclear overexpression and disease progression in Ta bladder cancer. Int. J. Oncol., 3: 355–360, 1993. 7. Esrig, D., Elmajian, D., Groshen, S., Freeman, J. A., Stein, J. P., Chen, S. C., Nichols, P. W., Skinner, D. G., Jones, P. A., and Cote R. J. Accumulation of nuclear p53 and tumor progression in bladder cancer. N. Engl. J. Med., 331: 1259 –1264, 1994. 8. Sarkis, A. S., Dalbagni, G., Cordon-Cardo, C., Zhang, Z-F., Sheinfeld, J., Fair, W. R., Herr, H. W., and Reuter, V. E. Nuclear overexpression of p53 protein in transitional cell bladder carcinoma: a marker for disease progression. J. Natl. Cancer Inst., 85: 53–59, 1993. 9. Finlay, C. A., Hinds, P. W., Tan, T. H., Eliyahu, D., Oren, M., and Levine, M. J. Activating mutations for transformation by p53 produce a gene product that forms an hsc70-p53 complex with an altered half-life. Mol. Cell. Biol., 8: 531–539, 1988. 10. Banks, L., Matlashewski, G., and Crawford, L. Isolation of human TP53 specific monoclonal antibodies and their use in the studies of human TP53 expression. Eur. J. Biochem., 159: 529 –534, 1986. 11. Esrig, D., Spruck, C. H. III, Nichols, P. W., Chaiwun, B., Steven, K., Groshen, S., Chen, S. C., Skinner, D. G., Jones, P. A., and Cote, R. J. p53 nuclear accumulation correlates with mutations in the p53 gene, tumor grade and stage in bladder cancer. Am. J. Pathol., 143: 1389 – 1397, 1993. 12. Cordon-Cardo, C., Dalbagni, D., Saez, G. T., Oliva, M. R., Zhang, Z-F., Rosai, J., Reuter, V. E., and Pellicer, A. P53 mutations in human bladder cancer: genotypic versus phenotypic patterns. Int. J. Cancer, 56: 347–353, 1994. 13. Dalbagni, G., Cordon-Cardo, C., Reuter, V., and Fair, W. Tumor suppressor alterations in bladder cancer. Surg. Oncol. Clin. N. Am., 4: 231–240, 1995. 14. Lutzker, S. G., and Levine, A. J. A functionally inactive p53 protein in teratocarcinoma cells is activated by either DNA damage or cellular differentiation. Nat. Med., 2: 804 – 810, 1996. 15. Stein, J. P., Ginsberg, D. A., Grossfeld, G. D., Chatterjee, S. J., Esrig, D., Dickinson, M. G., Groshen, S., Taylor, C. R., Jones, P. A., Skinner, D. G., and Cote, R. J. Effect of p21WAF1/CIP1 expression on tumor progression in bladder cancer. J. Natl. Cancer Inst., 90: 1072– 1079, 1998. 16. Cote, R. J., and Chatterjee, S. J. Molecular determinants of outcome in bladder cancer. Cancer J. Sci. Am., 5: 2–15, 1999. 17. Lipponen, P. K. Overexpression of p53 nuclear oncoprotein in transitional cell bladder cancer and its prognostic value. Int. J. Cancer, 53: 365–370, 1993.

1864 p53 IHC in Bladder Tumors

18. Underwood, M. A., Reeves, J., Smith, G., Gardiner, D. S., Scott, R., Bartlett, J., and Cooke, T. G. Overexpression of p53 protein and its significance for recurrent progressive bladder tumours. Br. J. Urol., 77: 659 – 666, 1996. 19. Lianes, P., Charytonowicz, E., Cordon-Cardo, C., Fradet, Y., Grossman, H. B., Hemstreet, G. P., Waldman, F. M., Chew, K., Wheeless, L. L., and Faraggi, D. Biomarker study of primary nonmetastatic versus metastatic invasive bladder cancer. National Cancer Institute Bladder Tumor Marker Network. Clin. Cancer Res., 4: 1267–1271, 1996. 20. Jahnson, S., and Karlsson, M. G. Predictive value of p53 and pRb immunostaining in locally advanced bladder cancer treated with cystectomy. J. Urol., 160: 1291–1296, 1998.

21. Shi, S. R., Cote, R. J., and Taylor, C. R. Antigen retrieval: past, present, and future. J. Histochem. Cytochem., 45: 327–343, 1997. 22. Korn, E. L., and Graubard, B. I. Analysis of Health Surveys, p. 30. New York: John Wiley & Sons, 1999. 23. Shah, B. V., Barnwell, B. G., and Bieler, G. S. SUDAAN User’s Manual, Release 7.5. Research Triangle Park, NC: Research Triangle Institute, 1997. 24. McCullagh, P. Regression models for ordinal data. J. R. Stat. Soc. B, 42: 109 –142, 1980. 25. Zeger, S. L., and Liang, K. Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42: 121–130, 1986.