Verification of Reference Ranges by Using a Monte Carlo ... - CiteSeerX

CLIN. CHEM. 40/12, 2216-2222

(1994) #{149} Laboratory

Management

and Utilization

Verification of Reference Ranges by Using a Monte Carlo Sampling Technique Earle

W. Holmes,’

Stephen E. Kahn, Peter A. Molnar, and Edward W. Bermes, Jr.

We have investigated the application of Monte Carlo significance tests to the verification of reference ranges in the context of the transfer of an established range from one laboratory to another. Here we present an introduction to the Monte Carlo technique, outline a procedure for performing these tests using a commercially available software program, and demonstrate some of the operating characteristics of the tests when they are used to compare samples of different sizes and variances. Indexing Terms: hypothesis testing/statistics/reference inte,vals

The establishment of reference ranges is of critical importance to the laboratory scientist. The ideal approach to this task calls for each laboratory to establish reference values locally by sampling healthy individuals from the population it serves. Although the theory and the methods that apply to this process are relatively straightforward (1-4), practical problems related to the definition of the reference population, the recruitment of sufficient numbers of subjects, the rigorous exclusion of disease or other conditions that might affect the analyte of interest, and the random and independent sampling of a population can easily compromise the final result. The many pitfalls and the considerable expense associated with the establishment of reference ranges has prompted many laboratories to “adopt” or “transfer” references values from the literature or from manufacturers’ package inserts. In fact, the transfer of reference

ranges for the same or an acceptably comparable analytical system probably accounts for most of the present reference interval assignments in clinical laboratories (4). Standards in the recently enacted regulations (Clinical Laboratory Improvement Amendments of 1988) appear to allow for the transfer of reference ranges but clearly state that each laboratory must verify the appropriateness of its reference ranges, regardless of whether they were determined in-house or by a manufacturer (5). The standards imply that, in either case, the laboratory must have data to demonstrate the applicability of a reference range to its own patient population. Although statistical methods for the de novo establishment of reference ranges are well described (4, 6), the application of statistics to verifying the appropriateness of a previously established range has received little attention. If one assumes that the general approach to this problem would involve sampling a limited number Department of Pathology, Loyola University Stritch School of Medicine, 2160 S. First Ave., Maywood, IL 60153. Fax 708-2164146. ‘Author for correspondence. Received July 5, 1994; accepted August 29, 1994. 2216

CLINICAL CHEMISTRY, Vol. 40, No. 12, 1994

of normal individuals and comparing the distribution of the variable in this locally drawn test sample with its distribution in a larger, remotely drawn reference sample, how might one proceed with the evaluation of the data? Several statistical methods-including the t-test, the F-test, the Mann-Whitney test, the 2 test and the Kolmogorov-Smirnov (KS) two-sample test-might be applied, but the specific problem at hand poses certain challenges to each of these more-traditional statistical approaches.2 For example, the F-test assumes that the variable of interest is normally distributed in both samples, and the t-test is not applicable when the underlying distribution of a response variable is badly skewed (7). It is often the case that reference range data do not conform to a normal distribution (8), and this potential problem can be difficult to address because statistical methods for the evaluation of normality typically have low power at small sample sizes. Rank tests based on the median and comparisons of sample distributions by using the or the KS test are applicable to nongaussian data. However, these alternatives are generally less powerful than the parametric approaches and are further limited by the fact that the values of x2 or D (the statistic calculated in the KS test) do little to ifiuminate the precise nature of the differences between the distribution of a variable in two samples. The Monte Carlo (MC) tests (9, 10) represent useful alternatives to the more traditional methods of significance testing. In an MC test, a test criterion (or statistic) calculated from the observed data is compared with a group of values calculated from simulated samples generated randomly and in accordance with the hypothesis being tested (11). In practice, this is accomplished by repeatedly shuffling the observations in the reference sample, drawing from these observations simulated samples that are the same size as the test sample, and calculating the value of the test statistic for each of the simulated samples. The outcome of the test is determined by comparing the rank of the test statistic for the actual observed data (the “benchmark” value of the test statistic) with the values of the test statistic calculated for the simulated samples generated from the reference set. MC tests are nonparainetric (12); however, in contrast to rank tests, they preserve the information provided by the cardinal relationships among the data (13). Furthermore, MC tests can be based on practically any 2Nonstandard

abbreviations:

KS, Kolmogorov-Smirnov;

MC,

Monte Carlo; DIFF absolute difference between two means; NS, the number of simulated samples in a Monte Carlo test; N, the size of a simulated sample in a Monte Carlo test; NGE, the number of Monte Carlo samples that have a value for a test statistic greater than or equal to the benchmark value; and NCCLS, National Committee for Clinical Laboratory Standards.

test criterion, even those for which exact sampling distributions have not been derived. Thus, comparisons can be based on test criteria that evaluate specific aspects of the differences between two samples. The present study was undertaken to examine the application of MC tests to the verification of reference ranges. The particular model of interest involves the comparison of a small sample of “normal range” results (test sample) with a larger sample (reference sample) for which a reference range has been established. The null hypothesis of the MC test is that the test sample was randomly drawn from the reference sample: This finplies that any difference between the two samples is due only to sampling error (i.e., that the test sample is small). The rejection of the null hypothesis at a significant probability level constitutes evidence that the two samples are different, and that reference ranges based on the reference sample are not appropriate for the population from which the test sample was drawn. Here, we outline a procedure for setting up MC calculations; present a computer program that rapidly performs these calculations, using the language and utilities of a commercially available software package; demonstrate the operating characteristics of the MC approach when it is applied to test and reference samples of different sizes and variances; and examine the use of different test statistics in the MC format. Materials and Methods Selection of Reference and Test Samples Samples of different sizes and male/female compositions were randomly selected from uric acid results determined in a reference range study comprising 282 ostensibly healthy men (n = 107) and women (n = 175). The study was carried out in accordance with the guidelines established by the Institutional Review Board for the Protection of Human Subjects of Loyola University Medical Center. The original data demonstrated a highly significant bias between the group means (mmol/L) by sex: women, = 0.26, SD = 0.07; men, = 0.33, SD = 0.07; t = -8.54, P 4z0.0001, t-test). The composite reference and test samples were of three types: all men; all women; mixed men and women. Pairwise comparisons between these three types of samples were used to test the ability of the MC test to accept like-to-like comparisons and to reject like-to-unlike comparisons. The reference and test samples were mutually exclusive. The distribution of uric acid concentrations in each sample was tested for normality by the KS goodness-of-fit test (14). The significance level of this test was determined by reference to a table of critical values appropriate for testing data from a normally distributed population for which the mean and variance are not specified (15).

Monte Carlo Sampling MC tests were performed with a program written for Resampling Stats, IBM version 3.14 (Resampling Stats, Arlington, VA). This software is designed to rapidly draw simulated samples, calculate statistical parameters on the samples, and tabulate the distribution of the

statistics. by using

A program for the comparison of two samples the absolute value of the mean difference (DIFF) as the test statistic is presented in the Appendix. The input for the program consists of (a) ASCII files containing data for the reference and test samples and (b) specification by the user of the number of simulated samples (NS) to draw and the sample size (N), the latter being equal to the number of observations in the test sample. The output of the program includes the benchmark value of the test statistic (as calculated from the original reference and test samples); the number of simulated samples in which the value of the test statistic was greater than or equal to the benchmark value (a quantity known as NGE); and the distribution of the test statistic in the simulated samples, displayed in both histogram and tabular formats. The significance level of a single MC test was calculated as (NGE + 1/NS + 1) (12). The observed significance levels were adjusted for the effect of NS by use of a statistical table presented in ref. 12 (Appendix D, Table A: This table presents, for different values of NS, the probability that an observed significance level would be