Stochastic Modeling of Early Hematopoiesis Michael A. Newton
Peter Guttorp Sandra Catlin Janis L. Abkowitz
Renato Assunca~o
March, 1995 Abstract Hematopoiesis is the body's way of making the cellular constituents of blood. Oxygen transport, response to infections, and control of bleeding are among the functions of dierent mature blood cells. These speci c functions are acquired as cells mature in the bone marrow. Stem cells are the \master cells" at the top of this pedigree, having within them the capacity to reconstitute the entire system. While the latter stages of hematopoiesis are fairly well understood, the functioning of stem cells and other multi-potential cells is currently a matter of intense research. This paper presents a statistical analysis providing support for the clonal succession model of early hematopoiesis. J. L. Abkowitz and colleagues at the University of Washington have developed an experimental method for studying the kinetics of early hematopoiesis in a hybrid cat. The essence of the method is to analyze G6PD, an enzyme linked to the X-chromosome. The G6PD type of a cell forms a binary marker that is passed down to all its descendant cells. Data record time series of proportions of one G6PD type in cells from the bone marrow, providing information about the number and lifetime of unobservable stem cells. Studies were performed after the autologous transplantation of G6PD heterozygous cats with limited numbers of hematopoietic stem cells. Preliminary analysis of the observed proportions indicates that under these circumstances the proportion of cells with one type of G6PD is not constant over time. A simple stochastic model is used to quantify the relationship between observed proportions and unobserved stem cell populations. The model has a hidden-Markov structure. We develop parameter estimates, con dence sets, and goodness-of- t tests for this model. For our simple model, a recursive updating algorithm allows computation of the multi-modal likelihood functions. A similar algorithm produces estimates of the realized Markov process. The parametric bootstrap is used to calibrate likelihood-based con dence sets and to perform simple
Michael A. Newton is Assistant Professor of Statistics and Biostatistics at the University of Wisconsin-Madison.
Peter Guttorp is Professor of Statistics and Sandra Catlin is a graduate student in the Statistics Department at the University of Washington, Seattle. Renato Assunca~o is Assistant Professor at Universidade Federal de Minas Gerais, Brazil. Janis L. Abkowitz is Professor of Hematology at the University of Washington, Seattle.
1
goodness-of- t tests. We address the question of whether stem cells have a constant proliferative potential between cats, and we discuss criticisms of the simple model.
stem cells, clonal succession, glucose phosphate dehydrogenase, hidden Markov model, recursive updating, parametric bootstrap, overdispersion. Keywords:
1 HEMATOPOIESIS: THE PRODUCTION OF BLOOD CELLS The bone marrow of humans and other vertebrata contains a relatively small number of remarkable cells, the hematopoietic stem cells. Through replication and dierentiation, these cells produce all the dierent kinds of blood cells: red cells which transport oxygen throughout the body; white cells which form the immune defense; and platelets which initiate clotting. A human creates 3-10 billion platelets, red cells, and neutrophils (one kind of white cell) every hour, and in an emergency the rate of production can increase by an order of magnitude (Golde, 1991). Neutrophils are active for eight hours in humans, while platelets and red cells circulate for seven days and four months, respectively. With stem cells at the source, hematopoiesis is the complex dynamic process which maintains this vast and varied blood cell population. See Brecher, Beal, and Schneiderman (1993) or Lemischka (1992) for reviews. Although it has proven very hard to isolate stem cells from the bone marrow, they must have certain functional properties. To maintain a reserve for the lifetime of the animal, it is essential that stem cells are able to self-replicate. Also, they must be able to supply cells for dierentiation and development into mature blood cells. As cells progress along a pathway from stem cell to mature blood cell, there is a sequential loss of potential to become any one of the mature blood cell types. That is, cells become specialized as they divide and dierentiate. Stem cells at the beginning of these pathways are totipotent cells, i.e., they can produce all the dierent types of mature blood cells. Further along the path a partially committed cell (called a progenitor cell) may still be able to produce several dierent types of mature cells. It appears that the capacity to dierentiate along individual pathways is lost randomly, until only one lineage remains (Ogawa, 1993). When a progenitor cell divides it may replicate itself and/or dierentiate and become the head of a sequence of cell divisions leading to mature blood cells. It has not been possible to distinguish stem cells from progenitor cells using morphological characteristics. There are two dierent theories regarding the kinetics of early hematopoiesis. In both, a large supply of stem cells is postulated. These stem cells represent the only source for mature blood cells, and must last the lifetime of the animal. One theory, which we will call the standard theory, is that the entire supply of stem cells is proliferating (actively dividing), perhaps with a very slow rate of division. Thus, at any time all stem cells contribute to hematopoiesis. The other, Kay's 2
(1965) theory of clonal succession, hypothesizes that most stem cells are inactive, and that at any time a small proportion are proliferating. This theory postulates that the stem cells have a nite lifespan, at the end of which they are replaced by previously dormant stem cells (see Brecher et al., 1986, for a discussion). Evidence of clonal succession may have important implications for research on cancer treatments, bone marrow transplantation, and gene transfer methods. In the next section we describe an experiment that provides some information about early hematopoiesis. Section 3 contains some preliminary data analytic discussions, while Section 4 describes a simple stochastic model. Inference for this model is described in Sections 5 and 6. The hypothesis of constant proliferative potential of stem cells between dierent individuals is assessed in Section 7. In Section 8 we critically evaluate assumptions of the simple model from Section 4.
2 A WINDOW TO EARLY HEMATOPOIESIS Much of the work on stem cells, particularly the development of the standard theory of hematopoiesis, has been done using experiments on mice. Since the lifetime of mice is relatively short, this evidence may be spurious. Single cells (stem cells) from one mouse can maintain hematopoiesis in another throughout its lifetime. Retrovirally marked stem cells can contribute to hematopoiesis through 2-3 serial transplants. However, this does not by itself prove that stem cells have very long life times, only that these life times are long compared to the life time of a mouse. For this reason, Abkowitz et al. (1988,1990,1992) embarked on a series of experiments using a larger animal species{the Safari cat. A Safari cat is a cross between the domestic cat (Felis catus) and the South American Georoy wild cat (Leopardus georoyi). Data on ten Safari cats are analyzed in this paper. Every somatic cell (i.e. non-sex cell) in a female Safari cat can be classi ed into one of two groups according to the expression of the enzyme glucose phosphate dehydrogenase (G6PD) coded by a gene on the X chromosome. Binary classi cation is possible because only one of the two copies of the gene actually produces the enzyme. Early in embyrogenesis, and apparently at random, either the maternal or the paternal X chromosome is inactivated. The single G6PD type (Georoy or domestic) of each cell is conserved through replication and dierentiation after X chromosome inactivation, and hence provides a binary marker of each cell and its clone (the set of descendants of a given cell). The G6PD type of a cell can be determined by electrophoresis. Preliminary work has indicated that G6PD is a neutral marker. Although stem cells have not been isolated in the laboratory, it is possible to analyze dierent types of progenitor cells from the cats under study. Colonies can be derived from both erythroid progenitors (red cell progenitors) and granulocyte/macrophage progenitors (progenitors of two kinds of white cells). At a number of sampling times (which varied between cats), colonies were 3
grown from progenitor cells sampled from the marrow of each cat. Typically there were 50-100 such colonies grown, and the the G6PD type of the corresponding 50-100 progenitors was determined at each sampling time. There was little dierence in the G6PD phenotypes between the two progenitor types at each sampling time, and so the data were combined. (This observation is consistent with a stochastic mechanism for early dierentiation and commitment.) Data from each cat form a time series of proportions of domestic-type progenitors. The analysis of 50 to 100 progenitors at each sampling time allows us to study moderate, but not particularly small, uctuations of G6PD type within the progenitor cell population. To obtain a more focused view of hematopoiesis, autologous bone marrow transplantations were done on the other ve cats. Some bone marrow cells were extracted, each subject was irradiated to kill all remaining marrow, and then only 1 ? 2 107 cells = kg were returned. One cat (number 40005) was retransplanted after about a year. As before the transplantations, percentages of domestictype progenitor cells were recorded every 2-3 weeks following the recovery of normal blood cell counts. Hematopoiesis was operating to maintain normal blood production, but many fewer stem and progenitor cells must have been involved in the process. The rising and falling of clones implied by the clonal succession hypothesis suggests that at some resolution, uctuations in the G6PD type can be detected. Thus autologous marrow transplantation eectively improves the resolution of measurement. The data for ve experimental (one with two transplants) and ve control cats are summarized in Table 1, and are available in full from statlib1 .
3 PRELIMINARY DATA ANALYSIS For each cat in the experiment, raw data form a time-series of counts fYi g of domestic-type colonies in samples of size fni g at sampling times fti g, for i = 1; 2; : : : ; m. The nature of sampling suggests that counts are conditionally independent with (Yi jpt ) Binomial (ni; pt ) i
i
where pt is the proportion of domestic-type cells in the progenitor pool at time t. From considerations of cell kinetics, models can be developed for the proportions fpt g. But before we attempt such modeling, it is necessary to study the gross features of the time-series. Two important questions must be addressed: 1. Is there evidence that the pt are not equal? i
i
2. If they are not equal, is there evidence that they follow some pattern? For example, do they exhibit positive autocorrelation? Send electronic mail to
[email protected] with the one-line message send newton hema from datasets 1
4
If the answer to question 1 is no, then there is no need to explore models for uctuating fpt g. If the answer to question 2 is yes, then there is structure in the time series which may provide insight into the workings of early hematopoiesis. The observed variance of the Yi can be used to test the null hypothesis that the answer to question 1 is no. Letting p equal the common value of the fpt g under the null hypothesis, it is convenient to introduce a standardized count: i
i
Zi = pni (Yi =ni ? p^)
P
P
where p^ = i Yi = i ni is the MLE of p. For large ni , which we have in this experiment, the Zi are well approximated by a normal distribution with mean 0 and variance p(1 ? p), under the null hypothesis. By a standard approximation, the test statistic (Cox and Hinkley, 1978, p.29)
Q=
m X i=1
Zi2 = (^p(1 ? p^)) 2m?1
under the null hypothesis of constant p. Signi cant extra-binomial variation exists in all but one of the experimental cats (Table 1). No overdispersion is detected in control cats. To address the second question, we inspect lag-1 autocorrelations of the Zi . Signi cant positive correlation exists in these standardized counts for four of the six experimental cats, as measured by a two-sided t test. No departures from independence are detected in series from the control cats. Sample sizes from the control cats are relatively small, and hence the tests have less power against the overdispersion and dependence alternatives in these series.
4 MODELING HEMATOPOIESIS 4.1 General Considerations The complex processes that control the dierentiation and ampli cation of hematopoietic cells cannot be determined from the G6PD data alone. Any model designed to learn something from these data must necessarily ignore details while respecting basic structure of early hematopoiesis. The rst simplifying measure is to consider the cells of interest as falling into one of two compartments: a stem-cell compartment, denoted SC , and a progenitor-cell compartment, denoted PC . The essential modeling problems are illustrated in Figure 1 which is an idealized view of the progenitor-cell compartment at two time points. (Note that in a real system, many thousands of cells inhabit PC , although only about 30 or so are shown in this idealization.) At time 1, PC is composed of three clones; that is the ospring from three stem cells released into PC from SC . In this schematic, clone 2 dominates the pool, having more cells than clones 1 and 3 combined. For 5
the cells sampled at time 1, the G6PD type is recorded. Later, at time 2, the composition of PC has changed{this evolution being driven by several factors:
expansion of a clone (e.g., clones 1 and 3) terminal dierentiation (e.g., most of clone 2) release of new stem cells (e.g., clone 4) We refer to the lifetime of a clone as the time from release of its stem-cell ancestor until terminal dierentiation of its constituent cells. One limitation of the data is that the clone of a cell cannot be identi ed. For example, cells from clone 2 cannot be distinguished from cells of clone 3, given the binary nature of the marker. If clonal succession explains cellular development, then we expect the clones to be relatively few in number, and to last a relatively short amount of time. If the standard theory is more appropriate, then clones are many and have a relatively longer lifetime. We present, in the next section, a hidden Markov model which describes the transfer of cells from SC to PC , the changes in PC driven by the three factors above, and the observation process. Parameters in the model describe behavior of stem cell clones, and can represent either clonal succession or the standard theory.
4.2 A simple model If the stem-cell compartment, SC , has a large number of dormant or self-replicating cells, it is reasonable to assume that the proportion of domestic-type cells in this compartment is constant over time. We denote this proportion by pd . As a dynamic process, the number of clones composing PC may uctuate because of terminal dierentiation and new stem-cell release. A balance is expected if the process is stable. If the
uctuations are not too large, it is reasonable to assume (at least as a rst approximation) that the total number of clones, denoted N , stays constant over time. A number X (t) of these N clones are of the domestic type at time t. The fraction X (t)=N uctuates between 0 and 1 because a depleted clone may be replaced by a clone having a dierent G6PD type. Strictly speaking in this model, a new stem cell is released if and only if an existing clone dies; that is, terminally dierentiates. If the lifetimes of dierent clones are independent, and follow an exponential distribution with rate parameter , then X (t) is a continuous-time, nite-state, Markov process. This state X (t) stays constant except when a clone dies. At such a time, the state can either go up by one, or drop by one, or stay the same, depending on the G6PD type of the clone and the newly released stem cell. The transition intensities of such changes are
x ! x + 1 with intensity (N ? x)pd 6
x ! x ? 1 with intensity x(1 ? pd ) This continuous-time process induces a nite Markov chain fX1 ; X2 ; : : : ; Xn g by restriction to the sampling times. Because sampling times are unequal, the transition probabilities for this chain are not stationary. They can be written
P (Xi = kjXi?1 = j ) =
min( j;k) X
j l
l=max(0;j +k?N )
!
!
N ? j pl pj?l pk?l pN ?j?k+l k ? l 11 10 01 00
(1)
where
p00 p01 p10 p11
= = = =
e?(t ?t ?1 ) + (1 ? pd)(1 ? e?(t ?t ?1 ) ) pd (1 ? e?(t ?t ?1 ) ) (1 ? pd )(1 ? e?(t ?t ?1 ) ) e?(t ?t ?1 ) + pd(1 ? e?(t ?t ?1 ) ): i
i
i
i
(2)
i
i
i
i
i
i
i
i
To establish (1), consider an equivalent coin tossing experiment. A room contains N people who are independently ipping coins at random exponential time intervals. The probability of heads for each coin is pd , and the mean time between ips for each person is 1=. The number of facing heads at time t, denoted X (t), has the same probability structure as the number of active, domestic-type clones described above. For a given coin, let puv be the probability of being in state v at time ti given it was in state u at time ti?1 , where 0 and 1 indicate tails and heads respectively. The probability p01 , for example, is obtained by noting that a change from 0 to 1 is equivalent to having at least one ip of the coin and having the last ip come up heads. The number of ips between ti?1 and ti is Poisson with mean (ti ? ti?1 ) and is independent of the outcome of these
ips{hence the form in (2). Continuing, consider a particular transition from time ti?1 to time ti in which l of the j heads remain heads and the rest become tails. Thus j ? l of the heads become tails, k ? l of the tails become heads, and N ? j ? k + l of the tails remain as !tails. The probability of this ! j ?l pk?l pN ?j ?k+l and there are precisely j N ? j such particular transitions transition is pl11 p10 01 00 l k?l xing l heads to stay heads. Summing over the unknown l establishes (1). The number X (t) of domestic-type clones in PC in uences the number of domestic-type cells in this progenitor pool. The proportion pt of domestic-type cells in PC diers from X (t)=N because all the clones do not have exactly the same number of cells at every point in time. Intuitively, we expect pt to equal X (t)=N on average (see Section 8.2). A rst approximation is to assume that these uctuations are negligible, and thus the observation distribution is: (3) (Y jX (t ); N ) Binomial n ; X (ti ) ; i
i
i
7
N
where Yi ; ni , and ti are as de ned in Section 3. In the coin analogy, at each of m sampling times, a sample with replacement from the faces of the N coins is shown to an observer, who records the total number ni and the number of heads Yi . From the observed series, the observer is asked to estimate N , pd , and . Intuitively, the observed average should be near pd , and the variance and particularly the autocorrelation inform about the size and frequency of shifts in X (t)=N and hence about N and .
5 LIKELIHOOD CALCULATIONS The likelihood function L forms the basis of our inference procedures, however, evaluation of L at a given parameter triple = (N; ; p) is nontrivial because the model probabilities are speci ed in terms of an unobservable process. As is somewhat standard for hidden Markov models, a recursive algorithm for evaluating L can be designed which takes advantage of the Markov structure and the conditional independence of observations given the state. The likelihood function is
L(; y) = P (Y1 = y1 ; Y2 = y2 ; : : : ; Ym = ym ) = P (Y = y)
(4)
where y = (y1 ; y2 ; : : : ; ym ) are the observed counts from one series. Because the model is speci ed in two stages, it is natural to rewrite the likelihood
L(; y) = =
X
P (Y = yjX = x)P (X = x) (5) x2S "( ) ( )# m m X Y Y P (Yi = yijXi = xi) P (X1 = x1 ) P (Xi = xi jXi?1 = xi?1 ) : i=2 x2S i=1
The summation in (5) is over the set of all possible state values at all sampling times
S = fx = (x1 ; x2 ; : : : ; xm ) : 0 xi N for all ig ; again where each xi indicates the number of domestic-type clones in the progenitor-cell compartment at time ti in a single cat. The cardinality of S is (N + 1)m , making computation based on (5) unviable. The likelihood function can be computed eciently by taking advantage of the Markovian structure of the hidden process. In contrast to (5),
L(; y) = P (Y1 = y1 )
m Y i=2
P (Yi = yi jY1 = y1 ; : : : ; Yi?1 = yi?1 )
8
where each factor in this product can be expanded into a simple sum. Writing y1k for (y1 : : : yk ),
P Yi = yi jY1i?1 = y1i?1
= =:
N X j =0 N X j =0
P (Yi = yijXi = j ) P Xi = j jY1i?1 = y1i?1
ui(j ) vi(j )
for i > 1. The observation probabilities ui (j ) are readily computed from assumption (3). In the literature on hidden Markov models, the vi (j ) are called predictive probabilities because they describe the distribution of the unknown state conditional on previous observations. Fortuitously, for i > 2 these predictive probabilities satisfy
vi (j ) /
N X k=0
P (Xi = j jXi?1 = k) ui?1 (k) vi?1 (k)
and thus can be calculated recursively from i = 1 to i = m using (1). To initiate the recursion, put v1 (j ) = P (X1 = j ) computed under the assumption that the process starts in equilibrium (a binomial distribution), and note that the rst factor P (Y1 = y1 ) can be computed from v1 (j ) and u1 (j ) for j = 0; 1; : : : ; N . This development is related to, but distinct from, the forward component of Baum's (1972) forward-backward algorithm. Baum's algorithm computes the conditional probability of the hidden state given the observed data rather than the marginal probability of the observed data. In either case, the calculation is at a xed parameter value . To maximize each likelihood over the three-dimensional parameter space, we use a modi ed Hit-and-Run method of stochastic search (Zabinsky et al., 1993). From a point k , we choose a random direction dk uniformly on the unit hypersphere around the point, and a step size k chosen uniformly from f : k + dk 2 C g where C is a convex bounded set of possible points. The new point is kept provided the likelihood increases, and otherwise a new try is made. An advantage with this method is that it does not require dierentiability of the likelihood function. In order to assess the robustness of this approach, ten runs, each with 2,000 attempted improvements, were computed. Generally the estimate of N was stable between runs, and the estimates of and pd were similar. A nal optimization consisted of a regular Newton-type search at the given value of N , optimizing over and pd . Figure 2 shows likelihood contours for the parameters N and T = 1= (weeks) for the experimental cats. The third parameter, pd , is xed at its MLE in each case. The likelihoods are multimodal, highly non-elliptical, and can be quite at in direction of the mean clonal lifetime T . Data provide information about the level of X (t)=N at the sampling times. If relatively large changes occur (i.e. when both N and T are small), the location and extent of the changes informs about the unknown parameters. Given the background binomial noise, it is impossible to detect 9
very small changes in X (t)=N . The maximum likelihood estimates of N tend to be small for the experimental series, suggesting a small number of active stem cell clones. The estimates of mean clonal lifetime T are small for three series, but are poorly de ned for the others because few changes in X (t)=N apparently occurred during the sampling frame. Data from 40004, 40005b, and 40665 appear to support the clonal succession hypothesis The likelihood surfaces (not shown) are very at for data from the ve untreated Safari cats in this study, and it is possible that a fairly large number of stem cells are active. This observation is consistent with that of O'Quigley (1992) using data from Nash, Storb, and Neiman (1988) on human bone marrow transplant recipients suggesting that about 100 stem cells are active post transplant in this much larger animal. Baum's (1972) algorithm can be used to estimate the unobserved state X (t). This smoothing algorithm uses both forward and backward recursions to compute
P (Xi = j jY1 = y1 ; : : : ; Ym = ym )
(6)
for any i and j , and for a given value of . (See also Devijver, 1985.) The precise recursions for the model discussed here are given in Guttorp, Newton, and Abkowitz (1990). With a maximum likelihood estimate of the parameter plugged into (6), a reconstruction of the state is the sequence x^ = (^x1 ; : : : ; x^m ) for which x^i is the mode of the distribution in (6). Figure 3(d) shows such a reconstruction for data from cat 40005b, compared to simulations of the hidden state used in a bootstrap algorithm. The reconstruction shows us the extent of uctuations in the underlying progenitor-cell population.
6 FREQUENCY CALIBRATION The set of parameters whose likelihood is at least 100(1 ? r)% of the maximum L(^(y); y) is called a 100(1 ? r)% likelihood region for (see Kalb eisch, 1978, for example). The likelihood ratio exceeds r in this set, or equivalently, the approximate pivot R(y; ) = ?2 log L^(; y) L((y); y) is smaller than ?2 log(r). Although nothing in the de nition relates to coverage probability, this set can be calibrated approximately in smooth models using Wilks' theorem (Wilks, 1938) which ensures that R(y; ) is asymptotically 2 on dim() degrees of freedom. This translates to dropping about 3 units of log-likelihood to de ne a 95% likelihood-based con dence set in a two-parameter problem. Since N is an integer-valued parameter, the regularity conditions for Wilks' theorem do not apply. It is not clear that the theorem should apply even approximately, since N determines 10
the support of the underlying Markov process. To empirically validate Wilks' result, we apply a parametric bootstrap. The bootstrap distribution of R(y; ) is the distribution induced by the model assuming parameters equal their estimated values. We use the methods of Section 5 to compute each MLE ^(y). In a two-step sampling scheme, nboot time series are generated: z1 ; z2 ; : : : ; znboot . In the rst step, a hidden state X is generated by running continuous time Markov process with parameters determined by the tted model. Then, restriction to the xed sampling times gives us the binomial success probabilities of the observation distribution which are used to get zj . This twostage sampling procedure is illustrated in Figures 3 and 4. Figure 3 shows three realized X =N^ 's simulated from the model t to the data for cat 40005b, compared to the reconstruction of X=N^ using the smoothing algorithm of Section 5 (levels plotted correspond to the sampling times only). Figure 4 shows three zj 's derived by adding binomial noise to the simulated X =N 's of Figure 3. The fourth panel shows the actual observed series for cat 40005b. The vertical bars at the bottom of each graph represent the sample sizes ni at the sampling times ti . These sizes are the same for every bootstrap replication, equal to about 100 on average over i. Note that the simulated series in Figure 4 are qualitatively similar to the observed series, suggesting that the model ts the data reasonably well. For each bootstrapped series zj , we approximate the loglikelihood ratio statistic Rj := R(zj ; ^(y)) by grid search to obtain the bootstrap MLE. The empirical distribution of the Rj 's converges to the bootstrap distribution of R(y; ) as nboot increases. The bootstrap distribution for 40005b indicates that the likelihood ratio is slightly stochastically larger than the 2 limit, and so con dence sets are slightly larger than obtained by Wilks' theorem. In Newton and Geyer (1994), the accuracy of the bootstrap is checked using a double bootstrap procedure. This computationally intensive diagnostic suggests that the bootstrap calibration is accurate in this case. We conclude that data from three of the six experimental series (40004, 40005b, 40665) have both small N and short mean clonal lifetime, supporting the clonal succession model. Bootstrapping is applicable to the other three series, however MLEs readily get estimated on the boundary of the grid. Noting that Wilks' theorem is slightly liberal, we cannot conclude that mean clonal lifetime is small in the other series. Bootstrapping also provides one way to assess the t of the simple model. In the spirit of Monte Carlo testing (e.g. Besag and Diggle, 1977) we select a test statistic, like the autocorrelation of the standardized counts, and simulate the distribution of that test statistic under the tted model. Since the null hypothesis is composite, this procedure is actually a bootstrap test (Efron and Tibshirani, 1993). Applying such tests to the autocorrelation statistics, we see that the observed autocorrelations fall in the middle of the bootstrap distributions (data not shown). This suggests that our simple model is suciently exible to explain the range of observed autocorrelations in all 11
series.
7 PROLIFERATIVE POTENTIAL Abkowitz et al. (1990) suggested, on the basis of only three experimental series, that stem cell clones may have a de ned proliferative potential, so that the number of divisions or number of ospring per stem cell lifetime may be constant. Based on these data there seemed to be a tendency for short average lifetimes to be associated with small numbers of stem cell clones, while the control cats exhibited data consistent with long life times and large numbers of stem cells, but with likelihood surfaces that were too at to make any reasonable inference. A simple measure of the proliferative potential (number of progeny per stem cell life span) of a stem cell is the ratio T=N , where T = 1= is the expected clonal lifetime. For example, in cat 40005b an individual cell may support 1/10 of hematopoiesis for 10 weeks, with a proliferative potential of .95. In a study of cats subjected to chemotherapy instead of autologous transplantation (Abkowitz et al., 1993) the analysis of Section 5 yielded a very dierent picture. For one cat, followed for six years after rst therapy, the estimated proliferative potential was less than 0.05. Separate analysis of the early and late parts of the data, using the model of Section 4, indicated that as time goes on, the hematopoietic stem cell behavior becomes more and more abnormal. This suggests that initially cells with minimal damage supported hematopoiesis, and that as these cells were exhausted through terminal dierentiation, more damaged cells supported blood cell production. With the currently available data it is feasible to assess the hypothesis of constant proliferative potential among cats, at least in a simple form. The tted value of T=N was .85, holding the pd for various cats at their respective maximum likelihood estimates, and the log likelihood decreased by 4.8. If standard asymptotic theory applies, this corresponds to a p-value of 0.08, not strong evidence against this model. Judging from the individual MLEs (Figure 5), it appears that average lifetime is decreasing (rather than increasing) with the number of active cells. We therefore tted a model in which TN is a constant (although N and T can vary from cat to cat), and compared the resulting likelihood to that obtained when both T and N varied freely. The resulting log likelihood dierence was 3.74. Standard asymptotics produce a p-value of 0.19, so the model with constant proliferative potential, as measured by TN , provides a slightly better description of possibly constant proliferative potential. The MLE of the TN is 130 clone-weeks, and the tted curve is shown in Figure 5. The cats with relatively at likelihood surfaces clearly have only a limited in uence on the tted model.
12
8 MODEL CRITICISM The assumptions in our simple model of early hematopoiesis require further scrutiny. We look at three dierent areas of concern: N may be varying over time; the clones may contribute unequally to hematopoiesis; and the process may be non-stationary.
8.1 Non-constant N
Biological considerations suggest that the number of active clones may vary over time. Births and deaths of clones are coupled in our simple model forcing N to remain constant. A simple extension has N following a birth-death process. Stem cells are activated at some constant rate b , with a fraction pd of the births being of the domestic type. As before, each active stem cell has a clone surviving an exponentially distributed amount of time, and so the rate of deaths, per clone, is from Section 4.2. Properties of this process are readily established (e.g. Smith, 1985). For example, the equilibrium distribution of N is Poisson with parameter b = (this ratio is thus the expectation of N if the process is stationary), and the conditional distribution of the number of domestic-type clones given N is binomial with parameters N and pd . Since nite Markov processes are no longer involved, the recursive updating technology does not apply, and likelihood theory may be quite dicult. However, simulation of the process given xed parameters b , and pd is straightforward. To check if our conclusions about clonal succession are robust to assumptions on N , we performed a simulation of this birth-death process model. Figure 6 summarizes the results of a simulation matching the conditions producing data from cat 40004. The reserve proportion of domestics is xed at pd = :378 as estimated in Table 1. For all parameter pairs on a grid (as in Figure 2), the birth-death process N along with the domestic process X were simulated, starting at equilibrium, for one year of experiment time. To maintain a nondegenerate progenitor-cell compartment, we condition on N > 0. At the m = 17 sampling times equal to those used for 40004, the observation process was simulated conditionally on the unobserved state as in the bootstrap calculation of Section 6. Summary statistics (Section 3) were computed from each of the 500 simulated series and compared to the observed statistics for cat 40004. The results clearly indicate that in this varying-N model, neither large average N nor long mean lifetime T = 1= produce variability and autocorrelation consistent with the observed series, supporting the clonal succession hypothesis. The simulation was repeated for each experimental series with similar results.
8.2 Unequal contribution of clones If all stem cell clones contribute equally to the progenitor-cell compartment, then the observed number of domestic colonies has a binomial distribution with success probability pt = X (t)=N at time t. As suggested in Figure 1, such equal contributions may not be expected, and a more 13
adequate model might therefore allow extra-binomial variation in this conditional distribution. If the contributions from each clone are at least identically distributed, then E fpt jX (t); N g = X (t)=N , although it may be that var fpt jX (t); N g > 0. Signi cant positive autocorrelation was observed in four of the six experimental series (Section 3). In combination with the overdispersion, this autocorrelation provides critical information about the number and lifetime of clones in our simple model by transmitting some information about the covariance structure of X (t)=N . Consider observed counts Ys and Yt made from samples of equal size at times s and t. Having uneven contributions to the progenitor-cell compartment increases the marginal variance of Ys over the variance in our simple model. However, (at least if N is constant), the covariance between Ys and Yt is equal to that of the simple model in any model where ps and pt are conditionally independent given X . Therefore, the autocorrelation of observed series will tend to be smaller in a model where unequal contributions are allowed in comparison to our simple model where all clones contribute equally. The extra variation introduced by allowing unequal contributions smears out strong positive autocorrelation. Since signi cant positive correlation is present in the observed series, we do not expect our general conclusions about clonal succession to be aected by allowing unequal contributions of clones. Branching process models can be developed to explicitly represent the growth of clones (Guttorp et al. 1990). However, these models present further challenges to computation and inference, and thus will be pursued elsewhere.
8.3 Other points With existing data, it is dicult to verify that clones have independent and identically exponentially distributed lifetimes. This assumption is taken to make computation feasible, and it parsimoniously explains the uctuating proportions of domestic-type clones in the progenitor-cell compartment. From both a hematological and a statistical point of view, a better understanding of these dynamics will come from the unique identi cation clones. The binary G6PD marker does not allow this. Our calculations have assumed that hematopoiesis is a stationary process. Given the relatively short time series, there is little power to test for nonstationarity. We are continuing to observe several of the experimental cats, and this will allow us to assess more precisely the kinetics of hematopoiesis after marrow transplantation. For example, we should be able to test hypotheses regarding the self-replicating potential of stem cells, provided that the longer series exhibit slowly changing proportions of domestic cells. Rather than focus on the size of dierent compartments, an alternative approach is to describe transfer rates between the stem-cell compartment SC and the progenitor-cell compartment PC . Preliminary calculations show that the logit of pt , the domestic proportion in PC thus follows an autoregressive process. Parameters relate to transfer rates rather than clonal lifetime or the number 14
of clones. The model is attractive since it allows for changes within SC , a feature suggested by Jordan and Lemischka (1990). Computational methods for the simple hidden Markov model and its extensions require further study. For example, direct methods of model comparison remain unavailable. There may be many simple hidden Markov structures which present similar distributions for data, and hence identi ability becomes an important issue. Problems also arise in model tting when the hidden state space is not nite. Markov chain Monte Carlo methods may be useful, however the analysis in Newton, Guttorp, and Abkowitz (1992) suggests that this is in fact a challenging problem. Recent work on estimation in hidden Markov processes (Bickel and Ritov, 1993) indicates that a variant of pseudolikelihood may be a pro table alternative.
ACKNOWLEDGEMENTS We thank Sudipto Neogi and Birna Kristinsdottir of the University of Washington School of Engineering for assistance with the optimization problem. Suggestions from the editor and two referees have greatly improved the manuscript. In particular we thank Dr. Kenneth Lange for suggesting a better way to compute transition probabilities. Partial support from NIH grant HL31823 is gratefully acknowledged.
References
[1] Abkowitz, J. L., Ott, R. M., Holly, R. D., and Adamson, J. W. (1988), \Clonal evolution following chemotherapy-induced stem cell depletion in cats heterozygous for glucose-6phosphate dehydrogenase," Blood, 71, 1687{1692. [2] Abkowitz, J. L., Linenberger, M., Newton, M. A., Shelton, G. H., Ott, R. L., and Guttorp, P. (1990), \Evidence for maintenance of hematopoiesis in a large animal by the sequential activation of stem cell clones," Proc. Nat. Acad. Sci., 87, 9062{9066. [3] Abkowitz, J. L., Linenberger, M., Persik, M., Newton, M. A., and Guttorp, P. (1993), \Behavior of feline hematopoietic stem cells years after busulfan exposure," Blood, 82, 2096{ 2103. [4] Baum, L. E. (1972), \An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes," Inequalities, 3, 1{8. [5] Besag, J., and Diggle, P. J. (1977), \Simple Monte Carlo tests for spatial pattern," Applied Statistics, 26, 327{333. 15
[6] Bickel, P. J., and Ritov, Y. (1993), \Inference in hidden Markov models I: Local asymptotic normality in the stationary case," submitted to Annals of Statistics. [7] Brecher, G., Beal, S. L., and Schneiderman, M. (1986), \Renewal and release of hemopoietic stem cells: Does clonal succession exist?" Blood Cells, 12, 103{112. [8] Brecher, G., Bookstein, N., Redfearn, W., Necas, E., Pallavicini, M. G., and Cronkite, E. P. (1993), \Self-renewal of the long-term repopulating stem cell," [9] Cox, D. R., and Hinkley, D. V. (1978), Problems and solutions in theoretical statistics, London: Chapman and Hall. [10] Devijver, P. A. (1985), \Baum's forward{backward algorithm revisited," Pattern Recognition Letters, 3, 369{373. [11] Efron, B. and Tibshirani, R. J. (1993), An introduction to the bootstrap, London: Chapman and Hall. [12] Golde, D. W. (1991), \The stem cell," Scienti c American, 265(6), 86{93. [13] Guttorp, P., Newton, M. A., Abkowitz, J. L. (1990), \A stochastic model for haematopoiesis in cats," IMA Journal of Mathematics Applied in Medicine and Biology, 7, 125{143. [14] Jordan, C. T., and Lemischka, I. R. (1990), \Clonal and systemic analysis of long-term hematopoiesis in the mouse," Genes & Development, 4, 220{232. [15] Kalb eisch, J. G. (1978), Probability and Statistical Inference II, Springer, New York. [16] Kay, H. G. M. (1965), \How many cell generations?" Lancet, 2, 418. [17] Lemischka, I. R. (1992), \The hematopoietic stem cell and its clonal progeny: mechanisms regulating the hierarchy of primitive haematopoietic cells," Cancer Surveys, 15, 3{18. [18] Nash, R., Storb, R., and Neiman, P. (1988), \Polyclonal reconstitution of human marrow after allogenic bone marrow transplantation," Blood, 45, 917-927. [19] Newton, M. A. and Geyer, C. J. (1994), \Bootstrap recycling: A Monte Carlo alternative to the nested bootstrap," J. Amer. Statist. Assoc., to appear. [20] Newton, M. A., Guttorp, P., and Abkowitz, J. L. (1992), \Bayesian inference by simulation in a stochastic model from hematology," Proceedings of the 24th Symposium on the Interface: Computing Science and Statistics. 16
[21] Ogawa, M. (1993), \Dierentiation and Proliferation of Hematopoietic Stem Cells," Blood, 81, 2844{2853. [22] O'Quigley, J. (1992), \Estimating the binomial parameter n on the basis of pairs of known and observed proportions," Applied Statistics, 41, 173{180. [23] Smith, W. A. (1985), \Birth-and-death processes," in Encyclopedia of statistical sciences, 6, S. Kotz and N. L. Johnson eds., John Wiley & Sons. [24] Spangrude, G. J., Smith, L., Uchida, N., Ikuta, K., Heimfeld, S., Friedman, J., and Weissman, I. L. (1991), \Mouse hematopoietic stem cells," Blood, 78, 1395{1402. [25] Wilks, S. S. (1938), \The large sample distribution of the likelihood ratio for testing composite hypotheses," Ann. Math. Statist. 9, 60{62. [26] Zabinsky, Z. B., Smith, R. L., McDonald, J. F., Romeijn, H. F., Kaufman, D., (1993). \Improving Hit-and-Run for Global Optimization," Journal of Global Optimization 3, 171{ 192.
17
Table 1: A summary of the data: The rst 6 rows correspond to data from cats having autologous marrow transplants, and the data record measurements more than 10 weeks after transplant. The last ve rows correspond to the control cats. The columns are: id number, number of sampling times, mean number of progenitor cells per sample, mean percentage of domestic-type progenitors per sample, mean intersampling time (weeks), the overdispersion statistic Q distributed 2m?1 if no overdispersion, the p-value corresponding to Q, the rst order autocorrelation r between the ? rescaled proportions, the p-value noting that r (m ? 2)=(1 ? r2 ) :5 is nominally tm?2 distributed under independence. id 40004 40005a 40005b 40006 40665 40823 63848 63132 63133 63134 65044
m mean ni mean % dom. mean dti
17 14 16 18 18 15 15 7 9 7 4
76.5 71.9 107.5 74.3 77.2 74.7 66.5 34.3 33.3 31.3 42.5
37.8 19.1 50.6 15.4 52.1 18.7 46.9 62.9 38.0 33.8 51.8
Q p-value
3.2 3.1 18.4 .14 2.0 393 < 10?9 2.3 36.3 .004 3.0 158.4 < 10?9 3.7 83.7 < 10?9 19.9 18.8 .17 23.7 9.3 .16 17.8 6.6 .58 23.7 7.4 .29 31.7 1.1 .78
18
r
p-value
.23 .61 .47 .47 .54 .37 .37 .28 .17 .15
.43 .013 .048 .047 .037 .17 .42 .46 .72 .85
74.3 2 10?9 .40 .12
Figure 1: An Idealization of the Progenitor-Cell Compartment: Each number corresponds to a progenitor cell. Cells with the same number are in the same clone. Boxes indicate cells having the domestic-type G6PD, and hexagons are cells having the Georoy-type G6PD.
time 1
time 2
1 1 1 1 1 1 1 1
1 1 1 1 11 1 1 1 11 11 1 1 1 1 1 1 1 1 1
2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 222 2 22 2 2 2 2
3 3 3
3 3 3 3 3 3 3 33 3
2 22 2 2
19
44
150
150
Figure 2: Likelihood of N (horizontal axis) and 1= (vertical axis) for the Data from Six Series Fixing pd at its MLE (indicated after the id number). Contours are in units of log likelihood from the maximum. -1-2
-1
cat 40005a: .22
100
100
cat 40004: .39
-1-2 -3
50
50
-1
-2 -2 30
-1 -2 -3
40
10
150
20
150
10
-1
0
0
-3-2
20
-2 -3
-2
30
40
-2 -2
100
cat 40006: .16
50
50
100
cat 40005b: .61
-1
30
40
10
150
20
150
10
20
30
40
-1-2-3
100
cat 40823: .21
50
50
100
cat 40665: .46
-2 -1
-3
0
0
-1 -2-3
-3
-2
-2 -3
-3
0
-2
0
-1
10
20
30
40
20
10
20
30
40
Figure 3: Hidden States. The lower right panel shows a reconstruction of the hidden state for cat 40005b. The remaining panels show the rst three state vectors simulated from the tted model in our bootstrap analysis. 1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
10
15
20
25
30
35
40
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
10
15
20
25
30
35
40
21
10
15
20
25
30
35
40
10
15
20
25
30
35
40
Figure 4: Observation Process. The lower right panel shows the observed proportions for cat 40005b. The remaining panels show simulated bootstrap data sets. 1.0
1.0
0.8
0.8
0.6
*
*
* * *
* *
*
0.6
** * *
* *
0.4
*
*
*
* * *
*
** *
0.4
* * 0.2
*
0.2
*
* *
* 0.0
0.0
* 10
15
20
25
30
35
40
1.0
15
20
25
30
35
40
1.0
0.8
*
0.6 0.4
10
*
0.8
**
* *
*
* *
* *
*
* * *
0.2
0.0
0.0
10
*
0.4
15
*
* * *
*
* *
*
*
0.2
**
0.6
*
20
25
30
35
40
22
* * *
10
15
*
20
25
30
35
40
80
100
Figure 5: Estimated Proliferative Potential Curve and Individual Estimates for the Six Experimental Animals.
20
MLE cell lifetimes (L) 40 60
005a
unrestricted mle restricted mle 823
005a 665, 823 665 005b 005b 004
006 006
0
004
10
20 30 MLE active clones (N)
23
40
150
Figure 6: Simulation Results for Varying-N Model. In both panels, the vertical axis is expected clonal lifetime T = 1= (weeks) and the horizontal axis is expected number of clones b =. Plotted at each grid point is the proportion of the 500 simulated series in which the simulated summary statistic exceeds the observed summary statistic for cat 40004. The upper panel corresponds to the overdispersion statistic Q (Section 3) and the lower panel corresponds to the lag-1 autocorrelation of the normalized counts Zi . Contour lines are at probabilities :2 and :4. 1.0 0.8
100
0.6 0.4 0.2
50
0.0 grey scale
0
10
20
30
40
0.4
100
150
0.2
0.4
0
50
0.2
0.2
10
20
30
24
40