A covariate {based coefficient of source dependence for capture ...

0 downloads 0 Views 256KB Size Report
the list or source symbols themselves, e.g. nAB C = A \B \C . The symbol mQ;R ... For the ALUS, a case consists of a person with a leg ulcer, and does not relate.
A covariate{based coecient of source dependence for capture{recapture models Alain C. Vandal

Clinical Trials Research Unit and Department of Statistics, University of Auckland

John Pearson

Department of Statistics, University of Auckland

Natalie Walker

Clinical Trials Research Unit, University of Auckland

February 2000 - Final draft Abstract

In capture{recapture modelling, it is necessary to account for possible dependence betwee sources. Along with nominal information, epidemiological capture{recapture studies often involve the collection of a variety of individual covariates. Under the assumption that we can estimate the covariate distribution in the population of interest and that sources are conditionally independent given the covariates, we can estimate a coecient of source dependence (CSD) for every set of sources. The CSD estimates are based on the estimated conditional covariate distribution given the sources. Such CSD estimates can be used as guides to t joint log{linear models and can also be introduced in a marginal log{linear model to produce a conditional population estimate. We illustrate these applications on data from the Auckland Leg Ulcer Study, 1997{1998. Keywords: capture{recapture, source dependence, conditional independence, log{linear models, marginal models, leg ulcers

1 Introduction Capture{recapture modelling aims at estimating a population size using several possibly incomplete lists of population members. It takes its origins in animal abundance estimation, but is now increasingly used in epidemiology to estimate disease prevalence and incidence. Several issues in capture{recapture modelling are of special importance in epidemiological research. Heterogeneity of capture probabilities, population openness, misclassi cation and reporting failure (\tag loss") are some of the more critical among these. Our purpose here is not to address these concerns but to propose a possible solution to a problem raised by Regal & Hook (1998), namely the need to \give more attention to modelling the underlying processes that give rise to the source interrelationships". 1

Assessment of source dependence lies at the heart of model validation in capture{ recapture studies. The method we propose does not deal with the causal and structural aspects of source dependencies, but instead assumes that these aspects can be subsumed in the characteristics of individual population members. As a result, we obtain an original method to account for individual covariates in capture{recapture analysis. A di erent approach to covariate use in capture{recapture was recently proposed by Tilling & Stern (1999). Speci cally, this article proposes a method to quantify source dependence using the estimated covariate distribution of listed individuals. This estimate can be used for model building and exploratory analysis related to regression model selection. We consider some natural applications of source dependence estimates in log{linear modelling, and illustrate these applications using the data from the Auckland Leg Ulcer Study (ALUS), 1997{1998. Section 2 describes the setting and purpose of capture{recapture studies as well as the particulars of the ALUS data. Section 3 introduces the coecient of source dependence and proposes a covariate{based estimate of this coecient. Section 4 describes log{linear modelling of capture{recapture data, demonstrates model validation techniques based on prior knowledge of source dependence and presents a marginal log{linear model based on the coecients of source dependence. Section 5 concludes with a short discussion.

2 Capture{recapture studies 2.1 Setting and notation

In an epidemiological context, the various \captures" of capture{recapture studies take the form of lists of individuals belonging to a well{de ned population. Such lists usually result from attempts to enumerate the population or a well{de ned subgroup of the population. Examples of sources for such lists are surveillance registries, civil records, active census by a group of individuals or self{noti cation. In this paper, the term list refers to a set of individuals observed through some process, while source refers to a random process which yields a list. Thus, the relationship between source and list is similar to the relationship between random variable and observed value. We will denote both lists and sources by upright uppercase characters, relying on context to allay any confusion. The source symbol will also represent the event that an individual randomly chosen from the population of interest is captured by the source. Thus we can use the concept of source probability, denoted by pL = Prob[L] for source L. We will use nonstandard notation to express observed counts in a manner that is amenable to modelling both mutually exclusive cell count and marginal cell totals in a contingency table. We let U be the set of available sources, with k = jUj. If Q; R  U , we will write \ nQ;R =

L2Q

L\

\

L2R

L

(2.1)

to denote the observed number of individuals simultaneously belonging to the lists in Q while not belonging to lists in R. Most often we will write such a quantity using 2

the list or source symbols themselves, e.g. nABC = A \ B \ C . The symbol mQ;R represents the expected value of nQ;R . We also de ne N = m; to be the unknown total population size, and n = mU to be the unknown number of individuals not appearing on any list. Finally, we write "

pQ;R = Prob

\

L2Q

L\

#

m  L = NQ;R : L2R \

(2.2)

Table 1 displays the notation for the observed counts of a four{source setting. In this notation, observed marginal totals are given by, e.g., nABD = nABCD + nABC D and nBC = nABCD + nABCD + nA BCD + nA BCD . BYes

AYes

BNo

BYes

ANo

BNo

CYes

DYes DNo

nABCD nAB CD nABCD nAB CD

nABCD nABCD   nABC  D  D nABC

CNo

DYes DNo

nABC D nAB C D nABC D nAB C D

nAB CD nAB   CD  nAB  C D nAB C D

Table 1: A contingency table for a four{source capture{recapture study. All cells are observed except for the bottom right cell, which is the number of individuals appearing on none of the lists.

2.2 The Auckland Leg Ulcer Study (ALUS)

To illustrate our proposed capture{recapture technique we will discuss its application to the Auckland Leg Ulcer Study (ALUS). This was a community-based crosssectional study aimed at identifying all individuals who had or who developed a leg ulcer in the New Zealand North and Central Auckland health districts, over the 12-month period from 1 November 1997 to 1 November 1998. A leg ulcer is generally considered to be any break in the skin on the lower leg (below the knee) or on the foot, which has been present for more than six weeks. Typically, the condition is a consequence of disease of the circulatory system, and can cause considerable disability. Leg ulcers are more common in people aged 65 years and over, usually take more than six months to heal and have a high rate of recurrence. The period{prevalence of a condition is of interest in the study of the epidemiology of that condition. This is de ned as the total number of existing and new cases found during the study period, divided by the total population in the study area at that time. For the ALUS, a case consists of a person with a leg ulcer, and does not relate to the number of ulcers developed during the period. 3

Cases were identi ed through noti cations from health professionals participating in the study and by self{noti cation. Of the 426 health professional practices approached, 398 agreed to participate in the study. Noti cations for cases were divided into four main sources: general practitioners (G), district nurses (D), self{noti ed cases (S) and others (O). The self{noti cation source pathway consisted of advertisements placed in free community newspapers and posters placed at pharmacies, supermarkets and in the waiting rooms of non{participating health professionals. A 24{hour, free telephone number was available for people wishing to enrol in the study. The O source included podiatrists, rest home and retirement village sta , medical specialists and hospital sta . During the study, a total of 423 individuals with current leg ulcers were identi ed, twenty by more than one source. Covariate information, including gender, age and quarter of rst noti cation, were recorded for each individual. Considerable variation between sources was noticed in the distributions of these covariates. An example of this variability is shown in Figure 1; in particular, the age distribution di ers markedly across sources in the lowest and highest age groups. Quarter of rst noti cation and gender display similar di erences. Such di erences in covariate distribution, it was thought, should provide a way to quantify source dependence. GYes DYes DNo

GNo DYes DNo

SYes

OYes ONo

0 0

0 1

0 7

1 112

SNo

OYes ONo

0 6

0 69

5 180

42 {

Table 2: Observed one-year period-prevalences of leg ulcer in the Auckland Leg Ulcer Study.

3 Source dependence

3.1 A covariate{based coecient of source dependence

For illustration purposes, we assume initially that only two sources A and B are available. We de ne the coecient of source dependence (CSD) between A and B by Prob[A \ B] = pAB CAB = Prob[A]Prob[B] p p

A B

4

(3.3)

0.35

Age group

0.20 0.15 0.00

0.05

0.10

Proportion of cases

0.25

0.30

=85

List G

List D

List O

List S

Total observed

Figure 1: Observed age distribution among the ALUS lists. When the denominator is non-zero, A and B are independent if and only if CAB = 1. Using (2.2), we can rewrite (3.3): CAB = mmAB N: (3.4) A mB We can use extra information concerning list members to determine source dependence. Covariate information is usually collected along with identifying information for the individuals appearing on the list. We let Z represent a possibly multidimensional covariate, which we will assume discrete for convenience. In practice, this covariate vector will often consist of such data as gender, age group, region of residence, etc. We then make the crucial assumption of conditional source independence given the covariate value; for the two-source situation, this assumption amounts to Prob[A \ BjZ = z] = Prob[AjZ = z]Prob[BjZ = z]:

(3.5)

Although this form is useful from an algebraic point of view, it is more easily interpreted if rewritten as Prob[AjB; Z = z] = Prob[AjZ = z]: Under (covariate-)conditional source independence, sources can be dependent but the covariate values ultimately drive list membership. Conditional source independence is further discussed in x5. Under assumption (3.5), Bayes Theorem can be used to show that X zjA]Prob[Z = zjB] : CAB = Prob[Z =Prob[ Z = z] all z 5

Similarly to (3.3), the coecient of source dependence for r sources A1 ; : : : ; Ar is de ned by CA1:::A = QpAr 1:::Ap = QmrA1 :::A N r?1 : (3.6) m i=1 A i=1 A The assumption of conditional independence can also be generalized to r sources: r

r

r

i

Prob

" r \

i=1

#

Ai jZ = z =

i

r Y i=1

Prob [Ai jZ = z] :

(3.7)

Assumption (3.7) is in fact weaker than true conditional independence, as it is not assumed that the same equation holds for any proper subset of the r sources. Under (3.7), the CSD has the alternative form: Q X r Prob[Z = z jAi ] i=1 (3.8) CA1:::A = r?1 : Prob[ Z = z ] all z r

See Appendix A for a derivation. If jQj = r = 1, we apply (3.8) to de ne CQ = 1. The quantity Prob[Z = zjAi ] is easily estimable from the data. We write nL(z) = j[Z = zin list L]j and n(z) = j[Z = zamongst observed individuals]j. Then Prob[Z = zjAi ] = nAni (z) (3.9) Ai is a natural estimate. The situation is di erent with Prob[Z = z], which denotes the covariate distribution over the population of interest, since the observed population is truncated. In general, the estimator Prob[Z = z] = n(z) (3.10) N ?n will be biased, with the denominator simply denoting the total number of individuals observed in any list. However, there may be reasons to believe that the bias is small, for instance if the unobserved segment of the population is relatively small. It may also be possible to produce a list{weighted estimator using external information such as census data. We return to this topic in x5. For a set Q of sources, the CSD estimator C~Q is formed by substituting the estimators (3.9) and (3.10) for their corresponding estimands in (3.8), yielding

[

[

C~Q = QN ? nn

L2Q L

X

all z

Q

L2Q nL (z ) n(z)jQj?1

!

(3.11)

Although an approximate distribution can be derived for C~Q under reasonable assumptions on the sources, all discussions in the sequel will be held conditionally on the observed C~Q's. We wish in so doing to avoid mathematical detail and to reserve discussions of covariate-based CSD variability for later study. Similarly to a log{odds ratio, the log{CSD is often more easily interpretable. A log{CSD value of 0 indicates independence of the sources, while positive and negative values represent positive and negative source associations, respectively. 6

Covariate{based CSD estimates for the ALUS data are presented in Table 3. In this case, the three{dimensional covariate Z consisted, for each case, of quarter of rst noti cation, of age group at rst capture (10 age groups: less than 35 y-o, greater than 85 y-o, and eight 5-year age groups in between) and of gender. Partial justi cation for this choice is found in x5. A full justi cation of the choice of covariates to include in CSD estimate computation and of the nature of the discretization involved will require a description of the distributional properties of the estimate. GYes DYes DNo

DYes

GNo

DNo

SYes

OYes ONo

-1.061 -0.901 -1.229 -0.943

-0.969 -0.841 -1.055 0

SNo

OYes ONo

0.322 0.031 0.160 0

0.287 0

0 {

Table 3: Covariate{based log{CSD estimates for the ALUS data, based on rst quarter of noti cation, age group at rst capture and gender. The zero-valued log{CSD estimates are structural.

3.2 A simple application of CSD estimates

To illustrate the structure of the CSDs, we describe how an estimate of total population size can be obtained from them. In the case of two sources, (3.4) yields the natural population estimate (3.12) N~ = nnAnB C^AB; AB which consists in a simple adjustment of the Petersen estimate (Seber, 1982, Ch. 3). The adjustment attempts to account for the dependence between the sources. We can adapt a result of Sekar & Deming (1949) to show that, conditionally on C~AB, N~ has variance estimate 2 d N ~ ) = nAnBn3A BnAB C~AB Var( nAB When more than two sources, and therefore more than one intersection of sources, are available, the above technique does not lend itself as well to estimation. Attempting to combine estimates from several combinations of lists leads naturally to the consideration of marginal log{linear models. We return to these in x4.4.

7

4 Log{linear models and CSDs

4.1 Joint log{linear models

Log{linear models can be used to estimate the log-expected values of cells in an incomplete table such as Table 1 by allowing for estimation of the dependence between cells. Cormack (1989) provides a discussion of modi cations to log{linear models which allow some assumptions to be relaxed. We do not touch on these topics here. We brie y illustrate the construction of a log{linear model of the cell counts in a contingency table such as Table 2. These are the most common models for contingency tables and are called joint models as they attempt to model the joint distribution of the cell values. Under the assumption of complete independence between all sources in Table 1, we can t a model containing only the main e ects due to the sources: log mQ;R = +

X

L2Q

L

(4.13)

for each Q  U with Q 6= ; and R = U n Q. Using the ALUS setup from Table 2 as an example, the following are three of the fteen equations which compose the model: log mGD O S = + G log mGD O S = + G + S log mGDOS = + G + D + O + S The unobserved bottom right cell has expectation n = mG D O S = exp( ). The corresponding prediction for n is therefore exp(^ ); approximate standard errors for the prediction ^ are provided by likelihood theory. An interaction term must be tted for each set of sources assumed to be dependent. For example, to represent dependence between sources G, O and S in ALUS, we would need to modify as follows these four terms in the independence model: log mGD OS log mGD O S log mG D OS log mGD OS

= = = =

+ G + O + GO + G + S + GS + O + S + OS + G + O + S + GO + GS + OS + GOS :

All other terms remain as before. High order interactions can only be tted along with all corresponding lower order ones to satisfy the de nition of dependence for several sources. The estimate of n remains exp(^ ). A large number of joint models are usually tted on the basis of various source dependency assumptions. Then either a particular model is chosen according to some goodness of t criterion such as AIC (Hook & Regal, 1997), or the unknown cell size estimates produced by the various models are averaged according to some weighting scheme (Madigan & York, 1997;Buckland et al., 1997). Selection and averaging not only create special problem in error assessment, but can also become impractical as the number of sources grows. The Auckland Stroke Study (Bonita et al., 1995), for 8

instance, e ected a stroke incidence survey involving 8 sources: over 1021 models can be tted to such data, with more than a quarter of a billion models involving only two{way interactions. Another feature of joint log{linear modelling is that observed zero counts determine model availability. This occurs because the log-likelihood will not attain its supremum if interactions corresponding to zero-valued cells or marginal totals are included in the models. Christensen (1997, Ch. 8) suggests a way of dealing with the problem of such \random zeros" that is conditional on the observed ones. To avoid conditional inference, however, we must avoid such interaction terms. In the case of the ALUS data, the cells corresponding to third- and fourth-order interactions are zero, as well as the cell corresponding to the G:O interaction. We therefore only consider arrangements of the ve remaining second-order interactions, for a total of 32 models. If we accept the covariate-based CSD estimates C~Q as good re ections of the source dependence structure, we can use them to assess a given tted log{linear model. In the rest of this section, we consider two log{linear model evaluation techniques based on covariate-based CSDs. We also formulate a marginal log-linear model where covariate-based CSDs can be used directly to estimate the total underlying population.

4.2 Interaction selection using the CSDs

We can explore the suitability of including some interaction terms in a joint log{ linear model using CSD estimates. We use the ALUS data for illustrative purposes. At rst glance, Table 3 shows that the S source di ers markedly from the other three, and seems to underlie the negative associations appearing in the table. Interaction terms involving S should therefore be expected to appear in an appropriate model. The near independence between G and O indicates that an interaction term for these sources is in fact not necessary, making its omission for numerical reasons from the models considered more justi able. In selecting models, we can also make use of the following fact: if Q and R are sets of sources, then CQ = CQ[R if and only if sources in Q are independent of sources in R. Moreover, log CQ[R ? log CQ measures approximately the relative change from CQ[R to CQ when that change is small. If the relative change is small, the assumption of independence between the sources in Q and the sources in R is plausible, and the inclusion of an Q[R interaction term in addition to a Q interaction term is unnecessary. These rules are useful to compare kth-order with (k ? 1)th -order interactions. For instance, we might wonder how well the strongest interaction, that between G, D and S, can be modelled using only second-order interactions. The approximate relative change in between CGDS and CGS is ?1:229?(?0:943) = ?0:286 or -29%, and between CGDS and CDS is ?1:229 ? (?1:055) = ?0:174 or -17%. Thus the G:D:S interaction is moderately well represented by second-order interactions. Similarly, the approximate relative change between CGDO (corresponding to the strongest positive interaction) and CDO is only 3.5%, indicating that second-order interaction accounts well for the third-order one. These considerations can guide us in model selection, albeit in no more than an 9

exploratory way. For example, we can should consider more closely models containing a D:O and either G:S or D:S interaction terms, as these would account for the strongest positive and negative interactions detected.

4.3 Model validation using CSDs

A second technique involves the comparison of model-based and covariate-based CSD estimates. We will use the CSDs to construct an alternative to AIC as a model criterion. Under a log{linear model and with jQj > 1, we can produce maximum likelihood log-CSD estimates: X log C^Q = log m^ Q + log N^ ? log m^ L (4.14) L2Q

where the m^ L are appropriate marginal sums of estimated cell counts and N^ is the tted marginal table total. A two-source setting, for example, would yield the logCSD maximum likelihood estimates log C^AB = log m^ AB + log N^ ? log m^ A ? log m^ B (4.15) = log m^ AB + log (m^ AB + m^ A B + m^ AB + m^ A B ) ? log (m^ AB + m^ AB ) ? log (m^ AB + m^ A B) : A model selection criterion can then take the form of a distance measure between model-based and covariate-based CSD estimates. Here, we chose the distance function 2 X log C~Q ? log C^Q : d= QU

A better choice of distance could be motivated by a study of the distributional properties of the C~Q. Table 4 displays these distances for selected models. The AIC criterion favours the full independence model while the CSD distance criterion favours the model including all available two-way interactions except G:D. In this case, the 10% di erence between the two point estimates is much smaller than the standard errors of either estimate, which run to roughly 20% for the independence model and roughly 35% for the CSD distance-optimal model. Figure 2 shows, the greater speci city of CSD distance as compared with AIC with respect to the estimated total population, in the sense that aberrantly large population estimates are more easily identi ed using CSD distance than AIC. This phenomenon is also apparent in Table 4, (consider models 3, 5, 10 and 13), and should apply to small population values as well; there is of course more leeway in which to err above than below the true population value N . The greater speci city of the CSD distance suggests investigating it as a weight for model averaging as described in Buckland et al. (1997). Conversely, we note that the tted models with low AIC also correspond to low CSD distances. Joint modelling thus provides independent evidence that the covariate{based CSD estimates are plausible. 10

Term in model

log CSD estimates GD GS DO DS OS GD GO GS DO DS OS GDO GDS GOS DOS GDOS N^ = se(^ ) CSD AIC dist. Cov. based log CSD: 0.18 0.03 -0.86 0.31 -0.96 -0.77 0.32 -1.23 -0.9 -0.97 -1.06 exp(^ )

11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0

0 0 0 1 0 0 1 1 1 0 1 1 0 1 1 1

0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1

0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1

0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1

0 0 0 0 0.37 0 0 0.01 0 0.59 0.25 0.00 1.18 0.19 0.18 0.00

0 0 0 0 0 0 0.02 0 0 0.05 0 0.01 0.17 0.01 0 0.01

0 0 0 -1.12 0 0 -1.17 -1.2 -1.00 0 -1.05 -1.05 0.07 -1.11 -1.12 -1.12

0 0 0.68 0 0 0.00 0 0 0.57 0.87 0 0.52 1.45 0 0 0.45

0 0 0 0 0 -0.12 0 -0.21 0 0 0.00 0.00 0.86 0 -0.13 -0.13

0 -0.60 0 0 0 -0.65 -0.71 0 0 0 0 -0.59 0.11 -0.65 0 -0.66

0 0 0.68 0 0.37 0.00 0.02 0.01 0.57 1.46 0.25 0.53 2.63 0.20 0.18 0.47

0 0 0 -1.12 0.37 -0.12 -1.17 -1.41 -1.00 0.59 -0.80 -1.06 2.04 -0.92 -1.06 -1.26

0 -0.60 0 -1.12 0 -0.65 -1.87 -1.20 -1.00 0.05 -1.05 -1.65 0.48 -1.76 -1.12 -1.79

0 -0.60 0.68 0 0 -0.77 -0.71 -0.21 0.57 0.87 0.00 -0.07 2.32 -0.66 -0.13 -0.34

0 -0.60 0.68 -1.12 0.37 -0.77 -1.87 -1.41 -0.43 1.46 -0.80 -1.13 3.49 -1.57 -1.06 -1.46

3300 3150 3750 3000 3600 3050 2850 2750 3350 4500 3200 3200 8100 3050 3000 3000

0.24 0.25 0.28 0.25 0.29 0.31 0.25 0.31 0.29 0.35 0.30 0.30 0.72 0.31 0.43 0.42

6.86 3.85 10.59 2.81 8.72 3.09 2.91 2.31 4.54 17.93 2.85 2.49 56.71 2.29 2.25 2.17

56.75 58.31 57.07 56.97 58.15 60.23 58.35 58.74 57.74 57.77 58.69 59.33 58.28 60.19 60.63 61.27

Table 4: Model based CSDs for selected log{linear models of the ALUS data. N^ is the estimated total number of patients to the nearest 50. The selected models display either one of the ten best AIC values, one of the ten best CSD distance values or both. Best AIC: model 1; best CSD distance: model 16. Values of \0" are exact, values of \0.00" are rounded. The covariate based log-CSD estimates are shown at the top of the table.

62 61 60 59 57

58

AIC

8.0

8.2

8.4

8.6

8.8

9.0

7.8

8.0

8.2

8.4

8.6

8.8

9.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

CLD distance

7.8

Log of population estimate

Figure 2: AIC and CSD distance as a function of the population estimate for the 32 log{linear models tted. (See Table 4.)

4.4 Population estimation conditional on CSD estimates

Although it is an easy matter to determine a CSD estimate from a tted log{linear model, in general, the value of the unknown cell will appear in equations relating CSDs to interaction terms. Nevertheless, by modifying log{linear models so that they model marginal, as opposed to cell, probabilities, we can use CSDs to estimate the total population size from within a regression model. We will call such models \marginal log{linear models". To do so, we can modify an approach due to Haber (1985) to simultaneously model all source marginal totals as well as all source intersection totals. Consider, for instance, a three-source example with sources A, B and C. The marginal list totals will be modelled by log mA = A; log mB = B and log mC = C: Taking logs in (3.6) and setting = log N , the other terms in the model take the following form: log mAB = ? + A + B + log CAB log mAC = ? + A + C + log CAC log mBC = ? + B + C + log CBC log mABC = ?2 + A + B + C + log CABC In general, for a set Q of sources, we model mQ by log mQ = ? (jQj ? 1) + 12

X

L2Q

L + log CQ

(Recall the de nition that CQ = 1 if jQj = 1.) The term mQ can always be expressed as a sum of some cells in the incomplete table of counts regardless of Q. The vector of expected cell values is log (Am) = X + c

(4.16)

where A transforms the cell counts into marginal counts, m is the vector of expected cell counts, X is the design matrix, the e ects to estimate, and c an o set consisting of the logarithm of the covariate-based CSD estimates. In the case of the three-source example, the components of this system could take the form 2 3 3 2 mAB C 1 0 0 1 1 0 1 6 mA BC 7 6 0 1 0 1 0 1 1 7 6 7 7 6 6 mA B C 7 6 0 0 1 0 1 1 1 7 6 7 7 6 A = 66 0 0 0 1 0 0 1 77 ; m = 66 mABC 77 6 mAB C 7 6 0 0 0 0 1 0 1 7 6 7 7 6 4 mA BC 5 4 0 0 0 0 0 1 1 5 mABC 0 0 0 0 0 0 1 0 3 6 3 2 0 77 6 6 0 77 6 6 A 7 7 c = 66 log C~AB 77 =6 4 B 5 ; ~AC 77 6 log C 6 C 4 log C ~BC 5 log C~ABC We note that A satis es A = Y0, where Y is the saturated design matrix of a joint log{linear model of m. The design matrix can be obtained from A by 2

0 6 0 6 6 0 6 X = 66 ?1 6 ?1 6 4 ?1 ?2

1 0 0 1 1 0 1

0 1 0 1 0 1 1

2

3

0 0 77 1 77 0 77 ; 1 77 15 1

X = [en ? A01:k ek jA01:k ] ; where ea is the vector containing a 1's and A01:k consists of the rst k columns of matrix A0, where k is the number of sources. The expression in the rst column computes ?jQj for each row of model (4.16). We note the similarity of model (4.16) with simultaneous joint and marginal models of Lang & Agresti (1994). Because A is non-singular, the log-likelihood can be written down using

m = m( ) = A?1 exp(X + c) where exponentiation is applied entry-wise. From a computational point of view, m can always be ordered so that A is an upper triangular matrix: computing A?1v for some vector v is therefore not an expensive task. Using a Poisson model for the cell counts, we obtain the log-likelihood

`( ; n) = n0 log m ( ) ? e0m( )   = n0 log A?1 exp(X + c) ? e0A?1 exp(X + c) (4.17) where n is the appropriately ordered vector of observed counts and e is the vector consisting of all one's. The maximum likelihood estimate for can be obtained from a 13

Newton-Raphson iteration based on the score and information; computational details are provided in Appendix B. We obtained estimates for the period-prevalence of leg ulcers in the Auckland region for the duration of the ALUS study, using the covariate-based CSDs as o sets. The initial value (0) was set using a whole population estimate based on a joint log{linear model of independence between all sources, for (0) = 8:100, and with L0 = log nL for L =G, D, O and S. The results are shown in Table 5. We note that the overall period-prevalence estimate from this model is lower than all those obtained from the joint log{linear models. While the relative standard error of roughly 21% appears smaller than those of the joint models, we must remember that the t is performed conditionally on the observed values of the covariate-based CSDs. While there is no way of independently validating the point estimate, it is close to the working estimate of 2600 leg ulcer cases which was used in the planning stage of the study, based on the literature concerning international leg ulcer epidemiology. Estimated s.e. Estimated Observed log m^ Q log m^ Q m^ Q nQ Q= G 4.33 0.11 76.0 76 D 5.29 0.07 197.8 198 O 3.87 0.14 48.0 48 S 4.80 0.09 121.0 121 ; (tot. pop.) 7.83 0.21 2508.9 { Table 5: Estimates from the marginal model with covariate-based CSD estimates as o sets.

5 Discussion Covariate-based CSD estimates can be used as exploratory and model validation tools in joint capture{recapture log{linear modelling. Marginal log{linear modelling with CSD o sets provides a way to account for all forms of source dependence, regardless of observed zero counts and regardless of the number of joint log{linear models to consider. It obviates the task of model selection, which complicates error assessment, at the cost of providing results which are conditional on the CSD estimates. The possibility of misleading inference from model conditioned on CSD estimates must be weighed against these advantages. Covariate-based CSD estimation is predicated on two assumptions. The rst is that the covariate distribution estimate (3.10) is unbiased. This assumption can easily be violated in health research context if covariates are associated with reporting bias, as often happens with age and gender. It may be that a Bayes estimate of the population covariate distribution using an informative prior is more appropriate than a simple averaging of the observations. Even a suitable weighting of the lists might be useful in that respect. The second assumption is that of conditional independence of sources given the individual covariates. Conditional independence can be viewed as an operational model of the structural source relationship described by Regal & Hook (1998). For 14

this view to hold, we must ensure that the causal and other dependencies between sources are strongly associated with the chosen covariates. A structural description of the ALUS sources, for instance, might involve frequent referral of highly disabled or immobilized patients by general practitioners to district nurses (though it is suspected that such referrals tend to appear on a single, rather than both, lists). In turn, high disability and immobility are associated with age. As well, self-noti cation will tend to occur in more self-reliant, and therefore younger, individuals. Advertising and reinforcement of the sourcing processes might occur simultaneously for health practitioners but at di erent times for self-noti cation, a process associated with the quarter of rst noti cation. District nurses and rest homes (which are part of the \Others" source) will tend to capture the same individuals. We can plausibly use covariate-conditional source independence as a surrogate for a fuller structural exploration of source dependency if these covariates either explain the dependency or are associated with the relation between the sources. Here, age, gender and quarter of rst noti cation appear to provide a good explanatory model for many source dependencies. CSD estimates should be based on covariates known to be associated with the source dependency structure in such ways. Covariate-based CSD estimates can provide a useful adjunct to currently accepted practice and in many cases can provide a valid alternative modelling tool. A possible extension of the associated marginal log{linear model lies in realm of Bayesian inference, for instance by placing priors on the CSDs.

Acknowledgments

We wish to thank Professor G. A. F. Seber for his useful comments in the development of the manuscript. One of the authors (Walker) undertook this research during the tenure of a Training Fellowship from the Health Research Council of New Zealand (HRCNZ). Special thanks should be given to the following people involved in ALUS: A. Rodgers, R. Norton, N. Birchall and S. MacMahon. Funding for ALUS was provided in part by the HRCNZ and ConvaTec Ltd. Smith & Nephew Ltd, Fisher & Paykel HealthCare Ltd, and Huntleigh Healthcare Pty Ltd provided additional support.

15

A Derivation of the estimator (3.8) With k sources A1; : : : ; Ak , and under the assumption of conditional independence (3.7) "

Prob

k \

i=1

#

Ai

X

=

all z X

=

"

Prob k Y

k \

i=1

#

AijZ = z Prob[Z = z] !

Prob [AijZ = z] Prob[Z = z]

i=1 k X Y Prob [Z all z

!

= zjAi ] Prob[A ] Prob[Z = z] = i all z i=1 Prob[Z = z ] ! ! k k Y X Y Prob [Z = zjAi ] = Prob[Ai] k?1 i=1 all z i=1 Prob[Z = z ]

B Score, information and Newton-Raphson procedure for (4.17)

Let be the q  1 vector of parameters to estimate. The vector e represents a vector of all one's of the appropriate dimension; the n  1 vector n contains the observed cell counts; the notation diag [x] is used to denote the diagonal matrix with diagonal x. Using this notation, the q  1 score vector for (4.17) is given by ?  U ( ) = @@ `( ; n) = X0diag [exp(( ))]A?10 diag [m( )]?1n ? e while the negative of the q  q information matrix is given by 2

?I ( ) = @ @ 0@ `( ; n)   ?  = X0 diag A?10 n  m( )? ? e   ?A?10diag n  m( )?2 A?1diag [exp (( ))] X I

I

where  ( ) = X + c;  m( ) = A?1 exp [( )];

#0

"

g .  n  m( )?aI = mn?Aa ; mn?Ba ; : : : ; mnf?1;:::;k a A B f1;:::;kg

From an initial guess (0) we can obtain the maximum likelihood estimate of by iterating the following Fisher scoring update until convergence: (s+1)

= (s) + I



?1   (s) (s) U :

16

References Bonita, R., Broad, J. B., Anderson, N. E. & Beaglehole, R. (1995). Ap-

proaches to the problems of measuring the incidence of stroke: the auckland stroke study, 1991-1992. Int. J. Epidemiol. 24, 535{542. Buckland, S. T., Burnham, K. P. & Augustin, N. H. (1997). Model selection: An integral part of inference. Biometrics 53, 603{618. Christensen, R. (1997). Log-linear models and logistic regression. Springer, New York, 2nd edition. Cormack, R. M. (1989). Log-linear models for capture-recapture. Biometrics 45, 395{413. Haber, M. (1985). Log-linear models for correlated marginal totals of a contingency table. Comm. Statist.- Theor. Meth. 14, 2845{2856. Hook, E. B. & Regal, R. R. (1997). Validity of methods for model selection, weighting for model uncertainty, and small sample adjustment in capturerecapture estimation. Amer. J. Epi. 145, 1138{1144. Lang, J. B. & Agresti, A. (1994). Simultaneously modeling joint and marginal distributions of multivariate categorical responses. J. Amer. Statist. Assoc. 89, 625{632. Madigan, D. & York, J. C. (1997). Bayesian methods for estimation of the size of a closed population. Biometrika 84, 19{31. Regal, R. R. & Hook, E. B. (1998). Marginal versus conditional versus `structural source' models: a rationale for an alternative to log{linear methods for capturerecapture estimates. Statist. Med. 17, 69{74. Seber, G. A. F. (1982). The estimation of animal abundance and related parameters. Macmillan, Inc., New York, second edition. Sekar, C. C. & Deming, W. E. (1949). On a method of estimating birth and death rates and the extent of registration. J. Amer. Statist. Assoc. 44, 101{115. Tilling, K. & Stern, J. A. C. (1999). Capture{recapture models including covariate e ects. Amer. J. Epi. 149, 392{400.

Contact address

Alain C. Vandal Clinical Trials Research Unit University of Auckland phone: 64.9 373-7599 x4729 Private bag 92019 fax: 64.9 373-1710 Auckland, New Zealand email: [email protected]

17

Suggest Documents