Spatial Cluster Detection

4 downloads 0 Views 1MB Size Report
Department of Mathematics and Statistics. Indian Institute of ... Then the goal of the frequentist scan statistic is to find regions where the disease rate is higher ...
Spatial Cluster Detection A project work on Data Mining & A.I Techniques Submitted by

Debamita Ghosh M.Sc in Statistics Session: 2010 – 2012

Supervisor: Prof. Dr. Amit Mitra

Department of Mathematics and Statistics Indian Institute of Technology Kanpur Kalyanpur – 208016, Uttar Pradesh, India November, 2011

ABSTRACT We intend to frame our project as a theoretical study on a certain field of datamining. In our project we focus on the task of spatial cluster detection: finding spatial regions where some quantity is significantly higher than expected. For example, our goal may be to detect clusters of disease cases, which may be indicative of a naturally occurring epidemic (e.g. influenza), a bioterrorist attack (e.g. anthrax release), or an environmental hazard (e.g. radiation leak). In all of these applications, we have two main goals: to identify the locations, shapes, and sizes of potential clusters, and to determine whether each potential cluster is more likely to be a “true” cluster or simply a chance occurrence. Thus we compare the null hypothesis H0 of no clusters against some set of alternative hypotheses H1(S), each representing a cluster in some region or regions S. Our primary motivating application is prospective disease surveillance: detecting spatial clusters of disease cases resulting from a disease outbreak. In this application, we perform surveillance on a daily basis, with the goal of finding emerging epidemics as quickly as possible. This is called bio-surveillance of disease.

INTRODUCTION

SVERDLOVSK (Former U.S.S.R): During April & May 1979 there were 77 confirmed cases of inhalational anthrax …..

Naturally, the disease rate is significantly higher than expected.. The most obvious questions in context:  Is it an epidemic??  What if similar situation arises for a large number of regions spread over a vast geographical area??  How to detect the regions, worst hit by such an alarming calamity??

PRACTICAL POSSIBILITIES: Situations like this may practically arise in a –  Bio-terrorist attack (Al-qaeda anthrax letters to U.S…)  Environmental hazards (nuclear radiation leak in Japan in 2011)  Bird-flu and SAARS cases in many countries around the world.

GOALS OF SPATIAL CLUSTER DETECTION:  To identify the locations, shapes, and sizes of Potentially anomalous spatial regions.

 To determine whether each of these potential clusters is more likely to be a “true” cluster or a chance occurrence. In other words, is anything unexpected going on? & if so…where??

DISEASE SURVEILLANCE: Given: count for each spatial location. (e.g. number of Emergency Dept. visits, or over-the-counter drug sales, of a specific type)

Do any regions have sufficiently high counts to be indicative of an emerging disease epidemic in that area?

How many cases do we expect to see in each area?

Are there any regions with significantly more cases then expected?

Our Approach:  In this spatial surveillance setting, each day we have data collected for a set of discrete spatial locations si.  For each location si, we have a count ci (e.g. number of disease cases), and an underlying baseline bi (underlying population at risk in si.)

 We have to find if there is any spatial region S (set of locations si) for which the counts are significantly higher than expected, given the baselines.  For simplicity, we assume here that the locations si are aggregated to a uniform, two-dimensional, N×N grid G, and we search over the set of squarer regions S, which are subsets of G.

Hypothetical Site Map:

Current region(S) being considered.

Entire area (G) being scanned

Frequentist Approach: Our hypothesis: Ho: No potential cluster in region S Vs H1 : Attack in region S.  We assume that ci follows Poisson(qbi), where bi represents the (known) census population of cell si and q is the (unknown) underlying disease rate.  Then the goal of the frequentist scan statistic is to find regions where the disease rate is higher inside the region than outside.

Statistical Frame-work: The statistic used for this is the likelihood ratio F(S) =

.

Assumptions:  The null hypothesis Ho assumes a uniform disease rate q = qall .  Under H1, we assume that q = qin si ε S, and q = qout si ε G−S, for some constants qin > qout.  Now we are in a position to derive an expression for F(S) using Maximum Likelihood estimates of qin, qout and qall.

 Hence we get estimated F(S), as

F(S)^=

(

)

( (

)

)

=1

if

>

………. (A1)

otherwise.

Where, Cin=∑

Bin=∑

Cout= ∑

Bout=∑

Call= ∑

Ball=∑

Now, it is easy to find the highest scoring region S* =

F(S) of grid G, and

its score F*= F(S*).In this way we can apprehend a set of potentially anomalous clusters. This set may be very large.

We must still determine the statistical significance of this region by randomization testing.

To determine whether each of these potential clusters are actually an anomalous clusters:

Frequentist approach: calculate statistical significance of each region by randomization testing.

Test Procedure

Randomization

F(Sr)=16.7

F(Sr)=18.7

F(Sr)=6.9

F(S)=15. 1 F(S)=21.4

1. Create R = 999 (say) replica grids by sampling under Ho,using max-likelihood estimates of any free

F(S)=18.9

parameters.

2. Find maximum region score F* for each replica.

3. For each potential cluster S, count Rbeat = number of ORIGINAL GRID

replica grids G’ with F*(G’) (score of the highest scoring region of the grid G’) higher than F(S).

4. p-value of region S =

Result: All regions with p-value < α are significant at level α.

.

Bayesian Approach: Now, we move from a Poisson to a conjugate Gamma-Poisson model. Bayesian Gamma-Poisson models are common representation for count data in epidemiology. Our hypothesis is same as before and as before, we assume Poisson likelihoods, ci ~ Po(qbi). The difference is that we assume a Bayesian model where the disease rates qin, qout ,and qall are themselves drawn from Gamma distributions.

 Thus under the null hypothesis we have, q=qall

si ε G, where qall ~ Ga(αall , βall)

 Under the alternative hypothesis H1, we have q = qin

si ε S and q = qout

si ε G−S, where we independently draw qin ~Ga(αin, βin) and qout ~Ga(αout , βout ).

Statistical Frame-work: From this model, we can compute the posterior probabilities P(H1|D) of an outbreak in each region S, and the probability P(Ho |D) that no outbreak has occurred ,given dataset D:

P(H0 |D) = P(D|Ho)P(Ho)/P(D)

and

P(H1 |D) = P(D|H1)P(H1)/P(D) ,

Where, P(D) = P(D|Ho)P(Ho) + ∑ [H1: Attack in region S. Hence, we can write H1 as H1(S).]

 At first we calculate the probabilities P(D|Ho) and

….. (A2)

 After that Bayesian spatial scan statistic can be computed simply by first calculating the score P(D|H1(S))P(H1(S)) for each spatial region S, maintaining a list of regions ordered by score. We then calculate P(D|H0)P(H0), and add this to the sum of all region scores, obtaining the probability of the data P(D).  Finally, we can compute the posterior probability P(H1(S) |D)= P(H0 |D)=

for each region, as well as .

 Then we can return all regions with non-negligible posterior probabilities, the posterior probability of each, and the overall probability of an outbreak.  Note that no randomization testing is necessary, and thus overall complexity is proportional to number of regions searched.

Choosing Priors:

For any region S that we examine, we must have values of the

parameter priors αin(S), βin(S),αout (S), and βout (S), as well as the region prior probability P(H1(S)). We must also choose the global parameter priors αall and βall , as well as the “no outbreak” prior P(Ho).

Our assumption: If there is an disease outbreak then it is equally likely to occur in any spatial region. Thus we have P(Ho) = 1−P1, and P(H1(S)) = P1/Nreg,

Where, Nreg is the total number of regions searched. The parameter P1 can be obtained from historical data.  For the parameter priors, we assume that we have access to a large number of days of past data, during which no outbreaks are known to have occurred we set the expectation and variance of the Gamma distribution Ga(αall ,βall) to the sample expectation and variance of Call/Ball. From there we can solve for αall and βall. ……….(A3)  The calculation of priors αin(S), βin(S), αout (S), and βout (S) is identical except for two differences: first, we must condition on the region S, and second, we must assume the alternative hypothesis H1(S).  On the other hand, the effect of an outbreak inside region S must be taken into account when computing αin(S) and βin(S); since we assume that no outbreak has occurred in the past data, we cannot just use the sample mean and variance, but must consider what we expect these quantities to be in the event of an outbreak. We assume that the outbreak will increase qin by a multiplicative factor m, thus multiplying the mean and variance of by m. To account for this in the Gamma distribution Ga(αin,βin), we multiply αin by m while leaving βin unchanged.  Since we typically do not know the exact value of m, we assume that a similar situation (epidemic) took place at some point in the past in some region. We can use the daily data for that event to approximate the value of m & use it for our case.

Another Problem (Frequentist Approach)  Now, what if after scanning millions of regions, we obtain thousands of potentially anomalous regions??  What if available resource at hand is less than the magnitude of the calamity?? How to know, whom to attend first??  Can we give some rough prediction about the clusters which can be hit by the calamity soon?? A Possible way out:

We shall use variable size clusters for this purpose i.e, we shall go on decreasing the size of the initial chosen square shaped cluster (S) if the corresponding p-value comes out to be less then α for it. In this way we go on calculating the p-values. Lesser the p-value, we shade the region darker. In this way, the darker shaded regions can be considered at higher risk.

A hypothetical situation: Cluster S

p-value= 0.036 (say)

p-value= 0.001 (say)

Cluster S’ is at higher risk then cluster S. We can say so by comparing the Pvalues. Lesser the p-value, stronger is the rejection of the null hypothesis. In this way by varying the size of the clusters we can obtain accurately the regions which are at higher risk. Hence, help can be sent to those areas first.

A broader picture:

Grid G

Cluster S

In the above grid the light & dark shades help us to learn the severity of the epidemic in various clusters. It is far easier to understand the situation, then checking a list of thousand anomalous clusters. Moreover, pictures of the same grid for varying time period will also give a rough idea about the rate & direction of dispersion of the pathogens causing the epidemic. Hence, many clusters which are at risk can be discovered easily.

References: 1) D. B. Neill and A. W. Moore. (2004),” Rapid detection of significant spatial clusters”, In Proc. 10th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 256265.

2) M. Kulldorff. (1999),” Spatial scan statistics: models, calculations, and applications”, In J. Glaz and M. Balakrishnan, eds., Scan Statistics and Applications, Birkhauser, 303322.

3) M. Kulldorff. (1997),” A spatial scan statistic”, Communications in Statistics: Theory and Methods 26(6), 1481-1496.

4) M. Kulldorff and N. Nagarwalla. (1995),” Spatial disease clusters: detection and inference”, Statistics in Medicine 14, 799-810.

Suggest Documents