A Bayes Factor Model for Detecting Artificial Discontinuities via

8462

JOURNAL OF CLIMATE

VOLUME 25

A Bayes Factor Model for Detecting Artificial Discontinuities via Pairwise Comparisons JUN ZHANG CICS-NC, North Carolina State University, Raleigh, and Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, North Carolina

WEI ZHENG Sanofi-Aventis, Boston, Massachusetts

MATTHEW J. MENNE NOAA/National Climatic Data Center, Asheville, North Carolina (Manuscript received 18 January 2012, in final form 31 May 2012) ABSTRACT In this paper, the authors present a Bayes factor model for detecting undocumented artificial discontinuities in a network of temperature series. First, they generate multiple difference series for each station with the pairwise comparison approach. Next, they treat the detection problem as a Bayesian model selection problem and use Bayes factors to calculate the posterior probabilities of the discontinuities and estimate their locations in time and space. The model can be applied to large climate networks and realistic temperature series with missing data. The effectiveness of the model is illustrated with two realistic large-scale simulations and four sensitivity analyses. Results from applying the algorithm to observed monthly temperature data from the conterminous United States are also briefly discussed in the context of what is currently known about the nature of biases in the U.S. surface temperature record.

1. Introduction It is well known that temperature series may contain unknown artificial discontinuities (Peterson et al. 1998). Such discontinuities are typically caused by station moves, instrument changes, and/or microclimate changes surrounding a station. If left undetected, these artificial signals can bias attempts to estimate true climate signals (Menne et al. 2009). Consequently, many algorithms have been developed to detect the discontinuities. A list of representative publications includes Alexandersson (1986), Vincent (1998), Lund and Reeves (2002), Caussinus and Mestre (2004), Della-Marta and Wanner (2006), Lund et al. (2007), Reeves et al. (2007), Wang et al. (2007), Wang (2008a,b), Menne and Williams (2009), Hannart and Naveau (2009), Beaulieu et al. (2010), and Lu et al. (2010).

Corresponding author address: Jun Zhang, NOAA/National Climatic Data Center, 151 Patton Ave., Asheville, NC 28801. E-mail: [email protected] DOI: 10.1175/JCLI-D-12-00052.1 2012 American Meteorological Society

Caussinus and Mestre (2004) and Menne and Williams (2009) adopt a pairwise comparison approach that is argued to have advantages in terms of avoiding the detection of true climate signals and utilizing difference series to increase signal-to-noise ratios (SNR) and improve hit rates (HRs). In Menne and Williams (2009), a semihierarchical splitting algorithm is applied to each difference series to identify all potential discontinuities and a rule-based algorithm is used to automatically assign discontinuities to the corresponding stations. However, for a specific target discontinuity, the estimated locations based on different target–neighbor difference series may not agree with each other. Menne and Williams (2009) solve the location uncertainty issue empirically. Hannart and Naveau (2009) and Beaulieu et al. (2010) address the location uncertainty issue from a Bayesian perspective. Hannart and Naveau (2009) propose a method based on Bayesian decision theory. Their method identifies subsequences containing a unique discontinuity by minimizing average posterior cost functions recursively. Beaulieu et al. (2010) develop a framework based on

15 DECEMBER 2012

Bayesian normal homogeneity test (BNHT) and apply BNHT recursively on the series to detect multiple discontinuities. However, such uncertainty can also be addressed from a Bayesian model selection perspective. Here, we describe a Bayes factor model selection procedure for the automatic detection of temperature series changepoints using pairwise comparisons. In the procedure, the Bayes factor or the evidence of discontinuities at each time step is first computed via a sliding sample window. Then, after the Bayes factors are obtained, we identify potential discontinuities by comparing the Bayes factors with an appropriate threshold and calculate the posterior probabilities of the discontinuities for each time step. Finally, we obtain the estimated locations for the discontinuities by computing the posterior mean for each location. In section 2, we describe the details of the Bayes factor model. In section 3, we discuss how to select model parameters. Some results based on simulations and real observations are discussed in section 4. Also, sensitivity analyses with respect to different model parameters are presented in section 4. The conclusions are in section 5.

2. Description of the Bayes factor model

stationary in time. Rather, it is assumed the high spatial correlation inherent in temperature fields means that CT and Cj are approximately equal. As discussed below, a rather narrow time-frame moving window is used to identify local discontinuities; therefore, low-frequency ‘‘creeping’’ inhomogeneities are not likely to be efficiently identified by the Bayes factor approach.

b. Bayes factors In this case, we want to pick a good model for the difference series. As in other applications, Bayes factors are useful tools for selecting a ‘‘winner’’ among competing models. Suppose, for example, that we have M0 and M1 where M0 is no changepoints in DTj(t) and M1 means that there is a changepoint at month t in DTj(t); we can compute the posterior probability for each model using the Bayes theorem, P(Mi j Y) 5

P(Y j Mi )P(Mi ) , P(Y j M0 )P(M0 ) 1 P(Y j M1 )P(M1 )

Before we describe the details of the model, we first define the difference series following Menne and Williams (2009). Suppose T(t) is the monthly temperature anomaly1 at a station, (1)

where t is the monthly index, CT(t) is the climate signal, JT(t) is the artificial changepoint signal, and T(t) is the noise. Here, T(t) has a series of correlated neighbors Nj(t), where j 5 1, . . . , n. The difference series can be expressed as DTj (t) 5 CT (t) 2 Cj (t) 1 JT (t) 2 Jj (t) 1 T (t) 2 j (t) . (2) Because of the high correlation between T(t) and Nj(t), i(t) 2 j(t) typically has a smaller variance than does i(t) or j(t). Here, JT(t) 2 Jj(t) contains the changepoint signals from either T(t) or its neighbor Nj(t). Because of multidecadal variations and trends, CT and Cj are not 1 We calculate the mean monthly temperature for each month and obtain the monthly temperature anomaly by subtracting the corresponding monthly-mean temperature from the actual monthly temperature.

(3)

where i 5 0, 1 and Y is the observation. We obtain posterior odds by P(M1 j Y) P(Y j M1 ) P(M1 ) 5 . P(M0 j Y) P(Y j M0 ) P(M0 )

a. Difference series

T(t) 5 CT (t) 1 JT (t) 1 T (t) ,

8463

ZHANG ET AL.

(4)

The Bayes factor is defined as BF10 5

P(Y j M1 ) . P(Y j M0 )

(5)

To obtain P(Y j Mi), we integrate out the parameters, ð P(Y j Mi ) 5

Vi

P(Y j vi , Mi )c(vi j Mi ) dvi ,

(6)

where P(Y j vi, Mi) is the probability density function with parameter vi under Mi and c(vi j Mi) is the prior density for vi under Mi. After obtaining the Bayes factor and assume prior odds equal to 1, we often use the value of 2 loge(BF10) to evaluate the evidence against M0 (Kass and Raftery 1995). For example, when the value of 2 loge(BF10) is between 2 and 6, there is evidence against M0 (Kass and Raftery 1995). The definition in (5) can be extended to the cases with more than two models. More details of Bayes factors and Bayesian model selection can be found in Kass and Raftery (1995) and MacKay (2003).

c. Bayes factors for one difference series Suppose for a short time window t 5 fs, s 1 1, . . . , s 1 lg, where s, s 1 1, . . . , s 1 l are consecutive time indexes

8464

JOURNAL OF CLIMATE

in a monthly temperature anomaly series, we have a set of hypotheses H 5 fMt: t 5 0, s, s 1 1, . . . , s 1 lg, where Mt 5 fat month t in DT(t)j there is a discontinuityg and M0 5 fno discontinuitiesg. To limit the potential hypotheses, we assume that there is at most one discontinuity in t. According to Menne et al. (2009), the average distance between two detected discontinuities for the U.S. Historical Climatology Network (USHCN) monthly temperature data version 2 is about 180–240 months. Although the actual frequency of the discontinuities must be higher, we expect that most time windows will contain at most one discontinuity, especially when l 180. For each Mt, we write the competing models as Mt : fm1,t 6¼ m2,t g, t 6¼ 0

(7)

M0 : fm1,t 5 m2,t 5 mt g,

(8)

and

where m1,t is the mean before month t and m2,t is the mean after month t. With the above assumptions, we can compare Ms, . . . , Ms1l with M0 and produce the Bayes factors, BFs0, . . . , BFs1l0. Then, after obtaining BFs0, . . . , BFs1l0, we can compute the posterior probabilities for all models. To calculate the probabilities, we apply the Bayesian two-sample t-test framework proposed by Go¨nen et al. (2005). Suppose x1,j’s are observations before t and x2,j’s are observations after t, for Mt with t 5 s, . . . , s 1 l; we assume observations are from two normal distributions N(x1,j j m1,t , s2t ) and N(x2,j j m2,t , s2t ) and the prior distribution is N[(m1,t 2 m2,t )/st j m0 , s20 ] 3 1/s2t . For M0, we assume observations are from one normal distribution N(xj j mt , s2t ) and the prior distribution is 1/s2t . After we observe the dataset Y for t, we compute the Bayes factor for Mt by integrating out all parameters. Go¨nen et al. (2005) obtain a closed form for the Bayes factor, 1 /2 2 P(Y j Mt ) Yn (z j np m0 , 1 1 np s0 ) BFt0 5 5 , P(Y j M0 ) Yn (z j 0, 1)

(9)

where z is the usual two-sample t statistic; m0 and s0 are prior mean and prior variance of (m1,i 2 m2,j)/s; np is the pooled sample size; and Yn(.jj, k) is the noncentral t distribution with location parameter j, scale parameter k1/2, and degree of freedom n. It is possible to use different priors such as a Cauchy prior (Rouder et al. 2009). However, we have found via simulations that, while we can achieve similar results with normal and Cauchy priors, numerical integration is required with the latter, which increases computational cost. So we choose to use normal priors in our calculations.

VOLUME 25

d. Bayes factors for K difference series Suppose now that we observe multiple sets fY1, Y2, . . . , YKg of observations from K difference series for a time window t, we want to compute the Bayes factor for the set of target–neighbor differences. Because the Yi’s are not independent, we cannot combine them in one t statistic. Nevertheless, theoretically the Bayes factor for a model Mt can be obtained with BFt0 5

P(Y1 Y2 . . . YK j Mt ) . P(Y1 Y2 . . . YK j M0 )

(10)

However, it is difficult to define a parametric model for P(Y1Y2 . . . YK j Mt) and to integrate out the parameters directly, so in this case we apply the Bayes factor model of a single series on each difference series and use the following formula to approximate the Bayes factor for K difference series: P(Y1 j Mt ) P(Y2 j Mt ) 1 loge 1 loge loge (BFt0 ) ’ K P(Y1 j M0 ) P(Y2 j M0 ) P(YK j Mt ) 1 1 loge (11) P(YK j M0 ) 5

1 [loge (BFt0,1 ) 1 loge (BFt0,2 ) 1 1 loge (BFt0,K )] , K (12)

where BFt0,i’s are computed with Eq. (9). The rationale behind this approximation is that loge(BFt0,i) can be viewed as the weight of evidence from each dataset Yi and the mean of log Bayes factors can be viewed as the average weight of evidence from the dataset fY1, Y2, . . . , YKg (Good 1985). For real applications, the median instead of the mean is recommended to mitigate the impacts of outliers.

e. Estimate break locations After we obtain the approximate Bayes factor at each time point of the set of difference series, we notice that Bayes factors increase whenever we approach a potential discontinuous point at the target. By selecting an appropriate threshold, a time window t m 5 fsm, sm 1 1 . . . , sm 1 lmg with 2 loge(BF) above the threshold will be identified to form the model set Hm.2 Depending on the magnitude of the threshold and the size of the discontinuities, we may have several tm’s for a station with multiple discontinuities. Each t m contains a potential discontinuity, and the number of t m’s corresponds to the

2 We focus on time windows with 2 loge(BF) greater than zero since such windows contain evidence that favors discontinuities.

15 DECEMBER 2012

number of potential discontinuities. Since we have the approximate Bayes factors BFi0 for each Mtm in Hm, we compute the posterior probability of Mtm using the formula in Kass and Raftery (1995), BFt

P(Mt j Y1 Y2 . . . YK ) 5

m

m

BF00 c0 1

0 3 ctm

sm 1lm

å

j5sm

,

(13)

BFj0 3 cj

where ctm 5 P(Mtm )/P(M0 ) and BF00 5 c0 5 1. If we define A 5 fthere is a discontinuity in t mg, then P(A j Y1, Y2, . . . , YK) . 0.5 indicates there is a discontinuity in the time window. For time windows with P(A j Y1, Y2, . . . , YK) . 0.5, we estimate the expected location E(Lm j Y1Y2 . . . YK, A) for discontinuities. We compute the probability when there is a discontinuity by P(Mt j Y1 , Y2 , . . . , YK , A) 5

P(Mt , A, Y1 , Y2 , . . . , YK ) m

P(A, Y1 , Y2 , . . . , YK )

m

5

P(Mt , Y1 , Y2 , . . . , YK ) m

P(A, Y1 , Y2 , . . . , YK ) (14)

5

5

5

P(Mt , Y1 , Y2 , . . . , YK )P(Y1 , Y2 , . . . , YK ) m

P(A, Y1 , Y2 , . . . , YK )P(Y1 , Y2 , . . . , YK ) P(Mt j Y1 , Y2 , . . . , YK ) m

P(Mt j Y1 , Y2 , . . . , YK ) m

å

tm 5sm

(15)

(16)

P(A j Y1 , Y2 , . . . , YK )

sm 1lm

(17)

m

For time window t m, the posterior mean of the location of the discontinuity is E(Lm j Y1 Y2 . . . YK , A) sm 1lm

å

tm 5sm

E(Lm j Y1 Y2 . . . YK , Mt , A) m

3 P(Mt j Y1 Y2 . . . YK , A)

(18)

m

5

sm 1lm

å

tm 5sm

Var(Lm j Y1 Y2 . . . YK , A) 5

sm 1lm

å

tm 5sm

2 tm P(Mt j Y1 Y2 . . . YK , A) m

2 [E(Lm j Y1 Y2 . . . YK , A)]2 .

(20)

The plot of 2 loge(BF) for a station based on simulated data is shown in Fig. 1 (details of the simulations are provided in section 4). In Fig. 1, there are three time windows containing 2 loge(BF) above the threshold. The threshold value is 4 and the indexing refers to months beginning with January 1900 (i 5 1) and going through December 1999 (i 5 1200). The break around 850 in differences series is from the neighbor series and we notice that 2 loge(BF) for this break is not above the threshold. The estimated locations for three detected discontinuities are 562, 933, and 1007 and estimated variances are 47.6, 7.5, and 14.7. For the first detected break, the true location is 570; for the second, the true location is 932; for the third, two true breaks are located at 1002 and 1011. We notice that all true breaks are within the interval of two estimated standard deviations from the estimated location and the distance between the estimated location and the true location is approximately proportional to the estimated standard deviation. Thus, the value of the estimated variance could be used to measure the relative accuracy of the estimated location and to select temperature series when the accuracy of the time location is important.

3. Selection of parameters .

P(Mt j Y1 Y2 . . . YK )

5

8465

ZHANG ET AL.

tm 3 P(Mt j Y1 Y2 . . . YK , A) .

(19)

m

We round the E(Lm j Y1Y2 . . . YK, A) to the closest integer and obtain the final estimation of the location of the discontinuity. We could compute the variance of the location of the discontinuity with

Since the prior distribution has the form of N[(m1 2 m2 )/s j m0 , s20 ] 3 1/s2 , we need to specify the values for prior mean m0 and prior variance s20 . We set the prior mean m0 to zero because we do not know whether the discontinuities will be positive or negative. A reasonable guess about the prior variance is that 90% of the discontinuities have a standardized size less than 1 (the accuracy of this guess will not significantly impact the results, as we later will discuss the sensitivity with respect to the prior variance in the results section). So the prior variance s20 can be decided by P(jDm/sj , 1 j Dm/s 6¼ 0) 5 0:1,

(21)

where Dm 5 m1 2 m2, and the prior variance s20 is equal to 0.3696 since s20 5 (1/z0:95 )2 5 (1/1:6449)2 5 0:3696.

(22)

We also need to determine a reasonable sample window size n: that is, how many observations from each side of

8466

JOURNAL OF CLIMATE

VOLUME 25

FIG. 1. The 2 loge(BF) plot for simulated station USC00026486 from the ‘‘clustering and sign bias C20C1’’ simulation (Williams et al. 2012). (top) Temperature anomalies, (middle) a difference series between USC00026486 and one of its neighbors, and (bottom) is 2 loge(BF) for station USC00026486.

the potential break point will be included in the t statistic. Including too many observations will increase undesired biases (i.e., the window may encompass more than one break); and including too few will lead to large uncertainty. We select the sample window size through a series of sensitivity analyses. As discussed further in the results section, it seems that if we let the window size n equal to 30 months (each side of the potential break), we achieve good results on simulated datasets (details are provided in the results section). For the prior odds P(Mt)/P(M0), typically the noninformative prior odds P(Mt)/P(M0) 5 1 is used in the calculation, although other choices are possible, as we later will see in the results section. To find the potential time window t m’s, a threshold for 2 loge(BF) needs to be specified. We follow the recommendation in Kass and Raftery (1995) and use the moderately positive evidence level of 4 as the threshold for 2 loge(BF). This threshold seems to work well for various simulations. Finally, since we could potentially have hundreds of neighbors, we need to set the upper limit for the number of neighbors included in the computation. To avoid including neighbors with very low correlation,3 we must set the lowest correlation limit.

3 The interstation correlation is estimated from the first difference series.

Based on our simulations, the performance of the algorithm is not very sensitive to these two parameters. Therefore, we use 40 as the upper limit for the number of neighbors and 0.5 as the lowest correlation limit for the neighbors as in Menne and Williams (2009).

4. Results a. Results using simulated and real-world observations We used two large simulated datasets to evaluate the effectiveness of our algorithm. Simulated datasets have been used to benchmark algorithms in the homogenization of radiosonde records (e.g., Titchner et al. 2009); paleoclimate reconstructions (e.g., Mann et al. 2005); and, more recently, surface temperature records (Venema et al. 2012; Williams et al. 2012). Here we used two of the simulated datasets described in Williams et al. (2012). Each of these synthetic datasets contained about 7700 station series and each station has a maximum record length of 100 yr; however, many of the stations have much shorter records and are characterized by missing periods of varying length. The simulated temperature series are based on the climate model output and contain correlated errors. As described in detail in Williams et al. (2012), the missing data patterns mimic the data

15 DECEMBER 2012

8467

ZHANG ET AL. TABLE 1. The characteristics of two simulations as in Williams et al. (2012). Forcings and period in model years

Analog world

Model

Clustering and sign bias C20C1

MIROC3.2(hires) (K-1 Model Developers 2004)

twentieth century forcings, 1900–99, run 1

Very many mainly small breaks

NCAR PCM (Washington et al. 2000)

CO2 1% yr21 to 2 3 CO2, 0071–0170

record and geographic distribution of stations in the U.S. Cooperative Observer Network. The simulations contain only step function discontinuities, which are arguably the most prevalent type of artificial discontinuities. We have two simulated datasets that we call them simulations 1 and 2. In Williams et al. (2012), simulation 1 is referred to as ‘‘clustering and sign bias C20C1,’’ and simulation 2 is referred to as ‘‘many small breaks with sign bias.’’ For simulation 1, there are on average 7 breaks for each series. The breaks are not randomly spaced through time but rather are ‘‘clustered’’, with most stations having a break within 30 yr of 1945 and during the 1980s to reflect the changes that occurred in the real-world network (Menne et al. 2009). For simulation 2, there are on average 10 breaks for each series. Also, for both simulations, there is a sign bias to reflect what is known about the errors in the USHCN (Menne et al. 2009), which means that the imposed errors do not have a frequency distribution that is symmetric about zero. Rather, there is a preference for positive errors in simulation 1 and negative errors in simulation 2. Overall, simulation 1 contains a mixture of large, medium, and small discontinuities and resembles the type of errors thought to be present in USHCN temperature data. Simulation 2 has predominantly small breaks and resembles a very challenging situation. In addition, for simulation 2, breaks are very close to each other, which makes the detection even more challenging. More details of two simulations can be found in Table 1. We used the parameter settings described in section 3 to identify breaks in the two datasets. The result for simulation 1 is listed in Table 2. Our algorithm detects

Break and metadata structure imparted 70% of stations within 7 yr in 1980s, with metadata (s 5 0.7l; avg 5 0.35); 70% of stations within 30 years from 1945, with metadata (s 5 0.4; avg 5 20.2); average one per station in latter half of record with metadata (s 5 0.5; avg 5 0.8); average of 2 breaks per station associated with metadata (s 5 0.8; avg 5 0); no metadata, more prevalent early in record, 4 per station on average (s 5 0.8; avg 5 0); average 2 metadata events not associated with a break 2 breaks on average per station seeded randomly throughout network and over time with metadata (s 5 1; avg 5 0); 2 breaks on average per station but probability twice as prevalent later in record and sign biased, with metadata (s 5 0.25; avg 5 20.2); 2 breaks on average per station but probability twice as prevalent later in record, with metadata (s 5 0.25; avg 5 0); 4 breaks per station unassociated with metadata, more prevalent early, slight sign bias (s 5 0.2; avg 5 20.075)

84.50% of true large discontinuities.4 For detected large discontinuities, the false detection rate (FDR) is only 1.11%. The overall hit rate is 47.11%, and the false detection rate is 11.82%. The result for simulation 2 is listed in Table 3. We detected 11.17% of the total breaks. The false detection rate is 9.55%. The median of b or estimated SNR for simulation 1 is 0.78 and for SNR simulation 2 is 0.19 (Table 4). Although the overall hit rates and false discovery rates may not be impressive at the first look, we should recognize that simulated temperature series are quite realistic and many of the imposed breaks are small (near zero). Thus, the hit rates and false detection rates achieved by our algorithm are reasonable and comparable to results using the Menne and Williams (2009) pairwise homogenization algorithm (PHA; Tables 2, 3). Particularly for simulation 2, many small breaks with sign bias, most of the breaks are small b for breaks or smaller than 0.58C and the median of SNR small discontinuities is only 0.16. Thus, any algorithm will likely have difficulties to boost the hit rate without increasing the false detection rate. The computations for the examples were carried out in R language on a 2.66-GHz CPU. Computation time is roughly about 2.7 s per station or 5.8 h for a 7700-station network. To further evaluate the efficiency of the Bayes algorithm, a simple adjustment factor was calculated for

4 We classify the discontinuities into three categories in terms of their actual sizes to help readers understand the performance of the model. Three categories are defined as follows: large, d $ 1.08C; medium, 0.58C # d , 1.08C; and small, d , 0.58C, where d is the size of a discontinuity and 18C 5 18C.

8468

JOURNAL OF CLIMATE

TABLE 2. Results obtained in simulation 1, ‘‘clustering and sign bias C20C1.’’ Sizea large medium small all Size large medium small all

Hit ratesb 84.32% 65.21% 17.35% 47.11%

73.36% 63.36% 25.75% 48.05%

False detection ratesc 0.79% 9.72% 36.94% 11.85%

7.4% 12.76% 41.63% 19.50%

TABLE 3. Results obtained in simulation 2 ‘‘many small breaks with sign bias.’’ Sizea

Tot breaks 8452 12 567 18 218 39 237

8452 12 567 18 218 39 237

large medium small all

Tot detection 7387 9517 4063 20 967

7546 9011 6865 23 422

VOLUME 25

Size large medium small all

Hit ratesb 86.14% 49.88% 5.53% 11.19%

Tot breaks

75.17% 57.72% 10.80% 16.06%

False detection ratesc 2.22% 11.12% 15.12% 9.36%

6.65% 14.55% 27.15% 18.52%

2972 5587 77 581 86 140

2972 5587 77 581 86 140

Tot detection 3058 5458 2116 10 632

3085 6611 7288 16 984

a

Large is d $ 1.08C, medium is 0.58C # d , 1.08C, and small is d , 0.58C, where d is the size of a discontinuity and 18C 5 18C. b The hit rate is equal to the number of detected breaks divided by the number of total breaks. For hit rates and total breaks, the numbers in the first column are from the algorithm in this paper without the use of metadata and the numbers in the second column are from the Menne and Williams (2009) PHA, version 52i algorithm using metadata. The metadata describing the change dates are incomplete and not always accurate (see Table 1). c The false detection rate is equal to the number of false detections divided by the number of the total detections.

a

Large is d $ 1.08C, medium is 0.58C # d , 1.08C, and small is d , 0.58C, where d is the size of a discontinuity and 18C 5 18C. b The hit rate is equal to the number of detected breaks divided by the number of total breaks. For hit rates and total breaks, the numbers in the first column are from the algorithm in this paper without the use of metadata, and the numbers in the second column are from the Menne and Williams (2009) PHA, version 52i algorithm using metadata. The metadata describing the change dates are incomplete and not always accurate (see Table 1). c The false detection rate is equal to the number of false detections divided by the number of the total detections.

each of the break dates identified. For each detected break, multiple adjustments were first calculated using a 30-month window (each side of the break) on each difference series and the median of the adjustments was used as the final adjustment factor. These adjustments were then applied to the 1218 simulated series that are corollaries to the real USHCN station temperature series (Menne et al. 2009). A conterminous U.S. (CONUS) average time series was then computed as in Williams et al. (2012) using the 1218 adjusted series as well as the raw input unadjusted series (i.e., with errors) and the underlying series with no seeded errors. As shown in Fig. 2, applying the Bayes factor adjustments moves the CONUS average trends closer to its true ‘‘homogeneous’’ value. In the case of simulation 1, the adjusted trends are smaller than the raw input indicating the adjustments are accounting for the input data errors, which have had a positive sign bias. In simulation 2, the errors have a negative bias, and the adjusted trends are therefore larger than the raw input. Not surprisingly, the adjustments do not move the CONUS average trend too far but rather not far enough. This is an indication that the adjustments are incomplete rather than overly aggressive, especially in the case of simulation 2 where the detection rate is relatively low. As discussed in Williams et al. (2012), PHA-based adjustments behave similarly (results using the operational configuration of the PHA, version 52i are also shown in Fig. 2a). Notably, the Bayes factor algorithm moves trend nearly as far as the operational PHA algorithm in simulation 2 but

not in simulation 1. Because the detection rates are comparable between the two algorithms, the reason for the differential adjustments may be related to the way in which the Bayes factor adjustments are calculated (i.e., using a very limited time window) compared to the PHA and/or the fact that the Bayes factor algorithm, unlike the PHA, does not currently use metadata as a prior. As mentioned in the conclusion, the adjustment method and exploiting available metadata are both logical options for future Bayes factor algorithm improvement. The Bayes factor algorithm was also applied to mean monthly maximum and mean monthly minimum temperature series from full 70001 stations in the U.S. Cooperative Observer Program network with the parameters described in section 3. Similar to the above, a CONUS-wide average was computed from the 1218station USHCN subset of stations using both the raw input series and the adjusted series. The time series and trend values for maximum and minimum temperatures

TABLE 4. Median of the estimated SNR. The sizes are as follows: large is d $ 1.08C, medium is 0.58C # d , 1.08C, and small is d , 0.58C, where d is the size of a discontinuity and 18C 5 18C. b* Median of SNR

All

Large

Medium

Small

Simulation 1 Simulation 2

0.78 0.19

1.94 1.59

1.06 0.70

0.33 0.16

b is defined as SNR b 5 d/^ s, where d is the true size of the * The SNR discontinuity and s ^ is the estimated standard deviation of the corresponding difference series.

15 DECEMBER 2012

ZHANG ET AL.

8469

FIG. 2. (top) Annual average CONUS temperature series calculated using the USHCN monthly temperature series from the simulation-1 dataset. Spatial averages are based adjustments calculated from the Bayes factor algorithm (in black), the Menne and Williams (2009) PHA (in orange). CONUS averages for the nonhomogenized (raw) input values with the seeded errors are shown in red. Averages based on the true data series without errors are shown in green. (bottom) As in (top), except for simulation 2.

are shown in Fig. 3. As in the case of the PHA adjustments (also shown), adjusted maximum temperature trends based on the Bayes factor algorithm are larger than in the raw, unadjusted trends. This is consistent with the presented understanding that maximum temperatures in the United States contain pervasive negative biases, especially since 1950. These biases are primarily related to changes in the time of observation

and a widespread change from liquid-in-glass thermometers to electronic thermistors (see Menne et al. 2009; Williams et al. 2012). For minimum temperatures, there are apparent conflicting biases in the USHCN temperature measurements, with a negative time of observation bias dominating since 1950 and a positive bias associated with the change to electronic thermistors that occurs largely in the mid-1980s. The Bayes factor adjustments

8470

JOURNAL OF CLIMATE

VOLUME 25

FIG. 3. As in Fig. 2 but for real-world monthly-mean (top) maximum and (bottom) temperatures.

on minimum temperature trends are also broadly consistent with this understanding.

b. Evaluation of parameter sensitivity For simulation 1, we randomly selected 5% of the stations and performed the sensitivity analyses of the HRs and FDRs with respect to prior variances, prior odds, threshold values of 2 loge(BF), and sample window sizes. Figure 4a shows the sensitivity analysis of the HRs and FDRs with respect to the prior variance. We

observe that the HRs and FDRs are not overly sensitive to the choice of the prior variance unless the prior variance is unreasonably small. This means that the selection of the prior variance is not a concern for the model. Figure 4b contains the sensitivity analysis of the HRs and FDRs with respect to the log10(prior odds) of P(Mt)/P(M0). The HRs and FDRs are not very sensitive to the choice of prior odds when the log10(prior odds) is greater than 22. The sensitivity analysis of the HRs and FDRs with respect to the sample window size is shown

15 DECEMBER 2012

ZHANG ET AL.

8471

FIG. 4. The sensitivity analysis of HRs and FDRs with respect to (a) the prior variance, (b) the the log10(prior odds), (c) the sample window size, and (d) the threshold value of 2 loge(BF).

in Fig. 4c. Very large or very small sample window sizes will lower the HRs. Also, very large sample window sizes cause the FDRs to increase. This is perhaps caused by including nearby discontinuities in the sample window. Figure 4d shows the sensitivity analysis of the HRs and FDRs with respect to the threshold values of 2 loge(BF). The HRs and FDRs are sensitive to the threshold value. Fortunately, the FDRs decrease more rapidly than the HRs when the threshold value increases. From Eq. (9), we know that the 2 loge(BF) is the function of the SNR, the sample window size n and the prior variance s20 if the prior mean m0 is equal to 0.5

5 If the SNR is known, then we can replace t statistic z in Eq. (9) pffiffiffiffiffi with SNR 3 np .

From Fig. 5, we know that the 2 loge(BF) is not sensitive to the change of prior variance s20 , and we can follow the procedure in section 3 to select a prior variance. For breaks with relatively large SNR values, Fig. 5 shows that the 2 loge(BF) is sensitive to the change of the sample window size n. However, a large sample window may include nearby discontinuities. The choice of the sample window size n depends on the prior information about the density of the discontinuities and the level of SNR. To apply the model on real observations, we could start from a relatively small window and gradually increase the size of the window until the HRs decrease. From Fig. 5, we also notice that increasing the threshold for 2 loge(BF) will effectively eliminate false detections with small SNR values and lower the FDRs. Choosing a different prior odds will also affect FDRs.

8472

JOURNAL OF CLIMATE

VOLUME 25

FIG. 5. (a) The 2 loge(BF) as the function of the sample window size and SNR value when prior variance equal to 0.3696 (each curve has the same SNR value), (b) 2 loge(BF) as the function of the sample window size and SNR value when prior variance equal to 5.3696, (c) 2 loge(BF) as the function of the sample window size and b values for each size category in SNR value when prior variance equal to 10.3696, and (d) the box plot of SNR simulation 1.

In the next paragraph, we will discuss the choice of prior odds. If we use flat priors or cj 5 c for "j 2 (sm, . . . , sm 1 lm), then from Eq. (13) we know

P(A j Y1 , Y2 , . . . , YK ) 5

sm 1lm

å

j5sm

P(A j Y1 , Y2 , . . . , YK ) . 0:55c P(Mj j Y1 Y2 . . . YK )

sm 1lm

å

5

j5sm

11

where A 5 fthere is a discontinuity in tmg. Because the condition of having a break is P(A j Y1, Y2, . . . , YK) . 0.5, from Eq. (23), we know sm 1lm

å

j5sm

BFj0 . 1.

(24)

Since (BFj0 3 c)

sm 1lm

å

j5sm

sm 1lm

, (BFj0 3 c)

(23)

c

å

j5sm

"s BFj0 . 15loge

m

1lm

å

j5sm

# BFj0 /(lm 1 1)

. loge [c21 /(lm 1 1)],

(25)

15 DECEMBER 2012

the sufficient and necessary condition for P(A j Y1, Y2, . . . , YK) . 0.5 is "s

m

loge

1lm

å

j5sm

8473

ZHANG ET AL.

# BFj0 /(lm 1 1) . loge [c21 /(lm 1 1)].

(26)

For the usual noninformative prior c 5 1, Eq. (26) is always valid. If we want to achieve the maximum hit rate for a certain threshold of 2 loge(BF), then we should use the noninformative prior. If we want to lower the FDR and the threshold for 2 loge(BF) is T, we can choose a different c with 2 loge [c21 /(l^1 1)] 5 Td,

(27)

^ 1 is the estimated average time where d . 0 and l1 window length.

5. Conclusions Detecting the artificial discontinuities in the real temperature series usually carries uncertainties. For example, a large fraction of the breaks in most surface temperature networks are probably undocumented and we may have more than one plausible location for a discontinuity. Because the ‘‘true’’ climate signal in these series is unknown, the best that we can hope to do is to quantify the uncertainty, and one way to do this is to approach the changepoint detection problem in multiple different ways (Thorne et al. 2011). With the model in this paper, we can quantify the uncertainty of the location and estimate the most likely location in a probabilistic approach. We have shown in the examples that the proposed model achieved reasonable results with simulated large-scale realistically noisy temperature series. The results of sensitivity analyses also provide the evidence that the model is useful for the real applications. In the future, we plan to use the available metadata information as a prior in the Bayes factor algorithm as well as test alternative ways to calculate adjustments for the identified breaks. These future algorithm enhancements will allow for a more comprehensive comparison to other homogenization algorithms and help quantify the structural uncertainty associated with surface temperature homogenization. Acknowledgments. We are grateful to Dr. Peter Thorne for his assistance with the data preparation and to Dr. Murray Clayton for his comments on our model. The comments by three anonymous reviews also greatly improved the manuscript.

REFERENCES Alexandersson, H., 1986: A homogeneity test applied to precipitation data. J. Climatol., 6, 661–675. Beaulieu, C., T. Ouarda, and O. Seidou, 2010: A Bayesian normal homogeneity test for the detection of artificial discontinuities in climatic series. Int. J. Climatol., 30, 2342–2357. Caussinus, H., and O. Mestre, 2004: Detection and correction of artificial shifts in climate series. J. Royal Stat. Soc., 53C, 405–425. Della-Marta, P., and H. Wanner, 2006: A method of homogenizing the extremes and mean of daily temperature measurements. J. Climate, 19, 4179–4197. Go¨nen, M., W. Johnson, Y. Lu, and P. Westfall, 2005: The Bayesian two-sample t test. Amer. Stat., 59, 252–257. Good, I., 1985: Weight of evidence: A brief survey. Bayesian Statistics, J. Bernardo et al., Eds., Elsevier, 249–269. Hannart, A., and P. Naveau, 2009: Bayesian multiple change points and segmentation: Application to homogenization of climatic series. Water Resour. Res., 45, W10444, doi:10.1029/ 2008WR007689. K-1 Model Developers, 2004: K-1 coupled GCM (MIROC) description. K-1 Tech. Rep, 1, 39 pp. Kass, R., and A. Raftery, 1995: Bayes factors. J. Amer. Stat. Assoc., 90, 773–795. Lu, Q., R. Lund, and T. Lee, 2010: An MDL approach to the climate segmentation problem. Ann. Appl. Stat., 4, 299–319. Lund, R., and J. Reeves, 2002: Detection of undocumented changepoints: A revision of the two-phase regression model. J. Climate, 15, 2547–2554. ——, X. Wang, Q. Lu, J. Reeves, C. Gallagher, and Y. Feng, 2007: Changepoint detection in periodic and autocorrelated time series. J. Climate, 20, 5178–5190. MacKay, D. J. C., 2003: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 628 pp. Mann, M., S. Rutherford, E. Wahl, and C. Ammann, 2005: Testing the fidelity of methods used in proxy-based reconstructions of past climate. J. Climate, 18, 4097–4107. Menne, M. J., and C. N. Williams, 2009: Homogenization of temperature series via pairwise comparisons. J. Climate, 22, 1700– 1717. ——, ——, and R. S. Voss, 2009: The U.S. Historical Climatology Network monthly temperature data, version 2. Bull. Amer. Meteor. Soc., 90, 993–1007. Peterson, T. C., and Coauthors, 1998: Homogeneity adjustment of in situ atmospheric climate data: A review. Int. J. Climatol., 18, 1493–1517. Reeves, J., J. Chen, X. L. Wang, R. Lund, and Q. Q. Lu, 2007: Comparison of techniques for detection of discontinuities in temperature series. J. Appl. Meteor. Climatol., 46, 900–914. Rouder, J. N., P. L. Speckman, D. Sun, R. D. Morey, and G. Iverson, 2009: Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev., 16, 225–237. Thorne, P. W., and Coauthors, 2011: Guiding the creation of a comprehensive surface temperature resource for twenty-first-century climate science. Bull. Amer. Meteor. Soc., 92, ES40–ES47. Titchner, H., P. W. Thorne, M. P. McCarthy, S. F. B. Tett, L. Haimberger, and D. E. Parker, 2009: Critically assessing tropospheric temperature trends from radiosondes using realistic validation experiments. J. Climate, 22, 465–485. Venema, V. K. C., and Coauthors, 2012: Benchmarking monthly homogenization algorithms. Climate Past, 8, 89–115, doi:10.5194/ cp-8-89-2012.

8474

JOURNAL OF CLIMATE

Vincent, L., 1998: A technique for the identification of inhomogeneities in Canadian temperature series. J. Climate, 11, 1094–1104. Wang, X. L., 2008a: Accounting for autocorrelation in detecting mean shifts in climate data series using the penalized maximal t or F test. J. Appl. Meteor. Climatol., 47, 2423–2444. ——, 2008b: Penalized maximal F test for detecting undocumented mean shift without trend change. J. Atmos. Oceanic Technol., 25, 368–384.

VOLUME 25

——, Q. H. Wen, and Y. Wu, 2007: Penalized maximal t test for detecting undocumented mean change in climate data series. J. Appl. Meteor. Climatol., 46, 916–931. Washington, W., and Coauthors, 2000: Parallel climate model (PCM) control and transient simulations. Climate Dyn., 16, 755–774. Williams, C. N., M. J. Menne, and P. W. Thorne, 2012: Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. J. Geophys. Res., 117, D05116, doi:10.1029/2011JD016761.