A procedure for automated quality control and ... - Springer Link

Climatic Change (2010) 98:471–491 DOI 10.1007/s10584-009-9741-9

A procedure for automated quality control and homogenization of historical daily temperature and precipitation data (APACH): part 1: quality control and application to the Argentine weather service stations Jean-Philippe Boulanger · J. Aizpuru · L. Leggieri · M. Marino

Received: 23 June 2008 / Accepted: 3 August 2009 / Published online: 1 October 2009 © Springer Science + Business Media B.V. 2009

Abstract The present paper describes the quality-control component of an automatic procedure (APACH: A Procedure for Automated Quality Control and Homogenization of Weather Station Data) developed to control quality and homogenize the historical daily temperature and precipitation data from meteorological stations. The quality-control method is based on a set of decision-tree algorithms analyzing separately precipitation and minimum and maximum temperature. All our tests are non-parametric and therefore are potentially useful in regions or countries presenting different climates as those observed in Argentina. The method is applied to the 1959–2005 historical daily database of the Argentine National Weather Service. Our results are coherent with the history of the Weather Service and more specifically with the history of implementation of systematized quality control processes. In temperature, our method detects a larger number of suspect values before 1967 (when there was no quality control) and after 1997 (when only real-time quality control had been applied). In precipitation, the detection of error in

J.-P. Boulanger (B) LOCEAN, UMR CNRS/IRD/UPMC, Tour 45–55/Etage 4/Case 100, UPMC, 4 Place Jussieu, 75252 Paris Cedex 05, France e-mail: [email protected] J. Aizpuru · L. Leggieri Departamento de Ciencias de la Computación, Facultad de Ciencias Exactas y Naturales, University of Buenos Aires, Buenos Aires, Argentina M. Marino Servicio Meteorológico Nacional, 25 de Mayo 658-(C1002ABN), Buenos Aires, Argentina Present Address: J.-P. Boulanger Departamento de Ciencias de la Atmosfera y los Oceanos, Facultad de Ciencias Exactas y Naturales, University of Buenos Aires, Buenos Aires, Argentina

472

Climatic Change (2010) 98:471–491

extreme precipitations is complex, but our method clearly detected a strong decrease in the number of potential outliers after 1976 when the National Weather Service was militarized, and the network was strongly reduced, focusing more on airport weather stations. Also in precipitation, we analyze in detail the long dry sequences and are able to identify potential long erroneous sequences. This is important for the use of the data for hydrological or agricultural impact studies. Finally, all the data are flagged with codes representing the path followed by the record in our decisiontree algorithms. While each code is associated to one of the categories (“Useful”, “Need-Check”, “Doubtful” or “Suspect”), the final user is free to redefine such category-assignment.

1 Introduction In the context of climate change, rescuing, controlling quality and homogenizing daily weather data has become a crucial task for climate research teams around the world (Alexandersson and Moberg 1997; Moberg and Alexandersson 1997; Wijngaard et al. 2003; Caussinus and Mestre 2004; Rusticucci and Barrucand 2004; Brandsma and Können 2006; Rusticucci and Renom 2007). Understanding past daily climate variability, the relationship between extreme events (warm or cold spells; floods; droughts; ...) and large scale variability is very important to evaluate future climate change scenarios, and, more specifically, their impacts on vulnerable components of society such as agriculture, hydrology or health. In the CLARIS project, various teams have performed their own quality control on historical daily data in Argentina (Rusticucci and Barrucand 2004), Uruguay (Rusticucci and Renom 2007), Brazil or Chile. In some cases, homogenization methods have also been applied in order to correct the data (Rusticucci and Renom 2007). The present work came late at the end of the project, and it was not originally scheduled in the CLARIS project. Our goal is to create an automated procedure consistent with previous CLARIS and National Weather Service works, and to provide the CLARIS LPB Project (http://www.claris-eu.org) with an automated procedure to check the thousands of stations of the La Plata Basin. For that reason, no paper in this issue has used these data. However, since we made available the dataset, it has become a reference for the Argentinean National Weather Service and the CLARIS LPB colleagues. This objective is actually an important step toward the development of an extended weather station database, compiling hundreds of stations as scheduled in the CLARIS LPB (La Plata Basin) FP7 European project (2008–2012). A homogenization procedure will soon be implemented in order to provide homogenized daily data for secular trends and extreme event analyses. In view of the importance of daily historical data for each country, we believed it crucial to work closely with the Argentine National Weather Service in the application and evaluation of our procedure. This collaboration allows access to the most complete database available for the country and provides the National Weather Service with a list of potential suspect data in their historical database. In the longer term, this collaboration will lead to a national homogenized database, crucial to evaluate the changes in secular trends and extreme events in the past and their possible evolution in the context of climate change scenarios.


473

This paper is organized as follows: Section 2 presents the historical daily dataset used to test the automated procedure; Sections 3 and 4 describe the major tests applied to the temperature and precipitation data respectively and discuss the decision trees used in the automated procedure; Section 5 presents the quality control results and discusses them in relation to the National Weather Service History; finally, Section 6 describes the method and gives some wider perspectives.

2 Data The daily dataset covers the 1959–2005 period for stations located all over Argentina. However, as clearly displayed in Fig. 1, the station density is poor south of 40◦ S (Patagonia). The best station density is found in the Province of Buenos Aires

Fig. 1 Spatial location of the Argentine weather stations providing daily temperature and precipitation records during all or part of the 1959–2005 period

474


and in the regions of important agricultural activities. Many stations are located at airports. If we focus on stations with less than 10% of missing data every year (Fig. 2), the number increases from 1959 to 1973–1975, but since then (Military Coup in 1976), the number of stations with less than 10% of missing data has steadily decreased (in parallel with the decrease in the total number of stations belonging to the network). Moreover, during the first democratic government (1983–1989), hyperinflation strongly affected the economy. And then further reductions in the observational network occurred since 1999. This result is confirmed by the histograms displayed in Fig. 3. which shows that most of the stations were created in the 1960s or around 1990. Moreover most of the stations, which stopped recording, did it during the 1970s, 1990s or 2000s. After the Military Coup in 1976, most of the Argentine Institutions were militarized, including the National Weather Service (NWS). Most of the NWS station observers were replaced by military personnel, lacking training and experience. Overall, more than 40% of the network has been lost since 1970. Fortunately, more than 70 stations (still operational) have records going back more than 40 years. (Fig. 3b). Finally, the percentage of data availability in the entire database (on a yearly basis) shows that much of the missing information was available before the end of the 1970s. Indeed, at this period, the NWS started systematizing its data collection,

180 Minimum Temperature Maximum Temperature Precipitation

170

160

150

140

130

120

110

100

90

80 1955

1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

Fig. 2 Evolution of the weather station network during the 1959–2005 period. Stations with less than 10% of missing data are computed, showing similar results for minimum temperature (dashed), maximum temperature (dash-dot) and precipitation (plain)


475 Start year

30 20 10 0 1955

1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

1985

1990

1995

2000

2005

30

35

40

45

50

Final year 30 20 10 0 1955

1960

1965

1970

1975

1980 Duration

60 40 20 0

0

5

10

15

20

25

Fig. 3 Histograms of number of stations per (top) year of initiation, (middle) year of termination and (bottom) duration. For the sake of clarity, 1959, 2005 and 2006 are not displayed as their values are “artificial”. Indeed, in our dataset, 158 stations start in 1959, 16 end in 2005 (on 31/12 meaning that the 2006 data are not provided) and 107 end at the dataset final date. Most of the station data with long records (bottom) are still operational

and copied data from cards to tapes. Unfortunately, many of the paper cards were damaged by dust and humidity and many data were not correctly read. Although such data could be recovered from paper records, the lack of human resources (especially at the time of the digitization in 1978 and after 1985) did not allow the filling in of missing data. The present data were manually controlled in two steps: first, an operator controlled the data in real-time. Then, at the end of each decade, they were manually checked, looking at temporal and spatial coherence. At the time we accessed the data, the 1990–1999 decade was in the process of quality control, and thus only the period October 1967 to 1997 has been controlled twice. These data (period 1960– 2000) have been used for extreme event detection in Vincent et al. (2005) and in Haylock et al. (2006).

3 Tests and decision-tree for daily temperature records Quality control methods applied to daily temperature data use fundamentally different method groups: (1) single station methods where data from one station at a time

476


are analysed and (2) spatial methods where data from several neighbouring stations are analyzed. Spatial methods are much more reliable in detecting or confirming potential errors because they can estimate the probability of error detection based

OjNi

Neighbour test

Outlier test No SS Short Series

Identical Sequence test

0

>5.5

DIP Test

3 2–3 1.5–2 1–1.5 0–1

N2 N2 N1 N1 N0

N3 N2 N1 N1 N0

N4 N3 N2 N2 N0

N4 N4 N3 N2 N0

diurnal range values to detect days when minimum temperature was warmer than maximum temperature. Such conditions can occur in the Argentine data for two reasons: (1) some historical observations of minimum and maximum temperatures were done at different times of the day. Minimum temperature was the absolute minimum between 9pm and 9am, while maximum temperature was the absolute maximum temperature between 9am and 9pm. As the time periods of reference were different, a minimum temperature could indeed be larger than the maximum temperature when recorded as a daily value. Fortunately, such cases disappeared as the observation procedures became more systemized; (2) the record was wrong or was badly read from old punched cards. Most of errors found in the dataset seem to be related to the latter. This check was applied first. In order not to flag both minimum and maximum temperature values, the result was stored and compared to the results of other tests, before deciding if only one or both temperature values had to be flagged as erroneous (confidence level 3). Series of constant temperature: a special flag is created when temperature values are constant over 2 days or more. The code used to flag constant temperature series is either SL for long series of three consecutive days or more or SS for short series of two consecutive days

Range and step tests Our objective was to design a procedure as universally applicable as possible so we decided not to use traditional tests related to specific absolute thresholds, which are usually only appropriate to a specific region. We rescaled the data into percentiles computing distances to the 25th (for lower values) or 75th (for upper values) percentile (P25 or P75) as follows: Per (X) = (X − P25) (P50 − P25) if X = P25 25th percentile Per (X) = (X − P75) (P75 − P50) if X = P75 75th percentile Per (X) = 0 otherwise. Other percentiles could be used without modifying the actual detection of outliers in the distributions. Our choice is based on the method used in most of the mathematical softwares to draw box-and-whisker plots: any data, in which distance is longer than 1.5 the distance between the median and the third (resp. first) quartile, is considered an outlier. The percentiles are computed for each month of the year (January to December) based on the daily data of each month. Our method is then based on two classes of tests applied directly to the rescaled temperature time series: –

Outlier tests: quality control methods use outlier tests to identify very large values (larger than 3 or 4 standard deviations from the mean). Here we just use the no-dimensional distance to 25th or 75th percentiles: OUT (Tt )=Per(Tt )


–

–

where T is either the minimum or the maximum temperature. The outlier test is actually not the core test of our procedures and is used rather as a complementary test, when the DIP test (see below) cannot be computed. If the outlier test is larger than 1.5, a code related to its amplitude is associated to the data: O1 (for values between 1.5 and 3), O2 (for values between 3 and 5) and O3 (for values larger than 5). The choice of these thresholds was based on the boxand-whisker mathematical function and on the analysis of the results. In any case, the final results are not significantly affected by small changes in these thresholds, as the most important test is the Neighbour test (spatial checking; see below). If the outlier test cannot be computed (because of missing data), an ON code is associated. Step Test: The step test measures how big is the difference between two consecutive days as follows: STEP (Tt )=Per(Tt −Tt−1 ) where T is either the minimum or the maximum temperature. DIP Test (equivalent to the test described in Vejen et al. 2002): The DIP test we developed is defined as: ◦ ◦

–

479

DIP(Tt )= −STEP(Tt )*STEP(Tt+1 ), if (Tt −Tt−1 )* (Tt+1 −Tt ) < 0. DIP(Tt )=0, otherwise

Basically, the DIP test allows the detection of very large 1-day peaks (upper or lower) contrasting with the surrounding variability. According to the definition of an outlier (distance greater than 1.5), the lowest non-zero value of a possible outlier DIP is the square of 1.5 (2.25). Thus, if the DIP test is larger than 2.25, a code related to its amplitude is associated to the data: DM, for medium values between 2.25 and 3.5; DL, for larger values between 3.5 and 5.5; and DX for values larger than 5.5. The choice of these thresholds was based on the analysis of the results. In any case, the final results are not significantly affected by small changes in these thresholds as the most important test is the Neighbour test (spatial checking; see below). If the DIP test cannot be computed (because of missing data on one of the 3 days needed to compute the test), a DN code is associated. Local Extreme Test: In case of a low DIP value (DIP < 2.25), we first check whether the value is larger or lower than both daily data observed 1 day before and 1 day after the data to check. If this is a local maximum or minimum, it is coded LE, otherwise LN.

Spatial Test Observed extreme events can be detected as potential errors using any kind of test, based only on station distribution. It is therefore crucial for any quality control procedure to compare stations with neighbouring stations. This test is extensively used in our procedure and is applied as follows: 1. We pre-select neighbouring stations, located at less than 500 km from the analyzed station, with a correlation larger than 0.8 (the correlation is computed only between data of the same month: January, February, etc...) and with a significance level higher than 99%. Our results are not significantly different whether one uses a lower or higher correlation value (in a range 0.7–0.85). This value was adopted because of our choice of selecting enough neighbours to compute spatial tests and highly correlated to avoid introducing noise in the interpolation.

480


2. For each pre-selected neighbouring station, we computed a linear regression between the analyzed and neighbour daily data (based on all the common years for the month of the data to analyze). Despite the low density network of weather stations, more than 80% (resp. 70%) of the daily maximum (resp. minimum) temperature observations do have at least two neighbouring stations complying with the selection conditions; around 10% of the minimum and maximum temperature data only have one neighbouring station, and around 10% (resp. 20%) of the daily maximum (minimum) temperature data do not have a neighbouring station to validate with. Such observations (and their related weather stations) are mainly located in the southern part of Argentina. 3. Then we compute an interpolated value from the neighbours as follows: m m Tint (t) = wi αi T (t)i + βi wi i∈[1,N]

i∈[1,N]

where αim and βim are the coefficients of the linear regression between the analyzed and neighbouring station, N is the total number of neighbouring stations, m is the month of the time (t) of the data to be analyzed, and wi is the weight of the neighbouring station in the interpolation computed using a similar method to that used by Thornton et al. (1997): 2 wi = exp −a ri R − exp (−a) ; for ri ≤ R R is the maximum distance used to select neighbours (500 km), ri is the distance between the analyzed and the neighbouring stations and a is a scaling factor. We tried different values and chose a value of 3 as in Thornton et al. (1997). Small changes in this value (range 2–4) do not significantly affect the interpolated value and consequently our final results. 4. We then compute the difference between the analyzed and the interpolated value. The difference is normalized by the standard deviation of the daily temperature of the month of the day to check. The larger the distance, the stronger the confidence in stating that the data is erroneous. 5. Finally, we also computed the angle around the analyzed station covered by all the neighbours. Briefly, we defined for each neighbouring station the angle between a parallel to the Equator and the straight line relating the analyzed station (reference point) and the neighbouring station angle 0. We computed then the total angles covered by all the neighbouring stations in order to quantify whether they are all located in a similar direction or whether they do cover most of the area around the station. 6. In conclusion, for a given distance, the larger the angle covered by the neighbouring stations, the stronger the assertion that the data is Useful or Suspect. Similarly, for a given angle, the larger the distance, the stronger the assertion that the data is Useful or Suspect. Finally, our decision-tree algorithm can be summarized as follows (see Fig. 4): –

If the DIP test cannot be computed (because of missing data on one of the 3 days needed to compute the test), the Outlier test is applied (code OX, OL, OM or O0 according to the result) and a neighbouring station test is applied, leading to


–

–

– –

–

481

one of the codes NN (no neighbour), N0 to N4 (related to distance and angle of the neighbouring stations). If the DIP test is zero, we apply the Constant Series test and measure its length, coding the identical values according to that length (SL for long series of three consecutive days or more or SS for short series of two consecutive days). If the DIP test is larger than 2.25, a code related to its amplitude is given to the data (DM, for medium; DL, for large; DX for very large value). Then the neighbouring station test is applied. In cases of low DIP values (DIP < 2.25), it is first checked whether the value is larger or lower than both daily data observed 1 day before and 1 day after: If this is a local maximum or minimum, it is coded LE, otherwise LN. If the code is LN, the neighbouring station test is applied. If the code is LE, the same test is applied to the interpolated neighbour values (see the section Spatial check) leading to different codes (NEO: extreme of opposite sign, NEI: extreme of identical sign; NEN: non extreme and NNE meaning that it cannot be computed). Finally, the neighbour test is applied. If the data is not a local maximum or minimum, the long constant series test is applied leading to the SL code (for long series; three consecutive days or more) or SS (for short series; two consecutive days).

Two important comments should be made here. First, the major output of the method is a code describing the path of the daily observation through the decisiontree algorithm. As a result, the user can define his/her own confidence table (such as Table 3). Second, the neighbour test is a very important test. And in cases where no neighbour data are available, the confidence level is more likely to be “Useful” or “Need-Check”. This can be considered as a precautionary measure to balance against the lack of data.

Table 3 Temperature confidence table Confidence

NN

N0

N1

N2

N3

N4

DX DL DM LENEO LENEI LENEN LENNE LN SL SS OX OL OM O0

Doubftul NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck Useful NeedCheck NeedCheck NeedCheck Useful

Useful Useful Useful Useful Useful Useful Useful Useful Useful Useful Useful Useful Useful Useful

NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck NeedCheck Useful Useful Useful

Doubftul Doubftul NeedCheck Doubftul NeedCheck NeedCheck NeedCheck NeedCheck Doubftul NeedCheck NeedCheck NeedCheck NeedCheck Useful

Suspect Suspect Suspect Doubftul Doubftul Doubftul Doubftul NeedCheck Suspect Doubftul Doubtful NeedCheck NeedCheck NeedCheck

Suspect Suspect Suspect Doubftul Doubftul Doubftul Doubftul Doubftul Suspect Doubftul Suspect Doubftul NeedCheck Useful

Useful means the value is certainly correct, NeedCheck means the value is probably correct, Doubtful means the value is probably wrong, but in unusual cases may be correct, Suspect means the value is certainly wrong, but in exceptional cases, may be correct

482


4 Tests and decision-tree for precipitation The procedure to control the quality of the daily precipitation data is based on two classes of tests: 1- Extreme daily precipitation 2- Extreme dry sequences Extreme daily precipitation test Considering the distribution, shape and amplitude of daily precipitation, it can be relatively difficult to assess whether a specific daily precipitation is wrong or even doubtful. Therefore, our algorithm was rather designed to identify how “extreme” is a specific daily precipitation total, leaving it to the user to decide whether to consider this value in his/her analysis. The algorithm is based on the following major steps: – First, for each month of the year (January, February, ..., December) a precipitation amplitude is transformed into a distance to percentile 75% of the daily distribution of the month. If the distance is larger than 1.5 times the amplitude difference between percentiles 75% and 50%, the day is flagged as a potential outlier. – Second, starting with the lowest amplitude potential outlier, we compute whether the amplitude difference with the following larger outlier is greater than 50% of its amplitude. If it is lower, the day is disregarded as an outlier. This method allows to unflag only potential outliers, the values of which are continuous with other values of the distribution. – Third, if the daily precipitation distribution of the calendar month has enough observations (30 being a minimum), a parametric distribution is fitted to the distribution (best fit between a Gamma and a Weibull). Then, we retrieve the larger outlier from the distribution and compute a new fit. If the two fitted distributions are significantly different (5% according to the Kolmogorov– Smirnov Test), the day is flagged a second time. The method is applied to all the potential outliers by removing the tail of the distribution in one case and the tail plus the following lower value in the second case. If the test is negative for all potential outliers, then they are all unflagged. However, if one potential outlier is flagged a second time, all other potential outliers with a larger amplitude are also automatically flagged a second time. This method is very conservative as it flags values, which are unlikely outliers. Considering that very few cases are flagged, no automatic spatial test has been developed, and the method leaves to a Human Control Procedure the decision to accept a value or not. Finally, we also compute whether the flagged value is observed on a Monday, as it may represent a cumulative value (Sunday plus Monday). 4.1 Extreme dry sequence test The test is structured as follows (see Fig. 5 and Tables 4, 5, 6 and 7): 1- Select the droughts to be checked First, we identify all sequences of dry days, and compute the dry day sequence distribution (from the shortest to the longest). All dry sequences longer than 10 days


483

SLD (short length drought) Useful

LLD (long length drought) Suspect

Drought length

Drought to check

NN,MW, VW,EW

Useful

NO

SPI calculation

Third neighbour test

ED, SD MD

Correlated neighbours Yes

First neighbour test

No

Second neighbour test

Fig. 5 Drought decision tree

and 50% longer than the previous dry sequence are flagged. Such a check is quite conservative as our objective is to reduce the computation time by filtering out short dry sequences without missing any doubtful dry sequence. All sequences longer than 1 year are coded Long Length Drought (LLD) and considered Suspect. 2- Station distribution tests For any dry sequence previously flagged, we compute the mean precipitation over the same period of time (example: January 15th to April 10th) during all the years of the station records. We first check if there has been another dry sequence in the record during the same period of time in any other year of the record. In positive cases, the dry sequence is considered as plausible, and we flag the dry sequence with a value equal to the number of other dry sequences. 3- Neighbouring station tests We develop three neighbour tests. In each test, we first select all stations within a radius of less than 200 km. Then we compute in the station record the mean precipitation during the same period as the dry sequence to be checked (example: January 15th to April 10th). In all cases, if there is a dry sequence in any neighbouring station at the same time as the one being checked, the dry sequence is considered as Useful. Otherwise, we proceed with one of the three neighbouring station tests as shown in the decision tree (Fig. 5).

484


Table 4 Confidence table when applying the first neighbour test First neighbour test

Drought classification at the station SPI_ED SPI_SD SPI_MD

Drought classification at the neighbouring station

SPI_ED SPI_SD SPI_MD SPI_NN SPI_MW SPI_VW SPI_EW

Useful NeedCheck Doubtful Suspect Suspect Suspect Suspect

Useful Useful NeedCheck Doubtful Suspect Suspect Suspect

Useful Useful NeedCheck Doubtful Suspect Suspect Suspect

Table 5 Confidence table when applying the second neighbour test Second neighbour test

Drought classification at the station SPI_ED SPI_SD SPI_MD

Drought classification at the neighbouring station

SPI_ED SPI_SD SPI_MD SPI_NN SPI_MW SPI_VW SPI_EW

Useful NeedCheck Doubtful Doubtful Doubtful Doubtful Doubtful

Useful Useful NeedCheck Doubtful Doubtful Doubtful Doubtful

Useful Useful NeedCheck Doubtful Doubtful Doubtful Doubtful

Table 6 Confidence table when there is no neighbour data to check with (NN) or when at least one neighbouring station presents a dry sequence at the same time (ND)

NN ND

Drought classification at the station SPI_ED SPI_SD

SPI_MD

NeedCheck Useful

NeedCheck Useful

NeedCheck Useful

Table 7 Confidence table when the SPI value cannot be computed at the station (short series) Third neighbour test

Neighbur station code

SS NN ND MIN P1 P2 P3

Station with too short time series NeedCheck Useful Useful NeedCheck NeedCheck Doubtful

The neighbour codes are: NN (no neighbour), ND (drought in at least one neighbour station), and MIN, P1, P2 and P3 refers to the third neighbour test


485

First neighbour test In the first neighbour test, we compute the rank correlation of the mean precipitation series (mean during the dry sequence period being checked) between the reference station and any neighbour station. If it is lower than 0.8 or if it does not reach a 99% significance level, we apply the second neighbour test. If it is higher and if it reaches a 99% significance level, we “interpolate” the neighbour precipitation data onto the reference station distribution by computing the precipitation value with the same percentile as in the neighbouring station distribution. We compute the corresponding SPI (Standardized Precipitation Index, McKee et al. 1993) value. From all neighbouring stations, we select the one with the driest SPI and compute the difference with the reference station dry sequence SPI. A confidence level is suggested as a function of the difference (Table 4). Second neighbour test In the second neighbour test, the correlation between the reference and neighbour time series are not sufficiently large to apply the first neighbour test. Therefore, considering that precipitation characteristics can be similar (with the same distribution, i.e. the two neighbouring stations have the same climate) although the correlations are not as high as 0.8, we compare the precipitation distributions using the Kolmogorov–Smirnov Test. If the test suggest that the two distributions are identical, with a 90% significance level, we select the station as a relevant neighbouring station, compute an SPI value (as in the first neighbour test) and the difference with the reference station dry sequence SPI.A confidence is suggested as a function of the difference (Tables 5). Third neighbour test This test is applied when the reference station time series is shorter than 20 years. If any of the neighbouring station records is longer than 20 years, we computed its mean cumulative distribution. Then we compute the precipitation percentile in the neighbouring, station recorded at the same time as the reference station dry sequence. We then identify three codes: MIN (if this value is the minimum value of the distribution), P1 (lower than the 10th percentile), P2 (lower than the 20th percentile) and P3 (larger than the 30th percentile).

5 Application to the Argentine National weather service 5.1 Temperature results First of all, it is important to note that more than 90% of the minimum and maximum temperature data are considered as “Useful” after applying the decision tree. Between 5% and 10% of the flagged data are flagged as “Need-Check” (Fig. 6), displaying a positive trend due to the negative trend in the density of the network. Less than 0.1% of the data (∼3000 records) are flagged as “Doubtful” and around 0.05% of the data are flagged as “Suspect” (∼1500 records). Figure 6 shows that the two curves (Doubtful and Suspect) for minimum and maximum temperatures are similar. They have high values at the beginning of the time series when most of the data were recorded and filed on paper cards, which, in some cases, have been damaged by dust and humidity. Indeed, prior to October 1967, the quality of the data was not controlled. After October 1967, there is a long period of a very low

486

Climatic Change (2010) 98:471–491 % of NeedCheck values

10

5 Minimum Temperatur e Maximum Temperatur e 0 1955

1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

1990

1995

2000

2005

1990

1995

2000

2005

% of Doubtful values 0.1

0.05

0 1955

1960

1965

1970

1975

1980

1985

% of Suspect values 0.04

0.02

0 1955

1960

1965

1970

1975

1980

1985

Fig. 6 Percentage of minimum (plain) and maximum (dash-dot) temperature observations classified as NeedCheck (top panel), Doubtful (middle panel) and Suspect (bottom panel)

level of errors corresponding to October 1967–1997 when a manual quality check was performed by the Argentine Weather Service, and an increase at the end of the period (stronger for Suspect values) when only real-time quality checks were performed. These results demonstrate that the quality checks performed by the Weather Service were good, but left some potential errors that our new tests have identified. As a consequence, our time series of flags shows a variability consistent with the Weather Service history and quality checking procedures. Finally, the spatial distribution of the errors clearly shows (Fig. 7) that for stations south of 40◦ S with no neighbours, it is impossible to compute the neighbouring station tests for minimum temperature, which are crucial to unequivocally flag data as Suspect. However, for maximum temperature (Fig. 7), the test could be applied at some stations allowing the detection of Doubtful and Suspect values. In regions of high spatial density of stations, we detected larger proportions of Suspect data in both datasets (minimum and maximum temperature). This difference can be explained by the fact that variability in minimum temperature is more local (less spatially correlated) than variability in maximum temperature. This result suggests that some Suspect data are likely to exist in the records of isolated stations, but cannot be classified as such using only the station distribution as done in our tests. It also suggests that the development of the network in the future


487

Fig. 7 Percentage of minimum (plain) and maximum (dash-dot) temperature observations classified as NeedCheck (top panel), Doubtful (middle panel) and Suspect (bottom panel). a Percentage of NeedCheck, Doubtful and Suspect values for minimum temperature (percentage computed relative to the total number of observed values). The symbols: filled circles represent a percentage larger than 15% (NeedCheck), 0.1% (Doubtful) and 0.05% (Suspect). Circles represent a percentage larger than 5% (NeedCheck), 0.04% (Doubtful) and 0.01% (Suspect). Crosses represent a percentage smaller than 5% (NeedCheck), 0.04% (Doubtful) and 0.01% (Suspect). b Same as a but for maximum temperature

should take into account the poor density of the network in the southern parts of the country. It is possible that comparison with automated stations belonging to private networks may partially complete the Weather Service network and contribute to improved error detection in the future (Fig. 7).

488


Fig. 8 Interannual variability of the number of potential outliers in daily precipitation

Fig. 9 (Upper panel) Seasonal cycle of the number of dry days classified as NeedCheck (straight line), Doubftul (dashed) and Suspect (dash-dot) cases. Doubtful and Suspect cases have been multiplied by 100 to be displayed on the same plot as the NeedCheck values; (middle panel) interannual variability of the percentage of NeedCheck (straight line), Doubtful (dashed) and Suspect (dash-dot) cases; (lower panel) interannual variability of the total percentage of the NeedCheck, Doubtful and Suspect cases


489

5.2 Precipitation results With the Extreme Daily Precipitation test around 80 potential outliers per year were identified before 1976, and around 20 to 40 outliers after that (Fig. 8). This difference may have two explanations. First, prior to 1967, no quality control had been applied to the data, which may explain why we detect more outlier cases. Second, the difference between the periods before and after 1976 can result from the fact that after the Military Coup, many stations, especially in the interior, were closed. These stations were located in drier regions than those located in the Humid Pampa. Moreover, a larger proportion of remaining stations was located in airports and had a strategic value, which certainly favoured a better real-time quality control. As previously explained, we also checked whether the potential precipitation outliers were observed on Mondays. We actually found that this was the case for 17% of the potential outliers. Although this value is a bit larger than a random process (14.3%), it is very close. At this stage, we believe that a “Human Control” is required to verify the flagged precipitation values. The Extreme Dry Sequence Test (Fig. 9; upper panel) displays a large peak during the winter period reaching about double the number of “Need-Check” dry days than during the rest of the year. While the interannual variability shows various peaks (middle panel), it also displays (Fig. 9; lower panel), a much larger number of nonUseful values (∼4% of the all dry days) at the beginning of the time series, before the NWS quality check. Moreover, a minimum is observed before the Military Coup in 1976. Since that date, a small positive trend can be observed in the percentage of non-Useful values.

6 Conclusions and perspectives One of the three strategic objectives of the CLARIS Project was to initiate the setting-up of a high-quality daily climate database for temperature and precipitation as it would be of great value for validating and evaluating model skill in simulating climate trends and extreme event frequency changes. Different groups have contributed to similar objectives using their own data quality checks (often manual or visual). The CLARIS LPB project (7th EC Framework Programme, 2008–2012) now aims at gathering hundreds of official and private station records of precipitation and minimum and maximum temperatures. It is therefore crucial to develop an automated system to perform data quality control. The present work is the first step towards that objective. In the present study, daily precipitation and minimum and maximum temperature data were checked separately. A potential improvement of the method (especially in regions with no neighbouring stations) would be to cross-check the different types of daily data. We developed three different decision trees to flag possible errors in daily data. The first decision tree is applied to either minimum or maximum temperature. The central test is called the DIP test and measures how high a positive or negative peak is (when it occurs) in a 3-day sequence. This test is completed by a Spatial Test confirming whether specific data can be flagged as Useful, NeedCheck, Doubtful or Suspect. The NeedCheck category most often includes data that are flagged but

490


for which not enough complementary information (such as neighbouring stations) is available. It is important to note that each daily datum is coded by a string following its path through the decision tree. The final flag (Useful, NeedCheck, Doubtful or Suspect) is a suggestion and can be modified by the users at their discretion. In order to validate the results of our quality control method, we decided to work exclusively on Argentine daily weather stations (from 1959 to 2005) with professionals of the Argentine National Weather Service (NWS), allowing us to interpret the detection of potential errors in the data set during the course of the NWS history. Overall, we found the number of flagged temperature data followed the history of the NWS fairly well (Fig. 7). Indeed, before October 1967, when most of the data were filed on paper cards, lost or badly digitized due to humidity and dust and without quality control, our test detected a larger number of flagged data (especially Doubtful and Suspect). During the 1967–1997 period, we observed a minimum of detection made by the NWS manual quality control. Finally, at the end of the time series there is an an increase in flagged data as the NWS has not yet made a thorough quality control check. These results demonstrate that the quality checks performed by the Weather Service were good (especially during the 1967– 1997 period), but left some potential errors that our new tests were able to identify. As a consequence, the time series of our flags display a variability consistent with the Weather Service history and quality checks. The quality check of daily precipitation data is far more complex. We therefore applied two distinct decision trees, one for extreme precipitations and one for long drought sequences. The extreme precipitation test aimed at detecting very high precipitation values. The flagged data were more numerous before 1976 than later. This result suggests that, although it is difficult to detect them, some erroneous high precipitation values may exist in the database, requiring thorough manual quality control of our flagged data. This process will be done in the future by the National Weather Service. The drought sequence test identified very long sequences particularly using the Standardized Precipitation Index (SPI, McKee et al. 1993). The drought sequence test largely relies on neighbouring stations. As in the temperature case, more drought sequences (flagged as NeedCheck, Doubtful or Suspect) were found at the beginning of the period. A minimum was observed in the early 1970s before the Military Coup (1976). Since then, a weak positive trend in the number of flagged dry days (especially NeedCheck) has been detected. It is possible that the decrease in the number of active stations since 1976 has reduced the number of neighbouring stations used in our tests and thus increased the number of NeedCheck flags. Our results also suggest that there is some Suspect data in the records of isolated stations, but these cannot be classified as such using only the station distribution. This suggests that the development of the network in the future should take into account its poor density in the southern parts of the country. It is possible that a wider collection of automated station data belonging to private networks may partially complete the Weather Service network and contribute to more error detection in the future. Our method will be applied in the framework of the CLARIS LPB project, which aims at gathering a large number of stations from official and private networks in La Plata Basin in order to provide better insights into regional daily climate variability and to create gridded daily products, useful to validate global and regional models as done in the ENSEMBLE EC Project (6th FP; http://ensembles-eu.metoffice.com/).


491

Acknowledgements We wish to thank the European Commission 6th Framework programme for funding the CLARIS Project (Project 001454) during the 3-year duration of the project. JeanPhilippe Boulanger wants to thank the Centre National de la Recherche Scientifique (CNRS) for the administrative coordination of the project, the Institut de Recherche pour le Développement (IRD) for its constant support, and the University of Buenos Aires and its “Department of Atmosphere and Ocean Sciences” for welcoming him during the entire duration of the project. Special thanks are also addressed to Olga Penalba, Matilde Rusticucci and Enrique Segura.

References Alexandersson H, Moberg A (1997) Homogenization of Swedish temperature data. Part 1: homogeneity test for linear trends. Int J Climatol 17:25–34 Brandsma T, Können GP (2006) Application of nearest-neighbor resampling for homogenizing temperature records on a daily to sub-daily level. Int J Climatol 26:75–89. doi:10.1002/joc.1236 Caussinus H, Mestre O (2004) Detection and correction of artificial shifts in climate series. Appl Stat 53, part 3:405–425 Haylock M, Peterson TC, Alves LM, Ambrizzi T, Anunciação YMT, Baez J, Barros VR, Berlato MA, Bidegain M, Coronel G, Corradi V, Garcia VJ, Grimm AM, Karoly D, Marengo JA, Marino MB, Moncunill DF, Nechet D, Quintana J, Rebello E, Rusticucci M, Santos JL, Trebejo I, Vincent LA (2006) Trends in total and extreme South American rainfall 1960–2000 and links with sea surface temperature. J Climate 19:1490–1512 McKee TB, Doesken NJ, Kleist J (1993) The relationship of drought frequency and duration to time scales. In: Proc. Eighth Conf. of Applied Climatology. Amer Meteor Soc, Anaheim, CA, pp 179–184 Moberg A, Alexandersson H (1997) Homogenization of Swedish temperature data. Part II: homogenized gridded air temperature compared with a subset of global gridded air temperature since 1861. Int J Climatol 17:35–54 Rusticucci M, Barrucand M (2004) Observed trends and changes in Temperature Extremes over Argentina. J Climate 17(20):4099–4107 Rusticucci M, Renom M (2007) Variability and trends in indices of quality controlled daily temperature extremes in Uruguay. Int J Climatol doi:10.1002/joc.1607 Thornton PE, Running SW, White MA (1997) Generating surfaces of daily meteorological variables over large regions of complex terrain. J Hydrol 190:214–251 Vejen F, Jacobsson C, Fredriksson U, Moe M, Andresen L, Hellsten E, Rissanen P, Palsdottir B, Arason B (2002) Quality control of meteorological observations: automatic methods used in the Nordic countries, report 08/2002. KLIMA, p 111 Vincent LA, Peterson TC, Barrros VR, Marino MB, Rusticucci M, Carrasco G, Ramirez E, Alves LM, Ambrizzi T, Berlato MA, Grimm AM, Marengo JA, Molion L, Moncunill DF, Rebello E, Anunciação YMT, Quintana J, Santos JL, Baez J, Coronel G, Garcia J, Trebejo I, Bidegain M, Haylock MR, Karoly D (2005) Observed trends in indices of daily temperature extremes in South America 1960–2000. J Climate 18:5011–5023 Wijngaard JB, Klein Tank AMG, Können GP (2003) Homogeneity of 20th century European daily temperature and precipitation series. Int J Climatol 23:679–692