Adaptive Gaussian filters for interpolation

0 downloads 0 Views 704KB Size Report
Jan 19, 2005 - pdf, much like a k-nearest-neighbours, but with a Gaussian weighting of the ... trial value for σ and evaluate W. If this is less than Wc then we square each ... To find an initial filter width larger, but not too much larger than the ... the result quite strongly, while smaller values, on the order of one (1) to ten.
Adaptive Gaussian filters for interpolation, classification and retrieval Peter Mills ∗ Stefan A. Buehler Institute for Environmental Physics, University of Bremen, PO Box 330440, 28334 Bremen, Germany

Abstract A method of adaptively varying the width of a multi-dimensional Gaussian filter is introduced. Such filters are shown, both formally and by illustration, to be an effective and general method for typical Bayesian estimation problems. It is shown how the filters may be applied to remote sensing problems through three examples, all employing data from the AMSU (Advanced Microwave Sounding Unit) series of satellite instruments. In the first example, the method is extended to apply to interpolation problems in which high resolution, but irregularly sampled AMSU-B measurements, are re-sampled to a more convenient gridding. In the second example, simulated radiance measurements from space are classified according to watervapour volume-mixing-ratio using adaptive Gaussian filtering. The potential usefulness of such a technique to study chaotic mixing is described in detail. In the final example, the algorithm is used to retrieve humidity in the upper troposphere. Detailed error analyses are included, both generally and for each of the examples. Actual implementations in the form of C++ subroutine listings are included in the appendix. The algorithm is analysed for efficiency and optimisations are suggested. Key words: 993.85.+q Instrumentation and techniques for geophysical research. 02.30.Zz Inverse methods. 02.60.Ed Interpolation; curve fitting. 02.70.Uu Applications of Monte Carlo methods. 02.70.Tt Justifications or modifications of Monte Carlo methods. 42.68.Wt Remote sensing; LIDAR and adaptive systems. 42.68.Ay Propagation, transmission, attenuation, and radiative transfer. 47.52.+j Chaos in fluid dynamics. 2.60.Jq Water in the atmosphere (humidity, clouds, evaporation, precipitation).

Contents 1

Introduction

2

∗ Correspondence should be sent to: Peter Mills ([email protected]) 49-421-218-4571 (phone) 49-421-218-4555 (fax)

Preprint submitted to Elsevier Science

19 January 2005

1

1.1

Data sources

3

2

Description of the algorithm

5

2.1

Relationship to optimal estimation and Bayesian probability theory

8

2.2

Error estimation

9

3

Example 1: Interpolation of AMSU-B radiances

10

4

Example 2: Classification of water-vapour volume mixing ratios

15

4.1

Isoline retrieval

19

4.2

Improving efficiency: finding the class borders

23

5

Example 3: Retrieval of upper tropospheric water vapour

26

6

Summary and conclusions

31

6.1

Further work

32

Acknowledgements

33

A

33

Appendix: program listings

A.1 adgaf routine

33

A.2 Classification routine

36

Introduction

In Mills (2004), a method of adaptively choosing the width of a Gaussian filter for use in classification and interpolation problems was painted in broad brush-strokes. The goal of this paper is to flush out this technique by describing the algorithm more formally and precisely and also to provide three simple examples that apply it to different, but closely related problems. All three problems, already alluded to in the title, make use of data from the AMSU (Advanced Microwave Sounding Unit) -B satellite instrument. The most obvious application of the algorithm is to perform interpolation on an unevenly distributed set of samples, and this is the focus of the first example, where one day of radiance measurements is interpolated to a more convenient, high-resolution lon.-lat. grid for a single date. As the name suggests, however, adaptive Gaussian filters (AGFs) do not just interpolate, 2

but also perform some smoothing on the data. Since unknown variables may frequently be treated in an equivalent manner to noise, it is this smoothing process that makes the algorithm suitable for Bayesian estimation problems. This is formally demonstrated in the first section, while the last two examples provide real illustrations. The high horizontal resolution of the AMSU-B instrument would seem to make it an ideal candidate for the retrieval of detailed humidity fields. To squeeze every last bit of accuracy from the instrument, isolines of watervapour volume-mixing-ratio (vmr) are retrieved via classification, i.e. in broad ranges, and could potentially be used to validate the results of contour advection simulations. This is the detailed focus of the second example. Unfortunately, the low vertical resolution of the instrument makes this last a difficult undertaking. To counter this, the water content within a broad layer may be retrieved, giving rise to such quantities as upper tropospheric water vapour (UTWV) and upper tropospheric humidity (UTH). These are derived in the final example by interpolating within a database of simulated brightness temperatures (BTs). While these examples are by no means meant to be exhaustive, they demonstrate the usefulness and generality of such algorithms, especially when applied to satellite remote sensing applications. Although the method is slow, and this is true of all the most general algorithms, because of the rapid advances in computer technology it is nonetheless very feasible where a few years ago it might have been unuseable. By deploying parallel algorithms on networked PC’s or multi-processor architectures, such methods could be made even more attractive.

1.1 Data sources Two major sources of data will be employed in the examples: so-called reanalysis data from the European Centre for Medium Range Weather Forecasting (ECMWF) and data from the AMSU series of satellite instruments. AMSU-B is a downward-looking instrument that detects microwave radiation in five so-called double side-band channels in the sub-millimeter range. Three are centred on the water-vapour emission line at 183.31 GHz and two are surface looking, so-called “window” channels, also sensitive to water-vapour and located at 89 and 150 GHz. Since the instrument is a sister to the AMSU-A instrument, the channels are typically numbered between 16 and 20, starting with the two surface looking channels and then the three 183 gigahertz channels, ending with the deepest-looking of the three. (Saunders et al., 1995) Details of all five AMSU-B channels are 3

channel

centre frequency

sideband offset

nominal width

AMSU-A:

[MHz]

[MHz]

[MHz]

6

54400

± 105

190

7

54940

± 105

190

8

55500

± 87.5

155

9

57290

± 87.5

155

10

57290

± 217

77

AMSU-B:

[GHz]

[GHz]

[GHz]

16

89.0

± 0.9

1.0

17

159.0

± 0.9

1.0

18

183.1

± 1.0

0.5

19

183.1

± 3.0

1.0

20

183.1

± 7.0

2.0

channel

centre frequency

sideband offset

nominal width

Table 1 NOAA AMSU channel frequencies.

shown in table 1 along with a selected sub-set of the AMSU-A channels. (Rosenkranz, 2001; Cramer, 2002) These latter supply temperature information since this is needed for our retrievals. (Mo, 1996) The ECMWF supplies a synthesized, gridded data set based on an amalgam of sonde, in-situ and satellite and other remote-sensing measurements. Gridding of the data used in this study is 1.5 by 1.5 degrees longitude and latitude while it is laid out vertically along sixty so-called “sigma” levels. The idea behind sigma levels is that they are terrain following at the surface but revert to pressure levels at the highest altitudes. Intermediate levels are interpolated between the two. For the most comprehensive and up-to-date information on this product, please refer to the ECMWF website: http://www.ecmwf.int/research/ifsdocs/index.html. Included in the ECMWF data are gridded fields of temperature, pressure and humidity. This will supply the values of water-vapour needed both to constitute the so-called “training” dataset and to validate predicted values. AMSU radiances are also simulated from this data, as opposed to relying on real measurements and the messy collocations inherent in this. Obviously, this will not supply useful information, but will serve for algorithm assessments and feasibility studies such as contained in this paper.

4

For a given vertical profile, the density of radiation at a single frequency emitted to space in a single direction may be modelled via the radiative transfer (RT) equation: X dI αi (T (s), P (s)) ρi (s) = [B (T (s)) − I] ds i

(1)

where I is the intensity of radiation per solid angle, per unit frequency, s is the path (a function of altitude), αi is the absorbtion cross-section of the ith species, ρi is the density of the ith species and B is the Planck function. T and P are the temperuture and pressure respectively, both functions of the path. Radiative transfer simulations are performed using ARTS, the Atmospheric Radiative Transfer Simulator, version 1.0. (Buehler et al., 2004a) This version does not support scattering so only clear-sky simulations are performed, but microwave radiation is not strongly affected by clouds. For a validation of ARTS 1.0 simulations against collocated AMSU measurements, please see Buehler et al. (2004b) and Buehler and John (2004) while a validation of the ARTS package with other RT simulators is contained in etal (2004).

2

Description of the algorithm

Suppose we have a function y(~x) sampled at points {~x1 , ~x2 , ~x3 , ..., ~xn } not of our own choosing. We would like some means of estimating y at some arbitrary point, ~x. As a specific approach to the problem, we are looking for a set of weighting coefficients at the interpolation point to apply to the sampled data points: n 1 X y(~x) = wi (~x)yi (2) W i=1 where W is a normalisation constant. If necessary, we choose W such that: W =

n X

wi

(3)

i=1

The summation in (2) could be over the entire set of points or simply over a set of k nearest neighbours, which is equivalent to saying that the coefficients for all but these points are zero. The most obvious choice is to perform some sort of filtering and convolve the points with a symmetric filtering function centred at the interpolation point: wi = g(|~x − ~xi |)

(4)

where g is the filtering function and | | denotes a metric, typically Cartesian. Although in theory a more complicated metric may be employed, in practice 5

it is often more efficient to first rescale or transform the variables so that a Cartesian metric is then appropriate. A natural choice for g would be a Gaussian: r (5) g(r) = e− 2σ2 where σ denotes the width of the filter. Such a scheme does not just interpolate between points, but also filters the data so that small scale details may be obscured. Thus, filter widths would naturally be tailored to the density of the samples with smaller widths being used for higher sample densities. In this section we outline an adaptive method for selecting it. Closely related to function interpolation is the estimation of probability density functions (pdfs). That is, given a set of samples {~x i } distributed according to some probability density function, P (~x), what then is the probability density at ~x? A very similar scheme may be used to estimate this pdf, much like a k-nearest-neighbours, but with a Gaussian weighting of the samples. Recalling the definition of W: P (~x) ≈

W n(2π)D/2 σ D

(6)

where n is the total number of samples, and D is the number of dimensions. How does this relate to the original problem of interpolating a function from a set of sampled points? The probability density may be related to the point spacing as follows: 1 (7) δ= (nP (~x))1/D where δ is the average point spacing at point ~x. We wish to vary the filter width so that it matches the point spacing: σ (opt) = Kδ =

K (nP (~x))1/D

(8)

where K is a constant. Substituting for P (~x) in (8) the approximated pdf from equation (6) produces the following that must be solved for σ (opt) : W (σ (opt) ) ≈ K D (2π)D/2

(9)

Thus, correctly selecting a fixed value for W (call it Wc ) in (3) should produce an “optimal” (as we have defined it) filter width. We start with a trial value for σ and evaluate W. If this is less than Wc then we square each of the terms and re-sum them until we exceed it. If greater, then the same procedure is applied except taking the square root. To speed up the algorithm, a subset of terms (say, one-hundred or ten- thousand) supplying the most weight may be pre-selected after the initial trial. As a last refinement, exponential interpolation to the target value, W c , may be performed between the final value of W and its previous iteration. 6

We wish to have a more formal, precise description of this procedure. To begin with, let us assume that the initial filter width, which will be denoted σ (0) , is larger than the local point spacing. The weighting coefficient is defined as before as a non-normalised Gaussian of the distance from the sample point, except now we add a subscript for the iteration:



d2 i

=e ( ) di = |~xi − ~x|

(j) wi

2 σ (j)

2

(10) (11) (j)

The jth iteration of the ith weighting coefficient, denoted w i the square of its previous iteration: (j)



 (j−1) 2

wi = wi

is defined as

(12)

consequently the jth filter width, σ (j) , obeys the following recursion relation: σ (j−1) σ (j) = √ 2

(13)

To find an initial filter width larger, but not too much larger than the optimal, we may use σ (0) = dceilWc . Following our super-scripting convention, W (j) is defined as: X (j) wi (14) W (j) = i

with the final iteration of j, call it f, defined such that: W (f ) ≤ Wc

(15)

The final, interpolated weights approximating those for the optimal filter width are calculated by exponential interpolation between the last two iterations of W, as follows:

c= (opt)

wi

log Wc + log W (f ) − 2 log W (f −1) 2 (log W (f ) − log W (f −1) )



 (f ) c

≈ wi

(16) (17)

while the corresponding filter width is given by: σ (f ) σ (opt) ≈ √ c

(18)

The quantity Wc may be thought of as roughly those number of nearest neighbours included in the interpolation. Thus, large values will tend to filter 7

the result quite strongly, while smaller values, on the order of one (1) to ten (10) will produce a result closer to interpolation. Very small values (< 1) will tend to mimic a nearest neighbour scheme. The routine adgaf implements the procedure described above and is supplied in the appendix. This algorithm will henceforth be referred to as an “Adaptive Gaussian Filter” (AGF). What are the advantages of this scheme? First, the weights and therefore the interpolated value depend only upon the distances of all the samples from the interpolation point, ~x. Therefore, once these have been calculated, a set of nearest neighbours supplying the most weight may be selected to speed up calculation. This may be implemented using binary trees, for instance, and performed in n log k time, where k is the number of nearest neighbours and n is the total number of samples. The adgaf routine includes a parameter to select out the k nearest neighbours and this should be set one or two orders of magnitude or so larger than Wc . (See appendix.) Second, the most expensive part of the calculation, the exponential functions, are calculated only once for each weight. Refinements in the filter width and the consequent changes in the weights are performed by squaring the latter, not by recalculating the exponentials. Typically, the “optimal” filter width is found in between one and three iterations.

2.1 Relationship to optimal estimation and Bayesian probability theory

The following tautology is applicable to any function: f (~x) ≡ lim

1

σ→0 (2π)D/2 σ D

Z

e−

|~ x−~ x 0 |2 2σ 2

f (~x0 )d~x0

(19)

Since the algorithm described above is simply a discretised version of this formula, less the limit operator, it is readily apparent that it should be an estimator of the function at the interpolation point, with the accuracy increasing with the density of the samples and (to a point) with decreasing filter width, σ. Less obvious, however, is its relationship to Bayesian probability theory. More generally, one might say that the data samples not merely a function, but rather a joint probability: P (~x, y). Since the conditional probability, P (y|~x), returns a single, normalised probability function for any given location of ~x, the expectation value operator at that point may be calculated in exactly the same manner as for a simple, one-dimensional pdf of a single 8

variable:

1 Z yP (~x, y)dy (20) y¯(~x) = yP (y|~x)dy = P (~x) Suppose, however, that we have only limited knowledge of the interpolation point, as with a set of measurements for instance. Thus, it could be described by a new distribution, call it Pm (~x), also called a posterior probability. (Rodgers, 2000; Evans et al., 2002) The expectation value would now be given by an integration over both variables: Z

y¯ =

Z

Pm (~x) Z yP (~x, y)dyd~x P (~x)

(21)

Since the data is already importance sampled along the joint probability, a Monte Carlo method for evaluating this integral is particularly simple: (Press et al., 1992, equation 7.8.3) P

y¯ ≈ Pi

Pm (~xi )yi xi ) i Pm (~

(22)

Note that the denomenator approximates P (~x) as in (6). Equation (22) describes an adaptive Gaussian filter, except that the distribution, P m (~x), is not based on the measurement statistics, but is chosen rather to optimise the inclusion of samples. This is not a Monte Carlo method in the strictest sense since very often the positions of the samples (such as results from RT simulations of randomly selected profiles) cannot be explicitly selected. Another thing that should be stressed, is that AGFs do return an expectation value estimator, not a maximum likelihood estimator. Equation (6) may be justified in a similar manner by considering the formula in (19), less the limit operator, as an estimator of P (~x) and assuming that this pdf is roughly constant in the region of interest.

2.2 Error estimation It is not so easy to establish an error bound on the final value of the interpolate. An obvious means of doing so would be a simple square sum of the samples weighted in the same way as the filter itself. This might work quite well if the function is very noisy, but if the function is linearly varying but without noise, then this will show up in an overly large error estimate. It is very easy to show that the interpolate has a second-order error term in the limit as the sample-density goes to infinity, so that linear trends at a minimum should be factored out of any rms-based error estimate. Given that analytical derivatives of the interpolated value of y are easily derived, one may obtain an rms-based error estimate by performing a Taylor expansion 9

Fig. 1. Error statistics for the retrieval shown in figure 16. Heavy solid line is a histogram of the error (true value less retrieved) normalised by the estimated error. Vertical lines denote the average and standard deviation while the broken curve traces a Gaussian of equivalent area, mean and variance.

and using the differences between this and the samples: 2  1 X 1 e = wi y˜ + ∇~x y˜ · (~xi − ~x) + (~xi − ~x) · ∇~x ∇~x y˜ · (~xi − ~x) − yi (23) W i 2 2

where y˜ is the interpolated value of y. To demonstrate the accuracy of this estimate, the error statistics corresponding to the UTH retrieval shown in figure 16 are summarised in figure 1. A histogram of the actual error normalised by estimates from equation (23) less the second order term (see equation (33)), traces almost a perfect Gaussian of zero mean and unit variance.

3

Example 1: Interpolation of AMSU-B radiances

It is a commonly encountered problem: a two-dimensional function sampled on a regular, rectangular lattice that we want to re-sample on a new grid or at irregularly chosen points. Typically one uses bilinear or bicubic interpolation, both of which are standard textbook fair. (Press et al., 1992) But more normally, one rather has it the other way round: the data are sampled irregularly and one wishes to interpolate to a regular grid. This is 10

Fig. 2. Raw AMSU channel 18 brightness temperatures for January 1, 2004, swaths centred around 12:00 UT

11

Fig. 3. AMSU channel 18 brightness temperatures for January 1, 2004, interpolated to a 0.2◦ × 0.2◦ resolution at 12:00 UT using adaptive Gaussian filtering.

12

Fig. 4. Comparison of interpolates using matrix inversion vs. adaptive Gaussian filtering.

especially the case with measurements from satellites which cannot be relied upon to look at the Earth only along integer values of longitude and latitude, nor to sample the whole globe in a single pass. Moreover, what if we also wish to interpolate from an irregular grid to an irregular grid and there are more than two dimensions? Adaptive Gaussian filters may be used as a fully general and very stable method of interpolation. While the method is actually smoothing the data with a linear filter, if the objective total weight, Wc , is set to a small value on the order of one, then it approaches an interpolation. On the other hand, some smoothing may be desireable in which case the parameter may be set higher. A particularly good example of a satellite instrument that does not return data in an easily digestible form is NOAA’s AMSU (Advance Microwave Sounding Unit) B instrument. The NOAA series of satellites upon which it is mounted fly in a sun-synchronous orbit with roughly seven (7) orbits per day while the instrument uses a cross-track scanning geometry with ninety different viewing angles sampling at a rate of roughly 22.5 scans per minute. What that translates to is over 300 000 measurements per satellite per day, while full global coverage requires almost a day of data from three satellites. All in an irregular, overlapping grid geometry: consider figure 2 showing each pixel of channel 18 measured from three NOAA (numbers 15 through 17) satellites along two orbits each, centred at 12:00 UT, Jan 1 2004.

13

What is needed is a method like adaptive Gaussian filtering to cut this problem down to size. Figure 3 shows the AMSU-B channel 18 brightness temperatures, resampled to a 0.2 by 0.2 degree longitude-latitude grid. For each interpolation point, one-hundred (100) nearest neighbours were selected from one full day of data and filtered with an objective total weight, W c , of three (3). Since performing the interpolation with all the data all at once would be too slow and consume too much memory, it was done in separate 20 by 20 degree bins, each with a two degree overlap. It should be noted before continuing that the raw radiances could not be used as is–because of the different viewing angles, measurements taken at different points along the scan line are not equivalent. To address this, they were corrected to approximate nadir using a set of empirically derived coefficients. For each viewing angle, the radiance of a given channel, corrected to nadir, is given by a weighted sum of all five other channels, plus a constant term. Thus: X cjiθ Iiθ + bjθ (24) Ij0◦ ≈ i

where Iiθ is the radiance of the ith channel with looking angle θ while the c’s are the weighting coefficients and the b’s are the constant terms. The equivalent procedure is described for the AMSU-A instrument in Goldberg et al. (2001). The actual weighting coefficients used were taken from the National Ocean and Atmospheric Administration (NOAA) website: http://poes.nesdis.noaa.gov/. These corrected brightness temperatures are what is shown in figure 2 and also what is interpolated in figure 3. For comparison, figure 4 shows a scatter diagram comparing a selection of the interpolates in figure 3 with the same field interpolated using a slightly different method. Since we are resampling to a rectangular grid, coefficients to interpolate from this to the original, irregular sampling may be calculated and the resulting matrix inverted. The interpolation coefficients are calculated using an n-dimensional multilinear method and the matrix inverted via the normal equation: K T K~y 0 = K T ~y

(25)

where ~y is the original vector of irregularly sampled ordinates and ~y 0 is the new vector of ordinates on the rectangular grid. The matrix K is an n by m, where m is the number of elements in ~y 0 , sparse matrix containing coefficients to go from the new sampling to the old sampling. In order to use the AGF method properly, some rescaling of the variables may be necessary as pointed out in section 2. The longitude and latitude coordinates were left untouched–not always a good idea considering the convergence of meridions towards the poles–time, however, was rescaled to a 14

Fig. 5. An example of contour advection

two day index in order to better match the spatial intervals between samples. Scaling of the variables may be done so that the interpolation favours changes in different ones. For instance, the smaller the time variable is scaled, the stronger the weighting of points distant in time.

4

Example 2: Classification of water-vapour volume mixing ratios

Adaptive Gaussian filters are useable for typical spatial and temporal interpolation problems, but it is when interpolating in abstract, inhomogeneous vector spaces for the purposes of performing soundings or inversions that the technique really comes into its own. Using the terminology of atmospheric inverse theory, the vector ~x may be thought of as the so-called measurement vector, while y is a single state variable. The reversal of the usual symbolic convention is unfortunate, but unavoidable: the problem is formulated so that the forward calculation from state to measurement is essentially already inverted! Samples of the two variables may be obtained from true measurements or from forward model calculations and directly sample their joint probability distribution. In this sense, the method may be used as a fully general, Bayesian inverse method, which, unlike optimal estimation, makes no assumptions about the type of statistics. (Rodgers, 2000) This type of application will be the focus of the next two sections. With very little modification (in fact the technique originated in this way) AGFs may be used to perform classifications. Closely related to both function interpolation and pdf estimation are classification algorithms: given a set of vectors {~x1 , ~x2 , ~x3 , ... , ~xn } and associated with each vector instead 15

of an arbitrary real number, we have one of a set of discrete values between 1 and nc . These latter we will call classes, with nc being the number of classes. Now the problem becomes, given an arbitrary vector ~x, to which class does this vector belong? The problem is usually stated, once again, in Bayesian terms: given a test point, ~x, what is the probability that it will be a member of a particular class, j. That is, what is the conditional probability, P (j|~x)? The class of which ~x is most likely a member will be given by the value of j supplying the maximum value of this probability density function (pdf): c = arg maxP (j|~x) j

(26)

Where c is the index of the class. Now we are talking about maximum likelihood estimation as opposed to expectation value. One approach to the problem is to approximate the pdf of each class from the density of the sampled points and compare each of the values. The formula in (6) may be applied with essentially no modification to estimate the conditional probability of a set of discrete variables. The only difference is that the sample locations {~xi } are only those for the particular class whose probability we are trying to determine. Further, for numerical work, no normalisation coefficients need be applied since it is only the relative values of the probabilities we are interested in. We can write the formula for the estimated conditional probability as follows: P (j|~x) ≈

1 X wi W i,ci =j

(27)

where ci is the class of the ith sample. The same adaptive technique may be applied when estimating pdf’s as when interpolating a function. For more details on the algorithm, please refer to the routine classify supplied in the appendix. For a good introduction to classification and pdf estimation, and also for some alternative algorithms, please see Kohonen (2000). A classification scheme may be used to retrieve trace atmospheric constituents in broad ranges. For instance, given a vector of AMSU radiances, we may wish to know if the water-vapour volume mixing ratio at a given altitude is higher or lower than a certain threshold. This raises two questions: 1. Why would we want to? and 2. Is retrieving in broad ranges more accurate than retrieving along a continuum? We must first address the motivation behind classification retrievals for continuum variables, because otherwise the second question is moot. Many 16

trace atmospheric constituents are inert enough that they may be modelled as passive tracers–that is, there exist neither sources nor sinks. Traditional methods for tracer advection track the concentration at fixed points along a grid, the time evolution calculated by numerical integration of partial differential equations. The disadvantage of these so-called Eulerian methods is that the spatial resolution is often quite limited. So-called Lagrangian models track the concentration along moving airparcels. Contour advection, for example, is a new and powerful technique that models the evolution of one or more contours or isoline of a passive tracer. (Dritschel, 1988, 1989) The method is adaptive in the sense that new points are added to or removed from the evolving contour in order to maintain the integrity of the curve. Thus, the horizontal configuration of an isoline may be predicted to a high degree of precision. Despite being driven by finitely resolved wind-fields, these advected contours often show a very high level of fine-scale detail, (Waugh and Plumb, 1994; Methven and Hoskins, 1999; et. al., 2002) as shown in the figure 5. This structure results from a continual process of stretching and folding, much as in a so-called “baker’s map.” (Ott, 1993; Ottino, 1989) Does this fine-scale structure, necessarily ignored by GCM’s, exist in the real atmosphere? In order to validate an advected contour, one should appreciate that it is not necessary to know, at any given point, the exact value of the tracer. Rather one need only know whether the point falls within or without the contour–that is, is the concentration higher or lower than the value of the isoline represented by the contour? To address the second issue, a simple example might be instructive. Suppose there are only four points in the training data with non-trivial weights, all equidistant from the test point. Suppose also that the dependent variable is being retrieved in only two ranges: either above or below a threshold value. Three of the points have values very close to and slightly below the threshold, while the fourth is well above the threshold. In this case, taking an average (expectation value) will supply a result that is above the threshold, while performing a classification (maximum likelihood) will supply a result that is below. This latter would appear to be more realistic for this case and would also produce better statistical measures of accuracy. To perform the classifications, the vector of coordinate variables is defined: ~x = (T0 , β, T18 , T19 , T20 )

(28)

where T18 , T19 and T20 are AMSU-B brightness temperatures from channels 18 through 20 respectively, while T0 and β, the reference temperature and temperature lapse rate respectively, define the mean temperature profile in 17

the upper troposphere:

T = T0 + β(z − z0 )

(29)

where z is altitude and z0 a reference altitude. The training data set was built up by taking 150 000 profiles of temperature, pressure and water-vapour from ECMWF reanalysis data, randomly selected in both time and space. To obtain the brightness temperatures, radiative transfer simulations were performed using the Atmospheric Radiative Transfer Simulator (ARTS), with these as input data. The simulations assume a clear-sky (no-scattering), one-dimensional atmosphere and a surface emissivity of 0.95 which is a realistic value for land, although perhaps too high for the ocean. In any case, except for the driest cases, none of the channels are consistently surface-looking, while the measurement vectors to be inverted were simulated using the same parameters anyway. The two temperature parameters were directly derived from the ECMWF data by performing fits between sigma levels 33 and 43 (roughly 300 to 500 hectopascals). For test purposes, this is reasonable, while the same data could just as easily be derived from the sister instrument, AMSU-A, when doing real retrievals. See for instance Houshangpour et al. (2004) and Rosenkranz (2001). In order to give them approximately equal weight in the retrieval, all five variables were normalised by their standard deviations. The first step was to test the accuracy of the method using samples selected from the training data itself. The data were classified according to the water-vapour volume mixing ratio (vmr) at sigma level 36 (roughly 400 hpa), with values below 0.0005 falling into the first class and those above into the second. Seven-thousand, five-hundred (7500) samples were randomly selected from the training data and classified with good accuracy using adaptive Gaussian filtering. Ten-thousand nearest neighbours (k = 10000) were selected for each classification while the objective total weight was set at ten (Wc = 10). The resulting accuracy was 92.4% while the correlation coefficient was 0.846. Since percentage of correct classifications may not always provide a realistic measure of the accuracy, the correlation coefficient is also given. This may be defined for discrete variables just as well as continuous and provides a well known and easily interpretted measure. For the simple two-class scenario, it is easy to show that the correlation coefficient has no dependence upon the often arbitrary assignment of values to classes. 18

Fig. 6. Plot of classification accuracy for different confidence ratings for the retrieval shown in figure 7.

4.1 Isoline retrieval Once classifications have been done over a large enough section of the Earth’s surface, isolines may be retrieved by simply tracing the borders between positive (higher than the threshold) and negative (lower than the threshold) classification results. This may be done via any contouring algorithm. The results of such a procedure are shown in figure 7 using the aforementioned simulated training data. The retrieval parameters were k = 10000 and Wc = 50. Note the higher value of the objective total weight, which produces better results for test sets drawn from a different population than the training data since there will be more smoothing of the class borders. To compute the test data, additional radiances covering the whole field had to be calculated in the same manner as for the training data. The mean temperature profiles comprising temperature and temperature lapse rate were derived as before from the ECMWF reanalysis data for that date. The heavy black line in figure 7 is the “true” isoline, with auxiliary fine contours for higher and lower multiples. The retrieved isoline is not shown; rather a dot indicates a positive classification. Obviously, all the dots should fall within the heavy contour, but like a young child who has not yet learned how to color, the retrieval does not perfectly fill the isoline. The shading in the figure indicates the confidence rating, which we define as 19

Fig. 7. Test of isoline retrieval algorithm. k = 10000, Wc = 50

20

Fig. 8. Plot of isoline confidence interval as a function of confidence rating for the retrieval shown in figure 7.

follows. If the true value of a given classification result is not known, full knowledge of its accuracy may nonetheless be found in the conditional probability and a simple confidence rating defined by rescaling it: C=

nc P (c|~x) − 1 nc − 1

(30)

where c is the “winning” class and nc is the number of classes. If C is zero, then the classification result is little better than chance, while if one then it should be perfect, assuming that the pdf has been accurately calculated. By first forming a histogram over the confidence rating, actual classification accuracies may be computed for the different bins and compared as in figure 6. The actual accuracy is very close to the confidence rating when equivalently rescaled, indicating that the conditional probabilities are indeed well estimated. As might be expected, the confidence is higher closer to the isoline, as well as where the gradients are low. The measure may be used to our benefit in order to define an interval within which the true isoline is likely to fall. A “cutoff” value may be chosen and an “inner” and “outer” contour drawn at this cutoff. If it is chosen correctly, the two contours thus found should enclose the true isoline to within a certain statistical confidence. This latter confidence interval should not be confused with the previously defined confidence rating. 21

Fig. 9. Confidence intervals for retrieved isoline shown in figure 7. We expect the red curve to fall within the outermost pair of fine black curves 90% of the time.

22

To calculate the confidence interval from the confidence rating, a rather tedious line-integral method was used. Mathematically, this could be expressed as follows: Z δl (C) =

h(C 0 (~r) − C)ds

(31)

where h is the heaviside function, δl is the confidence interval of the retrieved isoline as a function of confidence rating and C 0 is the confidence rating of the classification results as a function of geographical position, ~r. The integral is performed along the length of the actual isoline. Although (31) implies that the integral must be evaluated separately for each value of the confidence rating, C, in actual fact it may be done for all values of C by sorting the confidence ratings of the results, i.e., C 0 . Finally, the results from (31) shown in figure 8 may be used to set a definite confidence interval on the retrieved isoline, as shown in figure 9. Now the true isoline is marked in red, while the heavy black line is the retrieved. Fifty and ninety percent intervals are enclosed by the fine black lines. One-standard deviation, assuming that the statistics are Gaussian, is bounded by the broken curves.

4.2 Improving efficiency: finding the class borders We consider here only the case of two classes, but the generalization to more is straightforward. The difference between the conditional probabilities of the two classes is given as: R(~x) = P (1|~x) − P (2|~x) ≈

1 X (2ci − 3)wi W i

(32)

where 1 and 2 are the classes and ci is the class of the ith sample. If the total number of classes is just two, then the absolute value of this expression is equivalent to the confidence rating in (30). The class borders are found by setting this expression to zero, R(~x) = 0. One of the advantages of this method over a k-nearest-neighbours is that it is possible to calculate analytical derivatives, given for the adaptive scheme as: "

#

X (opt) 1 X (opt) ∂R 1 w (2c −3) x − x − ≈ w (xkj − xk ) (33) i ij j i 2 ∂xj Wc k k Wc (σ (opt) ) i

where xij is the jth coordinate of the ith sample, while xj is the jth coordinate of the test point. Equation (33) is useful for two purposes: first, for minimization routines 23

Fig. 10. Comparison of confidence rating extrapolated using (40) versus explicitly calculated using (30)

designed to search for the class borders. Second, once we have a set of vectors that sample the borders, they may be stored along with the gradient vectors. Since these latter define the normals to the hyper-surface which is the class border, the class of a test point may be found using the following procedure: the nearest neighbour sampling the class border is found, while a dot product of the difference between these two vectors with the gradient vector will return the class through its sign. This will have two advantages in speed: the number of samples required for an equivalent level of precision should be reduced by a power on the order of (D − 1)/D, and while the algorithm still scales with n in speed, (as opposed to n log k) the coefficient should be small since we are now only searching for a single nearest neighbour. One possible procedure is as follows: pick two points at random, ~x 1 and ~x2 , belonging to classes 1 and 2 respectively. We define vˆ as the direction vector between the two points: ~x2 − ~x1 vˆ = (34) |~x2 − ~x1 | 24

The class border on the line between ~x1 and ~x2 is found by solving the following equation for t: R(~x1 + vˆt) = 0

(35)

where ~b = ~x1 + vˆt|D=0 is located at the class border. The analytical derivatives may be used, for instance in a Newton’s method, as an aid to the minimisation procedure: dR = ∇~x R · vˆ (36) dt Good minimisation routines for this purpose may be found in Press et al. (1992). The border may be sampled as many times as necessary by repeatedly picking random pairs of points that straddle it. This will tend to over-sample the border in regions of high-density while under- sampling it in regions of low density. It will also tend to under-sample regions of this hyper-surface that are tightly folded. Since the border we are dealing with is simply connected and possesses no complex curves, this is not an issue. If it were, some method to redistribute the samples would then be appropriate. Finally, we now have a set of vectors, ~bi , sampling the class border along with their corresponding gradients, ∇~x R(~bi ). The class of a given test point, ~x, may be found as follows:

j = arg max |~bi − ~x|

(37)

p = (~bj − ~x) · ∇~x R(~bj ) c = (3 + p/|p|)/2

(38) (39)

i

where c is the class. One may extrapolate the value of R to the test point: R ≈ tanh p

(40)

allowing us to estimate the confidence. Equation (40) may be derived by considering two normally distributed classes of equal size. Using two-thousand border samples, this scheme retrieved the same isoline as in figure 7 with equivalent skill, (the two results were 99.6% similar, with a correlation of 0.991) while being over a thousand times faster. Searching for the class borders with a tolerance of 1 × 10−5 using the routine ntsafe from Numerical Recipes (Press et al., 1992) took about the same time as performing the classifications without any special training. The confidence estimated using (40) compares favourably with the more accurate result returned by (30) as figure 10 shows. 25

5

Example 3: Retrieval of upper tropospheric water vapour

Water vapour in the upper troposphere is one of the most significant contributors to the greenhouse effect. As a final example, adaptive Gaussian filtering was used to retrieve upper tropospheric water vapour (UTWV) and upper tropospheric humidity (UTH). The top three AMSU channels are ideally suited to retrieve these quantities as their sensitivity peaks in this region, as measured by the so-called weighting functions, or Jacobian of the brightness temperatures wrt humidity. Moreover, because the channels are sensitive to water vapour over a broad range of altitudes, as opposed to a narrow layer, retrievals should be more accurate than those for a single vertical level as in section 4. First, we define the two quantities, UTWV and UTH. UTWV is the integrated column density of water vapour between two pressure levels, P 1 and P2 : Z UTWV =

P2

P1

ρv dz

(41)

where ρv is water vapour density and z is the altitude. Integrated column densities, of which UTWV is one, have units of mass over square distance, i.e. [kg/m2 ].

The UTH is defined as the average of the relative humidity between these same two pressure levels. This averaging is done over altitute, not pressure, so we can use the following integral to represent it: 1 UTH = ∆z

Z P2 P1

Pv dz Psat

(42)

where ∆z is the layer thickness, Pv is the water-vapour partial pressure and Psat is the saturation water vapour pressure wrt water. Relative humidity is a unitless quantity, but is usually expressed as a percentage. By setting the pressure limits at between 300 and 500 hPa (P1 = 300 hPa, P2 = 500 hPa) the definitions of both UTH and UTWV becomes equivalent to that given in Houshangpour et al. (2004). To retrieve UTH and UTWV, the same database of brightness temperatures and temperature parameters derived from randomly chosen ECMWF profiles was used for training data, as outlined in section 4. Here we move directly to retrieving a global field, with very promising results. In figure 11, the retrieved UTWV is shown via the shading, while the contours indicate the true, ECMWF field. A scatter plot of the same results is shown in figure 12. The retrieval parameters used were the same as those in section 4, with k = 10000 and Wc = 50. The correlation is 0.968, the root mean square error (RMSE) is 0.293 kg/m2 and the bias is −0.076 kg/m2 . 26

Fig. 11. Retrieved field of UTWV from temperature, temperature lapse and simulated AMSU-B brightness temperatures.

27

Fig. 12. Scatter plot of retrieved UTWV versus “true” ECMWF values.

Can this retrieval be taken seriously, since we are cheating in a sense by using temperature parameters derived directly from the source data being used to model the brightness temperatures? To check this, a similar test retrieval was performed using AMSU-A radiances in lieu of temperatures. For this, a much smaller database of roughly fourteen-thousand ECMWF profiles was used, the details of which may be found in Chevallier (2001). This is the same data as used in Houshangpour et al. (2004), however our retrieval is not based on any a priori physical assumptions or simplified analytical models. (The radiative transfer modelling does not count since collocated, direct measurements could be used instead.) Therefore, no intermediate temperature retrieval is performed as it is not needed; rather the channel radiances are used directly. Both UTWV and UTH were retrieved with and without noise using six AMSU channels: seven (7), eight (8), nine (9), (from the ‘A’ instrument) eighteen (18), nineteen (19) and twenty (20). (from the ‘B’ instrument) The AMSU-A radiances were modelled in the same manner as those of AMSU-B and provide temperature information in the upper troposphere since the weighting functions peak in these altitudes. Instrument noise was modelled by adding Gaussian noise of the following magnitudes to each of the 28

Fig. 13. Test UTWV retrieval without noise.

Fig. 14. Test UTWV retrieval with simulated instrument noise.

channels: (σ7 , σ8 , σ9 , σ18 , σ19 , σ20 ) = (0.25, 0.25, 0.25, 1.06, 0.7, 0.7) [K] 29

(43)

Fig. 15. Test UTH retrieval without noise.

Fig. 16. Test UTH retrieval with simulated instrument noise.

where σj is the standard deviation of the noise in the jth channel as a brightness temperature. (Rosenkranz, 2001) Figures 13 through 16 show test retrievals of UTWV and UTH performed by randomly selecting two-thousand (2000) profiles from the training data. 30

Retrieval parameters were set at k = 1000 and Wc = 5. Despite the much smaller size of the training database and the use of AMSU-A radiances in place of known temperature parameters, the retrievals show similar skill as that of our first example. Also note the insensitivity of the method to random noise: retrievals with noise are almost as good as those without, indicating that the algorithm is extremely stable. The skill is also comparable with that in Houshangpour et al. (2004) which uses a more complex and far less general, albeit much quicker, method.

6

Summary and conclusions

Adaptive Gaussian filtering was applied to interpolate high-resolution AMSU-B satellite measurements in both space and time. The results were compared to another method based on inverting normal linear interpolation and found to have passeable, though not brilliant agreement. The biggest discrepancies are thought to stem from the inherent instability of the second method; the value of an interpolate is not bounded by those of its nearest neighbours, as it is with a simple weighted mean. Indeed, because of these instabilities, the interpolation was limited to a mere two by two degree resolution–much lower than the first method or the raw pixels themselves. This type of interpolation, however, was never the intended application of the method. Rather it is interpolation in abstract vector spaces as an approximate Bayesian estimator for which it shows the most promise. With this in mind, the algorithm was successfully applied to retrieve watervapour in the upper-troposphere and found to compare favourably with a very different, regression-based technique. (Houshangpour et al., 2004) It was also shown how to extend the algorithm to classifications, so that water-vapour could be successfully retrieved in broad ranges at a single vertical level. While no comparison with another method was discussed, informal tests suggest that classification using AGFs is at least as accurate as Kohonen’s popular learning vector quantisation (LVQ) algorithms, albeit slower. (Kohonen, 2000) This inefficiency, however, may be remedied by searching for the class borders. In some regards, these types of methods are as accurate as can be since they operate directly on the data itself. There are no elaborate models based on a priori assumptions; nor are there artificial models formed by correlating the dependent with the independent variables as in a neural network algorithm for instance. The main problem is the speed: every retrieval depends upon the entire set of training data and in this regard we would argue that such 31

techniques have now come of age. The current generation of computers is more than powerful enough to handle the demands, especially if some form of parallel computing is employed.

6.1 Further work With regard to the algorithm itself, it would be difficult to take it further without moving towards general, non-linear decomposition or variable reduction schemes. In this regard it might be argued that Gaussian filtering (adaptive or not) may be too slow to be of use. In reaching towards such a goal, however, one might consider the metric. In order that a simple Cartesian metric be applied, it was suggested that the coordinate variables be transformed ahead of time to make the vector space as isotropic and homogeneous as possible. But what if this coordinate transformation (and therefore the metric) were determined locally and dynamically like the filter width? Such a transformation could be determined through semi-analytic derivatives and might not only improve the accuracy of estimates, but would comprise an empirical, non-linear decomposition. In regard to the examples, there is much further work that could be done, the most important being the application of real, measured data instead of simulated. Towards this end, it would be possible that not only the retrievals, but also the training data could use real measurements. This removes any reliance on the faithfulness of model assumptions and instead reveals the true, empirical relationships between the data. This is relevant since polarisation and scattering by liquid and ice particles, along with the infinite possible parameterisations that these entail, were not included in the RT model. The chief difficulty with such a proposition is gathering enough collocated measurements of sufficient accuracy. In addition, because of small-scale inhomogeneities, there are fundamental limitations in how well the collocations can be done, resulting in a certain amount of noise in the training dataset even in the best case. To be of any use for the applications described in this paper, most or all of the following factors must be taken into account. Assuming we are working with radiosondes mounted on weather balloons, one must keep track of both the motion and flight time of the balloon. Since collocations are almost never exact in either time or space, corrections need to be made by tracing the motion of the sampled air. Finally, the AMSU instrument takes most of its measurements at a significant angle to the earth’s surface and has a cross-section or footprint that is neither narrow nor uniform. In the case of integrated upper tropospheric humidity values, it is apparent that a satellite and a radiosonde will almost never sample exactly the same air. 32

Once accurate collocations can be made, it is now a matter of generating enough of them to sufficiently sample the measurement space. In the case of radiosondes, one can probably never hope for really good statistics, since the radiosonde network is sparse in places to begin with. Nevertheless, it is likely that considerable improvement can be made to the isoline retrieval algorithm described in section 4 as it has hitherto been applied to real data. (Mills, 2004) Whether it can be made accurate enough to study chaotic mixing remains to be seen.

Acknowledgements Thanks to Arash Houshangpour for sharing his results and data and to Oliver Lemke for invaluable programming, software and system support. Thanks to all those (you know who you are!) who made the completion of my master’s degree possible. Thanks to ECMWF for ERA 40 data, to F. Chevallier at ECMWF for the ECMWF-field-derived training data set, and to Lisa Neclos from the Comprehensive Large Array-data Stewardship System (CLASS) of the US National Oceanic and Atmospheric Administration (NOAA) for AMSU data. Thanks to the ARTS radiative transfer community, many of whom have indirectly contributed by implementing features to the ARTS model. This study was funded by the German Federal Ministry of Education and Research (BMBF), within the AFO2000 project UTH-MOS, grant 07ATC04. It is a contribution to COST Action 723 ‘Data Exploitation and Modeling for the Upper Troposphere and Lower Stratosphere’.

A

Appendix: program listings

A.1 adgaf routine This is the C++ routine for performing adaptive Gaussian filtering. The user must supply seven parameters briefly described within the listing and whose symbols are made to roughly match those in the text. Also supplied by the user is a function (called kleast) for selecting the k least values from an array. For a very efficient algorithm based on binary trees that operates in k log n time, please e-mail the corresponding author. Finally, the user must 33

supply a routine, metric2, that returns the squared distance between a pair of vectors. This may be a very simple function (easy to inline) of only a few lines such as the following: //returns the squared Cartesian distance between a pair of vectors: inline float metric2(float *v1, float *v2, long m) { float d; float diff;

//first vector //second vector //number of dimensions

d=0; for (long i=0; i