Fitting Densities and Hazard Functions with Neural ...

7 downloads 29 Views 102KB Size Report
Colin Reeves , Charles Johnston. School of Mathematical and Information Sciences, Coventry University, UK. Abstract. In this paper, we show that Arti cial ...
Fitting Densities and Hazard Functions with Neural Networks 

Colin Reeves , Charles Johnston

 School of Mathematical and Information Sciences, Coventry University, UK

Abstract

In this paper, we show that Arti cial neural networks (ANNs)|in particular, the Multi-Layer Perceptron (MLP)|can be used to estimate probability densities and hazard functions in survival analysis. An interpretation of the mathematical function tted is given, and the work is illustrated with some experimental results.

1 Introduction

Arti cial neural networks (ANNs) have attracted a lot of attention over the past decade; partly this must be due to a certain fascination with the idea of imitating the human brain, but while the biological background is interesting, it is sucient for practical purposes to consider ANNs as a means of solving non-linear regression or classi cation problems, which comprise most of their practical applications (see [1] for a comprehensive survey). In this paper, we show that they can also be used for the problem of estimation of probability densities and hazard functions in survival analysis. The most popular and pervasive ANN model is probably still the Multi-Layer Perceptron (MLP), which has produced impressive results in regression and classi cation problems.

2 Mathematical Formulation

The mathematical model that maps inputs to output for the MLP with several hidden layers is constructed as follows: We de ne the input to the j th unit in the kth layer to be skj =

NX k 1 i=0

wij(k 1)x(ik 1):

Here, xik 1) is the output of the ith unit in layer (k 1), wij(k 1) is the weight associated with a connection (

from unit i in layer (k 1) to unit j in layer k, and Nk is the number of units in layer k. Note = 1 is assumed at each level, that a variable x(0k 1) def to allow for a bias or constant term. The labelling

given in this formulation implies that the input layer is labelled as layer 0, and it provides a consistent picture of the relationship between network layers, the associated net functions, and the sets of weights. The transmission section of each unit computes the `activation' akj of the unit, de ned as

akj = g(skj ) which becomes the input xkj to the next layer. The `squashing' function g() is most commonly the logistic function (1) (x) = (1 +1e x ) ; which provides a continuous approximation to a threshold function by mapping the input to each unit onto the range (0,1). In the simplest case, we have a MLP with a single hidden layer of p units and activation function g(), and the mapping formed simpli es to the following composition of functions:

1 0 p n X X '(x) = g @ + i g( i + ij xj )A (2) 0

i=1

0

j =1

where, for clarity, we use to denote the inputto-hidden weights, and for the hidden-to-output weights. The biases are denoted by a subscript 0. This is a common formulation; in most cases the activation functions are the same for both the inputhidden and hidden-output connections. However, in some applications, a non-linearity at the output may not be necessary, and a simple sum is sucient. In the latter case, Eq. (2) becomes '(x) = 0 +

p X i=1

i g( i0 +

n X j =1

ij xj )

3 Fitting Densities and Hazard Functions

(3)

A common problem in statistics is to use a sample of m observations to estimate the probability distribution of the population from which the sample

values were drawn. This is particularly dicult in the case of a continuous variable: simple methods such as the construction of histograms can be heavily in uenced by the choice of group intervals and starting points. In this paper, we shall focus on the continuous univariate case. More sophisticated methods of density estimation rely on smoothing the histogram, perhaps by using kernel methods [2]. Fitting densities by these means is relatively demanding in terms of computational e ort and understanding, although these is considerably alleviated by the development of libraries such as those available with modern statistical software. A simple alternative is to use ANNs to t the empirical distribution function (EDF), and given the rapidly spreading availability of software for tting ANNs, this may be an attractive option. Thus the output values are given as (i 0:5)=m, where i runs from 1 to m|approximating the cumulative probabilities from 0 to 1, while the corresponding input values are the order statistics x[i] . Another area where such a technique can be applied is in the estimation of hazard functions in survival analysis. A common approach in addressing survival data is to construct the Kaplan-Meier stepfunction as a nonparametric estimate of the survival curve, representing graphically the way the hazard changes over the course of time. Again, it seems obvious that this step-function could be smoothed by using a neural net.

3.1 A Neural Net Approach

As has been made clear above, the result of tting a neural net is a mathematical function. Thus, if we t the EDF, we obtain a model for F(x) which can be di erentiated in order to estimate the density f(x). Similarly, the Kaplan-Meier function is an empirical estimate of the survival function S(t) = 1 F(t). In tting this with a neural net, we can nd its derivative f(t), which in turn enables the estimation of the hazard function h(t) = f(t)=S(t): From this the cumulative hazard Z H(t) = h(u)du = logS(t) can also be estimated. In the case of the logistic function of Eq. (1), this approach is particularly simple: the derivative of (x) is x 0 (x) = (1 +e e x )2 = (x)(1 (x)):

This is in fact a special case of the logistic density, which is in general a 2-parameter function x  

 e

 1+e

x  



2

whose mean is  and variance ()2 =3. This is in fact a good approximation to a Normal distribution|the maximum di erence between standard Normal and logistic distribution functions is only 0.0228 [3]. In the univariate case of interest, the input to hidden layer unit i is of the form i0 + i1 x (cf. Eq. (2)), from which it is easily seen that the unit represents a logistic density with mean i0 = i1 and variance 2=(3 2i1).

3.2 What does it mean?

For the case of Eq. (3), the function '(x) that is tted to F(x) will on di erentiation become '0(x) =

p X i=1

i i1g0 ( i0 + i1x)

(4)

Since in the case of a logistic activation function, i10( i0 + i1x) is the logistic density, Eq. (4) will also yield a density, provided that p X i=1

i = 1:

This is, of course, not guaranteed by the neural net tting procedure, although it would be possible to modify the algorithm by including a Lagrange multiplier  and adding (say) p X

(

i=1

i 1)2

to the mean squared error term. Whether this is worth doing will probably depend on the application: if all that is required is to obtain a reliable estimate of the (composite) distribution function, the individual components and their weights are of minor importance. However, if it is also hoped to identify the distinct components that make up the composite density, something like this approach would seem to be necessary.

We should note also that if Eq. (2) is used instead, with an additional non-linearity at the output, i.e., '(x) = g 0 +

p X i=1

!

i g( i0 + i1x) ;

we are guaranteed to have a density. However, how this relates to individual components that are assumed to exist is not clear, since the derivative of '(x) is no longer a simple linear combination of p logistic densities: '0 (x) = g0 (v) where we de ne v = 0 +

p X i=1

p X i=1

i i1g0 ( i0 + i1x)

(5)

i g( i0 + i1x)

for clarity. This is still a sum of p components, and if we assume a logistic activation, i10 ( i0 + i1x) is still a density. However, we cannot assume a simple sum of logistic densities, since the weights would be the values i 0 (v); which of course are a function of x. Alternatively, we could take the 0 (v) term `inside' by treating 0(v) i1 0 ( i0 + i1x) P as a density, and force pi=1 i = 1 as before. However, we could equally well assume that the proportions de ned by the weights, and force Pp are i=1 i1 = 1.

4 Experimental Work

To test these ideas out further, we generated 100 simulations, each consisting of 100 data points generated by a mixture of two Normal distributions with means 0 and 4 respectively, and a common variance of 1. The mixture was in the ration 1 : 2. In each case, the methodology developed above was used to t the EDF using a single hidden layer ANN with 2 hidden nodes and a logistic activation function. The t obtained was extremely close, as measured by the 2 criterion, but more interesting were the values of the logistic parameters. We would like to know if the above interpretation is practically useP ful. As we did not force the condition pi=1 i = 1,

the implied estimates of the proportions were not informative. (These are usually the most dicult to estimate using conventional methods anyway.) The most useful parameters would be the values of the means obtained from the logistic mixture that is implicit in the ANN. These were 0:98 and 5:10, with a mean squared error (MSE) of 1:22 and 1:61. Now these values seem to be rather poor; similar experiments in [4] using maximum likelihood (ML) techniques found MSE values two orders of magnitude better. However, on closer examination, it appears that the experiments in [4] were somewhat compromised by the fact that the true values of the means were used as initial values for the ML iteration. If you tell it what the answer is, it's not surprising if it works! In [4], the point was to evaluate the e ectiveness of iterative ML techniques|and it is known that `good' initial estimates are very important. At least our ANN technique, making no assumptions about the location of the means, and using a logistic approximation to the Normal in any case, has obtained very respectable values, which could be re ned by ML methods if that is required. As a second experiment, a pilot study was carried out using a data set of patients su ering from malignant melanoma. The Kaplan-Meier (KM) survival probabilities for members of all subsets of the patient data can be used as a training set for the neural net. Patient data were categorised and the subsets were de ned by a hierarchical tree diagram. For each subset, a non-parametric survival analysis was performed. The output from this is a set of event and censoring times (i.e. the times when each patient leaves the study) together with values of the estimated KM probability of survival for each patient. The input to the neural net was then the patient factors (categorised covariates) and the times when they left the study. While this should not be taken as a report on the survival of such patients, since the data have been superseded, it is nonetheless useful as a vehicle for comparing the type of ts available by Cox proportional hazards (PH) and the neural net approach. The explanatory variables for these data consisted of Age (greater or less than 60 years), Gender, Clarke level (an indicator of involvement of skin layers, which had four levels in this study), Pathological depth of the cancer (log-transformed and categorised into three levels for this study). The most popular parametric (or semiparametric) model for survival analysis is Cox's proportional hazards (PH) model. Using the same data, a Cox PH model was produced using S-Plus

1

Fig. 1. Survival curves for malignant melanoma patients CoxPH Kaplan/Meier NNet

0.9 0.8 0.7 Surv. 0.6 prob. 0.5 0.4 0.3 0.2

0

20

40

that required an interaction term between age and Clarke level. The results are illustrated in Figure 1 for the subset of older men with fairly severe skin cancer. The t to the Kaplan-Meier survival plot obtained from the neural net was at least as good as that given by the Cox PH model for nearly all subsets, and in many cases (as here) actually better.

References

[1] S.Haykin (1999) Neural Networks : a Comprehensive Foundation, 2nd edition, Prentice Hall, London. [2] M.P.Wand and M.C.Jones (1995) Kernel Smoothing, Chapman and Hall, London. [3] N.L.Johnson, S. Kotz and N.Balakrishnan (1995) Continuous Univariate Distributions, Vol.II, Wiley, Chichester. [4] B.S.Everitt and D.J.Hand (1981) Finite Mixture Distributions, Chapman and Hall, London.

60 Time

80

100

120

Suggest Documents