The Development of Risk Adjusted Control Charts ...

The Development of Risk Adjusted Control Charts and Machine Learning Models to Monitor the Mortality Rate of Intensive Care Unit Patients

David A. Cook B. Med. ScL, M.B.B.S., F.A.N.Z.C.A., F.J.F.LC.M.

Submittedfor the Degree of Doctor ofPhilosophy School of Information Technology and Electrical Engineering, University of Queensland Date of Submission: December 2003

Abstract To promote the highest quality healthcare, it is necessary to monitor the outcomes of patient care. The tools to monitor the outcomes of patient care are not optimal. This thesis develops risk adjusted control chart methods for monitoring in-hospital mortality outcomes in Intensive Care Unit (ICU) patients. It is a medical application of statistics and machine learning to measure outcomes for quality management. Three directions of investigation are followed to achieve this.

The first is the assessment methods of ICU models that predict the probability of death. The desirable attributes of model performance are discrimination, the ability to separate survivors and non-survivors, and calibration, a measure of the extent to which model risk prediction represents that patient's actual risk of dying. An independent assessment of the APACHE III model used in the Princess Alexandra Hospital ICU demonstrated good discrimination and calibration, and so the model was validated in this context for prediction of in-hospital mortality.

The second is a study of statistical process control charts for patient mortality rate. Risk adjusted control chart techniques are subsequently developed to incorporate the validated APACHE III probability of death estimate to control for casemix and severity of illness. The design and performance of these control charts are studied.

The third direction is the development of an alternative model for risk adjustment. The results of preparatory experiments using machine learning techniques, artificial neural networks (ANNs) and support vector machines (SVMs), were comparable to those previously obtained with logistic regression. SVM models are further investigated to model 30-day in-hospital mortality, using raw patient data from the equivalent of one year of patient admissions. Model development is successfully guided by the desirable attributes of model performance: discrimination and calibration.

The conclusions of this study are: I) risk adjusted control charting offers an adjunct to current methods of ICU outcome assessment when monitoring the quality of care; 2) SVMs and ANNs are practical approaches to model the probability of in-hospital mortality for ICU patients; 3) model development can be guided by optimization of the model attributes of discrimination and calibration.

Declaration of Originality The work presented in this thesis is to the best of my knowledge and belief, original and my own work, except as acknowledged in the text. The material has not been submitted either in whole or in part for a degree at this or any other university.

n'b 0.7, good if > 0.8 and excellent if > 0.9. Rosenburg ''* holds the opinion that areas under the ROC curve of"... 0.8 or better are expected for mortality predictions in current models, and scores of 0.7 or less are considered to be unacceptable." The references''*''^ on which he bases this provide only opinion. My conclusion, drawn from a review of the published experience with classification performance of ICU models (Table 2.1) is that Rosenburg's position is correct. Contemporary ICU models that estimate probability of death should have an area under the ROC curve in the range of 0.80 - 0.90, with less than 0.70 being unacceptable. Steen '* gives an opinion that 0.90 may approach the upper limit for generalised, discrimination performance for models based on biological measurements.

For small datasets, a non-parametric method to estimate the ROC area based on the Wilcoxon rank sum test (Mann-Whitney (/test)'' or the calculation based on a series of trapezoids is appropriate. With large datasets, the "staircase" effect of discrete values is less important and parametric curve fitting or non-parametric approaches provide minimal difference in calculated values of the area under the ROC curve and the standard error *". The standard error can be used to compare the areas under the ROC curves and to estimate confidence intervals '°''^. Where the performance of two models' performances is compared on the same dataset, an alternative, non-parametric method *' using paired comparisons is more powerful.

At any decision threshold, the likelihood ratio is the ratio of the predicted probability of death in a patient who subsequently dies to the predicted probability of death in a patient who subsequently survives '^. It is the slope of the ROC curve, being the change in sensitivity divided by the change in (1 - specificity) over a given range of values *''*'. It provides important information about the performance of a model. The likelihood ratio is the relationship between the pre-test and post-test probabilities of mortality.

2.4.2 Calibration. Calibration is an attribute of model performance that reflects the extent to which a risk prediction represents that patient's actual risk of dying. An overall summary statistic, like the standardised

19

mortality ratio (SMR) provides information about how the overall mortality rate agrees with the mortality prediction for the sample. Other statistics and scores have been suggested, in non-ICU applications, by Brier ^, Yates " , Hilden ^'''" and Flora ^ . Goodness-of-fit approaches analyse model fit in risk strata grouping patients according to estimated risk (Hosmer and Lemeshow ' , Spiegelhalter *').

To assess the calibration of a model, a global assessment of calibration and an analysis of fit in risk intervals should be considered.

Evaluation of Overall Model Prediction There are statistics for assessing how well the model estimates compare with the actual mortality rate of the sample of patients. They compare the observed number of deaths to the predicted number of deaths. Large departures will indicate failure of the model to predict the probability of patient death in that context.

/.

Standardised Mortality Ratio

The SMR is a commonly used statistic. It is the ratio of observed deaths to predicted deaths, or observed mortality rate to predicted mortality rate. Equation 2.1. In the example, there are « patients indexed by i. 71^ is the estimate of the probability of death provided by the model. For a patient who dies, the outcome, Y^ = 1, or J^. = 0 if the patient survives.

SMR = -^

Equation 2.1

1=1

The hypothesis that there is no difference between the observed mortality rate and the predicted mortality rate can be formally tested by chi-squared or binomial methods. Confidence intervals (CI) estimate the precision of the SMR based on various assumptions about the relevant sampling

20

distribution. A number of approximations of the binomial distribution have been reviewed

' and the

choice is a balance between simplicity and accuracy.

The useful estimate of the standard error for the term 2_^ Yj arises from the variances of all the individual predictions ^'^^•^'.

SE = J'Z^ril-7t,) i=\

The estimate of the 95% CI of the SMR is then:

SMR±1.96-

i=l

z*. ;=1

2.

Flora's Z score

Flora's Z score ^ is similar to the method of using SMR with confidence intervals and has the same advantages and limitations. It compares the observed number of deaths and the predicted number of deaths using the difference rather than the ratio of these numbers; then the statistic is standardised by the estimate of the standard error.

Fiora score =

' '

' '

E*,-(i-*,) /=i

The Flora score has been used in a Greek study to compare APACHE II and SAPS II in a single institution ^'.

Where there is a large dataset, a chi-squared statistic has been proposed by Miller and Hui ^^ though it has not been used to assess ICU models that predict in-hospital mortality. The Hosmer-Lemeshow (HL) statistics, described below can be regarded as a special case of this y^ statistic, where continuous probability estimates are grouped for analysis.

21

3.

Mean Squared Error

The mean squared error (MSE) of the predictions, also known as the Brier Score ^ can be used to assess model fit.

MSEJI^^^ n

This method is particularly useful where very large datasets exist '*^. The MSE can be decomposed into components that reflect cliaracteristics of outcome prevalence (and variance of the outcome), bias, noise, model complexity and over-fitting "•'^'''''^''''''^ j ^ ^ MS'E has several drawbacks. Whether the model is overestimating or underestimating the probability of death is not apparent from the MSE. Also, the MSE is dependent on the mortality rate which is a characteristic of the context, as much as the model performance.

Calibration Curves Calibration curves allow qualitative evaluation of the model fit across risk intervals and are widely used to evaluate ICU mortality models. The agreement between predicted and observed mortality in intervals defined by predicted risk can be displayed graphically with a curve comparing model predictions with observed frequencies *^. Patients are grouped into contiguous intervals of predicted risk. For each risk interval, the mortality rate is plotted against the mean estimated probability of death. A perfectly calibrated model will have a calibration curve that has a slope of 1 and an intercept at the origin. For assessment of ICU models groups can usually be defined by intervals of 0.1 or 0.05 in the estimated probabilities of death, depending on sample size. Figure 2.1 is an example of a calibration curve of the APACHE III model for the PAH ICU database 1995 - 1997. The full description of this evaluation of this model is provided in Chapter 3.

22

Figure 2.1: Calibration Curve for APACHE III Hospital Mortality Model PAH ICU 1995 -1997, Observed (+/- 95% confidence intervals) v predicted mortality in 10 deciles of risk APACHE III model predicts in-hospital risk of death with adjustment for hospital characteristics

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

APACHE III Predicted Hospital Mortality

This graph gives a visual representation of the agreement or calibration at each of the levels of risk. Small data samples can produce irregular or noisy curves or empty risk intervals, though smoothing fiinctions are available *^. Small numbers are particularly likely in the strata of higher risk of death.

The CI for the observed mortality rate in each interval can be estimated based on a normal approximation to the binomial. The relationship between the calibration curve and its degree of estimated variation allows qualitative appraisal of model performance, and is useful for making comparisons between models. In each interval, CI give an indication of the precision of the mortality estimate and accounts for sub-group size and random variations. However, the sub-groups in each interval are not independent. As an alternative to CI, some authors 30,34,36,48,49,76,77 j^îy^g ^ histogram of patient numbers by interval. Either approach allows the reader to infer the likely precision of the estimates in each interval, based on the number of cases in the intervals.

However, the calibration curve approach does not provide a way to test hypotheses about the adequacy of fit across all intervals. The lack of independence of the values, the issues of small numbers and precision, and the problems of false positives and multiple testing limit the quantitative inferences that can be drawn from calibration curves.

23

Hosmer-Lemeshow statistics The Hosmer-Lemeshow (H-L) ^^'^ statistics (C and H) were proposed for the assessment of logistic regression models. Their use has been adapted and extended to include prospective, independent model validation.

H-L tests compare the observed against the predicted numbers of deaths and survivors in intervals of risk. Most applications that assess the calibration of ICU models have used 10 risk intervals. For the C statistic, patients are ranked according to predicted risk of death and divided into 10 near equal groups. The H statistic uses the sample divided into 10 contiguous risk intervals of equal width but unequal number. The C and H statistics are chi-squared like statistics calculated from a 4 x 10 table of observed and estimated mortality and survival. The value of the test statistic is compared with the chi-squared distribution with an appropriate number of degrees of freedom.

The number of degrees of freedom when the model is being assessed on the developmental dataset is the number of risk strata minus 2 **'**. By convention in the ICU literature, there are usually 10 deciles of risk, and so 8 degrees of freedom. The degrees of freedom for the chi-squared distribution when prospective independent validation is performed are equal to the number of risk intervals ^ . Again, by convention there are 10 intervals, though 8 or 9 intervals are reported when there are small samples ^^. With the H statistic, the risk intervals may contain few or no cases, requiring combination of intervals, or use of an alternative method.

As with all of the methods for model assessment studied in the current context, the H-L statistics are vulnerable to changes in the patient casemix ^^ and the distribution of severity of illness ^*, as well as model fit. The power of the analysis will depend on the sample size. Small samples tend to lack power to recognise poor fit. Conversely, large samples are more likely to suggest poor fit. Rowan et al. ^^ states that "A significant departure from the null hypothesis does not necessarily imply a bad fit, just that imperfections are of such a size that they can be detected in a large sample size". For comparisons on the same dataset, it is the magnitude of the chi-squared statistic that indicates the better model fit. In practice, it is necessary to decide on non-statistical grounds what level of fit is clinically acceptable.

24 Notwithstanding the effect of sample size and the inconsistencies in their use, the H-L statistics are widely used, and provide useful context specific information about calibration.

Spiegelhalter's Z score A similar approach to the H-L H statistic has been proposed by Spiegelhalter ^^ His method calculates standardised Z scores for each of 10 risk intervals based on the observed mortality rate, and the standard error of the risk estimates in the interval. It is assumed that Z will be approximately normally distributed and scores greater than 1.96 or less than -1.96 imply that the model is poorly calibrated in that interval.

Overall, the Spiegelhalter score provides an assessment of fit of the model predictions across all the intervals. It is calculated as the sum of the squares of the Z scores for the intervals and is then compared to a chi-squared distribution with 7 degrees of freedom.

By standardising in each risk interval and accounting for patient numbers and distribution of risk of death, comparison between different contexts may be possible provided the samples are large. A further advantage of the Spiegelhalter score is that it can be calculated from the tables presented for H-L statistics. The Spiegelhalter method of calculating the standard error is a conservative test ^^. A major shortcoming arises with intervals with small numbers of patients.

No assessment of performance of ICU mortality models has used the Spiegelhalter method to date.

Model Based Analysis of Performance Model based analyses of performance are more often used during fine-tuning of a model on a developmental dataset or for recalibration than for validation studies. Real ICU data may violate some of the assumptions on which both the modelling and subsequent analysis are carried out *^. However, Ash and Schwartz '^ observe that "...these algorithms for transforming data are judged primarily by how closely their predictions match reality, rather than the extent to which underlying assumptions are met."

25

Approaches based on a logistic regression model allow analysis of the relationship between the estimated risk of death and observed outcomes '*^-^^'^. These assessment methods *"'*' naturally allow new or refined ICU outcome models to be developed by adjustment of the parameters of an existing logistic regression model *^'*^.

2.5 Recommendations for a Practical Approach to Validation of ICU Models that Estimate the Probability of In-Hospital Death An ICU model that accurately estimates the probability of in-hospital mortality can potentially be used as a risk adjustment tool to analyse mortality outcomes in an ICU. The model's performance must however be thoroughly evaluated to determine whether the estimates of probability of death are accurate.

The recent publication of guidelines for standard reporting of the accuracy of diagnostic tests by the Standards for Reporting of Diagnostic Accuracy (STARD) ^* provides a useful example of an explicit and rigorous check list to serve as a guide to methodology and documentation. Reports on the performance of ICU models that estimate the probability of in-hospital death share similar characteristics to reports on the accuracy of a diagnostic test. Therefore, the systematic approach used by STARD (Checklist for Reporting Diagnostic Accuracy ^^) provides a framework for my recommendations.

The Title, Abstract and Keywords should identify the report as that of the assessment of the performance of an ICU mortality prediction model.

The Infroduction should clearly state the aim of the report. It may be to develop and introduce a new model, to validate an existing model, to compare several models, or to adjust an existing model.

The Methods section should describe the context of the analysis and dates of data collection. A single or multiple ICUs, and the type of hospital and ICU should be described. The study population and the method of patient eligibility and exclusion should be described. This will allow the reader to assess

26

independence of the validation process, conditions affecting model performance, and the applicability of the model to a situation of interest.

An account of rules of data collection and a description of the model are essential. The mortality and survival endpoints must be defined. The methods by which patient data are divided into sets for model development and testing must be given. A description of the statistical methods used to develop the model and to assess its performance should be given.

The accuracy of ICU outcome models should be assessed in terms of discrimination and calibration.

For discrimination, the area under the ROC curve should be calculated, with standard error or confidence intervals for precision. For two models on the same sample, pair-wise comparison of models should be done.

Calibration should be assessed by an overall assessment statistic and an assessment of fit across risk intervals. The most commonly used global indication is the SMR with confidence intervals. A graphical approach with a calibration curve using confidence intervals should be presented. A numerical evaluation of goodness-of-fit using the H-L statistics or the Spiegelhalter approach should be used. If more than one model is being evaluated on the same dataset, then a statistical comparison, using H-L statistics or Spiegelhalter Z scores is suggested.

The Results section should state when the data collection was performed. A description of the sample should include characteristics of age, gender and major diagnostic categories, severity of illness measurements and mortality rate. If this is an independent evaluation of a model, it is usefijl to compare the patient characteristics of the validation data set with those for the sample on which the model was developed. All missing or incomplete records must be accounted for.

The recommended approach described above is applied to analysis of APACHE III in an Australian adult ICU in the next chapter.

27

2.6 Survey of Models that Estimate the Probability of in-Hospital Death of ICU Patients Table 2.1 presents a tabular survey of studies that report the performance of ICU models to estimate the probability of in-hospital death.

The performance of all models on the developmental datasets is consistently better than on subsequent independent assessment. As expected, the issues of reproducibility and transportability of the models require that all models which estimate the probability of in-hospital death of ICU patients are validated at the site where the models will actually be used.

2.7 Conclusion The key attributes of models that estimate the risk of death of patients in the ICU are discrimination and calibration. The area under the ROC curve is the best measure of discrimination. For models to predict contemporary ICU mortality, an area under the ROC curve in the range of 0.80 - 0.90 is expected. Model calibration is an attribute that is more difficult to capture. Though calibration curves provide a qualitative representation, statistical evaluation of calibration is done using H-L goodness-offit statistics. In practice, it is necessary to decide on non-statistical grounds what level of fit, and maximum value is acceptable. Goals for Model performance using these measures will guide development of new models in Chapter 6 and 7.

f/ie &»

-4-»

S to OJ

>.

^ >

^ ^

o a> o c(Q SV.

t

>•>.

0) (4

tt)

o

0/E =1.2

s(U

APACHE 11 Modified with best GCS

u

Saudi Arabia single ICU Nov 1984-Nov 1985 210 admissions Saudi Arabia single ICU Nov 1984-April 1987 583 admissions New Zealand 2 ICUs Dec 1982-Nov 1983 1005 patients

••-»

Jacobs era/. 1987'' APACHE II Modified with best GCS

JS

Classification matrix PPV, NPV, sens, spec, and CCR at 0.7,0.8 and 0.9 Sens, spec, NPV, PPV and CCR at 0.6, 0.7 and 0.8

E •ca

ROC curve displayed, area not stated

cS

USA 13 ICUs 5815 admissions

3

H-L statistic

a.

Calibration:

&

Graph

u to

Calibration:

O

APACHE II Developmental paper

o c

Overall Assessment

O

Classification Matrix Thresholds (95% confidence intervals or standard error)

•s> 3

Discrimination

•s

Q>

Area under the ROC curve

•*-»

Comments

t«

Models

J3

Authors

orta

•s o tt)

Reports are presente(i in chronological order of publication. * Denotes absence o1' information (eg calibration not reported).

fIC

00

o

s

a

1

8 0\ ce 00

52

X) XI

o

ON ON

JS

-i 3

gal

_

£ 3

3

•g £

4) (30

CJ CJ

^

U

1->

>

CO e 4)m 0

8

CI.

D

s

2 41 >

•0

§••04^

4>

ce

0

4> C U 8

3 g

£ to to

I1>

ens itivity dardis pec ificity positi

o

CM

ive pred ive predi val

ro

§

n

Oi

M

s

J-J

t/1

ft; ft- q

3

3 0

pq

CO

20 ^. ee - 3 ON i2 . "* 0 so

ca 4)

a:

ospi 995

USA

CO

VO ^ H 43

3

S JS

^

•3

o

s

"I3 8 *

w X

s o

CM

S « > O • = 4J l^ .ii en .ti en en C O J3 ftO o.

=1 >^

a< < & < w 2

3 O

-ts

3

^
0.8) or excellent (area under the ROC curve > 0.9) ^*. The area under the ROC curve was similar to that of APACHE III in-hospital mortality predictions on the developmental data set ^, the USA prospective multi-centre validation series ^^'^ and the UK multi centre series ^*'. However, the discrimination of the APACHE in model is vulnerable to differences in case mix, clinical practice and data collection conditions, given the lesser performance in the multi-centre Brazilian series '*"', and single institution studies from England ^^ and Germany'".

In this sample, the H-L statistics, the calibration curves and the global agreement between observed and predicted outcomes concur that among the models studied, only the unadjusted APACHE III hospital mortality model displayed inadequate fit. Comparison with other published calibration curves for APACHE III 30.36,77,ioo ^^^^^ ^^^^ ^^^ calibration at the PAH resembles the curves of the North American ICUs ^^.

Despite differences in case mix and referral patterns between the PAH sample and the APACHE III developmental sample, the performance of the APACHE III models adjusted for hospital characteristics was good. Other analyses of APACHE III have only assessed the performance of the unadjusted APACHE III hospital mortality models. A UK validation study ^^ with a patients' sample having a higher average APACHE III score, more co-morbidities and different referral sources, found excellent discrimination. However, there was a 25% higher than predicted mortality rate and excess mortality in all risk ranges indicating poor calibration. Differences in casemix were proposed as the likely reason for higher than predicted hospital mortality. A German study '*^ describes a 22% higher than predicted mortality, with data collection anomalies, casemix, leadtime bias, model inaccuracy or quality of care cited as possible contributors.

In contrast to other work, the present study showed that the APACHE III model can be applied to a patient population with a different case mix and referral pattern, outside of the USA, and produce similar performance to that observed on the APACHE III development and validation series.

48

The validation of a model that estimates the risk of death on an independent data set implies model validity and the reliability of variables and data collection methods ^^ Potential for bias and inaccuracy '"^ and threats to model performance can arisefromlocal anomalies of clinical practice, casemix or data collection. The apparent variability of performance of ICU outcome or risk adjustment models mandates that these models must be closely examined at each site where they are used before conclusions are drawn about comparative or historical performance.

In the PAH series during the study period, the APACHE III mortality estimates, particularly with proprietary adjustment for hospital characteristics, provided both good discrimination and good calibration. This supports the validity and robustness of APACHE III variables, data collection and the model for mortality predictions.

The local performance of APACHE III will therefore allow its use as a locally validated risk adjustment tool to analyse mortality outcomes at the PAH ICU

49

Chapter 4 Control Charts for Analysis of Mortality Outcomes in Intensive Care.

4.1 Introduction The purpose of the next two chapters is to develop methods to continuously monitor outcomes of intensive care unit (ICU) patients using an adjustment for risk of death. In Chapter 4, control charts are applied to analyse ICU deaths. The choice of charting parameters is made by modelling the occurrence of false alarms and detection of changes in the mortality rate. Control charts incorporating risk adjustment (RA) are introduced in Chapter 5.

4.1.1 Overview Statistical analyses can be used to detect a change in the pattern of the outcomes of a process. A change in observed mortality rates for patients may be caused by various factors including a change in the process of patient care. Therefore, monitoring outcomes in a clinical setting may provide information to detect and influence improvements in the quality of patient care.

In this chapter, several approaches to track ICU mortality will be presented. Process control chart analysis will be applied to mortality at the PAH ICU during the period 1 January 1995 to 31 December 1999. The APACHE III model for estimating the in-hospital death of ICU patients, with proprietary adjustments for hospital characteristics, was shown in Chapter 3 to have good discrimination and calibration on the PAH ICU patient population. Therefore it will be used as a RA model for development of the RA control charts that are presented in Chapter 5.

The data used in the following chapters cover an additional two year period beyond those used for the validation study of Chapter 3 which was published in 2000 ^'. A summary of the larger data set is provided in Appendix 1. An update on the performance of the APACHE III model on this data set has been published "'^.

50

This project was commenced using the 1995 - 1997 patient dataset. This initial 3 year period is used as a period of historical observation. The chart monitoring will be applied to the subsequent data for 1998 -1999

4.2 Considerations in Monitoring In-Hospital Mortality of ICU Patients. An approach for monitoring ICU mortality should have several characteristics.

Primarily the methods must track the mortality rate and analyse the mortality observations in the context of a standard mortality rate. For the charts presented in this chapter, the expected mortality rate was derived from historical observations from the 1995 - 1997 patient data.

ICU mortality monitoring must include all patients that are admitted to the ICU, so that a global picture of ICU mortality is obtained which will reflect the care given to all patients. This will also maximize the number of patients, the power of the analysis and may minimize delay to recognize changes.

The data collection and monitoring must not distort the care provided to patients. Protocols have a place to guide the quality and efficiency of patient care. However, protocols that exist solely to improve data analysis must not constrain patient care or interfere with changes to the system of care. It is important that the usual care of patents is being assessed, rather than the effects of an experimentaltype intervention.

The results of analysis must be available in a timely maimer without unnecessary delay. This can be aided by using a sequential analysis technique or grouping patients into the smallest practical samples that provide adequate power. Large sample groups potentially miss temporal relationships, such as seasonal cycles or other periodic influences'**.

The mortality outcome used for the development of the confrol chart methods is the APACHE III endpoint of patient survival to hospital discharge. ICU patient hospital stay can be very long, up to months or even years as an inpatient. A complete dataset can only be analysed after all patients are dead or discharged from hospital, which can take months, or years. Early analysis tends to be biased

51

toward the deaths, and I have found in practice that use of in-hospital mortality as the endpoint limits the timeliness of analysis. This observation is consistent with opinion in the literature, which has called for ICU patient survival to be reported at fixed times post ICU admission ^^'^^. The considerable limitation in using in-hospital mortality as the outcome is addressed in subsequent chapters of this thesis. When new models are developed for RA charting in this thesis, the outcome of 30-day inhospital mortality is used. Patient outcomes can therefore be analysed 30 days after admission.

4.3

Control Chart Analysis of ICU Mortality

Process control charts, such as p chart, CUSUM chart and EWMA charts are suitable methods for detecting changes in proportions of binary outcomes of a process. In the ICU, the process under scrutiny is the complex milieu of patients and patient care. The outcome to be monitored is in-hospital mortality.

Variations to the mortality rates may be due to common cause, or special cause variations '"^ Common cause variations are related to the nature of the process and include chance variations. A process is operating "in-control" if the variability is due only to random variation '"*. Control charts provide tools for monitoring and analysis by tracking whether the outcome of the process conforms to the expected in-control variability. The variance of a process in-control will account for these chance effects, and is the basis for the calculation and presentation of control limits.

A process is "out-of-control", when the output does not have a stable distribution "'^, and the variation is attributable to causes other than random variation '"*. These are special causes or assignable causes of variation. These can be due to temporary or new factors that were not part of the in-control process. Examples in an industrial context usually include defective raw materials, improperly adjusted machines or operator errors '°*. In ICU, special cause variations causing a fall in mortality occur with an increase in low risk elective surgery, transfer of low risk patients from a nearby ICU or potentially a systematic increase in quality of care. In contrast, a decrease in mortality rate due to a chance run of unexpected survivors would represent a common cause variation.

52

The control parameters for a control chart are estimated from a stable, historical period of observation while the process is in-control. In this application, the initial period 1 January 1995 - 31 December 1999 of 36 monthly observations, or 31 blocks of 100 patients was used. Appendix 2 presents an analysis which establishes that it was reasonable to assume that the ICU process and the mortality rate was in-control during that period.

Where a change in the distribution of the outcome observations exceeds a level that is expected by chance, a signal will occur. Such a signal should prompt several actions in an ICU. The first is an evaluation of the likely importance of the signal by considering the sensitivity and false alarm characteristics of the chart which are calculated or simulated during the chart design process. Other alternative, complimentary patient data will also be examined to differentiate between common or special causes of variation. Secondly, if a systematic change has occurred, an examination of the process in an appropriate and timely manner will be conducted, including a search for the assignable causes. A systematic and acceptable increase in mortality rate due to more high risk patients will lead to a revision of the expected mortality rates. An increase in low risk elective surgery will require a revision down of the expected mortality rate. On the other hand, an increase in mortality rate attributed to, say, premature patient discharge or inadequate medical staff numbers would demand intervention and improvement of the care offered to patients, without revision of the expected mortality targets. Equally important would be a true fall in mortality that was not attributable to severity, casemix or other expected influences. A search for the causes of improvement could provide hypothesis for further clinical investigation.

The key to differentiating between random variations and a systematic change in the underlying process lies in the design and performance of the control charts.

Design of clinical trials considers the power of the analysis and the Type I and Type II errors. Analogous charting concepts are studied by analysing run length to signal. With control charts it is important to understand how long a process might run before a signal is expected when the process is in-control (false alarm), and when conditions change (true positive). The expected pattern of false

53

alarms and true positive signals can be quantified as the average run length to signal (ARL) under incontrol and under changed condition.

A clinician can design the monitoring scheme prospectively. The information required is: the in-control conditions, the changed conditions to be detected, the tolerable ARLs for in-control and changed conditions, and the patient numbers. The performance of a specific chart method is predictable and the choice and design, of charts can be tailored to specific applications.

There are limitations to monitoring ICU mortality that arise from the assumptions of the control chart model.

Independence of observations (patient outcomes) is an important assumption of control charting, but is difficult to assess. Correlations between successive outcomes may be related to cycles of activity, staffing, trauma, operating theatre lists and casemix. Staff learn on the job, so annual or term related cycles of staff gaining experience may affect the process. Public and school holidays affect casemix and staff availability. Annual budgetary cycles may have effects on patient case mix, particularly the volume of elective surgery. There is seasonal variation in many diseases such as heart disease '"*, asthma and communicable diseases like respiratory tract infections, meningococcal disease or flaccid ascending paralysis.

Sequential effects may also exist. Nosocomial infection transmission increases the risk of exposure and cross infection and may cause clusters or outbreaks of infection. High staff activity levels followed by periods of exhaustion and low staffing levels may influence the quality of care. The recent experience of clinicians could affect fixture clinical decisions. Business considerations may impact on activity levels, which in turn could affect admission and discharge criteria, resource availability and withdrawal of therapy.

There is therefore the possibility of lack of independence of patient outcomes and the possibility that the in-control mortality observations do not conform to the predicted distribution. In either of these cases, the control chart analysis would be undermined.

54

For/? charts and the CUSUM, it is assumed that the individual patient outcomes are random variables and a normal distribution can be used to calculate control limits. Departure from normality will prevent control limits from being correctly estimated. Appendix 2 presents a thorough analysis of the observations of monthly mortality rates and the mortality rates of blocks of 50 and 100 consecutive cases (1 January 1995 - 31 December 1997). The monthly and 100 case block mortality observations appear to have a normal distribution and there is no evidence of non-random clustering or mixing. The estimate of the standard deviation of the binomial distribution agrees with the observed standard deviation of mortality rate observations. There is some evidence for a 3 month (or 300 case) cycle in mortality rates.

The following discussion concentrates on the use of control charts to monitor the mortality rate of ICU patients unadjusted for any factors such as casemix.

4.4 Application of Control Charts to PAH ICU Data 4.4.1 p Chart The Shewhart/7 chart is a control chart for the proportion of an attribute in a sample. Control limits are based on a normal approximation of the binomial distribution, with parameters of sample size and control rate of the attribute'"'.

To plot ap chart for patient mortality rate, the statistic ^ . , the monthly mortality rate is plotted each month «,•

where, the month or sample is indexed by / and the n patient outcomes are indexed b y / The patient outcome Y.j is 1 if the patient dies in hospital and 0 if the patient survives to hospital discharge.

55

The target p is the mean mortality rate during an in-control period for the process. The value

\p(X-p) «,

is an estimate of the standard deviation (CT) of the in-control process

For large samples, control limits are calculated by a normal approximation to the binomial distribution:

CL=-p±a.}P^^^ «.

where a is the number of c width of the control limits.

Design of the p Chart

The p chart analysis presented in this section plots mortality rate by month. Appendix 3 presents an analysis of the effects on ARL to signal for a choice of control limit parameters in the setting of changing mortality rates. Based on this analysis, the charts are designed to signal plausible and clinically important changes in mortality rate in a timely fashion with an acceptable incidence of false alarms.

56

Figure 4.1 presents the data from Appendix 1 Table A1.2 in a/? chart of monthly mortality for the PAH ICU in 1995 - 1997, with both 2

0

— h +

h-

- • —C+ -•—C-

•

i

ro -0.02

55

^

;

^

.

E -0.04 (A -0.06 3

o

-0.08 -0.1 10

15

20

Admission Block Number

4.4.3 Estimates of the Current Mean: EWMA Chart

A number of approaches to estimating the current mean output are available. The overall mean of a series of values does not provide a good estimate of the current output, due to contributions of sometimes distant historical values. A method that provides better current estimates is the exponentially weighted moving average (EWMA) "^. More detail for this statistic is provided as it leads to original work developing the RA EWMA statistic, in Chapter 5.

The EWMA is an estimator that has been extensively investigated and it will be applied in this section. It assigns a higher weight to recent observations than to more distant or historical values. It is useftil for detecting small persistent shifts in the mean outcome and for estimating the timing of those shifts "*. The EWMA is slower to detect large shifts in the process mean than the/? chart. In contrast to the/? chart and the CUSUM, the EWMA is relatively insensitive to departures from the normal distribution "".As well as monitoring mortality for groups of patients, the EWMA can be applied to monitoring mortality rates by analysing a series of individual patient outcomes.

66

The formula for the EWMA is

EWMA. = y.Z + EWMA,_, (1 - X)

where >>, is the value of the /** observation. This value may be the mortality rate of samples of patients, />,. or the outcome of a single patient, Y.. In the examples used, both sample blocks of 100 consecutive patients, and single patient outcomes are presented. X is the weight between 0 and 1. Larger values give more weight to recent observations, with the limiting value of A = 1 producing a plot of yj against /, similar to a/? chart. EWMA- is the value of the statistic indexed by i.

The calculation of the EWMA statistic is an iterative process. For the first value, EWMAQ estimate of the in-control mortality rate is used. In this example, the estimate is p . EWMA.

, an is a

weighted average of the starting estimate, ^ and the subsequent observations j , to j ,

An alternative expression for EWMA.

is

EMWA.=(l-Ayp + A,Y,(l-A,y-'y, k=l

where k isa whole number < / .We can calculate control limits for the EWMA using the in-control mortality rate and the standard deviation. A formula for the control limits is "" :

CL,=p±acJ^[l-{l-ir]

whereCTis the standard deviation of the observations of the in-control sample, and a is the width of the control limits in multiples ofCT.The estimate ofCTis:

67

and the control limits are

c., = ,...ISj_i_[,_(,.,)..] where «, is the number of patients in each block of admissions. For an EWMA for single cases, n, = 1.

When the number of observations is large, a simplified formula.

-.---•j^s can be used •o^-'"^'"".

The design of the EWMA chart requires consideration of the effects of a, A and «, on ARL under incontrol and changed conditions. Appendix 5 provides an analysis of the effect of these parameters. Parameter choice determines the ability to detect real shifts in process mean and the likelihood of false positive signals.

68

Figure 4.7a presents EWMA chart of the mortality rates for blocks of 100 patients during 1998 - 1999. The "p =0.16, a = 2 and yi = 0.3. It is clear that the mean mortality rate is not 0.16. The mean mortality rate appears to have changed to lie in the range 0.12 - 0.14.

Figure 4.7a: EWMA Chart PAH ICU In-Hospital Mortality 1998-9 Blocks of 100 cases, in-control mean = 0.16, A = 0.3, a = +/-2 0.2

0.18

0.1

LU

-•—EWMA Upper Control limit: +2a Lower Control limit: - 2a

0.08 0.06 0.04 0.02-

10

18

Block of 100 Cases

»

^25

69

Figure 4.7b presents the same data, but with charting parameters revised and estimated based on the EWMA estimate of the mean mortality. The new target mean was 0.141 after the second block of observations. There are nofiirthersignals. The chart cannot be used to demonstrate a statistically significant difference between the EWMA estimate of mortality rate, and 0.141; that would require use of an appropriate statistical test. Nevertheless, the chart suggests that the mortality rate is in the range of 0.12 - 0.14 during the latter part of 1999.

Figure 4.7b: EWMA Chart PAH ICU In-Hospital Mortality1998 - 9 Blocks of 100 cases, A = 0.3, a = +/-2 initial estimate of in-control mean = 0.16, then after 2nd block, revised to 0.141 0.2 0.18 0.16

0.12

0:1 UI

-EWMA • Upper Control limit: +2o • Lower Control limit: -2a

0.08 0.06 0.04 0.02 0 4

10

15

Blocks of 100 Cases

20

25

70

The next series of charts plot EWMA for individual patient outcomes. Figure 4.8a displays the EMWA chart of outcomes of admissions 1998 -1999. The in-control mortality estimate is 0.16 and X was chosen to be 0.001. Again it is clear that the mean during this period is not 0.16.

Figure 4.8a: EWMA Chart PAH ICU In-Hospital Mortality 1998-9 Sirigle case, in-control mortality = 0.16, A = 0.001, a = +/-2

0.2 0.18 0.16 0.14
"5^ myocardial infarction patients "^ and general surgical cases "^ and RA Shewhart charts have been described by

76

Alemi and co-workers"^'"'. The methods described in these papers will be modified for the ICU context, and the RA exponentially weighted moving average (RA EWMA) chart will be introduced. The average run length (ARL) for these charts to detect changes in RA mortality rates will be analysed. Parameter choice will be made for this ICU application based on the performance of the charts over ranges of parameters and clinical scenarios.

An important consideration in the development of RA confrol chart methods has been to select a method to describe the distribution of mortality rates. This important basis for RA chart development is explored in Appendix 6. Three methods to characterise the distribution of observed mortality rates are discussed.

Two of these methods are applied to the charts and data in this chapter. The central limit theorem leads to approximation of the mortality rate distribution using a normal distribution. This is a good approximation for most RA chart applications. An exact method, using an iterative approach is also used to calculate the cumulative probability fimction of mortality rates of samples with patients of known probability of death.

In previous applications of RA strategies to health care, the emphasis has been on comparison between institutions, rather than monitoring changes over time. For example, the emphasis in cardiac surgical studies has been on using RA approaches for comparisons '^''''^'. in contrast, the RA methods that will be described in this thesis are designed to monitor mortality rates, and detect changes within a single institution over time.

The American APACHE III hospital mortality model, adjusted for hospital characteristics performed well as a estimate of in-hospital mortality in the PAH ICU between January 1995 and December 1999 '''\ It will be used as a RA tool to illustrate monitoring of ICU mortality. The characteristics of the data series and the performance of APACHE III model for the in-confrol period of 1995 - 1997 are summarised in Chapter 3. Although a validated model of the APACHE system is used in these analyses, any model with adequate performance could be used for RA. Other models have been

77

developed using this dataset ^^''^^ and fiirther development of an alternative machine learning RA model will be explored in Chapters 6 and 7.

5.2: Use of RA methods in Monitoring Hospital Mortality Outcomes. RA control charts are proposed to detect a change in the process of care, using model fit as an indication of change. In Figure 1.2, lezzonis' "Algebra of Risk" described the manner in which patient factors and diagnosis factors led to patient outcomes, influenced by the process of care and random variation. The relationship between the patient and disease factors, and the patient outcomes could be modelled. Subsequently, differences between observed and predicted patient mortality could be ascribed to random variation or to deterioration in model fit, potentially attributable to changes in quality of care. RA charts will provide a means of assessing model fit.

5.2.1: Issues with Monitoring RA Mortality. There are a number of issues that arise with risk adjusted outcome monitoring.

1.

Performance of RA model

The key to monitoring RA mortality rates is to have an adequate model to estimate the patient risk of death. The model performance must meet the criteria discussed in Chapter 2. It must be reproducible and be able to be generalised to unseen data separate from the context in which the model was developed or validated. The performance must be assessed in terms of discrimination, for example, using the area under the ROC curve, and calibration, assessed by the Hosmer-Lemeshow statistics. Ultimately, any numerical recommendation will be an opinion based on the practicalities of data collection, model building and validation, and clinical use.

My recommendation is based on two important considerations.

The first consideration is the level of performance that can be realistically expected from an ICU mortality prediction model. A review of the performance of models to predict ICU outcomes was summarised in Chapter 2, Table 2.1. This gives a guide to the discrimination and calibration that can be

78

expected from this type of model. For example, the area under the ROC curve should be in the range 0.80 - 0.90. It is reasonable to expect that the Hosmer-Lemeshow C statistic based on 10 groups, should have a value less than 15.5, ( ^'^ (s), /? > 0.05), but preferably C should be much smaller. The power of the Hosmer-Lemeshow method to detect departures from the null hypothesis of good model fit will depend on the size of the validation data set.

The second consideration is the sizes of the modelling and validation datasets. This requires a balance between the need for large datasets to develop and validate a model, and the finite time and resources available to collect patient data. A recent study by Clermont and co-workers '^^ modelled ICU patient outcomes with logistic regression and artificial neural networks, using a developmental dataset of 800 patients and a validation set of 447 cases. With developmental datasets of 800 patients or larger, satisfactory models were developed, but with smaller datasets the models were unreliable. A similar patient series of 1200 patients would take approximately 1 year to collect at the PAH ICU. This would provide 800 cases (67% of the data) for model development and 400 cases (33% of the data) for model validation. The study of Clermont et al. provides a useful benchmark, and on this basis, I propose that a practical compromise for model development dataset size be 800 patients, with the validation dataset size of 400 cases.

The APACHE III model with adjustment for hospital characteristics performed better than these minimum recommendations given above. On the 3159 patients at the PAH ICU during 1995 - 1997, the area under the ROC curve was 0.91, and the Hosmer-Lemeshow C statistic was 14.5.

2.

Model generalisation

Several simulation studies 5*'"'^*''2'' show that models that estimate the probability of ICU patient death can have poor performance and provide unreliable predictions of risk of death when the patient samples are different from that under which the model was developed.

A statistical model potentially describes only the relationships that were present in the data on which it was developed, and do not represent an immutable truth '^^ Models may not be applicable to all ICU

79

patient samples and institutions to which they might be applied. Model fit cannot be assumed and evaluation of the performance of each model at each site is required.

3.

Changes over time

The fit of a RA model may change over time ^. This could be attiibutable to evolving quality of care. A simulation study by Zhu et al. '^* demonstrated that a poor quality of care, simulated by increased mortality causes model fit to deteriorate. Alternatively, changes to model fit could be due to changes in discharge, admission, data collection practice, or changes in policy, equipment or freatment goals that impact on the variables and outcomes collected.

4.

Observer effects

In practice, changes in performance could be due to the effect of the observer, or the commencement of surveillance on the process ^. During the 1920s, productivity at the Western Electric plant in Hawthorne, Illinois, improved whenever observation or any intervention was attempted '^^. The Havi1:home effect could impact on the process of care, or it could exert a more subtle influence. Admission and discharge practices, data collection, patient selection and data interpretation could be influenced if the system is under scrutiny. For analysis of an historical dataset, this would not be an issue, but it is a consideration when these monitoring approaches are applied in a real clinical setting.

5.

Limitations of a single measurement

It is unrealistic to use a single measurement to capture the quality of care of patients. Patient survival provides only one aspect to evaluating ICU outcomes. A programme of quality management should evaluate many domains of ICU performance including mortality, resource use, process measures, access measures and complication rates '^*. RA control charts form part of the measurement of patient mortality. Confrol chart approaches can be readily applied to the other domains of measurement of quality of care in ICU.

5.

Choice of outcome

There are other meaningfiil endpoints than in-hospital death. Quality of life after treatment in ICU is important to patients, but is usually not measured. It has been proposed that assessments of the quality

80

of ICU outcomes should measure the quality of life rather than the rate of death '^'. Mortality is a crude indicator of quality of ICU care and may offer limited insight into subtie changes in the process

130

The ability of RA outcomes other than death to measure quality of care is untested '^' and will be difficult to measure and model. Quality of life after ICU has been studied and reviewed '^^, though it has not been used as a measure of quality of care. Unfortunately, there are no models that estimate the probability of alternative outcomes and provide a basis for RA.

The choice of the definition of the mortality outcome has important implications for the timeliness and accuracy of mortality analysis. ICU patients may have hospital stays of months or occasionally years. If the endpoint of in-hospital mortality is being analysed, then it may be months (or even years) after an admission date before all the patient records are finalised with the patients discharged from hospital or dead. Early analysis is biased toward a higher mortality rate. The APACHE III system uses in-hospital mortality, and for this refrospective application using a complete dataset this is not an issue. However, in practice, the 30 day in-hospital mortality endpoints allow analysis to be conducted 30 days after ICU admission. This important alternative endpoint is used in model development in Chapters 6 and 7.

6.

Data source and quality

Data quality and collection issues are important considerations for the development and validation of RA models and subsequent RA chart analysis. The general considerations of data quality have been discussed in Chapter 2.

Some RA mortality models for settings other than ICU are based on the use of patient data designed for billing or adminisfrative work, not for RA. Minimum datasets from adminisfrative, demographic and patient billing records are relatively inexpensive and there are masses of data available. These data are of controversial quality because of ambiguities of definition and problems with coding '^'. Adminisfrative databases do not capture all the features that determine a patient's outcome, and do not include clinical variables reflecting severity of illness, co-morbidities and abnormal laboratory evaluations. Failure to capture these explanatory features will compromise the potential performance of models.

81

The best datasets should be gathered with clinical risk estimate modelling or RA purposes in mind, though this is time consuming and expensive " ' . A prospective data collection and collation process is necessary to capture the adminisfrative, demographic, diagnostic, physiological and co-morbid characteristics of the patients. The data for ICU patients at the PAH ICU 1995 - 1999 is a complete dataset collected under the rules of the APACHE III model.

5.2.2: Application ofRA to the monitoring of ICU Mortality. RA techniques applied to monitoring outcomes in a single institution over time offer an adjunct to current methods of quality management. If the RA model meets performance criteria, charts comparing observed mortality rates against the RA expected values or other RA mortality statistics are possible.

The methods in the following sections are RA adaptations of standard charts monitoring: the p chart of mortality rates (RAp chart), the RA CUSUM and the RA EWMA chart. Before describing these adaptations, it is necessary to describe and address some issues that arise in calculation of control limits for the charting procedures, and how valid the statistical assumptions are for the distribution of mortality rates. Appendix 6 describes three approaches to characterising the distribution of the observed mortality rate of a sample of patients.

In an industrial application, where the inputs are standardised, and the process is in-control the risk of failure for each item sampled is assumed to be the same. Changes in the statistical distribution of outcomes imply that there is a change in the process.

In confrast, in medical applications, each patient has a unique set of contributors to risk of death. A patient brings physiological reserve (captured by age and co-morbidities), physiological disturbance (captured by the abnormalities in a range of physiological, clinical and laboratory measurements) and a diagnosis or disease category. There will be additional factors that are not measured in the first 24 hours in the ICU. Some of these occur before admission (pre-admission treatment, details of the exact surgical event), some are present but not recorded (rare conditions, new technology or uncommon

82

measurements) and many of the determinants of outcome occur after the first day (failure or complications of therapy).

Alemi and Oliver " ' argue that the estimate of risk of death should reflect the risk of death of a patient under ideal conditions, rather than what is reasonable to expect from a health care process. This ideal may be impossible to estimate, but if the argument were accepted, underperformance would be universal.

The exposure of patients to the process of care will not be constant and will vary from hours to months. The outcome of a patient who spends 8 hours in ICU after an elective surgical procedure is more reliant on the success of the surgery and the patient's underlying physiological reserve than ICU care. In contrast, the survival of a patient who spends 3 months in ICU with pneumonia and multiorgan failure depends heavily on the quality of the care offered in the ICU.

With these issues considered, the development of a RA charting approach will be presented.

5.3: RA p chart of mortality rate. The RA/j chart plots the mortality rate compared to and expected mortaHty rate given by the RA model predictions for the patients in each sample.

The patient mortality data can be analysed in blocks of a constant number of patients, say, 100 admissions or in specified time periods (e.g. months). For fixed block length, the time periods over which patients are accrued will vary, and will not neatiy fit into calendar months or quarters. Fixed time periods, e.g. monthly blocks, are more convenient from a unit management perspective, but the case numbers vary. Either approach can be used. For this application, a fixed block length of 100 patients will be used. At the PAH ICU during the study period, 100 patients were admitted over about 4 weeks.

83

5.3.1 RA p Chart: Control limits calculated using a normal approximation

The RAp chart plots the observed mortality rate on a chart with confrol limits calculated using the risks of mortality of patients in the sample. RA X charts "^ and p charts " ' have been proposed by Alemi and co-workers using the ^distribution to calculate control limits. The ICU application has large sample sizes, so I have adapted the RAp chart to ICU RA mortality monitoring by using a normal approximation to calculate the control limits.

The following notation and formulae are used for the RAp charts. Yy is the outcome for patienty in sample /. If the patient dies, Yy = 1 and the probability of death is TTy. If the patient survives, Yy = 0 and the probability of survival is l-TCy .

E(Yy)=^y

vaTfe)=^(l-^) ^y may be estimated by ^y, using a statistical model such as the APACHE III risk of death estimate.

var(^.) = ;r,(l-;r,)

The observed mortality rate /?, for sample i is

R: = n. and the predicted mortality rate, £•(/?,) is

n. The variance of the observed mortality rate,

J^varfe)

MRi) = ^^^-^—

Y.^y(l-7ty)

=

^^^—

84

The RAp chart compares /?, to ^(i?,) with control limits calculated around E{Ri). The control limits are defined as multiples of the standard deviation (a). Let a be the number of o-used, then

CL,=E{R,)±a.^vaT(R,)

n.

n,

In this illustration, a = 2, «, = 100 patients.

Figure 5.1 is a RAp chart of the hospital mortality rate of all ICU patients, in blocks of 100 admissions. Confrol limits are set at +/- 2 a, and observed mortality rates that fall outside the limits are marked with an asterisk (*). The average APACHE III predicted mortality rate and the observed mortality rate per block are plotted.

Figure 5.1: RAp Chart of Hospital Mortality PAH ICU 1995 - 1999, n = 100, APACHE III RA marks observed mortality values beyond the 2 a control limits.

20

m

40

Sample Number (blocks of 100 patients)

85

In blocks 7, 13 and 19, the observed mortality was above the upper control limit and in block 51, the mortality rate was below the lower control limit. The chart can be interpreted as showing that hospital mortality rate of ICU patients was very likely to have been significantiy above the level expected from the patient characteristics in the earlier part of the analysis period. The hospital mortality may have been lower than predicted in the later part of the period.

An expression for the probability of a single observation of the RAp chart falling outside the control limits can be developed. This is the power for a single observation and is adapted from the formula of Flora ^. The estimated probability of Yy = 1 is ^y , but under changed conditions, which we wish to detect, the probability of K = 1 becomes TTy .

Ê^v(l-^(,)+Z(^/,-^ 1. Successive non-negative values will lead to accumulation of Sj until its value exceeds the confrol limit, /?"*".

For testing for a change in the RA odds of death where the mortality is falling, the procedure is similar to the test for an increased OR. For the convenience of plotting both charts on the same figure, the statistic S~ is accumulated as a negative value (or zero) and h~ is a negative value. Thus,

S]=mm(s:_,-Wj,0)

The ORA is less than 1 and the confrol limit is h , the value below which S must fall to give an alert or alarm.

99

The APACHE III prediction of risk of death, Kj is used as an estimate of Ttj.

The choice of h'^ and h~ are made by modelling the ARL where the RA outcome of patients has changed, or remains unchanged, given the set of estimates A • , and the choices of clinically relevant ORA . Figure 5.10 shows the ARL using an upper and lower RA CUSUM tests based on the log likelihood together, with the ORA set at 0.5 and 2, for a range of A"^ and h~. In this example, h* and h~ have the same absolute magnitude, though for other applications, these decision thresholds will vary, and may not be equal. The experiment calculated the ARL for RA CUSUM from 10 000 simulations. Each case was drawn at random and with replacement from the 5278 patients in the series. Each simulated patient outcome was a Bernoulli trial based on Kj , the altered probability of death. This analysis is the basis for the choice of chart parameters.

The values in Figure 5.10 differ from those published in Cook et al.^^^ reporting an ARL estimate of 5400 based on the patient population from the 1995 - 1997 dataset. The inclusion of the additional 1998 - 1999 cases used in this updated analysis has increased the ARL for monitoring for OR — 2, with A^ = 4.5 to 5638.

Figure 5.10: Effect of Choice of h on ARL of RA CUSUM 1 sided monitoring for OR = 2, 2 sided monitoring for OR = 0.5, 2 APACHE Hi RA, 10 000 simulations at each value ofh+/-

^

5000 4000

100 The relationship between ARL and changes in the OR is shown in Figure 5.10. For this experiment, 3 RA CUSUM monitoring schemes were studied. An upper RA CUSUM testing for ORA = 2, a lower RA CUSUM testing for ORA = 0.5, and a combined upper and lower RA CUSUM with ORf^'' = 2 and

OR'A'^''

= 0.5 . Initial experiments showed that for the upper RA CUSUM, h* = 4.5 gave an in-

confrol ARL of 5640 admissions. For the lower RA CUSUM, h~ = -4.5 gave an ARL in-confrol of 6701 cases, or about 6 years.

The effect of changingrisksof death by changing the OR was examined by performing 10 000 RA CUSUMs of each type at each OR value in the range of 0.1 - 0.4 in increments of 0.1. To simulate patient outcomes, cases were randomly drawnfromthe series and outcomes were allocated as /v A

Bernoulli trials where the risk of death was /T • . The results of this simulation are shown in Figure 5.11.

Figure 5.11: Effect of Changed OR on ARL of RA CUSUM upper ORA = 2 , lower ORA = 0.5. combined upper and lower RA CUSUM ORA = 0.5 and 2 ,h+/- = 4.5, APACHE III RA 10 000 simulations at each point

8000 7000

ARL: patients

6000 5000

ARL Upper RACUSUM testing for OR = 2 ARL Lower RACUSUM testing for OR = 0.5

4000

ARL Upper and lower RACUSUM

3000 2000 1000 0

Odds Ratio

With unchanged 0R=\, the upper RA CUSUM has an ARL = 5640, and the lower is 6701. The ARL is 3043 for the combined scheme. As the OR increases above 1, the estimate of the ARL is dominated by the ARL of the upper RA CUSUM scheme. Similariy, with OR below 1, the ARL is rapidly dominated by the effect of the lower RA CUSUM scheme. If only increases in RA mortality were of interest, the specificity of the signal could be increased by using only the upper RA CUSUM. With a

101

RA CUSUM only monitoring for an increased OR, the false alarm ARL would exceed 5640 cases, or 5 years.

The RA CUSUM rapidly detects decreases in OR. The ARL for OR of 0.7 is 646 and for OR of 0.5 is 244. The RA CUSUM also rapidly detects increases in OR with ARL of 842 with OR = 1.3, ARL of 416 with OR = 1.5 and ALR of 169 with OR = 2.0.

The parameters chosen for the RA CUSUM (upper and lower RA CUSUM with OR'f^'' = 2 and Qjîower _ Q^ ^ confrol Hmits h'^'~ = 4.5) provide rapid detection of clinically important changes in OR with an acceptably long ARL for in-confrol or minimally changed OR. If only an increase in OR was of clinical interest, the in-confrol ARL could be dramatically increased by only monitoring with the upper RA CUSUM.

These parameters are used to plot upper and lower RA CUSUM charts for the PAH ICU data 1995 1999 in Figure 5.12.

Figure 5.12: Steiner RA CUSUM PAH ICU In-Hospital Mortality 1995 - 1 9 9 9 /7+/- = 4.5, optimal detection of change OR(iomer) = 0.5 and OR(upper) = 2, APACHE III RA, single cases, * identifies S* >h* , and # identifies S' '.4-t a

50%

E 3

o 1000

2000

3000

4000

5000

Case Number

5.6 Summary and Conclusion. This chapter has demonsfrated a new application ofRA control chart analysis to an ICU setting. My work is the first use of the APACHE III score as a validated RA tool to monitor ICU outcome. Each of the methods applied and developed in this section have a method for design and analysis described in the Appendices.

This work is original in its application. It is an adaptation of previously described methods, or presents new work. The RAp chart was developed from the method of Alemi et al "^ and incorporates a method for analysing the power of each sample, adapted from Flora ^ . The Z score/? chart relies on the work of Flora ^ and Sherlaw - Johnson and colleagues ™'^'. The use of the iterative approximation of the disfribution to derive confrol charts is an original contribution.

The RA CUSUM is based on the work of Lovegrove et al. '' and the moving frame approach is adapted from the work on cardiac surgery charting by Poloniecki et al. "^. The use of the RA CUSUM of Steiner and co-workers'^*''^°''*^ is an original application in ICU, though the method apart from the use of the APACHE III has not been modified.

112 Both the RA EWMA charts, the parametric approximation and the discrete approximations of the distribution of EWMAj are original contributions.

This chapter is about current RA charting techniques that can be applied to such contexts as the ICU. The charting methods have an important place in the range of quality measures that are available, and potentially have wide application.

In the next two chapters ICU data will be modelled again using machine learning approaches (neural networks and support vector machines) on datasets of a small and practical size for application in a single ICU.

113

Chapter 6: Supervised IVIachine Learning Techniques with Application to Intensive Care Mortality Models

In the previous chapters, I have reviewed the techniques for assessment of models that estimate the probability of intensive care unit (ICU) patient death. With an existing, validated model, I have developed risk adjusted control charts, applying such models for monitoring mortality in ICU patients. This chapter will infroduce machine learning approaches for development of new models to estimate the risk of death of ICU patients.

This chapter has three aims:

The first is to describe the machine learning methods of artificial neural networks (ANNs) and support vector machines (SVMs).

The second is to review applications of ANNs and SVMs in the intensive care unit (ICU) with particular reference to the estimation of probability of death of ICU patients.

The third is to present preliminary experiments with ANNs and SVMs on the Princess Alexandra Hospital (PAH) ICU dataset, as a prelude to development of a practical RA tool using raw patient data. One experimental task is the classification of patients according to 30-day in-hospital death or survival. The other is a regression problem to estimate the probability of 30-day in-hospital death. The variables used are pre-processed patient data: APACHE III acute physiology score, APACHE III chronic health score, patient age and modified APACHE III diagnostic coding. The results are compared to previous logistic regression results on the PAH ICU dataset.

114

6.1: Application of Machine Learning to Classification and Regression Problems to Predict ICU Mortality. ANNs and SVMs are machine learning approaches to modelling the probability of ICU patient death. It is practical to model ICU patient outcomes in this way because of the availability of powerfiil desktop computers. The recent review '"^^ of "Artificial Intelligence in the ICU" inadequately covered the topic in only six paragraphs. Four of these were infroductory definitions and descriptions of the ANN. There is much more to the topic than that review suggested.

Machine learning to model ICU outcome is a supervised learning task. Supervised learning in this application uses fraining datasets where the patient and diagnostic variables and the resultant patient outcomes are known. Generally a machine learning algorithm adjusts or optimises aspects of a model to minimise an error fiinction that compares the outcomes in the fraining dataset to model predictions.

6.1.1 Artificial Neural Networks ANNs provide a diverse range of models for classification and regression problems. ANNs have architectures of networks of interconnected simple processors. The fraining process involves learning patterns in the fraining data, often by modifying the weights that link the units. The following discussion will be limited to the supervised learning instance. A more detailed infroduction to this topic is available in various references ^^^'^'',

The most commonly used example of an ANN is the multilayer perception (MLP) which is shown in Figure 6.1.

115

Figure 6 . 1 : Architecture of 2 Layer Multi-Layer Perceptron Feature values of input vector

Input units^

X; - X„

Output units

Hidden units in hidden layer v J

Weights

D

Xl

Xi

A

Xi

A.

X4

.'vA**

Xn

\

L ^'^

D

This example of a 2 layer MLP has an input layer of« input units (left, •) equal to the dimension of the input vector. It has hidden layer units (cenfre, o) and output layer units (right, D). The data input is an n dimensional vector, X. In the example, feature values Xi to Xn are presented at the input nodes. The arrows represent connecting weights. Each input node cormects to each node in the layer below. The vector of weights at each node are W . In the hidden layer in Figure 6.1, W will have n elements. The input to each hidden units is the sum of weighted inputs. This is the sum of elements of the dot product of the vectors W and X i.e. W,X, -h WjXj + ....W„X„ and is abbreviated as ^ W • X. The input is processed by a bounded increasing activation function to produce a non-linear output. A commonly used activation fiinction is the sigmoid function / ( V W • x) =

:=-— .

Figure 6.2 shows three input values and weight vectors processed at the hidden unit. The output signals of the processing units are then connected by weights to the next layer. The next layer in this example

Hi is the output layer, though fiirther hidden layers are possible. At the output unit, the sum of the weighted signals is processed into the output signal.

Figure 6.2: Output of a hidden unit is a function of a sum of weighted inputs.

X3

Supervised learning proceeds by iterative adjustments to the inter-connecting weights of the ANN, to minimise the sum of the squared errors between observed and predicted output values. Initially, weights are set to random values. Subsequent adjustment of the interconnecting weights in the network is by gradient descent down the error surface seeking a minimum error value. Back propagation is one such algorithm whereby the weights in the network are adjusted, according to the derivative of the error with respect to the weights

147

MLP ANNs have a number of potential issues '*^ which are common to many numerical optimisation procedures. For example, the gradient descent algorithm can become frapped in local error minima, rather than proceeding to the optimal solution. The MLP can produce different fraining results depending on the starting weights, by terminating at different local error minima. As well, MLPs are prone to over-fitting. It is usual practice to monitor performance on an external verification data set during the fraining. The fraining process is terminated when the generalisation performance begins to deteriorate. In addition, a test dataset is used to assess the model performance - see below). Whilst

117

there are sfrategies to limit the problems caused by these issues, it is important to frain many MLPs and to choose the best ANN for the application at hand.

In the ANN experiments in this chapter, the available data were divided into three sets for each trial. A fraining set was used to frain the ANN. A verification set was used to monitor the generalisation performance so that fraining could be stopped. A test set was used to assess the model performance on data not used for model development.

The radial basis fiinction (RBF: Figure 6.3) network is another example of multilayer ANNs used in this study.

Figure 6 .3 Architecture of Radial Basis Function Network Feature values of input vector

Input unitsV

Radial units in hidden layer \-)

Output units D

X] — x„

Inputs fransferred to hidden units

Xj

Weighted sum of outputs

X2

Xi

As with the MLP, the input vector X at the n input nodes and each input is connected to the hidden layer. However, there is only a single hidden layer of processing units. The inputs to the hidden layer are fransformed using a fiinction with radial symmetry such as a Gaussian function. The output from each unit in the radial layer depends on the distance of the feature values from the cenfre of the fiinction for that unit '**. The sum of the weighted signals of the radial layer is processed with a linear

118

activation fiinction at the output units. The weights between the radial layer and the output layer are adjusted by the learning algorithm. The RBF ANN is rapidly frained, and is useful for modelling continuous mappings of input features onto outputs '^^.

The generalised regression neural network (GRNN) shovm in Figure 6.4 is similar to the RBF ANN. It is used to recalibrate the mortality predictions in the regression experiments in this chapter. It is a statistical method of fiinction approximation and in this application provides a Bayesian, kernel-based estimate of the risk of death. The GRNN was developed by Specht '^^ based on Parzen '^'•'^^. Figure 6.4 (based on Figure 1 in Specht '^*') shows the architecture of the GRNN.

Figure 6 .4 Architecture of GRNN Feature values of input vector

input units^

Hidden pattern units in radial, hidden layer L J

Summation units \-J Output unit C>

weights

weights

Figure 6.4 shows there are input units for each feature dimension of the input vector. Again, each input unit connects to all of the radial units in the first hidden layer. These pattern units are each dedicated to a fraining example, or represent clusters of similar fraining data. The pattern units have a Gaussian fiinction which fransforms the distance of each feature value from the pattern represented by each unit, similar to the RBF processing units. A smoothing parameter determines the shape of the radial fiinction, and the overlap between fiinctions. The second hidden layer has summation units which

IB process the weighted signals from all the pattern units. The output node provides estimates of the mean risk of death of a patient with the input pattern xi...x„.

The GRNN uses one fraining epoch to set the weights, and can be trained very quickly. Although there are architectural similarities to the RBF ANN, the GRNN has no iterative algorithm of weight adjustment after all the fraining examples are incorporated.

6.1.2 Review of the Use of ANN in the Intensive Care Unit. ANNs are widely applied in the medical area. Non-ICU applications include diagnosis '^^•'^', prognosis '^"'*^, physiological and laboratory data interpretation '55.i58.i67-i69 ^^^ pharmacology '^*''™. There are a number of reports describing ANNs used to model ICU patient data and outcomes. The following review of applications summarises the use of ANNs in the ICU from the perspective of predicting mortality or resource use.

Two studies have examined cardiac surgical sub-sets of ICU patients. Lippmann and Shahian ''' used a MLP ANN, logistic regression and Bayesian analysis to model the survival outcomes of 80 606 cases. Fifty-nine demographic, physiological, laboratory, diagnosis and cardiac assessment features were collected, of which 36 were used in the model. The calibration of the logistic regression model was the best, but the model had an area under the ROC curve of only 0.76. Orr "^ reported an ANN to estimate the risk of death in cardiac surgical patients using only 7 variables selected from a patient database. This model had good calibration, but lacked discrimination, with the area under the ROC curve of only 0.74.

Buchmann et al. ' " compared a logistic regression model, MLP, GRNN and a probabilistic neural network to classify ICU patients on the basis of chronicity (length of stay in ICU > 7 days), rather than to predict mortality. He found that the discrimination and calibration of the ANNs were superior to logistic regression. Another study of prediction of resource use, as measured by hospital length of stay was published by Mobley et al. "*. An ANN was frained to estimate the hospital length of stay of 557 coronary care patients using 74 variables including demographic characteristics, physiological

120

observations, laboratory results, diagnosis tests and index events. The ANN was able to predict length of stay within 24 hrs in 72% of patients. However, this was only marginally better than using the mean length of stay of each diagnostic class in the dataset.

Doig et al. "^ compared the predictions of an ANN to a logistic regression model for the classification of a small series of 422 patients. Each patient had already survived to 72 hours in ICU. Variables were selected from the APACHE II system and re-modelled. Remarkable discrimination was achieved on the fraining set (area under the ROC curve - 0.99). The discrimination was less good on the validation set (area under the ROC curve 0.82) suggesting overfraining of the ANN with overfitting to the fraining data at the expense of generalisation performance.

Dybowski et al. '^^ compared a ANN frained with a genetic algorithm to a logistic regression model for predicting the outcomes of a small subset of ICU patients (258 patients) with systemic inflammatory response syndrome. A classification free and logistic regression were used to select variables from physiological and demographic variables, and index events that occurred during the hospital stay. The ANN had better discrimination than logistic regression (area under the ROC curve 0.86 v 0.75). No assessment was made of the calibration of the models.

Prize et al. ' " frained MLPs to classify non-operative (608) and operative (883) ICU patients according to the predicted duration of mechanical ventilation. These models were only assessed on the developmental dataset predictions. By pruning the number of input features from 51 to 6, classification performance improved and the network complexity was reduced. This study demonsfrated a practical approach to limiting network complexity and provides usefiil documentation of a successfiil approach to processing and fransformation of patient data. However, the ANNs in this study were designed to estimate resource use rather than to predict mortality outcome. A weakness of the study was the models were not assessed on a separate test dataset, so no conclusions can be drawn about the reproducibility of the modelling, or of the model's generalisation performance.

Two studies compared ANNs to the APACHE II system. Wong and Young '" compared MLPs to the APACHE II system on 8796 patient admissions collected for an APACHE II database. Both the MLP

121

ANN and the APACHE II system had similar discrimination (area under the ROC curve 0.82 - 0.84) and calibration. Nimgaonkar et al. "^ also compared the performance of ANNs to the APACHE II system to predict mortality in an Indian ICU. A series of 2962 cases were modelled using the input variables for the APACHE II system. They analysed the contribution of each of the features to the models. Discrimination by the ANN was superior to the APACHE II (area under the ROC curve 0.88 v 0.77). The ANN displayed better calibration than the APACHE II model.

One of the most relevant studies is by Clermont et al. '^, who used logistic regression and ANN to model hospital mortality outcome on 1647 ICU patients. The demographic and physiology variables were collected under the rules of the APACHE III system. It is important to examine this study in some detail as the modelling context is very similar to the task of modelling patient outcomes at the PAH ICU. In Clermont's study, the component variables of the APACHE 111 model and the APACHE III score were used to model the probability of patient death. The areas under the ROC curves were in the range of 0.8 (logistic regression) - 0.836 (ANN with coded APACHE III observations). All the models had reasonable calibration when 800 or more cases were used to develop the model.

The ANN and the logistic regression models were able to successfiilly predict ICU patient death. A fiirther important conclusion was that both the logistic regression and ANN model performance deteriorated when the model development set size was reduced below 800 cases. In a real application, at the PAH ICU, this requires 9 months of patient data collection for model building. For smaller ICUs, it may represent 2 - 3 years of data collection.

Equally important was the author's practical choice of the size of 447 cases for the test dataset for model assessment. The size of this assessment set is a frade-off between the practicality of collecting patient data and the important statistical issues of the power and precision of the model assessment. For the PAH ICU and other busy ICUs, 447 patients requires about 3 - 4 months of data collection. For smaller units, a year may be required just to collect enough patient data for model assessment. Balanced against this, is the issue of smaller assessment datasets giving low statistical power to the Hosmer-Lemeshow C (H-L C) test to detect imperfect calibration, and the loss of statistical precision in estimating the area under the ROC curve. Clermont et.al made their choice based on a timed period

122

of data collection. The size of the dataset they used strikes a reasonable compromise between the time for data collection and statistical issues.

There are limitations to Clermont's study. As Paetz '^^ in a letter to the Editor in a subsequent edition of Critical Care Medicine commented, the study did not involve re-sampling to demonsfrate the robustness or reproducibility of the approach. It is possible that the non-random split of cases to the fraining and validation sets was a major determinant of the model's reported performance. Replicates of the modelling on alternative random data selections and re-sampling to provide alternative assessment sets are necessary to demonsfrate consistency of the modelling approach. I would add to these criticisms, that the authors used consecutive patients to build the models. The last 447 consecutive cases in the dataset were used to assess all the models. Even when smaller fraining sets were explored, non-random, consecutive sampling was used. Infroduction of bias into the model, the effects of influential outiiers, or fortuitous sampling carmot be excluded with their methodology.

There are fiirther limitations to their techniques. The patient's diagnosis or diagnostic coding was not included in the variables for the model. Also the authors relied on the APACHE III algorithm for the weights that they used to pre-process the variables for both the MLP ANN and the logistic regression model. These APACHE III weights are added together to give the APACHE III score, which the authors also selected as a variable in some models. A lack of a diagnosis variable and reliance on the APACHE III system may have limited the quality of the models that could be built.

The authors did not record how many ANN were frained to yield the optimal performance ANN, so it is not clear whether a limited or an exhaustive survey of possible ANNs was conducted

In summary, many applications of ANN to prediction of ICU mortality have been published. Overall, the performance of ANNs on ICU mortality prediction tasks appear as good as or better than logistic regression, on the datasets on which the models have been developed. However, no meaningfiil conclusions about the generalisation of these ANN models outside the context where each was developed can be made.

123

6.1.3 Support Vector Machines The SVM is based on work by Vapnik '^^ using a linear, machine learning algorithm to model a projection of the data into multi-dimensional feature space. Whilst the example data may not be linearly separable in input space (its raw form), the attributes mapped by a kernel fiinction into feature space may be separable. A usefiil definition of SVMs is ".. .learning systems that use an hypothesis space of linear fiinctions in high dimensional feature space, frained with a learning algorithm from optimisation theory, to implement a learning bias derived from statistical learning theory.." *.

There are several infroductory references to the theory of the SVM ^•'^'"'*^. The following description draws heavily on these references, and is not meant as a mathematically detailed account of the theory. Enough detail is provided to explain the basis for the experiments conducted in this and the next chapter.

For classification problems, SVMs are frained to define the position of an optimal separating hyperplane between classes. For regression problems, linear learning algorithms model a non-linear fiinction (a regression "tube") in feature space. The term "support vector" describes how only the most informative patterns in the data are used in defining the model.

Linear machine learning routines can be applied on a fixed, non-linear mapping of the data vectors in feature space because the numerical operations required to perform the minimisation and learning procedure can be evaluated efficiently. This is a tractable problem, as the complexity of the calculations does not increase with the dimension of feature space.

A highly curved multi-dimensional model hypothesis might also increase the risks of over-fitting the data. Overfitting is a particular issue where there is noise in the data used for regression modelling, and there is overlap of classes in classification problems. Several factors can be confrolled during the learning process, to limit the complexity of the model and promote generalisation of the SVM model. The learning algorithm minimises an error fiinction within consfraints of the model. Lagrange multipliers are included in the fiinction to be optimised, presenting a quadratic optimisation problem with a convex error surface without local minima.

124

SVM Classification. The SVM algorithm defines a hyperplane to minimise the number of classification errors on the fraining dataset, and maximise the margin between the two classes (see Figure 6.5).

Figure 6.5 shows a simple two class classification problem of stars and circles, (based on Figure 1.1 in Scholkopf e? al. (1999) '*^). The data input pattern is the vector x of« elements. The class labels are Y = \ for circles and Y = —\ for stars, and examples in this dataset are linearly separable in input space. The dashed lines define a margin, enclosing all possible separating hyperplanes where W-X-l-Z> = O . W i s a weight vector normal to the hyperplane, and 6 is a constant. There are no points that lie between the dashed lines.

Figure 6.5 Simple classification of stars and circles, showing the margin either side of the optimal hyperplane.

•

\

^

(w.x)+ /? = + !

\

\ I

o o o o

^

^

\

-b/\w\

^ \

\

^

\

1 ^

fc.

P

\

\ \o x

1^

x ^^^ ^ \l

(w.x) + Z) = -1

+

\

\

\

^^

X X X

*

^

(w.x) + /? = 0

\

\

125

A decision

fimction

f{j)

= sign[(w • x) +fej,allows the maximal sqjarating hyperplane (in

bold) to be defined, by minimising

2

— subject to

liwir Yj [(w • X/ ) -I- Z?] > 1 where Z = 1.../.

Equation 6.1

Lagrange multipliers (tt;.- i = 1 ...l) are infroduced to give an expression that is readily differentiated.

The optimal solution is found by minimising L with respect to W and b, and by maximising L with respect to the dual variablesttjtofindthe "saddle point". The conditions of Equation 6.1 allow W and b to be eliminated and the solution then is to maximise

^ = Z°.- -l^HHî^/i^Jî -^j

Equation6.2

The solution can be expanded in terms of only the fraining vectors which have non-zero Lagrange multipliers, O,,- ^ 0. These are the support vectors, which wUl lie on the dashed lines in Figure 6.5. The other fraining examples are not required to define the optimal separating hyperplane.

Where the data are not liaearly separable in input space, the problem can be solved by mapping the data into a multidimensional feature space where the decision function may be a linear fimction. For the learning task, (Equation 6.2) can be rewritten using the dot product of a mapping function, 0 ( } of X; and Xj

L = tî -\ttî^/f^M^-^-^{î) ,•=1

^ ,-=1 j=\

Equation6.3

126

A kernel fimction, K Q that can implement the dot product in Equation 6.3, is used.

L=j:cci-\ttî^jiÂî'^j) /=1

^ ,=1 7=1

For the non-separable case, a "soft margin hyperplane" is sought, by infroducing a positive slack variable, l-^j

A regularisation constant C, is used to assign a cost to fraining errors that are infroduced by the slack variable. This parameter C is the upper bound on the Lagrange multipliers, and will limit the influence of any single fraining vector. It will Hmit the complexity of the model and will influence the balance between overfitting and generalisation

SVM Regression For regression tasks, the SVM algorithm is modified. Yt is a real number representing the outcome, and X| is an input vector of w elements, for patient i. For classification, it was usefiil to use the concept of the margin in which the optimal separating hyperplane Ues. For regression, it is usefiil to consider a tabe in feature space, in which the optimal regression fimction lies. The Unear regression function, /"(xj = (w • x j -F 5 , is estimated by a tube of radius S around the regression fiinction using the £insensitive loss function (Figure 6.6)

To estimate the linear regression it is necessary to minimise

' fct|7,-/(x,,J Wll

;=1

127

Figure 6.6 shows the concept ofS, the precision with which the regression fiinction is modelled. It is based on Figure 1.4 of Scholkopf et al. 184

Figure 6.6 E: regression tube precision (or width) =(w-x)+Z)

There are two types of slack variables to account for positive and negative deviations from the tube,

(^,r>0) [yi- (or acute physiology score 71%), diagnostic group code 12 - 13 %, and age 1 - 8%.

With knowledge about the importance of variables in ICU populations generally, and in this sample from PAH ICU particularly, four input variables were chosen. The APACHE III acute physiology score was chosen to reflect the amount of physiological disturbance in the first 24 hours in ICU. The diagnostic category was used to capture the influence of the patient's diagnosis, procedure, or surgery. The patient's physiological reserve was represented by the age and chronic health score component of the APACHE III score.

For each variable, the area under the ROC curve was used to assess the variable's ability to discriminate between the deaths and survivors, prior to inclusion in models.

/. Acute Physiology

Score

The acute physiology component of the APACHE III score was calculated according to the data collection rules and definitions of APACHE III ^'"". It is a sum of scored components of the worst recorded observations and laboratory values from the first day of ICU admission. Neurological abnormalities, temperature, blood pressure, heart rate, respiratory rate, mechanical ventilation, and urine output are scored according to the extent of deviation from a midpoint "normal" physiological value. The blood chemistry measures: creatinine, white cell count, haematocrit, albumin, bilirubin.

136

glucose, sodium and urea are similarly scored according to deviance from a normal value. The sum of these scores gives the acute physiology component of the APACHE III score.

The acute physiology score (APS) has very good discrimination on the PAH ICU dataset. The area under the ROC curve of the APS score alone on tiie PAH ICU dataset is 0.837. Any model that is built from variables that include the APS should have an area under the ROC curve of greater than 0.837.

2. Disease Category In collaboration with Pefra Graham '***,! have proposed a simple and powerfiil approach to recoding the complex APACHE III disease group and weights. A brief description is provided here to assist explanation of modelling and feature selection.

The disease group classification is a three level simplification of the APACHE III diagnostic categories. Each disease group is categorised as High, Low or Neufral risk according to whether the mortality for the disease group was higher, lower or not significantly different to the average mortality of the APACHE III model development dataset (Table 3, Knaus et al, ^).Table 6.1 shows the coding of the original APACHE III diagnosis and the simplified scale. Any diagnosis that was not present in the original APACHE III disease group list is classified as "Neufral", and marked with an asterisk in the table.

The risk was coded with high = 1, neutral = 0, and low = -1. The area under the ROC curve for diagnostic code data alone on PAH ICU dataset was 0.726.

137

Table 6.1: Mapping of APACHE III Disease groups to Disease Category Category

Disease Group

(High/neufral/low risk of death)

Nonoperative Cardiovascular/vascular

Respiratory

Gastrointestinal (GI)

Neurologic

Sepsis Trauma

Cardiogenic shock Cardiac arrest Cardiomyopathy* Aortic aneurysm Congestive heart failure Peripheral vascular disease Rhythm disturbance Acute myocardial infarction Hypertension Allergy* Other cardiovascular diseases Parasitic pneumonia Aspiration pneumonia Respiratory neoplasm Respiratory arrest Pulmonary oedema (non-cardiogenic) Bacterial/viral pneumonia Chronic obstructive pulmonary disease Pulmonary embolism Mechanical airway obstruction Asthma Other respiratory diseases Hepatic failure GI perforation/obstruction GI bleeding due to varices GI inflammatory disease GI bleeding due to neoplasm* GI bleeding due to ulcer/laceration GI bleeding due to diverticulosis Other GI diseases Coma (unknown cause)* Intracerebral haemorrhage Subarachnoid haemorrhage Stroke Neurologic infection Neurologic neoplasm Neuromuscular disease Seizure Other neurologic diseases Sepsis (other than urinary tract) Sepsis of urinary tract origin Head frauma (with/without multiple trauma) Multiple frauma (excluding head trauma)

high high neufral neufral high neufral low low low neufral neufral high high high high high high high neufral neufral low neufral high high high neufral neufral low low high neufral high low high neufral neufral neufral neufral neufral high high low low

138

Metabolic

Haematologic

Genitourinary

Other

Metabolic coma Diabetic ketoacidosis Drug overdose Other metabolic diseases Coagulopathy/ neufropenia/ thrombocytopenia Other haematologic diseases Renal diseases Pre-eclampsia Other genitourinary diseases* (may be nonoperative or operative) Other medical diseases

high low low neufral high low neufral neufral neufral low

Operative high Dissecting/ ruptured aorta neutral Peripheral vascular disease (no bypass graft) low Valvular heart surgery low Elective abdominal aneurysm repair low Peripheral artery bypass graft low Carotid endarterectomy low Other cardiovascular diseases neufral Respiratory infection Respiratory low Lung neoplasm low Respfratory/ neoplasm low Other respiratory diseases neufral Gastrointestinal (GI) GI abscess* high GI perforation/rupture neufral GI inflammatory disease neufral GI pancreatitis* neufral GI peritonitis* GI obstruction neufral GI bleeding neufral GI vascular* neufral Liver fransplant neufral GI neoplasm low GI cholecystitis/ cholangitis low Other GI diseases low Intracerebral haemorrhage Neurologic high Subdural/ Epidural haematoma high Subarachnoid haemorrhage low Laminectomy/ other spinal cord surgery low Craniotomy for neoplasm low Other neurologic diseases low Trauma Head frauma (with/without multiple trauma) high Multiple frauma (excluding head trauma) low Renal Renal neoplasm low Renal transplant* neufral Other renal diseases low Hysterectomy Gynaecologic low Hip or extremity fracture Orthopaedic neufral * indicates a diagnosis that did not clearly map onto an APACHE III disease group at the time of modelling. These diagnoses had no risk data and were given a "neufral" disease group risk category coding. Vascular/cardiovascular

139

3. Chronic Health Score A score was calculated by adding points allocated according to the presence of co-morbid conditions, according to the data collection rules and definitions of APACHE III ^''*". Where these conditions are present, the points allocated are AIDS (23), hepatic failure (16), lymphoma (13), metastatic cancer (11), leukemia/multiplemyeloma (10), immune suppression (10) and cirrhosis (4). The area under the ROC curve for Chronic Health Score alone on the PAH ICU dataset is 0.624

4. Age Age was used as a continuous variable calculated as date of admission minus the patient's date of birth. The area under the ROC curve of the patient's age alone on the PAH ICU dataset is 0.62.

Index An integer was used to index the order of admission. This variable was not subsequently used in the modelling, other than as a unique identifier.

The data were not manipulated or pre-processed fiirther, prior to modelling with the ANN and SVM.

6.2.4 Classification The objective of the classification experiments was to assess how well ANNs and SVMs can classity patients using the data collected in the first 24 hours, into survivors or deaths in-hospital at 30 days.

140

SVM Classification Initial classification experiments were conducted with the default value of the SVMlight parameter

Cdefault =

f - p - where x,- is the input vector, and ||x,. || = ^ I V M

meanllxi Classification was investigated using the radial basis fiinction kernel SVM and polynomial kernel SVM.

SVM: Radial basis fiinction kernel The RBF kernel that was used to implement the dot product of the mapping function in Equation 6.3 was

^(x,-,Xy) = exp(-7||x.-x^.||') where x, and Xj are input patterns, and X,. — X

is the 2-norm of the difference between these

input vectors. The RBF parameter, y was varied.

SVMs with varying RBF kernel parameter y were investigated (Figure 6.8). For each SVM, the sample was randomly split into a set of 20% (1056 cases) for model fraining, and the remaining 80% (4222 cases) for model testing. Values are the mean of correct classification rate (CCR) of 10 SVMs frained at values of y between 0 and 20. Figure 6.8 shows the CCR for RBF kernel SVMs in the range of y between 0 and 1. The best CCR is 90.2% at y between 0.007 - 0.01, whilst the best in the training set was 89% with y less than 0.005. At larger y, the CCR approaches the mortality rate, as all cases are classified as deaths.

141

Figure 6.8: CCR for RBF SVM 7; 0...1 Values are mean of CCR of 10 models, 1056 cases In the training set

1 0.98 0.96 0.94 0.92

Training Set -Test Set

0.9

0.88

r O

*•— 30 o o

^

20

r" '

10 •

•

—1

0.82

t««r^>J^

0.83

ivinaga

|:

0.84

0.85

0.86

0.87

0.88

1

1

0.89

0.90

1

Area Under the ROC Curve

Figure 6.16 is afrequencyhistogram of the average H-L C statistic values of 100 assessment sets for each of the 100 SVM-GRNN models. Twenty nine SVM GRNN models of the 100 trained had an average H-L C> 15.5 on 400 case test sets. The histogram of the average H-L C statistics has a skewed distribution, which is expected as the H-L C statistic follows a chi-squared distribution. Note the 7 models grouped on therightof the histogram which all had mean H-L C statistics > 50.

151

Figure 6.16: Calibration of SVM-GPNN Models Histogram of Mean H-L C Statistic Each observation is the mean of 100 Samples of test set of 400 cases for each of the SVIVI-GPNN models 24 22

-I

r

I

I

T-

-i

[

-1

r

m.

1

r-

20 18 16

(o 14 o •S 12 o z 10

'W[

I

8 6 4

m

2

Wi

0 12

16

20

24

28

32

36

40

44

48

Mean H-L C

Seventy one percent of the SVM GRNN models met both the discrimination (area under the ROC curve > 0.80) and calibration (H-L C < 15.5 on 400 cases) criteria. Therefore, the SVM GRNN is a practical approach to modelling risk of 30 day in-hospital mortality on the PAH ICU data.

The distributions of the discrimination and calibration statistics in Figures 6.15 and 6.16 demonstrate the need to conduct multiple trials and re-sampling. The spread of areas under the ROC curve and H-L C statistics are due to sampling of the data for model building and model assessment. For each trial, a different training set was chosen. For assessment of each model, 100 different test sets were chosen from the remaining data. Experiments conducted on a single sampling of data only demonstrate that it impossible to build a suitable model on the dataset. This experiment has demonstrated that the modelling process is very likely to provide a suitable model on this dataset.

While some investigation was done of the SVM kernels and kernel parameters, the SVM parameter C was not optimised, and the SVMlight default value was chosen. It is possible that more extensive

152

investigation of SVM parameters may provide better performance. This will be explored in the next chapter.

ANN Regression The ANN fraining was conducted on Statistica Neural Networks '''^ using Radial Basis Function (RBF) and Multilayer Perception (MLP) designs. Initial experiments indicated that the RBF ANN consistently produced lesser discrimination than the MLP, so the following experiment is reported using only the MLP. The methods of feature selection, fraining and optimisation were the same as for the previous classification experiment.

For each trial, numerous ANNs were successively frained and the best models were retained, in the same way that the classification experiments were performed. At the commencement of each trial, an ANN was frained until its generalisation performance began to decline. Training of that ANN ceased and another ANN was similarly frained. If its best performance was superior to the first, then it was retained, if not, it was discarded. As before, this process was continued until a series of ANN with 20 increases in performance had been achieved.

Twenty such trials were conducted on random selections of training data. The performances of the best ANNs from each trial were analysed. The fraining datasets were 20% of the sample (1056 cases).

Initial analysis of the ANN outputs indicated that the outputs did not estimate the probability of death. Whilst the areas under the ROC curve demonsfrated good discrimination, the output of the ANN was poorly calibrated. Therefore, the MLP ANN output values were recalibrated with a GRNN.

To evaluate the generalisation performance of the 20 MLP GRNN ANNs, the areas under the ROC curves and the H-L C statistics were calculated by analysis of 100 test sets of 400 cases. These datasets were drawn at random, and with replacement from the available test data. The average of the areas under the ROC curve and the H-L C statistics were used to describe the discrimination and calibration of each of the 20 MLP GRNN models. These results are presented in Table 6.3.

153

Table 6.3: Discrimination (Area under the ROC Curve) and Calibration Results of ANN MLP-GRNN Regression. For each of 20 experimental fraining samples of 1056 cases each value is the mean of 100 re-sampling trials drawing 400 cases with replacement from the unseen data.

Trial Number

Area under the ROC Curve

H-L C statistic

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.866 0.870 0.855 0.867 0.870 0.874 0.862 0.851 0.860 0.859 0.857 0.875 0.862 0.847 0.866 0.866 0.872 0.861 0.867 0.858

14.0 12.7 25.6 10.8 12.3 23.3 11.0 17.1 13.0 10.2 17.6 27.7 9.2 10.0 8.6 14.2 21.9 17.7 13.9 10.1

Seven of the MLP GRNN ANNs (Table 6.3) did not satisfy the model performance criteria that 1 have proposed (area under the ROC curve > 0.80 and H-L C < 15.5 on 400 cases). All models had an area under the ROC curve > 0.85. However, models 3, 6, 8, 11, 12, 17 and 18 had H-L C statistics that were greater than 15.5. This experiment demonsfrates that for MLP GRNN models frained on a dataset of 1056 cases in 13 of 20 occasions the models achieved adequate performance. Therefore, the MLP GRNN offers one way to frain a model to estimate the probability of 30 day in-hospital mortality from the PAH ICU dataset.

Discussion of Regression Results These preliminary experiments demonsfrate that ANNs and SVMs can be used to build models that accurately estimate the probability of in-hospital death of ICU patients.

154

The SVM GRNN and ANN GRNN models were frained on 20% of the PAH ICU dataset, using 1056 cases, or approximately 10 months of admissions to the PAH ICU. This set size was chosen for the preliminary machine learning regression experiments for two reasons. Ths sample size of 1056 cases was successfiil in the classification experiments, and was likely to be adequate for the regression experiments. Clermont et al. '^^ had successfiilly used a fraining set size of 800 cases, but found that model performances with ANN and logistic regression deteriorated below this size.

The MLP ANN and the SVM have adequately discriminated between the survivors and non-survivors on the PAH ICU dataset. However, the MLPs and SVMs did not provide estimates of the probability of death, rather a point estimate of the predicted outcome. Therefore, the regression modelling experiment used a GRNN to recalibrate the predictions of the MLP and the SVMs. The GRNN added an additional step to the modelling process, but allowed the output values to approximate the probability of death. This is verified by the discrimination and calibration performance of the recalibrated models.

By itself, the GRNN was unable to effectively rank the patients in order of risk of death, with areas under the ROC curve < 0.7. However, once the patients were modelled by the SVM or the ANN the average area under the ROC was greater than 0.83. The GRNN was used to then adjust the SVM and ANN outputs to an accurate estimate of risk of death, verified by the calibration of the models. The area under the ROC curve and discrimination was preserved.

In a practical sense, the second phase of model calibration with the GRNN for the MLP ANN and the SVM was a usefiil experiment but is unwieldy. If this approach is to be infroduced, then the steps would have to be automated. Modelling, as conducted in this experimental context could not be infroduced into general use due to its complexity and predilection to error. One alternative is to apply the recommendations of Weston et al. ''* to constiiict SVM kernels specifically for probability density estimation. Using a logit fransformation, the log of the odds ratio, of the ANN or SVM outputs would be an alternative to the GRNN as a method to recalibrate the regression outputs to provide estimates of the risk of death between 0 and 1.

155

The alternative that will be investigated in the next chapter is to build a regression SVM that will approximate the probability of in-hospital patient death without a re-calibration step. It is planned to tailor the design and parameter choice to the performance requirements of a risk adjustment model. An approximation of probability of in-hospital death can be made if optimisation of models is driven by the desirable performance attributes of discrimination and calibration. I will pursue this in the next chapter, using discrimination (maximising the area under the ROC curve) and calibration (minimisation of H-L C statistic). If this is successfiil, then a regression SVM will be built with a single modelling step to approximate the probability of patient mortality.

This study uses re-sampling of data for fraining and testing to evaluate models. Previous comparable Studies

'

have relied on a single split of fraining and test sets. In a letter to the Editor, Paetz

highlighted this weakness in the study of Clermont and co-workers. Without re-sampling, the process is vulnerable to variations with inclusion (or exclusion) of influential outliers that may bias the model building, fraining or assessment. Clermont et al. had shown that the machine learning modelling was possible, but not that their findings were necessarily reproducible. The histograms of Figure 6.15 and 6.16 illusfrate the variability of model performance that is found when multiple samples of 1056 cases were drawn at random. The modelling with SVM and ANN was done with 20 MLP GRNN trials and 100 SVM GRNNs, each assessed with 100 random test sets. This approach with multiple trials and resampling confirms that the experimental method will produce models that are likely to meet the discrimination and calibration goals on any repeated sample of 1056 cases from the PAH ICU dataset. Whilst Clermont et al. showed that ANN could successfiilly model patient probability of death, the experiments of this chapter demonsfrate machine learning approaches that will usually be successfiil.

The performances of the SVM GRNN and the MLP GRNN models are as good as or better than the findings of Clermont et al This is the only comparable study using machine learning with similar variables on a similar ICU dataset. Clermont et al. found the areas under the ROC curves of a logistic regression and MLP ANN model were each 0.839, using samples of 800 and 1200 cases. This compares to the experimental findings in this chapter using 1056 cases. The classification experiment frained MLPs (0.87) and the SVMs (linear kernel: 0.84) and the regression experiment frained MLPGRNNs (0.86) and the SVM-GRNNs (0.86).

156

The calibration of the ANN and the SVM based regression models was similar in this experiment. For both approaches, under the conditions of this experiment, there were 65% of MLP-GRNN (13/20 models) and 71%) (71/100) of SVM-GRNN that had both discrimination and calibration that was acceptable. The calibration of these models assessed by the H-L C statistic was comparable to the findings of Clermont et al.

Any differences in the areas under the ROC curves could be explained by random variation, and it is possible that SVM, ANN and logistic regression will provide equivalent results. If, however, any difference is real, several explanations related to data and variable selection may be proposed. It is possible that the absence of a diagnostic field in the variables used by Clermont et al. may have contributed to a slightly lower performance. Pefra Graham and I '** have shown a diagnosis variable to be associated with up to 13% of the model's explanatory power on the PAH ICU dataset and up to 18% has been demonsfrated on other ICU patient models ^^. Also, the failure by Clermont et al. to randomise to development or testing sets, or to use a re-sampling procedure creates the potential for bias in the development and assessment of models. An unlucky split of cases may have resulted in the lesser measured areas under the ROC curves in their study. Other possible reasons can be proposed, such as data accuracy and precision, but not enough details are provided in the paper to comment. The data in the study by Clermont et al. and at the PAH ICU were collected and compiled according to the same rules of the APACHE 111 database system with the same fraining and quality checks.

6.3 Conclusion In the field of machine learning, ANNs, but not SVMs have been previously applied to ICU outcome modelling. These preliminary experiments demonsfrated that ANNs and SVMs provided comparable model performance on classification and regression tasks to logistic regression. In particular, the preliminary results indicate that SVMs may perform as well as logistic regression or ANN. As it is an area that has not been extensively studied a SVM model for risk adjustment will be pursued to estimate the risk of death of ICU.

157

However, the models proposed in this section have considerable room for improvement. This issue will be addressed in the next chapter.

Firstiy, models will be built with the patient database data, rather than use the APACHE III score. The variables used as input features for the ANNs and the SVMs were chosen on the basis of other successfiil ICU outcome modelling. These were heavily pre-processed, and as a preliminary area of study, improved the chances of successfiil model development. Both the APS and the Disease Category variables offer effective discrimination between survivors and non-survivors, even before incorporation into a model. However, there is the possibility that this pre-processing may have limited the performance of the models. Therefore, in Chapter 7, patient outcome will be modelled with all the variables that are collected on the database, not just the 4 pre-processed variables used in Chapter 6. The raw physiological and laboratory data, the diagnostic code, co-morbidly information and additional demographic and admission information will be used.

If a model can be built that uses only the component data variables, then the model development can be done independent of the APACHE III score and the APACHE III system that provides proprietary estimates of the probability of death. This will potentially save money spent on the software license fee. More importantiy, it will provide the flexibility to remodel the patient data when a risk adjustment model used for risk adjusted confrol charting no longer fits the patient data.

Secondly, 1056 cases were used for the model development in this chapter. Clermont and co-workers reported that 800 cases were sufficient for model development using logistic regression and ANNs. Therefore in the next chapter, a fraining dataset of 800 cases will be used. This is of practical importance as 800 cases is approximately 8 months of ICU admissions. With an additional 400 cases for model assessment, the model development and assessment can be conducted on about 1 year's patient data.

Thirdly, neither MLP ANNs nor regression SVMs provided estimates of the probability of patient death without a calibration step with the GRNN. By using the value of the machine learning model output as the input variable for the GRNN, the ranking performance of the SVMs and the ANN was

158

preserved and a probability estimate was produced. These methods satisfied the discrimination and calibration criteria of the mortality prediction model that 1 have proposed: area under the ROC curve > 0.80 and H-L C statistic < 15.5 on 400 cases. However, a SVM regression model may be able to provide reliable estimates of the probability of death without a GRNN calibration step. SVM model parameter choice will be determined by the kernel and parameter combinations that on average meet discrimination and calibration goals.

Fourthly, the estimation of regression SVM parameters will be revised. In this experiment, the polynomial and RBF kernels were explored for a range of values of e, without investigating values for the SVM parameter C. Parameter C is an important determinate of the SVM generalisation error. For the SVM to reliably estimate the probability of patient death, the values of kernel parameters and the parameters C and 8 will be systematically explored, seeking the parameter combinations that satisfy the model performance criteria.

Therefore in the next chapter, modelling of the ICU patient risk of death will be undertaken with a regression SVM using the raw patient data variables with fransformation where necessary. After an initial exploration of the search space, an extensive optimisation process will seek the model with the best discrimination and calibration for use as a risk adjustment model.

159

Chapter 7 Development and Assessment of SVMs to Estimate the Probability of ICU Patients' 30-Day In-Hospltal Mortality using Patient Data This chapter presents an experiment in which support vector machines (SVM) are developed to estimate the risk of in-hospital death of patients admitted to the Princess Alexandra Hospital (PAH) intensive care unit (ICU) using data about their admission, physiology and diagnoses.

SVMs have displayed comparable performance to artificial neural networks (ANNs) and logistic regression on the intensive care unit (ICU) dataset. As SVMs have not been extensively studied in this application; fiirther SVM model development will be pursued. This experiment will employ a sfrategy to frain a regression SVM based on achieving discrimination and calibration targets and thus guide kernel and parameter selection. It will be shown that regression SVMs can be frained to accurately and reliably estimate the probability of a patient's 30-day in-hospital death using the number of cases equivalent to 1 year of activity. This provides a practical approach to modelling risk of death for monitoring risk adjusted (RA) outcome monitoring.

7.1 Overview The purpose of the experiment was to build a SVM to accurately predict the probability of 30 day inhospital mortality of the ICU patients. For the SVM to provide a practical alternative to the commercial APACHE model, it has to fiinction under the following consfraints.

160

The data variables were raw demographic, physiological, diagnostic and investigational data available within the first 24 hours of ICU admission. The modelling was to be done on a 1200 case sub-sample or the equivalent of a year of patient admissions to the PAH ICU, during 1995 - 1999.

Estimating the probability of death can be framed as a regression problem. A proposed model had to have adequate discrimination and calibration to meet the standard of an accurate estimate of the probability of 30 day in-hospital death of an ICU patient at the PAH. In confrast to the experiment in the previous chapter, the aim was to frain regression SVM with outputs that did not require re-calibration by a method such as the general regression neural network (GRNN).

7.2 Method 7.2.1 Data From the original set of 5278 consecutive admissions in the PAH ICU patient dataset 1/1/1995 31/12/1999, 16 cases were excluded due to missing data on Source of Admission or Time to ICU Admission. ICU readmissions during a single hospitalisation were not included.

Thirty five input variables, not limited to the APACHE HI physiology, laboratory or diagnosis variables were used from the PAH ICU patient database. The list and description of the variables is shown in Table 6.1.

The demographic and admission data were collected at time of admission. The diagnosis, co-morbidity, laboratory values, physiological observations and measurements were collected at the end of the first day in ICU. No observations or measurements were included if the observations pre-dated the admission to the physical ICU area. In the event that a diagnosis was revised after 24 hours in the ICU, the original diagnosis was retained. Physiological and laboratory values were chosen according to the most exfreme values seen in the fu-st 24 hours in ICU. Where physiological or laboratory values were not collected, the database allocated a physiological value within the normal range. As the aim of the modelling task was to

161

estimate mortality based on patient attributes only, the APACHE III score which had been successfully used in the last chapter, was not used in this experiment.

The mortality outcome to be predicted was death in-hospital within 30 days.

Samples of 1200 cases, comprising a fraining set of 800 cases and a test set of 400 cases were randomly chosen, without replacement from the 5262 cases.

Initial experimentation on the dataset indicated that raw values could not be successfully modelled and that best results would be obtained if some of the variables underwent afransformationprior to inclusion in the model

Only one example of the use of SVMs to model ICU patient data was found in the literature. Morik et al. '^' used the SVM to emulate physician's behaviour to assist in rule generation and guide the haemodynamic management of patients in a German ICU. The data acquisition system was different to that used at the PAH ICU, and the German patient data were initially modelled using time series analysis. Morik et al. collected observations every 1 minute to detect and manage haemo-dynamic shock, in confrast to the APACHE III database, which collects the worst value during the first 24 hours.

Morik and co-workers receded categorical variables into a number of binary atfributes. This is appropriate for the PAH application and was therefore used for the "surgical category" variable and "patient admission source" variable. Surgical category became the binary indicator variables for Elective Surgery, NonSurgical and Emergency Surgery. Source of admission was coded using indicator variables for Emergency Room, Operating Room, Floor or Other Hospital.

However, the data fransformation used by Morik and co-workers for the real valued, continuous or discrete variables could not be used. Their method was to normalise the values of each variable using

162

norm(X)=^

(X-X ,

) '"'""'

Vvar(X)

This was not appropriate as the values in the PAH database were collected according to the rules of the APACHE 111 system database. In common with the APACHE II', SAPS I I ' and the MPM24 II * ICU models use the worst or most exfreme values (high or low) during the first 24 hours in ICU. This method has been widely used as it is the deviation from normal values that is associated with an increased risk of death, rather than the absolute measurements. Therefore, the frequency histograms of the values of some variables were comprised of two distributions of "worst" values. One component of the worst value was observations that were the "worst low values", whilst the other was the "worst high values". Two examples are shown in Section 7.2.2. Worst Temperature (Figure 7.1) has a frequency disfribution that is dominated by the worst high temperatures, reflecting that in critically ill patients, hyperthermia with sepsis or systemic inflammation is more common than is hypothermia. Worst Mean Arterial Pressure (Figure 7.2) clearly shows a bimodal distribution of worst high blood pressure and worst low blood pressure.

Other machine leaming methods have accurately modelled ICU patient in-hospital mortality, and some of their successful data fransformation sfrategies are described next. Generally, the raw physiology and laboratory observations were processed to values that reflected the distance of the observation from a physiologically normal state, which should be associated with the lowest risk of mortality. Clermont et al. '^^ gave numerical values to the variables according to the original APACHE III scoring scheme. I have not used this method, as it relies too closely on the APACHE III algorithm, and requires knowledge of the scoring for each variable. Doig et al. "^ coded all variables, except creatinine and the Glasgow Coma Scale as the difference above or below the median value of the variable. Potentially, the median value may not reflect the variable value with the lowest risk of death, and it will depend heavily on the casemix and patient severity.

The study by Prize et al. " ' reported an ANN-based clinical decision support system for ICU patients, and provides the most detail about successfiil pre-processing. In their study, all non-binary variables were standardised, and scaled so that the zero values of the variables were associated with the lowest risk of

163

death. An ICU clinician chose the value that, in his expert opinion, gave the lowest risk of death for each variable. The data observations were then subfracted from this physiologically normal value, and scaled by dividing by 3 times the standard deviation.

With the previous published experience in mind, the PAH ICU raw non-binary data were processed to achieve two objectives. The first aim was to standardise each variable, so that there was a minimum or maximum of mortality risk at a standardised value of zero. The second aim was to process the variables so that all were of approximately the same scale and range of values.

To standardise a variable, a physiologically normal value was subfracted from the variable value. One option for settling on a physiological normal value for this experiment was to use expert judgement, as used in the paper by Prize et al.. As an experienced ICU clinician, my clinical judgement tells me that this approach is too subjective. The second option was to use the normal values recommended by APACHE III. The values considered normal were derived from ICU patient data collected in 1988 and 1989 in North America. However, there have been advances in medical care and it was plausible that 5 - 1 0 years after the APACHE III model was developed, that the lowest risk of death would be found at different physiological values.

Therefore, a third option, to re-explore the data variables was undertaken. The values for each variable associated with minimum risk of death were identified using a smoothed, locally weighted second order polynomial fitted to scatter plots of 30 day in-hospital death (dependent variable) against each non-binary variable. This method is similar to that used in the development of APACHE III ^, and a recent large risk adjustment exercise reported by Render et al. '*fromthe Veteran's Affairs ICUs in the United States. However, in this PAH ICU application, the smoothed curves were only used to identify the value of minimal (or maximal) risk of death, and not to assign continuously re-weighted values to the variables.

A distance weighted least squares procedure (as part of Statistica 5 "^) was used to fit a curve to a plot of input variable values and the patient outcomes. A quadratic regression for each value of the variable is used

164

to estimate the corresponding patient outcome such that the influence of data points on the regression decreases with distance, along the X axis, from the particular variable value. This method demonsfrated which values of each variable were associated with the lowest or highest risks of patient death. Each variable was standardised by subfracting the value associated with minimum (or maximum) risk from the raw data value. Scaling was done by dividing the result by the standard deviation of the standardised variable values.

The raw data of some variables had frequency histograms that appeared unimodal, with grossly skewed frequency disfributions, or had values over a large range. In these cases, the logarithm to base 10 was used to fransform non-zero raw data values to facilitate scaling and standardisation. Entries with raw values of zero were given the lowest fransformed value from that variable.

This pre-processing did not aim to create a "standard normal variate" as used by Morik et al. The histograms of values of some variables were often far from having a Gaussian distribution. In the PAH dataset, the mean values were only occasionally associated with the lowest risk of death.

7.2.2 Three examples of pre-processing variables: Worst Temperature, Worst Mean Blood Pressure and Worst Bilirubin Three examples of pre-processing variables are worked to show the approach. Worst Temperature. Worst Temperature illusfrates a unimodal distribution with a normal physiological temperature associated with the lowest risk of death.

165

Figure 7.1 Histogram of Worst Temperature Values PAH ICU 1995-1999 1

1

1

1

11

11

1

2800 2600 (0

c o

2400

•

'

'

•

-

•

•

2200

"co 2000 1800

- .

• • • ,

-

W 1600

o

—

•

1400

1200 (D 1000

•

E

•

•

800 600 400 200 n

-

•

^M IV:?'"'

24

26

28

30

32

34

36

38

40

42

44

Worst Temperature (°C) The minimum risk of death was associated with a temperature of 37.2° Celsius, which is a normal body temperature. The Worst Temperature value minus 37.2 was scaled by dividing by the standard deviation (1.33). Figure 7.2 is a fit using distance weighted least squares that shows the relationship between the worst temperature and the outcome of 30 day in-hospital death.

166

Figure 7.2 Scatterplot of In-Hospital 30 Day Mortality and Temperature 1.2 1.0 -

o

ooo TM?oT'T"TTnrr^""'''''''''''''''''''''"'"''''i'''''''''"'"''"'''"'''''""PtD

o o

0.8

0)

E o o

0.6

8

'•' 0.2 0.0

-0.2 28

GOO O O

30

32

oirnmriininiKTVTTiTTtMinnni • i iii iiiimi iiiim>nn

-1.0

-0.5

0.0

0.5

1.0

1.5

o

2.0

2.5

Log(IO) Bilirubin

Binary variables, Chronic Health Score and the Diagnostic Category were not transformed. No cases were removed as outliers.

Table 7.1 shows the list of variables, descriptions, transformations and pre-processing employed to yield the standardised and scaled values for model building. Each variable is numbered according to the fields in the dataset. The name of the variable is based on the APACHE III database name, and the description of the variable provides information about how the variable was collected and coded, where necessary. The type of variable is binary, discrete or continuous. Where the pre-processing is described as "standardised", its treatment follows the methods detailed in Section 7.2.1 and 7.2.2. Logarithmic transformations are noted when used. The notes provide information about the values associated with the lowest risk of death, and the nature of the relationship between each variable and the patient outcomes. All the features described below were used in model building. There was no further processing of features.

U

00

6 0 1-1

s t I i

•3 Ol

&. i u > I«> .£3

V)

w

CD

I i

^fin u UI

O.

Z

on c ca ca

9> O

o i i

D. > c u

cs

(J

(41

00

U

•2 e;^ ^ra O •-; U

U

tzi

t/3

*

3 fa 0 5 01 ?

CM

^

D

oa

CO

H-

u ^ ts ts

.s 3 •« -a

» € fe

G

4-.

0

, 3

i3 0 0 * 5 * ' (D 00 C 43 C

'C

01 ZJ

S cs . g 2 4) ^ S " c

C4-H

•S r.^ -3 S I .S

i=l ts '^ *-' S o

W

(E

3

•*

oa 3 O

O a>

•S >n ,fi

s

.2 ^

.2 cs

I

Q ^

3 CS

^3

CS

.^

« 60 IS S3 O

I S3 O

(U

u

60

.S c

^ I 2 ^ 2 •§ ^ u

(JJ

Q

S-1

o (41

o

2i

_g ca

I 4-t oa u

2i ^

1+3

u O O -1

.g

u

2 > M oU

43

C3

TO

o o

o

o CO

3

U 2 o

I

I

"o >

I -J 41

.c O

41

o

eS

=6

43 5 H ^

G 3 O 3

CO

oa

•-B

a> eS

4)

•S "^ :" 4)

G cs

^ ^

Ol

DO'S

CO

. - e

O U es 43 T3 T3

>J cs

3

oo-o

§.2 43

2

41

.£3

W

4>

•c u 00 (3 •>? o

E u O 41

cS

t3

tzi

ON

(Z3

u O

o

3

cS

o E

u

Ol

1/3

O O

O

u

3

a3

o

u G

§ o

cs u

o

o o

Ii

tzi

41

cs

^^

es

a o

r--

_H

§^ ??Z fa J Z °^-

4J

43

3 o cs , ^ 41

cs

C8 ^ ,

o

41

a>

to 00

01

u

g DQ

6.?

UI

i> .ha

u

-I

•c

oa . 3

•C 3

CO

ka

^ ^

C

O

U

JJ c

4)

cs

(41

CO

CO

ere c

u

I

E-

O

ca CS

u T3

cs CO

CJ

g

O

U d

G 60 ^

« .3 es T3

f^

5

ON CM

03

t^

-*-» 3 1 O

O

Z

CO

§ ^

NO

ts s E

^s

1

d

T3

^ 6b ^ « ^

8«

CO

I ro

S3 -a 4-» o . 2 oa 3 u u O "O l4=l U 41 u

S O

3

o -3

41

cs

.3 .g 3 >.

T3

cs T3 W

CO

(73

^

1.1>i

u u

o oa

T3

E c o O (fa

§

3

-g (2

to

G u t)

T3

3

o

DO

oa e 3 ON o r~ 3

T3

CO

CO

O —;

U

u O

aj

'"B

c — c

41 CO

J .2

onui

o

c

•c

o

43

T3 13

O

3

^-N

CM CO 4J

.2 to

cs !3 U es 4 3 43 4-» CO

60

oa

u

O CS -1 -a C 60 §

cs •O

CJ

UI

o

O

2 S

o

C •rt

60 O UI

oa

(41

613

u

y

O «N

tension val

CO

»

H

"3

-o S "^

-F=.

s^./

CQ

M

41

:4i

^ (It

«n

CO

-3 u O

3 i) 60 (l>

•C u

60 8

o u

>-. X -) O A g o > W CM ( 4 1 .>; O " cs

t?; Ui

60

O

1>

5 E "^

^ CS

3

W 43

I 2

-o S "S E

8 & P.SÎ

E

ON

43

"*

3

§

u

>

-0 OH

u s up c O

CJ = ^

8^

4>

CM

o

u

G .

u 3 > C 'eiil u

1) — ^ 3 ^

O

cs (41

TS

O

CO

> >

" ^

£3

cs .5 T3

00

ta

rCO n

I

gS -^

es O

ac >,

ISEâ

en

o 41

•o o "3 & •C ^

41

o O

(U

3

•s >

41

* j

(1)

tti

60 T3" a>

CS

•c

(fa 2

"3 >

a> cs rS 3

I ^ CQ

"§•2 >^Ji

O o S

_3

"2 ts TS

3 O

.s |°o 55 fe

u

-s -s .2

cs (i,aj 3 _ - U CJ • 3 -31 4 3 J J

arte low

CS t> CO

V

o Ml T3

cs

t

CS

T3

g

T3 4}

3

oa .— cs C "i u Q

^

^ 43

4^ (N CO

O to

o Ol

o

o ^^

(fH

o

CJ

(41

"3

u

T5

(go

o a •^ 'G

.-1

cs

43 eS ii T3

oa

(^« o

u

T3

t3

43

g(p

u

a>

NO

175

7.3 Software For the experiments described in this chapter, data were pre-processed on Statistica 6.0

195

. The SVM

programme was SVMlight Version 5.00 (3 July 2002) written by T. Joachims. It can be downloaded and documentation is available at http://svmlight.ioachims.org '^'•'^^. I have adapted the interface programming in Matiab 6.5 "^ from an interface for SVMlight 4.0, by A. Schwaighofer at http://www.cis.tugraz.at/igi/aschwaig/software.html "'*. The model output and performance was analysed on Matiab 6.5 ''^

7.4 SVM Parameter Choice Guided by Model Attributes: Discrimination and Calibration. Any new model to predict the risk of 30 day in-hospital mortality must exhibit acceptable discrimination and calibration if it is to be used as a risk adjustment tool. The issues of model assessment were discussed in Chapter 2. Therefore, in this experiment, models were developed, assessed and compared by performance on a test dataset according to the average area under the ROC curve (discrimination), and the average H-L C statistic (calibration).

As before, adequate discrimination was defined as an ROC curve area of > 0.8 and acceptable calibration was defined as an H-L C statistic of 15.5 or less on test data of 400 cases.

7.5 Choice of SVM Kernel and Parameters. To train a SVM model, the choice of a kernel function and SVM parameters £ and C for that training application is required. A heuristic method was developed that searches two dimensional space and is shown to provide a solution close to the optimal parameter choice.

176

The first step is an initial investigation to gain an understanding of the properties of the search space, and the second is a more detailed study of the area near to the optimum. In the first step, both the RBF kernel SVM and the polynomial kernel SVM were investigated, and the calibration and discrimination of models were assessed.

7.5.1 Estimation of Parameters for RBF SVM The RBF kernel function that was used to implement the dot product of the mapping function in Equation 6.3 was

where x, and Xj are input patterns, and X,. — X .is the 2- norm of the difference between these vectors. The RBF parameter, y was varied.

In the absence of any a priori knowledge of the best RBF SVM parameter values, the parameter ranges were initially explored over large ranges: ^ 0 - 2 , 8 0 - 2 and C 0 - 4

The first to be studied was the RBF kemel parameter y, in the range 0 - 2 . The default parameter values of SVMlight were used with 8 set at 0.01 and SVM parameter C at the default value:

Cdefault = jj—JT where x, is the input vector, and ||x,. || = .^/(x,. • X,.) mean'I^XjVi The most promising RBF kemel was taken forward and then 8 was explored. C was still set at the default value.

When a value for 8 was set, the best model at a range of values for C was determined. Multiple sampling and training runs were done with 50 trials at each point using 800 randomly selected cases for fraining and 400 randomly selected cases for model testing.

177

Discrimination, measured by the model with the largest area under the ROC curve on the test set and calibration, measured by the lowest H-L C statistic on the test set were used to determine which parameters were chosen at each stage.

Figure 7.7 shows the relationship between RBF y in the range 0 - 2 and the average test set area under the ROC curve. The best area under the ROC curve was 0.84 when RBF y - 0.03. As with the surface plot of the area under the ROC curves in Figure 7.7, irregularity and variation in the plot contour is due to the effects of sampling model training and test datasets.

Figure 7.7 Area Under the ROC Curve for SVM RBF Kernel y in the range 0 - 1 average of 50 trials at each point, 800 cases training set, 400 cases test set £ =0.01, C = default

0.4

0.6

RBF parameter y

178

Figure 7.8 shows the response of model calibration to changes in RBF y with 8 = 0.01 and SVM parameter C set at the default value. The best average H-L C statistic (24.2) was also achieved at RBF y - 0.03. Note that at larger J'values > 0.8, the H-L C statistic again begins to decrease, but in this range, the SVMs have poor discrimination on the test set (ROC < 0.7) which precludes their use.

Figure 7.8 H-L C Statistic for SVM RBF Kernel y in the range of 0 - "/ average of 50 trials at each point 800 cases training set, 400 cases test set £ = 0.01,C = default

200 180 160 S 140 ••=

o X

120

100 80 60

0>2

0.4

0.6

0.8

RBF parameter y

This reinforces the importance of examining both discrimination and calibration. This graph demonstrates that the model calibration response does have more than one minimum of H-L C and adequate calibration. However, one of these occurs with models that have unacceptably low discrimination.

179

The RBF parameter y was fixed at 0.03, and the areas under the ROC curve for 8 between 0 and 2 with C set at the default value was then explored. Figure 7.9 shows a part of that trial with values of 8 in the range 0.1 - 0.5 demonsfrating that the response of the area under the ROC curve is relatively flat between 0.1 and 0.4, dropping off sharply at around 8 = 0.48. For 8 > 0.5, model predictions were no better than random, and some models gave the same estimate to all data points.

Figure 7.9 Area under the ROC Curve for SVM RBF Kernel with z in the range of 0 -0.5 average of 50 trials at each point, 800 cases training set, 400 cases test set / = 0.03, C = default

0}

t

1 0.95 —

s 0.9O O 0.85

o 0) 0} T3 C 3 CD

0.8 0.75-10.7 0.65 0.6 0.55-h 0.5 0

0.1

0.2

0.3

0.4

0.5

180

Figure 7.10 shows the H-L C statistic across a range of 8 with y =0.03 and SVM parameter C = default value. The best calibration was at 8 = 0.057 where the H-L C statistic was 17.0. Inspection of the model outputs at 8 < 0.03 revealed a tendency toward separation of the output values which clustered around 0 or around 1. With 8 < 0.03, Figure 7.9 shows that discrimination was retained but Figure 7.10 demonsfrates that model outputs did not accurately reflect the risk of death.

Figure 7.10 H-L C Statistic for SVM RBF Kernel with £ in the range 0-0.6 average of 50 trials at each point, 800 cases training set, 400 cases test set y = 0.03, C = default

181

The SVM parameter 8 was set at 0.057, the RBF y set at 0.03 and a range of C was trialled. Figure 7.11 shows the response of the area under the ROC curve to changes in Cin the range 0.1 - 2. The plot response was fairly flat. There is some variation in area under the ROC curve, which is due to random choice of model fraining and assessment datasets.

Figure 7.11 ROC Curve Area for SVM RBF Kernel with rangeof CO."/-0.6 average of 50 trials at each point, 800 cases training set, 400 cases test set y = 0.03, £.• = 0.057

cc 9 0) 3

O

o

o cc

O.i

0.5 0.5

1

SVM Parameter C

1.5

182

Figure 7.12 plots the average H-L C statistic against a range of SVM parameter C with RBF kemel, y 0.03 and 8 = 0.057. The best H-L C values < 20 appear between SVM parameter C in the range 0.5 - 1. These values of C gave an average area under the ROC curve ranging between 0.826 - 0.836 and average H-L C values were 16.7 - 20, with the exception of C = 0.9 where the average H-L C statistic was 715. Inspection of individual model test set performances in the range of parameter C 0.8 - 1.5 revealed that many, but not all SVM performed quite well. However, some combinations of the test data, the model and the validation data had H-L C statistics that were high and increased the average value.

Figure 7.12 H-L C Statistic for SVM RBF Kernel with range

of C 0.1-2 average of 50 trials at each point, 800 cases training set, 400 cases test set y = 0.03, £ = 0.057

100000

Statis scale

10000 1000

o o

100

10

0.5

1

1.5

SVM Parameter C:

7.5.2 Estimation of Parameters with Polynomial Kernel SVM The polynomial kemel fiinction that was used to implement the dot product of the mapping functions in Equation 6.3 was K(Xj ,'S.j) = [Xj • Xj )

where d, the degree of the polynomial was varied.

183

Trials of the polynomial kernels were conducted on polynomial fiinctions with d = I -4 .As with the experiment in the previous chapter, the SVMlight algorithm failed to converge to a solution for the 4 order polynomial kemel SVM, and it was not studied fiirther.

Figure 7.13 shows the average areas under the ROC curves for the polynomial fiinctions in the range £ 0 0.6 with Cat the default value. The polynomial kernels (d= 1 and d=2) displayed adequate discrimination in the range e < 0.4. The best SVM discrimination was with the polynomial kemel d=l (ROC curve area 0.88) at 8 = 0.003.

Figure 7.13 Area under the ROC Curve for SVM Polynomial kernels

d 1 - 3 for £ 0 - 0.6 average of 50 trials at each point, 800 cases training set, 400 cases test set C= default

1

—.—d = 1

0.9 0.8 0.7

V

--- • d= 2

.- — -'*^

r*- J, Wt^t-L'

lUC" •:::

-•-d=3

\-' ^-

• » ^

\

\ ^

~" •^" * '

-

0.6

\

0.5

""~ -^'.x

0.4 0.3 0.2 0.1 0

0.1

0.2

0.3 £

0.4

0.5

0.6

184

Figure 7.14 shows a plot of the average H-L C statistic over a range of e 0 - 0.5, with the SVM parameter C at the default value. The calibration was best at polynomial kemel d=2 with e about 0.1 - 0.2 giving H-L C statistics of 41.6 - 50.6.

Figure 7.14: H-L C Statistic for SVM Polynomial Kernels d 1- 3 for Range of £ 0 - 0.5 average of 50 trials at each point, 800 cases training set, 400 cases test set C= default 1000000 -•—d = 1 -•»-d = 2

100000

a>

••*- d = 3

10000

•I g (S o W E

1000

^ 8*

100

o£

•t----

0.2

0.3

0.4

0.5

0.6

e

To optimise C, the SVM polynomial {d=2) kemel and s = 0.2 were set. The range C 0 - 2 was studied. The best performance was at C = 0. A portion of the chart is shown in Figure 7.15. The best area under the ROC curve was 0.841 at C = 0, with decreasing area under the ROC curve as the SVM parameter C was increased.

185

Figure 7.15 Area Under the ROC Curve for SVM Polynomial Kernel for C: 0 - 0.002 average of 50 trials at each point 800 cases training set, 400 cases test set rf = 2, E = 0.2

0.9-r0)

0.85

O O

0.8

o

0.75

IT

0.7

(0 L.

ffi

0.65

•o c

0.6

3 (D

0.55

9>

0.5

0.0005

0.001

0.0015

0.002

SVM Parameter C

Fig 7.16 shows the calibration of the 2"^ order polynomial kemel SVM over a range of SVM parameter C. The best H-L C statistic value for this range was 51 at C = 0

Figure 7.16 H-L C Statistic for SVM Polynomial Kernel for C 0 - 0.002 average of 50 trials at each point, 800 cases training set, 400 cases test set tf = 2, E = 0.2

1000000 ( W V ' ' ^

^

list

(D o m

o

CO

100000 -I10000

CO C}

^-» C O

b

O .c - J co I o

1000 100 10 1 0.0005

0.001 SVM Parameter C

0.0015

0.002

186

No satisfactory combination of parameters for the SVM polynomial kemel was found to be close to both the desirable discrimination or the calibration targets.

7.5.3 Summary of the Approximation of SVM Kernel and Parameter Selection. The best discrimination found with the RBF kemel was y = 0.03, 8 = 0.057, C = 0.6 which displayed an average area under the ROC curve of 0.83 and an average H-L C of 16.7. In comparison, the best polynomial kemel was a 2"'' order polynomial kemel, 8 = 0.2, C= 0 which had an area under the ROC curve of 0.84 and an average H-L C statistic of 51. The discrimination of both the RBF and the polynomial kemel SVMs was adequate. The average calibration of the RBF kemel SVM was better and the parameters around this approximation were then investigated more intensively.

These results are in contrast to those of the previous chapter, where the linear polynomial kemel had the best discrimination on the regression task (0.86 - 0.87), while the discrimination of the RBF kemel SVM was less than 0.80 at all parameter values examined. The better performance with the RBF kemel could be because of the differences between the variables used for modelling in the two experiments. In the regression experiment of Chapter 6, four heavily pre-processed variables were used. In the regression experiment in this chapter, the non-binary patient observations, measurements and demographic information were pre-processed so that a central value of zero was associated with a minimum or maximum risk of death for each non-binary variable. This transformation may be better suited to RBF kemel. Alternatively, the investigation of changes in the parameter C and finer tuning of model parameters may have revealed the potential performance of the RBF kemel SVM.

7.5.4 Investigation of Values around Estimated SVM Parameters The approximation of SVM parameters provided a guide to the values of SVM parameters to be investigated more intensely. The RBF kemel SVM was the most promising and fiirther investigation of the

187

best parameters and performance was undertaken close to the approximate values found in the previous section. The parameter ranges chosen for more detaOed study were: y: 0.01 - 0.1, 8: 0 - 0.2 and C: 0.4 4.0.

Within these ranges, the RBF kemel parameter y, and the SVM parameters 8 and C were varied and the average H-L C statistic and the area under the ROC curve were calculated for 50 trials at each value. Each tiial used a ttaining set of 800 randomly chosen patients and the model performance was assessed on a validation set of 400 randomly chosen patients. Three dimensional plots of the response of the average H-L C statistics and areas under the ROC curves were produced (Figure 7.17 and Figure 7.18).

188

Figure 7.17: Surface Plot of Average H-L C Statistic for RBF SVM y 0.1 -1.0 The surface plot tones correspond to the average H-L C statistic

0.1 e

Colour

H-L C Statistic

Dark Blue

< 15.5

Light blue, green, yellow

15.5- 20

Red

> 20

•f = 0.02

T = 0.03

02 4 c T = 0.05

0.2 4 c •t = 0.06

T = 0.04

0.1 e

02 r = 0.07

4 c

189

Figure 7.18:Surface Plot of Average Area under the ROC Curve for RBF SVM Y 0.1 -1.0 The surface plot tones correspond to the average area under the ROC curve. Colour

Area under the ROC Curve

Dark Blue:

> 0.82

Light blue, green, yellow

0.80 - 0.82

Red

0.82. Light blue, green and yellow are adequate areas under the ROC curve of 0.80 - 0.82. Red colours the areas under the ROC curve of < 0.80 which indicates unacceptable discrimination.

The plots in Figure 7.18 have irregular surface variations with local minima, due to the effects of random sampling with different model training and testing datasets. Overall, these 3 dimensional plots show that the discrimination is generally better for SVM with y between 0.01 and 0.04 than for y > 0.04, and that discrimination is generally better for smaller values of parameter C for any choice of y and E. This is in contrast to the important relationships in the calibration surface plots in Figure 7.17 where the changes in £ had the most dramatic effect on the values of H-L statistic.

In Figure 7.18, the response is relatively flat for the surface plots for each of the RBF y. In general, smaller values of C and smaller values of 8 give the best discrimination at any RBF kemel y value. A SVM with a RBF with y in the range 0.1 - 0.4 will provide adequate discrimination with area under the ROC curve > 0.80 across the range of 8 and C. The best area under the ROC curve greater than 0.83 was found in the following zones RBF y = 0.02 with C 0.4 - 1.0 and £ 0 - 0.2 and RBF y = 0.03 with C 0.4 - 0.8 and £ 0 -0.16 and

192

RBF y - 0.04 with C 0.4 - 0.6 and £0-0.16.

The largest average areas under the ROC curve was 0.833, found at RBF y = 0.02 with C = 0.6, £ = 0 and at C = 0.5, £ = 0.

Therefore from the comparison of the 3 dimensional plots, the parameter settings that are most likely to give adequate models on this dataset are RBF y - 0.03 with C = 0.6 and £ = 0.05, and RBF y = 0.04 with C I .0 -1.9 and £ 0.03. At these parameter settings, I expect that for samples of 800 cases, a RBF SVM will be trained that when tested on a random sample of 400 cases will have an area under the ROC curve of 0.82 - 0.83 and a H-L C statistic of < 15.5. A model with this performance would be suitable for use as a risk adjustment tool for control charting.

The calibration response was particularly sensitive to changes in £ and y. In contrast, most parameter choices gave acceptable discrimination, but generally, smaller values of C gave sUghtly better discrimination than larger values.

The initial heuristic was able to provide a good estimate of the optimal parameter choices for the RBF SVM and allow limitation of the more intensive investigation of SVM parameters to an area with acceptable discrimination in the parameter ranges y = 0.01 - 0.10, £ 0 - 0.2 and C 0 - 2 initially. The range of C that was investigated was extended, as the area of calibration that was almost acceptable (H-L C 15.5 - 20) extended beyond C = 2. At C = 4 the discrimination was beginning to drop off below area under the ROC curve of 0.80.

The experiment in Figure 7.17 and 7.18 required 154,000 RBF SVM to be frained and analysed. Each model, including sample selection model building and analysis took about 5 seconds to complete, and so

193

the experiment took 9 days to mn on a single 2.7 GHz processor. If the intensive study had been conducted with the same detail over the original ranges of the parameter estimates (y = 0.01 - 2, £ 0 - 2 and C 0 - 4) the experiment would have been intractable.

The shape of the surface plots of the area under the ROC curve and the H-L C statistic in response to changes in the SVM parameters allowed the estimation of approximate SVM parameters to be reasonably accurate. This estimation involved a single parameter being studied with the other parameters fixed. This approach works in this application because of the shape of the error surfaces and because two model characteristics were used to guide the parameter selection.

Using only a single model performance attribute, for example the H-L C statistic would potentially lead to errors. The presence of a convex response seen in Figure 7.8 where the range of RBF y was studied, may have led to the choice of an inappropriate large value for y, if the area under the ROC curve had not also been used. At low RBF y values the areas under the ROC curve were > 0.82. At large RBF y, the discrimination was no better than random, yet the H-L C statistic suggested adequate calibration of the model.

7.6 Discussion These experiments demonsfrated that the probability of ICU patient death in-hospital within 30 days of admission can be modelled with a SVM, using only the patient data available within the first 24 hours of ICU admission. The approach to optimisation of the SVM using the discrimination and calibration to guide the parameter choice was successfiil in identifying parameters to build models of adequate performance.

The approximation of SVM parameters using an initial search heuristic provided an estimate of the best RBF parameters. This sfrategy reduced the computing time that would have been required. It is not certain that this approach will work in all applications of regression SVM. However, in this task, to estimate the

194

probability of ICU patient death, the simple exploration of parameter space efficiently localised an approximate area of optimal parameter choice. This was possible because of the shape of the surfaces of the calibration and discrimination curves seen in Figures 7.17 and 7.18.

The results of the SVM models built in this experiment are equivalent to previous studies with ANN and logistic regression to estimate the probability of patient death. The calibration and discrimination of the RBF SVM on the test data is similar to the results of models described by Clermont et al.^^^, using logistic regression and ANNs on a similar set of ICU patient data. Modelling data from 800 cases, Clermont described areas under the ROC curves for the logistic regression (0.829) and ANN (0.810) which is comparable to that found with the RBF SVM (0.82 - 0.83). The SVM assessment was the average of 50 frials of randomly selected sample sets in confrast to the single, non-random sample of Clermont et al. Therefore, the RBF SVM discrimination performance is at least equivalent to their study. As 50 trials were conducted on random samples, the choice of RBF SVM parameters is very likely to consistently produce SVM models that have adequate performance on the PAH ICU dataset.

However, the APACHE III estimates of probability of death still offer superior discrimination, with area under the ROC curve of 0.89 on the PAH ICU dataset. The APACHE 111 model was developed on 17 440 patient admissions ^. It is possible that with such a large dataset, similar levels of discrimination could be achieved with SVM modelling. However, very large datasets are impractical for local, single institution model development.

The calibration of the best RBF SVM was good with an average H-L C of less than 15.5. Caution must be taken with any comparisons using the H-L statistic, particularly as the mortality endpoints differ and the size of the validation set of Clermont et al. is 447 cases against 400 cases in the present SVM experiment. However, the SVM probably have superior calibration to the logistic regression model of Clermont et al. (H-L C = 45.9) developed with data from 800 patients. The best SVM is on average, equivalent to the calibration of their best ANN (H-L C = 17.3).

195

These results indicate that using RBF SVM to model patient data offers a practical and reproducible altemative for modelling the probability of ICU patient death within 30 days of admission.

However, the SVM model using the component patient data in this experiment performed less well than models using the pre-processed data such as APACHE III scores of the previous chapter. The area under the ROC curve of the APACHE III score alone on the PAH ICU dataset is 0.837, yet the largest ROC curve area found with the RBF SVM was 0.833. In the previous chapter using the variables of APACHE 111 score. Age, Chronic Health Score and the Diagnostic Code, the average area under the ROC curve for both the polynomial SVM and the multi-layer percepfron ANN was 0.86. Similarly, using the same variables on the same dataset, the logistic regression models developed by Graham and Cook '*"* also displayed ROC curve areas on the validation set of 0.86. The models of Clermont et al. from a similar ICU incorporating the APACHE III score had ROC curve areas up to 0.84. It may be, therefore that fiiture improved SVM models should include the APACHE III score, even though this means that such models are still reliant on the APACHE III system.

There must be an upper limit to the performance of a model that estimates probability of death using information from the first 24 hours in ICU. Much potentially important explanatory information is not available. Examples relevant to this dataset would be detail of the success or otherwise of surgery. Important episodes that occurred in the lead time prior to ICU admission are not captured, neither are the events, complications and progress after the first day. There will always be uncertainty in models using patient data from on the first day. It is this uncertainty that represents random events and effects, but also embraces the quality of patient care, that is the reason for the monitoring of risk adjusted mortality rates.

The choice of kemel and parameters is a problem to be solved for each set of data '*^. In the SVM the kemel fiinction, and the parameters £ and C are chosen or varied to provide optimal regression performance. For regression, the only reliable method that exists for all kernels and datasets is the use of a cross validation set, as used in these chapters. In this application, optimisation was guided by multiple trials of samples of training and testing sets. By using re-sampling and many training examples, SVM parameter

196

values were identified that are most likely to work well in any sub-sample of the patient data. The average measurement of 50 samples at each data point demonstrated the reproducibility of the model performances at parameter settings.

The choice of 8 by setting the regression tube width affects the approximation error, the number of support vectors, the training time and the complexity of the solution "'^ A training example will be a support vector only if the approximation error is larger than £. A large e will give few support vectors, rapid training, but poor accuracy as the regression tube is very wide. Small £ defines a narrow regression tube, w ith a large number of support vectors and a complex model with long training time. Mattera and Haykin

suggested

a robust approximation method for E by choosing a value so that approximately 50% of the examples in the training set are support vectors. This is a compromise between complexity and maintaining a small approximation error. An alternative suggestion by Weston et. a / ' '* to set 8 = 0 , and rely on adjustments to Cto control the generalisation performance takes no account of training time or complexity considerations.

For the PAH iCU patient data, the noise and the desired accuracy were not known. The \alues of £ \\ ere chosen experimentally to provide the best discrimination and calibration. The number of support \ectors in the trained models where the training set is 800 randomly selected cases, was on a\erage. 500 or 62.5''i). The use of the ROC curve area and H-L C gave a similar number of support vectors to the approximation method of Mattera and Haykin '* .

The choice of C is also determined by experimentation "~v A guideline proposed for choice in the RBF kernel SVM by Mattera and Haykin "* is that C should be of a similar size to the expected outputs of the SVM which are estimates of probability, and thus so are between 0 and I. This is only a guide, and further work on optimisation is necessary, in this application w ith the RBF }' set to 0.03 and i: set at 0.5 - 0.6, the best values for C were experimentally determined to be between 0.5 and 1.5. The \alue of 0.6 was chosen by the parameter estimation, and is consistent with the estimate suggested b> Mattera and Ha>kin.

197

In this light, the proposals of Mattera and Haykin for estimation of C and E can be used in conjunction with the parameter estimation method used in this chapter, as a means of localising the parameter solution, and minimising the computing time required.

7.7 Conclusion The probability of ICU patient death in-hospital within 30 days of admission can be modelled with a SVM, using only the patient data available within the first 24 hours of ICU admission. The approach to optimisation of the SVM using the discrimination and calibration statistics to guide the parameter choice is successful in identifying parameters to build models of adequate performance to use as risk adjustment tools.

These models can be trained and cross-validated on the number of cases seen in one year at the PAH ICU. Though the performance is less than that of the APACHE III models during 1995 - 1997, the SVM models have the advantage of using a 30 day in-hospital mortality endpoint, having the flexibility to be remodelled when the model no longer fits and not incurring an annua! licence fee.

199

Chapter 8 Summary, Conclusions and Future \Nork

8.1 Summary The aim of this work was to develop risk adjusted confrol chart methods for monitoring in-hospital mortality outcomes in Intensive Care. It is a medical application of statistics and machine leaming to the measurement of outcomes for quality management.

The methods of assessment of models that estimate the probability of death for a patient in the intensive care unit (ICU) were reviewed. From this review, the important attributes of discrimination and calibration were identified as the key measures of model performance. The performance of a model may deteriorate when it is applied to patient data which are not part of the database from which the model was developed. Therefore, a model such as APACHE 111 which was developed on a large North American ICU population must be validated in the Ausfralian ICU setting, before any conclusions can be drawn about the reliability of its mortality estimates in that particular population.

The area under the receiver operating characteristic (ROC) curve is the most useful approach to assessment of model discrimination. From a review of the assessment of models for ICU patient mortality, a reasonable expectation of the discrimination of ICU models is an area under the ROC curve greater than 0.80. Calibration is more difficult to assess. To evaluate calibration, a calibration curve showing observed and expected mortality in ranges of risk, and a statistical evaluation of calibration such as the HosmerLemeshow (H-L) statistic can be used.

These considerations were the basis for the evaluation of the APACHE III models at the Princess Alexandra Hospital ICU for the period 1995 - 1997. The models provided by the APACHE III system to predict the risk of death of an ICU patient in-ICU and in-hospital performed very well. For all models, the

200

discrimination was excellent, with the area under the ROC curve being 0.90 - 0.92. The calibrations of the in-ICU mortality models, and the in-hospital mortality model with proprietary adjustments for hospital characteristics were very good. The APACHE III model with adjustments for hospital characteristics provided the most accurate estimates of the probability of in-hospital death.

The initial confrol chart application was to in-hospital ICU patient mortality data without risk adjustment (RA). Thep charts, cumulative sum (CUSUM) and exponentially weighted moving average (EWMA) charts were used to analyse the in-hospital mortality rate. The years 1995 - 1997 were used as an in-control period to establish the confrol mortality rate, and charting was carried on through 1998 - 1999.

There was a significant fall in the mortality rate from the in-confrol rate of 0.16 to approximately 0.13. A post hoc analysis suggested that a change in the overall severity of patient illness probably did not occur. However, a change in casemix was demonsfrated. There was an increase in elective surgery. This type of patient has generally a low risk of in-hospital death, and the change contributed to the decrease in inhospital mortality rate of ICU patients.

The second confrol chart application developed the techniques of control chart analysis of in-hospital ICU patient mortality data with RA. The APACHE 111 model with proprietary adjustments for hospital characteristics provided accurate estimates of in-hospital death, and was incorporated into thep chart, CUSUM and EWMA charts. The use of a RA model, such as APACHE III, improved the information provided by the control charts. After adjusting for the severity of illness and casemix of patients, these analyses demonsfrated that patient survival had improved between 1995 and 1999. During the first 1 - 2 years, there were signals that the risk of patient death was higher than that predicted by the APACHE III model. During the last year of the analysis, the observed patient mortality was lower than that predicted by the APACHE 111 model. Possible explanations are that the RA monitoring has documented improvements in patient outcomes or quality of care.

201

If the observed mortality rate is consistentiy different to that predicted by the RA model, it means that the RA model no longer provides an accurate assessment of patient risk of death. In this application, the APACHE III model was consistently overestimating the probability of patient death by the end of 1999. Therefore, I developed altemative tools to the APACHE 111 model to estimate the probability of patient deaths at the PAH ICU. Machine learning techniques, such as support vector machines (SVMs) and artificial neural networks (ANNs) were developed on the PAH ICU database, and thus provide a new RA tool.

Several shortcomings of the APACHE III system were addressed in the new model development. An endpoint of 30 day mortality was chosen, instead of in-hospital mortality used by APACHE III. Mortality data could be analysed just 30 days after the episodes of patient care commenced, rather than waiting months for all patient hospitalisations to be finalised.

The APACHE III algorithm functions as a "black box" without model updates. In contrast, the machine leaming models are intended to be updated when the model fit deteriorates. To this end, models were successfiilly and reliably developed on a fraining set of 800 cases and a test set of 400 cases. The data for model fraining and cross validation can be collected over approximately one year. This means that it is practical for a single ICU to develop and maintain RA models to continuously monitor their clinical outcomes.

The APACHE 111 software and model commands an annual licence fee, whereas a locally developed RA model does not have that additional cost.

In Chapter 6, the preliminary ANN and SVM models were developed using the APACHE III based variables of acute physiological score, modified disease category and chronic health score, and the patient age. The models demonsfrated good discrimination, but very poor calibration. Subsequently, a multilayer percepfron ANN and a linear kemel SVM were both recalibrated with a general regression neural network.

202

The performances of these models were equivalent to previous ANN and logistic regression models on a similar ICU application '^^ and to logistic regression on the PAH ICU database.

A further experiment was then conducted to model 30 day in-hospital mortality. The component variables that describe patient physiology, demographics, laboratory results and diagnosis were used, rather than the APACHE III variables. Standardisation and scaling of the raw patient data was performed using successful data fransformation sfrategies after considering other machine leaming applications in the ICU. A radial basis function (RBF) SVM was used to estimate the probability of 30-day in-hospital mortality. The SVM models were developed, cross-validated and compared according to the average area under the ROC curve (discrimination), and the average H-L C statistic (calibration).

A simple and efficient heuristic method to search for optimal SVM parameters was used. In this way, optimal parameter choice was localised in space and this was intensively explored for the best performing SVM models. The average performance of the RBF SVM on random samples of the dataset was adequate with area under the ROC curve of greater than 0.83 and an H-L C statistic of less than 15.5 on 400 patient test sets.

8.2 Original Contributions This thesis brings together several streams of research. It arose from clinical medicine, patient care and the need to measure the quality of clinical care. To accomplish this, several novel confributions were made.

The most important contribution of this work is the application and development ofRA confrol charts to monitor ICU mortality rate. This paradigm of incorporating adjustments for casemix and severity of illness has not been applied to the ICU before. It is an idea that has wide appHcation to monitor the quality of the care that we offer in many areas of the health care service. The prerequisites are the ability to accurately measure patient outcomes, and to collect patient data to permit modelling estimates of the probability of these outcomes.

203

Development of these RA charts involved the following novel contributions.

i.

The assessment of the APACHE III models at the PAH was the first assessment of Ausfralian experience with APACHE III, and used the largest single institution series of patients outside of the United States of America.

ii.

The use of confrol charts to monitor in-hospital mortality of ICU patients is a logical extension of industrial statistics, but has not been carried out elsewhere. This is the first use of the APACHE III score as a validated RA tool to continuously monitor ICU outcome. The design and analysis of each of the techniques 1 developed is described in the Appendices. The technique allows the methods to be adapted and modified according to different clinical requirements.

iii.

The RA confrol chart work is original in its application and involves modifications of previous methods. In addition, the RA EWMA chart is entirely new.

The RA p chart was developed from the method of Alemi et al.^^'^ and incorporates a method for analysing the power of each sample, adapted from Flora ^''. The Z score/? chart relies on the work of Flora ^ and Sheriaw-Johnson et al. ™'"'.

The use of the iterative method to characterise the distribution used for constmcting/? charts is presented in Chapter 5.

The RA CUSUM is based on the work of Lovegrove et al. " and the moving frame approach is adapted from the work on charting cardiac surgery by Poloniecki et al. "^. The use of the RA CUSUM of Steiner and co-workers'^^'''*^''''^ is an original application in an ICU setting, though the method, apart from the use of APACHE III, has not been modified.

204

Both the RA EWMA charts with the parametric approximation and the discrete approximations are new work.

iv.

The 30 day in-hospital mortality endpoint that was used in the machine leaming is an original contribution, based on the specific requirements ofRA control chart analysis of patient mortality.

V.

This is the first application of SVM to estimate the probability of ICU patient death.

vi.

A model that estimates the probability of patient death has attributes of discrimination and calibration. These two performance measures were used to guide model development and SVM parameter selection.

8.3 Future Research During the course of this project, several areas were identified where fiirther research is required.

Further study of the assessment of the performance of models that predict the probability of death of ICU patients is required. At present, there is a good understanding that model reliability is reduced when the model is applied to other ICU contexts different from those where it was developed. Issues of data collection and mle interpretation, patient variables, clinical practice, admission and discharge practices, case mix, lead-time, mortality rate and type of hospital have all been shown to affect the ability of models to accurately estimate the probability of death of ICU patients. More work is required to understand the effects of these and other factors.

The assessment of model calibration proved a difficult part of this project. The H-L C statistic was determined to be the best technique available at the time at which this study was undertaken. Further work may lead to an altemative technique to measure the calibration and reliability of probability estimates.

205

The RA charts described in this thesis are initial steps in what could be a much larger undertaking. Potentiallyfiaiitfiilareas of research exist in additional analysis of the behaviour of the confrol charts under a larger range of plausible, and important clinical situations. This thesis was limited to analysing the effects of increases and decreases in mortality rates, and in increases and decreases in the odds ratio. From these scenarios, control chart parameters and control limits were chosen, and charts were constmcted. Many other combinations of changed casemix, early discharge, and simulations of aspects of poor or improved care should be explored.

The use of the central limit theorem to permit estimation of the expected distribution of mortality rates and the EWMA statistics involves approximations. The errors are probably small compared to the imprecision of the currently available RA tools for ICU. However, fiirther work can be done using altemative methods to characterise the distributions. Iterative methods, exact to the limits of the model predictions, are presented in this work. Further research can be done to more efficiently calculate the distributions. Examples are Monte Carlo simulations, or possibly analytical expressions for determining the probability density function of the mortality rates, based on the individual patient's predicted risks of death.

The latter part of this thesis described machine leaming applications to estimate the probability of death of the ICU patients. SVMs have, to my knowledge, only been employed once previously in the ICU setting, and only then to predict physicians clinical actions, rather than to estimate the probability of death. There is considerable opportunity to investigate other applications in ICU and to further optimise the models to predict patient death.

The choice of kemels and SVM tuning parameters provides an area for ongoing work. The approach presented here is practical, as it uses the desirable performance characteristics of the potential RA tool to guide optimisation of the SVM parameters. A simple search heuristic to identify the likely area to search more intensively may not work in all applications. It is not a replacement for thorough parameter evaluation and testing of model performance by cross validation. Genetic algorithms, for example, may offer a ihiitfiil altemative method to efficientiy localise optimal parameter settings.

206

In this application, the polynomial and RBF kemels were investigated, but there are many other kemels which can be applied and optimised on the data. SVMlight provides the opportunity for fiiture investigation of linear kemels, tanh sigmoid kemels, a range of polynomial fiinctions and a user defined kemel option. 198

The recommendation of specific kernels for probability density estimation by Weston and co-workers points to a usefiil direction for future investigation. The variable selection and processing used for the SVM model can be explored fiirther. Though the models gave acceptable performance, it is possible that more intensive evaluation may provide improvements.

All of the raw data components in the APACHE III algorithm were used for the SVM models in Chapter 7. Any additional variables from the database that may have had some explanatory power were included. It is possible that fiirther additional variables could be included for fiiture models. Variables describing organ failure, additional laboratory results such as potassium, platelet count, lactate, and injury severity score, therapeutic interventions, gender, marital status and ethnicity may improve prediction of outcomes.

Some of these variables could be improved by expert pre-processing. For example, there is a potential for inclusion of alveolar-arterial oxygen gradient using the blood gas measurements to better estimate the ability of the lungs to exchange oxygen. The diagnostic code used in this study could be revised now that the APACHE 111 diagnostic weights are all publicly available "".

It is not certain whether using all of the patient variables that were available on the database maximised the performance of the SVM models. Altematively, the large number of variables could have increased the complexity of the model and perhaps led to deterioration in model performance. Ideally, feature selection should order the features by effectiveness, to provide the best combination of features and remove those of negligible relevance. Such techniques as automatic relevance determination methods of selection ^'"•^"^ could be pursued.

or altemative statistical

207

The data pre-processing was carried out to provide features with a similar range of values. The preprocessing was based on methods used in other machine leaming applications. Where possible, each feature had an identifiable minimum or maximum in relation to the risk of death associated with that feature. The SVM was ineffective when used with raw data, and it is not clear which of the pre-processing manipulations were beneficial, or indeed whether some of the pre-processing may have limited the model's ultimate performance. Further experiments are necessary to explore the necessity and suitability of the preprocessing and fransformations used in this experiment.

In summary, the conclusions of this work are that RA confrol charting offers an important adjunct to current methods of assessment of ICU outcome to monitor the quality of care. SVMs provide a practical approach to model the probability of in-hospital mortality of ICU patients for RA based on patient data obtained in the first 24 hours. Their development can be effectively guided by optimisation of the attributes of discrimination and calibration. Models can be reliably built and assessed on 1200 cases, and so provide a RA model for a single ICU.

209

Appendix 1

Data Description of Admissions to Princess Alexandra Hospital Intensive Care Unit: 1995 - 2000 The dataset for analysis was expanded from that presented in Chapter 3 (1 January 1995 to 1 January 1997) with additional patients from 1 January 1998 - 31 December 1999. This extended dataset (1 January 1995 31 December 1999) is used in Chapters 4 to 7 for control chart analysis and modelling.

A1.1 Data Collection and Sample Summary Patient eligibility and exclusions and the statistical analysis are the same as Chapter 3. There were 5681 eligible episodes of ICU admission analysed. There were 5278 primary admissions and 403 (7.1 %) readmissions. The demographic features of the patients are summarised in Table Al. 1. The overall mortality was 515 in-ICU and 779 in-hospital deaths from 5278 patient hospitalisations.

210

Table A1.1: Demographic features of the primary admissions to PAH ICU 1/1995-12/1999 53.0(19.0)

Age years: mean (sd)

62.4

Male: % Surgical patients: ( % of admissions) Elective surgical patients: (% of admissions)

3154 (55.5) 2365 (41.6) 789 (13.9)

Emergency surgical patients: ( % of admissions) Non operative patients: (% of admissions)

2527 (45.5)

APACHE III score: mean (sd, range)

47.5(25.7,0-187)

ICU mortality: % (number of deaths)

9.1 (515)

Hospital Mortality: % (number of deaths)

14.8 (779)

ICU length of stay: mean in days (sd), median Hospital length of stay: mean in days (sd), median

2.9 (5.1), median = 1.0 27.9 (46.6) median =15.0

Abbreviations: sd, standard deviation

The following tables Al .2 and Al .3 present a summary of mortality rate data that were used for the charting and analysis in Chapter 4. Table Al .2 groups admissions by month and Table Al .3 groups patients in consecutive blocks of 50 patients.

211

Table A1.2: Admissions Grouped by Month of Admission to Princess Alexandra Hospital Intensive Care Unit 1995- 1999

Month

Jan-95 Feb-95 Mar-95 Apr-95 May-95 Jun-95 Jul-95 Aug-95 Sep-95 Oct-95 Nov-95 Dec-95 Jan-96 Feb-96 Mar-96 Apr-96 May-96 Jun-96 Jul-96 Aug-96 Sep-96 Oct-96 Nov-96 Dec-96 Jan-97 Feb-97 Mar-97 Apr-97 May-97 Jun-97 Jul-97 Aug-97 Sep-97 Oct-97 Nov-97 Dec-97 Jan-98 Feb-98 Mar-98 Apr-98 May-98 Jun-98 Jul-98 Aug-98 Sep-98 Oct-98 Nov-98

Observed Mortality Rate 0.11 0.16 0.13 0.15 0.19 0.17 0.20 0.15 0.16 0.21 0.16 0.12 0.19 0.21 0.16 0.18 0.17 0.21 0.14 0.26 0.22 0.12 0.17 0.19 0.11 0.12 0.15 0.09 0.13 0.16 0.12 0.15 0.20 0.15 0.19 0.12 0.06 0.10 0.11 0.15 0.11 0.14 0.17 0.20 0.16 0.16 0.08

Number of Admissions 99 87 101 88 86 87 100 89 74 82 94 76 74 77 101 89 87 86 93 96 81 104 95 84 88 90 103 90 82 87 86 78 74 79 86 89 79 86 88 101 82 84 76 71 90 85 77

Dec-98 Jan-99 Feb-99 Mar-99 Apr-99 May-99 Jun-99 Jul-99 Aug-99 Sep-99 Oct-99 Nov-99 Dec-99 Jan-97 Feb-97 Mar-97 Apr-97 May-97 Jun-97 Jul-97 Aug-97 Sep-97 Oct-97 Nov-97 Dec-97 Jan-98 Feb-98 Mar-98 Apr-98 May-98 Jun-98 Jul-98 Aug-98 Sep-98 Oct-98 Nov-98 Dec-98 Jan-99 Feb-99 Mar-99 Apr-99 May-99 Jun-99 Jul-99 Aug-99 Sep-99 Oct-99 Nov-99 Dec-99

0.12 0.08 0.12 0.13 0.15 0.13 0.13 0.15 0.11 0.14 0.10 0.06 0.18 0.11 0.12 0.15 0.09 0.13 0.16 0.12 0.15 0.20 0.15 0.19 0.12 0.06 0.10 0.11 0.15 0.11 0.14 0.17 0.20 0.16 0.16 0.08 0.12 0.08 0.12 0.13 0.15 0.13 0.13 0.15 0.11 0.14 0.10 0.06 0.18

89 75 92 98 87 97 118 105 92 93 86 83 82 88 90 103 90 82 87 86 78 74 79 86 89 79 86 88 101 82 84 76 71 90 85 77 89 75 92 98 87 97 118 105 92 93 86 83 82

212

Table A1.3 Admissions Grouped into Ordered Blocks of 50 Admissions: 1995 - 1999 Block of Patients Observed Deaths

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700

6 4 8 8 6 7 9 7 10 8 8 7 14 8 5 10 12 8 9 7 5 9 10 10 11 6 8 8 8 13 7 8 8 10 14 13 10 6 6 9 7 14 6 4 9 2 8 7 5 5 8 4 8 8

2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 5050 5100 5150 5200 5250

_5 __9 _6 JI2 8 JIO _5

_6 _8

_[ 7 _6 _4 JIO _6 _6 _5

__6 _6 JI2 11 5 10 7 9 4 5 7 3 6 8 5 5 5 9 7 8 6 8 4 10 9 3 9 5 5 2 3 9 8

213

Appendix 2 Analysis of Mortality Rate Observations PAH ICU: 1995 1997, and Estimation of In-Control Parameters. The purpose of Appendix 2 is to present an analysis of the PAH ICU patient data from 1 January 1995 - 31 December 1997, including evaluating whether the process was in-confrol, and estimating the process parameters. The analysis was conducted according to the principles detailed by Kennett and Zacks "'^. This analysis of the patient data examines whether the observed mortality rates are approximately normally distributed, and whether the process is stationary (i.e. distribution of mortality rates does not change over time).

A three-year period of observation from 1 January 1995 to 31 December 1997 is covered with this analysis. This initial collection of data was available at the commencement of this project. The data were analysed in three groupings: monthly groups, blocks of 50 consecutive cases, and in blocks of 100 consecutive cases. The term "mortality rate" is used for the proportions of deaths among blocks of consecutive cases, even though the observation periods vary. An overview of the data is presented in Chapter 2, and Appendix 1 tabulates the monthly observations (Table Al .2) and blocks of 50 cases (Table Al .3) up to the end of 1999.

A2.1 Data and Methods The analysis was conducted on 3 years of admissions to the PAH ICU. Only the first admission to ICU during any hospitalisation was considered to avoid double counting. The outcome of interest was survival to hospital discharge.

There were 3159 eligible admissions during this period. Data were grouped into months of varying sample size (36 months), samples of 50 consecutive cases (63 blocks) and samples of 100 consecutive cases (31

214

blocks). The overall mean mortality rate, the mean mortality rate of each grouping, range and standard deviation of the observed mortality rates was calculated. A normal probability plot and the Shapiro - Wilk W statistic were used to investigate whether the mortality rates were approximately normally distributed.

A comparison between the observed standard deviation of the mortality rates, and the standard deviation estimated from the binomial distribution was made for the patient samples of 50 and 100 cases. The standard deviation is:

.= mEM Hj

where p is the mean mortality rate, and n is the number of patients in sample, /, either 50 or 100.

Runs Tests A search for non-random pattems in the data was conducted using methods recommended by Kennett and Zacks "^ and the derivation of these tests and the notation comes from this reference.

Runs above or below the mean The first of the Runs tests is a test of the null hypothesis that the monthly mortality rates fall randomly either side of the overall mean mortality rate. Each sample mortality rate observation is allocated to a group according to whether it is above or below the overall mortality rate. "A" identifies values equal or above the mean, and "B" identifies values below the mean. A mn is defined as a consecutive series of one or more mortality rates that have the same classification. The statistic, R, is the observed number of mns.

IfR is too small, there is a high chance of "clustering" due to non-random disfribution of mortality rates. Too many runs imply mixing or homogenous distribution of the mortalities around the mean. A test of the null hypothesis of random distribution either side of the mean, has rejection regions R < R^ or R> R, With large samples such as used in this study (36 months), a normal approximation is used and

where

215

2iiiAnic^

//,=1-F-'A"^B

(^R =

(2m^mg(2m^mg-«) n^{n-\)

a is the level of significance for rejection of the null hypothesis (in this application a = 0.025) "^. Z/^ is the value of the standard normal distribution corresponding to a. TW^ is the number of observations above ( count of "A") and mg is the number of observations below (count of "B") the mean mortality rate.

Runs up or down It is possible that cyclical effects could be present without being detected on the previous assessment of randomness. For this analysis, each mortality rate is compared to the previous mortality rate. Where it is the same or larger, there is a frend up identified by "U". Where the mortality rate is less than the previous rate, there is a run down, identified by "D". A mn is defined as a consecutive series of one or more mortality rates that have the same classification and the statistic R*'is the number of mns counted.

—• or R -** > ^ .** The null hypothesis of random distribution of mns up or down is rejected '\f R~* CD

>

(D

E o z %

7^

0} Q. X

UJ

0.05

0.1

0.15

0.2

0.25

0.3

Observed Monthly Mortality Rate

Mortality rates above or below the monthly mean mortality rate gave the following sequence BBBB AAA B AA BB AA B AAA B AA B AA BBBBB A BB A B A B with 19 mns. R -\9,mA = \9,mB=\l,^lR=\8.9,

andR = 2.9../?„ = 13.2 andRi^ = 24.7,thereforeR„ 50 represents appropriate performance for thep chart of in-hospital ICU mortality.

233

Appendix 4 Statistical Analysis of the CUSUM Chart The purpose of this appendix is to summarise the background to statistical analysis using the CUSUM. The assumptions underlying the analysis, the notation and calculations and an analysis of chart performance under a range of parameter values and changed mortality rates are included. This appendix is not an exhaustive review, and is provided as a reference to the material relevant to the text of the thesis. The results of the analyses in this appendix provide the basis for the design of the CUSUM charts in Chapter 4.

The books by Ryan '°^, Montgomery ^^, Hawkins and Olwell ^*'^, and Kennett and Zacks ^"^ provide excellent descriptions of aspects of the topic.

A4.1 Assumptions The assumptions on which the methods are based are that the observations are independent and that samples are randomly drawn from a population of known distribution. The analysis in Appendix 2, of the PAH ICU data during the in-confrol period (1 January 1995 - 31 December 1997) suggests that these assumptions are plausible. For the following example, the sample size of 100 cases will be used. The distribution of the mortality rate of the blocks of 100 patients is approximately normal.

A4.2 statistical Tests To test for shifts in the process mean of an anticipated magnitude, three equivalent statistical approaches can be used: V mask, decision interval and Page's two sided CUSUM. The following discussion adopts the notation and conventions used by Hawkins and Olwell ^^^.

234

A4.3 Definitions and Notation p.

the mortality rate observed for the sample indexed by /.

fiQ

is the mean of the process in-confrol, estimated by the mean mortality rate Po • After a change in

the process the new mean will be //,. Where a shift to a higher mean mortality rate is of interest, then p^) PQ . Where a shift to a lower value is of interest, then ju[( ju^ n

the number of patients in a sample or block.

a

standard deviation of the process in-control. It is estimated by J

IPO{^~PO)

V

n

From Appendix 2, this estimate gave the same values as standard deviation of the mortality rates 1995 1997.

K^

This parameter is dependent on /J^, and the choice of a (clinically) important increased process

mean. ^.^>"o+-",

2 Similarly, K" depends on the clinically important new lower mean mortality rate.

2 K^ and IT have the units of the outcome measurement. In the examples of Chapter 4 the units are deaths per 100 cases. The CUSUM performance in detecting persistent shifts is optimal under these conditions, but the CUSUM is robust and will signal when shifts of greater or lesser magnitude are present. A signal does not imply that the shift is to ni exactly

h

This parameter is the confrol limit to which the CUSUM statistic is compared. It is chosen

according to the desired performance characteristics of the chart, i.e. average run lengths (ARL) for in-

235

confrol and changed mortality conditions under the parameter choices, h^ is the upper decision interval and h' is the lower decision interval.

The upper and the lower CUSUMs are separate statistical analyses testing for increases and decreases respectively, in the process mean. These statistics are often run concurrently. If only a change in the process mean in one direction is sought, then either an upper or lower CUSUM could be run in isolation. Where a single CUSUM is run, it has a lower rate of false alarm than running both upper and a lower CUSUM together. This analysis will consider the performance of both upper and lower CUSUMs together, and a single CUSUM alone.

The upper CUSUM statistic is calculated recursively,

C =o C;=max(0,C;_,+Pj-K^)

A signal occurs when

Concurrentiy, a lower CUSUM can be mn, and a decrease in the mean would be signalled when, C7 < hgiven,

C;îmn(0,C;_,+Pj-K-)

236

A4.4 Analysis of Performance of CUSUM Charts: Choice of Design Parameters. The performance of the CUSUM confrol charts can be described in terms of ARL to signal under inconfrol conditions, and under conditions of a changed mean mortality rate. The in-confrol ARL is a measure of the occurrence of false alarms. The ARL when the process mean has changed is an indication of the efficiency with which the chart detects the changed mortality rate. The ARL to detect a changed mortality rate can be estimated using a starting value of CUSUM = 0, or from a CUSUM with a steady state value and running under in-confrol conditions. The ARL from steady state will be shorter than the ARL from an in-confrol state. However, both methods are thought to give the same ranking of the efficiency of charting designs ^''^.

For this analysis, I will calculate the ARL for CUSUMs that have the changed or out-of-confrol mortality rates from the first observation.

To characterise the performance of the CUSUM charts used to analyse the ICU data, a series of simulations were run. All simulations were programmed on MATLAB 6.5 "^, and the results graphed in Microsoft Excel. Altemative approaches using integral equations, Markov chain discrete approximations to the integral equations and other methods to reduce the computing intensity of simulations are described by Hawkins and Olwell'".

A4.5 In-control ARL A simple programme was written to simulate the process in-control. The ARL of a CUSUM chart of the mortality rates of blocks of 100 admissions was modelled. The in-confrol mortality rate was 0.16 and the shiftsofmeanmortalityrateweretoO.il and 0.21, giving K^ =0.185 and K~ =0.135. The observed mortality rate of 0.16 and the standard deviation of 0.037 formed the basis for simulations. Simulated block

237

mortality rates were randomly drawn from a normal distribution with these parameters, and any negative values were given mortality rate of zero. 10 000 simulated runs were used to estimate the ARL at each value in the ranges studied. In each simulation an upper and a lower CUSUM were modelled.

Figure A4.1 shows the relationship between ARL and the range of the decision interval parameters, h.. h*'~ of 0.073 gives an in-confrol ARL for a single CUSUM test of 37 (or 3700 cases approximately 3 years), and 20 (or 2000 cases or 1.7 years) for the upper and lower CUSUM together.

Figure A4.1: In - Control ARL for a Range of h+/Single CUSUM, and combined upper and lower CUSUM Monitoring scheme In - control mortality rate = 0.16. optimal detection for change to 0.11 and 0.21

The results of a fiirther simulation to examine ARL for the range of hL+l- of 0.07 to 0.08 for CUSUMs with both upper and lower, and single monitoring schemes are shown in Figure A4.2 and A4.3. The simulation was conducted as before, except that altemative mortality rates of interest {p^ and p^ ) were used. The choice of p.* and //f has a great effect on the ARL while the process remains in confrol. For example, if the shifts in mean mortality rates that are to be detected are a lower rate of 0.08 and an upper rate of 0.24, the ARL in-confrol (h*'~ = 0.073) is 77 (7700 patients or about 7 years) for a two sided CUSUM, and 140 (14000 patients or about 13 years) for a one sided CUSUM. This is a large increase in the ARL in-confrol compared to when the mortality rates of 0.11 and 0.21 are to be detected (h*'~ = 0.073 ) where the ARL for a single CUSUM is 37 and for an upper and lower CUSUM is 20.

238

Figure A4.2: In - Control ARL for a Range of /)+/- and Alternative Mortality Rates in - control mortality rate = 0.16, upper and lower CUSUM

120

-0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18

0.07

0.072

0.074

0.076

0.078

0.08

h+/-

Figure A4.3: In - Control ARL for a Range of h+/- and Alternative Mortality Rates of Interest In - control mortality rate = 0.16, single (upper or lower) CUSUM 250

200 -0.24

150 -

-0.23

,. or the outcome of a single patient, Yj. In the examples used, both sample blocks of 100 consecutive patients, and single patient outcomes are presented. A is the weight between 0 and 1. EWMAj is the value of the statistic indexed by /.

The confrol limits are calculated by:

c.,=.-..J10)IJ^[i-(.-.f] where «, is the number of cases in the sample. «, = 1, when the outcomes of single patients, Y,, are being analysed. A series of simulation experiments were performed to characterise the run length distribution and the ARL under different conditions, a is the parameter defining the width of confrol Umits as multiples of ;)=^,(l-^,)

YY, The observed mortality rate for sample / is /?,. =

''— and the predicted mortality rate, JE'^/?,) is. «;

E{R;)=^^' «.

250

Three methods to characterise the distribution of /?,. will be considered.

A6.2 Approximation using each Estimate of Patient Risk of Death

R. is the average of the random variables, Yjj so using the cenfral limit theorem, the distiibution of Rj can be approximated by a normal distribution if the number of cases in the sample «,. is large.

The variance of /? is.

±MYJJ) MRI)=

-

|:;r,(l-;r..) M

2

nr

„2

n

and

var ( « , ) =

l;*.,(i-*») 7=1 n'

Where «,. is small, Alemi and co-workers "*'"' use the / distribution rather than the standard normal distribution to approximate the distribution of theses Z-scores

For the application to the ICU dataset, the sample sizes are large, being more than 87 cases. The normal approximation is simple to work with and for the purposes of RA charting, its accuracy and precision will be adequate and will probably exceed the accuracy of the RA tool.

However, this model of the distribution of sample mortality rate is a continuous, unbounded approximation whereas, the distribution of Rj given the series of TTjj values is a discrete distribution of mortality rate values bounded by 0 and 1.

251

A6.3 Approximation using Mean Patient Risk of Death For non-RA analysis, it is assumed that the patients were independently and randomly selected. The individual patient's risk of death is not known, so it is assumed to be the same for all patients, and is denoted by TTj. E{RJ) was the average predicted mortality rate of the sample.

The confrol limits were calculated using the estimate of the variance of /?,-.

K)=

Wj(l-Jrj)

var I

«.

However, this expression for the variance, over-estimates var(/?,) if all patients actually do not have the same risk of dying. In the realistic case where the patients do not all have the same risk of dying, Wj is still their average risk.

n,

For the non-RA p chart,

vari[(R.)

= - ^

'-L

y=i

«.

Hj

n.

ri;

ri:

252

By both adding and subfracting the term

n. r n,

\

("I

V

•+

•-I--

«,

V

.7=_1_

)

vy=l

J

^

^'

V7=_l

I

[t^^ n]

«;

y=i

^

n]

1:^.(1-^.) is^. 7=1

(":

is the estimate of the variance of the mortality rate where all the patients' probabilities of «,

death are not all equal. Assuming that this expression is accurate, then the non-RA estimate of the variance will overestimate the var(i?,) b y the value of the term:

V7=l

)

w.

V7=l

) 3

«.•

_

V7=l

)

«.

Therefore, where the average patient's risk of death is low and the sample sizes are large, this estimate is potentially a reasonable approximation. The advantage of this approximation is its ease of calculation.

Its disadvantages are that it overestimates the variance of Rj. Figure A 6 . 1 , at the end of this appendix shows the error in this estimate compared to more accurate methods using 100 patients randomly drawn from the P A H ICU dataset. This approach though simple, will result in wide control limits that will provide a conservative analysis.

253

A6.4 Exact Method: Iterative approach to defme the distribution of mortality rates.

The probability distribution of Rj can be calculated iteratively, if the values of all ^j. are known, or are each approximated by ^,y. Consider that each patient outcome, Yjj (death = 1, survival = 0) is an independent Bernoulli random variable with probability ;r,y . With «, patients in a sample, then

YY. 0 < y^ K < n,. The probability distribution of i?, =

can be described in terms of

^j-.

When the first patient is included, there are two possible values: Pr(^y=, = 1) = -^1

and

Pr(i?,.,=0) = l-;r,

After the second patient there are three possible observed mortality rates, Pr(/?,=2=l)=î^2

Vr{Rj^^=Q.5) = ?x{Rj^,=0) =

7tfy-A2)+{\-Ayt, (\-7t,l\-7i,)

The process can be continued with each patient risk of death estimate.

This iterative approach is a simple method to compute the probability distribution or the cumulative probability fiinction for /?, for a particular set of Kjj .

254

The likelihood function or the joint probability of all the terms Yjj is P [ TTjJ (l - Ttjj )

" and this

7=1

provides a simple method of computing the probability of any /?,-, given the set of values of Ttjj . The probability of observing a number of deaths, d of the sample of «, cases is the sum of each of the

likelihood terms that correspond to the

(

rt,

ways that d deaths can occur.

{rtj-d).

,

y-Y,

such that yYij

—d .

7=1

V7=l

Figure A6.1 compares the estimates of the cumulative probability function of /?, using the three methods described. The calculations have been performed for a single randomly chosen subset of 100 patients from the dataset.

Figure A6.1: Models of the Cumulative Probability Distribution of î An example of 100 cases where each risk of death is estimated by APACHE III

0.05

0.1

0.15

0.2

0.25

Observed Mortality Rate

0.3

0.35

255

Figure A6.1 shows that the model using the sample mean mortality rate for all patients provides an overestimate of the variance of /?,. The staircase appearance of the iterative or likelihood fiinction method represents the true distribution of /?,. The continuous distribution of the cenfral limit theorem approximation overall provides a good estimate for RA/? charts.

However, in Chapter 5, with the RA CUSUM and RA EWMA charts, the continuous approximation is inaccurate, and fiirther exact methods for the RA EWMA are developed.

257

Bibliography

1

Knaus WA, Draper EA, Wagner DP and Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med 1985; 13:818 - 829

2

Knaus WA, Wagner DP, Draper EA, Zimmerman JE, Bergner M, Bastos PG, Sirio CA, Murphy DJ, Lotring T, Damiano A and Harrell FE. The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest 1991; 100:1619 -1636

3

APACHE®. APACHE 111 Management System Software. Washington: APACHE Medical Systems, 1990-1999

4

Hastie T, Tibshirani R and Friedman J. The Elements of Statistical Leaming: Data Mining, Inference and Prediction. 1st ed. New York: Springer, 2001

5

Justice A, Covinski K and Berlin J. Assessing the generalizability of prognostic information. Arch Intem Med 1999; 130:515 - 524

6

Lemeshow S, Teres D, Klar J, Avrunin J, Gehlbach S and Rapoport J. Mortality probability models (MPM II) based on an international cohort of ICU patients. JAMA 1993; 270:2478 - 2486

7

Le Gall J, Lemeshow S and Saulnier F. A new simplified acute physiology score (SAPS II) based on a European/North American multi-cenfre study. JAMA 1993; 270:2957 - 2963

8

Christianini N and Shawe-Taylor J. The leaming methodology: The SVM. An Infroduction to Support Vector Machine and Other Kernel-Based Leaming Methods. Cambridge: Cambridge University Press, 2000

9

Benneyan J and Borgman A. Risk adjusted sequential probability ratio tests and longitudinal surveillance. Int J of Quahty in Health Care 2003; 15:5-6

258

10

de Leval MR, Francois K, Bull C, Brawn W and Spiegelhalter D. Analysis of a cluster of surgical failures. Application to a series of neonatal arterial switch operations. J Thorac Cardiovasc Surg 1994; 107:914-924

11

Lovegrove J, Valencia O, Treasure T, Sheriaw-Johnson C and Gallivan S. Monitoring the results of cardiac surgery by variable Hfe-adjusted display. Lancet 1997; 350:1128 - 1130

12

lezzoni L. Dimensions of risk. In: lezzoni L, ed. Risk Adjustment for Measuring Health Care Outcomes. Chicago, 111.: Health Adminisfration Press, 1997

13

Spiegelhalter D, Grigg O, Kinsman R and Treasure T. Risk adjusted sequential probability ratio tests: applications to Bristol, Shipmann and adult cardiac surgery. Int J for Quality in Health Care 2003; 15:7-13

14

Rosenberg AL. Recent innovations in ICU risk-prediction models. Curr Opin Crit Care 2002; 8:321 -330

15.

Lim T. Statistical process confrol tools for monitoring clinical performance. Int J for Quality in Health Care 2003; 15:3-4

16

Render M, Kim M, Welsh D, Timmons S, Johnston J, Hui S, Connors A, Wagner D, Daley J and Hofer T. Automated ICU risk adjustment: Results from a National Veterans Affairs study. Crit Care Med 2003; 31:1638-1646

17

Kennett R and Zacks S. Chapter 1: The role of statistical methods in modem industry. Modem Industrial Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press, 1998; 2 - 1 3

18

Clermont G and Angus DC. Severity scoring systems in the modem ICU. Ann Acad Med Singapore 1998; 27:397 - 403

19

Gunning K and Rowan K. ABC of intensive care outcome data and scoring systems. BMJ 1999319:241-244

259

M

Lemeshow S, Klar J, Teres D, Avrunin J, Gehlbach S, Rapaport J and Rue M. Mortality probability models for patients in the ICU for 48 or 72 hours: A prospective multicenter study. Crit Care Med 1994; 22:1351 -1358

21

Le Gall J, Lemeshow S and Saulnier F. Correction: A new simplified acute physiology score (SAPS II) based on a European/North American multicenfre study. JAMA 1994; 271:1321

22

Zimmerman JE, Knaus WA, Wagner DP, Sun X, Hakim RB and Nysfrom PO. A comparison of risks and outcomes for patients with organ system failure: 1982-1990. Crit Care Med 1996; 24:1633-1641

23

Zimmerman JE, Knaus WA, Sun X and Wagner DP. Severity sfratification and outcome prediction for multi-system organ failure and dysfiinction. World J Surg 1996; 20:401 - 405

24

Knaus WA, Wagner DP, Zimmerman JA and Draper EA. Variations in mortality and length of stay in ICU. Ann Int Med 1993; 118:753 - 761

25

Johnston JA, Wagner DP, Timmons S, Welsh D, Tsevat J and Render ML. Impact of different measures of comorbid disease on predicted mortality of ICU patients. Med Care 2002; 40:929 940

26

Graham P and Cook D. Risk prediction using 30 day outcome: A practical endpoint for quality audit. Chest: accepted for publication, October 2003

27

Young D and Ridley S. Mortality as an outcome measure in intensive care. In: Ridley S, ed. Outcomes in Critical Care. Oxford: Butterworth - Heinemann, 2002; 25 - 46

28

Ridley S. Severity of illness scoring systems and performance appraisal. Anaesthesia 1998; 53:1185-1194

29

Irwig L, Bossuyt P, Glasziou P, Gatsonis C and Lijmer J. Designing studies to ensure that estimates of test accuracy are fransferable. BMJ 2002; 324:669 -671

260

Ift

Pappachan JV, Millar B, Bennett D and Smith GB. Comparison of outcome from intensive care admission after adjustment for case mix by the APACHE III prognostic system. Chest 1999; 115:802-810

31

Cook DA. Performance of APACHE III models in an Australian ICU. Chest 2000; 118:1732 1738

32

Rowan K, Kerr J, Major E, McPherson K, Short A and Vessey M. Intensive Care Society's APACHE II study in Britain and Ireland -1 Variations in casemix of adult admissions to general ICUs and impact on outcome. BMJ 1993; 307:972 - 977

33

Rowan K, Kerr J, Major E, McPherson K, Short A and Vessey M. Intensive Care Society's APACHE II study in Britain and Ireland - II Outcome comparisons of ICUs after adjustments for casemix by the American APACHE II method. BMJ 1993; 307:977 - 981

34

Rowan KM, Kerr JH, Major E, McPherson K, Short A and Vessey MP. Intensive Care Society's APACHE II study in Britain and Ireland: a prospective, multi-center, cohort study comparing two methods for predicting outcome for adult intensive care patients. Crit Care Med 1994; 22:1392 1401

35

Castella X, Artigas A, Bion J and Kari A. A comparison of severity of illness scoring systems for ICU patients: Results of a multicenter, multinational study. Crit Care Med 1995; 23:1327 -1335

36

Beck DH, Taylor BL, Millar B and Smith GB. Prediction of outcome from intensive care: a prospective cohort study comparing APACHE 11 and III prognostic systems in a United Kingdom intensive care unit. Crit Care Med 1997; 25:9-15

37

Marshall JC, Cook DJ, Christou NV, Bernard GR, Sprung CL and Sibbald WJ. Multiple Organ Dysfunction Score: A reliable descriptor of a complex clinical outcome. Crit Care Med 1995; 23:1638-1652

38

Diamond G. What price perfection? Calibration and discrimination of clinical predictive models. J Clin Epidemiol 1992; 45:85 - 89

261

39

Hilden J, Habbema JD and Bjerregaard B. The measurement of performance in probablistic diagnosis: The trustworthiness of the exact values of the diagnostic probabilities. Meth Inform Med 1978; 17:227-237

40

Yates FJ. Decompositions of the mean probability score. Organisational Behaviour and Human Performance 1982; 30:132 -156

41

Hilden J. The area under the ROC curve and its competitors. Med Decis Making 1991; 11:95 -101

42

Zweig HW and Campbell G. ROC plots: Afiindamentalevaluation tool in clinical medicine. Clin Chem 1993;39:561-577

43

Poses RM, Cebul RD and Center RM. Evaluating physicians probablistic judgements. Med Decis Making 1988; 8:233-240

44

Jacobs S, Chang R, Lee B and Lee B. Audit of intensive care: a 30 month experience using the APACHE II severity of disease classification system. Int Care Med 1988; 14:567 - 574

45

Giangiuliani G, Mancini A and Gui D. Validation of a severity of illness score (APACHE 11) in a surgical ICU. Int Care Med 1989; 15:519-522

46

Oh TE, Hutchinson R, Short S, Buckley T, Lin E and Leung D. Verification of the APACHE scoring system in a Hong Kong ICU . Crit Care Med 1993; 21:698 - 695

47

Wong DT, Crofts SL, Gomez M, McGuire GP and Byrick RJ. Evaluation of predictive ability of APACHE II system and hospital outcome in Canadian ICU patients. Crit Care Med 1995; 23:1177 -1183

48

Nouira S, Belghith M, Elafrous S, Jaafoura M, Ellouzi M, Boujdaria R, Gahbiche M, Bouchoucha S and Abroug F. Predicitive value of severity scoring systems: Comparison of four models in Tunisian adult ICUs. Crit Care Med 1998; 26:852 - 858

262

49

Markgraf R, Deutschinoff G, Pientka L and Scholten T. Comparison of APACHE II and III and SAPS II: A prospective cohort study evaluating these methods to predict outcome in a German interdisciplinary ICU. Crit Care Med 2000; 28:26 - 33

50

Ruttiman UE. Statistical approaches to the development and validation of predictive instruments. Crit Care Clin 1994; 10:19-35

51

Altinan D and Bland J. Diagnostic tests 3: The ROC plot. BMJ 1994; 309:188

52

Glance LG, Osier T and Shinozaki T. ICU prognostic scoring systems to predict death: a cost effectiveness analysis. Crit Care Med 1998; 26:1842 -1849

53

Henderson AR. Assessing test accuracy and its clinical consequences: a primer for ROC curve analysis. Ann Clin Biochem 1993; 30:521 - 539

54

Bossuyt P, Reitsma J, Bruns D, Gatsonis C, Glasziou P, Irwig L, Lijmer J, Moher D, Rennie D and Wde Vet H. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ 2003; 326:41 - 44

55

Swetts J A. Measuring the accuracy of diagnostic systems. Science 1988; 240:1285 -1293

56

Murphy-Filkins R, Teres D, Lemeshow S and Hosmer DW. Effect of changing patient mix on the performance of an ICU severity-of-illness model: How to distinguish a general from a specialty intensive care unit. Crit Care Med 1996; 24:1968 -1973

57

Glance LG, Osier TM and Papadakos P. Effect of mortality rate on the performance of the APACHE II: a simulation study. Crit Care Med 2000; 28:3424 - 3428

58

Steen PM. Approaches to predictive modelling. Ann Thor Surg 1994; 58:1836 -1840

59

Hanley JA and McNeil BJ. The meaning and use of a ROC curve. Radiology 1982; 143:29 - 36

263

60

Center RM and Schwartz JS. An evaluation of methods of estimating the area under the ROC curve. Med Decis Making 1985; 5:149 -156

61

Hanley JA and McNeil BJ. A method of comparing the areas under the ROC curves derived from the same cases. Radiology 1983; 148:839 - 843

62

Spiegelhalter DJ. Statistical methodology for evaluating gasfrointestinal symptoms. Clin Gastro 1985;14:489-515

63

Yates JF and Curley SP. Conditional distribution analyses of probablistic forecasts. J Forecast 1985; 4:61 -73

64

Flora J. A method for comparing survival of bums patients to a standard survival curve. J Trauma 1978; 18:701-705

65

Lemeshow S and Hosmer D. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidem 1982; 115:92 - 106

66

Hosmer DW and Lemeshow S. Assessing the fit of the model. Applied logistic regression. New York: John Wiley and Sons, 1989; 135 -175

67

Vollset S. Confidence intervals for a binomial proportion. Stat Med 1993; 12:809 - 824

68

Armitage P and Berry P. Inferences from proportions. Statistical Methods in Medical Research. London: Blackwell Scientific Publications, 1994; 118-124

69

Rapoport J, Teres D, Lemeshow S and Gehlbach S. A method for assessing the clinical performance and cost effectiveness of ICUs: a multi-cenfre inception cohort study. Crit Care Med 1994;22:1385-1391

70

Sherlaw - Johnson C, Lovegrove J, Treasure T and Gallivan S. Likely variations in perioperative mortality associated with cardiac surgery: when does high mortality reflect bad practice? Heart 2000; 84:79 - 82

264

71

Sherlaw - Johnson C and Gallivan S. Approximating prediction intervals for use in variable life adjusted displays. London: Clinical Operational Research Unit, Dept. Mathematics, University College, 2000

72

Katsaragakis S, Papadimifropoulos K, Antonakis P, Strergiopoulos S, Konstadoulakis MM and Androulakis G. Comparison of APACHE II and SAPS II scoring systems in a single Greek ICU. Crit Care Med 2000; 28:426 - 432

73

Miller M and Hui S. Validation techniques for logistic regression models. Stat Med 1991; 10:1213 -1226

74

Murphy AH. A new vector partition of the probability score. J Appl Meteorology 1973; 12:595 600

75

Dolan JG, Bordley DR and Mushlin AI. An evaluation of clinician's subjective prior probability estimates. Med Dec Making 1986; 6:216 - 223

76

Moreno R and Morals P. Outcome prediction in intensive care: results of a prospective, multicenfre, Portuguese study. Int Care Med 1997; 23:177-186

77

Zimmerman JE, Wagner DP, Draper EA, Wright L, Alzola C and Knaus WA. Evaluation of APACHE III predictions of hospital mortality in an independent database. Crit Care Med 1998; 26:1317-1326

78

Beck DH, Smith GB and Taylor BL. The impact of low-risk ICU admissions on mortality probabilities by SAPS II, APACHE II and APACHE III. Anaesthesia 2002; 57:21 - 26

79

Ash AS and Schwartz M. Evaluating the performance of risk adjustment methods: dichotomous methods. In: lezzoni LI, ed. Risk Adjustment for Measuring Health Outcomes. Ann Arbor: Health Admin Press, 1994; 313-346

80

Lee KL, Pryor DB, Harrell FE, Califf RM, Behar VS, Floyd WL, Morris AJ, Waugh RA, Whalen RE and Rosati RA. Predicting outcome in coronary disease. Am J Med 1986; 80:553 - 560

265

81

Miller ME, Langefeld CD, Tiemey WM, Hui SL and McDonald CJ. Validation of probablistic predictions. Med Decis Making 1993; 13:49 - 58

82

Le Gall R, Klar J, Lemeshow S, Saulnier F, Alberti C, Artigas A and Teres D. The Logistic Organ Dysfiinction System: A new way to assess organ dysfiinction in the ICU. JAMA 1996; 276:802 810

83

Apolone G, Bertolini G, D'Amico R, lapichino G, Cattaneo A, De Salvo G and Melotti R. The performance of SAPS II in a cohort of patients admitted to Italian ICUs: Results from GiViTi. Int Care Med 1996; 33:1368 -1378

84

Sirio CA, Shepardson LB, Rotondi AJ, Cooper GS, Angus DC, Harper DL and Rosenthal GE. Community-wide assessment of intensive care outcomes using a physiologically based prognostic measure. Chest 1999; 115:793 - 801

85

Glance L, Osier T and Dick A. Identifying quality outliers in a large, multi - institutional database by using customised versions of the SAPS II and MPM II. Crit Care Med 2002; 30:1995 - 2002

16

Jacobs S, Chang R and Lee B. One years experience with the APACHE 11 severity of disease classification system in a general ICU. Anaesthesia 1987; 42:738 - 744

87

Zimmerman JE, Knaus WA, Judson JA, Havill JH, Trubuhovich RV, Draper EA and Wagner DP. Patient selection for intensive care: A comparison of New Zealand and United States hospitals. Crit Care Med 1988; 16:318 - 326

88

Marsh M, Krishan I, Naessens J, Strickland R, Gracet D, Campion M, Nobrega F, Southom P, McMichan J and Kelly M. Assessment of prediction of mortality by using the APACHE II scoring system in ICUs. Mayo Clin. Proc. 1990; 65:1549 -1557

89

Berger M, Marazzi A, Freeman J and Chiolero R. Evaluation of the consistency of APACHE II scoring in a surgical ICU. Crit Care Med 1992; 20:1681 -1687

266

90

Sirio C, Tajima K, Tase C, Knaus W, Wagner D, Hirasawa H, Sakanishi N, Katsuya H and Taenaka N. An initial comparison of intensive care in Japan and in the United States. Crit Care Med 1992; 20:1207-1215

91

Le Gall J, Lemeshow S, Leleu G, Klar J, Huillard J, Rue M, Teres D and Artigas A. Customised probability models for early severe sepsis in adult intensive care. JAMA 1995; 273:644 - 650

92

Moreno R, Miranda DR, Fidler V and Van Schilfgaarde R. Evaluation of two outcome prediction models on an independent database. Crit Care Med 1998; 26:50 - 61

93

Tan IKS. APACHE II and SAPS II are pooriy calibrated in a Hong Kong ICU. Ann Acad Med Singapore 1998; 27:318 - 322

94

Wong LS and Young JD. A comparison of ICU mortality prediction using the APACHE II scoring system and artificial neural networks. Anaesthesia 1999; 54:1048 -1054

95

Buist M, Gould T, Hagley S and Webb R. An analysis of excess mortality not predicted to occur by APACHE III in an Australian Level 111 ICU. Anaes Int Care 2000; 28:171 -177

96

Janssens U, Graf C, Graf J, Radke P, Konigs B, Koch K, Lepper W, vom Dahl J and Hanrath P. Evaluation of the SOFA score: a single center experience of a medical ICU unit in 303 consecutive patients with predominantly cardiovascular disorders. Int Care Med 2000; 26:1037 1045

97

Livingston BM, MacKirdy FN, Howie JC, Jones R and Norrie JD. Assessment of the performance of five intensive care scoring models within a large Scottish database. Crit Care Med 2000; 28:1820-1827

98

Arabi Y, Hahhad S, Goraj R, Al-Shimemeri A and Al-Malik S. Assessment of performance of 4 mortality prediction systems in a Saudi Arabian ICU. Crit Care Med 2002; 6:166-174

99

Glance L, Osier T and Dick A. Rating the quality of intensive care units: Is it a fiinction of the ICU scoring system? Crit Care Med 2002; 30:1976 -1982

267

100

Bastos PG, Sun X, Wagner DP, Knaus WA and Zimmerman JE. Application of the APACHE III prognostic system in Brazilian intensive care units: a prospective multicenter study. Int Care Med 1996;22:564-570

101

APACHE® C. APACHE Resources: http://www.apache-web.com/public/pub main.html. 2003

102

Cowen JS and Kelly MA. Errors and bias in using predictive scoring systems. Crit Care Clin 1994; 10:53-72

103

Cook D, Joyce C, Bamett R, Birgan S, Playford H, Cockings J and Hurford R. Prospective independent validation of APACHE 111 models in an Ausfralian tertiary referral ICU. Anaesth Int Care 2002; 30:308-315

104

Levett J and Carey R. Measuring for improvement: Toyota to thoracic surgery. Ann Thorac Surg 1999; 68:353 - 358

105

Kennett R and Zacks S. Chapter 10: Basic tools and principles of statistical process confrol. Modem Industrial Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press, 1998; 322 - 359

106

Montgomery D. Chapter 4: Methods and philosophy of statistical process confrol. Infroduction to Statistical Quality Confrol. New York: John Wiley and Sons, 1996

107

Ryan T. Statistical Methods for Quality Improvement. New York: John Wiley and Sons, 1989

108

Seto T, Mittleman M, Davis R, Taira D and Kawachi. Seasonal variation in coronary artery disease mortality in Hawaii: an observational study. BMJ 1998; 316:1946

109

Montgomery D. Chapter 6: The confrol chart for fraction non - conforming. Infroduction to Statistical Quality Confrol. New York: John Wiley and Sons, 1996

268

110

Benneyan J. Statistical quality confrol methods in infection confrol and hospital epidemiology. Part II: Chart use, statistical properties and research issues. Infection Confrol and Hospital Epidemiology 1998; 19:265 - 283

111

Hawkins D and Olwell D. Theoretical foundations of the CUSUM. Cumulative Sum Charts and Charting for Quality Improvement. New York: Springer Veriag, 1998

112

Hawkins D and Olwell D. Introduction. Cumulative Sum Charts and Charting for Quality Improvement. New York: Springer Veriag, 1998; 1 - 29

113

Kennett R and Zacks S. Chapter 11: Advanced methods of statistical process confrol. Modem Industtnal Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press, 1998; 360-407

114

Montgomery D. Chapter 7: Cusum and EWMA confrol charts. Infroduction to Statistical Quality Confrol. New York: John Wiley and Sons, 1996; 313-347

115

Poloniecki J, Valencia O and Littlejohns P. Cumulative risk adjusted mortality chart for detecting changes in death rate: observational study of heart surgery. BMJ 1998; 316:1697 -1700

116

Lawrance R, Dorsch M, Sapsford R, Macintosh A, Greenwood D, Jackson B, Morrell C, Robinson M and Hall A. Use of cumulative mortality data in patients with acute myocardial infarction for early detection of variation in clinical practice: Observational study. BMJ 2001; 323:324 - 327

117

Tekkis PP, McCulloch P, Steger AC, Benjamin IS and Poloniecki JD. Mortality confrol charts for comparing performance of surgical units: validation study using hospital mortality data. BMJ 2003; 326:786 - 791

118

Alemi F and Sullivan T. Tutorial on risk adjusted A'-Bar charts: Applications to measurement of diabetes confrol. QuaHty Management in Health Care 2001; 9:57 - 65

119

Alemi F and Oliver D. Tutorial on risk adjusted;? charts. Quality Management in Health Care 2001; 10:1-9

269

120

Khuri S, Daley J, Henderson W, Hur K, Gibbs J, Barbour G, Demakis J, Irvin G, Sfremple J, Grover F, McDonald G, Passaro E, Fabri P, Spencer J, Hammermeister K and Aust J. Risk adjustment of the postoperative mortality rate for the comparative assessment of the quality of surgical care: Resuhs of the National Veterans Affairs surgical risk study: Part 1. J Amer Coll Surg 1997; 185:315-327

121

Daley J, Khuri S, Henderson W, Hur K, Gibbs J, Barbour G, Demakis J, Irvin G, Sfremple J, Grover F, McDonald G, Passaro E, Fabri P, Spencer J, Hammermeister K and Aust J. Risk adjustment of the postoperative mortality rate for the comparative assessment of the quality of surgical care: Results of the National Veterans Affairs surgical risk study: Part 2. J Amer Coll Surg 1997; 185:328-340

122

Graham PL, Kuhnert PM, Cook DA and Mengersen K. Improving the quality of patient care using reliability measures: A classification free approach. Stat Med; Submitted for publication March 2003

123

Clermont G, Angus DC, DiRusso SM, Griffin M and Linde - Zwirble WT. Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models. Crit Care Med 2001; 29:291 - 296

124

Glance LG, Osier T and Shinozaki T. Effect of varying the casemix on the SMR and W statistic. Chest 2000; 117:1112-1116

125

lezzoni L. Statistically derived predictive models: caveat emptor (Editorial). J Gen Intem Med 1999;14:388-389

126

Zhu HP, Lemeshow S, Hosmer DW, Klar J, Avmnin J and Teres D. Factors affecting the performance of the models in the MPMl II system and sfrategies of customization: A simulation sttidy. Crit Care Med 1996; 24:57 - 63

127

Maxwell S and Delaney H. The logic of experimental design: Threats to the validity of inferences from experiments. Designing experiments and analyzing data: A model comparison perspective. Belmont, Ca: Wadsworth Publishing Company, 1990; 25 - 32

270

128

Berenholtz S, Doorman T, Ngo K and Pronovost P. Qualitative review of ICU quality indicators. J Crit Care 2002; 17:1-15

129

Sivak ED and Rogers MAM. Assessing quality of care using in-hospital mortality: Does it yield informed choices? (Editorial). Chest 1999; 115:613 - 614

130

Sheldon T. Promoting health care quality: what role performance indicators? Quality in Health Care 1998; 7:S45-50

131

Rosen AK, Ash AS, McNiff KJ and Moskowitz MA. The importance of severity of illness adjustment in predicting adverse outcomes in the Medicare population. J Clin Epidemiol 1995; 48:631-643

132

Kutsogianis DJ and Noseworthy T. Quality of life after ICU. In: Ridley S, ed. Outcomes in Critical Care. Oxford: Butterworth Heinemann, 2002; 139 - 168

133

Poloniecki J. Letter. BMJ 1998:1453

134

Parsonnet V, Dean D and Bemstein A. A method of uniform sfratification of risk for evaluating the results of surgery in acquired adult heart disease. Circulation 1989; 79(S1):S3 - S12

135

Jones D, Copeland G and de Cossart L. Comparison of POSSUM with APACHE II for prediction of outcome from a surgical high dependency unit. Br J Surg 1992; 79:1293 -1296

136

Orr R, Maini B, Sottile F, Dumas E and O'Mara P. A comparison of four severity adjusted models to predict mortality after coronary artery bypass graft surgery. Arch Surg 1995; 130:301 - 306

137

Weightman W, Gibbs N, Sheminant M, Thackray N and Newman M. Risk prediction in coronaty artery surgery: a comparison of four risk scores. Medical Joumal of Ausfralia 1997; 166:408 - 411

138

Steiner S, Cook R and Farewell V. Risk adjusted monitoring of surgical outcomes. Medical Decision Making 2001; 21:163-169

271

139

Gallivan S, Lovegrove J and Sheriaw - Johnson C. Letter. BMJ 1998; 317:1453

140

Steiner SH, Cook RJ and Farewell VT. Monitoring paired binary surgical outcomes using cumulative sum charts,. Statistics in Medicine 1999; 18:69 - 86

141

Cook D, Steiner S, Cook R, Farewell V and Morton A. Monitoring the evolutionary process of quality: Risk adjusted charting to frack outcomes in intensive care. Crit Care Med 2003; 2003

142

Steiner S, Cook R, Farewell V and Treasure T. Monitoring surgical performance using riskadjusted cumulative sum charts. Biostatistics 2000; 1:441 - 452

143

Hanson WC and Marshall BE. Artificial intelligence applications in the ICU. Crit Care Med 2001; 29:427 - 435

144

Maren A, Harston C and Pap R, eds. Handbook of Neural Computing. San Diego: Harcourt Brace Jovanovich, 1990

145

Statistica Neural Networks. Tulsa, Oklahoma: Statsoft, 1998

146

Bishop C, ed. Neural networks and Machine Leaming. Berlin: Springer Veriag, 1998

147

Reed RD and Marks RJ. Neural Smithing: Supervised leaming in feedforward artificial neural network. 1st ed. Cambridge, Massachusetts: MIT Press, 1999

148

Anderson J. Chapter 13: Nearest neighbour classifiers. An Infroduction to Neural Networks. Cambridge, Mass.: MIT Press, 1995

149

Maren A. NN stmcture: Form follows fiinction. In: Maren A, Harston C and Pap R, eds. Handbook of Neural Computing. San Diego: Harcourt Brace Jovanovich, 1990

150

Specht DF. A general regression neural network. IEEE Transactions on Neural Networks 1991; 2:568 - 576

272

151

Parzen E. Mathematical considerations in the estimation of specfra. Technometrics 1961; 3:167 190

152

Parzen E. On estimation of a probability density fiinction and mode. Annals of Mathematical Statistics 1962; 33:1065 -1076

153

Floyd CE, Lo JY, Yun AJ, Sullivan DC and Komguth PJ. Prediction of breast cancer malignancy using an artificial neural network. Cancer 1994; 74:2944 - 2948

154

Ortiz J, Ghefter CGM, Silva CES and Sabbatini RME. One year mortality prognosis in heart failure: A neural network approach based on echocardiographic data. JACC 1995; 26:1586 -1593

155

Selker HP, Griffin JL, Patil S, Long WL and D'Agostino RB. A comparison of performance of mathematical predictive methods for medical diagnosis: Identifying acute cardiac ischaemia among emergency department patients. J Investigative Med 1995; 43:468 - 476

156

Doyle HR, Parmanto B, Munro WP, Marino IR, Aldrighetti L, Doria C, McMicheal J and Fung JJ. Building clinical classifiers using incomplete observations - A neural network ensemble for hepatoma detection in patients with cirrhosis. Meth Inform Med 1995; 34:253 - 258

157

Setiono R. Exfracting mles from pruned ANN for breast cancer diagnosis. AI in Med 1996; 8:37 51

158

Itchhaporia D, Snow PB, Almassy RJ and Oetgen WJ. ANN: Current status in cardiovascular medicine. JACC 1996; 28:515 - 521

159

Eisenstein EL and Alemi F. A comparison of 3 techniques for rapid model development: An application in patient risk sfratification. Proc Med Informat Ass 1996:443 - 447

160

Lette J, Colletti BW, Cerino M, McNamara D, Eybalin MC, Levasseur A and Nattel S. Artificial intelligence vs logistic regression statistical modelling to predict cardiac complications after non cardiac surgery. Clin Cardiol 1994; 17:609 - 614

273

161

Doyle HR, Dvorchik I, Mitchell S, Marino IR, Ebert FH, McMichael J and Fung JJ. Predicting outcome after liverfransplantation.Ann Surg 1994; 219:408 - 415

162

Hamamoto I, Okada S, Hashimotot T, Wakabayashi H, Maeba T and Maeta H. Prediction of the eariy prognosis of the hepatectomised patient with hepatocellular carcinoma with a neural netivork. Comput Biol Med 1995; 25:49 - 59

163

Dombi GW, Nandi P, Saxe JM, Ledgerwood AM and Lucas CE. Prediction of rib fracture outcome by an artifical neural network. J Trauma, Infection and Critical Care. 1995; 39:915 - 921

164

Dvorchik I, Subotin M, Marsh W, McMichael J and Fung J J. Performance of multi-layer feedforeward neural network to predict liver fransplantation outcome. Meth Inform Med 1996; 35:12-18

165

Izenberg SD, Williams MD and Luterman A. Prediction of frauma mortality using a neural network. American Surgeon 1997; 63:275 - 281

166

Jefferson MF, Pendleton N, Lucas SB and Horan MA. Comparison of a genetic algorithm neural network with logistic regression for predicting outcome after surgery for patients with non-small cell lung carcinoma. Cancer 1997; 79:1338 -1342

167

Reibnegger G, Weiss G, Wemer - Felmayer G, Judmaier G and Wachter H. Neural network as a tool for utilizing laboratory information: Comparison with linear discriminant analysis and with classification and regression frees. Proc. Natl. Acad. Sci. USA 1991; 88

168

Forssfrom JJ and Dalton KJ. Artificial neural network for decision support in clinical medicine. Ann Med 1995; 27:509-517

169

Jorgensen JS, Pedersen JB and Pedersen SM. Use of neural network to diagnose acute myocardial infarction: Methodology. Clin Chem 1996; 42:604 - 612

170

Brier ME and Aronoff GR. Application of artificial neural network to clinical pharmacology. Int J Clin Pharm Therapeutics 1996; 34:510-514

274

171

Lippmann RP and Shahian DM. Coronary artery bypass risk prediction using neural networks. Ann Thor Surg 1997; 63:1635 -1643

172

Orr RK. Use of a probablistic neural network to estimate risk of mortality after cardiac surgery. Med Dec Making 1997; 17:178 - 185

173

Buchman TG, Kubos KL, Seidler AJ and Siegforth MJ. A comparison of statistical and connectionist models for the prediction of chronicity in a surgical ICU. Crit Care Med. 1994; 22:750 - 762

174

Mobley BA, Leasure R and Davidson L. Artificial neural network predicitons of lengths of stay on a post coronary care unit. Heart and Lung 1995; 24:251 - 256

175

Doig GS, Inman KJ, Sibbald WJ, Martin CM and Roberston JM. Modelling mortality in the ICU: comparing the performance of a back propagation, associative leaming neural network with multivariate logistic regression. Proc. Ann. Symp. Computer Application in Med Care. 1994; 17:361-365

176

Dybowski R, Weller P, Chang R and Gant V. Prediction of outcome in critically ill patients using artificial neural networks synthesized by genetic algorithm. Lancet 1996; 347:1146 -1150

177

Prize M, Ennett CM, Stevenson M and Trigg HCE. Clinical decision support systems for ICU: Using artificial neural networks. Med Eng Physics 2001; 23:217 - 225

178

Nimgaonkar A, Sudarshan S and Kamad DR. Prediction of mortality in an Indian ICU: Comparison between APACHE II and artificial neural network. (Hansraj Prize paper). Proceedings of the Annual Scientific Meeting, Indian Society of Critical Care Medicine. 2001:43 46

179

Paetz J. Some remarks on choosing a method for outcome prediction (letter). Crit Care Med 2002; 30:724

180

Vapnik V. Statistical Leaming Theory. New York: Wiley, 1998

275

181

Burges C. A tutorial on SVM for pattern recognition. Data Mining and Knowledge Discovery 1998;2:121-167

182

Campbell C. An Infroduction to Kemel Methods. In: Howlett R and Jain L, eds. Radial Basis Function Networks: Design and Application. Berlin: Springer Veriag, 2000

183

Christianini N and Shawe-Taylor J, eds. An Infroduction to Support Vector Machine and Other Kernel-Based Leaming Methods. 1 st ed. Cambridge: Cambridge University Press, 2000

184

Scholkopf B, Burges C and Smola A. Introduction to support vector leaming. In: Scholkopf B, Burges C and Smola A, eds. Advances in Kemel Methods: Support Vector Leaming. Cambridge: MIT Press, 1999

185

Mattera D and Haykin S. SVM for dynamic reconstmction of a chaotic system. In: Scholkopf B, Burges C and Smola A, eds. Advances in Kemel Methods: SVM Leaming. Cambridge, Massachusetts: MIT Press, 2000

186

Morik K, Brockhausen P and Joachims T. Combining statistical leaming with a knowledge based approach - A case study in intensive care monitoring.: http://vyww-ai.informatik.unidortmund.de/DOKUMENTE/morik etal 99a.pdf. 1999

187

Morik K, Imboff M, Brockhausen P, Joachims T and Gather U. Knowledge discovery and knowledge validation in intensive care. AI in Med 2000; 19:225 - 249

188

Graham P and Cook D. Risk prediction using 30 day outcome: A practical endpoint for quality audit. Submitted to Chest, January 2003 2003

189

Graham PL, Kuhnert PM, Cook DA and Mengersen K. Improving the quality of patient care using reliability measures: A Classification free approach. Stat Med 2003; Submitted for publication March 2003

190

Ripley B. Statistical theories of model fitting. In: Bishop C, ed. Neural Networks and Machine Leaming. Berlin: Springer/NatoScientific Affairs Division, 1998

276

191

Joachims T. SVMlight. Dortmund: University of Dortmund, Informatik, Al-Unit, "Collaborative Research Center on Complexity Reduction in Multivariate Data", 2002

192

Joachims T. Making large scale SVM leaming practical. In: Scholkopf B, Burges C and Smola A, eds. Advances in Kemel Methods: Support Vector Machine Leaming. Cambridge: MIT Press, 2000

193

Mathworks. MATLAB. Boston: The Math Works Inc., 2002

194

Schwaighofer A. Matiab Wrapper for SVM light: [email protected], 2002

195

Statistica. Tulsa: Statsoft, 1984 - 2000

196

Jencks SF, Williams DK and Kay TL. Assessing hospital associated deaths from discharge data: The role of length of stay and co-morbidities. JAMA 1988; 260:2240 - 2246

197

Glance LG and Szalados J. Benchmarking in critical care (editorial). Chest 2002; 121:326 - 328

198

Weston J, Gammerman A, Stitson M, Vapnik V, Vovk V and Watkins C. Support vector density estimation. In: Scholkopf B, Burges C and Smola A, eds. Advances in Kemel Methods: Support Vector Machine Leaming. Cambridge: MIT Press, 2000

199

Sheriaw-Johnson C and Gallivan S. Approximating prediction intervals for use in variable life adjusted displays. London: Clinical Operational Research Unit, Dept. Mathematics, University College, 2000

200

Neal R. Assessing the relevance determination methods using DELVE. In: Bishop C, ed. Neural Networks and Machine Leaming: Springer-Veriag, 1998; 97-129

201

Kitter J, Pudil P and Somol P. Advances in statistical feature selection. In: Singh S, Murshed N and Kropattsch W, eds. ICAPR 2001. Beriin: Springer-Veriag, 2001

277

202

Guyon I and Elisseeff A. An introduction to variable and feature selection. J Machine Leaming Research 2003; 3:1157-1182

203

Montgomery D. Chapter 8: Other statistical process confrol techniques. Infroduction to Statistical Quality Confrol. New York: John Wiley and Son, 1996

204

Montgomery D. Introduction to Statistical Quality Confrol. New York.: John Wiley and Sons, 1996

205

Hawkins D and Olwell D. Cumulative Sum Charts and Charting for Quality Improvement. New York: Springer, 1998

206

Kennett R and Zacks S. Modem Industrial Statistics: Design and Confrol of Quality and Reliability. 1st ed. Belmont, CA: Duxbury Press, 1998