a microcomputer program for estimating cancer in a cohort - CiteSeerX

20 downloads 44713 Views 860KB Size Report
Computer Methods and Programs in Biomedicine, 35 (1991) 193-201 ... A computer program for computers ..... and Macintosh ™ (Apple Corporation) operating.
Computer Methods and Programs

in Biomedicine, 35 (1991) 193-201 B.V. All rights reserved 0169-2607/91/$03.50

© 1991 Elsevier Science Publishers

COMMET 01198

Secti on 11. Systems and programs

CANEST: a microcomputer program for estimating cancer in a cohort Bardur Sigurgeirsson lHpartment of lHrmatology, Karolinska Hospital, Stockholm, Sweden

Certain diseases and symptoms carry an overrepresentation of cancer. To be able to measure the strength of such an association it is necessary to be able to predict cancer development in the group being observed. A computer program for computers running under the MS DOS operating system has been developed for this purpose. The program is written in the CLIPPER programming language. The estimates are based on incidence and prevalence data from the Swedish Cancer Registry for the years 1958 to 1986. The program also computes confidence intervals based on the Poisson distribution. The results can be printed out or exported to other programs for further analysis. Cancer: CAN EST; Dermatology; Cohort estimation; CLIPPER

1. Introduction

The relation of skin signs and internal malig­ nancy has always fascinated dermatologists. Many dermatological diseases, e.g., dermatomyositis and acantosis nigricans, seem to have a clear relation to malignancy [1]. In other diseases this relation is less clear [2]. Our objectives were to investigate the development of malignancy in patients with dermatological diseases and to compare this with the expected number of patients with malignancy based on prevalence and incidence data from the Swedish Cancer Registry [3]. Calculations of ex­ pected numbers of malignancies for a patient group observed over a period of many years is complicated and time-consuming. The objectives of this part of the project were to develop a microcomputer program running under the MS DOS operating system to do these estimates based

Correspondence:

B. Sigurgeirsson, Department of Dermatol·

ogy, Karolinska Hospital. S·I04 01 Stockholm, Sweden:

on prevalence and incidence data from the Swedish Cancer Registry [3-5]. Programs for this purpose have been developed before [6], but use cruder methods by assuming constant incidence rates within age and calendar intervals. In the program described below, cancer risk is estimated separately for each individual in the sample by using calender-specific incidence data for the years 1958 to 1986 [3]. The program is called CAN EST (Cancer estimates).

2. Computational methods and theory 2.1. Computations

In Fig. 1 a model of one patient during a hypothetical study is shown. The patient enters the study the same year as the dermatosis is diagnosed (YD). At that time there is a certain likelihood that the patient already has cancer. This likelihood is equal to the prevalence in the population at that point. We do the approxima­

115

SELECT &CXlOO(pAT @ 00.18.18.1'2_ SINGLE @ 10.19SAY .1lECI:lADS.....:. Birtn VD

Today

VO

Fig. L Model of one patient during the study. lIi rth: The year the patient was born. VD: The year the dermatosis was

firsl

diagnosed. CR: the last year on-line in the Cancer Reg­

istry. T od ay: today; one year of obsel'\'stion. See text for

@ 12.19SAY·RlSKIIW.ES..:' @ 12,43 SAY "OATIISET IN USE...:" @ 12.112 SAY CXXXXSET @ 13,18 SAY "RISK I'EM.>U£S;' @ 13.43 SAY -veAI\S 01' OIlSElW .. :' @ IS.19SAY"TOTAtlOOI'S.:'

further explanation.

@ .e.43 SAY"YEAlH1IAGNOSED...;" @ 10,33 SAY ALlTfllM(STR(lASmECOll INCCOUNT.O OO_,LE .NOT. EOI'O @10.112 SAY ALl TfIIM(STR(RECNOOll

tion that the prevalence is stable and use 1984 prevalence figures, which are the only ones avail­ able. The observation period is defined

as

XAGE-DeRMl\Ge

the

time between the diagnosis of the dermatosis and the last year incidence figures are available on­ 085-0

line in the Swedish Cancer (CR). At the time of writing, this year is 1986. The probability

OIlS· OIlS+ 1

that a certain individual acquires a malignancy

@ 13.112 SAY ALlTRIM(STAlOIIS)I

during 1 year of observation (YO), is equal to the

@ 15,112 SAY ALlTRIMISTR(YIW\'l) @ 12.33SAy ALlTRIM(STR(MWISK[II! 'OOOOOlI

incidence of that particular malignancy for the

@ 13.33SA'( ALlTRIM

age of the patient, and that particular calendar



'I!'aoooo))

TOroEATH. OEATHRATE(AOE(IROUP(XII.GE))

year, times I (year), minus the mortality rate. This can be

DEFlNE OEATHAATE

IAAl.T .DEA H. ,·\(AtiS\lElSTR{TO'l'llEIT I H.l. '0))/'000000

with the following formula:

FEMtlEATH. '....AtSU ( IlSTR!TOTDEATH. FOAl(· ITONQllGN iNCC'OUNT alNCCOUNl .. 1

PROBABILITY = INCIDENCE x (1 - MORTALITY RATE)

@'5.33 SAY ALlTRIM

NT)1

TOI'IN[l- STR(IC07(X].3)+ STR(XYEAIU). STR(AGEGROIJP(lWlE).2) Sl!l!KTOI'IND

This calculation must be repeated for each cancer

IF .NOT. FOONDO

form and for each year of observation (YO). As

'nCHR!7J ?lOON[)

the patient is getting older he moves between age groups and new incidence

ENDlF

are employed

IF_.,

each calendar year. For one patient observed

MAlRISK(X]-IW.RISIrl�valerlce

CREG

file is similarly indexed on

PICD7 and PAGEGR. Information about the death rate in Sweden

was obtained from national authorities

This

information is stored in the file DEATH. Age group is kept in

DAGE

males and females in

and the death rate for

MALEDEATH

and

respectively. The file is indexed on

FEMDEATH

DAGE.

Patient data are kept in the file f'A T. This file can have external sources such as a hospital nosis system, or the data can be fed into the

TABLE

I

A brief explanation of the files used in the CANEST program Database

Function Data from Cancer

A subset of

FROMCREG DEATH

ICD7S

Numeric

ICDS

Numeric

Decimals

I

10

3

:;

BEN

Numeric

I

PAD

Numeric

3

SNOMED

Numeric

5

PAT

Numeric

3

PREP

Numeric

8

DODCA

Numeric

DOOORSl

Numeric

OOOORS2

Numeric

5

OLTlD

Numeric

5

:;

OODALD

Numeric

5

DERMYEAR

Numeric

CAYEAR

Numeric

DERMAGE

Numeric

3 3 3

Death rate in the Swedish population

DGN

ICD·7 diagnosis texts

FROMCREG

Original file frOl'/1 Cancer Registry

INC

Resulls of estimates based on incidence

INC5886

Incidence for the years 1958-1986

program.

INCC

INC plus results of a count of cancer cases

what century the patient was born. This is neces­

PAT

The patient dala file

and Poisson upper and lower limits

SEKEL

contains information about in

sary for the program to be able to calculate the age of the patient. The IO-digit Swedish

POISSON

Poisson distribution

PRE

Results of estimates based on prevalence

PREC

PRE plus results of a count of cancer cases

PREV84

Prevalence 1984

dermatosis, and the sex of the patient, are

TOTC

INCC and PRECC accumulated 10 get

DERMAGE

and Poisson upper and lower limils

total number of cases

identification number is kept in the field The year of first DERMYEAR.

and

PNR.

of the dermatosis is in at the time of diagnosis of the

SEX

in

respectively. The program cal­

culates this information from the personal identi­

118

fication number and the year the dermatosis was diagnosed. When the program has calculated the esti­ mated incidence during the observation period and the supposed prevalence at the time the dermatosis was diagnosed, this information is saved in the files

IN C and PR E, respectively.

These files have similar structures, with the ICD-7 code in PICD7 and IICD7. Incidence and preva­ lence of that particular malignancy for males are stored in females in

IOMALES IOFEM

and

ans

POMALES,

POFEM.

similarly for

These files can be

printed or exported to other programs for further analysis such as computation of confidence limits,

Fig.

further statistical analysis or graphical presenta­

4.

The main menu, and calculate submenu.

tion. If the results of a match between the patient file and the Cancer Registry are available (file CR E G) the data can be processed further. The actual number of cancers is counted in the CR E G file and the results are merged with PR E and INC, resulting in the files

PRECC and

INCC,

gram and is used for displaying results and edit­ ing files. 4.2. A

respectively. If the user so desires, the Poisson distribution for the observed number of cancers can be calculated, which facilitates the calcula­ tion of relative risks with confidence intervals. Finally, PRECC and INCC are merged into TOlC. All these files contain data for individual cancers and the total number of cancers. TOTC and IN CC

sample run

The program is started from the operating system by typing its name CANEST from the operating system prompt. The main menu now appears:

I

files

Calculate

Print

Transfer

Setup

can be exported into EXCEL™ (17) format. Briefly, the program is used in the following way:

1. 4. Sample of typical program

Feed in patient data.

2. Match file with Cancer Registry.

runs

3. Estimate cancer in patient population. 4.1.

Interface

4. Count actual number of cancers in file from Cancer Registry.

Control of the program is based on pull-down

5. Compare the results of the match and the

menus as specified in the IBM SAA standard

[18).

estimate.

An example of a pull-down menu can be

6. Print or export data.

seen in Fig 4. The program can be controlled

Patient data are first fed into the patient file. The

with the keyboard or a mouse. Many software

following items are registered

programs are based on this standard, i.e., all

1. The patient's unique personal identification

programs running under the Windows™ (Micro­

number.

soft Corporation), OS/2™ (IBM Corporation)

2. The year the dermatological disease was

and Macintosh™ (Apple Corporation) operating

diagnosed.

systems. The main menu is at the top row on the

3. The diagnosis if more than one diagnosis is

screen and information about items selected is

under consideration.

shown at the bottom. The area between the top

If the patient data already exists in another pro­

and bottom rows is the working area of the pro-

gram or in another computer system, data can be

119

ICD-7 100 140 141

MalObs MalLo All cancera

ILip

Mal Up Mal Exp Mal RA

Conf Lo ConfHi

180

136,17

186,80

126,65

1,26

1,08

1,47

1

0,03

5,57

1,29

o,n

0,02

4,32

Tonaue

0

0,00

3,69

0,45

0,00

0,00

8,27

0

0,00

3,69

0,34

0,00

0,00

10,76

Floor 0/ mouth

0

0,00

3,69

0,20

0,00

0,00

18,92

0

0,00

3,69

0,52

0,00

0,00

7,14

1

0,03

5,57

0,30

3,29

0,08

18,33

148

0

0,00

3,69

0,23

0,00

0,00

16,18

147

0

0,00

3,69

0,40

0,00

0,00

9,25

0

0,00

3,69

0,02

0,00

0,00

160,39

5

1,62

11,67

1,66

3,01

0,98

7,02

4

1,09

10,24

8,11

0,49

0,13

1,26

142 143 144

ISdvarY glands Mouth,

145

148 150 151 152

other parts and uns

IPharynx, part u us Stomach Small intestine

153 Colon 154 155 156 157 158 160 161 162

Rectum and anus and liver Liver not specified as prima Pancreas Peritoneum Nose and nasal sinuses

IBiIaty

ILarynx

Trach.,bronch,lung & pleura 163 Luna, not sPec as primary 184 Mediastinum

170

Breast

ln

Prostate

178

Testis

179

195

Other male genital oraans Kidney Urinary oraans (exd. kldnel Malianant melanoma 0/ sklr Skin (melanoma excluded) Eye Nervous system Thyroid gland Endocrine glands

196

Bone

180 181 190 191 192 193 194

197 199

200 201

202 203 210

1

0,03

5,57

0,72

1,39

0,03

7,75

11

5,49

19,68

9,13

1,21

0,60

2,16 2,30

7

2,81

14,42

6,27

1.12

0,45

3

0,62

8,n

3,58

0,84

0,17

2,45

0

0,00

3.69

0,48

0,00

0,00

7,64

7

2,81

14,42

4,61

1,52

0,61

3,13

0

0,00

3,69

0,04

0,00

0,00

94,59

0

0,00

3,69

0.27

0,00

0,00

13,87

6

2,20

13,06

1,46

4,11

1,51

8,94

25

16,18

36,91

14,02

1,78

1,15

2,63

0

0,00

3,69

0,72

-0,00

0,00

5,16

0

0,00

3,69

0,02

0,00

0,00

217,00

0

0,00

3,69

0,20

0,00

0,00

18,54

40

28,58

54,47

27,13

1,47

1,05

2,01

2

0.24

7,23

0,87

2,30

0,28

8,30 12,06

1

0,03

5.57

0,46

2,16

0,05

.,

2,81

14,42

5,44

1,29

0.52

2,65

15

8.40

24,74

8,76

1.71

0,96

2,63

3

0,62

8.n

3,29

0.91

0.19

2,67

1

0.03

5,57

4,67

0.21

0.01

1,19

0

0,00

3,69

0.31

0,00

0,00

11.90

2

0.24

7,23

3,71

0,54

0.07

1,95

0

0,00

3.69

0,71

0,00

0,00

5,23

0

0,00

3,69

1,42

0,00

0.00

2,60

0

0,00

3.69

0.24

0.00

0,00

15.12

2

0,24

7,23

0,86

2,32

0,28

8,39

Other and unspecified sites

3

0,62

8:n

3,44

0.87

0,18

2,55

Mal. non-Hodgkin Iympoma

3

0,62:

8.n

3,49

1

0.03

5,57

0.82

--1 ,22

Connective tissue, musde

IHodgkin's disease

Reticulosis and related form Multiple myeloma Leuk., Polyc ver & myelofib .

2

0,24

7,23

2

0.24

7,23

0

0,00

3,69

-

_

0,18

2.51

0,03

6,82

. _ __

-- -

2.07

4.03

0:00

61.75 ._3,7

1,04 0.00

0,91

La: Confidence limit, males, lower. Ma l Up: Confidence limit,

Exp: Expected (estimated by CAN EST) cancer. males. Ma l

interval. males. lower limit. Canf

_

_

Fig. 5. A sample printout. Ma lOb.: Observed cancer, males. Ma l males. upper. Ma l

1

0,12

__

RR: Relative risk, males. Canf

Hi: Confidence interval, males. higher limit.

120

La: Confidence

The results can now be printed to a printer by

imported as an ASCII file. The program now calculates the age and sex of the patient.

selecting the Pr i n t option in the main menu.

When all patient data have been entered, the

Other files used in the system can also be printed

file is sent to the Cancer Registry for matching.

in this option.' In the T ran sf e r option, data can

The estimates can now be calculated. This is

be imported from other programs or exported to

done by choosing the menu item Ca l c u l ate

EXCEeM or to any other programs which accept

from the main menu. Cancer probability at the

ASCII or DBF files as input. Data sets can also

time the individual enters the study is first calcu­

be transferred to disk for backup. In Set u p the user can configure the programs

lated based on prevalence and the individual probabilities summed. The progress of these cal­

various options and name and select data sets.

culations can be watched on the screen and takes about 1/10 of total calculation time. Next, cancer probability is calculated for the observation pe­

s. Hardware and software specifications

riod. These calculations 'are based on incidence and have to be done for each year each individual

5.1. Material used

is in the study, and are therefore much more time-consuming. As before, the progress of the

The program was developed and tested on a

calculations can be watched on the screen (Fig.

PS/2 computer (IBM Corporation) with an 80386

4). This feedback is important for the user as

microprocessor, 120 Mb hard disk and 6 Mb of

these calculations are time-consuming and can

internal

take up to 7 h, for 6000 patients with an average

(Nantucket Corporation) was used for all pro­

memory.

The

CLIPPER™

compiler

observation time of 10 years, on a microcomputer

gramming. CLIPPER™ is a true compiler origi­

with an 80386 microprocessor and an 80387 co­

nally based on the dBASE IIITM programming

processor The results of the calculations are now

language [19]. There are clear advantages in using

transferred to the output files. If so desired, the

a compiler, compared with an interpreter. The

Poisson distribution can now be calculated for

program runs much faster, the code is protected

each c,mcer form and total cancer. The confi­

and it is possible to link together CLlPPER™

dence limits based on the Poisson distribution are

programs and programs written in other lan­

used later in the calculation of confidence inter­

guages such as C or assembler. Programs devel­

vals for the relative risk ratio. The objective of

oped with the CLIPPER™ compiler use a com­

the calculation of the estimates is to compare

mon file format (DBF) shared by many databases

them to the situation in real life, Le., how many

and spreadsheets, and this makes it easy to move

cancers really occurred in the patient population.

data to other software programs for further anal­

This can be done by matching the patient file

ysis or graphical display etc. A software link was

against the Cancer Registry. The match is done

created to the EXCEL™ [17] (Microsoft Corpo­

on a large central computer at the Swedish Can­

ration) spreadsheet, and macros were written to

cer Registry. The results of the match are stored

format the output for high-quality printing and to

in an ASCII (text) file which is then transferred

do further statistical analysis.

to a personal computer and imported to the CAN EST program (c re 9 file). The files can now

5.2. System requirements

be compared and the ratio between observed and expected cancer calculated. This ratio is called

The

program needs a microcomputer with

the relative risk. Several macros (macros are simi­

80286, 80386 or 80486 processors running under

lar to small programs which can be used to do

the MS DOS 3.3 operating system or higher, 640

repetitive tasks) have been written in EXCEeM

k or more of internal memory, a hard disk and a

to do the relative risk calculations and compute

printer. The program runs faster if a mathemati­

its confidence interval. A sample of the printout

cal coprocessor is installed. Access to the EX­

for such calculations is shown in Fig. 5.

CEL™ (Microsoft Corporation) spreadsheet is

121

preferable, as some of the calculations for confi­ dence intervals and final printouts are written in EXCEeM macros.

Discussion

The objective was to create a microcomputer program to calculate the expected number of cancers in a group of individuals observed over a number of years based on Swedish prevalence and incidence figures. The resulting estimate is useful in cancer research. The estimate can be compared to the cancer development in the group and thus the risk of cancer development ob­ tained. This technique has been applied to assess­ ing the risk of cancer development in patients with various dermatological diseases, but it can of course be used on any patient group or other groups, i.e., occupational groups. Before the development of the CANEST mi­ crocomputer program, a traditional method [9] was used for estimating cancer development in patient groups with dermatological diseases [14]. This method involves stratification of the patient material into 5-year age intervals. The estimated number of cancers is calculated based on the number of years each age group was under obser­ vation. The incidence figures used in these calcu­ lations are the figures for the year in the middle of the interval, or a cumulative incidence for the years 1971-1984. For example, for observing pa­ tients from 1965 to 1985 the 1975 figures would be used for the calculations. It is obvious that this method is only approximative and can possibly lead to false assumptions when the limits are narrow. An example of this difference can be seen in 1155 patients with chronic urticaria, where the cancer incidence was estimated to 41 with the older method but 48 with the computer program. The observed number of cancers was 36. Neither of these estimates indicated a significant risk. It can be argued that it is not uitable to use national incidence data to estimate cancer devel­ opment in a regional patient material. At the time of writing, the effects of this are not known, but this method is commonly used [9]. The pro­ gram has no limits regarding this aspect: a new

data file (INC58_86) with regional incidence data is simply used. It is intended to compare the outcome between these two methods when re­ gional incidence data become available in com­ puter-readable form. Being able to manipulate the data and do the calculations on a microcomputer has significant advantages compared to the use of a mainframe computer. The data are more accessible to the scientist, who does not have to rely on the help of computer specialists to do all calculations. The data can also be moved more easily to other programs for graphical presentation or further statistical analysis. Mainframe computer time is very expensive, so each run of the program costs money. Also, this microcomputer method allows for estimation of cancer cases in a cohort, when personal identification numbers are not com­ plete, if the year of birth and sex are known. The cost of each run of the program on the microcom­ puter is negligible. It is easy to distribute and update the program, as computers running under MS DOS are widely available.

7. Availability

Please write to the author for details.

Acknowledgements

The author whishes to thank Professor Gunnar Eklund and docent Bernt Lindelof for valuable discussions.

References [I) DJ. McLean and A. Haynes. Cutaneous Aspects of In· ternal Malignancy, in: Dermatology in General Medicine, eds. T.A. Fitzparick, A.Z. Eisen, K. Wolff, I.M. Freed­ berg and K.F. Austen, pp. New York,

1917-1937 (McGraw-Hill,

1987).

(21 J.P. Callen, Skin Signs of Internal MaliBnancy Fact, Fancy and Fiction, in: Skin Signs of Internal MaliBnancy, eds.

J.P. Callen, pp 340-357 1984).

AJ. Rook, H.I. Maibach and (Seminars in Dermatology,

[3] The Swedish Cancer Registry, Cancer Incidence in Swe­

122

den

1958-1986, (The National Board of Health and 1960-1990).

Welfare, annual publications, Stockholm,

(4) H.-O. Adami, T. Gunnarsson, P. Sparen and G. Eklund, The prevalence of cancer in Sweden 1984. Acta Onool. 28 (1989) 463-470. (5) H.-O. Adami and P. Sparen, Cancer Prevalence in Swe­ den 1984, in: Cancer Incidence in Sweden 1986, pp. 93-99 (The National Board of Health and Welfare, Stockholm, 1990). (6) R. Monson, Analysis of relative sUl'Vival and proportional mortality, Compul. Biomed. Res. 7 (1974) 325-332.

(11) The Cancer Registry, Cancer Incidence in Sweden 1985, pp. 5-26 (The National Board of Health and Welfare. Stockholm, 1989). (12) B. Mallson, Cancer Registration in Sweden: Studies on Completeness and Validity of Incidence and Mortality Registers (Thesis), pp. holm,

1-33 (Karolinska Institute, Stock­

1984).

(13) M. Gerhardsson, S.E. Noreli, H.J. Kiviranta and A. Ahlbom, Respiratory cancer.; in furniture workers. Br. J. Ind. Med. 42 (1985) 403-405. (14) B. Lindelof, B. Sigurgeirsson, C.F. Wahlgren and G.

[7J N.E. Breslow, Fundamental measures of disease oc­

Eklund, Chronic Urticaria and Cancer, Br. J. Dermatol.

curence and association, in: Statistical Methods in Can­

123 (1990) 453-456. (15) S.J. Straley, Programming in Clipper. (Addison-Wesley, New York, 1988). (16) Swedish National Central Bureau of Statistics. Causes of death, (SCB, annual publications, Stockholm, 1956-1986. [17J E. Jones, Using EXCEL on the PC (Osborne McGraw­ Hill, Berkeley, 1988). (18) IBM Corporation, Systems Application Architecture.

cer Research: Analysis of case-control studies, pp.

42-81

(lARC, Lyon, 1980). (8) N. Mantel and W. Haenszel, Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease, J. Nal. Cancer Inst.

22 (\959) 719-748. (9) G. Eklund, An example from CMR-70 - Adenocarci· noma in the nose, in: The Environmental Cancer Reg­

istry-70 (in Swedish), ed. C. Ortendahl, pp 41-46 (The 1990). [IOJ c. Lenter, Geigy Scientific Tables, pp. 152-155 (Ciba­ Geigy. Basle, 1982). National Board of Health and Welfare. Stockholm,

Common User Access Panel DeSign and User Interac­ tion. (IBM Corporation,

1988). [J9) J.D. Carrabis, dBASE III PLUS. The Complete Refer­ ence (Osborne McGraw-HiII, Berkeley, 1987).

123