Computer Methods and Programs in Biomedicine, 35 (1991) 193-201 ... A computer program for computers ..... and Macintosh ⢠(Apple Corporation) operating.
Computer Methods and Programs
in Biomedicine, 35 (1991) 193-201 B.V. All rights reserved 0169-2607/91/$03.50
© 1991 Elsevier Science Publishers
COMMET 01198
Secti on 11. Systems and programs
CANEST: a microcomputer program for estimating cancer in a cohort Bardur Sigurgeirsson lHpartment of lHrmatology, Karolinska Hospital, Stockholm, Sweden
Certain diseases and symptoms carry an overrepresentation of cancer. To be able to measure the strength of such an association it is necessary to be able to predict cancer development in the group being observed. A computer program for computers running under the MS DOS operating system has been developed for this purpose. The program is written in the CLIPPER programming language. The estimates are based on incidence and prevalence data from the Swedish Cancer Registry for the years 1958 to 1986. The program also computes confidence intervals based on the Poisson distribution. The results can be printed out or exported to other programs for further analysis. Cancer: CAN EST; Dermatology; Cohort estimation; CLIPPER
1. Introduction
The relation of skin signs and internal malig nancy has always fascinated dermatologists. Many dermatological diseases, e.g., dermatomyositis and acantosis nigricans, seem to have a clear relation to malignancy [1]. In other diseases this relation is less clear [2]. Our objectives were to investigate the development of malignancy in patients with dermatological diseases and to compare this with the expected number of patients with malignancy based on prevalence and incidence data from the Swedish Cancer Registry [3]. Calculations of ex pected numbers of malignancies for a patient group observed over a period of many years is complicated and time-consuming. The objectives of this part of the project were to develop a microcomputer program running under the MS DOS operating system to do these estimates based
Correspondence:
B. Sigurgeirsson, Department of Dermatol·
ogy, Karolinska Hospital. S·I04 01 Stockholm, Sweden:
on prevalence and incidence data from the Swedish Cancer Registry [3-5]. Programs for this purpose have been developed before [6], but use cruder methods by assuming constant incidence rates within age and calendar intervals. In the program described below, cancer risk is estimated separately for each individual in the sample by using calender-specific incidence data for the years 1958 to 1986 [3]. The program is called CAN EST (Cancer estimates).
2. Computational methods and theory 2.1. Computations
In Fig. 1 a model of one patient during a hypothetical study is shown. The patient enters the study the same year as the dermatosis is diagnosed (YD). At that time there is a certain likelihood that the patient already has cancer. This likelihood is equal to the prevalence in the population at that point. We do the approxima
115
SELECT &CXlOO(pAT @ 00.18.18.1'2_ SINGLE @ 10.19SAY .1lECI:lADS.....:. Birtn VD
Today
VO
Fig. L Model of one patient during the study. lIi rth: The year the patient was born. VD: The year the dermatosis was
firsl
diagnosed. CR: the last year on-line in the Cancer Reg
istry. T od ay: today; one year of obsel'\'stion. See text for
@ 12.19SAY·RlSKIIW.ES..:' @ 12,43 SAY "OATIISET IN USE...:" @ 12.112 SAY CXXXXSET @ 13,18 SAY "RISK I'EM.>U£S;' @ 13.43 SAY -veAI\S 01' OIlSElW .. :' @ IS.19SAY"TOTAtlOOI'S.:'
further explanation.
@ .e.43 SAY"YEAlH1IAGNOSED...;" @ 10,33 SAY ALlTfllM(STR(lASmECOll INCCOUNT.O OO_,LE .NOT. EOI'O @10.112 SAY ALl TfIIM(STR(RECNOOll
tion that the prevalence is stable and use 1984 prevalence figures, which are the only ones avail able. The observation period is defined
as
XAGE-DeRMl\Ge
the
time between the diagnosis of the dermatosis and the last year incidence figures are available on 085-0
line in the Swedish Cancer (CR). At the time of writing, this year is 1986. The probability
OIlS· OIlS+ 1
that a certain individual acquires a malignancy
@ 13.112 SAY ALlTRIM(STAlOIIS)I
during 1 year of observation (YO), is equal to the
@ 15,112 SAY ALlTRIMISTR(YIW\'l) @ 12.33SAy ALlTRIM(STR(MWISK[II! 'OOOOOlI
incidence of that particular malignancy for the
@ 13.33SA'( ALlTRIM
age of the patient, and that particular calendar
•
'I!'aoooo))
TOroEATH. OEATHRATE(AOE(IROUP(XII.GE))
year, times I (year), minus the mortality rate. This can be
DEFlNE OEATHAATE
IAAl.T .DEA H. ,·\(AtiS\lElSTR{TO'l'llEIT I H.l. '0))/'000000
with the following formula:
FEMtlEATH. '....AtSU ( IlSTR!TOTDEATH. FOAl(· ITONQllGN iNCC'OUNT alNCCOUNl .. 1
PROBABILITY = INCIDENCE x (1 - MORTALITY RATE)
@'5.33 SAY ALlTRIM
NT)1
TOI'IN[l- STR(IC07(X].3)+ STR(XYEAIU). STR(AGEGROIJP(lWlE).2) Sl!l!KTOI'IND
This calculation must be repeated for each cancer
IF .NOT. FOONDO
form and for each year of observation (YO). As
'nCHR!7J ?lOON[)
the patient is getting older he moves between age groups and new incidence
ENDlF
are employed
IF_.,
each calendar year. For one patient observed
MAlRISK(X]-IW.RISIrl�valerlce
CREG
file is similarly indexed on
PICD7 and PAGEGR. Information about the death rate in Sweden
was obtained from national authorities
This
information is stored in the file DEATH. Age group is kept in
DAGE
males and females in
and the death rate for
MALEDEATH
and
respectively. The file is indexed on
FEMDEATH
DAGE.
Patient data are kept in the file f'A T. This file can have external sources such as a hospital nosis system, or the data can be fed into the
TABLE
I
A brief explanation of the files used in the CANEST program Database
Function Data from Cancer
A subset of
FROMCREG DEATH
ICD7S
Numeric
ICDS
Numeric
Decimals
I
10
3
:;
BEN
Numeric
I
PAD
Numeric
3
SNOMED
Numeric
5
PAT
Numeric
3
PREP
Numeric
8
DODCA
Numeric
DOOORSl
Numeric
OOOORS2
Numeric
5
OLTlD
Numeric
5
:;
OODALD
Numeric
5
DERMYEAR
Numeric
CAYEAR
Numeric
DERMAGE
Numeric
3 3 3
Death rate in the Swedish population
DGN
ICD·7 diagnosis texts
FROMCREG
Original file frOl'/1 Cancer Registry
INC
Resulls of estimates based on incidence
INC5886
Incidence for the years 1958-1986
program.
INCC
INC plus results of a count of cancer cases
what century the patient was born. This is neces
PAT
The patient dala file
and Poisson upper and lower limits
SEKEL
contains information about in
sary for the program to be able to calculate the age of the patient. The IO-digit Swedish
POISSON
Poisson distribution
PRE
Results of estimates based on prevalence
PREC
PRE plus results of a count of cancer cases
PREV84
Prevalence 1984
dermatosis, and the sex of the patient, are
TOTC
INCC and PRECC accumulated 10 get
DERMAGE
and Poisson upper and lower limils
total number of cases
identification number is kept in the field The year of first DERMYEAR.
and
PNR.
of the dermatosis is in at the time of diagnosis of the
SEX
in
respectively. The program cal
culates this information from the personal identi
118
fication number and the year the dermatosis was diagnosed. When the program has calculated the esti mated incidence during the observation period and the supposed prevalence at the time the dermatosis was diagnosed, this information is saved in the files
IN C and PR E, respectively.
These files have similar structures, with the ICD-7 code in PICD7 and IICD7. Incidence and preva lence of that particular malignancy for males are stored in females in
IOMALES IOFEM
and
ans
POMALES,
POFEM.
similarly for
These files can be
printed or exported to other programs for further analysis such as computation of confidence limits,
Fig.
further statistical analysis or graphical presenta
4.
The main menu, and calculate submenu.
tion. If the results of a match between the patient file and the Cancer Registry are available (file CR E G) the data can be processed further. The actual number of cancers is counted in the CR E G file and the results are merged with PR E and INC, resulting in the files
PRECC and
INCC,
gram and is used for displaying results and edit ing files. 4.2. A
respectively. If the user so desires, the Poisson distribution for the observed number of cancers can be calculated, which facilitates the calcula tion of relative risks with confidence intervals. Finally, PRECC and INCC are merged into TOlC. All these files contain data for individual cancers and the total number of cancers. TOTC and IN CC
sample run
The program is started from the operating system by typing its name CANEST from the operating system prompt. The main menu now appears:
I
files
Calculate
Print
Transfer
Setup
can be exported into EXCEL™ (17) format. Briefly, the program is used in the following way:
1. 4. Sample of typical program
Feed in patient data.
2. Match file with Cancer Registry.
runs
3. Estimate cancer in patient population. 4.1.
Interface
4. Count actual number of cancers in file from Cancer Registry.
Control of the program is based on pull-down
5. Compare the results of the match and the
menus as specified in the IBM SAA standard
[18).
estimate.
An example of a pull-down menu can be
6. Print or export data.
seen in Fig 4. The program can be controlled
Patient data are first fed into the patient file. The
with the keyboard or a mouse. Many software
following items are registered
programs are based on this standard, i.e., all
1. The patient's unique personal identification
programs running under the Windows™ (Micro
number.
soft Corporation), OS/2™ (IBM Corporation)
2. The year the dermatological disease was
and Macintosh™ (Apple Corporation) operating
diagnosed.
systems. The main menu is at the top row on the
3. The diagnosis if more than one diagnosis is
screen and information about items selected is
under consideration.
shown at the bottom. The area between the top
If the patient data already exists in another pro
and bottom rows is the working area of the pro-
gram or in another computer system, data can be
119
ICD-7 100 140 141
MalObs MalLo All cancera
ILip
Mal Up Mal Exp Mal RA
Conf Lo ConfHi
180
136,17
186,80
126,65
1,26
1,08
1,47
1
0,03
5,57
1,29
o,n
0,02
4,32
Tonaue
0
0,00
3,69
0,45
0,00
0,00
8,27
0
0,00
3,69
0,34
0,00
0,00
10,76
Floor 0/ mouth
0
0,00
3,69
0,20
0,00
0,00
18,92
0
0,00
3,69
0,52
0,00
0,00
7,14
1
0,03
5,57
0,30
3,29
0,08
18,33
148
0
0,00
3,69
0,23
0,00
0,00
16,18
147
0
0,00
3,69
0,40
0,00
0,00
9,25
0
0,00
3,69
0,02
0,00
0,00
160,39
5
1,62
11,67
1,66
3,01
0,98
7,02
4
1,09
10,24
8,11
0,49
0,13
1,26
142 143 144
ISdvarY glands Mouth,
145
148 150 151 152
other parts and uns
IPharynx, part u us Stomach Small intestine
153 Colon 154 155 156 157 158 160 161 162
Rectum and anus and liver Liver not specified as prima Pancreas Peritoneum Nose and nasal sinuses
IBiIaty
ILarynx
Trach.,bronch,lung & pleura 163 Luna, not sPec as primary 184 Mediastinum
170
Breast
ln
Prostate
178
Testis
179
195
Other male genital oraans Kidney Urinary oraans (exd. kldnel Malianant melanoma 0/ sklr Skin (melanoma excluded) Eye Nervous system Thyroid gland Endocrine glands
196
Bone
180 181 190 191 192 193 194
197 199
200 201
202 203 210
1
0,03
5,57
0,72
1,39
0,03
7,75
11
5,49
19,68
9,13
1,21
0,60
2,16 2,30
7
2,81
14,42
6,27
1.12
0,45
3
0,62
8,n
3,58
0,84
0,17
2,45
0
0,00
3.69
0,48
0,00
0,00
7,64
7
2,81
14,42
4,61
1,52
0,61
3,13
0
0,00
3,69
0,04
0,00
0,00
94,59
0
0,00
3,69
0.27
0,00
0,00
13,87
6
2,20
13,06
1,46
4,11
1,51
8,94
25
16,18
36,91
14,02
1,78
1,15
2,63
0
0,00
3,69
0,72
-0,00
0,00
5,16
0
0,00
3,69
0,02
0,00
0,00
217,00
0
0,00
3,69
0,20
0,00
0,00
18,54
40
28,58
54,47
27,13
1,47
1,05
2,01
2
0.24
7,23
0,87
2,30
0,28
8,30 12,06
1
0,03
5.57
0,46
2,16
0,05
.,
2,81
14,42
5,44
1,29
0.52
2,65
15
8.40
24,74
8,76
1.71
0,96
2,63
3
0,62
8.n
3,29
0.91
0.19
2,67
1
0.03
5,57
4,67
0.21
0.01
1,19
0
0,00
3,69
0.31
0,00
0,00
11.90
2
0.24
7,23
3,71
0,54
0.07
1,95
0
0,00
3.69
0,71
0,00
0,00
5,23
0
0,00
3,69
1,42
0,00
0.00
2,60
0
0,00
3.69
0.24
0.00
0,00
15.12
2
0,24
7,23
0,86
2,32
0,28
8,39
Other and unspecified sites
3
0,62
8:n
3,44
0.87
0,18
2,55
Mal. non-Hodgkin Iympoma
3
0,62:
8.n
3,49
1
0.03
5,57
0.82
--1 ,22
Connective tissue, musde
IHodgkin's disease
Reticulosis and related form Multiple myeloma Leuk., Polyc ver & myelofib .
2
0,24
7,23
2
0.24
7,23
0
0,00
3,69
-
_
0,18
2.51
0,03
6,82
. _ __
-- -
2.07
4.03
0:00
61.75 ._3,7
1,04 0.00
0,91
La: Confidence limit, males, lower. Ma l Up: Confidence limit,
Exp: Expected (estimated by CAN EST) cancer. males. Ma l
interval. males. lower limit. Canf
_
_
Fig. 5. A sample printout. Ma lOb.: Observed cancer, males. Ma l males. upper. Ma l
1
0,12
__
RR: Relative risk, males. Canf
Hi: Confidence interval, males. higher limit.
120
La: Confidence
The results can now be printed to a printer by
imported as an ASCII file. The program now calculates the age and sex of the patient.
selecting the Pr i n t option in the main menu.
When all patient data have been entered, the
Other files used in the system can also be printed
file is sent to the Cancer Registry for matching.
in this option.' In the T ran sf e r option, data can
The estimates can now be calculated. This is
be imported from other programs or exported to
done by choosing the menu item Ca l c u l ate
EXCEeM or to any other programs which accept
from the main menu. Cancer probability at the
ASCII or DBF files as input. Data sets can also
time the individual enters the study is first calcu
be transferred to disk for backup. In Set u p the user can configure the programs
lated based on prevalence and the individual probabilities summed. The progress of these cal
various options and name and select data sets.
culations can be watched on the screen and takes about 1/10 of total calculation time. Next, cancer probability is calculated for the observation pe
s. Hardware and software specifications
riod. These calculations 'are based on incidence and have to be done for each year each individual
5.1. Material used
is in the study, and are therefore much more time-consuming. As before, the progress of the
The program was developed and tested on a
calculations can be watched on the screen (Fig.
PS/2 computer (IBM Corporation) with an 80386
4). This feedback is important for the user as
microprocessor, 120 Mb hard disk and 6 Mb of
these calculations are time-consuming and can
internal
take up to 7 h, for 6000 patients with an average
(Nantucket Corporation) was used for all pro
memory.
The
CLIPPER™
compiler
observation time of 10 years, on a microcomputer
gramming. CLIPPER™ is a true compiler origi
with an 80386 microprocessor and an 80387 co
nally based on the dBASE IIITM programming
processor The results of the calculations are now
language [19]. There are clear advantages in using
transferred to the output files. If so desired, the
a compiler, compared with an interpreter. The
Poisson distribution can now be calculated for
program runs much faster, the code is protected
each c,mcer form and total cancer. The confi
and it is possible to link together CLlPPER™
dence limits based on the Poisson distribution are
programs and programs written in other lan
used later in the calculation of confidence inter
guages such as C or assembler. Programs devel
vals for the relative risk ratio. The objective of
oped with the CLIPPER™ compiler use a com
the calculation of the estimates is to compare
mon file format (DBF) shared by many databases
them to the situation in real life, Le., how many
and spreadsheets, and this makes it easy to move
cancers really occurred in the patient population.
data to other software programs for further anal
This can be done by matching the patient file
ysis or graphical display etc. A software link was
against the Cancer Registry. The match is done
created to the EXCEL™ [17] (Microsoft Corpo
on a large central computer at the Swedish Can
ration) spreadsheet, and macros were written to
cer Registry. The results of the match are stored
format the output for high-quality printing and to
in an ASCII (text) file which is then transferred
do further statistical analysis.
to a personal computer and imported to the CAN EST program (c re 9 file). The files can now
5.2. System requirements
be compared and the ratio between observed and expected cancer calculated. This ratio is called
The
program needs a microcomputer with
the relative risk. Several macros (macros are simi
80286, 80386 or 80486 processors running under
lar to small programs which can be used to do
the MS DOS 3.3 operating system or higher, 640
repetitive tasks) have been written in EXCEeM
k or more of internal memory, a hard disk and a
to do the relative risk calculations and compute
printer. The program runs faster if a mathemati
its confidence interval. A sample of the printout
cal coprocessor is installed. Access to the EX
for such calculations is shown in Fig. 5.
CEL™ (Microsoft Corporation) spreadsheet is
121
preferable, as some of the calculations for confi dence intervals and final printouts are written in EXCEeM macros.
Discussion
The objective was to create a microcomputer program to calculate the expected number of cancers in a group of individuals observed over a number of years based on Swedish prevalence and incidence figures. The resulting estimate is useful in cancer research. The estimate can be compared to the cancer development in the group and thus the risk of cancer development ob tained. This technique has been applied to assess ing the risk of cancer development in patients with various dermatological diseases, but it can of course be used on any patient group or other groups, i.e., occupational groups. Before the development of the CANEST mi crocomputer program, a traditional method [9] was used for estimating cancer development in patient groups with dermatological diseases [14]. This method involves stratification of the patient material into 5-year age intervals. The estimated number of cancers is calculated based on the number of years each age group was under obser vation. The incidence figures used in these calcu lations are the figures for the year in the middle of the interval, or a cumulative incidence for the years 1971-1984. For example, for observing pa tients from 1965 to 1985 the 1975 figures would be used for the calculations. It is obvious that this method is only approximative and can possibly lead to false assumptions when the limits are narrow. An example of this difference can be seen in 1155 patients with chronic urticaria, where the cancer incidence was estimated to 41 with the older method but 48 with the computer program. The observed number of cancers was 36. Neither of these estimates indicated a significant risk. It can be argued that it is not uitable to use national incidence data to estimate cancer devel opment in a regional patient material. At the time of writing, the effects of this are not known, but this method is commonly used [9]. The pro gram has no limits regarding this aspect: a new
data file (INC58_86) with regional incidence data is simply used. It is intended to compare the outcome between these two methods when re gional incidence data become available in com puter-readable form. Being able to manipulate the data and do the calculations on a microcomputer has significant advantages compared to the use of a mainframe computer. The data are more accessible to the scientist, who does not have to rely on the help of computer specialists to do all calculations. The data can also be moved more easily to other programs for graphical presentation or further statistical analysis. Mainframe computer time is very expensive, so each run of the program costs money. Also, this microcomputer method allows for estimation of cancer cases in a cohort, when personal identification numbers are not com plete, if the year of birth and sex are known. The cost of each run of the program on the microcom puter is negligible. It is easy to distribute and update the program, as computers running under MS DOS are widely available.
7. Availability
Please write to the author for details.
Acknowledgements
The author whishes to thank Professor Gunnar Eklund and docent Bernt Lindelof for valuable discussions.
References [I) DJ. McLean and A. Haynes. Cutaneous Aspects of In· ternal Malignancy, in: Dermatology in General Medicine, eds. T.A. Fitzparick, A.Z. Eisen, K. Wolff, I.M. Freed berg and K.F. Austen, pp. New York,
1917-1937 (McGraw-Hill,
1987).
(21 J.P. Callen, Skin Signs of Internal MaliBnancy Fact, Fancy and Fiction, in: Skin Signs of Internal MaliBnancy, eds.
J.P. Callen, pp 340-357 1984).
AJ. Rook, H.I. Maibach and (Seminars in Dermatology,
[3] The Swedish Cancer Registry, Cancer Incidence in Swe
122
den
1958-1986, (The National Board of Health and 1960-1990).
Welfare, annual publications, Stockholm,
(4) H.-O. Adami, T. Gunnarsson, P. Sparen and G. Eklund, The prevalence of cancer in Sweden 1984. Acta Onool. 28 (1989) 463-470. (5) H.-O. Adami and P. Sparen, Cancer Prevalence in Swe den 1984, in: Cancer Incidence in Sweden 1986, pp. 93-99 (The National Board of Health and Welfare, Stockholm, 1990). (6) R. Monson, Analysis of relative sUl'Vival and proportional mortality, Compul. Biomed. Res. 7 (1974) 325-332.
(11) The Cancer Registry, Cancer Incidence in Sweden 1985, pp. 5-26 (The National Board of Health and Welfare. Stockholm, 1989). (12) B. Mallson, Cancer Registration in Sweden: Studies on Completeness and Validity of Incidence and Mortality Registers (Thesis), pp. holm,
1-33 (Karolinska Institute, Stock
1984).
(13) M. Gerhardsson, S.E. Noreli, H.J. Kiviranta and A. Ahlbom, Respiratory cancer.; in furniture workers. Br. J. Ind. Med. 42 (1985) 403-405. (14) B. Lindelof, B. Sigurgeirsson, C.F. Wahlgren and G.
[7J N.E. Breslow, Fundamental measures of disease oc
Eklund, Chronic Urticaria and Cancer, Br. J. Dermatol.
curence and association, in: Statistical Methods in Can
123 (1990) 453-456. (15) S.J. Straley, Programming in Clipper. (Addison-Wesley, New York, 1988). (16) Swedish National Central Bureau of Statistics. Causes of death, (SCB, annual publications, Stockholm, 1956-1986. [17J E. Jones, Using EXCEL on the PC (Osborne McGraw Hill, Berkeley, 1988). (18) IBM Corporation, Systems Application Architecture.
cer Research: Analysis of case-control studies, pp.
42-81
(lARC, Lyon, 1980). (8) N. Mantel and W. Haenszel, Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease, J. Nal. Cancer Inst.
22 (\959) 719-748. (9) G. Eklund, An example from CMR-70 - Adenocarci· noma in the nose, in: The Environmental Cancer Reg
istry-70 (in Swedish), ed. C. Ortendahl, pp 41-46 (The 1990). [IOJ c. Lenter, Geigy Scientific Tables, pp. 152-155 (Ciba Geigy. Basle, 1982). National Board of Health and Welfare. Stockholm,
Common User Access Panel DeSign and User Interac tion. (IBM Corporation,
1988). [J9) J.D. Carrabis, dBASE III PLUS. The Complete Refer ence (Osborne McGraw-HiII, Berkeley, 1987).
123