Automated Classification of Semi-structured Pathology Reports into ICD-O using SVM in Portuguese Michel Oleynik1 1 2
Diogo F. C. Patrão2
Marcelo Finger3
Medical University of Graz, Austria
A.C. Camargo Cancer Center, Brazil 3
University of São Paulo, Brazil
2017
1 / 11
Introduction
Introduction
Cancer: second leading cause of death in the U.S. Pathology reports Main source of information when diagnosing cancer Tumour site (topography) Histological type and malignancy (morphology)
International Classification of Diseases for Oncology (ICD-O) Encoded as part of normal operation or later for regulatory purposes Discordance rates as high as 50% among specialists
2 / 11
Introduction
Goal
C50.9: Breast, NOS 8522/3: Infiltrating duct and lobular carcinoma
Figure: Example of a pathology report mapped into ICD-O codes.
3 / 11
Materials and Methods
Materials
Materials
Anonymised data from ca. 95,000 patients over 14 years Written in Portuguese by Brazilian pathologists Pathology reports and cancer registry data Routine procedure in A.C. Camargo Cancer Center, Brazil
Gold standard creation Free-text content associated to structured data in cancer registries Discarded patients with confirmed metastasis or multiple classifications Reports unified using the patient identifier
Resulting dataset Topography: 18,905 patients (18 code groups) Morphology: 18,599 patients (49 code groups)
4 / 11
Materials and Methods
Methods
Methods
Support Vector Machines (SVMs) Bag-of-words, one-versus-all approach tf-idf weighting scheme with linear kernel Weka 3.6.6 with LibSVM 3.17
Supervised machine learning Micro-averaged with 10-fold cross-validation Evaluated by precision, recall and F1 -score
5 / 11
Results
Results: Topography Table: Top ten F1 scores in the ICD-O topography attribution task. Code Group
Description
n
P
R
F1
C44
Skin
3,858
0.88
0.94
0.91
C50
Breast
3,668
0.89
0.91
0.90
C73-C75
Thyroid and other endocrine glands
1,329
0.92
0.87
0.90
C60-C63
Male genital organs
1,536
0.93
0.81
0.87
C64-C68
Lymph nodes
C51-C58
Female genital organs
C69-C72
660
0.86
0.78
0.82
1,574
0.85
0.77
0.81
Eye, brain and other parts of central [...]
536
0.83
0.70
0.76
C00-C14
Lip, oral cavity and pharynx
903
0.80
0.71
0.75
C15-C26
Digestive organs
2,159
0.67
0.84
0.75
590
0.68
0.80
0.74
18,905
0.82
0.82
0.82
C77
Lymph nodes Overall
6 / 11
Results
Results: Morphology Table: Top ten F1 scores in the ICD-O morphology attribution task.
Code Group
Description
n
P
R
F1
959-972
Hodgkin and non-Hodgkin lymphomas
859
0.85
0.87
0.86
850-854
Ductal and lobular neoplasms
3,410
0.85
0.87
0.86
855
Acinar cell neoplasms
1,059
0.87
0.85
0.86
809-811
Basal cell neoplasms
1,704
0.80
0.89
0.84
872-879
Nevi and melanomas
1,473
0.87
0.81
0.84
906-909
Germ cell neoplasms
208
0.89
0.71
0.79
812-813
Transitional cell papillomas and carcinomas
384
0.81
0.74
0.78
938-948 858 868-871
Gliomas
237
0.82
0.71
0.76
Thymic epithelial neoplasms
17
1.00
0.59
0.74
Paragangliomas and glomus tumors
26
1.00
0.58
0.73
18,599
0.74
0.74
0.73
Overall
7 / 11
Discussion
Related Work
Martinez and Li (2011) Information extraction from pathology reports in a hospital setting Jouhet et al. (2012) Automated classification of free-text pathology reports for registration of incident cases of cancer Kavuluru et al. (2013) Automatic extraction of ICD-O-3 primary sites from cancer pathology reports
8 / 11
Discussion
Comparison with related work
Table: Comparison with related work. Topography Authors Martinez and Li (2011) Jouhet et al. (2012) Kavuluru et al. (2013)
Method
Morphology
n (docs)
Classes
F1
Classes
F1
NB1
217
11
0.58
N/A
N/A
NB, SVM2
5,121
26
0.72
18
0.85
SVM,
LR3
56,426
14
0.93
N/A
N/A
Oleynik et al. (2015)
NB
70,000+
18
0.76
49
0.63
Oleynik et al. (2017)
SVM
70,000+
18
0.82
49
0.73
1 Naïve
Bayes Vector Machine 3 Logistic Regression 2 Support
9 / 11
Conclusion
Conclusion
Future work Assess the kappa factor among specialists Evaluate in other domains and languages (e.g. MIMIC-III data) Other learning methods, e.g. neural networks
Contribution High efficiency rates comparable to those found in similar papers Easy process for obtaining structured information over textual data Accelerates the work of physicians when classifying patient data Successful baseline for future research, especially in Portuguese
10 / 11
Contact
Thank you! Contact Michel Oleynik
[email protected]
Acknowledgments Prof. Dr. Stefan Schulz Brazilian National Research Council (CNPq) Medical University of Graz, Austria A.C. Camargo Cancer Center, São Paulo, Brazil
11 / 11
Support Vector Machines f (~x ) = sign(~ w T ~x + b)
(1)
1/1