Automated Classification of Semi-structured ...

1 downloads 0 Views 649KB Size Report
Lip, oral cavity and pharynx. 903. 0.80. 0.71. 0.75. C15-C26. Digestive organs. 2,159. 0.67. 0.84. 0.75. C77. Lymph nodes. 590. 0.68. 0.80. 0.74. Overall. 18,905.
Automated Classification of Semi-structured Pathology Reports into ICD-O using SVM in Portuguese Michel Oleynik1 1 2

Diogo F. C. Patrão2

Marcelo Finger3

Medical University of Graz, Austria

A.C. Camargo Cancer Center, Brazil 3

University of São Paulo, Brazil

2017

1 / 11

Introduction

Introduction

Cancer: second leading cause of death in the U.S. Pathology reports Main source of information when diagnosing cancer Tumour site (topography) Histological type and malignancy (morphology)

International Classification of Diseases for Oncology (ICD-O) Encoded as part of normal operation or later for regulatory purposes Discordance rates as high as 50% among specialists

2 / 11

Introduction

Goal

C50.9: Breast, NOS 8522/3: Infiltrating duct and lobular carcinoma

Figure: Example of a pathology report mapped into ICD-O codes.

3 / 11

Materials and Methods

Materials

Materials

Anonymised data from ca. 95,000 patients over 14 years Written in Portuguese by Brazilian pathologists Pathology reports and cancer registry data Routine procedure in A.C. Camargo Cancer Center, Brazil

Gold standard creation Free-text content associated to structured data in cancer registries Discarded patients with confirmed metastasis or multiple classifications Reports unified using the patient identifier

Resulting dataset Topography: 18,905 patients (18 code groups) Morphology: 18,599 patients (49 code groups)

4 / 11

Materials and Methods

Methods

Methods

Support Vector Machines (SVMs) Bag-of-words, one-versus-all approach tf-idf weighting scheme with linear kernel Weka 3.6.6 with LibSVM 3.17

Supervised machine learning Micro-averaged with 10-fold cross-validation Evaluated by precision, recall and F1 -score

5 / 11

Results

Results: Topography Table: Top ten F1 scores in the ICD-O topography attribution task. Code Group

Description

n

P

R

F1

C44

Skin

3,858

0.88

0.94

0.91

C50

Breast

3,668

0.89

0.91

0.90

C73-C75

Thyroid and other endocrine glands

1,329

0.92

0.87

0.90

C60-C63

Male genital organs

1,536

0.93

0.81

0.87

C64-C68

Lymph nodes

C51-C58

Female genital organs

C69-C72

660

0.86

0.78

0.82

1,574

0.85

0.77

0.81

Eye, brain and other parts of central [...]

536

0.83

0.70

0.76

C00-C14

Lip, oral cavity and pharynx

903

0.80

0.71

0.75

C15-C26

Digestive organs

2,159

0.67

0.84

0.75

590

0.68

0.80

0.74

18,905

0.82

0.82

0.82

C77

Lymph nodes Overall

6 / 11

Results

Results: Morphology Table: Top ten F1 scores in the ICD-O morphology attribution task.

Code Group

Description

n

P

R

F1

959-972

Hodgkin and non-Hodgkin lymphomas

859

0.85

0.87

0.86

850-854

Ductal and lobular neoplasms

3,410

0.85

0.87

0.86

855

Acinar cell neoplasms

1,059

0.87

0.85

0.86

809-811

Basal cell neoplasms

1,704

0.80

0.89

0.84

872-879

Nevi and melanomas

1,473

0.87

0.81

0.84

906-909

Germ cell neoplasms

208

0.89

0.71

0.79

812-813

Transitional cell papillomas and carcinomas

384

0.81

0.74

0.78

938-948 858 868-871

Gliomas

237

0.82

0.71

0.76

Thymic epithelial neoplasms

17

1.00

0.59

0.74

Paragangliomas and glomus tumors

26

1.00

0.58

0.73

18,599

0.74

0.74

0.73

Overall

7 / 11

Discussion

Related Work

Martinez and Li (2011) Information extraction from pathology reports in a hospital setting Jouhet et al. (2012) Automated classification of free-text pathology reports for registration of incident cases of cancer Kavuluru et al. (2013) Automatic extraction of ICD-O-3 primary sites from cancer pathology reports

8 / 11

Discussion

Comparison with related work

Table: Comparison with related work. Topography Authors Martinez and Li (2011) Jouhet et al. (2012) Kavuluru et al. (2013)

Method

Morphology

n (docs)

Classes

F1

Classes

F1

NB1

217

11

0.58

N/A

N/A

NB, SVM2

5,121

26

0.72

18

0.85

SVM,

LR3

56,426

14

0.93

N/A

N/A

Oleynik et al. (2015)

NB

70,000+

18

0.76

49

0.63

Oleynik et al. (2017)

SVM

70,000+

18

0.82

49

0.73

1 Naïve

Bayes Vector Machine 3 Logistic Regression 2 Support

9 / 11

Conclusion

Conclusion

Future work Assess the kappa factor among specialists Evaluate in other domains and languages (e.g. MIMIC-III data) Other learning methods, e.g. neural networks

Contribution High efficiency rates comparable to those found in similar papers Easy process for obtaining structured information over textual data Accelerates the work of physicians when classifying patient data Successful baseline for future research, especially in Portuguese

10 / 11

Contact

Thank you! Contact Michel Oleynik [email protected]

Acknowledgments Prof. Dr. Stefan Schulz Brazilian National Research Council (CNPq) Medical University of Graz, Austria A.C. Camargo Cancer Center, São Paulo, Brazil

11 / 11

Support Vector Machines f (~x ) = sign(~ w T ~x + b)

(1)

1/1

Suggest Documents