Construction of an automated screening system ... - Wiley Online Library

Basic and Applied Pathology 2012; 5: 15–18 doi:10.1111/j.1755-9294.2012.01124.x

ORIGINAL ARTICLE

Construction of an automated screening system to predict breast cancer diagnosis and prognosis Sou-Young Jin,1 Jae-Kyung Won,2,3 Hojin Lee1 and Ho-Jin Choi1 1

Department of Computer Science, School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon; 2 Molecular Pathology Center, Seoul National University Cancer Hospital, Seoul; 3 Graduate School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea

Key words artificial intelligence, Breast Cancer Wisconsin dataset, computer-assisted image processing, data mining, mass screening. Received 18 January 2012 Accepted 1 February 2012 Correspondence Dr Jae-Kyung Won, MD, Molecular Pathology Center, Seoul National University Cancer Hospital, Seoul 110-744, Korea. Email: [email protected] ∗

Sou-Young Jin, Jae-Kyung Won, and Hojin Lee contributed equally to this study. This work was accepted and presented at International Conference on Internet (ICONI) 2011 Proceeding.

ABSTRACT Background and aim: Using machine learning methods can be helpful in the clinical decision processes such as pathological diagnosis with the aid of microscopic feature datasets. In the present study using the Breast Cancer Wisconsin dataset, an optimal algorithm (classifiers) which can predict both diagnosis (benign vs malignant) and prognosis (recur vs non-recur) was devised by comparing several classification algorithms. Methods: The performance of a two-step algorithm, which sequentially decides diagnosis and prognosis, was compared with that of a multi-class classifier, which divides classes simultaneously. Results: In the twostep classifier, it was discovered that the functional trees (FT) algorithm is the best for the first step of classification, and Na¨ıve Bayes is the best for the second step of classification. On the other hand, the one-step classifier shows better accuracy and better prediction on benign and non-recurring cases than the two-step classifier, but it shows lower accuracy on predicting recurring cases, leading to lower sensitivity. Conclusions: We conclude that the two-step classifier with FT and Na¨ıve Bayes is better than the one-step classifier. This work will be helpful in setting the automated screening system in real clinics and highlight clues to improve the accuracy by refining data and algorithm selection in data mining or machine learning processes.

INTRODUCTION When a patient with a breast mass goes to a hospital, the clinical decision process begins with a physical examination by physicians. Imaging studies such as mammogram or ultrasonography can be added. Fine needle aspiration cytology is often used as an important screening method because it is safe, less invasive than surgical biopsy and it has a high sensitivity to distinguish between a benign tumor and a malignant one.1 It has been suggested that the machine learning method is helpful in the clinical decision process. An example of this is the usage of imaging processors and a machine learning algorithm that can predict the diagnosis or prognosis of breast cancer patients through fine needle aspiration slides, which was put forward by pioneer works.2–4 By extracting diverse nuclear features from cytologic slides, they performed data mining and machine learning, which led to successful predictions about diagnosis or prognosis. Their datasets (Breast Cancer Wisconsin Diagnostic and Prognostic), that are publicly available,5 have been used to show the performance of machine learning algorithms by many studies. Some works aimed to build an automatic diagnostic system to distinguish benign breast tumors from malignant ones,6–9 while other works have focused on predicting the recurrence of breast cancer.10–12

When predicting the recurrence, it is much more important to find any possible cases that will recur than excluding cases that will not recur. In other words, sensitivity must be higher than specificity. If we miss recurring cases, patients might die of cancer, while if we misclassify non-recurring cases into a recurring class, patients might take further examinations, but it is not a severe problem. Almost all the previous works ignore the importance of this sensitivity issue. Also, to date no study has proposed a model that can predict both diagnosis and prognosis. If such a screening system was available, it might be helpful in real clinical settings. In the present study, we manipulated Breast Cancer Wisconsin Diagnostic and Prognostic datasets and created a unified dataset with three classes: benign, malignant-recurrent, and malignantnon-recurrent. We perform a diverse set of experiments that include combined successive binary classifications and multi-class classifications using diverse classification algorithms. By comparing the performance of multiple algorithms and experimental settings, we try to find the best model system that can predict both diagnosis and prognosis of breast cancer. Also, through comparing prediction performances of diagnosis with that of prognosis, we try to find problems in the dataset and selected attributes, and search for clues to improve accuracy for future work.

15

Machine learning for cytologic diagnosis

S.-Y. Jin et al.

Table 1 Attributes of the Breast Cancer Wisconsin dataset Diagnostic dataset

Prognostic dataset

ID number Diagnosis (M = malignant, B = benign) –

ID number Outcome (R = recurrent, N = non-recurrent) Time (recurrence time)

Radius (mean of distances from center to points on the perimeter) Texture (standard deviation of gray-scale values) Perimeter Area Smoothness (local variation in radius lengths) Compactness (perimeter2 /area – 1.0) Concavity (severity of concave portions of the contour) Concave points (number of concave portions of the contour) Symmetry Fractal dimension (“coastline approximation” – 1) – –

Figure 1 The structures of the original Breast Cancer Wisconsin dataset and our modified dataset. (a) The structure of the original Breast Cancer Wisconsin Diagnostic dataset and Prognostic dataset. (b) The structure of our modified dataset. It is composed of benign cases of the Breast Cancer Wisconsin Diagnostic dataset and the Breast Cancer Wisconsin Prognostic dataset (malignant cases).

Tumor size (diameter of the excised tumor in centimeters) Lymph node status (number of positive axillary lymph nodes observed at time of surgery)

Comparison between the Breast Cancer Wisconsin Diagnostic dataset and the Breast Cancer Wisconsin Prognostic dataset. Detailed information is available from: http://archive.ics.uci.edu/.

METHODS Modification of the Breast Cancer Wisconsin dataset Our purpose was to generate and evaluate a screening algorithm that classifies the dataset into benign, recurrent (malignant) and non-recurrent (malignant) classes. So, we used both the Breast Cancer Wisconsin Diagnostic dataset and the Breast Cancer Wisconsin Prognostic dataset from the UC Irvine Machine Learning Repository.5 The diagnostic dataset is to predict whether it is benign or malignant and the prognostic dataset is to predict whether it will recur or not (so all cases in the Prognostic dataset are malignant instances) (Fig. 1a). The attributes of the two datasets are nearly the same (Table 1). The Prognostic dataset shares large parts of malignant instances in the Diagnostic dataset except 73 cases. So, we made a new dataset combining the two datasets as described in equation 1. Dataset = {Benign instances in Diagnostic dataset} U {All instances in Prognostic dataset}

(1)

We used red colored regions in Fig. 1b as a dataset for our experimentation. Although benign cases in the Diagnostic dataset originally did not have the attributes of ‘Outcome’ and ‘lymph node status’, we assumed that ‘Outcome’ is ‘non-recurrent’ and ‘lymph node status’ is 0 in benign cases for the unification of datasets.

Figure 2 The scheme of a two-step classifier and one-step classifier. (a) The schematic diagram of a two-step classifier, which is a binary classifier. (b) The schematic diagram of a one-step classifier, which is a multi-class classifier.

The two-step classifier is composed of two classifiers. First, the dataset was classified into benign and malignant classes by the first classifier. Then, the second classifier classified those instances that had been classified as malignant, into recurrent and non-recurrent classes (Fig. 2a). We also considered a one-step classifier. This classifier divides the dataset into benign, recurrent and non-recurrent classes simultaneously (Fig. 2b).

RESULTS Study method We divided our experiment into two approaches: a two-step classifier and a one-step classifier. Each experiment was performed with algorithms of Weka software.13 16

Two-step classifier Results are described in Table 2 after applying several algorithms to the unified dataset. Functional trees (FT) algorithm shows the

S.-Y. Jin et al.


Table 2 Results of several algorithms in the first step of a two-step classifier Algorithm

Sensitivity

Specificity

Accuracy

Na¨ıve Bayes FT IB1 Random Tree J48 LMT 1-R

89.65 94.60 91.87 90.20 90.30 94.39 83.28

95.51 98.85 96.61 94.34 95.27 98.74 94.40

93.42 97.33 94.92 92.86 93.50 97.19 90.43

Values are presented as percentages. The performances of several algorithms that are applied to our unified dataset are summarized in this table. Sensitivity in this table means the ability to discriminate ‘malignant’ cases. FT, functional trees, LMT, logistic model trees.

Table 3 Results of several algorithms in 2nd step of two-step classifier Algorithm

Sensitivity

Specificity

Accuracy


43.78 38.89 34.89 30.44 15.11 13.56 12.00

78.03 77.24 78.14 76.90 88.34 94.48 86.27

66.11 68.16 67.89 65.89 71.00 75.32 60.68

Values are presented as percentages. The performances of several algorithms that are applied to cases that have been classified as ‘malignant’ are summarized in this table. Sensitivity in this table means the ability to discriminate ‘recurrent’ cases. FT, functional trees, LMT, logistic model trees.

best sensitivity, specificity and accuracy. Therefore, we chose the FT algorithm as the first classifier of the two-step classifier. Then, the instances classified as malignant class by the first classifier are separated into recurrent and non-recurrent instances by the second classifier (Table 3). Na¨ıve Bayes algorithm shows the best sensitivity, and logistic model trees algorithm shows the best specificity and accuracy. We chose the Na¨ıve Bayes algorithm as the second classifier of the two-step classifier, because sensitivity is more important than specificity and accuracy in a medical screening test. Though Na¨ıve Bayes is the best algorithm, its sensitivity is lower than 50%, which makes it difficult to use the Na¨ıve Bayes algorithm in clinical practice. The net result of applying the Breast Cancer Wisconsin dataset to the two-step classifier is summarized in Table 4. It shows relatively good prediction performances on benign and non-recurrent cases, although it shows low accuracy on recurrent cases.

One-step classifier Table 5 shows the experimental results of the one-step classifier. The values in predicting benign, non-recurrent and recurrent correspond to specificity at the first step of the two-step classifier, specificity at the second step of the two-step classifier and sensitivity at the second step of the two-step classifier, respectively. In the one-step classifier, the highest accuracy of predicting benign is higher than that of the two-step classifier, while the highest accuracy of predicting non-recurrent is lower than that of the two-step classifier. We also see that the accuracy of predicting recurrent is

Table 4 Net results of two-step classifier on the Breast Cancer Wisconsin dataset MalignantBenign nonMalignant- Specificity/ (classified) recurrent recurrent Sensitivity Benign (actual) Malignant-nonrecurrent Malignantrecurrent Positive/ Negative predictive value

352.9

2.9

1.2

98.85%

10.2

105.8

39.0

68.26%

0.5

25.1

19.4

43.11%

97.06%

79.07%

32.56%

85.83%

The performances of our two-step classifier that is composed of combined algorithms (1st step: functional trees algorithm, 2nd step: Na¨ıve Bayes) are summarized in this table.

Table 5 Results of the one-step classifier on the Breast Cancer Wisconsin dataset

Algorithm

Predicting benign

Predicting non-recurrent

Predicting recurrent

Total accuracy


94.85 98.88 96.36 94.29 95.35 99.58 95.18

52.16 76.62 69.53 66.29 81.72 88.67 68.87

15.36 8.61 10.99 7.68 1.46 3.58 4.24

79.31 86.79 83.89 80.77 83.96 89.15 81.12

Values are presented as percentages. The performances of our one-step classifier with single algorithm are summarized in this table. FT, functional trees, LMT, logistic model trees.

significantly lower than that of the two-step classifier for all the algorithms we have experimented with. In the one-step classifier, there is a tendency to predict instances as benign class so that the accuracy of predicting benign is high, while the other two predictions show low accuracy. In the two-step classifier, the best algorithm for each step is not the same. Consequently, in the one-step classifier, the results of some algorithms, such as Na¨ıve Bayes, show significantly lower accuracy than that of the two-step classifier. Since sensitivity that can detect malignancy is more important than total accuracy in a medical screening test, we conclude that the two-step classifier is more useful than the one-step classifier.

DISCUSSION Our experiments show that a successive application of two binary classifiers (two-step classifier) is more useful than a multi-class classifier (one-step classifier) to predict diagnosis and prognosis of breast cancer. Among several classification algorithms, the FT algorithm and Na¨ıve Bayes shows the best performance for this dataset. The experiment also shows that the overall accuracy of diagnosis is much better than that of prognosis. In our opinion, the reasons why diagnosis (benign vs malignant) is much more accurately predicted than prognosis (recurrent vs non-recurrent) mainly depends on the data attribute selection. Pathological diagnostic criteria in real clinics are based on morphological features, and attribute 17


selections of Breast Cancer Wisconsin Diagnostic dataset are consistent with those criteria. On the other hand, in reality, prognosis prediction is usually based on clinico-pathologic staging, such as tumor size, relationships with nearby tissues and lymph node status. But most of the attribute selections of the Breast Cancer Wisconsin Prognostic dataset did not consider those points, so they are not relevant in the view of the domain of accepted knowledge. Considering all the previous papers and our work, the sensitivity of diagnosis classification never reaches 100%. In reality, there are a few cases of malignancy such as tubular carcinoma, which has non-divided nuclear features with benign tumors. Such cases might be included in this dataset with a negative effect to sensitivity. For an improvement of sensitivity and better classification of prognosis, additional data attributes, such as physical examination, mammogram and ultrasonogram, are necessary, which reflects complementary properties with more clinically relevant attributes other than nuclear features.

ACKNOWLEDGMENTS This work was supported by the National Research Foundation (NRF) grant (No. 2011–0018264) of Ministry of Education, Science and Technology (MEST) of Korea.

REFERENCES 1. Ariga R, Bloom K, Reddy VB et al. Fine-needle aspiration of clinically suspicious palpable breast masses with histopathologic correlation. Am J Surg 2002; 184: 410–3. 2. Wolberg WH, Street WN, Heisey DM, Mangasarian OL. Computerized breast cancer diagnosis and prognosis from fine-needle aspirates. Arch Surg 1995; 130: 511–6.

S.-Y. Jin et al.

3. Street WN, Mangasarian OL, Wolberg WH. An inductive learning approach to prognostic prediction. In: Proceedings of the 12th International Conference on Machine Learning; Lake Tahoe, CA; 1995 Jul 9–12. San Mateo: Morgan Kaufmann, 1995; 522– 30. 4. Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. AAAI Tech Rep 1994; SS-94–01: 83–6. 5. UCI Machine Learning Repository [Internet]. UCI Machine Learning Repository: 2010; [cited 2011 Dec 20]. Available from: http://archive.ics.uci.edu/ml. 6. Pawlak Z. Rough sets. Int J Comput Inf Sci 1982; 11: 341–56. 7. Chen HL, Yang B, Liu J, Liu DY. A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Syst Appl 2011; 38: 9014–22. 8. Pous C, Gay P, Pla A et al. Modeling reuse on case-based reasoning with application to breast cancer diagnosis. Artif Intell Methodol Syst Appl 2008; 2008: 322–32. 9. Camastra F, Verri A. A novel kernel method for clustering. IEEE Trans Pattern Anal Mach Intell 2005; 27: 801–5. 10. Fung G, Mangasarian OL. Finite Newton method for Lagrangian support vector machine classification. Neurocomputing 2003; 55: 39–55. 11. Wu D, Bennett KP, Cristianini N, Shawe-Taylor J. Large margin decision trees for induction and transduction. In: Proceedings of the 6th International Conference on Machine Learning; 1999 Jun 27–30. Slovenia; 474–83. 12. Mena L, Gonzalez JA. Machine learning for imbalanced datasets: application in medical diagnostic. In: Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference; 2006 May 11–13. Melbourne Beach; 574–9. 13. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data Mining Software. Vol. 11. New York: SIGKDD Explorations, 2009.

C 2012 The Korean Society for Cytopathology, The Korean Society for Legal Medicine, The Korean Society of Oral and Maxillofacial Pathology, The Korean Society Copyright

of Pathologists, The Korean Society of Toxicological Pathology, The Korean Society of Veterinary Pathology and Blackwell Publishing Asia Pty Ltd

18