17th Mediterranean Conference on Control & Automation Makedonia Palace, Thessaloniki, Greece June 24 - 26, 2009
A Recommender System for Detection of Leukemia Based on Cooperative Game Atefeh Torkaman Faculty of Engineering Tarbiat Modares University Tehran, Iran
[email protected]
Nasrollah Moghaddam Charkari Faculty of Engineering Tarbiat Modares University Tehran, Iran
[email protected]
Abstract— Cancer is a term used for diseases in which abnormal cells divide without control and invade other tissues. Cancer types can be grouped into broader categories including Leukemia, Carcinoma, Sarcoma, Lymphoma and Myeloma, Central nervous system cancers among them, Leukemia is a form of serious cancers that starts in blood tissue such as the bone marrow where all the blood is made. It is one of the leading causes of death in the world. So, the importance of diagnostic techniques is manifested. Application of these techniques would be able to decrease the mortality rate from leukemia. In this paper, an automatic system for classifying leukemia based on game theory is presented. The aim of this research is to apply game theory in order to classify leukemia into eight classes. In other words, cooperative game is used for classification according to different weights assigned to the markers. Through out this paper, we work on real data (304 samples) taken from different types of leukemia that have been collected at Iran Blood Transfusion Organization (IBTO). The modeling system can be used to model and classify a population according to their contributions. In the other words, it applies equally to other groups of data. The results show that the highest classification accuracy (98.44%) is obtained for the proposed model. So, it is hoped that game theory can be directly used for classification in the other cases.
Mahnaz Aghaeipour
Esmerdis Hajati
Iranian Blood Transfusion Research Center Tehran, Iran
Iranian Blood Transfusion Research Center Tehran, Iran
[email protected]
[email protected]
immunohistochemistry [1]. All hematopoietic cells conclude an expression of markers (CD markers). According to the values of markers, which collected from Flow Cytometry, different types of Leukemia could be classified. II.
III. Keywords; Cooperative Game, Shapley value, Classification, Leukemia, Flow Cytometry
I. INTRODUCTION Cancer is a term used for diseases in which abnormal cells divide without control and invade other tissues. Today, scientists could find more than hundred different types of cancer in the world, among them Leukemia is very common and serious cancer for human. As we know, Leukemia is a type of cancer that starts in blood-forming tissue such as the bone marrow; it causes large numbers of abnormal blood cells to be produced and enter the blood (http://www.cancer.gov/cancertopics/what-is-cancer). There are some tools to recognize these cancers. Flow Cytometry is one of the best methods for cancer recognition. Flow Cytometry is a rapid and convenient technique for generating immunophenotypic data. The ability to perform multiparametric analysis on an individual cellular basis is a unique feature of the technique. It offers distinct advantages over competing immunophenotypic methods such as
978-1-4244-4685-8/09/$25.00 ©2009 IEEE
PROBLEM STATEMENT
In all diagnosis processes, right evaluation of data and making correct decisions are the most important items, help to decrease the mortality rate from diseases or failure rate from faults. Accordingly, in Flow Cytometry an expert requires interpreting the data and recognizing that a given sample belongs to which category of malignancies. In this way, classifier systems have been definitely applicable. Classification systems not only help experts to make a right decision, but also minimize some possible errors. In this article a classification method based on cooperative game is investigated to assign diseases to the right classes. According to our study, this is the first time that game theory directly applied for classification. The aim of this study is to assist hematologist and physicians in making decision correctly and decreasing errors in special cases. RELATED WORK ON CANCER DIAGNOSIS
There are some researches on medical diagnosis of different types of cancers based on learning methods. Reference [2] used artificial neural network and Multivariate Adaptive Regression Splines (MARS) approach to classify the breast cancer pattern. A new hybrid method based on fuzzy-artificial immune system and KNN algorithm was proposed in [3] for breast cancer diagnosis. Reference [4], developed an expert system for detecting breast cancer based on association rules and neural network. In [5], a method based on feature selection and support vector machine applied for breast cancer diagnosis. Reference [6] used Kohonen LVQ2 neural network to predict prostate cancer. An artificial neural network used in [7] for prediction of lymph node metastases in gastric cancer. And in [8] neural network and decision tree Models was proposed to predict models for colorectal cancer patients. Reference [9] used stochastic game theory to modeling mutations, onset, progression and immune competition of cancer cells. An evolutionary game was applied to describe the evolution of tumor cell populations with interactions between cells [10]. Reference [11] proposed evolutionary game theory in an agentbased brain tumor model. In [12] a game theoretical approach
1126
introduced to solve the classification problem in gene expression data analysis. IV.
characteristic function, the value created when the members of S come together and interact. The values concede to the contribution of the players in achieving a high payoff.
BACKGROUND
Data classification has been one of the most powerful subjects in statistics, decision science and computer science. It has also been applied in some problems in medicine, social science management and engineering. A well-planned data classification system makes essential data easy to find. Many problems such as disease diagnosis, image recognition, and credit evaluation use classification techniques [13]. In medicine, linear programming approaches are efficient and effective methods (for more information see [14], [15], [16], [17]) [4]. “Fig. 1” Shows some algorithms that have been widely used for classification. In this paper, we propose cooperative game and Shapley value to classify leukemia. As we know, game theory is divided into two classes, noncooperative and cooperative games. These classes differ in how they formalize interdependence among the players [18]. (See“Fig.2”).
A cooperative game with transferable utility (TU-game) is an ordered pair (N, v) consisting of the player set N and the characteristic function v : 2 → ℜ with v(φ ) = 0 [19]. v(S) is interpreted as the maximal worth, or cost savings, that the numbers of S can obtain when they cooperate. N
A coalition is a group of players T ⊆ N where the value of this coalition is v (T). A TU-game (N, v) is a [0, 1]-game N when v : 2 → [ 0 ,1] .
A TU-game is simple if
v ( S ) ∈ { 0 ,1} and v(S) = 1.
V ( S ) = 1 ⇔ S is winning and v( S ) = 0 ⇔ S is losing. N The payoff vector or allocation ( Xi ) i∈ N ∈ ℜ , assigns the amount X i to player i ∈ N in a TU_game (N,v). This
allocation is efficient if ∑i∈N
X i = v( N )
.
A solution for a class of TU-games is a function ψ that assigns to every TU-game in the class, a payoff vectorψ (v ) . The Shapley value, introduced by Shapley [21] is a well-known solution for TU-games, assigns to each player its average marginal contribution over all the possible orderings, i.e. permutations of the players. [12]. An intuitive example of the Shapley value can be described in an academic setting. Assume that a professor running a lab and have decided to distribute the yearly bonus to her students equitably, assigns the real contribution of each person to the academic success in the lab. During the year, the students form spontaneous “coalitions” of groups of students, each such group works and publishes a paper summarizing its work. Each paper gets a rank, composing its “payoff function”. Based on this annual data of the students’ coalitions and their associated payoffs, the Shapley value provides a fair and efficient way to distribute the bonus to each student according to her contribution over the year [20].
Figure 1. Some algorithms used for classification
Given that G = (N, v) is a game, the Shapley value assigns to player i ∈ N :
φ i (v ) =
1 ∑ ( v ( P ( π ; i ) U { i }) − v ( P ( π ; i ))) n! π
(1) Where π is a permutation of the players and P (π ; i ) is the set of players that precede player i in the permutation π . If two players i, j are symmetric, i.e. Figure 2. Classifying games
In general, a cooperative game consists of a set of players, and a characteristic function which specifies the value created by different subsets of the players in the game. Let N = {1,2,…,n} be the (finite) set of players and v(S) as the
1127
v ( S U {i}) = v ( S U { j}), ∀S ⊆ N \ {i , j} ,
then
φ i (v ) = φ j (v )
.
If a player i has null marginal contributions, i.e.
v ( S U {i}) = v ( S ), ∀S ⊆ N \ {i} , then φi (v) = 0
V. METHODOLOGY AND EXPERIMENTS Through out this paper, we work on matrices of markers and samples from different types of leukemia that have been collected at Iran Blood Transfusion Organization (IBTO). The dataset contains 304 samples taken from human leukemia tissues. It consists of 17 feature, these features determine different CD markers related to leukemia. Data from different experience stored in the form of matrices, rows referring to the CD markers (feature) and columns show different samples of disease. All of the samples are of malignant class. In our activities, eight classes of leukemia were investigated. These classes consist of: Early-pre-B-All, Pre-B-All, Pro-B-All, B-Cell-All, AmlM4-M5, Aml-M6, Aml-m7 and Aml-Non-M3. For example conversion matrix of Aml-Non-M3 is shown in “Fig. 3”. The goal of this paper is to introduce a new method which classifies different types of leukemia according to the value of the CD markers of each sample as accurate as possible.
Figure 4. Model proposed based on cooperative game to classify samples
In training phase, collected data from Flow Cytometry provide the initial samples. These data are in integer form. First, data are converted to boolean format, where 1 referring to those CD markers which are more than 20% threshold and 0 otherwise. This is a critical task as the value of each CD which we call feature, may deeply influence on the data classes. In order to select the set of CD markers in classifying samples, we propose a cooperative game, with the markers in the role of players. In this way, we use Shapley value to determine the weights and values of each marker (players). As we know one of the main advantages of Shapley value is the ability to provide a solution which is unique and fair. The fairness property determines the real portion of each player in the game, fairly. Finally, the usage of the Shapley value for our goal may be justified by its axiomatic qualities as mentioned as follows; Axiom 1. (Pareto optimality or Normalization) For any game ∑ φ i (v ) = v ( N ) . (N,v), it describes that i∈ N This axiom implies that the portion on a dataset accurately divided between different markers. Axiom 2. (Permutation invariance or symmetry) For any φ i (v ) = φπ (i ) (πv ) (N,v) with permutation π on N, . It determines that the value is not altered, if the markers have arbitrarily renaming or reordering. Axiom 3. (dummy-property) For any (N,v) where
v( S U {i}) = v ( S ), ∀S ⊆ N \ {i} then φ i ( v ) = 0 .
Figure 3. Conversion matrix of Aml-Non-M3
VI. PROPOSED MODEL BASED ON COOPERATIVE GAME A block diagram of the implemented system is shown in “Fig. 4”. There are two steps for classifying leukemia; training phase and diagnosis phase (testing phase).
It implies that a marker that does not influence in a disease certainly receives zero value. So, we can find the accurate value of each marker and the amount of investment in causing disease. Then, we use these weights as basis of knowledge in our system to classify data. We emphasis on the role played by a set of markers in correlation with groups.
1128
n×k
Let B ∈ { 0 ,1} be a boolean matrix, where n denotes the number of CD markers and k is the number of samples (from patients’ results). It is simply possible to calculate Shapley value by (1) for matrix B i.e. determines the weight of each CD markers. Shapley value assigns a value to each marker according to the investments of them in the game. Then, system will be trained by the mentioned weights. Table.1 is an example where Shapley values assigns different values (weights) to each marker of Early-Pre-B-All and Pre-B-All. In diagnosis phase (testing), in order to find the malignant case, Flow Cytometry is applied. Similar to train phase, integer data is converted to boolean format. TABLE 1. SHAPLEY VALUES ASSIGNS DIFFERENT VALUES (WEIGHTS) TO EACH MARKER OF EARLY-PRE-B-ALL AND PRE-B-ALL.
TABLE .2. CONFUSION MATRIX REPRESENTATION. Classification Accuracy
Presented Model
50%-50% training-test
80%-20% training-test
partition
partition
98.46
98.44
CONCLUSIONS In this study, an expert diagnosis system for classifying leukemia based on cooperative game is presented. This the first time that cooperative game used for classifying leukemia. It is observed that the proposed method yields the highest classification accuracies (98.46%, 98.44%, for 50–50% of training-test partition, 80–20% of training-test partition, respectively). Considering the results, the proposed model gives very hopeful results in classifying leukemia and assist hematologist and physicians to make better decisions. We believe that this model can be used in the other diagnostic system. As the Shapley value is an exponential function, the processing time increasing when the number of markers grew up. Therefore, the limitation of the model is that it is susceptible to time consuming. So, we can say the modeling system considered here is not well suited for systems where used in an on-line areas. Development of this study to decrease the processing time appears to be an interesting, and challenging, research perspective. References
At first, Euclidean distance will be found for the collected weights. Finally, the system can determines two candidates as diseases ( d 1 , d 2 ) which are closer to the goal by their minimum distances. In this step, it is the specialist’s task to investigate observations and determine the correct disease between two results ( d 1 or d 2 ). VII. EXPERIMENTAL RESULTS This study was performed on data from different types of leukemia collected at Iran Blood Transfusion Organization (IBTO). The data contain of 17 attributes and 304 records. 80% of data have been used in training phase while the remaining used in testing phase. The performance evaluation and correctness are tabulated in table.2.
[1] B.Wood, M.Borowitz, N.Abraham, H.Massey, M.Bluth, I.Miller, G.Threatte, R.Hutchison, F.Unger, M.Lifshitz. ”Henry's Clinical Diagnosis AND Management BY Laboratory Methods”, Twenty First Edition, pp..599600 ,2007. [2] S.M.Choua, T.S.Leeb, Y.E.Shaoc, I.F.Chenb, “Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines.” Expert Systems with Applications, vol.27,pp. 133–142. 2004. [3] S.Sahan, K.Polat, H.Kodaz, S.Günes. ”A new hybrid method based on fuzzy-artificial immune system and k-nn algorithm for breast cancer diagnosis”. Computers in Biology and Medicine, vol. 37, pp. 415–423.2007. [4] M.Karabatak, M.Cevdet. “An expert system for detection of breast cancer based on association rules and neural network”. Expert Systems with Applications, vol.36, pp. 3465–3469. 2009. [5] M.Akay. “Support vector machines combined with feature selection for breast cancer diagnosis”. Expert Systems with Applications, vol. 36, pp. 3240–3247. 2009. [6] R.Desai, F.Lin. “Medical Diagnosis with a Kohonen LVQ2 Neural Network” 2001. [7] E.Bollschweiler, S.Mönig, K.Hensler, E.Baldus, K.Maruyama, A.Hölscher. “Artificial Neural Network for Prediction of Lymph Node Metastases in Gastric Cancer: A Phase II Diagnostic Study”. Annals of Surgical Oncology, vol.11(5), pp. 506–511, 2004. [8] S.Lee, J.Kang, M.Suh. “Comparison of Hospital Charge Prediction Models for Colorectal Cancer Patients: Neural Network vs. Decision Tree Models”. J Korean Med Sci, vol.19, pp. 677-81, 2004.
1129
[9] N.Bellomo, M.Delitala. “From the mathematical kinetic, and stochastic game theory to modelling mutations, onset, progression and immune competition of cancer cells”, Physics of Life , Reviews 5, pp. 183–206, 2008. [10] L.Bacha, S.Bentzen, J.Alsner, F.Christiansen. “An evolutionary-game model of tumour–cell interactions:possible relevance to gene therapy”. European Journal of Cancer, vol.37, pp. 2116–2120, 2001. [11] Y.Mansury, M.Diggory, T.Deisboeck. “Evolutionary game theory in an agent-based brain tumor model: Exploring the ‘Genotype–Phenotype’link”. Journal of Theoretical Biology, vol. 238, pp. 146–156, 2006. [12] V.Fragnelli, S.Moretti. “A game theoretical approach to the classification problem in gene expression data analysis”. Computers and Mathematics with Applications, vol.55, pp. 950-959, 2007. [13] D.Michie, D.J.Spiegelhalter, C.C.Tayor. “Machine learning, neural and statistical classification”. London: Ellis Horwood, 1994. [14] K.P.Bennett, O.L.Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets”. Optimization Methods and Software, vol. 1, pp. 23–34, 1992. [15] E.Freed, F.Glover. “A linear programming approach to the discriminant problem. Decision Sciences”, vol.12(1), pp. 68–74, 1981. [16] R.C.Grinold. ”Mathematical programming methods of pattern classification”. Management Science, vol.19(3), pp. 272–289, 1972. [17] F.W.Smith. “Pattern classifier design by linear programming”. IEEE Transactions on Computers. Vol. C-17(4), pp. 367–372, 1968. [18] A.Brandenburger. “Cooperative Game Theory: Characteristic Functions, Allocations, Marginal Contribution”. 2007. [19] R.Branzei., D.Dimitrov , S.Tijs. “Models in Cooperative Game Theory”. Springer , 2008 [20] S.Cohen, E.Ruppin. “Feature Selection Based on the Shapley Value” , 2005 [21] L.Shapley. “A value for n-person games, in: H.W. Kuhn, A.W. Tucker (Eds.), Contributions to the Theory of Games II”, Annals of Mathematics Studies, Princeton University Press, Princeton, vol. 28, pp. 307–317, 1953.
1130