Towards Intelligent Decision Making for Risk Screening Panagiotis Moutafis and Ioannis A. Kakadiaris Computational Biomedicine Lab, Department of Computer Science University of Houston, 4800 Calhoun Rd., Houston, TX 77004
{pmoutafis,ioannisk}@uh.edu
ABSTRACT Predicting the best next test for medical diagnosis is crucial as it can speed up diagnosis and reduce medical expenses. This determination should be made by fully utilizing the available information in a personalized manner for each patient. In this paper, we propose a method that uses synthesis to infer the best learning cohort for the patient under consideration. The constrained sample space is then used to select the best next test by maximizing the expected information gain.
General Terms
Table 1: Notation overview. The number of samples is denoted by m, while the set L = {1, 2, · · · , c} denotes the class labels. Symbol
Description ψ
Y = [yi |yi R , i = 1, . . . , m]
High-Dimensional Input
χ
Low-Dimensional Input
X = [xi |xi R , i = 1, . . . , m] L = [li |li L, i = 1, . . . , m]
Class labels
fD : Rψ → Rα
Discriminative Projection
fC : Rχ → Rα
Coupling Projection
fR : Rα → Rβ
Refining Projection
Decision making, Risk Screening
Keywords Synthesis, Information Gain
1.
INTRODUCTION
The epidemiological medical data available for research usually include many measurements for each patient. However, less information is available for each individual patient during regular medical visits. This paper addresses the problem of exploiting highdimensional training data to select the best next test in a personalized manner for a patient under consideration. The impact is twofold: (i) faster and more accurate diagnosis, and (ii) better allocation of medical resources. The importance of each test varies for different groups of patients [1]. Therefore, some approaches examine local regions of the problem space separately to determine the importance of each test in a personalized manner. Domingos [4] used a clustering approach and Chi [3] proposed a method that recursively determines the most relevant feature. However, existing approaches do not fully exploit the high-dimensional training data. To address this problem, we propose a method dubbed Intelligent Decision Making (IDM). Our approach learns coupled projections for the low- and high-dimensional data to a common subspace with increased class-separation. Our rationale is that the class-separated projected data can be used to infer the best learning cohort in a Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected] or Publications Dept., ACM, Inc., fax +1 (212) 869-0481. PETRA’14, May 27-30, 2014, Island of Rhodes, Greece. Copyright 2014 ACM 978-1-4503-2746-6/14/05 $15.00.
more informed manner. Each low-dimensional test sample is projected to the common subspace and its k-nearest neighbors of the projected high-dimensional samples are used to determine the best learning cohort. Using the corresponding high-dimensional data, the best next test is determined according to the expected information gain rule based on Shannon entropy. Research on decision trees has shown that this process results in fewer rules [2]. In our application domain, this is translated to faster diagnosis. The contribution of this paper is a framework that: (i) fully exploits the high-dimensional information, and (ii) determines the best next test in a personalized manner.
2.
INTELLIGENT DECISION MAKING
Our approach comprises two steps: (i) synthesis, and (ii) cohort selection and best next test determination. The notation used in this paper is summarized in Table 1. Synthesis: In general, the goal of synthesis is to seek the best high-dimensional representation of a low-dimensional sample. In this paper, we adopt a slightly different approach and seek a common subspace that results in increased class-separation. Hence, the projected data can be used to infer the best learning cohort more effectively. The proposed synthesis model comprises a projection fD for the high-dimensional epidemiological data available a priori and a projection fC for the low-dimensional patient data available during the medical visit. An overview is presented in Fig. 1. Learning fD : This projection is learned with the goal of mapping the high-dimensional data to a subspace Rα that maximizes class-separation. It ignores the low-dimensional data because the high-dimensional data contain more discriminative information [7]. Different approaches can be employed to learn fD . In this paper, we use the Local Fisher Discriminant Analysis (LFDA) [5] because: (i) it exploits the class labels using local constraints, (ii) it relies on closed-form formulas, and (iii) it naturally performs di-
Algorithm 1 IDM: Testing Input: tj , Y, fD , fC , and k Output: Best next test 1: Project training and test data (i.e., fD (Y) and fC (tj )). 2: Select cohort samples Y = {yi |fD (yi )Nk (fC (tj ))}. 3: Determine the best next test τ using Eq. (6).
are computed. These values are then weighted by the proportion of the originalPsamples in each of theP subsets to form E(τ ). That c c w E (τ ), where is, E(τ ) = i i i=1 i=1 wi = 1. Finally, the information gain for this attribute is computed as: Gain(τ ) = A − E(τ ) .
Figure 1: Overview of the proposed method. The indices indicate the sample identifier.
(5)
High values for Gain(τ ) correspond to attributes with good discriminative properties. Thus, the best next test is determined by: τ = arg max Gain(τ ) .
(6)
τ
mensionality reduction. Learning fC : This projection is learned with the goal of mapping the low-dimensional data to their corresponding fD -projected high-dimensional pairs because: (i) an analytical solution can be easily computed, (ii) it has been found to generalize well in similar problems [6], and (iii) it implicitly incorporates the discriminative information exploited by fD . In particular, we define: fC = arg min kfD (Y) − fC (X)k2F + µkfC k2F ,
(1)
fC
where k.k2F is the Frobenius norm of the projection matrix that corresponds to fC , and µ denotes the weight of the regularization term. This ridge regression problem can be solved as: fC = fD (Y)X> (XX> + µI)−1 ,
(2)
where I is a χ × χ identity matrix. Learning fR : To enhance class-separation simultaneously utilizing the low- and high-dimensional projected data, a refining projection is learned that uses as input both fD (Y) and fC (X). The fD and fC are then refined as: fD = fR ◦ fD and fC = fR ◦ fC .
(3)
Cohort selection and best next test determination: The goal of cohort selection is to determine the best group of epidemiological data to use for a patient under consideration. The best next test is then selected by maximizing the expected information gain. Specifically, both the training data Y and a given low-dimensional test sample tj are projected to the common subspace Rβ using fD and fC , respectively. The k-nearest neighbors of fC (tj ) and their corresponding high-dimensional representations Y = {yi |fD (yi ) Nk (fC (tj ))} are then found. The set Y is used to select the best next test τ . An overview is provided by Alg. 1. Specifically, the entropy measures the uncertainty of the training set due to the different classes. We denote the frequency of each class by pi , where iL. Hence, the entropy of the whole training set is defined as: A=−
c X
pi log2 pi
(4)
i=1
for pi 6= 0. The entropy is maximized when the samples are equally distributed among the different classes and it is zero when all of them belong to a single class. For one attribute at a time the training set is split into subsets and the new entropy values of each subset
3.
CONCLUSION
Our approach moves towards intelligent decision making in risk screening. A synthesis step is performed that fully exploits the high-dimensional data to infer the best group of cohort patients for the patient under consideration. The best next test is then selected by maximizing the expected information gain.
Acknowledgments This research was funded in part by the University of Houston Hugh Roy and Lillie Cranz Cullen Endowment Fund. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of University of Houston.
4.
REFERENCES
[1] D. Aha and R. Goldstone. Concept learning and flexible weighting. In Proc. 14th Annual Conference of the Cognitive Science Society, pages 534–539, Bloomington, IN, July 29 August 1 1992. [2] M. Bramer. Decision tree induction: Using entropy for attribute selection. Principles of Data Mining, pages 51–64, 2007. [3] C.-L. Chi. Medical decision support systems based on machine learning. PhD thesis, 2009. [4] F. Domingos. Control-sensitive feature selection for lazy learners. Artificial Intelligence Review, 11(1-5):227–253, 1997. [5] M. Sugiyama, T. Idé, S. Nakajima, and J. Sese. Semi-supervised local Fisher discriminant analysis for dimensionality reduction. In Proc. Pacific-Asia Confercence on Knowledge Discovery and Data Mining, pages 333–344, Osaka, Japan, May 20-23 2008. [6] S. Wang, L. Zhang, Y. Liang, and Q. Pan. Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 1518–1525, Providence, Rhode Island, June 16-21 2012. [7] W. W. Zou and P. Yuen. Very low resolution face recognition problem. Trans. on Image Processing, 21(1):327–340, 2012.