Applying Matrix Factorization in Data Reconstruction for ... - IEEE Xplore

7 downloads 7221 Views 215KB Size Report
related patient data sets are utilized. A matrix factorization based technique for missing data reconstruction is presented. Numerical results show that recovery ...
2015 17th International Conference on E-health Networking, Application & Services (HealthCom)

Applying Matrix Factorization in Data Reconstruction for Heart Disease Patient Classification Huaxia Wang and Yu-Dong Yao

Wei Qian and Fleming Lure

Department of Electrical and Computer Engineering Stevens Institute of Technology Hoboken, NJ, USA Email: {hwang38, yyao}@stevens.edu

Department of Electrical and Computer Engineering University of Texas El Paso, TX, USA Email: [email protected]; [email protected]

Abstract—Heart disease is one of the most severe health illnesses. Developing accurate and efficient methods to diagnose heart disease is crucial in providing good heart healthcare to patients. In this paper, a data mining based technique for diagnosing heart disease is introduced, in which heart disease related patient data sets are utilized. A matrix factorization based technique for missing data reconstruction is presented. Numerical results show that recovery data sets are able to achieve reliable diagnosis or classification performance comparable to using original completed patient datasets. Index Terms—Data mining, Matrix factorization, Machine learning, Classification.

I. I NTRODUCTION Heart disease is one of the common diseases in a circulatory system. In general, heart disease refers to heart attacks or chest pains caused by narrowed or blocked blood vessels. There are invasive and non-invasive examinations when diagnosing heart diseases. Basically invasive examination involves cardiac catheterization and selective angiocardiography, while noninvasive examination includes a series of electrocardiographic examination and ultrasound cardiogram. Accurately diagnosing patients with heart disease has a profound influence for patient recovery. Data mining refers to using machine learning and statistical knowledge to explore specific patterns from a set of data. In general, the techniques used in data mining contain supervised/unsupervised learning [1], affinity grouping, clustering and description. Supervised learning attempts to analyze the training data and generate an inferred function which can be used for classifying new testing data. While unsupervised learning as a type of machine learning algorithm doesn’t require the training data to be labeled, it is closely related to density estimation in the concept of statistics. Recently, researchers have been trying to use data mining techniques in the healthcare field. In [2], support vector machine (SVM) algorithm has been investigated as a method to classify genes through using gene expression data. [3] introduced a healthcare quality monitoring method where an

978-1-4673-8325-7/15/$31.00 ©2015 IEEE

outlier detection approach was developed. In [4] and [5], methods to diagnose heart disease based on the Cleveland heart disease dataset were presented. Notice that, in [4] and [5], they use the completed data samples to perform the detection or prediction and incompleted data sets are not used. In order to recover an incomplete data set or matrix, matrix factorization theory [6] is presented to address this issue. In this paper, we consider a heart disease diagnosis classification model based on reconstructed sample matrix. First, a matrix factorization technique is investigated to reconstruct the missing data and appropriate latent factor size is selected based on experiment using original completed samples. After that, different machine learning algorithms are utilized to perform prediction. We compare the prediction performance using both completed and reconstructed data sets. The rest of this paper is organized as follows. Matrix factorization technique is presented in Section II. In Section III, several machine learning classification algorithms are introduced. Section IV presents the experimental results of the proposed method. Finally, conclusions are given in Section V. II. M ATRIX FACTORIZATION T ECHNIQUE AND DATA R ECONSTRUCTION IN H EART D ISEASE DATASET Matrix factorization is a method of spliting a matrix into product of matrices and shown to be an effective approach in reconstructing datasets for recommender systems [6]. In this section, we first review the topic of matrix factorization and then analyze a matrix factorization technique for heart disease data reconstruction. A. Matrix Factorization Techniques When solving linear equations, one of the efficient solutions is through LU decomposition [7], where a matrix is factorized into L part (lower triangular matrix) and U part (upper triangular matrix). After LU decomposition, the original problem Ax = b is converted into (LU) x = b and we can easily solve this through two steps: solving Ly = b and Ux = y.

245

Since in each step we are dealing with a triangular matrix, the computation cost degrades significantly. QR decomposition [7] is another classical matrix factorization method which can be used to solve linear equations, where the original matrix A is converted into an orthogonal matrix Q and an upper triangular matrix R. Notice that, since Q is an orthogonal matrix, we have Q−1 = QH . Through this operation we have Rx = QH b instead of Ax = b. Singular value decomposition (SVD) [7] is a matrix decomposition method which is closely related to eigendecomposition. A real or complex matrix A is factorized as A = UΣVH where U and V are unitary matrix and Σ is a diagonal matrix with non-negative elements on the diagonal. SVD technique has been widely used in signal processing related research. For instance, we can split received data autocorrelation matrix into signal subspace and noise subspace through SVD technique [8]. There are other matrix factorization techniques such as Cholesky decomposition, Jordan decomposition, rank factorization, etc. In the following, we will present the details of the rank factorization method and its effectiveness in heart disease data reconstruction. B. Data Reconstruction Using Matrix Factorization Techniques Latent variables are those variables which can not be directly observed in a specific model. The latent factor models [6] are presented to explain those observable variables in terms of latent variables. Matrix factorization approach is one of the most successful realizations of latent factor models. Accordingly, assume AU ×F is the data matrix in our heart disease model where the row’s number U represents the total number of the users and the column’s number F represents the number of attributes for each user. Assuming there are K latent factors, the original matrix A can be approximately written as e where P ∼ RU ×K and Q ∼ RF ×K . Denote A ≈ PQH = A ∼ e a ij (aij ) as the ijth estimated elements value of matrix A ∼ (A) and represent a ij as ∼ a ij =

pH i qj =

K X

pik qkj

(1)

k=1

where pi and qi represent the ith column vector of matrix P and Q. ∼ Denote eij as the error between aij and a ij and we have !2 K  2 X ∼ 2 eij = aij − a ij = aij − pik qkj (2) k=1

Considering regularization to avoid overfitting, the error eij can be defined as [6] !2 K K X  βX 2 k P k2 + k Q k2 (3) eij = aij − pik qkj + 2 k=1

and β is around 0.02.

k=1

Taking the differentiation of the above equation with respect to pik and qkj , we have ∂e2ij β = −2eij qkj + pik ∂pik 2

(4)

∂e2ij β = −2eij pik + qkj ∂qkj 2

(5)

The value of pik and qkj can be updated as follows [6], pnew ik = pik + α

∂e2ij = pik + α (2eij qkj − βpik ) ∂pik

(6)

new qkj = qkj + α

∂e2ij = qkj + α (2eij pik − βqkj ) ∂qkj

(7)

where α is a constant which determines approaching rate and usually we set α as a very small value. In our experiment, the heart disease dataset [4] [5] contains 1541 samples, of which only 297 samples are completed. The remaining 1244 samples have missing attributes. We attempt to reconstruct those missing data through the matrix factorization technique discussed above. Experiment results and further analysis are presented in Section IV. III. H EART D ISEASE PATIENT C LASSIFICATION A LGORITHMS There are many machine learning algorithms dealing with classification problems. In this section, we briefly discuss the algorithms used in our experiment. A. Support Vector Machine Support vector machine as a supervised machine learning algorithm has been widely used in classification problems. The main idea and steps of the SVM algorithm can be summarized as follows. (1) Use a canonical equation to define an optimal hyperplane so that the data from two different classes have the maximum margin width between the classes. (2) Considering a non-linearly separable problem, introduce slack variables [9] which can be used to relax the constraints of the canonical equation. The goal is to find a hyperplane with a minimum misclassification rate. (3) Map data to a higher dimensional space using kernel based learning, where it is easier to classify with linear decision surfaces [10], and therefore, reformulate the problem so that the data is mapped explicitly to this space. Here the kernel function can have many forms. In this paper, we use a polynomial kernel function [9] which can be expressed as k(xi , xj ) = (xi · xj + c)2

(8)

We will compare the classification accuracy of both linear and kernel based SVM algorithms in our experiment.

246

B. K Nearest Neighbor K-NN method is a type of instance-based learning and has been widely used due to its simple implementation. The basic idea of K-NN is to assign the object to a specific class with majority of its K nearest neighbor nodes belonging to such class [11]. A common utilized distance metric in K-NN algorithm is Euclidean distance. Besides, we can also measure the similarity between variables using their Hamming distance. The fuzzy set [12] is presented in order to further strengthen the reliability of the K-NN algorithm. • Traditional K-NN Algorithm We first calculate the Euclidean distance of the testing sample attributes to the training samples’ corresponding attributes. After that the distance value will be sorted in ascending order. We investigate the first K elements (Notice that K should be an odd number) in the ordered distance dataset. Among the K elements, if the number in a healthy group is larger than the number in a disease group, we claim that the interested testing sample belongs to the healthy group, and vice versa. • Fuzzy-Set Based K-NN Algorithm Fuzzy set was introduced by Zadeh in 1965 and researchers have found numerous ways to implement this theory in many fields. The so called fuzzy-set based K-NN [12] was introduced to improve the performance of traditional K-NN for practical uses. Considering a set of n sample vectors {x1 , ..., xn } and c classes, each vector has a specific value in each class. Denote uik = ui (xk ) for i = 1, ..., c and k = 1, ..., n as the membership degree xk in class i. The following formula should be satisfied c X uik = 1 i=1 n X

0

Suggest Documents