Constrained Nonnegative Matrix Factorization for Data Privacy Nirmal Thapa, Lian Liu, Pengpeng Lin, Jie Wang, and Jun Zhang Abstract— The amount of data that is being produced has increased rapidly so has the various data mining methods with the aim of discovering hidden patterns and knowledge in the data. With this has raised the problem of confidential data being disclosed. This paper is an effort to not let those confidential data be disclosed. We apply constrained nonnegative matrix factorization (NMF) in order to achieve what is also known as dual privacy protection that accounts for both the data and pattern hiding, though in this paper, we mainly focus on pattern hiding. To add the constraint we change the update rule as well as the objective function in NMF computation. As the procedure reaches the convergence, it yields a new dataset, which suppresses the patterns that are considered confidential. The effectiveness of this novel hiding technique is examined on two benchmark datasets (IRIS and YEAST). We show that, an optimal solution can be computed in which the user specified confidential memberships or relationships are hidden without undesirable alterations on non-confidential patterns, also referred to as side effects in this paper. This paper presents our idea of how the different parameters will vary to achieve convergence.
Keywords: Nonnegative Matrix Factorization, Privacy Protection, Data Hiding, Constraint, K-means. I. I NTRODUCTION Privacy concern in data mining has grown to a great level in recent years. Multiple techniques like Value-Class Membership, Value Distortion [1], Matrix Factorization [2], Heuristic-Based Techniques, and Cryptography-Based Techniques [3] have been areas of key interest for the researchers, with each algorithm having its own purpose and limitations. NMF, a matrix factorization technique has been used in various scenarios like text mining [8], part based learning, handwritten digit recognition [10] and many more. Singular value decomposition and nonnegative matrix factorization for the purpose of privacy-preserving has been studied by Wang et al. [4] and Xu et al. [5]. Wang in particular, studied how the pattern-hiding in terms of clustering can be achieved using NMF. Clustering is very widely studied topic that has been used Nirmal Thapa is with Department of Computer Science, University of Kentucky, Lexington, KY, 40506, Tel: (859) 227-6786, Fax: (859) 323-1971, Email:
[email protected] Lian Liu is with Department of Computer Science, University of Kentucky, Lexington, KY, 40506, Tel: (859) 218-6558, Fax: (859) 323-1971, Email:
[email protected] Pengpeng Lin is with Department of Computer Science, University of Kentucky, Lexington, KY, 40506, Tel: (859) 218-6558, Fax: (859) 323-1971, Email:
[email protected] Jie Wang is Assistant Professor in Department of Computer Information Systems, Indiana University Northwest, Gary, IN 46408, Tel:(219) 9806623, E-mail:
[email protected] Jun Zhang is Professor in Department of Computer Science, University of Kentucky, Lexington, KY, 40506, Tel: (859) 257-3892, Fax: (859) 323-1971, Email:
[email protected]
in different areas including machine learning, data mining, pattern recognition, image analysis, information retrieval, etc. There are many algorithms available for clustering. Among them, k-means is one of the most popular and widely used techniques. Work utilizing NMF for clustering is not a new idea but [11] goes one step further and presents the idea of similarity between the k-means and NMF. In this paper, we present our idea of combining clustering and NMF for the purpose of membership hiding by imposing additional constraint on NMF. NMF with additional constraints like orthogonality constraint [6] and sparseness constraint [7] have been applied to various fields. Our study uses constrained nonnegative matrix factorization for the purpose of hiding particular membership in a data analysis task. Some initial work in this field i.e., applying NMF for privacy protection was done by Wang et al. [4], [2]. The work by Wang et al. [4] applies NMF in the first phase and then tries to suppress the data pattern using different ad-hoc algorithms. This paper proposes explicit incorporation of the additional constraint in order to suppress the data patterns in the process of performing the matrix factorization, which is a single stage operation. II. BACKGROUND A. K-means clustering There are many clustering algorithms, like k-means and its variant, Hierarchical clustering, Density based clustering. As mentioned earlier, k-means is the most popular clustering algorithm. The basic objective of k-means is to cluster the n data items that can be given by (x1 , x2 , ...., xn ), into k sets (k ≤ n) such that S = (S1 , S2 , ...., Sk ) so as to minimize the within cluster sum of squares. Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. Common applications of k-means algorithm are image segmentation and principal component analysis [12], [13]. This paper uses k-means to compare the result. Experiment first runs the k-means on the original data and based on the result, ground truth is established. It must be noticed that the ground truth is the result that k-means returns, that may not be the exact classification. We perform the NMF and then compare the result to the one from the run on original data to see if there is any side effect (discussed later) or if any element that we wanted to change has not been changed. B. Nonnegative Matrix Factorization There are many kinds of matrix factorization like principal component analysis (PCA), singular value decomposition (SVD), and NMF. NMF is different in the sense that it imposes additional constraint that none of the elements of
the factor matrix H and basis matrix W can be negative. Another notable thing about NMF is that, results are nonunique which provides even better ground for it to be used for data protection. Nonnegative matrix factorization is a way in linear algebra where a matrix A is decomposed into the product of two matrices H and W . R is the residual since H × W will not always be equal to A. N M F (A) ⇒ H × W e A = H × W + R, A ≈ H × W = A Formally it can be defined as Given a nonnegative data model A(n×m), find two nonnegative matrices Hn×k and Wk×m with k being the number of clusters in A, that minimize Q, where Q is an objective function defining the nearness between the matrices A and HW. The modified version of A e H×W . is denoted as A= Generally, (n + m)k < nm, which reduces the rank of the original matrix. In other words, the original matrix will be compressed. There are two main aspects, one is the objective f unction and the other is the update rule. Objective function quantifies the quality of factorization usually in terms of distance between the two matrices A and HW . The Euclidean distance or the Frobenius norm is the common function to consider. Objective for NMF would be to minimize the distance between A and HW . minH≥0,W ≥0 f (A, H, W ) = kA − HW k2F Since, NMF is an iterative technique; there is the need to update matrices H and W in each iteration. Rule to do so is termed as update rule. We will discuss more on that one in the following sections. C. Data Pattern Hiding Data Hiding can be defined as the process of changing the data with the aim of hiding the confidential data and at the same time minimizing the alteration to the non-confidential data. This paper mainly focuses on the problem of confidentiality in terms of clustering. We want the information about the cluster membership of some particular data not to be disclosed. As said earlier, NMF generates two matrices H and W for a nonnegative data matrix A, which are nonnegative factor matrices generated by minimizing the objective functions. Matrix W represents coefficients for clusters and has size of k × m defining basis vectors. While H has size of n × k, contains cluster membership indicators representing additive combination for each subject. To apply this idea to data pattern hiding, we can find out cluster membership of data by finding the largest element in the factor vector from H, provided factor vectors are related to the cluster property of the subjects [2]. The shift of a subject from one cluster to another cluster occurs whenever the factors are modified. This is the essence on which data pattern hiding is based on. Let us say, we have n items in total with k clusters, we want to change the cluster membership of an item X which was originally in cluster Ci . In such a case, there are two ways
in which we change the membership. It can be either of the following two: • Change the membership of item X to a particular cluster Cj , such that i 6= j. • Change the membership of item X to any cluster other than cluster Ci . We discuss about how to explicitly specify that information into the NMF in a later section. One important aspect Wang et al. [2] mentions in their work is the issue of side effect, which is discussed in the following section. D. Side Effect Side effect can be defined as the unwanted changes that are introduced after applying the constrained nonnegative matrix factorization. In our case, it is the cluster membership of the data. As it is directly related to the utility of the data, it is necessary to keep the changes in cluster membership of non-confidential data to a minimum. Any technique must have the property to keep the side-effect to the minimum level in order for it to be useful. Ideally, all the confidential data are changed and nothing else is altered. In our method, we strive to achieve this goal. There must be some measure of side effect and for that we make the comparison against the k-means that we run for the original data. Hence, the number of subjects that get changed by the application of the method can be taken as the measure of side effect. III. C ONSTRAINT ON N ONNEGATIVE M ATRIX FACTORIZATION Researchers have come up with different constraints to be incorporated in NMF for solving different tasks. Some works are based on orthogonality [6] and some are based on sparseness [7]. Wang et al. [2] used the magnitude of the elements of the factor vector Hx in the H matrix to determine the cluster category of the subject X. Seeing at the work previously done on constraints, we would like to add a constraint which we called the clustering constraint that results in the matrix H that will either have one of the element significantly large compared to others which represents the new cluster for the item or one of the element insignificant in terms of magnitude so as to make sure that the element does not fall in that cluster. The objective function can be modified to accommodate penalty terms as; f (A, H, W ) = αkA − HW k2F + βkH − Ck2F (1) Here, C is a matrix of size n × k, and the elements of C are such that • If the item is not to be changed then, its contents will be 1 on the index representing its cluster and the rest of them are 0. • If the item is to be changed to another particular cluster, then contents will be 1 on the index representing destination cluster and the rest of them are 0, which we refer as in a cluster change.
If the item is to be changed to any other cluster, then contents will be 0 on the index representing source cluster and the rest of them are some random number in the range [0-1], referred to as not in a cluster change. Typical example of it is C1 C2 C3 •
0 1 0.45
0 0 0.55
1 0 0
Element in Cluster 3 Element in Cluster 1 Element in any cluster other than Cluster 3
Mathematical derivation for update formula Let, Q = kA − HW k2F
= tr(AT A) − 2tr(AT HW ) + tr(W T H T HW ) (2) also let, L = kH − Ck2F T
= tr((H − C) (H − C)) = tr(H T H − H T C − C T H + C T C)
= = •
(3)
H fixed and W changing,
=
δf (A, H, W ) δW δ(αkA − HW k2F − βkH − Ck2F ) δW δ(kH − Ck2F ) δ(Q) −β α δW δW δ(tr((AT HW ))) δ(tr((W T H T HW ))) −2α +α δW δW −2αH T A + 2αH T HW (4)
W fixed and H changing δf (A, H, W ) δH δ(αkA − HW k2F − βkH − Ck2F ) = δH δ(Q) δ(kH − Ck2F ) = α −β δH δH We know, first term gives, δQ = −2αAW T + 2αHW W T δH Second term gives, α
β
δkH − Ck2F δH
=
β2H − 2C + 0
=
2βH − 2βC
(5)
−2αAW T + 2αHW W T + 2βH − 2βC
=
2αHW W T + 2βH − 2αAW T − 2βC
Hence,
where, represents element-wise division, I denotes identity matrix. This gives rise to the update formula for W and H as, [H T A]i,j [H T HW ]i,j
[αAW T + βC]ij [H(αW W T + β)]ij
(9) (10)
As mentioned earlier, the objective function needs to be changed to incorporate the constraint. Let us start with our initial formula: f (A, H, W ) = αkA − HW k2F + βkH − Ck2F (11) The Objective here is to not only make kA−HW k2F smaller but to make the sum of both the terms in the above equation small, which gives rise to, min (αkA − HW k2F + βkH − Ck2F ) (12) H≥0,W ≥0
This is the objective function that will be used to check the convergence. If the value is below certain limit the NMF process is considered to have converged. IV. A LGORITHM In this section, we present the algorithm devised for the data pattern hiding. The algorithm for the constrained NMF is as follows: Original data matrix A, k, C, tol, maxIter, mainIter, α, Algorithm 1: Constrained NMF input : A ∈ Rn×m , 0 < k min(n, m), C ∈ + Rn×k , mainItr, tol, maxItr, α, β + n×k output: H ∈ R+ , W ∈ Rk×m + Initialize H and W with the random initial estimates (0) Hi,j ⇐ nonnegativevalue, 1 ≤ i ≤ n, 1 ≤ j ≤ k (0) Wi,j ⇐ nonnegativevalue, 1 ≤ i ≤ k, 1 ≤ j ≤ m for i ← 1 to mainItr do for j ← 1 to maxItr do [αAW T +βC]
Hi,j ← Hi,j [H(αW W T +β)]ijij (6)
(7)
Combining (6) and (7) in (5), =
δq δH =0.
B. Objective Function
= tr(AT A − AT HW − W T H T A + W T H T HW )
=
=0 and
H(αW W T + β) (αAW T + βC) = I
Hi,j = Hi,j
= tr((A − HW )T (A − HW ))
= tr(H T H − 2H T C + C T C)
δq δW
H T A H T HW = I
Wi,j = Wi,j
A. Update Formula
•
For optimal solution
[H T A]
Wi,j ← Wi,j [H T HWi,j ]i,j ˆ Calculate new A if value(Objective Function)≤ tol then break if sideeffect=0 then break Change value of α Change value of β
(8)
β are the input to the algorithm. The tol provides the stopping criterion, in other words measurement of convergence, maxIter presents the number of update to perform in H and W before stopping an NMF if convergence is not achieved. The output from the algorithm is the two matrices H and W , such that A˜ = H × W ≈ A, where all the confidential data are hidden and non-confidential data are intact. The constrained NMF algorithm is run for a certain number of iterations and checked each time if pattern hiding has been achieved, if there is any side effect the algorithm continues to perform NMF other wise it stops. The paper does not show how the side-effect is calculated in the algorithm above. It mainly is comparing the k-means result on the modified data with the k-means result on the original data for the nonconfidential data and comparing against what we wanted in the beginning for the confidential data.
Fig. 1.
Not in a cluster (IRIS)
V. E XPERIMENTAL R ESULTS For this experiment, positive real data are needed as it deals with NMF. It was necessary to avoid categorical data, since this research does not deal with categorical data. IRIS dataset is fairly standard in data mining community but it is desirable to base the experiments on multiple datasets. We wanted to make sure that the two datasets have different number of attributes that will give more variation in what the experiments were tested against. Experiments were performed with IRIS and YEAST datasets, both of which are fairly known datasets. •
IRIS Data Set: IRIS is a simple data set with 150 instances in a 4-dimensional attribute space. The four attributes are sepal length, sepal width, petal length and petal width. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant: Iris Setosa, Iris Versicolour, and Iris Virginica.
YEAST Data Set: YEAST is a real-valued data set having 1484 instances and 8 attributes. It is used to predict the localization site of protein, which has 9 classes. The experiments used three class of data from YEAST dataset. Variation in size of the dataset was another focus for the experiment so, we took the classes having the highest number of tuples. All the experiments took 3 classes of data whether it is IRIS data or YEAST data. All the comparisons were based on the ground truth which was the result obtained by using k-means algorithm on the original data. In all of the experiments it was made sure that there were not any side effects, it was encouraging that algorithm was able to change the membership of all the confidential subjects while keep the membership of non-confidential data intact.
Fig. 2.
In a cluster (IRIS)
done to observe the number of iterations it took to make all of the changes. The algorithm was unable to get convergence for more than 15 items for not in a cluster change while the number goes beyond 25 for in a cluster change in the case of IRIS data. Similar observation was made for the YEAST data. It can be concluded from the experiment that it takes lot lesser iterations to make the in a cluster change compared to not in a cluster change. It can be attributed to the fact that one element in a row of the matrix H needs to be significantly large compared to the others for it to work efficiently.
•
Fig. 3.
Not in a cluster (YEAST)
A. Experiment 1 Two types of changes were made, first was in a cluster change with graphs as shown in Fig. 2, 4 and second was not in a cluster change shown in Fig. 1, 3 . Experiment was
Fig. 4.
In a cluster (YEAST)
B. Experiment 2 Next experiment was to study the relation between the value of α and β with the number of confidential subjects in order to achieve convergence. We performed this experiment both with the IRIS and the YEAST data. For each dataset, we ran experiment for small n (total number of changes) and larger n. One thing to remember is that the number of in a cluster changes were equal to the number of not in a cluster changes. Initially, α = 0.9 and β = 0.1 and then the value of α was decreased by 0.1 and value of β was increased by 0.1, the aim is to keep (α + β) = 1, so that our estimated solution does not diverge from the actual solution. Experiment was repeated again, Fig. 5. Classes for α and β but this time with the initial value of α = 0.1 and β = 0.9 and increase the value of α by 0.1 and decrease β by 0.1. Each of the experiment was performed 100 times and to see what region in terms of value of α and β gives the most convergence.
Fig. 8.
YEAST data with n=10
Fig. 9.
YEAST data with n=38
data the value of α and β should be in this particular range. The value of α and β basically depends upon the data. As in the case of IRIS data it was β > α but for YEAST data it was more β < α region. C. Experiment 3
Fig. 6.
IRIS data with n=10
In the third experiment, we tried to study the relation between the total amount of data and the changes that can be made. The following tables show that total number of changes that were made successfully with different amount of data.
Data Size 60 90 120 150
Changes 10 14 20 24
TABLE I IRIS C HANGES AND DATA SIZE Fig. 7.
IRIS data with n=26
We can see from Fig. 6 that, when n is small, we get most convergence in that class of α and β combination where we start the iteration from, but as we increased n to 26 as in Fig. 7, we can see, that most convergence occurs in the region where β > α. Similar observation was made for the YEAST data as well from Fig. 8,9, when n is small values of α and β do not play as important part in the convergence. When n grows large then the distribution shifts towards the region with smaller β and greater α, indicating that to change larger number of
Data Size 150 180 240 300 360 420 450 480
Changes 50 70 90 110 140 140 150 150
TABLE II YEAST C HANGES AND DATA SIZE
It can be seen from the table that convergence can be achieved even for the ones we did not get convergence for after increment in the number of items each cluster is made. VI. C ONCLUSION We proposed a technique to change the membership of confidential data while making no change to non-confidential data by integrating the constraint on the NMF algorithm. We were able to change the membership of more than one item. The lesser number of iterations for in a cluster change against not in a cluster change is a good indicator that one of the elements in H matrix needs to be significantly higher
compared to other elements. From the experimental result we were able to see how the values of α and β should be changed as the number of items that we are changing grows or shrinks. The relation between the number and size of data was also studied. VII. F UTURE W ORKS There is still much to be done in this field. One prospective area might be to study the relation between the dimension of the data and the number of iterations that it takes for the convergence; it can be a very important and interesting thing to do in the future. For the immediate future, study of the utility of data with other application or even some sort of metrics on the distortion level with the change in the number of subjects can be a good step. We applied the clustering constraint in this paper; there is the possibility of combining the constraint with other constraint like orthogonality constraint or sparseness constraint to better the result. R EFERENCES [1] R. Agrawal and R. Srikant,“Privacy Preserving Data Mining,”Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [2] Jie Wang, Jun Zhang, Lian Liu, and Dianwei Han, “Simultaneous data and pattern hiding in unsupervised learning,” The 7th IEEE International Conference on Data Mining - Workshops(ICDMW07), pages 729734, Omaha, NE, USA, October 2007. IEEE Computer Society. [3] V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “ State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record,vol. 3, no. 1, pp. 50-57, Mar. 2004 [4] Jie Wang, Weijun Zhong, and Jun Zhang, “ NNMF-based factorization techniques for high-accuracy privacy protection on non-negative-valued datasets,” 2006 IEEE Conference of Data Mining, International Workshop on Privacy Aspects of Data Mining, pp. 513-517. IEEE Computer Society, 2006. [5] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang, “ Singular value decomposition based data distortion strategy for privacy protection,” Knowledge and Information Systems, 10(3):383-397, 2006. [6] Li, H., Adal, C., Wang, W., Emge, D., and Cichocki, A.,“Non-negative matrix factorization with orthogonality constraints and its application to raman spectroscopy,” The Journal of VLSI Signal Processing, 48, pp. 83-97 (2007). [7] Patrik O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Journal of Machine Learning Research, 5:1457-1469, 2004. [8] Wei Xu, Xin Liu , Yihong Gong, “Document clustering based on non-negative matrix factorization,” Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. New York: Association for Computing Machinery, pp. 267-273, 2003. [9] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons, “Algorithms and applications for approximate nonnegative matrix factorization,” Computational Statistics and Data Analysis, 52(1):155173, 2007. [10] M. Mazack, “Non-negative Matrix Factorization with Applications to Handwritten Digit Recognition,” Department of Scientific Computation, University of Minnesota, 2009. [11] C. Ding, X. He, and H. Simon, “On the equivalence of nonnegative matrix factorization and spectral clustering,” In Proceedings of SIAM Data Mining Conference, 2005. [12] H. Zha, C. Ding, M. Gu, X. He and H.D. Simon, “Spectral Relaxation for K-means Clustering,” Neural Information Processing Systems vol.14 (NIPS 2001), pp. 1057-1064, Vancouver, Canada. Dec. 2001. [13] C. Ding and X. He, “ K-means Clustering via Principal Component Analysis,” Proc. of Int’l Conf. Machine Learning (ICML 2004), pp 225232. July 2004.