Sparse Estimation in a Correlated Probit Model
arXiv:1507.04777v1 [stat.ML] 16 Jul 2015
Stephan Mandt Institute for Data Sciences and Engineering Columbia University
[email protected]
Florian Wenzel Department of Computer Science Humboldt University of Berlin
[email protected] John P. Cunningham Department of Statistics Columbia University
[email protected]
Shinichi Nakajima Berlin Big Data Center Technical University of Berlin
[email protected] Christoph Lippert Human Longevity, Inc San Diego, CA 92121, USA
[email protected]
Marius Kloft Department of Computer Science Humboldt University of Berlin
[email protected]
Abstract A large class of problems in statistical genetics amounts to finding a sparse linear effect in a binary classification setup, such as finding a small set of genes that most strongly predict a disease. Very often, these signals are spurious and obfuscated by confounders such as age, ethnicity or population structure. In the probit regression model, such confounding can be modeled in terms of correlated label noise, but poses mathematical challenges for learning algorithms. In this paper we propose a learning algorithm to overcome these problems. We manage to learn sparse signals that are less influenced by the correlated noise. This problem setup generalizes to fields outside statistical genetics. Our method can be understood as a hybrid between an `1 regularized probit classifier and a Gaussian Process (GP) classifier. In addition to a latent GP to capture correlated noise, the model captures sparse signals in a linear effect. Because the observed labels can be explained in part by the correlated noise, the linear effect will try to learn signals that capture information beyond just correlated noise. As we show on real-world data, signals found by our model are less correlated with the top confounders. Hence, we can find effects closer to the unconfounded sparse effects we are aiming to detect. Besides that, we show that our method outperforms Gaussian process classification and uncorrelated probit regression in terms of prediction accuracy.
1
Introduction
Detecting—among ten thousands of genes—sparse signals that are strong predictors of complex diseases or other binary outcomes is one of the central challenges in statistical genetics [1, 2], as it is a first step in identifying regulatory components controlling heritability. However, for various diseases such as type 2 diabetes [3], these sparse signals are yet largely undetected, which is why these missing associations have been entitled the The Dark Matter of Genomic Associations [4]. Central problems include that these signals are often very weak, and the found signals can be spurious due to confounding. Confounding can stem from varying experimental conditions and demographics such as age, ethnicity, gender [5], and—crucially—population structure, which is due to the relatedness between the samples [6, 5, 7]. Ignoring such confounders can often lead to spurious false positive findings that cannot be replicated on independent data [8]. Correcting for such confounding dependencies is considered one of the greatest challenges in statistical genetics [9]. 1
Beyond statistical genetics, sparse estimation is a general problem in binary classification, and has wide applications in science and technology, including, among many others, neuroscience [10], medicine [11], text classification [12], credit scoring [13], and computer malware detection [14]. In all of these applications, confounding of the sparse signal can have dramatic consequences such as false medical diagnoses or violations of financial regulations [15]. There is a need for statistical methods for feature selection that are robust to these confounding influences. In this paper, we propose an algorithm for sparse estimation in the presence of confounding, focusing on applications in statistical genetics and computer malware detection, but presenting a general learning algorithm. Our algorithm builds on the `1 -norm regularized probit regression model [16, 17], one of the most popular models for feature selection in binary classification [18]. The new aspect in our approach is that we allow the model to account for correlated label noise. These correlations reflect the impact by confounders. This approach is inspired by related methods in linear regression [19] that we generalize to the more involved classification setup. In contrast to regular, uncorrelated probit models, training the proposed model is a hard problem because marginalizing the correlated noise can not be done in closed form. We show that this results in an n-dimensional multivariate Gaussian integral over the positive orthant Rn+ , where n is the number of data points. In addition, the `1 regularizer prevents the use of second-order methods. In this paper, we propose an algorithm that handles both technical complications, which borrows ideas from the alternating direction method of multipliers [20] in combination with Expectation Propagation (EP) as a numerical integrator for truncated Gaussian distributions [21]. This results in an EM type algorithm for feature selection in the presence of correlated noise. We analyze the approach empirically on synthetic data and real-world data from the domains of statistical genetics and computer malware detection. We first show that our model captures both `1 -norm probit regression and Gaussian Process (GP) probit classification as special cases. We then show that our algorithm outperforms both methods in terms of feature quality and accuracy on the considered data sets by up to 5 percentage points. Furthermore, the found features are less correlated with the top confounders, as we show. Our paper is organized as follows: First, we present the mathematical framework of the proposed sparsity-inducing correlated probit model (Section 2). We then derive an `1 -norm regularized parameter optimization algorithm and show that it can be interpreted as an EM algorithm (Section 2.2). Carrying out the expectation step requires a further approximation, for which we introduce EP as a possible means, leading to a Newton-based EM-EP algorithm for parameter optimization. We then apply our method to extract features associated with diseases and traits from genetic data that is strongly confounded by non-genetic influences and population structure. Finally, we test our method on a data set that contains a mix of different types of malicious computer software data, giving rise to confounding by malware (population) structure
1.1
Related Work
A common generalized linear model for classification is the logistic regression model [22]. Including correlations is non-straightforward [23]; one has to resort to approximate inference techniques, including the Laplace and mean field approximations that have been proposed in the context of GP classification [24], and the pseudo likelihood method, which has been proposed in the context of generalized linear mixed models [25]. To our knowledge feature selection has not been studied in a correlated logistic setup. On the other hand, without correlations, there is numerous work on feature selection in Lasso regression [18]. Alternative sparse priors to the Lasso have been suggested in [26] for unsupervised learning (again, without compensating for confounders). The joint problem of sparse estimation in a correlated noise setup has been restricted to the linear regression case [27, 2, 19], which we have addressed before, whereas we are interested in classification. We remark that the ccSVM [5] deals with confounding in a different way and it does not yield a sparse solution. Finally, our algorithm builds on Expectation Propagation for GP classification [24, 21], but note that GP classification does not yield sparse estimates. 2
2 2.1
Correlated Probit Regression Model
Our algorithm builds on the `1 -norm regularized probit regression model [16, 17], one of the most popular models for feature selection in binary classification [18]. The probit regression model is built on the assumption of independent labels: given n observed inputs Xi , the corresponding labels yi ∈ {+1, −1} are assumed to be realized from n independent latent random variables i ∼ N (0, σ 2 ) according to yi = sign(Xi> w + i ), where w is a linear effect of interest. The assumption of independent noise variables greatly simplifies inference as the likelihood factorizes into n terms, each of which can be efficiently computed through a one-dimensional Gaussian integral. A new aspect is that we allow our model to account for correlated label noise. By including a nondiagonal noise covariance, we can correct for confounders, that broadly correct for misclassification due to similarities between the samples Xi . While we explain this mechanism more detail below, here we first introduce our correlated noise model. Without loss of generality we will now assume that all observed binary labels yi are 1. The reason is that, starting with positive and negative labels, we can perform a linear transformation to absorb the sign of the labels into the data matrix and noise covariance, thus working with the transformed data matrix and noise covariance results in our assumption (this is proven in Appendix A). The correlated probit model studied in this paper is 1 = sign Xi> w + i , = (1 , . . . , n )> ∼ N (0, Σ). (1) In contrast to standard probit regression, is thus a vector of correlated noise, and the matrix Σ models the dependencies among the labels. The likelihood function is the probability that all transformed labels are 1. This is satisfied when X > w + > 0. In this paper, we are interested in the marginal likelihood, Z > P(yi = 1|w) = P(X w + > 0) = N (; X > w, Σ) dn . (2) Rn +
The marginal likelihood is hence an integral of the multivariate Gaussian over the positive orthant. Because w enters the likelihood as a Gaussian mean, we introduce the shorthand notation µ ≡ µ(w) ≡ X > w.
(3)
To obtain sparse solutions, we aim to minimize the negative log likelihood in the presence of an `1 -norm regularizer as follows. Problem 1 (Sparsity-inducing Correlated Probit Regression). Minimize, with respect to w, Z L(w) = − log N (; µ(w), Σ) dn + λ0 ||w||11 . n | {z } R+ | {z } =:Lreg (w)
(4)
=:Lloss (w)
The central computational problems in the correlated probit regression model are that (1) the objective function contains an intractable, high-dimensional integral, and (2) it has a non-smoothly differentiable Lasso regularizer. In this paper we propose an algorithm that allows us nevertheless to optimize this objective function. Before we come to the algorithm, we describe the choice of the noise covariance and explain why a non-trivial noise covariance may result in better features. Covariance matrix. The covariance matrix models correlated label noise between the samples Xi . A priori, we do not know these correlations, but we can assume that have some measure of similarity between the data points which is expressible in terms of a finite set of n × n kernel matrices Ki . We then model the covariance of the correlated noise as a linear combination of those kernels with weights λi that are determined by cross-validation: Pm Σ = λ1 I + i=2 λi Ki . (5) (As we explain below, this aspect is similar to probit Gaussian process classification [24].) The correlated probit model now delivers two alternative explanations of the observed labels: one in terms of a sparse linear effect, and another explanation in terms of correlated label noise. By conditioning on the labels, the linear effect and the noise distribution will become correlated; in other words the correlated noise will explain away parts of the observed labels. Therefore the sparse linear effect 3
will try to fit only those labels that are hard to fit with a correlated noise distribution, but better to fit with a sparse linear effect. Including a noise covariance matrix is therefore a possible way to eliminate confounders. The kernels Ki can either be functions of the primary data X, or they can be constructed from side information. We now explain that including a linear kernel K = X > X partially eliminates confounding by population structure. Linear kernels, population structure, and correlated noise. In statistical genetics it is widely acknowledged that inclusion of a linear kernel in a linear mixed model model is a way to capture and account for confounding influences in the data matrix [6, 19]. Here we explain this mechanism in the context of our model. In our model, use the feature matrix X twice, namely once in the linear effect X > w, but also through X > X in the covariance matrix. Let us consider Σ = λ1 I + λ2 X > X. Model (1) is then equivalent to 1 = sign Xi> (w + w0 ) + i , ∼ N (0, λ1 In,n ), w0 ∼ N (0, λ2 Id,d ). (6) This can be confirmed by integrating out w0 , which is Gaussian. Importantly, (w + w0 ) is the sum of the sparse solution w and the non-sparse Gaussian distributed variable w0 . For regularization purposes it makes sense to include this additional Gaussian noise distribution in feature space Rd , in addition to the Gaussian noise in Rn . Predictions from this model involve a small contribution from all features due to w0 , and a large contribution from the few features that have non-zero w. We can interpret w0 as an un-correlated noise variable in the primal feature space. It thus induces correlated noise in dual space, which means that data points that have similar feature vectors Xi will also have strongly correlated label noise. This confounding by population structure can be corrected by including a linear noise kernel in our model. Connection to other models. Our approach is inspired by related methods in the linear regression setup, where the Lasso regularizer has successfully been combined with correlated noise to select genetic variants that affect continuous traits. When removing the probit likelihood, the model becomes the linear mixed model Lasso (LMM-Lasso) by [19], hence P(Y |w) = N (Y ; X > w, Σ). This model has shown to improve selection of true non-zero effects as well as prediction quality [19]. Our model is a natural extension of the model by [19] to binary outcomes, such as the disease status of a patient. As we explain in this paper, inference of our model is, however, much more challenging than in [19]. Furthermore, by construction, our model captures two limiting cases: uncorrelated probit regression and Gaussian process (GP) classification. To obtain uncorrelated probit regression, we simply set the parameters λi = 0 for i ≥ 2, thereby eliminating the non-diagonal covariance structure. To obtain GP classification, we simply omit R the fixed effect (i.e., we set w = 0) so that our model likelihood becomes P(Y = Y obs |w) = Rn N (; 0, Σ) dn , where hence the noise variable plays the role + of the latent function f in GPs [24]. When properly trained, our model will therefore outperform both approaches in terms of accuracy. We will compare our method to all three limiting cases in the experimental part of the paper and show enhanced accuracy. 2.2
Model Training
In this section, we give details on the employed training algorithm. We first describe how we split the optimization into optimizing the loss function and optimizing the regularizer, using ADMM [20]. This involves a Newton update, for which we need to compute the gradient and Hessian of the loss function. We show how to use EP to approximate the Newton step. Prediction is addressed in Appendix B. Newton-ADMM algorithm. We first focus on the optimization of the `1 -norm regularizer. The `1 norm prevents us from applying second-order methods such as Newton’s method. This problem can be circumvented by the alternating direction method of multipliers (ADMM) [20]. ADMM treats the problem by simultaneously optimizing Lloss (w) + λ0 ||z||11 over w and z subject to w = z. This is equivalent to solving the following min-max problem involving the Lagrange parameter η: (w, z) L(w, z, η)
=
arg max min L(w, z, η); η
z,w
1 = Lloss (w) + λ0 ||z||11 + η > (w − z) + c||w − z||22 . 2 4
(7)
In alternating between the minimization updates for w, z and a gradient step in η, we solve the original problem. While the updates for z and η have analytic solutions [20], note that the part of the objective depending on w, Lloss (w) + η > w + 2c ||w − z||22 , is now effectively `2 -norm regularized. Because z = w is enforced in the final result, the constant c determines only the speed but not the outcome of the optimization algorithm [20] and does not have to be cross-validated. Note also that the loss function Lloss is convex, as shown in Appendix C. For a Newton update in w, we now compute the gradient and the Hessian of L(w, z, η) as a function of w, called L(w) for brevity. Gradient-based parameter optimization. Having presented a strategy for optimizing the marginal log likelihood, we now present how to compute its gradient and its Hessian w.r.t w. For the regularizer Lreg (w) = 2c ||w − z||22 , it is straightforward to show that the components of the reg (w) = c δij . Before we gradient and Hessian are given by ∇wi Lreg (w) = c (wi − zi ) and Hij loss proceed to the original loss function L (w), it is convenient to introduce the following probability distribution: 1[ ∈ Rn+ ] N (; µ, Σ) R p(|µ, Σ) = . (8) N (; µ, Σ) dn Rn +
Above, 1[·] is the indicator function. This is just the multivariate Gaussian, truncated and normalized to the positive orthant; we call it the posterior noise distribution. We furthermore introduce µp (w)
= Ep(|µ(w),Σ) [] , = Ep(|µ(w),Σ) ( − µp (w))( − µp (w))> .
Σp (w)
(9)
This is just the mean and the covariance of the truncated multivariate Gaussian, as opposed to µ, Σ which are the mean and covariance of the non-truncated Gaussian. In the following we will suppress the dependence of µp and Σp on w. In abbreviating ∆µ = µp − µ as the difference between the two means, we prove the following formula for the gradient and the Hessian H loss (w) = ∇2w Lloss (w) in Appendix D: ∇w Lloss (w) H
loss
(w)
= =
∆µΣ−1 X > , −X[Σ
−1
(10) >
(Σp − ∆µ∆µ )Σ
−1
−Σ
−1
>
]X .
Note that the dependence of the right hand side on w enters through Σp (w) and ∆µ(w). Variational approximation by EP. Without further approximation, the suggested NewtonADMM scheme is still not tractable for large n: it involves the first moment and covariance of the posterior in each gradient step. We therefore use a recent application of expectation propagation (EP) [21] to approximate the posterior mean and variance needed for the gradient and Hessian. Note that also other approximate inference schemes are possible, such as variational inference or sampling methods. In contrast to variational inference, EP is known to approximate posterior moments better, which is why we resorted to this approach [24]. EP approximates moments of the posterior p(|µ, Σ) in terms of a variational distribution q(), approximately minimizing the Kullback-Leibler divergence, q ∗ (|µq∗ , Σq∗ )
=
arg min (Ep [log p(|µ, Σ)] − Ep [log q(|µq , Σq )]) . q
(11)
The variational distribution q ∗ () is an un-truncated Gaussian q ∗ (; µq∗ , Σq∗ ) = N (; µq∗ , Σq∗ ), characterized by the variational parameters µq∗ and Σq∗ . We approximate the mean and covariance of the posterior p in terms of the variational distribution, µp ≈ µq∗ , and Σp ≈ Σq∗ . We warm-start each gradient computation with the optimal parameters of the earlier iteration. Note that EP is just one of many options to approximate the required posterior moments. Bayesian interpretation and EP-EM algorithm. Finally, before we present our experiments, we elaborate on the Bayesian view. We show that our algorithm is an EM algorithm. Our model factorizes as p(Y, w, ) = p(Y |w, )p(w)p(˜ ), where ˜ and w are hidden variables with priors p(˜ ) = ˜ and p(w) ∝ exp(−λ||w||1 ). Because p(w) is a Laplace prior, it can not conveniently be N (0, Σ) treated in a fully Bayesian way, and hence we use a MAP estimate for w, while ˜ is marginalized out. ˜ > w + ˜)] is just the indicator function The (un-normalized) likelihood p(Y |w, ˜) = 1[Y = sign(X ˜ > w followed by as in regular probit regression. Transforming the integration variable ˜ → ˜ − X 5
= diag(y) · ˜ · diag(y) results in our earlier formulas. The Bayesian interpretation allows us to identify our algorithm as a partial EM scheme. To illustrate the idea we simplify the Newton scheme to a first order scheme: E − step : Q(w, wt ) M − step :
= Ep(|Y,wt ) [log p(Y, w, )]
wt+1
= wt + ρ∇w Q(w, wt )|w=wt . R For the proof, we first define the marginal likelihood p(Y, w) = p(Y, w, )d. Our objective is exactly the log of the marginal likelihood, hence R R ∇w p(Y,w,)d ∇w L(w, Y ) = ∇w log p(Y, w, )d = = p(Y,w) R 0 R ∇w0 log p(Y, w , )|w0 =w p(Y, w, )d = ∇w0 |w0 =w log p(Y, w0 , )p(|Y, w)d = ∇w0 Q(w, w0 )|w0 =w . p(Y, w) We used the identity ∇w p(Y, w, ) = ∇w log p(Y, w, )p(Y, w, ). This proves the interpretation of our algorithm as an EM algorithm, and hence our approach can be understood as an EM-EP scheme. Note that the model can be also interpreted from a frequentist view.
3
Empirical Analysis and Applications
In the following we will apply our model to synthetic and real world data and show that it outperforms GP classification and sparse probit regression in terms of accuracy and feature quality. 3.1
Experimental Setup and Choice of Covariance Matrix
For our real-world and synthetic experiments, we first need to make a choice for the class of kernels Ki that we use for the covariance matrix. We choose a combination of three contributions, Σ = λ1 I + λ2 X > X + λ3 Σside .
(12)
The third term is optional and depends on the context; it is a kernel that we extract from side information in the form of an additional feature matrix X 0 , where we choose Σside as an RBF kernel with bandwidth σ = 0.2 on top of X 0 . Note that this way, the data matrix enters the model both through the linear effect but also through the linear kernel. For all datasets, features were centered and Method Regularizer Covariance Likelihood scaled to unit standard deviation, except in Corr. probit `1 yes sign the computer security experiment, where LMM-Lasso `1 yes linear Probit `1 no sign the features are binary. We train the comGP w=0 yes sign peting algorithms, which are summarized in Table 1, using n instances for training and Table 1: Relation of the correlated probit to other models. splitting the remaining examples equally into validation and test sets. This process is repeated 50 times, over which we report on average accuracies or areas under the ROC curve (AUCs) as well as standard errors [28]. The three hyperparameters λ1 , λ2 , and λ3 together with the fourth parameter λ0 for the `1 -norm regularizer are determined on the validation set, using grid search over a sufficiently large parameter space (optimal values attained inside the grid), in most cases λi ∈ [0.1, 1000]. 3.2
Simulated Data
We evaluate our algorithm on synthetic data that we generate as follows. First, we generate the underlying model parameters and features. To this end, we generate a weight vector w ∈ R50 with k ≤ 50 entries being 1, and else 0. We then create a random covariance matrix Σ ∈ R200×200 , which serves as side information matrix Σside in the model (12).1 We independently draw 200 points X = {x1 , . . . x200 } from a uniform distribution over the unit cube [−1, 1]50 and create the labels according to the probit model (1) using Σ = Σside . We reserve a subset of n = 100 instances for training and use the remaining ones in equal parts for validation and testing. As a benchmark we 1 The covariance matrix was created as follows. The random generator in MATLAB version 8.3.0.532 was initialized to seed = 20 using the rng(20) command. The matrix Σ was realized in two steps via A=2*rand(50,200)-1 and Σ=3*A’*A+0.6*eye(200)+3*ones(200,200).
6
oracle correlated probit GP uncorrelated probit LMM
feature weight
Toy Experiment
feature weight
accuracy
100 95 90 85 80 75 70 65 600%
20%
40%
60%
80%
percentage of non-zero features
100%
1.5 1.0 0.5 0.0 0.5 0 1.5 1.0 0.5 0.0 0.5 0
L1-Regularized Correlated Probit
10
20
30
features
40
50
L2-Regularized Correlated Probit
10
20
30
features
40
50
Figure 1: T OY: Results of the synthetic experiment. L EFT: Average accuracies as a function of the number of true non-zero features in the generating model. Note the tiny error bars. R IGHT: Ground truth (blue solid line) and feature weights (green dots) of `1 -norm (TOP) and `2 -norm (BOTTOM) regularized probit regression.
introduce the oracle classifier, where we use the correlated probit model but skip the training and instead use the true underlying w for prediction. In Fig. 1 we report on the so-achieved accuracies (left) and feature weights (right) for an example realization from the scenario where 80% of the features in the ground truth are sparse. We observe that in the dense scenarios our approach attains up to 9 percentage points higher accuracies than uncorrelated probit regression, outperforming also the LMM-Lasso, and performing similar well as GP regression, which also takes the side information covariance into account. In the sparse scenarios, our approach outperforms GP regression and the LMM-Lasso by up to 10 percentage points and 22 percentage points, respectively, and performs also substantially better than uncorrelated probit regression. Inspecting the computed feature weights (Fig. 1, right), we observe that the correlated probit model finds the true non-zero weights without suffering from large noise as its `2 -norm counterpart. This leads to a higher prediction accuracy while simultaneously finding a sparse linear effect. 3.3
Tuberculosis Disease Outcome Prediction From Gene Expression Levels
We obtained the dataset by [29] from the National Center for Biotechnology Information website 2 , which includes 40 blood samples from patients with active tuberculosis as well as 103 healthy controls, together with the transcriptional signature of blood samples measured in a microarray experiment with 48,803 gene expression levels, which serve as features for our purposes. Also available is the age of subjects when the blood sample was taken, from which we compute Σside . All competing methods are trained for various values of n ∈ [40, 80]. To be consistent with previous studies, we report on the area under the ROC curve (AUC), rather than accuracy. The results are shown in Figure 2. We observe that correlated probit regression achieves a consistent improvement over its uncorrelated counterpart and over GP classification, by up to 12 and 3 percentage points, respectively. We furthermore computed the empirical correlation of the weight vector with the first principle component of the linear kernel (population structure). We found that the features that our model finds show much less correlation with population structure (confounding) than uncorrelated probit regression. This is because population structure was built into our model as a source of correlated noise. 3.4
Malicious Computer Software (Malware) Detection
We experiment on the Drebin dataset3 [14], which contains 5,560 Android software applications from 179 different malware families. There are 545,333 binary features; each feature denotes the presence or absence of a certain source code string (such as a permission, an API call or a network address). It makes sense to look for sparse feature vector [14], as only a small number of strings are truly characteristic of a malware. The idea is that we consider populations of different families of malware when training, and hence correct for the analogue of genetic population structure in this new context, that we call ”malware structure”. 2 3
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19491 http://user.informatik.uni-goettingen.de/˜darp/drebin/download.html
7
AUCs in Tuberculosis Data Set
100 90
corr. with confounder
95
AUC
0.40
correlated probit GP uncorrelated probit
85 80 75 70 6540
45
50
55
60
65
70
number of training samples
75
0.30 0.25 0.20 0.15
80
correlated probit uncorrelated probit
0.35
0
20
40
60
80
index sorted by decreasing absolute feature weight
Figure 2: TBC: Results of the tuberculosis experiment. L EFT: Average AUC. R IGHT: feature correlation with first principal component (= b population structure), where the x-axis is sorted by descending absolute weights. ROC Curves for Debrin Data Set 1.0 0.45
correlated probit uncorrelated probit
0.40
0.6 0.4 0.2 0.00.0
0.1
0.2
correlated probit GP uncorrelated probit LMM 0.3 0.4 0.5
corr. with confounder
true-positive rate
0.8
0.35 0.30 0.25 0.20 0.15
false-positive rate
0
20
40
60
80
index sorted by decreasing absolute feature weight
Figure 3: M ALWARE: Results of the computer malware detection experiment. L EFT: Average ROC curves. R IGHT: feature correlation with the first principal component (= b malware structure), where the x-axis is sorted by descending absolute weights.
We concentrate on the top 10 most frequently occurring malware families in the dataset.4 . We took 10 instances from each family, forming together a malicious set of 100 and a benign set of another 100 instances (i.e., in total 200 samples). We employ n = 80 instances for training and stratify in the sense that we make sure that each training/validation/test set contains 50% benign samples and an equal amount of malware instances from each family. There is no side information available for this application, but, as the previous applications, we employ a linear kernel of the feature matrix. Again we report on the (normalized) area under the ROC curve over the interval [0, 0.1]. The results are given in Figure 3 and Table 2. We observe that correlated probit regression achieves a consistent improvement in terms of AUC0.1 over its uncorrelated counterpart (by 5 percentage points) and over GPs (by 7.5 percentage points), and leads Corr. Probit GP Probit LMM-Lasso to features that are less confounded by the 74.9 ± 0.2 69.8 ± 0.3 67.2 ± 0.3 66.45 ± 0.3 malware structure. Table E.1 in the Supplementary Material shows the top scoring features. Table 2: AUC0.1 attained on the malware dataset. 3.5
Flowering Time Prediction From Single Nucleotide Polymorphisms
We experiment on genotype and phenotype data consisting of 199 genetically different samples from the model plant Arabidopsis thaliana [30]. The genotype of each sample comprises 216,130 single nucleotide polymorphism (SNP) features. The phenotype that we aim to predict is early or late flowering of a plant when grown at ten degrees centigrade. The original dataset contains the flowering time for each of the 199 genotypes. We split the dataset into the lower and upper 45%quantiles of the flowering time and binarize the labels, resulting in a set of 180 instances. From which we use n = 150 instances for training. This dataset is known to be strongly confounded by population structure. The results are reCorr. Probit GP Probit LMM-Lasso ported in Table 3 and show that correlated 84.1 ± 0.2 83.6 ± 0.2 83.5 ± 0.2 79.7 ± 0.2 probit regression has a slight advantage of about 0.5 percentage points in AUC over the Table 3: F LOWERING time prediction experiment (AUCs). competitors. 4 Geinimi, FakeDoc, Kmin, Iconosys, BaseBridge, GinMaster, Opfake, Plankton, FakeInstaller, DroidKungFu.
8
An analysis, restricted to the ten SNPs with largest absolute regression weights in our model, showed that they lie within four well-annotated genes that all convincingly can be related to flowering, structure and growth: the gene AT2G21930 is a growth protein that is expressed during flowering, AT4G27360 is involved in microtubule motor activity, AT3G48320 is a membrane protein, involved in plant structure, and AT5G28040 is a DNA binding protein that is expressed during flowering.
4
Conclusion
We presented a novel algorithm for sparse feature selection in presence of spurious correlations in the training data, which may be obfuscated by confounders such as age, ethnicity or population structure, as in the context of statistical genetics. Our method can be understood as a hybrid between an `1 -norm regularized probit classifier enforcing sparsity and a GP classifier. The model therefore learn to distinguish between sparse linear effects from confounding by correlated Gaussian noise. We showed that the signals found by our model are less correlated with the top confounders, and, hence, we can find effects closer to the truly underlying sparsity pattern. Experiments on real data from statistical genetics and computer malware detection showed that the proposed method outperforms GP classification and uncorrelated probit regression in terms of prediction accuracy. Finally we remark that there is a certain preoccupation for sparse models in the scientific community due to various reasons. While sparsity by itself is not the ultimate virtue to be strived for, we showed that the combination of sparsity-inducing regularization and dense-type probabilistic modeling (as in the proposed method) may improve over purely sparse models such as `1 -norm regularized uncorrelated probit regression. The reason for this is less obvious and its theoretical exploration is left for future work. Nevertheless, we remark that a good starting point to this end will be to study the existing literature on compressed sensing as pioneered by [31, 32] and put forward by [33] in the context of 1-bit compressed sensing. For the latter case such theory recently has being developed by [34], but under the assumption of independent noise variables—an assumption that is violated in the correlated probit model.
References [1] T. A. Manolio, F. S. Collins, N. J. Cox, D. B. Goldstein, L. A. Hindorff, D. J. Hunter, et al., “Finding the missing heritability of complex diseases,” Nature, vol. 461, no. 7265, pp. 747–753, 2009. [2] S. Vattikuti, J. J. Lee, C. C. Chang, S. D. Hsu, and C. C. Chow, “Applying compressed sensing to genomewide association studies,” GigaScience, vol. 3, no. 1, p. 10, 2014. [3] N. Craddock, M. E. Hurles, N. Cardin, et al., “Genome-wide association study of cnvs in 16,000 cases of eight common diseases and 3,000 shared controls,” Nature, vol. 464, no. 7289, pp. 713–720, 2010. [4] T. N. H. G. R. Institute, “Proceedings of the workshop on the dark matter of genomic associations with complex diseases: Explaining the unexplained heritability from genome-wide association studies,” 2009. [5] L. Li, B. Rakitsch, and K. M. Borgwardt, “ccsvm: correcting support vector machines for confounding factors in biological data classification,” Bioinformatics, vol. 27, no. 13, pp. 342–348, 2011. [6] C. Lippert, J. Listgarten, Y. Liu, C. Kadie, R. Davidson, and D. Heckerman, “Fast linear mixed models for genome-wide association studies,” Nature Methods, vol. 8, pp. 833–835, October 2011. [7] N. Fusi, O. Stegle, and N. D. Lawrence, “Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical studies,” PLoS comp. bio., vol. 8, no. 1, 2012. [8] P. Kraft, E. Zeggini, and J. P. Ioannidis, “Replication in genome-wide association studies,” Statistical Science: A review journal of the Institute of Mathematical Statistics, vol. 24, no. 4, p. 561, 2009. [9] B. J. Vilhj´almsson and M. Nordborg, “The nature of confounding in genome-wide association studies,” Nature Reviews Genetics, vol. 14, no. 1, pp. 1–2, 2013. [10] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?,” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997. [11] A. M. Cohen and W. R. Hersh, “A survey of current work in biomedical text mining,” Briefings in bioinformatics, vol. 6, no. 1, pp. 57–71, 2005. [12] S. Eyheramendy, A. Genkin, W.-H. Ju, D. D. Lewis, and D. Madigan, “Sparse bayesian classifiers for text categorization,” Journal of Intelligence Community Research and Development, vol. 13, 2003. [13] D. J. Hand and W. E. Henley, “Statistical classification methods in consumer credit scoring: a review,” Journal of the Royal Statistical Society: Series A, vol. 160, no. 3, pp. 523–541, 1997. [14] D. Arp, M. Spreitzenbarth, M. H¨ubner, H. Gascon, K. Rieck, and C. Siemens, “Drebin: Effective and explainable detection of android malware in your pocket,” in Proc. of NDSS, 2014. [15] D. Immergluck, Credit to the community. ME Sharpe, 2004. [16] C. I. Bliss, “The method of probits,” Science, vol. 79, no. 2037, pp. 38–39, 1934. [17] L. Fahrmeir, T. Kneib, S. Lang, and B. Marx, Regression. Springer, 2013.
9
[18] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996. [19] B. Rakitsch, C. Lipprt, O. Stegle, and K. Borgwardt, “A lasso multi-marker mixed model for association mapping with population structure correction,” Bioinformatics, vol. 29, no. 2, pp. 206–214, 2013. [20] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the ADMM,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011. [21] J. P. Cunningham, P. Hennig, and S. Lacoste-Julien, “Gaussian probabilities and expectation propagation,” arXiv preprint arXiv:1111.6832, 2011. [22] D. R. Cox, “The regression analysis of binary sequences,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 215–242, 1958. [23] A. Ragab, “On multivariate logistic distribution,” Micro. Reliab., vol. 31, no. 2, pp. 511–519, 1991. [24] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT Press, 2006. [25] N. E. Breslow and D. G. Clayton, “Approximate inference in generalized linear mixed models,” Journal of the American Statistical Association, vol. 88, no. 421, pp. 9–25, 1993. [26] S. Mohamed, K. Heller, and Z. Ghahramani, “Bayesian and l1 approaches to sparse unsupervised learning,” arXiv preprint arXiv:1106.1157, 2011. [27] M. W. Seeger and H. Nickisch, “Large scale bayesian inference and experimental design for sparse linear models,” SIAM Journal on Imaging Sciences, vol. 4, no. 1, pp. 166–199, 2011. [28] T. Fawcett, “An introduction to ROC analysis,” Pattern recognition letters, vol. 27, no. 8, pp. 861–874, 2006. [29] M. P. Berry, C. M. Graham, F. W. McNab, Z. Xu, S. A. Bloch, T. Oni, K. A. Wilkinson, R. Banchereau, J. Skinner, R. J. Wilkinson, et al., “An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis,” Nature, vol. 466, no. 7309, pp. 973–977, 2010. [30] S. Atwell, Y. S. Huang, B. J. Vilhj´almsson, G. Willems, M. Horton, Y. Li, D. Meng, A. Platt, A. M. Tarone, T. T. Hu, et al., “Genome-wide association study of 107 phenotypes in arabidopsis thaliana inbred lines,” Nature, vol. 465, no. 7298, pp. 627–631, 2010. [31] E. J. Cand`es and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?,” IEEE Transactions Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006. [32] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Th., vol. 52, no. 4, pp. 1289–1306, 2006. [33] P. T. Boufounos and R. G. Baraniuk, “1-bit compressive sensing,” in Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on, pp. 16–21, IEEE, 2008. [34] Y. Plan and R. Vershynin, “One-bit compressed sensing by lp,” arXiv:1109.4299, 2012. [35] A. Pr´ekopa, “On logarithmic concave measures and functions,” Acta Scientiarum Mathematicarum, vol. 34, pp. 35–343, 1973.
10
Supplementary Material A
Proof that we can set the observed labels to 1
In the following we show that without loss of generality we can assume that all observed labels ˜ = (X ˜1, · · · , X ˜ n ) ∈ Rd×n and a vector of are 1. The original problem involves the data matrix X obs n observed labels Y ∈ R . The correlated probit model is ˜ > w + ˜), ˜ ∼ N (0, Σ). ˜ Y obs = sign(X (A.1) ˜ as Xi = X ˜ i ◦ Y obs , where ◦ is the Hadamard product. When We now transform every column of X i multiplying the above equation element-wise with Y obs , this yields ˜ 1 = sign(X > w + Y obs ◦ ˜), ˜ ∼ N (0, Σ). (A.2) Lastly, we observe that the new noise vector = Y obs ◦ ˜ has the effective covariance matrix Σ = diag(Y obs ) · Σ · diag(Y obs ). Hence, after the above transformations, the model reads 1 = sign(X > w + ),
∼ N (0, Σ).
(A.3)
which proves our claim.
B
Predicting new labels
When predicting new labels, one can either take correlations with the other data points into account, or predict independently of correlations. Both cases have their justifications, depending on the case. While the first case closely resembles prediction in Gaussian Processes [24], one simply takes the sign of X > w of a new data point in the latter case. To get into more detail when taking correlations into account, let us introduce letters that indicate the training set (R) and the test set (E), and let yE/R be the test and training labels, respectively. We define the mapping YE YE 7→ Y := ∈ Rm+n . (B.1) YR We also concatenate test data and training data as X
=
(XE , XR ) ∈ Rd×(m+n) .
Finally, we consider the concatenated kernel matrices i i KEE KER ∈ R(m+n)×(m+n) Ki = i i KRE KRR
(B.2)
(B.3)
We use the weights λi that were determined by model selection on the training data (YR , XR ) to construct the covariance matrix on the extended space, X Σ= λi K i . (B.4) i
Finally, in order to predict new labels YE , we evaluate the objective, using X, Y = Y (YE ) and the training weights w: YE∗
C
=
arg
min
YE ∈{±1}m
L(w|X, Y, Σ).
Proof of the Convexity of L
Since Lreg is convex, it is sufficient to show that Lloss is convex in w. Let us recall that a function f is logarithmically convex if f is strictly positive and log f is convex; logarithmical concavity is defined analogously. Furthermore it is known [35] that for a logarithmically concave function R f : Rn+m → R and a convex subset A ⊂ Rn , the function g(x) = A f (x, y)dm y is logarithmic in the entire space Rn . We show now the convexity of Lloss . Since µ(w) is a linear function in w, it is sufficient to show 11
that f (µ)
:= − log
R
Rn +
N (; µ, Σ) dn
is convex in µ. The multivariate normal distribution N is logarithmically concave in (, µ)R> ∈ R2n (since N (; µ(w), Σ) > 0 for all µ, ∈ Rn and log N is concave in (, µ)> ). Therefore Rn N (; µ, Σ) dn + is logarithmically concave in µ. The logarithm of a logarithmically concave function is concave by definition, hence Z log N (; µ, Σ) dn Rn +
is concave. It follows that Lloss is convex.
D
Gradient and Hessian
In this section, we calculate the gradient and the Hessian of the unregularized objective, Z Lloss (w) = − log N (; µ(w), Σ)dn .
(D.1)
Rn +
It will be sometimes more convenient to consider the objective as a function of µ = X > w, rather than w, for which case we define Z N (; µ, Σ)dn . (D.2) Lloss (µ) = − log Rn +
The following calculation relates the gradient and Hessian to the first and second moments of the normalized truncated Gaussian p(|µ, Σ). Gradient. ∇w Lloss (w)
Z
N (; µ, Σ)dn
= −∇w log Rn +
∇µ N (; µ, Σ)dn ∂µ = − R N (; µ, Σ)dn ∂w Rn + R ( − µ)> Σ−1 N (; µ, Σ)dn Rn + R = X> n N (; µ, Σ)d n R+ = Ep(|µ,Σ) ( − µ)> Σ−1 X > . R
Rn +
=
(D.3)
(µp − µ)> Σ−1 X > .
We defined µp = Ep(|µ,Σ) [] as the mean of the truncated Gaussian. Hessian. We first consider the Hessian matrix of Lloss (µ) and call it B, Bij (µ)
= ∂µi ∂µj Lloss (µ).
(D.4)
loss
The chain rule relates this object to the Hessian of L (w), X ∂µk ∂µl = X · B(µ(w)) · X > ij . Hij (w) = ∂µk ∂µl L(µ) ∂wi ∂wj
(D.5)
kl
The problem therefore reduces to calculating B(µ) which is n × n, whereas the original Hessian H(w) is d × d. To calculate B(µ), we define Z 1 I(µ) = exp{− ( − µ)> Σ−1 ( − µ)}dn . (D.6) + 2 Rn 12
such that, up to a constant, Lloss (µ) = − log I(µ). The Hessian is given by Bij (µ)
= −∂µi ∂µj log I(µ) ∂µ ∂µ I(µ) ∂µi I(µ) ∂µj I(µ) = − i j + . I(µ) I(µ) I(µ)
(D.7)
Note that this involves also the first derivatives of I(µ), that we have already calculated for the gradient. We still need to calculate ∂µi ∂µj I(µ). To simplify the calculation, we introduce µ ˜ = − µ.
(D.8)
As a consequence, ∂µ˜i = −∂µi . Furthermore, 1 ∂µi ∂µj exp{− ( − µ)> Σ−1 ( − µ)} 2 1 > −1 ˜ Σ µ = ∂µ˜i ∂µ˜j exp{− µ ˜} 2 1 > −1 ˜ Σ µ ˜} = ∂µ˜i −Σ−1 µ ˜ j exp{− µ 2 h i 1 > −1 Σ−1 µ ˜ i Σ−1 µ ˜ j − Σ−1 exp{− µ ˜ Σ µ ˜} = ij 2 1 > −1 = Σ−1 µ ˜µ ˜> Σ−1 − Σ−1 ij exp{− µ ˜ Σ µ ˜}. 2 Abbreviating exp{· · · } = exp{− 21 ( − µ)> Σ−1 ( − µ)}, we find
= = = = =
∂µi ∂µj I(µ) I(µ) R ∂µ ∂µ exp{· · · }dn R+ n R i j exp{· · · }dn R+ R n−1 Σ ( − µ)( − µ)> Σ−1 − Σ−1 ij exp{· · · }dn R+ n R exp{· · · }dn R+ n −1 Ep(|µ) Σ ( − µ)( − µ)> Σ−1 − Σ−1 ij Σ−1 Ep(|µ) ( − µ)( − µ)> Σ−1 − Σ−1 ij Σ−1 Σp Σ−1 − Σ−1 ij .
(D.9)
(D.10)
(D.11)
For the second term, we just use our known result for the gradient, namely ∂µ I(µ) = Ep(|µ) [(µp − µ)> Σ−1 ] = (µp − µ)> Σ−1 . I(µ) As a consequence, ∂µi I(µ) ∂µj I(µ) I(µ) I(µ)
= =
∆µ> Σ−1 )i (∆µ> Σ−1 Σ−1 ∆µ ∆µ> Σ−1 ij .
j
Above we defined ∆µ = (µ − µq ). This lets us summarize the Hessian matrix B(µ): B(µ) = Σ−1 (Σp − ∆µ ∆µ> )Σ−1 − Σ−1 (D.12) loss
The Hessian of L
(w) is therefore H(w)
=
XB(µ(w))X > .
(D.13)
Hessian Inversion formula. For the second order gradient descent scheme, we need to compute the inverse matrix of the Hessian H(w). Let us call D = λ0 1n×n the (diagonal) Hessian of the 13
regularizer. We use the Woodbury matrix identity, H −1
=
(D + XBX > )−1
= D
−1
−D
−1
X(B
(D.14) −1
>
+X D
−1
−1
X)
>
X D
−1
−2 −1 > −1 > = λ−1 + λ−1 X 0 1n×n − λ0 X(B 0 X X)
Note that this identity does not require us to invert a d × d matrix, but only involves the inversion of n × n matrices (in our genetic applications, the number of samples n is typically in the hundreds, while the number of genetic features d in is of order 104 − 105 ). We first precompute the linear kernel X > X. We also use the fact that we can more efficiently compute the product H −1 ∇w L as opposed to first calculating the Hessian inverse and then multiplying it with the gradient.
E E.1
Supplement to the Empirical Analysis Additional Results for the Computer Malware Detection Experiment of Section 3.5 in the Main Text
Table E.1 below shows the top features found by correlated and uncorrelated probit model. (a) Correlated Probit weight 1.08 0.66 0.36 0.18 0.13 0.12 0.10 0.09 -0.09 0.09
feature permission::android.permission.SEND_SMS permission::com.android.launcher.permission.INSTALL_SHORTCUT call::getSubscriberId permission::android.permission.ACCESS_WIFI_STATE feature::android.hardware.telephony activity::.Background permission::android.permission.SET_WALLPAPER api_call::android/telephony/TelephonyManager;->getDeviceId real_permission::android.permission.ACCESS_FINE_LOCATION real_permission::android.permission.READ_PHONE_STATE
rank (uncorr.) 1 2 14 3 16 762 15 7 5 8
(b) Uncorrelated Probit weight 1.97 1.27 0.87 -0.57 -0.57 -0.56 0.54 0.53 0.49 -0.44
feature permission::android.permission.SEND_SMS permission::com.android.launcher.permission.INSTALL_SHORTCUT permission::android.permission.ACCESS_WIFI_STATE url::www.w3.org real_permission::android.permission.ACCESS_FINE_LOCATION permission::android.permission.CALL_PHONE api_call::android/telephony/TelephonyManager;->getDeviceId real_permission::android.permission.READ_PHONE_STATE call::Cipher(DES) intent::android.intent.category.DEFAULT
rank (corr.) 1 2 4 59 9 19 8 10 48 14
Table E.1: Top features determined by correlated (TOP) and uncorrelated (BOTTOM) probit regression on the android malware dataset, and corresponding feature weights wj as well as the rank of the feature (sorted in descending order) in the weight vector of the respective competitor (uncorrelated/correlated probit regression).
14