Large-Scale Graph-Guided Feature Selection with ... - Google Sites

3 downloads 220 Views 51KB Size Report
Large-Scale Graph-Guided Feature Selection with Maximum Flows. Chloé-Agathe Azencott∗. Knowledge discovery from struc
Large-Scale Graph-Guided Feature Selection with Maximum Flows Chlo´e-Agathe Azencott∗ Knowledge discovery from structured data is one of the central topics in data mining. In particular, graphs, or networks, have attracted considerable attention in the community, as they may represent molecular, biological, social, or other types of systems whose functionality and mechanisms are far from being completely understood. A crucial concern when studying such systems is to determine which part of the graph is responsible for performing a particular function. Hence the general problem of feature selection on graphs, where features coincide with vertices and the graph topology can be viewed as a priori knowledge about the relationships between features, is of broad interest across disciplines. As one example among many, identifying sets of mutations in interacting genes that may influence heritable traits is a core concern of association genetics. The common approach to this problem is to use Lasso-based regression [3] with an 1 -regularizer of the weight vector and additional structured regularizers that represent relationships between features. In spite of their success, we see a number of drawbacks to regression-based approaches in this context. First, they do not easily scale to millions or even hundreds of thousands of features, although such a setting is common, for instance, in genetics. Second, regression-based approaches concentrate on optimizing a prediction loss, while the problem to solve is often formulated in terms of finding features that are relevant for, correlated to or associated with a property of interest. These two issues have been addressed by our recent work in statistical genetics, which proposes a new formulation of graph-constrained feature selection called SConES [1]. This method directly maximizes a score of association rather than minimizing a prediction error. Its optimization scheme is exact and efficient, thanks to a maximum flow reformulation, and it has been empirically shown to recover more causal features than its regression-based counterparts. We have also proposed a new formulation of SConES in a multi-task setting, to improve feature selection in each task by combining and solving multiple tasks simultaneously. Multi-SConES [2] is flexible enough to allow selecting overlapping but non-identical sets of features across related tasks, and incorporating different structural constraints for different tasks. We propose to discuss this recently published work, and related issues in the context of the framework we developed, pertaining in particular to the choice of graph regularizer (SConES being particularly suitable for selecting modules of a modular graph), the choice of regularization parameters (currently based on stability/consistency criteria, but which could for instance rely on p-values), and the incorporation of non-linear models.

References [1] C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara, and K. Borgwardt. Efficient network-guided multi-locus association mapping with graph cuts. Bioinformatics, 29(13):i171–i179, 2013. [2] M. Sugiyama, C.-A. Azencott, D. Grimm, Y. Kawahara, and K. Borgwardt. Multi-task feature selection with multiple networks via maximum flows. In Proceedings of the 14th SIAM International Conference on Data Mining, 2014. [3] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288, 1996.

∗ Mines ParisTech, Centre for Computational Biology (CBIO), 73300 Fontainebleau, France – Institut Curie, 75248 Paris Cedex 05, France – INSERM U900, 75248 Paris Cedex 05, France

Suggest Documents