1
Finite-Sample Generalization Theory for Machine Learning Practice for Science Nageswara S. V. Rao∗ ∗ Oak Ridge National Laboratory, Oak Ridge, TN, USA,
[email protected]
I. I NTRODUCTION Mathematical methods have been very instrumental in developing and analyzing machine learning algorithms over past decades, including, structural risk minimization methods of Vapnik [6], bagging and boosting methods of Breiman [1], and the theory of learnable by Valiant [5] (to name a few). Indeed, these mathematical foundations not only led to practical algorithms but also provided rigorous characterization of their ability to generalize beyond the data used in learning. Recent explosion of machine learning codes and tools (R modules, python modules, tensor flow frameworks, and others) combined with the availability of large data sets made them readily applicable to science applications in material science, embrittlement prediction, computational chemistry and several others. But, their applicability to science problems is generally unclear due to the lack of high-confidence generalization estimates for their outputs (except in a few cases). Generalization theories originated from areas of statistics, pattern recognition and computational learning, address this aspect to certain extent under different frameworks. Foundational results provide distribution-free, finite-sample performance bounds for the generalization by learning methods using combinatorial entities such as Vapnik-Chevonenkis (VC) dimension [6]. There is a gap between these generalization theories and the current practice of machine learning frameworks, codes in particular. The theories, while characterizing the power of machine learning methods, have major practical limitations: (i) Practical Generalization Equations for Science: Generalization error bounds are in general too loose to provide confidence measures needed and useful in practice; indeed, even “big science data” is often not big enough to provide high confidence performance guarantees. (ii) Informational Incompleteness and Computational Intractability: There is no universally best machine learning method, since any finite-sample method is only slightly better than a random guesser for certain data sets. And, the underlying computational problem is NP-hard in general, thereby limiting solutions to approximations; even supercomputers do not scale well to large problem sizes. We outline directions to overcome the first limitation by sharpening the generalization equations to exploit the practical boundedness conditions together with smooth, non-smooth and algebraic conditions that are specific to DOE science areas (materials, facilities, data transport and others). We address the second limitation by utilizing information fusion methods that combine a broad spectrum of machine learning methods to ensure best of their performance with a high confidence.
II. D ISTRIBUTION -F REE , F INITE -S AMPLE R ESULTS A generic problem consists of learning a relationship between two random variables X ∈ 0 and 0 < δ < 1. This condition ensures that “error” of fˆ is within of optimal error (of f ∗ ) with probability 1 − δ, irrespective of the underlying measured or l computed data distribution PY,X . Consider that empirical error l P Iemp (f ) = 1l C (f (Xi ), Yi ) minimized by fˆ ∈ F by a i=1
machine learning algorithm. If F has finite capacity, then under bounded error, for sufficiently large sample, the condition in Eq (1) is guaranteed; a more general result ensures this condition under finite scale-sensitive dimension [4]. While foundational to machine learning, these generalization equations are often very loose, reflected by δ being far away from 0. The challenge is to identify and exploit specific properties of F to derive sharpened generalization equations that explicitly incorporate and reflect variables and parameters of science domains.
2
III. S HARPENED G ENERALIZATION B OUNDS A. Smoothness Conditions Smooth function estimates are generated by several machine learning methods, including sigmoidal neural networks, radial basis functions, potential functions, polynomial and others. An important subclass of smooth functions are Lipschitz continous functions for which |f (x) − f (x + ∂x)| ≤ L k ∂x k, where L is a Lipschitz constant. Differentiable functions satisfy this condition wherein maximum derivative can be used as L. This property combined with the boundedness of domain variables, such as embrittlement levels or network capacity, provides generalization bounds that explicitly show the dependence. This method applied to sigmoid neural networks shows that the sample size needed to ensure Eq (1) is linear in the number of neural network parameters [3], as opposed quadratic dependence of previous bounds for unbounded weights. In particular, we have the confidence estimate h(d+2) −2 l/512 δ = 8 32W e for sigmoid network with h hidden nodes and input dimension d with weights suitably bounded by W . We emphasize that several statistical estimates learned by R modules are smooth, and several science variables and parameters are bounded. B. Non-Smooth and Algebraic Conditions Machine learning methods also employ tree-based methods, which are non-smooth. In practice, the parameters are bounded and the learned functions have a finite (often a small number) of jumps, which leads to their bounded finite total variation V < ∞. In this case, we have the confidence estimate 2 128V e− l/2048 . (2) δ =8 1+ The unimodal and monotone functions that arise in material embrittement application have this property, which can be further exploited to sharpen the generalization equations. Algebraic structure of the domain variables can also be similarly exploited. When F is a v-dimensional vector space, we have v −2 l/512 δ = 8 128e e . Also, when domain variables constitute real-algebraic sets, for example, network transport profiles, the generalization equations can be derived by utilizing the boundedness of their VC dimension.
specific parameters. Together, they provide a set of rich, diverse estimates but with varying performances even for a single science problem. The information fusion methods combine these estimates to ensure generalization at least as good as their best subset [4], as illustrated for embrittllement prediction of lightwater reactors. Furthermore, these fusers can be learned from training samples along with their generalization equations. Another limitation of machine learning is due to computational intractability, since practical computations can only guarantee approximations. For example, different starting weights provide different approximations by back propagation algorithm used to train sigmoidal neural networks. Then, a nearest neighbor fuser combines these estimates to provide a solution superior to all of them. In general, for fuser class FF for combining estimators fˆi ∈ Fi , i = 1, 2, . . . , N that satisfies the isolation property [4], we have N I(f ∗ ) = min I(fˆi ) − ∆, F
i
where ∆ is the improvement achieved by optimal fuser fF∗ over best estimator. For the learned fuser fˆF that minimizes the empirical error, the confidence δ is given by Eq (2) with N P V = Vi , where Vi is the total variation of Fi . By exploiting i=1
the domain information, these fuser generalization equations can be sharpened as described in previous sections. V. F UTURE D IRECTIONS AND C HALLENGES The challenges and future directions for generalization theory for practice of machine learning for science areas include: (i) Science Applications and Infrastructures: Generalization equations and fusers for areas including, embrittlement of irradiated materials in light water reactors, computational chemistry functional relationships between nano, micro and macro parameters, and throughput for wide-area lustre and GPFS file transfers. (ii) Sharpened Generalization Equations: Boundedness conditions from science areas together with smooth, nonsmooth and algebraic conditions exploited to provide practical finite-sample generalization equations. (iii) Information Fusion: Information fusion methods customized to overcome computational challenges and achieve best performance of wide spectrum of available rich and diverse machine learning codes and methods. R EFERENCES
IV. I NCOMPLETENESS , I NTRACTABILITY AND F USION Incompleteness theorem of Devroye [2] shows that there is no universally best machine learning method that uses a finite sample; informally, it demonstrates that training sample, being finite, does not have sufficient information to optimize over the infinite class of data distributions. Indeed, any such learning method can be reduced to a random guesser for certain data distributions that are specific to it. As a practical consequence, diverse set of machine learning methods continue to be developed as newer learning problems arise in various disciplines, such as computational chemistry, nuclear science and data transfers. These methods employ powerful techniques to exploit the structure and distributions of domain
[1] [2] [3] [4]
[5] [6]
L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 1996. N. S. V. Rao. Simple sample bound for feedforward sigmoid networks with bounded weights. Neurocomputing, 29:115–122, 1999. N. S. V. Rao. Measurement-based statistical fusion methods for distributed sensor networks. In S. S. Iyengar and R. R. Brooks, editors, Distributed Sensor Networks. Chapman and Hall/CRC Publishers, 2011. 2nd Edition. L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984. V. N. Vapnik. Statistical Learning Theory. John-Wiley and Sons, New York, 1998.