Machine Learning for Automated Reasoning - Institute for Computing

Machine Learning for Automated Reasoning

Proefschrift

ter verkrijging van de graad van doctor aan de Radboud Universiteit Nijmegen, op gezag van de rector magnificus prof. mr. S.C.J.J. Kortmann, volgens besluit van het college van decanen in het openbaar te verdedigen op maandag 14 april 2014 om 10:30 uur precies

door

Daniel A. Kühlwein

geboren op 7 november 1982 te Balingen, Duitsland

Promotoren: Prof. dr. Tom Heskes Prof. dr. Herman Geuvers Copromotor: Dr. Josef Urban Manuscriptcommissie: Prof. dr. M.C.J.D. van Eekelen (Open University, the Netherlands) Prof. dr. L.C. Paulson (University of Cambridge, UK) Dr. S. Schulz (TU Munich, Germany)

This research was supported by the NWO project Learning2Reason (612.001.010).

Copyright © 2013 Daniel Kühlwein ISBN 978-94-6259-132-5 Gedrukt door Ipskamp Drukkers, Nijmegen

Contents

Contents 1

2

3

i

Introduction 1.1 Formal Mathematics . . . . . . . . . 1.1.1 Interactive Theorem Proving . 1.1.2 Automated Theorem Proving . 1.1.3 Industrial Applications . . . . 1.1.4 Learning to Reason . . . . . . 1.2 Machine Learning in a Nutshell . . . 1.3 Outline of this Thesis . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 1 1 2 3 4 5 6

Premise Selection in ITPs as a Machine Learning Problem 2.1 Premise Selection as a Machine-Learning Problem . . . 2.1.1 The Training Data . . . . . . . . . . . . . . . . 2.1.2 What to Learn . . . . . . . . . . . . . . . . . . 2.1.3 Features . . . . . . . . . . . . . . . . . . . . . . 2.2 Naive Bayes and Kernel-Based Learning . . . . . . . . . 2.2.1 Formal Setting . . . . . . . . . . . . . . . . . . 2.2.2 A Naive Bayes Classifier . . . . . . . . . . . . . 2.2.3 Kernel-based Learning . . . . . . . . . . . . . . 2.2.4 Multi-Output Ranking . . . . . . . . . . . . . . 2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Features . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Dependencies . . . . . . . . . . . . . . . . . . . 2.3.3 Online Learning and Speed . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

9 9 10 10 12 14 14 15 15 18 19 20 20 21

Overview of Premise Selection Techniques 3.1 Premise Selection Algorithms . . . . . . . . . . 3.1.1 Premise Selection Setting . . . . . . . . 3.1.2 Learning-based Ranking Algorithms . . . 3.1.3 Other Algorithms Used in the Evaluation

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

23 . 23 . 23 . 24 . 25

i

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . .

CONTENTS

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

25 26 27 27 28 30 33 34

Learning from Multiple Proofs 4.1 Learning from Different Proofs . . . . . . . . . . . . . . . 4.2 The Machine Learning Framework and the Data . . . . . . 4.3 Using Multiple Proofs . . . . . . . . . . . . . . . . . . . . 4.3.1 Substitutions and Unions . . . . . . . . . . . . . . 4.3.2 Premise Averaging . . . . . . . . . . . . . . . . . 4.3.3 Premise Expansion . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . 4.4.2 Substitutions and Unions . . . . . . . . . . . . . . 4.4.3 Premise Averaging . . . . . . . . . . . . . . . . . 4.4.4 Premise Expansions . . . . . . . . . . . . . . . . 4.4.5 Other ATPs . . . . . . . . . . . . . . . . . . . . . 4.4.6 Comparison With the Best Results Obtained so far 4.4.7 Machine Learning Evaluation . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

37 37 38 39 40 40 41 42 42 42 42 44 44 46 46 48

5

Automated and Human Proofs in General Mathematics 5.1 Introduction: Automated Theorem Proving in Mathematics . . . . . . . . 5.2 Finding proofs in the MML with AI/ATP support . . . . . . . . . . . . . 5.2.1 Mining the dependencies from all MML proofs . . . . . . . . . . 5.2.2 Learning Premise Selection from Proof Dependencies . . . . . . 5.2.3 Using ATPs to Prove the Conjectures from the Selected Premises 5.3 Proof Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Comparing weights . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49 50 50 51 52 53 54 56 56

6

MaSh - Machine Learning for Sledgehammer 6.1 Introduction . . . . . . . . . . . . . . . . 6.2 Sledgehammer and MePo . . . . . . . . . 6.3 The Machine Learning Engine . . . . . . 6.3.1 Basic Concepts . . . . . . . . . . 6.3.2 Input and Output . . . . . . . . . 6.3.3 The Learning Algorithm . . . . .

59 59 61 62 63 63 63

3.2 3.3

3.4 3.5 4

3.1.4 Techniques Not Included in the Evaluation Machine Learning Evaluation Metrics . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Evaluation Data . . . . . . . . . . . . . . . 3.3.2 Machine Learning Evaluation . . . . . . . 3.3.3 ATP Evaluation . . . . . . . . . . . . . . . Combining Premise Rankers . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . .

ii

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

CONTENTS

6.4

6.5

6.6 6.7 7

Integration in Sledgehammer . . . . . . . . 6.4.1 The Low-Level Learner Interface . 6.4.2 Learning from and for Isabelle . . . 6.4.3 Relevance Filters: MaSh and MeSh 6.4.4 Automatic and Manual Control . . 6.4.5 Nonmonotonic Theory Changes . . Evaluations . . . . . . . . . . . . . . . . . 6.5.1 Evaluation on Large Formalizations 6.5.2 Judgment Day . . . . . . . . . . . Related Work and Contributions . . . . . . Conclusion . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

MaLeS - Machine Learning of Strategies 7.1 Introduction: ATP Strategies . . . . . . . . . . . . . . . . . . . 7.1.1 The Strategy Selection Problem . . . . . . . . . . . . . 7.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Finding Good Search Strategies with MaLeS . . . . . . . . . . 7.3 Strategy Scheduling with MaLeS . . . . . . . . . . . . . . . . . 7.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Runtime Prediction Functions . . . . . . . . . . . . . . 7.3.4 Crossvalidation . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Creating Schedules from Prediction Functions . . . . . . 7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 E-MaLeS . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Satallax-MaLeS . . . . . . . . . . . . . . . . . . . . . . 7.4.3 LEO-MaLeS . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Further Remarks . . . . . . . . . . . . . . . . . . . . . 7.4.5 CASC . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Using MaLeS . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 E-MaLeS, LEO-MaLeS and Satallax-MaLeS . . . . . . 7.5.2 Tuning E, LEO-II or Satallax for a New Set of Problems 7.5.3 Using a New Prover . . . . . . . . . . . . . . . . . . . 7.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

64 64 65 67 68 68 69 69 72 73 73

. . . . . . . . . . . . . . . . . . . . . .

75 75 76 77 77 79 80 80 82 85 85 86 87 88 91 94 95 97 97 98 101 102 102

Contributions

105

Bibliography

107

Scientific Curriculum Vitae

121

Summary

125 iii

CONTENTS

Samenvatting

127

Acknowledgments

129

iv

Chapter 1

Introduction Heuristically, a proof is a rhetorical device for convincing someone else that a mathematical statement is true or valid. — Steven G. Krantz, [52] I am entirely convinced that formal verification of mathematics will eventually become commonplace. — Jeremy Avigad, [6]

1.1

Formal Mathematics

The foundations of modern mathematics were laid at the end of the 19th century and the beginning of the 20th century. Seminal works such as Frege’s Begriffsschrift [30] established the notion of mathematical proofs as formal derivations in a logical calculus. In Principia Mathematica [118], Whitehead and Russell set out to show by example that all of mathematics can be derived from a small set of axioms using an appropriate logical calculus. Even though Gödel later showed that no effectively generated consistent axiom system can capture all mathematical truth [32], Principia Mathematica showed that most of normal mathematics can indeed be catered for by a formal system. Proofs could now be rigidly defined, and verifying the validity of a proof was a simple matter of checking whether the rules of the calculus were correctly applied. But formal proofs were extremely tedious to write (and read), and so they found no audience among practicing mathematicians. 1.1.1

Interactive Theorem Proving

With the advent of computers, formal mathematics became a more realistic proposal. Interactive theorem provers (ITP), or proof assistants, are computer programs that support This chapter is based on: “A Survey of Axiom Selection as a Machine Learning Problem”, submitted to “Infinity, computability, and metamathematics. Festschrift celebrating the 60th birthdays of Peter Koepke and Philip Welch”.

1

CHAPTER 1. INTRODUCTION

Theorem There are infinitely many primes: for every number n there exists a prime p > n. Proof [after Euclid] Given n. Consider k = n! + 1, where n! = 1 · 2 · 3 · . . . · n. Let p be a prime that divides k. For this number p we have p > n: otherwise p ≤ n; but then p divides n!, so p cannot divide k = n! + 1, contradicting the choice of p. QED Figure 1.1: An informal proof that there are infinitely many prime numbers [117]

the creation of formal proofs. Proofs are written in the input language of the ITP, which can be thought of as being at the intersection between a programming language, a logic, and a mathematical typesetting system. In an ITP proof, each statement the user makes gives rise to a proof obligation. The ITP ensures that every proof obligation is met with a correct proof. ACL2 [47], Coq [11], HOL4 [90], HOL Light [39], Isabelle [68], Mizar [35], and PVS [71] are perhaps the most widely used ITPs. Figures 1.1 and 1.2 show a simple informal proof and the corresponding Isabelle proof. ITPs typically provide built-in and programmable automation procedures for performing reasoning that are called tactics. In Figure 1.2, the by command specifies which tactic should be applied to discharge the current proof obligation. Developing proofs in ITPs usually requires a lot more work than sketching a proof with pen and paper. Nevertheless, the benefit of gaining quasi-certainty about the correctness of the proof led a number of mathematicians to adopt these systems. One of the largest mechanization projects is probably the ongoing formalization of the proof of Kepler’s conjecture by Thomas Hales and his colleagues in HOL Light [37]. Other major undertakings are the formal proofs of the Four-Color Theorem [33] and of the Odd-Order Theorem [34] in Coq, both developed under Georges Gonthier’s leadership. In terms of mathematical breadth, the Mizar Mathematical Library [61] is perhaps the main achievement of the ITP community so far: With nearly 52 000 theorems, it covers a large portion of the mathematics taught at the undergraduate level. 1.1.2

Automated Theorem Proving

In contrast to interactive theorem provers, automated theorem provers (ATPs) work without human interaction. They take a problem as input, consisting of a set of axioms and a conjecture, and attempt to deduce the conjecture from the axioms. The TPTP (Thousands of Problems for Theorem Provers) library [91] has established itself as a central infrastructure for exchanging ATP problems. Its main developer also organizes an annual competition, the CADE ATP Systems Competition (CASC) [95], that measures progress in this field. E [84], SPASS [114], Vampire [77], and Z3 [66] are well-known ATPs for classical first-order logic. 2

1.1. FORMAL MATHEMATICS

theorem Euclid: ∃p ∈ prime. n < p proof let ?k = n! + 1 obtain p where prime: p ∈ prime and dvd: p dvd ?k using prime-factor-exists by auto have n < p proof have ¬ p ≤ n proof assume p ≤ n with prime-g-zero have p dvd n! by (rule dvd-factorial) with dvd have p dvd ?k − n! by (rule dvd-diff) then have p dvd 1 by simp with prime show False using prime-nd-one by auto qed then show ?thesis by simp qed from this and prime show ?thesis . . qed corollary ¬ finite prime using Euclid by (fastsimp dest!: finite-nat-set-is-bounded simp: le-def)

Figure 1.2: An Isabelle proof corresponding to the informal proof of Figure 1.1 [117]

Some researchers use ATPs to try to solve open mathematical problems. William McCune’s proof of the Robbins conjecture using a custom ATP is the main success story on this front [62]. More recently, ATPs have also been integrated into ITPs [16, 109, 46], where they help increase the productivity by reducing the number of manual interactions needed to carry out a proof. Instead of using a built-in tactic, the ITP translates the current proof obligation (e.g., the lemma that the user has just stated but not proved yet) into an ATP problem. If the ATP can solve it, the proof is translated to the logic of the ITP and the user can proceed. In Isabelle, the component that integrates ATPs is called Sledgehammer [16]. The process is illustrated in Figure 1.3 and a detailed description can be found in Section 6.2. In Chapter 6, we show that almost 70% of the proof obligations arising in a representative Isabelle corpus can be solved by ATPs. 1.1.3

Industrial Applications

Apart from mathematics, formal proofs are also used in industry. With the ever increasing complexity of software and hardware systems, quality assurance is a large part of the time and money budget of projects. Formal mathematics can be used to prove that an implementation meets a specification. Although some tests might still be mandated by certification authorities, formal proofs can both drastically reduce the testing burden and 3


Proof obligation

Isabelle

Isabelle proof

First-order problem

Sledgehammer

ATP proof

ATP

Figure 1.3: Sledgehammer integrates ATPs (here E) into Isabelle

increase confidence that the systems are bug-free. AMD and Intel have been verifying floating-point procedures since the late 1990s [65, 40], as a consequence of the Pentium bug. Microsoft has had success applying formal verification methods to Windows device drivers [7]. One of the largest software verification projects so far is seL4, a formally verified operating system kernel [48].

1.1.4

Learning to Reason

One of the main reasons why formal mathematics and its related technologies have not become mainstream yet is that developing ITP proofs is tedious. The reasoning capabilities of ATPs and ITP tactics are in many respects far behind what is considered standard for a human mathematician. Developing an interactive proof requires not only knowledge of the subject of the proof, but also of the ITP and its libraries. One way to make users of ITPs more productive is to improve the success rate of ATPs. ATPs struggle with problems that have too many unnecessary axioms since they increase the search space. This is especially an issue when using ATPs from an ITP, where users have access to thousands of premises (axioms, definitions, lemmas, theorems, and corollaries) in the background libraries. Each premise is a potential axiom for an ATP. Premise selection algorithms heuristically select premises that are likely to be useful for inclusion as axioms in the problem given to the ATP. A terminological note is in order. ITP axioms are fundamental assumption in the common mathematical sense (e.g., the axiom of choice). In contrast, ATP axioms are arbitrary formulas that can be used to establish the conjecture. In an ITP, we call statements that can be used for proving a new statement premises. Alternative names are facts (mainly in the Isabelle community), items, or just lemmas. After a new statement has been proven, it becomes a premise for the all following statements. Learning mathematics involves studying proofs to develop a mathematical intuition. Experienced mathematicians often know how to approach a new problem by simply looking at its statement. Assume that p is a prime number and a, b ∈ N − {0}. Consider the following statement: If p | ab, then p | a or p | b. 4

1.2. MACHINE LEARNING IN A NUTSHELL

Even though mathematicians usually know about many different areas (e.g., linear algebra, probability theory, numerics, analysis), when trying to prove the above statement they would ignore those areas and rely on their knowledge about number theory. At an abstract level, they perform premise selection to reduce their search space. Most common premise selection algorithms rely on (recursively) comparing the symbols and terms of the conjecture and axioms [41, 64]. For example, if the conjecture involves π and sin, they will prefer axioms that also talk about either of these two symbols, ideally both. The main drawback of such approaches is that they focus exclusively on formulas, ignoring the rich information contained in proofs. In particular, they do not learn from previous proofs.

1.2

Machine Learning in a Nutshell

This section aims to provide a high-level introduction to machine learning; for a more thorough discussion, we refer to standard textbooks [13, 60, 67]. Machine learning concerns itself with extracting information from data. Some typical examples of machine learning problems are listed below. Spam classification: Predict if a new email is spam. Face detection: Find human faces in a picture. Web search: Predict the websites that contain the information the user is looking for. The results of a learning algorithm is a prediction function that takes a new datapoint (email, picture, search query) and returns a target value (spam / not spam, location of faces, relevant websites). The learning is done by optimizing a score function over a training dataset. Typical score functions are accuracy (how many emails were correctly labeled?) and the root mean square error (the Euclidean distance between the predicted values and the real values). Elements of the training datasets are datapoints together with their intended value. For example: Spam classification: A set of emails together with their classification. Face detection: A set of pictures where all faces are marked. Web search: A set of query-relevant websites tuples. The performance of the learned function heavily depends on the quality of the training data, as expressed by the aphorism “Garbage in, garbage out.” If the training data is not representative for the problem, the prediction function will likely not generalize to new data. In addition to the training data, problem features are also essential. Features are the input of the prediction function and should describe the relevant attributes of the datapoint. A datapoint can have several possible feature representations. Feature engineering concerns itself with identifying relevant features [59]. To simplify computations, most 5


machine learning algorithms require that the features are a (sparse) real-valued vector. Potential features are listed below. Spam classification: A list of all the words occurring in the email. Face detection: The matrix containing the color values of the pixels. Web search: The n-grams of the query. From a mathematical point of view, most machine learning problems can be reduced to an optimization problem. Let D ⊆ X × T be a training dataset consisting of datapoints and their corresponding target value. Let ϕ : X → F be a feature function that maps a datapoint to its feature representation in the feature space F (usually a subset of Rn for some n ∈ N). Furthermore, let F ⊆ (F → T ) be a set of functions that map features to the target space and s a (convex) score function s : D × F → R. One possible goal is to find the function f ∈ F that maximizes the average score over the training set D. The main differences between various learning algorithms are the function space F and the score function s they use. If the function space is too expressive, overfitting may occur: The learned function f ∈ F might perform well on the training data D, but poorly on unseen data. A simple example is trying to fit a polynomial of degree n − 1 through n training datapoints; this will give perfect scores on the training data but is likely to yield a curve that behaves so wildly as to be useless to make predictions. Regularization is used to balance function complexity with the result of the score function. To estimate how well a learning algorithm generalizes or to tune metaparameters (e.g., which prior to use in a Bayesian model ), cross-validation partitions the training data in two sets: one set used for training, the other for the evaluation. Section 2.2.4 gives an example of metaparameter tuning with cross-validation.

1.3

Outline of this Thesis

This work develops machine learning methods that can be used to improve both interactive and automated theorem proving. The first part of the thesis focuses on how learning from previous proofs can help to improve premise selection algorithms. In a way, we are trying to teach the computer mathematical intuition. The second part concerns itself with the orthogonal problem of strategy selection for ATPs. My detailed contributions to the thesis chapters are listed in the Contributions section 7.7. Chapter 2 presents premise selection as a machine learning problem, an idea originally introduced in [101]. First, the problem setup and the properties of the training data are generally defined. The naive Bayesian approach of SNoW [21] is discussed and a new kernel-based Multi-Output Ranking (MOR) algorithm is introduced. The chapter ends with a discussion of the typical properties of the training datasets and the challenges they present to machine learning algorithms.

6

1.3. OUTLINE OF THIS THESIS

Chapter 3 compares the learning-based premise selection algorithms of SNoW and a faster variant of MOR, MOR-CG, with several other state-of-the-art techniques on the MPTP2078 benchmark dataset [2]. We find a discrepancy between the results of the typical machine learning evaluations and the ATP evaluations. Due to incomplete training data, i.e. alternative proofs, a low score in AUC and/or Recall does necessarily imply a low number of solved problems by the ATP. With 726 problems, MOR-CG solves 11.3% more problems than the second best method, SInE [41].1 An ensemble combination of learning (MOR-CG) with non-learning (SInE) algorithms leads to 797 solved problems, an increase of almost 10% compared to MOR-CG. Chapter 4 explores how knowledge of different proofs can be exploited to improve the premise predictions. The proofs found from the ATP experiments of the previous chapter are used as additional training data for the MPTP2078 dataset. Several different proof combinations are defined and tested. We find that learning from ATP proofs instead of ITP proofs gives the best results. The ensemble of ATP-learned MOR-CG with SInE solved 3.3% more problems than the former maximum. Chapter 5 takes a closer look at the differences between ITP and ATP proofs on the whole Mizar Mathematical Library. We compare the average number of dependencies of ITP and ATP proofs and try to measure the proof complexity. We find that ATPs tend to use alternative proofs employing more advanced lemmas whereas humans often rely on the basic definitions for their proofs. Chapter 6 brings learning-based premise selection to Isabelle. MaSh is a modified version of the sparse naive Bayes algorithm that was build to deal with the challenges of premise selection. Unlike MOR and MOR-CG, it is fast enough to be used during everyday proof development and has become part of the default Isabelle installation. MeSh, a combination of MaSh and the old relevance filter MePo increases the number of solved problems in the Judgement Day benchmark by 4.2%. Chapter 7 presents MaLeS, a general learning-based tuning framework for ATPs. ATP systems tuned with MaLeS successfully competed in the last three CASCs. MaLeS combines strategy finding with automated strategy scheduling using a combination of random search and kernel-based machine learning. In the evaluation, we use MaLeS to tune three different ATPs, E, LEO-II [9] and Satallax [19], and evaluate the MaLeS version against the default setting. The results show that using MaLeS can significantly improve the ATP performance.

1 With the ATP Vampire 0.6, 70 premises and a 5 second time limit. Section 3.3 contains additional information.

7

Chapter 2

Premise Selection in Interactive Theorem Proving as a Machine Learning Problem

Without premise selection, automated theorem provers struggle to discharge proof obligations of interactive theorem provers. This is partly due to the large number of background premises which are passed to the automated provers as axioms. Premise selection algorithms predict the relevance of premises, thereby helping to reduce the search space of automated provers. This chapter presents premise selection as a machine learning problem and describes the challenges that distinguish this problem from other applications of machine learning.

2.1

Premise Selection as a Machine-Learning Problem

Using an ATP within an ITP requires a method to filter out irrelevant premises. Since most ITP libraries contain several thousands of theorems, simply translating every library statement into an ATP axiom overwhelms the ATP due to the exploding search space.1 To use machine learning to create such a relevance filter, we must first answer three questions: 1. What is the training data? 2. What is the goal of the learning? 3. What are the features? This chapter is based on: “A Survey of Axiom Selection as a Machine Learning Problem”, submitted to “Infinity, computability, and metamathematics. Festschrift celebrating the 60th birthdays of Peter Koepke and Philip Welch” and my part of [2] “Premise Selection for Mathematics by Corpus Analysis and Kernel Methods”, published in the Journal of Automated Reasoning. 1 Initially, even parsing huge problem files has been an issue with some ATPs.

9

CHAPTER 2. PREMISE SELECTION IN ITPS AS A MACHINE LEARNING PROBLEM

Axiom 1. A Axiom 2. B Definition 1. C iff A Definition 2. D iff C Theorem 1. C Proof. By Axiom 1 and Definition 1. Corollary 1. D Proof. By Theorem 1 and Definition 2. Figure 2.1: A simple library 2.1.1

The Training Data

ITP proof libraries consist of axioms, definitions and previously proved formulas together with their proofs. We use these proofs as training data for the learning algorithms. For example, for Isabelle we can use the libraries included with the prover or the Archive of Formal Proofs [50]; for Mizar, the Mizar Mathematical Library [61]. The data could also include custom libraries defined by the user or third parties. Abstracting from its source, we assume that the training data consists of a set of formulas (axioms, definitions, lemmas, theorems, corollaries) equipped with 1. a visibility relation that for each formula states which other formulas appear before it 2. a dependency graph that for each formula shows which formulas were used in its proof (for lemmas, theorems, and corollaries) 3. a formula tree representation of each formula For the remainder of the thesis we simply use theorem to denote lemmas, theorems and corollaries. Example.

Figure 2.1 introduces a simple, constructed library. For each formula, every formula that occurs above it is visible. Axioms 1 and 2 and Definitions 1 and 2 are visible from Theorem 1, whereas Corollary 1 is not visible. Figure 2.2 presents the corresponding dependency graph. Finally, Figure 2.3 shows the formula tree of ∀x x + 1 > x. 2.1.2

What to Learn

When using an ATP as proof tactic of an ITP, the conjecture of the ATP problem is the current proof obligation the ITP user wants to discharge and the axioms are the visible premises. Recall that machine learning tries to optimize a score function over the training 10

2.1. PREMISE SELECTION AS A MACHINE-LEARNING PROBLEM

Ax. 2

Ax. 1

Def. 1

Def. 2

Thm. 1

Cor. 1 Figure 2.2: The dependency graph of the library of Figure 2.1, where edges denote dependency between formulas.

∀

>

x

+

x

x

1

Figure 2.3: The formula tree for ∀x x + 1 > x dataset. If we ignore alternative proofs and assume that the dependencies extracted from the ITP are the dependencies that an ATP would use, then an ambitious, but unrealistic, learning goal would be to try to predict the parents of conjecture in the dependency graph. Treating premise selection as a ranking rather than a subset selection problem allows more room for error and simplifies the problem. Hence we state our learning goal as: Given a training dataset (Section 2.1.1) and the formula tree of a conjecture, rank the visible premises according to their predicted usefulness based on previous proofs. In the training phase, the learning algorithm is allowed to learn from the proofs of all previously proved theorems. For all theorems in the training set, their corresponding 11


ATP problem 1

ed ank

mi pre

ses

tr hes

n1

hig

ni highest ranked premises

Premise ranking

ATP problem i nm

hig

hes

t ra

Sledgehammer

nke

dp

rem

ise s

ATP problem m Figure 2.4: Sledgehammer generates several ATP problems from a single ranking. For simplicity, other possible slicing options are not shown.

dependencies should be ranked as high as possible. I.e., the score function should optimize the ranks of the premises that were used in the proof. Alternative proofs and their effect on premise selection are addressed in Chapter 4, and Chapter 5 takes a look at the difference between ITP and ATP dependencies. When trying to prove the conjecture, the predicted ranking is used to create several different ATP problems. It has often been observed that it is better to invoke an ATP repeatedly with different options (e.g. numbers of axioms, type encodings, ATP parameters) for a short period of time (e.g., 5 seconds) than to let it run undisturbed until the user stops it. This optimization is called time slicing [99]. Figure 2.4 illustrates the process using Sledgehammer as an example. Slices with few axioms are more likely to find deep proofs involving a few obvious axioms, whereas those with lots of axioms might find straightforward proofs involving more obscure axioms.

2.1.3

Features

Almost all learning algorithms require the features of the input data to be a real vector. Therefore a method is needed to translate formula trees into real vectors that tries to characterize the formula. 12

2.1. PREMISE SELECTION AS A MACHINE-LEARNING PROBLEM

Symbols.

The symbols that appear in a formula can be seen as its basic characterization and hence a simple approach is to take the set of symbols of a formula as its feature set. The symbols correspond to the node labels in the formula tree. Let n ∈ N denote the vector size, which should be at least as large as the total number of symbols in the library. Let i be an injective index function that maps each symbol s to a positive number i(s) ≤ n. The feature representation of a formula tree t is the binary vector ϕ(t) such that ϕ(t)( j) = 1 iff the symbol with index j appears in t. The example formula tree in Figure 2.3 contains the symbols ∀, >, +, x, and 1. Given n = 10, i(∀) = 1, i(>) = 4, i(+) = 6, i(x) = 7, and i(1) = 8, the corresponding feature vector is (1, 0, 0, 1, 0, 1, 1, 1, 0, 0).

Subterms and subformulas.

In addition to the symbols, one can also include as features the subterms and subformulas of the formula to prove—i.e., the subtrees of the formula tree [110]. For example, the formula tree in Figure 2.3 has subtrees associated with x, 1, x + 1, x > x + 1, and ∀x x + 1 > x. Adding all subtrees significantly increases the size of the feature vector. Many subterms and subformulas appear only once in the library and are hence useless for making predictions. An approach to curtail this explosion is to consider only small subtrees (e.g., those with a height of at most 2 or 3).

Types.

The formalisms supported by the vast majority of ITP systems are typed (or sorted), meaning that each term can be given a type that describes the values that can be taken by the term. Examples of types are int, real, real × real, and real → real. Adding the types that appear in the formula tree as additional features is reasonable [56, 45]. Like terms, types can be represented as trees, and we may choose between encoding only basic types or also some or all complex subtypes.

Context.

Due to the way humans develop complex proofs, the last few formulas that were proved are likely to be useful in a proof of the current goal [24]. However, the machine learning algorithm might rank them poorly because they are new and hence little used, if at all. Adding the feature vectors of some of the last previously proved theorems to the feature vector of the conjecture, in a weighted fashion, is a way to add information about the context in which the conjecture occurs to the feature vector. This method is particularly useful when a formula has very few or very general features but occurs in a wider context. 13


2.2

Naive Bayes and Kernel-Based Learning

We give a detailed example of an actual learning setup using a standard naive Bayes and the kernel-based Multi-Output Ranking (MOR) algorithm. The mathematics underlying both algorithms are introduced and the benefits of kernels explained. Naive Bayes has already been used in previous work on premise selection [110] whereas the MOR algorithm is newly introduced in this thesis. The next chapter contains an evaluation of these two (among other) algorithms. 2.2.1

Formal Setting

Let Γ be the set of formulas that appear in the training dataset. Definition 1 (Proof matrix). For two formulas c, p ∈ Γ we define the proof matrix µ : Γ × Γ → {0, 1} by    1 if p is used to prove c, µ(c, p) B   0 otherwise. In other words, µ is the adjacency matrix of the dependency graph. The used premises of a formulas c are the direct parents of c in the dependency graph. usedPremises(c) B {p | µ(c, p) = 1} Definition 2 (Feature matrix). Let T B {t1 , . . . , tm } be a fixed enumeration of the set of all symbols and (sub)terms that appear in all formulas from Γ.2 We define Φ : Γ×{1, . . . , m} → {0, 1} by    1 if ti appears in c, Φ(c, i) B   0 otherwise. This matrix gives rise to the feature function ϕ : Γ → {0, 1}m which for c ∈ Γ is the vector ϕc with entries in {0, 1} satisfying ϕci = 1 ⇐⇒ Φ(c, i) = 1. The expressed features of a formula are denoted by the value of the function e : Γ → P(T ) that maps c to {ti | Φ(c, i) = 1}. For each premise p ∈ Γ we learn a real-valued classifier function C p (·) : Γ → R which, given a conjecture c, estimates how useful p is for proving c. The premises for a conjecture c ∈ Γ are ranked by the values of C p (c). The main difference between learning algorithms is the function space in which they search for the classifiers and the measure they use to evaluate how good a classifier is. 2 If

the set of features is not constant they are enumerated in order of appearance.

14

2.2. NAIVE BAYES AND KERNEL-BASED LEARNING

2.2.2

A Naive Bayes Classifier

Naive Bayes is a statistical learning method based on Bayes’ theorem about conditional probabilities3 with a strong (read: naive) independence assumptions. In the naive Bayes setting, the value C p (c) of the classifier function of a premise p at a conjecture c is the probability that µ(c, p) = 1 given the expressed features e(c). To understand the difference between the naive Bayes and the kernel-based learning algorithm we need to take a closer look at the naive Bayes classifier. Let θ denote the statement that µ(c, p) = 1 and for each feature ti ∈ T let t¯i denote that Φ(c, i) = 1. Furthermore, let e(c) = {s1 , . . . , sl } ⊆ T be the expressed features of c (with corresponding s¯1 , . . . , s¯l ). Then (by Bayes’ theorem) we have P(θ | s¯1 , . . . , s¯l ) ∝ P( s¯1 , . . . , s¯l | θ)P(θ)

(2.1)

where the logarithm of the right-hand side can be computed as ln P( s¯1 , . . . , s¯l | θ)P(θ) = ln P( s¯1 , . . . , s¯l | θ) + ln P(θ) = ln

l Y

P( s¯i | θ) + ln P(θ) by independence

(2.2) (2.3)

i=1

=

m X

ϕci ln P(t¯i | θ) + ln P(θ)

i=1 T c

= w ϕ + ln P(θ)

(2.4) (2.5)

where wi B ln P(t¯i | θ)

(2.6)

There are two things worth noting here. First, P(t¯i | θ) and P(θ) might be 0. In that case, taking the natural logarithm would not be defined. In practice, if P(t¯i | θ) or P(θ) are 0 the algorithm replaces the 0 with a predefined very small ε > 0. Second, line (5) shows that the naive-Bayes classifier is “essentially” (after the monotonic transformation) a linear function of the features of the conjecture. The feature weights w are computed using formula (2.6). 2.2.3

Kernel-based Learning

We saw that the naive Bayes algorithm gives rise to a linear classifier. This leads to several questions: ‘Are there better weights?’ and ‘Can one get better performance with non-linear functions?’. Kernel-based learning provides a framework for investigating such questions. In this subsection we give a simplified, brief description of kernel-based learning that is tailored to our present problem; further information can be found in [5, 82, 88]. 3 In

its simplest form, Bayes’ theorem asserts for a probability function P and random variables X and Y

that

P(Y|X)P(X) , P(Y) where P(X|Y) is understood as the conditional probability of X given Y. P(X|Y) =

15


Are there better weights?

To answer this question we must first define what ‘better’ means. Using the number of problems solved as measure is not feasible because we cannot practically run an ATP for every possible weight combination. Instead, we measure how good a classifier approximates our training data. We would like to have that ∀x ∈ Γ : C p (x) = µ(x, p). However, this will almost never be the case. To compare how well a classifier approximates the data, we use loss functions and the notion of expected loss that they provide, which we now define. Definition 3 (Loss function and Expected Loss). A loss function is any function l : R × R → R+ . Given a loss function l we can then define the expected loss E(·) of a classifier C p as X E(C p ) = l(C p (x), µ(x, p)) x∈Γ

One might add additional properties such as l(x, x) = 0, but this is not necessary. Typical examples of a loss function l(x, y) are the square loss (y − x)2 or the 0-1 loss defined by I(x = y).4 We can compare two different classifiers via their expected loss. If the expected loss of classifier C p is less than the expected loss of a classifier C 0p then C p is the better classifier. Nonlinear Classifiers

It seems straightforward that more complex functions would lead to a lower expected loss and are hence desirable. However, weight optimization becomes tedious once we leave the linear case. Kernels provide a way to use the machinery of linear optimization on non-linear functions. Definition 4 (Kernel). A kernel is is a function k : Γ × Γ → R satisfying k(x, y) = hφ(x), φ(y)i where φ : Γ → F is a mapping from Γ to an inner product space F with inner product h·, ·i. A kernel can be understood as a similarity measure between two entities. Example 1. A standard example is the linear kernel: klin (x, y) B hϕ x , ϕy i with h·, ·i being the normal dot product in Rm . Here, ϕ f denotes the features of a formula f , and the inner product space F is Rm . A nontrivial example is the Gaussian kernel with parameter σ [13]: ! hϕ x , ϕ x i − 2hϕ x , ϕy i + hϕy , ϕy i kgauss (x, y) B exp − σ2 4I

is defined as follows: I(x = y) = 0 if x = y, and I(x = y) = 1 otherwise.

16

2.2. NAIVE BAYES AND KERNEL-BASED LEARNING

We can now define our kernel function space in which we will search for classification functions. Definition 5 (Kernel Function Space). Given a kernel k, we define     X     Γ . f ∈ R | f (x) = α k(x, v), α ∈ R, k f k < ∞ Fk B   v v     v∈Γ P as our kernel function space, where for f (x) = v∈Γ αv k(x, v) X αu αv k(u, v) kfk = u,v∈Γ

Essentially, every function in Fk compares the input x with formulas in Γ using the kernel, and the weights α determine how important each comparison is.5 The kernel function space Fk naturally depends on the kernel k. It can be shown that when we use klin , Fklin consists of linear functions of the features T . In contrast, the Gaussian kernel kgauss gives rise to a nonlinear (in the features) function space. Putting it all together

Having defined loss functions, kernels and kernel function spaces we can now define how kernel-based learning algorithms learn classifier functions. Given a kernel k and a loss function l, recall that we measure how good a classifier C p is with the expected loss E(C p ). With all our definitions it seems reasonable to define C p as C p B arg min E( f )

(2.7)

f ∈Fk

However, this is not what a kernel based learning algorithm does. There are two reasons for this. First, the minimum might not exist. Second, in particular when using complex kernel functions, such an approach might lead to overfitting: C p might perform very well on our training data, but badly on data that was not seen before. To handle both problems, a regularization parameter λ > 0 is introduced to penalize complex functions. This regularization parameter allows us to place a bound on possible solution which together with the fact that Fk is a Hilbert space ensures the existence of C p . Hence we define C p = arg min E( f ) + λk f k2

(2.8)

f ∈Fk

Recall from the definition of Fk that C p has the form X C p (x) = αv k(x, v),

(2.9)

v∈Γ

with αv ∈ R. Hence, for any fixed λ, we only need to compute the weights αv for all v ∈ Γ in order to define C p . In Section 2.2.4 we show how to solve this optimization problem in our setting. 5 Schölkopf

gives a more general approach to kernel spaces [81].

17


Naive Bayes vs Kernel-based Learning

Kernel-based methods typically outperform the naive Bayes algorithm. There are several reasons for this. Firstly and most importantly, while naive Bayes is essentially a linear classifier, kernel based methods can learn non-linear dependencies when an appropriate non-linear (e.g. Gaussian) kernel function is used. This advantage in expressiveness usually leads to significantly better generalization6 performance of the algorithm given properly estimated hyperparameters (e.g., the kernel width σ for Gaussian functions). Secondly, kernel-based methods are formulated within the regularization framework that provides mechanism to control the errors on the training set and the complexity ("expressiveness") of the prediction function. Such setting prevents overfitting of the algorithm and leads to notably better results compared to unregularized methods. Thirdly, some of the kernel-based methods (depending on the loss function) can use very efficient procedures for hyperparameter estimation (e.g. fast leave-one-out cross-validation [78]) and therefore result in a close to optimal model for the classification/regression task. For such reasons kernel-based methods are among the most successful algorithms applied to various problems from bioinformatics to information retrieval to computer vision [88]. A general advantage of naive Bayes over kernel-based algorithms is the computational efficiency, particularly when taking into account the fact that computing the kernel matrix is generally quadratic in the number of training data points. 2.2.4

Multi-Output Ranking

We define the kernel-based multi-output ranking (MOR) algorithm. It extends previously defined preference learning algorithms by Tsivtsivadze and Rifkin [100, 78]. Let Γ = {x1 , . . . , xn }. Then formula (2.9) becomes C p (x) =

n X

αi k(x, xi )

i=1

Using this and the square-loss l(x, y) = (x − y)2 function, solving equation (2.8) is equivalent to finding weights αi that minimize    2 X  n  n n  X  X   min   α j k(xi , x j ) − µ(xi , p) + λ αi α j k(xi , x j ) α1 ,...,αn   i=1

j=1

(2.10)

i, j=1

Recall that C p is the classifier for a single premise. Since we eventually want to rank all premises, we need to train a classifier for each premise. So we need to find weights αi,p for each premise p. We can use the fact that for each premise p, C p depends on the values of k(xi , x j ), where 1 ≤ i, j ≤ n, to speed up the computation. Instead of learning the classifiers C p for each premise separately, we learn all the weights α p,i simultaneously. 6 Generalization is the ability of a machine learning algorithm to perform accurately on new, unseen examples after training on a finite data set.

18

2.3. CHALLENGES

To do this, we first need some definitions. Let A = (αi,p )i,p

(1 ≤ i ≤ n, p ∈ Γ).

A is the matrix where each column contains the parameters of one premise classifier. Define the kernel matrix K and the label matrix Y as K Y

B B

(k(xi , x j ))i, j (1 ≤ i, j ≤ n) (µ(xi , p))i,p (1 ≤ i ≤ n, p ∈ Γ).

We can now rewrite (2.10) in matrix notation to state the problem for all premises: arg min tr (Y − KA)T (Y − KA) + λAT KA (2.11) A

where tr(A) denotes the trace of the matrix A. Taking the derivative with respect to A leads to: ∂ T T ∂A tr (Y − KA) (Y − KA) + λA KA = tr (−2K(Y − KA) + 2λKA) =

tr (−2KY + (2KK + 2λK)A)

To find the minimum, we set the derivative to zero and solve with respect to A. This leads to: A

=

(K + λI)−1 Y

(2.12)

If the regularization parameter λ and the (potential) kernel parameter σ are fixed, we can find the optimal weights through simple matrix computations. Thus, to fully determine the classifiers, it remains to find good values for the parameters λ and σ. This is done, as is common with such parameter optimization for kernel methods, by simple (logarithmically scaled) grid search and cross-validation on the training data using a 70/30 split. For this, we first define a logarithmically scaled set of potential parameters. The training set in then randomly split in two parts cvtrain and cvtest with cvtrain containing 70% of the training data and cvtest containing the remaining 30%. For each set of parameters, the algorithm is trained on cvtrain and evaluated on cvtest . The process is repeated 10 times. The set of parameters with the best average performance is then picked for the real evaluation.

2.3

Challenges

Premise selection has several peculiarities that restrict which machine learning algorithms can be effectively used. In this section, we illustrate these challenges on a large fragment of Isabelle’s Archive of Formal Proofs (AFP). The AFP benchmarks contain 165 964 formulas distributed over 116 entries contributed by dozens of Isabelle users.7 Most entries are related to computer science (e.g., data structures, algorithms, programming languages, and process algebras). The dataset was generated using Sledgehammer [56] and is available publicly at http://www.cs.ru.nl/~kuehlwein/downloads/afp.tar.gz. 7A

number of AFP entries were omitted because of technical difficulties.

19


2.3.1

Features

The features introduced in Section 2.1.3 are very sparse. For example, the AFP contains 20 461 symbols. Adding small subterms and subformulas as well as basic types raises the total number of features to 328 361. Rare features can be very useful, because if two formulas share a very rare feature, the likelihood that one depends on the other is very high. However, they also lead to much larger and sparser feature vectors. Figure 2.5 shows the percentage of features that appear in at least x formulas in the AFP, for various values of x. If we consider all features, then only 3.37% of the features appear in more than 50 formulas. Taking only the symbols into account gives somewhat less sparsity, with 2.65% of the symbols appearing in more than 500 formulas. Since there are 165 964 formulas in total, this means that 97.35% of all symbols appear in less than 0.3% of the training data. 100

Features (%)

80 60 Symbols only All features

40 20 0

1

3

10

30

100

300

1000

Number of Formulas

Figure 2.5: Distribution of the feature appearances in the Archive of Formal Proofs Another peculiarity of the premise selection problem is that the number of features is not a priori fixed. Defining new names for new concepts is standard mathematical practice. Hence, the learning algorithm must be able to cope with an unbounded, ever increasing feature set. 2.3.2

Dependencies

Like the features, the dependencies are also sparse. On average, an AFP formula depends on 5.5 other formulas— 19.4% of the formulas have no dependencies at all, and 10.7% have at least 20 dependencies. Figure 2.6 shows the percentage of formulas that are dependencies of at least x formulas in the AFP, for various values of x. Less than half of the formulas (43.0%) are a dependency in at least one other formula and 94 593 formulas are never used as dependencies. This includes 32 259 definitions as well as 17 045 formulas 20

2.3. CHALLENGES

where the dependencies could not be extracted and were hence left empty. Only 0.08% of the formulas are being used as dependencies more than 500 times. The main issue is that the dependencies in the training data might be incomplete or otherwise misleading. The dependencies extracted from the ITP are not necessarily the same as an ATP would use [3]. For example, Isabelle users can use induction in an interactive proof, and this would be reflected in the dependencies—the induction principle is itself a (higher-order) premise. Most ATPs are limited to first-order logic without induction. If an alternative proof is possible without induction, this is the one that should be learned. Experiments with combinations of ATP and ITP proofs indicate that ITP dependencies are a reasonable guess, but learning from ATP dependencies yields better results (Chapter 4, [55, 110]). More generally, the training data extracted from an ITP library lacks information about alternative proofs. In practice, this means that any evaluation method that relies only on the ITP proofs cannot reliably evaluate whether an premise selection algorithm produces good predictions. There is no choice but to actually run ATPs—and even then the hardware, time limit, and version of the ATP can heavily influence the results. 2.3.3

Online Learning and Speed

Dependencies (%)

Any algorithm for premise selection must update its predictions model and create predictions fast. The typical use case is that of an ITP user who develops a theory formula by formula, proving each along the way. Usually these formulas depend on one another, often in the familiar sequence definition–lemma–theorem–corollary. After each user input, the prediction model might need to be updated. In addition, it is not uncommon for users to alter existing definitions or lemmas, which should trigger some relearning. Speed is essential for a premise selection algorithm since the automated proof finding

40

20

0

1

3

10

30

100

300

Number of Formulas

Figure 2.6: Distribution of the dependency appearances in the Archive of Formal Proofs 21


process needs to be faster than manual proof creation. The less time is spent on updating the learning model and predicting the premise ranking, the more time can be used by ATPs. Users of ITPs tend to be impatient: If the automated provers do not respond within half a minute or so, they usually prefer to carry out the proof themselves.

22

Chapter 3

Overview and Evaluation of Premise Selection Techniques

In this chapter, an overview of state-of-the-art techniques for premise selection in large theory mathematics is presented, and new premise selection techniques are introduced. Several evaluation metrics are defined and their appropriateness is discussed in the context of automated reasoning in large theory mathematics. The methods are evaluated on the MPTP2078 benchmark, a subset of the Mizar library, and a 10% improvement is obtained over the best method so far.

3.1 3.1.1

Premise Selection Algorithms Premise Selection Setting

The typical setting for the task of premise selection is a large developed library of formally encoded mathematical knowledge, over which mathematicians attempt to prove new lemmas and theorems[102, 15, 109]. The actual mathematical corpora suitable for ATP techniques are only a fraction of all mathematics (e.g. about 52 000 lemmas and theorems in the Mizar library) and started to appear only recently, but they already provide a corpus on which different methods can be defined, trained, and evaluated. Premise selection can be useful as a standalone service for the formalizers (suggesting relevant lemmas), or in conjunction with ATP methods that can attempt to find a proof from the relevant premises. This chapter is based on: [57] “Overview and Evaluation of Premise Selection Techniques for Large Theory Mathematics”, published in the Proceedings of the 6th International Joint Conference on Automated Reasoning.

23

CHAPTER 3. OVERVIEW OF PREMISE SELECTION TECHNIQUES

3.1.2

Learning-based Ranking Algorithms

Learning-based ranking algorithms have a training and a testing phase and typically represent the data as points in pre-selected feature spaces. In the training phase the algorithm tries to fit one (or several) prediction functions to the data it is given. The result of the training is the best fitting prediction function which can then be used in the testing phase for evaluations. In the typical setting presented above, the algorithms would train on all existing proofs in the library and be tested on the new theorem the mathematician wants to prove. We compare three different algorithms. SNoW (Sparse Network of Winnows)[21] is an implementation of (among others) the naive Bayes algorithm that has already been successfully used for premise selection [102, 105, 2]. Naive Bayes is a statistical learning method based on Bayes‘ theorem with a strong (or naive) independence assumption. Given a new conjecture c and a premise p, SNoW computes the probability of p being needed to prove c, based on the previous use of p in proving conjectures that are similar to c. The similarity is in our case typically expressed using symbols and terms of the formulas. The independence assumption says that the (non-)occurrence of a symbol/term is not related to the (non-)occurrence of every other symbol/term. A detailed description can be found in Section 2.2.4. SNoW:

MOR-CG: MOR-CG (Multi-Output Ranking with Conjugate Gradient) is a kernel-based learning algorithm [88] that is a faster version of the MOR algorithm described the previous Chapter. Instead of doing an exact computation of the weights as presented in Section 2.2.4, MOR-CG uses conjugate-gradient descent [89] which speeds up the time needed for training. Since preliminary tests gave the best results for a linear kernel, the following experiments are based on a linear kernel. Kernel-based algorithms do not aim to model probabilities, but instead try to minimize the expected loss of the prediction functions on the training data. For each premise p MOR-CG tries to find a function C p such that for each conjecture c, C p (c) = 1 iff p was used in the proof of c. Given a new conjecture c, we can evaluate the learned prediction functions C p on c. The higher the value C p (c) the more relevant p is to prove c.

BiLi (Bi-Linear) is a new algorithm by Twan van Laarhoven that is based on a bilinear model of premise selection, similar to the work of Chu and Park [23]. Like MORCG, BiLi aims to minimize the expected loss. The difference lies in the kind of prediction functions they produce. In MOR-CG the prediction functions only take the features1 of the conjecture into account. In BiLi, the prediction functions use the features of both the conjectures and the premises. This makes BiLi a similar to methods like SInE that symbolically compare conjectures with premises. The bilinear model learns a weight for

BiLi:

1 In our experiments each feature indicates the presence or absence of a certain symbol or term in a formula.

24

3.1. PREMISE SELECTION ALGORITHMS

each combination of a conjecture feature together with a premise feature. Together, this weighted combination determines whether or not a premise is relevant to the conjecture. When the number of features becomes large, fitting a bilinear model becomes computationally more challenging. Therefore, in BiLi the number of features is first reduced to 100, using random projections [12]. To combat the noise introduced by these random projections, this procedure is repeated 20 times, and the averaged predictions are used for ranking the premises. 3.1.3

Other Algorithms Used in the Evaluation

SInE, the SUMO Inference Engine, is a heuristic state-of-the-art premise selection algorithm by Kryštof Hoder [41]. The basic idea is to use global frequencies of symbols in a problem to define their generality, and build a relation linking each symbol S with all formulas F in which S is has the lowest global generality among the symbols of F. In common-sense ontologies, such formulas typically define the symbols linked to them, which is the reason for calling this relation a D-relation. Premise selection for a conjecture is then done by recursively following the D-relation, starting with the conjecture’s symbols. For the experiments described here the E implementation2 of SInE has been used, because it can be instructed to select exactly the N most relevant premises. This is compatible with the way other premise rankers are used in this chapter, and it allows to compare the premise rankings produced by different algorithms for increasing values of N.3 SInE:

APRILS [79], the Automated Prophesier of Relevance Incorporating Latent Semantics, is a signature-based premise selection method that employs Latent Semantic Analysis (LSA) [26] to define symbol and premise similarity. Latent semantics is a machine learning method that has been successfully used for example in the Netflix Prize,4 and in web search. Its principle is to automatically derive “semantic” equivalence classes of words (like car, vehicle, automobile) from their co-occurrences in documents, and to work with such equivalence classes instead of the original words. In APRILS, formulas define the symbol co-occurrence, each formula is characterized as a vector over the symbols’ equivalence classes, and the premise relevance is its dot product with the conjecture. Aprils:

3.1.4

Techniques Not Included in the Evaluation

As a part of the overview, we also list important or interesting algorithms used for ATP knowledge selection that for various reasons do not fit this the evaluation. We refer readers to [106] for their discussion. 2 http://www.mpi-inf.mpg.de/departments/rg1/conferences/deduction10/slides/stephanschulz.pdf 3 The

exact parameters used for producing the E-SInE rankings are at

https://raw.github.com/JUrban/MPTP2/master/MaLARea/script/filter1. 4 http://www.netflixprize.com

25


• The default premise selection heuristic used by the Isabelle/Sledgehammer export [64]. This is an Isabelle-specific symbol-based technique similar to SInE that would need to be evaluated on Isabelle data. • Goal directed ATP calculi including the Conjecture Symbol Weight clause selection heuristics in E prover [84] giving lower weights to symbols contained in the conjecture, the Set of Support (SoS) strategy in resolution/superposition provers, and tableau calculi like leanCoP [70] that are in practice goal-oriented. • Model-based premise selection, as done by Pudlák’s semantic axiom selection system for large theories [76], by the SRASS metasystem [97], and in a different setting by the MaLARea [110] metasystem. • MaLARea [110] is a large-theory metasystem that loops between deductive proof and model finding (using ATPs and finite model finders), and learning premiseselection (currently using SNoW or MOR-CG) from the proofs and models to attack the conjectures that still remain to be proved. • Abstract proof trace guidance implemented in the E prover by Stephan Schulz for his PhD [83]. Proofs are abstracted into clause patterns collected into a common knowledge base, which is loaded when a new problem is solved, and used for guiding clause selection. This is also similar to the hints technique in Prover9 [63]. • The MaLeCoP system [112] where the clause relevance is learned from all closed tableau branches, and the tableau extension steps are guided by a trained machine learner that takes as input features a suitable encoding of the literals on the current tableau branch.

3.2

Machine Learning Evaluation Metrics

Given a database of proofs, there are several possible ways to evaluate how good a premise selection algorithm is without running an ATP. Such evaluation metrics are used to estimate the best parameters (e.g. regularization, tolerance, step size) of an algorithm. The input for each metric is a ranking of the premises for a conjecture together with the information which premises where used to prove the conjecture (according to the training data). Recall

Recall@n is a value between 0 and 1 and denotes the fraction of used premises that are among the top n highest ranked premises. {used premises} ∩ {n highest ranked premises} Recall@n = {used premises} Recall@n is always less than Recall@(n + 1). As n increases, Recall@n will eventually converge to 1. Our intuition is that the better the algorithm, the faster its Recall@n converges to 1. 26

3.3. EVALUATION

AUC

The AUC (Area under the ROC Curve) is the probability that, given a randomly drawn used premise and a randomly drawn unused premise, the used premise is ranked higher than the unused premise. Values closer to 1 show better performance. Let x1 , .., xn be the ranks of the used premises and y1 , .., ym be the ranks of the unused premises. Then, the AUC is defined as Pn Pm AUC =

i

j

1 xi >y j

mn

where 1 xi >y j = 1 iff xi > y j and zero otherwise. 100%Recall

100%Recall denotes the minimum n such that Recall@n = 1. 100%Recall = min{n | Recall@n = 1} In other words 100%Recall tells us how many premises (starting from the highest ranked one) we need to give to the ATP to ensure that all necessary premises are included.

3.3

Evaluation

3.3.1

Evaluation Data

The premise selection methods are evaluated on the large (chainy) problems from the MPTP2078 benchmark5 .[2] These are 2078 related large-theory problems (conjectures) and 4494 formulas (conjectures and premises) in total, extracted from the Mizar Mathematical Library (MML). The MPTP2078 benchmark was developed to supersede the older and smaller MPTP Challenge benchmark (developed in 2006), while keeping the number of problems manageable for experimenting. Larger evaluations are possible,6 but not convenient when testing a large number of systems with many different settings. MPTP2078 seems sufficiently large to test various hypotheses and find significant differences. MPTP2078 also contains (in the smaller, bushy problems) for each conjecture the information about the premises used in the MML proof. This can be used to train and evaluate machine learning algorithms using a chronological order emulating the growth of MML. For each conjecture, the algorithms are allowed to train on all MML proofs that were done up to (not including) the current conjecture. For each of the 2078 problems, the algorithms predict a ranking of the premises. 5 Available 6 See

at http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078. [108, 3] for recent evaluations spanning the whole MML.

27


3.3.2

Machine Learning Evaluation: Comparison of Predictions with Known Proofs

We first compare the algorithms introduced in section 3.1 using the machine learning evaluation metrics introduced in section 3.2. All evaluations are based on the training data, the human-written formal proofs from the MML. They do not take alternative proofs into account. Recall

Figure 3.1 compares the average Recall@n of MOR-CG, BiLi, SNoW, SInE and Aprils for the top 200 premises over all 2078 problems. Higher values denote better performance. The graph shows that MOR-CG performs best, and Aprils worst.

1.0

Average Recall@n

0.8 0.6 0.4

SNoW MOR-CG BiLi SInE Aprils

0.2 0.0

50

100 n

150

200

Figure 3.1: Recall comparison of the premise selection algorithms Note that here is a sharp distinction between the learning algorithms, which use the MML proofs and eventually reach a very similar recall, and the heuristic-based algorithms Aprils and SInE. AUC

The average AUC of the premise selection algorithms is reported in table 3.1. Higher values mean better performance, i.e. a higher chance that a used premise is higher ranked than a unused premise. SNoW (97%) and BiLi (96%) have the best average AUC scores 28

3.3. EVALUATION

with MOR-CG taking the third spot with an average AUC of 88%. Aprils and SInE are considerably worse with 64% and 42% respectively. The standard deviation is very low with around 2% for all algorithms. Table 3.1: AUC comparison of the premise selection algorithms Algorithm

Avg. AUC

Std.

SNoW BiLi MOR-CG Aprils SInE

0.9713 0.9615 0.8806 0.6443 0.4212

0.0216 0.0215 0.0206 0.0176 0.0142

100%Recall

The comparison of the 100%Recall measure values can be seen in figure 3.2. For the first 115 premises, MOR-CG is the best algorithm. From then on, MOR-CG hardly increases and SNoW takes the lead. Eventually, BiLi almost catches up with MOR-CG. Again we can see a big gap between the performance of the learning and the heuristic algorithms with SInE and Aprils not even reaching 400 problems with 100%Recall.

1400 1200

100% Recall@n

1000

SNoW MOR-CG BiLi SInE Aprils

800 600 400 200 0

50

100 n

150

200

Figure 3.2: 100%Recall comparison of the premise selection algorithms

29


Discussion

In all three evaluation metrics there is a clear difference between the performance of the learning-based algorithms SNoW, MOR-CG and BiLi and the heuristic-based algorithms SInE and Aprils. If the machine-learning metrics on the MML proofs are a good indicator for the ATP performance then there should be a corresponding performance difference in the number of problems solved. We investigate this in the following section. 3.3.3

ATP Evaluation

Vampire

In the first experiment we combined the rankings obtained from the algorithms introduced in section 3.1 with version 0.6 of the ATP Vampire [77]. All ATPs are run with 5s time limit on an Intel Xeon E5520 2.27GHz server with 24GB RAM and 8MB CPU cache. Each problem is always assigned one CPU. We use Vampire because of its good performance in the CASC competitions as well as earlier experiments with the MML [108]. For each MPTP2078 problem (containing on average 1976.5 premises), we created 20 new ATP problems, containing the 10, 20, ..., 200 highest ranked premises. The results can be seen in figure 3.3.

800 700

Problems Solved

600 500 400 300

Vampire - MOR-CG Vampire - SNoW Vampire - SInE Vampire - BiLi Vampire - Aprils

200 100 00

50

100 Used Premises

150

200

Figure 3.3: Problems solved – Vampire Apart from the first 10-premise batch and the three last batches, MOR-CG always solves the highest number of problems with a maximum of 726 problems with the top 70 30

3.3. EVALUATION

premises. SNoW solves less problems in the beginning, but catches up in the end. BiLi solves very few problems in the beginning, but gets better as more premises are given and eventually is as good as SNoW and MOR-CG. The surprising fact (given the machine learning performance) is that SInE performs very well, on par with SNoW in the range of 60-100 premises. This indicates that SInE finds proofs that are very different from the human proofs. Furthermore, it is worth noting that most algorithms have their peak at around 70-80 premises. It seems that after that, the effect of increased premise recall is beaten by the effect of the growing ATP search space.

800 700

Problems Solved

600 500 400 300 200

E - MOR-CG E - SNoW E - SInE

100 00

50

100 Used Premises

150

200

Figure 3.4: Problems solved – E

E, SPASS and Z3

We also compared the top three algorithms, MOR-CG, SNoW and SInE, with three other ATPs: E 1.4 [84], SPASS 3.7 [114] and Z3 3.2 [66]. The results can be seen in figure 3.4, 3.5, 3.6 respectively. In all three experiments, MOR-CG gives the best results. Looking at the number of problems solved by E we see that SNoW and SInE solve about the same number of problems when more than 50 premises are given. In the SPASS evaluation, SInE performs better than SNoW after the initial 60 premises. The results for Z3 are clearer, with (apart from the first run with the top 10 premises) MOR-CG always solving more problems than SNoW, and SNoW solving more problems than SInE. It is worth noting that independent of the learning algorithm, SPASS solves the fewest problems and Z3 the most, and that (at least up to the limit of 200 premises used) Z3 is hardly affected by having too many premises in the problems. 31


800 700

Problems Solved

600 500 400 300 200

SPASS - MOR-CG SPASS - SNoW SPASS - SInE

100 00

50

100 Used Premises

150

200

Figure 3.5: Problems solved – SPASS

800 700

Problems Solved

600 500 400 300 200

Z3 - MOR-CG Z3 - SNoW Z3 - SInE

100 00

50

100 Used Premises

150

200

Figure 3.6: Problems solved – Z3

Discussion

The ATP evaluation shows that a good ML evaluation performance does not necessarily imply a good ATP performance and vice versa. E.g. SInE performs better than expected, and BiLi worse. A plausible explanation for this is that the human-written proofs that are 32

3.4. COMBINING PREMISE RANKERS

the basis of the learning algorithms are not the best possible guidelines for ATP proofs, because there are a number of good alternative proofs: the total number of problems proved with Vampire by the union of all prediction methods is 1197, which is more (in 5s) than the 1105 problems that Vampire can prove in 10s when using only the premises used exactly in the human-written proofs. One possible way how to test this hypothesis (to a certain extent at least) would be to train the learning algorithms on all the ATP proofs that are found, and test whether the ML evaluation performance closer correlates with the ATP evaluation performance. The most successful 10s combination, solving 939 problems, is to run Z3 with the 130 best premises selected by MOR-CG, together with Vampire using the 70 best premises selected by SInE. It is also worth noting that when we consider all provers and all methods, 1415 problems can be solved. It seems the heuristic and the learning based premise selection methods give rise to different proofs. In the next section, we try to exploit this by considering combinations of ranking algorithms.

3.4

Combining Premise Rankers

There is clear evidence about alternative proofs being feasible from alternative predictions. This should not be too surprising, because the premises are organized into a large derivation graph, and there are many explicit (and also quite likely many yetundiscovered) semantic dependencies among them. The evaluated premise selection algorithms are based on different ideas of similarity, relevance, and functional approximation spaces and norms in them. This also means that they can be better or worse in capturing different aspects of the premise selection problem (whose optimal solution is obviously undecidable in general, and intractable even if we impose some finiteness limits). An interesting machine learning technique to try in this setting is the combination of different predictors. There has been a large amount of machine learning research in this area, done under different names. Ensembles is one of the most frequent. A recent overview of ensemble based systems is given in [75], while for example [87] deals with the specific task of aggregating rankers. As a final experiment that opens the premise selection field to the application of advanced ranking-aggregation methods, we have performed an initial simple evaluation of combining two very different premise ranking methods: MOR-CG and SInE. The aggregation is done by simple weighted linear combination, i.e., the final ranking is obtained via weighted linear combination of the predicted individual rankings. We test a limited grid of weights, in the interval of [0, 1] with a step value of 0.25, i.e., apart from the original MOR-CG and SInE rankings we get three more weighted aggregate rankings as follows: 0.25 ∗ CG + 0.75 ∗ SInE, 0.5 ∗ CG + 0.5 ∗ SInE, and 0.75 ∗ CG + 0.25 ∗ SInE. The following Figure 3.7 shows their ATP evaluation. The machine learning evaluation (done as before against the data extracted from the human proofs) is not surprising, and the omitted graphs look like linear combinations of 33


900 800

Problems Solved

700 600 500 400 300

Vampire - SInE Vampire - 0.25 MOR-CG + 0.75 SInE Vampire - 0.5 MOR-CG + 0.5 SInE Vampire - 0.75 MOR-CG + 0.25 SInE Vampire - MOR-CG

200 100 00

50

100 Used Premises

150

200

Figure 3.7: Combining CG and SInE: Problems solved

the corresponding figures for MOR-CG and SInE. The ATP evaluation (only Vampire was used) is a very different case. For example the equally weighted combination of MORCG and SInE solves over 604 problems when using only the top 20 ranked premises. The corresponding values for standalone MOR-CG resp. SInE are 476, resp. 341, i.e., they are improved by 27%, resp. 77%. The equally weighted combination solves 797 when using the top 70 premises, which is a 10% improvement over the best result of all methods (726 problems solved by MOR-CG when using the top 70 premises). Note that unlike the external combination mentioned above, this is done only in 5 seconds, with only one ATP, one premise selector, and one threshold.

3.5

Conclusion

Heuristic and inductive methods seem indispensable for strong automated reasoning in large formal mathematics, and significant improvements can be achieved by their proper design, use and combination with precise deductive methods. Knowing previous proofs and learning from them turns out to be important not just to mathematicians, but also for automated reasoning in large theories. We have evaluated practically all reasonably fast state-of-the-art premise selection techniques and tried some new ones. The results show that learning-based algorithms can perform better than heuristics. Relying solely on ML evaluations is not advisable since in particular heuristic premise selection algorithms often find different proofs. A 34

3.5. CONCLUSION

combination of heuristic and learning-based predictions gives the best results.

35

Chapter 4

Learning from Multiple Proofs

Mathematical textbooks typically present only one proof for most of the theorems. However, there are infinitely many proofs for each theorem in first-order logic, and mathematicians are often aware of (and even invent new) important alternative proofs and use such knowledge for (lateral) thinking about new problems. In this chapter we explore how the explicit knowledge of multiple (human and ATP) proofs of the same theorem can be used in learning-based premise selection algorithms in largetheory mathematics. Several methods and their combinations are defined, and their effect on the ATP performance is evaluated on the MPTP2078 benchmark. The experiments show that the proofs used for learning significantly influence the number of problems solved, and that the quality of the proofs is more important than the quantity.

4.1

Learning from Different Proofs

In the previous chapter we tested and evaluated several premise selection algorithms on a subset of the Mizar Mathematical Library (MML), the MPTP2078 large-theory benchmark,1 using the (human) proofs from the MML as training data for the learning algorithms. We found that learning from such human proofs helps a lot, but alternative proofs can quite often be successfully constructed by ATPs, making heuristic methods like SInE surprisingly strong and orthogonal to learning methods. Thanks to these experiments we now also have (possibly several) ATP proofs for most of the problems. In this chapter, we investigate how the knowledge of different proofs can be integrated in the machine learning algorithms for premise selection, and how it influences the performance of the ATPs. Section 4.2 introduces the necessary machine learning terminology and explains how different proofs can be used in the algorithms. In Section 4.3, we define This chapter is based on: [55] “Learning from Multiple Proofs: First Experiments”, published in the Proceedings of the 3rd Workshop on Practical Aspects of Automated Reasoning. 1 Available at http://wiki.mizar.org/twiki/bin/view/Mizar/MpTP2078.

37

CHAPTER 4. LEARNING FROM MULTIPLE PROOFS

several possible ways to use the additional knowledge given by the different proofs. The different proof combinations are evaluated and discussed in Section 4.4, and Section 4.5 concludes.

4.2

The Machine Learning Framework and the Data

We start with the setting introduced in the previous chapter. Γ denotes the set of all facts that appear in a given (fixed) large mathematical corpus (MPTP2078 in this chapter). The corpus is assumed to use notation (symbols) and formula names consistently, since they are used to define the features and labels for the machine learning algorithms as defined in Chapter 2. The visibility relation over Γ is defined by the chronological growth of ITP library. We say that a proof P is a proof over Γ if the conjecture and all premises used in P are elements of Γ. Given a set of proofs ∆ over Γ in which every fact has at most one proof, the (∆-based) proof matrix µ∆ : Γ × Γ → {0, 1} is defined as    1 if p is used to prove c in ∆, µ∆ (c, p) B   0 otherwise. In other words, µ∆ is the adjacency matrix of the graph of the direct proof dependencies from ∆. The proof matrix derived from the MML proofs, together with the formula features are used as training data. In the previous chapter, we compared several different premise selection algorithms on the MPTP2078 dataset. Thanks to this comparison we have ATP proofs for 1328 of the 2078 problems, found by Vampire 0.6 [77]. For some problems we found several different proofs, meaning that the sets of premises used in the proofs differ. Figure 4.1 shows the number of different ATP proofs we have for each problem. The maximum number of different proofs is 49. On average, we found 6.71 proofs per solvable problem. This database of proofs allows us to start considering multiple proofs for a c ∈ Γ. For each conjecture c, let Θc be the set of all ATP proofs of c in our dataset, and let nc denote the cardinality of Θc . We use a generalized proof matrix to represented multiple proofs of c. The general interpretation of µX (c, p) is the relevance (weight) of a premise p for a proof of c determined by X, where X can either be a set of proofs as above, or a particular algorithm (typically in conjunction with the data to which it is applied). For a single proof σ, let µσ B µ{σ} , i.e.,    1 µσ (c, p) B   0

if σ ∈ Θc and p is used to prove c in σ, otherwise.

We end this section by introducing the concept of redundancy, which seems to be at the heart of the problem that we are exploring. Let c be a conjecture and σ1 , σ2 be proofs for c (σ1 , σ2 ∈ Θc ) with used premises {p1 , p2 } and {p1 , p2 , p3 } respectively. In this case, premise p3 can be called redundant since we know a proof of c that does not 38

4.3. USING MULTIPLE PROOFS

use p3 .2 Redundant premises appear quite frequently in ATP proofs, for example, due to exhaustive equational normalization that can turn out to be unnecessary for the proof. Now imagine we have a third proof of c, σ3 with used premises {p1 , p3 }. With this knowledge, p2 could also be called redundant (or at least unnecessary). But one could also argue that at least one of p2 and p3 is not redundant. In such cases, it is not clear what a meaningful definition of redundancy should be. We will use the term redundancy for premises that might not be necessary for a proof.

50

Proofs Found

40 30 20 10 00

500

1000 Problem

1500

2000

Figure 4.1: Number of different ATP proofs for each of the 2078 problems. The problems are ordered by their appearance in the MML.

4.3

Using Multiple Proofs

We define several different combinations of MML and ATP proofs and their respective proof matrices. Recall that there are many problems for which we do not have any ATP proofs. For those problems, we will always just use the MML proof. I.e., for all proof matrices µX defined below, if there is no ATP proof of a conjecture c, then µX (c, p) = µMML (c, p). 2 For this we assume some similarity between the efficiency of the proofs in Θ , which is the case for our c experiments based on the 5-second time limit.

39


4.3.1

Substitutions and Unions

The simplest way to combine different proofs is to either only consider the used premises of one proof, or take the union of all used premises. We consider five different combinations. Definition 6 (MML Proofs).    1 if p is used to prove c in the MML proof, µMML (c, p) B   0 otherwise. This dataset will be used as baseline throughout all experiments. It uses the human proofs from the Mizar library. Definition 7 (Random ATP Proof). For each conjecture c for which we have ATP proofs, pick a (pseudo)random ATP proof σc ∈ Θc .    1 if p is a used premise in σc , µRandom (c, p) B   0 otherwise. Definition 8 (Best ATP Proof). For each conjecture c for which we have ATP proofs, pick an(y) ATP proof with the least number of used premises σmin c ∈ Θc .    1 if p is a used premise in σmin c , µBest (c, p) B   0 otherwise. Definition 9 (Random Union). For each conjecture c for which we have ATP proofs, pick a random ATP proof σc ∈ Θc .    1 if p is a premise used in σc or in the MML proof of c, µRandomUnion (c, p) B   0 otherwise. Definition 10 (Union). For each conjecture c for which we have ATP proofs, we define    1 if p is a premise used in any ATP or MML proof of c, µUnion (c, p) B   0 otherwise. 4.3.2

Premise Averaging

Proofs can also be combined by learning from the average used premises. We consider three options, the standard average, a biased average and a scaled average. Definition 11 (Average). The average gives equal weight to each proof. µAverage (c, p) =

1 X µσ (c, p) + µMML (c, p) nc + 1 σ∈Θ c

40

4.3. USING MULTIPLE PROOFS

The intuition is that the average gives a better idea of how necessary a premise really is. When there are very different proofs, the average will give a very low weight to every premise. That is why we also tried scaling as follows: Definition 12 (Scaled Average). The scaled average ensures that there is at least one premise with weight 1. P σ∈Θc µσ (c, p) + µMML (c, p) P µScaledAverage (c, p) = maxq∈Γ σ∈Θc µσ (c, q) + µMML (c, q) Another experiment is to make the weight of all the ATP proofs equal to the weight of the MML proof: Definition 13 (Biased Average). 1 µBiasedAverage (c, p) = 2 4.3.3

P

σ∈Θc µσ (c, p)

nc

! + µMML (c, p)

Premise Expansion

Consider a situation where a ` b and b ` c. Obviously, not only b, but also a proves c. When we consider the used premises in a proof, we only use the information about the direct premises (b in the example), but nothing about the indirect premises (a in the example), the premises of the direct premises. Using this additional information might help the learning algorithms. We call this premise expansion and define three different weight functions that try to capture this indirect information. All three penalize the weight of the indirect premises with a factor of 21 . Definition 14 (MML Expansion). For the MML expansion, we only consider the MML proofs and their one-step expansions: P q∈Γ µMML (c, q)µMML (q, p) µMMLExp (c, p) = µMML (c, p) + 2 P Note that since µMML (c, p) is either 0 or 1, the sum q∈Γ µMML (c, q)µMML (q, p) just counts how often p is a grandparent premise of c. Definition 15 (Average Expansion). The average expansion takes µAverage instead of µMML : P q∈Γ µAverage (c, q)µAverage (q, p) µAverageExp (c, p) = µAverage (c, p) + 2 Definition 16 (Scaled Expansion). And finally, we consider an expansion of the scaled average. P q∈Γ µScaledAverage (c, q)µScaledAverage (q, p) µScaledAverageExp (c, p) = µScaledAverage (c, p) + 2 Deeper expansions and different penalization factors are possible, but given the performance of these initial tests shown in the next section we decided to not investigate further. 41


4.4 4.4.1

Results Experimental Setup

All experiments were done on the MPTP2078 dataset. Because of its good performance in earlier evaluations, we used the Multi-Output-Ranking (MOR) learning algorithm for the experiments. For each conjecture, MOR is allowed to train on all proofs that were (in the chronological order of MML) done up to that conjecture. In particular, this means that the algorithms do not train on the data they were asked to predict. Three-fold cross validation on the training data was used to find the optimal parameters. For the combinations in 4.3.1, the AUC measure was used to estimate the performance. The other combinations used the square-loss error. For each of the 2078 problems, MOR predicts a ranking of the premises. We again use Vampire 0.6 for evaluating the predictions. Version 0.6 was chosen to make the experiments comparable with the earlier results. Vampire is run with 5s time limit on an Intel Xeon E5520 2.27GHz server with 24GB RAM and 8MB CPU cache. Each problem is always assigned one CPU. For each MPTP2078 problem, we created 20 new problems, containing the 10, 20, ..., 200 highest ranked premises and ran Vampire on each of them. The graphs show how many problems were solved using the 10, 20, ..., 200 highest ranked premises. As a performance baseline, Vampire 0.6 in CASC mode (that means also using SInE with different parameters on large problems) can solve 548 problems in 10 seconds [2]. 4.4.2

Substitutions and Unions

Figure 4.2 shows the performance of the simple proof combinations introduced in 4.3.1. It can be seen that using ATP instead of MML proofs can improve the performance considerably, in particular when only few premises are provided. One can also see the difference that the quality of the proof makes. The best ATP proof predictions always solved more problems than the random ATP proof predictions. Taking the union of two or more proofs decreases the performance. This can be due to the redundancy introduced by considering many different premises and suggests that the ATP search profits most from a simple and clear (one-directional) advice, rather than from a combination of ideas. 4.4.3

Premise Averaging

Taking the average of the used premises could be a good way to combat the redundant premises. The idea is that premises that are actually important should appear in almost every proof, whereas premises that are redundant should only be present in a few proofs. Hereby, important premises should get a high weight and unimportant premises a low weight. The results of the averaging combinations can be seen in Figure 4.3.2. Apart from the scaled average, it seems that taking the average does perform better than taking the union. However, the baseline of only the MML premises is better or almost as good as the average predictions. 42

4.4. RESULTS

900 800

Problems Solved

700 600 500 400 300

MML Proofs Random ATP Proof Best ATP Proof Random Union Total Union

200 100 00

50

100 Used Premises

150

200

Figure 4.2: Comparison of the combinations presented in 4.3.1.

900 800

Problems Solved

700 600 500 400 300

MML Proofs Biased Average Average Scaled Average

200 100 00

50

100 Used Premises

150

200

Figure 4.3: Comparison of the combinations presented in 4.3.2.

43


4.4.4

Premise Expansions

Finally, we compare how expanding the premises effects the ATP performance in Figure 4.3.3. While expanding the premises does add additional redundancy, it also adds further potentially useful information.

900 800

Problems Solved

700 600 500 400 300

MML Proofs MML Expansion Average Expansion Scaled Expansion

200 100 00

50

100 Used Premises

150

200

Figure 4.4: Comparison of the combinations presented in 4.3.3. However, all expansions perform considerably worse than the MML proof baseline. It seems that the additional redundancy outweighs the usefulness.

4.4.5

Other ATPs

We also investigated how learning from Vampire proofs affects other provers, by running E 1.4 [84] and Z3 3.2 [66] on some of the learned predictions. Figure 4.5 shows the results. The predictions learned from the MML premises serve as a baseline. E using the predictions based on the best Vampire proofs is not so much improved over the MML-based predictions as Vampire is. This would suggest that the ATPs really profit most from “their own” best proofs. However for Z3 the situation is opposite: the improvement by learning from the best Vampire proofs is at some points even slightly better than for Vampire itself, and this helps Z3 to reach the maximum performance earlier than before. Also, learning from the averaged proofs behaves differently for the ATPs. For E, the MML and the averaged proofs give practically the same performance, for Vampire the MML proofs are better, but for Z3 the averaged proofs are quite visibly better. 44

4.4. RESULTS

900 800

Problems Solved

700 600 500 400 300

E - MML Proofs E - Best ATP Proof E - Average E - Average Expansion

200 100 00

50

100 Used Premises

150

200

(a) E

900 800

Problems Solved

700 600 500 400 300

Z3 - MML Proofs Z3 - Best ATP Proof Z3 - Average Z3 - Average Expansion

200 100 00

50

100 Used Premises

150

200

(b) Z3

Figure 4.5: Performance of other ATPs when learning from Vampire proofs.

45


4.4.6

Comparison With the Best Results Obtained so far

In the previous chapter, we found that a combination of SInE [41] and the MOR algorithm (trained on the MML proofs) has so far the best performance on the MPTP2078 dataset. Figure 4.6 compares the new results with this combination. Furthermore we also try combining SInE with MOR trained on ATP proofs. For comparison we also include our baseline, the MML Proof predictions, and the results obtained from the SInE predictions.

900 800

Problems Solved

700 600 500 400 300

MML Proofs Best ATP Proof 0.5 Best ATP Proof + 0.5 SInE 0.5 MML Proofs + 0.5 SInE SInE

200 100 00

50

100 Used Premises

150

200

Figure 4.6: Comparison of the best performing algorithms. While learning from the best ATP proofs leads to more problems solved than learning from the MML proofs, the combination of SInE and learning from MML proofs still beats both. However, combining the SInE predictions with the best ATP proof predictions gives even better results with a maximum of 823 problem solved (a 3.3% increase over the former maximum) when given the top 70 premises. 4.4.7

Machine Learning Evaluation

Machine learning has several methods to measure how good a learned classifier is without having to run an ATP. In the earlier experiments the results of the machine learning evaluation did not correspond to the results of the ATP evaluation. For example, SInE performed worse than BiLi on the machine learning evaluation but better than BiLi on the ATP evaluation. Our explanation was that we are training from (and therefore measuring) the wrong data. With SInE the ATP found proofs that were very different from the MML proofs. 46

4.4. RESULTS

1400 1200

best ATP Proofs MML Proofs SInE

100% Recall@n

1000 800 600 400 200 0

50

100 n

150

200

(a) 100%Recall on the MML proofs.

1400 1200

best ATP Proofs MML Proofs SInE

100% Recall@n

1000 800 600 400 200 0

50

100 n

150

200

(b) 100%Recall on the best ATP proofs.

Figure 4.7: 100%Recall comparison between evaluating on the MML and the best ATP proofs. The graphs show how many problems have all necessary premises (according to the training data) within the n highest ranked premises.

47


In Figure 4.7 we present a comparison between a machine learning evaluation (the 100%Recall measure) depending on whether we evaluate on the MML proofs or on the best ATP proofs. Ideally we would like to have that the machine learning performance of the algorithms corresponds to the ATP performance (see Figure 4.6). This is clearly not the case for the 100%Recall on the MML proofs graph. The best ATP predictions are better than the MML proof predictions, and SInE solves more than 200 problems. With the new evaluation, the 100%Recall on the best ATP proofs graph, the performance is more similar to the actual ATP performance but there is still room for improvement.

4.5

Conclusion

The fact that there is never only one proof makes premise selection an interesting machine learning problem. Since it is in general undecidable to know the “best prediction”, the domain has a randomness aspect that is quite unusual (Chaitin-like [22]) in AI. In this chapter we experimented with different proof combinations to obtain better information for high-level proof guidance by premise selection. We found that it is easy to introduce so much redundancy that the predictions created by the learning algorithms are not good for existing ATPs. On the other hand we saw that learning from proofs with few premises (and hence probably less redundancy) increases the ATP performance. It seems that we should look for a measure of how ‘good’ or ‘simple’ a proof is, and only learn from the best proofs. Such measures could be for example the number of inference steps done by the ATP during the proof search, or the total CPU time needed to find the proof. Another question that was (at least initially) answered in this chapter is to what extent learning from human proofs can help an ATP, in comparison to learning from ATP proofs. We saw that while not optimal, learning from human proofs seems to be approximately equivalent to learning from suboptimal (for example random, or averaged) ATP proofs. Learning from the best ATP proof is about as good as combining SInE with learning from the MML proofs. Combining SInE with learning from the best ATP proof still outperforms all methods.

48

Chapter 5

Automated and Human Proofs in General Mathematics

First-order translations of large mathematical repositories allow discovery of new proofs by automated reasoning systems. Large amounts of available mathematical knowledge can be re-used by combined AI/ATP systems, possibly in unexpected ways. But automated systems can be also more easily misled by irrelevant knowledge in this setting, and finding deeper proofs is typically more difficult. Both large-theory AI/ATP methods, and translation and data-mining techniques of large formal corpora, have significantly developed recently, providing enough data for an initial comparison of the proofs written by mathematicians and the proofs found automatically. This chapter describes such a comparison conducted over the 52 000 mathematical theorems from the Mizar Mathematical Library.

5.1

Introduction: Automated Theorem Proving in Mathematics

Computers are becoming an indispensable part of many areas of mathematics [38]. As their capabilities develop, human mathematicians are faced with the task of steering, comprehending, and evaluating the ideas produced by computers, similar to the players of chess in recent decades. A notable milestone is the automatically found proof of the Robbins conjecture by EQP [62] and its postprocessing into a human-comprehensible proof by ILF [25] and Mathematica [29]. Especially in small equational algebraic theories (e.g., quasigroup theory), a number of nontrivial proofs have been already found automatically [74], and their evaluation, understanding, and automated post-processing is an open problem [113]. This chapter is based on: [3] “Automated and Human Proofs in General Mathematics: An Initial Comparison”, published in the Proceedings of the 18th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning. All three authors contributed equally to the paper. Part of Section 5.2.1 is taken from [2] “Premise Selection for Mathematics by Corpus Analysis and Kernel Methods”, published in the Journal of Automated Reasoning.

49

CHAPTER 5. AUTOMATED AND HUMAN PROOFS IN GENERAL MATHEMATICS

In the recent years, large general mathematical corpora like the Mizar Mathematical Library (MML) and the Isabelle/HOL library are being made available to automated reasoning and AI methods [102, 73], leading to the development of automated reasoning techniques working in large theories with many previous theorems, definitions, and proofs that can be re-used [110, 41, 64, 112]. A recent evaluation (and tuning) of ATP systems on the MML [108] has shown that the Vampire/SInE [77] system can already re-prove 39% of the MML’s 52 000 theorems when the necessary premises are precisely selected from the human1 proofs, and about 14% of the theorems when the ATP is allowed to use the whole available library, leading on average to 40 000 premises in such ATP problems. In the previous chapters we showed that re-using (generalizing and learning) the knowledge accumulated in previous proofs can further significantly improve the performance of combined AI/ATP systems in large-theory mathematics. This performance, and the recently developed proof analysis for the MML [4], allowed an experiment with finding automatically all proofs in the MML by a combination of learning and ATP methods. This is described in Section 5.2. The 9 141 ATP proofs found automatically were then compared using several metrics to the human proofs in Section 5.3 and Section 5.4.

5.2

Finding proofs in the MML with AI/ATP support

To create a sufficient body of ATP proofs from the MML, we have conducted a large AI/ATP experiment that makes use of several recently developed techniques and significant computational resources. The basic idea of the experiment is to lift the setting used in [2] for large-theory automated proving of the MPTP2078 benchmark to the whole MML (approximately 52 000 theorems and more than 100 000 premises). The setting consists of the following three consecutive steps: • mining proof dependencies from all MML proofs; • learning premise selection from the mined proof dependencies; • using an ATP to prove new conjectures from the best selected premises. 5.2.1

Mining the dependencies from all MML proofs

For the experiments below, we used Alama et al.’s method for computing fine-grained dependencies [4]. The first step in the computation is to break up each article in the MML into a sequence of Mizar texts, each consisting of a single statement (e.g., a theorem, definition, unexported lemma). Each of these texts can—with suitable preprocessing—be regarded as a complete, valid Mizar article in its own right. The decomposition of a whole 1 Mizar proofs are initially human-written, but they are formal and machine-understandable. That allows their automated machine processing and refactoring, which can make them “less human”. Yet, we believe that their classification as “human” is appropriate, and that MML/MPTP is probably the most suitable resource today for attempting this initial comparison of ATP and human proofs.

50

5.2. FINDING PROOFS IN THE MML WITH AI/ATP SUPPORT

MML article into such smaller articles typically requires a number of nontrivial refactoring steps, comparable, e.g., to automated splitting and re-factoring of large programs written in programming languages with complicated syntactic mechanisms. In Mizar, every article has a so-called environment: a list ENV0 = [statement j : 1 ≤ j ≤ length(ENV0 )] of statements statement j specifying the background knowledge (theorems, notations, etc.) that is used to verify the article. The actual Mizar content contained in an article’s environment, is, in general, a rather conservative overestimate of the statements that the article actually needs. The algorithm first defines the current environment as ENV0 . It then considers each statement in ENV0 and tries to verify the article using the current environment without the considered statement. If the verification succeeds, the considered statement is deleted from the current environment. To be more precises, starting with the original environment ENV0 (in which the article verification succeeds), the algorithm works by constructing a sequence of finer environments {ENVi : 1 ≤ i ≤ length(ENV0 )} such that    if the verification fails in ENVi−1 − {statementi } ENVi−1 ENVi B   ENVi−1 − {statementi } otherwise . The article verification thus still succeeds in the final ENVlength(ENV0 ) environment, and this environment consist of all the statement of ENV0 whose removal caused the article verification to fail during this construction.2 The dependencies of the original statement, which formed the basis of the article, are then defined as the elements of ENVlength . This process is in detail described in [4], where it is conducted for the 100 initial articles from the MML. The computation takes several days for all of MML, however the information thus obtained gives rise to an unparalleled corpus of data about humanwritten proofs in the largest available formal body of mathematics. In the final account, the days of computation pay off, by providing more precise advice for proving new conjectures over the whole MML. An approximate estimate of the computational resources taken by this job is about ten days of full (parallel) CPU load (12 hyperthreading Xeon 2.67 GHz cores, 24 GB RAM) of the Mizar server at the University of Alberta. The resulting dependencies for all MML items can be viewed online.3 5.2.2

Learning Premise Selection from Proof Dependencies

To learn premise selection from proof dependencies, one characterizes all MML formulas by suitable features, and feeds them (together with the detailed proof information) to a machine learning system that is trained to advise premises for later conjectures. Formula symbols have been used previously for this task in [102]. Thanks to sufficient hardware being available, we have for the first time included also term features generated by the MaLARea framework, added to it in 2008 [110] for experiments with smaller subsets of 2 Note that this final environment could in general still be made smaller (after the removal of a certain statement, another statement might become unnecessary), and its construction depends on the (chosen and fixed for all experiments) initial ordering of statements in the environment. 3 http://mizar.cs.ualberta.ca/mizar-items

51

CHAPTER 5. AUTOMATED AND HUMAN PROOFS IN GENERAL MATHEMATICS

MML. Thus, for each MML formula we include as its characterization also all the subterms and subformulas contained in the formula, which makes the learning and prediction more precise. To our surprise, the EPROVER-based [84] utility that consistently numbers all shared terms in all formulas, written for this purpose in 2008 by Josef Urban, scaled without problems to the whole MML. This feature-generation phase took only minutes and created over one million features. We have also briefly explored using validity in finite models (introduced in MaLARea in 2008, building on Pudlák’s previous work [76]) as a more semantic way of characterizing formulas. However, this has turned out to be very time-consuming, most likely caused by the LADR-based clausefilter utility struggling with evaluating in the models some complicated (containing many quantifiers) mathematical formulas. Clearly, further optimizations are needed for extracting such semantic characterizations for all of MML. Even without such features, the machine learning was already pushed to the limit. The kernel-based multi-output ranker presented in chapter 2.2.4 turned out to be too slow and memory-exhaustive to handle over one million features and over hundred thousand training examples. The SNoW system used in naive Bayes mode took several gigabytes of RAM to train on the data, and on average about a second (ca. a day of computation for all of MML) to produce a premise prediction for each MML problem (based always on incremental training4 on all previous MML proofs). The results of this run are SNoW premise predictions for all of MML, available online5 as the raw SNoW output, and also postprocessed into corresponding ATP problems (see below). 5.2.3

Using ATPs to Prove the Conjectures from the Selected Premises

As the MML grows from axioms of set theory to advanced mathematics, it gives rise to a chronological ordering of its theorems. When a new theorem C is conjectured, all the previous theorems and definitions are available as premises, and all the previous proofs are used to learn which of these premises are relevant for C. The SNoW system provides a ranking of all premises, and the best premises are given to an ATP which attempts a proof of C. There are many ways how to organize several ATP systems to try to prove C, with different numbers of the highest ranked premises and with different time limits. For our experiments, we have fixed the ATP system to be Vampire (version 1.8) [77], and we have always used the 200 highest ranked premises and a time limit of 20 seconds. A 12-core 2.67 GHz Xeon server at University of Alberta was used for (parallelized) proving, which took about a day in real time. This has produced 9 141 automatically found proofs that we further analyze. The overall success rate is over 18% of theorems proved which is so far the best result on the whole MML, but we have not really focused yet on getting this as high as possible. For example, running Vampire in parallel with both 40 and 200 best recommended premises has been shown to significantly improve the success rate, and a preliminary experiment with the Z3 solver has provided another two thousand proofs 4 In the incremental learning mode, the evaluation and training are done at the same time for each example, hence there was no extra time taken by training. 5 http://mizar.cs.ualberta.ca/~mptp/proofcomp/snow_predictions.tar.gz

52

5.3. PROOF METRICS

from the problems with 200 best premises. Unfortunately, Z3 does not (yet) print the names of premises used in the proofs, so its proofs would not be directly usable for the analysis that is conducted here. When using a large number of premises, an ATP proof can often contain unnecessary premises. To weed out those unnecessary premises, we always re-run the ATP with only the premises that were used in the first run. The ATP problems are also available online6 for further experiments, as well as all the proofs found.7

5.3

Proof Metrics

We thus have, for 9 141 Mizar theorems φ, the set of premises that were used in the (minimized) ATP proof of φ. Each ATP proof was found completely independently of its Mizar proof, i.e., no information (e.g., about the premises used) from the Mizar proof was transferred to the ATP proof.8 This gives us a notion of dependency for Mizar theorems, derived from an ATP. From the Mizar proof dependency analysis we also know precisely what Mizar items are needed for a given Mizar (human-written) proof to be successful. Definition 17. For a Mizar theorem φ, let PMML (φ) be the minimal set of premises needed for the success of the (human) MML proof of φ. Let PATP (φ) be the set of premises used by an ATP to prove φ. This gives rise to the notions of “immediate dependence” and “indirect dependence” of one Mizar item a upon another Mizar item b: Definition 18. For Mizar items a and b, a

Machine Learning for Automated Reasoning - Institute for Computing

Machine Learning for Automated Reasoning - Institute for Computing

Suggest Documents

Machine Learning for Automated Reasoning - Institute for Computing ...

A machine learning approach for online automated

Evaluating Machine Learning Algorithms for Automated ... - CAIA

Machine Learning for Semi-Automated Gameplay Analysis

Towards an Automated Reasoning for Computing with Words

Information Transformation and Automated Reasoning for Automated ...

Automated Reasoning for Relational Probabilistic

Automated Reasoning for Robot Ethics

Automated Reasoning with Boolean ABoxes - Institute for Logic ...

Formal Reasoning - Institute for Computing and Information Sciences

Automated Machine Learning Service Composition

Machine Learning and Cloud Computing - Institute e-Austria Timisoara

Machine Learning and Cloud Computing - Institute e-Austria Timisoara

Lifted Generative Parameter Learning - UCLA Automated Reasoning

Automated microscopy and machine learning for expert-level malaria ...

Improving sensitivity of machine learning methods for automated case ...

A Review of Machine Learning for Automated Plan- ning

Multispectral Imaging and Machine Learning for Automated Cancer

Osmotic Collaborative Computing for Machine

Machine Learning for Systems and Systems for Machine Learning

Machine Learning for NLP

Machine Learning for Marketers

Lifelong Machine Learning and Reasoning

Machine Learning for NLP