Comparing Frequentist and Bayesian Approaches for ...

3 downloads 0 Views 259KB Size Report
2018 Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede. 2018 Conference on Systems Engineering Research. Comparing ...
2018 Conference on Systems Engineering Research

Comparing Frequentist and Bayesian Approaches for Forecasting Binary Inference Performance Sean D. Vermillion*, Jordan L. Thomas, David P. Brown, and Dennis M. Buede Innovative Decisions, Inc., Vienna, VA, USA

Abstract

In this paper, we compare forecasts of the quality of inferences made by an inference enterprise generated from a frequentist perspective and a Bayesian perspective. An inference enterprise (IE) is an organizational entity that uses data, tools, people, and processes to make mission-focused inferences. When evaluating changes to an IE, the quality of the inferences that a new, hypothetical IE makes is uncertain. We can model quality or performance metric—such as recall, precision, and false positive rate—uncertainty as probability distributions generated either through a frequentist approach or a Bayesian approach. In the frequentist approach, we run several experiments evaluating inference quality and fit a distribution to the results. In the Bayesian approach, we update prior performance beliefs with empirical results. We compare the two approaches in eighteen forecast questions and score the two sets of forecasts against ground truth answers. Both approaches forecast similar performance means, but the frequentist approach systematically produces wider confidence intervals. Therefore, the frequentist approach out-scores the Bayesian approach in metrics sensitive to interval width. © 2018 Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede.

Keywords: Inference Enterprise, Statistical Inference, Bayesian Updating, Binary Classification

1. Introduction In this paper, we compare two approaches for forecasting inference enterprise model performance: (1) a frequentist approach, and (2) a Bayesian approach. An inference enterprise (IE) is an organizational entity that uses data, tools, people, and processes to make mission-focused inferences1. For example, airport security organizations use scanning equipment with trained agents’ judgment to infer whether or not a passenger is a security threat. An inference enterprise model (IEM) is a model that uses available organizational information to forecast inference quality given * Corresponding author. Tel.: +1-703-854-1130; fax: +1-703-854-1132. E-mail address: [email protected] ® 2018 Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

2

Nomenclature DT FN FP IE IEM SVM TN TP

Decision Tree Classifier False Positive Rate False Negative Count False Positive Count Inference Enterprise Inference Enterprise Model Precision Recall Support Vector Machine Classifier True Negative Count True Positive Count

changes to an IE. In many cases, organizations are reluctant to release representative data or lack the data needed for IEM activities2. Thus, the quality of inferences made by a proposed or hypothetical IE architecture is uncertain. The question becomes: how do we best forecast the performance of an IE’s processes and architecture? For forecast generation, there are two primary schools of thought: the frequentist approach and Bayesian approach. The frequentist approach to forecast generation is generally seen as the more quantitative approach to probability, as it does not necessitate prior knowledge3. However, the primary limitation to the frequentist forecast is that all uncertainty contained in the forecast is due to randomness in the sample—i.e. aleatory uncertainty4. This means that the frequentist-generated forecast is dependent on the system being random and repeatable. An additional limitation of the frequentist approach is that it does not condition on observed data5. Bayesian probability on the other hand is generally thought to be the more subjective approach to probability because it requires information about a prior3. Since Bayesian probability uses prior knowledge, it is able to incorporate uncertainty due to lack of knowledge—i.e. epistemic uncertainty4—in addition to the uncertainty due to randomness. While many researchers have sought to find agreement between frequentist and Bayesian methods6-8, significantly less research has been conducted on comparing the ability to predict performance using the two schools of thought. In this paper we aim to compare the frequentist and Bayesian approaches to forecasting inference quality. Specifically, we generate eighteen forecasts using both approaches and score the forecasts against ground truth answers. Additionally, we qualitatively compare the properties of the forecasts generated using both approaches. This paper is organized as follows. Section 2 outlines the foundations relevant to this research including a background on the metrics used to measure classifier performance, and a summary the Bayesian probabilistic framework used in the Bayesian-forecast approach. In Section 3, we provide details of the methodology and in Section 4 we discuss our results. We conclude the paper with a summary of the research and directions for future work. 2. Foundations 2.1. Binary Classifier Performance In this paper, we only consider IEs that make binary inferences. Such IEs are analogous to, or even incorporate, binary classifiers, and thus we characterize inference quality similar to how a binary classifier’s performance is characterized. Binary classification experiment results can be summarized as a confusion matrix, see Table 1. Each data point used in an experiment is binned into one of the four possible states in the confusion matrix, and represents the total number of true positive data points, represents the total number of false negative data points, etc., so that , the total number of data points used in an experiment. Classifier performance metrics are functions of elements in the confusion matrix9. Three such performance metrics are recall , precision , and false positive rate , which are defined as the following: ,

,

(1)

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

3

Table 1. Binary classification confusion matrix.

Predicted Label Actual Label

Target Class

Non-target Class

Target Class

True Positive (TP)

False Negative (FN)

Non-target Class

False Positive (FP)

True Negative (TN)

Recall is a classifier’s effectiveness to identify target labels, precision measures the agreement of the positive labels given by the classifier with the true labels, and false positive rate measures a classifier’s failure to identify nontarget labels. Recall and precision are to be maximized while false positive rate minimized. When we have uncertainty in our data, or we test a classifier using multiple datasets, we generate a confusion matrix for each experiment. Therefore, we compute a recall, precision, and false positive rate for each experiment. From a frequentist perspective, we can fit distributions to the set of recall, precision, and false positive rate values generated through experiments, from which we make performance forecasts. In the next section, we discuss an alternative, Bayesian approach to generating distributions for recall, precision, and false positive rate from empirical data. 2.2. Probabilistic Framework for Classifier Performance Prediction Goutte and Gaussier introduce a probabilistic framework for inferring recall and precision distributions given empirical results in the form of a confusion matrix as in Table 110. We extend this framework to inferring a distribution for false positive rate as well. For tractability, we summarize the framework in terms of inferring a distribution for recall, then discuss translation to precision and false positive rate. Through Bayes’ theorem, the probability of recall, , given empirical observations, , is the following: Pr |

∝ Pr

|

Pr

(2)

where , , , , Pr | is the likelihood distribution for , and Pr is the prior belief distribution over . Since recall takes a value between zero and one, we intuitively model our prior belief using a beta distribution such that ~Beta

⇒ Pr

,



1

(3)

Goutte and Gaussier model the distribution of as a multinomial distribution, whose marginal and conditional distributions are binomial. Using this property, we derive the likelihood distribution, Pr | , through the following: Pr

! !

!

!

!

⇒ Pr

|



1

(4)

Combining Eqs. 2-4, the posterior distribution for recall is the following: Pr |



1

⇒ | ~Beta

,

(5)

Using the same logic, we generate a distribution for precision, Pr | , by replacing in Eq. 5 with , and we generate a distribution for false positive rate, Pr | , by replacing and in Eq. 5 with and , respectively.

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

4

Fig. 1. Inference enterprise modeling meta-model.

The above probability distribution definitions incorporate only a single observation of a confusion matrix. However, we can use several observations to update recall, precision, and false positive rates. Consider we perform one experiment and generate ; our recall distribution is computed as in Eq. 5. Then we perform a second experiment and generate . The posterior distribution generated from , Pr | , becomes our new prior distribution, and our new posterior is | , ~Beta . Extending this procedure to empirical , observations, is modeled as the following:

|

,

,…,

~Beta

,

(6)

Precision and false positive rate distributions are updated similarly. 3. Methodology 3.1. General Approach To compare forecasts generated from frequentist and Bayesian perspectives, we use both approaches to generate eighteen performance forecasts using the same empirical observations. We then score the quality of two forecast sets against ground truth answers using three scoring metrics. In the remainder of this section, we discuss the specific forecast questions, IE structures under consideration, data simulation, and scoring metrics used for comparing the two forecasting approaches. 3.2. Forecast Questions In this paper, we use the eighteen forecast questions asked in the fifteenth challenge problem in the Intelligence Advanced Research Program Activity-sponsored SCITE research program. This challenge problem is motivated by a particular IE structure use case and narrative in which an organization wishes to identify line managers within their organization of approximately 3,500 individuals using data on their employees’ online activity. This data potentially includes their web proxy data, email habits, VPN logs and human resource (HR) data. The organization contracts an independent modeling team to estimate the recall, precision, and false positive rate of a decision tree (DT) classifier and a support vector machine (SVM) classifier for detecting line managers based on (1) only web proxy data, (2) only

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

5

web proxy and email data, and (3) all available data. The organization provides the modeling team with data between 4 October 2015 to 24 September 2016 and asks the modeling team to predict performance for 2 October 2016 to 25 February 2017. The combination of three performance metrics, two classifiers, and three data structures yields a total of eighteen forecasts the independent modeling team is asked to provide the organization. Forecasts include a point estimate for the mean recall, precision, and false positive rate with a 60% certainty interval for each classifier and data structure under consideration. 3.3. Data Simulation For training and testing the DT and SVM classifiers, we are only provided aggregated statistics of the organization’s approximately 3,500 employees’ online activity. These statistics include means, standard deviations, correlations, histograms, and weekly autocorrelation for 142 variables for each week in the training period: 4 October 2015 to 24 September 2016. The 142 variables are grouped as being one of the following categories: 1. 2. 3. 4.

Web proxy detectors, e.g. total number of unblocked connections, total number of blocked connections, etc. Email detectors, e.g. total number of emails sent, total number of emails received, etc. VPN detectors, e.g. average time of day of first connection to the VPN each day, maximum time of day of first connection to the VPN each day, etc. Human resources detectors, e.g. years at the organization, number of unique IP addresses associated with a user, etc.

Separate sets of statistics are provided for line managers and non-line managers during this time period. The focus of this paper is not on describing data simulation procedures in depth, but we summarize our procedure here. An overview of our Monte Carlo approach for simulating data sets from these aggregate statistics is presented in Fig. 1. For each week in the training period, we sample from a 142-dimensional Gaussian copula to generate a correlated sample set using the correlation matrices supplied. We transform this sample using the supplied autocorrelations and properties of a conditional multivariate Gaussian distribution to generate the data sample for another week in the training period. We then transform the uniform marginals of each week’s Gaussian copula data set to match the supplied histograms for each of the 142 variables. We extrapolate this procedure to the test period— 2 October 2016 to 25 February 2017—by allowing for variability in correlations and autocorrelations. With each simulated dataset, we train and test the DT and SVM classifiers. We repeat this procedure thirty times. 3.4. Forecast Generation The primary focus of this paper is on the final step shown in Fig. 1: build performance distributions. For each simulated dataset and each classifier-data structure pair, we compute a confusion matrix as in Table 1. The frequentist and Bayesian inference approaches use confusion matrix results in different ways to build performance distributions and ultimately performance forecasts. With the frequentist approach, forecast mean performances are computed using the arithmetic means so, for instance, the mean for a recall forecast is computed as the following: ̅

1

1

(7)

where is the number of simulation runs. See Eq. 1 for computing precision and false positive rate. We compute the upper and lower 60% certainty bounds for the frequentist approach by finding the 80% and 20% percentile performance metric values, respectively. With the Bayesian updating approach, we are explicitly building a beta distribution, and can thus use the mean of a beta distribution and a beta distribution’s cumulative distribution function (CDF) to compute our forecast. For a recall forecast, the distribution for recall given empirical results, , is | ~Beta ∑ ,∑ where

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

6

and are the parameters for the prior distribution. Therefore, the forecast mean estimate is computed as the mean of this distribution: ̅

∑ ∑

(8)



We compute the 60% certainty interval for the Bayesian approach using the generated beta distribution’s CDF and finding the values where the CDF is 0.8 and 0.2, respectively, to serve as the upper and lower bounds. In this paper, we assume all prior distributions are uniform so that 1 for all forecasts. The uniform prior reflects our naivety in that we have no prior experience for where recall, precision, or false positive rate lie. 3.5. Forecast Scoring Metrics We score forecasts using available ground truth answers, , for each of the eighteen forecast questions using three metrics: (1) mean squared error (MSE), (2) certainty interval calibration (CIC), and (3) interval scoring rule (ISR). MSE measures the average error in our forecasts’ central tendencies and is defined as the following: 1

(9)

is the mean of the forecast. A smaller MSE is preferable to a larger one to reflect the desire to mitigate where forecast likelihood error. CIC measures the proportion of forecasts where the ground truth answer is within the forecast’s certainty interval and is defined as the following: 1

(10)

forecast, and ⋅ is an indicator function where and are the lower and upper certainty interval bounds for the giving a value of 1 if the argument is true and 0 if the argument is false. Since we compute certainty intervals at 60% confidence, the target CIC score is 0.6 indicating that 60% of our certainty intervals contain the ground truth answer. ISR simultaneously measures certainty interval width and the difference between the ground truth answer and certainty interval bound should the ground truth answer lie outside the certainty interval: 1

5

5

(11)

If the ground truth answer lies within the certainty bounds, the interval score for that prediction is simply the interval width. If the answer lies outside the certainty bounds, then the score is the interval width plus five times the difference between the ground truth and the upper bound or lower bound (whichever is closer). A smaller ISR is preferable to a large one to reflect the desire to provide informative forecasts without overly wide intervals.

Table 2. Forecast scores.

Metric

Frequentist Score

Bayesian Score

MSE CIC ISR

0.0311 0.1667 0.4833

0.0311 0.0000 0.5823

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

7

Fig. 2. Ground truth answers and forecasted answers to the eighteen performance questions.

4. Results and Discussion The frequentist and Bayesian forecasts to the eighteen forecast questions are shown in Fig. 2 and the forecast scores are listed in Table 2. From Table 2, we see the frequentist approach out-scores the Bayesian-generated forecasts in two of the three aggregate scoring metrics: CIC and ISR. The two approaches produce similar MSE scores. The frequentist approach systematically produces forecasts with wider certainty intervals, and thus it is unsurprising that these forecasts score higher in CIC. Additionally, due to wider intervals, the distance between a frequentist-generated bound and a ground truth answer is lessened even if the corresponding mean is far from the ground truth answer contributing to a lower ISR. While the Bayesian-generated forecasts have small interval widths, they are penalized in ISR for having greater distance between a bound and the ground truth. For all forecast questions, the Bayesian-generated forecast certainty intervals are consistently narrower than those generated through the frequentist approach. This observation is unsurprising for the context in which we are generating forecasts. The variance of a random variable distributed with a beta distribution such that ~Beta , is the following: var

1

(12)

As and increase, the denominator in Eq. 8 increases much more quickly than the numerator; thus variance decreases. In this paper, we are dealing with datasets with thousands of data points so that the TP, FP, FN, TN values

Sean D. Vermillion, Jordan L. Thomas, David P. Brown, and Dennis M. Buede

8

in the confusion matrix of each simulation run are potentially large numbers. Since the parameters for our posterior beta distributions for recall, precision, and false positive rate use these large confusion matrix elements, the variance for our posterior distributions is smaller resulting in small certainty intervals. 5. Summary In this paper, we compared performance forecasts generated using a frequentist inference approach and a Bayesian inference approach on a representative binary classification task. While the two approaches generally produced forecasts that align with each other around their means, the Bayesian approach consistently produced narrower certainty intervals due to the large number of data points we use in our comparison study. Therefore, the Bayesian approach arguably produces overly confident forecasts. If the prior distributions allow for a lot of performance uncertainty, the data is the driving element in defining the performance posterior distributions’ parameter values so that the priors have little impact on the posteriors. This is to say that if the parameters of the prior distribution, and , are much less than the elements in a confusion matrix as in Table 1, the empirically generated elements of posterior performance distributions drown out prior beliefs so that information from the priors have little influence on the posterior distributions. Bayesian inference is an attractive framework for producing performance forecasts since we can incorporate our beliefs over inference quality. However, there are hurdles to overcome while employing this framework to making reasonable forecasts. Immediate future work includes the following:  

Generalize the differences between the frequentist and Bayesian forecasting approaches beyond the binary classification task presented in this paper. Conduct a parameter search to determine if there exist particular performance prior beliefs that would produce posterior forecasts that would outscore the frequentist approach and investigate how reasonable those prior beliefs are.

Additionally, we seek to extend the Bayesian framework presented in this paper by allowing users to constrain or limit the ability of empirical results to overwhelm the prior distributions, potentially yielding better balance between the contributions of the priors and empirical results to the posteriors. Acknowledgements Research reported here was supported under IARPA contract 2016-16031400006. The content is solely the responsibility of the authors and does not necessarily represent the official views of the U.S. Government. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Huang E, Zaidi AK and Laskey KB. Inference Enterprise Multimodeling for Insider Threat Detection Systems. In: Madni AM, Boehm B, Ghanem RG, Erwin D and Wheaton MJ, (eds.). Disciplinary Convergence in Systems Engineering Research. Cham: Springer International Publishing, 2018, p. 175-86. Turcotte MJ, Kent AD and Hash C. Unified Host and Network Data Set. arXiv preprint arXiv:170807518. 2017. Efron B. Why isn't everyone a Bayesian? The American Statistician. 1986; 40: 1-5. O'Hagan T. Dicing with the unknown. Significance. 2004; 1: 132-3. Wagenmakers E-J, Lee M, Lodewyckx T and Iverson GJ. Bayesian versus frequentist inference. Bayesian evaluation of informative hypotheses. Springer, 2008, p. 181-207. Bartholomew D and Bassett E. A Comparison of Some Bayesian and Frequentist Inferences. II. Biometrika. 1966; 53: 262-4. Jeffreys H. The theory of probability. OUP Oxford, 1998. Tiao GC and Box GE. Some comments on “Bayes” estimators. The American Statistician. 1973; 27: 12-4. Sokolova M and Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management. 2009; 45: 427-37. Goutte C and Gaussier E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. European Conference on Information Retrieval. Springer, 2005, p. 345-59.