Assessing Probabilistic Inference by Comparing the ... - MDPI

0 downloads 0 Views 2MB Size Report
Jun 19, 2017 - with the matching green or orange class divided by the total number of ..... Acknowledgments: Reviews of this work with Fred Daum and Carlos ...
entropy Article

Assessing Probabilistic Inference by Comparing the Generalized Mean of the Model and Source Probabilities Kenric P. Nelson 1,2 1 2

Electrical and Computer Engineering Department, Boston University, Boston, MA 02215, USA; [email protected]; Tel.: +1-781-645-8564 Raytheon Company, Woburn, MA 01808, USA

Academic Editor: Raúl Alcaraz Martínez Received: 31 May 2017; Accepted: 12 June 2017; Published: 19 June 2017

Abstract: An approach to the assessment of probabilistic inference is described which quantifies the performance on the probability scale. From both information and Bayesian theory, the central tendency of an inference is proven to be the geometric mean of the probabilities reported for the actual outcome and is referred to as the “Accuracy”. Upper and lower error bars on the accuracy are provided by the arithmetic mean and the −2/3 mean. The arithmetic is called the “Decisiveness” due to its similarity with the cost of a decision and the −2/3 mean is called the “Robustness”, due to its sensitivity to outlier errors. Visualization of inference performance is facilitated by plotting the reported model probabilities versus the histogram calculated source probabilities. The visualization of the calibration between model and source is summarized on both axes by the arithmetic, geometric, and −2/3 means. From information theory, the performance of the inference is related to the cross-entropy between the model and source distribution. Just as cross-entropy is the sum of the entropy and the divergence; the accuracy of a model can be decomposed into a component due to the source uncertainty and the divergence between the source and model. Translated to the probability domain these quantities are plotted as the average model probability versus the average source probability. The divergence probability is the average model probability divided by the average source probability. When an inference is over/under-confident, the arithmetic mean of the model increases/decreases, while the −2/3 mean decreases/increases, respectively. Keywords: probability; inference; information theory; Bayesian; generalized mean

1. Introduction The challenges of assessing a probabilistic inference have led to development of scoring rules [1–6], which project forecasts onto a scale that permits arithmetic averages of the score to represent average performance. While any concave function (for positive scores) is permissible as a score, its mean bias must be removed in order to satisfy the requirements of a proper score [4,6]. Two widely used, but very different, proper scores are the logarithmic average and the mean-square average. The logarithmic average, based upon the Shannon information metric, is also a local scoring rule, meaning that only the forecast of the actual event needs to be assessed. As such no compensation for mean bias is required for the logarithmic scoring rule. In contrast, the mean-square average is based on a Euclidean distance metric, and its local measure is not proper. The mean-square average is made proper by including the distance between non-events (i.e., p = 0) and the forecast of the non-event. In this paper, a misconception about the average probability is highlighted to demonstrate a simpler, more intuitive, and theoretically stronger approach to the assessment of probabilistic forecasts. Intuitively one is seeking the average performance of the probabilistic inference, which may be a Entropy 2017, 19, 286; doi:10.3390/e19060286

www.mdpi.com/journal/entropy

Entropy 2017, 19, 286

2 of 14

assumed to be the arithmetic mean, which is incorrect for probabilities. Probabilities, as normalized Entropy 2017, 19, 286 2 of 14 ratios, must be averaged using mean, which is reviewed by Fleming and Wallace [7] Entropy 2017, Entropy 19, the 2017, 286 geometric 19, 286 2 of 14 Entropy 2017, 19, 286 2 of 14 and was discussed as early as 1879 by McAlister [8]. In fact, the geometric mean can be expressed as Entropyassumed 2017, Entropy 19, 2017, 286 19, 286 2 of assumed to be the to or arithmetic be arithmetic mean, Unfortunately which mean, is which incorrect is incorrect for probabilities. fortraditionally probabilities. Probabilities, Probabilities, as normalized as nor the log-average: human forecast, a machine algorithm, a the combination. “average” has been assumed to be the arithmetic mean, which is incorrect for geometric probabilities. Probabilities, as by normalized ratios, must ratios, be must averaged be averaged using the using geometric the mean, which mean, is which reviewed is reviewed Fleming by Fleming and Wallace and Wa [7 assumed to be the arithmetic mean, which is incorrect for probabilities. Probabilities, as normalized ratios, must be averaged using the geometric mean, which is reviewed by Fleming and Wallace [7] assumed assumed to be the to arithmetic be the arithmetic mean, which mean, is which incorrect is incorrect for probabilities. for probabilities. Probabilities, Probabilities, as normaliz as no and was and discussed was as earlyasas early 1879as by1879 McAlister McAlister fact, [8]. In thefact, geometric the Wallace geometric mean can mean be expressed can be expre as , [8]. Inby (1) ratios, must be averaged using thediscussed geometric mean, which isbyreviewed Fleming and [7] 2017, 19, 286 2 ofwhich 14 mean and was discussed as early as 1879 by McAlister [8]. In fact, the geometric can be expressed as ratios, must ratios, be must averaged be averaged using the using geometric the geometric mean, mean, is which reviewed is reviewed by Fleming by Fleming and Wallace and W [ the log-average: the log-average: and was discussed as early as 1879 by McAlister [8]. In fact, the geometric mean can be expressed as the log-average: and was and discussed was discussed as earlyasasearly 1879as by1879 McAlister by McAlister [8]. In fact, [8]. In thefact, geometric the geometric mean can mean be expressed can be exp the log-average: ed to be the arithmetic mean, which is for probabilities. Probabilities, aswnormalized ! theincorrect log-average: w  :i  1,...,N where p  pi :i  1,..,N is the thelog-average: set of2017, probabilities and is the set of weights Entropy Entropy 19, 286 2017, Entropy 19, 286 2017, Entropy 19, 286 2017, 19, 286 N N i , , (1) must be averaged using the geometric mean, which is reviewed by Fleming and Wallace [7] wi P ≡ exp w ln p = p , (1) avg (1) i i ∑ ∏ formed by such factors as the frequency and utility of event i. iThe connection with entropy and as discussed as early as 1879 by McAlister [8]. In fact, theassumed geometric can be 1assumed i= 1expressed assumed to bei=the tomean arithmetic be the assumed toarithmetic be mean, thetoarithmetic which be mean, theasisarithmetic which incorrect mean,is which mean, for probabilities. is, which incorrect for probabilities. is incorrect for Probabilities, probabilitie Probab for pr ,incorrect (1a information theory [9–12] is established by considering a distribution in which the weights are also -average: ratios, must ratios, be must averaged ratios, be averaged must ratios, using be the must averaged using geometric be the averaged using geometric mean, the using geometric which mean, the is geometric which reviewed mean, is which reviewed mean, by Fleming is which reviewe by Fle an is p  p :i p   1,..,N p :i  1,..,N w  w w :i  1,...,N w :i  1,...,N where where is the set is the of set probabilities of probabilities and and is the set is the of set weights of where p = { pi : i = 1, . . . , N } is of probabilities and w = {wi : i = 1, . . . , iN } is the set of i the set i i thewhere probabilities w1,..,N  pi and the distribution ofaswas probabilities sums toby one. In this case, thethe p  p :i  w  w :i  1,...,N is the set of probabilities and is the set of weights i and was and discussed was and discussed early and discussed as as was early 1879 discussed as by as 1879 McAlister early as McAlister 1879 early [8]. by as In McAlister 1879 fact, [8]. the by In fact, McAlister geometric [8]. In geometric fact, [8]. mean the In fact, geome can mea be th iand weights formedi byformed such factors asby the frequency utility of event i. The connection entropy formed by such factors such factors as the and frequency as the frequency and utility of utility event of i.event Thewith connection i. The connection with entropy with entro and probability average is the transformation of the entropy function. For assessment of a probabilistic p  p :i p   1,..,N p :i  1,..,N w  w w :i  1,...,N w :i  1,...,N where where is the set is the of set probabilities of probabilities and and is the set is the of set weigh of , the log-average: the log-average: the log-average: (1) formed by such factors as the and utility ofaby event i.byThe connection with entropy and ifrequency i log-average: i iare in and information theory [9–12] istheory established by considering distribution in the weights also information information theory [9–12] is [9–12] established is established considering considering awhich distribution a distribution in which the which weights the weights are also inference, the set of probabilities is specified as with i representing the p  p :i  1,..,N; j  1,...,M information theory [9–12] isthe established considering distribution which the weights areconnection alsowith entropy formed formed by such byfactors suchbyfactors as the frequency as the andtoutility andinIn of utility event of i.event The connection i. The with ent an the probabilities wthe pprobabilities distribution of probabilities sums one. this case, the probability ij afrequency i = i and the probabilities wi  pi wand  pitheand distribution the distribution of probabilities of probabilities sums to sums one.toInone. thisIncase, this the ca i , , , information information theory theory [9–12] is [9–12] established is established by considering by considering a distribution a distribution in which in the which weights the weight are al average is the transformation of the entropy function. For assessment of a probabilistic inference, wi  the pi and the distribution ofisprobabilities sums one. which In thisoccurred. case, the j * 1,...,N j representing classes and specifying thesetlabeled true toevent p  pi :i  1,..,N samples, w  w :i  isthe theprobabilities set of probabilities and the of weights probability probability averageaverage is the transformation is the transformation of the entropy of the entropy function. function. For assessment For assessment of a probabilistic of a prob i the set of probabilities specified as w p = pwand :i pentropy =the 1, .distribution . . ,the N; j distribution = 1, . For .of. , M with i representing the toInone. ijthe theisisprobabilities the probabilities  p and probabilities of probabilities sums to sums one. thisIncase, this t 1 probability average the transformation of function. assessment of a probabilistic i i i i N independent trials equal weight wi is specified the average forecast) of a i representing d by such factors Assuming as the frequency and utility of of event The entropy and ∗i.set N , with inference, the set the of jprobabilities of connection probabilities is specified as event as with with i represen the p probability pwhich :ip  1,..,N; poccurred. :i(or j 1,..,N; 1,...,M j  1,...,M samples, j representing theinference, classes and specifying the labeled true Assuming ij ij probability probability average average is the transformation is the transformation of the entropy of the entropy function. function. For assessment For assessment of aw probabilis ofw a :ipro inference, the set of probabilities is specified as with i representing the p p :i p   1,..,N p :i  p 1,..,N  p :i p  1,..,N  p :i  1,..,N w  w :i w   1,...,N w :i  1,...,N  wse where where where where is the set is the of probabilities set is of the probabilities set is of the and probabilities set and of probabilities and is and the i p  p :i  1,..,N; j  1,...,M ation theory [9–12] is established by considering a distribution in which the weights are also 1 probabilistic inference then: weight wi = i , the average ij probability i i i i i N independent trials ofisequal forecast) of a probabilistic j *and j *i (orspecifying samples, samples, j representing j representing the classes the classes and specifying the labeled the labeled true event true which event which occurred oc N inference, inference, the set the of probabilities set of probabilities is specified is specified as as with i representing with i represe t p  p :i p  1,..,N; p :i  j  1,..,N; 1,...,M j  1,...,M obabilities wi  pinference and theis jthen: distribution probabilities to one. In this case, the formed by such formed factors as by factors formed the such frequency as by factors the such frequency as factors and utility frequency as and of utility frequency event and ofi.utility event The and connection of i.utility The event connecti of i. with The even ijtheevent ij the j *by representingof the classes formed andsums specifying the labeled true which occurred. Nsuch i samples, 1 1 ,wthe 1average Assuming Assuming N information independent N independent trials trials equal equal weight w   , the average probability probability (or forecast) (or foreca of Nof N . ofweight N N P  p ij *byspecifying i the information theory information [9–12] theory is[9–12] established theory isspecifying [9–12] established theory isconsidering [9–12] established by labeled is considering established aby distribution considering a(2) by distribution considering inevent awhich distribution in the which a dist wea 1/N information *average  samples, j representing representing the classes the and the labeled true event true which occurre bility average is the Assuming transformation ofsamples, the entropy function. assessment a jprobabilistic 1and avg N independent trials ofj equal weight w  classes , the probability (or forecast) of a which PFor ≡ pi ,true .Nof (2) avg ∏ i1 i i,true probabilistic probabilistic inference inference is then: is w then: the the the probabilities trials pequal w and  pequal the and wdistribution wthe pithe distribution  the pof distribution probabilities the of probability probabilities distribution of sums probabilities sums one. probabil sum on 1w 1 and i =1iprobabilities nce, the set of probabilities is specified as ispAssuming with  pN :i probabilities 1,..,N; j probabilities 1,...,M Assuming Nthe independent trials of  and probability (ortoof forecast) (ortoIn fore oft i i representing iofweight i i weight i average probabilistic inference then: ij independent N i,wthe N , the average i i A proof using Bayesian reasoning that the geometric mean represents N N the1 average probability 1 probability probability average probability is theprobability transformation is average the transformation average is theNthe of transformation the is the entropy oftransformation the entropy function. of the will function. entropy For of the assessment entropy function. For assessm func of For a Aclasses proof using Bayesian reasoning the geometric mean represents average probability N . probabilistic inference then: isaverage then: jprobabilistic *Section es, j representingwill thebe and in specifying the that labeled true which Nevent . the occurred. Pavg pi ,true to pi ,true (2) 1will bePavg completed 2. Ininference Section 3isthe analysis extended generalized mean to  N be completed in Section 2. In Section 3 the analysis will be extended to the generalized mean to provide inference, inference, the P set inference, the probabilities set inference, ofthe set is of the specified probabilities setisof specified probabilities as pis specified as specified with ij rep i1 i1 . probabilities pija:ip is 1,..,N; pas :i jof p(2) 1,..,N; 1,...,M pas :ij  p1,...,M 1,..,N; pij :i w1 of pi ,true  N ij ij avg provide of thewfluctuations in a probabilistic forecast. Section method 1 ming N independent trials assessment of equal  1inN ,athe average probability (or forecast) of1 Na4 Nadevelops i1 i assessment of theweight fluctuations probabilistic forecast. Section 4 develops method of analyzing N .and . mean Pthe probability Pavg pithe j * pversus (2  avg ,true i ,true A proofAsamples, using proof Bayesian using Bayesian reasoning reasoning the that the geometric mean represents the average average probability pro j *represents j average * and j * the samples, j representing samples, j representing samples, the jthat representing classes jgeometric representing classes and classes the and classes specifying specifying labeled specifying the labeled true specifying the event true labeled wh the ev inference algorithms by plotting the average model thethe bilistic inference isanalyzing then: i1the average i1 inference algorithms by plottingreasoning the average model probability versus probability of the A proof using Bayesian that the geometric mean represents the average probability will completed beThis completed in Section inquantitative Section 2. In Section 2. Invisualization Section 3 the analysis 3 the of analysis will will extended to the generalized to the tom probability of the will databe source. provides the be relationship between 1 ,extended 1 , the 1 generalized 1 , (or Assuming Assuming Assuming Assuming N independent N independent trials N independent oftrials equal N independent of weight equal trials of weight wtrials equal  be w weight the equal  average w weight average probability wi mean probabi average data source. This provides visualization of the relationship between the of N quantitative N of N the N , the N the i performance i represents ito 1 2. be completed in Section In Section 3 the analysis will be extended to the generalized mean Aprovide proof A using proof Bayesian using Bayesian reasoning reasoning that the that geometric the geometric mean represents mean average the average probabili pr assessment assessment of the fluctuations of the fluctuations in a probabilistic in a probabilistic forecast. forecast. Section Section 4 develops 4 develops a method a me o thewill performance ofprovide the inference algorithm and the underlying uncertainty of the data source used N . Pavg and pi ,true the inference algorithm the underlying uncertainty of2. the data source used for testing. In Section 5generalized  probabilistic probabilistic inference probabilistic inference is probabilistic then: inference then: inference is(2) then: isdevelops then: the fluctuations in ainalgorithms probabilistic Section 4model method of will be will completed beofcompleted in Section Section 2. In Section 3is the 3average the will be will extended to theversus to theversus generalized mean analyzing analyzing inference inference algorithms bySection plotting byforecast. plotting theanalysis theanalysis average model probability probability the average the forprovide testing. assessment In Section 5of the model errors inIn the mean, variance, and tail decay ofbeaaextended two-class i1effect the effect of model errors in the mean, variance, and tail decay of a two-class discrimination problem analyzing inference algorithms by plotting the average model probability versus the average N N N N provide provide assessment assessment of the fluctuations of the fluctuations in a probabilistic in a probabilistic forecast. forecast. Section Section 4 develops 4 develops a method am 1 1 1 probability probability of the data of the source. data source. This provides Thishow provides quantitative quantitative visualization visualization of 1the relationship of the relationship between b discrimination problem are demonstrated. Section 6 shows the divergence probability can be N . its role N .as a N . N . proof using Bayesian reasoning that the geometric mean represents the average probability areprobability demonstrated. Section 6 shows how the divergence probability can be visualized and P  P p  p P  p P  p of the data source. This provides quantitative visualization of the relationship between     analyzing analyzing inference inference algorithms algorithms by plotting by plotting the average the average model model probability probability versus versus the avera the avg avg i ,true i avg ,true i avg ,true i ,true the performance the performance of the inference of the inference algorithm algorithm and the and underlying the underlying uncertainty uncertainty of the data of the source data sour used visualized and its role as a figure of merit for the level of confidence in accurately forecasting uncertainty. i1 i1 i1 i1 completed in Section Inmerit Section the analysis will bedata extended tosource. theunderlying generalized meanquantitative to visualization figure of for3the level of confidence insource. accurately forecasting uncertainty. the2.performance of the inference algorithm and the uncertainty ofvariance, the data source used probability probability the This provides This provides quantitative visualization of tail theand relationship of tail theof relationship betwee for testing. for testing. Inof Section Inof Section 5 the the data effect 5 the ofeffect model of errors model in errors the mean, in the mean, variance, and decay decay a two-class of a tw e assessment of thefor fluctuations in a probabilistic forecast. Section 4 develops a method of A proof A using proof Bayesian using A proof Bayesian reasoning A using proof Bayesian reasoning using that the Bayesian reasoning that geometric the reasoning geometric that mean the that geometric represents mean the represents geometric mean the averag repr m th testing. In Section 5 the effect of model errors in the mean, variance, and tail decay of a two-class the performance the Geometric performance of the inference of demonstrated. the inference algorithm algorithm and the and underlying the underlying uncertainty uncertainty of the data of the source data sou use 2. Proof and Properties of the Mean as the Average Probability discrimination discrimination problem problem are are demonstrated. Section Section 6 shows 6 how shows the how divergence the divergence probability probability can be 2. Proof and Properties of average the Geometric Mean as the Average Probability ing inference algorithms by plotting the model probability versus the average will be completed will be completed will in be Section completed will in Section 2. be In completed Section in 2. Section In Section 3 in the 2. Section analysis In 3 Section the 2. analysis will In 3 Section the be extended will analysis 3 be the extended analysis will to the be genera extend will to the discrimination problem are demonstrated. Section 6 shows how the divergence probability can for testing. for testing. In Section In Section 5 the effect 5 the of effect model of errors model in errors the mean, in the variance, mean, variance, and tail and decay tail of decay a two-cla of ab visualized visualized and itsorigin role and as its arole figure as aoffigure meritof for merit theoflevel for the of level confidence of confidence in accurately in accurately forecasting forecasting uncertainty unce The introduction showed the of the geometric mean the probabilities the bility of the data source. This provides quantitative visualization of the between The introduction showed the origin of the mean ofassessment the probabilities as the translation provide provide assessment assessment provide ofofrelationship the provide assessment fluctuations of the fluctuations of the afluctuations probabilistic of6in the aas fluctuations probabilistic intranslation forecast. a the probabilistic in forecast. Section a probabilistic forecast. Section 4 probabili develops for 4Sed visualized and its role as adiscrimination figure of merit for thegeometric level confidence in accurately forecasting uncertainty. discrimination problem problem are demonstrated. are demonstrated. Section Section 6inshows how shows the how divergence divergence probability can of entropy function and logarithmic score back to the probability domain. The geometric mean rformance of the inference algorithm andand thevisualized underlying uncertainty of the data source used of the the entropy function logarithmic score back to the probability domain. geometric mean analyzing analyzing inference analyzing inference algorithms analyzing inference algorithms by inference plotting algorithms plotting algorithms the by average plotting the by average model thecan average probability model the probability average model versus pr m and its role and as its arole figure as aof figure merit of for merit the level for the ofby level confidence ofAverage confidence in accurately inplotting accurately forecasting forecasting uncertaint un 2.visualized Proof2.and Proof Properties and Properties of the Geometric of the Geometric Mean as Mean the Average as The the Probability Probability can also be shown tothe be average the average probability from the perspective of averaging the Bayesian [10] ting. In Section 5 the ofand model inofthe mean, variance, and tail decay of a two-class 2.effect Proof Properties theprobability Geometric Mean as the Average Probability also be shown to beerrors probability from the perspective of averaging the Bayesian [10] total probability of the probability data of the source. probability data of the source. Thisdata provides of This the source. data provides quantitative This source. provides quantitative This visualization provides quantitative visualization quantitative of the visualizatio relation of the vis The introduction The introduction showed showed theoforigin the of origin the geometric ofas the geometric mean of mean the probabilities of Probability the probabilities as the translation as the tran probability of multiple and measurements. mination problem total are demonstrated. Section 6decisions shows how the can be probability of multiple decisions and measurements. 2. Proof 2. and Proof Properties and Properties ofdivergence the the Geometric Mean Mean the as the Average Probability the performance the performance the ofGeometric the performance inference the ofprobability the performance inference algorithm of the inference algorithm ofAverage and the the inference algorithm and underlying thealgorithm underlying anduncertainty the underlying anduncertainty theofunderlyin the uncer dat of introduction showed the origin of the geometric meanuncertainty. of thetoprobabilities as the translation entropy ofof the entropy function function logarithmic andforecasting logarithmic score back score back the probability to the probability domain. domain. The geometric The geometr mean zed and its role as a figureThe of merit forof thethe level confidence inand accurately for testing. for testing. In Section forIn testing. Section 5 the foreffect In testing. 5Section the ofeffect model In5Section the oferrors model effect 5 the in of errors the effect model mean, inoferrors the model variance, mean, inerrors the variance, and mean, in tail the varianc and decay mean, tai of the entropy function and logarithmic score back to the probability domain. The geometric The introduction The introduction showed showed theaverage origin theof of origin geometric offormed the geometric mean ofmean the ofsamples, the probabilities as Bayesian thethe translatio as Bayes the tr Theorem the average the normalization athe total from independent also can beof also shown be shown to beas the to be the probability from the from perspective the perspective ofprobabilities averaging ofmean averaging the [10 Theorem 1. 1. Given Given acan a definition definition of the average as theaverage normalization ofprobability aare total formed from independent samples, discrimination discrimination problem discrimination problem are discrimination demonstrated. problem demonstrated. problem areSection demonstrated. are Section 6 demonstrated. shows6Section how shows the6how Section divergence shows the6how divergen shows prob the h can also be shown to be the average probability from the perspective of averaging the Bayesian [10] of the entropy of the entropy function function and logarithmic and logarithmic score back score to back the probability to the probability domain. domain. The geometric The geome mea average probability is the geometric of probabilities, since the total is the product of the total probability ofmean multiple of multiple decisions decisions and measurements. and measurements. of and Properties the of Geometric Mean astotal the Average Probability thethe average probability is probability the geometric mean of independent independent since the is product offor visualized visualized and itsvisualized role andas itsaprobabilities, visualized role figure and asits aoffigure role merit and asits of for a merit figure role thetotal as level for of a figure merit the ofthe confidence level for of merit of theconfidence level inthe accurately the of confidence level in accurately offorecastin confiden in acc fo total probabilities. probability ofcan multiple measurements. alsocan bedecisions also shown be shown toand be the to average be the average probability probability from the from perspective the perspective of averaging of averaging the Bayesian the Bay [1 sample sample probabilities. he introduction showed the origin of the geometric mean of the probabilities as the translation total probability total probability multiple multiple and measurements. and measurements. Theorem Theorem 1. Given 1.aofGiven definition aofdefinition ofdecisions the average ofdecisions the average as the normalization as the normalization of a totalofformed a total from formed independent from independent samples 2. Proof 2.and Proof Properties and 2. Proof Properties of 2. and the Proof Properties Geometric ofand the Properties Geometric ofMean the Geometric as of Mean the theAverage Geometric as Mean the Average Probability asMean the Average Probability as the Averag Proba entropy function and logarithmic score back to the probability domain. The geometric mean Theorem 1. Given a definition of the average as the normalization of a total formed from independent samples, the hypotheses average the average probability probability is the geometric is the geometric meany of mean independent of used independent probabilities, probabilities, since thesince total the is the totalproduct is the produ of the and measurement to assign a probability to Proof. Proof. Given Given M M class class hypotheses and aa measurement which is probability to j o be shown to be the theGiven perspective averaging theThe Bayesian [10] theaverage average probability probability isfrom the geometric ofaof independent probabilities, since thegeometric total the product of the Theorem Theorem 1. 1.amean Given definition definition of theThe average ofintroduction the average as the normalization asthe the normalization oforigin aistotal ofof formed aorigin totalgeometric from formed independent from independen sampl The introduction The introduction showed showed the introduction origin showed oforigin the the showed of the geometric the mean the of mean of the the probabilities of geometric the mean probabili of mea the as t sample probabilities. sample probabilities. each class, class, the the Bayesian Bayesian assignment assignment for for each each class class is: is: each robability of multiple decisions and measurements. sample probabilities.the average the average probability probability theentropy geometric is the the geometric mean ofentropy mean independent of independent probabilities, probabilities, since total the is the total product isdomain. the of of the entropy of isthe function of function entropy and of the logarithmic function and logarithmic function and score logarithmic back score and tologarithmic back thesince score probability to the the back score probability to domain. the back probability to The theprod pro get     sample probabilities. sample probabilities. can also can be shown also be can to shown also be the can be to shown average also be the be to average shown probability be the to probability average be from the average probability the from perspective probability the from perspective of the averaging from perspective of the averag pers the Given class hypotheses and and a measurement y whichy is which used istoused assign to aassign probability a proba to Proof. Given Proof. M hypotheses pp jaj measurement  of aM class pp yfrom | jj independent em 1. Given a definition of the average as the normalization formed samples, total | y=probability  total total of multiple probability of total multiple decisions probability of multiple decisions and of measurements. multiple decisions and measurements. decisions and measurements. and measurements. , a measurement y which is used to assign a probability to Proof. Given M class hypotheses p pprobability , (3) y jand jtotal M foris  total  for each (3) M the each class, eachthe class, Bayesian the Bayesian assignment assignment each class is:class rage probability is the geometric mean of independent probabilities, since the product of theis: p yis: pp jaj measurement Given Given class M hypotheses class hypotheses and a measurement y whichy is which used istoused assign to aassign probability a prob Proof. assignment Proof. M y| jj and  ∑ each class, the Bayesian for each class probabilities. j Theorem j= 111. Theorem Theorem 1. Given a definition GivenTheorem a1.definition Given of the average a1. definition of Given the as average a the definition of normalization the as average the of normalization the as average the of a normalization total as of formed the a total normalization from formed of a indepe total from fo peach yis: | p pyis: | p j each class, eachthe class, Bayesian the Bayesian assignment assignment for eachfor class class j j j the average the probability average the probability average is the the geometric probability average is the geometric probability mean is the of geometric mean independent is the of geometric independent mean probabilities, of independent mean probabilities, of since independent probabilities, the since total probab the is the to si p jjM| y  M , ,   p y |p p| y j j (3) psample yused sample  Given M class hypotheses a measurement y which to assign toppppyy| , pp yy|| where p is the prior probability ofj |ishypothesis j. Ifa probability N independent measurements y = sample probabilities. probabilities. probabilities. sample probabilities. j and p   M  |   p    (3) jj jj jj jj 1 | y j 1 p jj  p y |ap p| y , M hypotheses, , . . , yeach taken probability to the {y1 , . . . , yi , . for  jM is assigned j j N } are ass, the Bayesian assignment class is: and for each measurement n M o (3 j 1 p y | p p y | p     MGiven class M hypotheses Given class hypotheses MGiven class hypotheses and M class a measurement and hypotheses a measurement and y a which measurement and y is a which measurement used to is y which used assign to y is a aw Proof. Given Proof. Proof. Proof.   j j j j then the probability of the jth hypothesis for all the measurements ≡ , . . . , , . . . , is: j 1j ij Nj j 1 j 1 p y | j p j each class, each class, Bayesian each theclass, Bayesian assignment each theclass, Bayesian assignment the for each Bayesian assignment for class each assignment is: class for each is: for class each is: class is: p j | y  M , the (3) p y | j p j  p y | j pp y | p  y |j j p y|j j p j j 1 j j p j | y pMj | y p  | y p  , | y  ,M , Mj Mj p  y |j   p y| j pp y|  j p y|j j p  j  j















      

















   





          























       







     





   

                                          





j 1

                           j 1

j 1

j 1

probability of the probability data source.ofThis the provides data source. quantitative This provides visualization quantitative of the visualization relationshipofbetween the relationship betw can also be shown canto also be be theshown average to probability be the average fromprobability the perspective from the of averaging perspective theofBayesian averaging [10] the Bayesian [ the performance ofthe theperformance inference algorithm of the inference and thealgorithm underlying and uncertainty the underlying of the uncertainty data source of used the data source u total probabilitytotal of multiple probability decisions of multiple and measurements. decisions and measurements. for testing. In Section for testing. 5 the effect In Section of model 5 the errors effect in of themodel mean,errors variance, in the and mean, tail decay variance, of aand two-class tail decay of a two-c discrimination problem discrimination are demonstrated. problem are Section demonstrated. 6 shows how Section the divergence 6 shows how probability the divergence can beprobability ca Theorem 1. Given Theorem a definition 1. Given of theaaverage definition as the of the normalization average as the of anormalization total formed from of a total independent formed from samples, independent samp visualized and its role visualized as a figure andofitsmerit role as forathe figure level ofof merit confidence for the level in accurately of confidence forecasting in accurately uncertainty. forecasting uncerta average the average is the probability geometric mean is the ofgeometric independent meanprobabilities, of independent since probabilities, the total is the since product the is the product of Entropythe 2017, 19, 286probability 3 oftotal 14 of the sample probabilities. sample probabilities. 2. Proof and Properties 2. Proof of and the Geometric Properties Mean of the as Geometric the Average Mean Probability as the Average Probability

   N  and and TheGiven introduction The introduction thepclass origin showed of themeasurement geometric the mean of the ofgeometric theisprobabilities ofasthe probabilities translation the transla MProof. classshowed hypotheses Given M which usedmean ytowhich assign isthe aused probability to(4) assign toas a probability Proof. p origin y =a ∏ i , ay measurement jhypotheses ij y of the entropy function of the and entropy logarithmic functionscore andi=back logarithmic to the probability score back to domain. the probability The geometric domain. mean The geometric m 1 each class, the Bayesian each class, assignment the Bayesian forbeeach assignment is:for each class from is: of the can also be shown can to be also the be average shown probability to the class average from the probability perspective averaging perspective the Bayesian of averaging [10] the Bayesian since each measurement is independent. Note, this is not the posterior probability of multiple total probability oftotal multiple probability decisions of multiple and measurements. decisions measurements. p y | and p  p y | j p j j decisions. j probabilities for one decision, but the total of N for the p probability | y p | y    , Given a definition , j j M M (3) ( average as the normalization of the total, the normalization of the Equation (4) must be the Nth Theorem 1. Given aTheorem definition1.ofGiven the average a definition as the of normalization thep average the aj total normalization a total formed samples, from independent sam y | j pasof p formed y | j pfrom ofj independent  rootthe since the total is atheproduct of the individual samples. Thusof the probability the j 1 j 1average average probability is average the geometric probability mean is the of independent geometric mean probabilities, independent since the probabilities, total is thefor since product thejth of total theis the product o hypothesis is the geometric mean of the individual probabilities: sample probabilities.sample probabilities.











        

     

    N and and 1/N hypotheses Given M hypotheses a measurement a measurement is used toyassign whichaisprobability used to to a probabilit Proof. Given M class Proof. = p (5)assign p class y i y .which j ij y ∏ i =1

each class, the Bayesian each class, assignment the Bayesian for each assignment class is: for each class is:



          

     

p y[13] | jdue p  p y | j to p outliers. j Remark 1. The arithmetic mean is not a “robust statistic” toj its sensitivity The median p j | y  M p j | y , M , is often used as a robust average instead of the arithmetic mean. The geometric mean is even more sensitive to(3) p y | j p j  p y | j p j  outliers. A probability of zero for one sample will cause the j 1 average for all the j 1samples to be zero. However; in evaluating the performance of an inference engine, the robustness of the system may be of higher importance. Thus, in using the geometric mean as a measure of performance it is necessary to determine a precision for the system and limit probabilities to be greater than this precision; and to insure that the system uses methods which are not over-confident in the tails of the distribution. Section 3 explores this further by defining upper and lower error bars on the average.





The average probability of an inference can be divided into two components in a similar way that cross-entropy has two components, the entropy and divergence. Again, these two sources of uncertainty will be defined in the probability space rather than the entropy space. Consider first, the comparison of two probability distributions, which are labeled source and model (or quoted) distribution (p ≡ psource , q ≡ qmodel ). The labels anticipate the relationship between a source distribution used to generate feature data and a model distribution which uses feature data to determine a probability inference. From information theory we know that the cross-entropy between two distributions is the sum of the entropy plus the divergence: H (p, q)

= H (p) + D (pkq) N

N

i =1

i =1

= − ∑ pi ln pi − ∑ pi ln

  qi pi

N

= − ∑ pi ln qi

.

(6)

i =1

The additive relationship transforms in the probability domain to a multiplicative relationship: PCE (p, q)

N

p N

≡ exp(− H (p, q)) = ∏ pi i ∏ N

i =1

  pi

i =1

qi pi

= PEnt (p) PDiv (pkq) ,

p

(7)

= ∏ qi i i =1

where PCE is the cross-entropy probability, PEnt is the entropy probability and PDiv is the divergence probability. The divergence probability is the geometric mean of the ratio of the two distributions: N

PDiv (pkq) =



i =1



qi pi

 pi .

(8)

The assessment of a forecast (which will be referred to as the model probability) can be decomposed into a component regarding the probability of the actual events (which will be referred to as the source probability) and the discrepancy between the forecast and event probabilities (which will be referred to as the divergence probability) for the true class events ij∗ :

Entropy 2017, 19, 286

4 of 14

N







q = PModel p j∗ , q j∗ = PSource p j∗ PDiv p j∗ kq j∗ =

1 N ij∗

N

∏p ∏

i =1

i =1

qij∗ pij∗

!1

N

.

(9)

A variety of methods can be used to estimate the source probabilities p j∗ =  pij∗ : i = 1, . . . , N, j∗ true class . Histogram analysis will be used here by binning the nearest neighbors’ model forecasts q, though this can be refined by methods such as kernel density estimation. 3. Bounding the Average Probability Using the Generalized Mean An average is necessarily a filter of the information in a collection of samples intended to represent an aspect of the central tendency. A secondary requirement is to estimate the fluctuations around the average. If entropy defines the average uncertainty, then one approach to bounding the average is to consider the standard deviation of the logarithm of the probabilities. Following the procedure of translating entropy analysis to a probability using the exponential function, the fluctuations about the average probability could be defined as: 1 N 1 N pσ (p) ≡ exp (ln pi )2 − ∑ ln pi ∑ N i =1 N i =1

!! 1 2

.

(10)

Unfortunately, this function does not simplify into a function of the probabilities without the use of the logarithms and exponentials, defeating the purpose of simplifying the analysis. A more important concern is that the lower error bound on entropy (average of the log probability (entropy) minus the standard deviation of the log probabilities) can be less than zero. This is equivalent to the p(p)/pσ (p) > 1 and thus does not provide a meaningful measure of the variations in the probabilities. An alternative to establishing an error bar on the average probability can be derived from the research on generalized probabilities. The Rényi [14] and Tsallis [15] entropies have been shown to utilize the generalized mean of the probabilities of a distribution [16,17]. The Rényi entropy translates this average to the entropy scale using the natural logarithm and the Tsallis entropy uses a deformed logarithm. Recently the power of the generalized mean m (the symbol m is used instead of p to distinguish from probabilities) has been shown to be a function of the degree of nonlinear statistical coupling κ between the states of the distribution [18]. The coupling defines a dependence between the states of the distribution and is the inverse of the degree of freedom ν, more traditionally used in statistical analysis. For symmetric distributions with constraints on the mean and scale, the Student’s t distribution is the maximum entropy distribution, and is here referred to as a Coupled Gaussian using κ = 1/ν. The exponential distributions, such as the Gaussian, are deformed into heavy-tail or compact-support distributions. The relationship between the generalized mean power, the coupling, and the Tsallis/Rényi parameter q is: m=

2κ = q − 1, 1 + dκ

(11)

where d is the dimension of the distribution and will be considered one for this discussion, and the numeral 2 can be a real number for generalization of the Levy distributions, and 1 for generalization of the exponential distribution. For simplicity, the approach will be developed from the Rényi entropy, but for a complete discussion see [18]. Using the power parameter m and separating out the weight wi , which is pi for the entropy of a distribution and 1/N for a scoring rule, the Rényi entropy is: N

1 ln(wi pim ). m i =1

S R (w, p) = − ∑

(12)

Entropy 2017, 19, 286

5 of 14

N

S R  w , p   

Entropy 2017, 19, 286

i 1

1 ln wi pim . m





5(12) of 14

 and setting Translating this to the probability domain using Translating this to the probability domain using exp −S R and setting wi = mth weighted the probabilities: weighted meanmean of theofprobabilities: (m)

Pavg

1 !1 ! 1 1 N N 1 1 m   1 n1 nm  m m m m  exp Pavgexp ln  Nppi  = N  pi p = +  ln ∑ i 1  m ∑ i 1N i   i N m



 m

i =1



1 N,

, results in the results in the mth

(13) (13)

i =1

An example of how the generalized mean can be used to assess the performance of an inference An example of how the generalized mean can be used to assess the performance of an inference model is illustrated Figure 1. In this example two 10-dimensional Gaussian distributions are sources model is illustrated Figure 1. In this example two 10-dimensional Gaussian distributions are sources for a set of random variables. The distributions have equal variance of 1 along each dimension and for a set of random variables. The distributions have equal variance of 1 along each dimension and no no correlation and are separated by a mean of one along each dimension. Three models of the correlation and are separated by a mean of one along each dimension. Three models of the uncertainty uncertainty are compared. For each model 25 samples are used to estimate of the two means and are compared. For each model 25 samples are used to estimate of the two means and single variance. single variance. The models vary the decay of the tail of a Coupled-Gaussian: The models vary the decay of the tail of a Coupled-Gaussian:





1 1    d

 1 2  G  x; µ,    1  1    x  ? |  −11   x  µ  − 12 ( κ1,+d) , Gκ (x; µ, Σ) ≡ Z  1 + κ (x − u) · Σ · (x − µ) Zκ (Σ) +

(14) (14)

where are the location vector and scale matrix,  is the degree of nonlinear coupling where µ and Σ are the location vector and scale matrix, κ is the degree of nonlinear coupling controlling controlling thedtail decay, d is the dimensions, Z is the normalization In Figure 1a, the tail decay, is the dimensions, Z is the normalization and ( a)+ =and max(0, a). In Figure. 1a, the decay is Gaussian = 0. As more dimensions are modeledare themodeled accuracythe of the classification the decay isκGaussian κ = 0. As more dimensions accuracy of the performance classification (bar) and the accuracy of the reported probabilities (gray square at 0.63) measured by the geometric performance (bar) and the accuracy of the reported probabilities (gray square at 0.63) measured by mean (m = 0) reach at six dimensions. the geometric meantheir (m = maximum 0) reach their maximum at six dimensions.

Figure 1. Source random variables variables are are two two independent independent 10 10 dimensional dimensional Gaussian. Gaussian. The The model model Figure 1. Source of of random estimates the mean and variance of each dimension using 25 samples. The mean and variance estimates the mean and variance of each dimension using 25 samples. The mean and variance are are used as the the location location and and scale scale for for three three Coupled-Gaussian Coupled-Gaussian models, =0 0 Gaussian, Gaussian, (b) 0.162 used as models, (a) (a) κ κ= (b) κκ == 0.162 Heavy-tail (c) κκ = = 0.095 Heavy-tail (c) 0.095 Compact-support. Compact-support.

For eight and 10 dimensions, the classification does not improve and the probability accuracy For eight and 10 dimensions, the classification does not improve and the probability accuracy degrades. This is because the estimation errors in the mean and variance eventually result in an degrades. This is because the estimation errors in the mean and variance eventually result in an over-fit over-fit model. The decisiveness (upper gray error bar) approximates the percentage of correct model. The decisiveness (upper gray error bar) approximates the percentage of correct classification classification (light brown bar) and the robustness (lower gray error bar) goes to zero. In contrast b) (light brown bar) and the robustness (lower gray error bar) goes to zero. In contrast b) the heavy-tail the heavy-tail model with κ = 0.162 is able to maintain an accuracy of 0.69 for dimensions 6, 8 and 10. model with κ = 0.162 is able to maintain an accuracy of 0.69 for dimensions 6, 8 and 10. The heavy-tail makes the model less susceptible to over-fitting. For c) the compact-support model (κ = 0.095) even

Entropy 2017, 19, 286 Entropy 2017, 19, 286

6 of 14 6 of 14

The heavy-tail makes the model less susceptible to over-fitting. For c) the compact-support model (κ =the 0.095) even the 2-dimensional model has inaccurate(0), probabilities (0), because a single probability 2-dimensional model has inaccurate probabilities because a single probability report of zero, report causesmean the geometric mean to be zero. causes of thezero, geometric to be zero. 4. Comparing Probability Inference Performance with the Data Distribution

To better understand the sources of inaccuracy in a model, Equation (9) can be used to separate the error thethe “source probability” of theof underlying data source the “divergence probability” errorinto into “source probability” the underlying dataand source and the “divergence between the model and data source. so requires anso estimate of an theestimate source distribution of probability” between thethe model and theDoing data source. Doing requires of the source the data for a reported probability. This is accomplished using histogram analysis of the reported distribution of the data for a reported probability. This is accomplished using histogram analysis of probabilities, though improved estimates are estimates possible utilizing for example priors for the the reported probabilities, though improved are possible utilizingBayesian for example Bayesian bins, kernel methods, and k-nearest In addition to theInsplitting of to thethe components of the priors for the bins, kernel methods,neighbor. and k-nearest neighbor. addition splitting of standard entropy measures, the generalized entropy equivalents using the generalized mean are components of the standard entropy measures, the generalized entropy equivalents using also the split into source andare divergence described below.components The histogram results andbelow. the overall generalized mean also splitcomponents into sourceasand divergence as described The metrics (arithmetic, geometric, andmetrics −2/3 mean) are visualized in a plot themean) modelare probability versus histogram results and the overall (arithmetic, geometric, and of −2/3 visualized in a source probabilities. plot of the model probability versus source probabilities. shows:(a) (a)the theparameters, parameters,(b)(b) the likelihoods, posterior probabilities, (d) Figure 22 shows: the likelihoods, (c)(c) thethe posterior probabilities, andand (d) the the receiver operating characteristic (ROC) curve perfectlymatched matchedmodel modelfor fortwo two classes classes with receiver operating characteristic (ROC) curve of of a aperfectly one-dimensional Coupled Coupled Gaussian Gaussian distributions. distributions. The The distribution distribution data data source source in in Figure 2c is shown one-dimensional as solid green and orange curves. Overlaying the source distribution is the model distributions shown as dashed curves. The heavy-tail decay (κ = 0.5) has the heavy-tail decay the effect effect of of keeping keeping the the posterior posterior probabilities probabilities saturating away awayfrom fromthe themean meanshown shownininFigure Figure2c; 2c;instead, instead, the posteriors return gradually from saturating the posteriors return gradually to to the prior distribution. actual ROC curve in Figure 2d uses the model probabilities to classify the prior distribution. TheThe actual ROC curve in Figure 2d uses the model probabilities to classify and and source the source probabilities to determine distribution. The predictedROC ROCcurve curveuses uses the the model the probabilities to determine thethe distribution. The predicted probabilities for both the classification classification and and calculation calculation of of the the distribution. distribution.

Figure 2. A A two-class two-class classification classification example example with single-dimension Coupled Figure 2. with single-dimension Coupled Gaussian Gaussian distributions. distributions. (a) The parameters for the orange class source and model are shown; (b) The likelihood distributions (a) The parameters for the orange class source and model are shown; (b) The likelihood distributions are shown.The Thesolid solid (source) dashed (model) overlap thisrepresenting case, representing a perfectly are shown. (source) andand dashed (model) overlap in thisincase, a perfectly matched matched (c) The posterior probabilities equal for each class; (d) Theoperating receiver model; (c)model; The posterior probabilities assumingassuming equal priors for priors each class; (d) The receiver operating characteristic curve (solid) and a “predicted” ROC curve (dashed), which is determined characteristic curve (solid) and a “predicted” ROC curve (dashed), which is determined using only the using the model distributions. modelonly distributions.

While the ROC curve is an important metric for assessing probabilistic inference algorithms, it does not fully capture the relationship between a model and the distribution of the test set. Figure 3

Entropy 2017, 19, 286

7 of 14

While the ROC curve is an important metric for assessing probabilistic inference algorithms, Entropy 2017, 19, 286 7 of 14 it does not fully capture the relationship between a model and the distribution of the test set. Figure 3 introduces a new quantitative visualization the performance of a probabilistic inference. introduces a new quantitative visualization of the of performance of a probabilistic inference. The The approach builds upon histogram plots which are used to calibrate probabilities, shown as green approach builds upon histogram plots which are used to calibrate probabilities, shown as green and and orange bubbles each class. histogram informationisisquantified quantifiedinto into three three metrics, metrics, the orange bubbles for for each class. TheThe histogram information the arithmetic 2/3 mean arithmetic mean mean (called (called Decisiveness), Decisiveness), the the geometric geometric mean mean(called (calledAccuracy) Accuracy)and andthe the−−2/3 mean (called (called Robustness), Robustness), shown shown has has brown brown marks. marks. In In the the case case of of perfectly perfectly matched matched model model with with the the source source all all the the histogram histogram bubbles bubbles and and each each of of the the metrics metrics aligns aligns along along the the 45 45 diagonal diagonal representing representing equality equality between probabilities andand the source probabilities. In Section 5 illustrations will be provided between the themodel model probabilities the source probabilities. In Section 5 illustrations will be which show how these metrics can show the gap in performance of a model, whether modelthe is provided which show how these metrics can show the gap in performance of a model,the whether under and the limitations to discrimination of the available modelor is over-confident, under or over-confident, and the limitations to discrimination of the information. available information. Because the goal is to maximize the overall performance of the algorithm, Because the goal is to maximize the overall performance of the algorithm, the the model model probability probability is shown on the y-axis. The distribution of the source or test data is treated as is shown on the y-axis. The distribution of the source or test data is treated as the the independent independent variable variable on on the the x-axis. x-axis. This This orientation orientation is is reverse reverse to to common common practice practice [19,20], [19,20], which which bins bins the the reported reported probabilities and shows the distribution vertically; however, the adjustment is helpful for focusing probabilities and shows the distribution vertically; however, the adjustment is helpful for focusing on on increasing 3 the orange andand green bubbles are individual histogram data. increasing model modelperformance. performance.InInFigure Figure 3 the orange green bubbles are individual histogram Neighboring modelmodel probabilities for thefor green or orange class are grouped to formtoaform bin. The y-axis data. Neighboring probabilities the green or orange class are grouped a bin. The location is the geometric mean of the model probabilities. The x-axis location is the ratio of the samples y-axis location is the geometric mean of the model probabilities. The x-axis location is the ratio of the with the matching green or orange class divided the total samples. samples with the matching green or orange classby divided bynumber the totalofnumber of samples.

Figure 3. 3. The The average averagemodel modelprobability probabilityversus versusthe the average source probability model/source Figure average source probability forfor thethe model/source of of Figure 1. The orange and green bubbles represent individual bins, with bins grouped bythe the model model Figure 1. The orange and green bubbles represent individual bins, with bins grouped by probabilityand andthe theratio ratio class to total distribution represented the source probability. The probability of of class to total distribution represented by thebysource probability. The brown brown are marks the overall averages (arithmetic, geometric, and −2/3). For perfectmodel modelall allare are on on marks theare overall averages (arithmetic, geometric, and − 2/3). For a aperfect the identity. identity. the

The process for creating model versus source probability plot given a test set of reported or The process for creating model versus source probability plot given a test set of reported or model  where i is the event number and j is the class label model probabilities  1... , j 1,  1... probabilities Q = qijQ: i =qij1,: i..., N, N j= ..., M M ,, where i is the event number and j is the class label and q is the labeled true event is as follows. There are some distinctions between the defined process and ij ∗ is the labeled true event is as follows. There are some distinctions between the defined and that used to create the simulated data in Figure 3 through Figure 6 which will be noted. process and that used to create the simulated data in Figure 3 through Figure 6 which will be noted. (1) Sort data: The probabilities reported are sorted. As shown the sorting is separated by class, data: The reported Aslimit shown separated bythe class, but (1) Sort but could alsoprobabilities be for all the classes.are A sorted. precision qlimthe is sorting definedissuch that all model probabilities qlimthe ≤ classes. q ≤ 1 − qAlimprecision . could also besatisfy for all limit is defined such that all the model



probabilities satisfy qlim  q  1  qlim .



Entropy 2017, 19, 286

(2)

(3)

8 of 14

Entropy 2017, 19, 286

8 of 14

Bin data: Bins with an equal amount of data are created to insure that each bin has adequate data for the analysis. Alternatively, cluster analysis could be completed to insure that each bin has (2) Bin data: Bins with an equal amount of data are created to insure that each bin has adequate data elements with similar probabilities. for the analysis. Alternatively, cluster analysis could be completed to insure that each bin has Bin analysis: Forwith each bin k probabilities. the average model probability, the source probability, and the elements similar contribution to the overall accuracy For each bin k are the computed. average model probability, the source probability, and the (3) Bin analysis: (a)

contribution to the overall accuracy are computed. Model probability (y-axis): Compute the arithmetic, geometric, and − 23 mean of the model (a) Model probability (y-axis): Compute the arithmetic, geometric, and  2 3 mean of the model probabilities for the actual event: probabilities for the actual event:  Nk  1/N   q k ∏  i=1 ij∗ k (m) !1 Qk = m Nk   1 m  q  ∑  Nk ij∗ k

m=0 (15) m = 1, −2/3,

(15)

i =1

(b)

where m is the power of the weighted generalized mean. Source probability (x-axis): Compute the frequency of true events in each bin: where m is the power of the weighted generalized mean. M (b) Source probability (x-axis): Compute the frequency of true events in each bin: ∗ ∑ Njk Pk =

(c)

j=M 1

N

, (16) j  1N ∑ Pk j=M1 jk , (16) N  jk ∗ are the number of j class element j 1 where Njk and Njk in bin k and the number of true j class elements in be k, respectively. If a source probability is calculated for each sample using where N jk and N jk are the number of j class element in bin k and the number of true j class k-nearest neighbor or kernel methods, then the arithmetic, geometric, and −2/3 mean can respectively. If a source probability is calculated for each sample using elements in be k, for also be calculated the source probabilities. kContribution -nearest neighbor or kernel methods, the 3arithmetic, geometric, and bubble mean to overall accuracy (size): then Figure through Figure 6 show sizescan also be calculated for thenumber source probabilities. proportional to the of true events; however, a more useful visualization is the Contribution to overall accuracy : Figure 3 through 6 show bubble sizes proportional (c) contribution each bin makes to the(size) overall accuracy. ThisFigure is determined by weighting the to average the number of source true events; however, more useful visualization is makes the contribution model by the average. Takingathe logarithm of this quantity a linear each to thesize: overall accuracy. This is determined by weighting the model average by the scalebin formakes the bubble source average. Taking the of this quantity makes a linear scale for the bubble size:  logarithm  M M

 jk

∑ N∗

 j =1 j k  M  ∑N jk   j =1  M   b ∑ Nj∗ k j =1   ∑ M k =1 ∑ Njk

     Nj ∗ k  1/Nj∗ k    ln ∏ qij∗ k   i =1   

(17)

(17)

j =1

(4)

Overall Statistics: The overall statistics are determined by the generalized mean ofmean the individual (4) Overall Statistics : The overall statistics are determined by the generalized of the individual bin statistics. bin statistics. (a)

(a) Overall Model Statistics: Overall Model Statistics:

where where B B is is the the number number of of bins. bins. (b) Overall Source Statistics:



 m1

!  B Nk 1  ∑ 1 ∑ qm∗ ij k B  k=1 Nk i=1

 ,  ,  

(18)

(18)

Entropy 2017, 19, 286

(b)

9 of 14

Overall Source Statistics:

Entropy 2017, 19, 286

m  m1   B ∑ 1 j =1     ∑    B  k =1  M N   ∑ jk 



M

∗ Njk 

j =1

9 of 14

(19) (19)

In Figure 3 through Figure 6 the brown marks are located at the (x, y) point corresponding to the (model, source) mean for the three metrics Decisiveness (arithmetic, m = 1), Accuracy (geometric, In Figure 3 through 6 the brown marks are located at the (x, y) point corresponding to the m = 0), and Robustness (m =Figure −2/3). (model, source) mean for the three metrics Decisiveness (arithmetic, m = 1), Accuracy (geometric, m =

5. Effect of Robustness Model Errors Mean, Variance, Shape, and Amplitude −2/3). 0), and (m =in The three (Decisiveness, and Robustness) provide insight about the discrepancy 5. Effect ofmetrics Model Errors in Mean, Accuracy, Variance, Shape, and Amplitude between an inference model and the source data. The accuracy (geometric mean) always degrades due The three metrics (Decisiveness, Accuracy, and Robustness) provide insight about the to errors, while the changes in the decisiveness and robustness provide an indication of whether an discrepancy between an inference model and the source data. The accuracy (geometric mean) always algorithm is under or over-confident. The discrepancy in accuracy will be quantified as the “probability degrades due to errors, while the changes in the decisiveness and robustness provide an indication of divergence” using Equation (8) and discussed more inaccuracy Section will 6. The effect of four whether an algorithm is under or over-confident. The thoroughly discrepancy in be quantified as theerrors (mean, variance,divergence” shape, andusing amplitude) the twomore classthoroughly models will be examined. “probability Equationon (8)one andof discussed in Section 6. The effect of In Figure the mean of the model is shiftedon from 1 for orange The closer spacing four errors4(mean, variance, shape, and(a) amplitude) one2oftothe twothe class modelsclass. will be examined. In Figure 4 the mean of the model (a) is shifted from 2 to 1 for the orange class. The of the hypotheses shown in (b) results in under-confident posterior probabilities shown incloser (c), where spacingcurves of the hypotheses shown (b) results under-confident posterior probabilities shown the dashed of the model are in closer to 0.5 in than the solid source curves. The actual ROCincurve (c),as where thecurve dashedincurves the model are closer to 0.5 ROC than the solid source curves. actual bin shown a solid (d) is of better than the forecasted curve (dashed). Each The histogram ROC curve shown as a solid curve in (d) is better than the forecasted ROC curve (dashed). Each for the analysis is a horizontal slice of the neighboring probabilities in Figure 4c. A source of error histogram bin for the analysis is a horizontal slice of the neighboring probabilities in Figure 4c. A in estimating the source probabilities is that model probabilities are drawn from multiple features; source of error in estimating the source probabilities is that model probabilities are drawn from two in one-dimensional single modal distributions used here for demonstration. For more complex multiple features; two in one-dimensional single modal distributions used here for demonstration. distributions many modes and/or manymodes dimensions the analysis would to be segmented For morewith complex distributions with many and/or many dimensions thehave analysis would have by feature regions. to be segmented by feature regions.

Figure 4. (a,b) The orange model (dashed line) has mean of 1 rather than the source of mean 2. The Figure 4. (a,b) The orange model (dashed line) has mean of 1 rather than the source of mean 2. results in (c) under-confident posterior probabilities and a shift in the maximum posterior decision The results in (c) under-confident posterior probabilities and a shift in the maximum posterior decision boundary. (d) While the ROC curve and Maximum Posterior classification performance are only boundary. (d) While the ROC curve and Maximum Posterior classification performance are only modified slightly from the performance of the perfect model, the predicted ROC curve shows a modified slightly from the performance of the perfect model, the predicted ROC curve shows a significant discrepancy. significant discrepancy.

Entropy 2017, 19, 286 Entropy 2017, 19, 286

10 of 14 10 of 14

Figure 5a shows how the under-confident model affects the model probability versus the source probability plot. The green and and orange orange class class bubbles bubbles separate. separate. For For source sourceprobabilities probabilitiesgreater/less greater/less the higher uncertainty reported by than 0.5 the the model modelprobabilities probabilitiestend tendtotobebelower/higher, lower/higher,reflecting reflecting the higher uncertainty reported the model. The large brown mark, which represents the average probability reported by the model by the model. The large brown mark, which represents the average probability reported by the versus the average probability of the source distribution, is below the line model versus the average probability of the source distribution, is equal belowprobability the equal dashed probability due to the and robustness (smaller brown marks) have dashed lineerror due in to the themodel. error inThe thedecisiveness model. The decisiveness andmetrics robustness metrics (smaller brown ◦ , indicative of the model being less decisive but an orientation with the x-axis which is less than 45 marks) have an orientation with the x-axis which is less than 45°, indicative of the model being less more robust (i.e., smaller in variations the reported The decisive The metric, whichmetric, is the decisive but more robust variations (i.e., smaller in uncertainty). the reported uncertainty). decisive arithmetic mean, has a larger whileaverage, the robust metric, which metric, is the −which 2/3 mean, has a smaller which is the arithmetic mean,average, has a larger while the robust is the −2/3 mean, average. Figure 5b shows the contrasting result when the orange class model hasclass an over-confident has a smaller average. Figure 5b shows the contrasting result when the orange model has an mean shifted from 2 toshifted 3. The from orange class bubbles now have probabilities over-confident mean 2 and to 3.green The orange and green class average bubbles model now have average which are more extreme than the source probabilities. The overall reported probabilities still decreases model probabilities which are more extreme than the source probabilities. The overall reported relative to a perfect model (large brown the orientation of the decisive andorientation robust metrics is probabilities still decreases relative to amark), perfectbut model (large brown mark), but the of the ◦ now greater 45 metrics . decisive and than robust is now greater than 45°.

Figure 5. The the under-confident under-confident mean mean model model Figure 5. The model model probability probability versus versus source source probability probability for for (a) (a) the pictured in Figure 4 and (b) an over-confident mean model. The green and orange class bubbles pictured in Figure 4 and (b) an over-confident mean model. The green and orange class bubbles separate class. The overall accuracy accuracy separate along along with with the the three three averages averages for for each each class. The large large brown brown mark mark is is the the overall which isisnow nowbelow below dashed for the both the under and over-confident models. (a) the which thethe dashed line line for both under and over-confident models. For (a) theFor alignment alignment the three overall metrics (brown) than 45° indicating under-confidence. For (b) of the threeofoverall metrics (brown) is less than 45is◦ less indicating under-confidence. For (b) the alignment ◦ indicating the alignment of themetrics three overall metrics than over-confidence. 45° indicating over-confidence. of the three overall is greater than is 45greater

Other model errors have a similar effect in lowering the model average below equality with the Other model errors have a similar effect in lowering the model average below equality with the source average, and shifting the orientation of the decisiveness and robustness to be greater or less source average, and shifting the orientation of the decisiveness and robustness to be greater or less than than 45°. The full spectrum of the generalized mean may provide approaches for more detailed 45◦ . The full spectrum of the generalized mean may provide approaches for more detailed analysis of analysis of an algorithms performance, but will be deferred. In Figure 6 the effect of under and an algorithms performance, but will be deferred. In Figure 6 the effect of under and over-confident over-confident variance and tail decay are shown. When the model variance is inaccurate (a) and (b) variance and tail decay are shown. When the model variance is inaccurate (a) and (b) the changes to the changes to the three metrics are similar to the effect of an inaccurate mean. The model accuracy the three metrics are similar to the effect of an inaccurate mean. The model accuracy decreases and the decreases and the angle of orientation for the decisive/robust metrics increases or decreases for over angle of orientation for the decisive/robust metrics increases or decreases for over or under confidence. or under confidence. When the tail decay is over-confident (c) the probabilities in the middle of the When the tail decay is over-confident (c) the probabilities in the middle of the distribution are accurate, distribution are accurate, but the tail inaccuracy can significantly increase the angle of orientation of but the tail inaccuracy can significantly increase the angle of orientation of the decisive/robust metrics. the decisive/robust metrics. When the tail decay is under-confident (d) the probabilities are When the tail decay is under-confident (d) the probabilities are truncated near the extremes of 0 and 1. truncated near the extremes of 0 and 1. An under-confident model of the tails, can be beneficial in An under-confident model of the tails, can be beneficial in providing robustness against other errors. providing robustness against other errors.

Entropy 2017, 19, 286 Entropy 2017, 19, 286

11 of 14 11 of 14

Model vs Source Probability Class A Class B 1.0

Model Probability

0.8

b)

 source  1.0  model  0.5

0.6

0.4

 source  1.0  model  2.0

0.6

0.4

0.2

0.2

0.0 0.0

1.0

0.8

Model Probability

a)

Model vs Source Probability Class A Class B

0.2

0.4

0.6

0.8

0.0 0.0

1.0

0.2

0.4

1.0

0.6

 source  0.5  model  0.0

0.4

Prevents over-confidence

0.6

 source  0.5  model  2.0

0.4

0.2

0.2

0.0 0.0

0.2

0.4

0.6

1.0

1.0

0.8

Model Probability

Model Probability

d)

Over-confiden 0.8

0.8

Model vs Source Probability Class A Class B

Model vs Source Probability Class A Class B

c)

0.6

Source Probability

Source Probability

0.8

0.0 0.0

1.0

Tail Truncated 0.2

0.4

0.6

0.8

1.0

Source Probability

Source Probability

Figure versus source average probability when the orange classclass variance model is over Figure 6. 6.The Themodel model versus source average probability when the orange variance model is or under confident. (a) The model has ahas variance 0.5 rather than than the source variance of 1.0. over or under confident. (a)orange The orange model a variance 0.5 rather the source variance of The accuracy (large(large brown mark)mark) is lower than the source accuracy and the orientation of the 1.0. model The model accuracy brown is lower than the source accuracy and the orientation ◦ decisive/robust metrics metrics (smaller(smaller brown brown marks)marks) is greater than than 45°; 45 (b) ; When the the variance is of the decisive/robust is greater (b) When variance is over-confident (2.0) accuracy decreases decisive/robust orientation is than less than over-confident (2.0) thethe accuracy alsoalso decreases andand the the decisive/robust orientation is less 45°;   0 κis=a Gaussian 45◦The ; (c)coupling The coupling tail decay the model is over-confident. 0 is a Gaussian distribution. distribution. Only (c) or tailordecay of the of model is over-confident. Only the probabilities or 1inaccurate, are inaccurate, but significantly this significantly increases the orientation of the probabilities near 0near or 10 are but this increases the orientation of the ◦ ; (d) When the tail model is under-confident reported the decisive/robust metrics angle above 45 decisive/robust metrics angle above 45°; (d) When the tail model is under-confident reported probabilities near 0 or 1 are truncated. This can provide robustness against other sources of errors. probabilities

6. 6. The The Divergence Divergence Probability Probability as as aa Figure Figure of of Merit Merit

The The accuracy accuracy of of aa probabilistic probabilistic inference, inference, measured measured by by the the source source probability, probability, as as described described in in the previous sections, is affected by two sources of error, the uncertainty in the distribution of the the previous sections, is affected by two sources of error, uncertainty in the distribution of the source source data data and and the the divergence divergence between between the the model model and and source distributions. distributions. The The divergence divergence can can be be isolated probability associated associatedwith withthe thesource sourcefrom fromthe themodel modelprobability. probability. This isolated by dividing out the probability This is    a rearrangement of Equation (9) PDiv = PModel the vector of true Source j∗ kqpj∗ q  PModelp j∗p, q ,j∗q j/P PSource ppj∗j * , ,p j∗pis is a rearrangement of Equation (9) pPDiv j * is the vector of j* j* j* * state reported probabilities and q j∗ is the vector of estimated source probabilities. The divergence the vectorthe of degree estimated source probabilities. The true state reported probabilities andofqmerit j * is regarding probability can be utilized as a figure of confidence in an algorithms divergence probability can be utilized as a figure of merit regarding the degree of confidence in an









 

Entropy 2017, 19, 286

12 of 14

Entropy 2017, 19, 286

12 of 14

algorithms ability to report accurate probabilities. The source probability reflects the uncertainty in the data source based on issues such as the discrimination power of the features. abilityFigure to report accurate The source of probability reflects probability the uncertainty in the data source 7 shows a probabilities. geometric visualization the divergence for the example when based on issues such as the discrimination power of the features. the tail is over-confident (model κ = 0 and source κ = 0.5). The model probability corresponds to the Figure 7 shows a geometric of the probability for the exampleiswhen the y-axis location of the algorithm visualization accuracy shown as divergence a brown mark. The source probability the x-axis tail is over-confident (model κ = 0 and source κ = 0.5). The model probability corresponds to the y-axis location of the accuracy. The ratio of these two quantities (divergence probability) is equivalent to location of thethe algorithm accuracy as a brown source probability is the x-axis location y-axis onto the shown point where the mark. x-axis The is length 1. The resulting quantities are projecting of the accuracy. The ratio of these two quantities (divergence probability) is equivalent to projecting the PModel p j * , q j *  0.56, PSource p j *  0.62, and  y-axis onto the point where the x-axis is length 1. The resulting quantities are PModel p j∗ , q j∗ = 0.56,      q j *PModel  PModel PSource pp  0.90. PSource p j∗ = 0.62, and PDiv pPj∗Div kq p∗ j * = p j∗ p , qj *j∗, q/P j * Source j∗ j * = 0.90.





 





j





 

Model vs Source Probability Class A Class B 1.0

P – Source

0.4

0.2

0.0 0.0

P - Divergence

0.6

P – Model

Model Probability

0.8

0.2

0.4

0.6

0.8

1.0

Source Probability Figure7.7.The Theaccuracy accuracyofofan aninference inferencealgorithm algorithm measured measured by the model Figure model probability probability(y-axis) (y-axis)consists consistsof the divergence divergence probability probability (projection (projectionofofthe the oftwo twocomponents, components, the the source source probability probability (x-axis) (x-axis) and the cross-entropy probability onto the 0 to 1 scale). cross-entropy probability onto the 0 to 1 scale).

Discussion 7.7.Discussion

Therole roleand andimportance importanceofofthe thegeometric geometricmean meaninindefining definingthe theaverage averageprobabilistic probabilisticinference inference The hasbeen beenshown shownthrough througha aderivation derivationfrom frominformation informationtheory, theory,a aproof proofusing usingBayesian Bayesiananalysis, analysis,and and has viaexamples examples a simple two-class inference problem. geometric mean of aofset of probabilities via of of a simple two-class inference problem. TheThe geometric mean of a set probabilities can be understood the central tendency because probabilities, as reside ratios,on reside on a multiplicative becan understood as the as central tendency because probabilities, as ratios, a multiplicative rather rather than an additive than an additive scale. scale. Twoissues issueshave havelimited limitedthe therole rolethe thegeometric geometricmean meanasasananassessment assessmentofofthe theperformance performanceofof Two probabilistic forecasting. Firstly, information theoretic results have been framed in termsofofthe the probabilistic forecasting. Firstly, information theoretic results have been framed in terms arithmeticmean meanofofthe thelogarithm logarithmofofthe theprobabilities. probabilities.While Whileoptimizing optimizingthe thelogarithmic logarithmicscore scoreisis arithmetic mathematically equivalent to optimizing the geometric mean of the probabilities, use of the mathematically equivalent to optimizing the geometric mean of the probabilities, use of the logarithmic logarithmic scale obscures the connection with the original set of probabilities. Secondly, because the scale obscures the connection with the original set of probabilities. Secondly, because the log score logthe score and themean geometric mean are very sensitive averages,which algorithms do not properly and geometric are very sensitive averages, algorithms do notwhich properly account for account for the role of tail decay in achieving accurate models suffer severely against these metrics. the role of tail decay in achieving accurate models suffer severely against these metrics. Thus metrics, finite andby are Thus such as the quadratic mean, whichfor keep the score of are made such asmetrics, the quadratic mean, which keep the score a forecast of pfor = a0forecast finite and proper made proper subtracting bias, have been attractive alternatives. subtracting theby bias, have beenthe attractive alternatives.

Entropy 2017, 19, 286

13 of 14

Two recommendations are suggested when using the geometric mean to assess the average performance of a probabilistic inference algorithm. First, careful consideration of the tail decay in the source of the data should be accounted for in the models. Secondly, a precision should be specified which restricts the reported probabilities to the range of qlim ≤ q ≤ 1 − qlim . Unlike an arithmetic scale, on the multiplicative scale of probabilities precision affects the range of meaningful quantities. In the examples visualizing the model versus source probabilities, the precision is qlim = 0.01 and matches the bin widths. The issue of precision is particularly important when forecasting rare events. Assessment of the accuracy needs to consider the size of the available data and how this affects the precision which can be assumed. The role of the quadratic score can be understood to include an assessment of the “Decisiveness” of the probabilities rather than purely the accuracy. In the approach described here this is fulfilled by using the arithmetic mean of the true state probabilities. Removing the bias in the arithmetic mean via incorporation of the non-true state probabilities is equivalent to the quadratic score; however, by keeping it a local mean the method is simplified and insight is provided regarding the degree of fluctuation in the reported probabilities. Balancing the arithmetic mean, the −2/3 mean provides an assessment of the “Robustness” of the algorithm. In this case, the metric is highly sensitive to reported probabilities of the actual event which are near zero. This provides an important assessment of how well the algorithm minimizes fluctuations in reporting the uncertainty and provides robustness against events which may not have been included in the testing. Together the arithmetic and −2/3 metrics provide an indication of the degree of variation around the central tendency measure by the geometric mean. The performance of an inference algorithm, including its tendency for being under or over-confident can be visualized by plotting the model versus source averages. As just described, the model averages are based on the generalized mean of the actual events. The contribution to the uncertainty from the source is determined by binning similar forecasts. The source probability for each bin is the frequency of probabilities associated with a true event in each bin. The arithmetic, geometric, and −2/3 means of the source probability for each bin are used to determine the overall source probability means. The geometric mean of the model probabilities is called the model probability and is always less than the source probability. The ratio of these quantities is the divergence probability, Equation (9), and is always less than one. A perfect model has complete alignment between the source and model probabilities, and in turn the three metrics are aligned along the 45◦ line of the model versus source probability plot. If the algorithm is over-confident the arithmetic mean of the reported probabilities increases and the −2/3 mean decreases, causing a rotation in the orientation of the three metrics to a higher angle. In contrast an under-confident algorithm causes a rotation to a lower angle. This was illustrated using a two-class discrimination problem as an example, showing the effects of errors in the mean, variance, and tail decay. The three metrics provide a quantitative measure of the algorithms performance as a probability and an indication of whether the algorithm is under or over confident. Distinguishing between the types of error (mean, variance, and decay) may be possible by analyzing the full spectrum of the generalized mean and is recommended for future investigations. Acknowledgments: Reviews of this work with Fred Daum and Carlos Padilla of Raytheon assisted in refining the methods. Conflicts of Interest: The author declares no conflict of interest.

References 1. 2.

Jose, V.R.R.; Nau, R.F.; Winkler, R.L. Scoring rules, generalized entropy, and utility maximization. Oper. Res. 2008, 56, 1146. [CrossRef] Jewson, S. The problem with the Brier score. arXiv 2004, arXiv:physics/0401046.

Entropy 2017, 19, 286

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

14 of 14

Ehm, W.; Gneiting, T. Local Proper Scoring Rules; Department of Statistics, University of Washington: Seattle, WA, USA, 2009. Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [CrossRef] Dawid, A.P. The geometry of proper scoring rules. Ann. Inst. Stat. Math. 2007, 59, 77–93. [CrossRef] Dawid, A.; Musio, M. Theory and applications of proper scoring rules. Metron 2014. [CrossRef] Fleming, P.J.; Wallace, J.J. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM 1986, 29, 218–221. [CrossRef] McAlister, D. The law of the geometric mean. Proc. R. Soc. Lond. 1879, 29, 367–376. [CrossRef] Khinchin, A.I. Mathematical Foundations of Information Theory; Courier Corporation: North Chelmsford, MA, USA, 1957. Bernardo, J.M.; Smith, A.F.M. Inference and Information. In Bayesian Theory; John Wiley & Sons: Hoboken, NJ, USA, 2009; pp. 67–81. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1949; p. 114. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2009. Rényi, A. On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Stat. Probab. 1961, 1, 547–561. Tsallis, C. Nonadditive entropy and nonextensive statistical mechanics—an overview after 20 years. Braz. J. Phys. 2009, 39, 337–356. [CrossRef] Nelson, K.P.; Scannell, B.J.; Landau, H. A Risk Profile for Information Fusion Algorithms. Entropy 2011, 13, 1518–1532. [CrossRef] Oikonomou, T. Tsallis, Renyi and nonextensive Gaussian entropy derived from the respective multinomial coefficients. Phys. A Stat. Mech. Appl. 2007, 386, 119–134. [CrossRef] Nelson, K.P.; Umarov, S.R.; Kon, M.A. On the average uncertainty for systems with nonlinear coupling. Phys. A Stat. Mech. Its Appl. 2017, 468, 30–43. [CrossRef] Cohen, I.; Goldszmidt, M. Properties and benefits of calibrated classifiers. Lect. Notes Comput. Sci. 2004, 3202, 125–136. Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning ICML 05, Bonn, Germany, 7–11 August 2005; pp. 625–632. © 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).