A IJCAI 2016 Tutorial. Fabio Cuzzolin. Department ... This is what the tutorial will look like ...... thebayesianobserver.wordpress.com we design ...... if the codes are selected independently, then the probability that the pair (Ï1, Ï2) is selected is ...
Belief functions Random sets for the working scientist A IJCAI 2016 Tutorial Fabio Cuzzolin Department of Computing and Communication Technologies Oxford Brookes University, UK
IJCAI 2016
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
1 / 464
This is what the tutorial will look like
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
2 / 464
Tutorial web site
http://cms.brookes.ac.uk/staff/FabioCuzzolin/ijcai2016.html Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
3 / 464
Uncertainty
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
4 / 464
Uncertainty
Nature of uncertainty
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
5 / 464
Uncertainty
Nature of uncertainty
What is uncertainty?
uncertainty → lack of information or imperfect information a state of limited knowledge, where it is impossible to exactly describe the existing state or future outcomes
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
6 / 464
Uncertainty
Nature of uncertainty
Uncertainty is widespread
“There are some things that you know to be true, and others that you know to be false; yet, despite this extensive knowledge that you have, there remain many things whose truth or falsity is not known to you. We say that you are uncertain about them. You are uncertain, to varying degrees, about everything in the future; much of the past is hidden from you; and there is a lot of the present about which you do not have full information. Uncertainty is everywhere and you cannot escape from it.” Dennis Lindley, Understanding Uncertainty (2006)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
7 / 464
Uncertainty
Nature of uncertainty
Two types of uncertainty
the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability different probability interpretations treat predictable and unpredictable variation differently also referred to as distinction between common-cause and special-cause has a consequence on human behaviour: people are averse to unpredictable variations (Ellsberg’s paradox, see Decision making)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
8 / 464
Uncertainty
Nature of uncertainty
‘Knightian’ Uncertainty
‘second order’ uncertainty: being uncertain about our very model of uncertainty if (a big ‘if’) uncertainty is modelled by probabilities being uncertain about the ‘correct’ probability model
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
9 / 464
Uncertainty
Nature of uncertainty
‘Knightian’ Uncertainty
Chicago economist Frank Knight distinguished ‘risk’ from ‘uncertainty’: “Uncertainty must be taken in a sense radically distinct from the familiar notion of risk, from which it has never been properly separated.... The essential fact is that ’risk’ means in some cases a quantity susceptible of measurement, while at other times it is something distinctly not of this character; and there are far-reaching and crucial differences in the bearings of the phenomena depending on which of the two is really present and operating.... It will appear that a measurable uncertainty, or ’risk’ proper, as we shall use the term, is so far different from an unmeasurable one that it is not in effect an uncertainty at all.” “You cannot be certain about uncertainty” in Knight’s terms: risk = probability, uncertainty = second-order uncertainty
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
10 / 464
Uncertainty
Nature of uncertainty
‘Knightian’ Uncertainty
risk → a consequence of an action taken in the presence of uncertainty some models of uncertainty use human propensity to act as a measure of uncertainty Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
11 / 464
Uncertainty
Mathematical probability
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
12 / 464
Uncertainty
Mathematical probability
Probability measures mainstream mathematical theory of (first order) uncertainty: mathematical (measure-theoretical) probability mainly due to Russian mathematician Andrey Kolmogorov probability is an application of measure theory, the theory of assigning numbers to sets additive probability measure → mathematical representation of the notion of chance assigns a probability value to every subset of a collection of possible outcomes (of a random experiment, of a decision problem, etc) collection of outcomes Ω → sample space, universe subset A of the universe → event Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
13 / 464
Uncertainty
Mathematical probability
Example: the spinning wheel Probability measures
typical example: spinning wheel with 3 possible outcomes universe Ω = {1, 2, 3} eight possible events (right), including the empty set probability of ∅ is 0, probability of Ω is 1 additivity holds: P({1, 2}) = P({1}) + P({2})
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
14 / 464
Uncertainty
Mathematical probability
Probability measures probability measure µ: a real-valued function on a probability space that satisfies countable additivity probability space: it is a triplet (Ω, F, P) formed by a universe Ω, a σ-algebra F of its subsets, and a probability measure on F I
not all subsets of Ω belong necessarily to F
axioms of probability measures: I I I
µ(∅) = 0, µ(Ω) = 1 0 ≤ µ(A) ≤ 1 for all events A ⊆ F additivity: for all countable collection of pairwise disjoint events Ai : ! [ X µ Ai = µ(Ai ) i
Fabio Cuzzolin
i
Belief functions Random sets for the working scientist
IJCAI 2016
15 / 464
Uncertainty
Mathematical probability
Random variable a variable whose value is subject to random variations, i.e. due to ‘chance’: what chance is is subject to philosophical debate! it can take one of a set of possible values, with probability mathematically, it is a function X from a sample space Ω (which forms a probability space) to (usually) the real line
subject to a condition of measurability: each range of values of the real line must have an anti-image in Ω which has a probability value this way, we can forget about the initial probability space and record the probabilities of various values of X Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
16 / 464
Uncertainty
Mathematical probability
(Discrete) random variable Example
the sample space is the set of outcomes of rolling two dice sample space: Ω = {(1, 1), (1, 2), ..., (6, 5), (6, 6)} a random variable can be the function that associates each roll of the two dice to the sum S of the faces random variables can be discrete or continuous – this one is discrete
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
17 / 464
Uncertainty
Mathematical probability
Cumulative Distribution Function (CDF) of a random variable F (x) = P(X ≤ x)
example: CDF of a Gaussian random variable Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
18 / 464
Uncertainty
Mathematical probability
Probability Density Function of a continuous random variable a random variable is called continuous when it can assume values in a non-countable set (e.g. the real line) it is described by a probability density function (PDF), which describes the likelihood of the variable taking any continuous value the probability of any range of values (e.g., an interval) is the integral of the PDF Rb over the range: P([a, b]) = a f (x)dx
example: PDF of a Gaussian random variable Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
19 / 464
Uncertainty
Mathematical probability
Radon-Nikodym derivative Measure-theoretic probability theory a continuous random variable with values in a measurable space (X , A) (usually Rn with the Borel sets as measurable family) has as probability distribution the measure X∗ P on (X , A) formally, the probability density function of X is the Radon-Nikodym derivative, denoted: dX∗ P f = dµ where µ is a reference measure on (X , A) Z that is, f is any measurable function such that: P(X ∈ A) =
fdµ A
analogous to a derivative in calculus modern mathematical probability is realy just an application of measure theory measure-theoretic allows us to unify discrete and continuous cases, by making the difference just a question of which measure is used
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
20 / 464
Uncertainty
Mathematical probability
Law of large numbers describes what happens when you repeat the same random experiment an increasing number of times n X1 + ... + Xn the average of the results (sample mean) X n = should be n close to the expected value (actual mean) probabilities become predictable as we run the same trial more and more times!
strong law: P(limn→∞ X n = µ) = 1 weak law: limn→∞ P(|X n − µ| > ) = 0
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
21 / 464
Uncertainty
Mathematical probability
Central limit theorem the mean of a sufficiently large number of iterates of independent random variables is normally (Gaussian) distributed let X1 , ..., Xn independent and identically distributed random variables with the same mean µ and variance σ 2 X1 + ... + Xn we can build the sample average as X n = n √ the random variable n(X n − µ) tends to a Gaussian with mean 0 and variance σ2
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
22 / 464
Uncertainty
Interpretations of probability
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
23 / 464
Uncertainty
Interpretations of probability
Does probability really exist? That sinking feeling
what is probability, really? is it just the way we call our ignorance/limitedness? can it be that, with sufficient information, any phenomenon is predictable in a deterministic way? Einstein: God does not roll dice Doc Smith’s Lensmen series: the Arisians have such mental powers that they compete on foreseen future events to the tiniest detail
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
24 / 464
Uncertainty
Interpretations of probability
Does probability really exist? That sinking feeling
the principles of quantum mechanics seems to suggest that probability is not just a figment of our mathematical imagination, or a representation of our ignorance the workings of the physical world seems to be inherently probabilistic we will come back to this later Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
25 / 464
Uncertainty
Interpretations of probability
Interpretations of probability Savage’s take
even assuming that probability is inherent to the physical world, people cannot agree on what it is “It is unanimously agreed that statistics depends somehow on probability. But, as to what probability is and how it is connected with statistics, there has seldom been such complete disagreement and breakdown of communication since the Tower of Babel. Doubtless, much of the disagreement is merely terminological and would disappear under sufficiently sharp analysis.” L.J. Savage, 1954
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
26 / 464
Uncertainty
Interpretations of probability
Interpretations of probability Frequentist, subjective and behavioural
as a result, probability has multiple competing interpretations an objective description of frequencies of events (meaning ‘things that happen’) at a certain persistent rate, or ‘relative frequency’ → frequentist interpretation [Fisher, Pearson] a degree of belief on events (interpreted as statements/propositions on the state of the world), regardless of any random process → Bayesian or evidential probability [de Finetti, Savage] neither frequentist nor Bayesian probability are in constrast with the classical mathematical definition of probability - others are (as we will see) the propensity of an agent to act (or gamble, or decide) in case the event happens → behavioural probability [Walley, Vovk]
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
27 / 464
Uncertainty
Interpretations of probability
‘Classical’ probability
championed by Pierre-Simon Laplace
if a random experiment can result in N mutually exclusive and equally likely outcomes, and if NA of these outcomes result in the occurrence of the event A, the probability of A is defined by P(A) =
NA N
works only for finite number of possible outcomes you need to determine in advance that all the possible outcomes are equally likely without relying on the notion of probability to avoid circularity
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
28 / 464
Uncertainty
Frequentist interpretation
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
29 / 464
Uncertainty
Frequentist interpretation
Frequentist probability
the (aleatory) probability of an event is its relative frequency in time
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
30 / 464
Uncertainty
Frequentist interpretation
Frequentist probability the (aleatory) probability of an event is its relative frequency in time when tossing a fair coin, frequentists say that the probability of getting a heads is 1/2, not because there are two equally likely outcomes but because repeated series of large numbers of trials demonstrate that the empirical frequency converges to the limit 1/2 as the number of trials goes to infinity: P(A) = lim
n→∞
nA NA 6= n N
where n is the number of trials, nA those resulting in A clearly impossible to actually perform an infinity of repetitions of a random experiment hence: we can only measure an approximation of the ‘actual’ probability (whatever it is) what are the consequences for inference?
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
31 / 464
Uncertainty
Frequentist interpretation
Frequentist inference
the frequentist interpretation offers guidance in the design of practical ‘random’ experiments developed by Fisher, Pearman, Neyman three main tools: statistical hypothesis testing model selection confidence interval analysis
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
32 / 464
Uncertainty
Frequentist interpretation
Statistical hypothesis testing Frequentist inference
statistical hypothesis is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables statistical hypothesis testing I I I I
a data set obtained by sampling is compared against synthetic data from an idealized model A hypothesis is proposed for the statistical relationship between the two data sets this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability – the significance level
in alternative we can do model selection: statistical hypothesis testing is a form of confirmatory data analysis, as opposed to exploratory data analysis which does not rely on pre-specified hypotheses
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
33 / 464
Uncertainty
Frequentist interpretation
Testing process Statistical hypothesis testing
1
state the research hypothesis
2
state the relevant null and alternative hypotheses
3
state the statistical assumptions being made about the sample, e.g. assumptions about the statistical independence or about the form of the distributions of the observations
4
state the relevant test statistic T (a quantity derived from the sample)
5
derive the distribution of the test statistic under the null hypothesis from the assumptions
6
set a significance level (α), i.e. a probability threshold below which the null hypothesis will be rejected
7
compute from the observations the observed value tobs of the test statistic T
8
calculate the p-value, the probability (under the null hypothesis) of sampling a test statistic at least as extreme as the observed value
9
Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than the significance level threshold
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
34 / 464
Uncertainty
Frequentist interpretation
Statistical hypothesis testing Sketch
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
35 / 464
Uncertainty
Frequentist interpretation
Type I and type II errors Statistical hypothesis testing
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
36 / 464
Uncertainty
Frequentist interpretation
Statistical hypothesis testing Interpretation
example: given the observed data assuming a (parameterised) probability distribution generating them, of which we do not know the parameter value we test hypotheses on the value of the parameter output: yes or no (binary decision) interpretation: if the p-value is less than the required significance level the null hypothesis is rejected if not, the test has no result - the evidence is insufficient to support a conclusion a reductio ad absurdum argument adapted to statistics: a claim is shown to be valid by demonstrating the improbability of the consequence of the opposite modern hypothesis testing is in fact a hybrid between two seminal proposals by Fisher and Neyman/Pearson
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
37 / 464
Uncertainty
Frequentist interpretation
P-values and error rates Statistical hypothesis testing
American Statistical Association: “The widespread use of ‘statistical significance’ (generally interpreted as ‘p ≤ 0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.” p-value: the probability, under the assumption of hypothesis H, of obtaining a result equal to or more extreme than what was actually observed the reason is, for continuous random variables P(X = x|H) = 0, so we distinguish: I I I
right-tail event {X ≥ x} → p = P(X ≥ x|H) left-tail event {X ≤ x} → p = P(X ≤ x|H) double-tailed event: the ‘smaller’ of {X ≤ x} and {X ≥ x}
α is the rate of falsely rejecting the null hypothesis (type I error)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
38 / 464
Uncertainty
Frequentist interpretation
Notion of P-value and misunderstandings
the p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false: frequentist statistics does not and cannot attach probabilities to hypotheses Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
39 / 464
Uncertainty
Frequentist interpretation
Maximum Likelihood (MLE) the term ‘likelihood’ was popularized in mathematical statistics by Ronald Fisher in 1922: ‘On the mathematical foundations of theoretical statistics’ Fisher argues against ‘inverse’ (Bayesian) probability as a basis for statistical inferences, and instead proposes inferences based on likelihood functions likelihood principle: all of the evidence in a sample relevant to model parameters is contained in the likelihood function this is hotly debated, still [Mayo,Gandenberger] maximum likelihood estimation: {θˆmle } ⊆ {arg max L(θ ; x1 , . . . , xn )}, θ∈Θ
where L(θ ; x1 , . . . , xn ) = f (x1 , x2 , . . . , xn | θ) and {f (.|θ), θ ∈ Θ} is a parametric model
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
40 / 464
Uncertainty
Frequentist interpretation
Maximum Likelihood (MLE) Properties
Maximum-likelihood estimators have no optimum properties for finite samples however, they do have good limiting properties: consistency, asymptotic normality, efficiency consistency: the sequence of MLEs converges in probability, for a sufficiently large number of observations, to the (actual) value being estimated asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution with mean on the true parameter (under a number of conditions) efficiency: it achieves the Cramer-Rao lower bound when the sample size tends to infinity, i.e. no consistent estimator has lower asymptotic mean squared error than the MLE
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
41 / 464
Uncertainty
Bayesian interpretation
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
42 / 464
Uncertainty
Bayesian interpretation
Subjective probability
(epistemic) probability = degrees of belief of an individual assessing the state of the world Ramsey and de Finetti → subjective beliefs must follow the laws of probability if they are to be coherent (if this ‘proof’ was prooftight we would not be here in front of you!) also, evidence casts doubt that humans will have coherent beliefs or behave rationally
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
43 / 464
Uncertainty
Bayesian interpretation
Are human rational and/or coherent?
this guy (Daniel Kahneman) won a Nobel prize supporting the exact opposite, in collaboration with Amos Tversky people pursue courses of action which are bound to damage them people do not understand the full consequences of their actions
https://en.wikipedia.org/wiki/Daniel_Kahneman
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
44 / 464
Uncertainty
Bayesian interpretation
Bayesian probability in the Bayesian view, a probability is assigned to a hypothesis, whereas under frequentist inference, a hypothesis is typically tested without being assigned a probability special case of evidential probability: some prior probability is updated to a posterior probability in the light of new evidence (data) once again it makes use of mathematical probabilities needs to specify a prior probability distribution taking into account the available (prior) information sequentially uses Bayes’ rule to compute a posterior distribution when more data becomes available
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
45 / 464
Uncertainty
Bayesian interpretation
Bayes’ rule Bayesian probability
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
46 / 464
Uncertainty
Bayesian interpretation
Some history Thomas Bayes (1702-1761), who proved a special case of what is now called Bayes’ theorem in a paper titled "An Essay towards solving a Problem in the Doctrine of Chances" Pierre-Simon Laplace (1749-1827) who introduced a general version of the theorem Jeffreys’ "Theory of Probability" (1939) played an important role in the revival of the Bayesian view of probability, followed by works by Abraham Wald (1950) and Leonard J. Savage (1954) de Finetti: a Dutch book is made when a clever gambler places a set of bets that guarantee a profit, no matter what the outcome of the bets. If a bookmaker follows the rules of the Bayesian calculus, a Dutch book cannot be made I
(however, Dutch book arguments leave open the possibility that non-Bayesian updating rules could avoid Dutch books)
justification by axiomatisation has been tried, but with no great success
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
47 / 464
Uncertainty
Bayesian interpretation
Bayesian inference prior distribution is the distribution of the parameter(s) before any data is observed, i.e. p(θ | α) it depends on a vector of hyperparameters α likelihood: is the distribution of the observed data conditional on its parameters, i.e. p(X | θ) marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s): Z p(X | α) = p(X | θ)p(θ | α) dθ θ
posterior distribution is the distribution of the parameter(s) after taking into account the observed data, as determined by Bayes’ rule: p(θ | X, α) =
Fabio Cuzzolin
p(X | θ)p(θ | α) ∝ p(X | θ)p(θ | α) p(X | α)
Belief functions Random sets for the working scientist
IJCAI 2016
48 / 464
Uncertainty
Bayesian interpretation
Bayesian prediction
posterior predictive distribution is the distribution of a new data point, marginalized over the posterior: Z p(x˜ | X, α) = p(x˜ | θ)p(θ | X, α) dθ θ
a distribution over possible data values is obtained By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s) – e.g., by maximum likelihood or maximum a posteriori estimation (MAP) does not account for any uncertainty in the value of the parameter
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
49 / 464
Uncertainty
Bayesian interpretation
Maximum-A-Posteriori (MAP) again we want to estimate the parameter θ of a parametric model assume that a prior distribution g over θ exists – then: θ 7→ f (θ | x) = Z
f (x | θ) g(θ) f (x | ϑ) g(ϑ) dϑ ϑ∈Θ
maximum a posteriori estimation then estimates θ as the mode of the posterior distribution of this random variable: f (x | θ) g(θ) = arg max f (x | θ) g(θ). θˆMAP (x) = arg max Z θ θ f (x | ϑ) g(ϑ) dϑ ϑ
MAP and MLE estimates coincide when the prior g is uniform not very representative of Bayesian methods, as the latter are characterized by the use of distributions to draw inferences also, unlike ML estimators, the MAP estimate is not invariant under reparameterization Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
50 / 464
Uncertainty
Bayesians vs frequentists
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
51 / 464
Uncertainty
Bayesians vs frequentists
Bayesian vs frequentist inference
in frequentist inference, unknown parameters are often, but not always, treated as having fixed but unknown values that are not capable of being treated as random variates Bayesian inference allows probabilities to be associated with unknown parameters the frequentist approach does not depend on a subjective prior that may vary from one investigator to another however, Bayesian inference (e.g. Bayes’ rule) can be used by frequentists www.stat.ufl.edu/∼casella/Talks/BayesRefresher.pdf
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
52 / 464
Uncertainty
Bayesians vs frequentists
Lindley’s paradox Bayesian vs frequentist hypothesis testing
Lindley’s paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution not really a paradox – the two approaches answer fundamentally different questions Lindley’s paradox1 occurs when I I
the result x is ‘significant’ by a frequentist test of H0 , indicating sufficient evidence to reject H0 say, at the 5% level, and the posterior probability of H0 given x is high, indicating strong evidence that H0 is in better agreement with x than H1
this can happen when H0 is very specific, H1 less so, and the prior distribution does not strongly favor one or the other
1
onlinelibrary.wiley.com/doi/10.1002/0470011815.b2a15076/pdf Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
53 / 464
Uncertainty
Bayesians vs frequentists
Bayesian vs frequentist inference
it is not like they are different ways of solving the same problem they are really designed to solve different problems! the result of a Bayesian approach can be a probability distribution on the parameters given the results of the experiment the result of a frequentist approach is either: I I
a ‘true or false’ (binary) conclusion from a significance test, or a conclusion in the form that a given sample-derived confidence interval covers the true value
either of these conclusions has a given probability of being correct
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
54 / 464
Uncertainty
Bayesians vs frequentists
Bayesian vs frequentist for regression problems
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
55 / 464
Beyond probability
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
56 / 464
Beyond probability
It’s the data, stupid!
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
57 / 464
Beyond probability
It’s the data, stupid!
Something is wrong? measure-theoretical mathematical probability is not general enough: cannot (properly) model missing data cannot (properly) model propositional data cannot really model unusual data (second order uncertainty)
the frequentist approach to probability: cannot really model pure data (without ‘design’) in a way, cannot even model properly continuous data models scarce data only asymptotically
Bayesian reasoning has several limitations: cannot model no data (ignorance) cannot model uncertain data cannot model pure data (without prior) again, cannot properly model scarce data (only asymptotically) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
58 / 464
Beyond probability
It’s the data, stupid!
It’s all about the data! What probability does not do so well model missing data I
canonical examples: the cloaked die, occluded dice
model interval or propositional data (e.g., in engineering) I
canonical example: reliability of witnesses in a trial
properly model scarce data I
paramount example: training in machine learning
model pure data I
without priors or designed experiments
model no data (ignorance) model unusual data (the statistics of rare events) I
extinct dinosaurs and black swans
perform prediction under huge (Knightian?) uncertainties I
making politicians happy
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
59 / 464
Beyond probability
It’s the data, stupid!
Fisher has not got it all right the setting is arguable I
I
I
the scope is quite narrow: rejecting or not rejecting a hypothesis (although it can provide confidence intervals) the criterion is arbitrary: who decides what an ‘extreme’ realisation is (choice of α)? what is the deal with 0.05 and 0.01? the whole ‘tail’ idea comes from the fact that, under measure theory, the conditional probability (p-value) of a point outcome x is zero – seems trying to patch an underlying problem with the way probability is mathematically defined
cannot cope with pure data, without assumptions on the process (experiment) which generated them (we will come back to this later) deals with scarce data only asymptotically (see ‘scarce data’)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
60 / 464
Beyond probability
It’s the data, stupid!
The problem(s) with Bayes pretty bad at representing ignorance I I
Fisher uninformative priors are just not adequate different results on different parameter spaces
Bayes’ rule assumes the new evidence comes in the form of certainty: “A is true” I
in the real world, often this is not the case
beware the prior! → model selection in Bayesian statistics I
I
I
results from a confusion between the original subjective interpretation, and the objectivist view of a rigorous objective procedure why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not all will be fine, in the end! asymptotically, the choice of the prior does not matter (really!)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
61 / 464
Beyond probability
Missing data
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
62 / 464
Beyond probability
Missing data
The cloaked die The die as random variable
a die is a simple example of (discrete) random variable there is a probability space Ω = {face1, face2, ..., face6} which maps to a real number: 1, 2, ..., 6 (no need for measurability here)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
63 / 464
Beyond probability
Missing data
The cloaked die Observations which are sets
now, imagine that face1 and face4 are cloaked, and we roll the die the same probability space Ω = {face1, face2, ..., face6} is still there (nothing has changed in the way the die works) however, now the mapping is different: both face1 and face4 are mapped to the set of possible values {1, 4} (since we cannot observe the outcome) mathematically, this is called a random set [Matheron,Kendall,Nguyen], i.e. a set-valued random variable
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
64 / 464
Beyond probability
Missing data
Occluded dice A more realistic scenario
a more realistic scenario is one in which we roll, say, four dice for some of them, their top face might be occluded, but some of the side faces will still be visible, providing information
e.g. I see the top face of Red die , Green die cannot see the outcome of Blue die however, I see sides faces set {2, 4, 5, 6}
Fabio Cuzzolin
and
and Purple die
but, say, I
of Blue, therefore the outcome of Blue is the
Belief functions Random sets for the working scientist
IJCAI 2016
65 / 464
Beyond probability
Missing data
Missing data and random sets
the bottom line is, whenever data are missing observations are inherently set-valued mathematically, we are not sampling a (scalar) random variable but we are sampling a set-valued random variable: a random set my outcomes are sets? my probability distribution has to be defined over sets missing data appears (or disappears?) everywhere in science and engineering e.g. occlusions in computer vision
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
66 / 464
Beyond probability
Missing data
Dealing with missing data Traditional approaches
traditional statistical approaches deal with missing data in one of the following ways: deletion: most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results single imputation: replacing a missing value with another I I I
I I
for instance, from a randomly selected similar record in the same dataset or, selecting donors from another dataset with the mean of that variable for all other cases (does not change the sample mean) using a regression model (does not represent residual variance well) stochastic regression
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
67 / 464
Beyond probability
Missing data
Dealing with missing data Traditional approaches
multiple imputation [Rubin]: averaging the outcomes across multiple imputed data sets (using, for instance, stochastic regression) I I
involves drawing values of the parameters from a posterior distribution hence, simulates both the process generating the data and the uncertainty associated with the parameters of the probability distribution of the data
Missing data with random sets No need for imputation or deletion whatsoever. All observations are set-valued, some of them happen to be pointwise.
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
68 / 464
Beyond probability
Propositional data
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
69 / 464
Beyond probability
Propositional data
Reliable witnesses Evidence supporting propositions suppose there is a murder, and three people are under trial for it: Peter, John and Mary our hypothesis space is therefore Θ = {Peter, John, Mary} there is a witness: he testifies that the person he saw was a man this amounts to supporting the proposition A = {Peter, John} ⊂ Θ should we take this testimony at face value? in fact, the witness was tested and the machine reported an 80% chance he was drunk when he reported the crime we should partly support the (vacuous) hypothesis that any one among Peter, John and Mary could be the murderer: it is natural to assign 80% chance to proposition A, and 20% chance to proposition Θ
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
70 / 464
Beyond probability
Propositional data
Dealing with propositional evidence even when evidence (data) supports propositions, Kolmogorov’s probability forces us to specify support for individual outcomes this is unreasonable - an artificial constraint due to a mathematical model that is not general enough I
we have no elements to assign this 80% probability to either Peter or John, nor to distribute it among them
the cause is the additivity of probability measures: but this is not the most general type of measure for sets under a minimal requirement of monotoniticy measure can potentially suitable to describe probabilities of events: these objects are called capacities in particular, random sets are capacities in which the numbers assigned to subsets are given by a probability distribution
Belief functions and propositional evidence As capacities (and random sets in particular), belief functions allow us to assign mass directly to propositions. Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
71 / 464
Beyond probability
Scarce data
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
72 / 464
Beyond probability
Scarce data
I know that I don’t know Learning from scarce data
yeah I know.. Socrates again.. but he did know it already 2500 years ago! still, people insist on learning from very limited experience Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
73 / 464
Beyond probability
Scarce data
How widespread is life? Learning from scarce data the argument on the likelihood of biological life in the universe is an extreme example: how likely is for a planet to give birth to life forms? planetary habitability is largely an extrapolation of conditions on Earth and the characteristics of the Solar System (some form of anthropic principle) basically what people do is to model perfectly the (presumed) causes of the emergence of life on Earth: it needs to circle a G-class star, in the ‘right’ galactic neighborhood, be in a certain ‘habitable zone’ around a star, have a large moon to deflect hazardous impact events ... p(life) = pA · · · · · pB · · · how much can one learn from a single example?? how much can one be sure about what he/she learned from very few examples?
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
74 / 464
Beyond probability
Scarce data
Machines that learn
thebayesianobserver.wordpress.com we design algorithms that can learn → machine learning BUT, we train them on a ridicously small amount of data how do we make sure they have learned the right lesson? Is there really a ‘precise’ lesson to learn? should we not work with sets of models instead? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
75 / 464
Beyond probability
Scarce data
A naive position Dealing with scarce data
a somewhat naive objection: probability distributions assume an infinite amount of evidence, so in reality finite evidence can only provide a constraint on the ‘true’ probability values I
I I
I
unfortunately, those who believe probabilities to be limits of relative frequencies (the frequentists) never really ‘estimate’ a probability from the data – the only assume (‘design’) probability distributions for their p-values Fisher: fine, I can never compute probabilities, but I can use the data to test my hypotheses on them in opposition, those who do estimate probability distributions from the data (the Bayesians) do not think of probabilities as infinite accumulations of evidence (but as degrees of belief) Bayes: I only need to be able to model a likelihood function of the data
well, actually, frequentists do estimate probabilities from scarce data when they do stochastic regression: see logistic regression in a couple of slides
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
76 / 464
Beyond probability
Scarce data
Asymptotic happiness
what is true, is that both frequentists and Bayesians seem to be happy with solving their problems ‘asymptotically’ I I
limit properties of ML estimates Bernstein-von Mises theorem
what about the here and now? e.g. smart cars? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
77 / 464
Beyond probability
Scarce data
Size and composition of the sample in (stochastic) logistic regression logistic regression allows us, given a sample Y = {Y1 , ..., Yn }, X = {x1 , ..., xn } where Yi ∈ {0, 1} is a binary outcome at time i and xi is the corresponding measurement, to learn the parameters of a conditional probability relation between the two
P(Y = 1|x) =
1 1 + e−(β0 +β1 x)
given a new x, one has the probability of a positive outcome generalises deterministic linear regression the n trials are assumed independent but not equally distributed: πi = P(Yi = 1|xi ) varies with i the parameters β0 , β1 are estimated by maximum likelihood of the sample, where Y
L(β|Y ) =ni=1 πi i (1 − πi )Yi logistic regression suffers when number of samples is ‘insufficient’ or when there are too few positive outcomes (1s) also, tends to underestimate the probability of a positive outcome (see rare events) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
78 / 464
Beyond probability
Scarce data
Size of the sample in frequentist probability Confidence intervals
Confidence interval Let X be a sample from a probability P(.|θ, φ) where θ is the parameter to be estimated and φ a nuisance parameter. A confidence interval for the parameter θ, with confidence level or confidence coefficient γ, is an interval [u(X ), v (X )] determined by the pair of random variables u(X ) and v (X ), with the property: P(u(X ) < θ < v (X )|θ, φ) = γ
∀(θ, φ).
example: I observe the weight of 25 cups of tea, I assume it is normally distributed with mean µ, and I want to know the confidence interval (the interval of ‘expected’ values on new samples) for the mean since the (normalised) sample mean Z is also normally distributed, I can get ask what values of the mean are such that P(−z ≤ Z ≤ z) = 0.95 (for instance) since Z =
X −µ √ , σ/ n
this yield an interval for µ, e.g. P(X − 0.98 ≤ µ ≤ X + 0.98)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
79 / 464
Beyond probability
Scarce data
Confidence intervals Interpretation
confidence intervals are a form of interval estimate correct interpretation: as we saw in the example, it is about sampling samples if I keep extracting new sample sets, 95% (say) of the time the confidence interval (which will differ for every new sample set) will cover the true value of the parameter alternatively: there is a 95% probability that the calculated confidence interval from some future experiment encompasses the true value of the parameter does not mean that a specific confidence interval is such that it contains the value of the parameter with 95% probability Bayesian version of them: credible intervals
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
80 / 464
Beyond probability
Scarce data
Size of the sample and belief functions
how do belief functions cope with scarce data?
Belief functions and scarce data Belief functions cope with scarce data by being cautious about the ‘correct’ probability model describing the studied process. a belief function corresponds to an entire set of probability distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
81 / 464
Beyond probability
Pure data
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
82 / 464
Beyond probability
Pure data
Modelling pure data Bayesian approach Bayesian reasoning requires modelling the data and a prior I I
prior is just a name for beliefs built over a long period of time, from the evidence you have observed so long a time has passed that all track record of observations is lost, and all is left is a probability distribution
why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not nevertheless we are compelling to picking one, because the mathematical formalism requires it I
this is the result of a confusion between the original subjective interpretation (where prior beliefs always exist), and the objectivist view of a rigorous objective procedure (where in most cases we do not have any prior knowledge)
Bayesians then go in ‘damage limitation’ mode, and try to pick the least damaging prior (see ‘ignorance’ later) all will be fine, in the end! (Bernstein-von Mises theorem) Asymptotically, the choice of the prior does not matter (really!) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
83 / 464
Beyond probability
Pure data
Modelling pure data Frequentist approach
the frequentist approach is inherently unable to describe pure data, without making additional assumptions on the data-generating process in Nature one cannot ‘design’ an experiment: data come your way, whether you want it or not – you cannot set the ‘stopping rules’ I
again, recalls the old image of a scientist ‘analysing’ (from Greek ‘ana’+’lysis’, breaking up) a specific aspect of the world in their lab
the same data can lead to opposite conclusions (!) I
I
different experiments can lead to the same data, whereas the parametric model employed (family of probability distributions) is linked to a specific experiment apparently, however, frequentists are just fine with this
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
84 / 464
Beyond probability
Pure data
Same data, different conclusions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
85 / 464
Beyond probability
Pure data
Same data, different conclusions
http://ocw.mit.edu/courses/mathematics/ 18-05-introduction-to-probability-and-statistics-spring-2014/ readings/MIT18_05S14_Reading20.pdf
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
86 / 464
Beyond probability
No data (ignorance)
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
87 / 464
Beyond probability
No data (ignorance)
Choosing the prior Bayesian inference the prior distribution is typically hard to determine ‘solution’ → pick an ‘uninformative’ probability Jeffrey’s prior → Gramian of the Fisher information matrix can be improper (unnormalised), and it violates the strong version of the likelihood principle: when using the Jeffreys prior, inferences about θ depend not just on the probability of the observed data as a function of θ, but also on the universe of all possible experimental outcomes, as determined by the experimental design, because the Fisher information is computed from an expectation over the chosen universe
uniform priors do depend on the chosen set of hypotheses can lead to different results on different spaces, given the same likelihood functions (this was pointed out by Shafer in his book, btw)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
88 / 464
Beyond probability
No data (ignorance)
Choosing the prior Bernstein-von Mises theorem
in Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior (Bernstein-von Mises theorem) little problem: the amount of information supplied by a sample of data must be large enough caveat [Freedman 1965]: the Bernstein-von Mises theorem does not hold almost surely if the random variable has an infinite countable probability space A. W. F. Edwards: “It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this ‘defence’ the better.”
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
89 / 464
Beyond probability
No data (ignorance)
Dealing with ignorance Shafer vs Bayes ‘uninformative’ priors can be dangerous
: they violate the strong likelihood principle, may be
unnormalised wrong priors can kill a Bayesian model priors in general cannot handle multiple hypothesis spaces in a coherent way (families of frames, in Shafer’s terminology)
Belief functions and priors Reasoning with belief functions does not require any prior. belief functions encoding data are combined with no need for priors
Belief functions and ignorance Belief functions naturally represent ignorance via the ‘vacuous’ belief function, assigning mass 1 to the whole hypothesis space. Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
90 / 464
Beyond probability
Unusual (rare) data
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
91 / 464
Beyond probability
Unusual (rare) data
Extinct dinosaurs The statistics of rare events
dinosaur statisticians probably we worrying about overpopulation risks.. .. until it hit them! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
92 / 464
Beyond probability
Unusual (rare) data
Black swans The statistics of rare events
black swan is a term coined by Nassim Nicholas Taleb unpredictable event which, once occurred, is rationalised in hindsight as being predictable/describable by the existing risk models Knightian uncertainty is presumed to not exist, with typically bad consequences! examples: financial crises, plagues, but also unexpected scientific or societal developments Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
93 / 464
Beyond probability
Unusual (rare) data
What’s a rare event? Very unusual data
examples of rare events, also called ‘tail risks’, are: volcanic eruptions, meteor impacts, tsunamis .. in the most extreme cases, these events might have never occurred (e.g. your vote will be decisive in the next presidential election, [Gelman and King, 1998]) what is a ‘rare’ event? clearly we are interested in them because they are not so rare, after all! in other words, they may happen rarely when considering a single system, but when putting a lot of systems together (the real world) the change of them happening becomes tangible so, an event is rare when it covers a region of the hypothesis space which is seldom sampled
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
94 / 464
Beyond probability
Unusual (rare) data
Dealing with rare events Traditional approaches
probability distributions for the system’s behaviour are built in ‘normal’ times (e.g. while the nuclear plant is working just fine) then used to extrapolate results at the ‘tail’ of the distribution popular statistical procedures (e.g. logistic regression) can sharply underestimate the probability of rare events I
Harvard’s G. King [2001] has proposed corrections based on oversampling the ‘rare’ events w.r.t the ‘normal’ ones
in response, some people drop generative probabilistic models in favour of discriminative ones [Random forests, Huang 2005] once again, we fail to understand that uncertainty affects our very models of uncertainty
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
95 / 464
Beyond probability
Unusual (rare) data
Dealing with rare events Imprecise probabilities we should explictly model second-order (Knightian) uncertainties the most straightforward way of doing this, is to consider sets of probability distributions as modelling the problem
Belief functions and Knightian uncertainty Mathematically, belief functions (random sets) do amount to (convex) sets of probability distributions. as we will see, there are many ways of doing this – credal sets, probability intervals .. a possible insight: this is a form of scarce data, which combines a qualitative element (where the data are scarce) this is a form of missing information too – we are missing certain regions of the hypothesis space
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
96 / 464
Beyond probability
Uncertain data
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
97 / 464
Beyond probability
Uncertain data
Bayes’ rule and certainty Bayes’ rule is used by Bayesians to reason (in time) when new evidence becomes available used by frequentist to condition on the (certain) measurements and generate their p-values indeed, it assumes that new evidence always comes in the form of certain statements: event A is true this is reasonable or even true in many situations: in science and engineering measurements flow in, and this is a form of ‘certain’ evidence applying Bayes’ rule to condition on series of measurements to construct likelihood functions (or p-values, if you are a frequentist) then appears very reasonable in many real world problems, though, evidence/data is uncertain
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
98 / 464
Beyond probability
Uncertain data
Uncertain data concepts themselves can be not well defined, e.g. ‘dark’ or ‘somewhat round’ object (qualitative data) I
fuzzy theory accounts for this via the concept of graded membership
unreliable sensors can generate faulty (outlier) measurements: can we still treat these data as ‘certain’? or is more natural to attach to them a degree of reliability, based on the past track record of the ‘sensor’ (data generating process)? but then, can we still apply Bayes’ rule? interval measurements are common in engineering, due to limited sensitivity of sensors I
could be treated as precise pairs (a, b), but this requires considering the set of all subsets of measured values
people (‘experts’, e.g. doctors) tend to express themselves in terms of likelihoods directly (e.g. ‘I think diagnosis A is most likely, otherwise either A or B’) I
if the doctors were frequentists, and were provided with the same data, they would probably apply logistic regression and come up with the same prediction on P(disease|symptoms): unfortunately doctors are not statisticians
multiple sensors can provide as output a PDF on the same space I
e.g., two Kalman filters based one on color, the other on motion (optical flow), providing a normal predictive PDF on the location of the target in the image plane
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
99 / 464
Beyond probability
Uncertain data
Jeffrey’s rule of conditioning Jeffrey’s rule of conditioning: a step forward from certainty and Bayes’ rule an initial probability P stands corrected by a second probability P 0 , defined only on a number of events suppose P is defined on a σ-algebra A there is a new prob measure P 0 on a sub-algebra B of A, and the updated probability P 00 has to: 1 2
meet the prob values specified by P 0 for events in B be such that ∀ B ∈ B, X , Y ⊂ B, X , Y ∈ A ( P(X ) P 00 (X ) if P(Y ) > 0 P(Y ) = P 00 (Y ) 0 if P(Y ) = 0
there is a unique solution: P 00 (A) =
X
P(A|B)P 0 (B)
B∈B
generalises conditioning (obtained when P 0 (B) = 1 for some B)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
100 / 464
Beyond probability
Uncertain data
Belief functions and uncertain evidence Conditioning versus combination what if I have a new probability on the same σ-algebra A? Jeffrey’s rule cannot be applied! as we saw, this happens when multiple sensors provide predictive PDFs belief function deal with uncertain evidence by moving away from the concept of conditioning (via Bayes’ rule) .. .. to that of combining pieces of evidence supporting multiple (intersecting) propositions to various degrees
Belief functions and evidence Belief reasoning works by combining existing belief functions with new ones, which are able to encode uncertain evidence. in addition, belief functions can represent fuzzy concepts as consonant (nested) belief functions they can represent unreliable measurements as ‘discounted’ probabilities (by assigning mass to the entire hypothesis set) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
101 / 464
Beyond probability
Knightian uncertainty
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
102 / 464
Beyond probability
Knightian uncertainty
Certainty about uncertainty Voltaire’s view
it is also absurd to be certain about uncertainty it is quite contemptuous to allow convenience to define your choice either: my noise is Gaussian, etc.. Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
103 / 464
Beyond probability
Knightian uncertainty
Ellesberg’s paradox Aversion to Knightian uncertainty the Ellsberg paradox illustrates people’s aversion to second-order uncertainty a decision problem can be formalized by defining: I I I
a set Ω of states of the world; a set X of consequences; a set F of acts, where an act is a function f : Ω → X
let < be a preference relation on F, such that f < g means that f is at least as desirable as g given f , h ∈ F and E ⊆ Ω, let fEh denote the act defined by (fEh)(ω) = f (ω)
if ω ∈ E;
h(ω)
if ω 6∈ E
Savage’s Sure Thing Principle states that ∀E, ∀f , g, h, h0 : fEh < gEh ⇒ fEh0 < gEh0
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
104 / 464
Beyond probability
Knightian uncertainty
Ellsberg’s paradox Aversion to Knightian uncertainty
suppose you have an urn containing 30 red balls and 60 balls, either black or yellow. Consider the following gambles: I I I I
f1 : you receive 100 euros if you draw a red ball f2 : you receive 100 euros if you draw a black ball f3 : you receive 100 euros if you draw a red or yellow ball f4 : you receive 100 euros if you draw a black or yellow ball
the Ellsberg paradox has been widely studied in economics and decision making1
1
http://www.econ.ucla.edu/workingpapers/wp362.pdf Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
105 / 464
Beyond probability
Knightian uncertainty
Ellsberg’s paradox Aversion to Knightian uncertainty in this example Ω = {R, B, Y }, fi : Ω → R and X = R (left table) empirically, most people strictly prefer f1 to f2 , while preferring f4 to f3 R B Y Now, pick E = {R, B}: by definition f1 100 0 0 f2 0 100 0 f1 {R, B}0 = f1 , f2 {R, B}0 = f2 f3 100 0 100 f1 {R, B}100 = f3 , f2 {R, B}100 = f4 f4 0 100 100 since f1 < f2 , i.e. f1 {R, B}0 < f2 {R, B}0 the Sure Thing principle would imply f1 {R, B}100 < f2 {R, B}100 i.e., f3 < f4 empirically the Sure Thing Principle is violated!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
106 / 464
Beyond probability
Knightian uncertainty
Making politicians happy Coming out with numbers for climatic change
politicians need to decide whether to invest billions of dollars/euros/pounds on expensive engineering projects to mitigate the effects of climate change whether theirs will be the right decision, we will know only in 20-30 years time – nevertheless, need to be made Fabio Cuzzolin decisions Belief functions Random setsnow for the working scientist IJCAI 2016 107 / 464
Beyond probability
Knightian uncertainty
Brexit (really?) Investors do not like uncertainty living in Oxford, I just have to talk about this “In New York, a recent meeting of S&P Investment Advisory Services’ five-strong investment committee decided to ignore the portfolio changes that its computer-driven investment models were advising. Instead, members decided not to make any big changes ahead of the vote.”
investors prefer ‘certainty’ to ‘uncertainty’: does ‘certainty’ mean certain outcome of their bets? No, only that they think their models can handle ‘known’ (first-order) uncertainty Wall Street Journal article
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
108 / 464
Beyond probability
Knightian uncertainty
Dealing with huge uncertainties Predicting the future
to be fair, the mainstream in climatic change is not to model uncertainty at all, but to simply use dynamical models of the atmosphere/planet for prediction I
by the way, even deterministic, correct (chaotic) models (can) deliver uncertainty on predictions due to uncertainty on initial conditions
requires predictions very far off in the future: what does this entail? I
if we use (deterministic) dynamical models, these are simplified versions of the world that get it more and more wrong as time passes
when modelling uncertainty explicitly, what are the challenges? I I I
we don’t have any priors (ouch, Bayesians), but we don’t have any data (pretty much) either (extreme scarcity) as we just saw, scarcity is a source of Knightian uncertainty we cannot really use hypothesis testing, either (too bad, frequentists): this is not a designed experiment where one can assume an underlying data-generating mechanism
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
109 / 464
Understanding
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
110 / 464
Understanding
A mathematical theory of evidence
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
111 / 464
Understanding
A mathematical theory of evidence
A mathematical theory of evidence
Shafer called his proposal ‘A mathematical theory of evidence’ the mathematical objects it deals with are called ‘belief functions’ where do these names come from? what interpretation of probability do they entail? I
I
it is a theory of epistemic probability: it is about probabilities as a mathematical representation of knowledge (a human’s knowledge, or a machine’s) it is a theory of evidential probability: such probabilities representing knowledge are induced (‘elicited’) by the available evidence
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
112 / 464
Understanding
A mathematical theory of evidence
Belief (in hypotheses) belief → the state of mind in which a person thinks something to be the case, with or without there being empirical evidence
knowledge is the part of belief that is true, or just that which is justified to be true? epistemology → the branch of philosophy concerned with the theory of knowledge epistemic probability → probability as a representation of knowledge
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
113 / 464
Understanding
A mathematical theory of evidence
Evidence (supporting hypotheses) in probabilistic logic, statements such as "hypothesis H is probably true" are interpreted to mean that the empirical evidence E supports H to a high degree this degree of support of H by E is called the logical or epistemic probability of H given E in fact, Pearl and others have supported a view of these matters in terms of probabilities on the logical causes of a certain proposition (‘probability of provability’), much related to modal logic I
to be fair, this connection to evidence is overlooked in much of the subsequent work
Rationale There exists evidence in the form of probabilities, which supports degrees of belief on a certain matter. the space where the evidence lives is different from the hypothesis space they are linked by a map one to many: but this is a random set! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
114 / 464
Understanding
Belief functions
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
115 / 464
Understanding
Belief functions
Dempster’s original setting going back to the trial example, the situation can be described by the diagram where Ω is the space where the evidence lives, in a form of a probability distribution P Θ is the hypothesis space, the set of outcomes of the trial
elements of Ω are mapped to subsets of Ω: once again this is a random set, i.e, a set-valued random variable the probability distribution P induces a mass assignment m : 2Θ → [0, 1] via the multi-valued (one-to-many) mapping Γ : Ω → 2Θ in the example Γ maps {not drunk } ∈ Ω to {Peter , John} ⊂ Θ the corresponding mass function is: m({Peter , John}) = 0.8, m(Ω) = 0.2
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
116 / 464
Understanding
Belief functions
Mass functions “Basic Probability Assignments”
let θ be an unknown quantity with possible values in a finite domain Θ, called the frame of discernment a piece of evidence about θ may be represented by a mass function m on Θ, defined as a function 2Θ → [0, 1], such that: P m(∅) = 0 A⊆Θ m(A) = 1 P(Θ) = 2Θ is the set of all subsets of Θ any subset A of Θ such that m(A) > 0 is called a focal element (FE) of m
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
117 / 464
Understanding
Belief functions
Belief and plausibility functions Dempster’s upper and lower probabilities for any A ⊆ Θ, we can define: I
the total degree of support (belief) in A as the probability that the evidence implies A: X Bel(A) = P({ω ∈ Ω|Γ(ω) ⊆ A}) = m(B) B⊆A
I
the plausibility of A as the probability that the evidence does not contradict A: Pl(A) = P({ω ∈ Ω|Γ(ω) ∩ A 6= ∅}) = 1 − Bel(A)
the uncertainty on the truth value of the proposition “θ ∈ A” is the interval Bel(A) ≤ Pl(A) belief and plausibility values can (but this is disputed) be interpreted as lower and upper bounds to the values of an unknown, underlying probability measure: Bel(A) ≤ P(A) ≤ Pl(A) for all A ⊆ Θ Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
118 / 464
Understanding
Belief functions
A generalisation of sets, fuzzy sets, probabilities belief functions generalise traditional (‘crisp’) sets: a logical (or “categorical”) mass function has one focal set A, with m(A) = 1 belief functions generalise standard probabilities: a Bayesian mass function has as only focal sets elements (rather than subsets) of Θ complete ignorance is represented by the vacuous mass function with m(Θ) = 1 belief functions generalise fuzzy sets (see possibility theory later): when the focal sets of m are nested, m is said to be consonant in that case the plausibility function Pl is a possibility measure, i.e., Pl(A ∪ B) = max(Pl(A), Pl(B)),
∀A, B ⊆ Θ,
its contour function pl(θ) = Pl({θ}) is the membership function of a fuzzy set
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
119 / 464
Understanding
Belief functions
A generalisation of sets, fuzzy sets, probabilities
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
120 / 464
Understanding
Dempster’s combination
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
121 / 464
Understanding
Dempster’s combination
Combination of evidence Murder example continued
the first item of evidence gave us: m1 ({Peter , John}) = 0.8, m1 (Θ) = 0.2 new piece of evidence: a blond hair has been found also, there is a probability 0.6 that the room has been cleaned before the crime this second body of evidence is encoded by the mass assignment m2 ({John, Mary }) = 0.6, m2 (Θ) = 0.4 once again, our sources of evidence are given to us in the form of probability distributions in some space relevant to (but not coinciding with) the problem how to combine these two pieces of evidence? an answer can be given within the random set interpretation of belief functions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
122 / 464
Understanding
Dempster’s combination
Combination of evidence
if ‘codes’ ω1 ∈ Ω1 and ω2 ∈ Ω2 were selected, θ ∈ Γ1 (ω1 ) ∩ Γ2 (ω2 ) if the codes are selected independently, then the probability that the pair (ω1 , ω2 ) is selected is P1 ({ω1 }) · P2 ({ω2 }) if Γ1 (ω1 ) ∩ Γ2 (ω2 ) = ∅, the pair (ω1 , ω2 ) cannot be selected, hence: the joint distribution on Ω1 × Ω2 must be conditioned to eliminate such pairs Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
123 / 464
Understanding
Dempster’s combination
Dempster’s rule Definition under these assumptions we get Dempster’s rule of combination let m1 and m2 be two mass functions on the same frame Θ, induced by two independent pieces of evidence their combination using Dempster’s rule is defined as: (m1 ⊕ m2 )(A) =
X 1 m1 (B)m2 (C), 1−κ
∀∅ 6= A ⊆ Θ,
B∩C=A
where κ=
X
m1 (B)m2 (C)
B∩C=∅
is the degree of conflict between m1 and m2 their Dempster’s sum m1 ⊕ m2 exists iff κ < 1 can be easily extended to any number of BFs
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
124 / 464
Understanding
Dempster’s combination
Dempster’s rule - example
m({θ1 }) =
0.7 ∗ 0.4 0.3 ∗ 0.6 0.3 ∗ 0.4 = 0.48, m({θ2 }) = = 0.31, m({θ1 , θ2 }) = = 0.21 1 − 0.42 1 − 0.42 1 − 0.42
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
125 / 464
Understanding
Dempster’s combination
Dempster’s rule Properties
Dempster’s rule has some interesting properties: commutativity, associativity, existence of a neutral element (the vacuous BF mΘ with m(Θ) = 1) it generalises set-theoretical intersection: if mA and mB are logical mass functions and A ∩ B 6= ∅, then mA ⊕ mB = mA∩B it generalises Bayes’ rule of conditioning I
if m = p is a probability and mA is a ‘logical’ mass function, then m ⊕ mA is the probability p(.|A) obtained via Bayes’ conditioning
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
126 / 464
Understanding
Dempster’s combination
A generalisation of Bayesian inference belief theory generalises Bayesian probability (it contains it as a special case), in that: I
I
I
classical probability measures are a special class of belief functions (in the finite case) or random sets (in the infinite case) Bayes’ ‘certain’ evidence is a special case of Shafer’s bodies of evidence (general belief functions) Bayes’ rule of conditioning is a special case of Dempster’s rule of combination
however, it overcomes its limitations I
you do not need a prior: if you are ignorant, you will use the vacuous BF mΘ which, when combined with new BFs m0 encoding data, will not change the result mΘ ⊕ m0 = m0
I
however, if you do have prior knowledge you are welcome to use it!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
127 / 464
Understanding
Families of frames
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
128 / 464
Understanding
Families of frames
Refinements and coarsenings the theory allows us to handle evidence impacting on different but related domains assume we are interested in the nature of an object in a road scene. We could describe it, e.g., in the frame Θ = {vehicle, pedestrian}, or in the finer frame Ω = {car, bicycle, motorcycle, pedestrian} other example: different image features in pose estimation a frame Ω is a refinement of a frame Θ (or, equivalently, Θ is a coarsening of Ω) if elements of Ω can be obtained by splitting some or all of the elements of Θ
Θ
ρ
θ1
Ω
θ2 θ3
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
129 / 464
Understanding
Families of frames
Families of compatible frames when Ω is a refinement for a collection Θ1 , ..., ΘN of other frames it is called their common refinement two frames are said to be compatible if they do have a common refinement compatible frames can be associated with different variables/attributes/features: I
I
let ΘX = {red, blue, green} and ΘY = {small, medium, large} be the domains of attributes X and Y describing, respectively, the color and the size of an object in such a case the common refinement ΘX ⊗ ΘY = ΘX × ΘY is simply the Cartesian product
or, they can be descriptions of the same variable at different levels of granularity (as in the road scene example) evidence can be moved from one frame to another within a family of compatible frames
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
130 / 464
Understanding
Families of frames
Marginalization let ΩX and ΩY be two compatible frames let mXY be a mass function on ΩX × ΩY it can be expressed in the coarser frame ΩX by transferring each mass mXY (A) to the projection of A on ΩX :
we obtain a marginal mass function on ΩX : X mXY ↓X (B) = mXY (A) ∀B ⊆ ΩX {A⊆ΩXY ,A↓ΩX =B}
(again, it generalizes both set projection and probabilistic marginalization) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
131 / 464
Understanding
Families of frames
Vacuous extension the “inverse” of marginalization a mass function mX on ΩX can be expressed in ΩX × ΩY by transferring each mass mX (B) to the cylindrical extension of B:
this operation is called the vacuous extension of mX in ΩX × ΩY : ( mX (B) if A = B × ΩY mX ↑XY (A) = 0 otherwise a strong feature of belief theory: the vacuous belief function (our representation of ignorance) is left unchanged when moving from one hypothesis set to another! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
132 / 464
Understanding
Interpretations
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
133 / 464
Understanding
Interpretations
The multiple semantics of belief functions being complex objects, belief functions have a number of (sometimes conflicting) semantics and mathematical interpretations original one [Dempster 1967]: lower probabilities induced by a multivalued mapping I
the mathematical representation: random set framework
Shafer’s (1976): representations of pieces of evidence in favour of propositions within someone’s subjective state of belief I
represented as set functions on a finite domain Ω
as convex sets of probability measures, in a robust Bayesian interpretation I
mathematically, a credal set whose lower and upper envelopes are belief and plausibility functions
other equivalent mathematical formulations: I I I
as non-additive (generalised) probabilities as monotone capacities as inner measures (linked to the rough set idea)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
134 / 464
Understanding
Interpretations
As non-additive probabilities (generalised) probabilities Probability measure A function P : F → [0, 1] over a σ-field F ⊂ 2Θ such that P(∅) = 0, P(Θ) = 1; if A ∩ B = ∅, A, B ∈ F then P(A ∪ B) = P(A) + P(B) (additivity). if we relax the third constraint to allow the function to meet additivity only as a lower bound we obtain a:
Belief function A function Bel : 2Ω → [0, 1] from the power set 2Ω to [0, 1] such that: Bel(∅) = 0, Bel(Ω) = 1; for every n and for every collection A1 , ..., An ∈ 2Ω we have that: X X Bel(Ai ∩ Aj ) + · · · + (−1)n+1 Bel(A1 ∩ ... ∩ An ) Bel(A1 ∪ ... ∪ An ) ≥ Bel(Ai ) − i
Fabio Cuzzolin
i and indifference ∼ goal: to build a belief function Bel such that A· > B iff Bel(A) > Bel(B) and A ∼ B iff Bel(A) = Bel(B) exists if · > is a weak order and ∼ an equivalence relation Algorithm 1 2 3
consider all propositions that appear in the preference relations as potential focal elements (FEs) elimination: if A ∼ B for some B ⊂ A then A is not a FE a perceptron algorithm is used to generate the mass m by solving the system of remaining equalities and disequalities
however: it selects arbitrarily one solution over many does not address possible inconsistency in the given preferences
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
160 / 464
Building
From preferences
Ben Yaghlane’s constrained optimisation approach Building belief functions from preferences
uses preferences and indifferences as in Wong and Lingras, with same axioms.. .. but converts them into a constrained optimisation problem objective function: maximise the entropy/uncertainty of the BF to generate (least informative result) constraints derived from input preferences/indifferences, i.e. A· > B ↔ Bel(A) − Bel(B) ≥ ,
A ∼ B ↔ |Bel(A) − Bel(B)| ≤
is a constant specified by the expert various
uncertainty measures
Fabio Cuzzolin
can be plugged in
Belief functions Random sets for the working scientist
IJCAI 2016
161 / 464
Building
Coin toss revised
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
162 / 464
Building
Coin toss revised
A coin toss example consider a coin toss experiment we toss the coin n = 10 times, obtaining the sample X = {H, H, T , H, T , T , T , H, H, H} with k = 6 successes (heads H) and n − k = 4 fails (tails T)
parameter of interest: the probability θ = p of heads in a single toss inference problem: gather information on the value of p (either in the form of a point estimate, the acceptability of certain guesses, a probability distribution on the possible values of p...)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
163 / 464
Building
Coin toss revised
Bayesian inference Coin toss example
general Bayesian inference assume the trials to be independent (they are obviously equally distributed) the likelihood of the sample is binomial: P(X |p) = pk (1 − p)n−k
apply Bayes’ rule to get the posterior (see plot): P(X |p)P(p) P(p|X ) = ∼ P(X |p) P(X ) (for we do not have prior info on the chances of p or X ) the ML estimate is the peak of this likelihood function Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
164 / 464
Building
Coin toss revised
Frequentist hypothesis testing Coin toss example
what would a frequentist do? well, it seems reasonable the the value of p be p = kn we can then test it: once again assuming independent and equally distributed trials, the distribution of the sample is the binomial we can then compute the p-value for, say, α = 0.95 the p-value is obviously P(p ≥ 0.6) = 1/2 > α = 0.05 and the hypothesis is sensible (‘not rejected’, to be precise)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
165 / 464
Building
Coin toss revised
Likelihood-based belief function inference likelihood-based belief function inference: ˆ ), PlΘ (A|X ) = supp∈A L(p|X BelΘ (A|X ) = 1 − PlΘ (A|X ) these bounds determine an entire envelope of PDF on the parameter space Θ = [0, 1]
we can apply the same criterion to normalised empirical counts ˆf (H) = 1, ˆf (T ) = 4/6 = 2/3 we get the mass assignment m(H) = 1/3, m(T ) = 0, m(Ω) = 2/3 as a credal set, Bel = {1/3 ≤ p < 1} (left) this ‘robustifies’ the ML estimate, which is a PDF compatible with the inferred BF Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
166 / 464
Building
Coin toss revised
Summary on inference general Bayesian inference → continuous PDF on the parameter space Θ (a second-order distribution) MLE/MAP estimation → a single parameter value = a single PDF on Ω generalised maximum likelihood → a belief function on Ω (a convex set of PDFs on Ω) I
generalises MAP/MLE
likelihood-based / Dempster-based belief function inference → a belief function on Θ = a convex set of second-order distributions I
generalises general Bayesian inference
lower and upper likelihoods → an interval of belief functions on Ω (we will see this at the end!)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
167 / 464
Reasoning
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
168 / 464
Reasoning
Combining vs conditioning Reasoning with belief functions
belief theory is a generalisation of Bayesian reasoning while in Bayesian theory evidence is of the kind ‘A is true’ (e.g. a new datum is available) .. in belief theory, new evidence can assume the more general form of a belief function I
a proposition A is a very special case of belief function with m(A) = 1
in most cases, reasoning needs then to be performed by combining belief functions, rather than by conditioning with respect to an event nevertheless, conditional belief functions are of interest, especially for statistical inference
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
169 / 464
Reasoning
Combining
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
170 / 464
Reasoning
Combining
Dempster’s rule under fire Zadeh’s paradox question is: is Dempster’s sum the only possible rule of combination? seems to have paradoxical behaviour in certain circumstances.. doctors have opinions about the condition of a patient Θ = {M, C, T }, where M stands for meningitis, C for concussion and T for tumor two doctors provide the following diagnoses: I
I
D1 : “I am 99% sure it’s meningitis, but there is a small chance of 1% that it is concussion". D2 : “I am 99% sure it’s a tumor, but there is a small chance of 1% that it is concussion".
can be encoded by the following mass functions: 0.99 A = {M} 0.99 0.01 A = {C} 0.01 m1 (A) = m2 (A) = 0 otherwise 0
Fabio Cuzzolin
Belief functions Random sets for the working scientist
A = {T } A = {C} otherwise,
IJCAI 2016
(1)
171 / 464
Reasoning
Combining
Dempster’s rule under fire Zadeh’s paradox their (unnormalised) Dempster’s combination is: 0.9999 A = {∅} m(A) = 0.0001 A = {C} as the two masses are highly conflicting, normalisation yields the belief function focussed on C → “it is definitively concussion”, although both experts had left it as only a fringe possibility objections: I
I
I
the belief functions in the example are really probabilities, so this is a problem with Bayes’ rule, in case! diseases are never exclusive, so that it may be argued that Zadeh’s choice of a frame of discernment is misleading → open world approaches with no normalisation doctors disagree so much that any person would conclude that one of the them is just wrong → reliability of sources needs to be accounted for
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
172 / 464
Reasoning
Combining
Dempster’s rule under fire Tchamova’s paradox this time, the two doctors generate the following mass assignments over Θ = {M, C, T }: A = {M} A = {M, C} a b1 1 − a A = {M, C} b2 A=Θ m1 (A) = m2 (A) = 0 otherwise 1 − b1 − b2 A = {T }. (2) assuming equal reliability of the two doctors, Dempster’s combination yields m1 ⊕ m2 = m1 , i.e, Doctor 2’s diagnosis is completely absorbed by that of Doctor 1! here the ‘paradoxical’ behaviour is not a consequence of conflict in Dempster’s combination, every source of evidence has a ‘veto’ power over the hypotheses it does not believe to be possible if any of them gets it wrong, the combined belief function will never give support to the ‘correct’ hypothesis
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
173 / 464
Reasoning
Combining
Proposed combination rules a number of alternative combination mechanisms have been proposed Yager’s rule: conflict mass is assigned to Ω Dubois’ rule: conflict mass B ∩ C = ∅ is assigned to B ∪ C conjunctive rule: Dempster without normalisation disjunctive rule: dual of the conjunctive (and Dempster’s) Denoeux’s cautious rule: min weight after canonical decomposition bold rule: dual of cautious Murphy’s averaging idea Deng’s distance-weighted averaging Lefevre’s weighting factors
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
174 / 464
Reasoning
Combining
Yager’s and Dubois’ rules first answer to Zadeh’s objections view that conflict is generated by non-reliable information sources P conflicting mass m(∅) = B∩C=∅ m1 (B)m2 (C) should be re-assigned to the whole frame Θ let m∩ (A) = m1 (B)m2 (C) whenever B ∩ C = A m∩ (A) ∅= 6 A(Θ mY (A) = m∩ (Θ) + m(∅) A = Θ.
(3)
Dubois and Prade’s idea: similar to Yager’s, BUT conflicting mass is not transferred all the way up, but to B ∪ C (due to applying the minimum specificity principle) X mD (A) = m∩ (A) + m1 (B)m2 (C).
(4)
B∪C=A,B∩C=∅
the resulting BF dominates Yager’s combination: mD (A) ≥ mY (A) ∀A Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
175 / 464
Reasoning
Combining
Smets’ conjunctive rule Smets also assumes that all sorces to combine are reliable: conflict is the result of an incorrectly specified frame of discernment rather than normalising (as in Dempster’s rule) or re-assigning the conflicting mass m(∅) to other non-empty subsets (as in Yager’s and Dubois’ proposals), his disjunctive rule leaves the conflicting mass with the empty set: conjunctive rule of combination: m∩ (A) =
X
m1 (B)m2 (C)
(5)
B∩C=A
applicable to unnormalised belief functions open world assumption: current frame only approximately describes the set of possible hypotheses the empty set ∅ represents hypotheses that are not included in the current frame (but might be, if more info becomes available)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
176 / 464
Reasoning
Combining
Disjunctive rule
dual of the conjunctive rule in Dempster’s original random set idea, consensus between two sources is expressed by the union of the supported propositions, rather than by their intersection disjunctive rule of combination: m∪ (A) =
X
m1 (B)m2 (C)
(6)
B∪C=A ∪ not that Bel1 Bel 2 (A) = Bel1 (A) ∗ Bel2 (A): belief values are simply multiplied!
was proposed by Ivan Kramosil as well
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
177 / 464
Reasoning
Combining
Inverting Dempster’s sum The canonical decomposition
a belief function can be decomposed into a Dempster’s sum of ‘simple’ components: w ∩ A(Θ mA , m=
(7)
mAw denotes the simple pseudo belief function such that: 1−w B =A w w B=Θ mA (B) = 0 ∀B 6= A, Θ and the weights w(A) satisfy: w(A) ∈ [0, +∞) for all A ( Θ conjunctive canonical decomposition conjunctive and disjunctive rules also admit simple inverses
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
178 / 464
Reasoning
Combining
Denoeux’s cautious rule w ∩ A(Θ mA based on Smets’ canonical decomposition m =
cautious combination: the mass assignment with the following weights: w1 ∧ 2 (A) = min{w1 (A), w2 (A)},
A ∈ 2Θ \ {Θ}
(8)
the belief function whose simple components have weight equal to the minimum of the two input weights it is the least committed BF in the set that dominates the weights of the input ones commutative, associative and idempotent! this means that if I keep adding the same evidence, nothing changes (unlike with Dempster’s rule) (a cautious conjunctive rule which differs from Denoeux’s was proposed by Destercke et al)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
179 / 464
Reasoning
Combining
Bold rule
dual of the cautious rule based on the canonical disjunctive decomposition: any unnormalised belief function can be uniquely decomposed as ∪ A6=∅ mA,v (A) m=
(9)
where mA,v (A) is the unnormalised belief function assigning mass v (A) to ∅, and 1 − v (A) to A bold combination is defined as: ∨ ∪ A6=∅ mA,min{v (A),v (A)} . m1 m 2 = 1 2
(10)
is only applicable to unnormalised belief functions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
180 / 464
Reasoning
Combining
Averaging approches completely different rationale from the random set interpretation some way of computing the ‘mean’ of the input mass functions Murphy [2000]: one can average the input masses, and calculate the combined b.p.a. by combining the average values multiple times D. Yong [2005]: averaging based on distance I
the degree of credibility Crd(mi ) of the i-th body of evidence Sup(mi ) Crd(mi ) = P , j Sup(mj )
I
. X Sup(mi ) = 1 − d(mi , mj ). j6=i
is used Xto compute a weighted average of the input masses: ˜ = m Crd(mi ) · mi i
albeit empirical, they try to address the issue with the ‘veto’ power of each piece of evidence
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
181 / 464
Reasoning
Combining
Lefevre’s weighting factors given J input masses m = {m1 , ..., mj , · · · , mJ } family of combination rules which distributes the conflicting mass m(∅) to each proposition A of a set of subsets P = {A}, according to a weighting factor w(A, m): m(A) = m∩ (A) + mc (A), (11) where c
m (A) =
w(A, m) · m(∅) 0
A∈P otherwise
includes Smets’ and Yager’s rules when P = {∅} and P = {Θ}, respectively we get Dempster’s rule when P = 2Θ \ {∅} with weights: w(A, m) =
m∩ (A) 1 − m(∅)
∀A ∈ 2Θ \ {∅}.
Dubois and Prade’s operator can also be obtained similar conflict redistribution strategies have been proposed by others
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
182 / 464
Reasoning
Combining
Other proposals Belief function combination
a number of other proposals exist for combination rules .. I I I I I I
Josang’s consensus operator (from beta distribution interpretation) Daniel’s MinC approach Wang’s [2007] Yamada’s ‘combination by compromise’ Yang’s evidential reasoning rule Florea’s Adaptive Combination Rules (ACR) and Proportional Conflict Redistribution (PCR) rule
.. and families of combination operators: I I I I
Denoeux’s families induced by t-norms and conorms α-junctions: linear operators associated with matrices Yager’s family of quasi-associative operators Denneberg’s family of updating rules
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
183 / 464
Reasoning
Combining
Combination Moving forward Yager’s rule is rather unjustified Dubois’ is kinda intermediate between conjunction and disjunction cautious and bold rules are inspired by possibility theory’s min rule, rather then the original random set framework my take on this: Dempster’s (conjunctive) combination and disjunctive combination are the two extrema of a spectrum of possible results
Proposal: combination tubes? Meta-uncertainty on the sources generating the input belief functions (their independence and reliability) induces uncertainty on the result of the combination, represented by a bracket of combination rules, which produce a ‘tube’ of BFs. we encountered this idea when generalising the concept of likelihood - was already hinted at by Pearl in “Reasoning with belief functions: An analysis of compatibility” we should probably work with intervals of belief functions then? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
184 / 464
Reasoning
Conditioning
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
185 / 464
Reasoning
Conditioning
Conditional belief functions in Bayesian theory conditioning is done via Bayes’ rule: P(A|B) =
P(A ∩ B) P(B)
for belief functions, many approaches to conditioning have been proposed (just as for combination!) I I I I I I I
original Dempster’s conditioning Fagin and Halpern’s lower envelopes “geometric conditioning” [Suppes] unnormalized conditional belief functions [Smets] generalised Jeffrey’s rules [Smets] sets of equivalent events under multi-valued mappings [Spies] conditioning by distance minimisation [Cuzzolin]
several of them are special cases of combination rules: Dempster’s, Smets’ .. others are the unique solution when interpreting belief functions as convex sets of probabilities (Fagin’s) once again, a duality emerges between the most and least cautious conditioning approaches
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
186 / 464
Reasoning
Conditioning
Dempster’s conditioning Dempster’s rule of combination induces a conditioning operator given a new event A, the “logical” belief function such that m(A) = 1 .. ... is combined with the a-priori belief function Bel using Dempster’s rule the resulting BF is the conditional belief function given A a la Dempster, Bel⊕ (A|B)
in terms of belief and plausibility values, Dempster’s conditioning yields Bel⊕ (A|B) =
¯ ¯ Bel(A∪B)−Bel( B) ¯ 1−Bel(B)
=
Pl(B)−Pl(B\A) , Pl(B)
Pl⊕ (A|B) =
Pl(A∩B) Pl(B)
obtained by Bayes’ rule by replacing probability with plausibility measures! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
187 / 464
Reasoning
Conditioning
Fagin and Halpern’s lower envelopes we know that a belief function can be seen as the lower envelope of the family of probabilities consistent with it: Bel(A) =
inf
P∈P[Bel]
P(A)
Fagin and Halpern defined a conditional belief function as the lower envelope (the inf) of the family of conditional probability functions P(A|B), where P is consistent with Bel: . BelCr (A|B) =
inf
P∈P[Bel]
P(A|B),
. PlCr (A|B) =
sup P(A|B) P∈P[Bel]
obviously generalises conditional probability (just like Dempster’s conditioning) have been considered by other authors too, e.g. Dempster ‘67 and Walley ‘81
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
188 / 464
Reasoning
Conditioning
Lower conditional envelopes Close form expressions
obviously strongly linked to the robust Bayesian (credal) interpretation, so rather incompatible with the random set interpretation nevertheless, while lower/upper envelopes of arbitrary sets of probabilities are not in general belief functions, but these actually are belief functions: BelCr (A|B) =
Bel(A∩B) , ¯ Bel(A∩B+Pl(A∩B)
PlCr (A|B) =
Pl(A∩B) ¯ Pl(A∩B)+Bel(A∩B)
they provide a more conservative estimate then Dempster’s conditioning BelCr (A|B) ≤ Bel⊕ (A|B) ≤ Pl⊕ (A|B) ≤ PlCr (A|B)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
189 / 464
Reasoning
Conditioning
Suppes and Zanotti’ geometric conditioning Suppes and Zanotti proposed a ‘geometric’ conditioning approach BelG (A|B) =
Bel(A ∩ B) , Bel(B)
PlG (A|B) =
Bel(B) − Bel(B \ A) Bel(B)
what it does, it retains only the masses of focal elements inside B, and normalises them: m(A) mG (A|B) = A⊆B Bel(B) it is a consequence of the focussing approach to belief update: no new information is introduced, we merely focus on a specific subset of the original set somewhat dual to Dempster’s conditioning, as it replaces probability with belief measures in Bayes’ rule Pl⊕ (A|B) =
Pl(A∩B) Pl(B)
↔
BelG (A|B) =
Bel(A∩B) Bel(B)
open question: is it induced by some dual rule of combination?
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
190 / 464
Reasoning
Conditioning
Smets’ conjunctive rule of conditioning or ‘unnormalized’ conditional belief function, has mass 0 if A 6⊂ B, X m(A ∪ X ) A ⊆ B m ∩ (A|B) = X ⊆B c ∩ it is induced by the conjunctive rule of combination: m ∩ (A|B) = m m B
belief and plausibility values: ¯ Bel(A ∪ B) A∩B = 6 ∅ Bel ∩ (A|B) = 0 A∩B =∅
Pl ∩ (A|B) =
Pl(A ∩ B) 1
A 6⊃ B A⊃B=∅
it is compatible with the principles of belief revision [Gilboa, Perea]: a state of belief is modified to take into account a new piece of information I
in probability theory, both focussing and revision are expressed by Bayes’ rule, but they are conceptually different operations which produce different results on BFs
it is more committal than Dempster’s rule! Bel⊕ (A|B) ≤ Bel ∩ (A|B) ≤ Pl ∩ (A|B) ≤ Pl⊕ (A|B) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
191 / 464
Reasoning
Conditioning
Disjunctive rule of conditioning ∪ induced by the disjunctive rule of combination: m ∪ (A|B) = m m B
obviously dual to conjunctive conditioning X m m(A \ B ∪ X ) A ⊇ B ∪ (A|B) = X ⊆B
while m ∪ (A|B) = 0 for all A 6⊃ B assigns mass only to subsets containing the conditioning event B belief and plausibility values: Bel(A) Bel ∪ (A|B) = 0
A⊃B A 6⊃ B
Pl ∪ (A|B) =
Pl(A) 1
A∩B =∅ A ∩ B 6= ∅
it is less committal not only than Dempster’s rule, but also than credal conditioning Bel ∪ (A|B) ≤ BelCr (A|B) ≤ PlCr (A|B) ≤ Pl ∪ (A|B)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
192 / 464
Reasoning
Conditioning
Conditioning - an overview Dempster’s ⊕ Credal Cr Geometric G ∩ Conjunctive ∪ Disjunctive
belief Pl(B) − Pl(B \ A) Pl(B) Bel(A ∩ B) ¯ ∩ B) Bel(A ∩ B) + Pl(A Bel(A ∩ B) Bel(B) ¯ A ∩ B 6= ∅ Bel(A ∪ B), Bel(A), A ⊃ B
plausibility Pl(A ∩ B) Pl(B) Pl(A ∩ B) ¯ ∩ B) Pl(A ∩ B) + Bel(A Bel(B) − Bel(B \ A) Bel(B) Pl(A ∩ B), A 6⊃ B Pl(A), A ∩ B = ∅
Nested conditioning operators Conditioning operators form a nested family, from the more committal to the least one!
Bl ∪ (·|B) ≤ BlCr (·|B) ≤ Bl⊕ (·|B) ≤ Bl ∩ (·|B) ≤ Pl ∩ (·|B) ≤ Pl⊕ (·|B) ≤ PlCr (·|B) ≤ Pl ∪ (· open question: what about geometric conditioning? is geometric conditioning induced by some combination rule dual to Dempster’s? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
193 / 464
Reasoning
Belief vs Bayesian reasoning
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
194 / 464
Reasoning
Belief vs Bayesian reasoning
Belief vs Bayesian reasoning Image data fusion for object classification
suppose we want to estimate the class of an object appearing in an image, based on feature measurements extracted from the image (e.g. by convolutional neural network) we capture a training set of images, complete with annotated object labels assuming a PDF of a certain family (e.g. mixture of Gaussians) we can learn from the training data a likelihood function p(y |x), where y is the object class and x the image feature vector suppose n different ‘sensors’ extract n features xi from each image: x1 , ..., xn let us compare how data fusion works under the Bayesian and the belief function paradigms!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
195 / 464
Reasoning
Belief vs Bayesian reasoning
Belief vs Bayesian reasoning Bayesian data fusion the likelihoods of the individual features are computed using the n likelihood functions learned during training: p(xi |y ), for all i = 1, ..., n measurements are typically assumed to be conditionally independent, yielding Q the product likelihood p(x|y ) = i p(xi |y ) Bayesian inference is applied, typically assuming uniform priors (for there is no reason to think otherwise), yielding Y p(y |x) ∼ p(x|y ) = p(xi |y ) i
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
196 / 464
Reasoning
Belief vs Bayesian reasoning
Belief vs Bayesian reasoning Dempster-Shafer data fusion with belief functions, for each feature type i a BF is learned from the the individual likelihood p(xi |y ), e.g. via the likelihood-based approach by Shafer this yields n belief functions Bel(y |xi ), on the range of possible object classes Y ∩ ⊕, ), ∪ a combination rule is applied to compute an overall BF (e.g. , obtaining
Bel(Y |x) = Bel(Y |x1 ) ... Bel(Y |xn ), an empirical comparison of this kind is shown under
Fabio Cuzzolin
Y ⊆Y
Regression
Belief functions Random sets for the working scientist
IJCAI 2016
197 / 464
Reasoning
Belief vs Bayesian reasoning
Inference under partially reliable data Belief vs Bayesian reasoning
in the fusion example we have assumed that the data are measured correctly what if the data-generating process is not completely reliable? problem: suppose we want to just detect an object (binary decision: yes Y or no N) two sensors produce image features x1 and x2 , but we learned from the training data that both are reliable only 20% of the time at test time we get an image, measure x1 and x2 , and unluckily sensor 2 got it wrong! the object is actually there we get the following normalised likelihoods p(x1 |Y ) = 0.9, p(x1 |N) = 0.1;
Fabio Cuzzolin
p(x2 |Y ) = 0.1, p(x2 |N) = 0.9
Belief functions Random sets for the working scientist
IJCAI 2016
198 / 464
Reasoning
Belief vs Bayesian reasoning
Inference under partially reliable data Belief vs Bayesian reasoning
how do the two fusion pipelines cope with this? the Bayesian scholar assumes the two sensors/processes are conditionally independent, and multiply the likelihoods obtaining p(x1 , x2 |Y ) = 0.9 ∗ 0.1 = 0.09, so that p(Y |x1 , x2 ) =
1 , 2
p(N|x1 , x2 ) =
p(x1 , x2 |N) = 0.1 ∗ 0.9 = 0.09
1 2
Shafer’s faithful follower discounts the likelihoods by assigning mass .2 to the whole hypothesis space Θ = {Y , N}: m(Y |x1 ) = 0.9 ∗ 0.8 = 0.72, m(Y |x2 ) = 0.1 ∗ 0.8 = 0.08,
Fabio Cuzzolin
m(N|x1 ) = 0.1 ∗ 0.8 = 0.08, m(Θ|x1 ) = 0.2; m(N|x2 ) = 0.9 ∗ 0.8 = 0.72 m(Θ|x2 ) = 0.2
Belief functions Random sets for the working scientist
IJCAI 2016
199 / 464
Reasoning
Belief vs Bayesian reasoning
Inference under partially reliable data Belief vs Bayesian reasoning thus, when we combine them by Dempster’s rule we get the BF Bel on {Y , N}: m(Y |x1 , x2 ) = 0.458,
m(N|x1 , x2 ) = 0.458,
m(Θ|x1 , x2 ) = 0.084
when combined using the disjunctive rule (the least committal one) we get Bel 0 : m0 (Y |x1 , x2 ) = 0.09,
m0 (N|x1 , x2 ) = 0.09,
m0 (Θ|x1 , x2 ) = 0.82
the corresponding (credal) sets of probabilities are
the credal interval for Bel is quite narrow: reliability is assumed to be 80%, and we got a faulty measurement in two! (50%) the disjunctive rule is much more cautious about the correct inference Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
200 / 464
Reasoning
Generalised Bayes Theorem
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
201 / 464
Reasoning
Generalised Bayes Theorem
Generalising full Bayesian inference in Bayesian inference a likelihood function p(.|θ), θ ∈ Θ is known, so that we can compute the likelihood of a new sample p(x|θ), x ∈ X after observing x, the prob distribution on Θ is updated to the posterior via Bayes’ theorem: P(θ|x) = Shafer’s
likelihood-based inference
P(x|θ)P(θ) P P(x) = θ0 P(x|θ0 )P(θ0 )
∀θ ∈ Θ
maps the likelihood p(x|θ) to a BF on Θ:
p(x|θ) ∀x ∈ X
7→
BelΘ (A|x), A ⊂ Θ
Dempster’s inference maps (for instance) the family of CDFs F (x|θ) associated with p(x|θ) to a belief function on Θ × X
F (x|θ) ∀x ∈ X
7→
BelΘ×X (.),
which by conditioning on x gives a BF BelΘ (A|x), A ⊂ Θ
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
202 / 464
Reasoning
Generalised Bayes Theorem
Generalised Bayes Theorem Generalising full Bayesian inference in Smets’ generalised Bayesian theorem setting, the input is a set of ‘conditional’ belief functions on Θ, rather than likelihoods p(x|θ) there BelX (X |θ),
X ⊂ X, θ ∈ Θ
each associated with a value θ of the parameter these are not the same conditional belief functions we saw, where a conditioning event B ⊂ Θ alters a prior belief function BelΘ mapping it to BelΘ (.|B) they can be seen as a parameterised family of BFs on the data the desired output is another family of belief functions on Θ, parameterised by all sets of measurements X on X: BelΘ (A|X ), ∀X ⊂ X as it is natural to require that each piece of evidence m(X |θ) have an effect on our beliefs on the parameters also coherent with the random set setting, in which we condition on set-valued observations Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
203 / 464
Reasoning
Generalised Bayes Theorem
Generalised Bayes Theorem Generalised Bayes Theorem Implements this inference BelX (X |θ) 7→ BelΘ (A|X ) by: 1
computing an intermediate family of BFs on X parameterised by sets of parameter values: Y ∪ θ∈A BelX (X |θ) = BelX (X |A) = BelX (X |θ) θ∈A ∪ via the disjunctive rule of combination
2 3
assuming that PlΘ (A|X ) = PlX (X |A) ∀A ⊂ Θ, X ⊂ X Y ¯ |θ) this yields BelΘ (A|X ) = BelX (X ¯ θ∈A
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
204 / 464
Reasoning
Generalised Bayes Theorem
Generalised Bayes Theorem Assumptions and properties 1
derives from two requirements: I if we apply GBT to two variables X and Y we get ∩ BelΘ (A|X , Y ) = BelΘ (A|X ) Bel Θ (A|Y )
the conditional BF on Θ is the conjunctive combination of the two ¯ |θ) : θ ∈ A} II PlX (X |A) is a function of {PlX (X |θ), PlX (X 2
generalises Bayes’ rule (by replacing P with Pl) when priors are uniform Shafer’s proposal for statistical inference PlΘ (A|x) = max PlΘ (θ|x) does not θ∈A
meet requirement I under requirement I of the GBT the two variables are conditional cognitive independent (extends stochastic independence) PlX×Y (X ∩ Y |θ) = PlX (X |θ) ∗ PlY (Y |θ) ∀X ⊂ X, Y ⊂ Y, θ ∈ Θ
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
205 / 464
Reasoning
Graphical models
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
206 / 464
Reasoning
Graphical models
Probabilistic graphical models Pearl conditional independence relationships are the building block of Pearl’s probabilistic graphical models conditional probabilities can be directly manipulated using BayesâA˘ Z´ rule support for graphical model is a directed acyclic graph G = (V , E) each node v ∈ V is associated with a random variable Xv the set of random variables X = {Xv , v ∈ V } is a Bayesian network with respect to G if its joint probability density function is the product of the individual density functions, conditional on their parent variables Y p(x) = p xv xpa(v ) v ∈V
where pa(v ) is the set of parents of v expresses the conditional independence of the variables from any of their non-descendants, given the values of their parent variables.
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
207 / 464
Reasoning
Graphical models
Probabilistic graphical models Belief propagation
message passing algorithm for performing inference on graphical models calculates the marginal distribution for each unobserved node, conditional on any observed nodes first formulated on trees, then polytrees, finally general graphs works by passing real valued functions called messages µv →u , u ∈ N(v ) along the edges upon convergence the estimated marginal distribution of each node is proportional to the product of all messages from adjoining factors (missing the normalization constant): Y pXv (xv ) ∝ µu→v (xv ) u∈N(v )
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
208 / 464
Reasoning
Graphical models
Graphical models for belief functions on joint belief functions early local propagation methods (see efficient computation ) have also developed into graphical models for reasoning with belief functions however, in networks using belief functions, relations among variables are usually represented by joint belief functions rather than conditional ones furthermore, these networks are undirected graphs, for instance: I I I I
hypertrees [Shenoy & Shafer, 1986] qualitative Markov trees [Shafer, Shenoy and Mellouli 1987] join trees [Shenoy 1997] valuation networks [Shenoy 1992]
Shenoy and Shafer showed that if combination and marginalization meet three axioms, then local computation becomes possible Cano et al [1993]: adding three more axioms allows us to use Shenoy & Shafer’s axiomatic framework for the propagation in directed acyclic graphs
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
209 / 464
Reasoning
Graphical models
Graphical models for belief functions on conditional belief functions graphs of conditional belief function independence relations are more efficient [Shenoy 1993] due to a lack of directed belief networks (similar to Bayesian networks), more recent works integrate belief function theory and Bayesian networks: I I
Cobb & Shenoy [2003]: plausibility transformation between models Simon & Weber [2006]: implementing belief calculus in Bayesian networks
alternative line of research: I
I I I
evidential networks with conditional belief functions (ENC) [Xu and Smets, 1993-95] use directed acyclic graphs BUT edges represent conditional relations (i.e. values in X generate conditional belief functions in Y) rather than conditional independence relations use Generalised Bayesian Theorem (GBT) for propagation propagation on ENCs only applies to binary relations between nodes Directed Evidential Networks (DEVN) [Ben Yaghlane and Mellouli, 2008]: generalise ENCs to relations involving any number of nodes
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
210 / 464
Reasoning
Graphical models
Shafer-Shenoy architecture Qualitative Conditional Independence
uses qualitative Markov trees, which generalise both diagnostic trees and causal trees (Pearl) partitions Ψ1 , ..., Ψn of a frame Θ are qualitatively conditionally independent (QCI) given the partition Ψ if P ∩ P1 ∩ ... ∩ Pn 6= ∅ whenever P ∈ Ψ, Pi ∈ Ψi and P ∩ Pi 6= ∅ for all i does not involve probability, but only logical independence stochastic conditional independence does imply the above if two BFs Bel1 and Bel2 are ‘carried by’ partitions Ψ1 , Ψ2 which are QCI given Ψ: (Bel1 ⊕ Bel2 )Ψ = (Bel1 )Ψ ⊕ (Bel2 )Ψ
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
211 / 464
Reasoning
Graphical models
Shafer-Shenoy architecture Qualitative Markov trees a qualitative Markov tree QMT = (V , E) is a tree of partitions of a base frame of discernment Θ - each node v ∈ V is associated with a partition Ψv of Θ meets the following requirement: I
I
deleting a node v and all incident edges yields a forest - denote the collection of nodes of the j-th such subtree by Vj (v ) for every node v ∈ V the minimal refinements of partitions in Vj (v ) for j = 1, ..., k are QCI given Ψv
a Bayesian causal tree is a qualitative Markov tree in which each node v is associated with the partition Ψv induced by random variable Xv
Propagation on QMTs Suppose a number of belief functions are inputted in to a subset of nodes V 0 . Problem: computing ⊕v ∈V 0 Belv .
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
212 / 464
Reasoning
Graphical models
Shafer-Shenoy architecture Propagating belief functions rather than applying Dempster’s combination over the whole frame Θ, we do multiple Dempster’s combinations over partitions I
restriction: each BF to combine has to be carried by a partition in the tree
a processor located at each node v combines BFs using Ψv as a frame of discernment, and projects BFs to its neighbours 1 2
send Belv to its neighbours N(v ) whenever it gets a new input, computes (Bel T )Ψv ← (⊕{(Belu )Ψv : u ∈ N(v )} ⊕ Belv )Ψv
3
computes for each neighbour w ∈ N(v ) Belv ,w ← (⊕{(Belu )Ψv : u ∈ N(v ) \ {w}} ⊕ Belv )Ψw and sends it to w
final result of each processor v : coarsening to that partition of the combination of all the inputted BFs: (⊕u∈V 0 Belu )Ψv Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
213 / 464
Reasoning
Graphical models
Directed evidential networks Propagation algorithm
extends Pearl’s belief propagation to belief functions problem: given BFs {Belv0 , v ∈ V } on the nodes of a DEN (a directed acyclic graph in which edges represent conditional relations), we seek the marginal on each node v ∈ V of their joint belief function if there is a conditional relation between two nodes u and v , it uses disjunctive combination and the generalised Bayesian theorem (GBT) to compute the posterior Belv (X |Y ) given the conditional Belu (Y |X ) each variable of the network has a λ value and a π value associated with it
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
214 / 464
Reasoning
Graphical models
Directed evidential networks propose here the (simpler) propagation algorithm for polytrees
Initialisation 1
for each node v : Belv ← Belv0 , πv ← Belv , λv ← the vacuous BF
2
for each root node, send a new πv →u message for all children u of v as X 0 πv →u = Belv →u (Y ) = mv (X )Belu (Y |X ) X ⊂Θv
3
node u waits for the messages from all its parents, then it I
computes the new πu value via πu = Belu ⊕
I I
⊕v ∈pa(u) πv →u
computes the new marginal belief Belu ← πu ⊕ λu sends the new πu message to all its children
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
215 / 464
Reasoning
Graphical models
Directed evidential networks whenever a new observation Bel O is inputted into a node v :
Updating 1
node v computes its new value Belv = πv ⊕ λv , where πv = Belv0 ⊕ ⊕u∈pa(v ) πu→v , λv = BelvO ⊕ ⊕w∈ch(v ) λw→v
2
for every child node w, we calculate and send the new message to all children X πv →u = Belv →u (Y ) = mv (X )Belu (Y |X ) X ⊂Θv
where Belu (Y |X ) is given by disjunctive combination of Belu (Y |x), x ∈ Θv 3
for every parent node u we compute the new message and send it to all parents X λu→v = Belu→v (X ) = mu (Y )Belv (X |Y ) Y ⊂Θu
where Belv (X |Y ) is the posterior given by the GBT Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
216 / 464
Reasoning
Graphical models
Directed evidential networks Graphical example of propagation
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
217 / 464
Using belief functions
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
218 / 464
Using belief functions
A set of tools for the working scientist using belief functions
scientists face on a daily basis problems such as: I I
I
making decisions based on the available data estimating a quantity of interest give the available data (which can be missing, incomplete,conflicting,partially specified) classifying data-points into bins F F
I I
I
extending k-NN classification approaches fusing the results of multiple classifiers
clustering clouds of data to make sense of them learning a mapping from measurements to a domain of interest (regression) ranking objects
belief functions can provide useful approaches to all these problems when in the presence of (heavy) uncertainty
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
219 / 464
Using belief functions
Decision making
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
220 / 464
Using belief functions
Decision making
Decision making with belief functions An overview
natural application of belief function representation of uncertainty problem: selecting an act f from an available list F (making a “’decision’), which optimises a certain objective function various approaches to decision making I
I
I
decision making in the TBM is based on expected utility via pignistic transform Strat has proposed something similar in his “cloaked carnival wheel” scenario generalised expected utility [Gilboa] based on classical expected utility theory [Savage,von Neumann]
a lot of interest in multicriteria decision making (based on a number of attributes)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
221 / 464
Using belief functions
Decision making
Expected utility approach Decision making under uncertainty a decision problem can be formalized by defining: I I I
a set Ω of states of the world; a set X of consequences; a set F of acts, where an act is a function f : Ω → X
let < be a preference relation on F, such that f < g means that f is at least as desirable as g Savage (1954) has showed that < verifies some rationality requirements iff there exists a probability measure P on Ω and a utility function u : X → R s.t. ∀f , g ∈ F ,
f < g ⇔ EP (u ◦ f ) ≥ EP (u ◦ g)
where EP denotes the expectation w.r.t. P P and u are unique up to a positive affine transformation does that mean that basing decisions on belief functions is irrational?
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
222 / 464
Using belief functions
Decision making
Decision making in the TBM Expected utility using the pignistic probability
in the TBM, decision making is done by maximising the expected utility of actions based on the pignistic transform (as opposed to computing upper and lower expected utilities directly from (Bel, Pl) via Choquet integral, as we will see later) the set of possible actions F and the set Ω of possible outcomes are distinct, and the utility function is defined on F × Ω Smets proves the necessity of the pignistic transform by maximizing X E[u] = u(f , ω)Pign(ω) ω∈Ω
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
223 / 464
Using belief functions
Decision making
Strat’s decision apparatus [UAI 1990]
Strat’s decision apparatus is based on computing intervals of expected values assumes that the decision frame Ω is itself a set of scalar values (e.g. dollar values, see left) - does not distinguish between utilities and elements of Ω (returns)
.. so that an expected value interval can be computed: E(Ω) = [E∗ (Ω), E ∗ (Ω)], where . X . X E∗ (Ω) = inf(A)m(A), E ∗ (Ω) = sup(A)m(A) A⊆Ω
A⊆Ω
not good enough to make a decision, e.g.: should we pay a 6$ ticket when the expected interval is [5$, 8$]? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
224 / 464
Using belief functions
Decision making
Strat’s decision apparatus A probability of favourable outcome
Strat identifies ρ as the probability that the value assigned to the hidden sector is the one the player would choose 1 − ρ is the probability that the sector is chosen by the carnival hawker
Theorem The expected value of the mass function of the wheel is E(Ω) = E∗ (Ω) + ρ(E ∗ (Ω) − E∗ (Ω)) to decide whether to play the game we only need to assess ρ basically, this amounts to a specific probability transform (like the pignistic one) Lesh, 1986 had also proposed a similar approach
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
225 / 464
Using belief functions
Decision making
Savage’s axioms Savage has proposed 7 axioms, 4 of which are considered as meaningful (the others are rather technical) let us examine the first two axioms: Axiom 1: < is a total preorder (complete, reflexive and transitive) Axiom 2 [Sure Thing Principle]. Given f , h ∈ F and E ⊆ Ω, let fEh denote the act defined by ( f (ω) if ω ∈ E (fEh)(ω) = h(ω) if ω 6∈ E then the Sure Thing Principle states that ∀E, ∀f , g, h, h0 , fEh < gEh ⇒ fEh0 < gEh0 this axiom seems reasonable, but it is not verified empirically!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
226 / 464
Using belief functions
Decision making
Ellsberg’s paradox suppose you have an urn containing 30 red balls and 60 balls, either black or yellow. Consider the following gambles: I I I I
f1 : f2 : f3 : f4 :
you receive 100 euros if you draw a red ball you receive 100 euros if you draw a black ball you receive 100 euros if you draw a red or yellow ball you receive 100 euros if you draw a black or yellow ball
in this example Ω = {R, B, Y }, fi : Ω → R and X = R the four acts are the mappings in the left table empirically it is observed that most people strictly prefer f1 to f2 , but they strictly prefer f4 to f3 R B Y Now, pick E = {R, B}: by definition f1 100 0 0 f2 0 100 0 f1 {R, B}0 = f1 , f2 {R, B}0 = f2 f3 100 0 100 f1 {R, B}100 = f3 , f2 {R, B}100 = f4 f4 0 100 100 since f1 < f2 , i.e. f1 {R, B}0 < f2 {R, B}0 the Sure Thing principle would imply f1 {R, B}100 < f2 {R, B}100, i.e., f3 < f4 empirically the Sure Thing Principle is violated! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
227 / 464
Using belief functions
Decision making
Gilboa’s theorem Gilboa (1987) proposed a modification of Savage’s axioms with, in particular, a weaker form of Axiom 2 a preference relation < meets these weaker requirements iff there exists a (non necessarily additive) measure µ and a utility function u : X → R such that ∀f , g ∈ F ,
f < g ⇔ Cµ (u ◦ f ) ≥ Cµ (u ◦ g),
where Cµ is the Choquet integral, defined for X : Ω → R as Z +∞ Z 0 Cµ (X ) = µ(X (ω) ≥ t)dt + [µ(X (ω) ≥ t) − 1]dt. 0
−∞
given a belief function Bel on Ω and a utility function u, this theorem supports making decisions based on the Choquet integral of u with respect to Bel or Pl
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
228 / 464
Using belief functions
Decision making
Lower and upper expected utilities for finite Ω, it can be shown that CBel (u ◦ f ) =
X
m(B) min u(f (ω))
B⊆Ω
CPl (u ◦ f ) =
X
ω∈B
m(B) max u(f (ω))
B⊆Ω
ω∈B
let P(Bel) as usual be the set of probability measures P compatible with Bel, i.e., such that Bel ≤ P. Then, it can be shown that CBel (u ◦ f ) =
min EP (u ◦ f ) = E(u ◦ f )
P∈P(Bel)
CPl (u ◦ f ) = max EP (u ◦ f ) = E(u ◦ f ) P∈P(Bel)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
229 / 464
Using belief functions
Decision making
Decision making Strategies
for each act f we have two expected utilities E(f ) and E(f ). How do we make a decision? possible decision criteria based on interval dominance: 1 2 3 4
f f f f
as focal elements only A or Ω assume we have a Belω with as FEs only {ω, ω ¯ , Ω} for all ω, and we want to combine them uses the fact that the plausibility P of the combined BF is a function of their input BFs’ commonalities Q(A) = B⊇A m(B): X
Pl(A) =
(−1)|B|+1
B⊆A,B6=∅
we get that Pl(A) = K
1+
X ω∈A
Y
Qω (B)
ω∈Ω
Y Belω (¯ ω) Belω (ω) − 1 − Belω (ω) 1 − Belω (ω)
!
ω∈A
the computation of a specific plausibility value Pl(A) is linear in the size of Ω (only elements of A and not subsets are involved) however, the number of events A themselves is still exponential Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
362 / 464
Challenges
Efficient computation
Gordon and Shortliffe’s scheme based on diagnostic trees they are interested in computing degrees of belief only for events forming a hierarchy (diagnostic tree) (in some applications certain events are not relevant, e.g. classes of diseases)
combine simple support functions focused on or against the nodes produces good approximations, unless evidence is highly conflicting Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
363 / 464
Challenges
Efficient computation
Gordon and Shortliffe’s scheme based on diagnostic trees
however, intersection of complements produces FEs not in the tree approximated algorithm: 1
2
3
first we combine all simple functions focussing on the node events (by Dempster’s rule) then, we successively (working down the tree) combine those focused on the complements of the nodes tricky bit: when we do that, we replace each intersection of FEs with the smallest node in the tree that contains it
results depends on the order of the combination in phase 2 again approximation can be poor, also no degrees of belief are assigned to complements of nodes therefore, we cannot compute their plausibilities!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
364 / 464
Challenges
Efficient computation
A simple Monte-Carlo approach to Dempster’s combination - Wilson, 1989 we seek Bel = Bel1 ⊕ ... ⊕ Belm on Ω, where the evidence is induced by probability distributions Pi on Ci via Γi : Ci → 2Ω Monte-Carlo algorithm simulates the random set interpretation of belief functions: Bel(A) = P(Γ(c) ⊆ A|Γ(c) 6= ∅) for a large number of trials n = 1 : N do randomly pick c ∈ C such that Γ(c) 6= ∅ for i = 1 : m do randomly pick an element ci of Ci with probability Pi (ci ) end for let c = (c1 , ..., cm ) if Γ(c) = ∅ then restart trial end if if Γ(c) ⊆ A then trial succeeds, T = 1 end if end for
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
365 / 464
Challenges
Efficient computation
A Monte-Carlo approach Wilson, 1989 the proportion of trials which succeed converges to Bel(A): E[T¯ ] = Bel(A), 1 Var [T¯ ] ≤ 4N we say algorithms has accuracy k if 3σ[T¯ ] ≤ k picking c ∈ C involves m random numers so it takes A · m, A constant testing if xj ∈ Γ(c) takes less then Bm, constant B expected time of the algorithm is N m · (A + B|Ω|) 1−κ where κ is Shafer’s conflict measure expected time to achieve accuracy k is then C, better for simple support functions
9 m 4(1−κ)κ2
· (A + C|Ω|) for constant
conclusion: unless κ is close to 1 (highly conflicting evidence) Dempster’s combination is feasible for large values of m (number of BFs to combine) and large Ω (hypothesis space) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
366 / 464
Challenges
Efficient computation
Markov-Chain Monte-Carlo Wilson and Moral, 1996 trials are not independent but form a Markov chain non-deterministic OPERATIONi : changes at most the i-th coordinate c 0 (i) of c 0 to y , with chance Pi (y ) Pr (OPERATIONi (c 0 ) = c) ∝ Pi (c(i)) if c(i) = c 0 (i), 0 otherwise MCMC algorithm which returns a value BELN (c0 ) which is the proportion of time in which Γ(cc ) ⊆ X cc = c0 S=0 for n = 1 : N do for i = 1 : m do cc = OPERATIONi (cc ) if Γ(cc ) ⊆ X then S =S+1 end if end for end for S return Nm Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
367 / 464
Challenges
Efficient computation
Importance sampling Wilson and Moral, 1996
Theorem If C is connected (i.e., any c, c 0 are linked by a chain of OPERATIONi ) then given , δ there exist K 0 , N 0 s.t. for all K ≥ K 0 and N ≥ N 0 and c0 : Pr (|BELNK (c0 )| < ) ≥ 1 − δ further step: importance sampling -> pick samples c 1 , ..., c N according to an “easy to handle” probability distribution P ∗ assign to each sample a weight wi =
P(c) P ∗ (c)
if P(c) > 0 implies P ∗ (c) > 0 then the average estimator of Bel(X )
P
Γ(c i )⊆X
N
wi
is an unbiased
try to use P ∗ as close as possible to the real one P strategies are proposed to compute P(C) = c P(c)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
368 / 464
Challenges
Efficient computation
Efficient implementation: a summary
do belief functions have a problem with computational complexity? the answer is: only if naively implemented does Bayesian inference on graphical models have computational issues? YES, it is NP-hard, even approximate inference is NP-hard that was solved by Monte-Carlo methods: the same holds for belief inference: we decide how many samples we want to use for approximation, and go for it the point is not assigning mass values to all the subsets out there in these infinite space, but being allowed to assign mass to a subset when it is the thing to do!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
369 / 464
Challenges
Belief functions on reals
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
370 / 464
Challenges
Belief functions on reals
Continuous formulations of the theory of belief functions
in the original formulation by Shafer [1976], belief functions are defined on finite sets only need for generalising this to arbitrary domains has been recognised at an early stage main approaches to continuous formulation presented here: I I I
Shafer’s allocations of probability [1982] belief functions as random sets [Nguyen] belief functions on Borel intervals of the real line [Strat,Smets]
other approaches, with limited (so far) impact I I I
generalised evidence theory MV algebras several others
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
371 / 464
Challenges
Belief functions on reals
Allocations of probability Shafer, 1979
every belief function can be represented as an allocation of probability, i.e., ∩-homomorphisms into positive and completely additive probability algebra (deduced from the integral representation due to Choquet) I
for every belief function Bel defined on a class of events E ⊆ 2Ω there exists a complete Boolean algebra M, a positive measure µ and an allocation of probability ρ between E and M such that Bel = µ ◦ ρ
two regularity conditions for a belief function over an infinite domain are considered: continuity and condensability canonical continuous extensions of belief functions to arbitrary power sets can be introduced by allocation of probability the approach shows significant resemblance with the notions of inner measure and extension of capacities [Honda]
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
372 / 464
Challenges
Belief functions on reals
Continuity and condensability Shafer’s allocations of probability E ⊂ 2Θ is a multiplicative subclass of 2Θ if A ∩ B ∈ E for all A, B ∈ E a function Bel : E → [0, 1] such that Bel(∅) = 0, Bel(Θ) = 1 and Bel is monotone of order ∞ is a belief function I
equally, an upper probability (plausibility) function is alternating of order ∞ (≥ is exchanged with ≤)
a BF on 2Θ is continuous if Bel(∩i Ai ) = limi→∞ Bel(Ai ) for every decreasing sequence of Ai s. A BF on a multiplicative subclass E is continuous if it can be extended to a continuous one on 2Θ I
continuity arises from partial beliefs on ‘objective’ probabilities
a BF on 2Θ is condensable if Bel(∩A) = infA∈A Bel(A) for every downward net A in 2Θ . A BF on a multiplicative subclass E is condensable if it can be extended to a condensable one on 2Θ I
a downward net is such that given two elements there is always an element subset of their intersection
condensability is restrictive, but related to Dempster’s rule Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
373 / 464
Challenges
Belief functions on reals
Choquet’s representation Shafer’s allocations of probability Choquet’s integral representation says that every belief function can be represented by allocation of probability r : E → F is a ∩-homomorphism if it preserves ∩
Choquet’s theorem For every BF Bel on a multiplicative subclass E of 2Θ , ∃ a set X and an algebra F of its subsets, a finitely additive probability measure µ on F , and a ∩-homomorphism r : E → F such that Bel = µ ◦ r . if we replace the measure space (X , F , µ) with a probability algebra (a complete Boolean algebra M with a completely additive prob measure µ) we get
Allocation of probability For every BF Bel on a multiplicative subclass E of 2Θ , ∃ an allocation of probability ρ : E → M such that Bel = µ ◦ ρ. non-zero elements of M can be thought of as focal elements
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
374 / 464
Challenges
Belief functions on reals
Canonical extension Theorem a BF on a multiplicative subclass E can always be extended to a belief function on 2Θ by canonical extension o Xn . (−1)|I|+1 Bel(∩i∈I Ai )|∅ 6= I ⊂ {1, ..., n} Bel(A) = sup n≥1,A1 ,...,An ∈E
proof is based on the existence of an allocation for the extension note the similarity with the superadditivity axiom also related to inner measures, which provide approximate belief values for subsets not in a sigma-algebra Bel is the minimal such extension what about evidence combination? condensability ensures that the Boolean algebra M represents intersection properly for arbitrary (not just finite) collections B of subsets: ^ ρ(∩B) = ρ(B) ∀B ⊂ 2Ω B∈B
allows us to imagine Dempster’s combinations of infinitely many belief functions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
375 / 464
Challenges
Belief functions on reals
Continuous belief functions Strat’s approach idea: take a real interval I and split it into N bits take as frame of discernment the set of possible intervals with these extreme: [0, 1), [0, 2), [1, 4] etc a belief function there has ∼ N 2 /2 possible focal elements, so that its mass lives on a triangle (left), and one can compute belief and plausibility by integration (right)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
376 / 464
Challenges
Belief functions on reals
Continuous belief functions Strat’s approach this trivially generalises to all arbitrary intervals of I (below)
Bel([a, b]) =
RbRb a
x
m(x, y )dydx,
Pl([a, b]) =
RbRN 0
max(a,x)
m(x, y )dydx
Dempster’s rule generalises as Bel1 ⊕ Bel2 ([a, b]) = RaRN 1 m1 (x, b)m 2 (a, y ) + m2 (x, b)m1 (a, y ) + m1 (a, b)m2 (x, y ) K 0 b +m2 (a, b)m1 (x, y ) dydx Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
377 / 464
Challenges
Belief functions on reals
Continuous belief functions on the Borel algebra of intervals a pretty much identical approach is followed by Smets allows us to define a continuous pignistic PDF as Z a Z 1 m(x, y ) . Bet(a) = lim dx dy →0 0 a+ y − x can be easily extended to the real line, by considering belief functions defined on the Borel σ-algebra of subsets of R generated by the collection I of closed intervals the theory provides a way of building a continuous belief function from a pignistic density, by applying the least commitment principle and assuming unimodal pignistic PDFs Bel(s) = −(s − s¯)
dBet(s) ds
where s¯ is such that Bet(s) = Bet(s¯) example: Bet(x) = N (x, µ, σ) is normal → Bel(y ) = y = (x − µ)/σ Fabio Cuzzolin
2 2y √ e−y , 2π
Belief functions Random sets for the working scientist
where
IJCAI 2016
378 / 464
Challenges
Belief functions on reals
Continuous belief functions induced by random closed intervals formal setting: let (U, V ) be a two-dimensional random variable from (C, A, P) to (R2 , B(R2 )) such that P(U ≤ V ) = 1 and Γ(c) = [U(c), V (c)] ⊆ R
(C,A,P) c
Γ
V(c)
U(c)
this setting defines a random closed interval, which induces a belief function on (R, B(R)) defined by Bel(A) = P([U, V ] ⊆ A), Fabio Cuzzolin
∀A ∈ B(R)
Belief functions Random sets for the working scientist
IJCAI 2016
379 / 464
Challenges
Belief functions on reals
Special cases of random closed intervals Consonant random interval
p-box
π(x) 1
1
0
F*
Γ(c)
c
Γ(c)
c x
U(c)
F*
V(c)
0
x U(c)
V(c)
special cases a fuzzy set on the real line induces a mapping to a collection of nested intervals, parameterised by the level c a p-box, i.e, upper and lower bounds to a cumulative distribution function (see later) also induces a family of intervals Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
380 / 464
Challenges
Belief functions on reals
From Boolean algebras to MV algebras
study belief functions in the more general setting than Boolean algebras of events inspired by generalization of classical probability towards “many-valued” events, such as those resulting from formulas in Lukasiewicz infinite-valued logic an algebra of such many-valued events is called an MV algebra upper/lower probabilities and possibility measures can also be defined on MV algebras
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
381 / 464
Challenges
Belief functions on reals
MV algebra Definition
MV algebra An algebra hM, ⊕, 6=, 0i with a binary operation ⊕, a unary operation 6= and a constant 0 such that hM, ⊕, 0i is an abelian monoid and the following equations hold true for every f , g ∈ M: ¬¬f = f ,
f ⊕ ¬0 = ¬0,
¬(¬f ⊕ g) ⊕ g = ¬(¬g ⊕ f ) ⊕ f
we define 1 = ¬0, f g = ¬(¬f ⊕ ¬g), f ≤ g if ¬f ⊕ g = 1 inf and sup so defined f ∨ g = ¬(¬f ⊕ g) ⊕ g and f ∧ g = ¬(¬f ∨ ¬g) make hM, ∨, ∧, 0, 1i a distributive lattice example: standard MV algebra is the real interval [0, 1] equipped with f ⊕ g = min(1, f + g) and ¬f = 1 − f , f g = max(0, f + g − 1) I
in this case and ⊕ are known as Lukasiewicz t-norm and t-conorm
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
382 / 464
Challenges
Belief functions on reals
States as generalisations of finite probabilities on MV algebras
Boolean algebras are also a special case, with ⊕, and ¬ are union, intersection and complement semisimple algebras: isomorphic to continuous functions onto [0, 1] on some compact Hausdorff space - can be view as many-valued counterparts of algebras of sets a totally monotone function b : M → [0, 1] can be defined on MV algebra, by replacing ∪ with ∨ and ⊂ with ≤ a state is a mapping s : M → [0, 1] such that s(1) = 1 and s(f + g) = s(f ) + s(g) whenever f g = 0 (generalisation of finitely additive prob measure) states on semisimple MVRalgebras are integrals of a Borel prob measure on the Hausdorff space: s(f ) = fdµ for each f ∈ M
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
383 / 464
Challenges
Belief functions on reals
Belief functions on MV algebras consider the MV-algebra [0, 1]P(X ) of all functions P(X ) → [0, 1], where X is finite let ρ : [0, 1]X → [0, 1]P(X ) defined as ρ(f )(B) = min{f (x), x ∈ B} B 6= ∅,
ρ(f )(B) = 1 otherwise
if f = 1A (the indicator function of event A) then ρ(1A )(B) = 1 iff B ⊆ A, and we can rewrite Bel(A) = m(ρ(1A )), where m is defined on collections of events
b : [0,1]X → [0, 1] is a belief function on [0, 1]X if there is a state on the MV-algebra [0, 1]P(X ) such that s(1∅ ) = 0 and b(f ) = s(ρ(f )), for every f ∈ [0, 1]X . The state s is called a state assignment. belief functions have values on continuous functions of X (events are a special case) state assignment -> probability measure on Ω in the random set interpretation Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
384 / 464
Challenges
Belief functions on reals
Properties Belief functions on MV algebras there is an integral representation by Choquet integral of such belief functions has strong connection with BFs on fuzzy sets
all standard properties of classical BFs are met (e.g. superadditivity) the set of such belief functions on [0, 1]X is a simplex whose extreme points correspond to the generalisation of categorical BFs can be extended to infinite spaces
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
385 / 464
Challenges
Belief functions on reals
Belief functions as random sets Rationale
given a multi-valued mapping Γ, a straightforward step is to consider the probability value P(c) as attached to the subset Γ(c) ⊆ Ω what we obtain is a random set in Ω, i.e., a probability measure on a collection of subsets roughly speaking, a random set is a set-valued random variable the degree of belief Bel(A) of an event A becomes the cumulative distribution function (CDF) of the open interval of sets {B ⊆ A} this approach has been emphasized in particular by [Nguyen,1978] and [Hestir,1991] and [Shafer,1987] example: a dice where one or more of faces are covered so that we do not know what’s beneath is a random variable which “spits” subsets of possible outcomes: a random set
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
386 / 464
Challenges
Belief functions on reals
Belief functions as random sets Mathematics the lower inverse of Γ is defined as: . Γ∗ (A) = c ∈ C : Γ(c) ⊂ A, Γ(c) 6= ∅ while its upper inverse is . Γ∗ (A) = c ∈ C : Γ(c) ∩ A 6= ∅ given two σ-fields A, B on C, Ω respectively, Γ is said strongly measurable iff ∀B ∈ B, Γ∗ (B) ∈ A . the lower probability measure on B is defined as P∗ (B) = P(Γ∗ (B)) for all B ∈ B - this is nothing but a belief function! ˆ of the random set Nguyen proved that, if Γ is strongly measurable, the CDF P coincides with the lower probability measure: ˆ P[I(B)] = P∗ (B)
Fabio Cuzzolin
∀B ∈ B,
. I(B) = {C ∈ B, C ⊆ B}
Belief functions Random sets for the working scientist
IJCAI 2016
387 / 464
Challenges
Belief functions on reals
Random sets to extend belief functions to arbitrary domains
the notion of condensability has been studied by Nguyen for upper probabilities generated by random sets too [Nguyen 1978] efforts directed at a general theory on arbitrary domains for finite random sets (i.e. with a finite number of focal elements), under independence of variables Dempster’s rule can be applied: n o (F, m) = Ai1 ,...,id = ×dj=1 Aij , mi1 ,...,id = mi1 · · · · · mid for dependent sources Fetz and Oberguggenberger have proposed an “unknown interaction” model for infinite random sets Alvarez (see p-boxes later) a Monte-Carlo sampling method
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
388 / 464
Challenges
Belief functions on reals
Belief functions as random sets Molchanov’s work
random set theory has been much studied by Molchanov [2006] developed a theory of calculus with capacities and random sets Radon-Nikodym theorems for capacities and random sets (see Horizons) and derivatives of capacities (conditional) expectations of random sets limit theorems: strong law of large numbers, cantral limit theorem, Gaussian random sets set-valued random processes
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
389 / 464
Challenges
Belief functions on reals
Belief functions on reals State of the art
most popular extension to closed interval proposed by Strat and Smets gave birth to what are called ‘continuous belief functions’ after an initial effort by Nguyen and others, random sets have been rather neglected recently, strong renewed interest in a theory of random sets, thanks to Molchanov and others strong and powerful mathematical framework! way forward for the theory in my view no mentioning of conditioning and combination yet
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
390 / 464
New horizons
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
391 / 464
New horizons
A research programme we made the case that non-additive probabilities arise from real issues with the way standard probability models the data (or absence thereof) we showed that random sets are the most natural representation of uncertainty they are also a straightforward generalisation of mathematical statistics how should the theory develop? some modest proposals: I I I I I I I I
generalised logistic regression for dealing with rare events parameterised families of random sets .. would allow frequentist hypothesis testing .. .. MAP-like estimation .. in particular, Gaussian random sets .. .. and how the central limit theorem generalises to RS generalising the total probability theorem .. .. and the concept of random variable
where can its full impact be felt? I I I
new, robust foundations for machine learning a novel understanding on quantum mechanics robust models of climatic change
a geometry of uncertainty as a general framework for uncertainty theory Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
392 / 464
New horizons
Upper and lower likelihood
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
393 / 464
New horizons
Upper and lower likelihood
Belief likelihood function Generalising the sample likelihood
traditional likelihood function is a conditional probability of the data given a parameter θ ∈ Θ, i.e. a family of PDF over X parameterised by θ different take: instead of using conventional likelihood to build a belief function, can we define a ‘belief likelihood function’ of a sample x ∈ X? it is natural to define a belief (set-) likelihood function as family of belief functions on X, BelX (.|θ) parameterised by θ ∈ Θ I
this is the input of Smets’ Generalised Bayesian Theorem, a collection of ‘conditional’ belief functions
note that a belief likelihood takes values on sets of outcomes – individual outcomes are a special case seems a natural setting for computing likelihoods of set-valued observations → coherent with the random set philosophy
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
394 / 464
New horizons
Upper and lower likelihood
Belief likelihood function Series of trials
what can we say about the belief likelihood function of a series of trials observations are a tuple x = (x1 , ..., xn ) ∈ X1 × · · · × Xn , where Xi = X denotes the space of quantities observed at time i by definition the belief likelihood function is BelX1 ×···×Xn (A|θ), where A is any subset of X1 × · · · × Xn
Belief likelihood function of repeated trials . ↑× X ↑× X BelX1 ×···×Xn (A|θ) = BelX 1 i i · · · BelX n i i (A|θ) ↑× X
where BelX j i i is the vacuous extension of BelXj to the Cartesian product X1 × · · · × Xn where the observed tuples live, and is a combination rule.
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
395 / 464
New horizons
Upper and lower likelihood
Belief likelihood function Series of trials, individual tuples can we reduce this to the belief values of the individual trials? yes, if we wish to compute likelihood values of tuples of individual outcomes rather than sets of them
Decomposition for individual tuples ∩ or ⊕ as a combination rule in the definition of belief likelihood When using both function, the following holds: n Y . L(x = {x1 , ..., xn }) = BelX1 ×···×Xn ({(x1 , ..., xn )}|θ) = BelXi (xi ) i=1 n
Y . L(x = {x1 , ..., xn }) = PlX1 ×···×Xn ({(x1 , ..., xn )}|θ) = PlXi (xi ) i=1
We can call them lower and upper likelihoods of the sample x = {x1 , ..., xn } second line → conditional conjunctive independence (but just for individual samples x) new result, yet unpublished – similar regularities hold when using the more cautious ∪ disjunctive combination open question: does this hold for arbitrary subsets of samples A ⊂ X1 × · · · × Xn ? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
396 / 464
New horizons
Upper and lower likelihood
Lower and upper likelihoods Bernoulli trials let us go back to the Bernoulli trials example: Xi = X = {H, T } under conditional independence and equidistribution, the traditional likelihood for a series of Bernoulli trials reads as pk (1 − p)n−k , where k is the number of successes and n the number of trials let us compute the belief likelihood function for Bernoulli trials! we seek the belief function on X = {H, T }, parameterised by p = m(H), q = m(T ) (with p + q ≤ 1 this time) which best describes the observed sample if we apply the previous result, since all Beli are equally distributed the lower and upper likelihoods of the sample x = {x1 , ..., xn } are: L(x = {x1 , ..., xn }) = BelX ({x1 }) · · · · · BelX ({xn }) = pk q n−k L(x = {x1 , ..., xn }) = PlX ({x1 }) · · · · · PlX ({xn }) = (1 − q)k (1 − p)n−k after normalisation, these are PDFs over the space B of all belief functions definable on X! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
397 / 464
New horizons
Upper and lower likelihood
Lower and upper likelihoods (Bernoulli trials)
lower likelihood (left) subsumes to the traditional likelihood pk (1 − p)n−k for p + q = 1 the maximum of the lower likelihood is the traditional ML estimate I makes sense: the lower likelihood is highest for the most committed belief functions (i.e. probabilities) upper likelihood (right) has maximum in p = q = 0 (the vacuous BF on {H, T }) the interval of BFs joining max L with max L is the set of belief functions such that p k = n−k , those which preserve the ratio between the empirical counts q once again the maths leads us to think in terms of intervals of belief functions, rather than individual ones Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
398 / 464
New horizons
Generalising logistic regression and rare events
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
399 / 464
New horizons
Generalising logistic regression and rare events
Generalising logistic regression (1) Bernoulli trials are central in statistics: generalising their likelihood allow us to represent uncertainty in a number of regression problems in logistic regression πi = P(Yi = 1|xi ) =
1 1 + e−(β0 +β1
, x) i
1 − πi = P(Yi = 0|xi ) =
e−(β0 +β1 xi ) 1 + e−(β0 +β1 xi ) (19)
the parameters β0 , β1 are estimated by maximum likelihood of the sample, where L(β0 , β1 |Y ) =
n Y
Y
πi i (1 − πi )1−Yi
i=1
where Yi ∈ {0, 1} and πi is a function of β0 , β1 – yielding a single conditional PDF as in the Bernoulli series experiment, we can replace the conditional probability (πi , −πi ) on X = {0, 1} with a belief function there
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
400 / 464
New horizons
Generalising logistic regression and rare events
Generalising logistic regression (2) upper and lower likelihoods can then be computed as L(β|Y ) =
n Y
Y
1−Yi
πi i qi
,
L(β|Y ) =
i=1
n Y (1 − qi )Yi (1 − πi )1−Yi i=1
where this time the Beli are not equally distributed how do we generalise the logit link between observations x and outputs y ? just assuming (19) does not yield any analytical dependency for qi first simple proposal: add a parameter β2 such that qi = m(Yi = 0|xi ) = β2
e−(β0 +β1 xi ) 1 + e−(β0 +β1 xi )
(20)
we can then find lower and upper optimal estimates for the parameters β arg max L 7→ β 0 , β 1 , β 2 β
arg max L 7→ β 0 , β 1 , β 2 β
plugging these optimal paramaters into (19), (20) yields an upper and a lower family of conditional belief functions given x (again an interval of BFs) BelX (.|β, x) BelX (.|β, x) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
401 / 464
New horizons
Generalising logistic regression and rare events
Rare events with belief functions Generalising logistic regression how do we use belief functions to be cautious about rare event prediction? when we measure a new observation x we plug it into BelX (.|β, x) and BelX (.|β, x), and get a lower and an upper belief function on Y note that each belief function is really an envelope of logistic functions
robust estimate of rare events: how does this relate to results of classical logit regression? more to come in the near future! Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
402 / 464
New horizons
Frequentist inference with RS
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
403 / 464
New horizons
Frequentist inference with RS
Choice of multivalued mappings
recall Dempster’s random set interpretation should the multivalued mapping Γ which defines a random set be modelled, or derived from the problem? e.g.: in the cloaked die example, it is the occlusion which generates the mapping Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
404 / 464
New horizons
Frequentist inference with RS
Parameterised families random sets Parameterised mapping, fixed distribution
however, in other cases it may make sense to model a parameterised family of multivalued mappings Γ(.|θ) : Ω → 2Θ given a (fixed) probability on Ω, this yields a parameterised family of random sets rationale: start with the classical random experiments which generate a given family of distributions .. .. and generalise the setting (design) to the case of set-valued observations proposal families: Gaussian, binomial, multinomial random sets
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
405 / 464
New horizons
Frequentist inference with RS
Parameterised families random sets Parameterised distribution, fixed mapping the other option is to fix the multi-valued mapping (e.g., when it is given by the problem) .. .. and have the source probability vary with a certain parameterised distribution this will also induce a family of random sets
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
406 / 464
New horizons
Frequentist inference with RS
Hypothesis testing with random sets in hypothesis testing, designing an experiment amounts to choosing a family of probability distributions generating the data if parameterised families of random sets can be contructed, they can be plugged in to the frequentist inference machinery (step 2 below) 1 2 3 4 5 6 7 8
state relevant null H0 and alternative hypotheses state (e.g.) assumptions about the form of the distributions random set (mass assignment) describing the observations state the relevant test statistic T (a quantity derived from the sample) → this time the sample contains set-valued observations! derive the distribution mass assignment of the test statistic under the null hypothesis (from the assumptions) set a significance level (α) compute from the observations the observed value tobs of the test statistic T → this will also be set-valued calculate the p-value, the probability conditional belief value under H0 of sampling a test statistic at least as extreme as the observed value Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value such conditional belief value is less than the significance level
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
407 / 464
New horizons
Central limit theorem
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
408 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
409 / 464
New horizons
Central limit theorem
A central limit theorem for random sets the old proposal by Dempster and Liu merely transfers normal distributions on the real line by Cartesian product with Rm more sensible/interesting option: investigating how Gaussian distributions are transformed under (appropriate) multivalued mappings involves exploring the space of mappings for sensible/convenient ones other avenue of research: a central limit theorem for random sets central limit theorem and law(s) of large numbers have been generalised to imprecise probabilities: Introduction to Imprecise Probabilities Larry G. Epstein & Kyoungwon Seo (Boston University) [2011]: A Central Limit Theorem for Belief Functions
Xiaomin Shi (Shandong University) [2015]:
Fabio Cuzzolin
Central limit theorems for belief measures
Belief functions Random sets for the working scientist
IJCAI 2016
410 / 464
New horizons
The total belief theorem
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
411 / 464
New horizons
The total belief theorem
The total belief theorem Generalising total probability to belief functions the generalisation of total probability exists for Walley’s imprecise probabilities: it is called marginal extension however, natural and marginal extensions are not closed operators in the space of belief functions: when applied to a random set the result is not a random set
Theorem Suppose Θ and Ω are two frames of discernment, and ρ : 2Ω → 2Θ the unique refining between them. Let Bel0 be a belief function defined over Ω = {ω1 , ..., ω|Ω| }. Suppose there exists a collection of belief functions Beli : 2Πi → [0, 1], where Π = {Π1 , ..., Π|Ω| }, Πi = ρ({ωi }), is the partition of Θ induced by its coarsening Ω. Then, there exists a belief function Bel : 2Θ → [0, 1] such that: 1
Bel0 is the restriction of Bel to Ω
2
Bel ⊕ BelΠi = Beli ∀i = 1, ..., |Ω|, where BelΠi is the logical belief function with mass mΠi (A) = 1 A = Πi , 0 otherwise Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
412 / 464
New horizons
The total belief theorem
The total belief theorem Visual representation
pictorial representation of the total belief theorem Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
413 / 464
New horizons
The total belief theorem
Structure of the focal elements of the total belief function restricted total belief theorem: Bel0 has only disjoint focal elements pictorial representation of the structure of the FEs of a total BF Bel lying in the image of a focal element of Bel0 of cardinality 3
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
414 / 464
New horizons
The total belief theorem
Graph of solutions Restricted total belief theorem potential solutions correspond to square linear systems, and form a graph whose nodes are linked by linear transformations of columns X X e 7→ e0 = −e + ei − ej i∈C
j∈S
where C is a covering set for e (i.e., every component of e is covered by at least one of them), S a set of selection columns at each transformation, the most negative component decreases
general solution based on simplex-like optimisation? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
415 / 464
New horizons
Random set random variables
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
416 / 464
New horizons
Random set random variables
Random set random variables? ok, random sets are set-valued random variables BUT can we actually build a random variable using as a basis a random set on Ω instead of a probability measure there? as usual we need a mapping from Θ to a measurable space (e.g. the real line): f : Θ → R+ = [0, +∞] where this time Θ is the co-domain of a multivalued mapping Γ : Ω → 2Θ for a continuous random variable X we can compute its probability density function (PDF) as its Radon-Nikodym derivative, the measurable function p such that Z P[X ∈ A] = pdµ A
can we compute a (generalised) PDF for a random set random variable?
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
417 / 464
New horizons
Random set random variables
Generalising the Radon-Nikodym derivative to random sets and capacities the
Radon-Nikodym derivative for set functions
was studied first by Harding et al [1997]
Yann Rebille [2009] has also studied the problem: A Radon-Nikodym derivative for almost subadditive set functions
Graf has tackled the problem of defining the RND for capacities (rather than probability measures) see Molchanov’s
Theory of Random Sets
assume capacities µ, ν are monotone, subadditive and continuous from below
Absolute continuity A capacity ν is absolutely continuous with respect to another capacity µ if, for every A ∈ F, ν(A) = 0 whenever φ(A) = 0. same definition as for standard measures for standard measures it is equivalent to the integral relation µ = Fabio Cuzzolin
Belief functions Random sets for the working scientist
R
ν
IJCAI 2016
418 / 464
New horizons
Random set random variables
Generalising the Radon-Nikodym derivative Strong decomposition for capacities (as opposed to probability measures) absolute continuity does not guarantee existence of a RN derivative consider the case of a finite Θ, |Θ| = n. Then any measurable function f : Θ → R+ is determined by just n numbers, which do not suffice to uniquely define a capacity on 2Θ
Strong decomposition The pair (µ, ν) has the strong decomposition property if ∀α ≥ 0 there exists a measurable set Aα ∈ F such that α(ν(A) − ν(B)) ≤ µ(A) − µ(B) ifB ⊂ A ⊂ Aα , α(ν(A) − ν(A ∩ Aα )) ≥ µ(A) − µ(A ∩ Aα ) ∀A. the condition says that the ‘incremental ratio’ of the two capacities is bounded in a certain sub-power set all standard measures meet the SDP Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
419 / 464
New horizons
Random set random variables
Generalising the Radon-Nikodym derivative Radon-Nikodym theorem for capacities For every two capacities µ and ν, ν is an indefinite integral of µ if and only if the pair (µ, ν) has the strong decomposition property and ν is absolutely continuous with respect to µ. open problems: interpreting the conditions of the theorem (which holds for general capacities) for completely alternating capacities (distributions of random closed sets) Molchanov: as a first step, note that the strong decomposition property for ν = TX and µ = TY mean that αPX (FAB ) ≤ PY (FAB ) if B ⊂ A ⊂ Aα , and αPX (FAA∩Aα ) ≥ PY (FAA∩Aα ) ∀A where FAB = {C ∈ F , B ⊂ C ⊂ A} Nguyen: a constructive approach to RN derivatives for capacities of random sets, similar to the one in constructive measure theory based on derivatives of set functions [Shilov] Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
420 / 464
New horizons
A new machine learning
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
421 / 464
New horizons
A new machine learning
What’s wrong with machine learning? new challenging real-world applications, such as smart cars navigating a complex, dynamic environment robot surgical assistants capable of predicting the surgeon’s needs
existing theory and algorithms typically focus on fitting the observable outputs in the training data may lead, for instance, an autonomous driving system to perform well on validation tests but fail catastrophically when tested in the real world Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
422 / 464
New horizons
A new machine learning
Towards robust machine learning unfortunate (but predictable) Tesla accident
unable to predict how a system will behave in a radically new setting (e.g., how does a smart car cope with driving through extreme weather conditions? most systems have no way of detecting whether their underlying assumptions have been violated: they will happily continue to predict and act even on inputs that are completely outside the scope of what they have actually learned it is imperative to ensure that these algorithms behave predictably “in the wild” Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
423 / 464
New horizons
A new machine learning
Vapnik’s statistical learning theory PAC learning classical statistical learning theory [Vapnik] contemplates “generalisation” criteria which are based on a naive correlation between smoothness and generality makes PAC predictions on the reliability of a training set which are based on simple quantities such as number of samples N generalisation problem: training error is different from the expected generalisation error – in classification problems: Ex∼D [δ(h(x) 6= y (x))] 6=
N X
δ(h(xn ) 6= y (xn ))
n=1
where the training data x = [x1 , ..., xn ] is assumed drawn from a distribution D, h(x) is the predicted label for input x and y (x) the actual label
Probabilistically Approximately Correct learning The learning algorithm finds with probability at least 1 − δ a model h ∈ H which is approximately correct, i.e. it makes a training error of no more than Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
424 / 464
New horizons
A new machine learning
Vapnik’s statistical learning theory PAC learning the main result of PAC learning is that we can relate the required size N of a training sample to the size of the model space H log |H| ≤ N − log
1 δ
so the minimum number of training examples given , δ and |H| is N≥
1 1 log |H| + log δ
for infinite-dimensional hypothesis spaces H
Vapnik-Chervonenkis Dimension The VC dimension of H is the maxium number of points that can be successfully shattered by a hypothesis h ∈ H (i.e, they can be correctly classified by some h ∈ H for all possible binary labellings of these points).
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
425 / 464
New horizons
A new machine learning
Vapnik’s statistical learning theory Example of VC dimension 4 points in R2 with H = the space of linear separators
however we arrange 4 points, there is a labelling that we cannot shatter (correctly reproduce), therefore the VC dimension of linear separators in R2 is 3. Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
426 / 464
New horizons
A new machine learning
Vapnik’s statistical learning theory Max margin SVMs dramatically overestimate the number of training instances required pretty useless for model selection, for bounds are too wide: people do cross validation instead however, it provides the only justification for max margin linear SVMs! for the space Hm of linear classifiers with margin m VCSVM = min{D,
4R 2 }+1 m2
where R is the radius of the smallest hypersphere enclosing all the training data
Large margin classifiers As the VC dimension of Hm decreases when m grows, it is desirable to select linear boundaries with max margin.
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
427 / 464
New horizons
A new machine learning
Imprecise-theoretical foundations for machine learning A modest proposal
issues with Vapnik’s traditional statistical learning theory have been recently recognised by many researchers [ Ermon , Liang , Weller] what about deep learning? nobody has a clue of why it works, really approaches should provide worst-case guarantees: it is not possible to rule out completely unexpected behaviours or catastrophic failures Percy Liang’s proposal: a new generation of ML algorithms which, rather than learning models that predict accurately on a target distribution, use minimax optimization to learn models that are suitable for any target distribution within a “safe" family concept does evoke imprecise probability! minimax models similar to Liang’s are naturally associated with convex sets of probabilities
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
428 / 464
New horizons
A new machine learning
Imprecise-theoretical foundations for machine learning A modest proposal imprecise probabilities naturally arise whenever the data are insufficient to allow the estimation of a probability distribution training sets in virtually all applications of machine learning constitute a glaring example of data which is I
I
insufficient in quantity (think of a Google object detection from images routine trained on even a few million images compared to the thousands of billions of images out there) insufficient in quality (as they are selected based on criteria such as cost, availability or mental attitudes, therefore biassing the whole learning process
uncertainty theory may be able to provide worst-case, cautious predictions, delivering AI agents aware of their own limitations research programme: a generalisation of the concept of Probably Approximately Correct – where does the probability distribution of the data come from? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
429 / 464
New horizons
Climatic change models
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
430 / 464
New horizons
Climatic change models
Climate change A Bayesian approach
Question What is the probability that a doubling of atmospheric CO2 from pre-industrial levels will raise the global mean temperature by at least 2o C? kind of question a policymaker might ask a climate scientist Rougier [2007] has very nicely outlined a Bayesian approach to climate modelling and prediction the predictive distribution for future climate is found by conditioning future climate on the observed values for historical and current climate - however: I
I
in climate prediction the collection of uncertain quantities for which the climate scientist must specify prior probabilities can be large specifying a prior distribution per se is not the difficulty, but specifying a good one is
people spend thousands of hours collecting climate data and constructing a climate model: why so little attention to quantifying our judgements about how these two are related? Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
431 / 464
New horizons
Climatic change models
Predicting future climate represent ‘climate’ as a vector of measurements collected at a given time I
e.g. components: CO2 level concentration on a grid, etc
climate: the vector y = (yh , yf ) collecting historical and present (yh ) and future (yf ) climate values . measurement error e: z = yh + e I
e.g. seasick technician, atmospheric turbulence
Assumption 1 Climate and measurement error are independent: e⊥y .
Assumption 2 Measurement error is Gaussian distributed N (0, Σe ). predictive distribution of climate given measured values z = z˜: p(y |z = z˜) ∼ n(z˜ − yh |0, Σe )p(y ) we need to specify a prior distribution for the climate y Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
432 / 464
New horizons
Climatic change models
Climate models as models of the prior choice of prior p(y ) challenging both because y is such a large collection of quantities, and quantities are linked by complex interdependencies, such as those arising from laws of nature the role of the climate model is to induce a distribution for climate itself plays the role of a parametric model in statistical inference what’s a climate model anyway? a deterministic mapping from a collection of parameters (equation coefficients, initial conditions, forcing functions) to a vector of measurements (our ‘climate’) x → y = g(x) where g belongs to a ‘model space’ G model evaluation: the actual value g(x) computed for some parameter value x a climate scientist considers, on a priori grounds, that some choices of x are better than others, i.e. there exists x ∗ such that y = g(x ∗ ) + ∗ where ∗ is the model ‘discrepancy’ Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
433 / 464
New horizons
Climatic change models
Prediction with parametric model (1) difference between the climate itself and model evaluations has two parts y − g(x) = g(x ∗ ) − g(x) + ∗ first part → contribution that may be reduced by a better choice of the model g second part → irreducible contribution that arises from the model’s imperfections x ∗ is not just a statistical parameter though, for it relates to physical quantities, so that climate scientists have a clear intuition of its effects I
scientists may be able to provide a prior p(x ∗ ) on the input parameters
Assumption 3 ‘Best’ input, discrepancy, and measurement error are mutually independent: x ∗ ⊥∗ ⊥e
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
434 / 464
New horizons
Climatic change models
Prediction with parametric model (2) Assumption 4 Discrepancy ∗ is Gaussian distributed with mean 0 and covariance Σ . assumptions 3 and 4 allow us to compute the desired climate prior as Z p(y ) = n(y − g(x ∗ )|0, Σ )p(x ∗ )dx ∗ in practice, the climate model function g(.) is not known, we only know a sample of model evaluations {g(x1 ), ..., g(xn )} model validation: tuning the covariances Σ , Σe , checking the validity of the Assumptions of Gaussianity can be done by using it to predict past/present climates p(z), and apply some hypothesis testing if the observed value z˜ is in the tail of the distribution, you have a problem as Rougier admits, responding to bad validation results is not straightforward Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
435 / 464
New horizons
Climatic change models
Model calibration
assuming the model has been validated, it needs to be calibrated find the desired ‘best’ value x ∗ of the model parameters indeed under the assumptions we can compute p(x ∗ |z = z˜) = p(z = z˜|x ∗ ) = n(z˜ = g(x ∗ )|0, Σ + Σe )p(x ∗ ) as we know, MAP could be applied but danger of multiple modes apparently climate scientists are not very happy with having a PDF over the parameter instead!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
436 / 464
New horizons
Climatic change models
Bayesian posterior prediction
by full Bayesian inference we can instead compute Z p(yf |z = z˜) = p(yf |x ∗ , z = z˜)p(x ∗ |z = z˜)dx ∗ where p(yf |x ∗ , z = z˜) is Gaussian with mean which depends on z˜ − g(x) highlights two routes for climate data to impact on future climate predictions: 1
have the effect of concentrating the distribution p(x ∗ |z = z˜) relative to the prior p(x ∗ ), depending on both quantity and quality of the climate data
2
a large difference z˜ − g(x) shifts the mean of p(yf |x ∗ , z = z˜) away from g(x)
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
437 / 464
New horizons
Climatic change models
Role of model evaluations go back to the initial question: what is the probability that a doubling of atmospheric CO2 will raise the global mean temperature by at least 2o C? let Q ⊂ Y the set of climates y for which the global mean temperature is at least 2o C higher in 2100 the probability of that is computed by integration: Z Pr (yf ∈ Q|z = z˜) = f (x ∗ )p(x ∗ |z = z˜)dx ∗ Z n(yf |µ(x), Σ)dyf
where the following integral can be computed directly f (x) = Q
the other integral requires numerical integration, e.g. I I
∼ = R weighted sampling: ∼ =
naive Monte-Carlo:
R
Pn
f (xi )
, xi ∼ p(x ∗ |z = z˜) w f (x i) i=1 i , xi ∼ p(x ∗ |z = z˜) n
i=1
Pn n
weighted by the likelihood:
wi ∝ p(z = z˜|x ∗ = xi )
sophisticated models whicn take a long time to evaluate may not provide enough samples for the prediction to be statistically significant albeit they may make the prior p(x ∗ ) and covariance Σ easier to specify Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
438 / 464
New horizons
Climatic change models
Issues with Bayesian prediction there is a number of issue with making climate inferences in the Bayesian framework lots of assumptions are necessary (e.g. Gaussianity), most of them to make calculations practical rather than anything else although the prior on climates is reconduced to prior on the parameters of a climate model, there is no obvious way of picking p(x ∗ ) it is far easier to say what are wrong choices (e.g. uniform priors) significant parameter tuning is required (e.g. for Σ , Σe ..)
Modelling climate with belief functions Quite a lot of work to do, but a few landmarks: avoid committing to priors p(x ∗ ) on the correct climate model parameters use climate model as a parametric model to infer either a BF on the space of climates Y or on the space of parameters (e.g. covariances, etc) of the distribution on Y Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
439 / 464
New horizons
A geometry of uncertainty
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
440 / 464
New horizons
A geometry of uncertainty
A geometric approach to the theory of evidence the collection B of all the vectors b = [Bel(A), ∅ ( A ( Ω]0 representing a belief function on Ω is a “simplex" (in rough words a higher-dimensional triangle), the belief space B = Cl(bA , ∅ ( A ⊆ Ω) which is the convex closure of (the vectors of) all “logical" BFs bA
alternatively we can adopt mass vectors mb = [mb (A), ∅ ( A ⊆ Ω]0 , living in a mass space: M = Cl(mA , ∅ ( A ⊆ Ω) Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
441 / 464
New horizons
A geometry of uncertainty
Binary example The simplex of BFs on a frame of size 2
belief/mass space B2 = M2 for a binary frame set of probabilities is a face of the simplex (triangle) region of consonant BFs is a “simplicial [ complex” CO = Cl(bA , A 3 x) x∈Ω
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
442 / 464
New horizons
A geometry of uncertainty
Bundle structure of the belief space a fiber bundle is a generalisation of Cartesian product - a space is decomposed into a base space and fibers which project onto a point of the base space the belief space has a recursive bundle structure
rationale: the mass associated with a belief function can be recursively assigned to subsets (focal elements) of increasing size
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
443 / 464
New horizons
A geometry of uncertainty
Geometry of Dempster’s rule Conditional subspaces Dempster’s rule behavior w.r.t. affine combination b⊕
X i
αi bi =
X
βi (b ⊕ bi ),
i
αi κ(b, bi ) βi = Pn j=1 αj k (b, bj )
where κ(b, bi ) is the usual Dempster’s conflict convex closure (Cl) and ⊕ commute in the belief space b ⊕ Cl(b1 , · · · , bn ) = Cl(b ⊕ b1 , · · · , b ⊕ bn ) the conditional subspace hbi - the set of all BFs (Dempster-) conditioned by b: o n . hbi = b ⊕ b0 , ∀b0 ∈ B s.t. ∃ b ⊕ b0 is the convex closure hbi = Cl(b ⊕ bA , ∀A ⊆ Cb )
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
444 / 464
New horizons
A geometry of uncertainty
Geometry of Dempster’s rule Geometric construction
the pointwise behavior of ⊕ depends on the notions of constant mass locus [Cuzzolin, 2004] and of foci {Fx , x ∈ Ω} of a conditional subspace Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
445 / 464
New horizons
A geometry of uncertainty
Geometry of combination Future agenda
the other main combination operators remain to be understood I I I I I I I
Yager’s rule Dubois and Prade’s rule conjunctive and disjunctive rules cautious and bold rules Josang’s consensus Murphy’s averaging Deng’s distance based
would visualise a ‘cone’ of possible future belief states under stronger or weaker assumptions can we also do inference by geometric methods? necessary to represent data and uncertainty measures in the same space
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
446 / 464
New horizons
A geometry of uncertainty
Conditioning by geometric methods Conditioning simplex each conditioning event A is associated with a conditional simplex BA in the belief space: . BA = Cl(bB , ∅ ( B ⊆ A) we can therefore define the geometric conditional belief function induced by a distance function d the BF(s) bd (.|A) which minimize(s) the distance d(b, BA )
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
447 / 464
New horizons
A geometry of uncertainty
Conditioning in the mass space Conditioning by geometric means
L1 conditional BFs given A: all those BFs with core contained in A and masses dominating m(B) on all subsets B of A geometrically they form a polytope the L2 conditional belief function is the unique mass function that redistributes in an equal way to each and every subset B of A the mass originally assigned to focal elements not included in A geometrically, it coincides with the center of mass of the polytope of L1 conditional BFs
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
448 / 464
New horizons
A geometry of uncertainty
General imaging interpretation Conditioning in the mass space
geometric conditional BFs in M possess an interpretation in terms of general imaging in belief revision [Lewis, Gardenfors] upon observing the impossibility of a certain outcome, one should re-assign its probability (mass) to the “closest” remaining state but if there is no reason to consider any remaining state as the closest ... I
I
.. we can represent such ignorance as a vacuous BF on the set of “weights” of the remaining states: this induces the polytope of L1 conditional Bfs! or, we can represent such ignorance as a uniform probability distribution on the weights: this induces the L2 conditional BF!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
449 / 464
New horizons
A geometry of uncertainty
Geometric of uncertainty Future agenda
geometry of combination I
I I
∩ what about the geometry of the other combination rules, in particular ∪ and ? ∩ and ? ∪ what’s the geometry of the ‘tubes’ of BFs we get using inversion of combination results via geometric means
geometry of conditioning I I
I
what happens when we plug in different norms? [Jousselme et al] is geometric conditioning a general encompassing framework for conditioning in belief calculus? the other main conditioning operators remain to be understood, e.g.: F lower and upper envelopes F Suppes’ ‘geometric’ conditioning F Smets’ unnormalised conditioning
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
450 / 464
New horizons
A geometry of uncertainty
Geometry of uncertainty Future agenda
geometric inference: can we represent data (samples) and uncertainty measures induced by them in the same space? I
what norm is appropriate to minimise for inference purposes?
geometry of convex sets of belief functions I
we saw they pop up all the time when reasoning or making inference
geometry of belief functions on reals I I
Borel intervals random sets
fancier geometries: I I
belief functions as projections of convex bodies belief functions as spinors? exterior algebras
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
451 / 464
New horizons
A geometry of uncertainty
Belief functions are projections of convex bodies Fancier geometries
convex bodies are a fascinating field of study for a convex body in Rn , there obviously are 2n orthogonal projections onto all subspaces generated by sets of coordinate axes related to the notion of Grassman manifold under a condition that the areas of these projections are normalised, a convex body can be seen as a belief function
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
452 / 464
New horizons
A geometry of uncertainty
Unified geometry of uncertainty Geometry of possibility the geometry of consonant belief functions needs the notion of simplicial complex
a collection Σ of simplices such that: 1
2
if a simplex belongs to Σ, then all its faces belong to Σ the intersection of any two simplices is a face of both
the region of consistent BFs is a simplicial complex: [ CO = Cl(bA , A 3 x) x∈Θ
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
453 / 464
New horizons
A geometry of uncertainty
Unified geometry of uncertainty Future agenda
what about all the other uncertainty measures of the hierarchy? most of them are not special cases of belief functions (in fact, they are more general than them) need to extend the geometric space to encapsulate the most general such representation (arguably, imprecise probabilities) intermediate steps: geometry of monotone capacities (in particular 2-monotone capacities, probability intervals) most fascinating: geometry of sets of desirable gambles
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
454 / 464
Summarising
Outline 1
Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists
2
4
Understanding A mathematical theory of evidence Belief functions
Fabio Cuzzolin
5
Regression (computer vision) Prediction (climate change)
7
Reasoning
8
9
New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty
Using belief functions Decision making Classification Ranking aggregation Applications
Challenges Efficient computation Belief functions on reals
Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models
6
Putting (in context) Derived frameworks Uncertainty theories
Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised
Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty
3
Dempster’s combination Families of frames Interpretations Misunderstandings
Uncertainty
10
Summarising
Belief functions Random sets for the working scientist
IJCAI 2016
455 / 464
Summarising
A summary of what we have learned in this tutorial the theory of belief functions is a modeling language for representing elementary items of evidence and combining them, in order to form a representation of our beliefs about certain aspects of the world it is relatively simple to implement and has been successfully applied grounded in the beautiful mathematics of random sets has strong relationships with other theories of uncertainty belief functions have interesting mathematical properties in terms of geometry, algebra, combinatorics evidential reasoning can be implemented even for very large spaces and numerous pieces of evidence, because I I I I
elementary items of evidence induce simple belief functions, which can be combined very efficiently; the most plausible hypothesis can be found without computing the whole combined belief function; Monte-Carlo approximations are easily implementable local propagation schemes allow parallelisation
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
456 / 464
Summarising
A summary of what we have learned in this tutorial
statistical evidence may be represented in several ways: I
I I
by likelihood-based belief functions, generalizing both likelihood-based and Bayesian inference by Dempster’s idea of using auxiliary variables in the framework of the Generalised Bayesian Theorem
propagation on graphical models can be performed decision making strategies based on intervals of expected utilities can be formulated that are more cautious than traditional ones the extension to continuous domains can be tackled via the Borel interval representation, in the more general case using the theory of random sets a toolbox of estimation, classification, regression tools based on the theory of belief functions is available
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
457 / 464
Summarising
Recent trends in the theory and application of belief functions in 2014 alone, almost 1200 papers were published on belief functions new applications are gaining ground, beyond sensor fusion or expert systems
earth sciences, telecoms, etc Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
458 / 464
Summarising
Publications venues conferences on the theory of uncertainty: I I I I I I I
BFAS’s International Conference on Belief Functions (BELIEF) Uncertainty in Artificial Intelligence (UAI) International Conference on Information Fusion (FUSION) International Symposium on Imprecise Probability - Theories and Applications (ISIPTA) Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU) IEEE Systems, Man and Cybernetics (SMC) Information Processing and Management under Uncertainty (IPMU)
journals (for theoretical contributions): I I I I I I
Fabio Cuzzolin
International Journal of Approximate Reasoning (IJAR) IEEE Transactions on Fuzzy Systems (I.F. 6.306) IEEE Transactions on Cybernetics (I.F. 3.781) Artificial Intelligence Information Sciences (4.038) Fuzzy Sets and Systems
Belief functions Random sets for the working scientist
IJCAI 2016
459 / 464
Summarising
What still needs to be resolved clarify once and for all the epistemic interpretation of belief function theory → random variables for set-valued observations mechanism for evidence combination still debated, depend on meta-information on sources hardly accessible → working with intervals of belief functions may be the way forward I
acknowledges the meta-uncertainty on the nature of the sources generating the evidence
same holds for conditioning (as we showed) what about computational complexity? → not an issue, just apply sampling for approximate inference I
we do not need to assign mass to all subsets, but we need to be allowed to do so when necessary (e.g. missing data)
belief functions on reals → Borel intervals are nice, but the way forward is grounding the theory into the mathematics of random sets
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
460 / 464
Summarising
Future of random set/belief function theory full development of random set graphical models I
merge the two lines of research (1) belief function on graphical models and (2) evidential networks
further development of machine learning tools I I
random set random forests tackling current trends such as transfer learning, deep learning
fully developed theory of statistical inference with random sets I I I I
generalised likelihood, logistic regression limit theorem, total probability for random sets random set random variables and processes frequentist inference with random sets
propose solutions to high impact problems I I I
rare event prediction robust foundations for machine learning robust climatic change predictions
mathematics and geometry of random sets and other uncertainty measures Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
461 / 464
Summarising
For Further Reading Papers and Matlab software available at: https://www.hds.utc.fr/˜tdenoeux Belief Functions Encyclopedia: http://cms.brookes.ac.uk/staff/FabioCuzzolin These slides are available online at: http://cms.brookes.ac.uk/staff/FabioCuzzolin/files/IJCAI2016.pdf THANK YOU!
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
462 / 464
Appendix
For Further Reading
For Further Reading I
G. Shafer. A mathematical theory of evidence. Princeton University Press, 1976. F. Cuzzolin. Visions of a generalized probability theory. Lambert Academic Publishing, 2014. F. Cuzzolin (Ed.). Belief functions: theory and applications. LNCS Volume 8764, Springer, 2014.
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
463 / 464
Appendix
For Further Reading
For Further Reading I F. Cuzzolin. The geometry of uncertainty - The geometry of imprecise probabilities Springer-Verlag (upcoming) F. Cuzzolin. Fifty years of belief functions: Theory IEEE Transactions on Fuzzy Sets (in preparation) 2017 F. Cuzzolin and C. Sengul. Fifty years of belief functions: Applications International Journal of Approximate Reasoning (in preparation) 2017
Fabio Cuzzolin
Belief functions Random sets for the working scientist
IJCAI 2016
464 / 464