Belief functions: A gentle introduction

0 downloads 0 Views 6MB Size Report
theory, the theory of assigning numbers to sets additive probability ..... information and not on the prior (Bernstein-von Mises theorem). A. W. F. Edwards: “It is ...
Belief functions: A gentle introduction Seoul National University Professor Fabio Cuzzolin School of Engineering, Computing and Mathematics Oxford Brookes University, Oxford, UK

Seoul, Korea, 30/05/18

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

1 / 125

Uncertainty

Outline

1

Belief functions Semantics Dempster’s rule Multivariate analysis Misunderstandings

Uncertainty Second-order uncertainty Classical probability

2

Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data

3

Belief theory A theory of evidence

Professor Fabio Cuzzolin

4

Decision making

5

Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets

Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem

Belief functions: A gentle introduction

Theories of uncertainty

6

Belief functions on reals Continuous belief functions Random sets

7

Conclusions

Seoul, Korea, 30/05/18

2 / 125

Uncertainty

Second-order uncertainty

Orders of uncertainty

the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability second order uncertainty: being uncertain about our very model of uncertainty has a consequence on human behaviour: people are averse to unpredictable variations (as in Ellsberg’s paradox) how good are Kolmogorov’s measure-theoretic probability, or Bayesian and frequentist approaches at modelling second-order uncertainty? Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

3 / 125

Uncertainty

Classical probability

Probability measures mainstream mathematical theory of (first order) uncertainty: mathematical (measure-theoretical) probability mainly due to Russian mathematician Andrey Kolmogorov probability is an application of measure theory, the theory of assigning numbers to sets additive probability measure → mathematical representation of the notion of chance assigns a probability value to every subset of a collection of possible outcomes (of a random experiment, of a decision problem, etc) collection of outcomes Ω → sample space, universe subset A of the universe → event Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

4 / 125

Uncertainty

Classical probability

Probability measures probability measure µ: a real-valued function on a probability space that satisfies countable additivity probability space: it is a triplet (Ω, F, P) formed by a universe Ω, a σ-algebra F of its subsets, and a probability measure on F I

not all subsets of Ω belong necessarily to F

axioms of probability measures: I I I

µ(∅) = 0, µ(Ω) = 1 0 ≤ µ(A) ≤ 1 for all events A ⊆ F additivity: for all countable collection of pairwise disjoint events Ai : ! [ X µ Ai = µ(Ai ) i

i

probabilities have different interpretations: we consider frequentist and Bayesian (subjective) probability

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

5 / 125

Uncertainty

Classical probability

Frequentist inference in the frequentist interpretation, the (aleatory) probability of an event is its relative frequency in time the frequentist interpretation offers guidance in the design of practical ‘random’ experiments developed by Fisher, Pearman, Neyman three main tools: I I I

Professor Fabio Cuzzolin

statistical hypothesis testing model selection confidence interval analysis

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

6 / 125

Uncertainty

Classical probability

Statistical hypothesis testing 1

state the research hypothesis

2

state the relevant null and alternative hypotheses

3

state the statistical assumptions being made about the sample, e.g. assumptions about the statistical independence or about the form of the distributions of the observations

4

state the relevant test statistic T (a quantity derived from the sample)

5

derive the distribution of the test statistic under the null hypothesis from the assumptions

6

set a significance level (α), i.e. a probability threshold below which the null hypothesis will be rejected

7

compute from the observations the observed value tobs of the test statistic T

8

calculate the p-value, the probability (under the null hypothesis) of sampling a test statistic at least as extreme as the observed value

9

Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than the significance level threshold Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

7 / 125

Uncertainty

Classical probability

P-values

Probability density

More likely observation

P-value

Very unlikely observations

Very unlikely observations Observed data point Set of possible results

the p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false: frequentist statistics does not and cannot attach probabilities to hypotheses

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

8 / 125

Uncertainty

Classical probability

Maximum Likelihood Estimation (MLE) the term ‘likelihood’ was popularized in mathematical statistics by Ronald Fisher in 1922: ‘On the mathematical foundations of theoretical statistics’ Fisher argued against ‘inverse’ (Bayesian) probability as a basis for statistical inferences, and instead proposes inferences based on likelihood functions likelihood principle: all of the evidence in a sample relevant to model parameters is contained in the likelihood function I

this is hotly debated, still [Mayo,Gandenberger]

maximum likelihood estimation: {θˆmle } ⊆ {arg max L(θ ; x1 , . . . , xn )}, θ∈Θ

where L(θ ; x1 , . . . , xn ) = f (x1 , x2 , . . . , xn | θ) and {f (.|θ), θ ∈ Θ} is a parametric model consistency: the sequence of MLEs converges in probability, for a sufficiently large number of observations, to the (actual) value being estimated Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

9 / 125

Uncertainty

Classical probability

Subjective probability

(epistemic) probability = degrees of belief of an individual assessing the state of the world Ramsey and de Finetti → subjective beliefs must follow the laws of probability if they are to be coherent (if this ‘proof’ was prooftight we would not be here in front of you!) also, evidence casts doubt that humans will have coherent beliefs or behave rationally

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

10 / 125

Uncertainty

Classical probability

Bayesian inference prior distribution: the distribution of the parameter(s) before any data is observed, i.e. p(θ | α) depends on a vector of hyperparameters α likelihood: the distribution of the observed data conditional on its parameters, i.e. p(X | θ) marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s): Z p(X | α) = p(X | θ)p(θ | α) dθ θ

posterior distribution: the distribution of the parameter(s) after taking into account the observed data, as determined by Bayes’ rule: p(θ | X, α) =

Professor Fabio Cuzzolin

p(X | θ)p(θ | α) ∝ p(X | θ)p(θ | α) p(X | α)

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

11 / 125

Beyond probability

Outline

1

Belief functions Semantics Dempster’s rule Multivariate analysis Misunderstandings

Uncertainty Second-order uncertainty Classical probability

2

Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data

3

Belief theory A theory of evidence

Professor Fabio Cuzzolin

4

Decision making

5

Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets

Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem

Belief functions: A gentle introduction

Theories of uncertainty

6

Belief functions on reals Continuous belief functions Random sets

7

Conclusions

Seoul, Korea, 30/05/18

12 / 125

Beyond probability

Something is wrong? measure-theoretical mathematical probability is not general enough: I I I

cannot (properly) model missing data cannot (properly) model propositional data cannot really model unusual data (second order uncertainty)

the frequentist approach to probability: I I I

cannot really model pure data (without ‘design’) in a way, cannot even model properly continuous data models scarce data only asymptotically

Bayesian reasoning has several limitations: I I I I

Professor Fabio Cuzzolin

cannot model no data (ignorance) cannot model ‘uncertain’ data cannot model pure data (without prior) again, cannot properly model scarce data (only asymptotically)

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

13 / 125

Beyond probability

Fisher has not got it all right the setting of hypothesis testing is (arguably) arguable I

I

I

the scope is quite narrow: rejecting or not rejecting a hypothesis (although it can provide confidence intervals) the criterion is arbitrary: who decides what an ‘extreme’ realisation is (choice of α)? what is the deal with 0.05 and 0.01? the whole ‘tail’ idea comes from the fact that, under measure theory, the conditional probability (p-value) of a point outcome x is zero – seems trying to patch an underlying problem with the way probability is mathematically defined

cannot cope with pure data, without assumptions on the process (experiment) which generated them (we will come back to this later) deals with scarce data only asymptotically

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

14 / 125

Beyond probability

The problem(s) with Bayes pretty bad at representing ignorance I I

Jeffrey’s ‘uninformative’ priors are just not adequate different results on different parameter spaces

Bayes’ rule assumes the new evidence comes in the form of certainty: “A is true” I

in the real world, often this is not the case (‘uncertain’ or ‘vague’ evidence)

beware the prior! → model selection in Bayesian statistics I

I

I

results from a confusion between the original subjective interpretation, and the objectivist view of a rigorous objective procedure why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not all will be fine, in the end! asymptotically, the choice of the prior does not matter (really!)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

15 / 125

Beyond probability

Set-valued observations

The die as random variable

� face 1 face 2

1

2

face 6 face 3 face 5 face 4

3

4

X 5

6

a die is a simple example of (discrete) random variable there is a probability space Ω = {face1, face2, ..., face6} which maps to a real number: 1, 2, ..., 6 (no need for measurability here) now, imagine that face1 and face2 are cloaked, and we roll the die

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

16 / 125

Beyond probability

Set-valued observations

The cloaked die: set-valued observations � face 1 face 2

1

2

face 6 face 3 face 5 face 4

3

4

X 5

6

the same probability space Ω = {face1, face2, ..., face6} is still there (nothing has changed in the way the die works) however, now the mapping is different: both face1 and face2 are mapped to the set of possible values {1, 2} (since we cannot observe the outcome) this is a random set [Matheron,Kendall,Nguyen, Molchanov]: a set-valued random variable whenever data are missing observations are inherently set-valued Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

17 / 125

Beyond probability

Propositional evidence

Reliable witnesses Evidence supporting propositions suppose there is a murder, and three people are under trial for it: Peter, John and Mary our hypothesis space is therefore Θ = {Peter, John, Mary} there is a witness: he testifies that the person he saw was a man this amounts to supporting the proposition A = {Peter, John} ⊂ Θ should we take this testimony at face value? in fact, the witness was tested and the machine reported an 80% chance he was drunk when he reported the crime we should partly support the (vacuous) hypothesis that any one among Peter, John and Mary could be the murderer: it is natural to assign 80% chance to proposition A, and 20% chance to proposition Θ

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

18 / 125

Beyond probability

Propositional evidence

Dealing with propositional evidence even when evidence (data) supports propositions, Kolmogorov’s probability forces us to specify support for individual outcomes this is unreasonable - an artificial constraint due to a mathematical model that is not general enough I

we have no elements to assign this 80% probability to either Peter or John, nor to distribute it among them

the cause is the additivity of probability measures: but this is not the most general type of measure for sets under a minimal requirement of monotoniticy measure can potentially suitable to describe probabilities of events: these objects are called capacities in particular, random sets are capacities in which the numbers assigned to subsets are given by a probability distribution

Belief functions and propositional evidence As capacities (and random sets in particular), belief functions allow us to assign mass directly to propositions. Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

19 / 125

Beyond probability

Scarce data

Machines that learn Generalising from scarce data

machine learning: designing algorithms that can learn from data BUT, we train them on a ridicously small amount of data: how can we make sure they are robust to new situations never encountered before (model adaptation)? statistical learning theory [Vapnik] is based on traditional probability theory Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

20 / 125

Beyond probability

Scarce data

Dealing with scarce data a somewhat naive objection: probability distributions assume an infinite amount of evidence, so in reality finite evidence can only provide a constraint on the ‘true’ probability values I

I

I

I

unfortunately, those who believe probabilities to be limits of relative frequencies (the frequentists) never really ‘estimate’ a probability from the data – the only assume (‘design’) probability distributions for their p-values Fisher: fine, I can never compute probabilities, but I can use the data to test my hypotheses on them in opposition, those who do estimate probability distributions from the data (the Bayesians) do not think of probabilities as infinite accumulations of evidence (but as degrees of belief) Bayes: I only need to be able to model a likelihood function of the data

well, actually, frequentists do estimate probabilities from scarce data when they do stochastic regression (e.g., logistic regression)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

21 / 125

Beyond probability

Scarce data

Asymptotic happiness

what is true, is that both frequentists and Bayesians seem to be happy with solving their problems ‘asymptotically’ I I

limit properties of ML estimates Bernstein-von Mises theorem

what about the here and now? e.g. smart cars? Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

22 / 125

Beyond probability

Representing ignorance

Modelling pure data Bayesian inference Bayesian reasoning requires modelling the data and a prior (actually, you need to pick the proper hypothesis space too!) I

prior is just a name for beliefs built over a long period of time, from the evidence you have observed – so long a time has passed that all track record of observations is lost, and all is left is a probability distribution

why should we ‘pick’ a prior? either there is prior knowledge or there is not nevertheless we are compelling to picking one, because the mathematical formalism requires it I

this is the result of a confusion between the original subjective interpretation (where prior beliefs always exist), and the objectivist view of a rigorous objective procedure (where in most cases we do not have any prior knowledge)

Bayesians then go in ‘damage limitation’ mode, and try to pick the least damaging prior (see ‘ignorance’ later) all will be fine, in the end! (Bernstein-von Mises theorem) Asymptotically, the choice of the prior does not matter (really!) Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

23 / 125

Beyond probability

Representing ignorance

Dangerous priors Bayesian inference the prior distribution is typically hard to determine – ‘solution’ → pick an ‘uninformative’ probability I I I

Jeffrey’s prior → Gramian of the Fisher information matrix can be improper (unnormalised), and it violates the strong version of the likelihood principle: inferences depend not just on the data likelihood but also on the universe of all possible experimental outcomes

uniform priors can lead to different results on different spaces, given the same likelihood functions the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior (Bernstein-von Mises theorem) A. W. F. Edwards: “It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this ‘defence’ the better.”

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

24 / 125

Beyond probability

Representing ignorance

Modelling pure data Frequentist inference

the frequentist approach is inherently unable to describe pure data, without making additional assumptions on the data-generating process in Nature one cannot ‘design’ an experiment: data come your way, whether you want it or not – you cannot set the ‘stopping rules’ I

again, recalls the old image of a scientist ‘analysing’ (from Greek ‘ana’+’lysis’, breaking up) a specific aspect of the world in their lab

the same data can lead to opposite conclusions I

I

different experiments can lead to the same data, whereas the parametric model employed (family of probability distributions) is linked to a specific experiment apparently, however, frequentists are just fine with this

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

25 / 125

Beyond probability

Representing ignorance

Dealing with ignorance Shafer vs Bayes ‘uninformative’ priors can be dangerous (Andrew Gelman): they violate the strong likelihood principle, may be unnormalised

wrong priors can kill a Bayesian model priors in general cannot handle multiple hypothesis spaces in a coherent way (families of frames, in Shafer’s terminology)

Belief functions and priors Reasoning with belief functions does not require any prior.

Belief functions and ignorance Belief functions naturally represent ignorance via the ‘vacuous’ belief function, assigning mass 1 to the whole hypothesis space.

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

26 / 125

Beyond probability

Rare events

Extinct dinosaurs The statistics of rare events

dinosaurs probably were worrying about overpopulation risks.. .. until it hit them! Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

27 / 125

Beyond probability

Rare events

What’s a rare event?

what is a ‘rare’ event? clearly we are interested in them because they are not so rare, after all! examples of rare events, also called ‘tail risks’ or ‘black swans’, are: volcanic eruptions, meteor impacts, financial crashes .. mathematically, an event is ‘rare’ when it covers a region of the hypothesis space which is seldom sampled – it is an issue with the quality of the sample Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

28 / 125

Beyond probability

Rare events

Rare events and second-order uncertainty probability distributions for the system’s behaviour are built in ‘normal’ times (e.g. while a nuclear plant is working just fine), then used to extrapolate results at the ‘tail’ of the distribution P(Y=1|x)

'rare' event

1

popular statistical procedures (e.g. logistic regression) can sharply underestimate the probability of rare events

0.5 training samples

−6

−4

−2

0

0

2

4

6

x

Harvard’s G. King [2001] has proposed corrections based on oversampling the ‘rare’ events w.r.t the ‘normal’ ones

the issue is really one with the reliability of the model! we need to explictly model second-order uncertainty

Belief functions and rare events Belief functions can model second-order uncertainty: rare events are a form of lack of information in certain regions of the sample space. Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

29 / 125

Beyond probability

Uncertain data

Uncertain data concepts themselves can be not well defined, e.g. ‘dark’ or ‘somewhat round’ object (qualitative data) I

fuzzy theory accounts for this via the concept of graded membership

unreliable sensors can generate faulty (outlier) measurements: can we still treat these data as ‘certain’? or is more natural to attach to them a degree of reliability, based on the past track record of the ‘sensor’ (data generating process)? but then, can we still apply Bayes’ rule? people (‘experts’, e.g. doctors) tend to express themselves in terms of likelihoods directly (e.g. ‘I think diagnosis A is most likely, otherwise either A or B’) I

if the doctors were frequentists, and were provided with the same data, they would probably apply logistic regression and come up with the same prediction on P(disease|symptoms): unfortunately doctors are not statisticians

multiple sensors can provide as output a PDF on the same space I

e.g., two Kalman filters based one on color, the other on motion (optical flow), providing a normal predictive PDF on the location of the target in the image plane

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

30 / 125

Belief theory

Outline

1

Belief functions Semantics Dempster’s rule Multivariate analysis Misunderstandings

Uncertainty Second-order uncertainty Classical probability

2

Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data

3

Belief theory A theory of evidence

Professor Fabio Cuzzolin

4

Decision making

5

Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets

Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem

Belief functions: A gentle introduction

Theories of uncertainty

6

Belief functions on reals Continuous belief functions Random sets

7

Conclusions

Seoul, Korea, 30/05/18

31 / 125

Belief theory

A theory of evidence

A mathematical theory of evidence Shafer called his proposal ‘A mathematical theory of evidence’ the mathematical objects it deals with are called ‘belief functions’ where do these names come from? what interpretation of probability do they entail? it is a theory of epistemic probability: it is about probabilities as a mathematical representation of knowledge (a human’s knowledge, or a machine’s)

evidence

truth

probabilistic knowledge

knowledge

Professor Fabio Cuzzolin

belief

it is a theory of evidential probability: such probabilities representing knowledge are induced (‘elicited’) by the available evidence

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

32 / 125

Belief theory

A theory of evidence

Evidence supporting hypotheses

in probabilistic logic, statements such as "hypothesis H is probably true" mean that the empirical evidence E supports H to a high degree called the epistemic probability of H given E

Rationale There exists evidence in the form of probabilities, which supports degrees of belief on a certain matter. the space where the evidence lives is different from the hypothesis space they are linked by a map one to many: but this is a random set!

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

33 / 125

Belief theory

Belief functions

Dempster’s multivalued mappings Dempster’s work formalises random sets via multivalued (one-to-many) mappings Γ from a probability space (Ω, F, P) to the domain of interest Θ

� drunk (0.2)





not drunk (0.8) Mary

Peter John

examples taken from a famous ‘trial’ example [Shafer] elements of Ω are mapped to subsets of Ω: once again this is a random set I

in the example Γ maps {not drunk } ∈ Ω to {Peter , John} ⊂ Θ

the probability distribution P on Ω induces a mass assignment m : 2Θ → [0, 1] on the power set 2Θ = {A ⊆ Θ} via the multivalued mapping Γ : Ω → 2Θ Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

34 / 125

Belief theory

Belief functions

Belief and plausibility measures

the belief in A as the probability that the evidence implies A: Bel(A) = P({ω ∈ Ω|Γ(ω) ⊆ A}) the plausibility of A as the probability that the evidence does not contradict A: Pl(A) = P({ω ∈ Ω|Γ(ω) ∩ A 6= ∅}) = 1 − Bel(A) originally termed by Dempster ‘lower and upper probabilities’ belief and plausibility values can (but this is disputed) be interpreted as lower and upper bounds to the values of an unknown, underlying probability measure: Bel(A) ≤ P(A) ≤ Pl(A) for all A ⊆ Θ

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

35 / 125

Belief theory

Belief functions

Basic probability assignments Mass functions

belief functions (BF) are functions from 2Θ , the set of all subsets of Θ, to [0, 1], assigning values to subsets of Θ it can be proven that each belief function has form X Bel(A) = m(B) B⊆A

where m is a mass function or basic probability assignment on Θ, defined as a function 2Θ → [0, 1], such that: P m(∅) = 0 A⊆Θ m(A) = 1 any subset A of Θ such that m(A) > 0 is called a focal element (FE) of m working with belief functions reduces to manipulating focal elements

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

36 / 125

Belief theory

Belief functions

A generalisation of sets, fuzzy sets, probabilities belief functions generalise traditional (‘crisp’) sets: I

a logical (or “categorical”) mass function has one focal set A, with m(A) = 1

belief functions generalise standard probabilities: I

a Bayesian mass function has as only focal sets elements (rather than subsets) of Θ

complete ignorance is represented by the vacuous mass function: m(Θ) = 1 belief functions generalise fuzzy sets (see possibility theory later), which are assimilated to consonant BFs whose focal elements are nested: A1 ⊂ ... ⊂ Am



consonant

Professor Fabio Cuzzolin

Bayesian

vacuous

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

37 / 125

Belief theory

Semantics

Semantics of belief functions Modelling second-order uncertainty

p(x) = 1 probability simplex



p(z) = 0.7

B



A

�(B)

1

p(x) = 0.6

Bel p(x) = 0.2

�(A)

0 p(z) = 1

p(z) = 0.2

p(y) = 1

belief functions have multiple interpretations as set-valued random variables (random sets) as (completely monotone) capacities (functions from the power set to [0, 1]) as a special class of credal sets (convex sets of probability distributions) [Levi,Kyburg] as such, they are a very expressive means of modelling uncertainty on the model itself, due to lack of data quantity or quality, or both Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

38 / 125

Belief theory

Semantics

Axiomatic definition belief functions can also be defined in axiomatic terms, just like Kolmogorov’s additive probability measures this is the definition proposed by Shafer in 1976

Belief function A function Bel : 2Θ → [0, 1] from the power set 2Θ to [0, 1] such that: Bel(∅) = 0, Bel(Θ) = 1; for every n and for every collection A1 , ..., An ∈ 2Θ we have that: X X Bel(A1 ∪ ... ∪ An ) ≥ Bel(Ai ) − Bel(Ai ∩ Aj ) + · · · + (−1)n+1 Bel(A1 ∩ ... ∩ An ) i

i 0 P(Y ) = 00 P (Y ) 0 if P(Y ) = 0

there is a unique solution: P 00 (A) =

X

P(A|B)P 0 (B)

B∈B

generalises Bayes’ conditioning! (obtained when P 0 (B) = 1 for some B) Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

40 / 125

Belief theory

Dempster’s rule

Conditioning versus combination what if I have a new probability on the same σ-algebra A? Jeffrey’s rule cannot be applied! as we saw, this happens when multiple sensors provide predictive PDFs belief function deal with uncertain evidence by moving away from the concept of conditioning (via Bayes’ rule) .. .. to that of combining pieces of evidence supporting multiple (intersecting) propositions to various degrees

Belief functions and evidence Belief reasoning works by combining existing belief functions with new ones, which are able to encode uncertain evidence.

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

41 / 125

Belief theory

Dempster’s rule

Dempster’s �combination �

drunk (0.2)



��

not drunk (0.8) Peter

Mary

John

cleaned (0.6) not cleaned (0.4)

��

new piece of evidence: a blond hair has been found; also, there is a probability 0.6 that the room has been cleaned before the crime

��

the assumption is that pairs of outcomes in the source spaces ω1 ∈ Ω1 and ω2 ∈ Ω2 support the intersection of their images in 2Θ : θ ∈ Γ1 (ω1 ) ∩ Γ2 (ω2 ) if this is done independently, then the probability that pair (ω1 , ω2 ) is selected is P1 ({ω1 })P2 ({ω2 }), yielding Dempster’s rule of combination: X 1 (m1 ⊕ m2 )(A) = m1 (B)m2 (C), ∀∅ 6= A ⊆ Θ, 1−κ B∩C=A

Bayes’ rule is a special case of Dempster’s rule Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

42 / 125

Belief theory

Dempster’s rule

Dempster’s combination A simple numerical example B1 A1

�1

� B2 = �

A2

m���� 1 1��������



��1�

Bel1 m���������������� 1 1 2

������� 2 1

m���������� 2

X1



�2

�3 �4

Bel2

��2� m������������������� 2 2 4 3

m({θ1 }) =

0.7∗0.4 1−0.42

= 0.48

m({θ2 }) =

0.3∗0.6 1−0.42

= 0.31

m({θ1 , θ2 }) = 0.3∗0.4 1−0.42 = 0.21

X3

X2

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

43 / 125

Belief theory

Dempster’s rule

A generalisation of Bayesian inference belief theory generalises Bayesian probability (it contains it as a special case), in that: I

I

I

classical probability measures are a special class of belief functions (in the finite case) or random sets (in the infinite case) Bayes’ ‘certain’ evidence is a special case of Shafer’s bodies of evidence (general belief functions) Bayes’ rule of conditioning is a special case of Dempster’s rule of combination F

it also generalises set-theoretical intersection: if mA and mB are logical mass functions and A ∩ B 6= ∅, then mA ⊕ mB = mA∩B

however, it overcomes its limitations I

you do not need a prior: if you are ignorant, you will use the vacuous BF mΘ which, when combined with new BFs m0 encoding data, will not change the result mΘ ⊕ m0 = m0

I

however, if you do have prior knowledge you are welcome to use it!

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

44 / 125

Belief theory

Multivariate analysis

Multivariate analysis Refinements and coarsenings the theory allows us to handle evidence impacting on different but related domains assume we are interested in the nature of an object in a road scene. We could describe it, e.g., in the frame Θ = {vehicle, pedestrian}, or in the finer frame Ω = {car, bicycle, motorcycle, pedestrian} other example: different image features in pose estimation a frame Ω is a refinement of a frame Θ (or, equivalently, Θ is a coarsening of Ω) if elements of Ω can be obtained by splitting some or all of the elements of Θ

Θ

ρ

θ1  

Ω

θ2   θ3  

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

45 / 125

Belief theory

Multivariate analysis

Families of compatible frames Multivariate analysis when Ω is a refinement for a collection Θ1 , ..., ΘN of other frames it is called their common refinement two frames are said to be compatible if they do have a common refinement compatible frames can be associated with different variables/attributes/features: I

I

let ΘX = {red, blue, green} and ΘY = {small, medium, large} be the domains of attributes X and Y describing, respectively, the color and the size of an object in such a case the common refinement ΘX ⊗ ΘY = ΘX × ΘY is simply the Cartesian product

or, they can be descriptions of the same variable at different levels of granularity (as in the road scene example) evidence can be moved from one frame to another within a family of compatible frames

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

46 / 125

Belief theory

Multivariate analysis

Families of compatible frames Pictorial illustration

�i

�1

�n

Ai A1 An

�i �1

�n

�1

...

�n

���

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

47 / 125

Belief theory

Multivariate analysis

Marginalisation let ΘX and ΘY be two compatible frames let mXY be a mass function on ΘX × ΘY it can be expressed in the coarser frame ΘX by transferring each mass mXY (A) to the projection of A on ΘX :

�Y �X B=A �X

A

�X �Y we obtain a marginal mass function on ΘX : X mXY ↓X (B) = mXY (A) ∀B ⊆ ΘX {A⊆ΘXY ,A↓ΘX =B}

(again, it generalizes both set projection and probabilistic marginalization) Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

48 / 125

Belief theory

Multivariate analysis

Vacuous extension the “inverse” of marginalization a mass function mX on ΘX can be expressed in ΘX × ΘY by transferring each mass mX (B) to the cylindrical extension of B:

�Y

�X B

A=B �Y

�X �Y this operation is called the vacuous extension of mX in ΘX × ΘY : ( mX (B) if A = B × ΘY X ↑XY m (A) = 0 otherwise a strong feature of belief theory: the vacuous belief function (our representation of ignorance) is left unchanged when moving from one space to another! Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

49 / 125

Belief theory

Misunderstandings

Belief functions are not (general) credal sets p(x) = 1

a belief function on Θ is in 1-1 correspondence with a convex set of probability distributions there (a credal set)

Cre

however, belief functions are a special class of credal sets, those induced by a random set mapping

Bel

p(z) = 1

Professor Fabio Cuzzolin

p(y) = 1

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

50 / 125

Belief theory

Misunderstandings

Belief functions are not parameterised families of distributions, or confidence intervals p(x) = 1

obviously, a parameterised family of distributions on Θ is a subset of the set of all possible distributions (just like belief functions)

Fam

not all families of distributions correspond to belief functions

Bel

p(z) = 1

example: Gaussian PDFs with 0 mean and arbitrary variance {N (0, σ), σ ∈ R+ } is not a belief function p(y) = 1

they are not confidence intervals either: confidence intervals are one-dimensional, and their interpretation is entirely different. Confidence intervals are interval estimates Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

51 / 125

Belief theory

Misunderstandings

Belief functions are not second-order distributions Dirichlet distribution

Belief function as uniform meta-distribution

unlike hypothesis testing, general Bayesian inference leads to probability distributions over the space of parameters these are second order probabilities, i.e. probability distributions on hypotheses which are themselves probabilities belief functions can be defined on the hypothesis space Ω, or on the parameter space Θ I I

when defined on Ω they are sets of PDFs and can then be seen as ‘indicator’ second order distributions (see figure) when defined on the parameter space Θ, they amount to families of second-order distributions

in the two cases they generalise MLE/MAP and general Bayesian inference, respectively Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

52 / 125

Reasoning with belief functions

Outline

1

Belief functions Semantics Dempster’s rule Multivariate analysis Misunderstandings

Uncertainty Second-order uncertainty Classical probability

2

Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data

3

Belief theory A theory of evidence

Professor Fabio Cuzzolin

4

Decision making

5

Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets

Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem

Belief functions: A gentle introduction

Theories of uncertainty

6

Belief functions on reals Continuous belief functions Random sets

7

Conclusions

Seoul, Korea, 30/05/18

53 / 125

Reasoning with belief functions

Reasoning with belief functions 1

inference: building a belief function from data (either statistical or qualitative)

2

reasoning: updating belief representations when new data arrives I I

3

manipulating conditional belief functions I I I

4

either by combination with another belief function or by conditioning with respect to new events/observations

via a generalisation of Bayes’ theorem vie network propagation via a generalisation of the total probability theorem

using the resulting belief function(s) for: I I I I

decision making regression classification etc (estimation, optimisation..)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

54 / 125

Reasoning with belief functions

Reasoning with belief functions

EFFICIENT COMPUTATION

CONDITIONING

COMBINED BELIEF FUNCTIONS

DECISION MAKING

MANIPULATION

INFERENCE STATISTICAL DATA/ OPINIONS

BELIEF FUNCTIONS

COMBINATION

CONDITIONAL BELIEF FUNCTIONS

MEASURING UNCERTAINTY

TOTAL/ MARGINAL BELIEF FUNCTIONS

DECISIONS

CONTINUOUS FORMULATION

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

55 / 125

Reasoning with belief functions

Statistical inference

Dempster’s approach to statistical inference Fiducial argument consider a statistical model n o f (x|θ), x ∈ X, θ ∈ Θ , where X is the sample space and Θ the parameter space having observed x, how to quantify the uncertainty about the parameter θ, without specifying a prior probability distribution? suppose that we known a data-generating mechanism [Fisher] X = a(θ, U) where U is an (unobserved) auxiliary variable with known probability distribution µ : U → [0, 1] independent of θ for instance, to generate a continuous random variable X with cumulative distribution function (CDF) Fθ , one might draw U from U([0, 1]) and set X = Fθ−1 (U) Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

56 / 125

Reasoning with belief functions

Statistical inference

Dempster’s approach to statistical inference the equation X = a(Θ, U) defines a multi-valued mapping Γ : U → 2X×Θ : o n Γ : u 7→ Γ(u) = (x, θ) ∈ X × Θ x = a(θ, u) ⊂ X × Θ under the usual measurability conditions, the probability space (U, B(U), µ) and the multi-valued mapping Γ induce a belief function BelX×Θ on X × Θ I I

conditioning it on θ yields BelX (.|θ) ∼ f (·|θ) on X conditioning it on X = x gives BelΘ (·|x) on Θ

X U �

×

|�

BelX(.|�)

|x

Bel�(.|x)

BelX × �

�: U → [0,1]

� Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

57 / 125

Reasoning with belief functions

Statistical inference

Inference from classical likelihood [Shafer76, Denoeux]  consider a statistical model L(θ; x) = f (x|θ), x ∈ X, θ ∈ Θ , where X is the sample space and Θ the parameter space BelΘ (θ|x) is the consonant belief function (with nested focal elements) with plausibility of the singletons equal to the normalized likelihood: pl(θ|x) =

L(θ; x) supθ0 ∈Θ L(θ0 ; x)

takes the empirical normalised likelihood to be the upper bound to the probability density of the sought parameter! (rather than the actual PDF) the corresponding plausibility function is PlΘ (A|x) = supθ∈A pl(θ|x) the plausibility of a composite hypothesis A ⊂ Θ PlΘ (A|x) =

supθ∈A L(θ; x) supθ∈Θ L(θ; x)

is the usual likelihood ratio statistics compatible with the likelihood principle Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

58 / 125

Reasoning with belief functions

Statistical inference

Coin toss example Inference with belief functions

consider a coin toss experiment we toss the coin n = 10 times, obtaining the sample n o X = H, H, T , H, T , H, T , H, H, H with k = 7 successes (heads H) and n − k = 3 fails (tails T) parameter of interest: the probability θ = p of heads in a single toss inference problem consists then on gathering information on the value of p Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

59 / 125

Reasoning with belief functions

Statistical inference

Coin toss example General Bayesian inference trials are typically assumed to be independent (they are equally distributed) the likelihood of the sample is binomial: P(X |p) = pk (1 − p)n−k apply Bayes’ rule to get the posterior P(p|X ) =

P(X |p)P(p) ∼ P(X |p) = pk (1 − p)n−k P(X )

as we do not have a-priori information on the prior likelihood function

maximum likelihood estimate

p Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

60 / 125

Reasoning with belief functions

Statistical inference

Coin toss example Frequentist inference what would a frequentist do? it is reasonable that p be equal to p = kn , i.e., the fraction of successes we can then test this hypothesis in the classical frequentist setting this implies assuming independent and equally distributed trials, so that the conditional distribution of the sample is the binomial we can then compute the p-value for, say, a confidence level of α = 0.05 the right-tail p-value for the hypothesis p = kn (the integral area in pink) is equal to 12 >> α = 0.05. Hence, the hypothesis cannot be rejected likelihood function p-value = 1/2

p Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

61 / 125

Reasoning with belief functions

Statistical inference

Coin toss example Inference with likelihood-based belief functions likelihood-based belief function inference yields the following belief measure, conditioned on the observed sample X , over Θ = [0, 1] ˆ ); BelΘ (A|X ) = 1 − PlΘ (Ac |X ), ∀A ⊆ Θ PlΘ (A|X ) = sup L(p|X p∈A

ˆ where L(p|X ) is the normalised version of the traditional likelihood random set induced by likelihood



����� X

determines an entire envelope of PDFs on the parameter space Θ = [0, 1] (a belief function there)

p

the random set associated with this belief measure is: n o ω ∈ Ω = [0, 1] 7→ ΓX (ω) = θ ∈ Θ PlΘ ({θ}|X ) ≥ ω ⊂ Θ = [0, 1] which is an interval centered around the ML estimate of p Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

62 / 125

Reasoning with belief functions

Statistical inference

Coin toss example Inference with likelihood-based belief functions the same procedure can applied to the normalised empirical counts ˆf (H) = 7 = 1, ˆf (T ) = 3 , rather than to the normalised likelihood function 7 7 imposing PlΩ (H) = 1, PlΩ (T ) = 37 on Ω = {H, T }, and looking for the least committed belief function there with these plausibility values 4 3 we get the mass assignment: m(H) = , m(T ) = 0, m(Ω) = , 7 7

Bel

0

4/7

7/10 MLE

1

p

corresponds to the credal set on the left

p = 1 needs to be excluded, as the available sample evidence reports that we had n(T ) = 3 counts already, so that 1 − p 6= 0 this outcome (a belief function on Ω = {H, T }) ‘robustifies’ classical MLE Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

63 / 125

Reasoning with belief functions

Statistical inference

Summary on inference general Bayesian inference → continuous PDF on the parameter space Θ (a second-order distribution) MLE/MAP estimation → a single parameter value = a single PDF on Ω generalised maximum likelihood → a belief function on Ω (a convex set of PDFs on Ω) I

generalises MAP/MLE

likelihood-based / Dempster-based belief function inference → a belief function on Θ = a convex set of second-order distributions I

generalises general Bayesian inference

Dempster’s approach requires a data-generating process likelihood approach produces only consonant BFs

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

64 / 125

Reasoning with belief functions

Combination

Combining vs conditioning Reasoning with belief functions

belief theory is a generalisation of Bayesian reasoning whereas in Bayesian theory evidence is of the kind ‘A is true’ (e.g. a new datum is available) .. .. in belief theory, new evidence can assume the more general form of a belief function I

a proposition A is a very special case of belief function with m(A) = 1

in most cases, reasoning needs then to be performed by combining belief functions, rather than by conditioning with respect to an event nevertheless, conditional belief functions are of interest, especially for statistical inference

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

65 / 125

Reasoning with belief functions

Combination

Dempster’s rule under fire Zadeh’s paradox question is: is Dempster’s sum the only possible rule of combination? seems to have paradoxical behaviour in certain circumstances doctors have opinions about the condition of a patient Θ = {M, C, T }, where M stands for meningitis, C for concussion and T for tumor two doctors provide the following diagnoses: I

I

D1 : “I am 99% sure it’s meningitis, but there is a small chance of 1% that it is concussion". D2 : “I am 99% sure it’s a tumor, but there is a small chance of 1% that it is concussion".

can be encoded by the following mass functions:    0.99 A = {M}  0.99 0.01 A = {C} 0.01 m1 (A) = m2 (A) =   0 otherwise 0

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

A = {T } A = {C} otherwise,

Seoul, Korea, 30/05/18

(1)

66 / 125

Reasoning with belief functions

Combination

Dempster’s rule under fire Zadeh’s paradox their (unnormalised) Dempster’s combination is:  0.9999 A = {∅} m(A) = 0.0001 A = {C} as the two masses are highly conflicting, normalisation yields the belief function focussed on C → “it is definitively concussion”, although both experts had left it as only a fringe possibility objections: I

I

I

the belief functions in the example are really probabilities, so this is a problem with Bayesian representations, in case! diseases are never exclusive, so that it may be argued that Zadeh’s choice of a frame of discernment is misleading → open world approaches with no normalisation doctors disagree so much that any person would conclude that one of the them is just wrong → reliability of sources needs to be accounted for

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

67 / 125

Reasoning with belief functions

Combination

Dempster’s rule under fire Tchamova’s paradox this time, the two doctors generate the following mass assignments over Θ = {M, C, T }:   A = {M} A = {M, C}  a  b1 1 − a A = {M, C} b2 A=Θ m1 (A) = m2 (A) =   0 otherwise 1 − b1 − b2 A = {T }. (2) assuming equal reliability of the two doctors, Dempster’s combination yields m1 ⊕ m2 = m1 , i.e, Doctor 2’s diagnosis is completely absorbed by that of Doctor 1! here the ‘paradoxical’ behaviour is not a consequence of conflict in Dempster’s combination, every source of evidence has a ‘veto’ power over the hypotheses it does not believe to be possible if any of them gets it wrong, the combined belief function will never give support to the ‘correct’ hypothesis

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

68 / 125

Reasoning with belief functions

Combination

Yager’s and Dubois’ rules first answer to Zadeh’s objections based on view that conflict is generated by non-reliable information sources P conflicting mass m(∅) = B∩C=∅ m1 (B)m2 (C) should be re-assigned to the whole frame Θ let m∩ (A) = m1 (B)m2 (C) whenever B ∩ C = A  m∩ (A) ∅= 6 A(Θ mY (A) = m∩ (Θ) + m(∅) A = Θ.

(3)

Dubois and Prade’s idea: similar to Yager’s, BUT conflicting mass is not transferred all the way up, but to B ∪ C (due to applying the minimum specificity principle) X mD (A) = m∩ (A) + m1 (B)m2 (C).

(4)

B∪C=A,B∩C=∅

the resulting BF dominates Yager’s combination: mD (A) ≥ mY (A) ∀A Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

69 / 125

Reasoning with belief functions

Combination

Conjunctive and disjunctive rules rather than normalising (as in Dempster’s rule) or re-assigning the conflicting mass m(∅) to other non-empty subsets (as in Yager’s and Dubois’ proposals), Smets’ conjunctive rule leaves the conflicting mass with the empty set: X m∩ (A) = m1 (B)m2 (C) (5) B∩C=A

applicable to unnormalised belief functions in an open world assumption: current frame only approximately describes the set of possible hypotheses disjunctive rule of combination: m∪ (A) =

X

m1 (B)m2 (C)

(6)

B∪C=A

consensus between two sources is expressed by the union of the supported propositions, rather than by their intersection I

∪ not that Bel1 Bel 2 (A) = Bel1 (A) ∗ Bel2 (A): belief values are simply multiplied!

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

70 / 125

Reasoning with belief functions

Combination

Combination: some conclusions Yager’s rule is rather unjustified .. Dubois’ is kinda intermediate between conjunction and disjunction my take on this: Dempster’s (conjunctive) combination and disjunctive combination are the two extrema of a spectrum of possible results

Proposal: combination tubes? Meta-uncertainty on the sources generating the input belief functions (their independence and reliability) induces uncertainty on the result of the combination, represented by a bracket of combination rules, which produce a ‘tube’ of BFs. fits well with belief likelihood concept, and was already hinted at by Pearl in “Reasoning with belief functions: An analysis of compatibility” we should probably work with intervals of belief functions then?

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

71 / 125

Reasoning with belief functions

Conditioning

Conditional belief functions Approaches in Bayesian theory conditioning is done via Bayes’ rule: P(A|B) =

P(A ∩ B) P(B)

for belief functions, many approaches to conditioning have been proposed (just as for combination!) I I I I I I

original Dempster’s conditioning Fagin and Halpern’s lower envelopes “geometric conditioning” [Suppes] unnormalized conditional belief functions [Smets] generalised Jeffrey’s rules [Smets] sets of equivalent events under multi-valued mappings [Spies]

several of them are special cases of combination rules: Dempster’s, Smets’ .. others are the unique solution when interpreting belief functions as convex sets of probabilities (Fagin’s) once again, a duality emerges between the most and least cautious conditioning approaches

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

72 / 125

Reasoning with belief functions

Conditioning

Dempster’s conditioning Dempster’s rule of combination induces a conditioning operator given a new event B, the “logical” belief function such that m(B) = 1 .. ... is combined with the a-priori belief function Bel using Dempster’s rule the resulting BF is the conditional belief function given B

Bel

B

Bel (A|B)

in terms of belief and plausibility values, Dempster’s conditioning yields Bel⊕ (A|B) =

¯ ¯ Bel(A∪B)−Bel( B) ¯ 1−Bel(B)

=

Pl(B)−Pl(B\A) , Pl(B)

Pl⊕ (A|B) =

Pl(A∩B) Pl(B)

obtained by Bayes’ rule by replacing probability with plausibility measures! Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

73 / 125

Reasoning with belief functions

Conditioning

Lower envelopes of conditional probabilities we know that a belief function can be seen as the lower envelope of the family of probabilities consistent with it: Bel(A) = inf P∈P[Bel] P(A) conditional belief function as the lower envelope (the inf) of the family of conditional probability functions P(A|B), where P is consistent with Bel: . BelCr (A|B) =

inf

P∈P[Bel]

P(A|B),

. PlCr (A|B) =

sup

P(A|B)

P∈P[Bel]

quite incompatible with the random set interpretation nevertheless, whereas lower/upper envelopes of arbitrary sets of probabilities are not in general belief functions, these actually are belief functions: BelCr (A|B) =

Bel(A∩B) , ¯ Bel(A∩B+Pl(A∩B)

PlCr (A|B) =

Pl(A∩B) ¯ Pl(A∩B)+Bel(A∩B)

they provide a more conservative estimate then Dempster’s conditioning BelCr (A|B) ≤ Bel⊕ (A|B) ≤ Pl⊕ (A|B) ≤ PlCr (A|B)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

74 / 125

Reasoning with belief functions

Conditioning

Geometric conditioning Suppes and Zanotti proposed a ‘geometric’ conditioning approach BelG (A|B) =

Bel(A ∩ B) , Bel(B)

PlG (A|B) =

Bel(B) − Bel(B \ A) Bel(B)

retains only the masses of focal elements inside B, and normalises them: mG (A|B) =

m(A) Bel(B)

A⊆B

it is a consequence of the focussing approach to belief update: no new information is introduced, we merely focus on a specific subset of the original set replaces probability with belief measures in Bayes’ rule Pl⊕ (A|B) =

Professor Fabio Cuzzolin

Pl(A∩B) Pl(B)



BelG (A|B) =

Belief functions: A gentle introduction

Bel(A∩B) Bel(B)

Seoul, Korea, 30/05/18

75 / 125

Reasoning with belief functions

Conditioning

Conjunctive rule of conditioning ∩ it is induced by the conjunctive rule of combination: m ∩ (A|B) = m m B (mB is the logical BF focussed on B) [Smets]

its belief and plausibility values are:  ¯ Bel(A ∪ B) A ∩ B 6= ∅ Bel ∩ (A|B) = 0 A∩B =∅

 Pl ∩ (A|B) =

Pl(A ∩ B) 1

A 6⊃ B A⊃B=∅

it is compatible with the principles of belief revision [Gilboa, Perea]: a state of belief is modified to take into account a new piece of information I

in probability theory, both focussing and revision are expressed by Bayes’ rule, but they are conceptually different operations which produce different results on BFs

it is more committal than Dempster’s rule! Bel⊕ (A|B) ≤ Bel ∩ (A|B) ≤ Pl ∩ (A|B) ≤ Pl⊕ (A|B)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

76 / 125

Reasoning with belief functions

Conditioning

Disjunctive rule of conditioning

∪ induced by the disjunctive rule of combination: m ∪ (A|B) = m m B

obviously dual to conjunctive conditioning assigns mass only to subsets containing the conditioning event B belief and plausibility values:  Bel(A) Bel (A|B) = ∪ 0

A⊃B A 6⊃ B

 Pl ∪ (A|B) =

Pl(A) 1

A∩B =∅ A ∩ B 6= ∅

it is less committal not only than Dempster’s rule, but also than credal conditioning Bel ∪ (A|B) ≤ BelCr (A|B) ≤ PlCr (A|B) ≤ Pl ∪ (A|B)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

77 / 125

Reasoning with belief functions

Conditioning

Conditioning - an overview

Dempster’s ⊕ Credal Cr Geometric G ∩ Conjunctive ∪ Disjunctive

belief Pl(B) − Pl(B \ A) Pl(B) Bel(A ∩ B) ¯ ∩ B) Bel(A ∩ B) + Pl(A Bel(A ∩ B) Bel(B) ¯ A ∩ B 6= ∅ Bel(A ∪ B), Bel(A), A ⊃ B

plausibility Pl(A ∩ B) Pl(B) Pl(A ∩ B) ¯ ∩ B) Pl(A ∩ B) + Bel(A Bel(B) − Bel(B \ A) Bel(B) Pl(A ∩ B), A 6⊃ B Pl(A), A ∩ B = ∅

Nested conditioning operators Conditioning operators form a nested family, from the more committal to the least one! Bl ∪ (·|·) ≤ BlCr (·|·) ≤ Bl⊕ (·|·) ≤ Bl ∩ (·|·) ≤ Pl ∩ (·|·) ≤ Pl⊕ (·|·) ≤ PlCr (·|·) ≤ Pl ∪ (·|·)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

78 / 125

Reasoning with belief functions

Belief vs Bayesian reasoning

Belief vs Bayesian reasoning A toy example

suppose we want to estimate the class of an object appearing in an image, based on feature measurements extracted from the image (e.g. by convolutional neural networks) we capture a training set of images, complete with annotated object labels assuming a PDF of a certain family (e.g. mixture of Gaussians) we can learn from the training data a likelihood function p(y |x), where y is the object class and x the image feature vector suppose n different ‘sensors’ extract n features xi from each image: x1 , ..., xn let us compare how data fusion works under the Bayesian and the belief function paradigms!

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

79 / 125

Reasoning with belief functions

Belief vs Bayesian reasoning

(Naive) Bayesian data fusion Belief vs Bayesian reasoning the likelihoods of the individual features are computed using the n likelihood functions learned during training: p(xi |y ), for all i = 1, ..., n measurements are typically assumed to be conditionally independent, yielding Q the product likelihood p(x|y ) = i p(xi |y ) Bayesian inference is applied, typically assuming uniform Q priors (for there is no reason to think otherwise), yielding p(y |x) ∼ p(x|y ) = i p(xi |y ) uniform prior

x1

likelihood function

... xn

likelihood function

Professor Fabio Cuzzolin

p(x1|y) conditional independence

×

Bayes' rule

� p(x |y) i

i

p(xn|y)

Belief functions: A gentle introduction

p(y|x) ~

� p(x |y) i

i

Seoul, Korea, 30/05/18

80 / 125

Reasoning with belief functions

Belief vs Bayesian reasoning

Dempster-Shafer data fusion Belief vs Bayesian reasoning with belief functions, for each feature type i a BF is learned from the the individual likelihood p(xi |y ), e.g. via the likelihood-based approach by Shafer this yields n belief functions Bel(y |xi ), on the range of possible object classes Y ∩ ⊕, ), ∪ a combination rule is applied to compute an overall BF (e.g. , obtaining

Bel(Y |x) = Bel(Y |x1 ) ... Bel(Y |xn ), p(x1|y) x1

likelihood function

Y ⊆Y

Bel(Y|x1) likelihood-based inference

belief function combination

... xn

Bel(Y|x)

likelihood function

likelihood-based inference

p(xn|y) Professor Fabio Cuzzolin

Bel(Y|xn)

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

81 / 125

Reasoning with belief functions

Belief vs Bayesian reasoning

Inference under partially reliable data Belief vs Bayesian reasoning

in the fusion example we have assumed that the data are measured correctly what if the data-generating process is not completely reliable? problem: suppose we want to just detect an object (binary decision: yes Y or no N) two sensors produce image features x1 and x2 , but we learned from the training data that both are reliable only 20% of the time at test time we get an image, measure x1 and x2 , and unluckily sensor 2 got it wrong! the object is actually there we get the following normalised likelihoods p(x1 |Y ) = 0.9, p(x1 |N) = 0.1;

Professor Fabio Cuzzolin

p(x2 |Y ) = 0.1, p(x2 |N) = 0.9

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

82 / 125

Reasoning with belief functions

Belief vs Bayesian reasoning

Inference under partially reliable data Belief vs Bayesian reasoning

how do the two fusion pipelines cope with this? the Bayesian scholar assumes the two sensors/processes are conditionally independent, and multiply the likelihoods obtaining p(x1 , x2 |Y ) = 0.9 ∗ 0.1 = 0.09, so that p(Y |x1 , x2 ) =

1 , 2

p(N|x1 , x2 ) =

p(x1 , x2 |N) = 0.1 ∗ 0.9 = 0.09

1 2

Shafer’s faithful follower discounts the likelihoods by assigning mass .2 to the whole hypothesis space Θ = {Y , N}: m(Y |x1 ) = 0.9 ∗ 0.8 = 0.72, m(Y |x2 ) = 0.1 ∗ 0.8 = 0.08,

Professor Fabio Cuzzolin

m(N|x1 ) = 0.1 ∗ 0.8 = 0.08, m(Θ|x1 ) = 0.2; m(N|x2 ) = 0.9 ∗ 0.8 = 0.72 m(Θ|x2 ) = 0.2

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

83 / 125

Reasoning with belief functions

Belief vs Bayesian reasoning

Inference under partially reliable data Belief vs Bayesian reasoning thus, when we combine them by Dempster’s rule we get the BF Bel on {Y , N}: m(Y |x1 , x2 ) = 0.458,

m(N|x1 , x2 ) = 0.458,

m(Θ|x1 , x2 ) = 0.084

when combined using the disjunctive rule (the least committal one) we get Bel 0 : m0 (Y |x1 , x2 ) = 0.09,

m0 (N|x1 , x2 ) = 0.09,

m0 (Θ|x1 , x2 ) = 0.82

the corresponding (credal) sets of probabilities are

Bel

0

0.09

0.46 0.54

Bel'

0.91

Bayes

P(Y|x1,x2)

1

the credal interval for Bel is quite narrow: reliability is assumed to be 80%, and we got a faulty measurement in two! (50%) the disjunctive rule is much more cautious about the correct inference Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

84 / 125

Reasoning with belief functions

Generalised Bayes Theorem

Generalised Bayes Theorem Generalising full Bayesian inference in Smets’ generalised Bayesian theorem setting, the input is a set of ‘conditional’ belief functions on Θ, rather than likelihoods p(x|θ) there BelX (X |θ),

X ⊂ X, θ ∈ Θ

each associated with a value θ of the parameter I

(these are not the same conditional belief functions we saw, where a conditioning event B ⊂ Θ alters a prior belief function BelΘ mapping it to BelΘ (.|B))

they can be seen as a parameterised family of BFs on the data the desired output is another family of belief functions on Θ, parameterised by all sets of measurements X on X: BelΘ (A|X ), ∀X ⊂ X each piece of evidence mX (X |θ) has an effect on our beliefs on the parameters coherent with the random set setting, as we condition on set-valued observations

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

85 / 125

Reasoning with belief functions

Generalised Bayes Theorem

Generalised Bayes Theorem Generalised Bayes Theorem Implements this inference BelX (X |θ) 7→ BelΘ (A|X ) by: 1

computing an intermediate family of BFs on X parameterised by sets of parameter values: Y ∪ θ∈A BelX (X |θ) = BelX (X |A) = BelX (X |θ) θ∈A ∪ via the disjunctive rule of combination

2 3

assuming that PlΘ (A|X ) = PlX (X |A) ∀A ⊂ Θ, X ⊂ X Y ¯ |θ) this yields BelΘ (A|X ) = BelX (X ¯ θ∈A

generalises Bayes’ rule (by replacing P with Pl) when priors are uniform

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

86 / 125

Reasoning with belief functions

The total belief theorem

The total belief theorem Generalising the law of total probability conditional belief functions are crucial for our approach to inference complementary link of the chain: generalisation of the law of total probability recall that a refining is a mapping from elements of one set Ω to elements of a disjoint partition of a second set Θ Bel0 = 2



[0,1]



�i

Beli = 2

�i

[0,1]

� ������������ i i

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

87 / 125

Reasoning with belief functions

The total belief theorem

The total belief theorem Statement

Total belief theorem Suppose Θ and Ω are two finite sets, and ρ : 2Ω → 2Θ the unique refining between them. Let Bel0 be a belief function defined over Ω = {ω1 , ..., ω|Ω| }. Suppose there exists a collection of belief functions Beli : 2Πi → [0, 1], where Π = {Π1 , ..., Π|Ω| }, Πi = ρ({ωi }), is the partition of Θ induced by Ω. Then, there exists a belief function Bel : 2Θ → [0, 1] such that: 1

Bel0 is the marginal of Bel to Ω (Bel0 (A) = Bel(ρ(A)));

2

Bel ⊕ BelΠi = Beli ∀i = 1, ..., |Ω|, where BelΠi is the logical belief function: mΠi (A) = 1 A = Πi , 0 otherwise several distinct solutions exists, and they likely form a graph with symmetries one such solution is easily identifiable

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

88 / 125

Reasoning with belief functions

The total belief theorem

The total belief theorem Existence of a solution [Zhou & Cuzzolin, UAI 2017]

assume Θ0 ⊇ Θ, and m a mass function over Θ − → m can be identified with a mass function m Θ0 over the larger frame Θ0 : for any → 0 0 − → 0 0 − 0 E ⊆ Θ , m Θ (E ) = m(E) if E = E ∪ (Θ0 \ Θ) and m Θ0 (E 0 ) = 0 otherwise − → such m Θ0 is called the conditional embedding of m into Θ0 −−→ let Beli be the conditional embedding of Beli into Θ for all Beli : 2Πi → [0, 1], and −→ −−→ −−−→ Bel = Bel1 ⊕ · · · ⊕ Bel|Ω|

Total belief theorem: existence −→ . The belief function Bel = Bel0↑Θ ⊕ Bel is a valid total belief function.

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

89 / 125

Reasoning with belief functions

Decision making

Decision making with belief functions a decision problem can be formalised by defining: I I I

a set Ω of possible states of the world a set X of consequences and a set F of acts, where an act is a function f : Ω → X mapping a world state to a consequence

problem: to select an act f from an available list F (i.e., to make a decision), which optimises a certain objective function various approaches to decision making with belief functions; among those: I

I

decision making in the TBM is based on expected utility via pignistic transform generalised expected utility [Gilboa] based on classical expected utility theory [Savage,von Neumann]

also a lot of interest in multicriteria decision making (based on a number of attributes)

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

90 / 125

Reasoning with belief functions

Decision making

Decision making with the pignistic probability classical expected utility theory is due to Von Neumann in Smets’ Transferable Belief Model, decision making is done by maximising the expected utility of actions based on the ‘pignistic’ transform this maps a belief function Bel on Ω to a probability distribution there: BetP[Bel](ω) =

X m(A) |A|

∀ω ∈ Ω

A3ω

the set of possible actions F and the set Ω of possible outcomes are distinct, and the utility function u is defined on F × Ω the optimal decision maximises E[u] =

X

u(f , ω)Pign(ω)

ω∈Ω

Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

91 / 125

Reasoning with belief functions

Decision making

Savage’s sure thing principle let < be a preference relation on F, such that f < g means that f is at least as desirable as g Savage (1954) has showed that < verifies some rationality requirements iff there exists a probability measure P on Ω and a utility function u : X → R s.t. ∀f , g ∈ F ,

f < g ⇔ EP (u ◦ f ) ≥ EP (u ◦ g)

does that mean that using belief functions is irrational? given f , h ∈ F and E ⊆ Ω, let fEh denote the act defined by ( f (ω) if ω ∈ E (fEh)(ω) = h(ω) if ω 6∈ E then the sure thing principle states that ∀E, ∀f , g, h, h0 : fEh < gEh ⇒ fEh0 < gEh0 Ellsberg’s paradox: empirically the Sure Thing Principle is violated! this is because people are averted to second-order uncertainty Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

92 / 125

Reasoning with belief functions

Decision making

Ellsberg’s paradox suppose you have an urn containing 30 red balls and 60 balls, either black or yellow I I I I

f1 : f2 : f3 : f4 :

you receive 100 euros if you draw a red ball you receive 100 euros if you draw a black ball you receive 100 euros if you draw a red or yellow ball you receive 100 euros if you draw a black or yellow ball

in this example Ω = {R, B, Y }, fi : Ω → R and X = R empirically most people strictly prefer f1 to f2 , but they strictly prefer f4 to f3 R B Y Now, pick E = {R, B}: by definition f1 100 0 0 f2 0 100 0 f1 {R, B}0 = f1 , f2 {R, B}0 = f2 f3 100 0 100 f1 {R, B}100 = f3 , f2 {R, B}100 = f4 f4 0 100 100 since f1 < f2 , i.e. f1 {R, B}0 < f2 {R, B}0 the Sure Thing principle would imply f1 {R, B}100 < f2 {R, B}100, i.e., f3 < f4 empirically the Sure Thing Principle is violated! Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

93 / 125

Reasoning with belief functions

Decision making

Lower and upper expected utilities Gilboa (1987) proposed a modification of Savage’s axioms a preference relation < meets these weaker requirements iff there exists a (non necessarily additive) measure µ and a utility function u : X → R such that ∀f , g ∈ F ,

f < g ⇔ Cµ (u ◦ f ) ≥ Cµ (u ◦ g),

where Cµ is the Choquet integral, defined for X : Ω → R as Z 0 Z +∞ µ(X (ω) ≥ t)dt + [µ(X (ω) ≥ t) − 1]dt. Cµ (X ) = −∞

0

given a belief function Bel on Ω and a utility function u, this theorem supports making decisions based on the Choquet integral of u with respect to Bel for finite Ω, it can be shown that X CBel (u ◦ f ) = m(B) min u(f (ω)) B⊆Ω

ω∈B

CPl (u ◦ f ) =

X

m(B) max u(f (ω))

B⊆Ω

ω∈B

(lower and upper expectations of u ◦ f with respect to Bel) Professor Fabio Cuzzolin

Belief functions: A gentle introduction

Seoul, Korea, 30/05/18

94 / 125

Reasoning with belief functions

Decision making

Decision making Possible strategies let P(Bel) as usual be the set of probability measures P compatible with Bel, i.e., such that Bel ≤ P. Then, it can be shown that CBel (u ◦ f ) =

min

P∈P(Bel)

EP (u ◦ f ) = E(u ◦ f )

CPl (u ◦ f ) = E(u ◦ f )

two expected utilities E(f ) and E(f ): how do we make a decision? possible decision criteria based on interval dominance: 1 2 3 4

f f f f