55ex (j.4-.4pt.4pt-width width widthwidth width Belief ...

Belief functions Random sets for the working scientist A IJCAI 2016 Tutorial Fabio Cuzzolin Department of Computing and Communication Technologies Oxford Brookes University, UK

IJCAI 2016

Fabio Cuzzolin

Belief functions Random sets for the working scientist

IJCAI 2016

1 / 464

This is what the tutorial will look like

Fabio Cuzzolin


IJCAI 2016

2 / 464

Tutorial web site

http://cms.brookes.ac.uk/staff/FabioCuzzolin/ijcai2016.html Fabio Cuzzolin


IJCAI 2016

3 / 464

Uncertainty

Outline 1

Nature of uncertainty Mathematical probability Interpretations of probability Frequentist interpretation Bayesian interpretation Bayesians vs frequentists

2

4

Understanding A mathematical theory of evidence Belief functions

Fabio Cuzzolin

5

Regression (computer vision) Prediction (climate change)

7

Reasoning

8

9

New horizons Upper and lower likelihood Generalising logistic regression and rare events Frequentist inference with RS Central limit theorem The total belief theorem Random set random variables A new machine learning Climatic change models A geometry of uncertainty

Using belief functions Decision making Classification Ranking aggregation Applications

Challenges Efficient computation Belief functions on reals

Combining Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem Graphical models

6

Putting (in context) Derived frameworks Uncertainty theories

Building Dempster’s approach Likelihood-based inference From preferences Coin toss revised

Beyond probability It’s the data, stupid! Missing data Propositional data Scarce data Pure data No data (ignorance) Unusual (rare) data Uncertain data Knightian uncertainty

3

Dempster’s combination Families of frames Interpretations Misunderstandings

Uncertainty

10

Summarising


IJCAI 2016

4 / 464

Uncertainty

Nature of uncertainty

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

5 / 464

Uncertainty


What is uncertainty?

uncertainty → lack of information or imperfect information a state of limited knowledge, where it is impossible to exactly describe the existing state or future outcomes

Fabio Cuzzolin


IJCAI 2016

6 / 464

Uncertainty


Uncertainty is widespread

“There are some things that you know to be true, and others that you know to be false; yet, despite this extensive knowledge that you have, there remain many things whose truth or falsity is not known to you. We say that you are uncertain about them. You are uncertain, to varying degrees, about everything in the future; much of the past is hidden from you; and there is a lot of the present about which you do not have full information. Uncertainty is everywhere and you cannot escape from it.” Dennis Lindley, Understanding Uncertainty (2006)

Fabio Cuzzolin


IJCAI 2016

7 / 464

Uncertainty


Two types of uncertainty

the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability different probability interpretations treat predictable and unpredictable variation differently also referred to as distinction between common-cause and special-cause has a consequence on human behaviour: people are averse to unpredictable variations (Ellsberg’s paradox, see Decision making)

Fabio Cuzzolin


IJCAI 2016

8 / 464

Uncertainty


‘Knightian’ Uncertainty

‘second order’ uncertainty: being uncertain about our very model of uncertainty if (a big ‘if’) uncertainty is modelled by probabilities being uncertain about the ‘correct’ probability model

Fabio Cuzzolin


IJCAI 2016

9 / 464

Uncertainty



Chicago economist Frank Knight distinguished ‘risk’ from ‘uncertainty’: “Uncertainty must be taken in a sense radically distinct from the familiar notion of risk, from which it has never been properly separated.... The essential fact is that ’risk’ means in some cases a quantity susceptible of measurement, while at other times it is something distinctly not of this character; and there are far-reaching and crucial differences in the bearings of the phenomena depending on which of the two is really present and operating.... It will appear that a measurable uncertainty, or ’risk’ proper, as we shall use the term, is so far different from an unmeasurable one that it is not in effect an uncertainty at all.” “You cannot be certain about uncertainty” in Knight’s terms: risk = probability, uncertainty = second-order uncertainty

Fabio Cuzzolin


IJCAI 2016

10 / 464

Uncertainty



risk → a consequence of an action taken in the presence of uncertainty some models of uncertainty use human propensity to act as a measure of uncertainty Fabio Cuzzolin


IJCAI 2016

11 / 464

Uncertainty

Mathematical probability

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

12 / 464

Uncertainty


Probability measures mainstream mathematical theory of (first order) uncertainty: mathematical (measure-theoretical) probability mainly due to Russian mathematician Andrey Kolmogorov probability is an application of measure theory, the theory of assigning numbers to sets additive probability measure → mathematical representation of the notion of chance assigns a probability value to every subset of a collection of possible outcomes (of a random experiment, of a decision problem, etc) collection of outcomes Ω → sample space, universe subset A of the universe → event Fabio Cuzzolin


IJCAI 2016

13 / 464

Uncertainty


Example: the spinning wheel Probability measures

typical example: spinning wheel with 3 possible outcomes universe Ω = {1, 2, 3} eight possible events (right), including the empty set probability of ∅ is 0, probability of Ω is 1 additivity holds: P({1, 2}) = P({1}) + P({2})

Fabio Cuzzolin


IJCAI 2016

14 / 464

Uncertainty


Probability measures probability measure µ: a real-valued function on a probability space that satisfies countable additivity probability space: it is a triplet (Ω, F, P) formed by a universe Ω, a σ-algebra F of its subsets, and a probability measure on F I

not all subsets of Ω belong necessarily to F

axioms of probability measures: I I I

µ(∅) = 0, µ(Ω) = 1 0 ≤ µ(A) ≤ 1 for all events A ⊆ F additivity: for all countable collection of pairwise disjoint events Ai : ! [ X µ Ai = µ(Ai ) i

Fabio Cuzzolin

i


IJCAI 2016

15 / 464

Uncertainty


Random variable a variable whose value is subject to random variations, i.e. due to ‘chance’: what chance is is subject to philosophical debate! it can take one of a set of possible values, with probability mathematically, it is a function X from a sample space Ω (which forms a probability space) to (usually) the real line

subject to a condition of measurability: each range of values of the real line must have an anti-image in Ω which has a probability value this way, we can forget about the initial probability space and record the probabilities of various values of X Fabio Cuzzolin


IJCAI 2016

16 / 464

Uncertainty


(Discrete) random variable Example

the sample space is the set of outcomes of rolling two dice sample space: Ω = {(1, 1), (1, 2), ..., (6, 5), (6, 6)} a random variable can be the function that associates each roll of the two dice to the sum S of the faces random variables can be discrete or continuous – this one is discrete

Fabio Cuzzolin


IJCAI 2016

17 / 464

Uncertainty


Cumulative Distribution Function (CDF) of a random variable F (x) = P(X ≤ x)

example: CDF of a Gaussian random variable Fabio Cuzzolin


IJCAI 2016

18 / 464

Uncertainty


Probability Density Function of a continuous random variable a random variable is called continuous when it can assume values in a non-countable set (e.g. the real line) it is described by a probability density function (PDF), which describes the likelihood of the variable taking any continuous value the probability of any range of values (e.g., an interval) is the integral of the PDF Rb over the range: P([a, b]) = a f (x)dx

example: PDF of a Gaussian random variable Fabio Cuzzolin


IJCAI 2016

19 / 464

Uncertainty


Radon-Nikodym derivative Measure-theoretic probability theory a continuous random variable with values in a measurable space (X , A) (usually Rn with the Borel sets as measurable family) has as probability distribution the measure X∗ P on (X , A) formally, the probability density function of X is the Radon-Nikodym derivative, denoted: dX∗ P f = dµ where µ is a reference measure on (X , A) Z that is, f is any measurable function such that: P(X ∈ A) =

fdµ A

analogous to a derivative in calculus modern mathematical probability is realy just an application of measure theory measure-theoretic allows us to unify discrete and continuous cases, by making the difference just a question of which measure is used

Fabio Cuzzolin


IJCAI 2016

20 / 464

Uncertainty


Law of large numbers describes what happens when you repeat the same random experiment an increasing number of times n X1 + ... + Xn the average of the results (sample mean) X n = should be n close to the expected value (actual mean) probabilities become predictable as we run the same trial more and more times!

strong law: P(limn→∞ X n = µ) = 1 weak law: limn→∞ P(|X n − µ| > ) = 0

Fabio Cuzzolin


IJCAI 2016

21 / 464

Uncertainty


Central limit theorem the mean of a sufficiently large number of iterates of independent random variables is normally (Gaussian) distributed let X1 , ..., Xn independent and identically distributed random variables with the same mean µ and variance σ 2 X1 + ... + Xn we can build the sample average as X n = n √ the random variable n(X n − µ) tends to a Gaussian with mean 0 and variance σ2

Fabio Cuzzolin


IJCAI 2016

22 / 464

Uncertainty

Interpretations of probability

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

23 / 464

Uncertainty


Does probability really exist? That sinking feeling

what is probability, really? is it just the way we call our ignorance/limitedness? can it be that, with sufficient information, any phenomenon is predictable in a deterministic way? Einstein: God does not roll dice Doc Smith’s Lensmen series: the Arisians have such mental powers that they compete on foreseen future events to the tiniest detail

Fabio Cuzzolin


IJCAI 2016

24 / 464

Uncertainty


Does probability really exist? That sinking feeling

the principles of quantum mechanics seems to suggest that probability is not just a figment of our mathematical imagination, or a representation of our ignorance the workings of the physical world seems to be inherently probabilistic we will come back to this later Fabio Cuzzolin


IJCAI 2016

25 / 464

Uncertainty


Interpretations of probability Savage’s take

even assuming that probability is inherent to the physical world, people cannot agree on what it is “It is unanimously agreed that statistics depends somehow on probability. But, as to what probability is and how it is connected with statistics, there has seldom been such complete disagreement and breakdown of communication since the Tower of Babel. Doubtless, much of the disagreement is merely terminological and would disappear under sufficiently sharp analysis.” L.J. Savage, 1954

Fabio Cuzzolin


IJCAI 2016

26 / 464

Uncertainty


Interpretations of probability Frequentist, subjective and behavioural

as a result, probability has multiple competing interpretations an objective description of frequencies of events (meaning ‘things that happen’) at a certain persistent rate, or ‘relative frequency’ → frequentist interpretation [Fisher, Pearson] a degree of belief on events (interpreted as statements/propositions on the state of the world), regardless of any random process → Bayesian or evidential probability [de Finetti, Savage] neither frequentist nor Bayesian probability are in constrast with the classical mathematical definition of probability - others are (as we will see) the propensity of an agent to act (or gamble, or decide) in case the event happens → behavioural probability [Walley, Vovk]

Fabio Cuzzolin


IJCAI 2016

27 / 464

Uncertainty


‘Classical’ probability

championed by Pierre-Simon Laplace

if a random experiment can result in N mutually exclusive and equally likely outcomes, and if NA of these outcomes result in the occurrence of the event A, the probability of A is defined by P(A) =

NA N

works only for finite number of possible outcomes you need to determine in advance that all the possible outcomes are equally likely without relying on the notion of probability to avoid circularity

Fabio Cuzzolin


IJCAI 2016

28 / 464

Uncertainty

Frequentist interpretation

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

29 / 464

Uncertainty


Frequentist probability

the (aleatory) probability of an event is its relative frequency in time

Fabio Cuzzolin


IJCAI 2016

30 / 464

Uncertainty


Frequentist probability the (aleatory) probability of an event is its relative frequency in time when tossing a fair coin, frequentists say that the probability of getting a heads is 1/2, not because there are two equally likely outcomes but because repeated series of large numbers of trials demonstrate that the empirical frequency converges to the limit 1/2 as the number of trials goes to infinity: P(A) = lim

n→∞

nA NA 6= n N

where n is the number of trials, nA those resulting in A clearly impossible to actually perform an infinity of repetitions of a random experiment hence: we can only measure an approximation of the ‘actual’ probability (whatever it is) what are the consequences for inference?

Fabio Cuzzolin


IJCAI 2016

31 / 464

Uncertainty


Frequentist inference

the frequentist interpretation offers guidance in the design of practical ‘random’ experiments developed by Fisher, Pearman, Neyman three main tools: statistical hypothesis testing model selection confidence interval analysis

Fabio Cuzzolin


IJCAI 2016

32 / 464

Uncertainty


Statistical hypothesis testing Frequentist inference

statistical hypothesis is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables statistical hypothesis testing I I I I

a data set obtained by sampling is compared against synthetic data from an idealized model A hypothesis is proposed for the statistical relationship between the two data sets this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability – the significance level

in alternative we can do model selection: statistical hypothesis testing is a form of confirmatory data analysis, as opposed to exploratory data analysis which does not rely on pre-specified hypotheses

Fabio Cuzzolin


IJCAI 2016

33 / 464

Uncertainty


Testing process Statistical hypothesis testing

1

state the research hypothesis

2

state the relevant null and alternative hypotheses

3

state the statistical assumptions being made about the sample, e.g. assumptions about the statistical independence or about the form of the distributions of the observations

4

state the relevant test statistic T (a quantity derived from the sample)

5

derive the distribution of the test statistic under the null hypothesis from the assumptions

6

set a significance level (α), i.e. a probability threshold below which the null hypothesis will be rejected

7

compute from the observations the observed value tobs of the test statistic T

8

calculate the p-value, the probability (under the null hypothesis) of sampling a test statistic at least as extreme as the observed value

9

Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than the significance level threshold

Fabio Cuzzolin


IJCAI 2016

34 / 464

Uncertainty


Statistical hypothesis testing Sketch

Fabio Cuzzolin


IJCAI 2016

35 / 464

Uncertainty


Type I and type II errors Statistical hypothesis testing

Fabio Cuzzolin


IJCAI 2016

36 / 464

Uncertainty


Statistical hypothesis testing Interpretation

example: given the observed data assuming a (parameterised) probability distribution generating them, of which we do not know the parameter value we test hypotheses on the value of the parameter output: yes or no (binary decision) interpretation: if the p-value is less than the required significance level the null hypothesis is rejected if not, the test has no result - the evidence is insufficient to support a conclusion a reductio ad absurdum argument adapted to statistics: a claim is shown to be valid by demonstrating the improbability of the consequence of the opposite modern hypothesis testing is in fact a hybrid between two seminal proposals by Fisher and Neyman/Pearson

Fabio Cuzzolin


IJCAI 2016

37 / 464

Uncertainty


P-values and error rates Statistical hypothesis testing

American Statistical Association: “The widespread use of ‘statistical significance’ (generally interpreted as ‘p ≤ 0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.” p-value: the probability, under the assumption of hypothesis H, of obtaining a result equal to or more extreme than what was actually observed the reason is, for continuous random variables P(X = x|H) = 0, so we distinguish: I I I

right-tail event {X ≥ x} → p = P(X ≥ x|H) left-tail event {X ≤ x} → p = P(X ≤ x|H) double-tailed event: the ‘smaller’ of {X ≤ x} and {X ≥ x}

α is the rate of falsely rejecting the null hypothesis (type I error)

Fabio Cuzzolin


IJCAI 2016

38 / 464

Uncertainty


Notion of P-value and misunderstandings

the p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false: frequentist statistics does not and cannot attach probabilities to hypotheses Fabio Cuzzolin


IJCAI 2016

39 / 464

Uncertainty


Maximum Likelihood (MLE) the term ‘likelihood’ was popularized in mathematical statistics by Ronald Fisher in 1922: ‘On the mathematical foundations of theoretical statistics’ Fisher argues against ‘inverse’ (Bayesian) probability as a basis for statistical inferences, and instead proposes inferences based on likelihood functions likelihood principle: all of the evidence in a sample relevant to model parameters is contained in the likelihood function this is hotly debated, still [Mayo,Gandenberger] maximum likelihood estimation: {θˆmle } ⊆ {arg max L(θ ; x1 , . . . , xn )}, θ∈Θ

where L(θ ; x1 , . . . , xn ) = f (x1 , x2 , . . . , xn | θ) and {f (.|θ), θ ∈ Θ} is a parametric model

Fabio Cuzzolin


IJCAI 2016

40 / 464

Uncertainty


Maximum Likelihood (MLE) Properties

Maximum-likelihood estimators have no optimum properties for finite samples however, they do have good limiting properties: consistency, asymptotic normality, efficiency consistency: the sequence of MLEs converges in probability, for a sufficiently large number of observations, to the (actual) value being estimated asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution with mean on the true parameter (under a number of conditions) efficiency: it achieves the Cramer-Rao lower bound when the sample size tends to infinity, i.e. no consistent estimator has lower asymptotic mean squared error than the MLE

Fabio Cuzzolin


IJCAI 2016

41 / 464

Uncertainty

Bayesian interpretation

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

42 / 464

Uncertainty


Subjective probability

(epistemic) probability = degrees of belief of an individual assessing the state of the world Ramsey and de Finetti → subjective beliefs must follow the laws of probability if they are to be coherent (if this ‘proof’ was prooftight we would not be here in front of you!) also, evidence casts doubt that humans will have coherent beliefs or behave rationally

Fabio Cuzzolin


IJCAI 2016

43 / 464

Uncertainty


Are human rational and/or coherent?

this guy (Daniel Kahneman) won a Nobel prize supporting the exact opposite, in collaboration with Amos Tversky people pursue courses of action which are bound to damage them people do not understand the full consequences of their actions

https://en.wikipedia.org/wiki/Daniel_Kahneman

Fabio Cuzzolin


IJCAI 2016

44 / 464

Uncertainty


Bayesian probability in the Bayesian view, a probability is assigned to a hypothesis, whereas under frequentist inference, a hypothesis is typically tested without being assigned a probability special case of evidential probability: some prior probability is updated to a posterior probability in the light of new evidence (data) once again it makes use of mathematical probabilities needs to specify a prior probability distribution taking into account the available (prior) information sequentially uses Bayes’ rule to compute a posterior distribution when more data becomes available

Fabio Cuzzolin


IJCAI 2016

45 / 464

Uncertainty


Bayes’ rule Bayesian probability

Fabio Cuzzolin


IJCAI 2016

46 / 464

Uncertainty


Some history Thomas Bayes (1702-1761), who proved a special case of what is now called Bayes’ theorem in a paper titled "An Essay towards solving a Problem in the Doctrine of Chances" Pierre-Simon Laplace (1749-1827) who introduced a general version of the theorem Jeffreys’ "Theory of Probability" (1939) played an important role in the revival of the Bayesian view of probability, followed by works by Abraham Wald (1950) and Leonard J. Savage (1954) de Finetti: a Dutch book is made when a clever gambler places a set of bets that guarantee a profit, no matter what the outcome of the bets. If a bookmaker follows the rules of the Bayesian calculus, a Dutch book cannot be made I

(however, Dutch book arguments leave open the possibility that non-Bayesian updating rules could avoid Dutch books)

justification by axiomatisation has been tried, but with no great success

Fabio Cuzzolin


IJCAI 2016

47 / 464

Uncertainty


Bayesian inference prior distribution is the distribution of the parameter(s) before any data is observed, i.e. p(θ | α) it depends on a vector of hyperparameters α likelihood: is the distribution of the observed data conditional on its parameters, i.e. p(X | θ) marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s): Z p(X | α) = p(X | θ)p(θ | α) dθ θ

posterior distribution is the distribution of the parameter(s) after taking into account the observed data, as determined by Bayes’ rule: p(θ | X, α) =

Fabio Cuzzolin

p(X | θ)p(θ | α) ∝ p(X | θ)p(θ | α) p(X | α)


IJCAI 2016

48 / 464

Uncertainty


Bayesian prediction

posterior predictive distribution is the distribution of a new data point, marginalized over the posterior: Z p(x˜ | X, α) = p(x˜ | θ)p(θ | X, α) dθ θ

a distribution over possible data values is obtained By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s) – e.g., by maximum likelihood or maximum a posteriori estimation (MAP) does not account for any uncertainty in the value of the parameter

Fabio Cuzzolin


IJCAI 2016

49 / 464

Uncertainty


Maximum-A-Posteriori (MAP) again we want to estimate the parameter θ of a parametric model assume that a prior distribution g over θ exists – then: θ 7→ f (θ | x) = Z

f (x | θ) g(θ) f (x | ϑ) g(ϑ) dϑ ϑ∈Θ

maximum a posteriori estimation then estimates θ as the mode of the posterior distribution of this random variable: f (x | θ) g(θ) = arg max f (x | θ) g(θ). θˆMAP (x) = arg max Z θ θ f (x | ϑ) g(ϑ) dϑ ϑ

MAP and MLE estimates coincide when the prior g is uniform not very representative of Bayesian methods, as the latter are characterized by the use of distributions to draw inferences also, unlike ML estimators, the MAP estimate is not invariant under reparameterization Fabio Cuzzolin


IJCAI 2016

50 / 464

Uncertainty

Bayesians vs frequentists

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

51 / 464

Uncertainty


Bayesian vs frequentist inference

in frequentist inference, unknown parameters are often, but not always, treated as having fixed but unknown values that are not capable of being treated as random variates Bayesian inference allows probabilities to be associated with unknown parameters the frequentist approach does not depend on a subjective prior that may vary from one investigator to another however, Bayesian inference (e.g. Bayes’ rule) can be used by frequentists www.stat.ufl.edu/∼casella/Talks/BayesRefresher.pdf

Fabio Cuzzolin


IJCAI 2016

52 / 464

Uncertainty


Lindley’s paradox Bayesian vs frequentist hypothesis testing

Lindley’s paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution not really a paradox – the two approaches answer fundamentally different questions Lindley’s paradox1 occurs when I I

the result x is ‘significant’ by a frequentist test of H0 , indicating sufficient evidence to reject H0 say, at the 5% level, and the posterior probability of H0 given x is high, indicating strong evidence that H0 is in better agreement with x than H1

this can happen when H0 is very specific, H1 less so, and the prior distribution does not strongly favor one or the other

1

onlinelibrary.wiley.com/doi/10.1002/0470011815.b2a15076/pdf Fabio Cuzzolin


IJCAI 2016

53 / 464

Uncertainty


Bayesian vs frequentist inference

it is not like they are different ways of solving the same problem they are really designed to solve different problems! the result of a Bayesian approach can be a probability distribution on the parameters given the results of the experiment the result of a frequentist approach is either: I I

a ‘true or false’ (binary) conclusion from a significance test, or a conclusion in the form that a given sample-derived confidence interval covers the true value

either of these conclusions has a given probability of being correct

Fabio Cuzzolin


IJCAI 2016

54 / 464

Uncertainty


Bayesian vs frequentist for regression problems

Fabio Cuzzolin


IJCAI 2016

55 / 464

Beyond probability

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

56 / 464

Beyond probability

It’s the data, stupid!

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

57 / 464

Beyond probability


Something is wrong? measure-theoretical mathematical probability is not general enough: cannot (properly) model missing data cannot (properly) model propositional data cannot really model unusual data (second order uncertainty)

the frequentist approach to probability: cannot really model pure data (without ‘design’) in a way, cannot even model properly continuous data models scarce data only asymptotically

Bayesian reasoning has several limitations: cannot model no data (ignorance) cannot model uncertain data cannot model pure data (without prior) again, cannot properly model scarce data (only asymptotically) Fabio Cuzzolin


IJCAI 2016

58 / 464

Beyond probability


It’s all about the data! What probability does not do so well model missing data I

canonical examples: the cloaked die, occluded dice

model interval or propositional data (e.g., in engineering) I

canonical example: reliability of witnesses in a trial

properly model scarce data I

paramount example: training in machine learning

model pure data I

without priors or designed experiments

model no data (ignorance) model unusual data (the statistics of rare events) I

extinct dinosaurs and black swans

perform prediction under huge (Knightian?) uncertainties I

making politicians happy

Fabio Cuzzolin


IJCAI 2016

59 / 464

Beyond probability


Fisher has not got it all right the setting is arguable I

I

I

the scope is quite narrow: rejecting or not rejecting a hypothesis (although it can provide confidence intervals) the criterion is arbitrary: who decides what an ‘extreme’ realisation is (choice of α)? what is the deal with 0.05 and 0.01? the whole ‘tail’ idea comes from the fact that, under measure theory, the conditional probability (p-value) of a point outcome x is zero – seems trying to patch an underlying problem with the way probability is mathematically defined

cannot cope with pure data, without assumptions on the process (experiment) which generated them (we will come back to this later) deals with scarce data only asymptotically (see ‘scarce data’)

Fabio Cuzzolin


IJCAI 2016

60 / 464

Beyond probability


The problem(s) with Bayes pretty bad at representing ignorance I I

Fisher uninformative priors are just not adequate different results on different parameter spaces

Bayes’ rule assumes the new evidence comes in the form of certainty: “A is true” I

in the real world, often this is not the case

beware the prior! → model selection in Bayesian statistics I

I

I

results from a confusion between the original subjective interpretation, and the objectivist view of a rigorous objective procedure why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not all will be fine, in the end! asymptotically, the choice of the prior does not matter (really!)

Fabio Cuzzolin


IJCAI 2016

61 / 464

Beyond probability

Missing data

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

62 / 464

Beyond probability

Missing data

The cloaked die The die as random variable

a die is a simple example of (discrete) random variable there is a probability space Ω = {face1, face2, ..., face6} which maps to a real number: 1, 2, ..., 6 (no need for measurability here)

Fabio Cuzzolin


IJCAI 2016

63 / 464

Beyond probability

Missing data

The cloaked die Observations which are sets

now, imagine that face1 and face4 are cloaked, and we roll the die the same probability space Ω = {face1, face2, ..., face6} is still there (nothing has changed in the way the die works) however, now the mapping is different: both face1 and face4 are mapped to the set of possible values {1, 4} (since we cannot observe the outcome) mathematically, this is called a random set [Matheron,Kendall,Nguyen], i.e. a set-valued random variable

Fabio Cuzzolin


IJCAI 2016

64 / 464

Beyond probability

Missing data

Occluded dice A more realistic scenario

a more realistic scenario is one in which we roll, say, four dice for some of them, their top face might be occluded, but some of the side faces will still be visible, providing information

e.g. I see the top face of Red die , Green die cannot see the outcome of Blue die however, I see sides faces set {2, 4, 5, 6}

Fabio Cuzzolin

and

and Purple die

but, say, I

of Blue, therefore the outcome of Blue is the


IJCAI 2016

65 / 464

Beyond probability

Missing data

Missing data and random sets

the bottom line is, whenever data are missing observations are inherently set-valued mathematically, we are not sampling a (scalar) random variable but we are sampling a set-valued random variable: a random set my outcomes are sets? my probability distribution has to be defined over sets missing data appears (or disappears?) everywhere in science and engineering e.g. occlusions in computer vision

Fabio Cuzzolin


IJCAI 2016

66 / 464

Beyond probability

Missing data

Dealing with missing data Traditional approaches

traditional statistical approaches deal with missing data in one of the following ways: deletion: most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results single imputation: replacing a missing value with another I I I

I I

for instance, from a randomly selected similar record in the same dataset or, selecting donors from another dataset with the mean of that variable for all other cases (does not change the sample mean) using a regression model (does not represent residual variance well) stochastic regression

Fabio Cuzzolin


IJCAI 2016

67 / 464

Beyond probability

Missing data

Dealing with missing data Traditional approaches

multiple imputation [Rubin]: averaging the outcomes across multiple imputed data sets (using, for instance, stochastic regression) I I

involves drawing values of the parameters from a posterior distribution hence, simulates both the process generating the data and the uncertainty associated with the parameters of the probability distribution of the data

Missing data with random sets No need for imputation or deletion whatsoever. All observations are set-valued, some of them happen to be pointwise.

Fabio Cuzzolin


IJCAI 2016

68 / 464

Beyond probability

Propositional data

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

69 / 464

Beyond probability

Propositional data

Reliable witnesses Evidence supporting propositions suppose there is a murder, and three people are under trial for it: Peter, John and Mary our hypothesis space is therefore Θ = {Peter, John, Mary} there is a witness: he testifies that the person he saw was a man this amounts to supporting the proposition A = {Peter, John} ⊂ Θ should we take this testimony at face value? in fact, the witness was tested and the machine reported an 80% chance he was drunk when he reported the crime we should partly support the (vacuous) hypothesis that any one among Peter, John and Mary could be the murderer: it is natural to assign 80% chance to proposition A, and 20% chance to proposition Θ

Fabio Cuzzolin


IJCAI 2016

70 / 464

Beyond probability

Propositional data

Dealing with propositional evidence even when evidence (data) supports propositions, Kolmogorov’s probability forces us to specify support for individual outcomes this is unreasonable - an artificial constraint due to a mathematical model that is not general enough I

we have no elements to assign this 80% probability to either Peter or John, nor to distribute it among them

the cause is the additivity of probability measures: but this is not the most general type of measure for sets under a minimal requirement of monotoniticy measure can potentially suitable to describe probabilities of events: these objects are called capacities in particular, random sets are capacities in which the numbers assigned to subsets are given by a probability distribution

Belief functions and propositional evidence As capacities (and random sets in particular), belief functions allow us to assign mass directly to propositions. Fabio Cuzzolin


IJCAI 2016

71 / 464

Beyond probability

Scarce data

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

72 / 464

Beyond probability

Scarce data

I know that I don’t know Learning from scarce data

yeah I know.. Socrates again.. but he did know it already 2500 years ago! still, people insist on learning from very limited experience Fabio Cuzzolin


IJCAI 2016

73 / 464

Beyond probability

Scarce data

How widespread is life? Learning from scarce data the argument on the likelihood of biological life in the universe is an extreme example: how likely is for a planet to give birth to life forms? planetary habitability is largely an extrapolation of conditions on Earth and the characteristics of the Solar System (some form of anthropic principle) basically what people do is to model perfectly the (presumed) causes of the emergence of life on Earth: it needs to circle a G-class star, in the ‘right’ galactic neighborhood, be in a certain ‘habitable zone’ around a star, have a large moon to deflect hazardous impact events ... p(life) = pA · · · · · pB · · · how much can one learn from a single example?? how much can one be sure about what he/she learned from very few examples?

Fabio Cuzzolin


IJCAI 2016

74 / 464

Beyond probability

Scarce data

Machines that learn

thebayesianobserver.wordpress.com we design algorithms that can learn → machine learning BUT, we train them on a ridicously small amount of data how do we make sure they have learned the right lesson? Is there really a ‘precise’ lesson to learn? should we not work with sets of models instead? Fabio Cuzzolin


IJCAI 2016

75 / 464

Beyond probability

Scarce data

A naive position Dealing with scarce data

a somewhat naive objection: probability distributions assume an infinite amount of evidence, so in reality finite evidence can only provide a constraint on the ‘true’ probability values I

I I

I

unfortunately, those who believe probabilities to be limits of relative frequencies (the frequentists) never really ‘estimate’ a probability from the data – the only assume (‘design’) probability distributions for their p-values Fisher: fine, I can never compute probabilities, but I can use the data to test my hypotheses on them in opposition, those who do estimate probability distributions from the data (the Bayesians) do not think of probabilities as infinite accumulations of evidence (but as degrees of belief) Bayes: I only need to be able to model a likelihood function of the data

well, actually, frequentists do estimate probabilities from scarce data when they do stochastic regression: see logistic regression in a couple of slides

Fabio Cuzzolin


IJCAI 2016

76 / 464

Beyond probability

Scarce data

Asymptotic happiness

what is true, is that both frequentists and Bayesians seem to be happy with solving their problems ‘asymptotically’ I I

limit properties of ML estimates Bernstein-von Mises theorem

what about the here and now? e.g. smart cars? Fabio Cuzzolin


IJCAI 2016

77 / 464

Beyond probability

Scarce data

Size and composition of the sample in (stochastic) logistic regression logistic regression allows us, given a sample Y = {Y1 , ..., Yn }, X = {x1 , ..., xn } where Yi ∈ {0, 1} is a binary outcome at time i and xi is the corresponding measurement, to learn the parameters of a conditional probability relation between the two

P(Y = 1|x) =

1 1 + e−(β0 +β1 x)

given a new x, one has the probability of a positive outcome generalises deterministic linear regression the n trials are assumed independent but not equally distributed: πi = P(Yi = 1|xi ) varies with i the parameters β0 , β1 are estimated by maximum likelihood of the sample, where Y

L(β|Y ) =ni=1 πi i (1 − πi )Yi logistic regression suffers when number of samples is ‘insufficient’ or when there are too few positive outcomes (1s) also, tends to underestimate the probability of a positive outcome (see rare events) Fabio Cuzzolin


IJCAI 2016

78 / 464

Beyond probability

Scarce data

Size of the sample in frequentist probability Confidence intervals

Confidence interval Let X be a sample from a probability P(.|θ, φ) where θ is the parameter to be estimated and φ a nuisance parameter. A confidence interval for the parameter θ, with confidence level or confidence coefficient γ, is an interval [u(X ), v (X )] determined by the pair of random variables u(X ) and v (X ), with the property: P(u(X ) < θ < v (X )|θ, φ) = γ

∀(θ, φ).

example: I observe the weight of 25 cups of tea, I assume it is normally distributed with mean µ, and I want to know the confidence interval (the interval of ‘expected’ values on new samples) for the mean since the (normalised) sample mean Z is also normally distributed, I can get ask what values of the mean are such that P(−z ≤ Z ≤ z) = 0.95 (for instance) since Z =

X −µ √ , σ/ n

this yield an interval for µ, e.g. P(X − 0.98 ≤ µ ≤ X + 0.98)

Fabio Cuzzolin


IJCAI 2016

79 / 464

Beyond probability

Scarce data

Confidence intervals Interpretation

confidence intervals are a form of interval estimate correct interpretation: as we saw in the example, it is about sampling samples if I keep extracting new sample sets, 95% (say) of the time the confidence interval (which will differ for every new sample set) will cover the true value of the parameter alternatively: there is a 95% probability that the calculated confidence interval from some future experiment encompasses the true value of the parameter does not mean that a specific confidence interval is such that it contains the value of the parameter with 95% probability Bayesian version of them: credible intervals

Fabio Cuzzolin


IJCAI 2016

80 / 464

Beyond probability

Scarce data

Size of the sample and belief functions

how do belief functions cope with scarce data?

Belief functions and scarce data Belief functions cope with scarce data by being cautious about the ‘correct’ probability model describing the studied process. a belief function corresponds to an entire set of probability distributions

Fabio Cuzzolin


IJCAI 2016

81 / 464

Beyond probability

Pure data

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

82 / 464

Beyond probability

Pure data

Modelling pure data Bayesian approach Bayesian reasoning requires modelling the data and a prior I I

prior is just a name for beliefs built over a long period of time, from the evidence you have observed so long a time has passed that all track record of observations is lost, and all is left is a probability distribution

why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not nevertheless we are compelling to picking one, because the mathematical formalism requires it I

this is the result of a confusion between the original subjective interpretation (where prior beliefs always exist), and the objectivist view of a rigorous objective procedure (where in most cases we do not have any prior knowledge)

Bayesians then go in ‘damage limitation’ mode, and try to pick the least damaging prior (see ‘ignorance’ later) all will be fine, in the end! (Bernstein-von Mises theorem) Asymptotically, the choice of the prior does not matter (really!) Fabio Cuzzolin


IJCAI 2016

83 / 464

Beyond probability

Pure data

Modelling pure data Frequentist approach

the frequentist approach is inherently unable to describe pure data, without making additional assumptions on the data-generating process in Nature one cannot ‘design’ an experiment: data come your way, whether you want it or not – you cannot set the ‘stopping rules’ I

again, recalls the old image of a scientist ‘analysing’ (from Greek ‘ana’+’lysis’, breaking up) a specific aspect of the world in their lab

the same data can lead to opposite conclusions (!) I

I

different experiments can lead to the same data, whereas the parametric model employed (family of probability distributions) is linked to a specific experiment apparently, however, frequentists are just fine with this

Fabio Cuzzolin


IJCAI 2016

84 / 464

Beyond probability

Pure data

Same data, different conclusions

Fabio Cuzzolin


IJCAI 2016

85 / 464

Beyond probability

Pure data

Same data, different conclusions

http://ocw.mit.edu/courses/mathematics/ 18-05-introduction-to-probability-and-statistics-spring-2014/ readings/MIT18_05S14_Reading20.pdf

Fabio Cuzzolin


IJCAI 2016

86 / 464

Beyond probability

No data (ignorance)

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

87 / 464

Beyond probability

No data (ignorance)

Choosing the prior Bayesian inference the prior distribution is typically hard to determine ‘solution’ → pick an ‘uninformative’ probability Jeffrey’s prior → Gramian of the Fisher information matrix can be improper (unnormalised), and it violates the strong version of the likelihood principle: when using the Jeffreys prior, inferences about θ depend not just on the probability of the observed data as a function of θ, but also on the universe of all possible experimental outcomes, as determined by the experimental design, because the Fisher information is computed from an expectation over the chosen universe

uniform priors do depend on the chosen set of hypotheses can lead to different results on different spaces, given the same likelihood functions (this was pointed out by Shafer in his book, btw)

Fabio Cuzzolin


IJCAI 2016

88 / 464

Beyond probability

No data (ignorance)

Choosing the prior Bernstein-von Mises theorem

in Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior (Bernstein-von Mises theorem) little problem: the amount of information supplied by a sample of data must be large enough caveat [Freedman 1965]: the Bernstein-von Mises theorem does not hold almost surely if the random variable has an infinite countable probability space A. W. F. Edwards: “It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this ‘defence’ the better.”

Fabio Cuzzolin


IJCAI 2016

89 / 464

Beyond probability

No data (ignorance)

Dealing with ignorance Shafer vs Bayes ‘uninformative’ priors can be dangerous

: they violate the strong likelihood principle, may be

unnormalised wrong priors can kill a Bayesian model priors in general cannot handle multiple hypothesis spaces in a coherent way (families of frames, in Shafer’s terminology)

Belief functions and priors Reasoning with belief functions does not require any prior. belief functions encoding data are combined with no need for priors

Belief functions and ignorance Belief functions naturally represent ignorance via the ‘vacuous’ belief function, assigning mass 1 to the whole hypothesis space. Fabio Cuzzolin


IJCAI 2016

90 / 464

Beyond probability

Unusual (rare) data

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

91 / 464

Beyond probability

Unusual (rare) data

Extinct dinosaurs The statistics of rare events

dinosaur statisticians probably we worrying about overpopulation risks.. .. until it hit them! Fabio Cuzzolin


IJCAI 2016

92 / 464

Beyond probability

Unusual (rare) data

Black swans The statistics of rare events

black swan is a term coined by Nassim Nicholas Taleb unpredictable event which, once occurred, is rationalised in hindsight as being predictable/describable by the existing risk models Knightian uncertainty is presumed to not exist, with typically bad consequences! examples: financial crises, plagues, but also unexpected scientific or societal developments Fabio Cuzzolin


IJCAI 2016

93 / 464

Beyond probability

Unusual (rare) data

What’s a rare event? Very unusual data

examples of rare events, also called ‘tail risks’, are: volcanic eruptions, meteor impacts, tsunamis .. in the most extreme cases, these events might have never occurred (e.g. your vote will be decisive in the next presidential election, [Gelman and King, 1998]) what is a ‘rare’ event? clearly we are interested in them because they are not so rare, after all! in other words, they may happen rarely when considering a single system, but when putting a lot of systems together (the real world) the change of them happening becomes tangible so, an event is rare when it covers a region of the hypothesis space which is seldom sampled

Fabio Cuzzolin


IJCAI 2016

94 / 464

Beyond probability

Unusual (rare) data

Dealing with rare events Traditional approaches

probability distributions for the system’s behaviour are built in ‘normal’ times (e.g. while the nuclear plant is working just fine) then used to extrapolate results at the ‘tail’ of the distribution popular statistical procedures (e.g. logistic regression) can sharply underestimate the probability of rare events I

Harvard’s G. King [2001] has proposed corrections based on oversampling the ‘rare’ events w.r.t the ‘normal’ ones

in response, some people drop generative probabilistic models in favour of discriminative ones [Random forests, Huang 2005] once again, we fail to understand that uncertainty affects our very models of uncertainty

Fabio Cuzzolin


IJCAI 2016

95 / 464

Beyond probability

Unusual (rare) data

Dealing with rare events Imprecise probabilities we should explictly model second-order (Knightian) uncertainties the most straightforward way of doing this, is to consider sets of probability distributions as modelling the problem

Belief functions and Knightian uncertainty Mathematically, belief functions (random sets) do amount to (convex) sets of probability distributions. as we will see, there are many ways of doing this – credal sets, probability intervals .. a possible insight: this is a form of scarce data, which combines a qualitative element (where the data are scarce) this is a form of missing information too – we are missing certain regions of the hypothesis space

Fabio Cuzzolin


IJCAI 2016

96 / 464

Beyond probability

Uncertain data

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

97 / 464

Beyond probability

Uncertain data

Bayes’ rule and certainty Bayes’ rule is used by Bayesians to reason (in time) when new evidence becomes available used by frequentist to condition on the (certain) measurements and generate their p-values indeed, it assumes that new evidence always comes in the form of certain statements: event A is true this is reasonable or even true in many situations: in science and engineering measurements flow in, and this is a form of ‘certain’ evidence applying Bayes’ rule to condition on series of measurements to construct likelihood functions (or p-values, if you are a frequentist) then appears very reasonable in many real world problems, though, evidence/data is uncertain

Fabio Cuzzolin


IJCAI 2016

98 / 464

Beyond probability

Uncertain data

Uncertain data concepts themselves can be not well defined, e.g. ‘dark’ or ‘somewhat round’ object (qualitative data) I

fuzzy theory accounts for this via the concept of graded membership

unreliable sensors can generate faulty (outlier) measurements: can we still treat these data as ‘certain’? or is more natural to attach to them a degree of reliability, based on the past track record of the ‘sensor’ (data generating process)? but then, can we still apply Bayes’ rule? interval measurements are common in engineering, due to limited sensitivity of sensors I

could be treated as precise pairs (a, b), but this requires considering the set of all subsets of measured values

people (‘experts’, e.g. doctors) tend to express themselves in terms of likelihoods directly (e.g. ‘I think diagnosis A is most likely, otherwise either A or B’) I

if the doctors were frequentists, and were provided with the same data, they would probably apply logistic regression and come up with the same prediction on P(disease|symptoms): unfortunately doctors are not statisticians

multiple sensors can provide as output a PDF on the same space I

e.g., two Kalman filters based one on color, the other on motion (optical flow), providing a normal predictive PDF on the location of the target in the image plane

Fabio Cuzzolin


IJCAI 2016

99 / 464

Beyond probability

Uncertain data

Jeffrey’s rule of conditioning Jeffrey’s rule of conditioning: a step forward from certainty and Bayes’ rule an initial probability P stands corrected by a second probability P 0 , defined only on a number of events suppose P is defined on a σ-algebra A there is a new prob measure P 0 on a sub-algebra B of A, and the updated probability P 00 has to: 1 2

meet the prob values specified by P 0 for events in B be such that ∀ B ∈ B, X , Y ⊂ B, X , Y ∈ A ( P(X ) P 00 (X ) if P(Y ) > 0 P(Y ) = P 00 (Y ) 0 if P(Y ) = 0

there is a unique solution: P 00 (A) =

X

P(A|B)P 0 (B)

B∈B

generalises conditioning (obtained when P 0 (B) = 1 for some B)

Fabio Cuzzolin


IJCAI 2016

100 / 464

Beyond probability

Uncertain data

Belief functions and uncertain evidence Conditioning versus combination what if I have a new probability on the same σ-algebra A? Jeffrey’s rule cannot be applied! as we saw, this happens when multiple sensors provide predictive PDFs belief function deal with uncertain evidence by moving away from the concept of conditioning (via Bayes’ rule) .. .. to that of combining pieces of evidence supporting multiple (intersecting) propositions to various degrees

Belief functions and evidence Belief reasoning works by combining existing belief functions with new ones, which are able to encode uncertain evidence. in addition, belief functions can represent fuzzy concepts as consonant (nested) belief functions they can represent unreliable measurements as ‘discounted’ probabilities (by assigning mass to the entire hypothesis set) Fabio Cuzzolin


IJCAI 2016

101 / 464

Beyond probability

Knightian uncertainty

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

102 / 464

Beyond probability


Certainty about uncertainty Voltaire’s view

it is also absurd to be certain about uncertainty it is quite contemptuous to allow convenience to define your choice either: my noise is Gaussian, etc.. Fabio Cuzzolin


IJCAI 2016

103 / 464

Beyond probability


Ellesberg’s paradox Aversion to Knightian uncertainty the Ellsberg paradox illustrates people’s aversion to second-order uncertainty a decision problem can be formalized by defining: I I I

a set Ω of states of the world; a set X of consequences; a set F of acts, where an act is a function f : Ω → X

let < be a preference relation on F, such that f < g means that f is at least as desirable as g given f , h ∈ F and E ⊆ Ω, let fEh denote the act defined by (fEh)(ω) = f (ω)

if ω ∈ E;

h(ω)

if ω 6∈ E

Savage’s Sure Thing Principle states that ∀E, ∀f , g, h, h0 : fEh < gEh ⇒ fEh0 < gEh0

Fabio Cuzzolin


IJCAI 2016

104 / 464

Beyond probability


Ellsberg’s paradox Aversion to Knightian uncertainty

suppose you have an urn containing 30 red balls and 60 balls, either black or yellow. Consider the following gambles: I I I I

f1 : you receive 100 euros if you draw a red ball f2 : you receive 100 euros if you draw a black ball f3 : you receive 100 euros if you draw a red or yellow ball f4 : you receive 100 euros if you draw a black or yellow ball

the Ellsberg paradox has been widely studied in economics and decision making1

1

http://www.econ.ucla.edu/workingpapers/wp362.pdf Fabio Cuzzolin


IJCAI 2016

105 / 464

Beyond probability


Ellsberg’s paradox Aversion to Knightian uncertainty in this example Ω = {R, B, Y }, fi : Ω → R and X = R (left table) empirically, most people strictly prefer f1 to f2 , while preferring f4 to f3 R B Y Now, pick E = {R, B}: by definition f1 100 0 0 f2 0 100 0 f1 {R, B}0 = f1 , f2 {R, B}0 = f2 f3 100 0 100 f1 {R, B}100 = f3 , f2 {R, B}100 = f4 f4 0 100 100 since f1 < f2 , i.e. f1 {R, B}0 < f2 {R, B}0 the Sure Thing principle would imply f1 {R, B}100 < f2 {R, B}100 i.e., f3 < f4 empirically the Sure Thing Principle is violated!

Fabio Cuzzolin


IJCAI 2016

106 / 464

Beyond probability


Making politicians happy Coming out with numbers for climatic change

politicians need to decide whether to invest billions of dollars/euros/pounds on expensive engineering projects to mitigate the effects of climate change whether theirs will be the right decision, we will know only in 20-30 years time – nevertheless, need to be made Fabio Cuzzolin decisions Belief functions Random setsnow for the working scientist IJCAI 2016 107 / 464

Beyond probability


Brexit (really?) Investors do not like uncertainty living in Oxford, I just have to talk about this “In New York, a recent meeting of S&P Investment Advisory Services’ five-strong investment committee decided to ignore the portfolio changes that its computer-driven investment models were advising. Instead, members decided not to make any big changes ahead of the vote.”

investors prefer ‘certainty’ to ‘uncertainty’: does ‘certainty’ mean certain outcome of their bets? No, only that they think their models can handle ‘known’ (first-order) uncertainty Wall Street Journal article

Fabio Cuzzolin


IJCAI 2016

108 / 464

Beyond probability


Dealing with huge uncertainties Predicting the future

to be fair, the mainstream in climatic change is not to model uncertainty at all, but to simply use dynamical models of the atmosphere/planet for prediction I

by the way, even deterministic, correct (chaotic) models (can) deliver uncertainty on predictions due to uncertainty on initial conditions

requires predictions very far off in the future: what does this entail? I

if we use (deterministic) dynamical models, these are simplified versions of the world that get it more and more wrong as time passes

when modelling uncertainty explicitly, what are the challenges? I I I

we don’t have any priors (ouch, Bayesians), but we don’t have any data (pretty much) either (extreme scarcity) as we just saw, scarcity is a source of Knightian uncertainty we cannot really use hypothesis testing, either (too bad, frequentists): this is not a designed experiment where one can assume an underlying data-generating mechanism

Fabio Cuzzolin


IJCAI 2016

109 / 464

Understanding

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

110 / 464

Understanding

A mathematical theory of evidence

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

111 / 464

Understanding



Shafer called his proposal ‘A mathematical theory of evidence’ the mathematical objects it deals with are called ‘belief functions’ where do these names come from? what interpretation of probability do they entail? I

I

it is a theory of epistemic probability: it is about probabilities as a mathematical representation of knowledge (a human’s knowledge, or a machine’s) it is a theory of evidential probability: such probabilities representing knowledge are induced (‘elicited’) by the available evidence

Fabio Cuzzolin


IJCAI 2016

112 / 464

Understanding


Belief (in hypotheses) belief → the state of mind in which a person thinks something to be the case, with or without there being empirical evidence

knowledge is the part of belief that is true, or just that which is justified to be true? epistemology → the branch of philosophy concerned with the theory of knowledge epistemic probability → probability as a representation of knowledge

Fabio Cuzzolin


IJCAI 2016

113 / 464

Understanding


Evidence (supporting hypotheses) in probabilistic logic, statements such as "hypothesis H is probably true" are interpreted to mean that the empirical evidence E supports H to a high degree this degree of support of H by E is called the logical or epistemic probability of H given E in fact, Pearl and others have supported a view of these matters in terms of probabilities on the logical causes of a certain proposition (‘probability of provability’), much related to modal logic I

to be fair, this connection to evidence is overlooked in much of the subsequent work

Rationale There exists evidence in the form of probabilities, which supports degrees of belief on a certain matter. the space where the evidence lives is different from the hypothesis space they are linked by a map one to many: but this is a random set! Fabio Cuzzolin


IJCAI 2016

114 / 464

Understanding

Belief functions

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

115 / 464

Understanding

Belief functions

Dempster’s original setting going back to the trial example, the situation can be described by the diagram where Ω is the space where the evidence lives, in a form of a probability distribution P Θ is the hypothesis space, the set of outcomes of the trial

elements of Ω are mapped to subsets of Ω: once again this is a random set, i.e, a set-valued random variable the probability distribution P induces a mass assignment m : 2Θ → [0, 1] via the multi-valued (one-to-many) mapping Γ : Ω → 2Θ in the example Γ maps {not drunk } ∈ Ω to {Peter , John} ⊂ Θ the corresponding mass function is: m({Peter , John}) = 0.8, m(Ω) = 0.2

Fabio Cuzzolin


IJCAI 2016

116 / 464

Understanding

Belief functions

Mass functions “Basic Probability Assignments”

let θ be an unknown quantity with possible values in a finite domain Θ, called the frame of discernment a piece of evidence about θ may be represented by a mass function m on Θ, defined as a function 2Θ → [0, 1], such that: P m(∅) = 0 A⊆Θ m(A) = 1 P(Θ) = 2Θ is the set of all subsets of Θ any subset A of Θ such that m(A) > 0 is called a focal element (FE) of m

Fabio Cuzzolin


IJCAI 2016

117 / 464

Understanding

Belief functions

Belief and plausibility functions Dempster’s upper and lower probabilities for any A ⊆ Θ, we can define: I

the total degree of support (belief) in A as the probability that the evidence implies A: X Bel(A) = P({ω ∈ Ω|Γ(ω) ⊆ A}) = m(B) B⊆A

I

the plausibility of A as the probability that the evidence does not contradict A: Pl(A) = P({ω ∈ Ω|Γ(ω) ∩ A 6= ∅}) = 1 − Bel(A)

the uncertainty on the truth value of the proposition “θ ∈ A” is the interval Bel(A) ≤ Pl(A) belief and plausibility values can (but this is disputed) be interpreted as lower and upper bounds to the values of an unknown, underlying probability measure: Bel(A) ≤ P(A) ≤ Pl(A) for all A ⊆ Θ Fabio Cuzzolin


IJCAI 2016

118 / 464

Understanding

Belief functions

A generalisation of sets, fuzzy sets, probabilities belief functions generalise traditional (‘crisp’) sets: a logical (or “categorical”) mass function has one focal set A, with m(A) = 1 belief functions generalise standard probabilities: a Bayesian mass function has as only focal sets elements (rather than subsets) of Θ complete ignorance is represented by the vacuous mass function with m(Θ) = 1 belief functions generalise fuzzy sets (see possibility theory later): when the focal sets of m are nested, m is said to be consonant in that case the plausibility function Pl is a possibility measure, i.e., Pl(A ∪ B) = max(Pl(A), Pl(B)),

∀A, B ⊆ Θ,

its contour function pl(θ) = Pl({θ}) is the membership function of a fuzzy set

Fabio Cuzzolin


IJCAI 2016

119 / 464

Understanding

Belief functions

A generalisation of sets, fuzzy sets, probabilities

Fabio Cuzzolin


IJCAI 2016

120 / 464

Understanding

Dempster’s combination

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

121 / 464

Understanding


Combination of evidence Murder example continued

the first item of evidence gave us: m1 ({Peter , John}) = 0.8, m1 (Θ) = 0.2 new piece of evidence: a blond hair has been found also, there is a probability 0.6 that the room has been cleaned before the crime this second body of evidence is encoded by the mass assignment m2 ({John, Mary }) = 0.6, m2 (Θ) = 0.4 once again, our sources of evidence are given to us in the form of probability distributions in some space relevant to (but not coinciding with) the problem how to combine these two pieces of evidence? an answer can be given within the random set interpretation of belief functions

Fabio Cuzzolin


IJCAI 2016

122 / 464

Understanding


Combination of evidence

if ‘codes’ ω1 ∈ Ω1 and ω2 ∈ Ω2 were selected, θ ∈ Γ1 (ω1 ) ∩ Γ2 (ω2 ) if the codes are selected independently, then the probability that the pair (ω1 , ω2 ) is selected is P1 ({ω1 }) · P2 ({ω2 }) if Γ1 (ω1 ) ∩ Γ2 (ω2 ) = ∅, the pair (ω1 , ω2 ) cannot be selected, hence: the joint distribution on Ω1 × Ω2 must be conditioned to eliminate such pairs Fabio Cuzzolin


IJCAI 2016

123 / 464

Understanding


Dempster’s rule Definition under these assumptions we get Dempster’s rule of combination let m1 and m2 be two mass functions on the same frame Θ, induced by two independent pieces of evidence their combination using Dempster’s rule is defined as: (m1 ⊕ m2 )(A) =

X 1 m1 (B)m2 (C), 1−κ

∀∅ 6= A ⊆ Θ,

B∩C=A

where κ=

X

m1 (B)m2 (C)

B∩C=∅

is the degree of conflict between m1 and m2 their Dempster’s sum m1 ⊕ m2 exists iff κ < 1 can be easily extended to any number of BFs

Fabio Cuzzolin


IJCAI 2016

124 / 464

Understanding


Dempster’s rule - example

m({θ1 }) =

0.7 ∗ 0.4 0.3 ∗ 0.6 0.3 ∗ 0.4 = 0.48, m({θ2 }) = = 0.31, m({θ1 , θ2 }) = = 0.21 1 − 0.42 1 − 0.42 1 − 0.42

Fabio Cuzzolin


IJCAI 2016

125 / 464

Understanding


Dempster’s rule Properties

Dempster’s rule has some interesting properties: commutativity, associativity, existence of a neutral element (the vacuous BF mΘ with m(Θ) = 1) it generalises set-theoretical intersection: if mA and mB are logical mass functions and A ∩ B 6= ∅, then mA ⊕ mB = mA∩B it generalises Bayes’ rule of conditioning I

if m = p is a probability and mA is a ‘logical’ mass function, then m ⊕ mA is the probability p(.|A) obtained via Bayes’ conditioning

Fabio Cuzzolin


IJCAI 2016

126 / 464

Understanding


A generalisation of Bayesian inference belief theory generalises Bayesian probability (it contains it as a special case), in that: I

I

I

classical probability measures are a special class of belief functions (in the finite case) or random sets (in the infinite case) Bayes’ ‘certain’ evidence is a special case of Shafer’s bodies of evidence (general belief functions) Bayes’ rule of conditioning is a special case of Dempster’s rule of combination

however, it overcomes its limitations I

you do not need a prior: if you are ignorant, you will use the vacuous BF mΘ which, when combined with new BFs m0 encoding data, will not change the result mΘ ⊕ m0 = m0

I

however, if you do have prior knowledge you are welcome to use it!

Fabio Cuzzolin


IJCAI 2016

127 / 464

Understanding

Families of frames

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

128 / 464

Understanding

Families of frames

Refinements and coarsenings the theory allows us to handle evidence impacting on different but related domains assume we are interested in the nature of an object in a road scene. We could describe it, e.g., in the frame Θ = {vehicle, pedestrian}, or in the finer frame Ω = {car, bicycle, motorcycle, pedestrian} other example: different image features in pose estimation a frame Ω is a refinement of a frame Θ (or, equivalently, Θ is a coarsening of Ω) if elements of Ω can be obtained by splitting some or all of the elements of Θ

Θ

ρ

θ1

Ω

θ2 θ3

Fabio Cuzzolin


IJCAI 2016

129 / 464

Understanding

Families of frames

Families of compatible frames when Ω is a refinement for a collection Θ1 , ..., ΘN of other frames it is called their common refinement two frames are said to be compatible if they do have a common refinement compatible frames can be associated with different variables/attributes/features: I

I

let ΘX = {red, blue, green} and ΘY = {small, medium, large} be the domains of attributes X and Y describing, respectively, the color and the size of an object in such a case the common refinement ΘX ⊗ ΘY = ΘX × ΘY is simply the Cartesian product

or, they can be descriptions of the same variable at different levels of granularity (as in the road scene example) evidence can be moved from one frame to another within a family of compatible frames

Fabio Cuzzolin


IJCAI 2016

130 / 464

Understanding

Families of frames

Marginalization let ΩX and ΩY be two compatible frames let mXY be a mass function on ΩX × ΩY it can be expressed in the coarser frame ΩX by transferring each mass mXY (A) to the projection of A on ΩX :

we obtain a marginal mass function on ΩX : X mXY ↓X (B) = mXY (A) ∀B ⊆ ΩX {A⊆ΩXY ,A↓ΩX =B}

(again, it generalizes both set projection and probabilistic marginalization) Fabio Cuzzolin


IJCAI 2016

131 / 464

Understanding

Families of frames

Vacuous extension the “inverse” of marginalization a mass function mX on ΩX can be expressed in ΩX × ΩY by transferring each mass mX (B) to the cylindrical extension of B:

this operation is called the vacuous extension of mX in ΩX × ΩY : ( mX (B) if A = B × ΩY mX ↑XY (A) = 0 otherwise a strong feature of belief theory: the vacuous belief function (our representation of ignorance) is left unchanged when moving from one hypothesis set to another! Fabio Cuzzolin


IJCAI 2016

132 / 464

Understanding

Interpretations

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

133 / 464

Understanding

Interpretations

The multiple semantics of belief functions being complex objects, belief functions have a number of (sometimes conflicting) semantics and mathematical interpretations original one [Dempster 1967]: lower probabilities induced by a multivalued mapping I

the mathematical representation: random set framework

Shafer’s (1976): representations of pieces of evidence in favour of propositions within someone’s subjective state of belief I

represented as set functions on a finite domain Ω

as convex sets of probability measures, in a robust Bayesian interpretation I

mathematically, a credal set whose lower and upper envelopes are belief and plausibility functions

other equivalent mathematical formulations: I I I

as non-additive (generalised) probabilities as monotone capacities as inner measures (linked to the rough set idea)

Fabio Cuzzolin


IJCAI 2016

134 / 464

Understanding

Interpretations

As non-additive probabilities (generalised) probabilities Probability measure A function P : F → [0, 1] over a σ-field F ⊂ 2Θ such that P(∅) = 0, P(Θ) = 1; if A ∩ B = ∅, A, B ∈ F then P(A ∪ B) = P(A) + P(B) (additivity). if we relax the third constraint to allow the function to meet additivity only as a lower bound we obtain a:

Belief function A function Bel : 2Ω → [0, 1] from the power set 2Ω to [0, 1] such that: Bel(∅) = 0, Bel(Ω) = 1; for every n and for every collection A1 , ..., An ∈ 2Ω we have that: X X Bel(Ai ∩ Aj ) + · · · + (−1)n+1 Bel(A1 ∩ ... ∩ An ) Bel(A1 ∪ ... ∪ An ) ≥ Bel(Ai ) − i

Fabio Cuzzolin

i and indifference ∼ goal: to build a belief function Bel such that A· > B iff Bel(A) > Bel(B) and A ∼ B iff Bel(A) = Bel(B) exists if · > is a weak order and ∼ an equivalence relation Algorithm 1 2 3

consider all propositions that appear in the preference relations as potential focal elements (FEs) elimination: if A ∼ B for some B ⊂ A then A is not a FE a perceptron algorithm is used to generate the mass m by solving the system of remaining equalities and disequalities

however: it selects arbitrarily one solution over many does not address possible inconsistency in the given preferences

Fabio Cuzzolin


IJCAI 2016

160 / 464

Building

From preferences

Ben Yaghlane’s constrained optimisation approach Building belief functions from preferences

uses preferences and indifferences as in Wong and Lingras, with same axioms.. .. but converts them into a constrained optimisation problem objective function: maximise the entropy/uncertainty of the BF to generate (least informative result) constraints derived from input preferences/indifferences, i.e. A· > B ↔ Bel(A) − Bel(B) ≥ ,

A ∼ B ↔ |Bel(A) − Bel(B)| ≤

is a constant specified by the expert various

uncertainty measures

Fabio Cuzzolin

can be plugged in


IJCAI 2016

161 / 464

Building

Coin toss revised

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

162 / 464

Building

Coin toss revised

A coin toss example consider a coin toss experiment we toss the coin n = 10 times, obtaining the sample X = {H, H, T , H, T , T , T , H, H, H} with k = 6 successes (heads H) and n − k = 4 fails (tails T)

parameter of interest: the probability θ = p of heads in a single toss inference problem: gather information on the value of p (either in the form of a point estimate, the acceptability of certain guesses, a probability distribution on the possible values of p...)

Fabio Cuzzolin


IJCAI 2016

163 / 464

Building

Coin toss revised

Bayesian inference Coin toss example

general Bayesian inference assume the trials to be independent (they are obviously equally distributed) the likelihood of the sample is binomial: P(X |p) = pk (1 − p)n−k

apply Bayes’ rule to get the posterior (see plot): P(X |p)P(p) P(p|X ) = ∼ P(X |p) P(X ) (for we do not have prior info on the chances of p or X ) the ML estimate is the peak of this likelihood function Fabio Cuzzolin


IJCAI 2016

164 / 464

Building

Coin toss revised

Frequentist hypothesis testing Coin toss example

what would a frequentist do? well, it seems reasonable the the value of p be p = kn we can then test it: once again assuming independent and equally distributed trials, the distribution of the sample is the binomial we can then compute the p-value for, say, α = 0.95 the p-value is obviously P(p ≥ 0.6) = 1/2 > α = 0.05 and the hypothesis is sensible (‘not rejected’, to be precise)

Fabio Cuzzolin


IJCAI 2016

165 / 464

Building

Coin toss revised

Likelihood-based belief function inference likelihood-based belief function inference: ˆ ), PlΘ (A|X ) = supp∈A L(p|X BelΘ (A|X ) = 1 − PlΘ (A|X ) these bounds determine an entire envelope of PDF on the parameter space Θ = [0, 1]

we can apply the same criterion to normalised empirical counts ˆf (H) = 1, ˆf (T ) = 4/6 = 2/3 we get the mass assignment m(H) = 1/3, m(T ) = 0, m(Ω) = 2/3 as a credal set, Bel = {1/3 ≤ p < 1} (left) this ‘robustifies’ the ML estimate, which is a PDF compatible with the inferred BF Fabio Cuzzolin


IJCAI 2016

166 / 464

Building

Coin toss revised

Summary on inference general Bayesian inference → continuous PDF on the parameter space Θ (a second-order distribution) MLE/MAP estimation → a single parameter value = a single PDF on Ω generalised maximum likelihood → a belief function on Ω (a convex set of PDFs on Ω) I

generalises MAP/MLE

likelihood-based / Dempster-based belief function inference → a belief function on Θ = a convex set of second-order distributions I

generalises general Bayesian inference

lower and upper likelihoods → an interval of belief functions on Ω (we will see this at the end!)

Fabio Cuzzolin


IJCAI 2016

167 / 464

Reasoning

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

168 / 464

Reasoning

Combining vs conditioning Reasoning with belief functions

belief theory is a generalisation of Bayesian reasoning while in Bayesian theory evidence is of the kind ‘A is true’ (e.g. a new datum is available) .. in belief theory, new evidence can assume the more general form of a belief function I

a proposition A is a very special case of belief function with m(A) = 1

in most cases, reasoning needs then to be performed by combining belief functions, rather than by conditioning with respect to an event nevertheless, conditional belief functions are of interest, especially for statistical inference

Fabio Cuzzolin


IJCAI 2016

169 / 464

Reasoning

Combining

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

170 / 464

Reasoning

Combining

Dempster’s rule under fire Zadeh’s paradox question is: is Dempster’s sum the only possible rule of combination? seems to have paradoxical behaviour in certain circumstances.. doctors have opinions about the condition of a patient Θ = {M, C, T }, where M stands for meningitis, C for concussion and T for tumor two doctors provide the following diagnoses: I

I

D1 : “I am 99% sure it’s meningitis, but there is a small chance of 1% that it is concussion". D2 : “I am 99% sure it’s a tumor, but there is a small chance of 1% that it is concussion".

can be encoded by the following mass functions:    0.99 A = {M}  0.99 0.01 A = {C} 0.01 m1 (A) = m2 (A) =   0 otherwise 0

Fabio Cuzzolin


A = {T } A = {C} otherwise,

IJCAI 2016

(1)

171 / 464

Reasoning

Combining

Dempster’s rule under fire Zadeh’s paradox their (unnormalised) Dempster’s combination is: 0.9999 A = {∅} m(A) = 0.0001 A = {C} as the two masses are highly conflicting, normalisation yields the belief function focussed on C → “it is definitively concussion”, although both experts had left it as only a fringe possibility objections: I

I

I

the belief functions in the example are really probabilities, so this is a problem with Bayes’ rule, in case! diseases are never exclusive, so that it may be argued that Zadeh’s choice of a frame of discernment is misleading → open world approaches with no normalisation doctors disagree so much that any person would conclude that one of the them is just wrong → reliability of sources needs to be accounted for

Fabio Cuzzolin


IJCAI 2016

172 / 464

Reasoning

Combining

Dempster’s rule under fire Tchamova’s paradox this time, the two doctors generate the following mass assignments over Θ = {M, C, T }:   A = {M} A = {M, C}  a  b1 1 − a A = {M, C} b2 A=Θ m1 (A) = m2 (A) =   0 otherwise 1 − b1 − b2 A = {T }. (2) assuming equal reliability of the two doctors, Dempster’s combination yields m1 ⊕ m2 = m1 , i.e, Doctor 2’s diagnosis is completely absorbed by that of Doctor 1! here the ‘paradoxical’ behaviour is not a consequence of conflict in Dempster’s combination, every source of evidence has a ‘veto’ power over the hypotheses it does not believe to be possible if any of them gets it wrong, the combined belief function will never give support to the ‘correct’ hypothesis

Fabio Cuzzolin


IJCAI 2016

173 / 464

Reasoning

Combining

Proposed combination rules a number of alternative combination mechanisms have been proposed Yager’s rule: conflict mass is assigned to Ω Dubois’ rule: conflict mass B ∩ C = ∅ is assigned to B ∪ C conjunctive rule: Dempster without normalisation disjunctive rule: dual of the conjunctive (and Dempster’s) Denoeux’s cautious rule: min weight after canonical decomposition bold rule: dual of cautious Murphy’s averaging idea Deng’s distance-weighted averaging Lefevre’s weighting factors

Fabio Cuzzolin


IJCAI 2016

174 / 464

Reasoning

Combining

Yager’s and Dubois’ rules first answer to Zadeh’s objections view that conflict is generated by non-reliable information sources P conflicting mass m(∅) = B∩C=∅ m1 (B)m2 (C) should be re-assigned to the whole frame Θ let m∩ (A) = m1 (B)m2 (C) whenever B ∩ C = A m∩ (A) ∅= 6 A(Θ mY (A) = m∩ (Θ) + m(∅) A = Θ.

(3)

Dubois and Prade’s idea: similar to Yager’s, BUT conflicting mass is not transferred all the way up, but to B ∪ C (due to applying the minimum specificity principle) X mD (A) = m∩ (A) + m1 (B)m2 (C).

(4)

B∪C=A,B∩C=∅

the resulting BF dominates Yager’s combination: mD (A) ≥ mY (A) ∀A Fabio Cuzzolin


IJCAI 2016

175 / 464

Reasoning

Combining

Smets’ conjunctive rule Smets also assumes that all sorces to combine are reliable: conflict is the result of an incorrectly specified frame of discernment rather than normalising (as in Dempster’s rule) or re-assigning the conflicting mass m(∅) to other non-empty subsets (as in Yager’s and Dubois’ proposals), his disjunctive rule leaves the conflicting mass with the empty set: conjunctive rule of combination: m∩ (A) =

X

m1 (B)m2 (C)

(5)

B∩C=A

applicable to unnormalised belief functions open world assumption: current frame only approximately describes the set of possible hypotheses the empty set ∅ represents hypotheses that are not included in the current frame (but might be, if more info becomes available)

Fabio Cuzzolin


IJCAI 2016

176 / 464

Reasoning

Combining

Disjunctive rule

dual of the conjunctive rule in Dempster’s original random set idea, consensus between two sources is expressed by the union of the supported propositions, rather than by their intersection disjunctive rule of combination: m∪ (A) =

X

m1 (B)m2 (C)

(6)

B∪C=A ∪ not that Bel1 Bel 2 (A) = Bel1 (A) ∗ Bel2 (A): belief values are simply multiplied!

was proposed by Ivan Kramosil as well

Fabio Cuzzolin


IJCAI 2016

177 / 464

Reasoning

Combining

Inverting Dempster’s sum The canonical decomposition

a belief function can be decomposed into a Dempster’s sum of ‘simple’ components: w ∩ A(Θ mA , m=

(7)

mAw denotes the simple pseudo belief function such that:   1−w B =A w w B=Θ mA (B) =  0 ∀B 6= A, Θ and the weights w(A) satisfy: w(A) ∈ [0, +∞) for all A ( Θ conjunctive canonical decomposition conjunctive and disjunctive rules also admit simple inverses

Fabio Cuzzolin


IJCAI 2016

178 / 464

Reasoning

Combining

Denoeux’s cautious rule w ∩ A(Θ mA based on Smets’ canonical decomposition m =

cautious combination: the mass assignment with the following weights: w1 ∧ 2 (A) = min{w1 (A), w2 (A)},

A ∈ 2Θ \ {Θ}

(8)

the belief function whose simple components have weight equal to the minimum of the two input weights it is the least committed BF in the set that dominates the weights of the input ones commutative, associative and idempotent! this means that if I keep adding the same evidence, nothing changes (unlike with Dempster’s rule) (a cautious conjunctive rule which differs from Denoeux’s was proposed by Destercke et al)

Fabio Cuzzolin


IJCAI 2016

179 / 464

Reasoning

Combining

Bold rule

dual of the cautious rule based on the canonical disjunctive decomposition: any unnormalised belief function can be uniquely decomposed as ∪ A6=∅ mA,v (A) m=

(9)

where mA,v (A) is the unnormalised belief function assigning mass v (A) to ∅, and 1 − v (A) to A bold combination is defined as: ∨ ∪ A6=∅ mA,min{v (A),v (A)} . m1 m 2 = 1 2

(10)

is only applicable to unnormalised belief functions

Fabio Cuzzolin


IJCAI 2016

180 / 464

Reasoning

Combining

Averaging approches completely different rationale from the random set interpretation some way of computing the ‘mean’ of the input mass functions Murphy [2000]: one can average the input masses, and calculate the combined b.p.a. by combining the average values multiple times D. Yong [2005]: averaging based on distance I

the degree of credibility Crd(mi ) of the i-th body of evidence Sup(mi ) Crd(mi ) = P , j Sup(mj )

I

. X Sup(mi ) = 1 − d(mi , mj ). j6=i

is used Xto compute a weighted average of the input masses: ˜ = m Crd(mi ) · mi i

albeit empirical, they try to address the issue with the ‘veto’ power of each piece of evidence

Fabio Cuzzolin


IJCAI 2016

181 / 464

Reasoning

Combining

Lefevre’s weighting factors given J input masses m = {m1 , ..., mj , · · · , mJ } family of combination rules which distributes the conflicting mass m(∅) to each proposition A of a set of subsets P = {A}, according to a weighting factor w(A, m): m(A) = m∩ (A) + mc (A), (11) where c

m (A) =

w(A, m) · m(∅) 0

A∈P otherwise

includes Smets’ and Yager’s rules when P = {∅} and P = {Θ}, respectively we get Dempster’s rule when P = 2Θ \ {∅} with weights: w(A, m) =

m∩ (A) 1 − m(∅)

∀A ∈ 2Θ \ {∅}.

Dubois and Prade’s operator can also be obtained similar conflict redistribution strategies have been proposed by others

Fabio Cuzzolin


IJCAI 2016

182 / 464

Reasoning

Combining

Other proposals Belief function combination

a number of other proposals exist for combination rules .. I I I I I I

Josang’s consensus operator (from beta distribution interpretation) Daniel’s MinC approach Wang’s [2007] Yamada’s ‘combination by compromise’ Yang’s evidential reasoning rule Florea’s Adaptive Combination Rules (ACR) and Proportional Conflict Redistribution (PCR) rule

.. and families of combination operators: I I I I

Denoeux’s families induced by t-norms and conorms α-junctions: linear operators associated with matrices Yager’s family of quasi-associative operators Denneberg’s family of updating rules

Fabio Cuzzolin


IJCAI 2016

183 / 464

Reasoning

Combining

Combination Moving forward Yager’s rule is rather unjustified Dubois’ is kinda intermediate between conjunction and disjunction cautious and bold rules are inspired by possibility theory’s min rule, rather then the original random set framework my take on this: Dempster’s (conjunctive) combination and disjunctive combination are the two extrema of a spectrum of possible results

Proposal: combination tubes? Meta-uncertainty on the sources generating the input belief functions (their independence and reliability) induces uncertainty on the result of the combination, represented by a bracket of combination rules, which produce a ‘tube’ of BFs. we encountered this idea when generalising the concept of likelihood - was already hinted at by Pearl in “Reasoning with belief functions: An analysis of compatibility” we should probably work with intervals of belief functions then? Fabio Cuzzolin


IJCAI 2016

184 / 464

Reasoning

Conditioning

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

185 / 464

Reasoning

Conditioning

Conditional belief functions in Bayesian theory conditioning is done via Bayes’ rule: P(A|B) =

P(A ∩ B) P(B)

for belief functions, many approaches to conditioning have been proposed (just as for combination!) I I I I I I I

original Dempster’s conditioning Fagin and Halpern’s lower envelopes “geometric conditioning” [Suppes] unnormalized conditional belief functions [Smets] generalised Jeffrey’s rules [Smets] sets of equivalent events under multi-valued mappings [Spies] conditioning by distance minimisation [Cuzzolin]

several of them are special cases of combination rules: Dempster’s, Smets’ .. others are the unique solution when interpreting belief functions as convex sets of probabilities (Fagin’s) once again, a duality emerges between the most and least cautious conditioning approaches

Fabio Cuzzolin


IJCAI 2016

186 / 464

Reasoning

Conditioning

Dempster’s conditioning Dempster’s rule of combination induces a conditioning operator given a new event A, the “logical” belief function such that m(A) = 1 .. ... is combined with the a-priori belief function Bel using Dempster’s rule the resulting BF is the conditional belief function given A a la Dempster, Bel⊕ (A|B)

in terms of belief and plausibility values, Dempster’s conditioning yields Bel⊕ (A|B) =

¯ ¯ Bel(A∪B)−Bel( B) ¯ 1−Bel(B)

=

Pl(B)−Pl(B\A) , Pl(B)

Pl⊕ (A|B) =

Pl(A∩B) Pl(B)

obtained by Bayes’ rule by replacing probability with plausibility measures! Fabio Cuzzolin


IJCAI 2016

187 / 464

Reasoning

Conditioning

Fagin and Halpern’s lower envelopes we know that a belief function can be seen as the lower envelope of the family of probabilities consistent with it: Bel(A) =

inf

P∈P[Bel]

P(A)

Fagin and Halpern defined a conditional belief function as the lower envelope (the inf) of the family of conditional probability functions P(A|B), where P is consistent with Bel: . BelCr (A|B) =

inf

P∈P[Bel]

P(A|B),

. PlCr (A|B) =

sup P(A|B) P∈P[Bel]

obviously generalises conditional probability (just like Dempster’s conditioning) have been considered by other authors too, e.g. Dempster ‘67 and Walley ‘81

Fabio Cuzzolin


IJCAI 2016

188 / 464

Reasoning

Conditioning

Lower conditional envelopes Close form expressions

obviously strongly linked to the robust Bayesian (credal) interpretation, so rather incompatible with the random set interpretation nevertheless, while lower/upper envelopes of arbitrary sets of probabilities are not in general belief functions, but these actually are belief functions: BelCr (A|B) =

Bel(A∩B) , ¯ Bel(A∩B+Pl(A∩B)

PlCr (A|B) =

Pl(A∩B) ¯ Pl(A∩B)+Bel(A∩B)

they provide a more conservative estimate then Dempster’s conditioning BelCr (A|B) ≤ Bel⊕ (A|B) ≤ Pl⊕ (A|B) ≤ PlCr (A|B)

Fabio Cuzzolin


IJCAI 2016

189 / 464

Reasoning

Conditioning

Suppes and Zanotti’ geometric conditioning Suppes and Zanotti proposed a ‘geometric’ conditioning approach BelG (A|B) =

Bel(A ∩ B) , Bel(B)

PlG (A|B) =

Bel(B) − Bel(B \ A) Bel(B)

what it does, it retains only the masses of focal elements inside B, and normalises them: m(A) mG (A|B) = A⊆B Bel(B) it is a consequence of the focussing approach to belief update: no new information is introduced, we merely focus on a specific subset of the original set somewhat dual to Dempster’s conditioning, as it replaces probability with belief measures in Bayes’ rule Pl⊕ (A|B) =

Pl(A∩B) Pl(B)

↔

BelG (A|B) =

Bel(A∩B) Bel(B)

open question: is it induced by some dual rule of combination?

Fabio Cuzzolin


IJCAI 2016

190 / 464

Reasoning

Conditioning

Smets’ conjunctive rule of conditioning or ‘unnormalized’ conditional belief function, has mass 0 if A 6⊂ B, X m(A ∪ X ) A ⊆ B m ∩ (A|B) = X ⊆B c ∩ it is induced by the conjunctive rule of combination: m ∩ (A|B) = m m B

belief and plausibility values: ¯ Bel(A ∪ B) A∩B = 6 ∅ Bel ∩ (A|B) = 0 A∩B =∅

Pl ∩ (A|B) =

Pl(A ∩ B) 1

A 6⊃ B A⊃B=∅

it is compatible with the principles of belief revision [Gilboa, Perea]: a state of belief is modified to take into account a new piece of information I

in probability theory, both focussing and revision are expressed by Bayes’ rule, but they are conceptually different operations which produce different results on BFs

it is more committal than Dempster’s rule! Bel⊕ (A|B) ≤ Bel ∩ (A|B) ≤ Pl ∩ (A|B) ≤ Pl⊕ (A|B) Fabio Cuzzolin


IJCAI 2016

191 / 464

Reasoning

Conditioning

Disjunctive rule of conditioning ∪ induced by the disjunctive rule of combination: m ∪ (A|B) = m m B

obviously dual to conjunctive conditioning X m m(A \ B ∪ X ) A ⊇ B ∪ (A|B) = X ⊆B

while m ∪ (A|B) = 0 for all A 6⊃ B assigns mass only to subsets containing the conditioning event B belief and plausibility values: Bel(A) Bel ∪ (A|B) = 0

A⊃B A 6⊃ B

Pl ∪ (A|B) =

Pl(A) 1

A∩B =∅ A ∩ B 6= ∅

it is less committal not only than Dempster’s rule, but also than credal conditioning Bel ∪ (A|B) ≤ BelCr (A|B) ≤ PlCr (A|B) ≤ Pl ∪ (A|B)

Fabio Cuzzolin


IJCAI 2016

192 / 464

Reasoning

Conditioning

Conditioning - an overview Dempster’s ⊕ Credal Cr Geometric G ∩ Conjunctive ∪ Disjunctive

belief Pl(B) − Pl(B \ A) Pl(B) Bel(A ∩ B) ¯ ∩ B) Bel(A ∩ B) + Pl(A Bel(A ∩ B) Bel(B) ¯ A ∩ B 6= ∅ Bel(A ∪ B), Bel(A), A ⊃ B

plausibility Pl(A ∩ B) Pl(B) Pl(A ∩ B) ¯ ∩ B) Pl(A ∩ B) + Bel(A Bel(B) − Bel(B \ A) Bel(B) Pl(A ∩ B), A 6⊃ B Pl(A), A ∩ B = ∅

Nested conditioning operators Conditioning operators form a nested family, from the more committal to the least one!

Bl ∪ (·|B) ≤ BlCr (·|B) ≤ Bl⊕ (·|B) ≤ Bl ∩ (·|B) ≤ Pl ∩ (·|B) ≤ Pl⊕ (·|B) ≤ PlCr (·|B) ≤ Pl ∪ (· open question: what about geometric conditioning? is geometric conditioning induced by some combination rule dual to Dempster’s? Fabio Cuzzolin


IJCAI 2016

193 / 464

Reasoning

Belief vs Bayesian reasoning

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

194 / 464

Reasoning


Belief vs Bayesian reasoning Image data fusion for object classification

suppose we want to estimate the class of an object appearing in an image, based on feature measurements extracted from the image (e.g. by convolutional neural network) we capture a training set of images, complete with annotated object labels assuming a PDF of a certain family (e.g. mixture of Gaussians) we can learn from the training data a likelihood function p(y |x), where y is the object class and x the image feature vector suppose n different ‘sensors’ extract n features xi from each image: x1 , ..., xn let us compare how data fusion works under the Bayesian and the belief function paradigms!

Fabio Cuzzolin


IJCAI 2016

195 / 464

Reasoning


Belief vs Bayesian reasoning Bayesian data fusion the likelihoods of the individual features are computed using the n likelihood functions learned during training: p(xi |y ), for all i = 1, ..., n measurements are typically assumed to be conditionally independent, yielding Q the product likelihood p(x|y ) = i p(xi |y ) Bayesian inference is applied, typically assuming uniform priors (for there is no reason to think otherwise), yielding Y p(y |x) ∼ p(x|y ) = p(xi |y ) i

Fabio Cuzzolin


IJCAI 2016

196 / 464

Reasoning


Belief vs Bayesian reasoning Dempster-Shafer data fusion with belief functions, for each feature type i a BF is learned from the the individual likelihood p(xi |y ), e.g. via the likelihood-based approach by Shafer this yields n belief functions Bel(y |xi ), on the range of possible object classes Y ∩ ⊕, ), ∪ a combination rule is applied to compute an overall BF (e.g. , obtaining

Bel(Y |x) = Bel(Y |x1 ) ... Bel(Y |xn ), an empirical comparison of this kind is shown under

Fabio Cuzzolin

Y ⊆Y

Regression


IJCAI 2016

197 / 464

Reasoning


Inference under partially reliable data Belief vs Bayesian reasoning

in the fusion example we have assumed that the data are measured correctly what if the data-generating process is not completely reliable? problem: suppose we want to just detect an object (binary decision: yes Y or no N) two sensors produce image features x1 and x2 , but we learned from the training data that both are reliable only 20% of the time at test time we get an image, measure x1 and x2 , and unluckily sensor 2 got it wrong! the object is actually there we get the following normalised likelihoods p(x1 |Y ) = 0.9, p(x1 |N) = 0.1;

Fabio Cuzzolin

p(x2 |Y ) = 0.1, p(x2 |N) = 0.9


IJCAI 2016

198 / 464

Reasoning


Inference under partially reliable data Belief vs Bayesian reasoning

how do the two fusion pipelines cope with this? the Bayesian scholar assumes the two sensors/processes are conditionally independent, and multiply the likelihoods obtaining p(x1 , x2 |Y ) = 0.9 ∗ 0.1 = 0.09, so that p(Y |x1 , x2 ) =

1 , 2

p(N|x1 , x2 ) =

p(x1 , x2 |N) = 0.1 ∗ 0.9 = 0.09

1 2

Shafer’s faithful follower discounts the likelihoods by assigning mass .2 to the whole hypothesis space Θ = {Y , N}: m(Y |x1 ) = 0.9 ∗ 0.8 = 0.72, m(Y |x2 ) = 0.1 ∗ 0.8 = 0.08,

Fabio Cuzzolin

m(N|x1 ) = 0.1 ∗ 0.8 = 0.08, m(Θ|x1 ) = 0.2; m(N|x2 ) = 0.9 ∗ 0.8 = 0.72 m(Θ|x2 ) = 0.2


IJCAI 2016

199 / 464

Reasoning


Inference under partially reliable data Belief vs Bayesian reasoning thus, when we combine them by Dempster’s rule we get the BF Bel on {Y , N}: m(Y |x1 , x2 ) = 0.458,

m(N|x1 , x2 ) = 0.458,

m(Θ|x1 , x2 ) = 0.084

when combined using the disjunctive rule (the least committal one) we get Bel 0 : m0 (Y |x1 , x2 ) = 0.09,

m0 (N|x1 , x2 ) = 0.09,

m0 (Θ|x1 , x2 ) = 0.82

the corresponding (credal) sets of probabilities are

the credal interval for Bel is quite narrow: reliability is assumed to be 80%, and we got a faulty measurement in two! (50%) the disjunctive rule is much more cautious about the correct inference Fabio Cuzzolin


IJCAI 2016

200 / 464

Reasoning

Generalised Bayes Theorem

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

201 / 464

Reasoning


Generalising full Bayesian inference in Bayesian inference a likelihood function p(.|θ), θ ∈ Θ is known, so that we can compute the likelihood of a new sample p(x|θ), x ∈ X after observing x, the prob distribution on Θ is updated to the posterior via Bayes’ theorem: P(θ|x) = Shafer’s

likelihood-based inference

P(x|θ)P(θ) P P(x) = θ0 P(x|θ0 )P(θ0 )

∀θ ∈ Θ

maps the likelihood p(x|θ) to a BF on Θ:

p(x|θ) ∀x ∈ X

7→

BelΘ (A|x), A ⊂ Θ

Dempster’s inference maps (for instance) the family of CDFs F (x|θ) associated with p(x|θ) to a belief function on Θ × X

F (x|θ) ∀x ∈ X

7→

BelΘ×X (.),

which by conditioning on x gives a BF BelΘ (A|x), A ⊂ Θ

Fabio Cuzzolin


IJCAI 2016

202 / 464

Reasoning


Generalised Bayes Theorem Generalising full Bayesian inference in Smets’ generalised Bayesian theorem setting, the input is a set of ‘conditional’ belief functions on Θ, rather than likelihoods p(x|θ) there BelX (X |θ),

X ⊂ X, θ ∈ Θ

each associated with a value θ of the parameter these are not the same conditional belief functions we saw, where a conditioning event B ⊂ Θ alters a prior belief function BelΘ mapping it to BelΘ (.|B) they can be seen as a parameterised family of BFs on the data the desired output is another family of belief functions on Θ, parameterised by all sets of measurements X on X: BelΘ (A|X ), ∀X ⊂ X as it is natural to require that each piece of evidence m(X |θ) have an effect on our beliefs on the parameters also coherent with the random set setting, in which we condition on set-valued observations Fabio Cuzzolin


IJCAI 2016

203 / 464

Reasoning


Generalised Bayes Theorem Generalised Bayes Theorem Implements this inference BelX (X |θ) 7→ BelΘ (A|X ) by: 1

computing an intermediate family of BFs on X parameterised by sets of parameter values: Y ∪ θ∈A BelX (X |θ) = BelX (X |A) = BelX (X |θ) θ∈A ∪ via the disjunctive rule of combination

2 3

assuming that PlΘ (A|X ) = PlX (X |A) ∀A ⊂ Θ, X ⊂ X Y ¯ |θ) this yields BelΘ (A|X ) = BelX (X ¯ θ∈A

Fabio Cuzzolin


IJCAI 2016

204 / 464

Reasoning


Generalised Bayes Theorem Assumptions and properties 1

derives from two requirements: I if we apply GBT to two variables X and Y we get ∩ BelΘ (A|X , Y ) = BelΘ (A|X ) Bel Θ (A|Y )

the conditional BF on Θ is the conjunctive combination of the two ¯ |θ) : θ ∈ A} II PlX (X |A) is a function of {PlX (X |θ), PlX (X 2

generalises Bayes’ rule (by replacing P with Pl) when priors are uniform Shafer’s proposal for statistical inference PlΘ (A|x) = max PlΘ (θ|x) does not θ∈A

meet requirement I under requirement I of the GBT the two variables are conditional cognitive independent (extends stochastic independence) PlX×Y (X ∩ Y |θ) = PlX (X |θ) ∗ PlY (Y |θ) ∀X ⊂ X, Y ⊂ Y, θ ∈ Θ

Fabio Cuzzolin


IJCAI 2016

205 / 464

Reasoning

Graphical models

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

206 / 464

Reasoning

Graphical models

Probabilistic graphical models Pearl conditional independence relationships are the building block of Pearl’s probabilistic graphical models conditional probabilities can be directly manipulated using BayesâA˘ Z´ rule support for graphical model is a directed acyclic graph G = (V , E) each node v ∈ V is associated with a random variable Xv the set of random variables X = {Xv , v ∈ V } is a Bayesian network with respect to G if its joint probability density function is the product of the individual density functions, conditional on their parent variables Y p(x) = p xv xpa(v ) v ∈V

where pa(v ) is the set of parents of v expresses the conditional independence of the variables from any of their non-descendants, given the values of their parent variables.

Fabio Cuzzolin


IJCAI 2016

207 / 464

Reasoning

Graphical models

Probabilistic graphical models Belief propagation

message passing algorithm for performing inference on graphical models calculates the marginal distribution for each unobserved node, conditional on any observed nodes first formulated on trees, then polytrees, finally general graphs works by passing real valued functions called messages µv →u , u ∈ N(v ) along the edges upon convergence the estimated marginal distribution of each node is proportional to the product of all messages from adjoining factors (missing the normalization constant): Y pXv (xv ) ∝ µu→v (xv ) u∈N(v )

Fabio Cuzzolin


IJCAI 2016

208 / 464

Reasoning

Graphical models

Graphical models for belief functions on joint belief functions early local propagation methods (see efficient computation ) have also developed into graphical models for reasoning with belief functions however, in networks using belief functions, relations among variables are usually represented by joint belief functions rather than conditional ones furthermore, these networks are undirected graphs, for instance: I I I I

hypertrees [Shenoy & Shafer, 1986] qualitative Markov trees [Shafer, Shenoy and Mellouli 1987] join trees [Shenoy 1997] valuation networks [Shenoy 1992]

Shenoy and Shafer showed that if combination and marginalization meet three axioms, then local computation becomes possible Cano et al [1993]: adding three more axioms allows us to use Shenoy & Shafer’s axiomatic framework for the propagation in directed acyclic graphs

Fabio Cuzzolin


IJCAI 2016

209 / 464

Reasoning

Graphical models

Graphical models for belief functions on conditional belief functions graphs of conditional belief function independence relations are more efficient [Shenoy 1993] due to a lack of directed belief networks (similar to Bayesian networks), more recent works integrate belief function theory and Bayesian networks: I I

Cobb & Shenoy [2003]: plausibility transformation between models Simon & Weber [2006]: implementing belief calculus in Bayesian networks

alternative line of research: I

I I I

evidential networks with conditional belief functions (ENC) [Xu and Smets, 1993-95] use directed acyclic graphs BUT edges represent conditional relations (i.e. values in X generate conditional belief functions in Y) rather than conditional independence relations use Generalised Bayesian Theorem (GBT) for propagation propagation on ENCs only applies to binary relations between nodes Directed Evidential Networks (DEVN) [Ben Yaghlane and Mellouli, 2008]: generalise ENCs to relations involving any number of nodes

Fabio Cuzzolin


IJCAI 2016

210 / 464

Reasoning

Graphical models

Shafer-Shenoy architecture Qualitative Conditional Independence

uses qualitative Markov trees, which generalise both diagnostic trees and causal trees (Pearl) partitions Ψ1 , ..., Ψn of a frame Θ are qualitatively conditionally independent (QCI) given the partition Ψ if P ∩ P1 ∩ ... ∩ Pn 6= ∅ whenever P ∈ Ψ, Pi ∈ Ψi and P ∩ Pi 6= ∅ for all i does not involve probability, but only logical independence stochastic conditional independence does imply the above if two BFs Bel1 and Bel2 are ‘carried by’ partitions Ψ1 , Ψ2 which are QCI given Ψ: (Bel1 ⊕ Bel2 )Ψ = (Bel1 )Ψ ⊕ (Bel2 )Ψ

Fabio Cuzzolin


IJCAI 2016

211 / 464

Reasoning

Graphical models

Shafer-Shenoy architecture Qualitative Markov trees a qualitative Markov tree QMT = (V , E) is a tree of partitions of a base frame of discernment Θ - each node v ∈ V is associated with a partition Ψv of Θ meets the following requirement: I

I

deleting a node v and all incident edges yields a forest - denote the collection of nodes of the j-th such subtree by Vj (v ) for every node v ∈ V the minimal refinements of partitions in Vj (v ) for j = 1, ..., k are QCI given Ψv

a Bayesian causal tree is a qualitative Markov tree in which each node v is associated with the partition Ψv induced by random variable Xv

Propagation on QMTs Suppose a number of belief functions are inputted in to a subset of nodes V 0 . Problem: computing ⊕v ∈V 0 Belv .

Fabio Cuzzolin


IJCAI 2016

212 / 464

Reasoning

Graphical models

Shafer-Shenoy architecture Propagating belief functions rather than applying Dempster’s combination over the whole frame Θ, we do multiple Dempster’s combinations over partitions I

restriction: each BF to combine has to be carried by a partition in the tree

a processor located at each node v combines BFs using Ψv as a frame of discernment, and projects BFs to its neighbours 1 2

send Belv to its neighbours N(v ) whenever it gets a new input, computes (Bel T )Ψv ← (⊕{(Belu )Ψv : u ∈ N(v )} ⊕ Belv )Ψv

3

computes for each neighbour w ∈ N(v ) Belv ,w ← (⊕{(Belu )Ψv : u ∈ N(v ) \ {w}} ⊕ Belv )Ψw and sends it to w

final result of each processor v : coarsening to that partition of the combination of all the inputted BFs: (⊕u∈V 0 Belu )Ψv Fabio Cuzzolin


IJCAI 2016

213 / 464

Reasoning

Graphical models

Directed evidential networks Propagation algorithm

extends Pearl’s belief propagation to belief functions problem: given BFs {Belv0 , v ∈ V } on the nodes of a DEN (a directed acyclic graph in which edges represent conditional relations), we seek the marginal on each node v ∈ V of their joint belief function if there is a conditional relation between two nodes u and v , it uses disjunctive combination and the generalised Bayesian theorem (GBT) to compute the posterior Belv (X |Y ) given the conditional Belu (Y |X ) each variable of the network has a λ value and a π value associated with it

Fabio Cuzzolin


IJCAI 2016

214 / 464

Reasoning

Graphical models

Directed evidential networks propose here the (simpler) propagation algorithm for polytrees

Initialisation 1

for each node v : Belv ← Belv0 , πv ← Belv , λv ← the vacuous BF

2

for each root node, send a new πv →u message for all children u of v as X 0 πv →u = Belv →u (Y ) = mv (X )Belu (Y |X ) X ⊂Θv

3

node u waits for the messages from all its parents, then it I

computes the new πu value via πu = Belu ⊕

I I

⊕v ∈pa(u) πv →u

computes the new marginal belief Belu ← πu ⊕ λu sends the new πu message to all its children

Fabio Cuzzolin


IJCAI 2016

215 / 464

Reasoning

Graphical models

Directed evidential networks whenever a new observation Bel O is inputted into a node v :

Updating 1

node v computes its new value Belv = πv ⊕ λv , where πv = Belv0 ⊕ ⊕u∈pa(v ) πu→v , λv = BelvO ⊕ ⊕w∈ch(v ) λw→v

2

for every child node w, we calculate and send the new message to all children X πv →u = Belv →u (Y ) = mv (X )Belu (Y |X ) X ⊂Θv

where Belu (Y |X ) is given by disjunctive combination of Belu (Y |x), x ∈ Θv 3

for every parent node u we compute the new message and send it to all parents X λu→v = Belu→v (X ) = mu (Y )Belv (X |Y ) Y ⊂Θu

where Belv (X |Y ) is the posterior given by the GBT Fabio Cuzzolin


IJCAI 2016

216 / 464

Reasoning

Graphical models

Directed evidential networks Graphical example of propagation

Fabio Cuzzolin


IJCAI 2016

217 / 464

Using belief functions

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

218 / 464


A set of tools for the working scientist using belief functions

scientists face on a daily basis problems such as: I I

I

making decisions based on the available data estimating a quantity of interest give the available data (which can be missing, incomplete,conflicting,partially specified) classifying data-points into bins F F

I I

I

extending k-NN classification approaches fusing the results of multiple classifiers

clustering clouds of data to make sense of them learning a mapping from measurements to a domain of interest (regression) ranking objects

belief functions can provide useful approaches to all these problems when in the presence of (heavy) uncertainty

Fabio Cuzzolin


IJCAI 2016

219 / 464


Decision making

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

220 / 464


Decision making

Decision making with belief functions An overview

natural application of belief function representation of uncertainty problem: selecting an act f from an available list F (making a “’decision’), which optimises a certain objective function various approaches to decision making I

I

I

decision making in the TBM is based on expected utility via pignistic transform Strat has proposed something similar in his “cloaked carnival wheel” scenario generalised expected utility [Gilboa] based on classical expected utility theory [Savage,von Neumann]

a lot of interest in multicriteria decision making (based on a number of attributes)

Fabio Cuzzolin


IJCAI 2016

221 / 464


Decision making

Expected utility approach Decision making under uncertainty a decision problem can be formalized by defining: I I I

a set Ω of states of the world; a set X of consequences; a set F of acts, where an act is a function f : Ω → X

let < be a preference relation on F, such that f < g means that f is at least as desirable as g Savage (1954) has showed that < verifies some rationality requirements iff there exists a probability measure P on Ω and a utility function u : X → R s.t. ∀f , g ∈ F ,

f < g ⇔ EP (u ◦ f ) ≥ EP (u ◦ g)

where EP denotes the expectation w.r.t. P P and u are unique up to a positive affine transformation does that mean that basing decisions on belief functions is irrational?

Fabio Cuzzolin


IJCAI 2016

222 / 464


Decision making

Decision making in the TBM Expected utility using the pignistic probability

in the TBM, decision making is done by maximising the expected utility of actions based on the pignistic transform (as opposed to computing upper and lower expected utilities directly from (Bel, Pl) via Choquet integral, as we will see later) the set of possible actions F and the set Ω of possible outcomes are distinct, and the utility function is defined on F × Ω Smets proves the necessity of the pignistic transform by maximizing X E[u] = u(f , ω)Pign(ω) ω∈Ω

Fabio Cuzzolin


IJCAI 2016

223 / 464


Decision making

Strat’s decision apparatus [UAI 1990]

Strat’s decision apparatus is based on computing intervals of expected values assumes that the decision frame Ω is itself a set of scalar values (e.g. dollar values, see left) - does not distinguish between utilities and elements of Ω (returns)

.. so that an expected value interval can be computed: E(Ω) = [E∗ (Ω), E ∗ (Ω)], where . X . X E∗ (Ω) = inf(A)m(A), E ∗ (Ω) = sup(A)m(A) A⊆Ω

A⊆Ω

not good enough to make a decision, e.g.: should we pay a 6$ ticket when the expected interval is [5$, 8$]? Fabio Cuzzolin


IJCAI 2016

224 / 464


Decision making

Strat’s decision apparatus A probability of favourable outcome

Strat identifies ρ as the probability that the value assigned to the hidden sector is the one the player would choose 1 − ρ is the probability that the sector is chosen by the carnival hawker

Theorem The expected value of the mass function of the wheel is E(Ω) = E∗ (Ω) + ρ(E ∗ (Ω) − E∗ (Ω)) to decide whether to play the game we only need to assess ρ basically, this amounts to a specific probability transform (like the pignistic one) Lesh, 1986 had also proposed a similar approach

Fabio Cuzzolin


IJCAI 2016

225 / 464


Decision making

Savage’s axioms Savage has proposed 7 axioms, 4 of which are considered as meaningful (the others are rather technical) let us examine the first two axioms: Axiom 1: < is a total preorder (complete, reflexive and transitive) Axiom 2 [Sure Thing Principle]. Given f , h ∈ F and E ⊆ Ω, let fEh denote the act defined by ( f (ω) if ω ∈ E (fEh)(ω) = h(ω) if ω 6∈ E then the Sure Thing Principle states that ∀E, ∀f , g, h, h0 , fEh < gEh ⇒ fEh0 < gEh0 this axiom seems reasonable, but it is not verified empirically!

Fabio Cuzzolin


IJCAI 2016

226 / 464


Decision making

Ellsberg’s paradox suppose you have an urn containing 30 red balls and 60 balls, either black or yellow. Consider the following gambles: I I I I

f1 : f2 : f3 : f4 :

you receive 100 euros if you draw a red ball you receive 100 euros if you draw a black ball you receive 100 euros if you draw a red or yellow ball you receive 100 euros if you draw a black or yellow ball

in this example Ω = {R, B, Y }, fi : Ω → R and X = R the four acts are the mappings in the left table empirically it is observed that most people strictly prefer f1 to f2 , but they strictly prefer f4 to f3 R B Y Now, pick E = {R, B}: by definition f1 100 0 0 f2 0 100 0 f1 {R, B}0 = f1 , f2 {R, B}0 = f2 f3 100 0 100 f1 {R, B}100 = f3 , f2 {R, B}100 = f4 f4 0 100 100 since f1 < f2 , i.e. f1 {R, B}0 < f2 {R, B}0 the Sure Thing principle would imply f1 {R, B}100 < f2 {R, B}100, i.e., f3 < f4 empirically the Sure Thing Principle is violated! Fabio Cuzzolin


IJCAI 2016

227 / 464


Decision making

Gilboa’s theorem Gilboa (1987) proposed a modification of Savage’s axioms with, in particular, a weaker form of Axiom 2 a preference relation < meets these weaker requirements iff there exists a (non necessarily additive) measure µ and a utility function u : X → R such that ∀f , g ∈ F ,

f < g ⇔ Cµ (u ◦ f ) ≥ Cµ (u ◦ g),

where Cµ is the Choquet integral, defined for X : Ω → R as Z +∞ Z 0 Cµ (X ) = µ(X (ω) ≥ t)dt + [µ(X (ω) ≥ t) − 1]dt. 0

−∞

given a belief function Bel on Ω and a utility function u, this theorem supports making decisions based on the Choquet integral of u with respect to Bel or Pl

Fabio Cuzzolin


IJCAI 2016

228 / 464


Decision making

Lower and upper expected utilities for finite Ω, it can be shown that CBel (u ◦ f ) =

X

m(B) min u(f (ω))

B⊆Ω

CPl (u ◦ f ) =

X

ω∈B

m(B) max u(f (ω))

B⊆Ω

ω∈B

let P(Bel) as usual be the set of probability measures P compatible with Bel, i.e., such that Bel ≤ P. Then, it can be shown that CBel (u ◦ f ) =

min EP (u ◦ f ) = E(u ◦ f )

P∈P(Bel)

CPl (u ◦ f ) = max EP (u ◦ f ) = E(u ◦ f ) P∈P(Bel)

Fabio Cuzzolin


IJCAI 2016

229 / 464


Decision making

Decision making Strategies

for each act f we have two expected utilities E(f ) and E(f ). How do we make a decision? possible decision criteria based on interval dominance: 1 2 3 4

f f f f

as focal elements only A or Ω assume we have a Belω with as FEs only {ω, ω ¯ , Ω} for all ω, and we want to combine them uses the fact that the plausibility P of the combined BF is a function of their input BFs’ commonalities Q(A) = B⊇A m(B): X

Pl(A) =

(−1)|B|+1

B⊆A,B6=∅

we get that Pl(A) = K

1+

X ω∈A

Y

Qω (B)

ω∈Ω

Y Belω (¯ ω) Belω (ω) − 1 − Belω (ω) 1 − Belω (ω)

!

ω∈A

the computation of a specific plausibility value Pl(A) is linear in the size of Ω (only elements of A and not subsets are involved) however, the number of events A themselves is still exponential Fabio Cuzzolin


IJCAI 2016

362 / 464

Challenges

Efficient computation

Gordon and Shortliffe’s scheme based on diagnostic trees they are interested in computing degrees of belief only for events forming a hierarchy (diagnostic tree) (in some applications certain events are not relevant, e.g. classes of diseases)

combine simple support functions focused on or against the nodes produces good approximations, unless evidence is highly conflicting Fabio Cuzzolin


IJCAI 2016

363 / 464

Challenges


Gordon and Shortliffe’s scheme based on diagnostic trees

however, intersection of complements produces FEs not in the tree approximated algorithm: 1

2

3

first we combine all simple functions focussing on the node events (by Dempster’s rule) then, we successively (working down the tree) combine those focused on the complements of the nodes tricky bit: when we do that, we replace each intersection of FEs with the smallest node in the tree that contains it

results depends on the order of the combination in phase 2 again approximation can be poor, also no degrees of belief are assigned to complements of nodes therefore, we cannot compute their plausibilities!

Fabio Cuzzolin


IJCAI 2016

364 / 464

Challenges


A simple Monte-Carlo approach to Dempster’s combination - Wilson, 1989 we seek Bel = Bel1 ⊕ ... ⊕ Belm on Ω, where the evidence is induced by probability distributions Pi on Ci via Γi : Ci → 2Ω Monte-Carlo algorithm simulates the random set interpretation of belief functions: Bel(A) = P(Γ(c) ⊆ A|Γ(c) 6= ∅) for a large number of trials n = 1 : N do randomly pick c ∈ C such that Γ(c) 6= ∅ for i = 1 : m do randomly pick an element ci of Ci with probability Pi (ci ) end for let c = (c1 , ..., cm ) if Γ(c) = ∅ then restart trial end if if Γ(c) ⊆ A then trial succeeds, T = 1 end if end for

Fabio Cuzzolin


IJCAI 2016

365 / 464

Challenges


A Monte-Carlo approach Wilson, 1989 the proportion of trials which succeed converges to Bel(A): E[T¯ ] = Bel(A), 1 Var [T¯ ] ≤ 4N we say algorithms has accuracy k if 3σ[T¯ ] ≤ k picking c ∈ C involves m random numers so it takes A · m, A constant testing if xj ∈ Γ(c) takes less then Bm, constant B expected time of the algorithm is N m · (A + B|Ω|) 1−κ where κ is Shafer’s conflict measure expected time to achieve accuracy k is then C, better for simple support functions

9 m 4(1−κ)κ2

· (A + C|Ω|) for constant

conclusion: unless κ is close to 1 (highly conflicting evidence) Dempster’s combination is feasible for large values of m (number of BFs to combine) and large Ω (hypothesis space) Fabio Cuzzolin


IJCAI 2016

366 / 464

Challenges


Markov-Chain Monte-Carlo Wilson and Moral, 1996 trials are not independent but form a Markov chain non-deterministic OPERATIONi : changes at most the i-th coordinate c 0 (i) of c 0 to y , with chance Pi (y ) Pr (OPERATIONi (c 0 ) = c) ∝ Pi (c(i)) if c(i) = c 0 (i), 0 otherwise MCMC algorithm which returns a value BELN (c0 ) which is the proportion of time in which Γ(cc ) ⊆ X cc = c0 S=0 for n = 1 : N do for i = 1 : m do cc = OPERATIONi (cc ) if Γ(cc ) ⊆ X then S =S+1 end if end for end for S return Nm Fabio Cuzzolin


IJCAI 2016

367 / 464

Challenges


Importance sampling Wilson and Moral, 1996

Theorem If C is connected (i.e., any c, c 0 are linked by a chain of OPERATIONi ) then given , δ there exist K 0 , N 0 s.t. for all K ≥ K 0 and N ≥ N 0 and c0 : Pr (|BELNK (c0 )| < ) ≥ 1 − δ further step: importance sampling -> pick samples c 1 , ..., c N according to an “easy to handle” probability distribution P ∗ assign to each sample a weight wi =

P(c) P ∗ (c)

if P(c) > 0 implies P ∗ (c) > 0 then the average estimator of Bel(X )

P

Γ(c i )⊆X

N

wi

is an unbiased

try to use P ∗ as close as possible to the real one P strategies are proposed to compute P(C) = c P(c)

Fabio Cuzzolin


IJCAI 2016

368 / 464

Challenges


Efficient implementation: a summary

do belief functions have a problem with computational complexity? the answer is: only if naively implemented does Bayesian inference on graphical models have computational issues? YES, it is NP-hard, even approximate inference is NP-hard that was solved by Monte-Carlo methods: the same holds for belief inference: we decide how many samples we want to use for approximation, and go for it the point is not assigning mass values to all the subsets out there in these infinite space, but being allowed to assign mass to a subset when it is the thing to do!

Fabio Cuzzolin


IJCAI 2016

369 / 464

Challenges

Belief functions on reals

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

370 / 464

Challenges


Continuous formulations of the theory of belief functions

in the original formulation by Shafer [1976], belief functions are defined on finite sets only need for generalising this to arbitrary domains has been recognised at an early stage main approaches to continuous formulation presented here: I I I

Shafer’s allocations of probability [1982] belief functions as random sets [Nguyen] belief functions on Borel intervals of the real line [Strat,Smets]

other approaches, with limited (so far) impact I I I

generalised evidence theory MV algebras several others

Fabio Cuzzolin


IJCAI 2016

371 / 464

Challenges


Allocations of probability Shafer, 1979

every belief function can be represented as an allocation of probability, i.e., ∩-homomorphisms into positive and completely additive probability algebra (deduced from the integral representation due to Choquet) I

for every belief function Bel defined on a class of events E ⊆ 2Ω there exists a complete Boolean algebra M, a positive measure µ and an allocation of probability ρ between E and M such that Bel = µ ◦ ρ

two regularity conditions for a belief function over an infinite domain are considered: continuity and condensability canonical continuous extensions of belief functions to arbitrary power sets can be introduced by allocation of probability the approach shows significant resemblance with the notions of inner measure and extension of capacities [Honda]

Fabio Cuzzolin


IJCAI 2016

372 / 464

Challenges


Continuity and condensability Shafer’s allocations of probability E ⊂ 2Θ is a multiplicative subclass of 2Θ if A ∩ B ∈ E for all A, B ∈ E a function Bel : E → [0, 1] such that Bel(∅) = 0, Bel(Θ) = 1 and Bel is monotone of order ∞ is a belief function I

equally, an upper probability (plausibility) function is alternating of order ∞ (≥ is exchanged with ≤)

a BF on 2Θ is continuous if Bel(∩i Ai ) = limi→∞ Bel(Ai ) for every decreasing sequence of Ai s. A BF on a multiplicative subclass E is continuous if it can be extended to a continuous one on 2Θ I

continuity arises from partial beliefs on ‘objective’ probabilities

a BF on 2Θ is condensable if Bel(∩A) = infA∈A Bel(A) for every downward net A in 2Θ . A BF on a multiplicative subclass E is condensable if it can be extended to a condensable one on 2Θ I

a downward net is such that given two elements there is always an element subset of their intersection

condensability is restrictive, but related to Dempster’s rule Fabio Cuzzolin


IJCAI 2016

373 / 464

Challenges


Choquet’s representation Shafer’s allocations of probability Choquet’s integral representation says that every belief function can be represented by allocation of probability r : E → F is a ∩-homomorphism if it preserves ∩

Choquet’s theorem For every BF Bel on a multiplicative subclass E of 2Θ , ∃ a set X and an algebra F of its subsets, a finitely additive probability measure µ on F , and a ∩-homomorphism r : E → F such that Bel = µ ◦ r . if we replace the measure space (X , F , µ) with a probability algebra (a complete Boolean algebra M with a completely additive prob measure µ) we get

Allocation of probability For every BF Bel on a multiplicative subclass E of 2Θ , ∃ an allocation of probability ρ : E → M such that Bel = µ ◦ ρ. non-zero elements of M can be thought of as focal elements

Fabio Cuzzolin


IJCAI 2016

374 / 464

Challenges


Canonical extension Theorem a BF on a multiplicative subclass E can always be extended to a belief function on 2Θ by canonical extension o Xn . (−1)|I|+1 Bel(∩i∈I Ai )|∅ 6= I ⊂ {1, ..., n} Bel(A) = sup n≥1,A1 ,...,An ∈E

proof is based on the existence of an allocation for the extension note the similarity with the superadditivity axiom also related to inner measures, which provide approximate belief values for subsets not in a sigma-algebra Bel is the minimal such extension what about evidence combination? condensability ensures that the Boolean algebra M represents intersection properly for arbitrary (not just finite) collections B of subsets: ^ ρ(∩B) = ρ(B) ∀B ⊂ 2Ω B∈B

allows us to imagine Dempster’s combinations of infinitely many belief functions

Fabio Cuzzolin


IJCAI 2016

375 / 464

Challenges


Continuous belief functions Strat’s approach idea: take a real interval I and split it into N bits take as frame of discernment the set of possible intervals with these extreme: [0, 1), [0, 2), [1, 4] etc a belief function there has ∼ N 2 /2 possible focal elements, so that its mass lives on a triangle (left), and one can compute belief and plausibility by integration (right)

Fabio Cuzzolin


IJCAI 2016

376 / 464

Challenges


Continuous belief functions Strat’s approach this trivially generalises to all arbitrary intervals of I (below)

Bel([a, b]) =

RbRb a

x

m(x, y )dydx,

Pl([a, b]) =

RbRN 0

max(a,x)

m(x, y )dydx

Dempster’s rule generalises as Bel1 ⊕ Bel2 ([a, b]) = RaRN 1 m1 (x, b)m 2 (a, y ) + m2 (x, b)m1 (a, y ) + m1 (a, b)m2 (x, y ) K 0 b +m2 (a, b)m1 (x, y ) dydx Fabio Cuzzolin


IJCAI 2016

377 / 464

Challenges


Continuous belief functions on the Borel algebra of intervals a pretty much identical approach is followed by Smets allows us to define a continuous pignistic PDF as Z a Z 1 m(x, y ) . Bet(a) = lim dx dy →0 0 a+ y − x can be easily extended to the real line, by considering belief functions defined on the Borel σ-algebra of subsets of R generated by the collection I of closed intervals the theory provides a way of building a continuous belief function from a pignistic density, by applying the least commitment principle and assuming unimodal pignistic PDFs Bel(s) = −(s − s¯)

dBet(s) ds

where s¯ is such that Bet(s) = Bet(s¯) example: Bet(x) = N (x, µ, σ) is normal → Bel(y ) = y = (x − µ)/σ Fabio Cuzzolin

2 2y √ e−y , 2π


where

IJCAI 2016

378 / 464

Challenges


Continuous belief functions induced by random closed intervals formal setting: let (U, V ) be a two-dimensional random variable from (C, A, P) to (R2 , B(R2 )) such that P(U ≤ V ) = 1 and Γ(c) = [U(c), V (c)] ⊆ R

(C,A,P) c

Γ

V(c)

U(c)

this setting defines a random closed interval, which induces a belief function on (R, B(R)) defined by Bel(A) = P([U, V ] ⊆ A), Fabio Cuzzolin

∀A ∈ B(R)


IJCAI 2016

379 / 464

Challenges


Special cases of random closed intervals Consonant random interval

p-box

π(x) 1

1

0

F*

Γ(c)

c

Γ(c)

c x

U(c)

F*

V(c)

0

x U(c)

V(c)

special cases a fuzzy set on the real line induces a mapping to a collection of nested intervals, parameterised by the level c a p-box, i.e, upper and lower bounds to a cumulative distribution function (see later) also induces a family of intervals Fabio Cuzzolin


IJCAI 2016

380 / 464

Challenges


From Boolean algebras to MV algebras

study belief functions in the more general setting than Boolean algebras of events inspired by generalization of classical probability towards “many-valued” events, such as those resulting from formulas in Lukasiewicz infinite-valued logic an algebra of such many-valued events is called an MV algebra upper/lower probabilities and possibility measures can also be defined on MV algebras

Fabio Cuzzolin


IJCAI 2016

381 / 464

Challenges


MV algebra Definition

MV algebra An algebra hM, ⊕, 6=, 0i with a binary operation ⊕, a unary operation 6= and a constant 0 such that hM, ⊕, 0i is an abelian monoid and the following equations hold true for every f , g ∈ M: ¬¬f = f ,

f ⊕ ¬0 = ¬0,

¬(¬f ⊕ g) ⊕ g = ¬(¬g ⊕ f ) ⊕ f

we define 1 = ¬0, f g = ¬(¬f ⊕ ¬g), f ≤ g if ¬f ⊕ g = 1 inf and sup so defined f ∨ g = ¬(¬f ⊕ g) ⊕ g and f ∧ g = ¬(¬f ∨ ¬g) make hM, ∨, ∧, 0, 1i a distributive lattice example: standard MV algebra is the real interval [0, 1] equipped with f ⊕ g = min(1, f + g) and ¬f = 1 − f , f g = max(0, f + g − 1) I

in this case and ⊕ are known as Lukasiewicz t-norm and t-conorm

Fabio Cuzzolin


IJCAI 2016

382 / 464

Challenges


States as generalisations of finite probabilities on MV algebras

Boolean algebras are also a special case, with ⊕, and ¬ are union, intersection and complement semisimple algebras: isomorphic to continuous functions onto [0, 1] on some compact Hausdorff space - can be view as many-valued counterparts of algebras of sets a totally monotone function b : M → [0, 1] can be defined on MV algebra, by replacing ∪ with ∨ and ⊂ with ≤ a state is a mapping s : M → [0, 1] such that s(1) = 1 and s(f + g) = s(f ) + s(g) whenever f g = 0 (generalisation of finitely additive prob measure) states on semisimple MVRalgebras are integrals of a Borel prob measure on the Hausdorff space: s(f ) = fdµ for each f ∈ M

Fabio Cuzzolin


IJCAI 2016

383 / 464

Challenges


Belief functions on MV algebras consider the MV-algebra [0, 1]P(X ) of all functions P(X ) → [0, 1], where X is finite let ρ : [0, 1]X → [0, 1]P(X ) defined as ρ(f )(B) = min{f (x), x ∈ B} B 6= ∅,

ρ(f )(B) = 1 otherwise

if f = 1A (the indicator function of event A) then ρ(1A )(B) = 1 iff B ⊆ A, and we can rewrite Bel(A) = m(ρ(1A )), where m is defined on collections of events

b : [0,1]X → [0, 1] is a belief function on [0, 1]X if there is a state on the MV-algebra [0, 1]P(X ) such that s(1∅ ) = 0 and b(f ) = s(ρ(f )), for every f ∈ [0, 1]X . The state s is called a state assignment. belief functions have values on continuous functions of X (events are a special case) state assignment -> probability measure on Ω in the random set interpretation Fabio Cuzzolin


IJCAI 2016

384 / 464

Challenges


Properties Belief functions on MV algebras there is an integral representation by Choquet integral of such belief functions has strong connection with BFs on fuzzy sets

all standard properties of classical BFs are met (e.g. superadditivity) the set of such belief functions on [0, 1]X is a simplex whose extreme points correspond to the generalisation of categorical BFs can be extended to infinite spaces

Fabio Cuzzolin


IJCAI 2016

385 / 464

Challenges


Belief functions as random sets Rationale

given a multi-valued mapping Γ, a straightforward step is to consider the probability value P(c) as attached to the subset Γ(c) ⊆ Ω what we obtain is a random set in Ω, i.e., a probability measure on a collection of subsets roughly speaking, a random set is a set-valued random variable the degree of belief Bel(A) of an event A becomes the cumulative distribution function (CDF) of the open interval of sets {B ⊆ A} this approach has been emphasized in particular by [Nguyen,1978] and [Hestir,1991] and [Shafer,1987] example: a dice where one or more of faces are covered so that we do not know what’s beneath is a random variable which “spits” subsets of possible outcomes: a random set

Fabio Cuzzolin


IJCAI 2016

386 / 464

Challenges


Belief functions as random sets Mathematics the lower inverse of Γ is defined as: . Γ∗ (A) = c ∈ C : Γ(c) ⊂ A, Γ(c) 6= ∅ while its upper inverse is . Γ∗ (A) = c ∈ C : Γ(c) ∩ A 6= ∅ given two σ-fields A, B on C, Ω respectively, Γ is said strongly measurable iff ∀B ∈ B, Γ∗ (B) ∈ A . the lower probability measure on B is defined as P∗ (B) = P(Γ∗ (B)) for all B ∈ B - this is nothing but a belief function! ˆ of the random set Nguyen proved that, if Γ is strongly measurable, the CDF P coincides with the lower probability measure: ˆ P[I(B)] = P∗ (B)

Fabio Cuzzolin

∀B ∈ B,

. I(B) = {C ∈ B, C ⊆ B}


IJCAI 2016

387 / 464

Challenges


Random sets to extend belief functions to arbitrary domains

the notion of condensability has been studied by Nguyen for upper probabilities generated by random sets too [Nguyen 1978] efforts directed at a general theory on arbitrary domains for finite random sets (i.e. with a finite number of focal elements), under independence of variables Dempster’s rule can be applied: n o (F, m) = Ai1 ,...,id = ×dj=1 Aij , mi1 ,...,id = mi1 · · · · · mid for dependent sources Fetz and Oberguggenberger have proposed an “unknown interaction” model for infinite random sets Alvarez (see p-boxes later) a Monte-Carlo sampling method

Fabio Cuzzolin


IJCAI 2016

388 / 464

Challenges


Belief functions as random sets Molchanov’s work

random set theory has been much studied by Molchanov [2006] developed a theory of calculus with capacities and random sets Radon-Nikodym theorems for capacities and random sets (see Horizons) and derivatives of capacities (conditional) expectations of random sets limit theorems: strong law of large numbers, cantral limit theorem, Gaussian random sets set-valued random processes

Fabio Cuzzolin


IJCAI 2016

389 / 464

Challenges


Belief functions on reals State of the art

most popular extension to closed interval proposed by Strat and Smets gave birth to what are called ‘continuous belief functions’ after an initial effort by Nguyen and others, random sets have been rather neglected recently, strong renewed interest in a theory of random sets, thanks to Molchanov and others strong and powerful mathematical framework! way forward for the theory in my view no mentioning of conditioning and combination yet

Fabio Cuzzolin


IJCAI 2016

390 / 464

New horizons

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

391 / 464

New horizons

A research programme we made the case that non-additive probabilities arise from real issues with the way standard probability models the data (or absence thereof) we showed that random sets are the most natural representation of uncertainty they are also a straightforward generalisation of mathematical statistics how should the theory develop? some modest proposals: I I I I I I I I

generalised logistic regression for dealing with rare events parameterised families of random sets .. would allow frequentist hypothesis testing .. .. MAP-like estimation .. in particular, Gaussian random sets .. .. and how the central limit theorem generalises to RS generalising the total probability theorem .. .. and the concept of random variable

where can its full impact be felt? I I I

new, robust foundations for machine learning a novel understanding on quantum mechanics robust models of climatic change

a geometry of uncertainty as a general framework for uncertainty theory Fabio Cuzzolin


IJCAI 2016

392 / 464

New horizons

Upper and lower likelihood

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

393 / 464

New horizons


Belief likelihood function Generalising the sample likelihood

traditional likelihood function is a conditional probability of the data given a parameter θ ∈ Θ, i.e. a family of PDF over X parameterised by θ different take: instead of using conventional likelihood to build a belief function, can we define a ‘belief likelihood function’ of a sample x ∈ X? it is natural to define a belief (set-) likelihood function as family of belief functions on X, BelX (.|θ) parameterised by θ ∈ Θ I

this is the input of Smets’ Generalised Bayesian Theorem, a collection of ‘conditional’ belief functions

note that a belief likelihood takes values on sets of outcomes – individual outcomes are a special case seems a natural setting for computing likelihoods of set-valued observations → coherent with the random set philosophy

Fabio Cuzzolin


IJCAI 2016

394 / 464

New horizons


Belief likelihood function Series of trials

what can we say about the belief likelihood function of a series of trials observations are a tuple x = (x1 , ..., xn ) ∈ X1 × · · · × Xn , where Xi = X denotes the space of quantities observed at time i by definition the belief likelihood function is BelX1 ×···×Xn (A|θ), where A is any subset of X1 × · · · × Xn

Belief likelihood function of repeated trials . ↑× X ↑× X BelX1 ×···×Xn (A|θ) = BelX 1 i i · · · BelX n i i (A|θ) ↑× X

where BelX j i i is the vacuous extension of BelXj to the Cartesian product X1 × · · · × Xn where the observed tuples live, and is a combination rule.

Fabio Cuzzolin


IJCAI 2016

395 / 464

New horizons


Belief likelihood function Series of trials, individual tuples can we reduce this to the belief values of the individual trials? yes, if we wish to compute likelihood values of tuples of individual outcomes rather than sets of them

Decomposition for individual tuples ∩ or ⊕ as a combination rule in the definition of belief likelihood When using both function, the following holds: n Y . L(x = {x1 , ..., xn }) = BelX1 ×···×Xn ({(x1 , ..., xn )}|θ) = BelXi (xi ) i=1 n

Y . L(x = {x1 , ..., xn }) = PlX1 ×···×Xn ({(x1 , ..., xn )}|θ) = PlXi (xi ) i=1

We can call them lower and upper likelihoods of the sample x = {x1 , ..., xn } second line → conditional conjunctive independence (but just for individual samples x) new result, yet unpublished – similar regularities hold when using the more cautious ∪ disjunctive combination open question: does this hold for arbitrary subsets of samples A ⊂ X1 × · · · × Xn ? Fabio Cuzzolin


IJCAI 2016

396 / 464

New horizons


Lower and upper likelihoods Bernoulli trials let us go back to the Bernoulli trials example: Xi = X = {H, T } under conditional independence and equidistribution, the traditional likelihood for a series of Bernoulli trials reads as pk (1 − p)n−k , where k is the number of successes and n the number of trials let us compute the belief likelihood function for Bernoulli trials! we seek the belief function on X = {H, T }, parameterised by p = m(H), q = m(T ) (with p + q ≤ 1 this time) which best describes the observed sample if we apply the previous result, since all Beli are equally distributed the lower and upper likelihoods of the sample x = {x1 , ..., xn } are: L(x = {x1 , ..., xn }) = BelX ({x1 }) · · · · · BelX ({xn }) = pk q n−k L(x = {x1 , ..., xn }) = PlX ({x1 }) · · · · · PlX ({xn }) = (1 − q)k (1 − p)n−k after normalisation, these are PDFs over the space B of all belief functions definable on X! Fabio Cuzzolin


IJCAI 2016

397 / 464

New horizons


Lower and upper likelihoods (Bernoulli trials)

lower likelihood (left) subsumes to the traditional likelihood pk (1 − p)n−k for p + q = 1 the maximum of the lower likelihood is the traditional ML estimate I makes sense: the lower likelihood is highest for the most committed belief functions (i.e. probabilities) upper likelihood (right) has maximum in p = q = 0 (the vacuous BF on {H, T }) the interval of BFs joining max L with max L is the set of belief functions such that p k = n−k , those which preserve the ratio between the empirical counts q once again the maths leads us to think in terms of intervals of belief functions, rather than individual ones Fabio Cuzzolin


IJCAI 2016

398 / 464

New horizons

Generalising logistic regression and rare events

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

399 / 464

New horizons


Generalising logistic regression (1) Bernoulli trials are central in statistics: generalising their likelihood allow us to represent uncertainty in a number of regression problems in logistic regression πi = P(Yi = 1|xi ) =

1 1 + e−(β0 +β1

, x) i

1 − πi = P(Yi = 0|xi ) =

e−(β0 +β1 xi ) 1 + e−(β0 +β1 xi ) (19)

the parameters β0 , β1 are estimated by maximum likelihood of the sample, where L(β0 , β1 |Y ) =

n Y

Y

πi i (1 − πi )1−Yi

i=1

where Yi ∈ {0, 1} and πi is a function of β0 , β1 – yielding a single conditional PDF as in the Bernoulli series experiment, we can replace the conditional probability (πi , −πi ) on X = {0, 1} with a belief function there

Fabio Cuzzolin


IJCAI 2016

400 / 464

New horizons


Generalising logistic regression (2) upper and lower likelihoods can then be computed as L(β|Y ) =

n Y

Y

1−Yi

πi i qi

,

L(β|Y ) =

i=1

n Y (1 − qi )Yi (1 − πi )1−Yi i=1

where this time the Beli are not equally distributed how do we generalise the logit link between observations x and outputs y ? just assuming (19) does not yield any analytical dependency for qi first simple proposal: add a parameter β2 such that qi = m(Yi = 0|xi ) = β2

e−(β0 +β1 xi ) 1 + e−(β0 +β1 xi )

(20)

we can then find lower and upper optimal estimates for the parameters β arg max L 7→ β 0 , β 1 , β 2 β

arg max L 7→ β 0 , β 1 , β 2 β

plugging these optimal paramaters into (19), (20) yields an upper and a lower family of conditional belief functions given x (again an interval of BFs) BelX (.|β, x) BelX (.|β, x) Fabio Cuzzolin


IJCAI 2016

401 / 464

New horizons


Rare events with belief functions Generalising logistic regression how do we use belief functions to be cautious about rare event prediction? when we measure a new observation x we plug it into BelX (.|β, x) and BelX (.|β, x), and get a lower and an upper belief function on Y note that each belief function is really an envelope of logistic functions

robust estimate of rare events: how does this relate to results of classical logit regression? more to come in the near future! Fabio Cuzzolin


IJCAI 2016

402 / 464

New horizons

Frequentist inference with RS

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

403 / 464

New horizons


Choice of multivalued mappings

recall Dempster’s random set interpretation should the multivalued mapping Γ which defines a random set be modelled, or derived from the problem? e.g.: in the cloaked die example, it is the occlusion which generates the mapping Fabio Cuzzolin


IJCAI 2016

404 / 464

New horizons


Parameterised families random sets Parameterised mapping, fixed distribution

however, in other cases it may make sense to model a parameterised family of multivalued mappings Γ(.|θ) : Ω → 2Θ given a (fixed) probability on Ω, this yields a parameterised family of random sets rationale: start with the classical random experiments which generate a given family of distributions .. .. and generalise the setting (design) to the case of set-valued observations proposal families: Gaussian, binomial, multinomial random sets

Fabio Cuzzolin


IJCAI 2016

405 / 464

New horizons


Parameterised families random sets Parameterised distribution, fixed mapping the other option is to fix the multi-valued mapping (e.g., when it is given by the problem) .. .. and have the source probability vary with a certain parameterised distribution this will also induce a family of random sets

Fabio Cuzzolin


IJCAI 2016

406 / 464

New horizons


Hypothesis testing with random sets in hypothesis testing, designing an experiment amounts to choosing a family of probability distributions generating the data if parameterised families of random sets can be contructed, they can be plugged in to the frequentist inference machinery (step 2 below) 1 2 3 4 5 6 7 8

state relevant null H0 and alternative hypotheses state (e.g.) assumptions about the form of the distributions random set (mass assignment) describing the observations state the relevant test statistic T (a quantity derived from the sample) → this time the sample contains set-valued observations! derive the distribution mass assignment of the test statistic under the null hypothesis (from the assumptions) set a significance level (α) compute from the observations the observed value tobs of the test statistic T → this will also be set-valued calculate the p-value, the probability conditional belief value under H0 of sampling a test statistic at least as extreme as the observed value Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value such conditional belief value is less than the significance level

Fabio Cuzzolin


IJCAI 2016

407 / 464

New horizons

Central limit theorem

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

408 / 464

New horizons


The role of Gaussians in probability theory the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) my noise is Gaussian, my kernel is Gaussian etc they have very nice properties: moments are sufficient statistics .. is the PDF with maximum entropy, among those with given mean and standard deviation central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions

Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons



Fabio Cuzzolin


IJCAI 2016

409 / 464

New horizons


A central limit theorem for random sets the old proposal by Dempster and Liu merely transfers normal distributions on the real line by Cartesian product with Rm more sensible/interesting option: investigating how Gaussian distributions are transformed under (appropriate) multivalued mappings involves exploring the space of mappings for sensible/convenient ones other avenue of research: a central limit theorem for random sets central limit theorem and law(s) of large numbers have been generalised to imprecise probabilities: Introduction to Imprecise Probabilities Larry G. Epstein & Kyoungwon Seo (Boston University) [2011]: A Central Limit Theorem for Belief Functions

Xiaomin Shi (Shandong University) [2015]:

Fabio Cuzzolin

Central limit theorems for belief measures


IJCAI 2016

410 / 464

New horizons

The total belief theorem

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

411 / 464

New horizons


The total belief theorem Generalising total probability to belief functions the generalisation of total probability exists for Walley’s imprecise probabilities: it is called marginal extension however, natural and marginal extensions are not closed operators in the space of belief functions: when applied to a random set the result is not a random set

Theorem Suppose Θ and Ω are two frames of discernment, and ρ : 2Ω → 2Θ the unique refining between them. Let Bel0 be a belief function defined over Ω = {ω1 , ..., ω|Ω| }. Suppose there exists a collection of belief functions Beli : 2Πi → [0, 1], where Π = {Π1 , ..., Π|Ω| }, Πi = ρ({ωi }), is the partition of Θ induced by its coarsening Ω. Then, there exists a belief function Bel : 2Θ → [0, 1] such that: 1

Bel0 is the restriction of Bel to Ω

2

Bel ⊕ BelΠi = Beli ∀i = 1, ..., |Ω|, where BelΠi is the logical belief function with mass mΠi (A) = 1 A = Πi , 0 otherwise Fabio Cuzzolin


IJCAI 2016

412 / 464

New horizons


The total belief theorem Visual representation

pictorial representation of the total belief theorem Fabio Cuzzolin


IJCAI 2016

413 / 464

New horizons


Structure of the focal elements of the total belief function restricted total belief theorem: Bel0 has only disjoint focal elements pictorial representation of the structure of the FEs of a total BF Bel lying in the image of a focal element of Bel0 of cardinality 3

Fabio Cuzzolin


IJCAI 2016

414 / 464

New horizons


Graph of solutions Restricted total belief theorem potential solutions correspond to square linear systems, and form a graph whose nodes are linked by linear transformations of columns X X e 7→ e0 = −e + ei − ej i∈C

j∈S

where C is a covering set for e (i.e., every component of e is covered by at least one of them), S a set of selection columns at each transformation, the most negative component decreases

general solution based on simplex-like optimisation? Fabio Cuzzolin


IJCAI 2016

415 / 464

New horizons

Random set random variables

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

416 / 464

New horizons


Random set random variables? ok, random sets are set-valued random variables BUT can we actually build a random variable using as a basis a random set on Ω instead of a probability measure there? as usual we need a mapping from Θ to a measurable space (e.g. the real line): f : Θ → R+ = [0, +∞] where this time Θ is the co-domain of a multivalued mapping Γ : Ω → 2Θ for a continuous random variable X we can compute its probability density function (PDF) as its Radon-Nikodym derivative, the measurable function p such that Z P[X ∈ A] = pdµ A

can we compute a (generalised) PDF for a random set random variable?

Fabio Cuzzolin


IJCAI 2016

417 / 464

New horizons


Generalising the Radon-Nikodym derivative to random sets and capacities the

Radon-Nikodym derivative for set functions

was studied first by Harding et al [1997]

Yann Rebille [2009] has also studied the problem: A Radon-Nikodym derivative for almost subadditive set functions

Graf has tackled the problem of defining the RND for capacities (rather than probability measures) see Molchanov’s

Theory of Random Sets

assume capacities µ, ν are monotone, subadditive and continuous from below

Absolute continuity A capacity ν is absolutely continuous with respect to another capacity µ if, for every A ∈ F, ν(A) = 0 whenever φ(A) = 0. same definition as for standard measures for standard measures it is equivalent to the integral relation µ = Fabio Cuzzolin


R

ν

IJCAI 2016

418 / 464

New horizons


Generalising the Radon-Nikodym derivative Strong decomposition for capacities (as opposed to probability measures) absolute continuity does not guarantee existence of a RN derivative consider the case of a finite Θ, |Θ| = n. Then any measurable function f : Θ → R+ is determined by just n numbers, which do not suffice to uniquely define a capacity on 2Θ

Strong decomposition The pair (µ, ν) has the strong decomposition property if ∀α ≥ 0 there exists a measurable set Aα ∈ F such that α(ν(A) − ν(B)) ≤ µ(A) − µ(B) ifB ⊂ A ⊂ Aα , α(ν(A) − ν(A ∩ Aα )) ≥ µ(A) − µ(A ∩ Aα ) ∀A. the condition says that the ‘incremental ratio’ of the two capacities is bounded in a certain sub-power set all standard measures meet the SDP Fabio Cuzzolin


IJCAI 2016

419 / 464

New horizons


Generalising the Radon-Nikodym derivative Radon-Nikodym theorem for capacities For every two capacities µ and ν, ν is an indefinite integral of µ if and only if the pair (µ, ν) has the strong decomposition property and ν is absolutely continuous with respect to µ. open problems: interpreting the conditions of the theorem (which holds for general capacities) for completely alternating capacities (distributions of random closed sets) Molchanov: as a first step, note that the strong decomposition property for ν = TX and µ = TY mean that αPX (FAB ) ≤ PY (FAB ) if B ⊂ A ⊂ Aα , and αPX (FAA∩Aα ) ≥ PY (FAA∩Aα ) ∀A where FAB = {C ∈ F , B ⊂ C ⊂ A} Nguyen: a constructive approach to RN derivatives for capacities of random sets, similar to the one in constructive measure theory based on derivatives of set functions [Shilov] Fabio Cuzzolin


IJCAI 2016

420 / 464

New horizons

A new machine learning

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

421 / 464

New horizons


What’s wrong with machine learning? new challenging real-world applications, such as smart cars navigating a complex, dynamic environment robot surgical assistants capable of predicting the surgeon’s needs

existing theory and algorithms typically focus on fitting the observable outputs in the training data may lead, for instance, an autonomous driving system to perform well on validation tests but fail catastrophically when tested in the real world Fabio Cuzzolin


IJCAI 2016

422 / 464

New horizons


Towards robust machine learning unfortunate (but predictable) Tesla accident

unable to predict how a system will behave in a radically new setting (e.g., how does a smart car cope with driving through extreme weather conditions? most systems have no way of detecting whether their underlying assumptions have been violated: they will happily continue to predict and act even on inputs that are completely outside the scope of what they have actually learned it is imperative to ensure that these algorithms behave predictably “in the wild” Fabio Cuzzolin


IJCAI 2016

423 / 464

New horizons


Vapnik’s statistical learning theory PAC learning classical statistical learning theory [Vapnik] contemplates “generalisation” criteria which are based on a naive correlation between smoothness and generality makes PAC predictions on the reliability of a training set which are based on simple quantities such as number of samples N generalisation problem: training error is different from the expected generalisation error – in classification problems: Ex∼D [δ(h(x) 6= y (x))] 6=

N X

δ(h(xn ) 6= y (xn ))

n=1

where the training data x = [x1 , ..., xn ] is assumed drawn from a distribution D, h(x) is the predicted label for input x and y (x) the actual label

Probabilistically Approximately Correct learning The learning algorithm finds with probability at least 1 − δ a model h ∈ H which is approximately correct, i.e. it makes a training error of no more than Fabio Cuzzolin


IJCAI 2016

424 / 464

New horizons


Vapnik’s statistical learning theory PAC learning the main result of PAC learning is that we can relate the required size N of a training sample to the size of the model space H log |H| ≤ N − log

1 δ

so the minimum number of training examples given , δ and |H| is N≥

1 1 log |H| + log δ

for infinite-dimensional hypothesis spaces H

Vapnik-Chervonenkis Dimension The VC dimension of H is the maxium number of points that can be successfully shattered by a hypothesis h ∈ H (i.e, they can be correctly classified by some h ∈ H for all possible binary labellings of these points).

Fabio Cuzzolin


IJCAI 2016

425 / 464

New horizons


Vapnik’s statistical learning theory Example of VC dimension 4 points in R2 with H = the space of linear separators

however we arrange 4 points, there is a labelling that we cannot shatter (correctly reproduce), therefore the VC dimension of linear separators in R2 is 3. Fabio Cuzzolin


IJCAI 2016

426 / 464

New horizons


Vapnik’s statistical learning theory Max margin SVMs dramatically overestimate the number of training instances required pretty useless for model selection, for bounds are too wide: people do cross validation instead however, it provides the only justification for max margin linear SVMs! for the space Hm of linear classifiers with margin m VCSVM = min{D,

4R 2 }+1 m2

where R is the radius of the smallest hypersphere enclosing all the training data

Large margin classifiers As the VC dimension of Hm decreases when m grows, it is desirable to select linear boundaries with max margin.

Fabio Cuzzolin


IJCAI 2016

427 / 464

New horizons


Imprecise-theoretical foundations for machine learning A modest proposal

issues with Vapnik’s traditional statistical learning theory have been recently recognised by many researchers [ Ermon , Liang , Weller] what about deep learning? nobody has a clue of why it works, really approaches should provide worst-case guarantees: it is not possible to rule out completely unexpected behaviours or catastrophic failures Percy Liang’s proposal: a new generation of ML algorithms which, rather than learning models that predict accurately on a target distribution, use minimax optimization to learn models that are suitable for any target distribution within a “safe" family concept does evoke imprecise probability! minimax models similar to Liang’s are naturally associated with convex sets of probabilities

Fabio Cuzzolin


IJCAI 2016

428 / 464

New horizons


Imprecise-theoretical foundations for machine learning A modest proposal imprecise probabilities naturally arise whenever the data are insufficient to allow the estimation of a probability distribution training sets in virtually all applications of machine learning constitute a glaring example of data which is I

I

insufficient in quantity (think of a Google object detection from images routine trained on even a few million images compared to the thousands of billions of images out there) insufficient in quality (as they are selected based on criteria such as cost, availability or mental attitudes, therefore biassing the whole learning process

uncertainty theory may be able to provide worst-case, cautious predictions, delivering AI agents aware of their own limitations research programme: a generalisation of the concept of Probably Approximately Correct – where does the probability distribution of the data come from? Fabio Cuzzolin


IJCAI 2016

429 / 464

New horizons

Climatic change models

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

430 / 464

New horizons


Climate change A Bayesian approach

Question What is the probability that a doubling of atmospheric CO2 from pre-industrial levels will raise the global mean temperature by at least 2o C? kind of question a policymaker might ask a climate scientist Rougier [2007] has very nicely outlined a Bayesian approach to climate modelling and prediction the predictive distribution for future climate is found by conditioning future climate on the observed values for historical and current climate - however: I

I

in climate prediction the collection of uncertain quantities for which the climate scientist must specify prior probabilities can be large specifying a prior distribution per se is not the difficulty, but specifying a good one is

people spend thousands of hours collecting climate data and constructing a climate model: why so little attention to quantifying our judgements about how these two are related? Fabio Cuzzolin


IJCAI 2016

431 / 464

New horizons


Predicting future climate represent ‘climate’ as a vector of measurements collected at a given time I

e.g. components: CO2 level concentration on a grid, etc

climate: the vector y = (yh , yf ) collecting historical and present (yh ) and future (yf ) climate values . measurement error e: z = yh + e I

e.g. seasick technician, atmospheric turbulence

Assumption 1 Climate and measurement error are independent: e⊥y .

Assumption 2 Measurement error is Gaussian distributed N (0, Σe ). predictive distribution of climate given measured values z = z˜: p(y |z = z˜) ∼ n(z˜ − yh |0, Σe )p(y ) we need to specify a prior distribution for the climate y Fabio Cuzzolin


IJCAI 2016

432 / 464

New horizons


Climate models as models of the prior choice of prior p(y ) challenging both because y is such a large collection of quantities, and quantities are linked by complex interdependencies, such as those arising from laws of nature the role of the climate model is to induce a distribution for climate itself plays the role of a parametric model in statistical inference what’s a climate model anyway? a deterministic mapping from a collection of parameters (equation coefficients, initial conditions, forcing functions) to a vector of measurements (our ‘climate’) x → y = g(x) where g belongs to a ‘model space’ G model evaluation: the actual value g(x) computed for some parameter value x a climate scientist considers, on a priori grounds, that some choices of x are better than others, i.e. there exists x ∗ such that y = g(x ∗ ) + ∗ where ∗ is the model ‘discrepancy’ Fabio Cuzzolin


IJCAI 2016

433 / 464

New horizons


Prediction with parametric model (1) difference between the climate itself and model evaluations has two parts y − g(x) = g(x ∗ ) − g(x) + ∗ first part → contribution that may be reduced by a better choice of the model g second part → irreducible contribution that arises from the model’s imperfections x ∗ is not just a statistical parameter though, for it relates to physical quantities, so that climate scientists have a clear intuition of its effects I

scientists may be able to provide a prior p(x ∗ ) on the input parameters

Assumption 3 ‘Best’ input, discrepancy, and measurement error are mutually independent: x ∗ ⊥∗ ⊥e

Fabio Cuzzolin


IJCAI 2016

434 / 464

New horizons


Prediction with parametric model (2) Assumption 4 Discrepancy ∗ is Gaussian distributed with mean 0 and covariance Σ . assumptions 3 and 4 allow us to compute the desired climate prior as Z p(y ) = n(y − g(x ∗ )|0, Σ )p(x ∗ )dx ∗ in practice, the climate model function g(.) is not known, we only know a sample of model evaluations {g(x1 ), ..., g(xn )} model validation: tuning the covariances Σ , Σe , checking the validity of the Assumptions of Gaussianity can be done by using it to predict past/present climates p(z), and apply some hypothesis testing if the observed value z˜ is in the tail of the distribution, you have a problem as Rougier admits, responding to bad validation results is not straightforward Fabio Cuzzolin


IJCAI 2016

435 / 464

New horizons


Model calibration

assuming the model has been validated, it needs to be calibrated find the desired ‘best’ value x ∗ of the model parameters indeed under the assumptions we can compute p(x ∗ |z = z˜) = p(z = z˜|x ∗ ) = n(z˜ = g(x ∗ )|0, Σ + Σe )p(x ∗ ) as we know, MAP could be applied but danger of multiple modes apparently climate scientists are not very happy with having a PDF over the parameter instead!

Fabio Cuzzolin


IJCAI 2016

436 / 464

New horizons


Bayesian posterior prediction

by full Bayesian inference we can instead compute Z p(yf |z = z˜) = p(yf |x ∗ , z = z˜)p(x ∗ |z = z˜)dx ∗ where p(yf |x ∗ , z = z˜) is Gaussian with mean which depends on z˜ − g(x) highlights two routes for climate data to impact on future climate predictions: 1

have the effect of concentrating the distribution p(x ∗ |z = z˜) relative to the prior p(x ∗ ), depending on both quantity and quality of the climate data

2

a large difference z˜ − g(x) shifts the mean of p(yf |x ∗ , z = z˜) away from g(x)

Fabio Cuzzolin


IJCAI 2016

437 / 464

New horizons


Role of model evaluations go back to the initial question: what is the probability that a doubling of atmospheric CO2 will raise the global mean temperature by at least 2o C? let Q ⊂ Y the set of climates y for which the global mean temperature is at least 2o C higher in 2100 the probability of that is computed by integration: Z Pr (yf ∈ Q|z = z˜) = f (x ∗ )p(x ∗ |z = z˜)dx ∗ Z n(yf |µ(x), Σ)dyf

where the following integral can be computed directly f (x) = Q

the other integral requires numerical integration, e.g. I I

∼ = R weighted sampling: ∼ =

naive Monte-Carlo:

R

Pn

f (xi )

, xi ∼ p(x ∗ |z = z˜) w f (x i) i=1 i , xi ∼ p(x ∗ |z = z˜) n

i=1

Pn n

weighted by the likelihood:

wi ∝ p(z = z˜|x ∗ = xi )

sophisticated models whicn take a long time to evaluate may not provide enough samples for the prediction to be statistically significant albeit they may make the prior p(x ∗ ) and covariance Σ easier to specify Fabio Cuzzolin


IJCAI 2016

438 / 464

New horizons


Issues with Bayesian prediction there is a number of issue with making climate inferences in the Bayesian framework lots of assumptions are necessary (e.g. Gaussianity), most of them to make calculations practical rather than anything else although the prior on climates is reconduced to prior on the parameters of a climate model, there is no obvious way of picking p(x ∗ ) it is far easier to say what are wrong choices (e.g. uniform priors) significant parameter tuning is required (e.g. for Σ , Σe ..)

Modelling climate with belief functions Quite a lot of work to do, but a few landmarks: avoid committing to priors p(x ∗ ) on the correct climate model parameters use climate model as a parametric model to infer either a BF on the space of climates Y or on the space of parameters (e.g. covariances, etc) of the distribution on Y Fabio Cuzzolin


IJCAI 2016

439 / 464

New horizons

A geometry of uncertainty

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

440 / 464

New horizons


A geometric approach to the theory of evidence the collection B of all the vectors b = [Bel(A), ∅ ( A ( Ω]0 representing a belief function on Ω is a “simplex" (in rough words a higher-dimensional triangle), the belief space B = Cl(bA , ∅ ( A ⊆ Ω) which is the convex closure of (the vectors of) all “logical" BFs bA

alternatively we can adopt mass vectors mb = [mb (A), ∅ ( A ⊆ Ω]0 , living in a mass space: M = Cl(mA , ∅ ( A ⊆ Ω) Fabio Cuzzolin


IJCAI 2016

441 / 464

New horizons


Binary example The simplex of BFs on a frame of size 2

belief/mass space B2 = M2 for a binary frame set of probabilities is a face of the simplex (triangle) region of consonant BFs is a “simplicial [ complex” CO = Cl(bA , A 3 x) x∈Ω

Fabio Cuzzolin


IJCAI 2016

442 / 464

New horizons


Bundle structure of the belief space a fiber bundle is a generalisation of Cartesian product - a space is decomposed into a base space and fibers which project onto a point of the base space the belief space has a recursive bundle structure

rationale: the mass associated with a belief function can be recursively assigned to subsets (focal elements) of increasing size

Fabio Cuzzolin


IJCAI 2016

443 / 464

New horizons


Geometry of Dempster’s rule Conditional subspaces Dempster’s rule behavior w.r.t. affine combination b⊕

X i

αi bi =

X

βi (b ⊕ bi ),

i

αi κ(b, bi ) βi = Pn j=1 αj k (b, bj )

where κ(b, bi ) is the usual Dempster’s conflict convex closure (Cl) and ⊕ commute in the belief space b ⊕ Cl(b1 , · · · , bn ) = Cl(b ⊕ b1 , · · · , b ⊕ bn ) the conditional subspace hbi - the set of all BFs (Dempster-) conditioned by b: o n . hbi = b ⊕ b0 , ∀b0 ∈ B s.t. ∃ b ⊕ b0 is the convex closure hbi = Cl(b ⊕ bA , ∀A ⊆ Cb )

Fabio Cuzzolin


IJCAI 2016

444 / 464

New horizons


Geometry of Dempster’s rule Geometric construction

the pointwise behavior of ⊕ depends on the notions of constant mass locus [Cuzzolin, 2004] and of foci {Fx , x ∈ Ω} of a conditional subspace Fabio Cuzzolin


IJCAI 2016

445 / 464

New horizons


Geometry of combination Future agenda

the other main combination operators remain to be understood I I I I I I I

Yager’s rule Dubois and Prade’s rule conjunctive and disjunctive rules cautious and bold rules Josang’s consensus Murphy’s averaging Deng’s distance based

would visualise a ‘cone’ of possible future belief states under stronger or weaker assumptions can we also do inference by geometric methods? necessary to represent data and uncertainty measures in the same space

Fabio Cuzzolin


IJCAI 2016

446 / 464

New horizons


Conditioning by geometric methods Conditioning simplex each conditioning event A is associated with a conditional simplex BA in the belief space: . BA = Cl(bB , ∅ ( B ⊆ A) we can therefore define the geometric conditional belief function induced by a distance function d the BF(s) bd (.|A) which minimize(s) the distance d(b, BA )

Fabio Cuzzolin


IJCAI 2016

447 / 464

New horizons


Conditioning in the mass space Conditioning by geometric means

L1 conditional BFs given A: all those BFs with core contained in A and masses dominating m(B) on all subsets B of A geometrically they form a polytope the L2 conditional belief function is the unique mass function that redistributes in an equal way to each and every subset B of A the mass originally assigned to focal elements not included in A geometrically, it coincides with the center of mass of the polytope of L1 conditional BFs

Fabio Cuzzolin


IJCAI 2016

448 / 464

New horizons


General imaging interpretation Conditioning in the mass space

geometric conditional BFs in M possess an interpretation in terms of general imaging in belief revision [Lewis, Gardenfors] upon observing the impossibility of a certain outcome, one should re-assign its probability (mass) to the “closest” remaining state but if there is no reason to consider any remaining state as the closest ... I

I

.. we can represent such ignorance as a vacuous BF on the set of “weights” of the remaining states: this induces the polytope of L1 conditional Bfs! or, we can represent such ignorance as a uniform probability distribution on the weights: this induces the L2 conditional BF!

Fabio Cuzzolin


IJCAI 2016

449 / 464

New horizons


Geometric of uncertainty Future agenda

geometry of combination I

I I

∩ what about the geometry of the other combination rules, in particular ∪ and ? ∩ and ? ∪ what’s the geometry of the ‘tubes’ of BFs we get using inversion of combination results via geometric means

geometry of conditioning I I

I

what happens when we plug in different norms? [Jousselme et al] is geometric conditioning a general encompassing framework for conditioning in belief calculus? the other main conditioning operators remain to be understood, e.g.: F lower and upper envelopes F Suppes’ ‘geometric’ conditioning F Smets’ unnormalised conditioning

Fabio Cuzzolin


IJCAI 2016

450 / 464

New horizons


Geometry of uncertainty Future agenda

geometric inference: can we represent data (samples) and uncertainty measures induced by them in the same space? I

what norm is appropriate to minimise for inference purposes?

geometry of convex sets of belief functions I

we saw they pop up all the time when reasoning or making inference

geometry of belief functions on reals I I

Borel intervals random sets

fancier geometries: I I

belief functions as projections of convex bodies belief functions as spinors? exterior algebras

Fabio Cuzzolin


IJCAI 2016

451 / 464

New horizons


Belief functions are projections of convex bodies Fancier geometries

convex bodies are a fascinating field of study for a convex body in Rn , there obviously are 2n orthogonal projections onto all subspaces generated by sets of coordinate axes related to the notion of Grassman manifold under a condition that the areas of these projections are normalised, a convex body can be seen as a belief function

Fabio Cuzzolin


IJCAI 2016

452 / 464

New horizons


Unified geometry of uncertainty Geometry of possibility the geometry of consonant belief functions needs the notion of simplicial complex

a collection Σ of simplices such that: 1

2

if a simplex belongs to Σ, then all its faces belong to Σ the intersection of any two simplices is a face of both

the region of consistent BFs is a simplicial complex: [ CO = Cl(bA , A 3 x) x∈Θ

Fabio Cuzzolin


IJCAI 2016

453 / 464

New horizons


Unified geometry of uncertainty Future agenda

what about all the other uncertainty measures of the hierarchy? most of them are not special cases of belief functions (in fact, they are more general than them) need to extend the geometric space to encapsulate the most general such representation (arguably, imprecise probabilities) intermediate steps: geometry of monotone capacities (in particular 2-monotone capacities, probability intervals) most fascinating: geometry of sets of desirable gambles

Fabio Cuzzolin


IJCAI 2016

454 / 464

Summarising

Outline 1


2

4


Fabio Cuzzolin

5


7

Reasoning

8

9





6




3


Uncertainty

10

Summarising


IJCAI 2016

455 / 464

Summarising

A summary of what we have learned in this tutorial the theory of belief functions is a modeling language for representing elementary items of evidence and combining them, in order to form a representation of our beliefs about certain aspects of the world it is relatively simple to implement and has been successfully applied grounded in the beautiful mathematics of random sets has strong relationships with other theories of uncertainty belief functions have interesting mathematical properties in terms of geometry, algebra, combinatorics evidential reasoning can be implemented even for very large spaces and numerous pieces of evidence, because I I I I

elementary items of evidence induce simple belief functions, which can be combined very efficiently; the most plausible hypothesis can be found without computing the whole combined belief function; Monte-Carlo approximations are easily implementable local propagation schemes allow parallelisation

Fabio Cuzzolin


IJCAI 2016

456 / 464

Summarising

A summary of what we have learned in this tutorial

statistical evidence may be represented in several ways: I

I I

by likelihood-based belief functions, generalizing both likelihood-based and Bayesian inference by Dempster’s idea of using auxiliary variables in the framework of the Generalised Bayesian Theorem

propagation on graphical models can be performed decision making strategies based on intervals of expected utilities can be formulated that are more cautious than traditional ones the extension to continuous domains can be tackled via the Borel interval representation, in the more general case using the theory of random sets a toolbox of estimation, classification, regression tools based on the theory of belief functions is available

Fabio Cuzzolin


IJCAI 2016

457 / 464

Summarising

Recent trends in the theory and application of belief functions in 2014 alone, almost 1200 papers were published on belief functions new applications are gaining ground, beyond sensor fusion or expert systems

earth sciences, telecoms, etc Fabio Cuzzolin


IJCAI 2016

458 / 464

Summarising

Publications venues conferences on the theory of uncertainty: I I I I I I I

BFAS’s International Conference on Belief Functions (BELIEF) Uncertainty in Artificial Intelligence (UAI) International Conference on Information Fusion (FUSION) International Symposium on Imprecise Probability - Theories and Applications (ISIPTA) Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU) IEEE Systems, Man and Cybernetics (SMC) Information Processing and Management under Uncertainty (IPMU)

journals (for theoretical contributions): I I I I I I

Fabio Cuzzolin

International Journal of Approximate Reasoning (IJAR) IEEE Transactions on Fuzzy Systems (I.F. 6.306) IEEE Transactions on Cybernetics (I.F. 3.781) Artificial Intelligence Information Sciences (4.038) Fuzzy Sets and Systems


IJCAI 2016

459 / 464

Summarising

What still needs to be resolved clarify once and for all the epistemic interpretation of belief function theory → random variables for set-valued observations mechanism for evidence combination still debated, depend on meta-information on sources hardly accessible → working with intervals of belief functions may be the way forward I

acknowledges the meta-uncertainty on the nature of the sources generating the evidence

same holds for conditioning (as we showed) what about computational complexity? → not an issue, just apply sampling for approximate inference I

we do not need to assign mass to all subsets, but we need to be allowed to do so when necessary (e.g. missing data)

belief functions on reals → Borel intervals are nice, but the way forward is grounding the theory into the mathematics of random sets

Fabio Cuzzolin


IJCAI 2016

460 / 464

Summarising

Future of random set/belief function theory full development of random set graphical models I

merge the two lines of research (1) belief function on graphical models and (2) evidential networks

further development of machine learning tools I I

random set random forests tackling current trends such as transfer learning, deep learning

fully developed theory of statistical inference with random sets I I I I

generalised likelihood, logistic regression limit theorem, total probability for random sets random set random variables and processes frequentist inference with random sets

propose solutions to high impact problems I I I

rare event prediction robust foundations for machine learning robust climatic change predictions

mathematics and geometry of random sets and other uncertainty measures Fabio Cuzzolin


IJCAI 2016

461 / 464

Summarising

For Further Reading Papers and Matlab software available at: https://www.hds.utc.fr/˜tdenoeux Belief Functions Encyclopedia: http://cms.brookes.ac.uk/staff/FabioCuzzolin These slides are available online at: http://cms.brookes.ac.uk/staff/FabioCuzzolin/files/IJCAI2016.pdf THANK YOU!

Fabio Cuzzolin


IJCAI 2016

462 / 464

Appendix

For Further Reading

For Further Reading I

G. Shafer. A mathematical theory of evidence. Princeton University Press, 1976. F. Cuzzolin. Visions of a generalized probability theory. Lambert Academic Publishing, 2014. F. Cuzzolin (Ed.). Belief functions: theory and applications. LNCS Volume 8764, Springer, 2014.

Fabio Cuzzolin


IJCAI 2016

463 / 464

Appendix

For Further Reading

For Further Reading I F. Cuzzolin. The geometry of uncertainty - The geometry of imprecise probabilities Springer-Verlag (upcoming) F. Cuzzolin. Fifty years of belief functions: Theory IEEE Transactions on Fuzzy Sets (in preparation) 2017 F. Cuzzolin and C. Sengul. Fifty years of belief functions: Applications International Journal of Approximate Reasoning (in preparation) 2017

Fabio Cuzzolin


IJCAI 2016

464 / 464

55ex (j.4-.4pt.4pt-width width widthwidth width Belief ...

55ex (j.4-.4pt.4pt-width width widthwidth width Belief ...

Suggest Documents

WORKOUT WORKOUT - width.

Stream Width

Narrow-width approximation accuracy

Space vector pulse width modu ector pulse width ...

Width Parameters Beyond Tree-width and Their Applications - LaBRI

Rhino/XTL Label Width Compatibilty Chart LabelManager Label Width ...

StrongLifts 5x5 - width.

Constant Width Bodies

tooth width ratio using

Pulse-Width Modulation (PWM)

pulse-width ofapneumaticactuator

Pulse-Width Modulated Rectifiers - ECEE

CD (4) has bounded width

equilateral unit-width convex polygons

Sinusoidal Pulse width modulation

Isoperimetric Inequalities and the Width

New Width Parameters of Graphs

Width Size Body Length

New Width Parameters of Graphs

Sinusoidal Pulse width modulation - ENCON

Lane Width and Safety. - WordPress.com

WIDTH AND MEAN CURVATURE FLOW

Boolean-width of graphs - Lip6

PULSE-WIDTH-AMPLITUDE MODULATION (PWAM)