Random sets at the interface of statistics and AI

Random sets at the interface of statistics and AI Fifth Bayesian, Fiducial, and Frequentist (BFF5) Conference Prof Fabio Cuzzolin School of Engineering, Computing and Mathematics Oxford Brookes University, Oxford, UK

Ann Arbor, MI, May 7 2018

Prof Fabio Cuzzolin

Random sets at the interface of statistics and AI


1 / 31

Uncertainty

Second-order uncertainty

Orders of uncertainty

the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability second order uncertainty: being uncertain about our very model of uncertainty has a consequence on human behaviour: people are averse to unpredictable variations (as in Ellsberg’s paradox) how good are Bayesian and frequentist probability at modelling second-order uncertainty? Prof Fabio Cuzzolin



2 / 31

Uncertainty


Fisher has not got it all right

the setting of frequentist hypothesis testing is (arguably) arguable I

I

I

the scope is quite narrow: rejecting or not rejecting a hypothesis (although it can provide confidence intervals) the criterion is arbitrary: who decides what an ‘extreme’ realisation is (choice of α)? what is the deal with 0.05 and 0.01? the whole ‘tail’ idea comes from the fact that, under measure theory, the conditional probability (p-value) of a point outcome x is zero – seems trying to patch an underlying problem with the way probability is mathematically defined

cannot cope with pure data, without assumptions on the process (experiment) which generated them

Prof Fabio Cuzzolin



3 / 31

Uncertainty


The problem(s) with Bayes pretty bad at representing ignorance I I

Jeffrey’s uninformative priors are just not good enough different results on different parameter spaces

Bayes’ rule assumes the new evidence comes in the form of certainty: “A is true” I

in the real world, often this is not the case (‘uncertain’ or ‘vague’ evidence)

beware the prior! → model selection in Bayesian statistics I

I

I I

results from a confusion between the original subjective interpretation, and the objectivist view of a rigorous objective procedure why should we ‘pick’ a prior? either there is prior knowledge (beliefs) or there is not all will be fine, in the end! (Bernstein-Von Mises theorem) asymptotically, the choice of the prior does not matter (really!)

Prof Fabio Cuzzolin



4 / 31

Uncertainty

Set-valued observations

The die as random variable

� face 1 face 2

1

2

face 6 face 3 face 5 face 4

3

4

X 5

6

a die is a simple example of (discrete) random variable there is a probability space Ω = {face1, face2, ..., face6} which maps to a real number: 1, 2, ..., 6 (no need for measurability here) now, imagine that face1 and face2 are cloaked, and we roll the die

Prof Fabio Cuzzolin



5 / 31

Uncertainty

Set-valued observations

The cloaked die: set-valued observations � face 1 face 2

1

2

face 6 face 3 face 5 face 4

3

4

X 5

6

the same probability space Ω = {face1, face2, ..., face6} is still there (nothing has changed in the way the die works) however, now the mapping is different: both face1 and face2 are mapped to the set of possible values {1, 2} (since we cannot observe the outcome) this is a random set [Matheron,Kendall,Nguyen, Molchanov]: a set-valued random variable whenever data are missing observations are inherently set-valued Prof Fabio Cuzzolin



6 / 31

Belief functions

Random set definition

Dempster’s multivalued mappings Dempster’s work formalises random sets via multivalued (one-to-many) mappings Γ from a probability space (Ω, F, P) to the domain of interest Θ

� drunk (0.2)

�

�

not drunk (0.8) Mary

Peter John

examples taken from a famous ‘trial’ example [Shafer] elements of Ω are mapped to subsets of Ω: once again this is a random set I

in the example Γ maps {not drunk } ∈ Ω to {Peter , John} ⊂ Θ

the probability distribution P on Ω induces a mass assignment m : 2Θ → [0, 1] on the power set 2Θ = {A ⊆ Θ} via the multivalued mapping Γ : Ω → 2Θ Prof Fabio Cuzzolin



7 / 31

Belief functions

Belief and plausibility

Belief and plausibility measures the belief in A as the probability that the evidence implies A: X Bel(A) = P({ω ∈ Ω|Γ(ω) ⊆ A}) = m(B) B⊆A

the plausibility of A as the probability that the evidence does not contradict A: Pl(A) = P({ω ∈ Ω|Γ(ω) ∩ A 6= ∅}) = 1 − Bel(A) originally termed by Dempster lower and upper probabilities belief and plausibility values can (but this is disputed) be interpreted as lower and upper bounds to the values of an unknown, underlying probability measure: Bel(A) ≤ P(A) ≤ Pl(A) for all A ⊆ Θ belief measures include probability ones as a special case: what does replace Bayes’ rule? shift from conditioning (on certain events) to combination (of pieces of evidence)

Prof Fabio Cuzzolin



8 / 31

Belief functions

Dempster’s combination

Dempster’s �combination �

drunk (0.2)

�

��

not drunk (0.8) Peter

Mary

John

cleaned (0.6) not cleaned (0.4)

��

new piece of evidence: a blond hair has been found; also, there is a probability 0.6 that the room has been cleaned before the crime

��

the assumption is that pairs of outcomes in the source spaces ω1 ∈ Ω1 and ω2 ∈ Ω2 support the intersection of their images in 2Θ : θ ∈ Γ1 (ω1 ) ∩ Γ2 (ω2 ) if this is done independently, then the probability that pair (ω1 , ω2 ) is selected is P1 ({ω1 })P2 ({ω2 }), yielding Dempster’s rule of combination: X 1 (m1 ⊕ m2 )(A) = m1 (B)m2 (C), ∀∅ 6= A ⊆ Θ, 1−κ B∩C=A

Bayes’ rule is a special case of Dempster’s rule Prof Fabio Cuzzolin



9 / 31

Belief functions

Semantics

Semantics of belief functions Modelling second-order uncertainty

p(x) = 1 probability simplex

�

p(z) = 0.7

B

�

A

�(B)

1

p(x) = 0.6

Bel p(x) = 0.2

�(A)

0 p(z) = 1

p(z) = 0.2

p(y) = 1

belief functions have multiple interpretations as set-valued random variables (random sets) as (completely monotone) capacities (functions from the power set to [0, 1]) as a special class of credal sets (convex sets of probability distributions) [Levi,Kyburg] as such, they are a very expressive means of modelling uncertainty on the model itself, due to lack of data quantity or quality, or both Prof Fabio Cuzzolin



10 / 31

Rare events

Rare events and second-order uncertainty

What’s a rare event?

what is a ‘rare’ event? clearly we are interested in them because they are not so rare, after all! examples of rare events, also called ‘tail risks’ or ‘black swans’, are: volcanic eruptions, meteor impacts, financial crashes .. mathematically, an event is ‘rare’ when it covers a region of the hypothesis space which is seldom sampled – it is an issue with the quality of the sample Prof Fabio Cuzzolin



11 / 31

Rare events

Rare events and second-order uncertainty

Rare events and second-order uncertainty probability distributions for the system’s behaviour are built in ‘normal’ times (e.g. while a nuclear plant is working just fine), then used to extrapolate results at the ‘tail’ of the distribution P(Y=1|x)

'rare' event

1

popular statistical procedures (e.g. logistic regression) can sharply underestimate the probability of rare events

0.5 training samples

−6

−4

−2

0

0

2

4

6

x

Harvard’s G. King [2001] has proposed corrections based on oversampling the ‘rare’ events w.r.t the ‘normal’ ones

the issue is really one with the reliability of the model! we need to explictly model second-order uncertainty belief functions can be employed to model this uncertainty: rare events are a form of lack of information in certain regions of the sample space how do we infer belief functions from sample data? Prof Fabio Cuzzolin



12 / 31

Statistical inference with belief functions

Likelihood-based inference

Inference from classical likelihood [Shafer76, Denoeux] consider a statistical model L(θ; x) = f (x|θ), x ∈ X, θ ∈ Θ , where X is the sample space and Θ the parameter space BelΘ (θ|x) is the consonant belief function (with nested focal elements) with plausibility of the singletons equal to the normalized likelihood: pl(θ|x) =

L(θ; x) supθ0 ∈Θ L(θ0 ; x)

compatible with the likelihood principle takes the empirical normalised likelihood to be the upper bound to the probability density of the sought parameter! (rather than the actual PDF) the corresponding plausibility function is PlΘ (A|x) = supθ∈A pl(θ|x) the plausibility of a composite hypothesis A ⊂ Θ PlΘ (A|x) =

supθ∈A L(θ; x) supθ∈Θ L(θ; x)

is the usual likelihood ratio statistics Prof Fabio Cuzzolin



13 / 31


Belief likelihood function

Belief likelihood function Generalising the sample likelihood [Cuzzolin UAI’18, u/r]

different take: instead of using conventional likelihood to build a belief function, can we define a belief likelihood function of a sample x ∈ X the traditional likelihood function is a conditional probability of the data given a parameter θ ∈ Θ, i.e., a family of PDF over X parameterised by θ it is natural to define a belief (set-) likelihood function as a family of belief functions on X, BelX (.|θ) parameterised by θ ∈ Θ note that a belief likelihood takes values on sets of outcomes – individual outcomes are a special case a natural setting for computing likelihoods of set-valued observations, such as those which naturally arise when data are missing coherent with the random set philosophy

Prof Fabio Cuzzolin



14 / 31



Belief likelihood function Multivariate analysis what can we say about the belief likelihood function of a series of trials observations are a tuple x = (x1 , ..., xn ) ∈ X1 × · · · × Xn , where Xi = X denotes the space of quantities observed at time i by definition the belief likelihood function is BelX1 ×···×Xn (A|θ), where A is any subset of X1 × · · · × Xn

Belief likelihood function of repeated trials . ↑× X ↑× X BelX1 ×···×Xn (A|θ) = BelX1 i i · · · BelXn i i (A|θ) here is an arbitrary combination rule (Dempster’s, conjunctive, disjunctive..) ↑× X

BelXj i i is the vacuous extension of BelXj to the Cartesian product X1 × · · · × Xn where the observed tuples live I

(assigns the mass of B ⊂ ΘX to B × ΘY )

Prof Fabio Cuzzolin



15 / 31



Belief likelihood function for ‘sharp’ samples can we reduce this to the belief values of the individual trials? yes, if we wish to compute likelihood values of tuples of individual outcomes {x = (x1 , ..., xn )} rather than arbitrary subsets of X1 × ... × Xn it makes sense to call the following lower and upper likelihoods

Lower and upper likelihoods of a sample {x = (x1 , ..., xn )} ∩ or ⊕ as a combination rule in the definition of belief likelihood When using both function, the following factorisations hold: n Y . L(x) = BelX1 ×···×Xn ({(x1 , ..., xn )}|θ) = BelXi (xi ) i=1 n

Y . L(x) = PlX1 ×···×Xn ({(x1 , ..., xn )}|θ) = PlXi (xi ) i=1

the second result holds under conditional conjunctive independence [Smets] ∪ similar regularities hold when using the more cautious disjunctive combination

the top decomposition also holds for Cartesian products of subsets of Xi Prof Fabio Cuzzolin



16 / 31


Lower and upper likelihoods

Lower and upper likelihoods (Bernoulli trials) Bernoulli trials example: Xi = X = {1, 0}, iid random variables under conditional independence and equidistribution, the traditional likelihood for a series of Bernoulli trials reads as pk (1 − p)n−k , where k is the number of successes (1) and n the number of trials let us compute the belief likelihood function for Bernoulli trials! we seek the belief function on X = {1, 0}, parameterised by p = m({1}), q = m({0}) (with p + q ≤ 1 this time) which best describes the observed sample if we apply the previous result, since all Beli are equally distributed the lower and upper likelihoods of the sample x = (x1 , ..., xn ) are: L(x) = BelX ({x1 }) · · · · · BelX ({xn }) = pk q n−k L(x) = PlX ({x1 }) · · · · · PlX ({xn }) = (1 − q)k (1 − p)n−k after normalisation, these are PDFs over the space B of all belief functions definable on X!

Prof Fabio Cuzzolin



17 / 31


Lower and upper likelihoods

Lower and upper likelihoods (Bernoulli trials) Numerical example

both lower likelihood (left) and upper likelihood (right) subsume the traditional likelihood pk (1 − p)n−k for p + q = 1 the maximum of the lower likelihood is the traditional ML estimate I

makes sense: the lower likelihood is highest for the most ‘committed’ belief functions (i.e. the probability measures, which attach mass to singleton elements)

upper likelihood (right) has maximum in p = q = 0 (the vacuous BF on {1, 0}) the interval of BFs joining max L with max L is the set of belief functions such that p k = n−k , those which preserve the ratio between the empirical counts q Prof Fabio Cuzzolin



18 / 31

Generalised logistic regression

Logistic regression

Logistic regression logistic regression models data in which one or more independent observed variables determine an outcome, represented by a binary variable conditional probabilities are assumed to have a logistic form: pi = P(Yi = 1|xi ) =

1 e−(β0 +β1 xi ) , 1 − pi = P(Yi = 0|xi ) = 1 + e−(β0 +β1 xi ) 1 + e−(β0 +β1 xi ) (1)

given a series of observations D = {(xi , Yi ), i = 1, ..., n} the parameters β0 , β1 are estimated by maximum likelihood of the sample, where L(β0 , β1 |Y ) =

n Y

Y

pi i (1 − pi )1−Yi

i=1

where Yi ∈ {0, 1} and pi is a function of β0 , β1 logistic regression yields a single conditional PDF to express second-order uncertainty on the model, we replace the conditional probability (pi , 1 − pi ) on X = {0, 1} with a conditional belief function there and look for the BFs whose parameters (masses) optimise either the lower or the upper likelihood (or a combination of both) Prof Fabio Cuzzolin



19 / 31


Framework

Generalised logistic regression upper and lower likelihoods can then be computed as L(β|Y ) =

n Y

Y

1−Yi

pi i qi

,

L(β|Y ) =

i=1

n Y (1 − qi )Yi (1 − pi )1−Yi i=1

as in logistic regression, the Beli are not equally distributed how do we generalise the logit link between observations x and outputs y ? just assuming (1) does not yield any analytical dependency for qi first simple proposal: add a parameter β2 such that qi = m(Yi = 0|xi ) = β2

e−(β0 +β1 xi ) 1 + e−(β0 +β1 xi )

(2)

we can then find lower and upper optimal estimates for the parameters β arg max L 7→ β 0 , β 1 , β 2 β

arg max L 7→ β 0 , β 1 , β 2 β

plugging these optimal parameters into (1), (2) yields an upper and a lower family of conditional belief functions given x BelX (.|β, x) BelX (.|β, x) Prof Fabio Cuzzolin



20 / 31


Dealing with rare events

Rare events with belief functions how do we use belief functions to be cautious about rare event prediction? having learned a lower and an upper family of conditional BFs given x from a training set D .. .. when observing a new x we plug it into BelX (.|β, x) and BelX (.|β, x), and get a pair of lower and upper belief functions on Y

note that each such belief function is really an envelope of logistic functions this will produce two intervals of probability values for the same x

open issues: how does this relate to results of classical logit regression? how are these two intervals related? what about optimising a combination of lower and upper likelihood instead? Prof Fabio Cuzzolin



21 / 31

Generalised Laws of Probability

Central limit theorem

Central limit theorems for random sets an ongoing effort is about generalising the laws of classical probability to belief functions (and random sets) the Gaussian distribution is central in probability theory and its applications (‘normal’ distribution) I I I

is the PDF with maximum entropy, among those with given mean and σ central limit theorem shows that all sums of iid random variables is Gaussian whenever test statistics or estimators are functions of sums of random variables, they will have asymptotical normal distributions

an old proposal by Dempster and Liu merely transfers normal distributions on the real line by Cartesian product with Rm more sensible/interesting option: investigating how Gaussian distributions are transformed under (appropriate) multivalued mappings other avenue of research: central limit theorems for random sets Larry G. Epstein & Kyoungwon Seo (Boston University) [2011]: ‘A Central Limit Theorem for Belief Functions’ Xiaomin Shi (Shandong University) [2015]: ‘Central limit theorems for belief measures’ Prof Fabio Cuzzolin



22 / 31


Total belief theorem

The total belief theorem Generalising the law of total probability conditional belief functions are crucial for our approach to inference complementary link of the chain: generalisation of the law of total probability refining: mapping from elements of one set Ω to elements of a disjoint partition of a second set Θ Bel0 = 2

�

[0,1]

�

�i

Beli = 2

�i

[0,1]

� �� i i

Prof Fabio Cuzzolin



23 / 31


Total belief theorem

The total belief theorem [Zhou & Cuzzolin, UAI’17] Total belief theorem Suppose Θ and Ω are two finite sets, and ρ : 2Ω → 2Θ the unique refining between them. Let Bel0 be a belief function defined over Ω = {ω1 , ..., ω|Ω| }. Suppose there exists a collection of belief functions Beli : 2Πi → [0, 1], where Π = {Π1 , ..., Π|Ω| }, Πi = ρ({ωi }), is the partition of Θ induced by Ω. Then, there exists a belief function Bel : 2Θ → [0, 1] such that: 1

Bel0 is the marginal of Bel to Ω (Bel0 (A) = Bel(ρ(A)));

2

Bel ⊕ BelΠi = Beli ∀i = 1, ..., |Ω|, where BelΠi is the logical belief function: mΠi (A) = 1 A = Πi , 0 otherwise the belief function

−→ . Bel = Bel0↑Θ ⊕ Bel −→ −−→ −−−→ where Bel = Bel1 ⊕ · · · ⊕ Bel|Ω| is Dempster’s sum of the (conditional embeddings) of the conditional BFs Beli , and Bel0↑Θ is the vacuous extension of Bel0 from Ω to Θ, is a solution other distinct solutions exists, and they likely form a graph with symmetries Prof Fabio Cuzzolin



24 / 31

Machine learning in the wild

Model adaptation

The problem with machine learning Generalising from scarce data

machine learning: designing algorithms that can learn from data BUT, we train them on a ridicously small amount of data: how can we make sure they are robust to new situations never encountered before (model adaptation)? we need to look at the foundations: statistical learning theory [Vapnik] Prof Fabio Cuzzolin



25 / 31


Statistical learning theory

Vapnik’s statistical learning theory makes predictions on the reliability of a training set based on simple quantities such as number of samples N generalisation issue: training error is different from the expected error: Ex∼p [δ(h(x) 6= y (x))] 6=

N X

δ(h(xn ) 6= y (xn ))

n=1

the training data x = (x1 , ..., xn ) is assumed drawn from a distribution p, h(x) is the predicted label for input x and y (x) the actual label

Probabilistically Approximately Correct learning The learning algorithm finds with probability at least 1 − δ a model h ∈ H which is approximately correct, i.e. it makes a training error of no more than PAC learning aims at providing generalisation bounds of the kind: ˆ − L(h∗ ) > ] ≤ δ, P[L(h) ˆ of model h ˆ learned from the training set, on the difference between the loss L(h) and the minimal theoretical loss L(h∗ ) for that class of models h ∈ H Prof Fabio Cuzzolin



26 / 31


Towards a robust statistical learning theory

Generalising statistical learning theory training distribution probability simplex test distribution

training samples

test samples

random set

the issue is: training and test data are assumed to be sampled from the same (unknown) probability distribution p machine learning deployment ‘in the wild’ has shown that that is hardly the case, leading to sometimes catastrophic failure (see Tesla, or recently Uber) we recently [BELIEF’18, u/r] took a first step towards robustifying PAC learning, by analysing the in the case of finite, realisable models we adopt the relaxed assumption that training and test distribution come from a known convex set of distributions (or possibly a random set) Prof Fabio Cuzzolin



27 / 31



Generalisation bounds for finite, realisable model spaces we wish to generalise the proof of Th 4, https://web.stanford.edu/class/cs229t/notes.pdf Let H be a hypothesis class, where each hypothesis h ∈ H maps some X to Y , l be the zero-one loss: l((x, y ), h) = I[y 6= h(x)], p be any distribution over X × Y ˆ be the empirical risk minimiser. and h

Theorem Assume that: (1) the model space H is finite, and (2) there exists a hypothesis h∗ ∈ H that obtains zero expected risk, that is: L(h∗ ) = E(x,y )∼p [l((x, y ), h∗ )] = 0, Then, with probability at least 1 − δ: ˆ ≤ L(h)

Prof Fabio Cuzzolin

log |H| + log(1/δ) . n



28 / 31



Credal realisability Generalising statistical learning theory [Cuzzolin, BELIEF’18, u/r] a central notion is that of realisability: there exists a hypothesis h∗ ∈ H that obtains zero expected risk . L(h∗ ) = E(x,y )∼p [l((x, y ), h∗ )] = 0 in the credal case we can replace this with credal realisability: ∃h∗ ∈ H, p∗ ∈ P : Lp∗ (h∗ ) = 0 unfortunately, the traditional proof does not apply under credal realisability uniform credal realisability can also be proposed: ∀p ∈ P, ∃hp∗ ∈ H : Ep [l(hp∗ )] = Lp (hp∗ ) = 0 in general, SLT proofs rely on classical concentration inequalities: random set version? does assuming that the credal set is actually a random set simplify the derivations? Prof Fabio Cuzzolin



29 / 31

Conclusions

Some conclusions, and a research programme we have appreciate the role of belief and random set theory at the boundary of statistics and AI I I I I

belief likelihood function as a generalisation of traditional likelihood generalised logistic regression for rare event analysis generalised laws of probability and the total belief theorem robustification of statistical learning theory

further development of machine learning tools I I

generalisation of maximum entropy classification and log-linear models random set random forests

fully developed theory of statistical inference with random sets I I

random set random variables, generalisation of Radon-Nikodym derivative frequentist inference with random sets

intriguing solutions to high impact problems I I

robust climatic change predictions robust statistical learning theory for machine learning ‘in the wild’

Prof Fabio Cuzzolin



30 / 31

Appendix

For Further Reading

For Further Reading I

G. Shafer. A mathematical theory of evidence. Princeton University Press, 1976. I. Molchanov. Theory of Random Sets. Springer, 2017. F. Cuzzolin. Visions of a generalized probability theory. Lambert Academic Publishing, 2014. F. Cuzzolin. The geometry of uncertainty - The geometry of imprecise probabilities. Springer-Verlag (in press). Prof Fabio Cuzzolin



31 / 31

Random sets at the interface of statistics and AI

Random sets at the interface of statistics and AI

Suggest Documents

data mining at the interface of computer science and statistics

Comparing Fuzzy Sets and Random Sets to Model the ... - MDPI

POSSIBILITY MEASURES, RANDOM SETS AND

Random symmetrizations of measurable sets

On the Construction of Effectively Random Sets

Localization of a random copolymer at an interface

On exchangeable random variables and the statistics

RANDOM FINITE SETS AND SEQUENTIAL MONTE ... - CiteSeerX

Random Fuzzy Sets and Their Applications

On random primitive sets, directable NDFAs and the generation of

Finite Random Sets and Morphology - Robert Haralick

Random Walks, Conditional Hitting Sets and

The phase transition in random catalytic sets

At the Interface

Excursion sets of stable random fields - arXiv

The Intelligent Database Interface: Integrating AI and ... - CiteSeerX

Level Sets of Random Fields and Applications: Specular Points and

AI at IBM Research

on convergence of the distributions of random sums and statistics

Computing K-Trivial Sets by Incomplete Random Sets

Figure S2. Ribosome and the Random Gene Sets ...

Low for random sets: the story - Department of Computer Science

Counting the changes of random â0 sets

On exchangeable random variables and the statistics of large ... - arXiv

Random sets at the interface of statistics and AI