Pattern Recognition

Pattern Recognition Prof. Christian Bauckhage

outline lecture 03 recap basic terms and definitions the BIG picture: machine learning for pattern recognition building an automatic pattern recognition system summary

recap what is pattern recognition ?

regression

classification

clustering

recap word counts from definitions in the literature

category, decision, class, name measurement, signal, object, observation, data

|6 ||| | |6 ||| |

classification, assignment

|||

mathematics, algorithm, function

|||

automatic

||

representation

|

basic terms and definitions

basic axioms

we live in a very large but finitely sized universe U what we can perceive of U is our environment E inside of E, there are a lot of objects or events o to perceive any o requires to record physically measurable quantities q(o) such as sizes, densities, energies, . . . acoustic properties, . . . visual properties, . . .

observe

no sensory system (biological or technical) can record every aspect of E or every property of o for example the human eye only reacts to a tiny interval of the whole em-spectrum the human ear only reacts to a small interval of acoustic frequencies the human tongue only perceives 5 (or so) basic tastes

basic axioms

a problem domain Q consists of quantifiable objects from a specific application domain; the q ∈ Q are called patterns pattern recognition deals with mathematical and technical aspects of processing and analyzing patterns the usual goal is to map map patterns to symbols or data structures in order to classify them, for instance  a class Ωi        a tuple of classes Ωi1 , Ωi2 , . . . q→   a class, a location, and an orientation Ω , t, R i      a symbolic description D

example

what animal is this (Ωi )? vs. what is going on here (D)?

basic axioms

classes or categories result from decomposing the problem domain Q into k or k + 1 subsets such that Ωi 6= ∅

∀i

Ωi ∩ Ωj = ∅

∀ i 6= j

where either k [

Ωi = Q

i=1

or k [ i=0

Ωi = Q

postulates

patterns within a class Ωi are similar patterns from different classes Ωi , Ωj are dissimilar Ω0 is the rejection class used for ambiguous patterns

observe

classification applies to simple- and complex patterns

example

classifying a simple pattern ⇔ assign a pattern to a class Ωi

example

classifying a complex pattern ⇔ assign (parts of) a pattern to classes Ωi1 , Ωi2 , . . .

postulate

patterns have features that are characteristic of the class(es) they belong to for instance, pictures of cheetahs show spots

postulate

there is a function f that extracts features from patterns   x1  ..  f q =x= .  xm for very simple patterns, it may suffice to consider f = id

postulate

there is a function f that extracts features from patterns   x1  ..  f q =x= .  xm for very simple patterns, it may suffice to consider f = id features of patterns of a class form a more or less compact region in the feature space features of patterns from different classes reside in more or less well separated regions of the feature space

example increasingly less compact and less well separated regions

x2

Ω1 Ω2 x1


x2

x2

Ω1

Ω1 Ω2

Ω2 x1

x1


x2

x2

Ω1

x2

Ω1 Ω2

Ω1 Ω2

Ω2 x1

x1

x1


x2

x2

Ω1

Ω1 Ω2

x2

Ω2

Ω1 Ω2

Ω2 x1

Ω1

x2

Ω1 x1

x1

x1


x2

x2

x2

Ω1

Ω1 Ω2

Ω1 Ω2

Ω2 x1

x2

x1

x2

Ω1

Ω1 Ω2

Ω1

Ω2 x1

x1

x1


x2

x2

x2

Ω1

Ω1 Ω2

Ω1 Ω2

Ω2 x1

x2

x1

x2

Ω1

x1

x2

Ω1 Ω2

Ω1

Ω2 x1

x1

Ω1 Ω2

x1

classifier

a classifier is a function y : Rm → Ω 0 , Ω 1 , . . . , Ω k that maps features x = f q to classes Ωi

question how to obtain a classifier ?

question how to obtain a classifier ?

answer let’s see . . .

the BIG picture

mathematics data science

pattern recognition

data mining machine learning

computer science

observe

in this course, we understand the problem of obtaining classifiers or predictors as a machine learning problem

machine learning

machine learning is the science of fitting models to data

machine learning

machine learning is the science of fitting models to data

⇔ given a sample of problem specific data 1) decide for a ”suitable” model class ⇔ specify a class of mathematical functions which you expect to be able to solve the problem at hand 2) determine / learn “appropriate” model parameters ⇔ use optimization algorithms to fit a mathematical function to the data

example regression

15.0

data

xi , yi

n

12.5 10.0

i=1

7.5 5.0

where xi , yi ∈ R

2.5 0.0 −2.5 −2.5

0.0

2.5

5.0

7.5

10.0

12.5

15.0

example regression

15.0

data

xi , yi

n

12.5 10.0

i=1

7.5 5.0

where xi , yi ∈ R

2.5 0.0 −2.5

0.0

2.5

5.0

7.5

10.0

12.5

15.0

0.0

2.5

5.0

7.5

10.0

12.5

15.0

−2.5

possible model 15.0

y(x) = w0 + w1 x

12.5 10.0

w0 ≡ offset

7.5 5.0

w1 ≡ slope

2.5 0.0 −2.5 −2.5

example classification

data

xi , yi

2

n

1

i=1 0

where xi ∈ R2 , yi ∈ {−1, +1}

−2

−1

0

−1

−2

1

2


data

xi , yi

2

n

1

i=1 0

where xi ∈ R2 , yi ∈ {−1, +1} possible model +1, if f (x) > 0 y(x) = −1, otherwise

−2

−1

0

−1

−2

1

2


data

xi , yi

2

n

1

i=1 0

where xi ∈ R2 , yi ∈ {−1, +1}

−2

−1

1

2

0

1

2

−2

possible model +1, if f (x) > 0 y(x) = −1, otherwise e.g. using quadratic discriminant T f (x) = x − µ1 C−1 x − µ1 1 T − x − µ2 C−1 x − µ2 2

0

−1

2

1

0 −2

−1 −1

−2

example density estimation 0.20

data 0.15

n xi i=1

0.10

0.05

where xi ∈ R

0.00 −10

−5

0

−0.05

5

10

15

example density estimation 0.20

data 0.15

n xi i=1

0.10

0.05

where xi ∈ R

0.00 −10

−5

0

5

10

15

0

5

10

15

−0.05

possible model p(x) =

2 X

0.20

0.15

wj N x | µj , σ2j

0.10

j=1

0.05

0.00 −10

−5

−0.05

example hard clustering

data

2

n xi i=1 where xi ∈ R2

1

0 −1

0

−1

−2

1

2

example hard clustering

data

2

n xi i=1

1

0

where xi ∈ R2

−1

0

1

2

0

1

2

−1

−2

possible model 2

2 C(x) = argmin x − µj j

1

0 −1

−1

−2

example sequences of symbols

data BBCBCAABCCCABCA BCACCAACABBCCAB CABCCCCCCABABBC BAABBBCACCCABCC CCCBABBABCACCAC CABCACABCCCABBB CACCCCCABCBBCCA BBBCCCBCCABCABB CBCABCABCABBABA BABAACABCBAABCA...

example sequences of symbols

data BBCBCAABCCCABCA BCACCAACABBCCAB CABCCCCCCABABBC BAABBBCACCCABCC CCCBABBABCACCAC CABCACABCCCABBB CACCCCCABCBBCCA BBBCCCBCCABCABB CBCABCABCABBABA BABAACABCBAABCA...

possible model Markov chain 0.2

0.3

0.2

A

B

0.6

0.3

0.5

0.2

0.1

C 0.6

applications

once a machine learning algorithm has fitted a model that generalizes well, it can be applied in practice ⇔ depending on the task at hand, the learned model can do inference reasoning predictions decision making .. .

based on novel, previously unseen data

building an automatic pattern recognition system

observe

in this course, we treat the problem of training classifiers or predictors as a supervised learning problem ⇔ we assume we are given a (large) representative, labeled sample of tuples

q1 , y1 , q2 , y2 , . . . , qn , yn containing information about the problem domain at hand the qj denote patterns, the yj are (hopefully correct) labels

example qj

:

picture of an animal

yj

:

name of its species or index i of its class Ωi

note

the process of building / implementing and using a pattern recognition system involves the following 3 21 phases 1) training phase validation phase (optionally) 2) test phase 3) application phase

training

collect a representative training set Strain of patterns and (manually) label them if necessary or appropriate, determine suitable features xj = f qj otherwise let xj = qj decide for a model class Y and train a classifier y ∈ Y, i.e. determine the parameters of y such that it maps the given training data to its labels y(xj ) ≈ yj

validating (optional)

collect a representative validation set Sval of patterns and (manually) label them the validation set must be independent of the training set Sval ∩ Strain = ∅

evaluate the performance of y(x) on the labeled set Seval if the overall performance is not “good enough”, readjust the parameters of y(x)

testing

collect a representative test set Stest of patterns and (manually) label them this test set must be independent of the training set Stest ∩ Strain = ∅

testing

collect a representative test set Stest of patterns and (manually) label them this test set must be independent of the training set Stest ∩ Strain = ∅ objectively evaluate y(x) on Stest by measuring, say accuracy precision recall .. .

application

if y(x) meets desired, problem specific quality requirements, apply it in practice or ship it to your customer

note

a (very) good performance of a trained model y on the training data means nothing what really counts in pattern recognition and machine learning are the generalization capabilities of y ⇔ what really counts in pattern recognition and machine learning is how y performs on independent test data in a few weeks, we will study details as to why this is

take home message

you must never ever in your life make the rookie mistake of improperly testing a machine learning or pattern recognition system any statement you ever make about the quality or performance of your system must be derived from test data which is independent of the training data

summary

we now know about

problem domains and patterns classes, features, and classifiers the fact that machine learning is nothing but model fitting general aspects of building a pattern recognition system