Image and speech recognition

Image and speech recognition Lecture notes 2012 Włodzimierz Kasprzak

Project is co-financed by European Union within European Social Fund

1

CONTENT Part I – Pattern recognition 1. Pattern recognition

5 - 42

2. Pattern transformation

43 - 77

3. Pattern classification

78 - 114

4. 5. 6. 7.

Part II – Image recognition Image pre-processing 115 - 157 Boundary-based image segmentation 158 - 181 Region-based image segmentation 182 - 213 Model-based object recognition 214 - 231

W. Kasprzak: EIASR

2

CONTENT (2) Part III – Speech recognition 8. Speech signal pre-processing 232 - 250 9. Acoustic feature detection 251 - 274 10. Phonetic speech model 275 - 289 11. Word and sentence recognition 290 - 314

W. Kasprzak: EIASR

3

GENERAL REFERENCES 1. W. Kasprzak: Rozpoznawanie obrazów i sygnałów mowy. WPW, Warszawa, 2009. 2. R. Duda, P. Hart, D. Stork: Pattern Classification. 2nd edition, John Wiley & Sons, New York, 2001. 3. H. Niemann: Klassifikation von Mustern. 2nd edition, Springer, 2003. 4. I. Pitas: Digital Image Processing Algorithms and Applications. Prentice Hall, New York etc. 2000. 5. D. Paulus, J. Hornegger: Applied Pattern Recognition. A Practical Introduction to Image and Speech Processing in C++. 3d edition, Vieweg, Braunschweig, 2001. 6. J. Benesty, M.M. Sondhi, Y. Huang (eds): Handbook of Speech Processing. Springer, Berlin Heidelberg, 2008. W. Kasprzak: IASR

4

EIASR Pattern recognition Lecture 1


5

1. Introduction The GOAL of pattern recognition: • to analyse (to recognize, to understand) and describe the content of given pattern (digital image, speech signal, etc.). Example The difference between computer graphics (image synthesis) and image analysis process:

W. Kasprzak: EIASR

1. Pattern recognition

6

Pattern Patterns – sets of multi-variable functions that express some physical entity, object, system, etc.  f r1 ( x1 , x 2 ,..., x n )  Domain of patterns:   ( , ,..., ) f x x x  n  Ω = { fr ( x ) | r = 1, 2 , …} ⊂ U f r ( x) = r 2 1 2   L    f ( x , x ,..., x )  n   rm 1 2

Example Some patterns (in a bitmap form) from the domain of characters: W. Kasprzak: EIASR


7

Pattern recognition approaches Complexity of patterns pattern recognition approach: 1.simple patterns template matching, classification 2.complex patterns (a sequence or set of simple patterns) pattern-sequence or -structure model-to-data matching 3.abstract patterns (e.g. a dynamic scene, continuous speech) pattern understanding. „2-D image and speech recognition”: •patterns represented by speech or 2-D image signals, •deals with simple and complex pattern recognition. W. Kasprzak: EIASR


8

Pattern understanding Abstract pattern recognition (pattern understanding):

W. Kasprzak: EIASR


9

Signal, 2-D image A signal is a function of how one variable is related to another variable, e.g. a = s(t). Signal variables: • Amplitude - the function value; may represent voltage, light intensity, sound pressure, etc. • Time – the typical function parameter (signal in time domain); however, other parameters are used in specific applications. 2-D Image – a signal, in which the parameter „time” takes the form of a 2-D space, e.g. a = I(x, y); (signal in spatial domain). W. Kasprzak: EIASR


10

Image scan Scanning: converting a 2-D image to a 1-D signal. Line scan Aerial scan Hilbert scan

Line scan – not the best one, as local continuity is not preserved. Aerial („zig-zag”) scan – preserving locality but prefering one direction. W. Kasprzak: EIASR


11

2. 2-D image analysis 2-D image analysis systems contain different analysis process at different data abstraction levels: 1.Signal level: image preprocessing (filtering) - image restoration, image enhancement. Example: source separation from mixtures

Finite number of mixtures

W. Kasprzak: EIASR

Reconstructed sources


12

2-D image analysis (2) 2. Iconic level: image compression, image registration (normalization), classification of entire image. Example: image normalization w.r.t. pattern orientation:

(a) Input image, (b) background suppression, c) foreground border image, d) rotated (normalized) image

W. Kasprzak: EIASR


13

2-D image analysis (3) 3. Segmentation level: boundary detection, region detection, texture classification, depth or motion detection. Example. Detection of characters in license plate regions:

(a) Motion mask, (b), (c) two regions of interest (ROI), d) regions corresponding to characters.

W. Kasprzak: EIASR


14

2-D image analysis (4) 4. Object level: model-based 2-D and 3-D object recognition. Example. Palm recognition:

(a) 2-D palm model, (b) 3-D palm model, (c) contour detection, (d) discrete feature detection, (e) feature-to-model matching W. Kasprzak: EIASR


15

3. Speech recognition Typical problems (steps) in a speech recognition system:

W. Kasprzak: EIASR


16

Speech recognition (2) Problems (steps) (cont.):

W. Kasprzak: EIASR


17

Simple speech classification Sometimes a simple speech recognition system will do, for example: • For the control of a device which needs several commands only, • If selecting a phone number by voice.

The structure of a simple word recognition system: 1. Signal acquisition and VAD. 2. Spectrogram analysis with fixed frame number (e.g. 40 frames × 512 samples 10.240 points). 3. Approximation of the spectrogram image – reducing the image resolution to 640 points (40 x 16). 4. In the learning stage: modelling words from the dictionary by corresponding average low-resolution spectrogram images. 5. Word recognition via direct image classification. W. Kasprzak: EIASR


18

Spectrogram image classifier (2) Example. Four low-resolution spectrogram images of the spoken words: (a) „koniec” and (b) „OK”.

(a) (b) The coding of spectrogram values: Black - the highest amplitude, Blue – middle range, Yellow – low range, white – near zero range. W. Kasprzak: EIASR


19

Spectrogram image classifier (3) Example. Spectrograms of the spoken numbers in polish. Significant differences visible.

W. Kasprzak: EIASR


20

4. Probability theory The purpose of applying probability theory in pattern recognition: • to model the uncertainty of sensor data and of processing results (for example by means of noise), • the methods of error correction (normalisation, filter) require the statistics of error sources, • for the judgement of pattern classification and modelbased recognition results stochastic distance measures are useful.

W. Kasprzak: EIASR


21

Discrete stochastic variable A non-negative probability density function (pdf) of a discrete stochastic variable X: P(X=x) = pX(x) . Probability of events, specified by an interval of values: P( A ≤ X ≤ B) = ∑ p X ( x) A≤ x ≤ B

The cumulated density FX( x ) of a stochastic variable X : FX ( x ) = P ( X ≤ x) = ∑ p X ( z ) Mean and variance of a pdf: µX =

∑ x ⋅ p( x)

z≤x

σ X2 =

x∈dom ( X )

W. Kasprzak: EIASR


∑[( x − µ

X

) 2 ⋅ p( x)]

x∈dom ( X )

22

Example The distribution of pixel values (e.g. intensities, colors) in an image can be modelled as a stochastic discrete variable. The normalized histogram of an image represents relative frequencies of pixel values in given image: (a) an image, and (b) its intensity histogram

W. Kasprzak: EIASR


23

Continuous stochastic variable The probability value for a given (exact) realization x of a continuous stochastic variable X, P(X=x)=pX(x), is infinitesimally small. It is meaningful to consider intervals of values: B

P ( A ≤ X ≤ B ) = ∫ p X ( x)dx A x The cumulated density: FX ( x) = ∫ f ( z ) dz −∞

The pdf of a Gaussian (normally distributed)2 variable: (x−µ ) − 12 1 2 2 p X ( µ ,σ ) = e σ σ 2π Mean and standard deviation (µ and σ ) are the only two parameters of this distribution. W. Kasprzak: EIASR


24

Continuous vs. discrete variable • Stochastic discrete variables represent measurable (observable) entities, i.e. we deal with a finite set of observations – the stochastic features of the observed variable usually change with new observation. • The probability theory by continuous variables models the hidden stochastic process. Its features may be fixed in time. Moments – the features of a pdf: k • the k-th absolute moment of p : mk ( p) = ∫ z ⋅ p( z ) dz • the k-th central moment of p : )

µ k ( p ) = m k ( p ) = ∫ ( z − m1 ( p )) k ⋅ p ( z ) dz W. Kasprzak: EIASR


25

A vector variable Let X = [ X1 , X2 , … , Xn ]T - a vector of stochastic variables. The mean vector: E{X} = [ E{ X 1}, E{ X 2 },..., E{ X n }]T The covariance matrix:  σ 11 K σ 1n  where   Σ =  M O M  σ i , j ( X) = E{( X i − E{ X i })( X j − E{ X j })} σ   n1 K σ nn  Example. The n-dimensional Gaussian distribution: p X ( x) =

1 2 n π n | det(Σ) |

e

−

[ x − µ ]T Σ −1 [ x − µ ] 2

⋅ 2 n π n | det(Σ) || det( 2πΣ) |

where “det” - determinant of a matrix. W. Kasprzak: EIASR


26

Bayes rule Chain rule for n stochastic variables:

pX ([ x1, , x2 ,..., xn ]) = p X1 ( x1 ) ⋅ p X 2| X1 ( x2 | x1 ) ⋅ p X 3| X1 , X 2 ( x3 | x1 , x2 ) ⋅ K⋅ p X n| X1 , X 2 ,K, X n−1 ( xn | x1 , x2 ,K, xn−1 ) For n stochastic independent variables it holds: n

pX ([ x1, , x2 ,..., xn ]) = Π p X i ( xi ) i =1

The Bayes rule relates conditional probability to joint distribution: p X , X ( x1 , x2 ) p X1| X 2 ( x1 | x2 ) = 1 2 p X 2 ( x2 ) Hence: p X1| X 2 ( x1 | x2 ) ⋅ p X 2 ( x2 ) = p X 2 | X1 ( x2 | x1 ) ⋅ p X1 ( x1 ) W. Kasprzak: EIASR


27

Entropy Duality of information content and probability: • low information of event (X=xi): high probability of event. • high information of event (X=xi): low probability of event. Information content of some realization xi of variable X:

I ( xi ) = − log a P ( xi )

Entropy of X – the mean information content of variable X: H ( X ) = − ∑ ( P( xi ) ⋅ log a P( xi ) xi ∈X

W. Kasprzak: EIASR


28

4. Sampling and digitalization Signal acquisition and analogue-to-digital (A/D) signal conversion: D

• sampling (in time or space) and amplitude digitalization

W. Kasprzak: EIASR


29

Sampling Sampling theorem: If the sampling frequency of signal f(x) is at least two times higher than the maximum frequency component, ϕ∈(-Bx, Bx), of this signal, then the full reconstruction of the original analogue signal from its sample set is possible, fj = f( j ∆x), j = 0, +/- 1, +/ 2, ...; where the sampling period is ∆x ≤ (1/ 2Bx). Duality of signal representation in the time (space) f(t) and frequency domain F(ω). W. Kasprzak: EIASR


30

Digitalization of amplitude

A sequence of real-valued (analogue) samples: { fj }. The corresponding digital-valued samples: { hj } The digitalisation error („noise”): nj = fj - hj The quality of the digitalisation - the signal to noise ratio: SNR = 10 log10 W. Kasprzak: EIASR

E{ f j2 } E{n 2j }


31

PCM coding PCM coding („pulse code modulation”): optimal digitalization (amplitude intervals and quantization levels).

Error measure:

Solution: intervals ai and levels bi

L ai +1

ε = ∑ ∫ ( f − bi ) 2 p ( f ) df

ai =

i =1 ai a

i +1 ∂ε = ∫ − 2( f − bi ) p ( f ) df = 0 ∂bi ai ∂ε = (ai − bi −1 ) 2 ⋅ p (ai ) − (ai − bi ) 2 ⋅ p (ai ) = 0 ∂ai

W. Kasprzak: EIASR

bi

bi −1 + bi , i = 2,3,..., L = 2 B 2

∫ =

ai +1

ai


∫

f ⋅ p( f ) df

ai +1

ai

,

, i = 1,2,..., L .

p( f ) df 32

5. Optimization theory elements LSE (least square error) 2D case: to establish a linear relation between two quantities, y and x : y = a x + b or ax+b -y=0, given N observations: a x1 + b - y1 = ε 1 a x2 + b - y2 = ε 2 ..............

a xN + b - yN = ε N . Goal function: U (a, b) = ε12 + ε22 + . . . + εN2 = ∑ εi2 = | ε |2 ∂U ∂U Minimization of U(a, b): = 0. = 0, ∂b ∂a Solution:  N − x   ( x y ) a   i b  = N x 2 − ( x ) 2 ⋅ − x   ∑ i ∑ i  ∑ i

W. Kasprzak: EIASR

∑ ∑x

2 i

i

⋅  

∑ ∑y


i

i

i

 

33

Moore-Penrose pseudo-inverse The Moore-Penrose pseudo-inverse is a general solution to the system of m linear equations with n unknowns:

b = A y , b ∈ R m , y ∈ R n , A ∈ R m×n The solution is:

y = A† b

where matrix A† is the Moore-Penrose pseudo-inverse. When A is full rank A† can be directly calculated: • case m = n : A† = A-1 • case m > n : it is the general linear LS solution, which minimizes the squared error (second-order norm) | b − Ay |2 A† = (AT A)-1 AT • case m < n : (it minimizes the norm |y|2 ) A† = AT (A AT)-1 W. Kasprzak: EIASR


34

Nonlinear LSE (1) Assume, there is some number N of data samples: ( xi , yi ) ∈ R The relation between variables x and y is approximated by a parametric non-linear function: y = f (x | p), where p∈Rn , N ≥ n, is the parameter vector. Goal: to find the minimum of the goal (objective) function: 1 N 1 N 2 1 2 Φ (p) = r (p) = ∑ r i (p) = ∑ [ yi − f ( xi | p)]2 2 i =1 2 i =1 2 Look at the gradient of the goal function:

2

N

∇Φ (p) = ∑ r i (p) ∇ri (p) = J(p)T r (p) i =1

The Jacobi matrix: W. Kasprzak: EIASR

{J(p)}ij =

∂ri (p) ∂p j


35

Nonlinear LSE (2) Second-order derivatives, are in the Hessian: N

H (p) = ∇ 2 Φ (p) = J(p)T J(p) + ∑ r i (p) ∇ 2 ri (p) i =1 for small errors r, approximated as: ∇ 2 Φ (p) = J(p)T J(p) Solutions of the nonlinear LSE problem 1) Gradient descent search – an iterative first-order update rule: p i +1 = p i − λ ∇Φ(p i ) 2) Gauss-Newton search - second-order based update rule: p i +1 = p i − (∇ 2 Φ(p i )) −1 ∇Φ(p i ) 3) Levenberg-Marquardt algorithm – uses a combined rule: p i +1 = p i − (H (p i ) + λ diag (H (p i )))−1 ∇Φ(p i ) W. Kasprzak: EIASR


36

Linear programming Linear programming (LP), or linear optimization: • optimization of a linear objective function, • subject to linear equality and linear inequality constraints. The solution is within a convex polyhedron, a set defined as the intersection of finitely many half spaces. Linear programming problems in canonical form: • Maximize the objective function xˆ = arg max c T x x

• Subject to constraints

Ax ≤ b , where x ≥ 0.

LP in MATLAB (Optimization Toolbox): BINTPROG, LINPROG W. Kasprzak: EIASR


37

Quadratic programming In Quadratic Programming (QP) the goal (objective) function is quadratic and constraints are linear: xˆ = arg min f ( x ) = x∈ R n

1 T x Hx + t T x 2

a i x = bi , i = 1,..., m T

subject to

T

c i x ≥ d i , i = 1,..., p

where H is a strictly positive-definite matrix of dimension n×n, t – a vector of length n , ai , ci – vectors of length n , and bi,di – scalars . In MATLAB - the quadprog routine. W. Kasprzak: EIASR


38

Quadratic programming (2) Simple: only equality constraints appear in the QP problem: A x = b . If the number of equality constraints is the same as the number of independent variables (m = n), constraints uniquely determine the solution to the QP problem: x = A-1 b If the number of equality constraints is smaller than the number of independent variables (m λ2 > … > λn) and n corresponding eigenvectors.

W. Kasprzak: EIASR


51

2. PCA The transformation optimizing s1 is Principal Component Analysis (also called Karhunen-Loeve transform): 1) Get the zero-mean covariance matrix Q(1) of samples {jx}. 2) Obtain its eigenvectors by the eigenvector decomposition n or SVD algorithm: (1) Q = ∑ λrϕ rϕ rT = UΛU T r =1

where U represents the matrix of eigenvectors ϕr of Q(1) and Λ is the diagonal matrix of positive eigenvalues λr, while n is the rank of the matrix Q(1). 3) Select the feature space dimension, m ≤ n. The feature vector jcm corresponding to sample jx is: jc m = Λm Um T jx (non-normalized feature) or jc m = Λm−1/2 Um T jx (normalized). W. Kasprzak: EIASR


52

PCA (2) PCA: a rotation of an orthogonal coordinate system that orders the axis to capture largest variances of data samples. If n is the input vector size then some m < n principal components represent interesting structure, while those with lower variances represent noise.

W. Kasprzak: EIASR


53

Inverse-PCA transform Let: dimension(cm) = m; dimension(x) = n . In PCA usually, m < n , even most often, m 2) The within-class variability matrix: k

k

Ni

S W = ∑ S i = ∑∑ ( i x j − m i )( i x j − m i )T i =1

i =1 j =1

where Ni - number of samples from class i=1, ..., k. k The between-class variability matrix: S B = ∑ N i (mi − m)(mi − m)T i =1 where m - the mass centre of all the samples. W. Kasprzak: EIASR


59

General LDA Definition The linear Fisher’s discriminate function is: y = Wo x , where T | W SB W | . Wo = arg max J ( W ) and J ( W ) = T W

| W SWW |

In LDA we apply the Wo that induces the maximum ratio between the determinants of the between-class variability matrix and the within-class variability matrix of samples. Solution The rows of matrix Wo are eigenvectors of the matrix SW-1 SB : S −W1S B Wo = ΛWo W. Kasprzak: EIASR


60

4. The inverse problem Instantaneous-time inverse problem: • m unknown sources (e.g. signals in time) or a set of m-dimensional data samples : • n observed, possibly noisy, different, linear mixtures of the sources (n >= m): • the mixing coefficients are some unknown constants.

 s1T [t ]    M   T  s [t ]  m 

 x1T [t ]    M   T  x [t ]  n 

 m  x[t ] = A s[t ] + n[t ] =  ∑ si [t ] ai  + n[t ]  i =1  W. Kasprzak: EIASR


61

Possible solutions Goal: find the unknown matrix A and reconstruct the sources. Possible solutions: 1. Independent component analysis 2. Projection pursuit 3. Factor analysis 4. Principal component analysis

W. Kasprzak: EIASR


62

Factor analysis, Whitening Factor analysis (FA) assumes that the sources (called factors) are uncorrelated, of Gaussian distribution and of unit variance: E{s sT } = I The noise components are assumed to be uncorrelated with Q = E{n nT } each other and with the factors: With above assumptions the covariance matrix of the E{ xx T } = Rxx = A AT + Q observation is: Assuming Q is known or can be estimated, FA attempts to solve A from: A AT = Rxx − Q T Without noise the problem simplifies to Whitening: A A = Rxx It is a specific case of PCA, where every „direction” in space is of unit variance. W. Kasprzak: EIASR


63

ICA for feature extraction Application of ICA (independent component analysis): a coding scheme - a signal (image block) is represented by a feature vector that consists of mixing coefficients – the block is a mixture of some fixed set of „standard” image blocks (independent components): Nie można obecnie wy świetlić tego obrazu.

ai1 xi

ai2 s1

aim s2

sm

Assumptions in ICA: 1.The sources are stochastically independent w.r.t. functions f , g that capture higher order signal statistics, i.e. for y1 , y2 it holds

E{ f ( y1 ) g ( y2 )} = E{ f ( y1 )} ⋅ E{g ( y2 )} 2. At most one of the sources is of Gaussian distribution. W. Kasprzak: EIASR


64

Example Example [Hyvarinen(2000)]  1 , if | si |≤ 3  = p s ( )  i 2 3 Two independent sources:  0, otherwise Joined distribution (s1, s2): s2

Mixing matrix s1

Mixture distribution

2 3 A0 =   2 1

x2

x1

W. Kasprzak: EIASR


65

Example (2) Whitening: a linear transform so that the new mixtures are uncorrelated and have a variance of one. PCA: find orthogonal main directions no separation. ICA: source separation by generalized independency.

W. Kasprzak: EIASR


66

FastICA (1) FastICA (Hyvarinen, Karhunen, Oja) is a batch method for ICA. It requires that the mixtures are centered and whitened first. x centered = x − E{x} = x − m x Centering: Whitening:

E{xxT } = R xx = UΛU T x whitened = Λ−1/ 2 UT x

Then the method iteratively updates the weight matrix W – vector-by-vector, while maximizing the non-Gaussianity of the projection wTx (see next page). W. Kasprzak: EIASR


67

FastICA (2) A. B. 1) 2)

Initialize a nonzero matrix: W=[w1,w2,…,wn] Iterate FOR outputs p=1,2,…,n DO w +p = E{ x g ( w Tp x )} − E{( g ' ( w Tp x )}w p Weight update:

where g is a nonlinear function, g’ – its first derivation. + 3) Normalize to unit length: w = w p p

w +p

4) A Gram-Schmidt orthogonal and normalization: p −1

w p ' =w p −∑ w Tp w j w Tj j =1

wp =

wp' wp'

5) IF W has not yet converged THEN next iteration of 1)-5). W. Kasprzak: EIASR


68

5. Blind source separation BSS methods solve the inverse problem in a way similar to ICA. But in BSS there is a focus on “on-line” reconstruction of sources. Application example: solving the „cocktail party” problem:

@ Cichocki, Amari (2001) W. Kasprzak: EIASR


69

Demixing in BSS Demixing in BSS: an m × n separating matrix W(t) is updated so that the m-vector, y(t) = W(t) x(t), becomes an estimate, y(t) ≈ s(t), of the original independent sources up to scaling and permutation indeterminacy. If source scaling (S) and permutation (P) are resolved then: WA=SPI

W. Kasprzak: EIASR


70

BSS (1) The Kullback–Leibler divergence measures the dependency among output signals (mutual information): p( y ) D( y || { yk };W ) = ∫ p ( y ) log K dy ∏k =1 p( yk ) K

D ( y || { yk }, W ) = − H ( y;W ) + ∑ H ( yk ) ; D ( y || { yk }; W ) ≥ 0 k =1

H(y, W) - the average information of the joint output y. The H(yk)-s are entropies of marginal distributions – their sum is constant – only the information related to joint output distribution, H(y, W), can be optimized here. The BSS optimization rule: arg min D ( y || y k }; W ) W W. Kasprzak: EIASR


71

BSS (2) The adaptive separation rule, generated according to the natural gradient: is: W (t + 1) = W (t ) + η(t ){I − f [ y (t )] ⋅ g[ y T (t )]} ⋅W (t ) f(y) = [f(y1), ..., f(yn)]T and g(yT)=[g(y1), ..., g(yn)] are vectors of non-linear activation functions, which approximate higher– order moments of the signals. If the source has a negative normalized kurtosis value (i.e. it is a sub–Gaussian signal), we choose: f(y) = y3, g(yj) = yj. For a super–Gaussian source with positive kurtosis, we choose: f(y) = tanh(α y), g(yj) = yj. W. Kasprzak: EIASR


72

Example Sound separation: one unknown source, one mixture and one separated signal are shown:

W. Kasprzak: EIASR


73

Example Four unknown sources:

Four mixtures with added noise:

Separated signals:

W. Kasprzak: EIASR


74

Measuring the separation quality (A) If the source signals are known For every pair (output Yi ; source Sj ) with amplitudes scaled to the interval compute the MSE[i, j] coefficient - the error of approximating Sj by Yi , i=1,…,n; j=1,…,m : 1 N −1 MSE[i, j ] = ∑ [ S j (k ) − Yi (k )]2 = E{( S j − Yi ) 2 } N k =0 where N is number of samples. A matrix P=[ pi,j ] is created, with pi , j = 1 / SNR [i , j ]  1  1 The error index: EI (P) =  ∑ ∑ | ~ pi , j | − n  +  ∑ ∑ | pi , j | − m  m i j  n j i  ~ Every row i of P is scaled: P = Norm(P) , such that ∀i (max j (a~i , j ) = 1) Every column j is scaled: P = NormCol(P) , such that ∀j (maxi (ai , j ) = 1) W. Kasprzak: EIASR


75

Measuring the quality (2) (B) If the mixing matrix A is known The error index for the set of separated sources is given as:   1 1 EI (P) =  ∑ ∑ | ~ pi , j | − n  +  ∑ ∑ | pi , j | − m  m i j   n j i The entries pi,j-s of matrix P, P = W A, are normalized along rows (i=1, ..., n) (for the first expression) or columns (j=1,…,m) in P, such that: or ∀i (max j ~ pi , j ) = 1) ∀j (maxi pi , j ) = 1) Ideal case of perfect separation: • P becomes a permutation matrix. • Only one element in each row and column equals to unity, and all the other elements are zero. • Then the minimum EI = 0. W. Kasprzak: EIASR


76

Measuring the quality (3) (C ) If both the sources and mixing matrix are unknown The normalised mutual correlation coefficients are computed f ( yi ) ⋅ g ( y j ) ri , j = | f ( yi ) || g ( y j ) | for every pair of output signals yi and yj , giving the square matrix: P=[ri,j]. The error index for the set of separated sources is computed as:

~ EI ( P ) =

W. Kasprzak: EIASR

1 ∑ n  i

 | r | n − ∑j i, j  


77

EIASR Pattern classification Lecture 3


78

1. Numeric classifiers Optimization criteria for classifiers 1. Minimum computational complexity linear discriminate classifiers 2. Minimum of misclassification probability Bayes classifier 3. Ability automatically to learn a decision rule multilayer perceptron 4. Ability to generalise knowledge for not available samples “Support Vector Machine” 5. Minimization of general risk ??? W. Kasprzak: EIASR


79

Decision theory approach A. Make assumptions about the analysed domain – for example learn the prior pdf-s. B. Define the costs and risk of misclassification (decision). C. Define a decision rule, which minimises this risk. Stochastic classifier - the analysed domain and decision rule in terms of stochastic variables. Let us observe: • Each decision induces individual costs. • After many decisions the average (expected) costs can be estimated – the risk of misclassification. • The calculation of average values requires the availability of full statistic information (pdf-s). W. Kasprzak: EIASR


80

Risk minimization (A) Assumptions about the analysed domain: • Pdf of observation c given class k, p( c | k ) . • Prior probabilities { pk (k=1,...,K) } of patterns from classes. (B) Calculation of risk V(kc) - a pattern kc from class k. 1. Costs of misclassification: {rik = r (i | k) (i,k= 1,2, ... ;K)}; 2. Obtain the probability of pattern from class k: p(kc | k) pk 3. The risk of pattern kc : V(kc): = ∑k ∑i ( pk p(kc | k) rik). (C) Decision rule 1. Minimise the risk V. If a binary cost function is used: rkk = 0, rik = 1, for i ≠ k, and i, k = 1, ... , K. then we need to maximize p(k | kc) . 2. Select class k with highest p( k | c) : i = argk max p( k | c) . W. Kasprzak: EIASR


81

2. Potential function-based classifier • Potential functions as class distributions in feature space; • All functions from the same parametric family of functions. Linear potential function family:

d i (c, a ) = {ai ϕ (c ) | ai ∈ ℜ ( n+1) } , (i = 1, L, m) T

ϕ (c ) = (1, c1 , c2 ,L, cn )T Linear potential function-based classifier, two class-case: Get potentials: n n (1) (1) (1) ( 2) ( 2) d1 (c, a ) = a0 + ∑ ai ⋅ ci d 2 (c , a ) = a0 + ∑ ai( 2 ) ⋅ ci i =1

i =1

Decision rule: d( c ) = d1(c, a(1)) - d2(c, a(2)), If d( c ) ≥ 0 then select the class Ω1 else class Ω2 W. Kasprzak: EIASR


82

Example Binary classifier for a 2-D feature vector – the linear decision rule corresponds to two half-planes separated by the line: d1(c) = d2(c).

W. Kasprzak: EIASR


83

Learning the potential functionbased classifier Linear classfier for K classes: n (k ) (k ) ς (c) = arg max d k (c, a ) = arg max(a0 + ∑ ai( k ) ⋅ ci ) k k i =1 Learning: 1. Let C be a matrix of size N × 3, built from N samples, where Nk samples are from class k. 2. For every class Ωk define the equation system: ε + D(k) = C a(k), where D(k) =[d(k)1, d(k)2, ... , d(k)N]T and dk(c, a(k)) = 1, if c ∈ Ωk , or dk(c, a(k)) = -1, if c ∉ Ωk; and ε =[ε1, ε2, ... , εN]T is an error vector. to be continued on next page. W. Kasprzak: EIASR


84

Learning the classifier (2) 3. The „least square error” optimization problem is: N

ao = arg min ∑ ε i2 (a ) a

i =1

The solution leads to a system: X(k) = B(k) ao(k), where X(k) = C(k)T D(k) , B(k) = C(k)T C(k) . This can be solved directly as: ao(k) = ( C(k)T C(k) )-1 C(k)T X(k) .

W. Kasprzak: EIASR


85

3. Bayes classifier Stochastic distributions are needed: -prior class pdf-s p(Ω Ωk) , k =1, 9 , K -conditional pdf-s p(c | Ωk) . Decision rule:

ς (c) = arg max p(Ωλ | c) = arg max p(Ωλ ) p(c | Ωλ ). λ

λ

Learning the probability distributions: 1.p(Ωk) are estimated as relative frequencies of classes in the learning sample set; 2.p(c |Ω Ωk): non-parametric, e.g. histograms; or parametric assume a density function family and estimate the parameters from the learning samples. W. Kasprzak: EIASR


86

Learning a Bayes classifier Non-parametric pdf

Parametric density ML („maximum likelihood”) estimator: for Ωk specify parameter set θ(k) which maximizes the probability of given (k ) max P ( c | θ ) observations ∑ j (k ) θ

Solve: W. Kasprzak: EIASR

∈Θ

c j ∈C ( k )

∂ log ∑ P(c j | θ ( k ) ) = 0 , θi ∈ θ ( k ) ∂θi c j ∈C ( k ) 3. Pattern classification

87

Minimum-distance classifier This is a specific form of the Bayes classifier for Gaussian pdf-s if it uses the Euclidean metrics. Select the class of minimum distance: d(Ωk , c) = ln p(Ωk |c). Then in special case (∑ ∑1 = ∑2= I) the Bayes condition: p( Ω1 ) p( c | Ω1 ) > p( Ω2 ) p( c | Ω2 ) is equivalent to: 1 1 −1 −1 ln p (Ω1 ) − (c − µ 1)T ∑1 (c − µ1 ) > ln p (Ω 2 ) − (c − µ2 )T ∑ 2 (c − µ2 ) 2 2 With, ∑1 = ∑2= I , and the maximum likelihood rule, 2 2 p(Ω1) = p(Ω2), it simplifies to: | c − µ1 | 0. N

H * : d a~ (c ) = ∑ ϑ j y j (c T j c ) + a0 = 0 j =1

The hyperplane is a weighted scalar product of the support vectors. W. Kasprzak: EIASR


94

Linear SVM under noise Linear separation under noisy condition: adding a “penalty” component in the SVM’s goal function - covering wrongly separated samples. Example. Error distances ε5 and ε6 (for samples of class +1) and ε7 and ε8 (samples of class -1).

W. Kasprzak: EIASR


95

The dual problem The modified goal function: N 1 2 min f (a ) = ~min ( | a | + C∑ ε j ) a~∈R n+1 a ∈R n+1 2 j =1 j T j subject to y j ( c a + a0 ) ≥ 1 − ε j , ∀ c ∈ ω , ε j ≥ 0 C is a constant, controlling the influence of the noisy component onto the goal function. L(a~,ϑ )] The dual problem: max LD (ϑ ) = max [inf ~ a ϑ ≥0

N

with

LD (ϑ ) = ∑ ϑ j − j =1

where

ϑ ≥0

N

N

1 T 1 j T k = − c c y y ( ) ϑ ϑ ϑ Qϑ + i T ϑ ∑∑ j k j k 2 2 j =1 k =1

Q jk = y j yk j c T k c ; i = [1,1,...,1]T N

subject to 0 ≤ ϑ j ≤ C ; 0 = ∑ ϑ j y j = y Tϑ j =1

W. Kasprzak: EIASR


96

Solving the SVM problem Both the primary and the dual form of the SVM optimization problem are instances of the general quadratic programming problem (for example see the quadprog function in MATLAB). There exists traditional methods, like Newton method or active set method, for solving the quadratic programming problem. But with many hundreds or thousands training samples traditional methods cannot be directly applied. An efficient learning algorithm that solves the dual problem for SVM is “Sequential minimal optimization” (SMO) (J.C.Platt ). W. Kasprzak: EIASR


97

Kernel SVM Linear separators in a high-dimensional feature space F(c), by replacing c and jc in H* equation with F( c ) and F( jc ), N

H * : d ~a (c) = ∑ ϑ j y j ( F (c)T F ( j c)) + a0 = 0

Example:

j =1

(a) The true decision boundary is: x12 + x22≤ 1. (b) After mapping into a three-dimensional feature space: (x12, x22, 2 x1 x2).

(a)

W. Kasprzak: EIASR

(b)


98

6. MLP McCulloch-Pitt’s model of a single neuron:

Inputs: (x1,x2,…,xn) n Input function: y = ∑ ( wi ⋅ xi ) − w0 ⋅ 1 , where w0 - bias weight i =1 Activation function: z = θ(y) Output: z W. Kasprzak: EIASR


99

Single feed-forward layer In ANN (artificial neural networks) the neurons are organised into layers. In a feed-forward (perceptron) network the input signals are sent from the input to the output via excitatory links.

y = W ⋅x z = θ ( y)

W. Kasprzak: EIASR


100

Activation functions Single step function: z = 1 , if yi > yk , where yk - a fixed threshold. 1 ( y ) = θ Sigmoid function: 1 + exp(− βy) Example. Activation functions: (left) step function, (right) sigmoid function.

W. Kasprzak: EIASR


101

Sigmoid function The derivative of a sigmoid function dz is: = z (1 − z )

z = θ ( y) =

1 1 + exp(− y )

dy

Example. Sigmoid function with different parameter β.

W. Kasprzak: EIASR


102

Multi-layer perceptron MLP: some number (at least two) of feed-forward layers arranged in a cascade. The transformation function of each layer l (=1,2,...): z(l ) = θ (W(l )z(l−1) − w0(l ) ) , where z(0) = x. Example. A MLP with 3 layers (2 hidden layers).

W. Kasprzak: EIASR


103

Supervised learning of MLP The Widrof-Hoff rule (delta rule) – the weight of incoming link between i-th neuron and j-th input is strengthened in proportion to the difference between required and real activation:

∆wij = µ ( si − zi ) x j where si is the required activation of the i-th neuron, and µ - a learning coefficient. The error back-propagation rule in MLP is an extension of the delta rule. An update of weights in layer l (l = L, L-1, …,1):

∆W

(l )

= µ ⋅d

(l )

⋅z

( l −1)T

, z (0) = x

where d(l) is the correction vector for the l-th layer. A single learning iteration starts form the last layer (l = L) and it proceeds back-to-back, layer-by-layer, until l=1. W. Kasprzak: EIASR


104

Backpropagation learning The correction values are set as follows: For last layer the required output vector s is avavilable.  ∂z  Hence d ( L ) = ( s − z ( L ) ). *  i   ∂yi  For hidden layers the correction vector is projected back. Hence, for l = L-1, …, 1:  ∂z  d ( l ) = (W ( l +1) )T d ( l +1) . *  i   ∂yi  “.* “ - an element-by-element multiplication of two vectors. For a sigmoid activation function the corrections are:

[

]

d ( L ) = (s − z ( L ) ). * 1 − z ( L ) . * z ( L )

[

]

d ( l ) = ( W ( l +1) )T d ( l +1) . * 1 − z ( l ) . * z ( l ) W. Kasprzak: EIASR


105

7. Pattern clustering Unsupervised learning procedures use unlabeled samples. „k-means” clustering 1. Init k cluster centers: µ(i). 2. FOR each observation cj DO associate it with the closest cluster center, that is, assign ξ(cj) ← i, where d(µ(i), cj) = mink d(µ(k), cj ), for some distance function d (e.g. Euclidean distance). 3. FOR each cluster i (with n(i) assigned observations) DO estimate the mean of the observations assigned with cluster as:

µ (i ) =

1 n(i )

c ∑ ξ

j j , (c j ) =i

4. REPEAT steps 2 and 3 given number of times. W. Kasprzak: EIASR


106

EM clustering The expectation-maximization (EM) algorithm is a general estimation technique, dealing with missing data. If applied for clustering the EM algorithm gives an maximummixture: likelihood estimate of a Gaussian K p (c | θ ) = ∑ α i N (c | µ i , Λi ) i =1

where N(.) denotes a Gaussian pdf, and θ - the parameter set that needs to be estimated: θ = { αi, µi , Λi | i=1,2,…,K}. E-step: FOR each sample cj and p(c j | ξ (c j ) = i,θ )α i i FOR each class i DO: Pj = p(c j | ξ (c j ) = i,θ )α i ∑ i M-step: FOR each class i DO: i i

∑c P j

µ = i

j

∑ Pji j

W. Kasprzak: EIASR

j

∑ c (c ) = ∑P j

Λ

i

j

T

Pji

j

i j

j

∑P = ∑∑ P j

α

i

j

i j

i

j


107

8. Ensemble learning Many classifiers or experts are generated and combined to solve a particular classification or decision problem: 1. Boosting – combining weak binary classifiers into a strong one. Each iteration of boosting creates three weak classifiers: the first classifier C1 is trained with a random subset of the training data. C2 is trained on a set only half of which is correctly classified by C1, and the other half is misclassified. C3 is trained with instances on which C1 and C2 disagree. The 3 classifiers are combined through a three-way majority vote. AdaBoost.M1 (adaptive Boosting) – extension for multi-class classification, most popular ensemble classifier (Freund & Shapire, 1995) 2. Mixture of experts (Jacobs & Jordan, 1991-94) W. Kasprzak: EIASR


108

AdaBoost AdaBoost is an algorithm for constructing a ”strong” classifier as a linear combination of T „weak” classifiers: T  F ( x) = C[ f ( x)] = C ∑ wt ⋅ ht ( x)  t =1  where ht(x) is the classification („hypothesis”) of sample x given by the t-th weak classifier, wt are weights and C is the rule of the „strong” classifier. The set of „weak” classifiers, H = {hi(x)}, is potentially infinite. Training data samples are drawn from a distribution that is iteratively updated, such that subsequent classifiers focus on increasingly difficult instances. Previously misclassified instances are more likely to appear in the next bootstrap sample. The classifiers are combined through weighted majority voting. W. Kasprzak: EIASR


109

AdaBoost.M1 (1) INPUT: classes: Ω={Ω1, Ω2, … ,ΩL}; training samples: S = {x1, x2, …, xn}, with labels yi∈Ω, i=1,2,…,n; number of weak classifiers: T. TRAINING 1 INIT distribution: D1 (i ) = , i = 1,2,..., n n FOR t=1,2,…,T DO 1. Select current training subset St according to Dt. 2. WeakLearn: train on St and return nthe „weak” classifier, ht: XΩ, with smallest error: ε t = ∑ I (ht ( xi ) ≠ yi ) ⋅ Dt (i) i =1 IF εt>1/2 THEN STOP ε 3. Calculate normalized error: β t = t 1− εt 4. Update distribution Dt: D (i ) ⋅ β t D (i ) , if h t ( xi ) = yi or Dt +1 (i ) = t , if h t ( xi ) ≠ yi Dt +1 (i ) = t Zt

W. Kasprzak: EIASR

Zt


110

AdaBoost.M1 (2) Remark: Zt is a normalization coefficient that makes Dt+1 a proper distribution. CLASSIFICATION by a weighted majority voting: Given a new unlabeled sample s: 1. Obtain total vote received by each class, T

v j = ∑ I (ht ( s) ≠ Ω j ) ⋅ log(1 / β t ) , j = 1,..., L t =1

i.e. the „weak” weights are: log(1 / β t ) 2. Rule C: select the class Ωj which receives the highest total vote vj

W. Kasprzak: EIASR


111

Mixture of Experts (1) Several experts (classifiers) are learned. The i-th expert produces its output: o i (x) = f ( Wi x) where Wi is a weight matrix and f(.) - a fixed nonlinearity. The outputs of experts are combined through a (generalized) linear rule: N

y = o(x) = ∑ g (x, v k ) o k (x) k =1

The weights of this combination are determined by a gating network: eξ i g (x, v i ) = N ; ξ i = v Ti x ∑ eξ k k =1

W. Kasprzak: EIASR


112

Mixture of Experts (2) The classification step in the „mixture of experts” approach can be expressed in stochastic terms as the maximization of posterior N probability: P(y | x, Ψ ) = ∑ g (x, v k ) P(y | x, W k ) k =1

where Ψ is the set of all parameters (all expert- and gating weights). The parameters are typically trained using the expectation maximization (EM) algorithm. Let the training set is given as: {(xt, yt) | t=1,2,…,T}. In the E-step, in the s-th epoch, for all the training data following posterior probabilities are computed: pi (t ) =

g (x t , v i( s ) ) P(y t | x t , Wi( s ) ) N

∑ g (x , v t

(s) k

, i = 1,2,..., N

) P(y t | x t , Wk( s ) )

k =1

W. Kasprzak: EIASR


113

Mixture of Experts (3) In the M-step following maximization problems are solved: ( s +1) i

W

V

( s +1)

T

= arg max ∑ pi (t ) log P (y t |x t , Wi ) Wi

t =1

T

N

= arg max ∑∑ pi (t ) log g (x t , v k ) V

t =1 k =1

where V is the set of all the parameters in the gating network.

W. Kasprzak: EIASR


114

EIASR Image pre-processing Lecture 4


115

1. Scene acquisition Perspective projection A simple „pinhole” model of the camera: p’= (-xp, -yp ) – reversed image point, P =(xc, yc, xc) – scene point (in camera coordinates), f – focal length. x p = sx

f ⋅ xc zc

yp = sy

f ⋅ yc zc

sx , sy : scaling of scene-to-image units W. Kasprzak: EIASR

4. Image pre-processing

116

Perspective vs parallel projection Parallel projection – if f tends to infinity : x p = lim s x f →∞

f ⋅ xc = s x xc f + zc

Perspective projection

W. Kasprzak: EIASR

y p = lim s y f →∞

vs.

f ⋅ yc = s y yc f + zc

parallel projection


117

Vector operations  u1   v1  u = u2  , v = v2  u3  v3 

3-D vectors:

Inner product of two vectors: 2

2

|| u ||= u1 + u2 + u3 cos(θ ) =

2

< u, v > || u || || v ||

< u, v >= u T v = u1v1 + u 2 v2 + u3v3

Cross product of two vectors  0 ˆ = u U  3 −u 2

− u3 0 u1

W. Kasprzak: EIASR

u2  − u1  0 

ˆv u× v = U


118

Camera parameters A transformation of scene point P onto pixel p depends on camera parameters of two types: 1.Extrinsic parameters rotation R, translation T 2.Intrinsic parameters projection KF, camera-to-image Ks Extrinsic camera transformation Let Pw =[xw, yw, zw] be the world coordinates and Pc =[xc, yc, zc] the camera coordinates of some scene point P . Then: Pc = RPw + T Intrinsic camera transformation  xp   x   sx  x  f 0 0  xc  p =  y p  = K s  y  =  0 zc  y  = K f Pc =  0 f 0  yc   1   0  1   1   0 0 1  zc  W. Kasprzak: EIASR

sθ sy 0

ox   x  o y   y  1   1 


119

Homogeneous coordinates A 3-D point P 4-D homogeneous coordinates Ph : P = [ X , Y , Z ]T ↔ Ph = [kX , kY , kZ , k ]T , k ≠ 0. 1 0 Translate a point by vector [ tx, ty, tz ] T: Scaling of coordinates by [ sx, sy, sz ]T: s x 0 0 Rotate the coordinate system: 0 s 0 y - around axis Z by angle θ , S=  0 0 sz - around axis X by angle α ,  0 0 0 - around axis Y by angle β cos θ  sin θ Rθ =   0   0

− sin θ cos θ 0 0

W. Kasprzak: EIASR

0 0 0 0 1 0  0 1

0 1 0 cos α Rα =  0 sin α  0 0

0 − sin α cos α 0


0 0 0  1

0 tx  0 1 0 t  y T= 0 0 1 t z    0 0 0 1  0 0 0  1

 cos β  0 Rβ =  − sin β   0

0 sin β 1 0 0 cos β 0 0 120

0 0 0  1

Homogeneous coordinates (2) 1 0 0 0 0 1 0 0  . Ψf =    0 0 1 0 Reverse transformation is not unique!   0 0 1 / f 0  Let us observe:   xc    f xc / zc  xp     f y / z  y  y   ph =  p  = norm(Ψ f ⋅ Ph ) = norm  c   =  c c   zc   f  f       z / f    1  1   c   The transformation from camera to pixel units  s x sθ 0 o x  The operations of pixel scaling, skewing and 0 s 0 o  shift of optical center can be represented by y y Ψs =  . a single matrix for operation on homogeneous 0 0 1 0    coordinates. 0 0 0 1 

Perspective transformation of homogeneous coordinates:

W. Kasprzak: EIASR


121

2. Camera calibration Example. Extrinsic parameters in R, T: - α : rotation of axes around global axis OZ; - β : rotation of axes around global axis OY; - θ : rotation of axes around global axis OX; G : translation of global system to camera fixture system; - C : translation of camera fixture system to camera system; Intrinsic parameters in ψf , ψs : f : the camera’s focal length, sx, sy : the image-to-pixel unit scaling, ox, oy : the camera origin to image origin shift. -

W. Kasprzak: EIASR


122

Example (2) The transformation of a point Pw , given in world coordinates, into a pixel p :

p = (Ψs ⋅ Ψ f ⋅ T−C ⋅ R θ ⋅ R β ⋅ R α ⋅ T−G ) ⋅ Pw = A ⋅ Pw

Remark: in this example we have defined a translation of the coordinate system, while in above equation a point is moved. A transformation of a point is a dual operation to the transformation of a coordinate system. kx   p1  ky   p    =  2  = A ⋅ Pw p = Auto-calibration of the camera  kz   p3  The goal is to estimate the combined matrix A      k   p4  for the scene-to-image transformation, based on some number of pixel-to-scene point correspondences. W. Kasprzak: EIASR


123

Auto-calibration Matrix A consists of 12 parameters, but only 8 are independent (the 3-d row linearly depends on the 4-th row (by f ). kx  p1   a11 a12 a13 a14   X   Y  ky   p  a a a a 2 21 22 23 24    = A⋅ P p= = = w  kz   p3  a31 a32 a33 a34   Z         a a a a p k 41 43 44   1     4   41 As p1 = x p4, and p2 = y p4, we get a system of 3 linear equations: x p4 = a11 X + a12 Y + a13 Z + a14 y p4 = a21 X + a22 Y + a23 Z + a24 p4 = a41 X + a42 Y + a43 Z + a44 (x,y) and (X,Y,Z) represent measurable data while p4 is not known. It needs to be eliminated from the system. W. Kasprzak: EIASR


124

Auto-calibration (2) After elimination of p4 from the 1st and 2nd row we get 2 equations with 12 unknowns: a11 X + a12 Y + a13 Z - a41 x X - a42 x Y - a43 x Z - a44 x + a14 = 0 a21 X + a22 Y + a23 Z - a41 y X – a42 y Y - a43 y Z - a44 y + a24 = 0

Auto-calibration algorithm 1. Detect m ≥ 6 image points { pi = (xi, yi) (i=1,2,...,m)}, which are projections of known 3-D points (in world coordinates) - Pi = (Xi, Yi, Zi) (i=1,2,...,m). 2. For every pair of points (pi , Pi) make two equations of above form – in total M ≥ 12 equations with 12. unknowns a11 - a44. 3. Solve the equation system M by an LSE approach. W. Kasprzak: EIASR


125

3. Color spaces Color spaces that are based on psychological observations on human perception: • HLS (Hue, Lightness, Saturation), • HSV (Hue, Value, Saturation) . Technical systems represent color as a mixture of three basic vectors in the color space: • A typical color system for computer monitors - three basic colors: red, green, blue (the RGB image, where R,G,B are coefficients from the interval [0, 1]). An additive positive-oriented system - "white" is the sum of its components. • Typical color printings on printers - the CMY (Cyan, Magenta, Yellow) system. An additive "negative" model - "black" is the sum of components: [C, M, Y] = [1, 1, 1] - [R, G, B]. W. Kasprzak: EIASR


126

Color spaces • Typical analog television broadcast - YIQ (Y - luminance, I, Q chromatic components: red and red-blue). A specific sensitivity of the human eye to the green color allows to represent green mostly by Y. Y  0.299 0.587 0.114   R   I  =  0.60    Q   0.21

− 0.28 − 0.32 ⋅ G  − 0.52 0.31   B 

• In digital media - YUV (or Y Cb Cr), where Cb, Cr - distances of the color from "gray" color along blue and red axes. 0.587 0.114   R   Y   0.299 C  = − 0.1687 − 0.3313 0.5  ⋅ G   b  C r   0.5 − 0.4187 − 0.0813  B 

0.587 0.114   R  Y   0.299 U  = − 0.1471 − 0.2889 0.436  ⋅ G        V   0.6149 − 0.515 − 0.100  B 

Note: R,G,B are coefficients from [0, 1]. Y also takes values from [0, 1], but Cb, Cr (or U, V) take values from [-0.5, 0.5]. W. Kasprzak: EIASR


127

Colour calibration Ideal colours (blue, green, red, yellow, magenta, white, dark), specific “skin” colours (“dark skin” and “light skin”) , etc. are defined in the YUV space. They are applied to make a colour calibration of the image.

W. Kasprzak: EIASR


Colour plane U-V for mid-value Y=0.5 (of 1) (or 128 of 255) 128

Y-based color normalization In theory an Y-based normalization of the color components U, Y is not required. But in practice this can help to limit the variability of “dark” and “light” versions of the same color. Assume 8 bits per color per pixel representation – i.e. values from [0, 255]. Y p ' = 128; U p ' = U p + κ pU (Y p '−Y p); V p ' = V p + κ pV (Y p '−Y p);

E.g. for a skin color: κ pU=1.135 and κ pV = 1/ κ pU W. Kasprzak: EIASR


129

Example The detection of skin colour in the image:

W. Kasprzak: EIASR


130

4. Image enhancement Low contrast images can be enhanced by „histogram stretching” or „histogram equalization”. „Stretching” of histogram H 1.Find a = min(H, A) and b = max(H, A), such that A% of pixels have lower value than a and A% of pixels have higher value than b (e.g. A=0.05). 2.Transform the pixel value w to a new digital value in the interval [0, 1, ..., Dm ] as: 0 if wb  Dm W. Kasprzak: EIASR


131

Histogram equalization The purpose of histogram equalization is to make every pixel value (nearly) equally probable: w 1 a = f ( w) = round [ Dm ∑ H (i)] PixNum i = 0

where H(i) is the histogram value for pixel value i, PixNum – total pixel number, [0, 1, ..., Dm ] is the interval of digital levels of variable w in the original image, round[] – specifies the rounding operation to nearest digital level. Remark: perfectly flat histograms are seldom obtained when working with digital variables. Equalization usually leads to a reduction of pixel value levels in the transformed image if compared to the original one. W. Kasprzak: EIASR


132

Example Image and its histogram before and after histogram equalization:

W. Kasprzak: EIASR


133

5. Image binarization Problem: how to set the threshold for image binarization – to separate the foreground object pixel from the background?

Solution idea: the image histogram is approximated by a weighted sum of two normal (Gaussian) distributions. The border between distributions determines the threshold value:

p (l ) = αN (l , m1 , σ 1 ) + (1 − α ) N (l , m2 , σ 2 ) W. Kasprzak: EIASR


134

The Otsu method 1. FOR all possible thresholds θ ;

obtain the between-class variance as:

σ (θ )2B = P1 (θ )(1 − P1 (θ ))(m1 (θ ) − m2 (θ ))2 θ

where P1 (θ ) =

∑ h(l ) l =1 L

∑ h(l )

.

l =1

2. Select θ such that σ(θ)B2 is of maximum value. There exists a finite number of threshold positions only. Hence, the above algorithm always terminates. W. Kasprzak: EIASR


135

6. Pattern normalization An image object can appear of different size and orientation and nevertheless we need to recognize it. Example: different letters W, but the same class of letter.

Example: normalization of image size

W. Kasprzak: EIASR


136

Pattern normalization (1) Pattern transformation in image space (translation, scaling, rotation, mirroring) - functions of the geometric moments. A normalization step: pattern f(x, y) ≥ 0 pattern h(x', y') ≥ 0, global moments mpq µpq

µ pq = ∑ ∑ x' p y' q h( x' , y' )

m pq = ∑∑ x p y q f ( x , y ) x

x

y

y

Step 1: shift to mass centre (xc, yc) m m x' = x − xc , y ' = y − yc x c = 10 , yc = 01 m00

m00

and normalize the amplitude

h( x' , y' ) =

µ00 = 1

f ( x , y) m00

µ10 = µ01 = 0

and

W. Kasprzak: EIASR


137

Pattern normalization (2) Step 2: size normalization – scaling of axes r = m02 + m20

x' =

x , r

y' =

y r

h( x' , y' ) = r 2 f ( x , y )

µ20 + µ02 = 1 Step 3: rotation - to normalize the orientation of the pattern. If (m20 ≠ m02) then minimize

S (α ) =

∑ [( x − x ) cosα − ( y − y ) sin α ] c

( x , y )∈ f

and get: W. Kasprzak: EIASR

tan( 2α ) =

2

c

2m11 m 20 − m02 4. Image pre-processing

138

Pattern normalization (3) Step 3 (cont.): Among 4 possible angles select α such that after rotation it holds: µ20 < µ02 and µ21 > 0.  x'   cos α   =   y '   − sin α

sin α  x    cos α  y 

h( x ' , y ' ) = f ( x , y )

µ11 = 0 , µ20 < µ02 and µ21 > 0 Step 4: mirroring with respect to the Y axis. Select β∈{+1, -1} such that after the transformation x' = βx , y' = y the resulting pattern’s moment is µ12 > 0 h( x ' , y ' ) = f ( x , y ) . W. Kasprzak: EIASR


139

7. Image filters Basic image filters are used to suppress: • the high frequencies in the image, i.e. smoothing the image, or • the low frequencies, i.e. detecting edges in the image. Spatial filter: to convolve the input image f(i,j) with some filter function h(i,j) (called kernel): g (i , j ) = h(i , j ) ∗ f (i , j ) Frequency filter: (1) transform the image into the frequency domain, (2) multiply the result with the frequency filter function and (3) re-transform the result into the spatial domain. Discrete convolution is a „shift and multiply” operation: for a square kernel with size (M+1)×(M+1) the discrete convolution is: g (i , j ) = W. Kasprzak: EIASR

M /2

∑

M /2

∑ h ( m , n ) ⋅ f (i − m , j − n )

m=− M / 2 n=− M / 2 4. Image pre-processing

140

Basic image filters Basic image filters: 1.Mean filter - noise reduction (NR) using mean of neighbourhood 2.Median filter - NR using median of neighbourhood 3.Gaussian smoothing - NR with a Gaussian smoothing kernel 4.Various gradient-based edge detection 5.Laplace filter - second derivation-based edge detection Mean filter: to replace each pixel value in an image with the mean (average) value of its neighbours, including itself. The kernel: W. Kasprzak: EIASR


141

Median filter Non-linear filter: the addition operation in discrete convolution is replaced by some non-linear operator: g (i , j ) = Om ,n [h( m, n ) ⋅ f (i − m, j − n )] Median filter is a non-linear filter. It replaces the pixel value with the median of neighbour values. The median is calculated by first sorting all the pixel values from the local neighbourhood according to numerical order and then replacing the central pixel by the middle value. Example:

W. Kasprzak: EIASR


142

Gaussian smoothing Gaussian smoothing operator uses a kernel that approximates a Gaussian function. 2-D isotropic (circularly symmetric) Gaussian: G ( x, y ) =

 x2 + y 2    − exp 2 2πσ 2 2 σ   1

Discrete Gaussian kernel: assume zero at distances more than three standard deviations from the mean and truncate the kernel. Example: discrete approximation of Gaussian function with σ = 1.0. The 2-D isotropic Gaussian is separable into x and y components apply a 1-D convolution to rows and columns, with kernel: W. Kasprzak: EIASR


143

Edge image detection A pre-processing step in edge detection: a smoothing operation in order to remove noise (spiky-like variations) from the image. Example: desired edge (left), real edge (right).

Basic types of edge image detectors: 1. discrete image function gradients, 2. convolve image with kernels, 3. using parametric edge models, 4. mixed approaches. W. Kasprzak: EIASR


144

Discrete image gradients  ∂f ( x, y )   f x ( x, y )   ∂x   = ∇f ( x, y ) =    ∂f ( x, y )  f x y ( , )  y     ∂y  Discrete differences in case of a 2-D image fˆ (i, j ) that represents

The gradient of a 2-D continuous function:

the continuous function f (x, y): fˆx (i, j ) = fˆ (i + 1, j ) − fˆ (i, j )

fˆy (i, j ) = fˆ (i, j + 1) − fˆ (i, j )

As a result of edge detection two output images are computed: 1.the magnitude (strength) s or the "absolute" strength s‘ (for computational simplicity) s' =| f x | + | f y | s = f x2 + f y2 and 2. the direction of the normal vector to edge: r = arctan( f y / f x ) W. Kasprzak: EIASR


145

Robert’s cross „Roberts cross”: example of a discrete gradient-based edge operator. Two discrete gradients along 45o and 135o: kx ( i, j) = f ( i, j ) - f ( i+1 , j+1) , ky ( i, j) = f ( i+1, j) - f ( i, j+1 ). Remark: they are equivalent to convolution kernels: 2 2 Edge strength: s = k x + k y or s' =| k x | + | k y | ky 3 = arctan( )− π r Edge orientation (correction by -3Π/4) needed): Roberts k 4 x

Characteristics of the „Roberts Cross” operator. • the same image format as input image, e.g. maximum strength: 255, for 8-bit images, • simple implementation (two difference values per pixel only), • sensitive to noise. W. Kasprzak: EIASR


146

Differences of central element Two differences that approximate the two gradients of image functions along the main axes: fˆx (i, j ) = fˆ (i + 1, j ) − fˆ (i − 1, j ) fˆy (i, j ) = fˆ (i, j + 1) − fˆ (i, j − 1) Remark: equivalent convolution kernels are:

Characteristics: • The same input-output format, e.g. maximum strength 255 for 1 byte/pixel), • simple implementation, • still sensitive to noise. W. Kasprzak: EIASR


147

Convolution-based edge detector Sobel operator:

Gx

Gy

Prewitt operator: Characteristics: • Sobel operator: different input-output image, e.g. maximum strength: 2040 for 8-bit input image, simple implementation, good results. • Prewitt operator: different input-output image, e.g. maximum strength: 1020 for 8-bit input image, simple implementation, results are almost as good as for the Sobel operator. W. Kasprzak: EIASR


148

Discrete directions A limited number of edge orientations is detected. This allows for an efficient implementation of orientation detection step. For example 16 discrete image directions are detected by comparing two edge gradients for each pixel.

W. Kasprzak: EIASR


149

Laplace operator Laplace operator (the sum of second-order derivatives) For a continuous 2-D function the Laplace operator is defined as: ∂2 f ∂2 f ∇ f ( x, y ) = 2 + 2 = f xx + f yy ∂y ∂x 2

The discrete Laplace operator uses a single mask but is independent from any direction (orientation information is lost). It gives a „positive" answer (i.e. a zero value) for both real edges and homogeneous regions in the image - it can be used in a combination with some other edge operator. Basic discrete Laplace operator: ∇2 f ( i, j ) = 4 f ( i, j ) - f ( i–1, j ) - f ( i+1, j ) - f ( i, j-1) - f ( i, j+1) The convolution kernel: W. Kasprzak: EIASR


150

LOG filter (1) Using a second derivative makes the Laplacian highly sensitive to noise use Gaussian smoothing as a preprocessing step. Laplacian of the Gaussian (LOG filter)

L( x, y ) = ∇ 2 (G ( x, y ) * I ( x, y )) Due to linearity:

L ( x, y ) = ∇ 2 G ( x, y ) * I ( x , y )

x2 1 Derivation of the LOG filter (1D case): G( x) = exp(− 2 ) 2σ σ 2π 2 - first derivative: x x G' ( x) = − 3 exp(− 2 ) 2σ σ 2π 2 x x2 1 1 - second derivative: G' ' ( x) = ( 3 − ) exp(− 2 ) 2σ σ 2π σ σ W. Kasprzak: EIASR


151

LOG filter (2) Illustration of the LOG filter:

Different Laplace convolution kernels: (a) Basic Laplacian (b),(c) LOG filter W. Kasprzak: EIASR


152

Color image edge operator An edge operator defined for a monochromatic image can be extended to handle color image. For example let be an RGB image function fRGB . Let pairs of colour pixels be given: f1 = ( r1, g1, b1 ) , f2 = ( r2, g2, b2 ) - from the local neighbourhood of current pixel. In case of the Sobel color operator there will be given three pairs for the "vertical" mask and 3 pairs for the "horizontal" mask. The difference for pair (f1, f2) can be alternatively computed as: D1 ( f1 ; f2 ) = {( r1 - r2 )2 + ( g1 - g2 )2 + ( b1 - b2 )2 } 1/2 D2 ( f1, f2 ) = | r1 - r2 | + | g1 - g2 | + | b1 - b2 | D3 ( f1, f2 ) = max { | r1 - r2 |, | g1 - g2 |, | b1 - b2 | } D4 ( f1, f2 ) = ωr |r1 - r2| + ωg | g1 - g2 | + ωb | b1 - b2 | W. Kasprzak: EIASR


153

8. Edge thinning Threshold-based edge elimination This simple edge thinning method is an edge elimination operator with a minimum threshold parameter θ. The threshold is either fixed or set adaptively (e.g. θ = γSmax, where γ∈(0,1)). s ( P ), if s ( P) > θ sthin ( P) =   0, otherwise

No-maximum edge elimination It depends on a check in the local neighborhood of given pixel P :IF ((s( P) ≥ s( N ) OR | r ( P) − r ( N ) |≥ T ) L

L

AND ( s ( P) ≥ s( N R ) OR | r ( P) − r ( N R ) |≥ T ) ) THEN s ( P)thin = s ( P); ELSE s( P)thin = 0; W. Kasprzak: EIASR


154

Edge thinning (2) Edge modification A local neighborhood-based modification of edge at pixel P: • if P is the strongest edge element in the set: P, NL, NR , then: s’(P) = s(P) + α ( s(NL ) + s(NR )); • if P is the weakest edge element in the above set then: s’(P) = s(P) - 2 α s(P); • if one neighbour of P (denoted by P+ ) is a stronger edge and another neighbour of P (denoted by P– ) is a weaker edge element then: s’(P) = s(P) - α s(P+) + α s(P-). Several iterations over the whole image may be necessary. W. Kasprzak: EIASR


155

Edge thinning (3) Edge elimination with hysteresis threshold This edge thinning method works with two edge strength thresholds: the upper θH and the lower θL. In the first run these edge pixels are individually marked as „good” that have higher strengths than the upper threshold. In the next run these “good” pixels are tried to be “extended” along a potential contour line in both directions (“positive” and “negative” line direction). For a single extension, the neighbor pixel need to have higher strength than the lower threshold. Remark: Now the neighbors are not competing with each other and they are searched along the expected contour line (not across it). W. Kasprzak: EIASR


156

9. Canny operator This a multi-stage edge image detection and thinning procedure. It needs 3 parameters: •the variance σ for the Gaussian mask, •edge strength thresholds θH, θL, where θH>θL, for hysteresisbased thinning. Input : a grey-scale image. Output : a “thinned” edge image. Steps of the Canny operator 1.Image smoothing by convolution with a Gaussian kernel. 2.Edge image detection by a discrete-gradient operator in 2x2 or 3x3 neighbourhood. 3.Edge thinning by the no-maximum elimination operator. 4.Edge elimination with a hysteresis threshold. W. Kasprzak: EIASR


157

EIASR Boundary-based image segmentation Lecture 5


158

1. Line detection Main approaches to line detection in images: 1. A 3-step line segment detection • Edge image detection (filtering), • Edge chain following (segmentation) • To approximate edge chains by straight line segments (symbolic description). 2. A 2-step line or contour detection • Edge image detection • Detection of straight lines, circles and contours by using a Hough transform. • „Active contour” method W. Kasprzak: EIASR

5. Boundary-based image segmentation

159

2. Edge chain following Principle: searching for extension of current edge pixel P=P-cur by its successor edge N=c(Pcur). Two neighbour edge elements can be linked if the edge magnitude (strength) and direction differences are below certain thresholds and their magnitudes are relatively large: |s(P) – s(N)| ≤ T1 |r(P) – r(N)| mod 2π ≤ T2 |s(P)| > T, |s(N)| > T Denote the 3 nearest pixel along the direction r(P) as: N1, N2, N3. The successor of P-cur: an edge element Ni whose strength and orientation are most similar to P-cur . W. Kasprzak: EIASR


160

Successor candidates Three candidates for a successor of pixel P:

Closing a gap for edge pixel P in directions (left) 0o and (right) 45o:

W. Kasprzak: EIASR


161

The hysteresis threshold method The edge strength threshold T is now split into two: - the upper threshold T0 and - the lower threshold T = β T0 , 0< β < 1 The strength of the start edge pixel has to be higher than the upper threshold T0 : |s(Pstart)| > T0 The strength of any next edge element in the chain needs to be higher than T. Example. Edge chain labels:

W. Kasprzak: EIASR


162

The hysteresis threshold (2) REPEAT Search for an edge pixel Pstart with no segment label (i.e. „free”) with strength s(Pstart)> To (upper threshold) . Pstart is now called current pixel P-cur and the segment label is SegNum. FORWARD_SEARCH (P-cur) IF chain SegNum is not closed THEN Again set the current pixel P-cur to be the start pixel Pstart . Perform BACKWARD_SEARCH(P-cur) UNTIL all pixels are visited. Example:

W. Kasprzak: EIASR


163

Edge chain image Input image

W. Kasprzak: EIASR


164

3. Line segment fit The equation of a line determined by two points (xj, yj), (xk, yk): x( yi − y k ) + y ( x k − x j ) = y j ( x k − x j ) − x j ( y k − y j ) ax + by = c , a 2 + b 2 > 0 The distance di of point (xi, yi) from this line:

di =

| axi + byi − c | a2 + b2

W. Kasprzak: EIASR


165

Edge chain approximation Procedure APPROXIMATE (CI, P1, PN) Add the line l determined by start point P1 and end point PN of the chain CI to the set of line segments L(C ). For every point Pk in chain CI determine the distance sk to line l. IF FOR ALL Pk ∈ CI: sk ≤ θ THEN RETURN ( END ). Select a point Pk with maximum distance sk. Remove the line l from set L(C). Decompose the line CI into 2 parts: CI1 = P1 ... Pk; CI2=Pk ... PN. Call APPROXIMATE(CI1, P1, Pk). Call APPROXIMATE(CI2, Pk, PN). W. Kasprzak: EIASR


166

Approximation by arcs (1) 1. Try to approximate two consecutive line segments by an arc. E.g. start with arc ACB.

2. Measure the approximation error eA = arg max | g ( xi ) − f ( xi ) |, i = 1,2,3; E.g. i

eB = arg max | g ( xi ) − f ( xi ) |, i = 5,6,7; i

if

sign ( g ( xeA ) − f ( xeA )) = sign ( g ( xeB ) − f ( x eB ))

AND max(| g ( xeA ) − f ( xeA ) |, | g ( xeB ) − f ( x eB ) |) < γR THEN

Approximat e( ACB , R )

W. Kasprzak: EIASR


167

Approximation by arcs (2) Approximate(ACB, R)

W. Kasprzak: EIASR


168

Approximation by arcs (3) 3. Try to extend an existing arc by including the next line segment E.g. arc ACB is checked to be extended to ACBD. Then an extension to ACBDF fails, but there is a second arc DFE.

4. Try to approximate closed arc chains by a circle or ellipses.

W. Kasprzak: EIASR


169

4. Hough transform (HT) A polar representation of a straight line defined by an edge element ei consists of two parameters: d, α , where d = xi cos α + yi sin α Remark: α = 90o + β , and α = r(e) is the edge’s direction.

An image line corresponds to a point in the Hough space (d; α ). W. Kasprzak: EIASR


170

Hough transform for lines Edge image E = [ eij ] , i =1,..., M; j =1,..., N; is of size M × N and it represents ONUM discrete edge directions. The Hough accumulator H(d;α) is a digital representation of the π π Hough space: 1 1 2 2 2 2 − ≤α ≤ −

2

M +N ≤d ≤

2

M +N

2

2

Hough Transform for line detection FOR every edge pixel eij (coordinates (xi, yi), orientation αij ) DO: - Compute dij = xi cos αij + yj sin αij ; - Approximate (dij, αij) by nearest discrete values (d+ij, α+ij) and increase H(d+ij, α+ij), i.e. H(d+ij, α+ij) = H(d+ij, α+ij) + 1. FOR every (d, α) DO: - IF H(d, α) > Threshold THEN corresponding line is detected. W. Kasprzak: EIASR


171

HT for circle center detection A circle, ( x - xc ) 2 + ( y - yc ) 2 = r 2 , has 3 parameters. A two-step circle recognition approach Step 1: detect circle centres by appropriate Hough transform; Step 2: detect edge chains around those centres. Hough Transform for circle center detection Let a line perpendicular to edge element ei be: gi: y = ai x + bi Lines corresponding to circle border points cross in the point C = (xc; yc) the centre point of circle. In Hough space H(a,b) these lines will correspond to collinear points ( ai, bi ). W. Kasprzak: EIASR


172

Circle center detection A line gi that passes through (xc, yc) satisfies: y = ai ( x - xc) + yc = ai x + ( yc - ai xc) . This line corresponds to the point in Hough space: (ai; yc - ai xc). The relation of a set of points in Hough space is approximately a straight line: bi = - ai xc + yc . For N point observations:  b1   − a1 1  b   − a 1 x  2  =  2  c   M   L 1  yc      bN  − aN 1

the LSE solution is: − xC  1  y  = N a 2 − ( a )2 ∑ i ∑ i  C  W. Kasprzak: EIASR

 N ⋅ − ∑ ai

− ∑ ai  ∑ (a i bi )  ⋅ ∑ ai2   ∑ bi 


173

HT for circle detection Now let us apply the Hough transform to detect all 3 parameters of a circle equation, ( x - xc ) 2 + ( y - yc ) 2 = r 2 , in one step. Hough transform for circle detection FOR every edge pixel ei (coordinates (xi, yi), orientation αi ) DO: - the (unknown) centre coordinates (xC, yC) of a circle with (unknown) radius rd passing through point (xi, yi) satisfy: xi = xC + rd cos αi, yi = yC + rd sin αi ; - FOR every discrete radius rd: 0 ≤ rd ≤ rmax , DO: - compute (xC, yC) from the above equations and - increase the Hough accumulator H(xC, yC, rd) by one; FOR every (xC, yC, rd) DO: - IF H(xC, yC, rd) > Threshold THEN corresponding circle is detected. W. Kasprzak: EIASR


174

Contour detection by GHT A generalised Hough transform (GHT) for contour detection Parameters of the Hough space: C = (xC; yC): location of the center mass, s - scale, α - contour orientation angle. Model learning (design) For every pair of allowed discrete values sd, αg of scale and orientation, create a table R(sd, αg) = [ r(ϕ), ϕ ], where for each edge xB with edge direction ϕ the pair: [ r(ϕ)=xB - C, ϕ] is stored. Example: The model contour (top drawing) and the candidate contour (bottom) W. Kasprzak: EIASR


175

5. Active contour The active contour, or snake, is defined as an energy minimizing polygon - the snake's energy depends on its shape and location within the image. A snake is initialized roughly to represent the object’s boundary. The snake’s motion rule is then iteratively applied to change the snake’s location and shape to satisfy the final condition: Finternal + Fexternal = 0 Fexternal = −∇E external Example. Two final snakes established from common initial snake while using more (outer) or less (inner) external force. W. Kasprzak: EIASR


176

Snake’s forces Edge-based external energy: (1) Eexternal ( x, y ) = − ∇I ( x , y )

or

2

(2) Eexternal ( x, y ) = − ∇ ( Gσ ( x, y ) ∗ I ( x, y ) )

2

The internal energy is responsible for elasticity and stiffness – trying to shorten and to smooth the contour: Einternal = Eelastic + Estffness

E.g. n −1

Eelastic = K1 ⋅ ∑ | pi − pi −1 |2 i =0 n −1

E stiffness = K 2 ⋅ ∑ | pi −1 − 2 pi + pi +1 |2 i =0

W. Kasprzak: EIASR


177

Motion (update) rule Snake dynamics – discrete motion rule : at each iteration, every control point pi=(xi, yi) is moved by a vector proportional to the force acting on it: xi = xi + α FelasticX ,i + β FstiffnessX ,i + γ FexternalX ,i yi = yi + α FelasticY ,i + β FstiffnessY ,i + γ FexternalY ,i For example: α = 0.8, β = 0.1, γ = 0.6 In particular: FelasticX ,i = 2 K1 ( ( xi −1 − xi ) + ( xi +1 − xi ) ) FelasticY ,i = 2 K1 ( ( yi −1 − yi ) + ( yi +1 − yi ) )

FstiffnessX ,i = K 2 ( 4 ( xi −1 + xi +1 ) − 6 xi − ( xi − 2 + xi + 2 ) ) FstiffnessY ,i = K 2 ( 4 ( yi −1 + yi +1 ) − 6 yi − ( yi − 2 + yi + 2 ) ) K FexternalX ,i = 3 ( I ( xi + 1, yi ) − I ( xi − 1, yi ) ) 2 K FexternalY ,i = 3 ( I ( xi , yi + 1) − I ( xi , yi − 1) ) 2 W. Kasprzak: EIASR


178

Example Two different final contours – obtained by varying the weights of the internal force. Inner contour: α = 0.8, β = 0.5, γ = 0.6 Outer contour: α = 0.8, β = 0.1, γ = 0.6

W. Kasprzak: EIASR


179

6. Point detector The Harris-Stephens operator Detect average image gradients Ix, Iy in the local neighborhood of image point (xk, yk) – represent them as a covariance matrix: A (xk, yk) = A point feature is detected when both eigenvalues of matrix A are of high and similar value.

W. Kasprzak: EIASR


180

SIFT and SURF detector SIFT (Scale-Invariant Feature Transform) [Lowe, 2004] is both a multi-scale keypoint detector and an image descriptor. Basic idea: DoG – scale-space filtering by Difference of Gaussians. SIFT implementation consists of following steps: 1. 2. 3. 4.

Scale-and-image space filtering (LoG filter) Detection of extrema in scale-image space (DoG filter Keypoint (corner feature) detection, Dominating orientation “at keypoint” is detected,

5. Keypoint description.

SURF (Speeded Up Robust Features) [Bay et al., 2008] Idea: DoH - Determinant of Hessian for keypoint detection. W. Kasprzak: EIASR


181

EIASR Region-based image segmentation Lecture 6


182

1. Homogeneous region A region R is a connected image area, which is homogeneous with respect to some parameter (vector of parameters) (e.g. intensity, colour, texture), and given predicate, H(R) = 1. Examples of homogeneity predicate 1. 1, if | max( R ) − min( R ) |< θ ( R ) H ( R) =  otherwise 0, where: max(R) = maxp∈R f(p), min(R) = minp∈R f(p) 2. 1, if | f ( p ) − mean( R ) |< θ ( R ) H ( p, R ) =  otherwise 0, where: θ ( R ) = θ A − F ( R ) (θ A − θ B ) F ( Rmax ) or n θ ( R) = kσ = k

W. Kasprzak: EIASR

1 ( f ( pi ) − mean( R))2 ; where pi ∈ R, i = 1,..., n. ∑ n i =1 6. Region-based image segmentation

183

Region growing and merging A simple approach to region detection is to start from some pixels (called seeds) si representing distinct image regions Ri and to grow them, until they cover the entire image. The initial seed detection: we make a histogram of the entire image; image pixels, whose image value corresponds to histogram peaks, are selected to be seeds. Region growing: at each stage k and for each region Ri(k) check if there are unlabeled neighbour pixel of each pixel of the region border and if the homogeneity criterion for the enlarged region will still be valid. If it is true then enlarge the region by this pixel. W. Kasprzak: EIASR

6. Region-based image segmentation

184

Split-and-merge Region merging: merge adjacent regions that have similar statistical properties. For example the two regions R1, R2 are allowed for merge if their arithmetic means are similar: Predicate (3) | mean ( R1 ) − mean ( R2 ) |< θ ( R1 ∪ R2 ) Split-and-merge (quadtree structure) Init: quadtree at some mid-level SPLIT: check whether every Image leaf node is homogeneous MERGE: check whether 4 related leafs can be merged W. Kasprzak: EIASR


185

Region detection algorithm Combined region detection algorithm (M, θ , θA, θ(R)) : Initial image tessellation into square regions of size M × M. REPEAT with the quadtree representation DO MERGE step (with a fixed threshold θ) SPLIT step (with a fixed threshold θ) UNTIL further split or merge is not possible (predicate (1)). Region merging/growing with a fixed threshold θA (pred. (3)) Region merging with an adaptive threshold θ(R) (pred. (2))

W. Kasprzak: EIASR


186

2. Texture Texture - a measure of image regularity, coarseness and smoothness. Examples of textures: grass, wood, water surface, etc.

W. Kasprzak: EIASR


187

Texture classification Texture recognition is a classification problem. Learning: find appropriate features, train a classifier, Classification: detect features of current texture; classify the features in terms of learned classes. Texture features: • statistical ( e.g. variance, skewness); • spectral (e.g. using the autocorrelation function or Fourier transform); • structural (e.g. using discrete features like: colour, motion, number of endings). W. Kasprzak: EIASR


188

Histogram-based texture features Let fk, k=1,…,L, be the values of the image function, and p(fk) the normalized histogram of an image region. L 1. Mean µ1 = µ = ∑ f k p( f k ) k =1

L

2. Variance

µ 2 = σ = ∑ ( f k − µ1 ) 2 p( f k ) 2

k =1

3. Skewness 4. Kurtosis

µ3 =

1

σ

3

L

∑( f

k

− µ1 ) 3 p ( f k )

k =1

1 L µ 4 = ∑ ( f k − µ1 ) 4 p( f k ) − 3 4 k =1 L

5. Entropy

H ( f ) = − ∑ p ( f k ) ⋅ log p ( f k ) k =1

W. Kasprzak: EIASR


189

The interpretation of features 1. The mean gives the average pixel value in given region 2. The variance measures the dispersion of pixel values in this region. 3. Skewness is a measure of histogram symmetry. 4. Kurtosis is a measure of the tail of the histogram – long-tailed histograms correspond to spiky regions. The subtraction of 3 ensures that the kurtosis of a Gaussian distribution is normalized to zero. 5. Entropy expresses the level of uniform pixel value distribution. The major limitation of histogram features is that they cannot express spatial characteristics of the texture. W. Kasprzak: EIASR


190

Histograms of pairs Histograms of sums and differences Let fj;k , fj+µ; k+ν be two pixel in the image separated by the displacement vector: V = (µ; ν) T. Their sum and difference are: ; d j ;k = f j ;k − f j + µ ;k + v s j ;k = f j ;k + f j + µ ;k + v Histograms of sums and differences, Hs(l,V) and Hd(m,V), for displacement V : H (l ,V ) = # f ( such that s = l ) j ;k

s

j ;k

H d (m, V ) = # f j ;k ( such that d j ;k = m )

They contain information about the spatial organization of pixel values. E.g. a coarse texture results in a concentration of histograms around: l = 2 × mean and m=0. Consider several displacements, e.g.: V = (µ; ν)T ∈ {(1,0), (0,1), (-1,0), (0,-1)} W. Kasprzak: EIASR


191

Features of sums and differences After normalisation: hs (l ) =

H s (l ) H d (m) , l = 0,...,2( L − 1) hd (m) = , m = (− L + 1),..., ( L − 1) ( ) H l ∑ s ∑ H d (m) l

m

Typical features are: 2 L−2 1. Mean c1 = ∑ lhs (l ) 2. Contrast

c3 =

l =0 2 L −2

∑ (l

2

c2 =

)

⋅ hs (l )

c4 =

l =0

3. Variance 4. Entropy W. Kasprzak: EIASR

c5 =

2 L−2

2 ( l − c ) ∑ 1 hs (l )

c6 =

c7 =

∑ − h (l )log(h (l )) s

s

∑ mh (m ) d

m = − L +1 L −1

∑ (m

2

)

⋅ hd (m )

m = − L +1 L −1

∑ (m − c ) 2

2

hd (m )

m = − L +1

l =0

2 L−2

L −1

c8 =

l =0 6. Region-based image segmentation

L −1

∑ − h (m) log(h (m)) d

d

m = − L +1

192

Co-occurrence matrix The grey-level co-occurrence matrix (GLCM) (by R. Haralick et al.) is defined as: G(d; r) = [gij (d; r)] , where an element gij(d; r) specifies the number of pixel pairs (fi=i, fj=j) which are separated by distance d along the direction r. Example. For a 5 × 5 image and L (L = 4) values, the cooccurrence matrix has L2 = 4 x 4 elements. 0  0 f → 0  2 2  E.g.

0 1 1 0  0 1 1 0 2 2 2 0  2 3 3 0 2 2 2 2 

4  4 G (1;0) =  2  1 

1  4 0 0 0 14 1   0 1 2  4

2

g 00 (1,0 ) → ( f 00 , f 01 ), ( f 01 , f 00 ), ( f 10 , f 11 ), ( f 11 , f 10 )

W. Kasprzak: EIASR


193

Parameters of GLCM The adjacency of pixels in a pair – along four directions r (horizontal, vertical, left and right diagonal):

The distance parameter d, is usually set as d=1, but it can be set as d > 1. R. Haralick proposed fourteen features for the GLCM. W. Kasprzak: EIASR


194

Features of the GLCM Normalisation of GLCM: Marginal distributions:

g ij (d , r ) = L −1

g ij (d , r )

∑ gij (d , r ) i, j

, i, j = 0,..., ( L − 1)

L −1

g1i = ∑ g ij ; g 2 j = ∑ g ij j =0

i =0

Mean and standard deviation of marginal distribution: L −1

µ1 = ∑ i ⋅ g1i ; σ 1 = i =0

L −1

µ2 = ∑ j ⋅ g2 j ; σ 2 = j =0

W. Kasprzak: EIASR

L −1

∑ (i − µ1)

2

⋅ g1i

i =0

L −1

∑ ( j − µ 2)

2

⋅ g2 j

j =0


195

Features of the GLCM Selected features: 1. Mean-square of distribution:

c1 = ∑i , j gij2 L −1

2. Contrast of the „difference”-variable : c2 = ∑ l 2 ⋅ (∑|i − j|=l g ij ) l =0

3. Normalized “correlation” of marginal distributions : (i − µ1) ⋅ ( j − µ 2) ⋅ g ij ∑ i, j c3 = σ 1⋅ σ 2 4. Entropy:

c4 = −∑i , j gij ⋅ log gij

W. Kasprzak: EIASR


196

Filter banks Convoluting an image block with a set of kernels yields a representation of block’s texture in a different space. There is a strong response when the texture looks similar to the filter kernel, and a weak response when it doesn’t. Example. The collection of kernels may consist of a series of spots and bars - at different scales. Spot filters respond strongly to small (non-oriented) points. Bar filters are oriented, and tend to respond to edges. W. Kasprzak: EIASR


197

Gabor filter banks Gabor filters - Fourier Transform elements multiplied by Gaussians. A Gabor kernel responds strongly if located over image blocks with texture having a particular spatial frequency and orientation. Gabor filters come in pairs: symmetric and anti-symmetric. A symmetric kernel: x2 + y 2 } Gsymmetric ( x, y ) = cos(k x x + k y y ) exp{− 2σ 2 An anti-symmetric kernel: x2 + y2 Ganti−symmetric ( x, y ) = sin(k x x + k y y ) exp{− } 2σ 2 Where (kx, ky) are the spatial frequencies.

W. Kasprzak: EIASR


198

Example Example. Gabor filter kernels, where mid-grey values represent a zero, darker values represent negative numbers and lighter values represent positive numbers. The top row shows three antisymmetric kernels, and the bottom row - three symmetric kernels; in horizontal direction.

W. Kasprzak: EIASR


199

Gabor filters in MPEG-7 MPEG-7 descriptors in the frequency domain: in polar coordinates the frequency domain is sampled into 30 „property channels” – the phase’s width is unformly of 30 degree, and the magnitude widths are powers of 2: 3 θ r = 30o r , (r = 0,1,K,5) ωs = ωo ⋅ 2 − s , ( s = 0,1, K,4), ω0 = 4

W. Kasprzak: EIASR


200

Gabor filters in MPEG-7 (cont.) The Gauss function in channel (s, r):

 − (ω − ω s ) 2   − (θ − θ r ) 2  G Ps , r (ω , θ ) = exp  ⋅ exp  2 2 2 σ 2 σ ρs θr    

The energy ei of the i-th channel:

ei = log(1 + g i ) , (i = 1,2, K,30)

gi =

180°

[G ∑ ∑ ω θ 1

(ω , θ ) ⋅ F (ω , θ )]

2

Ps , r

= 0 + = 0° +

The standard deviation of energy qi in the i-th channel: qi = log(1 + σ i ) , (i = 1,2, K ,30) σi =

∑ ∑ [(G 1

180°

Ps , r (ω , θ ) ⋅ F (ω , θ ) ) − g i / N

ω = 0 + θ = 0° +

2

]

2

Where F(ω,θ) – 2D Fourier transform into the frequency space represented by polar coordinates. W. Kasprzak: EIASR


201

3. Point-based image descriptor The image descriptor in SIFT Dominating orientation. An orientation is assigned to each keypoint to achieve invariance to image rotation. A neigborhood is taken around the keypoint location depending on the scale, and the gradient magnitude and direction is calculated in that region. An orientation histogram with 36 bins covering 360 degrees is created. The highest peak in the histogram is taken and any peak above 80% of it is also considered to calculate the orientation. Eventually, this replicates the keypoint into many, with same location and scale, but different directions.

SIFT feature descriptor. A 16s x 16s neighbourhood around the keypoint is taken. It is divided into 16 sub-blocks of 4 x 4 size. For each sub-block, an 8-bin orientation histogram is created. Thus,128 bin values are available. The dominating direction is used for registration of the feature vector. W. Kasprzak: EIASR


202

SURF descriptor Dominating orientation. SURF uses wavelet responses in horizontal and vertical direction for a neighborhood of size 6s, with adequate Gaussian weights, to obtain gradients dx, dy. The dominant orientation is estimated by calculating the sum of all responses in the (dx, dy) space, contained within a sliding orientation window of angle 60 degrees.

SURF feature descriptor. A neighborhood of size 20s x 20s is taken around the keypoint, where s is the scale. It is divided into 4 x 4 subregions. For each subregion, horizontal and vertical wavelet responses are taken, and a vector is formed:

v = [∑ d x , ∑ d y, ∑ |d x|, ∑ | d y |]

This gives a SURF feature descriptor with total 64 dimensions. An extended descriptor with 128 elements is also defined: the sums of dx and |dx| are computed separately for dy . The magnitude part of the frequency characteristics: Example: A spectrogram before and after pre-emphasis: W. Kasprzak: EIASR

8. Speech signal pre-processing

248

Auto-correlation of signal For a window of size N we compute normalized auto-correlation of signal samples with its shifted samples (by k). Starting with m-th sample the k-shifted normalized auto-correlation m + N − k −1 is: rk( m ) =

∑f

n

f n+k

n=m

| [ f n ] || [ f n+ k ] |n

The maximum of normalized auto-correlation values always appears for k=0 and is equal to 1. For a voiced speech part local maxima appear every k0 samples: r k , rk + jk0 ; e.g. for some k0, 2k0, 3k0, etc., because many strong harmonic frequencies exist in this part. W. Kasprzak: EIASR


249

Base frequency By normalized mutual correlation we shall mean a correlation factor between two consecutive signal frames. Let the frames of signal samples are given (in vector form): f( m − k , m − 1) = [ f m − k ,..., f m −1 ] f( m, m + k − 1) = [ f m ,..., f m + k −1 ] The normalized mutual correlation of these two vectors is:

ρ m (k ) =

f( m − k , m − 1) ⋅ f( m, m + k − 1) || f( m − k , m − 1) || ⋅ || f( m, m + k − 1) ||

Start with k=2 and compute the correlation factor for different values of k. In a voiced part of the speech the maximum of normalized mutual correlation will be at k that corresponds to the base period (and base frequency F0) of the speaker. W. Kasprzak: EIASR


250

EIASR Acoustic feature detection Lecture 9


251

1. Speech features Frame-based speech signal features 1. Mel-frequency cepstral coefficients (MFCC), extended by their first derivatives in time. or 2. Speech features based on Linear Predictive Coding (LPC), e.g. LPCC – linear predictive cepstral coefficients Speech signal (frames) Features

W. Kasprzak: EIASR

9. Acoustic feature detection

252

Cepstrum (1) The cepstrum of a signal x[n] is the result of a homomorphic transformation: cepstrum(x) = F-1(log |F(x)|), where F is the discrete-time Fourier Transform (DFT) for MFCC or the Z transform for LPCC.. Note: spectrum spec | trum ceps | trum cepstrum

MFCC: LPCC:

W. Kasprzak: EIASR


253

Cepstrum (2) Why are cepstrum features useful for speech recognition? • The cepstrum features characterizing the (impulse response of the) vocal tract are located near the “zero” feature k=0; whereas the input impulse components, corresponding to the larynx-modulated oscillations (that are not useful for speech recognition) are located at higher values of k (“longer” cepstrum time), where the cepstrum features achieve a maximum value; • The useful features can be separated from the others by selecting an appropriate number of them, starting from k=0, or by additional multiplication, called liftering. • The speech impulse response can also be separated from the acquisition channel’s (microphone) response by using centered cepstrum features. W. Kasprzak: EIASR


254

2. MFCC 1. Short-time Fourier Transform (STFT) A windowed DFT for every frame τ of the input signal: M −1 − i 2π kt M F (k ,τ ) = ∑ ( x[τ + t ] ⋅ e ⋅ wτ [t ]) , k=0, 1, ..., M-1 t =0

Window functions w[t] 1. Rectangular window 2. Triangle window 3. Hamming window

etc. W. Kasprzak: EIASR

  2π t  0.54 − 0.46 cos , wτ [t ] =  M    0,

for t = {0,1,..., M − 1} otherwise


255

Windowing example Example. x[n] is the sum of two sinus functions uniformly sampled from 0 to 2π by 128 samples: x[n] = sin(2π n/5) + sin(2π n /12), n=0,1,2,...,127. Single frame Single frame (rectangular window applied) (Hamming window applied)

W. Kasprzak: EIASR


256

Windowing example (2) Example (cont.) Magnitude of Fourier coefficients: With rectangular window. With Hamming window

W. Kasprzak: EIASR


257

Window functions Conclusion: f×w

Mag[DFT(w)]

W. Kasprzak: EIASR


Mag [STFT(f × w)]

258

Spectrogram Power of Fourier coefficients (squared magnitude)

FC ( k ,τ ) = F ( k ,τ )

2

=

M −1

∑ ( x[τ + t ] e

− i 2πkt

M

t =0

W. Kasprzak: EIASR

⋅ wτ [t ])

2

, k = 0 ,..., M-1


259

Mel frequency scale Non-linear response of the human ear to the frequency components in the audio spectrum: differences in frequencies at the low end (< 1 kHz) are easier detectable than differences of the same magnitude in the high end of the audible spectrum. Approach: a non-linear frequency analysis performed by the human ear - the higher the frequency the lower its resolution MEL scale (empirical result):   f  f mel = 2595 log1 + 700 [ Hz ]  

W. Kasprzak: EIASR


260

MFC Mel frequency coefficients (MFC) Triangular filters are located uniformly in the Mel frequency scale:

MFC (l , τ) =

M −1

∑ [ D(l , k ) ⋅ FC (k , τ)]

l = 1,..., L

k =0

The MFC value associated with each bin corresponds to a weighted average of the power spectral values in the particular frequency range specified by the shape of the filter. W. Kasprzak: EIASR


261

MFCC The Mel-frequency cepstrum coeffcients are computed by the homomorphic transformation MFCC(h) = FT-1{log MFC{FT{h}}}, for h = x ⊗ w The last step is the inverse Fourier Transform of logarithmic Mel frequency coefficients: L −1 k ⋅ ( 2l + 1)π )] MFCC ( k ,τ ) = ∑ [log MFC (l ,τ ) ⋅ cos( k = 1,..., K 2L

l =0

Centered MFCC

MFCC centered ( k ,τ ) = MFCC ( k ,τ ) − mean{MFCC ( k ,τ )) | τ = 1,2,...] k = 1,..., K

W. Kasprzak: EIASR


262

Delta features Energy feature Additional feature - the total energy of signal in a single frame: M

E(τ ) = log(∑ xi2 ) i =1

Gradients of features in time („delta” features) A schematic view of spectrograms for different phoneme types: single vowels (left), diphthongs (middle), plosives (right).

A linear regression in 5 consecutive frames is applied to find delta coefficients „d” (of MFCCs and energy feature „c”): d (τ ) = W. Kasprzak: EIASR

2c(τ + 2) + c (τ + 1) − c(τ − 1) − 2c(τ − 2) 10 9. Acoustic feature detection

263

Feature set Energy MFCC

Delta energy Delta MFCC General features per frame (Total energy, mean and variance, norm. max. auto- correlation, low-band ratio) W. Kasprzak: EIASR


264

3. LPC and LPCC The Z transform is a discrete-time signal transform, which is dual to the Laplace transform of continuous-time signals, that means a probing of signal by sinusoids and (decaying) exponentials: X ( z) =

∞

∑ x[n] z

−n

n = −∞

and z is a complex number: z = r e -jω , r = e -σ. The synthesis model of human speech (in z-domain) consists of: • an excitation source E(z) on the input, • a linear filter with transmittance H(z), • the speech signal X(z) on its output; (the signals and the filter are represented by their transforms in the complex-valued domain z). W. Kasprzak: EIASR


265

Speech synthesis model

Let us denote by H(z) the transmittance of the filter (the z transform of its frequency response h[n]). There are obvious relations in the z domain: X ( z ) = H ( z ) E( z ) , E ( z ) = A ( z ) X ( z ) W. Kasprzak: EIASR


266

IIR filter A digital IIR filter is characterized by a recursive equation: x[ n ] = b0 e[ n ] + b1e[ n − 1] + b2 e[ n − 2] + L + b p e[ n − p ]

+ a1 x[ n − 1] + a2 x[ n − 2] + L + am x[ n − m] The n-th output sample, x[n], is computed from the current and previous input samples and previous output samples. In short: m

p

k =1

k =0

x[n] − ∑ a k x[n − k ] = ∑ bk e[n − k ] A corresponding description in the z-domain is: p

∑b

k

X( z ) = E( z )

z

p

−k

k

k =0 m

H( z ) =

1 − ∑ ak z − k k =1

W. Kasprzak: EIASR

∑b

z −k

k =0 m

1 − ∑ ak z − k k =1


267

LPC 1 The Auto-Regressive (AR) model H( z ) = m 1 − ∑k =1 ak z −k assumes that the numerator is 1: Thus in the AR model the n-th output sample, xn, is estimated only on m previous output samples and current input sample as: x[ n ] = e[ n] + a1 x[ n − 1] + a2 x[ n − 2] + L + am x[ n − m ] m In short:

xn = e n + ∑ a k x n − k k =1

Ideally, for voiced parts the vocal tract is cyclically fed by a Dirac delta impulse. Then: e0=1, en =0 , for short-time frames. Thus, the n-th speech sample (in a frame) is estimated as a linear m combination of the previous m samples:

xˆ n = ∑ ak xn − k k =1

W. Kasprzak: EIASR


268

Auto-correlation method for LPC The task is to compute the parameters, { ak | k=1, ... , m }, for every signal frame. By the LSE approach, for given frame, we have: n1 ∂ε   ε = ∑ ( xn − xˆ n ) 2 = ∑  xn − ∑ a k xn − k  2 x n − i = 0 n=n0 ∂ai n  k  where n0, n1 are training sample indices in given frame. We get m equations with m unknowns: ∑ ak ∑ xn−k xn−i = ∑ xn xn−i L, i = 1,...,m k

n

n

By introducing the first m+1 auto-correlation coefficients: r|i − k | =

M −1−|i − k |

∑x x

n n +|i − k |

n =0

= ∑ xn − k xn − i n

the equation system takes the form: m

∑a r

k |i − k |

= ri , i = 1,..., m

k =0

W. Kasprzak: EIASR


269

LPC computation r1  r0  r0  r1  M  r  m−1 rm − 2

r2 r1 rm−3

L rm−1  a1   r1      L rm− 2  a2   r2  = M  M   M      L r0  am   rm 

Ra = r The matrix R is a Toeplitz matrix (it is symmetric with equal diagonal elements). Due this Toeplitz property an efficient algorithm is available for computing a without computing the inverse matrix R-1 . Alternative method The Levinson-Durbin algorithm is an iterative method for the computation of LPC parameters. W. Kasprzak: EIASR


270

The Levinson-Durbin algorithm E represents the prediction error, Ki – the reflection coefficients between consecutive parts of the acoustic tube, aj – the final prediction coefficients INIT: E0 = r0 , i = 1 FOR i=1 to m :

i −1  1   ri − ∑ α j ,i −1 r|i − j|  K i=  Ei −1  j =1 

FOR j=1 to i-1 : α j ,i = α j ,i −1 − K i α i − j ,i −1

α i ,i = K i FOR j=1 to m :

Ei = (1 − K i2 ) Ei −1

a j = α j ,M

W. Kasprzak: EIASR


271

LPC-based features Basic parameters The prediction parameters can itself be considered as a feature vector (usually the error value is also considered): ci = ai for i = 0, ..., m ; where a0=1; and cm+1 = ε . The number of prediction parameters is: m ∈ 〈12, 20〉. Approximately for the sampling frequency fs [kHz]: m ≈ fs +[2, 4]. Spectral features (smoothed signal spectrum) By transforming the parameter vector into frequency domain we get a smooth spectrum of the signal frame. The required resolution in frequency domain is achieved by padding the parameter vector with zeros to get a vector with M elements: A M = DFT ([1, a1 , a2 ,..., am ,0,0,...,0]) W. Kasprzak: EIASR


272

LPCC (1) LPCC - the cepstral LPC Recall, the speech synthesis filter function is transformed to the z1 domain transmittance function: H( z ) = m 1 − ∑k =1 ak z −k The polynomial in the denominator part can be reorganized giving an all-pole transmittance function: 1 1 H( z ) = = m m 1 − ∑k =1 ak z −k ∏ (1 − pk z −1 ) k =1

Next we use the ln - function and apply the inverse Z transform: m

c[1 : m] = Ζ (ln[H( z )]) = Ζ (∑ ln[ pk z −1 ]) −1

−1

k =1

W. Kasprzak: EIASR


273

LPCC (2) A direct iterative method for computing the LPCC features Instead of performing the particular steps of the cepstrum transformation of LPC coefficients, there exists an iterative method for a direct computation of LPCC features from the LPC coefficients. For 1≤ n ≤ m (where m is the order of LPC transform) : n −1

For n > m :

k c[n] = −an − ∑ (1 − )c[n − k ]ak ; n = 1,2,..., m n k =1

W. Kasprzak: EIASR

n −1

k c[n] = −∑ (1 − )c[n − k ]ak ; n > m n k =1 9. Acoustic feature detection

274

EIASR Phonetic speech model Lecture 10


275

1. Phonetic categories Phone (articulated sound) - the smallest element in speech. “A phone is a speech sound considered as a physical event without regard to its place in the sound system of a language” (Webster dictionary). A phone may have different realisations, determined by: tonality, duration, and intonation. The IPA (Int. Phonetic Association) has defined some 100 phones, while at least 40 are required for a rough transcription. Phoneme – a category of similar phones. Even though no two speech sounds are identical, all of the phones classified into one phoneme category are similar enough to have the same meaning. W. Kasprzak: EIASR

10. Phonetic speech model

276

Groups of phonemes Primary groups: vowels, consonants - a basic contrast – can a phoneme serve as a syllable nucleus or not? The mode of phonation: 1. Sonorants – „buzzes” - characterized mainly by voicing the repetitive opening and closing of the vocal cords. 2. Fricatives – „hisses” - generally non-voicing sounds. 3. Plosives – „pops” - explosive sounds (including affricates). 4. Silence - phrase marker, breathing, mini-silences and closures before a plosive. W. Kasprzak: EIASR


277

Vowels 1. Monophthongs (11 phonemes) - a single vowel quality. • stressed vowels: /A/, /E/, /@/, /i:/, /o/, /U/ (e.g. in the words father, bet, bat, beet, above, boot); • reduced vowels: /^/, /&/, /I/, /u/; (e.g. in the words bus, above, bit, book), • r-coloured vowels E: long /3r/, short /&r/ (as in bird and butter). 2. Diphthongs (6) - a clear change from start to end: /ei/ (e.g. late), /aI/ (e.g. bye), />i/ (e.g. boy), /iU/ (e.g. few), /aU/ (e.g. loud), /oU/ boat.

W. Kasprzak: EIASR


278

Consonants (1) 3. Approximants (4) – Semivowels – similar to vowels but more obstacles in the vocal tract than for the vowels: • liquids: the /l/ (e.g. "like") and the retroflex /9r/ (e.g. "red"); • glides : the /j/ (e.g. "yes") and the /w/ (e.g. "won"). 4. Nasals (3) - The airflow is blocked completely in the oral tract, but a lowering of the velum allows a weak flow through the nose: /m/ as in "me", /n/ as in "new", /ng/ as in "sing". 5. Fricatives (9) - Weak or strong friction noises, when the articulators are close together to cause turbulence in the airflow: • voiceless /f/, /T/, /s/. /S/, /h/ as in: “fine”, "thing", “sign”, “assure”, “hope”; • voiced /v/, /D/, /z/, /Z/ as in: “voice”, “than”, “resign”, “vision”. W. Kasprzak: EIASR


279

Consonants (2) 6. Plosives (6) - Bursts or explosive sounds produced by complete closure of the vocal tract followed by a rapid release of the closure: • unvoiced /ph/, /th/, /kh/ (as in “can”), and • voiced: /b/, /d/, /g/. 7. Affricates (2) - Plosives released with frication: the /tS/ sound (like in "church"), the /dZ/ (like in "judge"). The IPA (International Phonetic Alphabet) (c/o Department of Linguistics, University of Victoria, Victoria, British Columbia, Canada) is recognised as the international standard for the transcription of phonemes in all languages. W. Kasprzak: EIASR


280

Articulation of vowels Articulation of vowels and /j/ according to: level of mouth opening and tongue location.

W. Kasprzak: EIASR


281

Articulation of consonants Classification of consonants according to phonetic categories and articulation area:

W. Kasprzak: EIASR


282

2. Spectral cues Formants (energy concentrations in frequency bands) F0-F5 Basic frequency (F0) – the pitch Usually: 80-200 Hz (man), 150-350 Hz (women) F1 can vary from 300 Hz to 1000 Hz The lower it is, the closer the tongue is to the roof of the mouth. /i:/ has the lowest F1: ∼ 300 Hz; /A/ has the highest F1 ∼ 950 Hz. F2 can vary from 850 Hz to 2500 Hz The F2 value is proportional to the front or back position of the tongue tip. In addition, lip rounding causes a lower F2 than the case with unrounded lips. /i:/ has an F2 of 2200 Hz, the highest F2 of any vowel; /u/ has an F2 of 850 Hz - the tongue tip is far back, and the lips are rounded. W. Kasprzak: EIASR


283

Formants Distribution of main formants in vowels:

W. Kasprzak: EIASR


284

Typical spectrograms (1) 1. Monophthong vowels a strong voicing, with stable formants 2. Diphthong vowels A strong voicing but moving formants

3. Approximants Voiced but have less visible formants

W. Kasprzak: EIASR


285

Typical spectrograms (2) 4. Nasals low energy and characteristic "zero" region : 5. Fricatives have a high-frequency Gaussian region : 6. Plosives An explosive burst of acoustic energy, following a short period of silence : 7. Affricates A plosive followed by a fricative: W. Kasprzak: EIASR


286

3. Sub-phonemes Pronunciation differences (co-articulation effects): phones have a great influence on neighbour phones. Tri-phone model: to split each phone into one, two, or three parts, depending on the typical duration of that phone as well as how much that phone will be influenced by surrounding phones. Example. Tri-phones that represent the phoneme /E/ in "yes": 1.$front$fric (/E/ in the context of a following fricative). In general /E/ can be decomposed into 17 tri-phones: •8 for possible left context, •8 for possible right context, and •1 context-independent (centre phone). W. Kasprzak: EIASR


287

Context categories (1) 1.Front - front vowels and similar approximants. 2.Mid - central vowels and similar approximants. 3.Back - back vowels and similar approximants. 4.Sil - silence. 5.Nasal - nasals. 6.Retro - retroflex approximant and retroflex (r-coloured) vowels. 7.Fric - fricatives and affricates. 8.Other - plosives and remaining approximants.

W. Kasprzak: EIASR


288

Context categories (2)

W. Kasprzak: EIASR


289

EIASR Word and sentence recognition Lecture 11


290

1. Speech variability 1. Acoustic-phonetic variability covers different accents, pronunciations, pitches, volumes, and so on. 2. Temporal variability covers different speaking rates. 3. Syntactic variability (of sentences) A useful simplification is to treat them independently. The temporal variability is easier to handle - an efficient algorithm is known as Dynamic Time Warping. Acoustic-phonetic variability is more difficult to model, it covers acoustic feature schemes and phonetic modeling. Syntactic variability deals with different word sequences representing the same sentence (meaning). W. Kasprzak: EIASR

11. Word and sentence recognition

291

Dynamic Time Warping Let Y be a prototype (reference pattern), X – current observation (input pattern). Both Y and X are sequences of simple patterns. The distance D(X; Y) between the input and a reference pattern is defined despite different lengths of sequences. The goal of DTW is to find a path in the space of possible matches of bot sequences, that is optimal w.r.t. the cost function, i.e.

l = arg min D ( X , Yl ) l

D(X; Y) is defined as the sum of local distances, dij = d(xi; yj), between corresponding elements in the solution path. The cost of a path leading to „state” (xi; yj) is: R( xi −1 , y j −1 ) + d ( xi −1 , y j −1 )  R( x i , y j ) = min  R( xi −1 , y j ) + d ( xi −1 , y j )  R( x , y ) + d ( x , y ) i j −1 i j −1  W. Kasprzak: EIASR


292

DTW - Example Matching a 3-segment input pattern (x1 x2 x3) with a 4element model pattern (y1 y2 y3 y4).

An existing path can only be extended by one of the three allowed transitions: – state(xi-1, yi-1) state(xi, yj) – state(xi-1, yi) state(xi, yj) – state(xi, yi-1) state(xi, yj) W. Kasprzak: EIASR


293

2. Speech recognition Word recognition In word recognition the question is: what is the most likely word represented by given acoustic sequence? A word consists of a sequence of phonemes. W = arg word maxP P(word | signal) From the Bayes rule we get: P(word|signal)=α ⋅ P(signal|word) ⋅ P(word) Two prior stochastic models are required: 1.the acoustic-phonetic model – P(signal|word), and 2.the language model – P(word). In particular: the signal is represented by n feature vectors: P(word| c1c2...cn )=α ⋅ P(c1c2...cn |word) ⋅ P(word) W. Kasprzak: EIASR


294

Sentence recognition In sentence recognition the question is: what is the most likely word sentence represented by given sequence of words? A sentence consist of words given in our lexicon. S = argsequence maxP P(sentence | words) Sentence is a meaningful sequence of words, signal is a sequence of observed words. From the Bayes rule we get: P(sentence| words) = α ⋅ P(words |sentence) ⋅ P(sentence) Two prior stochastic models are needed: 1.the application model – P(words | sentence) and 2.the language model – P(sentence). W. Kasprzak: EIASR


295

Hidden Markow Model A HMM is defined as a 5-tuple: HMM = ( S, Y, Π, A, B ), where S = {S1, S2, ..., SN} - set of states, Y - set of output symbols, Π - the start state probability vector, A =[aij]- the state transition probability matrix, B =[bij]- the output probability matrix.

W. Kasprzak: EIASR


296

HMM in word recognition Output probability model 1. Discrete scalar - a single output symbol is observed in one state: Pj (O t = Yk | S j ) = b jk ; j = 1,..., N ; k = 1,..., M ; t = 1,..., T 2. Discrete vector – a vector of symbols per state is observed: M

Pj (O = c | S j ) = ∑ b jk ⋅ Pk (c t ) 3. Semi-continuous distributionk =1 – a mixture of Gaussian distributions – per class distributions t

t

M

M

k =1

k =1

Pj (O = c | S j ) = ∑ b jk ⋅ p k ( c | Ω k ) = ∑ b jk ⋅ Ν ( c ; µ k , Σ k ) t

t

4. Continuous (mixture of different distributions per state and M M class): t t Pj (O = c | S j ) = ∑ b jk ⋅ pk (c | Ω jk ) = ∑ b jk ⋅ Ν (c; µ jk , Σ jk ) k =1

W. Kasprzak: EIASR

k =1


297

HMM in word recognition (2) Structure: a left-to-right HMM

Actions: • INSert, • SUBstitute, • DELete

W. Kasprzak: EIASR


298

Viterbi search (1) Viterbi search - the goal is to find the best path of states in the HMM, w.r.t. the observed sequence:

(S1...ST ) = argmaxP(S1LST | c1c2 LcT ) S1...ST

This is done iteratively, from t=1 to t=T, while extending every partial path (S1S2...St-1) in the HMM model, in a best way with respect to the observed sequence of feature vectors (c1c2...ct), according to the dynamic programming principle: P(S1LSt−1St | c1c2 Lct ) = P(ct | St ) ⋅ maxSt (P(St | St−1) ⋅ P(S1LSt−1 | c1Lct−1)) In practice a logarithmic scale can be used: lnP(S1LSt−1St | c1c2 Lct ) = lnP(ct | St ) + maxSt (lnP(St | St−1) + lnP(S1LSt−1 | c1Lct−1)) W. Kasprzak: EIASR


299

Viterbi search (2) The search space in Viterbi search – consistent with the dynamic programmimng search - is a 2-D space indexed by time (observed sequence) and HMM state indices.

Recall that in a discrete HMM: a = P ( S | S ); b = P (Y | S ) ij i j i jk k j W. Kasprzak: EIASR


300

Viterbi search (3)

W. Kasprzak: EIASR


301

Baum-Welch training (1) The parameters of a discrete HMM model, λ = (Π, A, B) can be estimated by the Baum-Welch training (an EM-like approach, that uses the “forward-backward” algorithm to obtain current forward and backward “messages”). Given a discrete HMM model and an observation sequence O=(O1,…,OT), the Baum-Welch training is an MLestimation of the parameters λ.   ΡHMM (λ ) = log P (Ο | λ ) = log ∑ P (Ο, q | λ )    q∈ST

where q – a state sequence of length T, corresponding to observation O. The goal function: λˆ = arg max ΡHMM ( λ) λ

W. Kasprzak: EIASR


302

„Forward-backward” algorithm Let N – be the number of states. The “forward term” - the probability of generating a partial sequence and ending up in state with index j at time t :

α t ( j ) = P (Ο 1 L Ο t , q t = s j | λ ) The “backward term” - the probability of generating the remainder of the sequence, from state with index i at time t :

β t ( i ) = P ( Ο t + 1 L Ο T , qt = si | λ ) Thus the probability of visiting the state with index j at frame t for a complete observation sequence O is the product of above probabilities: α ( j ) ⋅ β ( j ) = P (Ο, q = s | λ ) t

t

 N  α t ( j ) = b j (Ot ) ⋅  ∑ α t −1 (i ) ⋅ aij   i =1  W. Kasprzak: EIASR

t

j

N

β t (i ) = ∑ aij ⋅ b j (Ot +1 ) ⋅ β t +1 ( j ) j =1


303

„Baum-Welch” algorithm (1) REPEAT 1) With current λ = {πi, aij, bjk | i,j=1,...,N; k=1,...,M} and t=1,…,T; get α t ( j ), βt (i) by the ”forward-backward” algorithm. 2) (The E-step) The expected probability of transition si sj at time t, given observation sequence O=(O1, … , OT) is: ξ t (i, j ) = P(qt = si , qt +1 = s j | Ο, λ ) P (qt = si , qt +1 = s j , Ο | λ ) α t (i ) ⋅ aij ⋅ b j (Ο t +1 ) ⋅ β t +1 ( j ) = = N P (Ο | λ ) ∑ α t (i) ⋅ β t (i) i =1

γ(i, j) - the expected probability that the transition from state i to state j appears, at any time from 1 to T: γ (i , j ) = ∑ ξ t (i, j ) t

W. Kasprzak: EIASR


304

„Baum-Welch” (2) γ(i) - the expected probability that state i is visited, at any time from 1 to T: γ (i ) = ∑ ∑ ξ t (i , j ) j

t

The expected probability that state i emits the symbol (class) Yk :

∑ ∑ ξ (i, j )

γ (i, Yk ) =

t

t :Ο t ∈Yk

j

The expected probability of visiting state Si at particular time t is: N α t (i ) β t (i ) γ t (i ) = P (qt = S i | Ο , λ) = N = ∑ ξ t (i, j ) α t ( j ) β t ( j ) j =1 ∑ j =1 3) (The M-step) In this step we can re-estimate the HMM parameter: vector π, matrices A and B, by taking simple ratios between above terms.

W. Kasprzak: EIASR


305

„Baum-Welch” (3) The update rules for the parameters of a discrete HMM model: πˆ i = γ 1 (i ) =

α1 (i ) β1 (i )

γ (i, j ) aˆij = = γ (i )

= ∑ ξ1 (i, j )

N

∑ α ( j)β ( j) 1

j =1

T

N

j =1

1

T

∑ ξ (i, j ) ∑ α (i) ⋅ a t

t

=

t =1 T

∑γ

t

(i )

t =1

⋅ b j (Ο t +1 ) ⋅ β t +1 ( j )

T

∑ α (i) ⋅ β (i) t

t

t =1

T

T

γ (i, Yk ) = bˆ jk = γ (i )

ij

t =1

∑ γ t ( j ) ⋅ χ (Ο t ∈ Yk ) t =1

T

∑ γ t (i ) t =1

∑ α ( j ) ⋅ β ( j ) ⋅ χ (Ο t

=

t

t

∈ Yk )

t =1

T

∑ α ( j) ⋅ β ( j) t

t

t =1

Where χ(a∈ Y)=1 if the formula in brackets is satisfied or χ(.)=0 – otherwise. W. Kasprzak: EIASR


306

3. Sentence recognition Two recognition steps – a many-word recognition is performed first (using a 2-layer HMM model) and alternative word hypotheses are generated; then a sentence recognition is performed with an additional layer of HMM that seeks for a syntactically proper and semantically meaningful word sequence.

W. Kasprzak: EIASR


307

Word lattice The word recognition stage generates 1-, 2- or 3-syllablebased (partially alternative) word hypotheses. Example

0

1

2

3

4

5

6

7

8

9

10

syllable W. Kasprzak: EIASR


308

N-grams The language model provides prior probability of given word sequence P(words). For n words (s1 ... sn) we have: n

P ( s1...sn ) = P ( s1 ) P ( s2 | s1 ) L P ( sn | s1 L sn−1 ) = ∏ P( si | s1...si−1 ) i =1

In practice n needs to be of limited length. The bigram model – the probability of next word depends only on the direct predecessor word: P( si | s1...si −1 ) ≈ P( si | si −1 ) n P ( s1...sn ) = P ( s1 ) P ( s2 | s1 ) L P ( sn | sn −1 ) = ∏ P ( si | si −1 ) i =1

The distribution P(si|si-1) can be learned by a counting the relative frequency of the pairs of words in a large learning set. N-gram – if the N-th word depends on N-1 predecessor words. W. Kasprzak: EIASR


309

Smoothing an N-gram Katz’ smoothing for a three-gram model: C ( wi − 2 , wi −1 , wi )  , r>k  C ( wi − 2 , wi −1 )   d r C ( wi − 2 , wi −1 , wi ) P * ( wi | wi −1 , wi − 2 ) =  , 0

SPEECH RECOGNITION

speech recognition, Machine translation, and speech ... - Microsoft

Automatic Speech Recognition and Speech ... - Semantic Scholar

Speech separation for speech recognition

Prosody and Automatic Speech Recognition

Image Recognition

Speech Recognition: Key Word Spotting through Image ... - arXiv

PREDICTING SPEECH RECOGNITION ...

Nepali Speech Recognition

SPEECH RECOGNITION AS FEATURE

Speech Recognition

Distant Speech Recognition

Speech Recognition using HMMs

Speech Recognition Training Manual

Nepali Speech Recognition

Arabic Speech Recognition Systems

Visual Speech Recognition - arXiv

Speech Recognition - Berlin Chen

AUTOMATIC SPEECH RECOGNITION

Multilingual Speech Recognition

ANN Speech Recognition - wseas

Improving speech recognition performance

speech enhancement for noise-robust speech recognition

Speech Analysis for Automatic Speech Recognition

Image and speech recognition

Image and speech recognition

Suggest Documents