Image and speech recognition Lecture notes 2012 Włodzimierz Kasprzak
Project is co-financed by European Union within European Social Fund
1
CONTENT Part I – Pattern recognition 1. Pattern recognition
5 - 42
2. Pattern transformation
43 - 77
3. Pattern classification
78 - 114
4. 5. 6. 7.
Part II – Image recognition Image pre-processing 115 - 157 Boundary-based image segmentation 158 - 181 Region-based image segmentation 182 - 213 Model-based object recognition 214 - 231
W. Kasprzak: EIASR
2
CONTENT (2) Part III – Speech recognition 8. Speech signal pre-processing 232 - 250 9. Acoustic feature detection 251 - 274 10. Phonetic speech model 275 - 289 11. Word and sentence recognition 290 - 314
W. Kasprzak: EIASR
3
GENERAL REFERENCES 1. W. Kasprzak: Rozpoznawanie obrazów i sygnałów mowy. WPW, Warszawa, 2009. 2. R. Duda, P. Hart, D. Stork: Pattern Classification. 2nd edition, John Wiley & Sons, New York, 2001. 3. H. Niemann: Klassifikation von Mustern. 2nd edition, Springer, 2003. 4. I. Pitas: Digital Image Processing Algorithms and Applications. Prentice Hall, New York etc. 2000. 5. D. Paulus, J. Hornegger: Applied Pattern Recognition. A Practical Introduction to Image and Speech Processing in C++. 3d edition, Vieweg, Braunschweig, 2001. 6. J. Benesty, M.M. Sondhi, Y. Huang (eds): Handbook of Speech Processing. Springer, Berlin Heidelberg, 2008. W. Kasprzak: IASR
4
EIASR Pattern recognition Lecture 1
Project is co-financed by European Union within European Social Fund
5
1. Introduction The GOAL of pattern recognition: • to analyse (to recognize, to understand) and describe the content of given pattern (digital image, speech signal, etc.). Example The difference between computer graphics (image synthesis) and image analysis process:
W. Kasprzak: EIASR
1. Pattern recognition
6
Pattern Patterns – sets of multi-variable functions that express some physical entity, object, system, etc. f r1 ( x1 , x 2 ,..., x n ) Domain of patterns: ( , ,..., ) f x x x n Ω = { fr ( x ) | r = 1, 2 , …} ⊂ U f r ( x) = r 2 1 2 L f ( x , x ,..., x ) n rm 1 2
Example Some patterns (in a bitmap form) from the domain of characters: W. Kasprzak: EIASR
1. Pattern recognition
7
Pattern recognition approaches Complexity of patterns pattern recognition approach: 1.simple patterns template matching, classification 2.complex patterns (a sequence or set of simple patterns) pattern-sequence or -structure model-to-data matching 3.abstract patterns (e.g. a dynamic scene, continuous speech) pattern understanding. „2-D image and speech recognition”: •patterns represented by speech or 2-D image signals, •deals with simple and complex pattern recognition. W. Kasprzak: EIASR
1. Pattern recognition
8
Pattern understanding Abstract pattern recognition (pattern understanding):
W. Kasprzak: EIASR
1. Pattern recognition
9
Signal, 2-D image A signal is a function of how one variable is related to another variable, e.g. a = s(t). Signal variables: • Amplitude - the function value; may represent voltage, light intensity, sound pressure, etc. • Time – the typical function parameter (signal in time domain); however, other parameters are used in specific applications. 2-D Image – a signal, in which the parameter „time” takes the form of a 2-D space, e.g. a = I(x, y); (signal in spatial domain). W. Kasprzak: EIASR
1. Pattern recognition
10
Image scan Scanning: converting a 2-D image to a 1-D signal. Line scan Aerial scan Hilbert scan
Line scan – not the best one, as local continuity is not preserved. Aerial („zig-zag”) scan – preserving locality but prefering one direction. W. Kasprzak: EIASR
1. Pattern recognition
11
2. 2-D image analysis 2-D image analysis systems contain different analysis process at different data abstraction levels: 1.Signal level: image preprocessing (filtering) - image restoration, image enhancement. Example: source separation from mixtures
Finite number of mixtures
W. Kasprzak: EIASR
Reconstructed sources
1. Pattern recognition
12
2-D image analysis (2) 2. Iconic level: image compression, image registration (normalization), classification of entire image. Example: image normalization w.r.t. pattern orientation:
(a) Input image, (b) background suppression, c) foreground border image, d) rotated (normalized) image
W. Kasprzak: EIASR
1. Pattern recognition
13
2-D image analysis (3) 3. Segmentation level: boundary detection, region detection, texture classification, depth or motion detection. Example. Detection of characters in license plate regions:
(a) Motion mask, (b), (c) two regions of interest (ROI), d) regions corresponding to characters.
W. Kasprzak: EIASR
1. Pattern recognition
14
2-D image analysis (4) 4. Object level: model-based 2-D and 3-D object recognition. Example. Palm recognition:
(a) 2-D palm model, (b) 3-D palm model, (c) contour detection, (d) discrete feature detection, (e) feature-to-model matching W. Kasprzak: EIASR
1. Pattern recognition
15
3. Speech recognition Typical problems (steps) in a speech recognition system:
W. Kasprzak: EIASR
1. Pattern recognition
16
Speech recognition (2) Problems (steps) (cont.):
W. Kasprzak: EIASR
1. Pattern recognition
17
Simple speech classification Sometimes a simple speech recognition system will do, for example: • For the control of a device which needs several commands only, • If selecting a phone number by voice.
The structure of a simple word recognition system: 1. Signal acquisition and VAD. 2. Spectrogram analysis with fixed frame number (e.g. 40 frames × 512 samples 10.240 points). 3. Approximation of the spectrogram image – reducing the image resolution to 640 points (40 x 16). 4. In the learning stage: modelling words from the dictionary by corresponding average low-resolution spectrogram images. 5. Word recognition via direct image classification. W. Kasprzak: EIASR
1. Pattern recognition
18
Spectrogram image classifier (2) Example. Four low-resolution spectrogram images of the spoken words: (a) „koniec” and (b) „OK”.
(a) (b) The coding of spectrogram values: Black - the highest amplitude, Blue – middle range, Yellow – low range, white – near zero range. W. Kasprzak: EIASR
1. Pattern recognition
19
Spectrogram image classifier (3) Example. Spectrograms of the spoken numbers in polish. Significant differences visible.
W. Kasprzak: EIASR
1. Pattern recognition
20
4. Probability theory The purpose of applying probability theory in pattern recognition: • to model the uncertainty of sensor data and of processing results (for example by means of noise), • the methods of error correction (normalisation, filter) require the statistics of error sources, • for the judgement of pattern classification and modelbased recognition results stochastic distance measures are useful.
W. Kasprzak: EIASR
1. Pattern recognition
21
Discrete stochastic variable A non-negative probability density function (pdf) of a discrete stochastic variable X: P(X=x) = pX(x) . Probability of events, specified by an interval of values: P( A ≤ X ≤ B) = ∑ p X ( x) A≤ x ≤ B
The cumulated density FX( x ) of a stochastic variable X : FX ( x ) = P ( X ≤ x) = ∑ p X ( z ) Mean and variance of a pdf: µX =
∑ x ⋅ p( x)
z≤x
σ X2 =
x∈dom ( X )
W. Kasprzak: EIASR
1. Pattern recognition
∑[( x − µ
X
) 2 ⋅ p( x)]
x∈dom ( X )
22
Example The distribution of pixel values (e.g. intensities, colors) in an image can be modelled as a stochastic discrete variable. The normalized histogram of an image represents relative frequencies of pixel values in given image: (a) an image, and (b) its intensity histogram
W. Kasprzak: EIASR
1. Pattern recognition
23
Continuous stochastic variable The probability value for a given (exact) realization x of a continuous stochastic variable X, P(X=x)=pX(x), is infinitesimally small. It is meaningful to consider intervals of values: B
P ( A ≤ X ≤ B ) = ∫ p X ( x)dx A x The cumulated density: FX ( x) = ∫ f ( z ) dz −∞
The pdf of a Gaussian (normally distributed)2 variable: (x−µ ) − 12 1 2 2 p X ( µ ,σ ) = e σ σ 2π Mean and standard deviation (µ and σ ) are the only two parameters of this distribution. W. Kasprzak: EIASR
1. Pattern recognition
24
Continuous vs. discrete variable • Stochastic discrete variables represent measurable (observable) entities, i.e. we deal with a finite set of observations – the stochastic features of the observed variable usually change with new observation. • The probability theory by continuous variables models the hidden stochastic process. Its features may be fixed in time. Moments – the features of a pdf: k • the k-th absolute moment of p : mk ( p) = ∫ z ⋅ p( z ) dz • the k-th central moment of p : )
µ k ( p ) = m k ( p ) = ∫ ( z − m1 ( p )) k ⋅ p ( z ) dz W. Kasprzak: EIASR
1. Pattern recognition
25
A vector variable Let X = [ X1 , X2 , … , Xn ]T - a vector of stochastic variables. The mean vector: E{X} = [ E{ X 1}, E{ X 2 },..., E{ X n }]T The covariance matrix: σ 11 K σ 1n where Σ = M O M σ i , j ( X) = E{( X i − E{ X i })( X j − E{ X j })} σ n1 K σ nn Example. The n-dimensional Gaussian distribution: p X ( x) =
1 2 n π n | det(Σ) |
e
−
[ x − µ ]T Σ −1 [ x − µ ] 2
⋅ 2 n π n | det(Σ) || det( 2πΣ) |
where “det” - determinant of a matrix. W. Kasprzak: EIASR
1. Pattern recognition
26
Bayes rule Chain rule for n stochastic variables:
pX ([ x1, , x2 ,..., xn ]) = p X1 ( x1 ) ⋅ p X 2| X1 ( x2 | x1 ) ⋅ p X 3| X1 , X 2 ( x3 | x1 , x2 ) ⋅ K⋅ p X n| X1 , X 2 ,K, X n−1 ( xn | x1 , x2 ,K, xn−1 ) For n stochastic independent variables it holds: n
pX ([ x1, , x2 ,..., xn ]) = Π p X i ( xi ) i =1
The Bayes rule relates conditional probability to joint distribution: p X , X ( x1 , x2 ) p X1| X 2 ( x1 | x2 ) = 1 2 p X 2 ( x2 ) Hence: p X1| X 2 ( x1 | x2 ) ⋅ p X 2 ( x2 ) = p X 2 | X1 ( x2 | x1 ) ⋅ p X1 ( x1 ) W. Kasprzak: EIASR
1. Pattern recognition
27
Entropy Duality of information content and probability: • low information of event (X=xi): high probability of event. • high information of event (X=xi): low probability of event. Information content of some realization xi of variable X:
I ( xi ) = − log a P ( xi )
Entropy of X – the mean information content of variable X: H ( X ) = − ∑ ( P( xi ) ⋅ log a P( xi ) xi ∈X
W. Kasprzak: EIASR
1. Pattern recognition
28
4. Sampling and digitalization Signal acquisition and analogue-to-digital (A/D) signal conversion: D
• sampling (in time or space) and amplitude digitalization
W. Kasprzak: EIASR
1. Pattern recognition
29
Sampling Sampling theorem: If the sampling frequency of signal f(x) is at least two times higher than the maximum frequency component, ϕ∈(-Bx, Bx), of this signal, then the full reconstruction of the original analogue signal from its sample set is possible, fj = f( j ∆x), j = 0, +/- 1, +/ 2, ...; where the sampling period is ∆x ≤ (1/ 2Bx). Duality of signal representation in the time (space) f(t) and frequency domain F(ω). W. Kasprzak: EIASR
1. Pattern recognition
30
Digitalization of amplitude
A sequence of real-valued (analogue) samples: { fj }. The corresponding digital-valued samples: { hj } The digitalisation error („noise”): nj = fj - hj The quality of the digitalisation - the signal to noise ratio: SNR = 10 log10 W. Kasprzak: EIASR
E{ f j2 } E{n 2j }
1. Pattern recognition
31
PCM coding PCM coding („pulse code modulation”): optimal digitalization (amplitude intervals and quantization levels).
Error measure:
Solution: intervals ai and levels bi
L ai +1
ε = ∑ ∫ ( f − bi ) 2 p ( f ) df
ai =
i =1 ai a
i +1 ∂ε = ∫ − 2( f − bi ) p ( f ) df = 0 ∂bi ai ∂ε = (ai − bi −1 ) 2 ⋅ p (ai ) − (ai − bi ) 2 ⋅ p (ai ) = 0 ∂ai
W. Kasprzak: EIASR
bi
bi −1 + bi , i = 2,3,..., L = 2 B 2
∫ =
ai +1
ai
1. Pattern recognition
∫
f ⋅ p( f ) df
ai +1
ai
,
, i = 1,2,..., L .
p( f ) df 32
5. Optimization theory elements LSE (least square error) 2D case: to establish a linear relation between two quantities, y and x : y = a x + b or ax+b -y=0, given N observations: a x1 + b - y1 = ε 1 a x2 + b - y2 = ε 2 ..............
a xN + b - yN = ε N . Goal function: U (a, b) = ε12 + ε22 + . . . + εN2 = ∑ εi2 = | ε |2 ∂U ∂U Minimization of U(a, b): = 0. = 0, ∂b ∂a Solution: N − x ( x y ) a i b = N x 2 − ( x ) 2 ⋅ − x ∑ i ∑ i ∑ i
W. Kasprzak: EIASR
∑ ∑x
2 i
i
⋅
∑ ∑y
1. Pattern recognition
i
i
i
33
Moore-Penrose pseudo-inverse The Moore-Penrose pseudo-inverse is a general solution to the system of m linear equations with n unknowns:
b = A y , b ∈ R m , y ∈ R n , A ∈ R m×n The solution is:
y = A† b
where matrix A† is the Moore-Penrose pseudo-inverse. When A is full rank A† can be directly calculated: • case m = n : A† = A-1 • case m > n : it is the general linear LS solution, which minimizes the squared error (second-order norm) | b − Ay |2 A† = (AT A)-1 AT • case m < n : (it minimizes the norm |y|2 ) A† = AT (A AT)-1 W. Kasprzak: EIASR
1. Pattern recognition
34
Nonlinear LSE (1) Assume, there is some number N of data samples: ( xi , yi ) ∈ R The relation between variables x and y is approximated by a parametric non-linear function: y = f (x | p), where p∈Rn , N ≥ n, is the parameter vector. Goal: to find the minimum of the goal (objective) function: 1 N 1 N 2 1 2 Φ (p) = r (p) = ∑ r i (p) = ∑ [ yi − f ( xi | p)]2 2 i =1 2 i =1 2 Look at the gradient of the goal function:
2
N
∇Φ (p) = ∑ r i (p) ∇ri (p) = J(p)T r (p) i =1
The Jacobi matrix: W. Kasprzak: EIASR
{J(p)}ij =
∂ri (p) ∂p j
1. Pattern recognition
35
Nonlinear LSE (2) Second-order derivatives, are in the Hessian: N
H (p) = ∇ 2 Φ (p) = J(p)T J(p) + ∑ r i (p) ∇ 2 ri (p) i =1 for small errors r, approximated as: ∇ 2 Φ (p) = J(p)T J(p) Solutions of the nonlinear LSE problem 1) Gradient descent search – an iterative first-order update rule: p i +1 = p i − λ ∇Φ(p i ) 2) Gauss-Newton search - second-order based update rule: p i +1 = p i − (∇ 2 Φ(p i )) −1 ∇Φ(p i ) 3) Levenberg-Marquardt algorithm – uses a combined rule: p i +1 = p i − (H (p i ) + λ diag (H (p i )))−1 ∇Φ(p i ) W. Kasprzak: EIASR
1. Pattern recognition
36
Linear programming Linear programming (LP), or linear optimization: • optimization of a linear objective function, • subject to linear equality and linear inequality constraints. The solution is within a convex polyhedron, a set defined as the intersection of finitely many half spaces. Linear programming problems in canonical form: • Maximize the objective function xˆ = arg max c T x x
• Subject to constraints
Ax ≤ b , where x ≥ 0.
LP in MATLAB (Optimization Toolbox): BINTPROG, LINPROG W. Kasprzak: EIASR
1. Pattern recognition
37
Quadratic programming In Quadratic Programming (QP) the goal (objective) function is quadratic and constraints are linear: xˆ = arg min f ( x ) = x∈ R n
1 T x Hx + t T x 2
a i x = bi , i = 1,..., m T
subject to
T
c i x ≥ d i , i = 1,..., p
where H is a strictly positive-definite matrix of dimension n×n, t – a vector of length n , ai , ci – vectors of length n , and bi,di – scalars . In MATLAB - the quadprog routine. W. Kasprzak: EIASR
1. Pattern recognition
38
Quadratic programming (2) Simple: only equality constraints appear in the QP problem: A x = b . If the number of equality constraints is the same as the number of independent variables (m = n), constraints uniquely determine the solution to the QP problem: x = A-1 b If the number of equality constraints is smaller than the number of independent variables (m λ2 > … > λn) and n corresponding eigenvectors.
W. Kasprzak: EIASR
2. Pattern transformation
51
2. PCA The transformation optimizing s1 is Principal Component Analysis (also called Karhunen-Loeve transform): 1) Get the zero-mean covariance matrix Q(1) of samples {jx}. 2) Obtain its eigenvectors by the eigenvector decomposition n or SVD algorithm: (1) Q = ∑ λrϕ rϕ rT = UΛU T r =1
where U represents the matrix of eigenvectors ϕr of Q(1) and Λ is the diagonal matrix of positive eigenvalues λr, while n is the rank of the matrix Q(1). 3) Select the feature space dimension, m ≤ n. The feature vector jcm corresponding to sample jx is: jc m = Λm Um T jx (non-normalized feature) or jc m = Λm−1/2 Um T jx (normalized). W. Kasprzak: EIASR
2. Pattern transformation
52
PCA (2) PCA: a rotation of an orthogonal coordinate system that orders the axis to capture largest variances of data samples. If n is the input vector size then some m < n principal components represent interesting structure, while those with lower variances represent noise.
W. Kasprzak: EIASR
2. Pattern transformation
53
Inverse-PCA transform Let: dimension(cm) = m; dimension(x) = n . In PCA usually, m < n , even most often, m 2) The within-class variability matrix: k
k
Ni
S W = ∑ S i = ∑∑ ( i x j − m i )( i x j − m i )T i =1
i =1 j =1
where Ni - number of samples from class i=1, ..., k. k The between-class variability matrix: S B = ∑ N i (mi − m)(mi − m)T i =1 where m - the mass centre of all the samples. W. Kasprzak: EIASR
2. Pattern transformation
59
General LDA Definition The linear Fisher’s discriminate function is: y = Wo x , where T | W SB W | . Wo = arg max J ( W ) and J ( W ) = T W
| W SWW |
In LDA we apply the Wo that induces the maximum ratio between the determinants of the between-class variability matrix and the within-class variability matrix of samples. Solution The rows of matrix Wo are eigenvectors of the matrix SW-1 SB : S −W1S B Wo = ΛWo W. Kasprzak: EIASR
2. Pattern transformation
60
4. The inverse problem Instantaneous-time inverse problem: • m unknown sources (e.g. signals in time) or a set of m-dimensional data samples : • n observed, possibly noisy, different, linear mixtures of the sources (n >= m): • the mixing coefficients are some unknown constants.
s1T [t ] M T s [t ] m
x1T [t ] M T x [t ] n
m x[t ] = A s[t ] + n[t ] = ∑ si [t ] ai + n[t ] i =1 W. Kasprzak: EIASR
2. Pattern transformation
61
Possible solutions Goal: find the unknown matrix A and reconstruct the sources. Possible solutions: 1. Independent component analysis 2. Projection pursuit 3. Factor analysis 4. Principal component analysis
W. Kasprzak: EIASR
2. Pattern transformation
62
Factor analysis, Whitening Factor analysis (FA) assumes that the sources (called factors) are uncorrelated, of Gaussian distribution and of unit variance: E{s sT } = I The noise components are assumed to be uncorrelated with Q = E{n nT } each other and with the factors: With above assumptions the covariance matrix of the E{ xx T } = Rxx = A AT + Q observation is: Assuming Q is known or can be estimated, FA attempts to solve A from: A AT = Rxx − Q T Without noise the problem simplifies to Whitening: A A = Rxx It is a specific case of PCA, where every „direction” in space is of unit variance. W. Kasprzak: EIASR
2. Pattern transformation
63
ICA for feature extraction Application of ICA (independent component analysis): a coding scheme - a signal (image block) is represented by a feature vector that consists of mixing coefficients – the block is a mixture of some fixed set of „standard” image blocks (independent components): Nie można obecnie wy świetlić tego obrazu.
ai1 xi
ai2 s1
aim s2
sm
Assumptions in ICA: 1.The sources are stochastically independent w.r.t. functions f , g that capture higher order signal statistics, i.e. for y1 , y2 it holds
E{ f ( y1 ) g ( y2 )} = E{ f ( y1 )} ⋅ E{g ( y2 )} 2. At most one of the sources is of Gaussian distribution. W. Kasprzak: EIASR
2. Pattern transformation
64
Example Example [Hyvarinen(2000)] 1 , if | si |≤ 3 = p s ( ) i 2 3 Two independent sources: 0, otherwise Joined distribution (s1, s2): s2
Mixing matrix s1
Mixture distribution
2 3 A0 = 2 1
x2
x1
W. Kasprzak: EIASR
2. Pattern transformation
65
Example (2) Whitening: a linear transform so that the new mixtures are uncorrelated and have a variance of one. PCA: find orthogonal main directions no separation. ICA: source separation by generalized independency.
W. Kasprzak: EIASR
2. Pattern transformation
66
FastICA (1) FastICA (Hyvarinen, Karhunen, Oja) is a batch method for ICA. It requires that the mixtures are centered and whitened first. x centered = x − E{x} = x − m x Centering: Whitening:
E{xxT } = R xx = UΛU T x whitened = Λ−1/ 2 UT x
Then the method iteratively updates the weight matrix W – vector-by-vector, while maximizing the non-Gaussianity of the projection wTx (see next page). W. Kasprzak: EIASR
2. Pattern transformation
67
FastICA (2) A. B. 1) 2)
Initialize a nonzero matrix: W=[w1,w2,…,wn] Iterate FOR outputs p=1,2,…,n DO w +p = E{ x g ( w Tp x )} − E{( g ' ( w Tp x )}w p Weight update:
where g is a nonlinear function, g’ – its first derivation. + 3) Normalize to unit length: w = w p p
w +p
4) A Gram-Schmidt orthogonal and normalization: p −1
w p ' =w p −∑ w Tp w j w Tj j =1
wp =
wp' wp'
5) IF W has not yet converged THEN next iteration of 1)-5). W. Kasprzak: EIASR
2. Pattern transformation
68
5. Blind source separation BSS methods solve the inverse problem in a way similar to ICA. But in BSS there is a focus on “on-line” reconstruction of sources. Application example: solving the „cocktail party” problem:
@ Cichocki, Amari (2001) W. Kasprzak: EIASR
2. Pattern transformation
69
Demixing in BSS Demixing in BSS: an m × n separating matrix W(t) is updated so that the m-vector, y(t) = W(t) x(t), becomes an estimate, y(t) ≈ s(t), of the original independent sources up to scaling and permutation indeterminacy. If source scaling (S) and permutation (P) are resolved then: WA=SPI
W. Kasprzak: EIASR
2. Pattern transformation
70
BSS (1) The Kullback–Leibler divergence measures the dependency among output signals (mutual information): p( y ) D( y || { yk };W ) = ∫ p ( y ) log K dy ∏k =1 p( yk ) K
D ( y || { yk }, W ) = − H ( y;W ) + ∑ H ( yk ) ; D ( y || { yk }; W ) ≥ 0 k =1
H(y, W) - the average information of the joint output y. The H(yk)-s are entropies of marginal distributions – their sum is constant – only the information related to joint output distribution, H(y, W), can be optimized here. The BSS optimization rule: arg min D ( y || y k }; W ) W W. Kasprzak: EIASR
2. Pattern transformation
71
BSS (2) The adaptive separation rule, generated according to the natural gradient: is: W (t + 1) = W (t ) + η(t ){I − f [ y (t )] ⋅ g[ y T (t )]} ⋅W (t ) f(y) = [f(y1), ..., f(yn)]T and g(yT)=[g(y1), ..., g(yn)] are vectors of non-linear activation functions, which approximate higher– order moments of the signals. If the source has a negative normalized kurtosis value (i.e. it is a sub–Gaussian signal), we choose: f(y) = y3, g(yj) = yj. For a super–Gaussian source with positive kurtosis, we choose: f(y) = tanh(α y), g(yj) = yj. W. Kasprzak: EIASR
2. Pattern transformation
72
Example Sound separation: one unknown source, one mixture and one separated signal are shown:
W. Kasprzak: EIASR
2. Pattern transformation
73
Example Four unknown sources:
Four mixtures with added noise:
Separated signals:
W. Kasprzak: EIASR
2. Pattern transformation
74
Measuring the separation quality (A) If the source signals are known For every pair (output Yi ; source Sj ) with amplitudes scaled to the interval compute the MSE[i, j] coefficient - the error of approximating Sj by Yi , i=1,…,n; j=1,…,m : 1 N −1 MSE[i, j ] = ∑ [ S j (k ) − Yi (k )]2 = E{( S j − Yi ) 2 } N k =0 where N is number of samples. A matrix P=[ pi,j ] is created, with pi , j = 1 / SNR [i , j ] 1 1 The error index: EI (P) = ∑ ∑ | ~ pi , j | − n + ∑ ∑ | pi , j | − m m i j n j i ~ Every row i of P is scaled: P = Norm(P) , such that ∀i (max j (a~i , j ) = 1) Every column j is scaled: P = NormCol(P) , such that ∀j (maxi (ai , j ) = 1) W. Kasprzak: EIASR
2. Pattern transformation
75
Measuring the quality (2) (B) If the mixing matrix A is known The error index for the set of separated sources is given as: 1 1 EI (P) = ∑ ∑ | ~ pi , j | − n + ∑ ∑ | pi , j | − m m i j n j i The entries pi,j-s of matrix P, P = W A, are normalized along rows (i=1, ..., n) (for the first expression) or columns (j=1,…,m) in P, such that: or ∀i (max j ~ pi , j ) = 1) ∀j (maxi pi , j ) = 1) Ideal case of perfect separation: • P becomes a permutation matrix. • Only one element in each row and column equals to unity, and all the other elements are zero. • Then the minimum EI = 0. W. Kasprzak: EIASR
2. Pattern transformation
76
Measuring the quality (3) (C ) If both the sources and mixing matrix are unknown The normalised mutual correlation coefficients are computed f ( yi ) ⋅ g ( y j ) ri , j = | f ( yi ) || g ( y j ) | for every pair of output signals yi and yj , giving the square matrix: P=[ri,j]. The error index for the set of separated sources is computed as:
~ EI ( P ) =
W. Kasprzak: EIASR
1 ∑ n i
| r | n − ∑j i, j
2. Pattern transformation
77
EIASR Pattern classification Lecture 3
Project is co-financed by European Union within European Social Fund
78
1. Numeric classifiers Optimization criteria for classifiers 1. Minimum computational complexity linear discriminate classifiers 2. Minimum of misclassification probability Bayes classifier 3. Ability automatically to learn a decision rule multilayer perceptron 4. Ability to generalise knowledge for not available samples “Support Vector Machine” 5. Minimization of general risk ??? W. Kasprzak: EIASR
3. Pattern classification
79
Decision theory approach A. Make assumptions about the analysed domain – for example learn the prior pdf-s. B. Define the costs and risk of misclassification (decision). C. Define a decision rule, which minimises this risk. Stochastic classifier - the analysed domain and decision rule in terms of stochastic variables. Let us observe: • Each decision induces individual costs. • After many decisions the average (expected) costs can be estimated – the risk of misclassification. • The calculation of average values requires the availability of full statistic information (pdf-s). W. Kasprzak: EIASR
3. Pattern classification
80
Risk minimization (A) Assumptions about the analysed domain: • Pdf of observation c given class k, p( c | k ) . • Prior probabilities { pk (k=1,...,K) } of patterns from classes. (B) Calculation of risk V(kc) - a pattern kc from class k. 1. Costs of misclassification: {rik = r (i | k) (i,k= 1,2, ... ;K)}; 2. Obtain the probability of pattern from class k: p(kc | k) pk 3. The risk of pattern kc : V(kc): = ∑k ∑i ( pk p(kc | k) rik). (C) Decision rule 1. Minimise the risk V. If a binary cost function is used: rkk = 0, rik = 1, for i ≠ k, and i, k = 1, ... , K. then we need to maximize p(k | kc) . 2. Select class k with highest p( k | c) : i = argk max p( k | c) . W. Kasprzak: EIASR
3. Pattern classification
81
2. Potential function-based classifier • Potential functions as class distributions in feature space; • All functions from the same parametric family of functions. Linear potential function family:
d i (c, a ) = {ai ϕ (c ) | ai ∈ ℜ ( n+1) } , (i = 1, L, m) T
ϕ (c ) = (1, c1 , c2 ,L, cn )T Linear potential function-based classifier, two class-case: Get potentials: n n (1) (1) (1) ( 2) ( 2) d1 (c, a ) = a0 + ∑ ai ⋅ ci d 2 (c , a ) = a0 + ∑ ai( 2 ) ⋅ ci i =1
i =1
Decision rule: d( c ) = d1(c, a(1)) - d2(c, a(2)), If d( c ) ≥ 0 then select the class Ω1 else class Ω2 W. Kasprzak: EIASR
3. Pattern classification
82
Example Binary classifier for a 2-D feature vector – the linear decision rule corresponds to two half-planes separated by the line: d1(c) = d2(c).
W. Kasprzak: EIASR
3. Pattern classification
83
Learning the potential functionbased classifier Linear classfier for K classes: n (k ) (k ) ς (c) = arg max d k (c, a ) = arg max(a0 + ∑ ai( k ) ⋅ ci ) k k i =1 Learning: 1. Let C be a matrix of size N × 3, built from N samples, where Nk samples are from class k. 2. For every class Ωk define the equation system: ε + D(k) = C a(k), where D(k) =[d(k)1, d(k)2, ... , d(k)N]T and dk(c, a(k)) = 1, if c ∈ Ωk , or dk(c, a(k)) = -1, if c ∉ Ωk; and ε =[ε1, ε2, ... , εN]T is an error vector. to be continued on next page. W. Kasprzak: EIASR
3. Pattern classification
84
Learning the classifier (2) 3. The „least square error” optimization problem is: N
ao = arg min ∑ ε i2 (a ) a
i =1
The solution leads to a system: X(k) = B(k) ao(k), where X(k) = C(k)T D(k) , B(k) = C(k)T C(k) . This can be solved directly as: ao(k) = ( C(k)T C(k) )-1 C(k)T X(k) .
W. Kasprzak: EIASR
3. Pattern classification
85
3. Bayes classifier Stochastic distributions are needed: -prior class pdf-s p(Ω Ωk) , k =1, 9 , K -conditional pdf-s p(c | Ωk) . Decision rule:
ς (c) = arg max p(Ωλ | c) = arg max p(Ωλ ) p(c | Ωλ ). λ
λ
Learning the probability distributions: 1.p(Ωk) are estimated as relative frequencies of classes in the learning sample set; 2.p(c |Ω Ωk): non-parametric, e.g. histograms; or parametric assume a density function family and estimate the parameters from the learning samples. W. Kasprzak: EIASR
3. Pattern classification
86
Learning a Bayes classifier Non-parametric pdf
Parametric density ML („maximum likelihood”) estimator: for Ωk specify parameter set θ(k) which maximizes the probability of given (k ) max P ( c | θ ) observations ∑ j (k ) θ
Solve: W. Kasprzak: EIASR
∈Θ
c j ∈C ( k )
∂ log ∑ P(c j | θ ( k ) ) = 0 , θi ∈ θ ( k ) ∂θi c j ∈C ( k ) 3. Pattern classification
87
Minimum-distance classifier This is a specific form of the Bayes classifier for Gaussian pdf-s if it uses the Euclidean metrics. Select the class of minimum distance: d(Ωk , c) = ln p(Ωk |c). Then in special case (∑ ∑1 = ∑2= I) the Bayes condition: p( Ω1 ) p( c | Ω1 ) > p( Ω2 ) p( c | Ω2 ) is equivalent to: 1 1 −1 −1 ln p (Ω1 ) − (c − µ 1)T ∑1 (c − µ1 ) > ln p (Ω 2 ) − (c − µ2 )T ∑ 2 (c − µ2 ) 2 2 With, ∑1 = ∑2= I , and the maximum likelihood rule, 2 2 p(Ω1) = p(Ω2), it simplifies to: | c − µ1 | 0. N
H * : d a~ (c ) = ∑ ϑ j y j (c T j c ) + a0 = 0 j =1
The hyperplane is a weighted scalar product of the support vectors. W. Kasprzak: EIASR
3. Pattern classification
94
Linear SVM under noise Linear separation under noisy condition: adding a “penalty” component in the SVM’s goal function - covering wrongly separated samples. Example. Error distances ε5 and ε6 (for samples of class +1) and ε7 and ε8 (samples of class -1).
W. Kasprzak: EIASR
3. Pattern classification
95
The dual problem The modified goal function: N 1 2 min f (a ) = ~min ( | a | + C∑ ε j ) a~∈R n+1 a ∈R n+1 2 j =1 j T j subject to y j ( c a + a0 ) ≥ 1 − ε j , ∀ c ∈ ω , ε j ≥ 0 C is a constant, controlling the influence of the noisy component onto the goal function. L(a~,ϑ )] The dual problem: max LD (ϑ ) = max [inf ~ a ϑ ≥0
N
with
LD (ϑ ) = ∑ ϑ j − j =1
where
ϑ ≥0
N
N
1 T 1 j T k = − c c y y ( ) ϑ ϑ ϑ Qϑ + i T ϑ ∑∑ j k j k 2 2 j =1 k =1
Q jk = y j yk j c T k c ; i = [1,1,...,1]T N
subject to 0 ≤ ϑ j ≤ C ; 0 = ∑ ϑ j y j = y Tϑ j =1
W. Kasprzak: EIASR
3. Pattern classification
96
Solving the SVM problem Both the primary and the dual form of the SVM optimization problem are instances of the general quadratic programming problem (for example see the quadprog function in MATLAB). There exists traditional methods, like Newton method or active set method, for solving the quadratic programming problem. But with many hundreds or thousands training samples traditional methods cannot be directly applied. An efficient learning algorithm that solves the dual problem for SVM is “Sequential minimal optimization” (SMO) (J.C.Platt ). W. Kasprzak: EIASR
3. Pattern classification
97
Kernel SVM Linear separators in a high-dimensional feature space F(c), by replacing c and jc in H* equation with F( c ) and F( jc ), N
H * : d ~a (c) = ∑ ϑ j y j ( F (c)T F ( j c)) + a0 = 0
Example:
j =1
(a) The true decision boundary is: x12 + x22≤ 1. (b) After mapping into a three-dimensional feature space: (x12, x22, 2 x1 x2).
(a)
W. Kasprzak: EIASR
(b)
3. Pattern classification
98
6. MLP McCulloch-Pitt’s model of a single neuron:
Inputs: (x1,x2,…,xn) n Input function: y = ∑ ( wi ⋅ xi ) − w0 ⋅ 1 , where w0 - bias weight i =1 Activation function: z = θ(y) Output: z W. Kasprzak: EIASR
3. Pattern classification
99
Single feed-forward layer In ANN (artificial neural networks) the neurons are organised into layers. In a feed-forward (perceptron) network the input signals are sent from the input to the output via excitatory links.
y = W ⋅x z = θ ( y)
W. Kasprzak: EIASR
3. Pattern classification
100
Activation functions Single step function: z = 1 , if yi > yk , where yk - a fixed threshold. 1 ( y ) = θ Sigmoid function: 1 + exp(− βy) Example. Activation functions: (left) step function, (right) sigmoid function.
W. Kasprzak: EIASR
3. Pattern classification
101
Sigmoid function The derivative of a sigmoid function dz is: = z (1 − z )
z = θ ( y) =
1 1 + exp(− y )
dy
Example. Sigmoid function with different parameter β.
W. Kasprzak: EIASR
3. Pattern classification
102
Multi-layer perceptron MLP: some number (at least two) of feed-forward layers arranged in a cascade. The transformation function of each layer l (=1,2,...): z(l ) = θ (W(l )z(l−1) − w0(l ) ) , where z(0) = x. Example. A MLP with 3 layers (2 hidden layers).
W. Kasprzak: EIASR
3. Pattern classification
103
Supervised learning of MLP The Widrof-Hoff rule (delta rule) – the weight of incoming link between i-th neuron and j-th input is strengthened in proportion to the difference between required and real activation:
∆wij = µ ( si − zi ) x j where si is the required activation of the i-th neuron, and µ - a learning coefficient. The error back-propagation rule in MLP is an extension of the delta rule. An update of weights in layer l (l = L, L-1, …,1):
∆W
(l )
= µ ⋅d
(l )
⋅z
( l −1)T
, z (0) = x
where d(l) is the correction vector for the l-th layer. A single learning iteration starts form the last layer (l = L) and it proceeds back-to-back, layer-by-layer, until l=1. W. Kasprzak: EIASR
3. Pattern classification
104
Backpropagation learning The correction values are set as follows: For last layer the required output vector s is avavilable. ∂z Hence d ( L ) = ( s − z ( L ) ). * i ∂yi For hidden layers the correction vector is projected back. Hence, for l = L-1, …, 1: ∂z d ( l ) = (W ( l +1) )T d ( l +1) . * i ∂yi “.* “ - an element-by-element multiplication of two vectors. For a sigmoid activation function the corrections are:
[
]
d ( L ) = (s − z ( L ) ). * 1 − z ( L ) . * z ( L )
[
]
d ( l ) = ( W ( l +1) )T d ( l +1) . * 1 − z ( l ) . * z ( l ) W. Kasprzak: EIASR
3. Pattern classification
105
7. Pattern clustering Unsupervised learning procedures use unlabeled samples. „k-means” clustering 1. Init k cluster centers: µ(i). 2. FOR each observation cj DO associate it with the closest cluster center, that is, assign ξ(cj) ← i, where d(µ(i), cj) = mink d(µ(k), cj ), for some distance function d (e.g. Euclidean distance). 3. FOR each cluster i (with n(i) assigned observations) DO estimate the mean of the observations assigned with cluster as:
µ (i ) =
1 n(i )
c ∑ ξ
j j , (c j ) =i
4. REPEAT steps 2 and 3 given number of times. W. Kasprzak: EIASR
3. Pattern classification
106
EM clustering The expectation-maximization (EM) algorithm is a general estimation technique, dealing with missing data. If applied for clustering the EM algorithm gives an maximummixture: likelihood estimate of a Gaussian K p (c | θ ) = ∑ α i N (c | µ i , Λi ) i =1
where N(.) denotes a Gaussian pdf, and θ - the parameter set that needs to be estimated: θ = { αi, µi , Λi | i=1,2,…,K}. E-step: FOR each sample cj and p(c j | ξ (c j ) = i,θ )α i i FOR each class i DO: Pj = p(c j | ξ (c j ) = i,θ )α i ∑ i M-step: FOR each class i DO: i i
∑c P j
µ = i
j
∑ Pji j
W. Kasprzak: EIASR
j
∑ c (c ) = ∑P j
Λ
i
j
T
Pji
j
i j
j
∑P = ∑∑ P j
α
i
j
i j
i
j
3. Pattern classification
107
8. Ensemble learning Many classifiers or experts are generated and combined to solve a particular classification or decision problem: 1. Boosting – combining weak binary classifiers into a strong one. Each iteration of boosting creates three weak classifiers: the first classifier C1 is trained with a random subset of the training data. C2 is trained on a set only half of which is correctly classified by C1, and the other half is misclassified. C3 is trained with instances on which C1 and C2 disagree. The 3 classifiers are combined through a three-way majority vote. AdaBoost.M1 (adaptive Boosting) – extension for multi-class classification, most popular ensemble classifier (Freund & Shapire, 1995) 2. Mixture of experts (Jacobs & Jordan, 1991-94) W. Kasprzak: EIASR
3. Pattern classification
108
AdaBoost AdaBoost is an algorithm for constructing a ”strong” classifier as a linear combination of T „weak” classifiers: T F ( x) = C[ f ( x)] = C ∑ wt ⋅ ht ( x) t =1 where ht(x) is the classification („hypothesis”) of sample x given by the t-th weak classifier, wt are weights and C is the rule of the „strong” classifier. The set of „weak” classifiers, H = {hi(x)}, is potentially infinite. Training data samples are drawn from a distribution that is iteratively updated, such that subsequent classifiers focus on increasingly difficult instances. Previously misclassified instances are more likely to appear in the next bootstrap sample. The classifiers are combined through weighted majority voting. W. Kasprzak: EIASR
3. Pattern classification
109
AdaBoost.M1 (1) INPUT: classes: Ω={Ω1, Ω2, … ,ΩL}; training samples: S = {x1, x2, …, xn}, with labels yi∈Ω, i=1,2,…,n; number of weak classifiers: T. TRAINING 1 INIT distribution: D1 (i ) = , i = 1,2,..., n n FOR t=1,2,…,T DO 1. Select current training subset St according to Dt. 2. WeakLearn: train on St and return nthe „weak” classifier, ht: XΩ, with smallest error: ε t = ∑ I (ht ( xi ) ≠ yi ) ⋅ Dt (i) i =1 IF εt>1/2 THEN STOP ε 3. Calculate normalized error: β t = t 1− εt 4. Update distribution Dt: D (i ) ⋅ β t D (i ) , if h t ( xi ) = yi or Dt +1 (i ) = t , if h t ( xi ) ≠ yi Dt +1 (i ) = t Zt
W. Kasprzak: EIASR
Zt
3. Pattern classification
110
AdaBoost.M1 (2) Remark: Zt is a normalization coefficient that makes Dt+1 a proper distribution. CLASSIFICATION by a weighted majority voting: Given a new unlabeled sample s: 1. Obtain total vote received by each class, T
v j = ∑ I (ht ( s) ≠ Ω j ) ⋅ log(1 / β t ) , j = 1,..., L t =1
i.e. the „weak” weights are: log(1 / β t ) 2. Rule C: select the class Ωj which receives the highest total vote vj
W. Kasprzak: EIASR
3. Pattern classification
111
Mixture of Experts (1) Several experts (classifiers) are learned. The i-th expert produces its output: o i (x) = f ( Wi x) where Wi is a weight matrix and f(.) - a fixed nonlinearity. The outputs of experts are combined through a (generalized) linear rule: N
y = o(x) = ∑ g (x, v k ) o k (x) k =1
The weights of this combination are determined by a gating network: eξ i g (x, v i ) = N ; ξ i = v Ti x ∑ eξ k k =1
W. Kasprzak: EIASR
3. Pattern classification
112
Mixture of Experts (2) The classification step in the „mixture of experts” approach can be expressed in stochastic terms as the maximization of posterior N probability: P(y | x, Ψ ) = ∑ g (x, v k ) P(y | x, W k ) k =1
where Ψ is the set of all parameters (all expert- and gating weights). The parameters are typically trained using the expectation maximization (EM) algorithm. Let the training set is given as: {(xt, yt) | t=1,2,…,T}. In the E-step, in the s-th epoch, for all the training data following posterior probabilities are computed: pi (t ) =
g (x t , v i( s ) ) P(y t | x t , Wi( s ) ) N
∑ g (x , v t
(s) k
, i = 1,2,..., N
) P(y t | x t , Wk( s ) )
k =1
W. Kasprzak: EIASR
3. Pattern classification
113
Mixture of Experts (3) In the M-step following maximization problems are solved: ( s +1) i
W
V
( s +1)
T
= arg max ∑ pi (t ) log P (y t |x t , Wi ) Wi
t =1
T
N
= arg max ∑∑ pi (t ) log g (x t , v k ) V
t =1 k =1
where V is the set of all the parameters in the gating network.
W. Kasprzak: EIASR
3. Pattern classification
114
EIASR Image pre-processing Lecture 4
Project is co-financed by European Union within European Social Fund
115
1. Scene acquisition Perspective projection A simple „pinhole” model of the camera: p’= (-xp, -yp ) – reversed image point, P =(xc, yc, xc) – scene point (in camera coordinates), f – focal length. x p = sx
f ⋅ xc zc
yp = sy
f ⋅ yc zc
sx , sy : scaling of scene-to-image units W. Kasprzak: EIASR
4. Image pre-processing
116
Perspective vs parallel projection Parallel projection – if f tends to infinity : x p = lim s x f →∞
f ⋅ xc = s x xc f + zc
Perspective projection
W. Kasprzak: EIASR
y p = lim s y f →∞
vs.
f ⋅ yc = s y yc f + zc
parallel projection
4. Image pre-processing
117
Vector operations u1 v1 u = u2 , v = v2 u3 v3
3-D vectors:
Inner product of two vectors: 2
2
|| u ||= u1 + u2 + u3 cos(θ ) =
2
< u, v > || u || || v ||
< u, v >= u T v = u1v1 + u 2 v2 + u3v3
Cross product of two vectors 0 ˆ = u U 3 −u 2
− u3 0 u1
W. Kasprzak: EIASR
u2 − u1 0
ˆv u× v = U
4. Image pre-processing
118
Camera parameters A transformation of scene point P onto pixel p depends on camera parameters of two types: 1.Extrinsic parameters rotation R, translation T 2.Intrinsic parameters projection KF, camera-to-image Ks Extrinsic camera transformation Let Pw =[xw, yw, zw] be the world coordinates and Pc =[xc, yc, zc] the camera coordinates of some scene point P . Then: Pc = RPw + T Intrinsic camera transformation xp x sx x f 0 0 xc p = y p = K s y = 0 zc y = K f Pc = 0 f 0 yc 1 0 1 1 0 0 1 zc W. Kasprzak: EIASR
sθ sy 0
ox x o y y 1 1
4. Image pre-processing
119
Homogeneous coordinates A 3-D point P 4-D homogeneous coordinates Ph : P = [ X , Y , Z ]T ↔ Ph = [kX , kY , kZ , k ]T , k ≠ 0. 1 0 Translate a point by vector [ tx, ty, tz ] T: Scaling of coordinates by [ sx, sy, sz ]T: s x 0 0 Rotate the coordinate system: 0 s 0 y - around axis Z by angle θ , S= 0 0 sz - around axis X by angle α , 0 0 0 - around axis Y by angle β cos θ sin θ Rθ = 0 0
− sin θ cos θ 0 0
W. Kasprzak: EIASR
0 0 0 0 1 0 0 1
0 1 0 cos α Rα = 0 sin α 0 0
0 − sin α cos α 0
4. Image pre-processing
0 0 0 1
0 tx 0 1 0 t y T= 0 0 1 t z 0 0 0 1 0 0 0 1
cos β 0 Rβ = − sin β 0
0 sin β 1 0 0 cos β 0 0 120
0 0 0 1
Homogeneous coordinates (2) 1 0 0 0 0 1 0 0 . Ψf = 0 0 1 0 Reverse transformation is not unique! 0 0 1 / f 0 Let us observe: xc f xc / zc xp f y / z y y ph = p = norm(Ψ f ⋅ Ph ) = norm c = c c zc f f z / f 1 1 c The transformation from camera to pixel units s x sθ 0 o x The operations of pixel scaling, skewing and 0 s 0 o shift of optical center can be represented by y y Ψs = . a single matrix for operation on homogeneous 0 0 1 0 coordinates. 0 0 0 1
Perspective transformation of homogeneous coordinates:
W. Kasprzak: EIASR
4. Image pre-processing
121
2. Camera calibration Example. Extrinsic parameters in R, T: - α : rotation of axes around global axis OZ; - β : rotation of axes around global axis OY; - θ : rotation of axes around global axis OX; G : translation of global system to camera fixture system; - C : translation of camera fixture system to camera system; Intrinsic parameters in ψf , ψs : f : the camera’s focal length, sx, sy : the image-to-pixel unit scaling, ox, oy : the camera origin to image origin shift. -
W. Kasprzak: EIASR
4. Image pre-processing
122
Example (2) The transformation of a point Pw , given in world coordinates, into a pixel p :
p = (Ψs ⋅ Ψ f ⋅ T−C ⋅ R θ ⋅ R β ⋅ R α ⋅ T−G ) ⋅ Pw = A ⋅ Pw
Remark: in this example we have defined a translation of the coordinate system, while in above equation a point is moved. A transformation of a point is a dual operation to the transformation of a coordinate system. kx p1 ky p = 2 = A ⋅ Pw p = Auto-calibration of the camera kz p3 The goal is to estimate the combined matrix A k p4 for the scene-to-image transformation, based on some number of pixel-to-scene point correspondences. W. Kasprzak: EIASR
4. Image pre-processing
123
Auto-calibration Matrix A consists of 12 parameters, but only 8 are independent (the 3-d row linearly depends on the 4-th row (by f ). kx p1 a11 a12 a13 a14 X Y ky p a a a a 2 21 22 23 24 = A⋅ P p= = = w kz p3 a31 a32 a33 a34 Z a a a a p k 41 43 44 1 4 41 As p1 = x p4, and p2 = y p4, we get a system of 3 linear equations: x p4 = a11 X + a12 Y + a13 Z + a14 y p4 = a21 X + a22 Y + a23 Z + a24 p4 = a41 X + a42 Y + a43 Z + a44 (x,y) and (X,Y,Z) represent measurable data while p4 is not known. It needs to be eliminated from the system. W. Kasprzak: EIASR
4. Image pre-processing
124
Auto-calibration (2) After elimination of p4 from the 1st and 2nd row we get 2 equations with 12 unknowns: a11 X + a12 Y + a13 Z - a41 x X - a42 x Y - a43 x Z - a44 x + a14 = 0 a21 X + a22 Y + a23 Z - a41 y X – a42 y Y - a43 y Z - a44 y + a24 = 0
Auto-calibration algorithm 1. Detect m ≥ 6 image points { pi = (xi, yi) (i=1,2,...,m)}, which are projections of known 3-D points (in world coordinates) - Pi = (Xi, Yi, Zi) (i=1,2,...,m). 2. For every pair of points (pi , Pi) make two equations of above form – in total M ≥ 12 equations with 12. unknowns a11 - a44. 3. Solve the equation system M by an LSE approach. W. Kasprzak: EIASR
4. Image pre-processing
125
3. Color spaces Color spaces that are based on psychological observations on human perception: • HLS (Hue, Lightness, Saturation), • HSV (Hue, Value, Saturation) . Technical systems represent color as a mixture of three basic vectors in the color space: • A typical color system for computer monitors - three basic colors: red, green, blue (the RGB image, where R,G,B are coefficients from the interval [0, 1]). An additive positive-oriented system - "white" is the sum of its components. • Typical color printings on printers - the CMY (Cyan, Magenta, Yellow) system. An additive "negative" model - "black" is the sum of components: [C, M, Y] = [1, 1, 1] - [R, G, B]. W. Kasprzak: EIASR
4. Image pre-processing
126
Color spaces • Typical analog television broadcast - YIQ (Y - luminance, I, Q chromatic components: red and red-blue). A specific sensitivity of the human eye to the green color allows to represent green mostly by Y. Y 0.299 0.587 0.114 R I = 0.60 Q 0.21
− 0.28 − 0.32 ⋅ G − 0.52 0.31 B
• In digital media - YUV (or Y Cb Cr), where Cb, Cr - distances of the color from "gray" color along blue and red axes. 0.587 0.114 R Y 0.299 C = − 0.1687 − 0.3313 0.5 ⋅ G b C r 0.5 − 0.4187 − 0.0813 B
0.587 0.114 R Y 0.299 U = − 0.1471 − 0.2889 0.436 ⋅ G V 0.6149 − 0.515 − 0.100 B
Note: R,G,B are coefficients from [0, 1]. Y also takes values from [0, 1], but Cb, Cr (or U, V) take values from [-0.5, 0.5]. W. Kasprzak: EIASR
4. Image pre-processing
127
Colour calibration Ideal colours (blue, green, red, yellow, magenta, white, dark), specific “skin” colours (“dark skin” and “light skin”) , etc. are defined in the YUV space. They are applied to make a colour calibration of the image.
W. Kasprzak: EIASR
4. Image pre-processing
Colour plane U-V for mid-value Y=0.5 (of 1) (or 128 of 255) 128
Y-based color normalization In theory an Y-based normalization of the color components U, Y is not required. But in practice this can help to limit the variability of “dark” and “light” versions of the same color. Assume 8 bits per color per pixel representation – i.e. values from [0, 255]. Y p ' = 128; U p ' = U p + κ pU (Y p '−Y p); V p ' = V p + κ pV (Y p '−Y p);
E.g. for a skin color: κ pU=1.135 and κ pV = 1/ κ pU W. Kasprzak: EIASR
1. Pattern recognition
129
Example The detection of skin colour in the image:
W. Kasprzak: EIASR
4. Image pre-processing
130
4. Image enhancement Low contrast images can be enhanced by „histogram stretching” or „histogram equalization”. „Stretching” of histogram H 1.Find a = min(H, A) and b = max(H, A), such that A% of pixels have lower value than a and A% of pixels have higher value than b (e.g. A=0.05). 2.Transform the pixel value w to a new digital value in the interval [0, 1, ..., Dm ] as: 0 if w
b Dm W. Kasprzak: EIASR
4. Image pre-processing
131
Histogram equalization The purpose of histogram equalization is to make every pixel value (nearly) equally probable: w 1 a = f ( w) = round [ Dm ∑ H (i)] PixNum i = 0
where H(i) is the histogram value for pixel value i, PixNum – total pixel number, [0, 1, ..., Dm ] is the interval of digital levels of variable w in the original image, round[] – specifies the rounding operation to nearest digital level. Remark: perfectly flat histograms are seldom obtained when working with digital variables. Equalization usually leads to a reduction of pixel value levels in the transformed image if compared to the original one. W. Kasprzak: EIASR
4. Image pre-processing
132
Example Image and its histogram before and after histogram equalization:
W. Kasprzak: EIASR
4. Image pre-processing
133
5. Image binarization Problem: how to set the threshold for image binarization – to separate the foreground object pixel from the background?
Solution idea: the image histogram is approximated by a weighted sum of two normal (Gaussian) distributions. The border between distributions determines the threshold value:
p (l ) = αN (l , m1 , σ 1 ) + (1 − α ) N (l , m2 , σ 2 ) W. Kasprzak: EIASR
4. Image pre-processing
134
The Otsu method 1. FOR all possible thresholds θ ;
obtain the between-class variance as:
σ (θ )2B = P1 (θ )(1 − P1 (θ ))(m1 (θ ) − m2 (θ ))2 θ
where P1 (θ ) =
∑ h(l ) l =1 L
∑ h(l )
.
l =1
2. Select θ such that σ(θ)B2 is of maximum value. There exists a finite number of threshold positions only. Hence, the above algorithm always terminates. W. Kasprzak: EIASR
4. Image pre-processing
135
6. Pattern normalization An image object can appear of different size and orientation and nevertheless we need to recognize it. Example: different letters W, but the same class of letter.
Example: normalization of image size
W. Kasprzak: EIASR
4. Image pre-processing
136
Pattern normalization (1) Pattern transformation in image space (translation, scaling, rotation, mirroring) - functions of the geometric moments. A normalization step: pattern f(x, y) ≥ 0 pattern h(x', y') ≥ 0, global moments mpq µpq
µ pq = ∑ ∑ x' p y' q h( x' , y' )
m pq = ∑∑ x p y q f ( x , y ) x
x
y
y
Step 1: shift to mass centre (xc, yc) m m x' = x − xc , y ' = y − yc x c = 10 , yc = 01 m00
m00
and normalize the amplitude
h( x' , y' ) =
µ00 = 1
f ( x , y) m00
µ10 = µ01 = 0
and
W. Kasprzak: EIASR
4. Image pre-processing
137
Pattern normalization (2) Step 2: size normalization – scaling of axes r = m02 + m20
x' =
x , r
y' =
y r
h( x' , y' ) = r 2 f ( x , y )
µ20 + µ02 = 1 Step 3: rotation - to normalize the orientation of the pattern. If (m20 ≠ m02) then minimize
S (α ) =
∑ [( x − x ) cosα − ( y − y ) sin α ] c
( x , y )∈ f
and get: W. Kasprzak: EIASR
tan( 2α ) =
2
c
2m11 m 20 − m02 4. Image pre-processing
138
Pattern normalization (3) Step 3 (cont.): Among 4 possible angles select α such that after rotation it holds: µ20 < µ02 and µ21 > 0. x' cos α = y ' − sin α
sin α x cos α y
h( x ' , y ' ) = f ( x , y )
µ11 = 0 , µ20 < µ02 and µ21 > 0 Step 4: mirroring with respect to the Y axis. Select β∈{+1, -1} such that after the transformation x' = βx , y' = y the resulting pattern’s moment is µ12 > 0 h( x ' , y ' ) = f ( x , y ) . W. Kasprzak: EIASR
4. Image pre-processing
139
7. Image filters Basic image filters are used to suppress: • the high frequencies in the image, i.e. smoothing the image, or • the low frequencies, i.e. detecting edges in the image. Spatial filter: to convolve the input image f(i,j) with some filter function h(i,j) (called kernel): g (i , j ) = h(i , j ) ∗ f (i , j ) Frequency filter: (1) transform the image into the frequency domain, (2) multiply the result with the frequency filter function and (3) re-transform the result into the spatial domain. Discrete convolution is a „shift and multiply” operation: for a square kernel with size (M+1)×(M+1) the discrete convolution is: g (i , j ) = W. Kasprzak: EIASR
M /2
∑
M /2
∑ h ( m , n ) ⋅ f (i − m , j − n )
m=− M / 2 n=− M / 2 4. Image pre-processing
140
Basic image filters Basic image filters: 1.Mean filter - noise reduction (NR) using mean of neighbourhood 2.Median filter - NR using median of neighbourhood 3.Gaussian smoothing - NR with a Gaussian smoothing kernel 4.Various gradient-based edge detection 5.Laplace filter - second derivation-based edge detection Mean filter: to replace each pixel value in an image with the mean (average) value of its neighbours, including itself. The kernel: W. Kasprzak: EIASR
4. Image pre-processing
141
Median filter Non-linear filter: the addition operation in discrete convolution is replaced by some non-linear operator: g (i , j ) = Om ,n [h( m, n ) ⋅ f (i − m, j − n )] Median filter is a non-linear filter. It replaces the pixel value with the median of neighbour values. The median is calculated by first sorting all the pixel values from the local neighbourhood according to numerical order and then replacing the central pixel by the middle value. Example:
W. Kasprzak: EIASR
4. Image pre-processing
142
Gaussian smoothing Gaussian smoothing operator uses a kernel that approximates a Gaussian function. 2-D isotropic (circularly symmetric) Gaussian: G ( x, y ) =
x2 + y 2 − exp 2 2πσ 2 2 σ 1
Discrete Gaussian kernel: assume zero at distances more than three standard deviations from the mean and truncate the kernel. Example: discrete approximation of Gaussian function with σ = 1.0. The 2-D isotropic Gaussian is separable into x and y components apply a 1-D convolution to rows and columns, with kernel: W. Kasprzak: EIASR
4. Image pre-processing
143
Edge image detection A pre-processing step in edge detection: a smoothing operation in order to remove noise (spiky-like variations) from the image. Example: desired edge (left), real edge (right).
Basic types of edge image detectors: 1. discrete image function gradients, 2. convolve image with kernels, 3. using parametric edge models, 4. mixed approaches. W. Kasprzak: EIASR
4. Image pre-processing
144
Discrete image gradients ∂f ( x, y ) f x ( x, y ) ∂x = ∇f ( x, y ) = ∂f ( x, y ) f x y ( , ) y ∂y Discrete differences in case of a 2-D image fˆ (i, j ) that represents
The gradient of a 2-D continuous function:
the continuous function f (x, y): fˆx (i, j ) = fˆ (i + 1, j ) − fˆ (i, j )
fˆy (i, j ) = fˆ (i, j + 1) − fˆ (i, j )
As a result of edge detection two output images are computed: 1.the magnitude (strength) s or the "absolute" strength s‘ (for computational simplicity) s' =| f x | + | f y | s = f x2 + f y2 and 2. the direction of the normal vector to edge: r = arctan( f y / f x ) W. Kasprzak: EIASR
4. Image pre-processing
145
Robert’s cross „Roberts cross”: example of a discrete gradient-based edge operator. Two discrete gradients along 45o and 135o: kx ( i, j) = f ( i, j ) - f ( i+1 , j+1) , ky ( i, j) = f ( i+1, j) - f ( i, j+1 ). Remark: they are equivalent to convolution kernels: 2 2 Edge strength: s = k x + k y or s' =| k x | + | k y | ky 3 = arctan( )− π r Edge orientation (correction by -3Π/4) needed): Roberts k 4 x
Characteristics of the „Roberts Cross” operator. • the same image format as input image, e.g. maximum strength: 255, for 8-bit images, • simple implementation (two difference values per pixel only), • sensitive to noise. W. Kasprzak: EIASR
4. Image pre-processing
146
Differences of central element Two differences that approximate the two gradients of image functions along the main axes: fˆx (i, j ) = fˆ (i + 1, j ) − fˆ (i − 1, j ) fˆy (i, j ) = fˆ (i, j + 1) − fˆ (i, j − 1) Remark: equivalent convolution kernels are:
Characteristics: • The same input-output format, e.g. maximum strength 255 for 1 byte/pixel), • simple implementation, • still sensitive to noise. W. Kasprzak: EIASR
4. Image pre-processing
147
Convolution-based edge detector Sobel operator:
Gx
Gy
Prewitt operator: Characteristics: • Sobel operator: different input-output image, e.g. maximum strength: 2040 for 8-bit input image, simple implementation, good results. • Prewitt operator: different input-output image, e.g. maximum strength: 1020 for 8-bit input image, simple implementation, results are almost as good as for the Sobel operator. W. Kasprzak: EIASR
4. Image pre-processing
148
Discrete directions A limited number of edge orientations is detected. This allows for an efficient implementation of orientation detection step. For example 16 discrete image directions are detected by comparing two edge gradients for each pixel.
W. Kasprzak: EIASR
4. Image pre-processing
149
Laplace operator Laplace operator (the sum of second-order derivatives) For a continuous 2-D function the Laplace operator is defined as: ∂2 f ∂2 f ∇ f ( x, y ) = 2 + 2 = f xx + f yy ∂y ∂x 2
The discrete Laplace operator uses a single mask but is independent from any direction (orientation information is lost). It gives a „positive" answer (i.e. a zero value) for both real edges and homogeneous regions in the image - it can be used in a combination with some other edge operator. Basic discrete Laplace operator: ∇2 f ( i, j ) = 4 f ( i, j ) - f ( i–1, j ) - f ( i+1, j ) - f ( i, j-1) - f ( i, j+1) The convolution kernel: W. Kasprzak: EIASR
4. Image pre-processing
150
LOG filter (1) Using a second derivative makes the Laplacian highly sensitive to noise use Gaussian smoothing as a preprocessing step. Laplacian of the Gaussian (LOG filter)
L( x, y ) = ∇ 2 (G ( x, y ) * I ( x, y )) Due to linearity:
L ( x, y ) = ∇ 2 G ( x, y ) * I ( x , y )
x2 1 Derivation of the LOG filter (1D case): G( x) = exp(− 2 ) 2σ σ 2π 2 - first derivative: x x G' ( x) = − 3 exp(− 2 ) 2σ σ 2π 2 x x2 1 1 - second derivative: G' ' ( x) = ( 3 − ) exp(− 2 ) 2σ σ 2π σ σ W. Kasprzak: EIASR
4. Image pre-processing
151
LOG filter (2) Illustration of the LOG filter:
Different Laplace convolution kernels: (a) Basic Laplacian (b),(c) LOG filter W. Kasprzak: EIASR
4. Image pre-processing
152
Color image edge operator An edge operator defined for a monochromatic image can be extended to handle color image. For example let be an RGB image function fRGB . Let pairs of colour pixels be given: f1 = ( r1, g1, b1 ) , f2 = ( r2, g2, b2 ) - from the local neighbourhood of current pixel. In case of the Sobel color operator there will be given three pairs for the "vertical" mask and 3 pairs for the "horizontal" mask. The difference for pair (f1, f2) can be alternatively computed as: D1 ( f1 ; f2 ) = {( r1 - r2 )2 + ( g1 - g2 )2 + ( b1 - b2 )2 } 1/2 D2 ( f1, f2 ) = | r1 - r2 | + | g1 - g2 | + | b1 - b2 | D3 ( f1, f2 ) = max { | r1 - r2 |, | g1 - g2 |, | b1 - b2 | } D4 ( f1, f2 ) = ωr |r1 - r2| + ωg | g1 - g2 | + ωb | b1 - b2 | W. Kasprzak: EIASR
4. Image pre-processing
153
8. Edge thinning Threshold-based edge elimination This simple edge thinning method is an edge elimination operator with a minimum threshold parameter θ. The threshold is either fixed or set adaptively (e.g. θ = γSmax, where γ∈(0,1)). s ( P ), if s ( P) > θ sthin ( P) = 0, otherwise
No-maximum edge elimination It depends on a check in the local neighborhood of given pixel P :IF ((s( P) ≥ s( N ) OR | r ( P) − r ( N ) |≥ T ) L
L
AND ( s ( P) ≥ s( N R ) OR | r ( P) − r ( N R ) |≥ T ) ) THEN s ( P)thin = s ( P); ELSE s( P)thin = 0; W. Kasprzak: EIASR
4. Image pre-processing
154
Edge thinning (2) Edge modification A local neighborhood-based modification of edge at pixel P: • if P is the strongest edge element in the set: P, NL, NR , then: s’(P) = s(P) + α ( s(NL ) + s(NR )); • if P is the weakest edge element in the above set then: s’(P) = s(P) - 2 α s(P); • if one neighbour of P (denoted by P+ ) is a stronger edge and another neighbour of P (denoted by P– ) is a weaker edge element then: s’(P) = s(P) - α s(P+) + α s(P-). Several iterations over the whole image may be necessary. W. Kasprzak: EIASR
4. Image pre-processing
155
Edge thinning (3) Edge elimination with hysteresis threshold This edge thinning method works with two edge strength thresholds: the upper θH and the lower θL. In the first run these edge pixels are individually marked as „good” that have higher strengths than the upper threshold. In the next run these “good” pixels are tried to be “extended” along a potential contour line in both directions (“positive” and “negative” line direction). For a single extension, the neighbor pixel need to have higher strength than the lower threshold. Remark: Now the neighbors are not competing with each other and they are searched along the expected contour line (not across it). W. Kasprzak: EIASR
4. Image pre-processing
156
9. Canny operator This a multi-stage edge image detection and thinning procedure. It needs 3 parameters: •the variance σ for the Gaussian mask, •edge strength thresholds θH, θL, where θH>θL, for hysteresisbased thinning. Input : a grey-scale image. Output : a “thinned” edge image. Steps of the Canny operator 1.Image smoothing by convolution with a Gaussian kernel. 2.Edge image detection by a discrete-gradient operator in 2x2 or 3x3 neighbourhood. 3.Edge thinning by the no-maximum elimination operator. 4.Edge elimination with a hysteresis threshold. W. Kasprzak: EIASR
4. Image pre-processing
157
EIASR Boundary-based image segmentation Lecture 5
Project is co-financed by European Union within European Social Fund
158
1. Line detection Main approaches to line detection in images: 1. A 3-step line segment detection • Edge image detection (filtering), • Edge chain following (segmentation) • To approximate edge chains by straight line segments (symbolic description). 2. A 2-step line or contour detection • Edge image detection • Detection of straight lines, circles and contours by using a Hough transform. • „Active contour” method W. Kasprzak: EIASR
5. Boundary-based image segmentation
159
2. Edge chain following Principle: searching for extension of current edge pixel P=P-cur by its successor edge N=c(Pcur). Two neighbour edge elements can be linked if the edge magnitude (strength) and direction differences are below certain thresholds and their magnitudes are relatively large: |s(P) – s(N)| ≤ T1 |r(P) – r(N)| mod 2π ≤ T2 |s(P)| > T, |s(N)| > T Denote the 3 nearest pixel along the direction r(P) as: N1, N2, N3. The successor of P-cur: an edge element Ni whose strength and orientation are most similar to P-cur . W. Kasprzak: EIASR
5. Boundary-based image segmentation
160
Successor candidates Three candidates for a successor of pixel P:
Closing a gap for edge pixel P in directions (left) 0o and (right) 45o:
W. Kasprzak: EIASR
5. Boundary-based image segmentation
161
The hysteresis threshold method The edge strength threshold T is now split into two: - the upper threshold T0 and - the lower threshold T = β T0 , 0< β < 1 The strength of the start edge pixel has to be higher than the upper threshold T0 : |s(Pstart)| > T0 The strength of any next edge element in the chain needs to be higher than T. Example. Edge chain labels:
W. Kasprzak: EIASR
5. Boundary-based image segmentation
162
The hysteresis threshold (2) REPEAT Search for an edge pixel Pstart with no segment label (i.e. „free”) with strength s(Pstart)> To (upper threshold) . Pstart is now called current pixel P-cur and the segment label is SegNum. FORWARD_SEARCH (P-cur) IF chain SegNum is not closed THEN Again set the current pixel P-cur to be the start pixel Pstart . Perform BACKWARD_SEARCH(P-cur) UNTIL all pixels are visited. Example:
W. Kasprzak: EIASR
5. Boundary-based image segmentation
163
Edge chain image Input image
W. Kasprzak: EIASR
5. Boundary-based image segmentation
164
3. Line segment fit The equation of a line determined by two points (xj, yj), (xk, yk): x( yi − y k ) + y ( x k − x j ) = y j ( x k − x j ) − x j ( y k − y j ) ax + by = c , a 2 + b 2 > 0 The distance di of point (xi, yi) from this line:
di =
| axi + byi − c | a2 + b2
W. Kasprzak: EIASR
5. Boundary-based image segmentation
165
Edge chain approximation Procedure APPROXIMATE (CI, P1, PN) Add the line l determined by start point P1 and end point PN of the chain CI to the set of line segments L(C ). For every point Pk in chain CI determine the distance sk to line l. IF FOR ALL Pk ∈ CI: sk ≤ θ THEN RETURN ( END ). Select a point Pk with maximum distance sk. Remove the line l from set L(C). Decompose the line CI into 2 parts: CI1 = P1 ... Pk; CI2=Pk ... PN. Call APPROXIMATE(CI1, P1, Pk). Call APPROXIMATE(CI2, Pk, PN). W. Kasprzak: EIASR
5. Boundary-based image segmentation
166
Approximation by arcs (1) 1. Try to approximate two consecutive line segments by an arc. E.g. start with arc ACB.
2. Measure the approximation error eA = arg max | g ( xi ) − f ( xi ) |, i = 1,2,3; E.g. i
eB = arg max | g ( xi ) − f ( xi ) |, i = 5,6,7; i
if
sign ( g ( xeA ) − f ( xeA )) = sign ( g ( xeB ) − f ( x eB ))
AND max(| g ( xeA ) − f ( xeA ) |, | g ( xeB ) − f ( x eB ) |) < γR THEN
Approximat e( ACB , R )
W. Kasprzak: EIASR
5. Boundary-based image segmentation
167
Approximation by arcs (2) Approximate(ACB, R)
W. Kasprzak: EIASR
5. Boundary-based image segmentation
168
Approximation by arcs (3) 3. Try to extend an existing arc by including the next line segment E.g. arc ACB is checked to be extended to ACBD. Then an extension to ACBDF fails, but there is a second arc DFE.
4. Try to approximate closed arc chains by a circle or ellipses.
W. Kasprzak: EIASR
5. Boundary-based image segmentation
169
4. Hough transform (HT) A polar representation of a straight line defined by an edge element ei consists of two parameters: d, α , where d = xi cos α + yi sin α Remark: α = 90o + β , and α = r(e) is the edge’s direction.
An image line corresponds to a point in the Hough space (d; α ). W. Kasprzak: EIASR
5. Boundary-based image segmentation
170
Hough transform for lines Edge image E = [ eij ] , i =1,..., M; j =1,..., N; is of size M × N and it represents ONUM discrete edge directions. The Hough accumulator H(d;α) is a digital representation of the π π Hough space: 1 1 2 2 2 2 − ≤α ≤ −
2
M +N ≤d ≤
2
M +N
2
2
Hough Transform for line detection FOR every edge pixel eij (coordinates (xi, yi), orientation αij ) DO: - Compute dij = xi cos αij + yj sin αij ; - Approximate (dij, αij) by nearest discrete values (d+ij, α+ij) and increase H(d+ij, α+ij), i.e. H(d+ij, α+ij) = H(d+ij, α+ij) + 1. FOR every (d, α) DO: - IF H(d, α) > Threshold THEN corresponding line is detected. W. Kasprzak: EIASR
5. Boundary-based image segmentation
171
HT for circle center detection A circle, ( x - xc ) 2 + ( y - yc ) 2 = r 2 , has 3 parameters. A two-step circle recognition approach Step 1: detect circle centres by appropriate Hough transform; Step 2: detect edge chains around those centres. Hough Transform for circle center detection Let a line perpendicular to edge element ei be: gi: y = ai x + bi Lines corresponding to circle border points cross in the point C = (xc; yc) the centre point of circle. In Hough space H(a,b) these lines will correspond to collinear points ( ai, bi ). W. Kasprzak: EIASR
5. Boundary-based image segmentation
172
Circle center detection A line gi that passes through (xc, yc) satisfies: y = ai ( x - xc) + yc = ai x + ( yc - ai xc) . This line corresponds to the point in Hough space: (ai; yc - ai xc). The relation of a set of points in Hough space is approximately a straight line: bi = - ai xc + yc . For N point observations: b1 − a1 1 b − a 1 x 2 = 2 c M L 1 yc bN − aN 1
the LSE solution is: − xC 1 y = N a 2 − ( a )2 ∑ i ∑ i C W. Kasprzak: EIASR
N ⋅ − ∑ ai
− ∑ ai ∑ (a i bi ) ⋅ ∑ ai2 ∑ bi
5. Boundary-based image segmentation
173
HT for circle detection Now let us apply the Hough transform to detect all 3 parameters of a circle equation, ( x - xc ) 2 + ( y - yc ) 2 = r 2 , in one step. Hough transform for circle detection FOR every edge pixel ei (coordinates (xi, yi), orientation αi ) DO: - the (unknown) centre coordinates (xC, yC) of a circle with (unknown) radius rd passing through point (xi, yi) satisfy: xi = xC + rd cos αi, yi = yC + rd sin αi ; - FOR every discrete radius rd: 0 ≤ rd ≤ rmax , DO: - compute (xC, yC) from the above equations and - increase the Hough accumulator H(xC, yC, rd) by one; FOR every (xC, yC, rd) DO: - IF H(xC, yC, rd) > Threshold THEN corresponding circle is detected. W. Kasprzak: EIASR
5. Boundary-based image segmentation
174
Contour detection by GHT A generalised Hough transform (GHT) for contour detection Parameters of the Hough space: C = (xC; yC): location of the center mass, s - scale, α - contour orientation angle. Model learning (design) For every pair of allowed discrete values sd, αg of scale and orientation, create a table R(sd, αg) = [ r(ϕ), ϕ ], where for each edge xB with edge direction ϕ the pair: [ r(ϕ)=xB - C, ϕ] is stored. Example: The model contour (top drawing) and the candidate contour (bottom) W. Kasprzak: EIASR
5. Boundary-based image segmentation
175
5. Active contour The active contour, or snake, is defined as an energy minimizing polygon - the snake's energy depends on its shape and location within the image. A snake is initialized roughly to represent the object’s boundary. The snake’s motion rule is then iteratively applied to change the snake’s location and shape to satisfy the final condition: Finternal + Fexternal = 0 Fexternal = −∇E external Example. Two final snakes established from common initial snake while using more (outer) or less (inner) external force. W. Kasprzak: EIASR
5. Boundary-based image segmentation
176
Snake’s forces Edge-based external energy: (1) Eexternal ( x, y ) = − ∇I ( x , y )
or
2
(2) Eexternal ( x, y ) = − ∇ ( Gσ ( x, y ) ∗ I ( x, y ) )
2
The internal energy is responsible for elasticity and stiffness – trying to shorten and to smooth the contour: Einternal = Eelastic + Estffness
E.g. n −1
Eelastic = K1 ⋅ ∑ | pi − pi −1 |2 i =0 n −1
E stiffness = K 2 ⋅ ∑ | pi −1 − 2 pi + pi +1 |2 i =0
W. Kasprzak: EIASR
5. Boundary-based image segmentation
177
Motion (update) rule Snake dynamics – discrete motion rule : at each iteration, every control point pi=(xi, yi) is moved by a vector proportional to the force acting on it: xi = xi + α FelasticX ,i + β FstiffnessX ,i + γ FexternalX ,i yi = yi + α FelasticY ,i + β FstiffnessY ,i + γ FexternalY ,i For example: α = 0.8, β = 0.1, γ = 0.6 In particular: FelasticX ,i = 2 K1 ( ( xi −1 − xi ) + ( xi +1 − xi ) ) FelasticY ,i = 2 K1 ( ( yi −1 − yi ) + ( yi +1 − yi ) )
FstiffnessX ,i = K 2 ( 4 ( xi −1 + xi +1 ) − 6 xi − ( xi − 2 + xi + 2 ) ) FstiffnessY ,i = K 2 ( 4 ( yi −1 + yi +1 ) − 6 yi − ( yi − 2 + yi + 2 ) ) K FexternalX ,i = 3 ( I ( xi + 1, yi ) − I ( xi − 1, yi ) ) 2 K FexternalY ,i = 3 ( I ( xi , yi + 1) − I ( xi , yi − 1) ) 2 W. Kasprzak: EIASR
5. Boundary-based image segmentation
178
Example Two different final contours – obtained by varying the weights of the internal force. Inner contour: α = 0.8, β = 0.5, γ = 0.6 Outer contour: α = 0.8, β = 0.1, γ = 0.6
W. Kasprzak: EIASR
5. Boundary-based image segmentation
179
6. Point detector The Harris-Stephens operator Detect average image gradients Ix, Iy in the local neighborhood of image point (xk, yk) – represent them as a covariance matrix: A (xk, yk) = A point feature is detected when both eigenvalues of matrix A are of high and similar value.
W. Kasprzak: EIASR
5. Boundary-based image segmentation
180
SIFT and SURF detector SIFT (Scale-Invariant Feature Transform) [Lowe, 2004] is both a multi-scale keypoint detector and an image descriptor. Basic idea: DoG – scale-space filtering by Difference of Gaussians. SIFT implementation consists of following steps: 1. 2. 3. 4.
Scale-and-image space filtering (LoG filter) Detection of extrema in scale-image space (DoG filter Keypoint (corner feature) detection, Dominating orientation “at keypoint” is detected,
5. Keypoint description.
SURF (Speeded Up Robust Features) [Bay et al., 2008] Idea: DoH - Determinant of Hessian for keypoint detection. W. Kasprzak: EIASR
5. Boundary-based image segmentation
181
EIASR Region-based image segmentation Lecture 6
Project is co-financed by European Union within European Social Fund
182
1. Homogeneous region A region R is a connected image area, which is homogeneous with respect to some parameter (vector of parameters) (e.g. intensity, colour, texture), and given predicate, H(R) = 1. Examples of homogeneity predicate 1. 1, if | max( R ) − min( R ) |< θ ( R ) H ( R) = otherwise 0, where: max(R) = maxp∈R f(p), min(R) = minp∈R f(p) 2. 1, if | f ( p ) − mean( R ) |< θ ( R ) H ( p, R ) = otherwise 0, where: θ ( R ) = θ A − F ( R ) (θ A − θ B ) F ( Rmax ) or n θ ( R) = kσ = k
W. Kasprzak: EIASR
1 ( f ( pi ) − mean( R))2 ; where pi ∈ R, i = 1,..., n. ∑ n i =1 6. Region-based image segmentation
183
Region growing and merging A simple approach to region detection is to start from some pixels (called seeds) si representing distinct image regions Ri and to grow them, until they cover the entire image. The initial seed detection: we make a histogram of the entire image; image pixels, whose image value corresponds to histogram peaks, are selected to be seeds. Region growing: at each stage k and for each region Ri(k) check if there are unlabeled neighbour pixel of each pixel of the region border and if the homogeneity criterion for the enlarged region will still be valid. If it is true then enlarge the region by this pixel. W. Kasprzak: EIASR
6. Region-based image segmentation
184
Split-and-merge Region merging: merge adjacent regions that have similar statistical properties. For example the two regions R1, R2 are allowed for merge if their arithmetic means are similar: Predicate (3) | mean ( R1 ) − mean ( R2 ) |< θ ( R1 ∪ R2 ) Split-and-merge (quadtree structure) Init: quadtree at some mid-level SPLIT: check whether every Image leaf node is homogeneous MERGE: check whether 4 related leafs can be merged W. Kasprzak: EIASR
6. Region-based image segmentation
185
Region detection algorithm Combined region detection algorithm (M, θ , θA, θ(R)) : Initial image tessellation into square regions of size M × M. REPEAT with the quadtree representation DO MERGE step (with a fixed threshold θ) SPLIT step (with a fixed threshold θ) UNTIL further split or merge is not possible (predicate (1)). Region merging/growing with a fixed threshold θA (pred. (3)) Region merging with an adaptive threshold θ(R) (pred. (2))
W. Kasprzak: EIASR
6. Region-based image segmentation
186
2. Texture Texture - a measure of image regularity, coarseness and smoothness. Examples of textures: grass, wood, water surface, etc.
W. Kasprzak: EIASR
6. Region-based image segmentation
187
Texture classification Texture recognition is a classification problem. Learning: find appropriate features, train a classifier, Classification: detect features of current texture; classify the features in terms of learned classes. Texture features: • statistical ( e.g. variance, skewness); • spectral (e.g. using the autocorrelation function or Fourier transform); • structural (e.g. using discrete features like: colour, motion, number of endings). W. Kasprzak: EIASR
6. Region-based image segmentation
188
Histogram-based texture features Let fk, k=1,…,L, be the values of the image function, and p(fk) the normalized histogram of an image region. L 1. Mean µ1 = µ = ∑ f k p( f k ) k =1
L
2. Variance
µ 2 = σ = ∑ ( f k − µ1 ) 2 p( f k ) 2
k =1
3. Skewness 4. Kurtosis
µ3 =
1
σ
3
L
∑( f
k
− µ1 ) 3 p ( f k )
k =1
1 L µ 4 = ∑ ( f k − µ1 ) 4 p( f k ) − 3 4 k =1 L
5. Entropy
H ( f ) = − ∑ p ( f k ) ⋅ log p ( f k ) k =1
W. Kasprzak: EIASR
6. Region-based image segmentation
189
The interpretation of features 1. The mean gives the average pixel value in given region 2. The variance measures the dispersion of pixel values in this region. 3. Skewness is a measure of histogram symmetry. 4. Kurtosis is a measure of the tail of the histogram – long-tailed histograms correspond to spiky regions. The subtraction of 3 ensures that the kurtosis of a Gaussian distribution is normalized to zero. 5. Entropy expresses the level of uniform pixel value distribution. The major limitation of histogram features is that they cannot express spatial characteristics of the texture. W. Kasprzak: EIASR
6. Region-based image segmentation
190
Histograms of pairs Histograms of sums and differences Let fj;k , fj+µ; k+ν be two pixel in the image separated by the displacement vector: V = (µ; ν) T. Their sum and difference are: ; d j ;k = f j ;k − f j + µ ;k + v s j ;k = f j ;k + f j + µ ;k + v Histograms of sums and differences, Hs(l,V) and Hd(m,V), for displacement V : H (l ,V ) = # f ( such that s = l ) j ;k
s
j ;k
H d (m, V ) = # f j ;k ( such that d j ;k = m )
They contain information about the spatial organization of pixel values. E.g. a coarse texture results in a concentration of histograms around: l = 2 × mean and m=0. Consider several displacements, e.g.: V = (µ; ν)T ∈ {(1,0), (0,1), (-1,0), (0,-1)} W. Kasprzak: EIASR
6. Region-based image segmentation
191
Features of sums and differences After normalisation: hs (l ) =
H s (l ) H d (m) , l = 0,...,2( L − 1) hd (m) = , m = (− L + 1),..., ( L − 1) ( ) H l ∑ s ∑ H d (m) l
m
Typical features are: 2 L−2 1. Mean c1 = ∑ lhs (l ) 2. Contrast
c3 =
l =0 2 L −2
∑ (l
2
c2 =
)
⋅ hs (l )
c4 =
l =0
3. Variance 4. Entropy W. Kasprzak: EIASR
c5 =
2 L−2
2 ( l − c ) ∑ 1 hs (l )
c6 =
c7 =
∑ − h (l )log(h (l )) s
s
∑ mh (m ) d
m = − L +1 L −1
∑ (m
2
)
⋅ hd (m )
m = − L +1 L −1
∑ (m − c ) 2
2
hd (m )
m = − L +1
l =0
2 L−2
L −1
c8 =
l =0 6. Region-based image segmentation
L −1
∑ − h (m) log(h (m)) d
d
m = − L +1
192
Co-occurrence matrix The grey-level co-occurrence matrix (GLCM) (by R. Haralick et al.) is defined as: G(d; r) = [gij (d; r)] , where an element gij(d; r) specifies the number of pixel pairs (fi=i, fj=j) which are separated by distance d along the direction r. Example. For a 5 × 5 image and L (L = 4) values, the cooccurrence matrix has L2 = 4 x 4 elements. 0 0 f → 0 2 2 E.g.
0 1 1 0 0 1 1 0 2 2 2 0 2 3 3 0 2 2 2 2
4 4 G (1;0) = 2 1
1 4 0 0 0 14 1 0 1 2 4
2
g 00 (1,0 ) → ( f 00 , f 01 ), ( f 01 , f 00 ), ( f 10 , f 11 ), ( f 11 , f 10 )
W. Kasprzak: EIASR
6. Region-based image segmentation
193
Parameters of GLCM The adjacency of pixels in a pair – along four directions r (horizontal, vertical, left and right diagonal):
The distance parameter d, is usually set as d=1, but it can be set as d > 1. R. Haralick proposed fourteen features for the GLCM. W. Kasprzak: EIASR
6. Region-based image segmentation
194
Features of the GLCM Normalisation of GLCM: Marginal distributions:
g ij (d , r ) = L −1
g ij (d , r )
∑ gij (d , r ) i, j
, i, j = 0,..., ( L − 1)
L −1
g1i = ∑ g ij ; g 2 j = ∑ g ij j =0
i =0
Mean and standard deviation of marginal distribution: L −1
µ1 = ∑ i ⋅ g1i ; σ 1 = i =0
L −1
µ2 = ∑ j ⋅ g2 j ; σ 2 = j =0
W. Kasprzak: EIASR
L −1
∑ (i − µ1)
2
⋅ g1i
i =0
L −1
∑ ( j − µ 2)
2
⋅ g2 j
j =0
6. Region-based image segmentation
195
Features of the GLCM Selected features: 1. Mean-square of distribution:
c1 = ∑i , j gij2 L −1
2. Contrast of the „difference”-variable : c2 = ∑ l 2 ⋅ (∑|i − j|=l g ij ) l =0
3. Normalized “correlation” of marginal distributions : (i − µ1) ⋅ ( j − µ 2) ⋅ g ij ∑ i, j c3 = σ 1⋅ σ 2 4. Entropy:
c4 = −∑i , j gij ⋅ log gij
W. Kasprzak: EIASR
6. Region-based image segmentation
196
Filter banks Convoluting an image block with a set of kernels yields a representation of block’s texture in a different space. There is a strong response when the texture looks similar to the filter kernel, and a weak response when it doesn’t. Example. The collection of kernels may consist of a series of spots and bars - at different scales. Spot filters respond strongly to small (non-oriented) points. Bar filters are oriented, and tend to respond to edges. W. Kasprzak: EIASR
6. Region-based image segmentation
197
Gabor filter banks Gabor filters - Fourier Transform elements multiplied by Gaussians. A Gabor kernel responds strongly if located over image blocks with texture having a particular spatial frequency and orientation. Gabor filters come in pairs: symmetric and anti-symmetric. A symmetric kernel: x2 + y 2 } Gsymmetric ( x, y ) = cos(k x x + k y y ) exp{− 2σ 2 An anti-symmetric kernel: x2 + y2 Ganti−symmetric ( x, y ) = sin(k x x + k y y ) exp{− } 2σ 2 Where (kx, ky) are the spatial frequencies.
W. Kasprzak: EIASR
6. Region-based image segmentation
198
Example Example. Gabor filter kernels, where mid-grey values represent a zero, darker values represent negative numbers and lighter values represent positive numbers. The top row shows three antisymmetric kernels, and the bottom row - three symmetric kernels; in horizontal direction.
W. Kasprzak: EIASR
6. Region-based image segmentation
199
Gabor filters in MPEG-7 MPEG-7 descriptors in the frequency domain: in polar coordinates the frequency domain is sampled into 30 „property channels” – the phase’s width is unformly of 30 degree, and the magnitude widths are powers of 2: 3 θ r = 30o r , (r = 0,1,K,5) ωs = ωo ⋅ 2 − s , ( s = 0,1, K,4), ω0 = 4
W. Kasprzak: EIASR
6. Region-based image segmentation
200
Gabor filters in MPEG-7 (cont.) The Gauss function in channel (s, r):
− (ω − ω s ) 2 − (θ − θ r ) 2 G Ps , r (ω , θ ) = exp ⋅ exp 2 2 2 σ 2 σ ρs θr
The energy ei of the i-th channel:
ei = log(1 + g i ) , (i = 1,2, K,30)
gi =
180°
[G ∑ ∑ ω θ 1
(ω , θ ) ⋅ F (ω , θ )]
2
Ps , r
= 0 + = 0° +
The standard deviation of energy qi in the i-th channel: qi = log(1 + σ i ) , (i = 1,2, K ,30) σi =
∑ ∑ [(G 1
180°
Ps , r (ω , θ ) ⋅ F (ω , θ ) ) − g i / N
ω = 0 + θ = 0° +
2
]
2
Where F(ω,θ) – 2D Fourier transform into the frequency space represented by polar coordinates. W. Kasprzak: EIASR
6. Region-based image segmentation
201
3. Point-based image descriptor The image descriptor in SIFT Dominating orientation. An orientation is assigned to each keypoint to achieve invariance to image rotation. A neigborhood is taken around the keypoint location depending on the scale, and the gradient magnitude and direction is calculated in that region. An orientation histogram with 36 bins covering 360 degrees is created. The highest peak in the histogram is taken and any peak above 80% of it is also considered to calculate the orientation. Eventually, this replicates the keypoint into many, with same location and scale, but different directions.
SIFT feature descriptor. A 16s x 16s neighbourhood around the keypoint is taken. It is divided into 16 sub-blocks of 4 x 4 size. For each sub-block, an 8-bin orientation histogram is created. Thus,128 bin values are available. The dominating direction is used for registration of the feature vector. W. Kasprzak: EIASR
6. Region-based image segmentation
202
SURF descriptor Dominating orientation. SURF uses wavelet responses in horizontal and vertical direction for a neighborhood of size 6s, with adequate Gaussian weights, to obtain gradients dx, dy. The dominant orientation is estimated by calculating the sum of all responses in the (dx, dy) space, contained within a sliding orientation window of angle 60 degrees.
SURF feature descriptor. A neighborhood of size 20s x 20s is taken around the keypoint, where s is the scale. It is divided into 4 x 4 subregions. For each subregion, horizontal and vertical wavelet responses are taken, and a vector is formed:
v = [∑ d x , ∑ d y, ∑ |d x|, ∑ | d y |]
This gives a SURF feature descriptor with total 64 dimensions. An extended descriptor with 128 elements is also defined: the sums of dx and |dx| are computed separately for dy . The magnitude part of the frequency characteristics: Example: A spectrogram before and after pre-emphasis: W. Kasprzak: EIASR
8. Speech signal pre-processing
248
Auto-correlation of signal For a window of size N we compute normalized auto-correlation of signal samples with its shifted samples (by k). Starting with m-th sample the k-shifted normalized auto-correlation m + N − k −1 is: rk( m ) =
∑f
n
f n+k
n=m
| [ f n ] || [ f n+ k ] |n
The maximum of normalized auto-correlation values always appears for k=0 and is equal to 1. For a voiced speech part local maxima appear every k0 samples: r k , rk + jk0 ; e.g. for some k0, 2k0, 3k0, etc., because many strong harmonic frequencies exist in this part. W. Kasprzak: EIASR
8. Speech signal pre-processing
249
Base frequency By normalized mutual correlation we shall mean a correlation factor between two consecutive signal frames. Let the frames of signal samples are given (in vector form): f( m − k , m − 1) = [ f m − k ,..., f m −1 ] f( m, m + k − 1) = [ f m ,..., f m + k −1 ] The normalized mutual correlation of these two vectors is:
ρ m (k ) =
f( m − k , m − 1) ⋅ f( m, m + k − 1) || f( m − k , m − 1) || ⋅ || f( m, m + k − 1) ||
Start with k=2 and compute the correlation factor for different values of k. In a voiced part of the speech the maximum of normalized mutual correlation will be at k that corresponds to the base period (and base frequency F0) of the speaker. W. Kasprzak: EIASR
8. Speech signal pre-processing
250
EIASR Acoustic feature detection Lecture 9
Project is co-financed by European Union within European Social Fund
251
1. Speech features Frame-based speech signal features 1. Mel-frequency cepstral coefficients (MFCC), extended by their first derivatives in time. or 2. Speech features based on Linear Predictive Coding (LPC), e.g. LPCC – linear predictive cepstral coefficients Speech signal (frames) Features
W. Kasprzak: EIASR
9. Acoustic feature detection
252
Cepstrum (1) The cepstrum of a signal x[n] is the result of a homomorphic transformation: cepstrum(x) = F-1(log |F(x)|), where F is the discrete-time Fourier Transform (DFT) for MFCC or the Z transform for LPCC.. Note: spectrum spec | trum ceps | trum cepstrum
MFCC: LPCC:
W. Kasprzak: EIASR
9. Acoustic feature detection
253
Cepstrum (2) Why are cepstrum features useful for speech recognition? • The cepstrum features characterizing the (impulse response of the) vocal tract are located near the “zero” feature k=0; whereas the input impulse components, corresponding to the larynx-modulated oscillations (that are not useful for speech recognition) are located at higher values of k (“longer” cepstrum time), where the cepstrum features achieve a maximum value; • The useful features can be separated from the others by selecting an appropriate number of them, starting from k=0, or by additional multiplication, called liftering. • The speech impulse response can also be separated from the acquisition channel’s (microphone) response by using centered cepstrum features. W. Kasprzak: EIASR
9. Acoustic feature detection
254
2. MFCC 1. Short-time Fourier Transform (STFT) A windowed DFT for every frame τ of the input signal: M −1 − i 2π kt M F (k ,τ ) = ∑ ( x[τ + t ] ⋅ e ⋅ wτ [t ]) , k=0, 1, ..., M-1 t =0
Window functions w[t] 1. Rectangular window 2. Triangle window 3. Hamming window
etc. W. Kasprzak: EIASR
2π t 0.54 − 0.46 cos , wτ [t ] = M 0,
for t = {0,1,..., M − 1} otherwise
9. Acoustic feature detection
255
Windowing example Example. x[n] is the sum of two sinus functions uniformly sampled from 0 to 2π by 128 samples: x[n] = sin(2π n/5) + sin(2π n /12), n=0,1,2,...,127. Single frame Single frame (rectangular window applied) (Hamming window applied)
W. Kasprzak: EIASR
9. Acoustic feature detection
256
Windowing example (2) Example (cont.) Magnitude of Fourier coefficients: With rectangular window. With Hamming window
W. Kasprzak: EIASR
9. Acoustic feature detection
257
Window functions Conclusion: f×w
Mag[DFT(w)]
W. Kasprzak: EIASR
9. Acoustic feature detection
Mag [STFT(f × w)]
258
Spectrogram Power of Fourier coefficients (squared magnitude)
FC ( k ,τ ) = F ( k ,τ )
2
=
M −1
∑ ( x[τ + t ] e
− i 2πkt
M
t =0
W. Kasprzak: EIASR
⋅ wτ [t ])
2
, k = 0 ,..., M-1
9. Acoustic feature detection
259
Mel frequency scale Non-linear response of the human ear to the frequency components in the audio spectrum: differences in frequencies at the low end (< 1 kHz) are easier detectable than differences of the same magnitude in the high end of the audible spectrum. Approach: a non-linear frequency analysis performed by the human ear - the higher the frequency the lower its resolution MEL scale (empirical result): f f mel = 2595 log1 + 700 [ Hz ]
W. Kasprzak: EIASR
9. Acoustic feature detection
260
MFC Mel frequency coefficients (MFC) Triangular filters are located uniformly in the Mel frequency scale:
MFC (l , τ) =
M −1
∑ [ D(l , k ) ⋅ FC (k , τ)]
l = 1,..., L
k =0
The MFC value associated with each bin corresponds to a weighted average of the power spectral values in the particular frequency range specified by the shape of the filter. W. Kasprzak: EIASR
9. Acoustic feature detection
261
MFCC The Mel-frequency cepstrum coeffcients are computed by the homomorphic transformation MFCC(h) = FT-1{log MFC{FT{h}}}, for h = x ⊗ w The last step is the inverse Fourier Transform of logarithmic Mel frequency coefficients: L −1 k ⋅ ( 2l + 1)π )] MFCC ( k ,τ ) = ∑ [log MFC (l ,τ ) ⋅ cos( k = 1,..., K 2L
l =0
Centered MFCC
MFCC centered ( k ,τ ) = MFCC ( k ,τ ) − mean{MFCC ( k ,τ )) | τ = 1,2,...] k = 1,..., K
W. Kasprzak: EIASR
9. Acoustic feature detection
262
Delta features Energy feature Additional feature - the total energy of signal in a single frame: M
E(τ ) = log(∑ xi2 ) i =1
Gradients of features in time („delta” features) A schematic view of spectrograms for different phoneme types: single vowels (left), diphthongs (middle), plosives (right).
A linear regression in 5 consecutive frames is applied to find delta coefficients „d” (of MFCCs and energy feature „c”): d (τ ) = W. Kasprzak: EIASR
2c(τ + 2) + c (τ + 1) − c(τ − 1) − 2c(τ − 2) 10 9. Acoustic feature detection
263
Feature set Energy MFCC
Delta energy Delta MFCC General features per frame (Total energy, mean and variance, norm. max. auto- correlation, low-band ratio) W. Kasprzak: EIASR
9. Acoustic feature detection
264
3. LPC and LPCC The Z transform is a discrete-time signal transform, which is dual to the Laplace transform of continuous-time signals, that means a probing of signal by sinusoids and (decaying) exponentials: X ( z) =
∞
∑ x[n] z
−n
n = −∞
and z is a complex number: z = r e -jω , r = e -σ. The synthesis model of human speech (in z-domain) consists of: • an excitation source E(z) on the input, • a linear filter with transmittance H(z), • the speech signal X(z) on its output; (the signals and the filter are represented by their transforms in the complex-valued domain z). W. Kasprzak: EIASR
9. Acoustic feature detection
265
Speech synthesis model
Let us denote by H(z) the transmittance of the filter (the z transform of its frequency response h[n]). There are obvious relations in the z domain: X ( z ) = H ( z ) E( z ) , E ( z ) = A ( z ) X ( z ) W. Kasprzak: EIASR
9. Acoustic feature detection
266
IIR filter A digital IIR filter is characterized by a recursive equation: x[ n ] = b0 e[ n ] + b1e[ n − 1] + b2 e[ n − 2] + L + b p e[ n − p ]
+ a1 x[ n − 1] + a2 x[ n − 2] + L + am x[ n − m] The n-th output sample, x[n], is computed from the current and previous input samples and previous output samples. In short: m
p
k =1
k =0
x[n] − ∑ a k x[n − k ] = ∑ bk e[n − k ] A corresponding description in the z-domain is: p
∑b
k
X( z ) = E( z )
z
p
−k
k
k =0 m
H( z ) =
1 − ∑ ak z − k k =1
W. Kasprzak: EIASR
∑b
z −k
k =0 m
1 − ∑ ak z − k k =1
9. Acoustic feature detection
267
LPC 1 The Auto-Regressive (AR) model H( z ) = m 1 − ∑k =1 ak z −k assumes that the numerator is 1: Thus in the AR model the n-th output sample, xn, is estimated only on m previous output samples and current input sample as: x[ n ] = e[ n] + a1 x[ n − 1] + a2 x[ n − 2] + L + am x[ n − m ] m In short:
xn = e n + ∑ a k x n − k k =1
Ideally, for voiced parts the vocal tract is cyclically fed by a Dirac delta impulse. Then: e0=1, en =0 , for short-time frames. Thus, the n-th speech sample (in a frame) is estimated as a linear m combination of the previous m samples:
xˆ n = ∑ ak xn − k k =1
W. Kasprzak: EIASR
9. Acoustic feature detection
268
Auto-correlation method for LPC The task is to compute the parameters, { ak | k=1, ... , m }, for every signal frame. By the LSE approach, for given frame, we have: n1 ∂ε ε = ∑ ( xn − xˆ n ) 2 = ∑ xn − ∑ a k xn − k 2 x n − i = 0 n=n0 ∂ai n k where n0, n1 are training sample indices in given frame. We get m equations with m unknowns: ∑ ak ∑ xn−k xn−i = ∑ xn xn−i L, i = 1,...,m k
n
n
By introducing the first m+1 auto-correlation coefficients: r|i − k | =
M −1−|i − k |
∑x x
n n +|i − k |
n =0
= ∑ xn − k xn − i n
the equation system takes the form: m
∑a r
k |i − k |
= ri , i = 1,..., m
k =0
W. Kasprzak: EIASR
9. Acoustic feature detection
269
LPC computation r1 r0 r0 r1 M r m−1 rm − 2
r2 r1 rm−3
L rm−1 a1 r1 L rm− 2 a2 r2 = M M M L r0 am rm
Ra = r The matrix R is a Toeplitz matrix (it is symmetric with equal diagonal elements). Due this Toeplitz property an efficient algorithm is available for computing a without computing the inverse matrix R-1 . Alternative method The Levinson-Durbin algorithm is an iterative method for the computation of LPC parameters. W. Kasprzak: EIASR
9. Acoustic feature detection
270
The Levinson-Durbin algorithm E represents the prediction error, Ki – the reflection coefficients between consecutive parts of the acoustic tube, aj – the final prediction coefficients INIT: E0 = r0 , i = 1 FOR i=1 to m :
i −1 1 ri − ∑ α j ,i −1 r|i − j| K i= Ei −1 j =1
FOR j=1 to i-1 : α j ,i = α j ,i −1 − K i α i − j ,i −1
α i ,i = K i FOR j=1 to m :
Ei = (1 − K i2 ) Ei −1
a j = α j ,M
W. Kasprzak: EIASR
9. Acoustic feature detection
271
LPC-based features Basic parameters The prediction parameters can itself be considered as a feature vector (usually the error value is also considered): ci = ai for i = 0, ..., m ; where a0=1; and cm+1 = ε . The number of prediction parameters is: m ∈ 〈12, 20〉. Approximately for the sampling frequency fs [kHz]: m ≈ fs +[2, 4]. Spectral features (smoothed signal spectrum) By transforming the parameter vector into frequency domain we get a smooth spectrum of the signal frame. The required resolution in frequency domain is achieved by padding the parameter vector with zeros to get a vector with M elements: A M = DFT ([1, a1 , a2 ,..., am ,0,0,...,0]) W. Kasprzak: EIASR
9. Acoustic feature detection
272
LPCC (1) LPCC - the cepstral LPC Recall, the speech synthesis filter function is transformed to the z1 domain transmittance function: H( z ) = m 1 − ∑k =1 ak z −k The polynomial in the denominator part can be reorganized giving an all-pole transmittance function: 1 1 H( z ) = = m m 1 − ∑k =1 ak z −k ∏ (1 − pk z −1 ) k =1
Next we use the ln - function and apply the inverse Z transform: m
c[1 : m] = Ζ (ln[H( z )]) = Ζ (∑ ln[ pk z −1 ]) −1
−1
k =1
W. Kasprzak: EIASR
9. Acoustic feature detection
273
LPCC (2) A direct iterative method for computing the LPCC features Instead of performing the particular steps of the cepstrum transformation of LPC coefficients, there exists an iterative method for a direct computation of LPCC features from the LPC coefficients. For 1≤ n ≤ m (where m is the order of LPC transform) : n −1
For n > m :
k c[n] = −an − ∑ (1 − )c[n − k ]ak ; n = 1,2,..., m n k =1
W. Kasprzak: EIASR
n −1
k c[n] = −∑ (1 − )c[n − k ]ak ; n > m n k =1 9. Acoustic feature detection
274
EIASR Phonetic speech model Lecture 10
Project is co-financed by European Union within European Social Fund
275
1. Phonetic categories Phone (articulated sound) - the smallest element in speech. “A phone is a speech sound considered as a physical event without regard to its place in the sound system of a language” (Webster dictionary). A phone may have different realisations, determined by: tonality, duration, and intonation. The IPA (Int. Phonetic Association) has defined some 100 phones, while at least 40 are required for a rough transcription. Phoneme – a category of similar phones. Even though no two speech sounds are identical, all of the phones classified into one phoneme category are similar enough to have the same meaning. W. Kasprzak: EIASR
10. Phonetic speech model
276
Groups of phonemes Primary groups: vowels, consonants - a basic contrast – can a phoneme serve as a syllable nucleus or not? The mode of phonation: 1. Sonorants – „buzzes” - characterized mainly by voicing the repetitive opening and closing of the vocal cords. 2. Fricatives – „hisses” - generally non-voicing sounds. 3. Plosives – „pops” - explosive sounds (including affricates). 4. Silence - phrase marker, breathing, mini-silences and closures before a plosive. W. Kasprzak: EIASR
10. Phonetic speech model
277
Vowels 1. Monophthongs (11 phonemes) - a single vowel quality. • stressed vowels: /A/, /E/, /@/, /i:/, /o/, /U/ (e.g. in the words father, bet, bat, beet, above, boot); • reduced vowels: /^/, /&/, /I/, /u/; (e.g. in the words bus, above, bit, book), • r-coloured vowels E: long /3r/, short /&r/ (as in bird and butter). 2. Diphthongs (6) - a clear change from start to end: /ei/ (e.g. late), /aI/ (e.g. bye), />i/ (e.g. boy), /iU/ (e.g. few), /aU/ (e.g. loud), /oU/ boat.
W. Kasprzak: EIASR
10. Phonetic speech model
278
Consonants (1) 3. Approximants (4) – Semivowels – similar to vowels but more obstacles in the vocal tract than for the vowels: • liquids: the /l/ (e.g. "like") and the retroflex /9r/ (e.g. "red"); • glides : the /j/ (e.g. "yes") and the /w/ (e.g. "won"). 4. Nasals (3) - The airflow is blocked completely in the oral tract, but a lowering of the velum allows a weak flow through the nose: /m/ as in "me", /n/ as in "new", /ng/ as in "sing". 5. Fricatives (9) - Weak or strong friction noises, when the articulators are close together to cause turbulence in the airflow: • voiceless /f/, /T/, /s/. /S/, /h/ as in: “fine”, "thing", “sign”, “assure”, “hope”; • voiced /v/, /D/, /z/, /Z/ as in: “voice”, “than”, “resign”, “vision”. W. Kasprzak: EIASR
10. Phonetic speech model
279
Consonants (2) 6. Plosives (6) - Bursts or explosive sounds produced by complete closure of the vocal tract followed by a rapid release of the closure: • unvoiced /ph/, /th/, /kh/ (as in “can”), and • voiced: /b/, /d/, /g/. 7. Affricates (2) - Plosives released with frication: the /tS/ sound (like in "church"), the /dZ/ (like in "judge"). The IPA (International Phonetic Alphabet) (c/o Department of Linguistics, University of Victoria, Victoria, British Columbia, Canada) is recognised as the international standard for the transcription of phonemes in all languages. W. Kasprzak: EIASR
10. Phonetic speech model
280
Articulation of vowels Articulation of vowels and /j/ according to: level of mouth opening and tongue location.
W. Kasprzak: EIASR
10. Phonetic speech model
281
Articulation of consonants Classification of consonants according to phonetic categories and articulation area:
W. Kasprzak: EIASR
10. Phonetic speech model
282
2. Spectral cues Formants (energy concentrations in frequency bands) F0-F5 Basic frequency (F0) – the pitch Usually: 80-200 Hz (man), 150-350 Hz (women) F1 can vary from 300 Hz to 1000 Hz The lower it is, the closer the tongue is to the roof of the mouth. /i:/ has the lowest F1: ∼ 300 Hz; /A/ has the highest F1 ∼ 950 Hz. F2 can vary from 850 Hz to 2500 Hz The F2 value is proportional to the front or back position of the tongue tip. In addition, lip rounding causes a lower F2 than the case with unrounded lips. /i:/ has an F2 of 2200 Hz, the highest F2 of any vowel; /u/ has an F2 of 850 Hz - the tongue tip is far back, and the lips are rounded. W. Kasprzak: EIASR
10. Phonetic speech model
283
Formants Distribution of main formants in vowels:
W. Kasprzak: EIASR
10. Phonetic speech model
284
Typical spectrograms (1) 1. Monophthong vowels a strong voicing, with stable formants 2. Diphthong vowels A strong voicing but moving formants
3. Approximants Voiced but have less visible formants
W. Kasprzak: EIASR
10. Phonetic speech model
285
Typical spectrograms (2) 4. Nasals low energy and characteristic "zero" region : 5. Fricatives have a high-frequency Gaussian region : 6. Plosives An explosive burst of acoustic energy, following a short period of silence : 7. Affricates A plosive followed by a fricative: W. Kasprzak: EIASR
10. Phonetic speech model
286
3. Sub-phonemes Pronunciation differences (co-articulation effects): phones have a great influence on neighbour phones. Tri-phone model: to split each phone into one, two, or three parts, depending on the typical duration of that phone as well as how much that phone will be influenced by surrounding phones. Example. Tri-phones that represent the phoneme /E/ in "yes": 1.$front$fric (/E/ in the context of a following fricative). In general /E/ can be decomposed into 17 tri-phones: •8 for possible left context, •8 for possible right context, and •1 context-independent (centre phone). W. Kasprzak: EIASR
10. Phonetic speech model
287
Context categories (1) 1.Front - front vowels and similar approximants. 2.Mid - central vowels and similar approximants. 3.Back - back vowels and similar approximants. 4.Sil - silence. 5.Nasal - nasals. 6.Retro - retroflex approximant and retroflex (r-coloured) vowels. 7.Fric - fricatives and affricates. 8.Other - plosives and remaining approximants.
W. Kasprzak: EIASR
10. Phonetic speech model
288
Context categories (2)
W. Kasprzak: EIASR
10. Phonetic speech model
289
EIASR Word and sentence recognition Lecture 11
Project is co-financed by European Union within European Social Fund
290
1. Speech variability 1. Acoustic-phonetic variability covers different accents, pronunciations, pitches, volumes, and so on. 2. Temporal variability covers different speaking rates. 3. Syntactic variability (of sentences) A useful simplification is to treat them independently. The temporal variability is easier to handle - an efficient algorithm is known as Dynamic Time Warping. Acoustic-phonetic variability is more difficult to model, it covers acoustic feature schemes and phonetic modeling. Syntactic variability deals with different word sequences representing the same sentence (meaning). W. Kasprzak: EIASR
11. Word and sentence recognition
291
Dynamic Time Warping Let Y be a prototype (reference pattern), X – current observation (input pattern). Both Y and X are sequences of simple patterns. The distance D(X; Y) between the input and a reference pattern is defined despite different lengths of sequences. The goal of DTW is to find a path in the space of possible matches of bot sequences, that is optimal w.r.t. the cost function, i.e.
l = arg min D ( X , Yl ) l
D(X; Y) is defined as the sum of local distances, dij = d(xi; yj), between corresponding elements in the solution path. The cost of a path leading to „state” (xi; yj) is: R( xi −1 , y j −1 ) + d ( xi −1 , y j −1 ) R( x i , y j ) = min R( xi −1 , y j ) + d ( xi −1 , y j ) R( x , y ) + d ( x , y ) i j −1 i j −1 W. Kasprzak: EIASR
11. Word and sentence recognition
292
DTW - Example Matching a 3-segment input pattern (x1 x2 x3) with a 4element model pattern (y1 y2 y3 y4).
An existing path can only be extended by one of the three allowed transitions: – state(xi-1, yi-1) state(xi, yj) – state(xi-1, yi) state(xi, yj) – state(xi, yi-1) state(xi, yj) W. Kasprzak: EIASR
11. Word and sentence recognition
293
2. Speech recognition Word recognition In word recognition the question is: what is the most likely word represented by given acoustic sequence? A word consists of a sequence of phonemes. W = arg word maxP P(word | signal) From the Bayes rule we get: P(word|signal)=α ⋅ P(signal|word) ⋅ P(word) Two prior stochastic models are required: 1.the acoustic-phonetic model – P(signal|word), and 2.the language model – P(word). In particular: the signal is represented by n feature vectors: P(word| c1c2...cn )=α ⋅ P(c1c2...cn |word) ⋅ P(word) W. Kasprzak: EIASR
11. Word and sentence recognition
294
Sentence recognition In sentence recognition the question is: what is the most likely word sentence represented by given sequence of words? A sentence consist of words given in our lexicon. S = argsequence maxP P(sentence | words) Sentence is a meaningful sequence of words, signal is a sequence of observed words. From the Bayes rule we get: P(sentence| words) = α ⋅ P(words |sentence) ⋅ P(sentence) Two prior stochastic models are needed: 1.the application model – P(words | sentence) and 2.the language model – P(sentence). W. Kasprzak: EIASR
11. Word and sentence recognition
295
Hidden Markow Model A HMM is defined as a 5-tuple: HMM = ( S, Y, Π, A, B ), where S = {S1, S2, ..., SN} - set of states, Y - set of output symbols, Π - the start state probability vector, A =[aij]- the state transition probability matrix, B =[bij]- the output probability matrix.
W. Kasprzak: EIASR
11. Word and sentence recognition
296
HMM in word recognition Output probability model 1. Discrete scalar - a single output symbol is observed in one state: Pj (O t = Yk | S j ) = b jk ; j = 1,..., N ; k = 1,..., M ; t = 1,..., T 2. Discrete vector – a vector of symbols per state is observed: M
Pj (O = c | S j ) = ∑ b jk ⋅ Pk (c t ) 3. Semi-continuous distributionk =1 – a mixture of Gaussian distributions – per class distributions t
t
M
M
k =1
k =1
Pj (O = c | S j ) = ∑ b jk ⋅ p k ( c | Ω k ) = ∑ b jk ⋅ Ν ( c ; µ k , Σ k ) t
t
4. Continuous (mixture of different distributions per state and M M class): t t Pj (O = c | S j ) = ∑ b jk ⋅ pk (c | Ω jk ) = ∑ b jk ⋅ Ν (c; µ jk , Σ jk ) k =1
W. Kasprzak: EIASR
k =1
11. Word and sentence recognition
297
HMM in word recognition (2) Structure: a left-to-right HMM
Actions: • INSert, • SUBstitute, • DELete
W. Kasprzak: EIASR
11. Word and sentence recognition
298
Viterbi search (1) Viterbi search - the goal is to find the best path of states in the HMM, w.r.t. the observed sequence:
(S1...ST ) = argmaxP(S1LST | c1c2 LcT ) S1...ST
This is done iteratively, from t=1 to t=T, while extending every partial path (S1S2...St-1) in the HMM model, in a best way with respect to the observed sequence of feature vectors (c1c2...ct), according to the dynamic programming principle: P(S1LSt−1St | c1c2 Lct ) = P(ct | St ) ⋅ maxSt (P(St | St−1) ⋅ P(S1LSt−1 | c1Lct−1)) In practice a logarithmic scale can be used: lnP(S1LSt−1St | c1c2 Lct ) = lnP(ct | St ) + maxSt (lnP(St | St−1) + lnP(S1LSt−1 | c1Lct−1)) W. Kasprzak: EIASR
11. Word and sentence recognition
299
Viterbi search (2) The search space in Viterbi search – consistent with the dynamic programmimng search - is a 2-D space indexed by time (observed sequence) and HMM state indices.
Recall that in a discrete HMM: a = P ( S | S ); b = P (Y | S ) ij i j i jk k j W. Kasprzak: EIASR
11. Word and sentence recognition
300
Viterbi search (3)
W. Kasprzak: EIASR
11. Word and sentence recognition
301
Baum-Welch training (1) The parameters of a discrete HMM model, λ = (Π, A, B) can be estimated by the Baum-Welch training (an EM-like approach, that uses the “forward-backward” algorithm to obtain current forward and backward “messages”). Given a discrete HMM model and an observation sequence O=(O1,…,OT), the Baum-Welch training is an MLestimation of the parameters λ. ΡHMM (λ ) = log P (Ο | λ ) = log ∑ P (Ο, q | λ ) q∈ST
where q – a state sequence of length T, corresponding to observation O. The goal function: λˆ = arg max ΡHMM ( λ) λ
W. Kasprzak: EIASR
11. Word and sentence recognition
302
„Forward-backward” algorithm Let N – be the number of states. The “forward term” - the probability of generating a partial sequence and ending up in state with index j at time t :
α t ( j ) = P (Ο 1 L Ο t , q t = s j | λ ) The “backward term” - the probability of generating the remainder of the sequence, from state with index i at time t :
β t ( i ) = P ( Ο t + 1 L Ο T , qt = si | λ ) Thus the probability of visiting the state with index j at frame t for a complete observation sequence O is the product of above probabilities: α ( j ) ⋅ β ( j ) = P (Ο, q = s | λ ) t
t
N α t ( j ) = b j (Ot ) ⋅ ∑ α t −1 (i ) ⋅ aij i =1 W. Kasprzak: EIASR
t
j
N
β t (i ) = ∑ aij ⋅ b j (Ot +1 ) ⋅ β t +1 ( j ) j =1
11. Word and sentence recognition
303
„Baum-Welch” algorithm (1) REPEAT 1) With current λ = {πi, aij, bjk | i,j=1,...,N; k=1,...,M} and t=1,…,T; get α t ( j ), βt (i) by the ”forward-backward” algorithm. 2) (The E-step) The expected probability of transition si sj at time t, given observation sequence O=(O1, … , OT) is: ξ t (i, j ) = P(qt = si , qt +1 = s j | Ο, λ ) P (qt = si , qt +1 = s j , Ο | λ ) α t (i ) ⋅ aij ⋅ b j (Ο t +1 ) ⋅ β t +1 ( j ) = = N P (Ο | λ ) ∑ α t (i) ⋅ β t (i) i =1
γ(i, j) - the expected probability that the transition from state i to state j appears, at any time from 1 to T: γ (i , j ) = ∑ ξ t (i, j ) t
W. Kasprzak: EIASR
11. Word and sentence recognition
304
„Baum-Welch” (2) γ(i) - the expected probability that state i is visited, at any time from 1 to T: γ (i ) = ∑ ∑ ξ t (i , j ) j
t
The expected probability that state i emits the symbol (class) Yk :
∑ ∑ ξ (i, j )
γ (i, Yk ) =
t
t :Ο t ∈Yk
j
The expected probability of visiting state Si at particular time t is: N α t (i ) β t (i ) γ t (i ) = P (qt = S i | Ο , λ) = N = ∑ ξ t (i, j ) α t ( j ) β t ( j ) j =1 ∑ j =1 3) (The M-step) In this step we can re-estimate the HMM parameter: vector π, matrices A and B, by taking simple ratios between above terms.
W. Kasprzak: EIASR
11. Word and sentence recognition
305
„Baum-Welch” (3) The update rules for the parameters of a discrete HMM model: πˆ i = γ 1 (i ) =
α1 (i ) β1 (i )
γ (i, j ) aˆij = = γ (i )
= ∑ ξ1 (i, j )
N
∑ α ( j)β ( j) 1
j =1
T
N
j =1
1
T
∑ ξ (i, j ) ∑ α (i) ⋅ a t
t
=
t =1 T
∑γ
t
(i )
t =1
⋅ b j (Ο t +1 ) ⋅ β t +1 ( j )
T
∑ α (i) ⋅ β (i) t
t
t =1
T
T
γ (i, Yk ) = bˆ jk = γ (i )
ij
t =1
∑ γ t ( j ) ⋅ χ (Ο t ∈ Yk ) t =1
T
∑ γ t (i ) t =1
∑ α ( j ) ⋅ β ( j ) ⋅ χ (Ο t
=
t
t
∈ Yk )
t =1
T
∑ α ( j) ⋅ β ( j) t
t
t =1
Where χ(a∈ Y)=1 if the formula in brackets is satisfied or χ(.)=0 – otherwise. W. Kasprzak: EIASR
11. Word and sentence recognition
306
3. Sentence recognition Two recognition steps – a many-word recognition is performed first (using a 2-layer HMM model) and alternative word hypotheses are generated; then a sentence recognition is performed with an additional layer of HMM that seeks for a syntactically proper and semantically meaningful word sequence.
W. Kasprzak: EIASR
11. Word and sentence recognition
307
Word lattice The word recognition stage generates 1-, 2- or 3-syllablebased (partially alternative) word hypotheses. Example
0
1
2
3
4
5
6
7
8
9
10
syllable W. Kasprzak: EIASR
11. Word and sentence recognition
308
N-grams The language model provides prior probability of given word sequence P(words). For n words (s1 ... sn) we have: n
P ( s1...sn ) = P ( s1 ) P ( s2 | s1 ) L P ( sn | s1 L sn−1 ) = ∏ P( si | s1...si−1 ) i =1
In practice n needs to be of limited length. The bigram model – the probability of next word depends only on the direct predecessor word: P( si | s1...si −1 ) ≈ P( si | si −1 ) n P ( s1...sn ) = P ( s1 ) P ( s2 | s1 ) L P ( sn | sn −1 ) = ∏ P ( si | si −1 ) i =1
The distribution P(si|si-1) can be learned by a counting the relative frequency of the pairs of words in a large learning set. N-gram – if the N-th word depends on N-1 predecessor words. W. Kasprzak: EIASR
11. Word and sentence recognition
309
Smoothing an N-gram Katz’ smoothing for a three-gram model: C ( wi − 2 , wi −1 , wi ) , r>k C ( wi − 2 , wi −1 ) d r C ( wi − 2 , wi −1 , wi ) P * ( wi | wi −1 , wi − 2 ) = , 0