(i) precise segmentation of spatially inhomogeneous multimodal images and (ii) ... The labels involve translation invariant nearest neighbour pairwise interac- ..... though called unsupervised or self learning at that time, appeared in the late ...
UDC 621.391+519.72+528.71 G. GIMEL'FARB, A.A. FARAG
TEXTURE ANALYSIS BY ACCURATE IDENTIFICATION OF SIMPLE MARKOVIAN MODELS Keywords: Markov random field, parameter estimation, image segmentation, texture synthesis. 1. INTRODUCTION
Statistical pattern recognition and probabilistic signal/image modeling have attracted considerable attention of “Cybernetics and Systems Analysis,” being at that time simply “Cybernetics” (Kibernetika), just from the outlet (see e.g. [11, 20, 28]). As is shown below, one of these very first steps forward, namely, the Schlesinger's unsupervised learning algorithm [28] is still serviceable in today's pattern recognition and image analysis. In this paper, we consider two problems of analysing image textrures: (i) precise segmentation of spatially inhomogeneous multimodal images and (ii) modeling and realistic synthesis of homogeneous textures. Each region-of-interest in a multi-modal imagerelates to an individual dominant peak, or mode of the empirical marginal probability distribution of grey levels. Texture homogeneity is restricted to only translational invariance of selected second-order signal statistics. Significant achievements in solving these problems have come about through Markov–Gibbs random field (MGRF) models of grayscale images and/or region maps indicative of the regions-of-interest in the grayscale images. This avenue of investigation originated in the late seventies — early eighties [3, 9, 18] persists up to the present. Each MGRF model relates image probabilities to explicit spatial geometry and quantitative strengths of statistical dependence, or interactions between grey levels and/or region labels in sites (pixels) of an arithmetic lattice supporting the images. The interaction geometry is specified with a neighbourhood graph linking the interacting pixels traditionally called neighbours. Its translational invariance stems from a fixed neighbourhood of a single pixel. Quantitative interaction strengths are given by Gibbs potential functions on a certain subset of cliques in the graph [5, 18]. At present, increasingly more complex models come to the forefront in attempts to better describe intricate classes of images (see, e.g. [31]). Nonetheless, a reasonably large number of important practical problems still can be effectively solved with far simpler conventional MGRFs providing more accurate identification (parameter estimation) is able to precisely focus the model on a particular class of images. This paper overviews the potentialities of two such simple models, namely, a joint MGRF of multi-modal grayscale images and their region maps that refines the like model in [10, 14] and a general MGRF of spatially homogeneous grayscale textures or region maps with multiple pairwise pixel interactions [12, 13]. Parameters of the latter model to be estimated from a given training image consist of a characteristic pixel neighbourhood and corresponding Gibbs potentials, one per each family of translation invariant pairwise cliques. The former model is much simpler and assumes statistically independent grey levels with the same multi-modal probability distribution in each pixel and conditionally interdependent region labels. The labels involve translation invariant nearest neighbour pairwise interactions and conditional pixel-wise interactions with the grey levels. The model parameters consists in the multi-modal grey level distribution to be accurately restored from the empirical marginal distribution of grey levels in the training image and Gibbs potentials for the conditional MGRF of region labels, given the image. The paper is organised as follows. Section II summarises analytical properties of the MGRFs useful for their accurate identification. Novel algorithms for segmenting multi-modal images by the Expectation Maximization (EM) based identification of a sim© G. Gimel'farb, A.A. Farag, 2005
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
37
ple MGRF model of images and region maps are briefly reviewed in Section III. Section IV outlines the use of an identified MGRF model with multiple pairwise pixel interactions to realistically synthesise spatially homogeneous stochastic textures with only close-range characteristic interactions and nearly periodic mosaics with both the closeand long-range interactions. A few concluding remarks are given in Section V. 2. PROPERTIES OF MGRFS WITH PAIRWISE PIXEL INTERACTIONS
Let R = {( x, y )| x = 0,1, K , X - 1; y = 0,1, K , Y - 1} denote an arithmetic lattice of size XY supporting digital images s = [ s( x, y ) : ( x, y ) Î R; s( x, y ) Î S ] , where S = {0, 1, K , S - 1} is a finite set of S non-negative integer signals. Let N = {( x n , h n ) : n = 1, K , N } be a neighbourhood specifying offsets between the neighbours (( x, y ), ( x + x , y + h)) Î R 2 ; ( x , h ) Î N. Given N, conditional probabilities of signals in each pixel ( x, y ) depend only on the signals in the neighbours {( x ± x , y ± h ):( x , h ) Î N} Ù R. Below s refers to grayscale images g with Q grey levels Q = {0,1, K , Q - 1} or/and region maps m with K region labels K = {0,1, K , K - 1} that indicate regions-of-interest in the grayscale images. Therefore, S may stand for Q, K, or Q ´ K, respectively, depending on types of images. General MGRF with multiple pairwise pixel interactions. Translation invariant spatial geometry of pairwise pixel interactions is given by a subset CN = {Cx ,h :( x , h ) Î N} of clique families Cx ,h = {(( x, y ), ( x + x , y + h )) | ( x, y ) Î R} Ù R in the neighbourhood graph linking all the neighbours in R . Each family contains all the pixel pairs in R with the same offset ( x , h ) . Quantitative strengths of interactions depend on signal co-occurrences in the cliques and are specified with a Gibbs potential function, Vx ,h : S 2 ® ( -¥ , ¥ ) , associated with the clique family. Let Fx ,h ( s ) = [ Fx , h ( s, s¢ | s ):( s, s¢ ) Î S 2 ;
å
F ( s, s¢ | s ) = 1] ( s, s¢) Î S 2 x , h
denote an
empirical probability distribution of pairwise signal co-occurrences collected over the clique family Cx ,h in an image s. Definition 2.1. Normalised, by reducing to pixel, total interaction energy E( s ) of the N clique families for the neighbourhood N in the image s is defined as: E (s ) =
å
( x ,h ) Î N
r x , h E x ,h ( s ),
(1)
where scaling factors r x , h = |Cx ,h |/ | R | account for different cardinalities of the clique families and E x , h ( s ) denotes the normalised, by reducing to clique, partial interaction energy of the family Cx ,h : E x ,h ( s ) =
1 |Cx ,h | (( x, º
å
å
y),( x¢, y¢)) Î C x ,h
( s, s¢) Î S 2
Vx , h ( s( x, y ), s( x¢ , y¢ )) º
Vx ,h ( s, s¢ )Fx , h ( s, s¢ | s ) .
(2)
Given the neighbourhood N and potential V = {Vx ,h : ( x , h ) Î N}, the MGRF image model with multiple translation invariant pairwise pixel interactions is specified with the Gibbs probability distribution (GPD) PrN ,V ( s ) =
1 exp(| R | E ( s )) , Z N ,V
(3)
where Z N ,V = å s Î S exp(| R | E ( s )) is the normalising partition function and S is the parent population of all the images s on R . Empirical distributions of pairwise signal
38
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
co-occurrences F( s ) = {Fx ,h ( s ):( x , h ) Î N} are sufficient statistics of the model. The GPD belongs to exponential families of distributions [1], hence its log-likelihood function LN ( V | s ) = log PrN ,V ( s ) is unimodal in the space of the potentials under very loose conditions holding for a vast majority of the training statistics F( s ) . 1 Let Firf = , Varirf = Firf (1- Firf ) , and Varx ,h ( s ) = å ( s,s¢) ÎS 2 ( Fx ,h ( s, s¢ | s ) S2 -Firf )Fx ,h ( s, s¢ | s ) denote the signal co-occurrence probability for an independent random field (IRF) of equiprobable signals s ÎS, its variance, and the variance of the empirical signal co-occurrence distribution Fx ,h ( s ), respectively. The IRF is the MGRF of Eq. (3) with zero potential values Vx ,h ( s, s¢ ) = 0 ; ( s, s¢ ) ÎS 2 ; ( x , h ) Î N. Using the derivation scheme proposed in [12, 13] and based on a truncated Taylor’s expansion of the log-likelihood LN ( V | s ) in the close vicinity of zero potential V = 0, it is easy to prove the following theorem. Theorem 2.1. The maximum likelihood estimate (MLE) of the Gibbs potential for the clique family Cx ,h ; ( x , h ) Î N, in the MGRF model of Eq. (3), given a training image s, has the following first approximation: Vx ,h ( s, s¢ ) = lr x ,h ( Fx ,h ( s, s¢ | s ) - Firf ); ( s, s¢ ) Î S 2 .
(4)
The scaling factor l is the same for all the clique families related to the neighbourhood N: å r x ,h Varx ,h ( s ) å r x ,h Varx ,h ( s ) 1 ( x ,h ) Î N S 4 ( x ,h ) Î N . l= º Varirf å r 2 Varx ,h ( s ) r 2 Varx ,h ( s ) S 2 - 1 å (5) ( x ,h ) Î N
x ,h
( x ,h ) Î N
x ,h
It follows that the partial energy in Eq. (2) for the training image is proportional in the first approximation to the variance of the empirical signal co-occurrence distribution Fx ,h ( s ): (6) E x ,h ( s ) = lr x ,h Varx ,h ( s ) . -1 Under | R | ® ¥, the scaling factors r x ,h ® 1, l ® Varirf » S 2 for S >> 1, and
Vx ,h ( s, s¢ ) ® S 2 Fx ,h ( s, s¢ | s ) - 1 ; ( s, s¢ ) ÎS 2 . Definition 2.2. A model-based interaction map (MBIM) for a training image s is a collection of the relative partial energies e x ,h ( s ) = r x ,h Varx ,h ( s ) for a large set N s = {( x , h ):[1 £ x £ D Ù h = 0] Ú [| x | £ D Ù 1 £ h £ D]} of pixel neighbours with absolute relative offsets no larger than D in both directions. The relative energy estimates the contribution of the clique family to the total interaction energy: the greater the relative energy, the more characteristic the family. Hence the characteristic structure of pairwise pixel interactions is recovered by selecting the top-rank clique families sorted by their energies [12, 13]. The selection accounts for the empirical distribution of the energies {e x ,h ( s ):( x , h ) Î N s }, e.g. by focussing on their statistically significant deviations from the mean energy. Simple MGRF model of region maps. In this case all the nearest neighbour interactions are represented with a single clique family, C = {(( x, y ), ( x + x , y + h )): , )} for the ( x, y ) Î R; ( x , h ) Î N nn } Ù R , where N nn = {( 0,1), (1,0)} or {( 0,1), (1,-1), (1,0), (11 symmetric 4- or 8-neighbourhood of each pixel, respectively. The potential V nn = [Vnn ( k , k¢ ):( k , k¢ ) Î K 2 ] takes account of only whether the labels are equal or not: Vnn ( k , k¢ ) = u eq if k = k¢ and u ne otherwise. The normalised, by reducing to pixel, total interaction energy in the region map m is: E nn ( m ) =
1 Vnn ( m( x, y ), m( x¢ , y¢ )) º r( u eq Feq ( m ) + u ne Fne ( m )), å (7) | R | (( x, y),( x¢, y¢)) ÎC
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
39
where Feq ( m ) and Fne ( m ) = 1- Feq ( m ) are the empirical probabilities of equal and non-equal nearest-neighbour pairs of region labels in the map m and the scaling fac|C| 1 1 1æ 1ö tor r = . Let Firf,eq = , Firf,ne = 1- , and Varirf,nn = ç 1- ÷ denote the proba| R| K K Kè Kø bilities of equal and non-equal pairs of labels and their variances for the IRF of the equiprobable labels. The derivation scheme from [12, 13] results in Theorem 2.2. Theorem 2.2. The first approximation of the MLE of the potential values u eq and u ne is as follows: u eq = - u ne = 0.5
1 Varirf,nn
( Feq ( m ) - Firf,eq ) º 0.5
K2 æ 1ö ç Feq ( m ) - ÷ . K - 1è Kø
(8)
For K >> 1, u eq = - u ne » 0.5( KFeq ( m ) - 1) . Simple joint MGRF model of grayscale images and region maps. Let a joint probability distribution of images and their maps Pr ( g, m ) = Pr ( g )Pr ( m| g ) be expanded into an unconditional distribution of grayscale images Pr( g ) and a conditional distribution Pr( m| g ) of region maps, given the image.We assume the unconditional model is an IRF of grey levels with the same multi-modal probability distribution P = [ P ( q ): q Î Q] such that each region-of-interest relates to its own mode in P. Let F( g ) = [ F ( q| g ): q Î Q] denote the empirical grey level probability distribution for an image g. Then Pr( g ) =
Õ
( x, y) Î R
P ( g ( x, y )) º Õ P ( q ) |R |F ( q |g ) .
(9)
q ÎQ
The conditional model of region maps, given the image, is restricted to the pixel-wise and the nearest neighbour pairwise interactions. The pairwise interactions involve only region lables and are the same as in the MGRF model of Eq. (7). The pixel-wise interactions characterise conditional dependences between the grey levels and region labels. The corresponding potential V pix = [Vpix ( k , q ):( k , q ) Î K ´ Q] takes account of all the signal pairs ( k , q ) Î K ´ Q in the individual pixels. Let Fpix ( m, g ) = [ Fpix ( k , q| m, g ):( k , q ) Î K ´ Q] denote the joint empirical probability distribution of the pixel-wise region label and grey level co-occurrences. The normalised, by reducing to pixel, total interaction energy for the conditional model is as follows: E ( m, g ) = E pix ( m, g ) + E nn ( m ) ,
(10)
where E nn ( m ) is the energy of the pairwise pixel interactions in Eq. (7) and E pix ( m, g ) is the energy of the pixel-wise interactions: E pix ( m, g ) = º
ö 1 æç Vpix ( m( x, y ), g ( x, y )) ÷ º å ÷ | R | ç ( x, y) Î R è ø
å
( k ,q ) Î K ´ Q
Vpix ( k , q )Fpix ( k , q| m, g ).
(11)
Given the grayscale image g , the probability of each pair ( k , q ) in the pixel for a 1 conditional IRF of the equiprobable region labels is Firf ( q| g ) = F ( q| g ) . The derivaK tion scheme from [12, 13] leads to Theorem 2.3. Theorem 2.3. The MLE of the pixel-wise and pairwise potentials of the conditional MGRF model have the following first approximation: 1ö æ Vpix ( k , q ) = l ( F ( k , q| m, g ) - Firf ( q| g )); v eq = - v ne = 0.5lrç Feq ( m ) - ÷ K ø (12) è
40
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
with the common factor l: l= where
Dpix ( m, g ) + Dnn ( m ) U pix ( m, g ) + rVarirf,nn Dnn ( m )
Dpix ( m, g ) =
å
( k ,q ) Î K ´ Q
( F ( k , q | ( m, g ) - Firf ( q| g )) 2 ,
Dnn ( m ) = r 2 ( Feq ( m ) U pix ( m, g ) = å
å
q ÎQ k Î K
(13)
,
1 2 ) , K
( F ( k , q | ( m, g ) - Firf ( q| g )) 2 Firf ( q| g )(1- Firf ( q| g )) .
III. SEGMENTATION OF MULTI-MODAL IMAGES
In spite of simplicity, the multi-modal probability models of Eq. (9) are commonly used in various application domains of modern data analysis and pattern recognition [5, 6, 25] involving large data sets, e.g. in astronomy, physics, remote sensing of the Earth’s surface, or medical imaging. The final goal of the model identification in texture analysis is to segment a given multi-modal image, that is, accurately separate the individual regions-of-interest. At the first stage, a multi-modal probability model P mixing individual grey level distributions for each region-of-interest is estimated from the empirical grey level distribution F( g ) and used for the pixel-wise signal classification. We follow the conventional assumption that the number K of the modes (or dominant peaks) in P is already known. The estimated model has to accurately approximate not only the peaks themselves but also the behaviour of the signals away from each peak because the classification accuracy depends on intersecting tails of distributions associated with each mode. Generally, no accurate classification can be achieved by identifying only such a signal model by itself, and the joint MGRF model of grayscale images and region maps is typically involved and identified. But even the first-stage data classification (or clustering) is of practical interest. Let j q ( z ) and F q ( z ), -¥ £ z £ ¥, be the Gaussian density function and the cumulative Gaussian probability function, respectively, with a shorthand notation q = ( m , s 2 ) for mean m and variance s 2 . Definition 3.1. Linear combination of Gaussians (LCG) with J p positive and J n negative -å
Gaussians
is
Jn w j( z | q n , j j =1 n, j
a
continuous
) such that
function
¥
ò-¥ p( z )dz = 1 under
p( z ) = å
Jp w j( z | q p ,i i = 1 p ,i
)
the strictly positive weights
w = [ w p ,i : i = 1, K , J p ; w n , j : j = 1, K , J n ] meeting the condition:
å
Jp w i = 1 p ,i
-å
Jn w j =1 n, j
= 1.
(14)
Definition 3.2. Discrete Gaussian (DG) with the mean and variance q is the discrete probability distribution Yq = [ y q ( s ): s Î S; å Ss =-10 y q ( s ) = 1] of S signals s ÎS such that y q ( 0) = F q ( 0.5) , y q (Q - 1) = 1- F q (Q - 15 . ) , and y q ( s ) = F q ( s + 0.5) - F q ( s - 0.5) for s = 1, K , S - 2 . ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
41
Definition 3.3. Linear combination of DGs (LCDG) with J p positive and J n negative DGs is a discrete function P ( s ) = å
Jp w y ( s| q p ,i i = 1 p ,i
)-å
Jn w y ( s| q n , j j =1 n, j
) of S
signals s ÎS such that å s ÎS P ( s ) = 1 under the strictly positive weights w of Eq. (14).Under the fixed numbers J p and J n of the additive and subtractive components, the model parameters are the weights w and individual characteristics Q = {q p ,i : i = 1, K , J p ; q n , j : j = 1, K , J n }. Below the total number of the components is denoted J ( = J p + J n ). Two novel Expectation Maximization (EM) based algorithms introduced in [10, 14] accurately approximate the distribution P with the LCG. Below we extend these algorithms to the LCDGs. Due to positive and negative components, both the LCDGs and LCGs fit the multi-modal distributions much closer than any conventional probability mixture of only positive Gaussians [16, 27, 30]. Historically, the very first EM algorithm for identifying the probability mixtures, though called unsupervised or self learning at that time, appeared in the late nineteen sixties [28] (see also [29]). The technique has received its today’s name and become very popular only a decade later after it has been reinvented and applied in [4] to a general problem of parameter estimation from an incomplete data set. At present a number of the EM-algorithms exist to identify various probability models [23]. Although the LCG and LCDG do not strictly belong to the probability domain due to possible negative values, their higher approximation accuracy to within a limited signal range is found to be more important in practice. Probability distributions form a proper subset of all the LCGs or LCDGs due to restriction p( z ) ³ 0 or P ( s ) ³ 0 , respectively, that automatically holds for the mixtures with only positive components. Below we ignore this restriction because our goal is only to precisely separate classes of signals associated with each dominant mode, and this is better obtained with the LCDGs or LCGs. Comparing to these latter, the DGs improve stability and convergence rate of the approximation. Given the number K of the dominant modes, the total numbers J p and J n of the components of each type are provided by the initialisation algorithm and do not change during the subsequent refinement of the model parameters. The initialisation provides also the starting parameter values w[ 0 ] and Q[ 0 ] . Under the assumed statistical independence of the grey levels, the model P in Eq. (9) is identified with the MLE of the LCDG that maximises the empirical log-likelihood 1 L( w, Q| g ) = log Pr ( g ) : | R| L( w, Q| g ) = å F ( q| g ) log P ( q ) º q ÎQ
º å F ( q| g ) log ( å q ÎQ
Jp w y( q| q p ,i i = 1 p ,i
)-å
Jn w y( q| q n , j j =1 n, j
)) .
(15)
Apart from using the DGs, the identification algorithms are the same as in [10, 14]. The parameter refinement algorithm modifies the conventional EM algorithm for normal mixtures in [28, 29] by taking account of components with alternating signs. Because the refinement is sensitive to the starting parameter values, the initialisation algorithm builds a close initial LCDG-approximation of P. EM based refinement of the LCDG. This EM process performs actually an iterative block relaxation search for alocal maximum of the log-likelihood in Eq. (15) based on a specific expansion of the search space [14, 28, 29]. Let P [ t ] ( q ) = å
å
Jn w[ t ] y ( q| q [nt,] j j =1 n, j
Jp
w[ t ] y ( q| q [pt,]i i = 1 p ,i
)-
) denote the current LCDG at iteration t. Relative contributions of
each signal q ÎQ to each positive and negative DG at iteration t are given by the respective conditional weights serving as posteriors:
42
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
p [pt ] ( i| q ) =
w[pt,]i y ( q| q [pt,]i )
and
P [ t ] (q )
w[nt,]j y ( q| q [nt,] j )
p [nt ] ( j| q ) =
P [ t ] (q )
,
i = 1, K , J p ; j = 1, K , J n ,
(16)
such that the following Q constraints hold: Jp
å
i =1
Jn
p [pt ] ( i| q ) - å p [nt ] ( j| q ) = 1, q Î Q.
(17)
i =1
The EM process iterates the following two block relaxation steps until the changes of the log-likelihood (or of the model parameters) become small. 1) find conditional mathematical expectations of the parameter estimates w[ t + 1] , Q[ t + 1] using the fixed conditional weights of Eq. (16) for the step t as conditional probabilities, and 2) update these weights by maximizing L( w, Q| g ) under the fixed w[ m + 1] ,Q[ m + 1] . This process converges to a local maximum of the log-likelihood of Eq. (15). It is easily shown by rewriting the log-likelihood in the equivalent form using the constraints of Eq. (17) as unit factors: é Jp ù L( w[ t ] , Q[ t ] | g ) = å F ( q| g ) ê å p [pt ] ( i| q ) log P [ t ] ( q )ú êëi = 1 úû q ÎQ é Jn ù - å F ( q| g ) ê å p [nt ] ( j| q ) log P [ t ] ( q )ú êëi = 1 úû q ÎQ
(18)
and replacing log P [ t ] ( q ) in the first and the second brackets with the equal terms log w[pt,]i + log y ( q| q [pt,]i ) - log p [pt ] ( i| q ) and log w[nt,]j + log y ( q| q [nt,] j ) - log p [nt ] ( j| q ), respectively, that follow from Eq. (16): é Jp ù L( w[ t ] , Q[ t ] | g ) = å F ( q| g )ê å p [pt ] ( i| q )(log w[pt,]i + log y ( q| q [pt,]i ) - log p [pt ] ( i| q ))ú êëi = 1 úû q ÎQ é Jn ù (19) - å F ( q| g ) ê å p [nt ] ( j| q )(log w[nt,]j + log y ( q| q [nt,] j ) - log p [nt ] ( j| q ))ú . êëi = 1 úû q ÎQ Hence the expected estimates of the weights at the E-step w[pt,i+ 1] = å F ( q| g )p [pt ] ( i| q ) and w[nt, +j 1] = å F ( q| g )p [nt ] ( j| q ) q ÎQ
q ÎQ
are precisely the same as the conditional Lagrange maximisation of the log-likelihood in Eq. (19) provides under the restriction of Eq. (14). The expected parameters of each DG are also the conventional unconditional MLEs that stem from the log-likelihood maximisation after each difference of the cumulative Gaussians is replaced with its close approximation by the Gaussian density (below “c” stands for “p” or “n”, respectively): 1 m[ct,i+ 1] = qF ( q| g )p [ct ] ( i| q ); [ t + 1] å w c,i q ÎQ ( s [ct,i+ 1] ) 2 =
1
å
w[ct,i+ 1] q ÎQ
( q - m[ct,i+ 1] ) 2 F ( q| g )p [ct ] ( i| q ).
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
43
The step 2 performing the conditional Lagrange maximisation of the log-likelihood of Eq. (19) under the Q restrictions of Eq. (17) results exactly in the weights p [pt + 1] ( i| q ) and p [nt + 1] ( j| q ) of Eq. (16) for all i = 1, K , J p ; j = 1, K , J n and q ÎQ. The above modification of the EM-algorithm remains valid until all the weights w are strictly positive, but the iterations have to be terminated when the log-likelihood of Eq. (19) begins to decrease due to accumulation of numerical errors. Sequential EM-based initialization. By assumption, the number of the dominant modes, or signal classes K is known. To simplify the notation, we consider below the bi-modal case when the distribution P has two separate dominant modes representing an object and its background, respectively. The algorithm is easily extended to the general case of K > 2 dominant modes. Let P( J ) denote the LCDG with J components. First, the empirical distribution F( g ) is approximated with a dominant mixture P( 2 ) of two DGs using the above EM-based refinement algorithm with only the positive weights (in this case it closely resembles the conventional EM algorithm in [28, 29]). The mixture roughly approximates each dominant mode with a single DG. Deviations of F( g ) from the dominant mixture have to be described with other components of the initial LCDG. Such a model has two dominant positive weights, say, w p,1 and w p,2 ; w p ,1 + w p ,2 = 1 , and a number of the “subordinate” weights with smaller absolute values such that
å
Jp w i = 3 p ,i
-å
Jn w j =1 n, j
= 0.
Both the number of the subordinate DGs and all the weights and parameters of the LCDG components are accurately estimated as follows. 1. The deviations d( q ) = F ( qg ) - P( 2 ) ( q ) ; q ÎQ, between F and P( 2 ) are assigned to t h e ad d i t i v e an d s u b t r a c t i v e g r o u p s o f t h e ab s o l u t e v a l u e s , D p = [ d p ( q ) = max{d( q ), 0}: q Î Q] and D n = [ d n ( q ) = max{-d( q ), 0}: q Î Q] , respectively, such that d( q ) = d p ( q ) - d n ( q ) for each q ÎQ. -1 -1 2. If the scaling factor g = å Q d (q ) º å Q d ( q ) is less than a certain acq=0 p q=0 n curacy threshold, the algorithm terminates and returns the dominant bi-modal model P( 2 ) . 1 1 3. Otherwise the scaled-up absolute deviations D p and D n are considered as two g g new “distributions”, and the EM-based refinement algorithm with only the positive weights is used iteratively to find the sizes, J p or J n , and parameters of the mixtures of the DGs, Pp or Pn , respectively, that precisely approximate the scaled-up absolute deviations. The size of each mixture is estimated by sequential minimisation of the total error between the scaled-up absolute deviation, D p (or D n ) , and its mixture model, Pp (or Pn ), the number of the components being sequentially incremented while the total error decreases. 4. The weights of the components of the subordinate models are scaled down, and the scaled-down models Pp and Pn are added to and subtracted from the dominant mixture P( 2 ) , respectively, to form the accurate initial LCDG-model P º P( J ) of the size J =2+ Jp + Jn. After the EM-based refinement, the final LCDG-model P is subdivided further into the two models P[ k ] = [ P ( q| k ): q Î Q] , k = 1, 2 , describing each the the individual mode, or class of signals. For simplicity sake, the subdivision is based on the mean values of the DGs. Let m1 and m 2 , 0 < m1 < m 2 < Q - 1 , be the the means for the two dominant DGs. Let t Î[ m1 , m 2 ] denote a threshold relating the subordinate DGs with the means below it to the first class. The chosen threshold has to minimise the expected signal classification -1 error e t = å tq -=10 P ( q | 2) + å Q q = t P ( q | 1) . Typical experimental example. Figure 1 shows the approximation of the bi-modal empirical distribution of Q = 256 grey levels in a typical slice of human body obtained by Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DMRI). The dominant 44
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
B
A
D
E
C
F
Fig. 1. LCDG approximation of the bimodal grey level distribution: the original DMRI slice (A), its empirical grey level distribution (b) approximated with the dominant mixture (a) of two DG (B), the final LCDG model (C) indicated by dots and its ten DGs (D), the pixel-wise segmentation map (E), and its errors (1,26%) with respect to the expert's “ground truth” (F).
modes represent the brighter kidney area and its darker background. The initial LCDG consists of the 2 dominant, 4 additive, and 4 subtractive DGs, that is, J p = 6, J n = 4, and J = 10. The initial and refined final LCDG models of each class are obtained with the thresholds t = 78 and t = 85 ensuring the best separation of the classes. First 37 iterations of the EM-refinement increase the log-likelihood of Eq. (19) from -6.90 to – 4.49, and the convergence to the log-likelihood maximum is considerably more stable than in [14] with the LCGs. The region map produced by the pixel-wise segmentation has the relative error of 1.26% with respect to the ground truth. These and other experiments with different multi-modal images show that our EM-based algorithms result in very accurate LCDG-models of signal probability distributions. The resulting pixel-wise signal classification combined with post-processing typically has small errors with respect to the “ground truth” region maps produced by experts, e.g. less than 3% for several hundreds of (2..4)-modal medical images obtained with Computer Tomography and Magnetic Resonance Imaging. The post-processing is based on the conditional MGRF model of region maps, given the image, with the analytical potential estimates of Eq. (12). The like model identification with the LCGs in [14] yields also small segmentation errors, but the LCDGs take better account of discrete signals, ensure more robust convergence to the local log-likelihood maximum, and are more tolerant to accumulation of numerical errors. Conventional normal mixtures of the same size and under the same post-processing result in up to ten times larger errors because some inter-class intervals are representedby single Gaussians. Because such a component combines tails of the two class distributions, the accurate class separation becomes hardly possible. 4. MGRF BASED TEXTURE SYNTHESIS-BY-ANALYSIS
Diversity and complexity of natural textures turn their analysis and realistic synthesis into extremely complicated computational problems, even with only spatially homogeneous textures. Nonetheless, visual realism, or similarity between synthetic and training textures types is achieved in many cases by bringing close together their size-independent sufficient signal statistics for a particular MGRF model (see, e.g. [3, 12, 18]). After the model is identified from a training image, well-known Markov Chain Monte Carlo (MCMC) processes of pixel-wise stochastic relaxation or simulated annealing allow for generating images distributed in accord with the MGRF [3, 18] or conforming eventually to the global maximum of the GPD [9], respectively. An alterISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
45
native MCMC-based stochastic approximation process, called controllable simulated annealing in [12, 13], makes a characteristic set of the empirical grey level co-occurrence distributions F( s ) = [ Fx ,h ( s ):( x , h ) Î N ] in each synthetic image to approach closely the like statistics F( s o ) for the training image s o . The distributions are the sufficient statistics of the MGRF in Eq. (3). Due to approaching selected signal statistics, the synthetic textures are “homogenised” translation invariant counterparts of the training images. Fortunately, in many cases the homogenised synthetic textures generated with this latter MCMC-based process realistically represent the training images of certain types. Nonetheless, typically the MCMC-based synthesis is too computationally complex to produce large-size images of main practical interest. At present, much faster synthesis is achieved with non-parametric sampling involving the probability models only implicitly. The training image itself is considered as a source of random signal samples corresponding to an unknown probability model. The sampled signals are replicated and placed into a large-size goal texture in either pixel- or patch-wise way. The pixel-by-pixel synthesis (see, e.g. [8]) extends avery small initial “seed” by randomly choosing each next pixel among the training or already generated goal pixels having the neighbourhoods similar to the neighbourhood for the desired pixel in the current goal texture. However, such extrapolation is computationally complex, is typically unstable due to accumulation of errors, and has no criteria for selecting a proper neighbourhood for a given texture. More stable and faster patch or block sampling [7, 21, 22, 24] replicates and permutes relatively large rectangular or arbitrary-shaped training patches to produce almost realistic large-size textures. Resulting images are very impressive, except that identical replicas of the same patches are readily noticeable and special post-processing is typically required to suppress false borders between the permuted adjacent or overlapped patches [7, 22, 24]. Once again, there are no criteria for choosing most characteristic patches of a particular texture. Below we overview in brief an alternative synthesis-by-analysis approach, called bunch sampling in [15, 33] and based on the accurate identification of the explicit MRGF with multiple pairwise pixel interactions of Eq. (3). It intends to bridge the gap between the heuristic and model-based non-parametric sampling. Unlike other block or patch sampling techniques, both the geometric shape of and the placement grid for spatial bunches of signals are derived from the MBIM for a given training image (see Definition 2.2). All the bunches in the training image are considered as a set of texture elements (texels [17] or textons [19]) of fixed geometry (i.e. shape and size) being characteristic under the identified MGRF model. The texels are randomly sampled, replicated, and placed into the goal image with due account of spatial dependences specified by their relative positions with respect to the training and goal placement grids. Geometry and placement grid for texels. Figure 2 shows several training textures of the size 128 ´ 128 from the texture collections [2, 26] and their scaled-up interaction structures. Each structure represents the symmetric characteristic pixel neighbourhood {( x , h ), ( -x ,-h ):( x , h ) Î N} with N = 200 neighbours having the top-rank relative partial energies in the MBIM for the supporting set N s of the size 65 ´ 65 (i.e., with the maximum relative offset D = 32 ). It follows that the top-rank energies form specific spatial clusters in the MBIM. Stochastic textures such as D004 and D009 [2] have only one central cluster indicating only close range pixel interactions dominate in those textures. The central clusters represent the most energetic close-range interactions relating mainly to a uniform background. On the contrary, the peripheral clusters of nearly periodic mosaics like D34 and D101 [2] reflect their spatial repetitiveness. The placement grid has to preserve the realistic visual appearance of a synthetic texture by sending each texel to the same relative position with respect to others as in the training image. Assuming that non-overlapping texels are conditionally independent, a compact bounding parallelogram around the characteristic interaction structure derived from the MBIM serves as a cell of a simple placement grid of equidistant guiding lines that form sides of each cell. The parallelogram found with the modified algorithm in [32] 46
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
Fig. 2. Training textures 128 ´ 128 D004, D006, D009, D012, D020,D024, D029, D034 (a), D077, D0101 [2], fabrics0000, fabrics0013, flowers0005, food0000, food0007, and grass0001 [26] (c) and their grey-coded interaction structures (b, d) with 200 characteristic clique families (the darker the point, the larger the relative energy).
specifies orientation angles a x and a y of the guiding lines with respect to the image coordinate axes ( x, y ) and side sizes d x and d y of the cell. The sizes are the maximum texel spans along the guiding lines. After tessellation of the training and goal lattices with the placement grids, the relative position of each bunch is defined as its spatial offset with respect to the closest cell in the grid. Due to the fundamentally different nature of their MBIMs, these two types of textures impose different constraints on the texel geometry and placement grids. Stochastic textures are formed by random arrangement of a very large number of different texels. The texel geometry is derived from the sole central cluster in the MBIM indicating an average spatial span of the characteristic close-range pixel interactions. In this case the realistic synthesis depends more on the estimated size and shape of the texels than on their placement grid. Peripheral clusters in the MBIMs for nearly periodic mosaics, primarily the clusters closest to the centre, reflect periodicity and spatial symmetry of the texture. The texel is typically disjoint and may be of much smaller size than for stochastic textures, up to a single pixel for almost or precisely repetitive mosaics like D101 [2]. In this case the placement grid is most essential for the realistic synthesis and must be estimated very accurately because even small orientation and size errors for the cell may totally deteriorate the synthetic texture [15]. Typical results of bunch sampling. After the texel geometry and the placing grid are estimated from the MBIM for the identified MGRF model of Eq. (3), synthetic textures of arbitrary sizes are obtained by random sampling of the texels from the training image with their subsequent replication and randomised placement. At each step, a texel randomly picked up from the training image in accord with its estimated geometry is placed into the synthetic texture in the same relative position with respect to the placement grid as in the training image. The absolute position is randomly selected among the candidate positions that satisfy the condition. Possible signal collisions when a new texel is to be placed into an area occupied partly or completely by the previously placed texels are safely resolved in many cases by a simple heuristic rule [15] of preserving the already placed signals. Each goal texture is generated until its lattice is totally covered by signals transferred from the training image, so that such a synthesis is very fast: its time complexity is æ | R| ö ÷÷ . However, its major limitation stems from the assumed spatial homogeneity of O çç è | N| ø the MGRF dictating the fixed texel geometry and equidistant placing grids. The synthetic ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
47
images of the nearly periodic training mosaics can be easily “rectified”, i.e. made strictly homogeneous, by reducing the texel size to a single pixel. All the pixels in the placement cell are considered as the individual texels, and each set of signals for the equivaa b lent training texels having the same relative offset with respect to the placement grid are replaced with the most probable signal of that set. Figure 3 presents few examples of synthetic textures obtained with the bunch sampling from the training images in Fig. 2. The synthetic and d training textures are similar both visuc ally and quantitatively, in terms of the Fig. 3. Synthetic textures 512 ´ 512 D004 (a), food0000 (b), total c 2 -distances between the empirgrass0001 (c) with N = 200, and D101 (d) with N = 10. ical grey level co-occurrence distributions for the characheristic clique families CN .These and many other experimental results show that the bunch sampling effectively synthesises different types of homogeneous stochastic and periodic textures due to deriving the texel geometry and placement grids from the accurately identified MGRF models. 5. CONCLUDING REMARKS
From the above discussion, it appears that the accurate identification may make simple conventional MGRF image models quite competitive with more elaborated counterparts in solving selected problems of modern texture analysis and synthesis. Although the growing complexity of the problems necessitates generally the development of increasingly more intricate new image models and processing techniques, the improvements of more conventional and simpler ones continue to be of theoretical and practical interest. In such away, a number of earlier results including the very first EM algorithm published in “Cybernetics and Systems Analysis” in the late 1960s continue to persist up to the present. REFERENCES 1. O. B a r n d o r f f - N i e l s e n , Information and Exponential Families in Statistical Theory, Wiley, N.Y. (1978). 2. P. B r o d a t z , Textures: A Photographic Album for Artists and Designers, Dover, New York (1966). 3. G. R. C r o s s and A. K. J a i n , “Markov random field texture models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 5, 25–39 (1983). 4. A. P. D e m p s t e r , N. M. L a i r d , and D. B. R u b i n , “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Statistical Society, 39B, 1–38 (1977). 5. R. C. D u b e s and A. K. J a i n , “Random field models in image analysis,” J. Applied Statistics, 16, 131–164 (1989). 6. R. O. D u d a , P. E. H a r t , and D. G. S t o r k , Pattern Classification, Wiley, N.Y. (2001). 7. A. A. E f r o s and W. T. F r e e m a n , “Image quilting for texture synthesis and transfer,” in: Proc. 28th Int. Conf. on Computer Graphics and Interactive Techniques SIGGRAPH 2001 (Los Angeles, California, USA), ACM Press, N.Y. (2001), pp. 341–346. 8. A. A. E f r o s and T. K. L e u n g , “Texture synthesis by non-parametric sampling,” in: Proc. IEEE Int. Conf. on Computer Vision (Greece, Corfu), 2, IEEE CS Press, Los Alamitos (1999), pp. 1033–1038. 9. S. G e m a n and D. G e m a n , “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 6, 721–741 (1984).
48
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
10. A. A. F a r a g , A. E l - B a z , and G. L. G i m e l ’ f a r b , “Iterative approximation of empirical grey level distributions for precise segmentation of multi-modal images,” EURASIP J. on Applied Signal Processing, [in print] (2005). 11. G. L. G i m e l ’ f a r b , “On one general approach to build statistically optimal image recognition algorithms (Ob odnom obshchem podhode k postroeniyu statisticheski optimal’nyh algoritmov raspoznavaniya izobrazheniy),” Kibernetika, No. 3, 84–90 (1967). 12. G. L. Gimel’farb, “Texture modeling with multiple pairwise pixel interactions,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 18, 1110–1114 (1996). 13. G. L. G i m e l ’ f a r b , Image Texture and Gibbs Random Fields, Kluwer, Dordrecht (1999). 14. G. G i m e l ’ f a r b , A. A. F a r a g , and A. E l - B a z , “Expectation-Maximization for a linear combination of Gaussians,” in: Proc. IAPR Int. Conf. Pattern Recognition (ICPR–2004) (Cambridge, UK), 3, IEEE CS Press, Los Alamitos (2004), pp. 422–425. 15. G. G i m e l ’ f a r b and D. Z h o u , “Fast synthesis of large-size textures using bunch sampling,” in: Proc. Image Vision Computing New Zealand (IVCNZ 2002) (Auckland, New Zealand) (2002), pp. 215–220. 16. A. G o s h t a s b y and W. D. O ’ N e i l l , “Curve fitting by a sum of Gaussians,” CVGIP: Graphical Models and Image Processing, 56, 281–288 (1999). 17. R. M. H a r a l i c k and L. G. S h a p i r o , Computer and Robot Vision, Vol. 2, Addison-Wesley, Reading (1993). 18. M. H a s s n e r and J. S k l a n s k y , “The use of Markov random fields as models of textures, Computer Graphics and Image Processing,” 12, 357–370 (1980). 19. B. J u l e s z , “Textons, the elements of texture perception, and their interactions,” Nature, No. 290, 91–97 (1981). 20. V. A. K o v a l e v s k y , “Present state-of-art of the pattern recognition problem (Sovremennoe sostoyanie problemy raspoznavaniya obrazov),” Kibernetika, No. 5, 78–86 (1967). 21. L. L i a n g , C. L i u , and H. Y. S h u m , “Real-time texture synthesis by patch-based sampling,” Technical Report MSR-TR-2001-40, Microsoft Research (2001). 22. V. K w a t r a , A. S c h && od l , I. E s s a , G. T u r k , and A. B o b i c k , “Graphcut textures: image and video synthesis using graph cuts,” in: ACM Trans. on Graphics 22 (Proc. 30th Conf. on Computer Graphics and Interactive Techniques SIGGRAPH 2003, San Diego, California, USA), ACM Press, N.Y. (2003), pp. 277–286. 23. C. J. M c L a c h l a n , The EM Algorithm and Extensions, Wiley, N.Y. (1997). 24. A. N e u b e c k , A. Z a l e s n y , and L. van G o o l , “Cut-primed smart copying,” in: Proc. Texture 2003: The 3rd Int. Workshop on Texture Analysis and Synthesis in Conjunction with ICCV 2003 (Nice, France), Heriot-Watt Univ. (2003), pp. 71–76. 25. N. R. P a l and S. K. P a l , “A review on image segmentation techiniques,” Pattern Recognition, 26, 1277–1294 (1993). 26. R. P i c a r d , C. G r a s z y k , S. M a n n , et al., “VisTex database,” MIT Media Lab, Cambridge, Mass. (1995). 27. T. P o g g i o and F. G i r o s i , “Networks for approximation and learning,” Proc. IEEE, 78, 1481–1497 (1990). 28. M. I. S c h l e s i n g e r , “Relation between supervised and unsupervised learning in pattern recognition (Vzaimosvyaz obucheniya i samoobucheniya v raspoznavanii obrazov),” Kibernetika, No. 2, 81–88 (1968). 29. M. I. S c h l e s i n g e r and V. H l a v a c , Ten Lectures on Statistical and Structural Pattern Recognition, Kluwer, Dordrecht (2002). 30. H. W. S o r e n s o n and D. L. A l s p a c h , “Recursive Bayesian estimation using Gaussian sums,” Automatica, 7, 465–479 (1971). 31. A. S r i v a s t a v a , X. L i u , and U. G r e n a n d e r , “Universal analytical forms for modeling image probabilities,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 24, 1200–1214 (2002). 32. K. V o s s and H. S u e s s e , “Invariant fitting of planar objects with primitives,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 80–84 (1997). 33. D. Z h o u and G. G i m e l ' f a r b , “Model-based estimation of texels and placement grids for fast realistic texture synthesis,” in: Proc. Texture 2003: The 3rd Int. Workshop on Texture Analysis and Synthesis in Conjunction with ICCV 2003 (Nice, France), Heriot-Watt Univ. (2003), pp. 119–123. Ïîñòóïèëà 14.01.05
ISSN 0023-1274. Êèáåðíåòèêà è ñèñòåìíûé àíàëèç, 2005, ¹ 1
49