Flexible matching. 5. OPTIMIZING CONCEPI' PROTOTYPES. 5.1. Optimization model. 5.2. Optimization criteria and quality measures. 5.3. Concept optimization ...
"
TEXTURE RECOGNITION THROUGH
MACHINE LEARNING AND CONCEPT
OPTIMIZIATION
P. Pachowicz and J. Bala P91-12
MLI 91-5
\
\
.
TEXTURE RECOGNITION THROUGH MACHINE LEARNING AND CONCEPT OPTIMIZATION
P.W. Pachowicz and J.W. Bata Computer Science Department and
Center for Artificial Intelligence
George Mason University
Fairfax, VA 22030
Key words: texture recognition, machine vision through machine learning, concept optimization
ABSTRACT This paper justifies and demonstrates a machine learning approach to the problem of texture recognition. The learning-based texture recognition is separated into the following phases: (i) the acquisition of texture concepts, (ii) the optimization of concept prototypes, and (iii) the recognition of unknown texture samples. Methodology adapted to the acquisition and recognition of-noisy texture data are introduced based on the AQ learning-from-examples algorithm. Characteristics of learning-based recognition of texture concepts are presented for different parameters of attribute extraction, different number of training data, and for different setting of the learning tooL Special emphasis is given to the optimization of noisy te.xture concepts. The optimization model and processes are designed to improve system recognition effectiveness according to given optimization criteria and evaluation measures. These criteria and measures are designed regarding the texture recognition and segmentation tasks. Various concept optimization methods are presented and tested. The empirical evaluation of developed learning-based approach to texture recognition is-demonstrated on the domain composed of twelve texture classes. Additionally, the effectiveness of a genetic search when applied to improve the worst performing concept descriptions is also presented.
ACKNOWLEDGEMENT Authors thank Eric Bloedorn for his help in the preparation of this report. This research was done in the Artificial Intelligence Center of George Mason University. Research activities of the Center are sponsored in part by the Defense Advanced Research Projects Agency under grant, administrated by the Office of Naval Research, No. N00014-87-K-0874, in part by the Office of Naval Research under grant No. NOO014-88-K0226, No. N00014-88-K0397, and No. N00014 9l-J-1351.
1
TABLE OF CONTENTS
1. INTRODUCTION AND MOTIVATION 1.1. 1.2. 1.3. 1.4. 1.5.
Traditional texture recognition and segmentation
New dimensions in texture recognition and segmentation
New approaches to adaptive object recognition
Tools for concept acquisition and representation
Objectives and goals
2. PROCESSING TEXTURE DATA 2.1. Image data
2.2. Texture features extraction
2.3. Selecting and coding texture events
3. ACQUIRING RULE DESCRIPTION OF TEXTURE CLASSES 3.1. Choosing a learning tool
3.2. Learning concept prototypes
3.3. Concept representation
4.. RECOGNIZING CLASS MEMBERSHIP THROUGH MATCHING RULES 4.1. Strict matching
4.2. Flexible matching
5. OPTIMIZING CONCEPI' PROTOTYPES 5.1. Optimization model
5.2. Optimization criteria and quality measures
5.3. Concept optimization methods
5.3.1. Truncation of less significant concept components
5.3.2. Generalization of concept components over negative examples
5.3.3. Filtering final training data by pre-optimized concept descriptions
6. INTRODUcrORY EXPERIMENTS WITH LEARNING TEXTURE CLASS DESCRllY oi
(6)
Vi Di ===> -, (x c Xj ) where 12
i:;tj
(7)
These assumptions are not fulfilled when the system learns from texture data. At first, training texture data is noisy. Moreover, two or more different classes of texture can possess small areas of common characteristics but these classes can differ over a larger area. The second assumption requires that the training data represents a complete model of texture classes. Such model is very difficult to acquire when one considers the 3-D nature of texture and variable texture occurrences (under different conditions of scene projection and illumination). Texture model can be complete for a given training area only when one includes all pixels of this area as a training set (Le., for the area of 128x128 pixels one has to prepare 16k training events). Most learning programs, however, will not learn a class description from such large set of training data or the time of such learning will not be acceptable. On the other hand, the characteristics of texture data is variable upon external perceprual conditions.
In the texture domain, we accept that the training data is neither consistent nor complete. To handle inconsistency, the AQ programs have a parameter for ambiguous data. The parameter controls whether an ambiguous example should be considered as the positive or negative example, or is ignored. So, the inconsistent data can be included or excluded from a training set. In our experiments, we decided to include the inconsistent training examples as positive examples during learning a class description; Le., the program deals with the case where several classes can have common local characteristics. Such learning acquires characteristic description of texture rather than the discriminant characteristics.
3.3.
Concept Representation
The Li learning process repeated for each class of training events generates 0={01, 0 2 ,... , o#Classes} set of classification hypotheses (concept descriptions). Classification hypotheses tautologically or weakly imply the observational statements, and satisfy the background knowledge. Classification hypotheses are represented by the family of AQ programs as Oisjunctive Normal Forms (DNF). A sinale dEDi description, also called a cover, is composed of the disjunction of ordered pairs < rule1j : t(i,j) >, where the first component is a rule (Le., the j-th concept component of the i-th class) and the second component characterizes rule typicality; i.e., Oi ;. d =
< ruleil
: t(i,t) > v ... v
< ruleij
: t(i,j) > v ...
(8)
A decision rule is an if ... then ... statement describing the distribution of training data through the attribute space. The conditional part of a rule, defined as a complex, is a conjunction of conditions
< rule1j
: c(iJ)
>
:=
< cond 1 > < cond2 > ... < condk >... => < i : c(i,j) >.
(9)
A single condition, called asa selector, describes range of attribute values
< condk >
= [hk
op vall[.. val2] ]
for op E {=, ,~,
o} .
(10)
When the conditional part of ruleij is satisfied (when it matches a test sample) the rule yields the i-th class with c(i,j) confidence. The confidence level clearly states the degree of match between the testing sample and a given rule as a concept component. The value of c(i,j) depends on applied matching technique and distance measure (see section 4).
I3
Rules are concept components of a given class and they are ordered in decreasing value of t(iJ). In this way, most typical concept components are followed by less typical components; i.e., t(i,j)
~ t~i.k)
(11)
where j S k and for i-th class
Typicality measure for the family of AQ programs corresponds to the number of training examples covered by concept component. The flrst concept component, the most typical component, is the one that covers the most of training examples. The last concept component covers the least training data. An example description is presented below: Texture_class__d4
34] < [Xl < 43] [X5 =5 ..10] < [Xl> 48] [X3 < 5] ["6 < 6] < [X3 = 29..33] [X4 > 47] [X5 48..55]
=
(t=45) (t=15) (t=5)
(t=2)
>v >v >v >.
The AQ programs can produce either intersecting or disjoint covers. Induction process in the intersecting mode produces a class cover that can logically intersect with covers of other classes. This intersection is limited, however, to the "do not care" areas of the attribute space. On the other hand, induction process in the disjoint mode produces a cover that cannot intersect at all with covers of other classes. So, rules produced in the intersecting mode are more general than rules produced in the disjoint mode. The generation of a single complex is performed under control criteria and parameters. Control parameters specify generality of concept component A concept component can be (i) as general as possible (e.g., by the reduction of selectors or the extension of the attribute range svall ..svaI2), (ii) as simple as possible (e.g., by removing redundant values from extended selectors in a cover), and (iii) as speciflc as possible (e.g., by the preserving the maximum number of selectors, each with minimum range of values). The DNF representation for concept descriptions is well suited for very complex distribution of training examples. Attribute space is represented by concept components that cover irregular distributions of training examples. Components of the same concept description may overlap in the attribute space.
4. RECOGNIZING CLASS MEMBERSHIP THROUGH MATCHING RULES Acquired concept prototypes are applied to classify unknown data. This classiflcation is performed through matching test data with concept descriptions along with the computation of a confidence level of this match; i.e.,
R: D x X
------->
< I, C>
(12)
where i E I is a class membership. and c E C is a confldence level of decision i. Decision making process r( d,x) is performed through the calculation of 9 distance function. The distance value is calculated from each ruleik concept component, and then the minimum value is searched; i.e.• r(d.x)
= < i, c >:
3n (9(rule i n, x)
= min
0). Strict matching assumes that concept prototype does or does not cover a given test instance. If typicality of matched rule is not considered then confidence level of such match can take only two values; Le., c=l (an instance matched) or c=O (an instance not matched). There is no consideration of the degree of closeness between an instance and conditional part of a rule when an instance is not covered by a concept. Therefore, a strict match is not suitable for the classification of noisy data that is usually displaced from clusters of training data.
4.2.
Flexible matching
If one assumes the training data does not completely represent the variability of texture characteristics then one has to accept that test data can be translated slightly from main clusters of training data in the attribute space. Additionally, if the recognition phase is preceded by an optimization of concept prototypes, one has to assume that any reduction of concept descriptions can remove both noisy and correct concept components. Such reduced concept descriptions will not strictly match even training data that uniquely satisfy the truncated components. So, to evaluate the membership of any instance to a texture class one has to apply a flexible match.
Recognizing class membership of noisy data, flexible matching measures the degree of closeness between an instance and the conditio~al part of a rule. Such closeness is computed as 9 distance value between an instance and ruleJk concept component within the attribute space. Thus, the confidence level can take any value in range from 0 (i.e., does not match) to 1 (I.e., matches). Calculation of the closeness degree and the confidence level for a single test instance (implemented within the recognition module of a system) is executed according to the following schema. For a given condition of a rule [xn = valj] and an instance where Xn = valk ' the normalized confidence value is 1 - ( I valj - valle I / #levels )
(15)
where #levels is the total number of attribute values. The confidence level of a rule is computed by multiplying evaluation values of each condition of the rule. The total evaluation of class membership of a given test instance is equal to the confidence level of the best matching rule, i.e. the rule of the highest confidence value. For example, the confidence value c for matching the following conditional part of a rule [ Xl
=0] [ x2 = 1..2 J [ x7 = 10 ] [xg = 10.. 20]
with a given test instance x = < 4, 5, 24, 34, 0, 12, 6, 25 #levels=55 is computed as follows:
15
(16)
> and the number of attribute values
Cx} :: 1 - ( 10 - 41/ 55) :: .928 cx2 = 1 - ( 12 • 51/ 55) = .946 cx7 = 1 - ( 110 - 61/ 55) :: .928 c x8 = 1- (120- 251/55) = .91 c x3. c x4. cxS' c x6= 1
c
= cx! ,.. cx2 ,.. cx3 ,.. cx4 ,.. co'" cx6 ,.. cx7 ,.. cx8 = .74
(17)
The recognition process yields class membership for the rule that has the highest confidence level among matched rules. Calculated confidence level. however, is not a probability measure and it can yield more than one class membership. It means that for a given preclassified test dataset, system recognition effectiveness is calculated as the number of correctly classified instances to the total number of instances of the dataset.
5. OPTIMIZING CONCEPT PROTOTYPES Machine learning theory and algorithms were developed and tested long ago for simple problems. Since then. machine learning has been moving slowly towards more difficult applications. One of the most important application problems of machine learning is the influence of noise on acquired knowledge. Since image data is noisy and there does not exist perfect methods for attribute extraction, training data contains noise. Consequently. learned concept descriptions from such dataset contain noisy or imperfect components. Noise in training data can be numerical, structural and also represented by incorrect classification provided by a teacher. Since complete elimination of noise is not possible, we develop concept optimization methods in order to improve their performance. Currently, there does exist systems that learn concept prototypes that are optimal according to a given performance measure. Generally, processes of concept acquisition, for example, typical for AQ programs, are performed in two phases; i.e. concept acquisition and concept manipulation. The frrst phase, concept acquisition derives concept prototypes according to specified preference criteria. These criteria guide inductive search for the best local element of a concept. A single concept component is derived separately and then integrated with a concept structure. Thus, a single component is optimal locally according to given preference criteria. Preference criteria, however, do not deal with concepts performance on testing/tuning data. In the second phase, concept manipulation, system analyzes sets of concept components and modifies concept descriptions. Criteria for the modification of concept descriptions are different from criteria of component acquisition. These criteria consider concepts performance and overall rather than local structure. The problem of acquiring of well performing concept prototypes has been studied intensely. One of the first approaches incorporated selection of most representative training data and the learning in the incremental mode (Michalski and Larson, 1978). The problem of manipulation of concept descriptions was then discussed from different perspectives (see for example: Fisher and Schlimmer, 1988, Tha, et al., 1988, Markovitch and Scott, 1988, Tamble and Newell, 1988, Holte, et al., 1989, Tcheng, et aI., 1989, Weiss, et al., 1990). Concept manipulation techniques were implemented within several learning programs, for example, pruning decision trees was implemented for ID family of programs (Quinlan, 1987), and SG-TRUNC optimization method was implemented for the AQ family of programs (Zhang and Michalski, 1989, Zhang, 1990). An exception to the traditional schema of the acquisition and manipulation of concept descriptions, however, is a new algorithm for learning optimal decision trees (i.e., ID family of programs) using the Minimum Description Length Principle (Quinlan and Rivest. 1989). This algorithm integrates
16
local and global heuristics in order to derive concept descriptions. Such an approach. however, is not applied to the AQ family of learning programs.
5.1.
Optimization model
Optimization processes are applied in order to modify concept descriptions in such a way that optimized descriptions reach higher quality measure when compared with primary descriptions; i.e.•
Q
0: D x Z and
-------->
D
Q[ r(d'" ,x)]
(18)
= max Q[ r(dz,x) ]
(19)
Z where:
Ci'" e D
is optimal concept description according to a given Q quality measure, d z =o(d,z), Z is the space of optimization parameters, r(d,x) is a recognition process, and x £ X.
To follow consequences of applied concept optimization, let us study characteristics of recognition processes under the following assumptions: (i) class descriptions are learned by the AQ family of programs (incorporating general coverage method, Michalski, 1983), (ii) training data is noisy, and (iii) hyper-spheres of training examples overlap for different classes through the attribute space. Let us consider the simplest case of class distributions; i.e., there are only two classes. where the fIrst class (01) has random distribution of training data through the whole attribute space, and the second class (02) has normal distribution of training data (N(J..l,a». The 01 class represents background noisy training data belonging to other hypothetical classes within attribute space that cause the partitioning of description of the 02 class. Learned descriptions of both classes will contain more than one concept component (rule) ---learning process derives a cover (conditional part of a rule) over positive examples only. Another reason for partitioning a single concept description can be the irregularity of the attribute space. This reason, however, is not considered in this paper. Statement 1: For two classes in the attribute space, where one of them (01) has random distribution and the second one (02) has nonnal distribution, the relationship between acquired concept components (rule~) of the second class (.Q2) is as follows:
t(02,n) > t(02,m) ==> P( 9(rulem n, J.1) < 9(rule02 m, J.1» > 02 m P( 9(rule n, J.1) > 9(rule m, J.1) )
(20)
It means that more typical concept components (Le., covering more training examples) are generally closer to the center of a cluster of training data than less typical concept components. In this way, concept typicality depends on the distance measure from the center of local cluster of training data to a concept component. If the area of concept components of class .Q2 is similar then the distribution of typicality measure t(02j) is a normal-like distribution regarding the center of data cluster. The assumption of similar area of concept components of class .Q2 is generally fulfIlled by the distribution of examples belonging to class .Qt that partition description of class 02.
17
Statement 2: For more than two overlapping classes in the attribute space, where one of them (01) has random distribution and the other classes (02.03•... ) have normal distributions, the distribution of t(Okj) typicality measure for k=2,3, ... ,n is not a nonnal-like distribution at the boarder area between classes.
The irregular distribution of typicality measure is caused by the overlapping effect of two (or more) classes (see Figure 4). Traditional pattern recognition methods (e.g., minimization of Bayes risk) approximate distribution of training data to a given distribution model. Such model of a single class is not affected by the distribution of other neighboring classes. Inductive learning presented in section 3, however, acquires description of a single class with respect to other classes. It causes higher partitioning of concept descriptions through the border area between two classes if they are close together; i.e., the valley between typicality measure of 02 and 03 classes in Figure 4 is deeper than for the normal distributions.
112 113 attribute value Fig.4 Irregular distribution of typicality of concept components caused by overlapping effect of training data (dotted line - normal distribution) The effect of higher partitioning of concept descriptions is cenainly negative. It means that concept partitioning increases with an increase in the standard deviation of the distribution of training data. The increase in standard deviation causes two classes overlap with higher degree, i.e., the border line between both classes tends to disappear. The partitioning problem is much greater when the distance between cluster centers of two classes decreases. While traditional pattern recognition method of Bayes risk minimization is able to approximate a cross-section between two overlapping classes, the introduced machine learning method is affected by the mentioned problems. The negative effect of description partitioning causes a single instance to be classified incorrectly (i.e., matched with noisy concept component of any counterclass) even if it is in the center of its class membership. This situation occurs when a noisy concept component is derived across the border with an other class description (Le., on the opposite site of the valley). This component can then be matched with test data incorrectly. The partitioning problem has ben ignored by most previous research. Recently, Whitehall and Stepp (1990) developed the CAQ algorithm that searches for the border between distributions of numeric attributes of different classes. Their approach, however, assumes that the distribution of training data is known explicitly. We found this assumption to be a disadvantage of their method. Section 6.1 demonstrates that such an assumption cannot be made and the distribution of training data can be very complex. The discussed above partitioning effect, however, can be utilized through the optimization of concept deSCriptions in order to improve system recognition effectiveness. 18
Theorem 1: If D"'C D, D ~ d = < d 1, d2, ..• ,d#classes>, and D) d'" = oed) is optimized in such a way that the number of less significant concept components is reduced then the recognition process r(d'" ,x) matching an instance x with optimized description d'" can perform better than the recognition process r(d,x) matching the same instance x with primary description d. Proof'
Removing less significant concept components from the description d we decrease the effect of class overlapping. This removal is characterized by the following reasoning: (i) the probability that the less significant concept component of a given Oi class is eliminated from the area of clusters typical for counter-classes is higher than from the area occupied by clusters of that class itself -- because of t(Oij) typicality measure of a given Oi class, (li) removal of less significant concept components decreases negative partitioning of attribute space, and (iii) the decrease in partitioning of attribute space clears both the area of cluster centers and the border areas between class descriptions. The optimization of a description d produces description d'" that has clearer cluster areas and clearer border areas. The cluster area of one class has less noisy concept components than other classes. The border areas between class descriptions are more distinct allowing them to match test instances with closer and more significant concept components. Thus, the probability that test instance x located within cluster areas of its class membership is classified correctly, is higher for optimized description d* than for the primary description
d. The degree of concept optimization plays a major role in the explained manipulation of concept descriptions. The increase in optimization parameters is expected to be followed by an increase in a quality value representing system performance because the system will no longer match test instances with less significant concept components (Le., noisy or incorrect concept components). Substantial increases in optimization parameters, however, will have a negative effect on system performance. Such increase can cause the removal of more significant concept components, especially when distribution of such a concept is very irregular (e.g., when training data has many local clusters or it forms a complex path through the attribute space). In order to protect the system against such removal of more significant (not noisy) concept components, a certain maximum of system recognition effectiveness should be followed for a given texture domain.
5.2. Optimization criteria and quality measures The improvemellt of concepts performance can be viewed as a five fold objective; i.e., (i) to increase system recognition effectiveness, (ii) to improve stability of system performance, (iii) to decrease storage capacity of concept descriptions, and (iv) to increase the speed of recognition processes. In this paper, we consider optimization techniques that allow both to improve system recognition effectiveness and to improve stability of system performance. Let us discuss, the criteria that guide the optimization of concept prototypes and the Q quality measures that can evaluate system performance. Considering computer vision as the application domain of machine learning, the quality of a segmented and annotated image depends on the quality of all three phases of the texture recognition and segmentation schema, i.e., (i) image
19
processing performed to extract texture attributes, (ii) matching image elements with learned texture class descriptions in order to annotate them by most probable classification hypotheses, and (ill) local unification of classification hypotheses in order to segment an image into homogenous areas corresponding to certain objects. Quality criteria for the evaluation of segmentation processes are precise (Zucker, et a., 1975, Davis, et al., 1981, Hsiao and Schawchuk, 1989, DuBuff, et al., 1990). An excellent system should: (i) preserve sharp and precise borders between different texture areas, (ii) smooth homogeneous texture surface areas, and (iii) preserve small objects against their removal from segmented image. These criteria, however, are too difficult for current computer vision systems. They are contradictory. For example, the criterion of smoothing texture surface areas requires the extension of a radius of local operators extracting texture attributes. On the other hand, if a radius is enlarged then the operators blur borders between texture areas and they can remove small objectS from an image. What are then the optimization criteria for the intermediate phase of our recognition system, i.e., the matching of image events (attribute vectors) with texture class descriptions? Considering hierarchical processes of texture recognition and segmentation schema and their mutual dependencies caused by the quality of their performance, we require that the best matching system should: • increase the classification confidence when matching a class description with data belonging to this class, • decrease the classification confidence when matching data with other class descriptions, and • perform on equal confidence level for all classes when data is matched with their class descriptions (system stability criterion). Considering the above requirements, a confusion matrix should have the highest values for diagonal elements and the lowest through all other matrix elementS. Moreover, the system stability criterion requires that each concept should have the same chance to be recognized. It means that highly negative effect is reached when one classi can be recognized with the highest confidence (e.g., above 95%) and the other class with relatively low confidence (e.g., below 60%). Experiments presented in this paper were evaluated through testing acquired and optimized concept descriptions on separated sets of test data (section 2). For each subset of 200 events extracted from a given homogenous texture area representing i-th class (different than the area used for the extraction of learning events), the system computes a recognition rate for this class. The recognition rate for class Ui was calculated by dividing the number of correctly classified test events from the Teventsm test dataset by the total number of test events for this class; i.e., rec_ratem
# (x: xc Tevents
m and
r(d,x) = < ru, c >
}
== --------------------------------------------------•• -.--.# {x: xc
(21)
TeventsQi }
System recognition effectiveness was then evaluated through the computation and monitoring of the following measures: 1) average recognition rate computed through all twelve test data sets representing texture classes (we require the average recognition rate to be the highest), . 2) standard deviation from the average recognition rate (system stability criterion prefers the standard deviation to be minimum), and 3) minimum recognition rate representing the worst performing concept description (this rate should be as high as possible to allow all classes to be recognized).
20
5.3.
Concept optimization methods
We distinguish two classes of concept optimization methods; i.e .• direct and indirect. Direct optimization methods directly manipulate acquired concept descriptions. This manipulation involves both specialization and generalization operations. Specialization operations shrink a concept through elimination of concept less significant components. On the other hand. generalization extends a concept over nearest and hypothetically noisy (less significant) components belonging to counterclasses. The second approach to concept optimization. indirect optimization, incorporates pre-optimized concept descriptions (Le., optimized by a direct method) to filtrate final training data. The learning process is then repeated for modified set of training data. In this paper. we present optimization techniques that were integrated with AQ programs. Through practical experimentation, we found that some of optimization techniques are effective for simple domains but do not improve system performance for complex domains. Considering texture recognition task. we developed an indirect optimization method that follows our optimization model and performs better for complex domains of irregular distribution of attributes and the larger number of texture classes. 5.3.1. Truncation of less significant concept components The first optimization method of concept descriptions is based on the theory of Two-Tiered Representation (IT) of imprecise concepts (Michalski, 1987). The theory assumes that an acquired concept description can be transformed to its 11 representation through a separation of the most significant concept properties (Base Concept Representation) from exceptions to these properties (Inferential Concept Interpretation). Since the concept descriptions leamed by the family of AQ programs are composed of ordered components (Le., from the most to less significant components), one can truncate such descriptions by removing some less significant components. In our experiments, the truncation degree is controlled by a parameter corresponding to the percentage number of learning events covered by removed components. Such optimized concept descriptions are more specific. As explained in section 5.1, they can improve the performance of the recognition system because test data will no longer match removed noisy concept components. However, this direct optimization method does not apply coverage over the areas of removed components causing the partitioning of concept descriptions to be unchanged. The cooperation schema between learning, optimization and recognition modules is the simplest and shown in Figure Sa. 5.3.2. Generalizing concept components over negative examples The SG-TRUNC method is another direct optimization method applied to improve the performance of concept prototypes (Zhang and Michalski, 1989). This method incorporates both specialization through truncation of less significant concept components and also generalization of concept components. The generalization of concept components, however, is performed through the extension of attribute values within concept descriptions. In this way, a concept component covers not only positive training examples but it can cover some negative examples as well. The degree of allowed coverage is controlled by optimization parameters. The SG-TRUNC optimization method was implemented within the AQ15 learning program (Michalski, et a., 1986) and within the AQ16 integrated learning system (Zhang, 1990). Two parameters control optimization process of the AQ15 and AQ16 programs. The first parameter
21
controls the degree of concept specialization, while the second parameter controls the degree of concept generalization. The optimization process requires a training data (see Figure 5b). Authors of this method demonstrated its ability to improve system recognition effectiveness for simple domains. The performance of the AQl6 integrated system is presented later in this paper for complex texture domain. 5.3.3. FIltering final training data by pre-optimized concept descriptions The previously discussed direct optimization methods direcdy eliminate less significant concept components from concept descriptions and the second method manipulates single components in order to expand them over negative examples. Such expansion, however, considers local rather than the overall distribution of training data within the attribute space. Advancing concept optimization methods, we present the following conclusions acquired from the analysis of the optimization model:
• If the learning program draws a class description over positive examples only then any removal ofless significant concept componems ofcoumerclass descriptions dces not imply automatic generalization of most significant components ofa given class over the space released by removed coumerclass components. • The ultimate goal of such removal, however, should be the increase in typicality of concept components ofa given class and the decrease in their number over the border areas between differem concepts. • Such an increase in typicality ofconcept components and the decrease in their number over t~ border areas between differem concepts can be achieved through the indirect optimization. In this method. pre-optimized concept descriptions are used to filtrate a .final set oftraining data and the learning process is repeated with the new dataset. In order to derive homogenous areas representing concept descriptions and to improve borders between concept descriptions of different classes, one has to merge partitioned concept components. This merging can be executed correcdy over the space released by the removal of less significant concept components if these components were incorrecdy acquired as components of counterclass descriptions. Such generalization over released areas of attribute space (1) extends concept components describing main cluster areas; Le., increases the typicality of these concept components, and (2) improves the separation of concept descriptions of different classes between themselves. To implement such generalization over released areas of attribute space, we developed an indirect optimization method. Introduced method incorporates pre-optimized concept deSCriptions to release areas of attribute space representing less significant concept components by the filtration of training dataset If some of primary training examples were noisy and thus produce less significant concept components then some of these noisy examples can be filtered by pre-optimized concept descriptions. It logically follows than that the filtered set of training data can be reused to learn final concept descriptions. We implemented the above reasoning within a schema presented in Figure 5c. It is seen that the system learns from originally provided training data and optimizes acquired descriptions truncating less significant concept components. The learning process is perfonned by the AQ14 program. The program was set to acquire specific and simple concept components. This was done, because AQ programs tend to combine several clusters of positive examples and describe them as a single concept component. Thus, less significant components can be combined with most significant
22
components protecting them against their removal. Requesting the acquisition of specific and simple concept description, the system is forced to derive single (non-connected.) complexes rather than connected complexes. Applied direct optimization process (described in section 5.3.1) removes less significant concept components from originally learned concept descriptions, where the optimization degree is the percentage number of training examples covered by removed concept components. Acquired and optimized primary concept descriptions are then applied to filter the final set of training data. Training examples are passed if they are covered (matched strictly) by optimized descriptions. If they are not covered then they are removed from the final set of training examples. The learning process is then repeated in order to acquire the final concept descriptions. a)
. -_ _ _.. Optimal.-_ _ _..
.1
~ing
Learning
I Ru1es..-1 Optimization Irules.1 Recognition IDecisio\..
+
+
Test data
Optimization degree
b)
Optimal
g ::'in
_Learnin __ ' _g_I_R_ul_es,.._1Op.....ltimization I rules"'I'-R-ec-o-gm-'u-'o-n"l
L-_ ....
1_
'
t t
Optimization degree
Learning
c) Trainin
daIa
+
Decisio~
Test data . - - - -.. Decision
Test data
Learning Optimization degree
Fig. 5 Cooperation schemas for different concept optimization methods:
a) simple truncation of concept components. b) generalization of concept descriptions over negative examples (the AQ16 integrated system). and c) filtration of fmal training data incorporating pre-optimized concept prototypes
6. INTRODUCTORY EXPERIMENTS WITH LEARNING TEXTURE CLASS DESCRIPTIONS This section presents results of applying a learning approach to the acquisition of texture class descriptions from noisy data. The experiments show the dependency of learning and recognition processes on variable learning conditions. We investigate this dependency on: 1. the size of texture feature extraction window (for the computation of local macro-statistics --- section 2.2), 2. the number of training examples, and 3. the acquisition of specific and general concept descriptions.
We validate the choice of feature extraction and learning parameters for the next experiments investigating different optimization approaches. All experiments are done for 12 classes of texture
23
presented in Figure 1. The extraction of texture attributes, learning processes, and flexible matching processes were performed as explained in preceding sections.
6.1. Attribute space complexity To justify our learning-based approach to the acquisition of texture concepts, we ftrst inspect the distribution of the attribute space. The task is to demonstrate that the distribution of attributes for texture classes is complex. We have already mentioned that most researchers try to use powerful parametric pattern recognition method of minimization of Bayes risk (Duda and Hurt, 1973) in the acquisition and recognition of texture class descriptions. The application of this method, however, requires the distribution of texture attributes to be normal. The approximation of attribute distribution by normal distribution. however, can sometimes be done successfully if the system learns a few class descriptions and these textures are significantly different. But if the system has to learn large number of texture class descriptions and the attribute distribution is complex. traditional parametric methods of pattern recognition cannot be applied.
Class: d9; Attribute: SSES 8~----~--~~--~-----r----~~
Class: dS4; Attribute: ESSS 5~----r---~----~----~----~~
~ -T--4---T~~~~~~
.g
~~--~~---r~--~~~+---~-4
:6>.
~+---~--~~--~--~~--~~
l+-~~~~~---+---_+~+-~
04-~--~'~~~~---+--~~--~~~
o
10 Attribute value 40 Class: dS4; Attribute: LSSS
0~~-4--r-~~~--~~~~~
o
50
10 Attribute value 40 Class: d93; Attribute: RSRS
50
6~----~--~----~----~--~~
4~----~--~-----T-----r----'-~
~+-----~---4--~~----~----+-~
~
.g
.g+---~~---+~~~----~----~
j+-----~--~--~~-.--~----+-~
j ·S
'S
en
~+-----~--~~~~~~~----+-~ >.
~ l+----+--~~L-~----~~~~
'" ~~~~~~~f-+-~~~~~--T-H >. !
g
l+-~-+----~--~--~r-~~~
1~--~~--~r----+-----r~r-~~
o~~--+-~~~~-+--~~--~~=
o
10
Attribute value
40
50
o
10
Attribute value
Fig.6 Examples of non-normal attribute distribution
24
40
50
We simplify the proof that the distribution of attributes for our texture classes within n-dimensional attribute space is not normal by showing that the distribution of single attributes is not normal. Figure 6 presents most representative samples of non-nonna! attribute distribution, where the solid line corresponds to smoothed distribution of an attribute and the dotted line corresponds to the approximated normal distribution. Left side diagrams of Figure 6 present relatively simple attribute distribution. However, right side diagrams of Figure 6 present two cases of more complex attribute distributions. The upper right diagram indicates the possible fonnation of more than one cluster of training data. The lower right diagram presents a very complex distribution of an attribute without distinction of regular clusters. It is seen that the R5R5 attribute does not carry significant information about d93 texture class. Considering the above complexity of texture attributes, one cannot apply neither parametric pattern recognition methods nor the CAQ learning program that deals with numeric attributes and noise. Both approaches assume the distribution of training data to be known and approximated by a parametric distribution.
6.2. Basic characteristics of the learning approach The effectiveness of a learning schema applied to the acquisition and recognition of texture concepts depends both on processes preceding the learning phase and the parameters of learning program. These processes are related to the extraction of texture attributes and the selection of training data. We investigate these characteristics depending on (1) the size of averaging window applied to compute local macro-statistics of texture energy (Laws', 1980), and (2) the number of training examples provided for the learning phase. Acquired characteristics of system recognition effectiveness are presented in Figure 7. The training dataset was adjusted from 50 learning examples per class to 300 learning examples per class. The test dataset was constant and it consisted of 200 testing examples per class. The radius of the averaging window applied to acquire macro-statistics of texture energy (see Figure 3) was 3.5, 5.5 and 7.5 pixels. Figure 7 shows that the average recognition rate increased slightly with an increase in the number of training examples. At the same time, the standard deviation decreased and the minimum recognition rate increased rapidly for larger window size. This observation indicates that an increase in the number of training examples has substantial influence on system recognition stability. On the other hand, overall system recognition effectiveness, measured by the average recognition rate, is very sensitive on the window size. The overall effectiveness improved substantially with larger windows. . Based on this evidence, one could consider an increase both in the window size and the number of training examples in order to improve system recognition effectiveness. An increase in window size, however,. has significant influence on the image segmentation processes (Tomita and Tsuji, 1977) and niilst be limited to the area dependent on the content of texture image (Le., size and shape of texture areas). Significant increase in the number of training data is impossible because of the increase in concept complexity. This complexity, measured by an average number of concept components per class, increases linearly (in our experiments) with the increase in the number of training data. Considering the results of applying learning approach to the acquisition of texture concepts from noisy texture data, one finds that the improvement of system performance must be done in a way other than adjustment in window size and the increase in the number of training examples. A certain balance is required in order to support high recognition effectiveness, high stabilization of recognition decisions, and relatively low complexity of concept descriptions. We decided that further experiments can be run for the attribute extraction window of radius R=7.5 and the number of training data of 200 t;xamples representing each texture class.
25
80
32
i\
~
\/
./
V
~~
r-
• a • 50
R=3.S R=S.S R=7.S
/'
!\
26
-V'
~--.
22
40
\
24
.
•
20
o
SO
100 ISO 200 2S0 300 Number of training examples/class
•
a
R=3.S R=S.S R=7.S
~
iI--..
........
--.
~ ~
-
'\
~
•
0
SO
100 ISO 200 250 300 Number of training examples/class
50
•a
•
•
..... pft .....
R=3.5 R=5.S R=7.5
a
•
.sa
e
= =
R=3.5 R=5.5 R=7.S
0 :~
r ...
§
.5
=
~
20
o o
10~--~--~---+--~----~~
SO
100 ISO 200 250 300 Number of training examples/class
0
SO
100 ISO 200 2SO 300 Number of training examples/class
Fig.7 Learning characteristics for different number of training events and different window size extracting texture attributes
6.3.
Effectiveness of specific versus general concept descriptions
Most learning tools are controlled by special parameters. These parameters, as indicated in section 3, can tune a learning tool to specific application domain. Applying the AQ14 program to learn
26
texture concepts. we considered relatively large number of program parameters. In this section. we present the influence of one of those parameters that requests learning general or specific concept descriptions. The difference between specific and general concept descriptions can be illustrated by the area of attribute space separated by these descriptions. General descriptions cover larger area than specific descriptions; i.e., the conditional part of a general rule contains larger range of attribute values than a specific rule. Moreover. general descriptions overlap a larger area than specific descriptions. Thus, general descriptions can be matched strictly with test data over larger area of the attribute space. Matching general descriptions with test instances. we increase the probability that an instance is classified to more than one class. The primary effect of such matching is the increase in the average recognition rate --- where the recognition rate is not a probability measure and it is computed as the ratio of the number of instances classified correctly to the concept by the total number of test instances. Test instance, however, can be covered by more than one general concept which allows classification to more than one class. In this way, a highly negative effect of such matching is seen as the increase in mis-classification rate. The mis-classification rate monitors' both the number of instances that are classified incorrectly and the number of instances that are not uniquely classified to the correct class. Figure 8 presents the improvement in average recognition rate, standard deviation, and the minimum recognition rate when concept descriptions are learned as general descriptions (see black marks) and specific descriptions (see white marks). This improvement is illustrated for different sizes of the attribute extraction window and for different numbers of training examples. These results suggest it is better to learn general rather than specific concept descriptions. 80
,~
,
..... ........
40
./ r >••..
60
.........,
:raa'~
. .".- ....., .... >- ... Of
"
•
.... "'-0- •••.
•
'10- ••••-ct •••.
R=3.5 (2en) R=3.5 (soec) R=7.5(2en) R=7.S (spec)
0
I I
.. '. t ..···t .... (' ~ ? V....
...
•
/\ I r \I .. /
a ••
V ~ 50
a'
,
."
,.4
.....
'\
~
C"
40
150 200 250 300
Number of training examples/class
\~
0
.
iii
20
50 100 150 200 250 300
Number of training examples/class
.•. ~
V ~ '-,
~
•
10
o 50' 100
V
...,
•
).
~
20
~
,~
, '
!
.....
~a •• c
I
10
0
50
y.a.~
..... >.,,#" ~
100 150 200 250 300
Number of training exampleS/class
Fig.8 Recognition effectiveness of specific and general concept descriptions The mis-classification rate, however, suggest otherwise. An example is shown in Figure 9, where confusion matrices are presented both for general and specific concept deSCriptions. The average
27
mis-classification rate shows a nearly two fold increase when general descriptions are applied to recognize test data. The minimization of such mis.classification rate is important for the image segmentation phase. Thus. the choice of general or specific concept descriptions must consider both the average recognition rate and the average mis-classification rate. Considering the rapid increase in the mis.classification rate, funher experiments with this learning approach applied to texture recognition will be based on the acquisition of specific rather than general concept descriptions. Confusion matrix for general concept description d4 d4 cIS d9
d19 d14 d28 d37 d54 d57 d77 d92 d93
45 5 31 0 0 8 0 19 16 0 16 0
cIS 5
51 0 25 0 11
2 14 0 0 4 0
d9
dI9
d24
d28
d37
d54
d57
d77
d92
31 3
0 12 0 72 0 25 0 2 0 0 10
0 0 7 0 99 0 0
9 16 1 14 0
62
0 1 0 0 0 0
28 17 2 12 0 28 0
16 7 10 0 0 0 0 3
0 0 0 0 0 0 0 0 0
6 17 6 1 0 12 0 5 1 0
80 0 2 3 0 8 8 1 32 0
4
0
98
1
'j)
0 0 0 0
5 0 4 0
0 0 0 0 1
59
87
5 0
d93 2 2 5 10 Averaae recognition = 74.67 7 Avenp 1 mis-classification = 4.83 0 0 2 0 7
2 2 0
95 0 0
50 2
98
d57 12
d77
d92 0
d93
4
0
Confusion matrix for specific concept description 43
cIS 2
4
54
15 0 0 10 0
0 7 0 10 0 5 0 0 5 0
d4 d4 cIS d9
d19 d14 d18 d37 d54 d57 d77 d92 d93
16 3 0
11 0
d9
d19
17 2 66 0 0
0 8 0
88
d14 0 0 5
2
J)
0 4 8 0 l)
0 1 0 0 4
0 99 0 0 0 0 0 0
0
1
2
0
d28 1 8 0 5 0
45 0 8 0 0 7
0
d37 0 0 0 0 0 0
99 0 0 0 0 0
d54 11 10 0 1 0 15 0
61 1 0 2 0
88
2 0 0 0 0 0 0 0 1
0 0 0
99 0 0
9 8 0 0 0 0 0
6 1 1 0 6 0 4 0 0
0 1
7
37
5 0 4 0 0 0 0 7
0
97
Averaae recognition = Avenp
mis-c1assification =
73.00 2.66
Fig.9 Learning characteristics for the acquisition of general and specific concept descriptions
7. EMPIRICAL EVALUATION OF CONCEPT OPTIMIZATION METHODS This section presents experimental results with learning-based texture recognition when different concept optimization approaches were applied in order to increase system recognition effectiveness. These experiments were run for twelve texture classes (see Figure 1). The training data was extracted applying modified Laws' method with the radius of averaging window equal to 7.5 pixels. Training data for each class consisted of 200 examples. The test data was extracted from different areas of the same texture and consisted of 200 test events for each class.
7.1. Effectiveness of simple truncation of less significant concept components The effectiveness of the simplest concept optimization method that incorporates the truncation of less significant concept complexes (section 5.3.1.) was already investigated. Zhang (1990)
28
demonstrated the increase in the recognition effectiveness when this methcxi was applied to simple domains of symbolic data. This data, however, did not characterize real engineering domains. In the case of texture data, applied simple truncation of less significant concept components did not increase average recognition rate. This rate was equal to 73% for a wide range of optimization degrees; Le., for removed complexes covering 2% to 30% of training examples. At the same time, the minimum recognition rate was constant and equal to 37%. The standard deviation applied to monitor system stability criterion oscillated within a very small range of values; i.e., between 23. and 23.S.
We found this optimization method inefficient when applied to increase texture recognition effectiveness. However, the lack of negative decrease in system performance suggests that the truncation of less significant concept components could be applied to reduce the size of concept descriptions without negative effects in the recognition effectiveness. 7.2. Effectiveness of the SG-TRUNC optimization method The SG-TRUNC concept optimization method (see section 5.3.2.) was applied to texture recognition problem in our first introductory experiments (Pachowicz, 1990). The data, however, was much simpler; Le., there were six classes of texture and applied averaging window in the computation of texture energy measure was much larger (Le.• a square window of 2000 pixels). Both training and testing data were less noisy and the attribute space was less complex. Experiments were performed incorporating the AQ15 learning program (Hong. et al., 1986) and ATEST testing program (Reinke, 1984). Concept descriptions were generated as general concepts. As reported, the optimization increased the average recognition rate and decreased the deviation of the recognition rate. Generally speaking this method performed well for simple texture domains and for learned general concept descriptions (rather than specific descriptions), but the mis-classification rate was not investigated in those experiments. In experiments with more complex texture data, we decided to apply the AQ16 integrated learning system rather than the combination of the AQ15 and the ATEST program. These two programs differ mostly in the implementation of the SG-TRUNC method and in the concept matching technique. The effectiveness of the AQ16 system was already demonstrated by Zhang (1990) as a very effective concept optimization method. The AQ16 integrated system, however. was tested with simple non-engineering data and has never been applied to texture data.
For texture data presented in this paper, concept optimization performed by the AQ16 system gave poor results. These results are presented in Figure 10. Dotted lines in the diagrams represent average recognition rate, standard deviation, and minimum recognition rate when the values of optimization level were increased. Solid lines represent smooth characteristics. The average recognition rate dropped from about 72% to 63% and then recovered slightly to the 70% level with the increase in optimization level. This recovery, however, was associated with a very fast increase in the value of standard deviation and with a rapid decrease in the minimum recognition rate. This means that the performance of well performing class descriptions was further increased while the performance of worse performing classes was decreased deeply. Obtained results indicated deep decrease in the stabilization of system recognition performance. At the same time the average recognition rate had no clear trend. We tested this method of concept optimization for other sets of texture data with the similar results. Therefore, we conclude that the SG-TUNC concept optimization method implemented within the AQ 16 system is not an effective method when applied to hard texture domains of many texture classes and very noisy training data.
29
74 ';i.
35
50
.§
~
.Y
ec
.53
S
ec
''::
I
..=0
~
0
'2 1:>1) 0
~
.~
~
CI'l
r « ~
:i=
>
20 25
64
10
62
ro
20
o
Optimization level
15
O~~~~~~~~~~
0
Optimization level
15
0
Optimization level
15
Fig.10 Recognition results for the AQ16 integrated system (SO-TRUNC optimization method)
7.3 Effectiveness of the filtration of final training data by pre-optimized concept descriptions The introduced indirect optimization method that filtrates final training data incorporating pre optimized concept descriptions perfonned very well according to the discussed evaluation criteria. These results are presented in Figure 11, where dotted lines correspond to acquired characteristics and solid lines represent smooth characteristics. 75~----~----~----~
~
~c
:~
t-t---++.....--+----I .... ~
(':I
~
«>
o
Optimization degree
30
0
Optimization degree 30
0
Optimization degree
Fig. I 1 Recognition results for the filtration of final training data incorporating pre-optimized concept descriptions ,
30
30
The average recognition rate increased from the 73% level up to 74.5% before its slower decrease to below the 73% level. This increase was characteristic in the range from 0% to 10% of filtered training data interpreted as noisy examples. At the same time, the standard deviation decreased from 23.5 to below 22. Minimum recognition rate increased significantly from 37% to above 45%. For higher optimization degrees the minimum recognition rate decreased slowly but still to a level well above the initial 37%. The optimization effect was then followed by the analysis of the recognition characteristics for the worst performing texture descriptions; i.e., class d92 (37%) and class d28 (44%). The analysis is based on pattern labeling of the recognition curves. Increasing recognition curves are marked by arrows of black head. while decreasing and oscillating patterns are marked by arrows of white head (see Figure 12). These labels were incorporated in the development of a novel recognition method of noisy concepts (Bala and Pachowicz, 1991). In this method, final classification yields recognition of the class that have an increasing recognition pattern. If more classes have such an increasing pattern of recognition curves then the classification yields recognition of a class with the highest recognition rates of increasing pattern. Figure 12 shows the increase in the recognition rate for the worst perfonning texture descriptions. The recognition rate for d92 class of texture increased form 37% to the value well above 45% with the increase in the optimization degree. The recognition rate for d28 class of texture increased slightly from 44% to the level of 50%. We fmd above results as a very positive effect of applied filtration of trainingdata by pre-optiroized concept descriptions. These results are the best among the three concept optimization methods tested in this paper.
Test for class d92
SO~----~~--~----~
60
Test for class d28
.....
~ .......
•
a 0
!l
e
d92