Detection of Monotonic Chain Structures in Mixed ... - Semantic Scholar

2 downloads 0 Views 61KB Size Report
model (CSM) which is a mathematical model to manipulate objects described by mixed type ... Minkowski metrics on the mixed feature type multidimensional data (MFTMD) [5]. ... defined to evaluate monotonicity of chain connected structures.
Proceedings of the International Conference on Cognition and Recognition

Detection of Monotonic Chain Structures in Mixed Feature Type Multidimensional Data Manabu Ichino Tokyo Denki University, Department of Information and Arts Hatoyama, Saitama 350-0394, Japan [email protected] Abstract Symbolic data analysis aims at generalizing some standard statistical methods. Generalization of principal component analysis (PCA) is an interesting and important research problem in symbolic data analysis. A main purpose of PCA is to find a linear structure in multidimensional data. However, a direct extension of PCA is difficult, when each object is described by not only usual quantitative features but also interval valued features and qualitative features. This paper describes a method to detect “monotonic chain structures” including “linear structure” based on the Cartesian system model (CSM ) which is a mathematical model to manipulate objects described by mixed type feature values. Simple examples are presented to illustrate our approach. 1. INTRODUCTION Symbolic data analysis (SDA) is a new direction to generalize standard statistical methods [1]. Generalization of the principal component analysis (PCA) is a major research theme in SDA. When objects are described by not only usual quantitative values but also interval values and qualitative values, direct extension of the PCA is difficult. The author defined a generalized Minkowski metrics on the mixed feature type multidimensional data (MFTMD) [5]. This generalized metrics were used to generalize a classical PCA and clustering problems for MFTMD [6]. The extension of the PCA to interval data has been proposed by several authors [2][3]. A main purpose in classical PCA is to find a “linear structure” in multidimensional data. In contrast to this idea, this paper presents a method to find “monotonic structures” embedded in MFTMD. We assume a finite set U of objects described in the Cartesian system model (CSM ) which is a mathematical model to manipulate MFTMD. We define relative neighborhood sets for each object in U under a selected set of features. Then, based on the relative neighborhood sets, we present a formulation and interpretation of chain connected covering for the set U. As special classes of chain connected covering, we study monotonic chain structures in the relation to the nested coverings of U. We define the similarity between features as the average Marczewski -Steinhaus distances for relative neighborhood sets. The detection of monotonic chain structures in multidimensional data is realized by a hierarchical feature clustering based on our similarity measure. A simple measure is also defined to evaluate monotonicity of chain connected structures. 2. CARTESIAN SYSTEM MODEL Let U be a finite set of K objects as: U = {? 1, ? 2,…, ? K }

(1)

Let each object ? k be described by d features (attributes). Let Di be the domain of feature Fi, i=1,2,…,d. Then, the feature space is defined by the product set D(d) = D1 × D2 × ... × Dd

(2)

Since we permit the simultaneous use of various feature types, we use the notation D (d) for the feature space in order to distinguish it from usual d-dimensional Euclidean

1

Detection of Monotonic Chain Structures in Mixed Feature Type Multidimensional Data

space Dd. Each object ? i in the set U is represented in the feature space D(d) as: Ei = E i1 × E i2 × ··· × Eid or Ei = (Ei1, Ei2 ,…, Eid),

(3)

where Eij, j= 1,2,…,d, is the feature value taken by the feature F j. 2.1 Feature Types In this paper, we treat the following feature types. 1) Continuous quantitative feature: The height and the weight for a person are examples of this feature type. 2) Discrete quantitative feature: The number of cities in a state and the number of family members of a person are examples of this feature type. 3) Ordinal qualitative feature: One’s academic background {junior high school, high school, college or university, graduate school} and military rank are examples of this feature type. We assume an appropriate numerical coding. 4) Nominal qualitative feature: The distinction of sex {male, female} and blood type of a person {A, B, AB, O} are examples of this feature type. We permit interval values for feature types 1) - 3), and finite set values for feature type 4). The Cartesian product of the form (3) described in terms feature types 1) - 4) is called an event. 2.2 the Cartesian Join and Meat Operators The Cartesian join, A A

B=[A1

B 1]×[A 2

B , of a pair of events A = (A1, A2,…, Ad) and B = (B1, B 2,…,Bd ) in the feature space D(d) , is defined by B2]×···×[Ad

Bd],

(4)

where [Ai Bi ]is the Cartesian join of feature values A i and Bi for feature Fi and is defined as follows. When Fi is a quantitative or an ordinal qualitative feature, [Ai B i]is a closed interval given by [Ai

Bi] = [min(A iL, BiL), max (AiU, B iU)],

(5)

where AiL and A iU, respectively, are the minimum and the maximum values of the interval Ai, and min(AiL , B iL) and max (AiU , BiU) are the operators, which take the minimum and the maximum values, respectively, among sets {A iL, B iL} and {AiU , B iU}. When Fi is a nominal feature, [Ai [Ai

Bi]= A i ? Bi.

The Cartesian meet, A A

B=[A1

where [A i [Ai

Bi] is a union:

B 1]×[A 2

(6) B , of a pair of events A = (A1, A2,…, Ad ) and B = (B1,B2,…, Bd ) in the feature space D (d) , is defined by B2]×···×[Ad

Bd],

(7)

Bi] is the Cartesian meet of feature values Ai and B i for feature Fi defined by the intersection

Bi]= A i n Bi

(8)

When the intersection (8) takes the empty valuef , for at least one feature, the events A and B have no common part. We denote this fact by An B=F,

(9)

and we say that A and B are completely distinguishable.

2

Proceedings of the International Conference on Cognition and Recognition

We call the triplet (D(d) ,

,

) the Cartesian System Model (CSM) [7][8].

3. RELATIVE NEIGHBORHOOD AND NEIGHBORHOOD SET In the following discussion, we treat various subsets of the given set of features. To clarify this, let F0 be be the set of feature numbers given by F0 = {1, 2, ... , d},

(10)

and be called the feature set. For a feature subset F = {p1, p 2, …, p m} ⊆ F0, an object ? k in the set U = { ? 1, ? 2, …, ? K} may be given as follows: Ek = Ekp1 × Ekp 2 × ··· × Ekpm or Ek = (Ekp 1, Ek p2, ··· , Ekpm).

(11)

3.1 Join Region and Neighborhood Set Definition 1: Join Region For a pair of objects ? p, ? q ? U, let J( ? p, ?

q

F) be the Cartesian join region in the feature space spanned by a feature subset

F ? F 0 i.e., J(? p,? q | F)= ? r?F[Ekpr

Ekqr ],

(12)

where ? is the operator for the Cartesian product and square brackets “[” and “]” mean here that the boundary values of the Cartesian join for feature F r are included in the join region (i.e., a closed region). Definition 2 : Relative Neighborhood Two objects ? p, ? q ? U are called the relative neighbors under a feature subset F ? F0, if the following condition is satisfied: J( ? p, ? q | F)

E k ? Ek for all k ? p, q.

(13)

Figure 1 shows seven objects represented by rectangular events under the feature set {F 1, F2}. In this figure, ( ? 1, ? 2 ), ( ? 1 ,

? 5), ( ? 2, ? 5), and ( ? 5 , ? 7) are examples of relative neighbors. A neighborhood set of an object ? ∈ U under a feature subset F, denoted by n(? |F), is a non-empty subset of U. The operator n(·|F) is a mapping n: U ? 2U, where 2 U denotes the power set of U.

3

Detection of Monotonic Chain Structures in Mixed Feature Type Multidimensional Data

Fig. 1: Illustration for the relative neighborhood

Definition 3 : Neighborhood Set For each ? ∈ U, let n(? |F) be defined by the set of all relative neighgors. We assume that each n(? |F) includes ? as a neighborhood. 3.2 Chain and Chain Connected Covering Definition 4 Two objects ? p, ? q ?U are called chain connected (or simply connected) under F, if

? p, ? q ? n( ? p|F)n n( ? q|F).

(14)

Definition 5 : Chain A series of objects ? p1, ? p2,…, ?

pm ,

is called a chain under F if the following conditions are satisfied:

? pk, ? p(k+1) ? n(? pk|F)n n( ? p(k+1)|F), k =1, 2,…, m-1,

(15)

where ? p1 and ? pm are called the terminal points , and m is the length of the chain. Definition 6: Chain Connected Covering (CCC) A chain ? p1, ? p2,…, ? pm is called a chain connected covering (CCC) of U under F if

U⊆

U n(ω m

k =1

pk

F

)

(16)

Example 1: In Figure 1, the series of ? 3 , ? 2, ? 5, ? 6 , ? 7 becomes a CCC of the set U of seven objects. As an example for a nominal feature, suppose the following five objects composed of nominal values ak, k=1, 2,…, 7: ? 1={a1 , a2 , a3}, ? 2={a2 , a3 , a4},? 3={a3, a4, a5 }, ? 4 ={a4, a5, a 6}, and ? 5={a 5, a6, a7 }. Then, these five objects yield a CCC under F, where ? 1 and ? 5 are terminal points. Definition 7: Monotonic Chain A chain ? p1, ? p2,…, ? pm is called a monotonic chain under F, if the chain satisfies the nesting property:

4

Proceedings of the International Conference on Cognition and Recognition

J(? p1, ? k  F) ? J( ? p1, ? k+1  F), k =1, 2,…, m-1.

(17)

3.3 Similarity and Monotonicity Measures Let U = { ? 1, ? 2 ,…, ? K} be the set of objects described under the feature set F0. For an object ? k ? U, let n(? k|F1) and n(? k|F 2) be neighborhood sets for feature subsets F1 ? F0 and F2 ? F0 , respectively. Then, the similarity between F 1 and F2 with respect to object ? k is defined by S(F 1, F2| ? k) = |n( ? k|F 1)nn( ? k|F2)| / |n( ? k|F1)?n( ? k|F 2)|,

(18)

where | · | denotes the cardinality of a set ·, and 1-S(F1, F 2|? k) is called the Marczewski-Steinhaus metric between two neighborhood sets [4]. Then, we define the similarity between two feature subsets F1 and F2 over the set of objects U as follows. Definition 8 : Similarity Measure The similarity between feature subsets F1 and F2 is defined by

S (F1 , F2 U ) =

=

1 K

1 K

∑ S (F , F K

1

2

k =1

ωk )

∑ n(ω F )∩ n(ω F ) / n(ω F )∪ n(ω K

k

1

k

2

k

1

k =1

k

F2 ).

(19)

This similarity measure satisfies the inequality: 0 = S(F1, F2 | U) = 1.

(20)

Suppose that objects in U compose a monotonic chain ? 1, ? 2, ? 3,…, ?

K

under a feature set F. Then, two terminal points ?

1

and ? K have two relative neighbors, and other objects ? k, k=2, 3,…, K-2, have three relative neighbors, respectively. Threfore, as the total, K objects have 2×2+3×(K-2)=3K -2 relative neighbors. Based on this fact, we define a monotonicity measure for a set of objects U under F as follows. Definition 9 : Monotonicity Measure (21) where | · | denotes the cardinality of a set ·. This measure satisifies the inequality: 1 = M(U|F) = K2 /(3K-2).

(22)

The minimum value is achieved when all objects in U compose a complete monotonic chain, while the maximum value is achieved when all objects in U have the same K relative neighbors. Example 2: In Figure 1, we have the result: S(F 1, F 2| U) = 8/21˜ 0.381, M(U|F) = 27/19 ˜ 1.421, and K 2/(3K-2) = 49/19 ˜ 2.579. On the other hand, for a nominal feature F in Example 1, we have M(U|F)=15/13 ˜ 1.154, and K2 /(3K-2)=25/13˜ 1.923.

5

Detection of Monotonic Chain Structures in Mixed Feature Type Multidimensional Data

By using various hierachical clustering methods based on our similarity measure, we can detect monotonic structures embedded in mixed feture type multidimntional data. Then, concerning the clustered feature subsets, it is possible to realize a PCA like analysis. As an indirect approach, our similarity measure may be used to construct a similarity matrix (correlation matrix) for a traditional PCA. 4. CONCLUDING REMARKS We presented notions of the chain connected covering and the monotonic chain structures. Then, we defined a similarity measure between feature sets. Our similarity measure is applicable to mixed feature type multidimensional data. Moreover, we defined a simple monotonicity measure. These measures may be useful tools in the generalization of the classical PCA. REFERENCES [1] Bock, H.-H. and Diday, E.(eds.): Analysis of Symbolic Data. Exploratory Methods for Extracting Statistical Information from Complex Data, Series: Studies in Classification, Data Analysis, and Knowledge Organization, 15. Springer-Verlag, Berlin (2000) [2] Chouakria, A., Diday, E. and Cazes, P.: An Improved Factorial Representation of Symbolic Objects. KESDA’98 April, Luxembourg (1998) [3] Lauro, C.N. and Palumbo, F.: principal component Analysis of Interval Data: a Symbolic Data Analysis Approach. Computational Statistics, Vol.15 n.1 (2000), pp.73-87 [4] Lin, T.Y. and Cercone, N.: Rough Sets and Data Mining. Kluwer Academic Publishers (1997) [5] Ichino, M.: General Metrics for Mixed Features ? The Cartesian Space Theory for Pattern Recognition. Proc. IEEE Int. Conf. on Syst. Man, Cybern., Beijing and Shenyang, China (1988), p.494-497 [6] Ichino, M and Yaguchi, H.: Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans. on Syst. Man, Cybern., Vol.24 n.4 (1994), pp.698-708 [7] Ichino, M.: Feature Selection for Symbolic Data Classification. In E. Diday, et al., (Eds), New Approaches in Classification and Data Analysis, Springer-Verlag, Berlin (1994) [8] Ichino, M. and Yaguchi, H.: Symbolic Pattern Classifiers Based on the Cartesian System Model. In C. Hayashi, et al., (Eds), Data Science, Classification, and Related Methods, Springer-Verlag, Tokyo (1998)

6

Suggest Documents