{y-suzuki,shoudai}@i.kyushu-u.ac.jp. 2 Department of ...... g16 t16. Fig. 6. For i =
13, 14, 15, 16, gi ≡ ti, ti ≼ gi, gi ≼ ti and L(gi) = L(ti). 4.2 A MINL Algorithm for ...
Efficient Learning of Unlabeled Term Trees with Contractible Variables from Positive Data Yusuke Suzuki1 , Takayoshi Shoudai1 , Satoshi Matsumoto2 , and Tomoyuki Uchida3 1
2
Department of Informatics, Kyushu University, Kasuga 816-8580, Japan {y-suzuki,shoudai}@i.kyushu-u.ac.jp Department of Mathematical Sciences, Tokai University, Hiratsuka 259-1292, Japan
[email protected] 3 Faculty of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Japan
[email protected]
Abstract. In order to represent structural features common to tree structured data, we propose an unlabeled term tree, which is a rooted tree pattern consisting of an unlabeled ordered tree structure and labeled variables. A variable is a labeled hyperedge which can be replaced with any unlabeled ordered tree of size at least 2. In this paper, we deal with a new kind of variable, called a contractible variable, that is an erasing variable which is adjacent to a leaf. A contractible variable can be replaced with any unlabeled ordered tree, including a singleton vertex. Let OTT c be the set of all unlabeled term trees t such that all the labels attaching to the variables of t are mutually distinct. For a term tree t in OTT c , the term tree language L(t) of t is the set of all unlabeled ordered trees which are obtained from t by replacing all variables with unlabeled ordered trees. First we give a polynomial time algorithm for deciding whether or not a given term tree in OTT c matches a given unlabeled ordered tree. Next for a term tree t in OTT c , we define the canonical term tree c(t) of t in OTT c which satisfies L(c(t)) = L(t). And then for two term trees t and t0 in OTT c , we show that if L(t) = L(t0 ) then c(t) is isomorphic to c(t0 ). Using this fact, we give a polynomial time algorithm for finding a minimally generalized term tree in OTT c which explains all given data. Finally we conclude that the class OTT c is polynomial time inductively inferable from positive data.
1
Introduction
A term tree is a rooted tree pattern which consists of tree structures, ordered children and internal structured variables. A variable in a term tree is a list of vertices and it can be replaced with an arbitrary tree. The term tree language L(t) of a term tree t, which is considered to be a representing power of t, is the set of all ordered trees which are obtained from t by replacing all variables in t with arbitrary ordered trees. The subtrees which are obtained from t by removing the variables in t represent the common subtree structures among
T1
T2
T3 u 2
y x
z
z
x
t1
y
x
t2
u1
v2
v 1
t3
g1
u3
g2
g3
Fig. 1. An uncontractible (resp. contractible) variable is represented by a single (resp. double) lined box with lines to its elements. The label inside a box is the variable label of the variable.
the trees in L(t). A term tree is suited for representing structural features in tree structured data such as HTML/XML files which are represented by rooted trees with ordered children and edge labels [1]. Hence, in data mining from tree structured data, a term tree is used as a knowledge representation. Since SGML/XML can freely define tree structures by using strings given by users as tags, extracting meaningful structural features from such data is often difficult. In this case, by ignoring edge labels (e.g., tags) in tree structures and using only structural information, we need to extract meaningful structural features. Based on this motivation, we consider the polynomial time learnabilities of term trees without any vertex and edge label in the inductive inference model. A term tree t is said to be unlabeled if t has neither vertex label nor edge label, and said to be regular if all variable labels in t are mutually distinct. In this paper, we deal with a new kind of variable, called a contractible variable, that is an erasing variable which is adjacent to a leaf. A contractible variable can be replaced with any unlabeled ordered tree, including a singleton vertex. A usual variable, called an uncontractible variable, does not match a singleton vertex. Let OTT c be the set of all unlabeled regular term trees with contractible variables. In this paper, we show that the class OTT c is polynomial time inductive inferable from positive data. First we give a polynomial time algorithm for deciding whether or not a given term tree in OTT c matches a given unlabeled ordered tree. For any term tree t, we denote by s(t) the ordered unlabeled tree obtained from t by replacing all uncontractible variables of t with single edges and all contractible variables of t with singleton vertices. We write t ≡ t0 if t is isomorphic to t0 . Second we give totally 52 pairs (g` , gr ) of term trees g` and gr such that g` 6≡ gr , s(g` ) ≡ s(gr ), and L(g` ) = L(gr ). We say that a term tree
t in OTT c is a canonical term tree if no term subtree of t is isomorphic to any second term tree gr of the 52 pairs. For any term tree t in OTT c , there exists a canonical term tree in OTT c , denoted by c(t), which satisfies s(c(t)) ≡ s(t) and L(c(t)) = L(t). Then we show that for any term trees t and t0 in OTT c , if L(t) = L(t0 ) then c(t) ≡ c(t0 ). Using this fact, we give a polynomial time algorithm for finding a minimally generalized term tree in OTT c which explains all given data. From this algorithm and Angluin’s theorem [3], we show that the class OTT c is polynomial time inductively inferable from positive data. For example, the term tree t3 in Fig. 1 is a minimally generalized term tree in OTT c which explains T1 , T2 and T3 . And t2 is also minimally generalized among all unlabeled regular term trees with no contractible variable which explain T1 , T2 and T3 . On the other hand, t1 is overgeneralized and meaningless, since t1 explains any tree of size at least 2. In analyzing tree structured data, sensitive knowledge (or patterns) for slight differences among such data are often meaningless. For example, extracted patterns from HTML/XML files are affected by attributes of tags which can be recognized as noises. In fact, a term tree with only uncontractible variables is very sensitive to such noises. By introducing contractible variables, we can find robust term trees for such noises. From this reason, we consider that in Fig. 1, t3 is a more precious term tree than t2 . A term tree is different from other representations of ordered tree structured patterns in [2, 4, 13] in that an ordered term tree has structured variables which can be substituted by arbitrary ordered trees. In [10, 12], we showed that some fundamental classes of regular ordered term tree languages with no contractible variable are polynomial time inductively inferable from positive data. In [5, 8, 9], we showed that some classes of regular unordered term tree languages are polynomial time inductively inferable from positive data. Moreover, we showed in [6] that some classes of regular ordered term tree languages with no contractible variable are exactly learnable in polynomial time using queries. In [7], we gave a data mining method from semistructured data using ordered term trees.
2
Term Trees with Contractible Variables
Let T = (VT , ET ) be a rooted tree with ordered children (or simply a tree) which has a set VT of vertices and a set ET of edges. Let Eg and Hg be a partition of ET , i.e., Eg ∪ Hg = ET and Eg ∩ Hg = ∅. And let Vg = VT . A triplet g = (Vg , Eg , Hg ) is called a term tree, and elements in Vg , Eg and Hg are called a vertex, an edge and a variable, respectively. We assume that edges and vertices have no label. A label of a variable is called a variable label. X denotes a set of variable labels. For a term tree g and its vertices v1 and vi , a path from v1 to vi is a sequence v1 , v2 , . . . , vi of distinct vertices of g such that for any j with 1 ≤ j < i, there exists an edge or a variable which consists of vj and vj+1 . If there is an edge or a variable which consists of v and v 0 such that v lies on the path from the root to v 0 , then v is said to be the parent of v 0 and v 0 is a child of v. We use a notation [v, v 0 ] to represent a variable {v, v 0 } ∈ Hg such that v is the parent of v 0 . Then we call v the parent port of [v, v 0 ] and v 0 the child port of [v, v 0 ]. For a term tree
g, all children of every internal vertex u in g have a total ordering on all children of u. The ordering on the children of u is denoted by