OntoMiner: Bootstrapping and Populating. Ontologies From Domain Specific Web Sites. Hasan Davulcu, Srinivas Vadrevu, and Saravanakumar Nagarajan.
OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites Hasan Davulcu, Srinivas Vadrevu, and Saravanakumar Nagarajan Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, 85287, USA {hdavulcu, svadrevu, nrsaravana}@asu.edu
Abstract. RDF/XML has been widely recognized as the standard for annotating online Web documents and for transforming the HTML Web to the so called Semantic Web. In order to enable widespread usability for the Semantic Web there is a need to bootstrap large, rich and up-todate domain ontologies that organize most relevant concepts, their relationships and instances. In this paper, we present automated techiques for bootstrapping and populating specialized domain ontologies by organizing and mining a set of relevant Web sites provided by the user. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. Next, we present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semistructed concept instances annotated with their labels whenever they are available. Experimental evaluation for the News and Hotels domain indicates that our algorithms can bootstrap and populate domain specific ontologies with high precision and recall.
1
Introduction
RDF and XML has been widely recognized as the standard for annotating online Web documents and for transforming the HTML Web to the so called Semantic Web. Several researchers have recently questioned whether participation in the Semantic Web is too difficult for “ordinary” people [1–3]. In order to enable widespread useability for the Semantic Web there is a need to bootstrap large, rich and up-to-date domain ontologies that organizes most relevant concepts, their relationships and instances. In this paper, we present automated techiques for bootstrapping and populating specialized domain ontologies by organizing and mining a set of relevant Web sites provided by the user. As an example application, a user of the OntoMiner can use the system to rapidly bootstrap and ontology populated with instances and they can tidy-up the bootstrapped ontology to create a rich set of labeled examples that can be utilized by supervised machine learning systems such as the WebKB[4]. The user of the OntoMiner system only need to provide the system the URLs of the Home Pages of 10 to 15 domain specific Web sites that characterizes her
2
Hasan Davulcu, Srinivas Vadrevu, and Saravanakumar Nagarajan
domain of interest. Next, OntoMiner system detects and utilizes the HTML regularities in Web documents and turns them into hierarchical semantic structures encoded as XML by utilizing a hierarchical partition algorthm. We present treemining algorithms that identifies most important key domain concepts selected from within the directories of the Home Pages. OntoMiner proceeds with expanding the mined concept taxonomy with sub-concepts by selectively crawling through the links corresponding to key concepts. OntoMiner also has algorithms that can identify the logical regions within Web documents that contains links to instance pages. OntoMiner can accurately seperate the “human-oriented decoration” such as navigational panels and advertisement bars from real data instances and it utilizes the inferred hierarchical partition corresponding to instance pages to accurately collect the semi-structured concept instances. A key characteristic of OntoMiner is that, unlike the systems described in [5, 6] it does not make any assumptions about the usage patterns of the HTML tags within the Web pages. Also, OntoMiner can seperate the data instances from the data labels within the vicinity of extracted data and attempts to accurately annotate the extracted data by using the labels whenever they are available. We do not provide algorithms for extracting and labeling data from within HTML tables since there are existing solutions for detecting and wrapping these structures [7, 8]. Other related work includes schema learning[9–11] for semi-structured data and techniques for finding frequent substructures from hierarchical semistructured data[12, 13] which can be utilized to train structure based classifiers to help merge and map between similar concepts of the bootstrapped ontologies and better integrate their instances. The rest of the paper is organized as follws. Section 2 outlines the hierarchical partitioning, Section 3 discusses taxonomy mining, Section 4 describes instance mining. Experimental evaluation for the News and Hotels domains indicates that our algorithm can bootstrap and populate domain specific ontologies with high precision and recall.
2 2.1
Semantic Partitioning Flat Partitioner
Flat Partitioner detects various logical partitions of a Web page. For example, for the home page of http://www.nytimes.com, the logical partitions are marked in boxes B1 through B5 in Figure 1. The boxes in snapshot of Web page in Figure 1 correspond to the dotted lines shown in tree view of Web page in Figure 1.. The Flat Partitioner Algorithm takes an ordered DOM tree of the Web page as input and finds the flat partitions in it. Intuitively, it groups contiguous similar structures in the Web pages into partitions by detecting a high concentration of neighboring repeated nodes, with similar root-to-leaf tag-paths. First, the partition boundary is initialized to be the first leaf node in the DOM tree. Next, any two leaf nodes in the tree are linked together with a ”similarity link”
OntoMiner
3
Fig. 1. Snapshot of New York Times Home Page and Parse Tree View of the Home Page
if they share the same path from the root of the tree and all the leaf nodes in between have different paths. Then the ratio of number of ”similarity links” that crosses the current candidate boundary to the total number of ”similarity links” inside the current partition is calculated. If this ratio is less than a threshold δ, the current node is marked as the partition boundary. Otherwise, current node is added to the current partition and the next node is considered as the partition boundary. The above process terminates when the last element in the list of leaf nodes is reached. A Path Index Tree (PIT) is built from the DOM tree of the Web page, which helps to determine all the ”similarity links” between the leaf nodes within a single traversal. The PIT is a trie based data structure which is made up of all unique root to leaf tag-paths and, in its leaf nodes PIT stores the ”similarity links” between the leaf nodes of the DOM tree. The tree view in Figure 1 illustrates the Flat Partitioning Algorithm. The arrows in the tree view in Figure 1 denote the ”similarity links” between the leaf nodes. Let’s assume the threshold δ is set to 60%. Then, when the current node is ”Job Market” the total number of outgoing unique ”similarity links” (out in line9) is 1 and total number of unique ”similarity links” (total in line 10) is 1. Hence the ratio of out to total is 100% which is greater than threshold. Hence current in line 6 becomes the next leaf node. At node ”International”, out becomes 1 and total is also 1. Hence the ratio is still greater than threshold. When current reaches ”Community Affairs”, out becomes 0 whereas total is 1 and hence the ratio is less than threshold δ. Now, ”Community Affairs” (B2 in Figure 1) is added to the set of partition boundaries in line 12 and all the ”similarity links” are removed from the partition nodes in line 13. The same boundary detection condition is satisfied once again when the algorithm reaches ”6.22 PM ET” where out becomes 1 and total is 3. Hence ”6.22 PM ET” (B3 in Figure 1) is added to the partition boundaries.
4
Hasan Davulcu, Srinivas Vadrevu, and Saravanakumar Nagarajan
Algorithm 1 Flat Partition Algorithm Flat Partitioner Input: T: DOM Tree Output: < b1 , b2 , ...bk >: Flat Boundaries 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
2.2
PIT := PathIndexTree(T) current := first leaf node of T Partition Nodes := φ Partition Boudaries := φ for each lNode in Leaf Nodes(T) do current := lNode → next if N = PIT.next similar(Current) exists then N Partition Nodes := Partition Nodes end if out = |{path(m)|m ∈ P artition N odes and m > current}| total = |{path(m)|m ∈ P artition N odes}| if out/total ¡ δ then current Partition Boundaries := Partition Boundaries Partition Nodes := φ end if end for Return Partition Boundaries
Hierarchical Partitioning
Hierarchical Partitioner infers the hierarchical relationships among the leaf nodes of the HTML parse tree where all the page content is stored. The Hierarchical Partitioner achieves this through sequence of three operations: Binary Semantic Partitioning, Grouping and Promotion.
Binary Semantic Partitioning The Binary Semantic Partitioning of the Web page relies on a dynamic programming algorithm which employs the following cost function. The dynamic programming algorithm determines the nodes that need to be grouped together, by finding the grouping with the minimal cost. The cost for grouping any two nodes in the HTML parse tree is recursively defined as follows. – Cost(Li , Lj ) = 0, if i = j – Cost(Li , Lj ) = mini≤k