Toward integration travel information data using information extraction ...

4 downloads 1437 Views 188KB Size Report
travel information data embedded in web pages using approaches of information ... can absolutely make the query more convenient, and the abundant semantic ...
1

Toward integration travel information data using information extraction and instance matching Feng Shi, Juanzi Li, and Jianqiang Hu Computer Science Department, Tsinghua University, China Abstract In this paper, we introduce the method on how to integrate travel information data embedded in web pages using approaches of information extraction and instance matching. Furthermore we extend the concept of instance matching to find the connotative relationship between instances extracted from different sources in order to improve the result of integration. We extracted more than 145,000 pieces of travel data terms of sight, route, agent, hotel, restaurant and ticket from several different sources, and integrated them into a piece of travel data with comprehensive information. 1 Overview There are plenty of web sites providing various data of travel information on the Web, including information about sights, agents, routes, hotels, restaurants, tickets and so on, but almost none of them has the comprehensive travel information alone. If a tourist wants to search enough information to plan a trip, he has to constantly switch back and forth among a lot of various web sites. If the tourism data from different sources can be integrated into a whole, the concentrative semantic data can absolutely make the query more convenient, and the abundant semantic information behind the integrated data can also make the query more accurate and make data sharing and reusing more easily. Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data [1]. However, the integration of the tourism data is not easy. The data embedded in the web is mostly unstructured or semi-structured, so how to extract the data from web pages is the first question. And the data from different sources often have different perspectives, so they may overlap with each other, or they may not have apparent relation literally. So simply merging cannot work for integration. For example, “Spring City” and “Kunming” refer to the same city, but they have totally different labels. Again, the travel route of “One-day tourism in Beijing” contains the sight of “The Great Wall”, while these two instances have no shared property. To solve the above-mentioned issues, we adopt the Semantic Web [2] technology of information extraction and instance matching [3]. Information extraction enables people to get useful data from web pages with the method of semantic annotation based on the characteristics of travel data, and instance matching can resolve the semantic conflicts among heterogeneous data sources and find the connotative relation between instances extracted from different sources. 2 Defining the Travel Ontology Schemas Our travel ontology covers the following areas: travel route, sight, agent, hotel, restaurant, airline ticket and train ticket. Figure 1 shows the rough structure of our travel ontology

schemas. Orange nodes represent primary concepts, and red nodes represent literal property shared by two or more concepts. These shared literal nodes can help us find the relations between instances of different concepts.

FIG. 1 the travel ontology schema definition 3 Semantic Annotation and Instance Matching The process of the data integration mainly contains 3 steps: first, extract information embedded in web pages or stored in relational databases according to the travel ontology schemas; second, clean the extracted data to eliminate errors, and normalize the data to unify different data forms; third, integrate the data with the method of instance matching and transform them into semantic data. 3.1 Extract Information Embedded in Web Pages Through analysis of many travel web sites, we find that nearly all the web pages are written with templates, so most of the information is embedded in regularly structured objects. A simple but effective approach to extract information data from web pages is to use template like method based on rules. Fig 2 shows some web pages from travel information sites.

FIG. 2 information extraction from web pages Compared to the approach based on rules, approaches of machine learning such as SVM and CRF can be applied more widely. However the precisions of these approaches are not high enough.

2 3.2 Integrating the Travel Data with Instance Matching Trough extracting information from the web we’ve got enough data for the integration. However the data come from different sources, so simply merging cannot work. The main problem of the data integration is the conflicts and overlap between heterogeneous data sources [4]. Two instances, which refer to the same real world entity, may have totally different labels in different sources. And the same label may have different meanings in different sources. To resolve the problem, we use the method of instance matching [5], with which we can not only get rid of the redundancy of the data, but also find the relations between instances of different concepts. So through instance matching we can integrate instances extracted from different sources into a whole. Figure 3 shows an example of data integration with instance matching.

FIG. 3 data integration with instance matching Instance matching finds the relations between instances through computing the similarity degree between them. And we use different methods of similarity computation according to different situations. 3.2.1. Similarity Computation Since most of the properties of the ontology elements are expressed with strings, similarity degree computation between strings is an important way to help to find the relations between instances from different sources. There are many methods to compute the similarity degree between strings, such as string edit distance, n-gram similarity algorithm, cosine similarity based on TF-IDF and so on. We extend these methods, and combine different similarity computation methods in different situations. 1. If both strings are short, and their lengths are close, we directly use methods like string edit distance method which is defined as the minimum cost of transforming one string into another by insertions, deletions, or substitutions. 2. If one string is much longer than the other, we use methods like full match algorithm which is defined as the max length of continuous words two strings share. For example, “Old Town of Lijiang” and “Lijiang is a famous old town located in Lijiang City, Yunnan, China”, the max length is three: old, town, Lijiang.

3. If both strings are very long, we use methods like cosine similarity based on TF-IDF directly, in which we give every word of the strings a weight according to the TF and IDF, and all the words of the string construct a vector like the vector space model (VSM), and we compute the cosine value of the two vectors as the similarity degree of the two long strings. 3.2.2. Relations Finding between Instances We define two types of relations between two instances from different data sources: 1. They may refer to the same real world entity; 2. There is some object property link between them; We can use the ontology features to extend the methods that only compute the similarity between strings. We can make use of the properties of instances to compute the instance similarity. We choose the data property which we use to compute the similarity, and give different property different weights according to their importance, and compute the overall similarity. The first type of relation can be found by checking similarities between each data property of the two instances. If the overall similarity degree reaches a high enough level, then the relation can be regarded as matched. For example, we extracted the sight information from two web sites. One site has no information of the sight region, while the other has no information of the sight type. So we use this method to check whether two instances from the two sites refer to the same sight, and integrate all the information of the two instances into one. The second type of relation means that there is an object property between the two instances. So we first need to determine which data property should be computed, and which similarity degree computation method should be chosen according to the specific situation. For example, the relation between instances of travel routes and sights belongs to the second type. We extracted more than 145,000 pieces of common data including about 6,000 instances of tourist routes, more than 13,000 instances of sights, about 2,000 instances of agents, more than 1.500 instances of airline tickets, more than 3,300 instances of train tickets, more than 90, 000 instances of hotels and more than 30,000 instances of restaurants from several travel web sites, and integrated all these instances into a piece of travel data with comprehensive information. References [1] [2] [3]

[4] [5]

Maurizio Lenzerini: Data Integration: A Theoretical Perspective. Principles of Database Systems 2002 Page(s): 233-246 Berners-Lee.T., Hendler.J., Lassila.O.: The semantic web. Scientific American 284(5) (2001) 34-43 Chao Wang, Jie Lu, Guangquan Zhang: Integration of Ontology Data through Learning Instance Matching. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence Cheng Hian Goh. Representing and Reasoning about Semantic Conflicts in Heterogeneous Information Sources. Phd, MIT, 1997. Yuangui Lei: An instance mapping ontology for the semantic web. In: Proceedings of the 3rd international conference on Knowledge capture, 2005