Web Data Integration System: Approach and Case Study Abdolreza Hajmoosaei and Sameem Abdul-Kareem Faculty of Computer Science and Information Technology University of Malaya, PO Box 50603, Kuala lumpur, Malaysia
[email protected],
[email protected]
Abstract. There are a lot of valuable data on the web that organizations or users can use to improve their decision making process. It is therefore, very important and critical that this information be complete, precise and can be acquired on time. Most web sources provide data in semi-structured form on the internet. The extraction and combination of semi-structured data from different sources on the internet often fails because of syntactic and semantic differences. The access, retrieval and utilization of information from the different web data sources imposes a need for the data to be integrated. Integration of web data is a complex process because of the open, dynamic and heterogeneity nature of web data. The solution to this problem is a web data integration system. External information can be extracted from web sources and utilized for users through a web data integration system. In this paper, we first propose an approach and architecture for web data integration system and then develop a prototype of the proposed system for Malaysian universities. Keywords: Web data source, Heterogeneity conflict, Web data integration.
1 Introduction The web is the platform for information publishing; it is the biggest resource of information of any type. There are a lot of valuable data and business data on the web that organizations or users can use to improve their decision making process. It is therefore, very important and critical that this information be complete, precise and can be acquired on time [1]. It is also vital that such external information be systematically managed and utilized for users. The solution to the mentioned problem is a web data integration system. External information can be extracted from web sources and utilized for users through a web data integration system. The access, retrieval and utilization of information from the different data sources imposes a need for the data to be integrated. There are many types of heterogeneity and differences among web sources that make a combined effort to access data from different sources on the internet difficult and error-prone [2] [3]. In web data integration process we need to resolve heterogeneity conflicts between web data sources. There are different views about classification of Heterogeneity conflicts. The heterogeneity conflicts can be classified according to the following abstraction levels [2] [4]: W. Abramowicz and D. Fensel (Eds.): BIS 2008, LNBIP 7, pp. 410–423, 2008. © Springer-Verlag Berlin Heidelberg 2008
Web Data Integration System: Approach and Case Study
411
•
Data Value Conflicts: Data value conflicts are those conflicts that arise at the instance level. They are related to the representation or the interpretation of the data values. Examples of these conflicts are discrepancies of type, unit, precision and allowed values (e.g. "kg" and "gram" or "$" and "dollar"). • Schema Conflicts: Schema conflicts are due to different alternatives provided by one data model to develop schemas for the same reality. For example, what is modeled as an attribute in one relational schema may be modeled as an entity in another relational schema for the same application domain (e.g. "Author" as attribute for the entity "book" and "author" as an entity that has a relationship with "book"). Another example two sources may use different names to represent the same concept (e.g. "price" and "cost") , or the same name to represent different concepts , or two different ways, for conveying the same information(e.g. "data of birth" and "age"). • Data Model Conflicts: Data model conflicts occur when databases use different data models, e.g., one database designed according to the relational model, and another one object-oriented. Conflicts in each level can be categorized into two categories: • Syntactic Conflicts: Syntactic conflicts refer to discrepancies in the representation of data (e.g. "1/23" and "1.23" or "price=23$" and "price: 23$"). • Semantic Conflicts: Semantic conflicts refer to disagreement about the meaning, interpretation use of the same or related data (e.g. "staff" and "employee"). The major aim of our work is to give a solution for resolving heterogeneity conflicts mentioned above in a web data integration process. For this purpose we first recommend an approach and architecture for web data integration system and subsequently develop a prototype of our web data integration system in domain of Malaysian universities.
2 System Approach and Architecture There are approaches for web data integration that have been proposed by researchers. Some of the major researches and approaches in web data integration domain are SIMS [5], COIN [6], MOMIS [7], KRAFT [8], OBSERVER [9]. For reconciliation of semantic conflicts between heterogeneous data sources, the above mentioned projects create one global or shared ontology by integrating or merging local schemas or ontologies. Subsequently, they perform semantic mapping between created global ontology and the local schemas or ontologies. In the web context the maintenance and updating of global or shared ontology is very time consuming and costly because many web data sources are involved and the number of involved web data sources change frequently; web designers and users are free to use their own terms and vocabulary and schemata which are subject to frequent changes. In our proposed approach we try to overcome this problem by using domain specific ontology and we resolve semantic problems between web sources through semantic mapping between the domain ontology and the local ontologies.
412
A. Hajmoosaei and S. Abdul-Kareem
User
Query Construction module
Graphic User Interface
Domain ontology
Query creation
Mapping module Repository
Web ontology server Local ontology selection
Transformer
Domain Dictionary
Ontology and data source catalogue
Local ontology
Mapper
Query process module Reformulation of query
Optimization of query plan
Data Extraction Module Local ontology
Translator
Web data source
Wrapper
Translator Wrapper
Local ontology Web data source
Data Correlation Module Execution query plan
Exporter
Converter Conversion functions
User
Fig. 1. Web Data Integration System
Our proposed system (figure1) uses domain specific ontologies for the creation of user queries. There is a domain specific ontology for each application that covers the semantic definition of terms which are required for user queries in a particular application domain. The domain ontologies are modeled in a uniform representation model. The user can browse the domain ontology and choose terms for his/her query; afterwards the system creates the user query. We assume each web source has an underlying pre-existing local ontology on the web and each local ontology is associated with one or more web sources. After the creation of the user query, the web ontology server chooses local ontologies related and relevant to user query domain and sends them to the mapping module. The first, local ontology is transformed to the system uniform representation model by transforming and mapping to corresponding
Web Data Integration System: Approach and Case Study
413
terms from the local ontology and thence subsequently rewritten using the terms from the local ontology. This system uses ontologies to resolve semantic schema conflicts between web data sources. The proposed system resolves semantic schema heterogeneities between web source and user query through semantic mapping between the domain ontology and local ontology. The rewritten user query is sent to a query process module for reformulation and creation of optimized query plan from the user query. The gained sub queries from reformulation process are translated to the web sources query languages by translators and then the answer of sub-queries are extracted from related web sources by wrappers. Each wrapper knows the structure of the underlying web source and extracts the related data to its sub-query and presents the extracted data in a common format. Through wrappers, the system resolves data model heterogeneity conflicts between web data sources. The extracted data obtained through wrappers are sent to the converter. The converter resolves data value heterogeneity conflicts between data values. The converter exploits conversion functions and rules for resolving heterogeneities between two data value (such as units, data types, value format conversion functions). Finally the query plan operations are executed and final answers are exported to user preference format by the exporter. Our proposed web data integration system covers all abstraction levels of data heterogeneity conflicts between web data sources. The system applies: • • •
ontology as a solution for resolving schema heterogeneities; wrapper as solution for resolving data model heterogeneities; converter as solution for resolving data value heterogeneities;
The proposed web data integration system is scalable to any domain by adding related domain ontology to system. That means for using of the system in any application area we must develop and add a domain ontology (relevant to application domain area) to system. Our system implements a query based approach to information extraction and integration, from heterogeneous and distributed web data sources.The extraction and integration process in the proposed system consists of eleven major tasks as follows: 1. Creation of user query; 2. Determination of related local ontologies and their underlying web sources with query domain; 3. Transformation of related local ontologies to internal uniform representation model; 4. Semantic mapping between query terms and related local ontologies terms; 5. Rewriting of user query with corresponding terms from local ontologies; 6. Reformulation of query and creation of optimized query plan; 7. Translation of sub queries to web sources query languages; 8. Extraction of data; 9. Conversion of data values; 10. Execution of query plan operations; 11. Exporting of data;
414
A. Hajmoosaei and S. Abdul-Kareem
3 Prototype of System In the rest of the paper we discuss a prototype of our proposed web data integration system in a specific domain. We develop a system for universities in Malaysia so that users can access their required information relating to Malaysian universities. For example students can pose queries about programs, admission dates, fees or lecturers can access information about new publications, events and so on, in Malaysia universities. 3.1 Building University Ontology Before the creation of the domain ontology we need to define one uniform representation model for the ontologies in our system. In order to compare and find similar terms between the domain and local ontology, the system needs to represent all ontologies in a uniform model. In our system we propose one uniform representation model for ontologies. This representation model is general and any ontology with any representation model can be transformed to this uniform representation model. Definition1: T:=(C,A,R,V), each ontology element (term) is one of following entities: • • • •
C: concept or instance of one concept A: attribute of one concept R: relationship between concepts V: value range of one relationship
For example student (concept), age (attribute), master student (instance of student, it is considered as sub-concept of student in our model), attend (relationship between student and class) and “” and one string that show the value of its range.
Web Data Integration System: Approach and Case Study
415
Definition 6: O:=(G, G’), each ontology is represented by two graphs. Definition 7: G:=(N,E), N=, E=, G is acyclic directed rooted graph that consists of nodes and edges. Each node is a concept (or instance of a concept). Each edge is “is-a” relation that shows sub-concept (subclass) relation between nodes. Indeed, G is a hierarchy concept model of ontology. Each node has one father and may have no, one or more child nodes. If one node has two fathers, the model resolves this problem with repeating child node for each one of its fathers. Definition 8: G’:=(N,E’), N=, E’=, G’ is cyclic graph that consists nodes and edges. Each node is a concept (or instance of a concept) or one value. Each edge is relationship between two nodes that show the relationship between concepts. Indeed, G’ is a concept relationship model of ontology. The synonym sets are specified only for domain ontology terms in during the creation of the domain ontology. In our uniform representation model, all elements (concepts, attributes, relationships and values) are string (chain of characters). Our representation model of ontology is very general, so that our proposed approach which uses this formalization will work with any ontology representation languages. We need to transform the local ontology to the uniform representation model (in mapping module of system). We build computer college domain ontology based on our uniform representation model. We follow the methodology below for building the domain ontology. Phase 1: Domain and Scope clarification. Scope of our domain ontology is the current existing universities in Malaysia. Phase 2: Concept (or instance), attribute and relationship extraction. For extraction of existing terms in the university we need to investigate and study conceptualization of this domain in detail. In our investigation we should answer several basic questions: • Why are we using the ontology? • For what types of questions should the ontology provide answers? Phase 3: Key property determination. We specify key properties of each concept in domain ontology. These key properties show main meaning and semantic of each concept which characterizes it from others. Phase 4: synonym determination. We exploit dictionaries and universities websites to specify synonyms of each existing term in the domain ontology. Phase 5. Modeling: In this step we build concept hierarchy and relation concept graphs by an ontology-editing tool such as Protégé-2000, Ontolingua-1997 or Chimaera-2000. We use Protégé-2000 for our university domain ontology. Protégé2000 was developed by Mark Musen’s group at Stanford Medical Informatics. Phase 6: Implementation. Finally we implement and store uniform representation models (hierarchy concept and relation concept graphs) of the university domain ontology in the SQL/SERVER DBMS.
416
A. Hajmoosaei and S. Abdul-Kareem
Fig. 2. Ontology editing tool
3.2 Building of GUI Users interact with the system through a GUI (graphic user interface). The GUI must display the domain ontology to the user so that the user can find his/her query terms easily and quickly. After the user has found his/her query terms, the system creates the user query. The query construction module uses the following structure and syntax for the expression of the user query. SELECT < attributes names > FROM < concept name > WHERE { …} In this query structure, the user can query the attributes of one concept from the domain ontology. The user can specify constraints and conditions over attributes of the query concept. Constraints and conditions are expressed after the “WHERE” clause in the query expression. We clarify the query syntax and structure with the following example: Suppose that a user needs the names and emails of professors in university who are above 50 years of age and are female. The user traverses the terms in the university domain ontology and chooses his/her query terms. Afterwards, the system constructs an expression based on the user query as follows: SELECT name, Email FROM Professor WHERE{ sex =Female , age >50 } Note that the user can not pose complex queries. The user can split his/her complex query to simple sub-queries and subsequently submit them to the system. In our
Web Data Integration System: Approach and Case Study
417
prototype, the GUI first displays super concepts of computer college domain ontology to the user (concepts in top level of hierarchy concept graph). The user chooses one of the super concepts. We call it, T1. In next step the GUI shows three types of information to the user related to T1 that are choices for the user: • • •
Sub-concepts (children) of T1, Attributes of T1, Relationships and their ranges which T1 is the domain of those relationships (we call, relationship of T1).
Figure 3 provides a partial illustration of this information about the concept of student.
Fig. 3. Example of GUI
The user has the following choices for selection: •
•
If T1 is a query concept, then: o user specify attributes of T1 which are in question and choose “Y” for query attributes and o If query possesses constraints then user selects attribute constraints and enters the constraint value of the attributes in the value fields. Afterwards, user submits form and the process of query construction is finished. If T1 is not query concept then user repeats the following tasks until query concept is reached: o choose one of sub-concepts or o choose one of ranges (if range is concept element, no value element).
In this way, user finds his/her query terms and the query construction module creates the user query.
418
A. Hajmoosaei and S. Abdul-Kareem
3.3 Developing of Ontology Server The web ontology maintains a catalog of related ontologies and web sources, which contains information about local ontologies and the data sources underlying each local ontology. The ontology server is invoked by other modules for the following services: • • •
Selection of related local ontology with user query domain and transfer it to mapping module; Give information about existing data in related web sources to query process module; Give information about query language of related web sources to translators;
In this step, the first we list all universities in Malaysia and determine their ontologies. We should create ontology for those universities which have no ontology. All of Malaysian universities do not have ontology. We extract a concept hierarchy and concept relation to build the ontology for them from their web sites. We formalize extracted universities local ontologies in OWL (ontology web language) and maintain them in the ontology server. We used IIS as a server platform and implement the ontology server services by visual studio .NET framework. 3.4 Building Transformer The universities local ontologies are formalized in OWL. We need to transform OWL-based local ontologies to a system uniform representation model. Our created transformer transforms the universities local ontologies to a system uniform representation model. 3.5 Mapping Algorithm After the creation of user query by the query construction module, the query needs to be translated to relevant web sources query languages. For this translation first the user query terms must be semantically mapped to similar terms in the local ontologies of web sources. This semantic mapping resolves semantic schema conflicts between query terms and web source terms. The semantic mapping relates each query terms with its semantically similar term of related local ontology. The semantic mapping is partially performed between the domain ontology (related to query terms) and local ontology. There are approaches for semantic mapping between ontologies that have been proposed by researchers. Some of the recent researches and approaches in ontology mapping domain are Chimaera [10], Anchor-PROMPT [11], QOM [12], Cupid [13], GLUE [14], SAT [15] and ASCO [16]. The approaches exploit available information from ontologies and map similar terms of two given ontologies to each other through mapping algorithms. Our mapping algorithm was motivated by some ideas of the above approaches. Inputs of our mapping algorithm are: query terms, domain ontology and local ontology. There are three types of ontology term in user query: concept (C), attribute (A) and value (V). The purpose of the mapping algorithm is to discover semantically similar terms with query terms from the local ontology and then rewriting of queries
Web Data Integration System: Approach and Case Study
419
with discovered similar terms. For calculation of similarity between two terms between user query and local ontology, the following function is used in mapping algorithm: MF (Mapping Function): MF(T1,T2):=[0 1]; This function calculates similarity between two terms. Value rang of [0 1] indicates amount of similarity. MF performs two sub-functions for similarity calculation. First sub-function, normalizes two terms to their tokens. In this sub-function each term (concept, attribute) be: • • •
Tokenized: Lemmatized: Eliminated:
For normalization, it exploits and uses one domain specific dictionary. This dictionary consists all existing terms in a specific domain including their synonym sets. The second sub-function compares tokens (string without space) of normalized terms with each other and calculates similarity between tokens. Finally, similarity between two terms is calculated from aggregation of token similarities. There are well-known metrics for calculating string similarity between two tokens such as JaroWinkler metric [17], Levenstein metric and Monger-Elkan [18]. We implement JaroWinkler metric in our mapping algorithm. The main Steps of our mapping algorithm are as follows: First step: name and synonym matching between C and all local ontology concepts(CL); MF is executed between C(name) (query concept in user query), all its synonyms names C(syn-name) with all university local ontology concepts (all CL(name) ). for all CL(name)null do {If MF(C(name),CiL(name)) >= threshold then Add (C, CiL) to similarity-table; Else: for all C(syn-name) null do {If MF(C(syn-name),CiL(name)) >= threshold then Add (C,CiL) to similarity-table; }} The result of the above step is some similar pairs, which have similarity measure above the algorithm threshold. We call them, candidate mapping pairs. Second step: father matching; in this step father and grandfather of C are compared with father and grandfather of each CiL (CiL in similar table). For each CiL we calculate: Father-matching(C,CiL) [MF(father(C), MF(grandfather(C), grandfather(CiL))]/2;
father(CiL))
+
Third step: key property matching; Algorithm executes MF between key attributes and key relationships of C with all attributes and relationships of CiL. (CiL in similar table) Algorithm saves number of matching attributes and relationships (properties) between C and each CiL. while CiL (name)null do { for all A,R of C do
420
A. Hajmoosaei and S. Abdul-Kareem
If each MF(C(key-property-name & key-propertysyn-names), CiL(A & R)) >= threshold then CiL(similar-property) +1;} Fourth step: aggregation; In this step the algorithm finds the most similar corresponding concept with C among all CiL in the similar table. For this purpose, algorithm aggregates results of second and third steps for each CiL. The CiL which has highest weight above the algorithm threshold is final mapping concept of C. We call it C1L. For all CiL do { Weight(CiL) father-matching(CiL) property); } C1L CiL [Max weight (CiL)]; concept-mapping-table (C1, C1L);
+
CiL(similar-
Fifth step: Attribute mapping between C and C1L; After finding similar concept of C, we must find similar attributes for query attributes and constraint attributes expressed in user query from local ontology. We should notice, algorithm just execute MF between C-atrribute-name (in user query), all C-att-synset-names with attributesnames and relationships-names of its mapping pair (C1L in mapping table). Algorithm chooses maximum MF that is above threshold and stores similar-attribute pairs in Catt-mapping table (such as: , …..). For all C(A-name) in user query do { If MF(C(A-name),C1L(A-name or R-name)) >= threshold then Add (C(A-name), C1L(A-name or R-name)) to att-mapping-table; Else: while A-syn-name null do { If MF(C(A-syn-name),C1L(A-name or R-name)) >= threshold then Add (C(A-name), C1L(A-name or R-name)) to att-mapping-table; } } If algorithm does not find similar terms of query concept and query attributes from the local ontology, mapping is not executed between the user query terms and the local ontology terms and the local ontology has failed. 3.6 Developing of Query Process Module This module consists of the following tasks: 1. Reformulation of user query. After the mapping process, the user query that has been rewritten into terms of one or more local ontologies is reformulated into one or more sub queries. The query process module must first reformulate the query into sub queries that refers directly to the schemas in the web sources. In order for the system to do this, it needs to have a set of source descriptions. This information exists in the web ontology server. 2. Creation and optimization of the query plan. After the minimal set of data sources were selected for a given query and the query was reformulated to some sub queries a key problem is to find the optimal query execution plan for these sub
Web Data Integration System: Approach and Case Study
421
queries. In particular, the plan specifies the order in which to perform the different operations between the sub queries (join, selection and projection) and scheduling of different operators. In our prototype we submit whole rewritten user query to each selected university web source. Therefore we do not need to reformulate the user query and query plan. 3.7 Building of Translators After the query was rewritten with local ontology terms, it is translated to the local web sources query language. For each term of the web source there exists some mapping information in the repository of the web ontology server module that links the local ontology elements with the underlying data elements of the web source. This information is used by the translator to translate the user query (expressed in terms of the local ontology) into different queries to the underlying web sources query language. In our prototype, the university local ontologies terms use the same name with their corresponding terms in the university web sites. In our prototype, the rewritten user query with universities local ontologies terms is sent to the related wrapper and the user query is parsed to each university web source query language by wrappers. 3.8 Building Wrappers In this step, data is extracted from the pre-selected web sources by wrappers. Wrappers are used for retrieving data from web sources. A wrapper is a module which understands a specific data organization. It knows how to retrieve data from the underlying repository and hide the specific data organization to the rest of the information system. There are three freeware wrapper tools which have been developed by various researchers: 1. 2. 3.
W4F (http://cheops.cis.upenn.edu/W4F/) XWRAP (http://www.cc.gatech.edu/projects/disl/XWRAPElite/) DEByE (http://www.lbd.dcc.ufmg.br/~debye/)
We customize one of above wrappers for our prototype.
3.9 Building of Converter After extraction of answers by wrapper from web sources, they are sent to the converter. The converter resolves existing data value heterogeneities between extracted data values. The popular data value conflicts consist of discrepancies in: • • • •
Type (e.g. integer and char) Unit (e.g. "kg" and "gram") Value format such (e.g. date format, time format). Value sign (e.g. "$" and "dollar")
For resolving these kinds of conflicts, the converter uses some conversion functions. These functions should cover all existing value data heterogeneities in the
422
A. Hajmoosaei and S. Abdul-Kareem
domain of the system. In our prototype domain, we define some conversion function for resolving conflicts between data values such as name, URI, date and fee. We define standard format for these values and convert extracted data values which are different with our standard format by conversion functions. 3.10 Building Exporter Finally the data is ready to be presented to the user. In this step, the data is exported to the user representation environment in our prototype. We present data in the HTML page to the user.
4 Conclusion In this paper, we first proposed an approach and architecture for a web data integration system. The proposed system resolves data heterogeneities between web data sources. Afterwards, we discussed the development of prototype and a case study for our proposed system. Our prototype is in the domain of the Malaysian universities. The existence of the proposed web data integration system is critical for any organization to access valuable external information on the web. Our proposed system extracts this valuable information from web sources, integrates them to each other and presents to the user.
References 1. Heflin, J., Hendler, J.: 2000: Semantic interoperability on the web. In Extreme Markup Languages (2000), http://www.cs.umd.edu/projects/plus/SHOE/pubs/extreme2000.pdf 2. Kashyap, V., Sheth, A.: Semantic heterogeneity in global information systems: The role of metadata, context and ontologies. In: Papazoglou, M.P., Schlageter, G. (eds.) Cooperative Information Systems: Current Trends and Directions, pp. 139–178. Academic Press Ltd, London (1998) 3. Fensel, D.: Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. Springer, Heidelberg (2001) 4. Ram, S., Park, J.: Semantic Conflict Resolution Ontology (SCROL): An Ontology for Detecting and Resolving Data and Schema-Level Semantic Conflicts. IEEE Transactions on Knowledge and Data Engineering 16(2), 189–202 (2004) 5. Arens, Y., Ciiee, Y., Knoblock, A.: SIMS: Integrating data from multiple information sources. In: Information science institute, University of Southern California, U.S.A (1992) 6. Goh, C.H., Bressan, S., Madnick, S., Siegel., M.: Context interchange New features and formalisms for the intelligent integration of information. ACM Transaction on Information Systems 17(3), 270–290 (1999) 7. Beneventano, D., Bergamaschi, S., Guerra, F., Vincini., M.: The MOMIS approach to information integration. In: ICEIS 2001, Proceedings of the 3rd International Conference on Enterprise Information Systems, Portugal (2001) 8. Visser, P.R., Jones, D.M., Beer, M., Bench-Capon, T., Diaz, B., Shave, M.: Resolving ontological heterogeneity in the KRAFT project. In: Bench-Capon, T.J.M., Soda, G., Tjoa, A.M. (eds.) DEXA 1999. LNCS, vol. 1677, pp. 668–677. Springer, Heidelberg (1999)
Web Data Integration System: Approach and Case Study
423
9. Mena, E., Kashyap, V.: OBSERVER: An Approach for Query Processing in Global Information Systems based on Interoperation across Pre-existing Ontologies (1996) 10. McGuiness, D.L., Fikes, R., Rice, J., Wilder, S.: The chimaera ontology environment. In: McGuiness, D.L., Fikes, R., Rice, J., Wilder, S. (eds.) Seventh National Conference on Artificial Intelligence (AAAI-2000) (2000) 11. Noy, N.F., Musen, M.A.: Anchor-PROMPT: Using Non-Local Context for Semantic Matching. In: Workshop on Ontologies and Information Sharing. IJCAI, Seattle, WA (2001) 12. Ehrig, M., Staab, S.: Efficiency of Ontology Mapping Approaches. In: Institute AIFB, University of Karlsruhe (2001) 13. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid. Proc. Of the 27th Conference on Very Large Databases (2001) 14. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: 2002: Learning to Map between Ontologies on the Semantic Web. In: The Eleventh International World Wide Web Conference (WWW 2002), Hawaii, USA (2002) 15. Giunchiglia, F., Shvaiko, P.: Semantic Matching. CEUR-WS 71 (2003) 16. Thanh Le, B., Dieng-Kuntz, R., Gandon, F.: On Ontology Matching Problems: for building a corporate Semantic Web in a multi-communities organization. In: Institute National and Research in Informatic, Sophia Antipolis, France (2004) 17. Winkler, W.E.: The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04 (1999) 18. Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name Matching Tasks. In: IJCAI 2003, workshop on Information Integration on the Web. (2003)