Semantic Web Search Model for Information Retrieval of the Semantic Data * Okkyung Choi1, SeokHyun Yoon1, Myeongeun Oh1, and Sangyong Han2 Department of Computer Science & Engineering Chungang University 221, Huksuk-dong, Dongjak-ku, Seoul, 156-756, Korea (okchoi,lazecool,hamnuri)@archi.cse.cau.ac.kr
[email protected]
Abstract. In this paper, we propose the ontology-based semantic web search model to enhance efficiency and accuracy of information retrieval for unstructured and semi-structured documents. New evaluation model is also proposed to measure the similarity between documents with semantic information. It is implemented and compared with the existing web models.
1
Introduction
The web was able to make great advancements rapidly because of its convenience and the fact that it is easily accessible to anyone around the world. However there emerged the problem of having too many search results on a single search for specific information the user needs. The search engine has been developed and new ideas are being tried out in order to solve such a problem. But as long as the search method is based on plain data processing it should be difficult for a user to find the specific material he/she needs on the web because the search is executed fragmentarily based only on words and sentence construction having the semantic contents of the web document left out.[5] As a solution to this difficulty the semantic web has a large number of ontologies connected to each other in a decentralized manner so that the semantic contents of a web document is expressed clearly, and the semantic boundaries are well arranged so that the user can find the necessary information more easily. Here, the ontology refers to a “concept” database, which allows the machine, in other words the web environment, to understand the semantic language that humans use and understand. As described above, in order to enable users to efficiently search for information from the web, a machine, must be able to comprehend and process the semantic contents of the information. For this the “ontology” based information search method is in great need to systemize a database to be used as a knowledgebase, which is similar to “concept”. As so the present study proposes RDF(Resource Description Language) *
This search was supported by ITRI of the Chung-Ang University.
C.-W. Chung et al. (Eds.): HSI 2003, LNCS 2713, pp. 588-593, 2003. Springer-Verlag Berlin Heidelberg 2003
Semantic Web Search Model for Information Retrieval of the Semantic Data
589
and the ontology-based semantic web search model. The efficiency and accuracy of the proposed model is verified through a new method of similarity measurement using semantic metadata. The study is organized in the following order. In section 2 the necessity for the semantic web search model will be explained along with its techniques. Next in section 3 a new similarity measurement technique will be introduced and applied in evaluating, comparing and analyzing the current search model and the newly proposed semantic web search model. Conclusions and future studies will be suggested in the final section.
2
Semantic Web Search Model
2.1
The Necessity
The search model currently used is poor in efficiency and accuracy being unable to draw the specific results requested by the user because the web documents are formed with HTML and XML. Considering such matters, RDF purported for semantic data search and the ontology-based semantic web search model should be adopted as the next generation web prototype and become more and more crucial in the near future. 2.2
System Model and Functions
In this section the E-engine Ontology Model’s system model will be closely looked upon to support the extension of the semantic web technology.
Fig. 1. E-engine Ontology Model : System Model
As shown in Figure 1, the system model is composed of largely three layers. Each layer is an extension of the original semantic web layer of the semantic model. The application layer at the very top is the Interface Manager, the second layer is the World Map and the third layer at the bottom is the Content Management Module. The three layers will be described in more detail.
590
Okkyung Choi et al.
2.2.1 Content Management Module (1) E-engine Syntactic Layer (Metadata Syntactic Layer) This layer is capable of assigning an arbitrary document structure but does not define the meaning of the structure nor does it interpret the meaning of the expressed document. The layer defines the data’s syntax. The format is handled by the data expression layer and constructed with XML. (2) E-engine Semantic Layer (Semantic Layer) In order to interpret and handle the contents on the web, a certain type of language is necessary for expressing not only the data but also the rules that regulate the inference of the data. RDF(Resource Description Framework) is sort of language that expresses the nature of the resources on the web and the relations between different resources.[1] RDF is used for assigning meaning to a document. The layer is constructed with frameworks and schemas. 2.2.2 World Map (Ontology) The World Map is placed above the semantic layer. It is a systematic method of expression that can improve the present condition where information is processed simply as data and the semantic context must be provided by man, and allow information to have value as knowledge. It is composed of three layers, the Content Manager, Schema Manager and the thesaurus manager. The Content Manager applies the definition of the semantic metadata and the definition of the classification model for semantic data search along with succession and equivalence to define the relation between different metadata. The Schema Manager defines the standard data type and format of the Content Manager’s standard classification model and the thesaurus manager’s semantic integration model. The Thesaurus Manager is like an encyclopedia. It defines the identification and property standards in accordance with the international standards for electronic commerce. The Thesaurus Manager integrates schema and unifies and reorganizes similar terms. In other words, it is in charge of integrating terms that are semantically the same. 2.2.3 Interface Manager (Semantic Search Engine) The current web search does not consider the semantic connection of the query. For this reason, the metadata factors that connect the semantic notes with the content information must be defined. The following section will deal with the methods and technologies that are necessary for engineering the search engine. (1) Semantic Standardization Standardization plays an important role in processing semantic heterogeneity that occurs due to multiple data sources. In the metadata standardization process all the contents that belong to the same domain are connected to the same metadata regard-
Semantic Web Search Model for Information Retrieval of the Semantic Data
591
less of the source or format. Semantic standardization is a process of unifying the multiple names that hold the same concept by a common factor. (2) Semantic Connection By establishing the relation between data the search engine will be able to determine whether A and B have any connection to each other and provide the user with fast and accurate information. The ontology technology must be used here in order to benefit from such semantic connection processes.
3
The Search Method and Analysis of Its Performance
3.1
Search Method Using the RDF Semantic Metadata
The newly suggested method is purported for having the RDF semantic metadata applied in measuring the similarity of the web resource. Fig. 2. shows the search method using the RDF semantic metadata. The user’s query is extended to an RDF query so that the RDF documents can be searched. Since RQL is a query language used for searching resources matching with the RDF query, it is more correct to say it is a search for data rather than a search for information. For more accurate and efficient similarity measurement it is necessary to combine the cosine similarity, which is applied in the vector-space model and has similarity ranging from 0 to 1, and the RDF search results drawn by using the RQL, which has binary similarity.
Fig. 2. Search Method Using the RDF Semantic Metadata
Fig 3. Improved Similarity Measurement Using RDF Semantic Metadata
592
Okkyung Choi et al.
Fig. 3. shows the improved similarity measuring method using the RDF semantic metadata. K1 is the proportional constant of the ordinary search result value and K2 is the proportional constant of the RDF search result. 3.2
Comparison and Analysis
For this study, “semantic web” was searched on the search engine Google [3] and the documents ranking from one to ten were selected to test the proposed model. The URL of the selected ten documents are as listed in [4] and the numbers indicated in the left box are the document numbers. In the case of searching without RDF, the results are ranked according to the cosine similarity applied in the currently used vector space model. The search word is “author is berners.” In the case where the cosine similarity of the current model is applied, the word “is” is deleted because it is a stop word. So the search word is interpreted as “author & berners”. Meanwhile in the case where the RDF search method is applied, the fact that the list of documents with context implying that the “author” is “burners” is being searched is also reflected. So more than just searching for documents containing “author” and “berners”, the RDF search method is applied to place more value on the documents that contain the fact that the author is berners(author -> berners). The vector-based cosine similarity was calculated for each of the ten documents. Then the cosine similarity was applied along with the R value(the RDF search value) to measure the ultimate value of the similarity. The measurements are as shown below in Table 1. Table 1. Similarity Measurement Results (k1=1, k2=100)
Doc. Num b c d e f g h i j k
Weight author berners 0 0.031604 0.064481 0.004183 0 0 0 0.001389 0.025723 0.012015 0 0.013045 0.022013 0.005712 0 0 0.009767 0 0 0.006416
Cosine Ran Similarity k 0.015802 2 0.034332 0 0 9 0.000695 7 0.018869 1 0.006522 4 0.013863 3 0 8 0.004883 5 0.003208 6
R 0 0 1 0 0 0 1 0 0 1
RDF RDF Etc Similarity Rank 0.015645 3 0.033992 0 0.009900 5 0.000688 8 0.018682 2 0.006457 6 0.023626 1 0 9 × 0.004834 7 0.013077 4
According to the test results, document number d, which ranked ninth according to the similarity measured by the current method, ranked fifth when the RDF search method was applied. While the current model measures the similarity according to simply whether the documents contain the search words or not, the newly proposed method places more value on the documents that contain the semantic context which implies that the “author” is “burners”.
Semantic Web Search Model for Information Retrieval of the Semantic Data
4
593
Conclusion and Further Research
The study suggests an information search model that applies semantic web factors as a solution for the problems with the current search model. The semantic factors are emerging as the web technology of the next generation. With the new semantic web search model, efficiency and accuracy of the automatic classifiers and information extractors are enhanced and the processing of documents both semistructural and nonstructural is made possible. Through the establishment of the ontology, data standardization, data integration and the semantic connection method, the semantic web search model is capable of semantic data search. Additionally, the study introduced a new similarity measuring method to be applied in the proposed model. The new measuring method has been compared with the current vector space model’s cosine similarity method and analyzed to verify the efficiency and accuracy of the semantic search model. The cosine similarity method used in the current search model was not capable of reflecting the RDF semantic metadata. However the new measuring method applied in the semantic search model uses the RDF semantic metadata to draw more efficient search results. Further research is planned for the improvement of the search method using the RDF semantic metadata through the extension of the Boolean model, which gives binary value using RQL. With this, the similarity of the unbinary weight can be measured and the semantic vector space can be expressed by using the ontology and RDF schema. Additional research also lies ahead on using the RDF search results in extending the query to make the RDF semantic metadata usable even on the current search engine.
References [1] Lee Jae-ho and Yang Jeong-jin, “The Semantic Web: The Intelligent Technology of the Next Generation”, TTA Journal, Serial No, 81, June 2002 [2] Sheth, A.; Bertram, C.; Avant, D.; Hammond, B.; Kochut, K.; Warke, Y., "Managing semantic content for the Web", IEEE Internet Computing , Vol 6 , 2002, p. 80 –87 [3] http://www.google.com/en, Google [4] http://ec.cse.cau.ac.kr/okchoi/test_webdata1.html, Web Pages for Testing [5] http://islab.hanyang.ac.kr/~jmchoi/cse995, CSE 995 Semantic Web