the next generation retrieval systems in order to overcome the rapid increase in data ... conventional search engines where only Web pages are ranked [15]. ..... 6. d'Aquin, M., & Lewen, H. Cupboard â A Place to Expose Your Ontologies to.
Semantic Web Search Engine Using Ontology, Clustering and Personalization Techniques Noryusliza Abdullah1, Rosziati Ibrahim1, Faculty of Computer Science and Information Technology Universiti Tun Hussein Onn Malaysia, Johor, Malaysia
1
{yusliza,rosziati}@uthm.edu.my,
Abstract. Data accuracy and reliability have been a serious issue in the vast emergence of information on the web. Advanced web searching has assisted in knowledge retrieving. However, most knowledge on the Web is presented in natural-language text that understandable by human but difficult for computers to interpret. Therefore, Semantic Web approach is widely used to give more reliable application. This paper presents a framework in enhancing knowledge retrieval processes using Semantic Web technologies. Instead of using ontology and categorization alone, we are injecting personalization concept from Relational Database (RDB) to ensure more reliable data are obtained. The proposed framework is discussed in details. A case study is presented to see the viability of the proposed framework in retrieving the meaningful information. Keywords: Semantic Web Search, Ontology, Clustering, User Profiling.
1 Introduction Knowledge is so important for us in all aspects. Although many sources have given good numbers of information but there are still lacking in terms of knowledge reliability. As organization’s ability to learn and handle knowledge processes or knowledge product is considered the new key success factor [1], research in information and knowledge retrieval are actively conducted. They are also useful in preventing researchers from digging every single document to information searching. However, retrieving the real meaning of data is often fails to give the desired result. As Information Technology has evolved to the mature phase, we do not expect this situation should continue. Something has to be done to ensure we can learn from other people by capturing as much information or knowledge as we can and make it meaningful to our purpose. In order to make this determination is fulfilled, knowledge retrieval is chosen as an enabler to overcome the previous stated problem. It will be the next generation retrieval systems in order to overcome the rapid increase in data and information to find the right knowledge [2]. Nonetheless, in retrieving knowledge, substantial amount of efforts are needed. According to Tao, Li, & Nayak, [3] interpreting users’ information needs is compulsory in knowledge retrieval. Hence, they proposed Local Instance Repository (LIR), a personal collection of web documents recently visited by the user.
The major challenge in implementing either information or knowledge retrieval in WWW is most knowledge on the Web is presented as natural-language text that understandable by human but difficult for computers to interpret. So, Semantic Web approach is widely used to give more reliable application. Mikroyannidis [4] explains that Semantic Web is able to give information a well-defined meaning and better cooperation between computers and people. In applying the Semantic Web, ontology is commonly discussed. It is an explicit specification of a conceptualization. d’Aquin & Noy [5] states that data interoperability property from ontologies which permits sharing and reusing features, is a key promises of the Semantic Web. These advantages are highlighted in 11 ontologies libraries. Four of the libraries might take into consideration in this study due to the general domain. They are Cupboard[6], Ontology Design Patterns (ODP) [7], OntoSelect [8], OntoSearch2 [9] and Schema-Cache [10]. While ontologies are capable in giving good outcome, researchers are trying to enhance searching method using clustering technique [11] and user profiling/personalization [12, 13, 14]. Although previous researches are capable to give good results, we are motivated to improve the output. Therefore, we propose the hybrid of Semantic Web Search Engine, a knowledge retrieval platform using Semantic Web Search to ensure reliability criteria is fulfill in retrieving knowledge. This web searching based on three criteria: ontology, clustering and user profiling/personalization. The techniques are consolidated to give more reliable searching particularly in the user’s perspective. The proposed technique will extract meaningful information and give positive impact in the area of Knowledge Retrieval. The remainder of this paper is organized as follows. Section 2 lists related works that relevant to this research. Section 3 discusses the research method while Section 4 provides suggested framework in this study. Finally, section 5 provides the conclusion.
2 Related Works In this section we discuss the details of Semantic Web, online ontology resources, clustering (or categorization) and user profiling (or personalization).
2.1. Semantic Web In our research, we concentrate on the search engine. Semantic Web search engine rank semantic web document, RDF graphs, triples and terms. This is different from conventional search engines where only Web pages are ranked [15]. The functionality of the Semantic Web is resemble typical search engine such as Google and Yahoo but referring to Jiang [16] the benefit of using it is the ability for machine-understood descriptions of meaning. The web helps us to reach information that we search and other data related to it. Thus, Semantic Web is not just sharing text of a page but data and facts as well [17]. Other motivation to use Semantic Web is it helps in collecting data together from the web [17]. Referring to Mikroyannidis [4], Semantic Web is better than conventional web because of the ability to handle unstructured content. Semantic Web can overcome this problem by using software agent that can enhancing search
precision and enabling logical reasoning. Semantic Web is the significant product among the established companies like Oracle, Vodafone, Amazon, Adobe, Yahoo and Google wherein they provide a smarter web [18]. Moreover, Joo [19] views semantic web has a potential to implement semantic integration and reduce information overload. According to Janev & Vrane [20] this is the popular area in the Information and Communication Technology field. Many research efforts are conducted to improve traditional web and making the content available on the semantic web. In line with this thought, Edwards [17] explains moving from HTML to XML is the original plan for the semantic web. Loopholes in HTML addressed by Linked Data that connect data, information and knowledge on the semantic web using Uniform Resource Identifier (URI) and Resource Description Framework (RDF).
2.2. Online Ontology Resources Ontology is the heart of the Semantic Web. It is a domain and knowledge representation [21, 22]. In consonance with Hepp [23], ontologies are the vocabulary that can be used to express a knowledge base while Diez-Rodriguez et al. [24] discussed that the intention to represent concepts in ontologies is to improve knowledge searching and discovery mechanisms. In-depth researches are conducted on ontologies because of the function as the backbone for the semantic web [20]. Joo [19] states that research on ontology is necessary to ensure the diffusion of the semantic web. In addition, Ontology-based knowledge organization can contribute to express the contents of information elements and semantic relations between them. It can also support semantic reasoning and retrieval [25]. Furthermore, Maier, Hadrich & Peinl [1] stated that documented knowledge which spread across multiple sources requires identification and visualization with the help of knowledge maps and integration supported by ontologies as a manager to semantic content. However, in the interest of ensuring ontologies and metadata to represent information correctly, they need constant updates and maintenance [4]. In order to accomplish the aim, Web Ontology Language (OWL) is used. It is a semantic markup language for publishing and sharing ontologies on the World Wide Web and used to describe the classes and relations between them [21]. Still, according to Cardoso [18], building ontology is more complex in terms of logic and structure compared to building software. The main goal of ontology engineering is to produce useful, consensual, rich, current, complete, and interoperable ontologies. In building ontologies, linking them to the knowledge organization systems is the main priority to increase interoperability and data accessibility [23]. The highest methodologies adoption in develop ontology is Methontology. Ontologies development needed an editor. There are several editors including Protégé, SWOOP, OntoEdit, OntoStudio and many more. Among all, protégé is the most used editor due to the support of wide variety of plugin and import formats and it’s free open source. In accordance with D’Amato et al. [26], combining semantic web search with ontological background is a promising research approach. New semantic web
applications discover ontologies on the web. Exploring large-scale semantics need to perform certain tasks: Find relevant resources, Select appropriate knowledge, Exploit heterogeneous knowledge sources and combine ontologies and resources [27]. Semantic applications that use online knowledge can ensure in obtaining appropriate semantic resources. D’Aquin et al. [27] lists several Semantic web search engines such as Swoogle, Sindice, Falcon-S and Watson. Among these search engines, Watson is better in terms of finding, selecting, exploiting and combining online resources without having to download the ontologies. It uses a set of crawlers to explore sources to check for duplicates, copies or prior versions. Analyzing and indexing are depending on content, complexity, quality and relation to other resources.
2.3. Clustering/Categorization Extension to the current approach, Trillo et al.[11] proposes categorization or clustering method which turns up with a semantic technique to group the output of searching keywords into different categories. They use online ontologies to define the possible categories.
2.4. User Profiling/Personalization Research on personalization or user profiling in the semantic web is actively conducted. Jie et al. [12] uses information on the homepage for profile extraction. Data for instance, interest and publications are extracted to get more information on users. Other researchers are based on the history of visited site for personalization. In order to improve browsing result, personalization mechanism is used. This mechanism is based on user preferences and monitoring process of user navigation. Antoniou et al. [13] suggests the method of suggesting highly accessed pages from the past users’ navigational patterns to the new users. This method has overcome very frequent accessibility for short periods of time using advance data structures technique. Yoo [14] supports effective retrieval of personalized information on the semantic web by using hybrid query processing method. The hybrid of two methods, query rewriting method and reasoning method are able to process query when individual requirements change. Many researchers are using user profiling and personalization term interchangeably and refer them as the same entity. However, some researchers adopt them as two different things. Personalization refers to the navigational behaviour while user profiling is user’s personal data. We will use user profiling term from now onwards to avoid confusion. While most researchers are concentrating on browsing history and using web data for personalization or user profiling, we choose to hybrid our Semantic Web Search engine using data in our Relational Databases (RDB) to get more info on users. Due to the absence of Oracle-like RDMBS which implements RDF model to their databases, we map our RDB to the RDF.
3 The Framework of Semantic Web Search Engine In this section, the proposed framework to implement hybrid Semantic Web Search is presented.
3.1. Semantic Web Search Construction In retrieving knowledge, there are several techniques can be implemented. Semantic Web is chosen based on certain advantages stated in the previous section. Ensuring results obtained are more reliable, method in [11] is used with modification in user profiling concept.
3.2. Search Result based on User Profiling This research focuses on Universiti Tun Hussein Onn Malaysia (UTHM) dataset. Emphasizing on the user profiling, members’ own data are extracted and used to ensure results are more reliable in user’s perspective. These are the components need to be examined:
Staff ID Staff Name Faculty ID Faculty Name
3.3. Proposed Model Based on [11], the approach of extracting online ontology from the web is applied. The results are then categorized to facilitate users. However, searching facilities is optimized by adopting user profiling technique in the current approach. Figure 1 (b) shows the adaptation framework from Trillo et al. [11] (Refer Figure 1a).
(a) Trillo et al. [11] framework STEP 1 : Discovery of the Semantics of User Keywords
User Keywords
Extraction of Keyword Senses
Other Lexical Resources
USER Web
Other ontologies (not indexed)
Extracted Database Wordnet
Disambiguation of User Keywords
Disambiguition Algorithm (WSD)
STEP 2 : Semantics-guided Data Retrieval
Recollection of Hits
Cleaning & Lexical Annotation of Hits
Categorization of Hits
Ranking of Categories and Presentation of Results to the User
Ranked List of Categories
(b) Proposed framework USER IDENTIFICATION USER
Keyword
SEMANTIC DISCOVERING Online semantic resources/ ontology searching
Results categorization and clustering
Search result USER PROFILING / PERSONALIZATION RDB to RDF mapping
RDF to UTHM Ontology comparison
Ranking
Fig. 1. (a) Trillo et al. [11] framework. (b) Proposed framework.
Figure 1 shows the adaptation of Trillo et al. [11] with the enhancement in user profiling. Compared to the previous framework, this proposed framework will match categorized keywords with users’ personal data and rank the output based on the data. In our approach, user’s own data is compared to the clustered search result. Computer’s name might be used for identification. Otherwise, users might key-in simple data for instance staff ID as recognition to give personalized result. Enabling the semantic search to drill data from the database, need particular method. It is due to the different RDF format used in the semantic web compared to the Relational Databases (RDB). RDF format is presented in subject, predicate, object format. Therefore, RDB to RDF mapping will be conducted. Referring to Matthias et al. [28], in application scenarios, Direct Mapping is more suitable in RDB to RDF cases. In this approach, relational tables are map to classes in RDF vocabulary and tables attributes to properties in the vocabulary. Hence, Direct Mapping is used for our framework in the user profiling part.
3.3. Algorithm An Algorithm shown in Figure 2 is used in the framework. In line 1, users’ entered ID as identification. The keyword entering, processing and categorizing are done in Line 2 to 4. In these steps, online ontology is used to specify and conceptualize the keywords. The main contribution of this research is between line 5 to 17. They utilized user profiling technique and rank the results. Combining these steps with online ontology and clustering is not implemented by Trillo et al. [11].
Input: 1. User entered keywords Output: 1. Mapped clustered/categorized result with user data Begin 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. End
User identification Keyword entered If keyword == ontology CategorizeHits, C While C > 0 do If User, U ∈ UTHM database Map RDB to RDF Initialize n = 1 If Field data, F == C Rank = n Else Rank = n+1 Endif Else Rank = n+2 Endif Done
Fig. 2. Algorithm used in the Semantic web Search Engine
3.4. Data Description Important attributes are listed in Table 1 and Table 2 with the structure and description. The data structure is based on real data from UTHM’s Relational Database (RDB). In the implementation phases, actual UTHM data will be used as datasets. Table 1. Datasets structure – Faculty Table
Field facID
Structure Varchar2 (3)
Description Faculty in UTHM.
facName
Varchar2 (50)
Name of faculty.
Table 2. Datasets structure – Staff Table
Field staff ID
Structure Varchar2 (10)
Description ID for every staff. Unique. Used as identification.
staffName
Varchar2 (50)
Name of staff.
facID
Varchar2 (3)
Faculty for staff.
4 Case Study of the Semantic Web Search Engine This section describes a case study of Semantic Web Search Engine for UTHM members using three techniques: ontology online, clustering and user profiling. Algorithm in Figure 2 is explained in detail. Total of 968 academic staffs from UTHM are expected to utilize this finding. Table 3 shows academic staffs based on faculty. However, for testing requirement, only ten percent of them which selected randomly will undergo the testing phase. Table 3. Academic staff
Faculty
Number of academic staff
Management Civil/Environment Mechanical Electrical Vocational Information Technology (IT) Science & Technology TOTAL
88 167 200 201 99 69 144 968
* Data as of Wednesday, 18th January 2012
4.1. Step 1 - User Identification The goal of this process is to capture user’s profile. Figure 3 shows the Graphic User Interface (GUI) for identification. This search engine classify user’s faculty. To facilitate uses, computer’s data stored in web log might be used to avoid users from enter ID every time they use this application.
Fig. 3. GUI of user identification
4.2. Step 2 - Ontology searching and clustering In this step, user enters keywords. They are then mapped with ontology online. The results are mixed up and clustering of hits is used and listed into specific group. These processes are shown in Figure 4.
IdioSearch / SophSearch mouse Search Keyword search
Ontology online
Category Computer 1. ______________________ 2. ______________________ 3. ______________________ Category Cartoon 1. ______________________ 2. ______________________ 3. ______________________ Category Environment 1. ______________________ 2. ______________________ 3. ______________________
Clustering
Fig. 4. Web ontology searching and clustering
By using framework in [11], the expected output is shown in Table 4. Categories are listed randomly without considering users’ profile. The datasets indicate all users are obtaining the same results. Enhancement using user profiling technique is discussed between Step 3 to 5.
Table 4. Results using Trillo et al.[11] framework Users Yusliza
UTHM Staff Yes
Web Category Computer Cartoon Environment
Azma
Yes
Computer Cartoon Environment
Ziela
No
Computer Cartoon Environment
4.3. Step 3 - User Profiling using RDB to RDF mapping This process uses Direct Mapping technique. Staff ID entered in Step 1 is used here. It then mapped to UTHM relational database from Table 1 and 2. Structures from these tables are shown in Figure 5 and Figure 6. Mapping process coding which use RDF and SPARQL, query language for RDF is listed in Figure 7.
facID facName
VARCHAR2 (3) VARCHAR2(50)
PRIMARY KEY
Fig. 5. Faculty table staffID staffName facID
VARCHAR2 (10) VARCHAR2 (50) VARCHAR2 (3)
PRIMARY KEY FOREIGN KEY
Fig. 6. Staff table
Select '' AS facURI , facNo , facName from fac Select '' AS staffURI , staffID , staffName , facID from staff Fig. 7. RDF and SPARQL coding to map UTHM database Step 4 - RDF to UTHM ontology comparison UTHM ontology as shown in Figure 8 is developed to ensure changes are not done
4.4.
to the database. Modification to the databases will affect current systems since we use actual UTHM datasets. After clustering, the user’s faculty captured and mapped in Step 3 is compared with UTHM ontology and find dedicated user’s faculty. Field derived from this process is compared with Category in Step 2.
[Faculty] hasOffice Mechanical
hasOffice Civil field
UTHM
Environment [Organization]
hasEmployee Azma hasOffice IT field Computer
hasEmployee Yusliza
Fig. 8. UTHM Ontology
4.5. Step 5 - Ranking In this final stage, clustered/categorized hits are ranked depending on user’s data. As shown in Table 5, this Semantic Web Search use entered ID as identification. Name is captured from the RDB. If the user is UTHM staff, the web will get faculty field obtained from Step 4. Clustered activity conducted in Step 3 which produce web categories are compared with results from Step 4. Similar result will give highest rank. Non-similar result but still in the UTHM ontology will be on the lower rank and lastly, non-similar and not in the ontology will be on the lowest level. If the user is not UTHM staff, category will be ranked randomly. Figure 9 shows the expected result in GUI. Table 5. Web category ranking ID
Name
UTHM Staff
718
Yusliza
Yes
Computer
Environment Cartoon Computer
615
Azma
Yes
Environme nt
Environment
1
1
Cartoon Computer
0 1
3 2
Environment Cartoon Computer
0 0 0
1 2 3
-
Ziela
No
Field
-
Web Category
Web Category = UTHM Field 1 0 1
Rank
2 3 1
Fig. 9. Semantic Web Search Engine
This framework is based on the previous researchers, Trillo et al. [11]. In contrast with our research, only list of categories is given from the online ontologies and clustering processes. Nevertheless, they are mixed up and listed randomly. Excessive numbers of categories will cause confusion. Conversely, we are expected to produce results that are reliable towards user preferences by adding user profiling technique. This technique generates results in Table 5. It produce ranking that does not exist in Table 4.
5 Conclusion The propose framework of knowledge retrieval using hybrid Semantic Web Search has been discussed. They are three criteria namely online ontology, clustering and user profiling have been used in this research. Enhancement using user profiling criteria is embedded to the current practice which only uses ontology online and clustering. It will give more reliable search results by considering users’ own data in RDB. This paper provides the framework, algorithm, datasets structure and the expected result. To produce better illustration, example is enclosed in this paper with detail explanation. This hybrid Semantic Web Search Engine implementation is capable to give the desired result in terms of user’s profile.
Acknowledgement This work is supported by Universiti Tun Hussien Onn Malaysia (UTHM) and Faculty of Computer Science and Information Technology, UTHM. The authors would like to thank Information Technology Centre, UTHM for providing statistic and live data.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
Maier, R., Hadrich, T., & Peinl, R. Enterprise Knowledge Infrastructures (2nd Edition ed.). Berlin: Springer (2009). Yao, Y., Zeng, Y., Zhong, N., & Huang, X. Knowledge Retrieval (KR). Paper presented at the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (2007). Tao, X., Li, Y., & Nayak, R. A knowledge retrieval model using ontology mining and user profiling Integrated Computer-Aided Engineering, Volume 15(Number 4 / 2008), 313-329 (2009). Mikroyannidis, A. Toward a Social Semantic Web. Computer, 40(11), 113-115 (2007). d’Aquin, M., & Noy, N. F. Where to publish and find ontologies? A survey of ontology libraries. Web Semantics: Science, Services and Agents on the World Wide Web, 11(0), 96-111 (2011). d’Aquin, M., & Lewen, H. Cupboard – A Place to Expose Your Ontologies to Applications and the Community. Lecture Notes in Computer Science, 2009, Volume 5554(The Semantic Web: Research and Applications), Pages 913-918 (2009). Ontology Design Patterns.org (ODP). Available from http://ontologydesignpatterns.org/wiki/Main_Page (2010). Buitelaar, P., Eigner, T., & Declerck, T. OntoSelect: A Dynamic Ontology Library with Support for Ontology Selection (2004) Paper presented at the International Semantic Web Conference (2004). Thomas, E., Pan, J. Z., & Sleeman, D. ONTOSEARCH2: Searching Ontologies Semantically [Electronic Version], from http://ceur-ws.org/Vol-258/paper26.pdf (2008). Schema-cache. Available from http://schemacache.com/ Trillo, R., Po, L., Ilarri, S., Bergamaschi, S. and Mena, E. Using semantic techniques to access web data. Information Systems, 36 (2), 117-133 (2011). Jie, T., Limin, Y., Duo, Z., & Jing, Z. A Combination Approach to Web User Profiling. ACM Trans. Knowl. Discov. Data, 5(1), 1-44 (2010).. Antoniou, D., Paschou, M., Sourla, E. and Tsakalidis, A. A Semantic Web Personalizing Technique: The Case of Bursts in Web Visits. Proceedings of the Semantic Computing (ICSC), 2010 IEEE Fourth International Conference on. 530-535 (2010) Yoo, D., Hybrid query processing for personalized information retrieval on the Semantic Web (2011). Bussler, C. Is Semantic Web Technology Taking the Wrong Turn. Internet Computing IEEE, 12(1), 75-79 (2008). Jiang, H. Information retrieval and the semantic web. Proceedings of the Chongqing, China. IEEE Computer Society, V3461-V3463 (2010). Edwards, C. Analysis: Semantic web's hidden meanings. Engineering and Technology, 5 (16), 52-53 (2010). Cardoso, J. The semantic web vision: Where are we? IEEE Intelligent Systems, 22 (5), 8488 (2007). Joo, J. Adoption of Semantic Web from the perspective of technology innovation: A grounded theory approach. International Journal of Human Computer Studies, 69 (3), 139-154 (2011). Janev, V. and Vrane, S. Applicability assessment of Semantic Web technologies. Information Processing and Management, 47 (4), 507-517 (2010). Wecel, K. Towards an Ontological Representation of Knowledge on The Web. in W. Abramowicz (Eds.). Knowledge-based Information Retrieval and Filtering From the Web. USA. Kluwer Academic Publisher (2003).
22. Fluit, C., Sabou, M. and Harmelen, F. v. Ontology-based Information Visualization. in V. Geroimenko and C. Chen (Eds.). Visualizing the Semantic Web: XML-based Internet and Information Visualization. Verlag. Springer (2003). 23. Hepp, M. Ontologies: State of the Art, Business Potential, and Grand Challenges. In M. Hepp, P. D. Leenheer, A. d. Moor & Y. Sure (Eds.), Ontology Management. Semantic Web, Semantic Web Services, and Business Applications. New York: Springer (2008). 24. Diez-Rodriguez, H., Morales-Luna, G., & Olmedo-Aguirre, J. O. Ontology-based Knowledge Retrieval. Paper presented at the 2008 Seventh Mexican International Conference on Artificial Intelligence (2008). 25. Hao, Y. and Zhang, Y.-f. Research on Knowledge Retrieval by Leveraging Data Mining Techniques. Proceedings of the 2010 International Conference on Future Information Technology and Management Engineering. IEEE, 479- 484 (2010) 26. D'Amato, C., Esposito, F., Fanizzi, N., Fazzinga, B., Gottlob, G. and Lukasiewicz, T. Inductive reasoning and semantic web search. Proceedings of the SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing. Sierre, Switzerland. Association for Computing Machinery, 1446-1447 (2010). 27. D'Aquin, M., Motta, E., Sabou, M., Angeletou, S., Gridinoc, L., Lopez, V. and Guidi, D. (2008). Toward a new generation of semantic web applications. IEEE Intelligent Systems, 23 (3), 20-28 (2008). 28. Matthias, H., Gerald, R. and Harald, C. G., A comparison of RDB-to-RDF mapping languages. Proceedings of the Proceedings of the 7th International Conference on Semantic Systems. Graz, Austria. ACM (2011).