Available online at www.sciencedirect.com
Procedia Technology 5 (2012) 388 – 396
CENTERIS 2012 - Conference on ENTERprise Information Systems / HCIST 2012 - International Conference on Health and Social Care Information Systems and Technologies
GSSP – A Generic Semantic Search Platform Sara Paivaa,*, Manuel Ramos-Cabrer b, Alberto Gil-Solla b a
[email protected] - Viana do Castelo Polytechnic Institute – Avenida do Atlântico, Apartado 574 – 4900-348 Viana do Castelo b {mramos,agil}@det.uvigo.es – Universidade de Vigo – Telematics Engeneering Department
Abstract Semantic search has been rapidly growing as a way to improve search results. The meaning of the input expression has revealed to produce better results than the traditional keyword appearance. Regarding search engines, there are currently several proposals but all of them are already implemented to a specific goal. We find important to develop a generic semantic search system so it rapidly be adapted to any system and domain that has search needs. This paper introduces GSSP, a generic semantic search platform proposal. We present the platform and the several steps that need to be followed in order for the platform to be used. We also provide some application examples that we are working on.
© 2012 by Published by Elsevier Selection and/or peer-review under responsibility of © 2012 Published Elsevier Ltd. SelectionLtd. and/or peer review under responsibility of CENTERIS/SCIKA CENTERIS/HCIST. Association for Promotion and Dissemination of Scientific Knowledge Keywords: semantic search; generic search platform; free search criteria.
1. Introduction The search for information has been an active line of study for several years now, as it’s quick and objectively retrieval means saving time, which therefore can be applied in more productive tasks. However, quick and objectively finding of information is still a difficult task as search engines have not yet found a way to provide useful results [1]. The issue of how quick the information is delivered to the end user can be
* Corresponding author. Tel.: +351- 258 819 700; fax: +351-258 827 636. E-mail address:
[email protected].
2212-0173 © 2012 Published by Elsevier Ltd. Selection and/or peer review under responsibility of CENTERIS/SCIKA - Association for Promotion and Dissemination of Scientific Knowledge doi:10.1016/j.protcy.2012.09.043
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
addressed with a hardware intervention, creating more and better resources to process data faster for end user delivery. However, when we talk about objectivity we are talking about delivering exactly what the user wants - no more and no less. And this issue requires a correct understanding of what is being asked to the search engine. As such, the conventional search gives place to a semantic search, characterized by meaningbased approach [2]. There are currently several proposals in literature regarding the possible architecture of a semantic search engine [1]. In this paper we present yet another one called GSSP - a Generic Semantic Search Platform, based on the predecessor PRECISION [3], which at this point is optimized for documental searches. GSSP was designed to suite any scenario where searches are helpful and with few configuration needs. GSSP intends to be easy, not only to the end user, but also for the system that is trying to adopt it to its use. In this case, the only intervention is the definition of the concepts to search and the data insertion itself. Everything else is generic to GSSP and adaptable to any system. The rest of this paper is organized as follows: the next chapter presents a related literature review. Next we present GSSP with subsections starting with an overview of the system, the architecture and then the detailed explanation of the four stages for its use in a specific domain: concept definition, data insertion, data expansion and searches. Before the conclusions, we present some of the possible scenarios of use of GSSP. 2. Literature review When reviewing the literature before introducing GSSP, several aspects need to be addressed. First and most important, the need for a generic system emerges as we didn’t find in the literature review any system capable of easily being adopted for a wide range of domains. What we mainly find are semantic search engine architecture proposals, such as [3-5], where the authors present the components that should be present in the architecture but do not focus on how the system should be used or adopted for generic purposes. There are also several proposals for semantic search systems for specific domains such as images [6], video [7], educational information resources [8, 9], web portals [10] or traditional medical informatics [11]. Also important in this literature review is to analyze specific aspects of search engines such as the input methods for the query construction, and as stated in [12], several of these methods exist such as free text input, controlled terms, operators or user feedback. In addition, we presented, in [13] other methods such as navigation abilities, form-based, self-defined syntaxes or guided search. We strongly believe the combination of several of these methods is the best choice as there is a bigger probability of pleasing the end user that can choose the method he/she prefers. Regarding the method for storing ontologies, the typical format OWL is sometimes replaced by the standard database representation, as addressed in [14], so typical query languages, such SQL, can be used in information retrieval [15]. A fundamental aspect when dealing with semantic information relates to tags and keywords [16-18]. The better the classification of a given resource in terms of tags or keywords, the better it can be found. Some authors defend that keywords need to have a context [16] and, as a consequence, they need to be assigned automatically. Many problems are pointed out when manual tagging occurs [17] the user can choose not to tag what can compromise all the search system or the tags can become heterogeneous and subjective. Finally, an important point also to consider are stopwords [19] as they can interfere in the process of resources retrieval in a semantic search system. 3. GSSP 3.1. General Overview Before describing the GSSP architecture, it is important to refer it was designed with three main aspects in mind: 1) building a generic search platform regardless the nature of the information; 2) make an abstraction to
389
390
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
the ontology layer, as it introduces too many formalisms for non IT people and 3) reduce the effort of configuration by a given system that adopts it, to motivate non IT systems also to use it. Also, as main points in the design of GSSP we point out the following: 1) an automatic mapping between the ontology and a database model, where data is inserted; 2) four distinct input methods for query formulation; 3) an OWL reasoner for semantic validation errors in the input expression provided by the user; 4) synonyms expansion through online resources; 5) self-defined method of determining relevant information. 3.2. Architecture In this section, we describe the system architecture through the possible use cases and the interactions between actors and the system because we think this is the best way to understand the overall system functionality and how it works. Fig. 1 shows the possible actors a scenario using GSSP might have, and also the use cases.
Figure 1. Use case diagram for GSSP
The order in which the use cases are presented represents the order in which they should be executed, generating four different stages in the adoption of GSSP solution to a given existing system. The contextualization of each of the actors and use cases can be seen in Fig. 2 that shows the complete architecture of GSSP. Next we explain each of the use cases/stages, from a perspective of a system that is adopting GSSP. 3.3. The stages 3.3.1. Stage I - Configure concepts This is the first task that needs to be done, usually by the administrator of the system or someone that has the knowledge about the type of information the system deals with. It doesn’t necessarily have to be an IT person, although the task is easier for someone with this kind of knowledge.
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
Figure 2. GSSP architecture
The concepts inherent to ontologies were used, but we chose to create an abstraction to those concepts so there weren’t too many formalisms in the platform what could lead to some difficulty in its use. Therefore, the concept definition is kept in a database (shown in Fig. 3 as we don’t need other ontologies characteristics such as, for instance, inferences). We identify this database as GSSP_BASE.
Figure 3. Ontology database model
The information that should be provided in this stage, according to the discussed in [20], is: 1) the concept designation; 2) whether the concept is searchable or auxiliary; 3) concepts characteristics (datatype properties); 4) relation between concepts (object properties) and 5) nature of the relation (functional
391
392
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
characteristic of object properties). As a consequence of this need, a simple interface was developed with the main intention to collect the fundamental information to the search process in an easy way. Automatic system database generation An automatic migration from the ontology model to the final model of the system can be generated automatically. The importance of this automatic migration is that the system database follows a specific model for the search process to be generic. If each system administrator could define its own database, an ad-hoc structure would result what would invalidate the generic characteristic of the search process. Next, we describe this automatic generation process (AGP). The resulting database from the AGP, which we identify as GSSP AGP, has to be able to store information about 1) the data that is going to be the target for the user searches, 2) tags/keywords that describe each searchable resource and 3) synonyms for keywords. In order to improve the accuracy of the system and understand what the user wants even if he doesn’t use the exact words the resources are tagged with, GSSP uses the notion of synonyms. As such, the database resultant from the AGP must support keywords for each resource and synonyms for each keyword. Additionally, in order to avoid unnecessary Internet accesses for converting a given input word into its origin, we store in a database table the association between a specific word and its correspondent origin. Finally, as we stated above, concepts can be searchable or auxiliary. Therefore, a table must also exist to identify the tables that hold information of searchable concepts. First, we start by showing the static part of the model generated by the AGP. The rest of the model that AGP generates is based on well-known database concepts such as primary keys, foreign keys, 1:1 and 1:N relations [21]. The following premises apply to the AGP process: AGP - Premisse 1 Each row of the table GSSP BASE.CONCEPT, that identifies a concept C with a given purpose P, generates a new table GSSP AGP.T (where T matches GSSP BASE.CONCEPT.DESCR). AGP - Premisse 2 GSSP AGP.T is composed of one primary key (ID T) and also by all properties identified in GSSP BASE.PROPERTIES as being related to the concept. At this time, a field named TAG is required for the expansion process. AGP - Premisse 3 If P(T) is searchable then: AGP - Premisse 3.1 A new record
is
inserted
in GSSP AGP.SEARCHABLE.
AGP - Premisse 3.2 A new table GSSP AGP.T KEYWORD is created in order to allow the association of several keywords to each resource. This table is composed of two fields, both primary keys: ID T and ID K. AGP - Premisse 4 For each relation in GSSP BASE.RELATION AGP - Premisse 4.1 A new field is added to GSSP AGP.T with the name GSSP BASE.RELATIONS. DEST CONCEPT. This field is a foreign key to the correspondent table.
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
AGP - Premisse 4.2 If the relation is 1:N: AGP - Premisse 4.2.1 A new table is created named GSSP AGP.T1 where T1 matches GSSP BASE.RELATION.DEST CONCEPT. The table has one primary key named ID T1 and has a TAG field, for the expansion process. AGP - Premisse 4.2.2 A new table GSSP AGP.T1 KEYWORD is created in order to allow the association of several keywords to each resource. This table is composed of two fields, both primary keys: ID T1 and ID K.
3.3.2. Stage II – Data insertion This stage is completely independent of the system as the way data is inserted into the created database (DFP - data feed process) has to be defined by the IT person responsible for the system that is adopting GSSP. The goal is to populate all automatic created tables, with exception to the keyword tables. 3.3.3. Stage III – Data expansion The Data Expansion Process (DEP) is one of the most important in GSSP as the quality of the synonyms is the key to the system responses. The richer they are, the more associations between words will be found and more meaningful results will be returned to the user that performed the query. So far, the system has incorporated two online services for the Portuguese Language where information is being expanded from: 1) Priberam - a portuguese dictionary and 2) Thesaurus - a web service that allows obtaining synonyms in different languages. The DEP operates on every searchable table and its related tables, namely on the TAG field (which we first mentioned when we explained AGP premisses). The content of this field is separated by words, stopwords are eliminated (for this purpose, we used the stopword list from [22] and each word is submitted to Priberam so we can obtain the word of origin. For example: the word ’estudantes’ (in english: students) is reduced to ’estudante’ (in english: student). This could also be achieved with a stemming process but we didn’t use that as we don’t want the final stem of the word. As we referred in the introduction, at this stage we are applying GSSP to documental searches and in this type of system, document titles tend to use words like ’student’ or ’payment’ and not variations like ’studying’, ’paying’, which would justify the use of stemming. When submitting each word to Priberam, we obtain its word of origin and also its synonyms. With the word of origin, we make a submission to Thesaurus, that provides us with even more synonyms (repetitions are eliminated) to improve the system’s capabilities. This information is used to populate the searchable and related tables and also their association to the keywords. At the end of the process, keywords and synonyms for each resource can be visualized through the interface in order to remove some of them or to add a new one. 3.3.4. Stage IV – Search This stage comprises resources search and adding new keywords to a resource. We start by explaining the different types of searches our system supports: Free search - A free search, as its name indicates, is characterized by the freedom the user has in entering the keywords he wants. There are no limitations, restrictions or guidance. It is fundamental, in a type of
393
394
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
search like this, to have efficient criteria so the results provided are relevant for the user. As such, we have defined the following criteria that aim to return results with different levels of proximity to the inserted keywords: Criterion 1: The first criterion aims at returning the resources that, in the description of the resources defined in the searchable tables, contain the words used in the query (reduced to their origin). Criterion 2: The second criterion aims at returning the resources that, in the description of tables related to the searchable ones, contain the words used in the query (reduced to their origin). Criterion 3: The third criterion aims at returning the resources whose synonyms of the keywords in the description defined in the searchable tables, match the words used in the query. Criterion 4: The last criterion aims at returning the resources that, in the description of tables related to the searchable ones, contain synonyms of the words used in the query. Index search - This type of search allows seeing the complete list of existing keywords and their synonyms. Clicking on one of them the user can visualize the associated resources. The advantage to the user is that he can navigate through concepts when he doesn’t remember or simply doesn’t know how to find out what he needs. Hierarchical search - This type of search allows the user to navigate in a hierarchical way over resources. The first level shown is composed by all searchable tables and, within each one, we show the relations. In each entry of the hierarchy we present the total number of resources. Guided search - This type of search is not detailed as we dedicated some of our previous work to this theme [3, 13, 20, 23]. Basically, in a guided search the system helps the user in the construction of the search expression so it is clearly understood by the system that can, therefore, answer objectively. 3.4. Applications of GSSP Currently, we are already applying GSSP to the Quality Management System † (QMS) of Viana do Castelo Polytechnic Institute (IPVC) and also preparing the application to an Alzheimer Information Website. The QMS of IPVC is mainly about document searches, so the system can perhaps be more optimized at this kind of searches. We are also on our way to adopting GSSP to a Alzheimer information portal where the main information consists of symptoms, definitions and also experiences. 4. Conclusions and Future Work In this paper we presented a generic semantic search platform, as we didn’t find any in the current literature. With the growth of these kinds of services we believe it is important to make them easily adopted by any system that has search needs. Therefore, we defined four stages: concept definition, data insertion, data expansion and search process. The first stage intends to allow the creation of an ontology with an
†
http://sgq.ipvc.pt/
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
abstraction layer so it can be defined by someone who is not completely at ease with this techniques. The data insertion process is completely outside the scope of GSSP as it depends on the system that is adopting GSSP. In the data expansion process, we already defined two components that access online services to obtain the word of origin and synonyms for every keyword inserted in the query. Finally, the search process has four types: free search, guided search, hierarquical search and index search. So far, we are using the prototype with the case study of a documental search system, concretely the Quality Management System. As a continuation of this work, we have two fundamental ideas: 1) to evolve to other types of systems so we can add more search functionalities and capabilities and 2) to allow more relations between tables and reflect them in the search process. Also, applying the system in other domains give us feedback information to evolve our platform and to increase its generality.
References 1 2
3
4 5 6 7 8 9
10 11 12 13
14 15 16
17 18 19
Kassim, J.M. & Rahmany, M., 2009. Introduction to Semantic Search Engine. 2009 International Conference on Electrical Engineering and Informatics, (August), pp.380-386. Tümer, D., Shah, M.A. & Bitirim, Y., 2009. An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia. 2009 Fourth International Conference on Internet Monitoring and Protection, pp.51-55. Paiva, S. et al., 2011. Precisionௗ: a Guided-Based System for Semantic Validation and Personalized Natural Language Generation of Queries. In Proceedings of the 29th International Conference on Consumer Electronics (ICCE 2011). Las Vegas, USA, pp. 503 504. Ponnada, M. & Sharda, N., 2007. Model of a semantic web search engine for multimedia content retrieval. In Computer and Information Science, 2007. ICIS 2007. 6th IEEE/ACIS International Conference on. IEEE, pp. 818–823. Ilyas, Q.M., Kai, Y.Z. & Talib, M.A., 2004. A conceptual architecture for semantic search engine. Multitopic Conference, pp.605 610. Lv, C. et al., 2009. Image Semantic Search Engine. 2009 First International Workshop on Database Technology and Applications, pp.156-159. Worring, M. et al., 2007. The MediaMill semantic video search engine. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, pp. 213-216. Chang-qin, H., Ru-lin, D. & Zhi-ting, Z., 2009. A semantic enabled intelligent search system for educational information resources. IT in Medicine &, pp.539-544. Çelik, D., Elci, A. & Elverici, E., 2011. Finding Suitable Course Material through a Semantic Search Agent for Learning Management Systems of Distance Education. 2011 IEEE 35th Annual Computer Software and Applications Conference Workshops, pp.386-391. Li, W. & Yang, C., 2008. A Semantic Search Engine for Spatial Web Portals. Geoscience and Remote Sensing Symposium, 2008. IGARSS 2008. IEEE International, pp.1278-1281. Mao, Y. & Tian, W., 2009. A Semantic-Based Search Engine for Traditional Medical Informatics. 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology, pp.503-506. Slimani, T. & Yaghlane, B.B., 2001. An Overview of Search Methodologies in Semantic Web. World Wide Web Internet And Web Information Systems, pp.1-8. Paiva, Sara, Ramos-Cabrer, M. & Gil-Solla, A., 2010b. Semantic Query Validation in Guided-Based Systems: assuring the construction of queries that make sense. In Proceedings of the 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2010). London, England. A, Y. & Pierra, G., 2006. Querying Ontology Based Database Using OntoQL (an Ontology Query Language). In Proceedings of Ontologies DataBases and Appliactions of Semantics ODBASE2006 Montpellier France Oct 31 Nov 2 2006. pp. 704-721. Pan, Z. & Heflin, J., 2004. DLDBௗ: Extending Relational Databases to Support Semantic Web Queries. Design, (C), pp.303-308. Hope, G., Wang, T. & Barkataki, S., 2007. Convergence of Web 2.0 and Semantic Web: A Semantic Tagging and Searching System for Creating and Searching Blogs. Semantic Computing 2007 ICSC 2007 International Conference on DOI 101109ICSC200795, pp.201-208. Kalender, M., Dang, J. & Uskudarli, S., 2010. Semantic TagPrint - Tagging and Indexing Content for Semantic Search and Content Management. 2010 IEEE Fourth International Conference on Semantic Computing, pp.260-267. Torres, D. et al., 2011. Semdrops: A Social Semantic Tagging Approach for Emerging Semantic Data. In 2011 IEEEWICACM International Conference on Web Intelligence WI 2011. Braga, I.A., 2009. Evaluation of Stopwords Removal on the Statistical Approach for Automatic Term Extraction. 2009 Seventh Brazilian Symposium in Information and Human Language Technology, (Icmc), pp.142-149.
395
396
20
21 22
23 24
Sara Paiva et al. / Procedia Technology 5 (2012) 388 – 396
Paiva, Sara, Ramos-Cabrer, M. & Gil-Solla, A., 2010a. Automatic query generation in guided systems: natural language generation from graphically built query. In Proceedings of the 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2010). London, England, pp. 165-170. Darwen, H., 2010. An Introduction to Relational Database Theory, BookBoon.com. Federal, U.S.P.U., Universidade, U. & Paulista, E., 2008. CSTNewsௗ: um Córpus de Textos Jornalísticos Anotados segundo a Teoria Discursiva Multidocumento CST ( Cross-document Structure Theory ) Priscila Aleixo Thiago Alexandre Salgueiro Pardo. Structure, pp.1-12. Paiva, Sara & Ramos-Cabrer, M., 2009. Aplicação da pesquisa semântica, ontologias e sistemas de recomendação a portais governamentais. In Proceedings of the 9th Conference of the Portuguese Association of Information Systems. Viseu, Portugal. Wang, F. & Beijing, T., 2010. Personalized Semantic Search Agent Interaction Mode Research 1 ) Information Location rather than Information Personalized Semantic Search Agent Interactive Mode. In 2010 IEEE 11th International Conference on ComputerAided Industrial Design & Conceptual Design. Beijing, China, pp. 1188-1193.