Procedia Information Technology & Computer Science
00 (2013) 000-‐000
3 World Conference on Information Technology 2012 rd
Specialized Information Retrieval in the Context of the Chemical Textile Domain
a
b
b
Carolina PRIETO 1, Javi FERNÁNDEZ , Elena LLORET and Manuel PALOMAR a
b
AITEX, Technological Textile Institute, Plaza Emilio sala Nº1, Alcoy 03801, Spain b University of Alicante, Alicante, Spain
Abstract
Given the high quantity of available information, it is impossible to keep up daily without taking advantage of Natural Language Processing tools. This article provides an analysis of the use of two domain-‐specific Information Retrieval systems applied to the Chemical Textile domain. The aim of this paper is to study whether search engines and virtual observatory systems are appropriate for retrieving specific information in the Chemical Textile domain. To this end, we develop a specialized search engine and we propose the study of this tool together with an existing alert system (virtual observatory system) and we compare them with two widespread general-‐purpose alert systems, Google Alerts and Yahoo! Alerts. The results obtained show that the specialized search engine is the most appropriate tool for professionals because it is able to retrieve information that is of interest for the studied domain. Moreover, several limitations that have been encountered with regard to the chosen systems are discussed, thus suggesting possible solutions for further work. Keywords: Information Retrieval, Alert Systems, Chemical Textile domain, Search engine, Virtual Observatory; Selection and/or peer review under responsibility of Prof. Dr. Dogan Ibrahim. ©2012 Academic World Education & Research Center. All rights reserved.
1 * ADDRESS FOR CORRESPONDENCE: Carolina Prieto Ferrero. AITEX, Technological Textile Institute, Plaza Emilio Sala 1, 03801 Alcoy, Spain”. E-‐mail address:
[email protected] / Tel.: +0034-‐96-‐554-‐2200
1. Introduction and motivation With the rapid growth of the Web, the professionals are often faced with high quantity of information and find it difficult to search for relevant and useful information on the Web. To process these high amounts of information, we can take advantage of Natural Language Processing (NLP) tools, which will allow us to retrieve, extract, classify and summarize the useful information for our domain. Within NLP, one of the areas is Information Retrieval (IR). Particularly, professionals working on the Chemical Textile domain, who deal with a lot of information every day, could make use of the previous mentioned tools for increasing their performance at the work place. To the best of our knowledge, a domain which has not many specialized resources is the Chemical Textile domain. In previous work [8], we carried out a preliminary evaluation of the appropriateness of general-‐purpose alert systems for finding specific information that professionals of this field need for their day-‐to-‐day activities. As conclusion, we reported a number of limitations of using this type of general systems for specific domains. Therefore, in this paper we focus on specialized resources (a domain-‐specific alert system and a specialized crawler) in order to analyze whether they are more useful and help to solve some of the limitations encountered. Besides the general-‐purpose search engines and IR systems, in the literature, we can find several works that aim at addressing the retrieval of relevant information, but focusing on a very specific domain. Related to this, we find BioPatentMiner [7], a system that facilitates IR from biomedical patents, and MedSearch [5], a specialized medical Web search engine, which uses several specific techniques (e.g., tf.idf) for improving its usability and the quality of search results. In [10], an evaluation of the information retrieved by a patent IR system in the Chemical domain is carried out, thus concluding that domain-‐specific search engines may be more appropriate for retrieving information of interest when focusing on a specific domain. Not only IR systems have been developed to deal with the restricted domain problem, but also different crawling techniques [4]. Focused web crawlers identify when an URL or a document is relevant to a specific domain and prioritize and analyze them in a more appropriate manner, using more advanced techniques, such as ontologies [1,6]. The aim of this paper is to analyze to what extent different IR tools are appropriate when the domain of application is very restricted and specific, as in the case of the Chemical Textile. In particular, for this study we develop two IR systems and analyze their application in the context of the Chemical Textile domain. These systems are a specialized search engine and a specific alert system, to see which of them is more appropriate for finding specific information in this domain. 2. Specialized Information Retrieval Systems for the Chemical Textile domain In this Section we explain the two Information Retrieval systems used for the study of the Chemical Textile domain: A Specialized Search engine and Virtual Observatory. The search engine has been developed to help performing queries about the Chemical Textile domain. An expert in the domain selected a restricted set of web sites to be included in the system. The documents obtained from these web sites are downloaded using the crawler [2] developed by the department DLSI2 in the University of Alicante3. The search engine is based on a modified version of Lucene4. In this version, document terms are analyzed using a stemmer (i.e., Snowball5), but both the stem and the original term are indexed. In this way, we can retrieve a bigger number of documents 2 3 4 5
http://www.dlsi.ua.es/ http://www.ua.es/ http://lucene.apache.org/ http://snowball.tartarus.org/
but always giving more weight to those containing the original words. The system also gives more relevance to precision, offering a smaller number of results but with a higher reliability. Additional features have been included, like the prioritization of recent documents; duplicate removal and automatic grouping of the results (clustering) using Carrot26, for a faster navigation through the results list. The Virtual Observatory7 is an alert system developed at the University of Alicante. As for the search engine for the Chemical Textile domain, a group of experts in the domain select a set of sources to be checked periodically. These sources are mainly RSS but also include generic web pages. When these sources publish new content, the system extracts the new information and subsequently sends it to the subscribed users in a daily e-‐mail as alerts. The challenge at this point is to decide which alerts are relevant to which users. First, the experts create a set of categories of interest in the specific domain. Second, they select a set of documents and categorize them using those categories. Then, the system uses these documents as examples and learns how to automatically classify new documents. This learning is made using Machine Learning techniques, specifically the Weka8 [3] implementation of the Support Vector Machines algorithm, due to its good performance in text categorization tasks [9]. 3. Experiments To perform the study, an expert of the Chemical Textile domain defined 4 groups of terms of different granularity: generic terms, specific terms, compounds terms and multiword expressions applied to the Chemical Textile domain. All the terms were in English and these groups of terms were chosen because they are relevant to this domain. Most of them are found in legislation Webs as REACH9 , CPSC10 or OEKO-‐TEX11 . For our experiments, we have chosen 6 terms for each group12. The evaluation was performed using the previously mentioned tools. The results obtained are compared with the results provided with the general-‐purpose alert systems for the same terms [8]. The assessment consisted in counting the number of interesting documents each of the systems returned. For this, an expert of the Chemical Textile domain evaluated individually each of the documents retrieved and classified them into interesting and uninteresting for that domain. Table 1 shows the overall percentages of interesting documents retrieved by the general-‐purpose alert systems, i.e., Google Alert and Yahoo! Alerts compared to the search engine system. Such percentages are calculated as the number of retrieved documents classified as interesting for each group with respect to the total retrieved documents in the same group. Table 1. Overall percentage of interesting documents for each group of terms retrieved by the different Information Retrieval systems
6 7 8 9 10 11 12
Google Alerts
Yahoo! Alerts
http://project.carrot2.org/ http://en.ovtt.org/alerts http://www.cs.waikato.ac.nz/ml/weka/ http://www.reachinnova.com http://www.cpsc.gov/about/cpsia/cpsia.html http://www.oekotex.com
http://intime.dlsi.ua.es/papers/wcit2012091101.html
Search Engine
Generic Terms Specific Terms Compound Terms Multiword expressions
3.9% 1.8% 50% 0%
9.7% 12.3% 20.5% 83.3%
71.2% 77.67% 69.47% 82.40%
In these results we can observe that it is difficult for general-‐purpose alert systems to retrieve information in a specific domain. Often, we have problems with the ambiguity of the terms. For instance, terms such as lead or flame have others meanings, and as a consequence, these systems cannot distinguish which meaning do we refer to. Despite this, Yahoo! Alerts as a generic IR system is more accurate and has more coverage than Google Alerts. Regarding the specialized search engine, we notice that it performs better than the general-‐purpose alert systems. This is because the search engine is domain specific, thus being capable of retrieving more interesting information for the Chemical Textile domain. Only for multiword expressions, the results for the search engine are lower than Yahoo! Alert system. This is due to the fact that Yahoo only retrieved 6 documents, 5 of which were interesting. In contrast, our search engine retrieved 142 documents, 117 of which were interesting. As shown, the search engine retrieves much more information than Yahoo. Concerning the virtual observatory the results obtained were lower than expected. The number of retrieved alerts within the studied period of time was very low, having a total of 40 alerts. Among them, only 15 alerts were of interest. We believe that, in this case, it may be necessary to broaden the period of time that we spent for analyzing this system in order to obtain more concluding results. 4. Conclusion and Future Work In this paper, we developed and studied two specific Information Retrieval systems for the Chemical Textile domain. In particular, such systems were: a specific search engine and a virtual observatory. These systems can be of great help for experts for dealing daily with lots of information pertaining to such domain. For the experiments, we compared their performance with respect to results obtained in a previous work for general-‐purpose alert systems that are accessible to any user, thus using the same terms. The specialized search engine is a good retrieval system that retrieves interesting information for professionals and users. Moreover, in this system, most of the retrieved information is interesting, as it was shown from the results obtained. With this system the problem of ambiguity that we have with some terms when using generic alert systems disappear. Concerning the virtual observatory, its results were not very satisfactory, since it did not retrieve a high quantity of information. It may be necessary to broaden the period of time that we spent for analyzing this system, in order to see whether it can retrieve more sites, and analyze their usefulness. Despite the encouraging results, several limitations have been encountered with regard to the chosen systems, such as the virtual observatory, where it sometimes retrieve information that is not directly related to the topic of interest. Therefore, as future work we plan to build a specific ontology for the Chemical Textile domain, that can be applied to IR systems for improving the search results. We propose to make the analysis with the Virtual Observatory alert system for a wider period of time (6 months), using the ontology of the Chemical Textile domain. With the integration of the ontology in this system, we could check if it is possible to retrieve more specific and relevant information for our domain.
Acknowledgements This research work has been funded by the Spanish Government through the project TEXT-‐MESS 2.0 (TIN2009-‐13391-‐C04) and by the Valencian Government through projects PROMETEO (PROMETEO/2009/199) and ACOMP/2011/001. References [1] Naresh Chauhan, Nisha Pahal, and A K Sharma. Context-‐Ontology Driven Focused Crawling of Web Documents. Pages 121–124, 2007. [2] Javi Fernández, J.M. Gómez, and Patricio Martínez-‐Barco. Evaluación de sistemas de recuperación de información web sobre dominios restringidos. Procesamiento de Lenguaje Natural, 45(0):273–276, 2010. [3] Mark Hall, Hazeltine National, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The WEKA Data Mining Software: An Update, volume 11. 2009. [4] Maryam Hazman. A Survey of Focused Crawler Approaches. Journal of Global Research in Computer Science, 2012. [5] Gang Luo, Chunqiang Tang, Hao Yang, and Xing Wei. Medsearch: a specialized search engine for medical information retrieval. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 143–152, New York, NY, USA, 2008. ACM. [6] Hiep Phuc Luong, Susan Gauch, and Qiang Wang. Ontology-‐Based Focused Crawling. 2009 International Conference on Information, Process, and Knowledge Management, pages 123–128, February 2009. [7] Sougata Mukherjea and Bhuvan Bamba. Biopatentminer: an information retrieval system for biomedical patents. In Proceedings of the Thirtieth international conference on Very large data bases -‐ Volume 30, VLDB ’04, pages 1066–1077. VLDB Endowment, 2004. [8] Carolina Prieto, Elena Lloret, and Manuel Palomar. Análisis de la Calidad de la Información Recuperada por Sistemas de Alertas en el dominio Químico Textil. II Spanish Conference on Information Retrieval, 2012. [9] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, March 2002. [10] Jianhan Zhu and John Tait. A proposal for chemical information retrieval evaluation. In Proceedings of the 1st ACM workshop on Patent information retrieval, PaIR ’08, pages 15–18, New York, NY, USA, 2008. ACM.