A Meta-Information Extractor for Interrogative. Sentences. Cleyton Souza1, Joaquim Maia1, Luiz Silva1, Jonathas MagalhËaes2, Heitor. Barros2, Evandro Costa3 ...
A Meta-Information Extractor for Interrogative Sentences Cleyton Souza1 , Joaquim Maia1 , Luiz Silva1 , Jonathas Magalh˜aes2 , Heitor Barros2 , Evandro Costa3 , and Joseana Fechine2 1
Federal Institute of Education, Science and Technology of Para´ıba - IFPB, Monteiro - PB - Brazil. 2 Federal University of Campina Grande - UFCG, Campina Grande-PB, Brazil. 3 Federal University of Alagoas - UFAL, Macei´ o-AL, Brazil.
Abstract. The development of tools for Information Retrieval Systems or Expertise Finding Systems has a common task: the understanding of the information need. Since the information need is usually expressed through natural language, the computational processing of the information need involves several NLP techniques. Fortunately, there is a vast set of tools for English, but others languages have been marginalized. Thus, in this paper, we present a Web Service that offers to clients NLP treatment for interrogative sentences written in Brazilian Portuguese. The Web Service receives the question as input e returns its meta-information. The main differential of our proposal is that we offer a full analysis of the question text using a single function. We evaluate a feature of the Web Service named Category labeler, responsible for automatic discover the subject of the question, and we found that it has a true positive rate higher than 50% (α=10%).
Keywords: Natural Language Processing, Brazilian Portuguese, NLP, Web Services
1
Introduction
In current society, information plays a major role. The lack of information is one of the main issues faced by people and organizations [3]. In this context, inside computing, we have many areas that the main goal is to solve information needs. The Information Retrieval (IR) area, for instance, handles with the searching for documents that can solve an information need. Expertise Finding Systems (EFS) are applied to recommend experts to offer supporting to a task or an issue. In all these cases, understanding the information need is an essential step. Since the information need is usually expressed through natural language, several techniques from Natural Language Processing (NLP) area are frequently applied in order to computationally represent the requests [1]. Nowadays, there are many tools that offer support to the application of these techniques [7]. However, there
is still a gap in literature for NLP tools for other languages besides English, e.g., Portuguese. In addition, each tool usually provides access to only a small set of techniques, being necessary to combine many solutions in order to achieve a more complex and complete analysis. In this paper, we present a NLP Web Service that offers only one function; specifically destined for the analysis of interrogative sentences written in Brazilian Portuguese. The input of the Web Service should be a question and the output will be the meta-information related to the question, such as topic, type, vector model representation, etc. We are not proposing a novel NLP technique, but making available a tool that could be used for many ends, as will be fully demonstrated later. This work is organized as follows: Section 2 presents the Related Work; Section 3 details our proposal; Section 4 shows results related to the evaluation of a couple of our meta-information extractors; and Section 5 ends the paper with Conclusions and Future Work.
2
Related Work
We are not the first work about providing NLP techniques via Web Service for a non-English language neither for Portuguese. Ogrodniczuk and Przepi´orkowski [7] summarize and compare eight frameworks for NLP in a more complete review than we aim providing in this section. Our related work analysis concentrates in NLP Web Services for non-English languages. In [6], it is described a text handler that analyzes free text and outputs sentence boundaries, among other basic patters. Their tool is meant for Catalan and Spanish, although authors say that it could be expanded for English. According to Martinez et al. [6], even basic functionalities are not equally available for every language. In addition, the authors propose a Web Service Description Language (WSDL) interface for text handling tools. Erygit [4] presented a NLP platform, namely ITU Turkish NLP Web Service, that provides the state of art NLP tools for Turkish language. The users may communicate with the platform via three channels: (1) via a Web interface, (2) by file upload, and (3) by using a Web API. According to Erygit [4], the importance of providing such tools this way comes from the difficulty of sharing NLP resources through different people with distinct purposes and with varying level of computer background. Prokopidis et al. [8] provide an overview of NLP technologies available from the Institute for Language and Speech Processing (ILSP). This NLP suite is exclusive for the Greek language and comprises a series of processing units based on both machine learning and rule-based approaches. According to the authors, given the large number of linguistic services and tools already developed by various organizations, it is imperative made a NLP suite like theirs available as a Web Service that can be combined with services provided by other teams in larger processing workflows.
In [2], it is reported the development of a cluster of Web Services for NLP for Portuguese named LXService. This cluster includes a Sentence Chunker, a Tokenizer, POS tagger, Nominal featurizer, Nominal lemmatizer, Verbal featurizer and lemmatizer, Verbal conjugator and Nominal inflector; all available through four Web Services named: LX-Inflector, LX-Lemmatizer, LX-Conjugator and LX-Suite. These tools provide the basic of NLP functions that can be further combined and expanded by other client applications. Branco et al. [2] is the closest work to ours, however, we consider that our work is an interesting complement to theirs. Specifically, Branco et al.’s work provides many basic functions, while we are focusing in providing a more complex NLP analysis; and specialized in interrogative sentences. This specialization results in, at least, two additional functions: (1) a Question-category labeler and (2) a Question-type labeler. Regarding the complex analysis, our aim is making in a single operation a full analysis of the question text in order to extract its meta-information, as it will be detailed in Section 3. In addition to this tools, we want highlight the Linguateca Project [5]. Its goal is being a resource center of tools for the computational processing of Portuguese. They make the access to these existing tools easier through the aggregation of their information in their web site. The importance of such space is huge, either for experienced researchers in computing, and for students with little or none background in computing.
3
Proposal
Our proposal consists in a Web Service that extracts the following meta-information from the questions using a single operation. Although, it is possible also invoke the Web Services separately. In the following, we describe what consists these meta-information. – Vector-model Representation – It represents the question text using the vector-model. In this representation, each coordinate of the vector represents a stem present in the question text and the value of the coordinate is the product between the TF and IDF. To build this vector, we used Lucene API. It provides many NLP basic functions like tokenization, stop-words removal and stemming. In addition, it has analyzers available for several languages, including Portuguese. – Category Label – Each question has a subject. The category label single represents the question subject. The list of possible categories is pre-defined. Each category is also represented by a vector. We calculate the cosine similarity between the question vector and all category vectors. The one with higher similarity is assigned as Category of the question. – Type Label – Questions can be classified according its intention. Morris et al. (2010) proposed a classification into one of eight types: recommendation, opinion, factual knowledge, rhetorical, invitation, favor, social connection and offer. We represent the question using some descriptors. Next, we use
–
–
–
–
3.1
a Naive Bayes to automatically classify the question into one of these eight types. Category Similarities – It represents an array of tuples with the Category label plus the cosine similarity. Before we label the question with a category, we test the similarity between the question-vector and all category-vector in order to find the higher similarity. All these data is persisted in an Array. Type Probabilities – It represents an array of tuples with the Type label plus the probability calculated by the Naive Bayes model. The output of the Naive Bayes is a table with probabilities. The meta-information consists in this table. Expressions – It represents an array of tuples with Entities mentioned in the question text plus a categorization of the entity (e.g., place, date, people, product, etc.). We use a NER tool for this extraction. However, we are still testing the efficacy of different tools with Brazilian Portuguese. Currently, we are working with Stanford NER and Open NLP API. Interrogative Particle – It represents the interrogative particle of the question. In English would be equivalent to what, who, where, how, etc. Before apply the stop-words removal, we basically search in the question text for one or more of the Portuguese interrogative particle. Design
Figure 1 presents the components of the proposed Web Service. These components are discussed later. – Information Extraction Services (IES) Layer – This layer contains the functions used to extract the information from the question. These functions were also separately implemented as Web Service, but can be indirectly invoked through our proposal. This separation leads to high cohesion and low coupling; making the maintaining and reuse of the code easier and also allowing other applications to single handling with them. How the functions were implemented is described in the next section. – Management Layer – The role of this layer is to manage all others Web Services. In addition, it works like a Facade that provides sequential access to all Web Services functions. Basically, this layer has a single Web Service that defines the order in which each Web Service of the IES Layer will be invoked and also invoke them. The input of this Web Service should be a question and the output will be the meta-information related to the question. – Client Layer – This layer is responsible for invoke the functions from the Management Layer to get the meta-information from the question. Due the dynamic nature of Web Services, these clients could be implemented in any platform or language and could even be others Web Services. 3.2
Service Workflow
In this section, we briefly detail the sequential process of execution of the Service Manager. Figure 2 is an Activity Diagram of the Service Manager, which is the Web Service responsible for invoke the others.
Fig. 1. Components of the Proposed Web Service.
First, the Service Managers receives the question as input. Next, it starts the process of call the remaining services in chain. The first service to be call returns the Vector-Model representation of the question. It is the first because some of the others Services depend of this information like the Category Labeler. Next, the remaining Web Services are executed in parallel. Each one returns a different meta-information. These data are combined and structured before the Service Manager returns it to the client. It is interesting to notice that the Category Similarities Service is invoked before the Category Label. Since the function of this service is to estimate the probability of the question fit in each category, it is imperative that it be executed first, to guarantee a minimum of efficiency. The same can be said to the Type Label Service and Type Probabilities Services. In the end, after all services being executed, their results are checked for inconsistencies and the meta-information structured in returned in a XML format. An example of such file is presented in Figure 3.
Fig. 2. Activity Diagram of the Proposal.
4
Evaluation and Results
As our proposal has many functions and some of these functions are very basic, we concentrate our evaluation in the more complex features: the Category Labeler, the Type Labeler and Named Entity Recognizer. 4.1
The category labeler
Methodology To evaluate the category labeler function, we performed an experiment using two datasets. The datasets are composed by a set of questions labeled with at least one tag, but it can has multiple, and the goal of the experiment is to confirm if the category labeler is able to correctly assign one of these tags. We split each dataset in a set for training (30%) and other for testing (70%). The questions in the training dataset were used to build the vectors for each category. A category vector is built by the merging among all questions in this category of the training dataset. Next, we compared each question vector of the test dataset with the category vectors using the cosine similarity. The highest similarity was used to assign a category for the question.
Fig. 3. Meta-information XML File.
The first dataset was built with questions from Stack Overflow in Portuguese4 . The other dataset was built with questions from Forro Square5 , a regional CQA site for questions about S˜ao Jo˜ao, a traditional party in Brazil that happens in June. Table 1 details both datasets. Both datasets are available for download6 . Table 1. Datasets’ detailing
Dataset # Questions # Tags Stack Overflow 501 60 Forro Square 65 8
Results Next, in Table 2, we summarize the results for both datasets. Table 2. Results.
Dataset
True False Left Out Amount of True Positive Positive Positive Questions Rate Stack Overflow 267 223 11 501 53% Forro Square 42 19 4 65 64%
Discussion As can be seen in Table 2, the Category Labeler achieved a True Positive rate of 53% for the Stack Overflow dataset and 64% for the Forro Square dataset. The left out column represents the number of questions which the Category Labeler was not able to assign a category due cosine similarity equal to 0 in all comparisons. We used a one-tailed-right binomial test to confirm these results statistically. The test showed that the Category labeler statistically achieved a rate higher than 50% with a confidence level of 95% (p-value 0.01241) in the classification of questions from the Forro Square dataset. Since this dataset has far less questions than the Stack Overflow, it is not too difficult achieve such results. However, it is easier to observe some interesting tendencies. For instance, as we said a question can have multiple tags. In this scenario, if the Category labeler assigns any of its 4 5 6
http://pt.stackoverflow.com/ http://forrosquare.lsd.ufcg.edu.br/index.php https://docs.google.com/file/d/0B4ZS_d4fhCZVdVdpZEw2LW04enc/
tags we counted as a hit. However, as the tags were assigned firstly by users, we are assuming that they are correct, but in a small dataset like that, we observed the over-use of some tags and also, sometimes, the incorrect use. As our training is based on assigned tags it is expected that questions from the same category share a common vocabulary, but the over-use and incorrect assignment of tags makes it harder. Regarding the Stack Overflow dataset, the category labeler also achieved a True Positive rate statistically higher than 50%, but for a confidence level of 90% (p − value=0.07637). These results are quite impressive; when we analyze that the Stack Overflow dataset has over 300 hundred different tags. Higher the amount of tags more difficult is for the Category Labeler correctly assign one, since its decision is sole based on the cosine similarity comparison. In addition, the tags from the Stack Overflow dataset had close subjects. For instance, there was a tag called mysql, but there was also tags like foreign key, database and table-database7 . So, the possibility of a common vocabulary makes more difficult the task of the Category labeler. However, this was not enough for interfere with the results, since the results for both datasets were so similar. Threats to Validity We identify two types of threats to these results regarding External and Construct validity. External validity refers to the generalizability of the results. Regarding external validity, although we evaluated the Category Labeler using dataset from two different domains (with two groups of unrelated tags) and we found similar results with both of them, we can’t assume that this will repeat through different domains. However, we are planning another study case using more generic categories like those used on Quora8 or Yahoo! Answers9 . May seem odd do not use Yahoo! Answers on this study, but, in the time when we were planning this evaluation, unfortunately, Yahoo! Answers API was discontinued10 . Construct validity defines how well a test or experiment measures up to its claims and relates with problems in the design of the experiment. Regarding construct validity, we build two datasets to use in the experiment, however we do not consider treat our control the dataset data. Thus, there is no intended balance in the number of questions with each tag or accuracy of assigned tags taken as ground truth.
5
Conclusion and Future Work
In this paper, we present a Web Service for NLP processing of interrogative sentences written in Brazilian Portuguese. This tool realizes with a single function a full analysis over the question text and returns a XML containing all the 7 8 9 10
This tags were translated to english, but they are in portuguese in the dataset. http://www.quora.com/ http//answers.yahoo.com https://developer.yahoo.com/answers/
meta-information extracted. In [9], we have been using such analysis for Expertise Finding and Hint Recommendation. However, we believe that there are many other applications like information retrieval and information extraction, for instance. We validate so far one of the features of our proposal named Category labeler. A Web Service responsible for assign a label to categorize the subject of the question. The experiment with two wide different datasets showed that this feature has a true positive rate of over 50%. The interesting part is that the Category Labeler reached such results in a dataset with more than 60 possible tags for a question. As future work, we planning evaluate other features of our proposal that are still in development, such as the Named Entity Recognizer and the Type labeler. In addition, we are planning developing other tools as clients of this Web Service, like a Chatter Bot able to offer answers or a Information Extraction System, that look in documents the answer for questions asked in natural language.
6
Acknowledgements
We want to thanks the Laboratory of Distributed Systems (LSD) from the Federal University of Campina Grande (UFCG) for give in the dataset of the Forro Square.
References [1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology Behind Search. Addison Wesley, 2011. [2] A. Branco, F. Costa, P. Martins, F. Nunes, J. Silva, and S. Silveira. Lxservice: Web services of language technology for portuguese. In LREC 2008, 2008. [3] M. Castells. The Rise of the Network Society: The Information Age: Economy, Society, and Culture. Information Age Series. Wiley, 2010. [4] G. Eryi˘ git. Itu turkish nlp web service. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–4. Association for Computational Linguistics, 2014. [5] Linguateca. Processamento computacional do portuguˆes. http://www. linguateca.pt/proc_comp_port.html, Jan. 2003. Retrieved February, 2015. [6] H. Mart´ınez, J. Vivaldi, and M. Villegas. Text handling as a Web Service for the IULA processing pipeline. In Proceedings of the Workshop on Web Services and Processing Pipelines in HLT: Tool Evaluation, LR Production and Validation (WSPP 2010) at the Language Resources and Evaluation Conference (LREC 2010), pages 22–29, 2010.
[7] M. Ogrodniczuk and A. Przepi´orkowski. Linguistic processing chains as web services: Initial linguistic considerations. Proceedings of the Workshop on Web Services and Processing Pipelines in HLT: Tool Evaluation, LR Production and Validation (WSPP 2010) at the Language Resources and Evaluation Conference (LREC 2010), pages 1–7, 2010. [8] P. Prokopidis, B. Georgantopoulos, and H. Papageorgiou. A suite of NLP tools for Greek. In The 10th International Conference of Greek Linguistics, Komotini, Greece, 2011. [9] C. Souza, J. Magalh˜ aes, E. Costa, J. Fechine, and R. Reis. Enhancing the status message question asking process on facebook. In B. Murgante, S. Misra, A. Rocha, C. Torre, J. Rocha, M. Falc˜ao, D. Taniar, B. Apduhan, and O. Gervasi, editors, Computational Science and Its Applications – ICCSA 2014, volume 8582 of Lecture Notes in Computer Science, pages 682–695. Springer International Publishing, 2014.