An Approach to support Web Service Classification and Annotation Marcello Bruno, Gerardo Canfora, Massimiliano Di Penta, and Rita Scognamiglio
[email protected],
[email protected],
[email protected],
[email protected] RCOST - Research Centre on Software Technology University of Sannio, Department of Engineering Palazzo ex Poste, Via Traiano 82100 Benevento, Italy
Abstract The need for supporting the classification and semantic annotation of services constitutes an important challenge for service–centric software engineering. Late–binding and, in general, service matching approaches, require services to be semantically annotated. Such a semantic annotation may require, in turn, to be made in agreement to a specific ontology. Also, a service description needs to properly relate with other similar services. This paper proposes an approach to i) automatically classify services to specific domains and ii) identify key concepts inside service textual documentation, and build a lattice of relationships between service annotations. Support Vector Machines and Formal Concept Analysis have been used to perform the two tasks. Results obtained classifying a set of web services show that the approach can provide useful insights in both service publication and service retrieval phases. Keywords: Ontology Building, Service Classification, Semantic Annotation
1 Introduction One of the most relevant advantages of service–centric software engineering is the possibility a developer has to build his/her own system as a composition of one or more abstract services, i.e., semantic descriptions that can be matched at run–time with the description of one or more concrete services. The subsumption relationship between an abstract service and the concrete services is completed by means of matching algorithms integrated in the service broker [18]. The choice of the actual concrete service to bind to an abstract service can also consider concrete services’ Quality of Service (QoS) attributes. The above described scenario requires that each service must have a semantic description, according to a specific
ontology1 . Service semantic annotation is, however, a difficult task that, given the actual state–of–the–art, is often too expensive to be done in practice. Also the building and maintenance of ontologies requires expertise and budgets not always available. Unfortunately, very often the only source of information available is a pure–textual description of the service, sometimes extracted from source code comments. During service publication, it would be therefore useful to exploit this form of textual information to: • permit the automatic classification of services to be published according to the broker’s service ontological classification; • support building and maintenance of domain–specific ontologies; • aid the semantic annotation of a service with respect to the ontology. By detecting concepts inside the service textual documentation, it would be possible to see how the service concepts can be identified in the ontology, and how the service can be cataloged with respect to other existing services. The usefulness of a semi–automatic support for service classification and annotation is not limited to service publication phase. In fact, it can also be used during service retrieval. Let us suppose that a service integrator is querying (sending a free–text query) the broker to search for a service performing a particular task. Such an automatic classification mechanism can be applied to free-text queries to: • identify the category (or the scored list of categories) in which any service matching the query can be found; 1 There is work investigating the possibility of matching between services described with different ontologies. This aspect, however, is out of scope for this paper and will not be further considered.
• ease the browsing among the available services, once the service integrator chooses a category. As it will be clearer later, the relationships between different services belonging to a specific domain can be represented, with some simplifications, using a concept lattice. Thus, it would be useful to develop a mechanism able to identify the lattice region in which the service the integrator is searching for could be found. This paper proposes an approach that, starting from service textual description, performs an automatic classification (to catalog services across specific domains, such as telecommunications, finance, etc.), and then identifies service key concepts and their relationships as a concept lattice. The approach relies on Support Vector Machines (SVM) [22] and Information Retrieval (IR) Vector Spaces [10] for service classification, and uses Formal Concept Analysis (FCA) [23] to build concept lattices from service descriptions. The results showed that, even if a totally automatic construction of the lattice is not feasible, FCA still gives aids and useful insights to help the publisher annotating the service and, when necessary, maintaining the ontology. The remainder of the paper is organized as follows. First, Section 2 provides an overview of the related literature and available tools. Then, Section 3 describes the proposed approach and its application scenarios. The first results obtained are presented and discussed in Section 4. Finally, Section 5 concludes.
2 Related Work Text classification has seen a great deal of success with the application of several studies addressed towards machine learning [12, 15, 19, 24]. Among the many learning algorithms, SVM [22] appears to be most promising. The first application of text classification using SVM has been presented by Joachims [12]. The results were also confirmed by different other studies [12, 24]. Joachims et al. [13] developed a theoretical learning model of text classification for SVMs, which provides some explanation about SVMs performance in text classification. Di Lucca et al. [8] compared the effectiveness of different IR and machine learning approaches for classifying software maintenance requests. As a result, SVM outperformed other approaches. The manual construction and maintenance of specific– domain ontologies is an expensive and complex work, requiring significant waste of effort and time, as well as a detailed knowledge of the domain to be modeled. Fridman Noy at al. [17] describe the knowledge model of Protege 2000 [3], an ontology–editing and knowledge-acquisition environment. Tao [21] developed a Protege 2000 FCA–
based plug–in for the building and maintenance of ontologies. Up to now, some work has been done in the field of automatic support for ontology building. An example of using FCA in ontology merging has been proposed by Maedche et al. [16]. However, few papers investigated the possibility of using FCA in ontology building and structuring. Cimiano et al. [7] discuss how FCA can be used to support ontology engineering and how ontologies can be exploited in FCA applications. They present the method “FCA-Merge” for merging ontologies following a bottom–up approach. HeleMai Haav [11] presented an approach, based on Natural Language Processing (NLP), for the automatic or semiautomatic discovery of domain-specific ontologies from free text. Kim and Compton [14] propose an ontology browsing mechanism relying on FCA and incremental knowledge acquisition mechanisms. JBraindead Information Retrieval System [9] combines a free–text search engine that uses FCA to organize the results of a query. This work showed that conceptual lattices can be very useful to group relevant information in a free–text search task.
3 Approach Description As stated in the introduction, the proposed free–text service classification approach aims at accomplishing a twofold task: i) perform the automatic classification of a service description, i.e., determine to which category/domain a service belongs to; ii) locate a service description in a concept lattice. The remainder of this section will explain in details the three steps of the classification approach, depicted in Figure 1.
Figure 1. The service classification approach
3.1 Text Preprocessing The first step aims to preprocess service textual descriptions. Textual description of web services might be in the form of Web Service Description Language (WSDL) documents, coming from UDDI registries, as well as any other textual document provided as a documentation support for the service itself. Words are extracted from documents and then preprocessed. Successively, words are filtered by means of a stop–list, and normalized. The stop–list contains articles, prepositions, and in general words that are frequent in each query, and therefore not discriminant (”web”, ”service” or ”SOAP”). During the stemming phase, verbs are brought back to infinitive, plurals to singulars, etc. using the Wordnet dictionary [5] and its Java API.
3.2 Service Classification The classification of services into domain–specific classes is performed using the SVM method. In our implementation, the freely available LIBSVM tool [1] was used. As stated in the introduction, automatic service classification both serves during service publication (to classify the new service) and service retrieval (to identify the class(es) where to restrict the focus of the query). In this section’s context, both web service documentation and user queries are considered as a textual description to be classified (represented as grey arrows in Figure 1). Prior to apply SVM, sequences of words, obtained in the previous preprocessing phase, must be mapped onto vectors. In our approach, the mapping is achieved using IR techniques. Each element of the vector corresponds to a word (or term) in a vocabulary extracted from the documents themselves. All words are weighted with tf-idf metric. In this way, each document is mapped onto a vector using an injective function. The whole document set is encoded in a matrix, where rows represent documents (vectors) and columns are the weighted words. No information about the position or the meaning of the words is used, i.e., no semantic is known using this matrix. A classification task using supervised algorithm such as SVM or ANN requires a training set. In other word, our SVM needs to be trained with a pre–classified set of documents. This produces a ”model matrix” that will be used to “predict” to which class the document to be classified belongs to.
3.3 Building the Concept Lattice Once classified the service, or re–directed the query to a specific domain, key concepts need to be extracted from the service/query2 and their lattice needs to be built. Clearly, 2 From
this point we will refer both as “service” indistinctly.
such a lattice only represents a simplification of a domain ontology. FCA advantage comes from the way it shows how the presence or absence of attribute distinguishes objects, i.e., by means of super–concept/sub–concept relationships. A concepts lattice can well represent services names and keywords belonging to a specific domain, highlighting “isA” relationships between concepts and attributes. Without loss of generality, let us suppose we want to build a concept lattice from a set of service descriptions. First and foremost, we need to identify discriminant words, useful for the lattice. To this aim, we use the idf metric to eliminate words that do not appear in at least two or more, depending on the number of documents/services belonging to that domain/class. More formally: A service context is a triple C=(S, K, I) where S is a set of service names (the objects), K is the set of service description keywords (the attributes), and I the binary relationship which indicates the presence or absence of words into documents. The obtained lattice3 may be used to identify concepts for a specific domain, as well as the relationships between services belonging to a class. Such a lattice aids a service publisher when providing service semantic annotation, in that it tries, starting just a textual description, to discover the service “hidden semantic”.
4 Empirical Study To validate and gain insights about the usefulness of the proposed approach, we performed an experiment aiming to classify a set of web service documentations, and to build lattices for services belonging to some particular classes/domains. Results are presented and discussed in this section.
4.1 Case Study Description Getting a suitable and extensive case study for experiments dealing with web services is still a challenge. Although, at the time of writing, several UDDI registries exist and are available for querying, too often the set of services obtained is almost useless. Even the service are trivial, or their documentation is dummy. We used, as a case study, a set of pre–classified services available on the net [4] and downloaded from some UDDI registries. Such a set was composed of 205 services, classified in 11 classes, representing domains such as news, weather, credit card, etc. As said, each service is provided with a short description, extracted from the WSDL tag. For example, a Credit Card Web Service presents a description as follows: 3A thorough example is available http://www.rcost.unisannio.it/mdipenta/tr-ca.ps.
online
at
”Will accept and validate a Credit Card Number. Returns True for a Valid Number and Returns False for a Invalid Number”
4.2 Service Classification Results SVM service classification performances were measured using the leave–one–out validation [20]: each document (vector) in the set (matrix) was classified using a SVM model built using the remaining ones, and the percentage of correct classifications was measured. We found that different performances can be achieved by properly calibrating the SVM parameters, namely the kernel function and the gamma parameter. Other than the first classification obtained for each service, we let the SVM find alternative classification, for which we also measured the correct classification ratio, incrementally with respect to the first classification ratio. This permits to obtain, for each service, an ordered list of classes, to which the service has the highest likelihood to belong, according to our classification algorithm. In other words, the algorithm ensures that the service belongs, with a given likelihood, to the first class, with an higher likelihood to one of the two first classes, etc. For our case study, precision for the first score position is of about 63%. This is not that high (up to 84% was obtained for software maintenance ticket classification by Di Lucca et al. [8]), however reasonable, considering the quality and quantity of the training set, and that the approach was able to classify across 11 classes. When looking further, to best– two and best–three class scores, we found that, clearly, performance increases, respectively to 73% and 83%. In conclusion, although the approach could not always suggest the correct class, at least is able to indicate a limited group of classes to which the service could belong.
4.3 Building Service Concept Lattice After classification was performed and thus services belonging to each category identified, we built concept lattices for each category using FCA. As described in Section 3, words having high idf were pruned, in that they are considered not relevant for building the concept lattice. Figure 2 shows a lattice obtained applying FCA to documents belonging to the “Credit Card” class. Each node in the lattice can show both concepts and objects. Concepts appear in the high part of the node, while objects in the low. In our case, concepts represent sets of keywords, while objects represents services. Generic concepts (referred as top concepts) are placed in the high part of the lattice (card, credit, Visa, Mastercard, etc.). The lattice easy permits to find, for example, services that both support Visa or Mastercard (service11 and service6). service6 is considered to
be more specific than service11 because, according to its description, it can validate credit card numbers, while service11 does not advertise such a feature. Going further in our lattice analysis, it can be noted that service16 can be used to validate credit cards and it seems to be more specific than service1. This reflects what is specified in their description: • service1: “credit card (whole text is: Offering Loans And Credit Cards to Consumers)” • service16: “accept card credit validate valid (whole text is ”Will accept and validate a Credit Card Number. Returns True for a Valid Number and Returns False for a Invalid Number”)” Note that word relevance depends on the term document frequency. If a word appeared only in one document, it has been pruned (”loans”, ”consumer”, ”invalid”). Other words have also been stopped. According to the proposed approach, new concepts are added to the lattice when terms appear in more than one document. Therefore, if a new service description containing the word consumer is used to expand the lattice, a new concept will be added, and the lattice structure will change. A further example of lattice building, related to the mail domain, is reported in a technical report available online [6].
4.4 Discussion Results obtained for service classification showed how the approach can be useful. It appears evident that the automatic classification helps, identifying, with a likelihood of 83%, the ordered list of 3 out of 11 classes among which the service may be classified. The service publisher can exploit this result from several points of view. First and foremost, he/she can accept one of the classifications proposed by the automatic tool, possibly manually refining the choice. In this case, the tool helps in reducing the publisher classification degrees of freedom across a limited number of service classes/domains. It may happen that the proposed classes may be completely different than publisher expectancies. If a “weather” service is classified in the “finance” or “mail” domain, it means that the service description may be ambiguous. In this case the tool raises the publisher’s attention to the problem, highlighting the need for correcting the deployed service documentation. This will reduce the risk that, during service search, the service is never found by queries related to its own features and, instead, it is found by queries related to other kind of features. Regarding concept lattice building, it appears immediately clear that a completely automatic ontology or semantic annotation building is unfeasible. This, however, was not
Figure 2. Concept lattice of services: the credit card example
our purpose. Instead, we found that, while human supervision and intervention cannot be avoided, useful insights can be obtained from service lattices. In fact, by highlighting relationships between services, it can help to build and refine the service semantic annotation. By looking to the lattice, the publisher can found that some keyword may simply make the service annotation heavier, or even misleading. Thus, it can be decided to remove or replace these keywords. When a service developer publishes some services, he/she is aware of the genericity/specificity of the services. If this is not reflected in the lattice, it means that service descriptions are misleading or incomplete. More generally, if the web service textual description is incomplete, too generic or containing phrases not properly related to the service features, it may be hard to automatically classify it and to find the correct position in the concept lattice. Much in the same way, let us suppose that some domain– specific services have been already published and semantically annotated according to a specific ontology, and that we want to publish a new service. By using proper tools, such as the FCA plug-in for Protege 2000 [2], it can be possible to extract a context from the ontology. If we add a row, representing our service keywords, to such a context, and
then we build a lattice using FCA, we will be able to immediately highlight how our service can be annotated with respect to the ontology. The second consideration we can make about the usefulness of these concept lattices is related to ontology building. As stated in the introduction, service annotations coherent with ontologies may be necessary to allow automatic service matching for late–binding. The concept identification made by FCA, as well as the lattice structure of these concepts, although giving a limited view of an ontology, can indeed be useful for its building, completion or maintenance. In fact, when publishing new service, new concepts may need to be added, especially if the ontology is not yet complete. Conversely, when a user performs a query to retrieve a service, the following scenario can happen. First and foremost, the user is guided, by the SVM classifier, to focus on some particular domains. In these domains, the portion of lattice of interest is highlighted, significantly easing the service search. Finally, our studies suggested that lattices appear to be useful when focusing to well restricted domains. Wide domains and upper ontologies would generate, in fact, unmanageable and difficult to understand lattices.
5 Conclusions This paper presented an approach, based on machine– learning techniques, to support service classification and annotation. Starting from free–text service documentation, services are automatically classified in classes/domains using Support Vector Machines. Successively, Formal Concept Analysis is used to build service concept lattice for each specific domain. Results of a classification experiment on a set of 205 services downloaded from the web shown the feasibility of the approach. Although needing user guidance, automatic classification, by proposing the nearest–three classes out of 11 with a likelihood of 83%, can ease and support the service publication and annotation. Much in the same way, the obtained concept lattices highlighted relationships existing between services, and aided the identification of domain key concepts. Finally, we showed with some examples how the same approach can also be integrated in the service retrieval mechanism. Work–in–progress is devoted to further improve the proposed technique, to confirm the obtained results with other case studies, and to integrate the approach in a service broker we are developing in a project in cooperation with an Italian software company.
[12]
[13]
[14]
[15]
[16] [17]
[18]
[19]
References [1] [2] [3] [4] [5] [6]
[7]
[8]
[9]
[10]
[11]
Libsvm tool. http://www.csie.ntu.edu.tw/ cjlin/libsvm/. Plugin for protege–2000. http://www.ntu.edu.sg/. Protege–2000. http://protege.stanford.edu/. Textual documentation of web services and classified services. http://moguntia.ucd.ie/repository/ws2003/. Wordnet dictionary. http://www.cogsci.princeton.edu/∼wn/. M. Bruno, G. Canfora, M. Di Penta, and R. Scognamiglio. An approach to support web service classification and annotation. Technical report, RCOST - University of Sannio, Italy, Sep 2004. S. P. Cimiano, J. Staab, and Tane. Deriving concept hierarchies from text by smooth formal concept analysis. In Proceedings of the GI Workshop Lehren Lerner -Wissen Adaptivitat (LLWA), 2003. G. Di Lucca, M. Di Penta, and S. Gradara. An approach to classify software maintenance requests. In Proceedings of IEEE International Conference on Software Maintenance, pages 93–102, Montr´eal, QC, Canada, Oct 2002. P. W. Eklund, editor. Browsing Search Results via Formal Concept Analysis: Automatic Selection of Attributes, volume 2961/2004 of Lecture Notes in Computer Science. Springer, feb 2004. W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. H.-M. Haav. An application of inductive concept analysis to construction of domain-specific ontologies. In B. Thalheim and G. Fiedler, editors, Emerging Database Research
[20]
[21] [22] [23] [24]
in East Europe, Proceedings of the Pre-Conference Workshop of VLDB 2003, volume 14/03 of Computer Science Reports, pages 63–67. Brandenburg University of Technology at Cottbus, nov 2003. T. Joachims. Text categorization with support vector machines: learning with many relevant features. European Conf. Mach. Learning, ECML98, Apr. 1998. T. Joachims. A statistical learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 128–136, 2001. M. Kim and P. Compton. Formal concept analysis for domain-specific document retrieval systems. Lecture Notes in Computer Science, 2256, 2001. D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992. E. Maedche and G. Stumme. FCA-MERGE: Bottom-up merging of ontologies. Jan. 06 2001. N. F. Noy, R. W. Fergerson, and M. A. Musen. The knowledge model of Prot´eg´e-2000: Combining interoperability and flexibility. Lecture Notes in Computer Science, 1937, 2000. M. Paolucci, T. Kawamura, T. R. Payne, and K. Sycara. Semantic matching of web services capabilities. In First International Semantic Web Conference (ISWC 2002), volume 2348, pages 333–347. Springer-Verlag, June 2002. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1–47, 2002. M. Stone. Cross-validatory choice and assesment of statistical predictions (with discussion). Journal of the Royal Statistical Society B, 36:111–147, 1974. G. Tao. Using formal concept analysis for ontology structuring and building. PhD thesis, 1992. V. N. Vapnik. Statistical Learning Theory. John Wiley, Sept. 1998. B. G. R. Wille. Formal Concept Analysis. Mathematical Foundations, Springer Verlag, 1999. Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey, and R. Tong, editors, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 42–49, Berkeley, US, 1999. ACM Press, New York, US.