Folksonomy in a Software Component Search Engine

0 downloads 0 Views 360KB Size Report
The Java class source code, which is considered in Maracatu as a .... A relevant work involving search engines and component retrieval is [9]. ... Architecture”, In the 9th International Symposium on Component-Based Software Engineering (CBSE. 2006), Lecture Notes in Computer Science (LNCS), Mälardalen University, ...
Folksonomy in a Software Component Search Engine – Cooperative Classification through Shared Metadata Taciana Amorim Vanderlei, Vinicius Cardoso Garcia, Eduardo Santana de Almeida, Silvio Romero de Lemos Meira Federal University of Pernambuco Recife Center for Advanced Studies and Systems (C.E.S.A.R.) Recife – PE - Brazil {tav, vcg, esa2, srlm}@cin.ufpe.br

Abstract. This paper presents the use of folksonomy concepts in a software component search engine as an alternative to improve the performance of the search engine, covering the tool specification, design and implementation. A set of requirements to perform component search and retrieval with folksonomy are presented, followed by the description of search using this concept in the engine. The engine’s current version combines folksonomy, text mining, and facet-based search techniques.

1. Introduction The high cost of software development dictates that software components and related artifacts like requirements, design models, source code, and test plans should be reused whenever possible [1] once reuse can increase time-to-market and quality and reduce costs. However, there are several problems that must be resolved in order to achieve systematic reuse and one of these is the problem of component search and retrieval [1]. In order to promote reuse, sometimes it is viable for organizations to maintain a large reusable software component library that requires an efficient method for retrieving components satisfying given requirements [1]. However, the majority of the automated methods for retrieving software are not completely satisfactory for finding relevant components [2]. The search mechanisms must focus on intuitive ways to classify and identify the components with low costs. Moreover, they should be based on information that is familiar to the software engineers. In this way, they can find the appropriate components during the software development without necessarily having knowledge about the contents of the repository. In this scenario, different techniques should be experimented to address the aforementioned drawbacks. Folksonomy1 can be the key for a distributed classification system with initial low costs, since folksonomies are maintained by users. Folksonomy, combining “folk” and “taxonomy” [3], refers to a collaborative but unsophisticated way in which information is categorized on the web, for the most part. Instead of using a

1

Flickr (http://www.flickr.com/) and Del.icio.us (http://del.icio.us/) are two of the best known examples of social software using Folksonomy. The basic idea of these is simply to make people share respectively web bookmarks and photos annotated with tags.

centralized classification scheme, users are encouraged to freely assign chosen keywords (called tags) to pieces of data, in a process known as tagging. In this paper, an improved version of Maracatu [4], combining the original text and facet search with folksonomy is discussed. Such addition improves component search precision and recall and its specification and design are presented herein. The remainder of this paper is organized as follows: Section 2 discusses the new features and technical requirements. Section 3 presents Maracatu architecture with folksonomy features. Section 4 describes the implementation of the current stage of Maracatu. Related works are presented in Section 5 and, finally, Section 6 presents some concluding remarks and directions for future work.

2. Features and Technical Requirements Folksonomy is the practice of allowing anyone – especially consumers – to freely attach keywords or tags to existing objects. In folksonomies, we should have three distinct data elements [5]: (i) the tag, pieces of information separate from, but related to, an object; (ii) a clear understanding of the object being tagged; and (iii) the identification of the user doing the tagging. These elements allow the users to tag information and objects in order to increase the recall. That is the most important aspect, because using a known vocabulary can help the future recovery of the tagged objects by different users. Additionally, other requirements for an efficient component search and retrieval engine using folksonomy are: a. Integration of different search techniques. The engine should use the folksonomy technique combined with traditional classification schemes to improve the precision of the search. b. Search by tags. It should be possible to discover all items from all users that match a specific tag [6]. Moreover, the engine should support the discovery of items tagged from specific users that match the tag. d. User authentication. The authors authorized to add tags to components should be registered and identified on the engine. This is necessary to aid user identification on the tag in registration and to identify author’s groups with the same understanding and knowledge of a domain, helping the search. e. Database persistence. Tags, related to a specific component and author, should be stored in a persistent database for posterior reuse. This means that data can be accessed at any time by the tool. f. Concurrency control. The engine should manage simultaneous accesses to the database. Moreover, it must prevent two users from tagging the same component at the same time and also must be concerned with transactions for backup and recovery. g. High precision, recall and efficiency. The engine should present a high precision, recovering the most relevant components; with a high recall, few relevant components are left behind; and high efficiency refers to the time and space required by a retrieval method [4].

3. Maracatu Search Engine Architecture The previous version of Maracatu combined text mining and facet-based classification scheme of search [4]. The Java class source code, which is considered in Maracatu as a white-box component, can be indexed by a mechanism that accesses distributed CVS open project repositories, indexing and ranking these artifacts using the Lucene search engine [7]. Maracatu’s architecture is based on the client-server model, and uses Web Services technology [8] for message exchange between the subsystems. Maracatu Service is responsible for indexing the components, in background, and responding to the user queries. The Eclipse plug-in act as a Web Service client to access the server to retrieve components according to developer formulated queries. In order to improve search precision, we propose to complement the Maracatu classification approaches with folksonomy concepts. Figure 1 shows the new Maracatu architecture. Next the Figure 1. Maracatu Architecture additional features and the reasons for them are discussed. The text mining and facet-based classification schema in Figure 1 represents the old version of the architecture and is detailed in [4]. Thus, the new modules are: Folksonomy classifier. This module is responsible for performing the persistence of the tags in a database structure. In the classification stage, the tags are stored in a XML file with related components and authors. The XML language was chosen because of its simplicity and ability for interoperability. It will allow extending Maracatu in a flexible way to add features such as the option to do searches and associate tags to components via web browser and to store ontology for the tags of the folksonomy search technique. Folksonomy search. The search is performed through folksonomy usage and can be combined with text mining and facet-based search techniques. The result of the folksonomy search is merged with the results of the others two search techniques which utilize the Lucene search engine. Author’s tags, on the search, can be chosen only if the tag field was filled by the user performing the query; otherwise it will be disabled, since it does not make sense to find components associated with an author without tags. Component rank. This module is responsible to rank the search results according to the number of times the tags searched appear on the database (XML file). In case of draw, the components are ranked according to how closely each component matches the query. This rank is valid if the folksonomy search is used; otherwise the

components are ranked according to the Lucene ranking algorithm that utilizes the same criteria to rank when draws occur in folksonomy-based searches. User identification. This module is responsible for maintaining the list of users of the engine in an XML file, at the server side. Any developer, at the client side, must be identified with login and password before tagging or performing a search.

4. Maracatu Search Engine Implementation The folksonomy version of Maracatu can be seen in Figure 2, which shows the Maracatu plug-in2 being used in the Eclipse development environment. The window on the left shows the association of tags to a specific component by an author.

Figure 2. Maracatu plug-in screens

To add tags to a specific component the user must first inform which component is being tagged. Next, the text area (1) will be enabled with the author’s tags related to the specific component, if any for updating, or a blank text area to insert new tags. Each tag should be separated by blank spaces. Below the text area, are presented the lists: (i) Tags suggestions (relevant tags in the component), (ii) Top tags (most used tags) and (iii) Top author tags (most used author tags), addressed to help the author in choosing the tags that fit better in the component. Checking a specific tag corresponds to add it to the text area. The window on the right shows the new search mechanism, where the developer may choose combinations of folksonomy, facets and keywords while performing queries such as: “retrieval all Networking components that are identified by the author Taciana Amorim as an infrastructure or driver and developed for the J2EE Platform in the EJB Component Model”. The list of authors with the same domain understanding is shown at the upper right corner of the screen (2) to aid the choice of the author field in the search. The identification of these groups is determined by the authors who have tagged

2

The current version of the plug-in may be obtained on the project site http://done.dev.java.net

a certain number of components using the same tag as the user (author) performing the search. The Tag Cloud area (3) presented on the lower right corner of the search screen is another useful option during the search. Tags registered by the authors are displayed in alphabetical order and with different colors according with their popularity. By selecting a tag in the Tag Cloud area, a collection of components associated to that tag are displayed in the text area (4). Other option is filling the Tag field by selecting a number of tags within the tag cloud using the CTRL key. The current version of the search engine implementation contained 120 classes, divided into 57 packages, with 6688 lines of code (comments not included). The previous version of Maracatu had 106 classes, 55 packages and 3844 lines of code.

5. Related Work A relevant work involving search engines and component retrieval is [9]. Ye et al. present CodeBroker, a search engine that marked the main changes in the active information delivery research area. Differently from Maracatu, which searches for components when requested, the engine autonomously delivers components that are relevant to the programming task at hand and that are unknown to the programmer. The contributions of this work are important because the authors investigate different issues related to component reuse, despite the fact that the search functionality uses just signature matching without considering any semantic information. However, it does not reflect users’ domain, an important functionality provided by folksonomy. Other important work is Krugle3, a search engine that uses tags to identify code snippets. This work follows the same direction of our search technique, relating components with tags, but does not identify authors’ tags. Additionally, this tool only provides a web interface, what makes it difficult to be used by developers. Enterprises have also started blogging and experimenting with folksonomies. An example is IBM's Intranet that serves 315,000 IBM employees worldwide in different languages and with multiple roles and information needs [10]. They started experimenting with folksonomy to keep information updated and organized following their users’ personal way of accessing the system. Analyzing the literature [2] we can observe that there are different ways to store and retrieve software components from repositories. However, our engine is the first one to search components involving cooperative classification and reflecting users’ domain.

6. Concluding Remarks and Future Works A criticism folksonomies often face is that they encourage idiosyncratic tagging, increasing the complexity of content retrieval. The phenomena that can present problems include polysemy (or lexical ambiguity), words which have multiple related meanings; synonym, multiple words with the same or similar meanings and plural words. Those who prefer top-down taxonomies/ontologies argue that an agreed set of tags enables more efficient indexing and searching of content [11]. Still, the idea of

3

Krugle - http://www.krugle.com/product/

socially constructed classification schemes (with no input from an information architect) is interesting to improve repository effectiveness, producing results that reflect more accurately the population’s conceptual model of the information. Folksonomies are not the solution to every modern problem of classification and they must not be considered as alternatives to the traditional classification schemes librarians have designed over the years, but a complement to them. They should be applied only under the right circumstances and considering their own specific properties and the differences in respect to other classification schemes as taxonomies and controlled vocabulary. In this paper, we presented the use of folksonomies in Maracatu, a component search engine, covering the tool specification, design and implementation. The initial work shows that this technique can be useful and stimulate future research involving cooperative classification and reflecting users’ domain. Previous experiments in Maracatu indicate that using text mining search and facet-based search is better than using them separately [4]. We believe that folksonomy combined with others search techniques can increase the precision of the results. As a future work an experiment using text mining, facet-based and folksonomy search will be conducted to verify this assumption.

References [1] Podgurski, A. and Pierce, L. (1993) “Retrieving Reusable Software by Sampling Behavior”, ACM Transaction on Software Engineering and Methodology, Vol. 02, No. 03, July, pp. 286-303. [2] Lucrédio, D., Almeida, E. S., Prado, A. F. (2004) “A Survey on Software Components Search and Retrieval”, In Steinmetz, R., Mauthe, A., eds.: 30th IEEE EUROMICRO Conference, ComponentBased Software Engineering Track, Rennes - France, IEEE/CS Press, pp. 152–159. [3] Mathes, A., (2004) “Folksonomies – Cooperative Classification and Communication Through Shared Metadata”, Computer Mediated Communication - LIS590CMC, December. [4] Garcia, V. C., Lucrédio, D., Durão, F. A., Santos, E. C. R., Almeida, E. S., Fortes, R. P. M., Meira, S. R. L. (2006) “From Specification to the Experimentation: A Software Component Search Engine Architecture”, In the 9th International Symposium on Component-Based Software Engineering (CBSE 2006), Lecture Notes in Computer Science (LNCS), Mälardalen University, Västerås, Sweden. [5] Wal, T. V., (2006) “Online Information Folksonomy Presentation Posted”, January, available at: http://www.personalinfocloud.com/2006/01/online_informat.html (Accessed on April 17, 2006). [6] Quintarelli, E. (2005) "Folksonomies: power to the people", Proceedings of the 1st International Society for Knowledge Organization, UniMIB Meeting, June 24, Milan, Italy, ISKOI, Italy, available at: http://www.iskoi.org/doc/folksonomies.htm (Last accessed on April 17, 2006). [7] Hatcher, E., Gospodnetic, O. (2004) “Lucene in Action”, In Action series. Manning Publications Co., Greenwich, CT. [8] Stal, M. (2002) “Web services: beyond component-based computing”, Communications of ACM Vo0l. 45, No. 10, pp. 71–76. [9] Ye, Y., and Fischer, G. (2002) “Supporting reuse by delivering task-relevant and personalized information”, In ICSE 2002 - ICSE, Orlando, Florida, USA, pp. 513–523. [10] (2005) “IBM's Intranet and Folksonomy”, March, available at: http://thecommunityengine.com/home/archives/2005/03/ibms_intranet_a.html (Accessed on April 17, 2006). [11] Bruijin, J. (2003) “Using ontologies”, Tecnical report, Digital Enterprise Research Institute.