eColabra An Enterprise Collaboration & Reuse Environment - CiteSeerX

eColabra - An Enterprise Collaboration & Reuse Environment

NGITS’99

eColabra An Enterprise Collaboration & Reuse Environment Orit Edelstein, Avi Yaeli, Gabi Zodik IBM Haifa Research Lab MATAM, Haifa 31905, Israel { orit, aviy, gabi }@vnet.ibm.com

Abstract. In this paper we present a new methodology to achieve reuse and knowledge sharing, by applying information retrieval techniques to OO resources while exploiting the OO language semantics and characteristics. We present a methodology that can cut down costs not only of development and maintenance but also reduce the “time to market”. Unlike traditional reuse processes/infrastructures, we show that applying our methodology requires extremely low investment, leading to a Win Win solution. This is achieved by our environment based on autonomous tools that require no support while implementing our approach. We further describe novel techniques to combine free text and attribute based searches and show how they improve search precision. In addition eColabra server collects valuable statistics which can serve as the organization’s knowledge management infrastructure.

1

Introduction

The ability to develop new applications (in particular Web-based applications) in a short time is crucial to the success of software companies that need to compete aggressively in today's market. Considering the fact that software technologies emerge very fast, change on a daily basis, this becomes an even more complicated task. For this reason it is vital to share and reuse the knowledge and the programming experience gained inside development organizations in an efficient and productive manner. Previous attempts have been made at solving the reuse problem. Most solutions rely on radical changes to the developer’s programming habits, software engineering processes and software architectures [Griss 97]. Organizations have to establish a reuse infrastructure including a component library where developers have to register their code in specific ways. These attempts have failed in most cases mainly because of the additional overhead which these solutions introduced and because of the additional cost required during the development of reusable components and the maintenance of the component’s libraries. Other projects, such as jCentral [Watson 97] [jCentral 98], attempted to create a global network reuse environment, on the WWW. These projects were focused on the Java language only and assumed that Java programmers share their code on HTTP 1


NGITS’99

servers. The later assumption is still valid, however these days we find more and more Java developers inside our organizations, something that was not true two or three years ago. Under this observation, we should first exhaust the knowledge and the reuse of source code inside our organization before seeking it outside on the web. In addition inside the organization’s intranet we no longer seek for information on HTTP servers only, but our main target for information gathering is from any shared file system. The advantages are obvious, no royalty is required, the quality of the code is according to the organization’s standards, and it can lead to reduction in the cost of code maintenance by sharing the same class library. These are the main reasons that have driven us to establish eColabra, an enterprise collaboration and reuse environment eColabra supports a large variety of programming languages in addition to Java. eColabra already supports Java, C++, C, OO-COBOL, and COBOL, moreover the same ideas apply to structured documents as well, our system already supports structured and free text searching through HTML and XML documents. We believe that within the enterprise, code reuse can and should be achieved in more ways than the traditional black box reuse. We perceive knowledge sharing as an additional important way of reuse, and in particular the reuse of specific programming knowledge and reuse of design and design patterns. With eColabra one can automatically assemble a repository, with a simple classification scheme that does not require manual registration or quality control, while providing advanced retrieval and visualization mechanisms so as to facilitate the locating and integration of reusable code resources. We have embodied our approach into, eColabra, a costless environment for enterprise collaboration & reuse that we describe below.

2

eColabra - The Proposed Solution

We have developed an environment for reuse, eColabra, by building upon previous attempts at solving the reuse problem that were based on applying information retrieval techniques to software components in general [Frakes & Nejmeh 87] [Maarek et al. 91], or to OO classes [Helm & Maarek 91] [jCentral 98], and applying them to a large variety of programming languages and structured documents in order to create one large reusable repository. We take advantage of highly precise information retrieval indexing [IMT 98] methods and graph visualization techniques (e.g., [Maarek et al. 97] [Zernik & Zodik 97]) in order to customize reuse during the two basic stages of reuse, the classification stage, by using a language (programming language) specific indexing and classification scheme when adding selected resources to the repository, and at the retrieval stage, by displaying the reuse candidates in a meaningful and novel way. In addition the system includes annotation and notification services. The notification service releases developers from the burden of following the eColabra repository for updates, instead eColabra will provide automatic notification on registered items. 2


NGITS’99

Annotation are feedback collected by eColabra from developers. Once the repository will include a sufficient amount of annotations, the users will be able to specify more detailed queries that can take advantage of this additional information encapsulated inside these annotations.

2. 1

Reuse Categories

We consider many additional aspects of code reuse, compared with traditional reuse which focused mainly on Black Box reuse. However, one must consider another major benefit of code reuse, namely “time to market”. The time to market will be reduced no matter which of the following reuse methods is being used, although the maintenance and development costs may not be reduced in all cases. However, in many situations the economic advantage of being first in the market can have tremendous impact on the future of the organization in general and create opportunities for much higher profits to the company. We have classified the various reuse types into the following 6 categories. We have done so in order to emphasize the different potential cost savings between the various reuse methods. Starting at the top with the best case, i.e. reuse of code and design with frameworks. !

Frameworks - reusing code and design

!

Black box reuse - using the module (class, component) or a class library as is.

!

Collaborated modules and component sharing - a module being productively maintained by one team and used by several teams.

!

White box reuse - browsing through the code in order to reuse pieces of it while making local changes.

!

Learn by example - see how others have implemented or used a class, method, event, or new undocumented features in order to learn from other’s experience.

!

Knowledge sharing - browse or search the enterprise’s code resources to find out what has been and is being developed by other teams.

All of the above reuse categories can be achieved while using our tool. However its major contribution is in promoting reuse of the last four categories. Our methodology is focused on knowledge sharing and not on the software process that leads to reusable components and frameworks as traditional reuse methodologies provide.

2. 2

An Economic View on Reuse

The two main reasons that lead organizations to try to accomplish reuse are: first the benefits of cutting down development and maintenance costs and second to shorten the time to market of their products. Essentially if one takes a closer look at all the various reasons that drive reuse, one will notice that they are all rooted in economic reasons, mainly in order to improve company's profits. 3


NGITS’99

Naturally the obvious question is the profitability of the entire process and investment, as it is with any project. The answer to this question is not trivial and in many cases impossible to resolve, mainly because these projects have to be measured for long periods while technologies change very fast and require new investments. In addition, the benefit of reuse is very hard to measure. Our approach to overcome this open question is to provide a solution that involves almost no cost at all. Moreover, compared to costs that organizations have to invest these days in setting up a traditional reuse environment this amount is neglectable. We achieve this goal by providing an automatic tool that can handle all the activities. Based on this fact we state that our strategy is a Win Win solution, as the organization using our tool can only gain from the reuse and cannot loose. Even if the high level of black box reuse is not achieved and only more simple cases of reuse happen, they can only raise the company’s profit by reducing development and maintenance costs, and more important, the “time to market” can be reduced significantly even by these so called lower level reuse categories. This in turn can become a significant factor for companies profits and in their role in the market place.

2. 3

eColabra Architecture

The eColabra architecture consists of two main modules, presented in fig 1 and fig 2: information gathering and run time. During the information gathering phase, resources are discovered/crawled, analyzed and stored in a searchable database. During the run time phase, analyzed information is retrieved from the database and results are delivered to the client in a client-server model over the Web via the HTTP protocol. 2. 3. 1

Information Gathering Architecture Ar chitecture

The information gathering module is responsible to collect the raw resource data, analyze it, and then store it in a database. After the analyzed resources have been stored in the database the textual information is indexed, by making use of IBM Intelligent Text Miner [IMT 98]. The time spent in this phase grows linearly with the

4


NGITS’99

amount and types of resources that have to be analyzed, it is performed off-line.

Information Gathering Architecture Server

Analyzers

scheduler Java Crawlers

C++

WebDAV Server

COBOL UML/XMI

WebDAV Crawler

DAV Web est Requ

DB2 database & index server

HTML Analyzer

Fig. 1. The information gathering architecture.

2. 3. 2

Run Time Architecture

The serve/run time architecture is designed as a 3-tier client-server model over the Web, using the HTTP protocol. The client is accessing eColabra HTTP server using a standard browser, filling in forms (in the HTML version) or the various fields in an Applet (in the Applet interface). The outcome of both interfaces is a POST request sent to the server. The client’s requests are received by the server and are processed and invoked against the DB2 database server. The query results obtained from the database are further processed by the result processor, formatted and sent back to the client in XML format. The client Applet receives the XML response and presents the results, in case of HTML clients, the XSL [XML 98] processor (on the client browser) will be used with the suitable stylesheet in order to present the XML result.

Run Time Architecture Results (html, applet)

HTTP server

res ults

servlets

(XM L)

query request

Client

Result Processor Query Processor

Server

Fig. 2. The run time architecture.

5

Ranked result set

DB2 Query (SQL, IR)

database & index server


2. 4

NGITS’99

Search & Retrieval

eColabra provides two kinds of search options “simple search” and “power search”. In the power search forms, the user is able to query on language specific source attributes. For example, in the Java power search form, the user will be able to ask questions like "search for all classes that implement method add() and are subclass of Hashtable and include the following free text ‘capable to handle large sets of data’". The type of searchable attributes varies from one programming language to another, however the OO languages will naturally share many common attributes. We enable the user to mix attribute-based queries with free text queries. We believe that this feature will enable the users of eColabra to express more precisely their needs and thus the system will be able to provide more precise results. Previous works have based their solution either solely on free text searches [Helm & Maarek 91] or provided attribute based searches only. This new and important feature, introduces additional complexities. The free text query is a relevance oriented query while the attribute based query is a boolean oriented query. This problem, of mixing different types of queries is known in the literature and some solutions have been suggested [Fagin 98]. In our current implementation we have implemented an algorithm that does an intersection between these two domains. We however, intend to implement Fagin’s algorithm and continue research in this field. Another solution we plan to implement for solving this problem is by defining weights for each of the attributes. Then we perform a weighted union on all the records in the database that include at least one of the attributes or obtained a none zero rank in the free text query. In the context of personalization, we consider to maintain a user profile, in which we will save, among other parameters, the weights of the various attributes as they have been adjusted along the time by each individual user.

2. 5

Repository Statistics

One of the additional benefits of eColabra as a central place that developers contact in order to search for resources is the statistical information accumulated over time. Based on our past experience with such systems in particular with jCentral [jCentral 98] we can describe the characteristics and types of information we plan to collect and the conclusion one can reach based on these statistics. Following are listed a few examples that demonstrate the kind of statistical information that we plan to collect with eColabra: 1.

Reuse statistics

2.

Organization/project design, implementation and usage characteristics based on implementation. These include: (a) design characteristics, for example we can find out how deep are project hierarchies, or what is the average number of inherited classes, the typical/average number of overridden methods, typical number of virtual methods. (b) implementation characteristics include code size, 6


NGITS’99

what external packages or libraries are being used, average number of methods per class, average number of classes in a project, etc. 3.

3

User query statistics, this includes a wide range of information regarding the various types and content of queries. For example one could find out what content is most frequently searched for and on what source type it is applied.

Conclusions

In this paper we have presented a new methodology to achieve reuse and knowledge sharing, by applying information retrieval techniques to OO resources while exploiting the OO language semantics and characteristics. We have described several novel techniques to combine free text and attribute based searches, used during query composition time and while processing the result sets. We have demonstrated how by using our methodology one can cut down costs not only of development and maintenance but also reduce the “time to market”. Unlike traditional reuse processes/infrastructures, we have shown that applying our methodology requires almost no investment, leading to a Win Win solution. This is achieved by our methodology and environment that is based on autonomous tools which requires very low support while implementing our approach. In addition we plan to exploit eColabra server to collect valuable statistics, both organization’s usage patterns and resource assets. We shall further investigate how to exploit this data in particular how it can serve to manage organization’s knowledge, which much of it is embodied in it’s code resources. In the near future we plan to investigate new/additional indexing schemes, enable browsing by categories, add additional source/document analyzers and support new novel integration tools to enhance further reuse from within IDEs.

4

Acknowle dgments

We express our appreciation to Ron Pinter for his contribution to this project and for all the previous works that have established the ground for this work. We would like to thank Yoelle Maarek and Pnina Vortman for their contribution in past activities that have enabled this project, and David Bernstein for his support through all the stages of this activity.

5

References

[DB2 98] : http://www.software.ibm.com/data/DB2/ [DOM 98] : http://www.w3.org/DOM/ [ECOOP 97] Gabi Zodik, Yardena Peres, Jerry W. Malcolm, Pnina Vortman "A Framework Registration Language", 11th European Conference on Object-Oriented Programming 1997.

7


NGITS’99

[Fagin 98] Ron Fagin, Yoelle Maarek, “Allowing users to weight search terms”, IBM Research Report RJ 10108, March 1998. [Frakes & Nejmeh 87] W. B. Frakes and B. A. Nejmeh. "Software reuse through information retrieval". In the Proceedings of the 20th Annual HICSS, pages 530-535, Kona, HI, January 1987. [Frakes 97] W. B. Frakes, personal communication, August 97. [Gamma] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, “Design Patterns Elements of Reusable Object-Oriented Software”, Addison-Wesley, ISBN 0-201-63361-2 [Griss 97] Ivar Jacobson, Martin Griss, Patrik Jonsson, “Software Reuse, Architecture, Process and Organization for Business Succcess”, ACM press and Addison Wesley Longman, ISBN 0-201-92476-5 [Grouper 98] : http://zhadum.cs.washington.edu/zamir/cluster.html [Hawaii 99] S. Teng, Q. Lu, M. Eichstaedt, D. Ford, T. Lehman, “Collaborative Web Crawling: Information Gathering/Processing over Internet”, Hawaii International Conference on System Sciences, January 1999. [Helm & Maarek 91] Richard Helm and Yoelle S. Maarek. "Integrating Information Retrieval and Domain Specific Approaches for Browsing and Retrieval in Object-Oriented Class Libraries", In the Proceedings of OOPSLA'91, Las Vegas, 1991. [jCentral 98] Qi Lu, Reiner Kraft, Matthias Eichstaedt, Gabi Zodik, Daniel Ford, Ron Pinter, Dirk Nicol, “jCentral: Search the Web for Java”, WWW8. [IMT 98] : Intelligent Miner for Text, http://www.software.ibm.com/data/iminer/fortext/ [Maarek et al. 91] Y.S. Maarek, D.M. Berry and G.E. Kaiser, "An Information Retrieval Approach for Automatically Constructing Software Libraries", in Transactions on Software Engineering, 17:8, August 1991. [Maarek et al. 97] "WebCutter: A System for Dynamic and Tailorable Site Mapping", Yoelle S. Maarek , Michal Jacovi, Menachem Shtalhaim , Sigalit Ur, Dror Zernik and Israel Z. Ben Shaul , Computer Networks and ISDN Systems, Elsevier Science, to appear. Also In the proceedings of the 6th WWW Conference, Santa-Clara, CA, April 1997. [Oren Z. 98] Oren Zamir, Oren Etzioni, “Web Document Clustering: A Feasibility Demonstration”, In proceedings of the 19th ACM SIIR’ 98, pages 46-54, 1998. [Research 97] “Information on the Fast Track”, IBM Research Magazine, Vol. 35, No. 3, pp 18-21, 1997 [StrField 97] Reiner Kraft, Qi Lu, Ron Pinter: “System for Creating Structured Fields on Electronic Forms”, 1997, US-Patent [Watson 97] Joshua Dobies, Dan Ford, Reiner Kraft, Pete Lazarus, Qi Lu, Ron Pinter, Orit Edelstein, Yoelle Maarek, Fang Min, Pnina Vortman, Gabi Zodik, “Java Central: An Internet Java Resource Center”, The 1997 IBM T.J. Watson Research Java Conference. [XML 98] : http://www.w3.org/XML/ , http://www.w3.org/TR/WD-xsl [Zernik & Zodik 97] Dror Zernik and Gabi Zodik, “A graph management framework for 3D visualization”, IBM Object Technology 97.

8