Software Development Using Context Aware Searching Of Components In Large Repositories Sayan Paul
Tushar Makkar
K. Chandrasekaran
Department of Computer Science and Engineering National Institute of Technology Karnataka Surathkal, India
[email protected]
Department of Computer Science and Engineering National Institute of Technology Karnataka Surathkal, India
[email protected]
Department of Computer Science and Engineering National Institute of Technology Karnataka Surathkal, India
[email protected]
Abstract— This paper proposes a new approach to locate software components from large component online open source repositories which encompasses the inherent features of contextaware browsing, ranking and semantic tagging. Tagging of individual components helps making search fast and efficient. We are trying to improvise the results of context aware browsing by ranking them on the basis of Hidden Markov Models. The inputs to Hidden Markov Models consists of auto generated contextual queries. These queries formulate the resource set of our Hidden Markov model. The queries are ameliorated using reformulation, specialization, generalization and general association. This automation not only reduces the search space of components for an efficient browsing but also it enables developers to use those components whose existence they do not even prognosticate. Index Terms— Software development, software reuse, context aware, open source, components, tagging, ranking.
I. INTRODUCTION One of the emerging branches of software engineering is Component-based software engineering (CBSE) or Component-based development (CBD). Its main focus is to segment or compartmentalize the whole software system based on its wide ranging functionalities. This approach is a reusebased approach to define, implement and compose loosely coupled independent components into a single unified system. This practice not only has short term benefits but also ranges to long term benefits for both the software and the organization responsible for the development [27]. The quality and productivity of software development can be improved with the help of component-based software development. However the developer must be aware of the components before he can use or reuse the components from the online repositories. These repositories are really large and it is hard to keep track of each and every individual component. Moreover, they are dynamic in nature, new components are being added and old components gets updated. All these changes need to be tracked properly. So in this paper we are proposing techniques to search through these repositories and find relevant components. One of the largest and freely available component repositories online are the open source repositories E.g. Github, Mercurial, SVN, etc. In order to search through this huge
collection we need to use some kind of classification and identification process for individual component. One of the ways of achieving this is through tagging of components. Informal metadata for objects like Web pages, images, and videos can be expressed using words or phrases known as tags. In web 2.0, users have the ability to create a richer, more adaptive and responsive way to navigate and search both existing and new media. Tagging is a phenomenon which supports this web 2.0 mentality [2]. With the help of these tags we will be able to identify and classify each and every individual component. Tags also makes the searching process more efficient. Using the tags created or allotted to each component, we try to search the component which the developer is currently trying to implement. Our proposed system will continuously run in the background of development environment and will monitor components’ interactions with the development environment and will be able to infer software developer's needs for required components. Using the development context including the development task and the background knowledge of developers as search criteria, potentially reusable components are autonomously located and actively presented. These search results are then ranked on the basis of Hidden Markov Models and their contextual sensitivity. The search results can also be filtered according to their relevance, source, etc. So our proposed model significantly increases the opportunity of component reuse. Context-aware computing is defined as the use of context in software applications, in this the behavior of the application gets changed accordingly with any changes in the context [7]. With the advent of Big Data and the amount of information present in it, the importance of context-aware retrieval has observed a significance increase. It acts as an extension for classical information retrieval with contextual information as search criteria, with the aim to deliver information which more relevant to users within their current context. [8]. Searching and browsing are information access mechanism, both of which require users to initiate the process. So as to assist users in finding components from large component repositories, information access mechanisms are assisted using information delivery mechanisms which actively present
information to users without the need of explicit queries of any sort [9] Context aware computing can help us to do the same thing in an efficient manner. So the conversion of decontextualized information available within the large open source repositories to the contextualized one by getting to know the working context of user's applications is very pivotal for the thriving software reuse. In this, the software developers need not formulate any specific query. The system infers by continuously monitoring and evaluating developer's interactions with the application environment and simultaneously locates and delivers potentially reusable components that could be used in implementation of the task to be developed. So it will basically reduce the search space for browsing the components leading to a mammoth decrease in time to find the desired components during development [1]. Different component repositories which searching support use different searching and retrieval mechanisms. Components can be represented in three different ways: •
Text-based: In this approach, components and queries are represented using textual descriptions, and relevance of components are defined using information retrieval techniques. Textual descriptions can be siphoned from accompanying documents [17], or are extracted from comments and/or identifier names in components [18].
•
Structure-representation: In this approach, components are searched using a knowledge base or conceptual distance graph. Components are represented using multiple facets in multi-faceted classification schemas [19] and sematic relationship among these terms are determined using conceptual distance graphs. In LaSSZE [20] components are represented as frames. In LASSZE, frames are organized into hierarchical, taxonomic categories.
•
Formal method-based: In this approach, components and queries are represented using formal specifications or signatures [21].
The section I gives us an introduction to software development using software components and some common methods used. Next in section II, we see some related work that has been done in this field. Section III explains the problem in much more details and challenges associated. Section IV explains our proposed method with all its system details. Section V gives us an analysis of different techniques. Section VI concludes the paper with some future work ideas. II. RELATED WORK Most component repository systems organize components using inheritance structure for supporting browsing such as the Smalltalk programming environment. These type of browsing structure makes it nearly impossible for developers
to search for a component without having comprehensive knowledge about the repository's structure. This method lacks meaningful dossier about the repositories such as what are the repository contents and the methodology to discover components that might be used for the current task [22]. A lot of research have been carried out for improving browsing of large component repositories. Fischer [23] proposed a browsing mechanism which involved specification management and combined various concepts and specifications for structuring the repositories. This technique enabled developers to find components in a repository by selecting peculiar attributes required for the task, or deselecting attributes not involved in completion of a particular task. Drummond et. Al [24] added an active agent to the browsing system for speeding up the task of searching the components. The software developer's browsing actions are monitored to deliver the components that closely match the implicit goal of the developer. Codebroker [1] attempts to automatically locate components and the system itself adapts to each development session, it tries to adapt the search process to the background knowledge of each developer proposed work. III. PROBLEM DESCRIPTION Searching and Browsing are one of the most imperative technique to locate components from large component repositories. Searching can be easily done if we have suitable queries and is generally fast and direct. Software developers need to generate or formulate a query, and the system returns the corresponding components. The main challenge is to formulate queries. It is a challenging and implacable task because system developers or designers have to build up the gap from situational model which involves understanding of a given task to the system model which is inferred from the description of the components [4]. Browsing is defined as a task where relevance of the components is determined by developers and is displayed in accordance with their development task, and the associated links are decussated in the component repository. Developers tend to feel more comfortable with browsing than searching because here they need not commit resources and it supports incremental development of the requirements after the evaluation of information. [6]. However Browsing is not at all scalable because of the enormous size of data components available to the developers. The links in between the components increase exponentially with the increase in the number of repositories. Hence software developers gets puzzled by the complexity. Moreover as most of the repositories are structured according to the inheritance relationship, the current structure for locating components based on functionality fails deplorably. The problem is further aggravated due to the inherent dilemma in the design of browsing structure. The relationships between the components are difficult to detect and hence most software developers get confused due to the complex network of repositories while traversing the links. Most of the developers prefer creating their own programs instead of attempting to use or even find
the components from the large repositories available to them because they either do not know whether those components exist or the systematic methodology needed to find the components is unknown to them [10]. IV. PROPOSED METHOD A. Description of our system
Figure 1. Flow Diagram of System The figure describes the flow model for our proposed system. Software developer (user), actor in this scenario, would be providing information to our context aware browsing system which would essentially be a process running in the background of the text editor and gaining decontextualized information from other sources like past history of the users , user's reliability score etc. It would then generate automated queries for searching of the components from the repositories. Meanwhile our system would be tagging the repositories on the basis of contexts. These tagged repositories will then be searched and the results would be reformulated and ranked on the basis of Markov models and the optimizations provided to the input queries. It would lead to generation of more relevant results to the context of the code which user is writing. It would in turn lead to an efficient browsing of large component repositories and an efficacious software reuse.
B. Tagging of components Tagging is one of the crucial criteria when it comes to searching of components from a large repository. In this paper, we try to spread our search range to online open source repositories. In order to achieve this, we first keep a database of pre-recognized tags (mainly consisting of common module names, algorithm, class abstractions, etc.). These tags can be used as seed and each module will be tested against these tags and classified accordingly. Different codes are fetched from online open source repositories which are freely available for use. These codes will be fetched using web crawler which crawl through the HTML code to extract the components or API can be used. Each of the tags in the database are compared with comments or variables in the module/components or the documents associated with it. This process involves different string matching algorithms to match these texts with the tags, so the codes from the online open source repositories and its associated documents are checked matched these tags. One challenge in this method is the representation of same code by different users can be in different ways. Tagging of the components need not be done only by comparing with tag itself, the system will also compare the code with some previously tagged code to measure the similarity and tag them accordingly. As new codes get classified or tagged the system will add their links to our database and further classifications are done using these as test data. So the whole tagging system keeps on growing and keeps building itself. Users will also be able to tag different components in the online repositories and this will add up to the database. Based on these users response to the components tagged, the system will be able to learn the pattern of tagging and be able to tag more accurately. The user’s feedback will make the selflearning system more and more accurate. So as the users tags some code in the repository the daemon running in background gets triggered to check the new tag with the whole repository tagging relevant components, so if a user tags a Fibonacci code, all similar codes with similar code or comments, etc. gets tagged. All these tags and their associated links are stored in the database for later access. We can make use of different probabilistic models like Hidden Markov models , etc. in order to predict tags for untagged texts, and not only probabilistic models but rule-based approaches and use of neural networks to predict outcomes with high efficiency has been successful to quite an extent. Through these approaches, [16] some important observations can be made: • There exists a limited number of tags for a given word and those tags can be found out either in a dictionary or doing morphological study of the word. • Local context helps to find the correct tag from a given set of tags. It can be done using contextual rules which define their valid sequence. A priority list can be made out of the rules so that a better selection can be done when several rules apply. It is not necessarily to have a single tag for a particular component. A particular component may have multiple tags. All the tags would have a relevance value, β which not only
depends on the code which a user is searching for but also the context of a particular user. User tagged codes can be used as training data and those tagged by the system can be backed by positive user feedback. The system tagged components which get positive feedback can then be used as training data for the model, thus increasing the efficiency and accuracy of the whole system. A common reliability factor (α) will be assigned. It would tell what amount of importance needs to be given to tag, which a particular user has created. For example the reliability factor α for author of the code would be maximum since he knows what essentially his code would do. Similarly reliability factor would depend on the previous history of users also E.g. If a malicious user go on tagging wrongly the components in the repository his reliability score would go down and hence his malicious tag would not affect the overall searching mechanism of our system . Similarly, a constant value for our system's reliability score can be assigned, which would be improvised on the basis of ranking by other users. Learning by feedback algorithms can be incorporated in the system (Neural networks, Deep learning, etc.). C. Design and Architecture of tagging system
Figure 2: Architecture of tagging system The whole architecture of the tagging system can be subdivided into layers as shown in Figure 2. The first layer of the system is the crawler which crawls through HTML code to search for individual components. Filter layer is the second layer in our architecture, this filters the relevant components from others. Rdbms or the third layer is basically used for storing the tags and their associated links. The fourth layer is divided into two parts: Context analyser - this links individual components to their tags based on their context, Personalizer Predicts the tags for untagged components based on previous history.
D. Searching with tagging In this we use signature matching along with free-text information retrieval techniques to fetch task relevant components. It uses a probability based information retrieval approach to compute the resemblance between conceptual queries extracted from comments in programs which are under development and markdown documents (.md files) or other text documents related with the project about the components in the repository [15]. Appropriate weights are assigned by probability based information retrieval (IR) technique to different terms in a document which is then used to estimate the affinity between a query and a document. This returns a rank ordered list of pre indexed documents that provide the best match for a given query. The concept similarity between the document of a component (Dj , O f(r2) if and only if r1 is more relevant than r2 , with respect to K , q and Lk . Here R+ is the set of all positive real numbers. f is basically the conditional probability p( r|q, K, Lk ) . After proper formulation and simplification [12] it can be interpreted as: p(r|q, K, Lk ) = p(r|q)/p(r)∑ [p(r|t, K, Lk ) · p(t|K, Lk )]. (3.1) p(r|q) is basically the relevance of r and q when context specific information is not taken into consideration , while p(r) is the prior probability with which user can access r with respect to the user model K and the most recent access history Lk in his next try. For notational convenience we now define f(r) = p(r|q)/p(r), (3.2) and f2(r) = ∑p(r|q),p(r)[p(r|t, K, Lk )· p(t|K, Lk )] (3.3)
So it can be rewritten as f(r)=f1(r).f2(r) (3.4) We introduce a tuning factor i (0