SE-CodeSearch: A Scalable Semantic Web-based Source ... - CiteSeerX

SE-CodeSearch: A Scalable Semantic Web-based Source Code Search Infrastructure Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling Department of Computer Science and Software Engineering Concordia University Montreal, QC, Canada {i_keiv,l_roost, p_schuge, rilling}@cse.concordia.ca Abstract—Available code search engines provide typically coarsegrained lexical search. To address this limitation we present SECodeSearch, a Semantic Web-based approach for Internet-scale source code search. It uses an ontological representation of source code facts and analysis knowledge to complete missing information using inference engine. This approach allows us to reason and search across project boundaries containing often incomplete code fragments extracted in a one-pass and no-order manner. The infrastructure provides a scalable approach to process and query across large code bases mined from software repositories and code fragments found online. We have implemented our SE-CodeSearch as part of SE-Advisor framework to demonstrate the scalability and applicability of our Internet-scale code search in a software maintenance context. Keywords-Semantic Web; code search; software maintenance

I.

INTRODUCTION

While there exists a significant body of work on source code search in general [2], Internet-scale source code search (ISCS) [1] is a new emerging area that is gaining momentum due to the large amount of quality code available publicly on the Internet. It differs from traditional code search in the following aspects: (1) The heterogeneity of its resources from where the information is mined, (2) the need to scale for very large amount of information that has to be indexed and analyzed, (3) the incompleteness of the code and the implicit and explicit dependencies among code fragments. In this research, we define a new type of ISCS, a Semantic-rich Internet-scale Code Search (SICS) that exploits the semantics behind analyzed source code. In our SICS approach, we model source code using Description Logic (DL) via Web Ontology Language (OWL) and apply Semantic Web reasoning to deal with incomplete or missing knowledge (i.e. fact extraction). We have identified basic requirements for common code search queries based on existing work in the code search domain [1-4]. Our ontological knowledge-based approach can address these requirements for object-oriented languages. We apply Semantic Web reasoners for inferring implicit facts via explicitly represented analysis knowledge to support searches that might involve transitive closure, unqualified name resolution, and static code analysis. Based on our SE-CodeSearch infrastructure, we have developed a proof of concept implementation that has been

integrated within our existing SE-Advisor [5] project. Furthermore, we investigate the scalability of the infrastructure for very large amounts of code. Our research is significant for several reasons: (1) We identify basic queries and requirements to be supported by a SICS. (2) We provide a novel knowledgebased Internet-scale code search infrastructure that integrates existing Semantic Web technologies to provide a scalable and semantic-rich source code search. (3) Our approach can be easily parallelized by breaking down the problem into smaller sub-problems as we follow a one-pass and no-order analysis approach. (4) We demonstrate the scalability of our infrastructure using real world data. The remainder of this paper is organized as follows. Section 2 discusses the motivation of this research. In Section 3, we introduce the SE-CodeSearch infrastructure, the source code ontology, and the implementation. Section 4 presents the results of our experiments. Related work and concluding remarks are presented in Section 5 and 6 respectively. II.

MOTIVATION

We expect SICS to provide the ability to search over any kind of source code (even incomplete or uncompileable) available on the Internet. It provides fine-grained search functionalities by considering the underlying semantic of code. Fact extraction from such partial code fragments represents a major challenge since imported classes / binaries (dependencies) might not be available during the extraction process. A. Usage scenarios, challenges and limitations In what follows, we discuss two usage scenarios to provide motivation for our work. The first scenario involves a simple identification of a class by its Fully Qualified Name (FQN), while the second one searches for classes that call a certain method. Scenario #1: While there is only one org.eclipse.jgit.util.IO class, there exist hundreds of java IO classes that are declared by numerous open source projects indexed by e.g., GoogleCS [6]. However, most of the publicly available ISCS engines are not able to search for classes using FQNs but are limited to simple class name matching (IO in our example). FQN is

the combination of package and class names and can be used as a semi-global identifier in languages such as Java. Scenario #2: To be able to support queries related to method call, caller and callee of each method call statement must be known, even if one of them has been defined earlier in another code fragment.

Transitivity closure-based queries (TCQ). TCQ requires that the engine supports transitivity property to be able to query for example over object-oriented inheritance structures. Method call-based queries (MCQ). It must be possible to search over method call statements by specifying the call site type and method specification.

For a search engine to be able to support both scenarios, it has to consider the semantic of the programming language during the fact extraction phase. For example, the following facts must be extracted from sample code shown in Fig.1 to support the discussed queries: (A) ClassName:util.SomeClass (B) MethodCall:(Reciver:util.SomeClass,Method:MyMethod) from line 1,4 and 20. However, existing search engines extract (C) ClassName:SomeClass and (D) MethodCall:MyMethod facts instead of A and B. It is obvious that C and D are not sufficient to answer our usage scenarios. Other more sophisticated dependency analysis queries will also require a semantic-rich search, like the chained method calls. MethodA().MethodB() in Fig.1, where no import statement is used for the return type of MethodA (Fig. 1 right side).

Absent information query (AIQ). This type of queries handles the absence of certain information. An example for an AIQ search is: “Find all classes that do not inherit from a specified class”. If one has to deal (as SICS do) with incomplete repositories, support for the Open World Assumption (OWA) is necessary to avoid any invalid result due to possibly missing middle class in the inheritance tree.

It has to be noted that not all the implicit facts in Fig. 1 can be extracted at analysis time by compiler-based search engines, unless the contents of packageA and packageB have been analyzed beforehand. This includes the FQN of the TargetType (also the FQN of foo’s type) and the receiver type of methodB call statement. Without support for inter-code analysis it is not possible to specify whether foo’s type refers to packageA.TargetType or packageB.TargetType. Depending on the web crawler, files can be analyzed in any order, which might lead to situation where only partial information is available at analysis time. From a source code analysis point of view, fact C can be easily extracted from an Abstract Syntax Tree (AST) created for the given code sample. However, in order to be able to extract fact A, which is a detailed form of C, an Annotated AST is required. For the creation of an Annotated AST which has FQN (instead of only class names), inter-code semantic analysis is required. This fact creates a major challenge for Internet-scale code search due to the incompleteness of the data being analyzed.

C. Comparison of the Internet-scale source code engines So far, we discussed challenges and requirements of a Semantic-rich Internet-scale Code Search. In what follows, we perform a feature-based comparison among existing source code search engines. Each engine was evaluated by two graduate students using heuristic-based queries. Each of the students had more than 4 years of industrial programming experience. Due to differences of the underlying repositories and services, each person had to derive customized queries for each test case. The results were then analyzed and compared for inconsistencies. In case of discrepancies a third person was asked to determine the final answer.

B. Semantic-rich Internet-scale code search requirements There are number of core queries [1-4] that should be supported by a SICS. In what follows, we classify them based on their requirements to form our infrastructure specification. Pure structural queries (PSQ). PSQs refer to those queries that focus on the structural source code aspect. These queries provide information typically available at the Annotated AST by unqualified name resolution. Metadata queries (MDQ). Queries related to the source code metadata such as project license are considered as MDQ. 1 2 3 4 5

package util; import packageA.*; import packageB.*; Class SomeClass { TargetType foo;

... 20 MyMethod(); 21 packageA.OtherType bar; 22 bar.MethodA().MethodB(); ...

Figure 1. A sample code shows the necessity of inter-code analysis.

Free-form queries (FFQ). Free-form query-based code search is a diverse search family which has special applications as discussed in [7]. They are not covered in this research. Mixed queries (MXQ). A SICS must support complex search since faceted queries are common in code search.

Results from this evaluation are shown in Table 1, where asterisk and hyphen indicate that there was insufficient information or the question was not applicable. Two categories of code search engine can be distinguished: (1) Fine-grained search provider limited to compileable code such as Sourcerer [8] and (2) coarse-grained services such as GoogleCS [6]. None of the two categories supports all the required properties of a SICS. So there is a need for a third family that supports both incomplete (uncompileable) code and fine-grained services covering all the specified queries. TABLE I. A FEATURE-BASED EVALUATION OF INTERNET-SCALE SOURCE CODE ENGINES HELPS US TO FIND THE GAP THAT MUST BE COVERED Feature

TCQ

PSQ

MCQ

Incomplete Code

Engine Assieme [3]

*

Yes

-

No

Sourcerer [8]

Yes

Yes

No

No

Strathcona [9]

No

-

Yes

No

Koders.com

No

Yes

No

*

Codase.com

No

No

No

*

Krugle.com

No

No

No

*

JExample.com

No

Yes

Yes

-

GoogleCS [6]

No

No

No

*

III.

SE-CODESEARCH INFRASTRUCTURE

In our research, we use Semantic Web technologies to address the limitations of existing Internet-scale code search. This knowledge-based infrastructure can support incremental source code analysis similar to demand driven analysis using logic programming [10]. In our approach, we consider facts as the information and the code analysis techniques as the knowledge. The general approach is to mine available source code to extract as many facts as possible through existing source code analysis techniques. In the case of incomplete code (e.g., Fig. 1) we extract analysis knowledge instead of corresponding unresolvable facts. Having an inference engine attached to the repository (containing both facts and analysis knowledge), unresolved facts can be incrementally inferred using available analysis knowledge when additional knowledge becomes available. In what follows, we provide an overview of the SECodeSearch architecture (Fig. 2). For the internal knowledge and information modeling, our approach relies on an ontological representation. The ontologizer subsystem transforms extracted facts to an OWL representation. Due to the different types of inference needed, our architecture uses both DL and lightweight reasoners. DL reasoners are used for the more complex analysis (inference) such as PSQ and full MCQ, while the latter covers transitive closure required for TCQ and classification (mostly for MDQs) queries. A Sparqlendpoint interface, GUI and web services are available to access the data. As part of our approach, we have created an optimized ontology that satisfies the previously discussed requirements considering scalability as a major design factor. The major entities of our source code ontology (sicsont) are shown in Fig. 3.

Figure 2. SE-CodeSearch infrastructure

Figure 3. SICSONT major entities overview.

A. Analysis knowledge representation Similar to other code ontologies [11], sicsont includes some very high-level concepts for object-oriented source code. However, to support the SICS requirements for capturing code analysis knowledge, there are some parts that make our model unique among the available code ontologies. The ontology and related documents are available at http://aseg.cs.concordia.ca/ontology#sicsont. In what follows, we highlight some of the sicsont design aspects. Type hierarchy. In sicsont, we use OWL individuals to represent types (i.e. classes and interfaces in OO languages) and their implementation details. This keeps the required formalism decidable, which is a major design decision, since object-oriented meta-modeling typically leads to undecidable models [2]. Transitive property that might exist between these types, is captured by using the OWL [12] transitivity. Based on this design decision, object-oriented classes defined in the source code are represented as individuals of the Class/Interface concepts (Fig. 3). Variables are defined as individuals of the Variable concept and inheritance hierarchies and their relationships among instances and classes, are captured using hasType, which is a transitive role. Package relationships and their dependencies. There are three forms of relationship (ownership, hierarchy and dependency) that must be distinguished. The ownership and hierarchy roles are belongsTo and hasSubPackage respectively. These roles let us search for packages and find classes belonging to a current package or all its sub packages (using transitive closure). The dependency relationship is modeled through the imports role. Method call and return statements. At the source code level, method call statements are an important code aspect. We represent method calls using individuals of the MethodCall concept. Method call statement can be connected to the receiver code by hasMethodCall role. However, each method call statement has two important pieces of information associated, the method, and the caller. Both are required to be able to answer MCQs, in the context of object-oriented code, since the caller type determines the actual callee (i.e. method implementation). The ReturnStm concept is a key for solving the MCQs challenges. Each new sub-concept of ReturnStm contains the analysis knowledge needed for the reasoner to extract additional implicit facts (i.e. the connection between call statement and message receiver). This approach provides us the ability of incremental fact extraction related to method call statements in case of incomplete code. That is, the caller code and the method owner, which contains the method implementation, can be converted to the ontology separately in any order. The reasoner infers the connection and further facts if they are related to the query. Fig. 4 shows how the analysis knowledge must be modeled within the return statement concept using our ontology. It also shows how we deliberately connected the OWL concepts to individuals to have the reasoner infer about return types of each method call. While this kind of logic-based code analysis is typical in static analysis [10], it is not a trivial achievement for us since we are using Description Logic in a SICS context. In other words, the solution shown in Fig. 4 is one of the interesting aspects of this

research since it exploits the ability of Description Logic and OWL to simulate part of code analysis which usually has been implemented by other types of formalism. Unqualified name resolution. Based on our previous discussion, one of the main PSQs’ requirements is unqualified name resolution. Within our SE-CodeSearch we introduce two approaches to address this problem. First, we use a verbose modeling approach. We refer to it as loose unqualified name resolution which does not require reasoning at all. The second approach uses a DL reasoner to resolve the names accurately. Loose unqualified name resolution tends to predict all the possible qualified names for each unqualified name instance based on available local information at each step. Given this approach, for each potential result, one individual of type TempResolvedType is instantiated. For example for the code shown in Fig. 1, we add two temporal types for foo to the repository which are packageA.TargetType and packageB.TargetType. One may assume by using this approach we are adding invalid information to our repository. However, since we separate this temporarily information of semiresolved types by tagging them as TempResolvedType, this assumption in not true. This approach gives us a great opportunity to return more complete result to the user with no reasoning. The underlying rationale of loose unqualified name resolution is the fact that for each case only one of our guesses is true and we can rely on the user knowledge at runtime to find out the answer. When a user searches for the variables of packageA.TargetType, we consider that packageA.TargetType must have been implemented. Consequently, foo can be returned as the answer. While loose unqualified name resolution does not require inter-code analysis, our second approach is specifically designed for inter-code inferences. In this approach, a child of UnqualifiedResolution is created for each indexed qualified name. Then using OWL equivalent and subclass restrictions, the resolution knowledge is represented similar to the given return statement template. The interesting aspect of this solution is the capability of resolving names even without analysis of the target class implementation (i.e. type declaration statement). It is sufficient for the ontologizer to meet a FQN once in any kind of statement. Unqualified name resolution is the significant technique to improve the accuracy of search result. Moreover, covering all the queries described in the second section helps the user to search precisely over both complete and incomplete code. B. Implementation As a proof of concept, we have implemented a Java version of our SE-CodeSearch, and integrated it within our SE-Advisor tool [5]. Most of our tool chain, except the ontologizer is based on available software components. We used Javaparser [13] for the AST construction. Racer [14] and Allegrograph [15] are adopted as the reasoners and repository. The ontologizer is responsible to annotate the ASTs based on our loose unqualified name resolution, then extract, and populate information or knowledge statements using the sicsont ontology.

Figure 4. Knowledge-based analysis template for return statement

The two major types of analysis knowledge are the return statements and reasoning-based unqualified name resolution conditions. Identification of entities is one of the major aspects of Semantic Web, which affects the reasoning capability. Three kinds of individuals are created by the implementation: Some identifier names must be randomized to ensure uniqueness within our global knowledgebase (e.g., method call statements). Others, such as classes, must always have a FQN. Based on sicsont design, methods must be independent from the owner classes to be able to identified them by only their method names. This approach ensures that entities have consistent identifiers even when they are created at different processing steps. This also enables the parallel processing of web content through multiple crawlers. Other data required for graphical presentation such as statement line numbers are modeled using Semantic Web reification. IV.

SE-CODESEARCH SCALABILITY TEST

A key challenge for any fine-grained knowledge-based search infrastructure is scalability. In order to evaluate the scalability of our approach, we conducted a set of experiments to observe its performance. Our data is based on a very large Java source code repository (about 400 Gigabytes) [16]. For our experiment, we only considered Java classes (not the binaries imported by each project) to simulate incomplete code fragments as code can be mined from any Internet resource. Dealing with incomplete code is more challenging since the search engine must infer in this case additional facts. The fully populated repository contained 600,000 classes. The experiments were performed on standard PC, with 1.5 GB available RAM and a 3.40 GHz CPU. For the experiments we selected queries from the TCQ and MCQ groups, which at the same time covered the PSQ and AIQ . While the query templates were predefined for the experiment, the actual search criteria were randomly selected based on available data. The randomization of the queries is one of the main reasons for the fluctuation of the results (Fig. 5).

Figure 5. The scalability test of SE-CodeSearch shows that its response time is independent on the repository size.

The result shows that our implementation scales reasonably well. Given the repository size and the limited hardware, the observed performance can be attributed to our ontology design/population. It requires limited and divisible forms of reasoning. A backward chaining reasoner is used for accessing the complete repository to support light-weight tasks. OWLDL reasoning tasks uses a forward chaining reasoner, accessing a very limited subset of repository, based on the given query. V.

RELATED WORK

Assieme [3] is a compiler-based code search project, which has two interesting contributions: (1) The ability to find different kinds of resources and (2) being able to resolve possible implicit references. The second one is so similar to our unqualified name resolution. Sourcerer is a relational model based repository and search engine [8]. It is also a compilerbased approach, which does not support incomplete code. Strathcona [9] is another engine which is able to restrict a method call query by the receiver type. It does not support hierarchical transitivity. The main drawback of a relational model approach is lack of support for AIQs since it does not support OWA inherently [17]. There are also some projects provide high-level services such as Code Conjurer [18] and S6 [19]. Their common approach is to filter the retrieved code from other code engines by means of keywords, and then test the remaining code against the given criteria such as test-cases. Component rank [20] is a weighted graph-based approach to represent class dependency relations to provide a usage frequency ranking model. SPARS-J uses the component ranking schema to support coarse-grained source code search. Component ranking can be seen as a complimentary approach for our SE-CodeSearch since both of them use graph-based models as the internal representation approach. As an example of FFQ-based search which is not covered by SE-CodeSearch, SNIFF [7] uses IR techniques to support free-from queries, by analyzing textual documentation to rank relevant code snippets. Similar to SNIFF, Exemplar [21] supports also natural language-based queries. While both projects rely on textual information, Exemplar only returns software component as the final result. As discussed in [7], free-form query-based search are specifically applicable if a user is searching for beacons in the program. This approach could be applied as a preliminary step prior to a FQN-based code search to help user's to find the related names first. VI.

to others, SE-CodeSearch infrastructure requires less hardware resources while achieving similar search granularity. As part of our future work, we plan to add versioning information extracted from metadata to the repository and deal with inconsistent information through levels of trust. We also plan to make our complete tool chain available to the community. REFERENCES [1]

[2]

[3]

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11]

[12]

[13] [14] [15] [16]

CONCLUSION AND FUTURE WORK

In this paper, we introduce a Semantic-Web enabled approach to Internet-scale Code Search (SICS). To the best of our knowledge, SE-CodeSearch is the first knowledge-based semantic-rich Internet-scale code search that provides an incremental analysis that does not require a compiler for fact extraction. Instead it uses a one-pass approach that extracts facts from incomplete code incrementally without considering the order of parsed code. The sicsont is a novel ontology that addresses static code analysis knowledge representation in OWL and DL. The optimized design of sicsont ontology allows the SE-CodeSearch to scale well even for large source code repositories. Its repository population is faster than others, since no compilation step is required. Comparing our approach

[17]

[18] [19] [20]

[21]

R. E. Gallardo-Valencia and S. Elliott Sim, “Internet-scale code search,” Proceedings of the 1st ICSE Workshop on Search-Driven DevelopmentUsers, Infrastructure, Tools and Evaluation, pp. 49-52, 2009. C. Welty, “Augmenting abstract syntax trees for program understanding,” Proceedings of the 12th IEEE International Conference Automated Software Engineering, pp. 126-133, 1997. R. Hoffmann and J. Fogarty, “Assieme: finding and leveraging implicit references in a web search interface for programmers,” Proceedings of the 20th annual ACM Symposium on User interface software and technology, pp. 13-22, 2007. S. Bajracharya and C. Lopes, “Mining search topics from a code search engine usage log,” Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories, pp. 111-120, 2009. J. Rilling, R. Witte, P. Schuegerl, and P. Charland, “Beyond information silos - an omnipresent approach to software evolution,” International Journal of Semantic Computing, vol. 2, 2008. Google Code Search, http://www.google.com/codesearch. S. Chatterjee, S. Juvekar, and K. Sen, “SNIFF: a search engine for java using free-form queries,” Proceedings of the 12th International Conference on Fundamental Approaches to Software Engineering, 2009. S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes, “Sourcerer: a search engine for open source code supporting structure-based search,” Proceedings of the 21st Conference on Objectoriented Programming, Systems, Languages, and Applications, 2006. R. Holmes and G.C. Murphy, “Using structural context to recommend source code examples,” Proceedings of the 5th International Conference on Software Engineering, pp. 117-125, 2005. D. Saha and C.R. Ramakrishnan, “Incremental and demand-driven points-to analysis using logic programming,” Proceedings of International Conference on Principles and Practice of Declarative Programming, 2005. C. Kiefer, A. Bernstein, and J. Tappolet, “Analyzing software with iSPARQL,” Proceedings of the 3rd ESWC International Workshop on Semantic Web Enabled Software Engineering, 2007. D. L. McGuinness and F. V. Harmelen, OWL Web Ontology Language Overview, W3C Recommendation 10 February 2004, http://www.w3.org/TR/owl-features/. Javaparser, http://code.google.com/p/javaparser/. RacerPro 2.0, http://www.racer-systems.com/. AllegroGraph, http://www.franz.com/agraph/allegrograph. C. Lopes, S. Bajracharya, J. Ossher, and P. Baldi (2010), UCI Source Code Data Sets. http://www.ics.uci.edu/~lopes/datasets. University of California, Bren School of Information and Computer Sciences. B. Motik, I. Horrocks, and U. Sattler, “Bridging the gap between OWL and relational databases,” J. Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, pp. 74-89, 2009. O. Hummer, W. Janic, and C. Atkinson, “Code Conjurer: pulling reusable software out of thin air,” Ieee Software, pp. 45-52, 2008. S.P. Reiss, “Semantics-based code search,” Proceedings of the 31st International Conference on Software Engineering, 2009. K. Inoue, R. Yokomori, T. Yamamoto, M. Matsushita, and S. Kusumoto, “Ranking significance of software components based on use relations, ” IEEE Trans. on Software Engineering, pp. 213-225, 2005. M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. M. Cumby, “A search engine for finding highly relevant applications,” Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 2010.

SE-CodeSearch: A Scalable Semantic Web-based Source ... - CiteSeerX

SE-CodeSearch: A Scalable Semantic Web-based Source ... - CiteSeerX

Suggest Documents

Semantic Vectors: A Scalable Open Source ... - Semantic Scholar

Webbased peer assessment: feedback for ... - Semantic Scholar

translations as a semantic knowledge source - CiteSeerX

A WebBased Decision Support System for ... - Semantic Scholar

Webbased Educational Resources - MUHS

Usability evaluation of a webbased patient ...

A Scalable Collect - Semantic Scholar

Scalable Detection of Semantic Clones - CiteSeerX

Supporting Scalable, Persistent Semantic Web Applications - CiteSeerX

A Scalable Approach for Statistical Learning in Semantic ... - CiteSeerX

A PKI-based Scalable Security Infrastructure for Scalable ... - CiteSeerX

Scalable Clustering for Multi Source Entity Resolution

a scalable loop optimization approach for scalable ... - Semantic Scholar

A Scalable Microservice-based Open Source Platform for Smart Cities

AsterixDB: A Scalable, Open Source BDMS - VLDB Endowment

A Flexible Open-Source Toolbox for Scalable Complex Graph Analysis

A Scalable Topic-Based Open Source Search Engine

AsterixDB: A Scalable, Open Source BDMS - VLDB Endowment

A Scalable Source Multipath Routing Architecture for Datacenter ...

A Scalable Approach to Source Camera Identification ... - IEEE Xplore

A scalable micro-power converter for multi-source piezoelectric energy

A Flexible Open-Source Toolbox for Scalable Complex Graph Analysis

Semantic WebBased Analysis of Product Line Variant Model - ijcee

Scalable Concurrent Counting - CiteSeerX