The Journal of Systems and Software 121 (2016) 345–357
Contents lists available at ScienceDirect
The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss
Context-awareness in the software domain—A semantic web enabled modeling approach Mostafa Erfani a, Mohammadnaser Zandi a, Juergen Rilling a,∗, Iman Keivanloo b a b
Concordia University, Montreal, Canada Queen’s University, Kingston, Canada
a r t i c l e
i n f o
Article history: Received 21 January 2015 Revised 18 February 2016 Accepted 23 February 2016 Available online 4 March 2016 Keywords: Context-awareness Meta-modeling Semantic Web
a b s t r a c t Recent years have witnessed rapid advances in the use of contextual information in ubiquitous and ambient computing. Such information improves situated cognition and awareness as well as stakeholders’ usage experience. While domains such as Web 3.0 – the next generation of the web – have made contextawareness a main requirement of their solution space, the software engineering domain still lacks the same rate of adoption. In our research, we introduce an ontology based context-aware meta-model that takes advantage of Semantic Web technologies to capture and formalize context information. Providing such formal context representation allows us to make context information an integrated and reusable part of the software engineering domain. We present several case studies related to the software evolution domain to illustrate the benefit of sharing and reusing context for various software engineering tasks, such as mentor recommendation, code search, and result ranking. © 2016 Elsevier Inc. All rights reserved.
1. Introduction Many definitions for the term “context” exist in the literature. One of the earliest examples dates back to the writings of the Chinese Legalist School of philosophers (500–60 B.C.), who attempted to model human behavior in order to better guide emperors in their decision-making (Chinese Thought and Philosophy Legalism). Over the past decade, “context” has often been referred to as “situated cognition” and has been used to establish awareness of information sources (e.g., physical, computational, and user task). Establishing context-awareness requires technologies that can not only capture data, but are also capable of reasoning about data as well. A good example of such technology are sensors, which have emerged as one of the methods used to generate large quantities of data often also referred to as big data. Context-aware computing is a domain where the ability to mine and analyze such big data is essential to support mobile and pervasive computing applications (Ashton, 2009). Furthermore, context-aware computing supports the integration and interpretation of existing knowledge resources. It is therefore considered a key technology in Web 3.0 (Lassila and Hendler, 2007) for enabling the creation of smart
∗
Corresponding author. E-mail addresses:
[email protected] (M. Erfani), m_zandig@cse. concordia.ca (M. Zandi),
[email protected],
[email protected],
[email protected] (J. Rilling),
[email protected] (I. Keivanloo).
http://dx.doi.org/10.1016/j.jss.2016.02.023 0164-1212/© 2016 Elsevier Inc. All rights reserved.
services, which can deliver personalized and customizable results through autonomous agents. Collecting context data and integrating context-awareness at the application level is expensive. In order to reduce this cost, reuse and sharing of context information among context-aware applications must be considered from the beginning of the development cycle. However, such early lifecycle integration requires the availability of well-defined and reusable context models that allow for the sharing and integration of context data across resource and application boundaries. While existing work on context modeling (e.g., Korpipaa et al., 2003; Costa et al., 2004) has focused mainly on the conceptualization of context at application or domain level, it lacks support for sharing, reusing, and integration of context across application and domain boundaries. In this research, we introduce an ontology-based metamodeling approach, which takes advantage of the Semantic Web technology stack (Berners-Lee) to support context reuse at both the model and knowledge sharing levels. Given its expressiveness and support for both semantic reasoning and the open world assumption (OWA), ontologies have been widely used in meta-modeling (Bettini et al., 2010; Perttunen et al., 2009) to conceptualize a domain of discourse. While context information might differ among application domains, its core requirements remain the same across domains. As part of our modeling approach, we support both the capturing of core context requirements as well as domain and application specific contexts. More specifically, we are interested in modeling and reusing context information found in the software
346
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
Fig. 1. Context information in web search.
engineering domain. Context information in software engineering differs significantly from other domains based on the availability of hardware sensors and the type of context data. For example, in the software engineering domain the main sources for sensory data are: (1) dynamic (run-time) sensory data (e.g., collected from a user interaction with the system) and (2) static sensory data mined from static information resources (e.g., software repositories). The ontology based context-aware meta-model which we present in this paper is an extension from our previous work (Erfani et al., 2014). For our current research, we not only further refined our original context model, but also present several case studies to illustrate the potential benefits of context sharing and reuse in the software evolution domain. The remainder of the paper is organized as follows. Section 2 motivates our research, highlighting the lack of context-awareness in the software domain. Section 3 reviews related work, followed by Section 4, which provides an overview of our ontology based meta-model approach. In Section 5, an instantiation of our metamodel for the software domain is shown. Section 6 presents application level evaluation and an example of a software domain instance. Section 7 describes various threats to validity. Finally, in Section 8 we present our discussions and conclusions. 2. Motivation A common objective for the use of context information is to facilitate the interaction between a system and user task completion by capturing situation-specific information (Bettini et al., 2010; Perttunen et al., 2009). Context-awareness has become an essential part of many application domains including ambient and heterogeneous computing (Bettini et al., 2010; Perera et al., 2014), mobile applications (de Farias et al., 2007), and web search (Haveliwala, 2003). Personalization of information plays an important role in the success of context-aware application, which allows services to adapt to a user’s context. Context-aware systems differ from personalized systems by prioritizing the mining of global information over individual users’ long and/or short-term histories. Furthermore, context-awareness has also become an essential part
of emerging application domains such as Web 3.0 (Lassila and Hendler, 2007). These application domains take advantage of context information to identify users’ needs, provide user-centric results, and facilitate system-to-system interaction. A key challenge for integrating context in existing systems is the necessity to deal with heterogeneous knowledge collected from different resources, while making this information machine interpretable and reusable across different applications. Web 3.0 is expected to address many of these challenges by integrating different technologies such as ubiquitous computing and Semantic Web in order to facilitate personalization and context awareness. Web 3.0 also aims to make context information semantically structured, readable, and interpretable by both humans and machines. Having structured information allows for the integration with other knowledge resources and the development of intelligent agents. This will enable the creation of end-user contents that are semantically richer and with a greater degree of personalization and situation-awareness. For example, most Web search engines exploit a large amount of different context data in order to improve and customize their result sets (Haveliwala, 2003). Fig. 1 illustrates such use of contextual information as part of a Web search engine. Their ability to integrate both personalized and context data has become a main factor in their continuing success. In many cases, Internet search engines use a combination of over 200 individual signals such as geographical information, search history, and profile information to establish a user context before returning their result set to the end-user. This is in contrast to the software domain where software engineering infrastructures often lack the same support for modeling and integration of context information as part of their solution space. For example, source code search engines, while similar to Web search engines, only rely on a few and mostly contextunaware criteria (e.g. term frequency) to improve their result set ranking. Several reasons exist for this lack of context-awareness, ranging from limited availability of large datasets containing sensory and context mineable data to the widespread globalization of the software industry and the consequent distribution
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
Fig. 2. Semantic web architecture in layers (6).
of knowledge across various resources. The situation is further complicated by the lack of standardized models for capturing and sharing of knowledge in the software engineering domain. Proprietary data formats are still widely used to exchange data between applications, therefore limiting the ability to seamlessly integrate and reuse knowledge across application or even domain boundaries. These exchange formats are not extendable to support evolving system and application needs. Our research is motivated by the need for introducing a formal, standardized context modeling approach. We therefore focus on developing a context model for the software engineering domain, which provides a reusable and extendible modeling approach based on mature, standardized knowledge representation techniques. Similar to the Web 3.0 paradigm, we will rely on the Semantic Web technology stack to model context and integrate context information as an ambient part of the software engineering domain. Taking advantage of the Semantic Web provides us not only with the ability to integrate distributed knowledge resources, but also allows us to deal with the onerous data processing requirement needed to establish context-awareness. 3. Background and related work 3.1. Semantic Web and its technology stack For machines to understand and reason about knowledge, it must first be represented in a well-defined and machine-readable language. The following section provides a brief overview of the Semantic Web technology stack and its enabling technologies. The Semantic Web is an initiative of the W3C to introduce such a language for standardizing formal representation of knowledge on the Internet. Fig. 2 provides an overview of the complete Semantic Web architecture and technology stack as introduced by the W3C (Berners-Lee). In the lowest layer, URI and Unicode, are already key elements of the existing WWW. Uniform Resource Identifiers (URIs) allow an understandable identification of resources (e.g., documents) that are distributed across the Internet; Uniform Resource Locators (URLs) are considered a subset of URIs. XML is a general-purpose markup language for documents containing structured information. With its namespace and schema definitions, it provides us with a common syntax used by the Semantic Web. The Resource Description Framework (RDF) (Klyne and Carroll, 2004) has emerged as a core data representation for-
347
mat for the Semantic Web. RDF is the primary language used to represent and store resource (meta-) data in graph form. RDF is based on triples of the form ‘subject-predicate-object’ that comprise the data graphs. To allow for a standardized description of taxonomies and other ontological constructs, an RDF Schema (RDFS) (Brickley and Guha, 2004) is combined with formal semantics within RDF. RDFS describes taxonomies of classes and properties and uses them to create lightweight ontologies. For more detailed ontologies, the Web Ontology Language (OWL) (McGuinness and Van Harmelen, 2004) can be used. OWL is derived from description logics and is syntactically embedded into RDF. Much like RDFS, it provides additional standardized vocabulary. OWL comes in three species: (1) OWL Lite for taxonomies and simple constraints, (2) OWL DL for full description-logic support, and (3) OWL Full for maximum expressiveness and syntactic freedom of RDF (Baader et al., 2003). RDFS and OWL have well defined semantics that allow for reasoning within ontologies as well as within knowledge bases that are described using these languages. The Semantic Web supports standardized rule based languages (e.g., RIF (Kifer, 2008) and SWRL (Horrocks et al., 2004)) to further extend its reasoning capabilities. To query RDF data as well as RDFS and OWL ontologies within knowledge bases, a Simple Protocol and RDF Query Language (SPARQL) (Prud’hommeaux and Seaborne, 2008) is available. Since both RDFS and OWL are built on RDF, SPARQL can be used to query both of them. As part of the Semantic Stack, all rules regarding semantics will be executed below the Proof layer. Results from these executions will be used by the Proof layer to prove deductions. These formal proofs, together with trusted inputs for the proofs, ensure that results can be trusted. For reliable inputs, cryptographic means should be employed (e.g., digital signatures) to allow for the verification of the origin of the sources. On top of the Semantic Web technology stack are end user interfaces and applications that take advantage of the Semantic Web infrastructure.
3.2. Context and context modeling The term context-awareness has been widely used for systems that keep track of their surroundings (Schilit et al., 1994). Contextawareness is also an essential part of ubiquitous computing (also known as omni-present, pervasive, everywhere or universal computing), where devices are used to provide users with personalized services to fit a user’s environment. Ubiquitous computing integrates information processing in everyday objects and activities, allowing machines to fit user environments instead of forcing users to adapt to machines. As a result, for many ordinary activities, ubiquitous computing will engage many computational devices and systems simultaneously (Perera et al., 2014). Over the past decade, context modeling has become a mainstream activity with existing approaches mainly focusing on creating context models. Such models typically abstract and conceptualize contexts from different individual systems or applications into a unified model (Bettini et al., 2010; Perera et al., 2014). Common to these models is that they differ in their capabilities, including their expressiveness, usability, interoperability, and support for specific application domains. In a context-aware system, the focus is on creating smart services, which work in concert to support people in carrying out their everyday life activities in an easy way using information and intelligence (Ashton, 2009). Context data for these models is collected through a variety of sensors and non-sensor resources, which allow capturing the current, often ambiguous, imprecise, and erroneous status of an environment. In the literature (e.g. Bettini et al., 2010; Perera et al., 2014), several core requirements (Table 1) have been identified which should be supported by any context model.
348
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
Table 1 Context modeling core requirements. Requirement Heterogeneity Relations Timeliness Imperfection Reasoning Usability Efficiency Validation
3.3. Applications of context and ontologies in the software engineering domain
Description Ability to express different types of context information Relationships among context information types Context history as an information source Factors (e.g. time) affecting quality of context information Consistency checking, context adaptation and new context information inference Ability to translate real world concepts to modeling constructs and manipulating context information at run time Provide fast access to large amount of context information and data objects Ability to validate partial or full data content
Context modeling approaches (Bettini et al., 2010; Perera et al., 2014) can be categorized based on their representation approach, such as: Key-value Models, Markup Models, Graphical Models, Object Oriented Models, Logic-based Models, Ontology-based Models. We limit our review to ontology-based context models. For a detailed review of the other modeling approaches, we refer the reader to surveys in Bettini et al. (2010), Perttunen et al. (2009), Perera et al. (2014), Strang and Linnhoff-Popien (2004). Common to ontology-based models is their support for validation through different strategies such as data-type validation, consistency checking and range specifications. A key difference among ontology based context models are the standards and representation schemas used to model context data, such as: (1) Resource Description Framework (RDF) (Klyne and Carroll, 2004), (2) Resource Description Framework Schema (RDFS) (Brickley and Guha, 2004), (3) Web Ontology Language (OWL) (McGuinness and Van Harmelen, 2004). A good example of an ontology-based context model is the Aspect-Scale-Context (ASC) introduced by Strang and LinnhoffPopien (2003), which supports interoperability among data sources through bi-directional relations. The same model also forms the core for their non-monolithic Context Ontology Language (CoOL) (Strang et al., 2003), which allows representing context-awareness in distributed service frameworks. Gu and Wang (2004) introduce SOCAM, a service-oriented context-aware middleware based on OWL. Their underlying CONON (Wang et al., 2004) ontology consists of two layers: an upper ontology to capture the model’s core concepts, and a lower layer for the domain specific concepts. OWL reasoners support contextual reasoning and consistency checking. However, the model is restricted in its structure and assertions, leading to potential semantic contradiction for real world contexts. CoBrA (Chen et al., 2003) is an agent-based architecture supporting context-aware systems in smart spaces (rooms, homes and cars). This approach is domain-centric and relies mostly on hardware sensors. In Fuchs et al. (2005), a context meta-model based on the OMG meta-modeling approach is introduced that provides a formal mapping between the meta-model and OWL by creating restricted ontologies that comply with OWL DL and support DL reasoning. Several RDF-based models have been introduced for capturing quality related context information, such as Korpipaa et al. (2003) for mobile devices, as well as Schmidt et al. (1999). A major limitation of these RDF-based models is their lack of modeling abstractions, restricting model reuse into different applications or domains. In general, a key difference between existing ontology-based context models and our approach is that these existing models focus mainly on context conceptualization, whereas our modeling approach also considers data characteristics as an integrated part of the model. As a result, our model also supports data quality and time (freshness) aspects found in the software domain.
For the software engineering domain, context-aware systems aim to improve the tools and environments used to support situation awareness, allowing for user-centric results and support. In addition, context information captured in the software engineering domain is restricted to a limited number of artifacts and sensors (e.g., developer profiles, API calls, coding patterns, test cases and documentation). Existing applications of context in the software domain such as code search and code completion (Nguyen et al., 2012; Sahavechaphan and Claypool, 2006; Thummalapenta and Xie, 2007) have utilized some context information (i.e., code patterns, code signatures) to improve their analysis results. Recommendation systems tightly integrated within a given IDE have started to take a foothold in the software domain (e.g., Rational Jazz). Common to these context applications is that they only support proprietary tool contexts. This is in contrast to our approach, which provides a formal and non-tool-specific context modeling solution, allowing for reuse and integration of context information across tool and resource boundaries. In other related work, ontologies have been used to conceptualize a domain of discourse. Würsch et al., (2012) proposed a set of ontologies organized in a three-layer architecture to model software evolution knowledge. However, these ontologies only capture historical rather than current context information. Although the authors claim that their approach is extendible, it lacks support for some of the general requirements of context-aware modeling (e.g. reasoning). Similarly, in Zhang et al. (2008) and Rilling et al. (2008), we introduced ontologies for the software engineering domain. However, the focus of our earlier work was on knowledge integration and traceability aspects among software repositories. In Keivanloo et al. (2012), we introduced an RDF-based approach, creating a unified representation for a very large dataset capturing facts from different software repositories. ˇ c´ et al., 2005) have inIBM Rational Jazz and Hipikat (Cubrani troduced frameworks for the annotation and integration of software artifacts at different stages of the software development cycle. However, given their proprietary context modeling approaches, their applicability is restricted to these tools, limiting the integration of context information from external sources and its reuse with third-party tools. 4. An ontology-based context meta-model Context data originates from a variety of dynamic (run-time) sensors and static information resources, which collectively capture an environment and its context. A key challenge when dealing with context data is that it tends to be ambiguous, imprecise, and erroneous. Ontological models are capable of dealing with these data challenges by being able to take advantage of different types of knowledge inferring services. These inference services provide various validation options for consistency, transitive dependency, and range specifications checking (Bettini et al., 2010; Perttunen et al., 2009). Similar to Web 3.0 and its basic idea of providing a standardized and semantically-enriched knowledge modeling approach, our context model (Fig. 3) takes advantage of the Semantic Web technology stack (including ontologies, formal semantics, inference services, and RDF) to capture, model, and reason upon context information. Our context model consists of several different abstraction layers, which allows us to improve the generalizability and reusability of the model. Fig. 4 shows the upper meta layers of our model. A complete description of the model, including all of its concepts and relations, is available online at SECOLD Ontology.
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
349
Fig. 3. Overall overview of our approach.
Fig. 4. Context meta-meta and meta-model abstraction.
At the meta-meta level, two conceptual classes capture our model’s core semantics: Class and Property. The lower modeling levels extend and refine these two conceptual classes to capture context details. At the meta-level, triple structures are modeled through the conceptual classes: (1) AbstractConcept, representing objects in the triple, (2) AbstractAssociation, corresponding to the predicate, and (3) AbstractEntity, which captures the subject. A key premise of context models is that when representing situated cognition a model also has to consider temporal and quality aspects of the information resources. For example, as time elapses the quality of some context information can degrade rapidly (in particular sensor and run-time data), whereas the quality of static information is less time sensitive. In our model, time is captured explicitly as part of sensory data and implicitly through some static resources. A challenge in dealing with freshness (time) using an ontological modeling approach is that ontologies only provide limited native support to capture explicitly time and temporal aspects. For our meta-model, we address this shortcoming by taking advantage of existing ontology design patterns, which are discussed in detail in the next section. From a modeling perspective, a key advantage of our ontological modeling approach is its ability to capture both core and
non-core contextually-relevant information categories (Bettini et al., 2010; Perttunen et al., 2009). Table 2 provides an overview of the mapping between these contextual categories and their corresponding ontological representation in our context meta-model as well as their instantiation at the (software) domain level. It should be noted that we have omitted some high-level context categories in Table 2 (e.g., operational, objective, external, physical) to improve the readability of the table. 5. Software context-aware model The software domain layer (Fig. 5) is derived from our metamodel to capture domain specific context aspects, such as: (1) Static information mined from different resources (e.g., software repositories, social networks, log files and other resources), and (2) Sensory (dynamic) data collected from IDEs or active software repositories, where state changes will automatically be broadcasted (e.g., concurrent code check out or code changes directly within an IDE). While sensors in the software domain still contain both time and freshness, their actual quality measures differ significantly from the sensory data found in other domains. For example,
350
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
Fig. 5. Software domain context model.
Table 2 Context categories and their meta-model mapping. Context categories
Meta-model
Software domain instance
User Computing (System) Physical (Environment) Historical Social Networking Things Sensor Who (Identity) Where (Location) When (Time)
User Association, Concept Association, Concept Association, Concept Association, Concept Association, Concept Concept Sensor Association, User Association, Entity TimeEnabledAssociation, Type Association, Concept Association, Concept SensedAssociation, Concept StaticAssociation, Concept SensedAssociation, Concept DerivedAssociation, Concept
Software Engineer commit → changeSet speaks → ENGLISH workedOn → bug knows → developer knows → developer Bug IDE commitedBy→Developer place → USA CaptureDateTime→DATE
What (Activity) Why Sensed Static Profiled Derived
commit → changeSet commitFor → bug hasStatus → RESOLVED knows → JAVA assignee → developer worked → API
accuracy and freshness for GPS signals in mobile computing are measured in milliseconds (almost real-time) rather than the typical seconds or minutes used in the software domain. For the software domain layer, we extend AbstractUser class with Developer to capture developer-specific information. ArtifactRepository is extended from the Sensor class. The ArtifactRepository class is further refined through the subclass SoftwareArtifactRepository to capture data related to software repositories. In addition, the class DevelopmentTool (derived from the Sensor class) is introduced, to model the specific type of sensors found in the software domain. Both classes (ArtifactRepository and DevelopmentTool) play a key role in modeling historical and dynamic context information. SoftwareArtifact extends AbstractArtifact to capture software artifact specifics and its data. As mentioned before, sensory data in the software engineering domain differs from traditional context-aware systems since software artifacts lack quality information about their data. While in other domains (e.g., mobile applications) the quality of sensory data might fluctuate (e.g. the accuracy of a GPS data might vary depending on the signal strength), sensory data in the software domain can be considered accurate. For the software domain,
sensory data originates from controlled systems (e.g., version control systems, IDE) where status changes and events are typically well defined. However, in situations dealing with inaccurate information that is derived and implicit (e.g., through pattern or data mining analysis), the capturing of quality aspects is required. For instance, traceability links among software sensors might not be explicitly available, and are often only established through analysis, pattern matching, or other types of inference in such a fashion that makes these links less reliable. In our model, we capture these potential inaccuracies (quality fluctuations) through the DerivedAssociation class and its quality properties. As part of our modeling approach, facts about derived information are captured as children of DerivedAssociation (e.g., hasTraceabilityLink), and quality information for those facts are presented by the reification ontology design pattern. In this design pattern, two predicates (i.e., reificationProperty) will replace the derived predicate. Both of these reification properties use the PropertyReification class as their domain, with their ranges being the domain and range of the original derived property (Fig. 6(a)). The PropertyReification class is an instantiation of QualityEnabled, capturing the quality information for specific derived information. In the following example, we illustrate the use of the property reification pattern to model inferred facts between a commit message and an issue. Assuming the following commit message: “Issue #3 is resolved as part of commit #5”, an implicit traceability link can now be established between a commit and an issue using the following triple: . In order to capture the quality (accuracy) of this link (e.g., 0.9 precision), we now substitute the original fixedIn predicate with a new subclass of the ReificationProperty class (i.e., TraceabilityLink) and two other predicates (i.e., hasIssue, hasCommit), shown in Fig. 6(b). Using this modeling approach, we can now represent the traceability link and its quality attribute using the triple . Although this reification design approach increases the complexity of our ontology design (in terms of number of classes), it provides us with an extensible and expressive modeling that can capture different quality aspects. A drawback of our design is that the original property becomes a class and therefore some of the modeling expressiveness associated with properties (e.g., transitivity or symmetrically) are no longer supported. In order to mitigate this problem, we added a rule (shown in Eq. (1)) to deduce a derived property based on the two new properties used in the reification design approach. Through this rule,
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
351
Fig. 6. (a) Property reification design pattern. (b) Property reification example.
event changes. Furthermore, the ontology pattern allows us to define and identify sequences of change patters. 6. Application level examples
Fig. 7. Ordered list ontology design pattern (Abdallah and Ferris, 2010).
transitive and symmetric reasoning on the property deduced from the reification design pattern are again supported.
R1 (r p, d )R2 (r p, r ) = R3 (d, r )
(1)
Another modeling challenge for sensory data is the need to capture ordered and temporal data. For example, snapshots or file changes in a version control system need to be modeled such that the temporal order of these changes is maintained. Ontologies in general are not well suited to model such sequential (ordered) information. We address this modeling challenge by taking advantage of another existing ontology design pattern, the Ordered List (Abdallah and Ferris, 2010) pattern (Fig. 7). This pattern efficiently models sequences of information resources as a semantic graph using basic concepts and properties. The Ordered List pattern overcomes limitations of the native RDF sequence representation (i.e., rdf:Seq), which lacks query efficiency and requires that the size of the list to be known at design time.The Ordered List pattern is based on two concepts: OrderedList, and Slot. The pattern is represented through the class olo:item which is included in olo:Slot. The number of slots are modeled by the length attribute of olo:OrderedList. Each slot uses olo:index as its direct access method to a single slot item. Sequential access to previous or next slot is also supported by the Ordered List pattern. The item index also provides direct access to additional information such as first/last item in the list. The Ordered List pattern also supports advanced semantic reasoning to further improve the flexibility and inference of our knowledge model. Using inference services in connection with the OrderedList pattern allows us to not only define and identify sequences of change patterns but also to query directly for status changes in the knowledge base without the need to first preprocess the time sequences. For example, using the OrderedList pattern it is possible to trigger a rule to take action once the status of an issue changes (e.g., open to assigned), rather than having to first analyze the individual event timestamps to derive the sequence of
Traditionally, software systems were developed as closed-source projects following well-defined development processes and reporting structures. For these systems, context-awareness was only considered (if at all) as a highly customized or organization-specific problem, resulting in proprietary and non-reusable context modeling. However, over the past decade both context-awareness and personalization of information in software domain are gaining importance. This is due to many factors, such as: (1) the availability of new and more sophisticated sensors (e.g., repositories), including the availability of more sensory data, (2) globalization of the software industry, with its distribution of knowledge resources across project and organizational boundaries, and (3) the popularity of open source projects with their less controlled, processoriented development philosophy. At the same time, progress has been made in capturing task and context relevant information. Software repositories now collect some sensory and user-specific information as part of their repository data (e.g., time stamps, location, user-id). At the same time, the mining software repository (MSR) community has focused on analyzing and mining historical data stored in these software repositories to discover patterns and making this data actionable. However, such data must be considered ‘information silos’ due to the lack of a standardized knowledge representation that would allow for a seamless integration of facts and context information mined from multiple sources and sensors (Hassan, 2008). In contrast, our modeling approach supports the creation of information hubs (as promoted by Web 3.0) where context information becomes an integrated part of the solution space. These information hubs integrate dynamic sensory data with static knowledge resources to support software and knowledge retrieval services that are context aware. The following use cases illustrate how our ontology-based context meta-modeling approach can facilitate system and user interaction experience in the software engineering domain by establishing situation awareness. 6.1. Mentor assignment Mentor assignment is the process of recommending experienced developer(s) to guide a novice programmer who is unfamiliar with an existing code base. However, several challenges exist in recommending a mentor. Firstly, given the growing size of software systems, even core developers are often only familiar with certain aspects of a system. As a result, developers take ownership of certain features and therefore multiple experts might be
352
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
required to mentor software immigrants and novice programmers during the completion of different tasks. Secondly, as reported in a study conducted by Compdata (2013 Turnover Rates by Industry), which surveyed over 34 thousand companies in the US, the average turnover rate in the US between 2008 and 2013 was 15.2%. This high turnover rate, combined with increasing size of project teams, adds to the challenge of identifying and assigning mentors to new programmers. A significant body of research exists on mentoring and integrating new developers into existing software projects. Previous studies in this area have focused on mining history information from sources like version control systems (Canfora et al., 2013; Mockus and Herbsleb, 2002; Minto and Murphy, 2007), social networks (Shami et al., 2008; Ye et al., 2007; Moraes et al., 2010), and issue tracker systems (Kersten and Murphy, 2006) to recommend potential mentors (Mockus, 2010; Canfora et al., 2012). However, only limited research exists that focuses on task-specific mentoring (expert referral), which is the process of recommending mentors who are experts in a given task. As discussed earlier, our context modeling approach takes advantage of both real time (dynamic) information collected from software sensors such as development tools (e.g., IDEs), as well as static historical information stored in software repositories (e.g., issue tracker, versioning control), and from other relevant resources. In the following study, we illustrate how a mentor assignment approach can benefit from context-awareness based on available static and dynamic information to improve the quality of the assignment. Task-at-hand. The mentor assignment process is triggered by a developer’s current task-at-hand. Initial task situation awareness is established through sensory data collected from different sensors (e.g., IDE, issue tracker). For example, a task context can be established through issue status changes (e.g., the issue status changes from “Assigned” to “In Progress”) as well as additional information mined from the issue tracker (e.g., summaries, descriptions, specifications, comments). IDE – a software sensor. IDEs (e.g., Eclipse) are capable of capturing developer’s real time context information (e.g., project name, file the user is currently working on, open files in the IDE, programming language) and make it available for further processing. Frequency of updates, current file being edited, and all dynamic information collected from the IDE are context relevant data, which can be used to provide further ‘task-at-hand’-specific information (Malheiros et al., 2012). Mentors list. Knowledge in collaborative and global software development processes tends to be distributed across resources and their boundaries. Given this diversity and heterogeneity of knowledge resources, programmers (and other stakeholders) face an increasing challenge in locating resources relevant to a current task-at-hand. In particular, software immigrants and junior developers who are often unfamiliar with the current system lack the required domain, application, and programming experience. Furthermore, they face additional challenges due to missing, inconsistent, or incomplete documentation. In such situations, developers seek guidance from people familiar with their problem context ˇ c´ et al., 2005; Steinmacher to complete their task at hand (Cubrani et al., 2012; Begel et al., 2010). However, several challenges might arise when trying to identify such potential mentors automatically, since the actual selection process will have to consider not only a developer’s but also a potential mentor’s current work context. In this research, we therefore proposed an automated mentor recommendation, which takes advantage of available resources. Among these knowledge resources are sensory data, which capture the current work contexts of both developer and potential mentors (e.g., task, files being edited, day and time of the day, location). In addition, historical data is mined from various resources (e.g.,
versioning systems, issue trackers) to identify previous usage patterns. The quality of any automated mentor recommendation approach will depend on its ability to collect and interpret context information to personalize and improve the quality of its recommendation in order to closely match a user’s task-at-hand context. Mentor. We consider a mentor to be any developer within a project who has relevant expertise to the task-at-hand context of a programmer (e.g., software immigrant) seeking help. This expertise could involve the fact that this developer (mentor) has previously worked on similar issues, or perhaps has performed the most recent changes to the files the software immigrant is currently working on. Mentor recommendation. An automated mentor recommendation approach should not only be capable of identifying potential mentors but also guide users in selecting the most appropriate mentor, by ranking the recommended mentors based on some quality, similarity, or relevance criteria. Ranking of mentors. Many different criteria are used for the ranking of the mentor result set, including the relevance or similarity of the mentor’s historical data and expertise with the available context sensory data collected from a user’s current taskat-hand. The relevancy of an individual mentor recommendation is measured through fImportance, which includes criteria such as keyword matching and term frequencies to assess the similarity of a user’s current task context and the mentor recommendation. Moreover, we consider both the importance and frequency of individual changes, which are captured by the cImportance measure. This measure assesses a developer’s experience based on the assumption that a developer modifying a file should be familiar with the file and the task to be completed (Mockus and Herbsleb, 2002). Therefore, the relevancy of a mentor recommendation is based on changes to the files (in the current developer’s task context) and the times when these commits occurred – see Eq. (2).
score(d ) =
n i
f Importance( fi ) ∗
m
cImportance(c j )
(2)
j
where n is the number of files that are opened or currently edited in a developer’s current task context, with f referring to the individual files. Function fImportance(fi ) calculates the relevancy of a file’s content to the task at hand. A file is considered to be modified if a change has been recorded in the set of previous commits cj , with m being the number of previous commits on fi by mentor d. Function cImportance calculates the importance of individual commits based on the recorded date of cj and the changes in fi . As the task context changes, so do the scores for potential mentors, with resultant changes in the recommendation ranking result set. In what follows, we present a controlled study designed to assess the applicability of our context model by examining the impact of context information on the quality of automated mentor recommendations. Objective: For this study, we take advantage of historical data found in the software repositories of mature open source projects. More specifically, we assess how closely mentors recommended by our approach match known involvement of different mentors/programmers in completing a particular task (in our case, an already-closed issue). Setting: Data used for our evaluation was extracted from our existing SECOLD V2 datasets (SECOLD Ontology): an Internet-scale dataset populated with meta data, commits, issues, and source code extracted from over 50,0 0 0 open source projects prior to April 2012, including the Google-web-toolkit project (see Table 4). Results: For our study, we manually inspected a subset of the Google-web-toolkit project data (Version 2.3–2.5). We selected a subset of commits, each of which corresponded to a ‘fix’ for an
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
353
Table 3 Results from the mentor assignment evaluation (I#: issue number; R#: revision number; NoF: number of files associated with a commit; NoC: number of potential mentors; MR: Ranking position of the assigned mentor in the result set). R#
I#
NoF
NoC
MR
R#
I#
NoF
NoC
MR
10948 10944 10941 10940 10939 10935 10896 10895 10894 10865 10858 10697
6961 3069 6724 4930 4221 6760 5264 7065 2934 6167 6834 4342
2 2 1 1 1 1 2 5 8 1 1 13
5 7 5 4 16 2 10 9 5 4 2 22
1 2 2 1 3 1 1 1 2 3 1 2
10636 10489 10472 10437 10317 10288 10282 10257 10248 10170 10122 10120
6161 6367 6081 6554 5367 6040 6145 5639 5707 6015 6208 6300
5 12 7 1 18 2 33 17 2 1 8 7
11 22 5 8 6 9 27 14 10 8 3 3
2 6 2 1 3 2 1 12 5 2 2 2
Table 4 Google-web-toolkit project information. Project size in number of files Project size in line of codes Total number of commiters Total number of commits Total number of issues Total number of issues’ assignee Date range Version number
Fig. 8. Sample SPARQL query.
open issue. We then established for each of these selected commits a task-at-hand context by analyzing the corresponding open issue. We then analyzed each file content prior to the committed changes by comparing the file content (e.g., keyword matching, method signature) against the issue content in order to rate the importance of files with respect to the particular issue. Next, in order to obtain a list of mentor candidates we took advantage of SPARQL queries (Fig. 8) to identify developers that committed changes. For evaluating our recommendation approach, we created a base line by manually inspecting all past commits that any potential mentors had performed on the files related to the given task at hand. In order to eliminate potential noise in the data, we conducted a separate preprocessing step that removed all commits that had any of the following properties: (1) those included more than 100 files, in the belief that such commits might represent non-relevant changes to these files (such as changing the copyright (Hindle et al., 2008), (2) those for which manual inspection did not return any favorite mentor(s), and (3) all previous file commits related to the task-at-hand that had been performed by only a single developer. Discussion: For our analysis, we manually selected the first best mentor to compare against our result set. The results from our study (Table 3) show that in 21 out of 24 cases, our approach returned the manually-determined favorite mentor within the top three results of our ranked list. We also observed that in 8 out of these 21 cases our favorite mentor was ranked first; in 10 out of 21 cases, they ranked second. The study also showed a few cases (e.g., revision number 10489 and 10257), when our recommendation approach did not perform as well and the favorite mentor was not returned within the top five. Further analysis of these scenarios showed that in both cases, the revision involved a large number of different files and therefore developers. Precision and recall (Eq. (3)) are two of the most wellestablished measures for evaluating unranked result sets. In information retrieval, these measures consider the following data: (1)
7474 720,016 164 5859 7229 73 04/2011–04/2012 2.3–2.5
Table 5 Results from the mentor assignment evaluation using two relevant result (R#: revision number; FP: First Position; SP: Second Position). R#
FP
SP
R#
FP
SP
10948 10944 10941 10940 10939 10935 10896 10895 10894 10865 10858 10697
1 2 2 1 3 1 1 1 2 3 1 2
2 3 3 2 9 2 2 2 3 4 2 4
10636 10489 10472 10437 10317 10288 10282 10257 10248 10170 10122 10120
2 6 2 1 3 2 1 12 5 2 2 2
3 9 3 2 4 3 2 14 6 3 3 3
the total number of relevant items in the result set (r); (2) the total number of relevant items (a), and (3) the total number of items in the result set (s).
Precision =
r r Recall = s a
(3)
Evaluation: Like in any other ranked set, our mentor assignment should return the experienced contributors on the top of our recommendation result set. In order to reduce any bias towards our favorite mentor we consider in our evaluation the two most suitable mentors. We observed an overall precision for our recommendations of 0.54, meaning that at least one of the two suitable mentors (s = 2) was ranked first in the result set (Table 5). Further analysis shows that in 8 out of 24 cases our approached correctly placed both of the two best mentors in the top two positions; in 9 other cases, they were within the top three. Discussion: As the analysis of our mentor assignment showed, our context-aware approach can often successfully identify mentors relevant to a given task context. In order to evaluate the impact of sensory data on our results, we analyzed the impact of different sensory data on the quality of the recommendations. For this analysis, we incrementally removed parts of the context information used by our ranking approach to study its impact on the
354
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
Fig. 9. Mento assignment based on context levels.
result set. More specifically, we analyzed three different scenarios: (1) Full context, with all sensory data for the recommendation, (2) Semi context, which omits the file importance factor, and (3) Minimum context, which only considers the list of files and omits file importance and commit importance. The results of our experiments are shown in Fig. 9. Using Full context provided an overall improvement of the recommendations quality by 46% when compared to the Minimum context; in another 25% of cases, the Full context ranked equally as well. Only in 29% of cases did it rank worse than the Minimum context. The experiment also showed that the Full context provided only marginal improvement in the quality of the results set compared to the Semi context. In 8% of the cases, the full context outperformed the semi-context, whereas in 88% it returned the same results. In 4% of cases, the result ranking was even worse. We performed a more detailed manual analysis of the cases where the Full context approach did not perform well. We observed that the ranking in our Full context approach was affected by two major factors: (1) Commits that tended to involve a small number of files (e.g. revision number: 10944, 10941, 10939), or (2) revisions that contained a large number of contributors (e.g., revision number: 10489). We do believe that a more refined pre-processing (filtering) step and integrating additional context data (criteria) as part of our full-context ranking will significantly improve the result ranking. We plan to investigate both aspects as part of our future work.
existing source code search engines. The necessary sensory data is modeled through the IDE class, which is a subclass of the Sensor class. Similar to IDEs, software repositories can also contribute information towards establishing a search context. For example, issue trackers can provide mineable historical data (e.g., issues worked on) as well as current context information (e.g., issue(s) currently being worked on by a developer). A developer’s issue tracker history is captured using an instance of class IssueTracker and its association of type AbstractSensedAssociation to SoftwareArtifact (i.e., Issue). Issue-related activities are captured by the properties and associations of other SoftwareArtifact instances (e.g., IssueActivity) (Fig. 10). The inclusion of context information within source code search engines will improve the quality of the result sets by returning results, which are more relevant to a given user context. The following SPARQL query (Figure 11) exemplifies how our context model can integrate dynamic context information derived from different sensory data. The above query takes advantage of the following sensory data: (1) programming language used by files currently open in IDE, (2) import statements (i.e., class-level dependency declaration statements) of files open in IDE, and (3) keywords associated with in progress issue assigned to a developer. 7. Threats to validity 7.1. Internal threats to validity
6.2. Context-enabled code search Similar to Web search, code search engines (e.g., SeCodeSearch (Keivanloo et al., 2010)) also can benefit from integrating information (e.g., source code fragments, modules) relevant to the current task-at-hand context. Code search engines that are contextaware can use context information in addition to the source code to improve their search results. Such improvement may include: supporting query auto complete, search space restrictions, and integrating context information in their result ranking. A contextaware code search engine will not only benefit from improved search performance due to the reduced search space, but also from improved quality and user satisfaction by returning more relevant results to the current user context. In what follows, we illustrate how sensory data collected from users’ system interactions, including dynamic resources (e.g., IDEs) and static resources (e.g., search histories, usage logs) can enhance
One of the threats to validity in our mentor recommendation methodology could be the selection of candidates only from the set of active contributors. While one could also consider a nondeveloper as a mentor, Google-web-toolkit, like most other open source projects, rely mostly on their active developer community (typically characterized by the number of commits and releases a product has) to capture product related expertise. Another threat to validity for our approach comes from any potential noise in our sensory and repository data. In order to address these challenges, we apply several techniques to eliminate noisy data such as removal of duplicate issues and duplicate files open in the editor. The manual selection of a favorite mentor can be consider as another threat to the validity of our method. We mitigate this potential threat by applying several pre-processing steps. First, we eliminated commits from the dataset that did not include
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
355
Fig. 10. Partial view of the application level context model.
need to determine the relevancy of files committed for a given task-at-hand context. This challenge was addressed by introducing our candidates’ score measure, which considers various sensory data resources to reduce the dependency on a single ranking source. While Semantic Web reasoners have significantly advanced in recent years, scalability of in-memory inference remains a major potential problem. In order to mitigate this performance challenge, it is possible to take advantage of a native triplestore (e.g. AllegroGraph), which represents RDF data and meta-data as triples. While the triplestore can be queried and reasoned upon with scalable performance, their built-in reasoning capabilities are limited. Furthermore, given that RDF is a standard model for data interchange on the web, we can now seamlessly integrate context models with existing software artifacts (datasets) already published in RDF (e.g., SECOLD (Keivanloo et al., 2012)), and reason across these ontologies. Fig. 11. Integration of context information from different resources.
8. Discussion and conclusions an explicit reference to a fixed issue or issue number as part of their commit message. Also eliminated from consideration were files and tasks with a sole developer. It is salient to note here that our results are based on a controlled study that takes as a natural presupposition that all files committed as part of an issue or task-at hand are opened or edited within the IDE. However, in a real world setting, not all of these files might be open at the same time. We address this potential threat by including other criteria in our ranking approach, such as committed files (and their time stamp) and key word matching between commits and issues. 7.2. External threats to validity The size of the project and the number of cases might be considered threats to our evaluation. However, Google-web-toolkit is a medium-sized, mature project that has not only gained support from wider industry, but also from a large and vibrant developer community. Another potential threat to validity is the assumption of correctness of the issue number in a commit message, and further, that the files that were committed are exactly those that are required for fixing the issue. To counter this, we manually verified the issue numbers in commit message and the files committed to confirm their accuracy. Another potential threat to validity is the
Ontologies and the Semantic Web have been widely used in meta-modeling to conceptualize a domain of discourse and have emerged as an essential part of the next version of the Web– Web 3.0. This paper introduces an ontology-based meta-modeling approach that takes advantage of Semantic Web technologies to model context at different abstraction levels. In our work, we use ontologies not only for conceptualization purposes, but also to create a reusable context model that is capable of supporting core requirements of any context model. A key objective of our design approach is to be able to instantiate an application-specific context model that can capture different context categories found across the software domain. Applying reasoning and rule-based inference also can improve context aware information retrieval, which has been lacking in traditional software applications. In addition, our approach takes advantage of real-time context information captured from software sensors and historical repository data. Furthermore, context in the software domain is not limited to user interaction but also includes interaction among system and tools that can take advantage of context specific results and analysis. As part of our future work, we plan to extend the application of our context modeling approach to other areas in the software
356
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357
domain, such as secure programming or context aware documentation. Our current evaluation has mainly focused on a quantitative analysis of the results from the case studies, limiting our ability to generalize the applicability and validity of the approach. As part of our future work, we therefore plan to perform qualitative analysis in the form of user studies, which will allow for both an evaluation of our approach’s applicability as well as an analysis of the result sets from an expert user perspective. References Abdallah, S.A. and Ferris, B., 2010. The ordered list ontology. [Online]. Available: http://smiy.sourceforge.net/olo/spec/orderedlistontology.html. AllegroGraph RDFStore Web 3.0’s database. [Online]. Available: http://franz.com/ agraph/allegrograph/. Ashton, K., That ‘Internet of Things’ thing, RFID J. 2009. [Online]. Available: http: //www.rfidjournal.com/articles/view?4986. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F., 2003. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press. Begel, A., Khoo, Y.P., Zimmermann, T., 2010. Codebook: discovering and exploiting relationships in software repositories. In: 2010 ACM/IEEE 32nd Int. Conf. Softw. Eng., vol. 1, pp. 125–134. Berners-Lee, T., Semantic Web Architecture. [Online]. Available: http://www.w3.org/ 20 0 0/Talks/1206- xml2k- tbl/slide10- 0.html. Bettini, C., Brdiczka, O., Henricksen, K., Indulska, J., Nicklas, D., Ranganathan, A., Riboni, D., April 2010. A survey of context modelling and reasoning techniques. Pervasive Mob. Comput. 6 (2), 161–180. Brickley, D., Guha, R.V., RDF vocabulary description language 1.0: RDF schema, W3C, 2004. [Online]. Available: http://www.w3.org/TR/rdf-schema/. ˇ ´ D., Murphy, G.C., Singer, J., Booth, K.S., 2005. Hipikat: a project memory Cubrani c, for software development. IEEE Trans. Softw. Eng. 31 (6), 446–465. Canfora, G., Di Penta, M., Oliveto, R., Panichella, S., 2012. Who is going to mentor newcomers in open source projects? In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pp. 44:1–44:11. Canfora, G., Di Penta, M., Giannantonio, S., Oliveto, R., Panichella, S., 2013. YODA: young and newcomer developer assistant. In: Proceedings - International Conference on Software Engineering, pp. 1331–1334. Chen, H., Finin, T.W., Joshi, A., 2003. Using OWL in a pervasive computing broker. In: Proceedings of the Workshop on Ontologies in Agent Systems (OAS 2003), pp. 9–16. Chinese thought and philosophy legalism. [Online]. Available: http://www. chinaknowledge.de/Literature/Diverse/legalism.html. Costa, P.D., Pires, L.F., Van Sinderen, M., Filho, J.P., 2004. Towards a Services Platform for Context-Aware Applications. In: IWUC, pp. 48–61. de Farias, C.R.G., Leite, M.M., Calvi, C.Z., Pessoa, R.M., Filho, J.G.P., 2007. A MOF metamodel for the development of context-aware mobile applications. In: Proc. 2007 ACM Symp. Appl. Comput. - SAC ’07, pp. 947–952. Erfani, M., Rilling, J., Keivanloo, I., 2014. Towards an ontology-based context-aware me-ta-model for the software domain. In: 2014 IEEE 38th International Computer Software and Applications Conference Workshops (COMPSACW), pp. 696–701. Fuchs, F., Hochstatter, I., Krause, M., Berger, M., 2005. A metamodel approach to context information. In: Third IEEE Int. Conf. Pervasive Comput. Commun. Work, pp. 8–14. Gu, T., Wang, X., 2004. An ontology-based context model in intelligent environments. In: Proceedings of Communication Networks and Distributed Systems Modeling and Simulation Conference. Hassan, A.E., 2008. The road ahead for mining software repositories. In: Frontiers of Software Maintenance, pp. 48–57. Haveliwala, T.H., 2003. Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng. 15 (4), 784–796. Hindle, A., German, D.M., Holt, R., 2008. What do large commits tell us?: a taxonomical study of large commits. In: MSR (2008), pp. 99–108. Horrocks, I., Patel-schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M., 2004. SWRL: a semantic web rule language combining OWL and RuleML, W3C Member submission 21. Keivanloo, I., Roostapour, L., Schugerl, P., Rilling, J., 2010. SE-CodeSearch: a scalable semantic web-based source code search infrastructure. In: IEEE International Conference on Software Maintenance, ICSM. Keivanloo, I., Forbes, C., Hmood, A., Erfani, M., Neal, C., George, P., Juergen, R., 2012. A linked data platform for mining software repositories. In: 2012 9th IEEE Working Conference on Mining Software Repositories MSR (2012), pp. 32– 35. Kersten, M., Murphy, G.C., 2006. Using task context to improve programmer productivity. In: Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering - SIGSOFT ’06/FSE-14, pp. 1–11. Kifer, M., 2008. Rule interchange format: the framework. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), LNCS, vol. 5341, pp. 1–11.
Klyne, G., Carroll, J.J., 2004. Resource description framework (RDF): concepts and abstract syntax. W3C Recomm. 10, 1–20. Korpipaa, P., Mantyjarvi, J., Kela, J., Keranen, H., Malm, E.J., 2003. Managing context information in mobile devices. IEEE Pervasive Comput 1 (3), 42–51. Lassila, O., Hendler, J., 2007. Embracing ‘Web 3.0. IEEE Internet Comput 11 (3), 90–93. Malheiros, Y., Moraes, A., Trindade, C., Meira, S., July 2012. A source code recommender system to support newcomers. In: IEEE 36th Annu. Comput. Softw. Appl. Conf., pp. 19–24. McGuinness, D.L., Van Harmelen, F., 2004. OWL web ontology language overview. W3C Recomm 10 (February), 1–22. Minto, S., Murphy, G.C., 2007. Recommending emergent teams. In: Proceedings ICSE 2007 Workshops: Fourth International Workshop on Mining Software Repositories, MSR 2007. Mockus, A., Herbsleb, J.D., 2002. Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th International Conference on Software Engineering. ICSE 2002. Mockus, A., 2010. Organizational volatility and its effects on software defects. In: Proceedings of the 2010 Foundations of Software Engineering Conference (2010), pp. 117–126. Moraes, A., Silva, E., da Trindade, C., Barbosa, Y., Meira, S., 2010. Recommending experts using communication history. In: 2nd International Workshop on Recommendation Systems for Software Engineering - RSSE ’10, pp. 41–45. Nguyen, A.T., Nguyen, T.T., Nguyen, H.A., Tamrawi, A., Nguyen, H.V., Al-Kofahi, J., Nguyen, T.N., 2012. Graph-based pattern-oriented, context-sensitive source code completion. In: 34th International Conference on Software Engineering (ICSE), pp. 69–79. Perera, C., Zaslavsky, A., Christen, P., Georgakopoulos, D., 2014. Context aware computing for the internet of things: a survey. IEEE Commun. Surv. Tutorials 16 (1), 414–454. Perttunen, M., Riekki, J., Lassila, O., 2009. Context representation and reasoning in pervasive computing: a review. Int. J. Multimed. Ubiquitous Eng., 4, pp. 1– 28. Prud’hommeaux, E., Seaborne, A., 2008. SPARQL query language for RDF. W3C Recomm 2009 (January), 1–106. IBM Rational Jazz technology platform. [Online]. Available: http://www-01.ibm.com/ software/rational/jazz/. Rilling, P.C.J., Witte, R., Schuegerl, P., 2008. Beyond information silos – an omnipresent approach to software evolution. Int. J. Semant. Comput. (IJSC), Spec. Issue Ambient Semant. Comput. 2 (4), 431–468. Sahavechaphan, N., Claypool, K., 2006. XSnippet: mining for sample code. ACM SIGPLAN Not 41 (10), 413–430. Schilit, B., Adams, N., Want, R., 1994. Context-aware computing applications. In: Workshop on Mobile Computing Systems and Applications, pp. 85–90. Schmidt, A., Beigl, M., Gellersen, H.-W., 1999. There is more to context than location. Comput. Graphics 23 (6), 893–901. SECOLD ontology. [Online]. Available: http://www.secold.org/ontology. Shami, N.S., Ehrlich, K., Millen, D.R., 2008. Pick me!: link selection in expertise search results. In: Proceedings of ACM CHI 2008 Conference on Human Factors in Computing Systems, vol. 1, pp. 1089–1092. Steinmacher, I., Wiese, I.S., Gerosa, M.A., Paulo, S., 2012. Recommending mentors to software project newcomers. In: 3rd International Workshop on Recommendation Systems for Software Engineering, pp. 63–67. Strang, T., Linnhoff-Popien, C., 2003. Service interoperability in ubiquitous computing environments. In: International Conference on Advances in Infrastructure for Electronic Business, Education, Science, Medicine, and Mobile Technologies on the Internet. Strang, T., Linnhoff-Popien, C., 2004. A context modeling survey. In: Workshop on Advanced Context Modelling, Reasoning and Management, UbiComp 2004 - The Sixth International Conference on Ubiquitous Computing. Strang, T., Linnhoff-Popien, C., Frank, K., 2003. CoOL: a context ontology language to enable contextual interoperability. IFIP Int. Fed. Inf. Process 2893, 236– 247. Thummalapenta, S., Xie, T., 2007. PARSEWeb : a programmer assistant for reusing open source code on the web. In: Proceedings of the Twenty-Second IEEE/ACM International Conference on Automated Software Engineering (ASE ’07), pp. 204–213. 2013 Turnover rates by industry. [Online]. Available: http://www.compensationforce. com/2014/02/2013- turnover- rates- by- industry.html. Würsch, M., Ghezzi, G., Hert, M., Reif, G., Gall, H.C., July 2012. SEON: a pyramid of ontologies for software evolution and its applications. Computing 94 (11), 857–885. Wang, X.H., Zhang, D.Q., Gu, T., Pung, H.K., 2004. Ontology based context modeling and reasoning using OWL. In: IEEE Annu. Conf. Pervasive Comput. Commun. Work. 2004. Proc. Second. Ye, Y., Nakakoji, K., Yamamoto, Y., 2007. Reducing the cost of communication and coordination in distributed software development. In: Proceedings of the 1st International Conference on Software Engineering Approaches for Offshore and Outsourced Development, pp. 152–169. Zhang, Y., Witte, R., Rilling, J., Haarslev, V., 2008. Ontological approach for the semantic recovery of traceability links between software artefacts. IET Software 2 (3), 185.
M. Erfani et al. / The Journal of Systems and Software 121 (2016) 345–357 Juergen Rilling is currently a Ph.D. candidate in the Department of Computer Science and Software Engineering at Concordia University, Montreal, Canada. Prior to joining the Ph.D. program at Concordia, he received a Master in Computer Science from University of Technology Malaysia (UTM), Skudai in Malaysia and a Bachelor degree in Computer Science from Shahid Bahonar University of Kerman in Iran.
Mohammadnaser Zandi is currently a Master student in the Department of Computer Science and Software Engineering at Concordia University, Montreal, Canada. Prior to joining Concordia as a Master student, he received a B.Sc. in Computer Science from Shahid Bahonar University of Kerman in Iran.
357
Mostafa Erfani is a professor in the Department of Computer Science and Software Engineering at Concordia University, Montreal, Canada. He obtained a Diploma degree in computer science from the University of Reutlingen, Germany in 1991 and a M.Sc. in Computer Science from the University of East Anglia, UK in 1993. He received his Ph.D. from the Illinois Institute of Technology, Chicago, US in 1998. The general theme of his research over the last 19 years has been on providing software maintainers with techniques, tools, and methodologies to support the evolution of software systems. His current research focus is on supporting the modeling and analysis of global software ecosystems. He has published over 100 papers in major refereed international journals, conferences and workshops. He also serves on the program committees of numerous international conferences and workshops in the area of software maintenance and program comprehension and as a reviewer for all major journals in his research area. Iman Keivanloo is currently working as a Software Engineer at Veeva Systems. Prior to this position, he was a post-doctoral fellow with Dr. Ying Zou at Software Reengineering Lab. in the Department of Electrical and Computer Engineering at Queen’s University. Dr. Keivanloo received his PhD (2013) in Computer Science from Concordia University in Montreal, under supervision of Dr. Juergen Rilling. His research focuses mainly on recommendation systems for software engineering using big data. More specifically, his research interests are (1) source code similarity search, (2) clone detection, (3) source code recommendation/completion, and (4) their applications for software evolution and maintenance. His research interest have also been in applications of Linked Data and Semantic Web in software engineering, development, and maintenance.