MOOGLE: A Metamodel-Based Model Search Engine

3 downloads 40011 Views 3MB Size Report
these features. Of these, inexact matching, ranking, automatic indexing and .... search engine, including good performance, precision and recall. This means.
myjournal manuscript No. (will be inserted by the editor)

MOOGLE: A Metamodel-Based Model Search Engine

Daniel Lucrédio1 , Renata P. de M. Fortes1 , Jon Whittle2? 1

Institute of Mathematical and Computer Science - USP. Av. Trabalhador SãoCarlense, 400 - Centro. Cx Postal 668 - CEP 13560-970 - São Carlos - SP Brazil

2

Computing Department - InfoLab21, Lancaster University, Lancaster LA1 4WA UK

Received: date / Revised version: date

Abstract

Models are becoming increasingly important in the software

development process. As a consequence, the number of models being used is increasing, and so is the need for ecient mechanisms to search them. Various existing search engines could be used for this purpose, but they lack features to properly search models, mainly because they are strongly focused on text-based search. This paper presents Moogle, a model search engine that uses metamodeling information to create richer search indexes and to allow more complex queries to be performed. The paper also presents ?

This work was developed at the ISE department of George Mason University,

USA, with the support from Microsoft Research and Brazilian institutions CAPES (process 0657/07-7) and CNPq (process 141975/2008-3)

2

Daniel Lucrédio et al.

the results of an evaluation of Moogle, which showed that the metamodel information improves the accuracy of the search.

1 Introduction Frakes & Fox [15] have shown that developers' lack of knowledge about reusable software assets, together with the costs of nding such assets, are among the main reasons why developers do not try to reuse software. For this reason, nding reusable assets is a fundamental aspect of an eective reuse program [16, 25, 44, 32]. Through ecient searching, developers can nd and understand an asset prior to reusing it [16]. In large projects, it is even more important, since the diculties in nding assets are greater, and the search engine may be the determinant factor for the reuse success [48]. As a result, a number of reuse-oriented search engines have been studied and proposed, mostly for code search and retrieval [30]. With model-driven development (MDD) gaining importance, an increasingly large number of models is being produced and used by software organizations. In this sense, the reuse problem is extended to models in addition to code and other artifacts. Some development methodologies, such as software factories [21], strongly rely on the reuse of models in order to take advantage of previously developed assets. In such an environment, model repositories play an important role (see, for example, [33]), and the ability to automatically search for models inside large repositories may be as critical as searching for code.

MOOGLE: A Metamodel-Based Model Search Engine

3

Model search also has an important educational benet for students and teachers, allowing learners to nd good (and bad) examples that can then be used to improve the acquisition of knowledge and provide hints at solutions [27].

Models are normally represented as text, using for example the XMI format (XML Metadata Interchange) [34], and thus it is possible to use general-purpose search engines, such as Google, to nd models. For example, in Google, the query letype:xmi customer will retrieve all XMI les containing the word customer. However, these engines are not designed for model searching. There are three principal reasons why general-purpose search engines are not eective at model search.

Firstly, they do not take metamodel information into account. It is not possible, for example, to search for a class named Customer or a sequence diagram whose participants are objects of the class Registration.

Secondly, general-purpose search engines do not display model-based results well. The results are only shown textually, i.e., an XMI le will be presented. Even for users familiar with the XMI format, large XMI les are dicult to comprehend. Thus, without actually downloading the returned models and opening them in a modeling tool, it is largely impossible to read the content. In the context of searching, where several models have to be inspected, downloading all of them is not practical.

Thirdly, which is more a practical limitation, in some of these engines the only way to specify what kind of model is wanted is through le extension.

4

Daniel Lucrédio et al.

This makes it impossible, for example, to distinguish between models that have the same le extension, like XMI, which is a common format. This makes searching for specic types of models a dicult task.

Moogle [29] addresses these drawbacks by using metamodel information during the entire searching process. The metamodel is used to create more ecient indexes and to support more complex queries, which may refer to metamodel elements. Moogle also formats the search results in a more readable way by removing irrelevant tags and characters from the textual model le. Finally, Moogle can search for models in any language, not only UML, as long as there is a well-dened metamodel which has been provided to Moogle. This feature makes it suitable for domain-specic modeling [10].

This paper describes the interface and architecture of Moogle and reports on an experimental evaluation comparing Moogle's advanced search  using metamodeling information  with simple text-based search. The results show that advanced search is more accurate, being more suitable for cases where the user has greater knowledge about the model s/he is looking for. However, it tends to retrieve fewer relevant models, and hence it is less suitable when the user wants to retrieve a broader (albeit less accurate) sample.

The remainder of this paper is structured as follows. Section 2 describes the context of Moogle by detailing state-of-the-art research and practice in model search. This section includes a discussion of the key features a model search engine should possess. Section 3 describes how Moogle is used to search for models. Section 4 details the architecture of the search engine.

MOOGLE: A Metamodel-Based Model Search Engine

5

Section 5 then reports on an empirical evaluation of Moogle designed to compare advanced versus simple text-based search. Final remarks are given in Section 6.

2 State-of-the-art in model search This section presents a state-of-the-art review of academic and commercial search engines which can be used to retrieve models. The aim is to place Moogle in the context of existing work as well as to compare approaches in the literature. As a means of comparing existing work, we rst develop a list of key features, based on our experience and the reviewed literature, which is our perception of what an ideal model search engine should have (Section 2.1). We revisit this feature list in Section 5 to compare Moogle with existing approaches.

2.1 Key features of a model search engine

Some of the requirements on a model search engine are the same as those for searching for software components. In addition, however, models possess particular characteristics (e.g., the availability of a metamodel) that lead to specic desired features for model search engines. We elaborate on an ideal list of key features below and evaluate work in the literature against these features. Of these, inexact matching, ranking, automatic indexing and browsing are generic characteristics, whereas the others are specic to modeling.

6

Daniel Lucrédio et al.

Metamodel-based searching:

Pure text-based search allows a user

to search through any kind of artifact that can be described in terms of words, including documents, source code, and dierent kinds of models. Most search engines can easily determine if a particular query matches a model, by examining the occurrence of the query terms inside it, possibly discarding irrelevant terms like markup tags and stop words. In general, these search engines disassociate the terms from the underlying structure of the document, mainly because they are meant to be used in unstructured documents. Such characteristic is a limitation in model searching.

For example, consider a learning scenario: inexperienced developers willing to learn some specic kind of modeling technique may want to search for other models of the same kind to learn from them. Or a design activity: the developer may want to search for similar models for the same problem he is trying to handle, and see how other developers managed to solve this problem, what design decisions they made, what design patterns they used, and so on.

In these scenarios, it is essential that the user is capable of specifying the types of models to be retrieved, otherwise he could obtain a large number of irrelevant results. For example, a user searching for purchase order use case model should not get class or sequence diagrams as part of the search result. The same is true for ner grained levels, using model elements as part of the query. For example, a user searching for utf-8 encoding operation

MOOGLE: A Metamodel-Based Model Search Engine

7

should not get a UML actor as a response. Such information (model and model elements) is contained on the metamodel.

Thus, in order to be eective to model searching, a search engine should be able to process queries that refer to information from the metamodel. It should be possible to search for model elements of a particular type, e.g., UML class diagrams with a class containing the word Customer or messages in a sequence diagram containing the word send. Any metamodel element should be allowed, thus providing the capability for complex queries  for example, to search for any model element that directly or indirectly inherits from a superclass.

Real-world concepts: String pattern matching is a widely used technique for searching through text. In this technique, the query normally consists of a regular expression that species possible matches. However, specifying a query using regular expression has limitations. First, additional eort is needed to construct such expressions in order to provide a satisfactory support for similarity [37]. Also, there are weaknesses in relation to synonyms and homonyms [31]. Finally, the user must have a previous knowledge of the vocabulary used inside the artifacts. Unknown terms may never be discovered unless the user actually inspects the artifact.

Queries formulated using natural language can help to overcome these limitations. They are simple to express, and given the popularity of the Internet search engines, they do not introduce an additional learning curve. Also, natural language processing deals with homonyms and synonyms, re-

8

Daniel Lucrédio et al.

ducing also the need for a previous knowledge of the vocabulary used inside the artifacts.

Thus, a model search engine should support queries formulated using natural language. During search, the engine should consider all information inside model elements, such as their names, descriptions and comments. However, irrelevant character sets, such as XML tag names, for example, should be ignored.

Support for the entire life-cycle:

Model search may be useful at

dierent stages of the development life-cycle. In earlier stages, models can help in requirements identication and analysis. At this stage, the user will not know many details about which models are relevant. The search engine should assist these users by providing an easy method to browse a wide variety of possibly unrelated models referring to a particular domain.

In later stages of the development life-cycle, such as design and implementation, a developer will have specic knowledge about which models are relevant and will thus need to restrict search results to a small number of highly relevant models. In this way, the developer will learn how other developers managed to solve a similar problem and which design decisions were made.

Dierent kinds of models:

Although UML is currently the most

widely-used modeling language, the current trend is towards domain-specic modeling languages. Nowadays, developers may invent their own modeling languages in-house using technologies such as GME [43] or the Eclipse

MOOGLE: A Metamodel-Based Model Search Engine

9

Modeling Project [14]. A model search engine therefore must support userdened modeling languages.

Inexact matching:

Often, developers have only a vague idea of what

they are looking for. In these cases, exact matching (high precision) may not be the best choice, and inexact matching (high recall) is desirable, so that more options can be delivered to the developer. Thus, a model search engine should not only retrieve exact matches, but also closely related elements, by using dierent techniques, such as regular expressions, wildcard, fuzzy and proximity searches. For example, the query Pay* could be used to retrieve models containing the terms Pay, Payment, Paying, etc.

Ranking: The results of the search should be ranked according to their relevance in relation to the specied query. This is important to provide a good balance between exact and inexact matching. Developers with a good idea about what they want, but who do not want to spend lots of time browsing the results, will nd the most exact matches appearing rst in the result list. Developers trying to understand what they want normally browse the results list sequentially. By ranking the list in a proper way, the time spent in this browsing is reduced, since the developer spends less time inspecting irrelevant results.

Automatic indexing: Models should be automatically indexed, so that no manual eort is needed in order to allow them to be searched. For searching models, there are choices in deciding how to index searchable models. The most obvious choice is to index according to the metamodel. This can

10

Daniel Lucrédio et al.

generally be done automatically. An alternative choice might be to index model elements according to a semantic interpretation. However, this has generally to be done manually, which requires a large degree of eort. In addition, manual indexing can lead to subjective interpretations. Although this is desirable in some domains, it may lead to misunderstanding when multiple individuals are involved in the development.

Proper preview of the results:

As discussed earlier, textual repre-

sentations of models cannot be easily understood by most users. Without a proper preview, the user has no choice but to manually inspect each result in a modeling tool in order to determine if it is relevant or not. This is unrealistic and impractical, specially considering large repositories.

Thus, a model search engine should provide comprehensible formatting of results either by displaying models graphically or by removing unreadable or irrelevant characters in the textual representation. The objective is to avoid the need for manual inspection of each result. Ideally, results should be displayed with contextual information. For example, the relevance of model elements returned by the search will be dicult to assess unless related elements, such as the element's parents and siblings, are also shown.

Browsing: Search engines can not be trusted to provide a high recall for every query. Sometimes a relevant result is just left behind, no matter how good is the search engine or the query. While in some scenarios this is more or less acceptable, in model-driven development this may cause a valuable reuse opportunity to be lost. For such cases, there should be a mechanism

MOOGLE: A Metamodel-Based Model Search Engine

11

to help the developer to make sure no relevant model is left behind, even if it requires a more time-consuming manual inspection in some parts of the repository.

Thus, in addition to automatic searching, a model search engine should support model browsing. Although not a direct feature of a search engine, browsing is an important mechanism, which is eectively used in combination with searching to gain knowledge about the existing artifacts [37].

In terms of models, automated browsing support gives the user the ability to view and navigate all models according to some ltering criteria, such as facets [38]. One particularly important browsing criteria is to allow users to browse for models that are located in the same folder or project of a particular model. This is important, because models located in the same place are possibly related, even if the search engine fails to notice it.

Large repositories:

We envisage a scenario of model reuse on a large

scale, where thousands of models are shared among dierent developers. This raises several issues that must be addressed properly by the model search engine, including good performance, precision and recall. This means that the search engine should be able to search through hundreds of thousands of models and model elements. In such scenario, relying on a complete inspection of the repository during query execution is impractical.

We now present a review of the state-of-the-art in model search and compare existing research according to the list of features dened above. We limit our discussion to those approaches most similar to Moogle.

12

Daniel Lucrédio et al.

2.2 An abductive, linguistic approach to model retrieval

Gulla et al. [23] present a linguistic approach to retrieve models. Queries are written using a domain-specic language relevant to a particular problem domain. These queries are then translated automatically into model-specic queries referring to metamodel elements such as table, association and class. In this way, the approach does not require users to have an understanding of the modeling language's metamodel.

The translation process works by attaching semantic interpretations to queries using a lexicon proposed in [24]. Figure 1 shows examples of how the words loan, have and book are represented in this semantically-based lexicon.

Fig. 1 Lexical conceptual representations of loan, have and book. Extracted

from [23].

MOOGLE: A Metamodel-Based Model Search Engine

13

In Figure 1a, the conceptual structure for the transitive use of loan is shown. This structure indicates several semantic properties of loan: it is a temporally situated ([+temporal]), dynamic act ([+dynamic]). It has some duration ([-punctual]) and a clear terminal point ([+completed]). The structure also includes similar descriptions of two arguments, x and y, denoting the loaner and the thing being loaned, respectively. Two semantic relations, FORCE and CONTROL, mark argument x as the source of energy and the controller in the situation, and argument y as the entity exposed to this force and control. The arrow in front of POSSESSION means that the situation terminates by a transition to a state in which argument x is the possessor of argument y. Figures 1b and 1c describe the words have and book, respectively. The arguments x and y of have are omitted for the sake of simplicity.

Semantic interpretations are also attached to primitives of the modeling language which is the object of the search. For example, in an EntityRelationship model, entities are lexically represented as: [-temporal, +concrete, +bounded], relationships are represented as: [+temporal, -punctual], and attributes are [temporal].

Given these semantic interpretations, a search query is processed as follows:

1. A query is specied in terms of known predicates and arguments. For example, if a user wants to search for models related to book loaning, s/he species the query loan(X,book).

14

Daniel Lucrédio et al.

2. The query is translated into an equivalent query referring to metamodel elements of the modeling language. The translation from domain-specic to metamodel-query language is done by matching the semantic interpretations in the two languages. Hence, the predicate loan is interpreted as a relationship between entities because both are [+temporal, -punctual]. Similarly, book and X are interpreted as entities that are connected to this relationship. 3. The search engine uses the metamodel-query to search for relevant models. A model is considered relevant if (1) its structure matches the metamodel-query; and (2) the semantic interpretation of terms in the query match the interpretation of the matched model elements. Unlike most approaches, where the actual words are the basis for establishing a match (through synonyms, hyponyms and meronyms), here only the semantic properties are used. This means that, for the query loan(X,book) instead of looking for a relationship named loan connected to an entity named book and some unnamed entity, the system retrieves not only models containing structures that describe book loaning, but also models describing car rental, for example. This happens because car and rent have the same semantic properties as book and

loan. In a traditional information retrieval system, this kind of match would not be established, because the words car and book are not synonyms, hyponyms or meronyms.

MOOGLE: A Metamodel-Based Model Search Engine

15

By comparing Gulla et al.'s approach with the criteria dened in Section 2.1, some conclusions can be drawn. The approach provides good support for model search in the initial stages of development, using

real-world

concepts1

from a particular domain as the basis for the query.

matching

is a major feature, and is achieved using rich semantic infor-

Inexact

mation that makes the approach stand out from traditional information retrieval systems. However, the authors do not discuss how the approach can be used to search within large portions of text, for example, inside comments or use case descriptions.

No indexing process is described, but we can assume that

indexing

automatic

is possible, because the user does not need to manually insert

information for each model. However, a complete lexicon must be provided, containing semantic descriptions of every word that is to be used in the query.

Although it is not the main focus of the approach, the user can specify metamodel-queries directly, using the primitives from the modeling language. Thus, it supports

metamodel-based searching.

it suitable for later development stages, providing

This also makes

support for the entire

life-cycle. However, the user would need to know exactly how the modeling primitives are represented in terms of the formulas that describe them semantically. This is why, according to the authors, users would almost never use this feature.

1

In this and the following sections, bold face is used as a reference to the key

features of a model search engine discussed in Section 2.1

16

Daniel Lucrédio et al.

Dierent kinds of models

are supported, as long as their primitives

are formally described in terms of the semantically-based lexicon. The authors do not discuss in detail how to do this, nor how much eort is needed to include support for a new modeling language.

We posit that the approach is not suitable for

large repositories, since

the matching process would have to look at each model element individually to search for elements with similar semantic descriptions. Although this could be improved with indexing techniques, the authors do not discuss how this could be done.

There is no information about

sults

and

browsing

ranking, proper preview of the re-

and therefore we cannot evaluate how it performs

against these features.

2.3 Using WordNet for Case-Based Retrieval of UML Models

Gomes et al. [19] present an environment, called REBUILDER, which can be used for retrieving UML class models. REBUILDER uses a knowledge

2 common sense ontology. WordNet

repository that is based on the WordNet

is a large lexical database of English. (...) Nouns, verbs, adjectives and ad-

verbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations . The result is a network of meaningfully related words

2

wordnet.princeton.edu

MOOGLE: A Metamodel-Based Model Search Engine

17

and concepts that make a useful tool for computational linguistics and natural language processing.

REBUILDER uses WordNet to categorize UML elements, namely: packages, classes (with attributes and methods), interfaces (with methods only) and relations (generalizations, associations, dependencies, etc.).

REBUILDER tries to categorize each model element into a context synset. Initially, only the element's name is considered: if WordNet has a single synset for that particular word, it is chosen as the element's context synset. However, an element's name may have more than one word, or it may be a word with multiple meanings. This means that there are multiple synsets that are candidates for being the context synset. In this case, more information about the model element has to be used in order to nd the correct synset.

Which information to use for this disambiguation process depends on the kind of model element being evaluated. For example, if it is a class, the names of its attributes and methods are used. If it is a package, its subpackages and other packages it relates to are considered. The synset chosen is the one which is closest (according to some well-dened distance criterion) to all of the model element's synsets.

This context attribution process happens in REBUILDER both when a model is indexed and when a query is submitted.

REBUILDER indexes a model by attributing a context synset to it. A model can be a package (thus containing other packages, classes, interfaces

18

Daniel Lucrédio et al.

and relations), a single class or an interface. For example, consider the four models shown in Figure 2.

Fig. 2 Examples of UML models. Adapted from [19].

The models shown in Figure 2 are from the educational domain, and each one contains a single package, P1, P2, P3 and P4, respectively. The indexing process consists of assigning these packages to a context synset. Figure 3 shows the resulting index created for the models in Figure 2. P1 and P2 are indexed under the university synset, and P4 is indexed under school, because the package names are directly mapped to single concepts. P3's name cannot be directly mapped into a single synset, and therefore the disambiguation process has to be performed. Using a simple metric that considers the sum of the distances to all synsets associated to P3's element's names (Central, Institution, University, Teacher, Department and Student), the synset educational institution is chosen.

MOOGLE: A Metamodel-Based Model Search Engine

19

Fig. 3 Models indexed according to WordNet. Each node of this network is a

synset from WordNet. Adapted from [19].

The same synset attribution process happens when a query is submitted. In REBUILDER, a query consists of a piece of design, containing partial UML packages, classes or interfaces, and the number of models to be retrieved. First, a synset is attributed to the query model, using the process described above. Next, starting from this synset, a spreading activation algorithm visits the nearby synset nodes, including all models found on the way into the result, until the number of models to be retrieved is reached, or there are no more nodes to be visited. Finally, the resulting models are ranked according to their similarity to the query model.

For example, consider a query where the user wants to nd four models involving students and teachers. Figure 4 shows a possible design for such a query.

REBUILDER would retrieve four models as follows (using the index from Figure 3):

20

Daniel Lucrédio et al.

Fig. 4 Example query submitted to REBUILDER.

1. The educational institution synset is assigned to the query; 2. The index in Figure 3 is traversed according to the spreading activation algorithm, starting at educational institution. Model P3 is indexed under this node and so becomes a search result. The nearby nodes, school, university and institution, are marked for being visited next; 3. The algorithm then visits the next marked node, school, and hence, model P4 is included in the results. Nodes primary school and secondary school are marked for visit next. But rst, node university is visited, and P1 and P2 are included in the results; 4. Since the number of models to be retrieved is reached (four), the spreading activation stops.

REBUILDER supports many of the features identied in Section 2.1.

Metamodel-based

searching is achieved, since the queries refer to meta-

model elements. However, REBUILDER is restricted to UML, and thus there is no support for

dierent kinds of models.

The approach allows queries to be specied using

real-world concepts,

because WordNet is a general-purpose ontology. However, the use of WordNet may be limiting since technical concepts are not covered in WordNet, and thus poor object categorization may occur in low-level design models. Since the queries are specied using design fragments, REBUILDER is not

MOOGLE: A Metamodel-Based Model Search Engine

applicable to early life-cycle stages and thus does not support

21

support the

entire life-cycle adequately. The indexing and retrieval process are ecient. The models are

tomatically indexed

and

ranked.

au-

This makes the approach suitable for

large repositories. At the same time, inexact matching is achieved using the semantic relations described in WordNet and the spreading activation algorithm. The authors do not discuss supports

preview of the results or if the approach

browsing.

2.4 A Framework for Architecture-driven Service Discovery

Kozlenkov et al. [28] present a framework for discovering services. Although not explicitly designed for model search, the framework uses a model-based query language consisting of UML class and sequence diagrams specifying the structural and behavioral design models for the services. Discovering if there are any services available and which may be useful for that particular piece of design is the objective of the query engine. A UML prole is used to mark which model elements must be present in any discovered service. Figure 5 shows an example of a query in this approach. It shows a sequence diagram where some messages are marked with the stereotype asd_query_message. These identify the services to be discovered. In this example of a system for a GPS device located in a car, two services are being queried: one for nding the location (longitude and latitude coordinates) of a

22

Daniel Lucrédio et al.

particular address (messages 1 and 2), and one for nding points of interests (POIs) of a particular type, in a predened area (messages 5 and 6).

Fig. 5 Example query submitted to the approach. Adapted from [28].

The query execution engine searches for service specications that satisfy the query using a similarity analysis algorithm that relies on a function measuring distance between two operations. The distance function has a linguistic component that calculates the semantic distance between two operation names based on WordNet. Information about input and output parameters is also considered in this function, as well as other descriptors, such as additional constraints, pre- and post- conditions. The resulting services are then ranked according to the calculated distance in relation to the query operations.

Comparing it with the main features of a model search engine discussed in Section 2.1, this approach is restricted to services and UML models that

MOOGLE: A Metamodel-Based Model Search Engine

23

are used to model them. Thus, it does not support

dierent kinds of

models. However, it is extremely well integrated into a UML-based design process. It employs

metamodel-based searching

to enable architecture-

driven search, but is limited to only a few UML elements. The results are automatically integrated into the original design, and thus they are

played properly.

dis-

On the other hand, the approach is less suitable for

early life-cycle phases, and thus does not

support the entire life-cycle.

Being service-oriented and focused on design, the approach is not based on specifying

real-world concepts. But inexact matching, together with

a distance-based

ranking algorithm, are eectively used to avoid discarding

relevant results.

The services must be modeled properly using structural and behavioral models, and thus

indexing is not entirely automatic.

However, once

these models exist, there is no need for further categorization eort.

The authors do not discuss if there are options for manual if it can be eectively used with

browsing or

large repositories.

2.5 EMF Search Project

3 provides the infrastructure for querying EMF-

The EMF Search project

based models, focusing on the integration with the Eclipse Core Search API for end user tight integration. The search engine takes metamodel in-

3

http://www.eclipse.org/modeling/emft/?project=search#search

24

Daniel Lucrédio et al.

formation into consideration, thus allowing the user to search for specic metamodel elements.

In EMF Search, there are two main query types: textual and OCL.

Textual queries work almost like normal le search in Eclipse. It implements a case-sensitive string matching algorithm. Wild cards are allowed, as well as Java-based regular expressions. What makes it dierent from normal search is the possibility of including the participants, i.e. the metaelements that will be part of the query. Figure 6 shows Eclipse's main search page for querying EMF models, where the user is searching for dierent UML2 action elements with name matching the Secret* regular expression.

OCL-based search uses simple invariants based on a particular metaelement. For example, by specifying the query self.name 'noname ' for the element EClass, all instances of a class with a name dierent than noname are retrieved.

In relation to the main features of a model search engine, the EMF search project is strongly focused on

metamodel-based searching. Dif-

ferent kinds of models are supported, since the mechanism works at the metamodel level. The results are shown inside Eclipse and thus a

proper

preview of the results is achieved. Regarding the

concepts,

support for the entire life-cycle

and

real world

EMF search performs relatively well, since any kind of model

(requirements, analysis, design, implementation, etc) can be searched for.

MOOGLE: A Metamodel-Based Model Search Engine

25

Fig. 6 Textual EMF-based search in Eclipse. Extracted from wiki.eclipse.org.

However,

inexact matching

is based on regular expressions, thus the re-

sults will not include models that have only a semantic relation to the query.

There is no

ranking policy. Results are shown in the order of the search

space (for example, the ordering of the le system or the order of appearance inside a le). There is no

indexing,

which makes it inecient for

large

repositories. Since it is an API, the EMF search project could be used to

26

Daniel Lucrédio et al.

implement

browsing

but it is not present in the default implementation

inside Eclipse.

2.6 Source Code Search

The literature has a large body of work on source code search. For example, a previous work [30] surveyed component search and retrieval approaches from 1991 to 2004. As discussed in Section 1, general-purpose code search engines may also be used for model searching. Code searching and model searching are similar problems. Both involves searching through structured and semi-structured text (in some cases XML) and identifying concepts and related segments in a corpus. In fact, Moogle employs the same techniques used in source code search to eectively index and search models. However, there are some limitations with current approaches that prevent them from being eectively used in model searching. These are further discussed later in this section.

Information retrieval (IR) techniques have been successfully used to address problems related to or involving code searching. IR-based search has many interesting features that bring great advantage over simple regular expression search [37]. Such features include multiple term queries, natural language queries, Boolean operators and ranking of search results. Scalability and reliability of the search engine are also achieved, which is important for massive le repositories, such as large scale software systems [37].

MOOGLE: A Metamodel-Based Model Search Engine

27

Marcus et al. [31] used Latent Semantic Indexing (LSI), a well-known information retrieval technique [9], to address the problem of concept location [39] through ecient searching. Like other IR methods, LSI indexes documents in a document space, by extracting information about the occurrences of terms within them. This information is then used to dene similarity between queries and documents [6].

In Marcus et al.'s work, documents are tipically les of source code or program entities such as classes, functions or interfaces. These are called

source code documents [31]. The terms are extracted from source code comments and program identiers. Comments are mostly composed of natural language text, so term extraction occurs naturally. Program identiers, which normally have important semantic information, are analyzed according to two commonly used coding styles: combination of words using underscore  _ and letter capitalization (like in insertCustomerInDatabase, a convention that is also known as camel case). The result is that each source code document is reduced so that it contains only meaningful terms, excluding programming language keywords and other terms that are irrelevant to search. The combination of documents and terms is then used to feed the search mechanism, which can calculate the similarity between documents and queries, and use this similarity to nd matching results. This approach is also used in Moogle, as will be explained later.

Marcus et al.'s work was later continued to include tool implementations for C++ [36] and Java [35].

28

Daniel Lucrédio et al.

Poshyvanyk et al. [37] used Google Desktop Search (GDS)

desktop.google.com/

http://

as a source code search engine. They've developed

an Eclipse plug-in, called GES (Google Eclipse Search), which uses GDS to provide IR-search capabilities with eciency and the ability to unobtrusively re-index the search space as it changes. Being integrated to the development environment (Eclipse), GES does not require the developer to switch between dierent tools for searching and coding, thus reducing the overall eort.

In fact, Google Code Labs have also used the same concept, applying their engine to search for source code. This mechanism, called Google Code Search, is hosted on

http://www.google.com/codesearch, and can be used

to search public source code, normally located in version control systems or publicly accessible through a web server. Similar tools are available at

http://www.koders.com

and

http://www.krugle.com.

De Lucia et al. [6] have also used LSI to index software in a similar manner, but with the objective of recovering traceability links in software. For this reason, they also investigated how LSI could be used to index other types of software artifacts, including documentation. They have developed a tool, called ADAMS Re-Trace [7], which has an indexer module that processes software artifacts and remove non-textual tokens and stop words [5]. The result is a plain text document that represents the artifact, similarly to the work of Marcus et al. described before. For traceability recovery,

MOOGLE: A Metamodel-Based Model Search Engine

29

this is enough. However, search and retrieval for reuse, specially of models, requires additional functionality, as will be discussed later in this section.

Although it can be used to index dierent types of software artifacts [7], including UML models such as use cases and activity diagrams, ADAMS Re-Trace does not go inside diagrams to extract text. In a later work [8], De Lucia et al. explain that for UML activity diagrams, for example, only the corresponding textual description is used.

Hill et al. [26] investigated the use of contextual search to improve the developer's ability to determine result relevance and reformulate queries. After a query is performed, natural language phrases are automatically generated from the words surrounding the query terms in the source code, so that the developer can quickly discard irrelevant results without having to investigate the code. In addition, the phrases extracted from source code form a hierarchy of related phrases, with more general phrases appearing in the top of the hierarchy and more specic phrases appearing at the bottom. This hierarchy help the developer to identify relevant program elements and formulate more eective queries.

In Hill et al.'s work, phrase extraction occurs through rules developed specically for dierent groups of related signatures. For example, there are rules for static methods and elds, methods with and without parameters, and methods and elds with verbs and objects in the name [26]. Such rules examine the signatures, split the identiers according to punctuation, numbers and camel case, and identify the phrases. A morphological parser,

30

Daniel Lucrédio et al.

called PC-KIMMO

http://www.sil.org/pckimmo/

is used to determine

the types of words (nouns and verbs) that form the phrases.

2.6.1 Searching for Models Using Source Code Search Engines

Although

more suited to searching for source code, code search engines can also be used to search for models. Since most models are described using text les, by specifying particular le types, such as XMI or XML, it is possible to search for specic keywords inside models using a code search engine. For example, Google code search uses IEEE POSIX Extended Regular Expressions (ERE) standard [1]. The le: operator can be used together with an ERE to search inside les that have a specic extension. For instance, the query

"file:\.xmi customer insert"

nds all .xmi les that

contain the terms customer and insert. But, as described in Section 1, code search engines (and general-purpose web search engines) have three main limitations related to model search. First, the search is purely text-based and so metamodel-based search is not possible. In the approaches described earlier in this section, like the works of De Lucia et al. [57], although model les can be indexed, all structural information is discarded, so it becomes impossible to associate a particular term with some structure. A second drawback is that the results are only shown textually. If the user is not familiar with XMI or the model representation format being used, the results are dicult to interpret without actually downloading the le and opening it in a proper tool. Contextual search, like in the work of

MOOGLE: A Metamodel-Based Model Search Engine

31

Hill et al. [26], could be used to overcome this drawback, but the multitude of model types and variety of model elements that are arising from domain-

?

specic modeling [ ], make it very dicult to create specic rules for every case. Finally, some of these code search engines, like Google Code Search, rely on the le extension to distinguish between model types. Although it is possible to include any le extension in a query, this information alone is not enough to provide a distinction, because most models share the same le extension (.XML and .XMI). Code search engines implement the most generic features described in Section 2.1:

real world concepts, support for the entire life-cycle, lim-

ited support for

dierent kinds of models, inexact matching, ranking,

automatic indexing

and support for

large repositories. Browsing

is

partially supported, since it is possible to manually browse through the les located in the same folder as a particular result. The main drawbacks are the absence of a

metamodel information and

poor preview of the results.

2.7 XML-Based Search Engines

4

Finally, there are numerous XML-based search engines available . Both text-based and structured query engines are available, which could be used to search models. XML text-based queries work in a similar way as EMF

4 list

see http://www.searchtools.com/info/xml-resources.html for a comprehensive

32

Daniel Lucrédio et al.

search (Section 2.5): the user species some keywords, and a text comparison is performed within the content of XML documents.

Structured queries are more complex, and can benet from information about the XML structure, such as DTDs or XMLSchema. In this sense, it is possible to specify nely-tuned queries, such as searching only inside specic parts of the document, or inside specic XML tags and attributes. To support this, a structured query language is used, where W3C's XML Query Language  or XQuery [45] is the most well known.

There are XML-based search engines that support almost all the features described in Section 2.1. They support

dierent kinds of models

since

XML is a widely used format for storing models. For the same reason, they oer

support for the entire life-cycle.

There are some free-text based engines that allow the user to search within models using

real world concepts. Inexact matching is achieved

not only through wild cards and regular expressions, but also through word similarity. Some engines implement

ranking

algorithms and

automatic

indexing, which makes them suitable for large repositories. Metamodel-based searching

is achieved in a limited way, because

although it is possible to specify queries using XML metaelements, these are normally composed of unreadable, machine-oriented tags, which are not very user-friendly. For the same reason, these engines do not provide a

proper preview of the results, only the original XML le.

MOOGLE: A Metamodel-Based Model Search Engine

33

We are unaware of a XML-based search engine that allows manual

browsing.

2.8 Comparison of Existing Approaches

Table 1 shows a comparison between the approaches described in the previous subsections and highlights their support for the identied features of an ideal model search engine. The next section will describe Moogle and in Section 5, we revisit this table and assess Moogle against the criteria therein.

Support for the key features: (

) Full

Support, (G #) Limited Support, (#) No Support, (-) Not enough information Approach

1

Gulla et al. [23]

G #

Gomes et al. [19] Kozlenkov et al. [28] EMF Search Code Search XML Search

G # # G #

2

3

#

G # # #

4

# # G #

5

G #

6

7

8

9

10

-

G #

-

-

-

-

#

#

G # #

# #

-

-

# G #

#

-

Table 1 Comparison between the existing approaches. The key features are: (1)

Metamodel-based searching, (2) Real world concepts, (3) Support for the entire life-cycle, (4) Dierent kinds of models, (5) Inexact matching, (6) Ranking, (7) Automatic indexing, (8) Proper preview of the results, (9) Browsing, (10) Large repositories.

3 Using Moogle 5 oers three

This section describes Moogle from a user's perspective. Moogle

ways to search for models: Simple Search, Advanced Search and Browse.

5

Currently hosted at

http://alambique.icmc.usp.br/moogle/

34

Daniel Lucrédio et al.

3.1 Simple search

In simple search, general-purpose queries can be specied using keywords. Simple search is well suited to earlier stages of development, when the developer is looking for models related to domain concepts.

Moogle uses Apache Lucene's [2] query syntax, which has the following features:

 Terms:

A term can be a single word or a phrase (a group of words

surrounded by double quotes). Multiple terms can be combined together with Boolean operators to form a more complex query.

 Fields: In addition to terms, it is possible to search for specic elds of a document. For example, the query metaelement:Actor will retrieve all documents that contain a model element of type Actor.

 Wildcards: Single character wildcard searches are supported, using the (?) modier. For example, to search for test or text, the query te?t can be used. Multiple character wildcard searches are also supported, through the (*) modier. For example, to search for test, tests or tester, the query test* can be used. These can also be used in the middle of a term, but not at the beginning.

 Fuzzy searches: Fuzzy searches are possible using the Levenshtein Distance metric [4] to nd terms that are similar in spelling, and is indicated by inserting the tilde character (~) at the end of a single word term. For example, the query roam~ will nd terms like foam and roams.

MOOGLE: A Metamodel-Based Model Search Engine

 Proximity searches:

35

In phrase terms, it is possible to determine the

maximum distance between the words of the phrase, using the tilde character (~) and a number at the end of a phrase. For example, the query customer purchase ~10 will nd documents where these words appear with 10 words or less between them.

 Term boosting: A query considers all terms as being equally relevant. Putting the caret symbol (^) with a boost factor after a term makes it more relevant. For example, the query customer^4 purchase will consider the term customer four times more relevant than purchase.

 Boolean operators: These allow terms to be combined using the logical operators AND, OR, and NOT. The default operator is OR, i.e. if there are no operators between terms, the OR operator is used. Parentheses can be used to specify operator precedences.

As an example, the query waiting for call states (Figure 7) will return all models containing any of the specied words, irrespective of the type of the model elements. In the example in Figure 7, three UML models containing the words waiting, call or states are shown. These words can appear inside any model element, such as Operation, State, StateMachine,

Property or even a Comment. When simple search is used, Moogle automatically inserts a eld term in the query. This can be seen at the top of Figure 7, where the term metaelement:XMIFile was automatically inserted after the original query with the logical AND operator between them. This tells Moogle to search

36

Daniel Lucrédio et al.

Fig. 7 Simple search

inside XMI les, which means that any model le will be considered, and that only entire models are retrieved.

3.2 Advanced search

In advanced search, a Moogle user can specify the types of model elements to be returned. For example, the query waiting for call states can be rened to retrieve only UML elements of type State. By specifying waiting for call as a query (Figure 8a), and ltering by UML states (Figure 8b), Moogle retrieves all UML states that contain the specied words (Figure 8c). Internally, Moogle uses Apache Lucene's eld to identify the model elements to include in the query. For example, the query in Figure 8 is transformed into Lucene's query (waiting for call) AND ((metamodel:UML 2 AND (metaelement:State))).

MOOGLE: A Metamodel-Based Model Search Engine

37

Fig. 8 Advanced search

The advanced search considers the full metaclass hierarchy when ltering the results. For example, by specifying the query string Phone and selecting UML 2 Classier as a lter, all elements containing the word Phone and that are instances of UML 2's Classier metaclass or any of its direct or indirect subclasses are retrieved. The left side of Figure 9 shows the results for this particular query, with four dierent types being retrieved: a Class (result 1), a StateMachine (result 2), an Interaction (result 3) and a Collaboration (result 4). The right side of Figure 9 highlights these four results on an excerpt of the UML 2 metamodel, to show that not only direct subclasses are retrieved (like Class), but also other elements contained in the hierarchy. The advanced search also allows more than one lter to be combined. For example, by searching for Phone call, and selecting Action, State and StateMachines as lters, the following query string is produced: (Phone call) AND ((metamodel:UML 2 AND (metaelement:Action OR metaelement:State OR metaelement:StateMachine))). This retrieves all model

38

Daniel Lucrédio et al.

Fig. 9 Advanced search results using hierarchy ltering

elements that are instances of UML 2's Action OR State OR StateMachine metaclasses.

By changing this string directly it is possible to search for multiple inheritance. Switching the metaelements's OR operator for the AND operator causes all model elements that are instances of BOTH metaclasses to be retrieved.

For example, by modifying a query to look like: (Phone) AND ((metamodel:UML 2 AND (metaelement:BehavioredClassier AND metaelement:StructuredClassier))), all model elements that are instance of both BehavioredClassier and StructuredClassier (like a Collaboration), or any of their subclasses, are retrieved.

By clicking on a result, a formatted version of the model is presented. Figure 10 shows a comparison between the formatted preview provided by Moogle and the original XMI format. In the formatted version of this example, a package named StateMachine can be more clearly seen, containing

MOOGLE: A Metamodel-Based Model Search Engine

39

a state machine named Supplier, with four triggers and four operations. Also, the fonts of attribute values and element types are larger, to highlight the most important information. In the original XMI format, it is more dicult to see this information with all the machine-oriented XML tags and without any highlighting.

Fig. 10 Formatted preview vs original XMI format

3.3 Browsing

Browsing (Figure 11) is similar to advanced search, but the developer does not specify keywords to be searched. Instead, all elements matching the specied lters are shown. The lters are organized into a faceted-based scheme [38]: there are ve facets (Figure 11a)  Location, Metamodel, Metaelement, File size and Last update. These facets were chosen because they represent useful clues about a model, and can narrow down the results to a smaller number of options. For example, when one is looking through his personal les trying to locate a particular document, but does not remember its name exactly, if he knows it is a large le, he can narrow down the choices and drastically reduce the browsing time.

40

Daniel Lucrédio et al.

For each facet, the user can select one or more values, combining them freely. In this example, all UML2 Types located in les of size between 10 Kb and 100 Kb are shown (Figure 11b).

Fig. 11 Browsing

Also note that in browsing, as in advanced searching, the hierarchy between metamodel elements is considered. This is shown in the example of Figure 11, where the specied lter is UML2 Type, but the results contain a UML2 Stereotype  a subclass of Type  named metaProperty.

4 Architecture of the Search Engine There are three basic components in Moogle's architecture (Figure 12): the Graphical User Interface, the Searcher and the Model Extractor. The

Graphical User Interface is what is shown to the developer, and it is where he/she can specify the query, dening the search string and its parameters, like the lters for the search. The Searcher is responsible for interpreting the

MOOGLE: A Metamodel-Based Model Search Engine

41

query provided by the user, nding matching models and returning them to be shown in the Graphical User Interface.

The searcher also creates all the required indexes to perform the search in an eective way, but it requires model descriptors to create them. These descriptors are XML documents that contain all kinds of information to be searched, including text content and any kind of metadata, such as the size of the model, last update, and type of the le, among others. The descriptors are provided by the Model Extractor, which is responsible for retrieving models from their original locations and creating the required descriptors.

Fig. 12 Architecture of the model search engine

With the indexes created, it is possible to submit queries to the Searcher. After being specied by the user in the GUI, a query is passed to the Searcher using the HTTP protocol, with the query parameters being passed as HTTP parameters. The Searcher then generates a XML le with the

42

Daniel Lucrédio et al.

results of the search. The GUI has a XML formatter that reads the results and shows them to the user in a proper way. Next each component of this architecture is further described.

4.1 Model Extractor

Figure 13 shows a detailed description of the Model Extractor component.

Fig. 13 Detailed description of the Model Extractor component

The objective of the Model Extractor is to nd and read model les from dierent repositories (a database, the le system or a web site, for example), and to produce model descriptors to be used by the Searcher component. These descriptors will contain all the searchable content, already structured and formatted in a way that the Searcher component can understand. Moogle is designed to support any kind of model. To implement this feature, the extractor is extensible, supporting dierent parsers. For example, there can be a XML parser for XML-based models, a domain-specic parser for models based on a domain-specic language, a UML parser for

MOOGLE: A Metamodel-Based Model Search Engine

43

UML models, and so on. Currently, only an EMF [12] parser is provided, because EMF covers a substantial part of languages and commercial tools. It can read and interpret XMI [34] les, which is supported by most tools.

EMF is a generic modeling engine, allowing the manipulation of dierent models and metamodels, including UML 2. This means that a single EMF-based parser can read and interpret dierent models, like ER or UML models, without any modication. All that is required is that the correspondent metamodels are loaded before parsing. An EMF model may conform to multiple metamodels simultaneously. For example, a model may conform to the UML metamodel and a specic UML prole.

To manage the metamodels, the Model Extractor component uses an internal metamodels repository, which stores EMF-compliant metamodel les (in .jar format) and a metamodel conguration le. The conguration le contains the namespaces of each metamodel supported by Moogle. A namespace is a string that identies, inside a model le, which metamodel is being used in that particular le.

When reading a model le, the extractor must rst identify which metamodels to use. Some models identify the metamodel through the le extension, like some UML tools, which use the .uml2 extension. But this is not always true (some tools use only the .xmi, or a domain-specic extension), and thus this is not a reliable way of identifying metamodels. Moogle uses the XMI namespace declaration for that purpose, since it identies the metamodels being used by a particular model [34].

44

Daniel Lucrédio et al.

For each (EMF) model le, the extractor reads the rst line and extracts the namespace declaration. It then compares the namespace declaration with all namespaces contained in the conguration le, and determines which .jar les to load. If a required metamodel is not found (false negative), then the model will not be read. If the wrong metamodel is inferred (false positive), the model will also not be read, because the extractor would throw an exception and move to the next model. Both cases can be easily corrected by modifying the conguration le, as long as the model contains a valid unique namespace declaration.

Once all metamodels are loaded into the EMF parser, the extractor component reads the model, extracting the information needed to create model descriptors. The irrelevant XML and XMI tags are ignored, and the generated descriptor contains only the actual content of the model. Figure 14 shows an example of a model descriptor. In addition to the basic metadata, such as the path of the model, the le size, etc., there is a eld called storedtext, containing every word of every element of the model, so that it will be found whenever one of these words is searched.

... Phone.uml2 Phone http://www.eclipse.org/emf/2002/ Ecore attributes kernel Hang up Incoming call Pick up Hang up Idle Waiting Phone

Fig. 14 Example of a model descriptor produced by Moogle's Model Extractor

component

MOOGLE: A Metamodel-Based Model Search Engine

45

The descriptor in Figure 14 allows entire model les to be searched. However, the extractor also adds one entry for each element inside a model so that it can be found individually. Figure 15 shows an example descriptor for a UML2 Operation, including its metamodel (UML 2), its type (Operation) and its attributes (name, visibility, isAbstract).

... sequential

Fig. 15 Example of a model descriptor for individual model elements, in this

case a UML 2 operation.

This richer descriptor allows more ne-tuned queries to be specied and executed, such as: Find all operations with sequential concurrency. For a UML operation, for example, there are elds for each attribute of an operation, such as name, visibility, isAbstract, concurrency, among others that are read from the UML metamodel. The complete schema of the descriptors is better detailed in next section. In addition to the XML descriptors, which are used for searching, the model extractor produces a formatted model preview. The content of this preview is the same as the XML descriptor, but HTML and CSS are used to provide user-readable content. The following formatting rules are used by the formatter:



each line contains a single model element;

46

Daniel Lucrédio et al.



indentation is used to indicate containment between model elements;



each model element is shown in the format:

ModelElementType(attr1="value1", attr2="value2", ...);



model element types and attribute values are shown in larger fonts, while attribute names and other characters are shown in smaller fonts; and



line numbers are inserted at the beginning of each line, to facilitate the identication of model elements;

Figure 16 shows an example of a formatted preview of a UML2 model. In this example, a package named ClassDiagram contains a class named Delivery Order, which contains two properties: selectedItem and invoice.

Fig. 16 Example of formatted preview

We decided to generate the previews during extraction time because we used an existing Searcher component (described in next section) that already implements many functionalities, like term highlighting, which depend on the existence of the formatted text. This design decision imposed the need for extra storage space, but required no modication in the Searcher component, resulting in a faster implementation of Moogle. Future versions could introduce modications in the Searcher component so that stored previews are not necessary, although storage is not a big problem nowadays.

MOOGLE: A Metamodel-Based Model Search Engine

47

In summary, for each model, three actions are taken by Moogle's model extractor:

1.

Metamodels identication: First, the extractor determines to which metamodels a model conforms, in order to correctly load the necessary contents into the EMF parser;

2.

Metamodels loading: After the metamodels are identied, the extractor loads their content (jar les) into the EMF parser; and

3.

Model descriptors generation:

The extractor reads the model le,

extracts all textual content and creates two model descriptors, one that describes the entire model and one that describes each model element individually. It also creates a formatted model preview.

The descriptors produced by the extractor are used by the Searcher component, which is described next.

4.2 Searcher

Except for the metamodel information, Moogle has the same functionalities as many existing web search engines. For this reason, the searcher component uses an existing search engine, called Apache SOLR[3]. SOLR provides many functionalities, including:



advanced full-text search capabilities;



optimizations for high volume web trac;



standards-based open interfaces;

48

Daniel Lucrédio et al.



comprehensive HTML administration interfaces;



ecient replication to other SOLR search servers;



and an extensible plugin architecture.

It is an open source project, well documented, and with an active community. For these reasons, it was chosen to be included in this project. The other components were adapted to make use of these functionalities. For example, the model descriptors generated by the Model Extractor component follow the format understood by SOLR. SOLR is precongured and ready to be used in a number of scenarios. However, to be used in Moogle, the schema of the descriptors had to be congured properly. Figure 17 shows part of the schema used in Moogle's descriptors. SOLR's schema is basically composed of static and dynamic elds. Static elds are those that are common to all models and model elements, no matter which metamodel it conforms to. All these elds apply to both models and model elements. These are:

 id: a unique number, generated by the extractor;  path: the path to the model's le;  lename: the model's le name;  lesize: the model's le size;  lastupdate: the timestamp of the last modication;  location: the place where this model was located (which website, repository, version control system, etc.);

MOOGLE: A Metamodel-Based Model Search Engine

...

Fig. 17 Part of the schema used by Moogle to index models.

 linenumber:

used for model elements, to indicate where they are lo-

cated inside the model les;

 metamodel: the metamodels associated with this particular model;  metaelement: the type of this particular model element;  storedtext: all searchable terms of a particular model or model element, described as a sequence of space-delimited words; and

 fullpreview:

a formatted preview of the model/element. This is the

preview that is shown in the results page. This eld is used for term highlighting, one of SOLR's functions.

50

Daniel Lucrédio et al.

For model elements, le-related elds, like path, le name, le size, last update and metamodel, are inherited from the containing model. All elds have attributes that dene how they are to be used by SOLR. The most important attributes are:

 type:

the type of the eld. This is used by SOLR to determine which

types of queries can be used in this particular eld. For example, numeric elds can appear in range queries, while string elds can appear in queries with wild cards or regular expressions;

 indexed:

this attribute indicates if this eld will be indexed, i.e. if it

will contribute to the query. The only eld that is not indexed is the preview, which is only used to show the results;

 required: indicates whether a eld is required or not. The optional elds in Moogle are the linenumber, which is only used for model elements, and the dynamic elds, which will be described next; and

 multiValued: indicates if a eld may contain multiple values per document. For example, a model may conform to multiple metamodels, hence this eld is multivalued.

The dynamic elds represent elds that are specic to a particular metamodel. For example, a UML classier has a name text eld, and two boolean elds to indicate whether it is abstract or an interface. A UML operation, like the one shown in Figure 15, contains elds like name, visibility and concurrency, among others. These are described in SOLR using dynamic elds. All dynamic elds are indexed, and their types are iden-

MOOGLE: A Metamodel-Based Model Search Engine

51

tied by a sux on their names, as shown in Figure 17. The advantage of using dynamic elds is that the generated descriptors alone contain all information required by SOLR to perform metamodel-based queries.

4.3 Graphical User Interface

The Graphical User Interface is the component responsible for managing the interaction between the developer and the Searcher component. It is web-based, comprising four dierent pages:



initial page, for simple queries;



results page, similar to the initial page, but with the results appearing below it, and controls for navigating the results;



advanced search page, for more complex queries, including metamodel elements; and



browse page, where faceted browsing is allowed.

Section 3 contains examples of Moogle's GUI. One particularly interesting aspect of Moogle's GUI is how the lters used in the advanced search and browsing pages are shown. As discussed in Section 3, model element types are shown to the user in a list, and can be used as a lter to the query. To create this list, the GUI component uses the same EMF parser used by the Model Extractor. However, instead of inspecting models, here it is used to inspect the metamodels, in order to identify all possible metaele-

52

Daniel Lucrédio et al.

ments  which are the searchable model element types  that can be used as a search lter. Figure 18 illustrates this mechanism.

Fig. 18 Automatic identication of metaelements

The EMF parser is congured with the Ecore metamodel, which is EMF's meta-metamodel [12]. This makes it able to read any EMF-compatible metamodel. Next, Moogle's metamodel repository is inspected, and for each metamodel, all metaelements are inserted into a list in the GUI. In the example of Figure 18, two metamodels are shown: Web Navigation (a domainspecic metamodel) and UML2.

The result is that there is no manual eort required to maintain the GUI. Every metaelement in every metamodel supported by Moogle is automatically inserted into the list. However, the names of these metaelements are also automatically extracted from the metamodel, which caused some problems with some obscure element names, particularly in the UML2 metamodel, as discussed in Section 5.4.4.

MOOGLE: A Metamodel-Based Model Search Engine

53

5 Evaluation of the Search Engine

The dierence between Moogle and other text-based search engines is its awareness of the underlying metamodels. This awareness has three potential advantages for model search:

(i) the use of metamodel information optimizes

the indexing process, since irrelevant tags in the model les can be ignored;

(ii)

the advanced search feature allows for more powerful and expressive

queries; and

(iii) model previews are improved since formatting is metadata-

aware.

To evaluate Moogle, an experiment was designed and run with users to assess to what extent Moogle provides the advantages above. More specically, the experiment had three main goals:

-

G1 . Eectiveness in retrieving models.

To determine if the use

of metamodel information produces better indexes, and, therefore, more eective searching;

-

G2 . Benets of advanced search.

To determine if the advanced

search is helpful to users, when compared with text-based search; and

-

G3 . Expressive power of the formatted preview.

To determine

if the formatted preview is sucient so that the user does not need to download the results le and open it in a tool in order to understand the contents.

54

Daniel Lucrédio et al.

5.1 Metrics

To measure these goals, we collected data on recall and precision metrics, which are the classic measures for search engines. Precision is the number of relevant results retrieved over the total number of results retrieved, and recall is the number of relevant results retrieved over the number of relevant elements in the database [22]:

precision =

recall =

relevantResults ∩ retrievedResults retrievedResults

relevantResults ∩ retrievedResults relevantElementsInDatabase

In order to determine if the search engine has a good balance between precision and recall, we also used the f-measure, which is the harmonic mean of precision and recall [40]. The closer the f-measure is to 1.0, the better the mechanism is. But this will only occur if both precision and recall are high. The f-measure is calculated using the following formula:

Fα =

where

α

(1 + α) ∗ (precision ∗ recall) α ∗ precision + recall

is used to weight the precision in relation to recall. We used

α = 1,

which means that both precision and recall are equally important. These metrics can be used to evaluate goals goal

G1 ,

G1

to

G3

as follows. For

precision and recall for Moogle were compared with reference val-

MOOGLE: A Metamodel-Based Model Search Engine

55

ues from other reuse-oriented search engines in the literature. In general, existing reused-oriented search engines have values for precision and recall close to 50% [16, 48, 17]. The f-measure for these engines is also 50%. We therefore used these values as benchmarks.

For goal

G2 ,

we compared values of precision, recall and f-measure for

Moogle's simple search and advanced search.

Goal

G3

is more subjective, and thus more dicult to measure. We

decided to use a combination of recall and the user's perception about having found the right model. If Moogle retrieves correct results and the user perceives these results as correct, then we posit that Moogle's output is understandable. If not, there is a problem with the comprehensibility of the output. To capture this, during the experiment, for each query performed by the user, s/he would answer yes or no to the question: Did you nd the right model? We then measured the number of queries where the user's perception about having found a model was inconsistent with the recall, i.e., a query had 100% recall and yet the user claimed not to nd a model. We believe, from our experience with search engines, that one of the main causes of this discrepancy is the lack of understanding about a model's content, caused by a poor textual representation. By measuring the number of queries exhibiting this discrepancy, we are, to some extent, measuring the quality of the formatted view. We found nothing to use as a benchmark value for this measure in the literature, so we considered that if more

56

Daniel Lucrédio et al.

than 25% of the queries resulted in an incorrect perception, the formatted preview was inadequate.

5.2 Experimental Design

An experiment was designed in which users were provided a model repository and were asked a series of search-related questions which required them to retrieve models from this repository. We were then able to collect recall and precision metrics according to whether their search queries returned relevant models. This allowed us to evaluate goals

G1

and

G2 . Furthermore,

by asking questions related to the users' perceptions of whether they found the right answer (as described above), we were able to evaluate goal

G3 .

For the experiment, a set of eight questions was dened, covering a range of plausible search tasks, geared towards both early lifecycle development, where the user has little idea about details of the domain (e.g. nd a UML model that describes a customer and its attributes), and towards later stages of development, where more detailed models are needed (e.g. Which states and events should be considered in a state machine for a phone call?). The problems were dened based on the experience of the authors and two modeling experts, both with academic and industrial experience: one of the experts had experience in creating and using UML models in software development as well as experience with automated transformations for recovering UML designs from legacy systems, and the other had some experience with

MOOGLE: A Metamodel-Based Model Search Engine

57

domain-specic modeling in MDD projects. The denition of the problems also considered the availability of models that could solve the problem.

To set up the model repository, the search engine was populated with 146 model les, gathered from publicly available repositories and samples. We developed a web crawler that automatically navigates through the results of a web search engine. We then used it with google.com to automatically nd and download XMI les. The following queries were submitted: letype:ecore, letype:xmi, letype:uml, letype:uml2.

The results are mostly model les published in open source repository

6

systems, such as CVS and Trac . We also used model samples used in research projects and classes, gathered from colleagues and students. These are random les, used in real projects, and therefore they represent a good sample of the kind of models that would exist in a models repository. The les comprise 82096 model elements that can be individually searched using Moogle's advanced search features.

For each of the eight questions, we manually inspected the repository to search for relevant models. We considered as relevant any model that could contribute to a solution for the problem. For example, for the problem Which states and events should be considered in a state machine for a phone call?, a model that has a class with a property named phone number is clearly irrelevant, while a model containing states and events named phone, call and waiting is relevant. This process took approximately

6

http://www.cvshome.org

and

http://trac.edgewall.org,

respectively

58

Daniel Lucrédio et al.

two hours for each problem. We found between 1 and 4 relevant models for each problem, each one containing 10 to 50 relevant elements.

An evaluation prototype was developed to allow the experiment to be remotely performed by users in their own free time. The prototype consisted of a web interface that presented the eight questions in sequence. Users created and executed search queries for each of these questions using Moogle's interface. The prototype logged every interaction between the system and the user. When the user nished a problem, s/he clicked on a link to capture his/her perceptions about whether or not s/he found a model. After all problems were completed, a short questionnaire was presented to gather other information from the user, including his/her impressions about the search engine and previous experience with search engines. Users were free to decide whether to use simple or advanced search to solve the tasks.

5.3 Experimental Subjects

The web evaluation prototype was sent to dierent people by e-mail, together with a short tutorial on how to use Moogle and how to perform the evaluation. The subjects were chosen by availability, and all had basic modeling skills. The evaluation was conducted over a two week period. At the end of the two weeks, thirteen subjects had completed the evaluation. Eleven had already used code search engines before, such as Google code search, Koders and Krugle, and thus had some experience in searching soft-

MOOGLE: A Metamodel-Based Model Search Engine

59

ware for reuse. All of them had basic skills in modeling, but none had used a model search engine before. The subjects performed a total of 200 queries, 129 using only simple search, and 71 using advanced search (the subjects were not forced to use advanced search). The precision/recall/f-measure was calculated for each query.

5.4 Experimental Results

One issue we faced was that, in Moogle, each query results in one or more pages of results, each page containing up to ve models. However, we noticed that users do not always browse through all pages. Our results have shown that in 80% of the queries, only the rst page was visited, and in 95% of the queries, the user visited three pages or less. This means that if a model is retrieved, but appears only after page three, it is practically the same as not being retrieved at all. For this reason, our calculations consider pages one to three only. We also calculated the recall for an arbitrarily large number of pages (30), but only to verify if there are many relevant results appearing in later pages that are not normally visited. If the recall values considering three and thirty pages are similar, this means that there are not many models left unseen. We also noticed that many queries found no models at all. A total of 58 queries had zero results (27 using simple search and 31 using advanced search). Analyzing these queries, we found that in most of them (42), users

60

Daniel Lucrédio et al.

Fig. 19 Overall results.

were experimenting with dierent lters non-existent in the repository. In twelve of these queries, the advanced search features were being incorrectly used, with the user changing the query string directly in the text eld, instead of the advanced search screen, or incorrectly placing characters such as quotes and double quotes, thus leading to invalid queries. In the remaining four queries leading to no results, the users specied the correct keywords and lters, and thus these were indeed the result of Moogle's inability to nd the right model. We eliminated from the calculations the twelve queries where the users were using Moogle incorrectly.

Next, we present the results for each goal.

5.4.1 Goal

G1 : Eectiveness in retrieving models

Figure 19 shows the over-

all results. The rst observation is that, for each new page, precision decreases and recall increases. Precision goes down because more pages mean more irrelevant results. Recall goes up because, although there are more irrelevant results, there are also more relevant ones. Such a result coincides with intuition.

MOOGLE: A Metamodel-Based Model Search Engine

61

Recall varied from 44% to 53% for the rst and rst three pages, respectively. These are similar to the reference value of 50% as reported in the literature. The precision varied from 24% to 17%, which is smaller than the reference value.

One positive result is that recall is exactly the same (53%), regardless of whether 3 or 30 pages are considered. This means that all relevant models retrieved are located in the rst three pages. We thus concluded that Moogle's ranking policy, which is implemented by SOLR, is performing well.

Moogle performed worse than similar studies in terms of precision, but presented similar recall. A possible explanation is that the values of 50% refer to code search engines, with interfaces that are specic for searching code. For example, the engine tested by Frakes & Pole has a list of all keywords available for the user to choose [16]. Ye & Fischer use signature matching [48]. Although Moogle is closer to these in purpose, its architecture and interface are more similar to a document search engine, so we decided to additionally compare the precision and recall with document search engines.

Sha & Rather present a study on precision and recall of ve web search engines [41]. Their mean values of precision vary between 14% to 57%, with the most popular web search engines (AltaVista, Google and HotBot) presenting a precision rate around 28%. These values are closer to our results, indicating that Moogle could be behaving more like a document search engine.

62

Daniel Lucrédio et al.

Fig. 20 Simple search vs advanced search

However, a more plausible explanation is that the lower precision could be the result of the relatively small number of models in the repository. Although there were more than 80,000 indexed model elements, these corresponded to only 146 les. If we compare the number of les with the amount of components in the repositories of the reference works (between 2000 to 3000 components), and if we consider that only a small percentage of these les are relevant, precision would naturally be smaller, and recall would naturally be higher.

But we do not consider this lower precision as a major drawback, since this only means that the user will have to navigate through three pages to nd the models. Considering that each page has 5 results, this is not very dicult.

5.4.2 Goal

G2 : Benets of advanced search

Figure 20 summarizes the com-

parison between simple versus advanced search, which showed that advanced search is more accurate, with a small improvement of precision in the rst page (around 8%), and a large improvement in the rst three pages (around 66%). However, it has smaller recall, performing around 40% worse than

MOOGLE: A Metamodel-Based Model Search Engine

63

simple search in the rst three pages. The f-measure, which balances both precision and recall, shows that simple search is better in the rst page (around 15%), but that advanced search performs better in the rst three pages (around 27%). Advanced search is more accurate, which means that it can lead to more relevant results when the right query is specied. However, it is not so good in retrieving a larger amount of models. In simple search, more generic queries can lead to more models being retrieved, but at the expense of more irrelevant results. We conclude that advanced search may be more suited for more skilled users, with more knowledge about what is needed, while simple search may help novice users in gaining knowledge about the engine and the repository.

5.4.3 Goal

G3 :

Expressive power of the formatted preview

In 76 queries,

the subjects claimed NOT to nd an answer to the proposed problem. Figure 21 shows how these queries are distributed according to the recall calculated for the rst three pages. A recall of 100% means that the right answer to the problem was definitely present (i.e. all relevant models were retrieved and were presented in the rst three pages). According to our criteria, if the expressive power of the formatted preview is good enough, the subjects should be able to correctly recognize it. Therefore, for the queries with a recall of 100%, there should be no subjects claiming not to nd an answer, i.e. the 100% column in Figure 21 should be zero. However, there are 29 queries (38%) in this col-

64

Daniel Lucrédio et al.

Fig. 21 Recall distribution for the queries where the subjects claimed not to nd

an answer

umn. These are the queries where the subjects failed to see the answer when it was there. This means, according to our criteria, that Moogle's formatted preview is not good enough, because the subjects are not recognizing the relevant models in the results. In the remaining 112 queries, the subjects claimed to nd an answer to the proposed problem. Figure 22 shows how these queries are distributed according to the recall calculated for the rst three pages.

Fig. 22 Recall distribution for the queries where the subjects claimed to nd an

answer

MOOGLE: A Metamodel-Based Model Search Engine

65

A recall of 0% means that the right answer to the problem was NOT there, and according to our criteria, if the expressive power of the formatted preview is good enough, the subject should NOT be able to correctly recognize one. Therefore, for the queries with a recall of 0%, there should be no subjects claiming to nd an answer, i.e. the 0% column in Figure 22 should be zero. However, there are 41 queries (37%) in this column. These are the queries where the subjects claimed to nd an answer, but in fact there was not one to be found. This means, according to our criteria, that Moogle's formatted preview is not good enough, because the subjects are falsely recognizing relevant models. Of course, in these cases the subjects could have found an alternative answer to the problem, one which we did not predict. However, the number of subjects having the wrong idea about nding an answer is similar in both cases (38% in Figure 21 and 37% in Figure 22), which indicates that this is not the case. Nevertheless, both values are larger than our reference value (25%), which could mean that the users were not able to understand the model contents, and thus a better preview is needed.

5.4.4 Qualitative results

We did not analyze Moogle's performance. But

even with more than 80000 index elements, the query results are returned

7

in matter of seconds, often less than a second . The indexing process is much slower. Indexing all 146 les takes approximately 15 minutes. This is

7

Results given by SOLR interface, in a 2.2 Ghz processor with 2 GB of RAM

66

Daniel Lucrédio et al.

mostly because the model interpretation process uses reection to identify the metaelements. However, a full reconstruction of the indexes is rarely necessary, since Moogle allows partial index updating. The subjects reported some benets when using Moogle: the advanced search feature oers useful guidance in searching models; Moogle allows to nd models in large repositories, where manual search is not possible; it is very accurate, helping to reduce false hits with the lters; it has a simple and easy to use interface; it oers dierent options to search for models; and it allows searching inside complex XMI les. The noted benets are in accordance with the quantitative data, because we veried that advanced search can indeed improve precision, and that Moogle can help in dierent ways when searching for models, both for novice and more skilled users. Among the diculties, almost all subjects reported the same drawbacks: the lack of a graphical preview of the models; and the fact that the user needs to know the metamodel elements in order to properly use the advanced search engines. The latter point is a consequence of the fact that Moogle uses UML2's metaelements, which do not always coincide with the commonly used name for a concept. For example, there is no attribute metaelement in UML2. Rather, attribute ltering in Moogle must refer to property. We also collected the following direct answers from the questionnaire:



From the 13 subjects, 12 answered that advanced search is a good improvement over simple search;



All subjects believe that a search engine can increase model reuse;

MOOGLE: A Metamodel-Based Model Search Engine



67

12 subjects believe that a model search engine helps in the learning process;

5.5 Discussion and summary of the results

With the evaluation results, the three goals can be summarized as follows:

 G1 . Eectiveness in retrieving models.

Moogle performed worse

than similar studies from the literature, in terms of precision, but presented similar recall. We don't think this is a major drawback, since basically this means that the same amount of relevant models are retrieved, but the user will have to navigate at least through three pages to nd them. Also, we don't know how these other studies treated paging issues;

 G2 . Benets of advanced search. Advanced search performed better in terms of precision, but worse in terms of recall. This was reported in the questionnaire by one of the subjects, who said that accuracy was one of the benets of Moogle. In this sense, we can argue that advanced search could be a good option in later design and implementation phases, where more detailed models are desired (higher precision), while simple search can be used in early stages of the development, where a larger quantity of examples from a domain is desired (higher recall).

 G3 . Expressive power of the formatted preview.

The quantita-

tive data showed that a large amount of queries resulted in a wrong perception about nding a correct model or not. Although there are

68

Daniel Lucrédio et al.

many possible explanations for this inconsistency, the qualitative data supported our theory that the visual representation is not good enough: most subjects complained about not having a visual representation of the models.

5.6 Threats to validity

One of the main challenges of this experiment was to nd a good quantity of models to populate the search engine. Our approach of collecting models, using a web crawler, found many models that could not be indexed using EMF, and so we had to discard them. The problem is that many models created using commercial tools do not work with the metamodel les we had, which are gathered from the open source projects Eclipse UML2 [11] and EMF [12] projects.

If the number of models is too small, then the engine is not really being tested for eectiveness. We managed to nd 146 models, which correspond to more than 80 thousand searchable elements. However, the problems ended up having a small number of relevant models (1 to 4). We would like to test it with around 2000-3000 models, mainly to have more relevant models for each problem, and to better evaluate its accuracy.

Another threat is the determination of what is a relevant model. In this experiment, we extensively browsed the repository in searching for relevant models for each problem. However, a user's interpretation about what is

MOOGLE: A Metamodel-Based Model Search Engine

69

relevant or not may be dierent. Although we think we reached a satisfactory consensus between the authors, a more unbiased selection process is desired. The number of subjects is also important. Another study with search engines [16] involved 35 subjects, against 13 in our experiment. Although there were a large amount of queries (200), more subjects could lead to more solid results. Finally, in our experiment, we did not force the users to use the advanced search features in a query. We did so because we wanted to determine the benets of the advanced search as an additional feature, to be used together with simple search, and not separately. As a result, many queries that should have been performed using advanced search were performed only with simple search, which caused the imbalance between simple and advanced queries.

5.7 Comparison with the key features

Considering the key features of a model search engine identied in Section 2.1, Moogle's strongest focus is on

metamodel-based searching. It allows

more complex queries to be specied, increasing the accuracy of the search. Moogle supports the use of

real world concepts, because free-text search-

ing is employed. Thus, any word present inside the models is considered, including comments and other textual content. As shown in the evaluation, Moogle supports both the initial and later stages of the development, through a combination between simple and advanced search, and

browsing.

Thus, it

supports the entire life-cycle.

70

Daniel Lucrédio et al.

Dierent kinds of models are supported, as long as a metamodel is provided.

Inexact matching

is achieved by the underlying search API, and it

uses wild cards, regular expressions, and word similarity. However, there is no complex similarity matcher as with some of the related approaches described in Section 2, and thus there is room for improvement. Moogle's

ranking

algorithm is implemented by the search API, and is

a commonly used algorithm. The evaluation results show that it includes more relevant results rst. Models are

automatically indexed,

requiring no additional eort for

categorization. This feature, along with the strong performance of the underlying search API, provides good support for The main missing feature is a

large repositories.

proper preview of the results, which

has proven to be very limited and needs improvement. Table 2 replicates the information from Table 1, now including Moogle, comparing it to the existing approaches with respect to its support for the key features of a model search engine.

6 Conclusion and Future Work This paper presented Moogle, a model search engine that uses metamodel information to enrich its indexing and querying tasks. Moogle allows simple and complex queries to be performed, taking into account not only the textual content of the models, but also their internal structure. Moogle

MOOGLE: A Metamodel-Based Model Search Engine

71

Support for the key features: (

) Full

Support, (G #) Limited Support, (#) No Support, (-) Not enough information Approach

1

Gulla et al. [23]

G #

Gomes et al. [19] Kozlenkov et al. [28] EMF Search Code Search XML Search Moogle

G #

2

3

#

G # # #

# G #

4

# # G #

5

G # G #

6

7

8

9

10

-

G #

-

-

-

-

#

#

G # #

# # G #

-

-

# G #

#

-

Table 2 Comparison between the existing approaches and Moogle. The key fea-

tures are: (1) Metamodel-based searching, (2) Real world concepts, (3) Support for the entire life-cycle, (4) Dierent kinds of models, (5) Inexact matching, (6) Ranking, (7) Automatic indexing, (8) Proper preview of the results, (9) Browsing, (10) Large repositories.

may be useful in two scenarios: in a development environment, providing a way to nd and retrieve existing models for reuse; and in an educational environment, providing a means for teachers and students to nd examples of models during the learning process.

This paper presented a signicant evaluation of Moogle, which showed that Moogle's advanced search and the use of metamodel information leads to more accurate results. When combined with the higher recall obtained in simple search, Moogle is suited to a wider range of users and activities. The evaluation has some threats to validity, as discussed in Section 5.6, but we believe these do not invalidate the overall ndings. Nevertheless, in future evaluations we intend to undertaken further experiments to conrm the results.

The evaluation identied a number of possible enhancements for Moogle. Currently, Moogle's biggest drawback is its lack of a graphical display of the

72

Daniel Lucrédio et al.

retrieved models. The textual preview, although an abstracted version of a textual model le, was shown to be insucient for comprehensibility. There are a number of ways to provide a graphical display of Moogle's output.

One solution would be to use a graphical API, such as GraphViz [20] or Eclipse GMF (Graphical Modeling Framework) [13], to automatically create a picture for the model elements and for the diagrams. This would allow Moogle to generate pictures at runtime and to show contextualized visual information according to the search results. For example, if a UML actor is found, Moogle could generate pictures for the use case and sequence diagrams where it appears, and use these as a preview.

The major challenge of this solution is to determine the graphical appearance of the model elements since information about icons is not provided in XMI les. One possibility is to implement language-specic plug-ins. Each language supported by Moogle would have its own plugin for generating pictures. A selection algorithm would try to nd the best plugin for a particular model. For UML models, there are some existing solutions that can be used to facilitate this task, such as the UML graph API [42], which generates UML pictures based on textual content. For domain-specic models, GMF may be used, but since it was not designed as a runtime API, it is not very easy to use in a scenario like Moogle, where it would run as a background service during indexing time. Layout might also be an issue because automatic layout algorithms are notoriously insucient. However, since search

MOOGLE: A Metamodel-Based Model Search Engine

73

results consist of model fragments, not entire models, automatic layout may be feasible.

Another issue we discovered with Moogle is the fact that advanced search is based on the names of elements in the metamodel. Some subjects had difculties understanding the meaning of the metaelements because modelers typically do not have an in-depth knowledge of the UML metamodel. What is needed is richer semantic information associated with the models, and works from the semantic web area, such as [18], are good starting points for future work here. These could be used, for example, to allow users to express queries using concepts familiar to them, which are translated automatically to the corresponding metaelements.

Moogle does not currently support similarity matching search  although the Levenshtein Distance [4] metric supported by the underlying search API implements this to a limited degree. Semantic relations and structural similarity metrics, like the ones described in Section 2, could be eective in Moogle, although we have not yet found strong evidence that similarity matching is needed in practice.

Finally, in advanced search, when a model element is found, Moogle currently only shows the element isolated from the rest of the model. It would be benecial to show also its context, so that the user can better understand the element's place in the overall model.

74

Daniel Lucrédio et al.

We are planning to expand Moogle by plugging it into some common

8

model repositories, such as the Atlantic Zoos . We are also planning to make Moogle available to the community, so that people will be able to upload their models and use it as a public model search tool. As well as its application in a software engineering environment, we believe that Moogle has useful applications in an educational context, and we plan to investigate its use as a tool to assist novices to learn modeling. It has been shown [47] that a common learning technique of novice programmers is to look at exemplars  that is, good (and bad) examples of programs that solve a similar problem to the one under study. Novices can then adapt these exemplar solutions to serve their own particular needs. We suggest that the same may be true of novice modelers. In such a case, a tool such as Cynthia [46] could be designed to allow students to retrieve sample and adapt sample models. Moogle could form a key part of such a tool.

References 1. IEEE std 1003.1-2008 - standard for information technology-portable operating system interface (POSIX) - base specications, issue 7. 2. Apache. Lucene. http://lucene.apache.org/. 3. Apache. SOLR. http://lucene.apache.org/solr/. 4. Ricardo Baeza-Yates, Walter Cunto, Udi Manber, and Sun Wu.

Proximity

matching using xed-query trees. In Combinatorial Pattern Matching. Proceedings of the 5th Annual Symposium, CPM 94, Asilomar, CA, USA, June

8

http://www.emn.fr/z-info/atlanmod/index.php/Zoos

MOOGLE: A Metamodel-Based Model Search Engine

75

5-8, volume 807/1994 of Lecture Notes in Computer Science, pages 198212.

Springer Berlin / Heidelberg, 1994. 5. Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genovea Tortora. ADAMS re-trace: A traceability recovery tool.

In European Conference on

Software Maintenance and Reengineering, volume 0, pages 3241, Los Alami-

tos, CA, USA, 2005. IEEE Computer Society. 6. Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genovea Tortora. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Softw. Eng. Methodol., 16(4):13, 2007. 7. Andrea De Lucia, Rocco Oliveto, and Genovea Tortora. ADAMS re-trace: traceability link recovery via latent semantic indexing.

In ICSE '08: Pro-

ceedings of the 30th international conference on Software engineering, pages

839842, New York, NY, USA, 2008. ACM. 8. Andrea De Lucia, Rocco Oliveto, and Genovea Tortora. Assessing ir-based traceability recovery tools through controlled experiments. Empirical Softw. Engg., 14(1):5792, 2009.

9. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391407, 1990.

10. Arie van Deursen, Paul Klint, and Joost Visser. Domain-specic languages: An annotated bibliography.

SIGPLAN Notices - ACM Press, 35(6):2636,

2000. 11. Eclipse. Eclipse UML2 project. http://www.eclipse.org/modeling/mdt/?project=uml2. 12. Eclipse.

EMF

-

Eclipse

http://www.eclipse.org/modeling/emf/.

Modeling

Framework.

76

Daniel Lucrédio et al.

13. Eclipse.

GMF

-

Graphical

Modeling

Framework.

http://www.eclipse.org/modeling/gmf/. 14. Eclipse. Eclipse modeling project. http://www.eclipse.org/modeling/, 2009. 15. William B. Frakes and C.J. Fox.

Sixteen questions about software reuse.

Communications of the ACM, 38(06):7587, 1995.

16. William B. Frakes and Thomas P. Pole. An empirical study of representation methods for reusable software components. IEEE Transactions on Software Engineering, 20(8), 1994.

17. Vinicius Cardoso Garcia, Daniel Lucrédio, F.A. Durão, E. C. R. Santos, Eduardo Santana de Almeida, Renata Pontin de Mattos Fortes Fortes, and Silvio Romero de Lemos Meira. From specication to the experimentation: A software component search engine architecture. In 9th International Symposium on Component-Based Software Engineering (CBSE), Sweden, 2006. Lecture

Notes in Computer Science, Springer-Verlag. 18. Mauro Gaspari and Davide Guidi. Towards an ontology-guided search engine. Technical report, Department of Computer Science - University of Bologna, 2003. 19. Paulo Gomes, Francisco C. Pereira, Paulo Paiva, Nuno Seco, Paulo Carreiro, José L. Ferreira, and Carlos Bento. Using WordNet for case-based retrieval of UML models. AI Commun., 17(1):1223, 2004. 20. Graphviz. Graph visualization software. http://www.graphviz.org. 21. Jack Greeneld, Keith Short, Steve Cook, and Stuart Kent. Software Factories: Assembling Applications with Patterns, Models, Frameworks and Tools.

Wiley, 2004. 22. D.A. Grossman and O. Frieder.

Information Retrieval. Algorithms and

Heuristics. 2nd ed. Springer, Dordrecht, Netherlands, 2 edition, 2004.

MOOGLE: A Metamodel-Based Model Search Engine

77

23. Jon Alte Gulla, Bram van der Vos, and Ulrich Thiel. An abductive, linguistic approach to model retrieval. Data and Knowledge Engineering, 23(1):1731, 1997.

24. L. Hellan and M. Dimitrova-Vulchanova. Preliminary notes on a framework for 'Lexically Dependent Grammar'. In Lecture Series at International Summer Institute in Syntax. Central Institute of English and Foreign Languages,

Hyderabad, India, July 1994.

25. Scott Henninger. An evolutionary approach to constructing eective software reuse repositories. ACM Transactions on Software Engineering and Methodology, 6(2):111140, 1997.

26. Emily Hill, Lori Pollock, and K. Vijay-Shanker.

Automatically capturing

source code context of nl-queries for software maintenance and reuse. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 232242, Washington, DC, USA, 2009. IEEE Computer

Society.

27. Tomás Isakowitz and Robert J. Kauman.

Supporting search for reusable

software objects. IEEE Transactions on Software Engineering, 22(6), 1996.

28. A. Kozlenkov, V. Fasoulas, F. Sanchez, G. Spanoudakis, and A. Zisman. A framework for architecture-driven service discovery. In SOSE '06: Proceedings of the 2006 international workshop on Service-oriented software engineering,

pages 6773, Shanghai, China, 2006.

29. Daniel

Lucrédio,

Renata

Pontin

de

MOOGLE: A model search engine.

Mattos

Fortes,

and

Jon

Whittle.

In Krzysztof Czarnecki, Ileana Ober,

Jean-Michel Bruel, Axel Uhl, and Markus Völter, editors, MoDELS, volume 5301 of Lecture Notes in Computer Science, pages 296310. Springer, 2008.

78

Daniel Lucrédio et al.

30. Daniel Lucrédio, Eduardo Santana de Almeida, and Antonio Francisco do Prado.

A survey on software components search and retrieval.

In Ralf

Steinmetz and Andreas Mauthe, editors, 30th IEEE EUROMICRO Conference, Component-Based Software Engineering Track, pages 152159, Rennes

- France, 2004. IEEE/CS Press.

31. Andrian Marcus, Andrey Sergeyev, Vaclav Rajlich, and Jonathan I. Maletic. An information retrieval approach to concept location in source code.

In

WCRE '04: Proceedings of the 11th Working Conference on Reverse Engineering, pages 214223, Washington, DC, USA, 2004. IEEE Computer Society.

32. J. C. C. P. Mascena, Silvio Romero de Lemos Meira, Eduardo Santana de Almeida, and Vinicius Cardoso Garcia. Towards an eective integrated reuse environment. In 5Th ACM International Conference on Generative Programming and Component Engineering (GPCE), short paper, Portland, Oregon,

USA, 2006.

33. Dan Matheson, Robert France, James Beiman, Roger Alexander, James DeWitt, and Nathan McEachen. Managed evolution of model driven development approach to software-based solutions. In OOPSLA and GPCE Workshop, Vancouver, British Columbia, Canada, 2004.

34. OMG.

XML metadata interchange (XMI) specication.

Technical report,

Object Management Group, 2006.

35. Denys Poshyvanyk, Andrian Marcus, and Yubo Dong. plug-in for source code exploration.

Jiriss - an eclipse

In ICPC '06: Proceedings of the 14th

IEEE International Conference on Program Comprehension, pages 252255,

Washington, DC, USA, 2006. IEEE Computer Society.

MOOGLE: A Metamodel-Based Model Search Engine

79

36. Denys Poshyvanyk, Andrian Marcus, Yubo Dong, and Andrey Sergeyev. Iriss a source code exploration tool. In Proceedings of the 21st IEEE International Conference on Software Maintenance - Industrial and Tool volume, ICSM 2005, 25-30 September 2005, Budapest, Hungary, pages 6972, 2005.

37. Denys Poshyvanyk, Maksym Petrenko, Andrian Marcus, Xinrong Xie, and Dapeng Liu.

Source code exploration with google.

In IEEE International

Conference on Software Maintenance, volume 0, pages 334338, Los Alamitos,

CA, USA, 2006. IEEE Computer Society. 38. Rubén Prieto-Díaz.

Implementing faceted classication for software reuse.

Communications of the ACM, 34(5), 1991.

39. Václav Rajlich and Norman Wilde. prehension.

The role of concepts in program com-

In IWPC '02: Proceedings of the 10th International Workshop

on Program Comprehension, page 271, Washington, DC, USA, 2002. IEEE

Computer Society. 40. J. Robin and F. Ramalho. Can ontologies improve web search engine eectiveness before the advent of the semantic web? In XVIII Brazilian Symposium on Databases, pages 157169, Manaus, Amazonas, Brazil, 2003.

41. S.M. Sha and Raq A. Rather. Precision and recall of ve search engines for retrieval of scholarly information in the eld of biotechnology. Webology, 2(2), 2005. 42. UMLGraph.

Automated

drawing

of

UML

diagrams.

http://www.umlgraph.org. 43. Vanderbilt

University.

GME

-

Generic

Modeling

Environment.

http://www.isis.vanderbilt.edu/projects/gme/. 44. T.A. Vanderlei, Vinicius Cardoso Garcia, Eduardo Santana de Almeida, and Silvio Romero de Lemos Meira. Folksonomy in a software component search

80

Daniel Lucrédio et al.

engine cooperative classication through shared metadata.

In XX Brazil-

ian Symposium on Software Engineering, Tool Session, Florianópolis, Brazil,

2006. 45. W3C.

XQuery 1.0: An XML query language - W3C recommendation 23

january 2007. Technical report, World Wide Web Consortium, 2007. 46. Jon Whittle, Alan Bundy, and Richard J. Boulton. Proofs-as-programs as a framework for the design of an analogy-based ml editor. Formal Asp. Comput., 13(3-5):403421, 2002. 47. Jon Whittle and Andrew Cumming. Evaluating environments for functional programming. Int. J. Hum.-Comput. Stud., 52(5):847878, 2000. 48. Yunwen Ye and Gerhard Fischer. Supporting reuse by delivering task-relevant and personalized information. In ICSE 2002 - 24th International Conference on Software Engineering, pages 513523, Orlando, Florida, USA, 2002.