Impact of the Research Community on the Field of Software Configuration Management
An Impact Project Report
Jacky Estublier (Editor)
David Leblang (Co-Editor)
Geoffrey Clemm
Grenoble University 220, rue de la Chimie BP53 38041 Grenoble France
24 Oxbow Road Wayland, MA 01778 USA
Rational Software 20 Maguire Road Lexington, MA 02421 USA
[email protected]
[email protected]
[email protected]
Reider Conradi
André van der Hoek
Walter Tichy
Darcy Wiborg-Weber
IDI-Gløshaugen NTNU, NO-7491 Trondheim, Norway
Dept. of Info & Comp Science 444 Computer Science Bldg Irvine, CA 92697-3425 USA
Informatics U. Karlsruhe 76128 Karlsruhe Germany
Telelogic 9401 Geronimo Road Irvine CA 92618 USA
[email protected]
[email protected]
[email protected]
[email protected]
$%675$&7 Software Configuration Management (SCM) is an important discipline in professional software development and maintenance. The importance of SCM has increased as programs have become larger and more mission/life-critical. This paper discusses the evolution of SCM technology from the early days of software development to present and the impact university and industrial research has had along the way. It also includes a survey of the industrial state-of-the-practice and research directions. .H\ZRUGV Software configuration research impact.
management,
software
engineering,
35()$&(:+$7,6,03$&7" During the preparation of this report, there was a lively debate among the authors about what defines impact and which research to include. The impact is easy to discern: A successful, multimillion dollar industry in software configuration management tools has arisen. SCM tools have become so pervasive that virtually all major projects use them. SCM provides tangible and recognized benefits for software professionals and managers; software development standards even prescribe their use. There are now over a dozen textbooks dealing exclusively with SCM, and most textbooks on software engineering include a chapter on that topic. Software engineering conferences and workshops regularly seek contributions in this area. Finally, there is a lively international community of SCM researchers and an international SCM workshop series. Identifying the research that had caused this impact was more difficult. We (the authors) came quickly to the conclusion that citing only academic research would be far too narrow, because corporate research had contributed a great deal of results and ideas. In fact, the software configuration management community owes much of its liveliness to a healthy and competitive mix of researchers and developers from both academia and industry. Our first ground rule was therefore to take an inclusive view of research. We discovered quickly that it was futile to determine who contributed "more", academia or industry. Our opinions about what was more important diverged widely, depending on our personal points of view. We will therefore leave the evaluation of the relative merits of research contributions to our readers and to historians. Our goal is to provide an honest and accurate picture of the major research ideas in SCM and show where they had an impact. Given the long lead time for research results to show up in practice, some as yet unused results may have their impact still ahead of them. Thus, we decided to discuss current research even if it has not yet had any discernible impact on practice. So far, these three ground rules, though important, do not help identify relevant research. After some debate we decided to concentrate on publications in the open literature. This criterion is not perfect, since it leaves out unpublished work, in particular results that go directly into products. However, fortunately for the community, research results were vigorously published even by industry, notably Bell Labs, Apollo, Atria, and others. We are fairly certain that the major research ideas and results in the domain of SCM have actually been published. Of course, there is a chance of some unpublished invention being heavily used in
some real product. However, this is unlikely, because competition among the SCM vendors forces the major players to offer comparable functionality and feature sets. By comparing the published material with what is known about commercial SCM tools, it does appear that the major research results incorporated in products are actually published in the open literature. Conversely, consider what happens to research results that are not published. The probability is high that these results will be forgotten once people who were engaged in the work move on. The impact of such work is severely limited; after a while, it appears as if it never happened. The work cannot be extended by others without actually redoing it. We concluded that basing this report on research published in the open literature (including patents) would be adequate. The emphasis on publications leaves out other activities that have had impact. A major impact is caused by people, in particular university graduates moving to industry. Graduates with PhD topics in SCM directly benefit the SCM industry, but even at the B.S. and M.S. levels, many computer science students have been exposed to SCM in software engineering courses and projects, not to mention computer science in general, be it algorithms, the latest software tools, the internet, etc. Of course, recent graduates lack experience, but it is generally acknowledged that young people bring with them fresh and new ideas. Industry also impacts academia via people. Career moves from industry to academia are relatively rare, but many of the academics in the SCM field have enjoyed long and fruitful relationships with the SCM industry, be it through consulting, sabbaticals, joint projects, or spear-heading start-up companies. Industrial contacts are a source for new problems to solve and act as a corrective of what is important and what is not. Last but not least, workshops and conferences have had a significant impact on the SCM community. Given the competitive nature of the software business, it has fallen to the academics to organize vendor-independent meetings in which the latest results in SCM are presented and new, unpublished ideas discussed. It is unlikely that even the authors of this report would all have met without the SCM workshops that brought together researchers and developers in the SCM field for over a decade. In sum, we hope that this report is an interesting compendium of the "software configuration management story", a success story that is yet unfolding. We are proud to be part of this story and hope that many more chapters will be added to it in the future. Plan Chapter 2 describes shortly the state of practice, while chapters 3 to 9 describe for each SCM area, what have been the research done, and its impact in actual commercial tools. Chapter 10 summarises the impact of research work along a time line. Chapters 11 and 12 conclude the report.
67$7(2)35$&7,&($1' (92/87,21 Configuration management (CM) is the discipline of controlling changes in large and complex systems. Its goal is to prevent the chaos caused by the numerous corrections, extensions, and adaptations that are applied to any large system over its lifetime. The objective of CM is to ensure a systematic and traceable development process, so that a system is in a well-defined state
with accurate specifications and verified quality attributes at all times.
process support, concurrent engineering control, distributed and remote work space control and other high level functions.
CM was first developed in the aerospace industry in the 1950s, when production of spacecraft experienced difficulties caused by inadequately documented engineering changes. Software Configuration Management (SCM) is CM tailored to systems, or portions of systems, that consist predominantly of software [Whi91].
Thus, SCM evolved significantly since its introduction but a number of characteristics remained constant. SCM is a technology in charge of most of the issues raised by large software product development and maintenance, which are not addressed by programming languages. Further, SCM does not address the issues related to the early life cycle phases (requirement, design, analysis, ….), nor by the last one (deployment, dynamic evolution and reconfiguration).
At first, different coloured punch cards were used to indicate changes; fortunately, software became an on-line entity and hence could easily be placed under automatic, programmed control. Currently, SCM systems are sophisticated tools, which emphasize
SCM system functionalities evolved to address the issues successively raised in this area, as shown on the following figure.
SC M Evolution
2 ULJLQ
but m e
• D edicated m ainfram e
•A dd-hoc
• O ff-shelf
• Large Softw are
• Any S oftw are
• H om e G row n system s
: KHUH" • C ritical S oftw are : KDW • V ersioning, rebuilding IRU " • N obody : LWK 2 Q
• W ork S pace control • M aintenance team s • U nix
)URP)LOHFRQWURO« Originally, SCM duty was to manage the many files involved in a software product, their YHUVLRQV and the EXLOGLQJ of the system out of these files. Fundamentally this did not change. Currently, there is a tendency to hide the concept of file, found to be too low level, at the benefit of concepts closer to actual programming practices, as found in programming environments (project, folder, package, …). Nevertheless, the file system did not disappear and software engineering tools still rely on this file system view which remains, for the foreseeable future, the underlying structure. This is what fundamentally distinguishes SCM from other Configuration Management tools, like PDM. The consequence is that the main structure of a system, as perceived by SCM tools, is the file system structure, and a basic SCM function is to be able to (re)generate a file system view of a software system in order for desktop tools and development environments to work.
9HUVLRQFRQWURO To avoid the confusion resulting from the same identifier assigned to two versions of the same item, a new and unique identifier is issued whenever an item is changed. But the proliferation of items
• P roduct, P rocess m odeling • Everyboby everyw here involved in the softw are • E verybody ’s m achine
requires to manage them, recording their relationships and their common properties. This is what version control is about. Originally, each file was individually managed as a tree of versions (revisions and branches) following the SCCS [Roc75] and RCS [Tic81] systems. Surprisingly, these systems are still at the core of almost all commercial SCM systems. Then SCM system managed FRQILJXUDWLRQ which is a “consistent” collection of file versions. Only recently, with the advent of task based approaches, versioning evolved again integrating change set features, but on an RCS based internal representation. Note that the configuration concept is general enough to cover any consistent collection of files, not source code only. Indeed, many customers use document configurations.
3URGXFWPRGHOLQJ Of course, an SCM system, also manages the software under its own view : versions and configurations are to be maintained and, because of the rebuilding function, at least the compile time dependencies are to be maintained. Thus, for long, most SCM systems were maintaining only the version trees, the configuration (a lists of version identifiers) and a makefile, for the compile time dependencies. Recently, for their internal representation of a software product, high end SCM systems support features that add semantics to
items. They allow types, attributes, and relationships to be associated with items in accordance with a GDWDPRGHO. Consequently, an SCM system needs to translate (fast) between the file system view and the internal representations of the same software product. This is a limiting factor which explains why even high end commercial tools use simple data models close to the file system view.
6\VWHPPRGHODQGUHEXLOG As software systems have grown to include many thousands of components, keeping track of the pieces and how they fit together has become a difficult but critically important task. SCM systems accomplish this task with V\VWHP PRGHOLQJ facilities. System modeling was originally used to help rebuild a system from source files (Make [Fel79]), and even now, in most SCM systems, the makefile is the only “system model”. A rich system model also provides a high level view of the product organisation (its architecture), which can be used to better control product evolution, to compute “consistent” configurations and so on. Unfortunately, any information that has to be provided and maintained by hand represents a (high) cost which may rapidly outweigh its expected benefit. For this reason, in the current state of practice, system models are not widely adopted by practitioners. High level system models will be used only if they can be extracted automatically from source code, or if they are available, for other reasons.
5HEXLOGDQG6HOHFWLRQ
With so many components and versions, a FRQILJXUDWLRQVHOHFWLRQ mechanism is needed to select which versions to include in a new configuration. The general problem would requires to consider for all combinations if it is “consistent”. This is both extremely expensive and requires some semantic knowledge of all artefacts. Of course, SCM tools avoid to address this issue in its generality. Instead, a new configuration is built starting from a reference consistent configuration called a EDVHOLQH. The baseline is “checked-out” in the file system (it is a ZRUNVSDFH) on which are performed a few changes. Once tested, the modified baseline is “checked-in” as the new configuration. This simple schema is the base of almost all SCM systems, with extensions for concurrent engineering support, see below.
:RUNVSDFHDQGFRQFXUUHQWHQJLQHHULQJ
:RUN VSDFHV are “naturally” independent directory hierarchies, therefore they can be used to provide areas where users can perform their job isolated (or insulated) from colleagues activities. This is a nice property. Unfortunately, concurrent engineering often requires to combine (merge) the work performed, simultaneously or not, in different work spaces. Thus high end systems propose features, like rules, providing flexibility in combining the work performed in different work spaces. Since the late 80s, this is one of the important research topic.
&KDQJHFRQWURO From the beginning, auditing and tracking have been considered as basic functions of SCM system. It requires that any change performed on a large software system be related with a formal change request. Change control is the set of procedures governing
how a change is decided, implemented, tested and released. Most current SCM systems support change control based on a set of state assigned to software items or change request items. It is a “product based” change control. More recently, in high end systems, work spaces were considered as a unit of work; sub-WS corresponding to sub activities. Thus a work space becomes the support for WDVN or DFWLYLW\ concepts. Therefore, in these systems, change control can be expressed in terms of task assigned to users in specific work spaces; it is an activity based process control. Process support is a generalisation of change control applied to other activities. The fact that activities are currently assimilated to work spaces explains why process support in SCM is limited to activities performed in a work space.
+LJKHQGYHUVLRQLQJDQGWDVNVXSSRUW The fact that tasks and work spaces are strongly related make that version selection and task become related; and consequently versions and task are related. It has two major consequences. First, the grain for a version is the complete work space, thus naturally a FKDQJHVHW. Second, a selection is based on a known release, and a version is a change with respect to that release. This explains why the basic change set approach is gaining acceptance, and why advanced change set approaches do not fit practice, and therefore are not used. Work space (or task or activity) can be seen as unifying and subsuming most basic SCM concepts : file management, version and selection control, building and change control. This is why it became key, recently, to provide simpler interfaces. However this understanding took time to maturate, and many research where needed to reach that level of abstraction.
'LVWULEXWHGDQGUHPRWHGHYHORSPHQW Distributed SCM coordinates the work of geographically distributed teams. Distribution requires that the main SCM functions introduced above are all network enabled. In local area networks, client-server implementations of these functions will do, but in wide-area networks the bandwidth between sites may not be sufficient. In this case, each site needs to replicate the relevant parts of the database and a periodic update process must synchronize them [All95]. Updates that travel over public networks should be encrypted to foil industrial espionage. Distributed development places a strong emphasis on process support. If team members rarely meet, telephone calls are difficult because of time-zone differences, and email round-trips take a day, informal arrangements on who works on what for how long break down. In this situation, much more emphasis must be placed on automated support for scheduling, tracking work, and preventing information loss. Web technology and the generalization of the virtual enterprise increases the need for a good support for remote development. Remote development is an important area of work for vendors. Web site management, and web enabled cooperative work became a new area where clearly SCM techniques are needed. Curiously, the actual leaders in the market of “content management” are not SCM vendors, (B BroadVision, Vignette, ATG, Allaire, InterWorld, Interwoven, Blue Martini, Open Market, BSCW, …) but they use the basic SCM techniques. Web technology also addressed issues related to cooperative work on common files, and have proposed an extension of the HTTP
protocol (WebDAV) which include multi authors facilites, and DeltaV, which adds most of SCM basic features; not only basic versioning, but also work space management. Not surprisingly, some SCM vendors had a definitive influence in the definition of this norm.
domains like “web content management” also apply SCM techniques, as well as the new web protocol WebDAV/DeltaV. .
(YROXWLRQLQUHVSRQVHWRWKHPDUNHW Although current tools build on past inventions, changes in the hardware, software, and business environment has made the evolutionary path of SCM far from a straight line. Throughout the 1980s debates raged on the best type of delta storage and retrieval mechanisms. New tools were built extending the groundwork laid by earlier tools as rapid progress was made. However, in the 1990s non-textual objects became more common and totally new algorithms were required. By 2000 disk storage became so inexpensive, CPUs so fast, and non-text objects so common that deltas became unimportant – many new tools use simple (zip-like) compression. Feature selection versus branching as a way to build customized configurations of a product was a popular subject in the 1980s and 1990s. However, in recent years the business environment has made building custom software releases much less popular. Even the creation of “patches”, a mainstay of the past, has largely been replaced by annual “service updates”. Development environments like Visual Studio have taken over Make responsibilities and also some system modeling features so new tools rarely include proprietary build features. These changes don’t mean the SCM is dead, but rather that it needs to solve new problems more than it needs to improve on solutions to old problems. Some of those new problems include dealing with teams collaborating over the internet, allowing software developers to take their work home at night or on a airplane by docking and undocking, dealing with web-content, integrating with other development tools, and providing a GUI that is easy to use but still provides the needed power. The success of SCM is also due to the fact that SCM systems have been carefully designed to be independent from programming languages and application semantics, and to take the best of simple algorithms (like the version tree) and simple heuristics (like line merging). This turns SCM systems into general and efficient tools, while avoiding the intrinsic complexity of syntactic and semantic issues. Instead, SCM systems seek to integrate the tools which need syntactic or semantic knowledge (like syntactic editors.
6XPPDU\ SCM is one of the most successful branches of software engineering. There is a lively international research community working on SCM and a billion dollar commercial industry has developed. Nearly all corporate and government software projects use some SCM tool. SCM is actually unanimously considered as an essential tool for the success of any software development project. It is required to get the second CMM maturity level. Practitioners consider SCM tools as helpful, mature and stable. Consequently, SCM market is increasing pretty fast (see table 1). The success of SCM can also be measured by the fact its basic techniques became pervasive and found in many tools. All 4GL tools, most programming environment tools, and even simple editors, like Word, now include a basic version manager. New
7DEOH:RUOGZLGH6&07RROV0 >,'&@ The following chapters detail the different areas identified above from a research and practice perspective.
9(56,21,1* &ODVVLFYHUVLRQLQJ Classic technique to avoid confusion consists in issuing a new identifier each time an item changes. But issuing a new identifier for every change may obscure relations between items. One may want to record, for instance, that a given configuration item is a revision of another, correcting certain errors. The version control function of SCM records this information. Classic SCM systems such as SCCS [Roc75] from Bell Labs and RCS [Tic85] from Purdue University and all major commercial CM systems collect related configuration items into a set called a YHUVLRQ JURXS (or YHUVLRQ JUDSK, or FRPSRQHQW or HOHPHQW) and manage the evolution of these sets. The items in a version group are organized into a directed acyclic graph (DAG) where the arcs are “successor” relations. There are typically three types of successor relations: the relation UHYLVLRQRI records historical development lines; the relation YDULDQWRI connects items that differ in some aspect of function, design, or implementation, but are interchangeable in other respects. The PHUJHU relation indicates changes made on several variants have been combined. Figure 1 illustrates a version group with two diverging lines of development; one is a corrective branch and the other is a main development branch. The corrective branch starts as an offshoot of an old development version; later, its changes are merged back into the original development line. The variant branch continues in case there are more corrections in the future.
other kinds of components (both more general and more specialized), more advanced version models, and user-defined relationship types and attribute types. First we introduce the concepts of SURGXFWVSDFH and YHUVLRQVSDFH.
3.2.1 3.1
1
2
3
3.2
3.3
4
5
B ranch or V ariant Successor
R evision
For practical reasons, version control also incorporates the identification function by simply incrementing a version number for every new item added to the group. Team members update version groups by following the FKHFNLQFKHFNRXWprotocol. Before commencing work on a configuration item, a developer performs a checkout. The checkout operation copies a selected version from its version group into the developer’s workspace where a developer can carry out modifications undisturbed from the activities of other workers (see “workspaces” below). The checkout may also place a UHVHUYDWLRQ into the version group giving the developer exclusive rights to later checkin a new version as a revision-of the one that was checked out (extending the line of descent). Other users wishing to modify the same original version can only create a variant-of this version (a variant branch) and later merge their changes back into the original line of descent. Thus, the reservation protects concurrent developers from inadvertently overwriting each other’s changes. In NSE [Sun89], and several modern SCM systems, reservations are optional. Without reservations, the first person to checkin extends the development line and all others must merge their changes in. Version control also maintains a logbook recording the reasons for changes. The log entries typically record time of change and the identity of the developer, plus a short commentary describing the nature of the change. The log is convenient for surveying the changes a system underwent over time. Version control also compresses the space occupied by a version group using standard data compression techniques (like zip) or by using GHOWD VWRUDJH. Delta storage saves only the differences between versions rather than complete copies. With modern delta generators, space consumption for source code objects is reduced to fewer than 10 percent of full storage [Hunt98]. Whenever a version is needed, a decoder program first reconstructs the data from the differences. SCCS and RCS [Roc75, Tic85] are early version control systems implemented with delta mechanisms. Variants of the delta techniques from these early systems are still used in modern commercial systems. Binary objects, such as executable images, need a different delta algorithm than those used by line-oriented text objects. A binary delta mechanism developed by [Rei91] at Linz University is now incorporated into commercial products (ClearCase [Leb94]).
$GYDQFHGYHUVLRQLQJ Classic versioning traditionally deals with configuration items that are files and relationships limited to those in the version graph (revision-of, variant-of, merger). The only attributes of importance is the baseline name (e.g. BetaRelease). This section deals with
3URGXFWVSDFH
A SURGXFW VSDFH of an application (software project) is a generalization of the notion of configuration items. It encompasses all of the entities that make up the application including the configuration items needed to build the application, design documents describing the application, the K\SHUOLQNV (relationship entities) such as those that link design documents to the source code that realize the design, attributes such as baseline names and change completion states, and change requests. The metadata database in the software repository may hold declarable component types (e.g. source, derived, composite) and subtypes (e.g. c++source), hyperlink types (e.g. PartOf), and attribute types. These type definitions may include semantics for each type. In the metadata database or embedded in the SCM tool there may also be SURGXFWUXOHV (typically constraints). For instance, some components may not be combined with other components, regardless of which versions are selected. The overall description of the software product space depends on the actual data schema (component and relation types) and its instantiation. Hyperlinks may describe relationship both within a lifecycle phase and between lifecycle phases. Hyperlinks within a phase model relations like PartOf or DependsOn from source code objects to other source code objects (see the section on system modeling). Hyperlinks between phases are typically used to WUDFH a change from a requirements document to a design document to source code modules to tests.
9HUVLRQ6SDFH
A YHUVLRQVSDFH is the set of all possible versions of components. It generalizes the way versions are identified. Each version (configuration item) still must have a unique identifier but it may also be identified by attribute values, e.g. (OS = Unix, Language = French, State = Released). One predefined attribute, such as “VersionID”, is typically an RCS-like version identifier. The “extra” version-attributes may be local to a component, a set of components, or they may be global, in any case, they provide an alternative mechanism for selecting versions. The SCM system may contain meta-information that includes rules to control what combinations of attributes are likely to lead to a “good” version. Such version-rules may express FRQVWUDLQWV, GHIDXOWV, or SUHIHUHQFHV. Indeed, an advanced SCM repository could serve as a GHGXFWLYH database: it would use system-stored rules to augment ambiguous, or partial, user-level specifications to make up a complete and legal version binding. A version model, of which there are two main kinds, describes the version space. The FRPSRQHQWEDVHG YHUVLRQLQJ model and the FKDQJHVHWYHUVLRQLQJ model are discussed below.
&RPSRQHQWEDVHG9HUVLRQLQJ Component-based versioning is the kind most commonly used in classic and commercial SCM systems. As described above, there is a version graph for each component where each node in the graph is explicitly created by the developer and checked into the SCM repository (as a revision of, or variant of, an earlier version). The version graph defines the exact and total set of versions of a component. The version graph is ORFDO to each component, even if other components have topologically similar version graphs.
A FKDQJH, in this model, is defined as a new node in a component’s version graph. A ZRUNVSDFH (see below) is constructed by choosing one version of each component’s graph. Hyperlink relationships are usually not versioned in this model (they have no version graph) and the Variant-Of, Revision-Of, and Merger relations have special semantics and are treated differently than other relationship types. Component-based versioning is simple to understand, but does result in a proliferation of variants, necessitating explicit PHUJLQJ. Adele [Est85] distinguishes between permanent and temporary variants, where the latter is intended for cooperative work.
&KDQJHVHW9HUVLRQLQJ Change-set versioning came from industrial practices in the 70s (IBM and CDC update systems), extended by [Hart81]. The idea relies basically on conditional compilation technique1; at check-in all “bundled” changes are tagged with the same label, which allows later to (re)build files providing an arbitrary list of labels. Thus it becomes possible to build files that never existed before, by combining changes performed independently, maybe concurrently. This technique was introduced under name FKDQJH VHW to the market by [ADC86] In change-set versioning a FKDQJH is the sum of related modifications made to several components. The motivation is that a ORFDO component change (delta) is often part of a global update (a transaction), involving many other components. A cluster of related updates should therefore be grouped together and described by one identifier. This cluster of updates in usually called a FKDQJHVHW [ADC89]. Each new change UHTXLUHV at least one previous change to establish a base for editing (except for initial creations). The "requires" relation forms a connected acyclic graph where each change-set is a node. Effectively, the change-sets can be thought of as versions in the version space, and the "requires" relation becomes the standard version graph "predecessor" relation (predecessor is the inverse of revision-of and variant-of).
Change 6 3+4
Change 5 A,D
Change 3 A,B
Change 4 C requires
Change 2 D
Change 1 A,B,C
Figure 2: Features from change #3 and #4 can be used individually or combined in a “virtual” change #6.
In a pure change-set system, a workspace is constructed by naming the changes desired. The system may expand this list to include all indirectly required components and versions2. More than one version of a component may be needed in order to fulfill the desired changes. In this case, the system must merge together deltas from several different updates in order to produce a workspace object. This means that new component versions can be temporarily instantiated without having to be explicitly created and checked-in. That is, the selected version represents an initial combination and merge of several previous changes. Such combinations may drastically reduce the need for explicit variant branches. Pure change-set systems do not work well in practice for several reasons. First, deltas sometimes overlap and conflict so although the system can physically construct any combination some combinations may not parse or compile. For some binary formats, there is no way to sensibly combine any deltas except with the version they were based on. Second, when a merger conflict results each new consumer of the conflicting changes must repeatedly resolve it. Third, in a software project with tens of thousands of components and hundreds of thousands of changes, it is too difficult to reliably make a system by naming change-sets. In fact, only a tiny percentage of change combinations are actually useful. Commercial change-set systems like TrueChange [McC00] and Bellcore’s (in-house) Asgard [Mic96] system make heavy and frequent use of baselines to define a starting point for new changes. These baselines are usually tested and stable. A developer need only specify a baseline and a few additional changes. Although this reduces the number of possible permutations of changes it vastly simplifies version selection and increases confidence in the result. AT&T did an early implementation of change-based versioning, called Sublime. It was designed for telecom applications. The goal was to allow for IHDWXUHHQJLQHHULQJ (see also [Tur99]); that is, to pick a subset of available functionalities (not just components) to configure a new, tailored system - e.g. with properties A, F and Z, but not R and S. Sublime used a variant of SCCS that could construct a file by picking specific deltas out of the SCCS archive file. Other early systems include a multi-version editor at IBM [Sar88], and independently explored in the Norwegian EPOS project under the name of FKDQJHRULHQWHG YHUVLRQLQJ [Lie89]. EPOS applied Boolean attributes, called RSWLRQV, to version fragments (deltas). Boolean expressions containing options determined the visibility of a delta [Wes01]. In the mean time, it was understood that a special use of the merging technique could provide a sub-set of the facilities found in the change-set approaches. Called change-packages [Dar97], this technique has the big advantage to use the standard versioning technology which is mature and efficient, and can allow for occasional change selection in an hybrid approach. This is why this technology has increasing acceptance in current SCM tools. Only recently, with the advent of task based approaches, versioning evolved again integrating change set features, but on an RCS based internal representation
1
Conditional compilation is well represented by cpp, the C preprocessor. Cpp analyses special constructs LI ODEHO”, “HQGLI” and if ODEHO is defined, the text between LIand corresponding #HQGLI is left in the file, otherwise removed.
2
This is done by computing a closure over the “requires” relationships of the initial list of change-sets.
Recently Zeller and Snelting [Zel97] have explored similar approaches in their ICE prototype, using IHDWXUH ORJLF to express the version-rules.Many attempts have been made at offering fully versioned databases including Damokles OODBMS [Dit87] and ObjectStore [Lam94].
,PSDFW The first significant and influential systems, SCCS and RCS came from Bell Labs and Purdue University. Delta and merges algorithms all came from research labs. It may be a surprise to consider that, despite the fact classic versioning technology did not changed significantly during the last 20 years (since RCS), almost all commercial systems rely on classic versioning. Further, the traditional versioning technology became pervasive and basic versioning capabilities can be found in very many tools today, not only SCM. Since these mechanisms are still at the core of almost all commercial SCM systems, the influence of research in this area is obvious. The most advanced versioning facilities are not present in today commercial systems. This may be because the standard facilities are adequate by industrial users; simple SCM tools (like SourceSafe) may be enough. Although change-sets could be used to uniformly implement conditional compilation as well as revisions, in practice, users view conditional compilation as a programming language issue distinct from ordinary revisions. In general, software development organizations now try to reduce the number and variety of releases in order to save costs and increase quality. In addition, organizations are more concerned with change management and process control than advanced versioning models. Nevertheless, many commercial SCM systems have added the ability to track logical changes rather than individual file changes. But the name and features are not always similar. ADC/Pro uses the term "change set," CCC/Harvest uses "package," CM Synergy uses "task," ClearGuide uses "activity," PCMS uses "work package," and StarTeam uses "subproject.". Even lower-end SCM tools such as Intersolv’s PVCS and StarBase’s StarTeam have the ability to mark a source code change with the corresponding defect report or enhancement request. Within the next years, we expect that tracking software by units of logical change will be the state of the practice for most commercial software configuration [Dar97]. The driving force behind the acceptance of change sets may be the desire to manage changes in a process-oriented way not as a version selection mechanism.. Altogether, versioning in industrial systems from the one hand shows a remarkable technical stability (the version tree) and from the other hand a slow evolution to incorporate gradually some of the advanced services experimented in change-set prototypes, but without compromising the efficient underlying technology, and without the complexity. Currently some high end systems proposes features close to the change-sets, but almost three decades were needed for that.
&20326,7,21$1'6(/(&7,21 %DFNJURXQG A configuration is a set of entities (files, objects, artifact, document …) that must have desired characteristics and satisfy certain compatibility constraints. With thousands of entities, and even more versions, naming each entity explicitly would be a daunting
task. In order to address this problem many schemes for specifying versions have evolved and every SCM system is somewhat different.
/DWHVWDQG*UHDWHVW Some systems, such as Microsoft SourceSafe [Mic00], SCCS, and RCS, simply default to the latest version of the principal variant and leave it to the user to fetch individual exceptions. These systems have very limited mechanisms for specifying versions.
+LHUDUFK\ Teamware, and many other systems, use a hierarchy of workspaces to specify local versions and then inherit other versions from parent and grandparent workspaces.
*HQHUDO$WWULEXWHV Adele [Est94] formulates general queries over the set of attribute values to specify versions. For example, (date > 6.20.83) OR ((status = approved) AND (owner = jacky)). It is difficult to specify a consistent configuration with this technique so the user query is supplemented with locally stored version rules during the selection process. This gives a flexible and more tractable approach (see example below). However, it is hard to guess or know in advance what will be selected Inscape [Per87] and Adele also select components, not just versions, with their query language and closures. For example, (roots = system1 & lib2, relations = PartOf & DependsOn). The EPOS database [Gul91] and feature attributes in the ICE system [Zel97] select versions first and eliminate components that don’t have a selected version.
&KDQJHVHWVDQG)HDWXUHV McCabe TrueChange [McC00] and Asgard use change-sets, as described above, to specify versions. A user would ask for a baseline plus the changes they wanted (e.g. RLS2.5 + “buffer expansion” + “bug fix 83” + “bug fix 91”). These systems also provide feature selection [Zel97]. In Asgard, the transitive closure of the “requires” relation is always computed to help select a consistent configuration.
5XOHEDVHG Rule-based systems such as SHAPE [Lam88] and ClearCase [Leb94] use an ordered set of rules to specify versions. For example, “my checked-out versions”, otherwise “the latest versions on the ‘multiprocessor’ variant”, otherwise “the baseline labeled RLS2.5”. The rules may be general queries: if a version matches the query it is used otherwise the next rule is evaluated. For example, (created_since(Monday) & Status=Approved). Rules may be qualified by patterns (wildcards) in the product space, for example, “compiler/backend/…”.
,PSDFW As a practical matter, it is difficult for a user to comprehend a configuration that varies too greatly from a known baseline so users tend to create configurations with small additions to a known baseline. They build, test, and frequently make new baselines. For this reason, generalized attribute specification is not too popular; users of change-set systems tend to specify a baseline plus only a handful of changes; rule-based system users typically have only a small number of rules. This gives manageable version selection, and reflects how development teams perform cooperative updating and integration.
ClearCase allows versions to be selected by generalized queries involving attributes, hyperlinks, and baselines but the feature is very rarely used3 in practice. Managing a large number of changes with respect to the same baseline can be tractable only if the SCM system hides the versioning and selection issues under a higher level concept and comfortable interfaces. It is what occurs in systems using hierarchical workspaces [Est00] and task concepts in which routinely hundred or thousand of changes are managed, but through very a simple interface. Industry have adopted slowly and very partially the research results; however, SCM systems are starting to combine version selection with other disciplines such as process control, change management, and system modeling. There is an opportunity for research in this area.
'$7$02'(/$1''$7$%$6( 6833257 %DFNJURXQG A data model is used to define the structure for software product while in development. It is an abstract representation of the types of artifacts to be stored in the system and how they are related to one another. Typical data models in databases are hierarchical, relational, entity relationship or object oriented. A file system is also a (simple) data model. A product model, or system model, is the representation of a given (software) product structure in term of the underlying data model. In database terminology, a product model is called a schema. If using a file system, the product model is the hierarchy of files and directories. Typically, product models are represented as a set of object classes with attributes and relationships. A product model is independent of the actual storage technique; it can be implemented in an object-oriented database, a relational database, or even a set of files. One of the primary functions of an SCM system is to store a component and all of its versions safely. In the early 1970’s, version control tools such as SCCS [Roc75] and RCS [Tic82] [Tic85] were introduced. These tools rely on the persistent storage facilities provided by the underlying operating system: a file system with files and directories. These tools store components as files in the file system. Different versions of the same component are stored within a single file using a delta mechanism. Additional information, such as the version number, date, and comments, is also stored in the archive. A directory tree of archives represents a system, and tagging individual versions of each component represents a specific baseline of the system. To compose a configuration, the user extracts all versions with a similar tag. A simple data model can be inferred from tools such as these, but it is fixed and not very flexible.
'DWDEDVHV In the mid-to-late1980’s, the more robust commercial SCM tools were implemented with commercial database systems [Leb84] [CCC87] [Amp89]. The primarily goal was to provide better reliability and scalability for the SCM data they stored through the use of features such as transactions and backup services. The
3
A paraphrased query: created_by(me) & Approved & (created_since(Monday) | part_of_baseline(BL2.1))
database was used to store the attributes and relationships (often called “meta-data”), but continued to store components in the file system, using deltas to save space, because database systems were not good at storing large, unstructured, “blobs” of data. The database itself was not directly visible from users. Anyway, in commercial SCM systems, the “real” data are the files, and the data base is used to store information about the files building up the software. There is no other entities than files, and no complex composed structures. In other word, no general data modeling is possible with current commercial SCM tools.
7RROV Also in the1980’s, several tools tried to define specific data models dedicated to SCM i.e. including versioning, selection and so on, but also more general allowing to model more complex data structure not necessarily related with files. These include Adele [Est85] and Configuration Management Assistant (CMA) [Plo89]. CMA, developed by Tartan Laboratories (a spin-off from Carnegie Mellon), provided a data model with classes of attributes and relationships that represent dependencies, consistency, compatibility, and so on. Adele was originally a research system developed at the University of Grenoble. It uses an active object-oriented and versioned data model to represent the components of the system. Objects can represent any kind of entity; attributes being either primitive (string, int, date) another object type or predefined like files, activities and functions. Relationships are independent entities that model associations such as derivation, dependency, and composition. Triggers can be attached either on objects and relationships [Est87, Est94]. Such a data model provides several benefits [Jor95]: •
It is possible to design objects that closely model the application domain, not only software.
•
SCM systems become capable of dealing with very small objects that might represent a fragment of a source program, as well as structured objects that might represent a complete system containing thousands of objects.
•
Structured objects provide the freedom to copy or share components among many versions and users.
•
Components can be constructed as incremental modifications to existing components in a lightweight manner.
Some commercial SCM systems developed object-oriented data models around the late1980’s. Aide-De-Camp (ADC, later known as ADC Pro and most recently TrueChange) provided attributes and relationships, as well as an innovative technique of storing changes independent of the file in which they were made [ADC89]. CaseWare/CM or CCM (later known as Continuus/CM and most recently CM Synergy) developed an object-oriented data model in 1989 that uses objects, attributes, and relationships to model complex systems [Wri90] [Cag93] [Cag94]. An object’s type defines the behavior of its instances using platform-independent, run-time interpreted methods. Types are organized in a type hierarchy with single inheritance. Relationship and attribute types can also be defined, and attached to objects dynamically (if security permits). Even in the best SCM’ data model, a configuration is a special object type along with several predefined relationships, including
derivation (history), binding (composition), and dependency. But like other objects, a configuration may have versions and may have new types of relationships added, but the semantics of these ad hoc relationships must be implemented by the application. Despite its successive enhancements, the data model proposed by current commercial SCM tools is still closely bound to files and directories. The meta data stored in databases is only related with files; there is no other entities than files, and no complex composed and heterogeneous structures. This constitutes a serious limitation since it prohibits SCM tools to address other closely related domains where the object of interest are not classic files and directories. It includes web sites (where pages are html files, but they are organized and managed differently), and Product Data Management (PDM) where entities are abstraction of real world artifacts (bolts, wires, engine, …) with many attributes, complex relationship, and complex organization. PDM, entities of interest are primarily stored in a data base not in files. SCM data models are much too rudimentary for modeling PDM data.
,PSDFW The design of both CCM’s and ClearCase’s data modeling facilities were influenced by PCTE [Bou88] and also by other object-oriented research at the time it was developed. Although commercial systems still do not contain many of the features of research systems some do provide data modeling feature. For example, commercial systems provide hyperlinks and attributes but attributes and relationships are not versioned, and types do not support multiple inheritance. Much of the market demand for the relationship aspects of data modeling came from the old DoD-2167A UHTXLUHPHQWVWUDFLQJ specification (and later, from ISO9000 and SEI/CMM). Interestingly, the software market demand for data modeling has became weaker over time as demand for change and process management has increased. Conversely, the increasing need for consistent management of software and other products (PDM) sets very high expectation on data modeling that current commercial system do not seem capable to propose in the near future. Prototypes experimented so far are a good step in this direction, but only a strong demand from industry and more research and experimentation would be required anyway.
6