9 ADiJaC â Automatic Differentiation of Java ... - ACM Digital Library

9

ADiJaC – Automatic Differentiation of Java Classfiles EMIL I. SLUS¸ANSCHI and VLAD DUMITREL, University Politehnica of Bucharest

This work presents the current design and implementation of ADiJaC, an automatic differentiation tool for Java classfiles. ADiJaC uses source transformation to generate derivative codes in both the forward and the reverse modes of automatic differentiation. We describe the overall architecture of the tool and present various details and examples for each of the two modes of differentiation. We emphasize the enhancements that have been made over previous versions of ADiJaC and illustrate their influence on the generality of the tool and on the performance of the generated derivative codes. The ADiJaC tool has been used to generate derivatives for a variety of problems, including real-world applications. We evaluate the performance of such codes and compare it to derivatives generated by Tapenade, a well-established automatic differentiation tool for Fortran and C/C++. Additionally, we present a more detailed performance analysis of a real-world application. Apart from being the only general-purpose automatic differentiation tool for Java bytecode, we argue that ADiJaC’s features and performance are comparable to those of similar mature tools for other programming languages such as C/C++ or Fortran. Categories and Subject Descriptors: G.1.4 [Numerical Analysis]: Quadrature and Numerical Differentiation—Automatic differentiation; D.3.4 [Programming Languages]: Processors—Compilers, Preprocessors; F.3.2 [Logics and Meanings of Programs]: Semantics of Programming Languages—Operational semantics, Program analysis; G.1.6 [Numerical Analysis]: Optimization—Gradient methods General Terms: Algorithms, Performance Additional Key Words and Phrases: Source transformation ACM Reference Format: Emil I. Slus¸anschi and Vlad Dumitrel. 2016. ADiJaC – Automatic differentiation of Java classfiles. ACM Trans. Math. Softw. 43, 2, Article 9 (September 2016), 33 pages. DOI: http://dx.doi.org/10.1145/2904901

1. INTRODUCTION

Derivatives are a crucial ingredient to various computational techniques used in science and engineering. Numerous applications such as parameter identification, design optimization, sensitivity analysis, and data assimilation problems rely heavily on derivatives. The accurate evaluation of the derivatives of functions, specified in the form of computer programs, is thus required. Traditional approaches for computing derivatives include divided differences and symbolic differentiation. The former represents a class of techniques that provide numerical approximations for derivatives [Burden and Faires 2001], and therefore suffer from truncation and cancellation errors. The latter is used in algebraic systems such as Maple [Kofler 1997; Geddes et al. 1993] and Mathematica [Maeder 1991; Wolfram 1991] and generates results in the form of formulae that are often cumbersome and Authors’ addresses: E. I. Slus¸anschi and V. Dumitrel, University Politehnica of Bucharest, Department of Computer Science and Engineering, Splaiul Independentei 313, Sector 6, 060042 Bucharest, Romania; emails: [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 0098-3500/2016/09-ART9 $15.00 DOI: http://dx.doi.org/10.1145/2904901

ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:2

E. I. Slus¸anschi and V. Dumitrel

inefficient; furthermore, this technique is usually limited in terms of the nature and complexity of the input functions that it can handle. On the other hand, automatic differentiation (AD) [Naumann 2011; Griewank and ¨ ¨ Walther 2008; Bucker et al. 2006; Bischof and Bucker 2000; Berz et al. 1996; Bischof et al. 1995; Griewank 1989; Rall 1981] provides an efficient way of accurately evaluating derivatives of arbitrarily complex functions represented in high-level programming languages, such as C, C++, Fortran, or MATLAB, and even algebraic systems like Maple [Monagan and Rodoni 1996]. Using automatic differentiation (also known as algorithmic differentiation), a program that computes a mathematical function is transformed into another program that computes its desired derivatives. Compared to the traditional techniques, AD is capable of handling a large variety of programming constructs, does not suffer from truncation errors, and is capable of producing efficient derivative codes. 1.1. Building an AD Tool for Java

Java [Gosling et al. 2013] is an object-oriented programming language that uses strong static typing. Java programs are executed inside a virtual machine, which provides automatic memory management and garbage collection. Pointer arithmetic is not allowed and memory leaks are not possible. Consequently, Java is considered to be memory and type safe. Java programs are portable at both the source and bytecode level. In addition, the language incorporates various powerful features, such as exception handling, native multithreading, dynamic linking, and reflection. These characteristics make Java a flexible and safe language. The downside to these benefits is an overall penalty in terms of performance. Java is usually considered to be less efficient than languages such as C, C++, or Fortran. Various improvements have been implemented over the years to alleviate this disadvantage, including the introduction of Just-in-Time (JIT) compilers [Adl-Tabatabai et al. 1998]. Thanks to such advances, Java has become more competitive in terms of performance [Boisvert et al. 2001; Bull et al. 2001; Moreira et al. 2000] and has been used successfully in many areas of scientific computing. The primary motivation behind the development of the ADiJaC tool [Slus¸anschi 2008] was the fact that no usable AD implementation for Java was available, despite the language’s improvements in terms of performance and its increasing use in scientific applications. The tool is designed to offer general-purpose AD capabilities, minimize user intervention, and produce efficient derivative codes. As the name suggests (Automatic Differentiation of Java Classfiles), ADiJaC operates on Java classfiles, which consist of Java bytecode. As mentioned before, this representation is highly portable; furthermore, compilers for languages other than Java are also capable of generating bytecode. This common representation offers the possibility of using ADiJaC to differentiate code that originates from different languages. If we were to differentiate Java source code directly, we would not have been able to enjoy the portability and flexibility offered by Java bytecode. However, there are several drawbacks to using Java bytecode directly. It is a stackbased, untyped representation, and comprises over 200 instructions, making it less suitable for AD-specific transformations. More appropriate representations are offered by Soot [Vallée-Rai et al. 1999], a Java optimization framework developed at McGill University. Soot represents an important part of ADiJaC’s infrastructure, not only by providing proper intermediate representations, but also by enabling various analyses, transformations, and optimizations. The main contribution of this work is the design and implementation of a fully functional AD tool for the Java programming language, in both the forward and reverse modes. The transformations are conducted at a language-independent level using the Soot framework, and a number of code analyses and optimizations are ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

ADiJaC – Automatic Differentiation of Java Classfiles

9:3

implemented in both modes of AD. This article presents details on the code generation and experimental results for numerical and scientific computing applications. We show that the performance obtained by ADiJaC-generated AD codes is comparable to that of codes generated by the well-established AD tool Tapenade [Hascoët and Pascual 2013]. 1.2. Outline

The following section contains an overview of the principles behind automatic differentiation and the available techniques and tools, including those related to the Java programming language. Section 3 presents the architecture and infrastructure of the ADiJaC tool, including implementation details that apply to all modes of differentiation. Sections 4 and 5 describe the design and implementation of the forward and reverse modes, respectively. Section 6 discusses a series of experimental performance results and a more detailed performance analysis of a real-world problem. Finally, Section 7 concludes this work and discusses possible future developments of the ADiJaC tool. 2. RELATED WORK 2.1. An Overview of Automatic Differentiation

Automatic differentiation can be regarded as a semantic transformation that can be applied to any computer code. Any computation expressed in the form of a computer program can be represented as a sequence of arithmetic operations (e.g., addition, multiplication) and intrinsic functions (e.g., sin, exp, etc.). The key concept of the AD technology is the repeated application of the chain rule ∂g ∂h of differentiation (if f = g ◦ h, then ∂∂ xf = ∂h ) to such elementary operations whose ∂x derivatives are well known and simple to express. These elementary derivatives are combined together to yield the derivative computation of an entire program. Given a computer program that evaluates a function f : Rn → Rm, automatic differentiation can generate a program that computes the m × n Jacobian matrix df , where dx x is an input vector of size n. It is often the case that one only requires the derivatives of a subset of the outputs with respect to a subset of the inputs. In the context of AD, an input is referred to as independent if derivatives with respect to it are required. Conversely, an output is referred to as dependent if its derivatives with respect to some independent inputs are required. Finally, a variable is called active if it depends on some independent input, and some dependent output depends on it. The associativity of the chain rule allows for the computed partial derivatives to be combined in a variety of ways, all yielding the same results but at different computational costs in terms of time and memory usage. However, it should be noted that, by definition, computer floating-point arithmetic is not associative. We now proceed to briefly describe the two main accumulation strategies used in automatic differentiation. 2.1.1. The Forward Mode. In the forward mode, a gradient object ∇u is associated with each active scalar variable in the program such that ∇u contains the derivatives of u with respect to the n-dimensional input x. In general, the value of ∇u is changed whenever u itself is changed. For example, given an assignment statement u = v w that computes a binary operation u = v w, assuming that all three variables are ∂u active, the corresponding derivative statement will be ∇u = ∂w ∇w + ∂u ∇v. ∂v In this way, the derivative information is carried along with the evaluation of f , starting with ∇x initialized as the n×n identity matrix (a process called seeding), and finally producing the desired values ∇ f = ∂∂ xf . In the resulting derivative code, the derivative objects are vectors of length n, and their computation introduces additional loops. 2.1.2. The Reverse Mode. In the reverse mode, an adjoint object u ¯ is associated with each active scalar variable in the program, such that u¯ contains the derivatives of the m-dimensional output f with respect to u. As before, we consider the assignment ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:4


u = v w, where all three variables are active. The reverse propagation is achieved by ∂u adding the contributions u¯ ∂u and u¯ ∂w to v¯ and w, ¯ respectively, and then resetting u¯ to 0. ∂v The adjoint information is propagated in reverse order with respect to the original program execution, starting with the known values f¯ = ∂∂ ff seeded as the m× m identity matrix, and producing the desired values x¯ = ∂∂ xf . The reverse mode requires that the evaluation of f be completed before the derivative computation can commence. Since the intermediate results of the function evaluation are sometimes necessary in reverse order during the adjoint computation, these values must be either saved or recomputed [Giering and Kaminski 1998a]. In addition, information regarding the control flow for the evaluation of f must also be available during the reverse propagation [Naumann et al. 2004]. These issues make the reverse mode significantly more difficult to implement than its forward counterpart [Hascoët 2009; Griewank and Walther 2000].

2.1.3. Computational Requirements. As mentioned before, the accumulation strategies may differ significantly in terms of computation costs, specifically time and memory requirements [Griewank 2003]. As a general rule, the temporal complexity (or overhead) of forward accumulation grows linearly with the number of inputs, whereas the temporal complexity of reverse accumulation grows linearly with the number of outputs. Consequently, the forward mode is superior when the number of inputs is significantly smaller than the number of outputs (n m), whereas the reverse mode is preferable when the number of outputs is significantly smaller than the number of inputs (m n). It should also be noted that the reverse mode may introduce high memory requirements by saving intermediate values and control flow information. Apart from these well-established strategies, it is possible to use other ways of applying the chain rule. The problem of finding a strategy that computes a function’s Jacobian using the minimum number of operations is called the optimal Jacobian accumulation, and is proven to be NP-complete [Naumann 2008]. However, given additional knowledge concerning the structure of a problem or the sparsity patterns of the Jacobian matrix, it is possible to produce derivative codes that achieve superior performance compared to the standard forward and reverse approaches [Gebremedhin et al. 2005; Bischof et al. 1996a]. 2.2. Automatic Differentiation Tools

Both AD modes can be implemented using either the source transformation or the operator overloading technique. 2.2.1. Operator Overloading. Operator overloading effectively hides the implementation of the derivative computation from the user, but it must be noted that it is only applicable in programming languages that support some form of overloading for operators. To use the tool, the users must modify the types of the active variables and then recompile the code with the new implementations for the overloaded operations and intrinsics. This technique offers flexibility, as the implementation of new functionalities is restricted to the new derivative type (class). However, the runtime overhead of the operator overloading technique can be substantial, due to the large number of method calls involved. The fact that the source code itself remains unaltered is at first glance an attractive feature, but unfortunately this fact is overshadowed by the lack of transparency introduced in the code, especially when debugging the derivative computations. Additionally, the operators have no knowledge about possible dependencies between various variables. This fact considerably limits the flexibility provided by the associativity of the chain rule of differential calculus, and thus reduces the potential improvements in terms of performance of the differentiated code. ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:5

AD tools based on operator overloading include Adept [Hogan 2014], ADOL-C [Griewank et al. 1996], CppAD [Bell 2016], and FADBAD [Bendtsen and Stauning 1996] for C/C++; ADF [Straka 2005] and ADOL-F [Shiriaev and Griewank 1996] for Fortran; and ADMAT [Verma 1999] and MAD [Forth 2006] for MATLAB. Additionally, some scientific computing frameworks such as COSY [Makino and Berz 2006] and Trilinos with the Sacado implementation [Heroux et al. 2005] contain packages that enable this technique. Additional work was done, for various operator overloading AD tools, on improved formulations that are well suited for the evaluation of first and second derivatives of optimal values, yielding significant savings in time and memory [Bell and Burke 2008]. An interesting implementation strategy of AD for C++ programs in the forward mode uses operator overloading and expression templates [Hogan 2014], and thus allows for a limited scope of exploiting chain rule associativity. 2.2.2. Source Transformation. On the other hand, the source transformation approach of AD is capable of analyzing, detecting, and making use of more complex structures such as statements or nested loops for which more efficient derivative code can be generated through the exploitation of the chain rule associativity. It is a compiler-based technique for the transformation of the original computer code into another code that explicitly contains statements for the computation of the desired derivatives, thus allowing inspection of the generated code. The main disadvantage of the source transformation approach is the complexity involved when implementing AD tools, requiring a mature infrastructure capable of handling the language-specific constructs. Using this approach, a number of AD tools were also developed: Adifor [Bischof et al. 1992, 1996b; Bischof and Griewank 1992], TAF [Giering and Kaminski 2003], and TAMC [Giering and Kaminski 1998b] for Fortran; ADIC [Narayanan et al. 2010; Bischof et al. 1997] for C/C++; and ADiMat [Willkomm et al. 2014; Bischof et al. 2002] for MATLAB. Tools such as Tapenade [Hascoët and Pascual 2013; Pascual and Hascoët 2008] and OpenAD [Utke et al. 2008] provide AD functionalities for both C/C++ and Fortran. 2.3. Automatic Differentiation for the Java Programming Language

A tentative implementation of forward-mode AD for the Java programming language was conducted by Rune Skjelvik for his master’s thesis in 2001 [Skjelvik 2001]. However the proposed implementation only simulates an operator overloading approach over Java, as the language itself does not allow operator overloading. A preprocessor is used to transform all applications of arithmetic operators to method calls (e.g., v * w becomes v.mult(w)), which compute both the original values and the derivatives. The paper of Fischer et al. [2005] contains a reference to a library called JavaDiff that implements AD on a representation of a computational graph in a fashion similar to ADOL-C. No details are provided about the implementation except for acknowledging the limitations arising from the computational graph approach, namely, its inability to deal with input-dependent control flow. JAutoDiff [JAutoDiff 2014] is a tool based on a form of operator overloading that relies on manually constructing functions using primitive operations (e.g., plus, mul, sin, etc.). The major disadvantage of this implementation is the fact that the functions to be processed are limited to single expressions; this means that features such as control flow are out of the scope of this approach. Another implementation offering support for vector computations of computer derivatives in Java is described in Montiel et al. [2013]. Deriva [Kwiecinski 2016] is an AD tool for Java and Clojure that also relies on manually constructed functions. Unlike JAutoDiff, however, the supported expressions can also contain conditionals. ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:6


In contrast to the aforementioned efforts, JAP: Java Jacobian Automatic Programming [Pham-Quang and Delinchant 2012] offers a more flexible and generic set of AD capabilities. In a manner similar to Skjelvik’s approach, operator invocations are transformed into method calls, which are defined for objects of type Jdouble1. The authors call this technique virtual operator overloading. JAP can be used to obtain Jacobian matrices and partial derivatives, but only in the forward mode. It also offers some support for Java arrays, external functions, and runtime selection of partial derivatives. 3. OVERVIEW

The ADiJaC tool uses source transformation to generate derivatives for functions expressed in Java bytecode. In this context, source transformation is used to refer to the AD technique transforming Java bytecode, and not the original Java source code. Although source transformation in AD is more difficult to implement, it is generally considered to produce more efficient code than the alternative, namely, the operator overloading approach. Furthermore, the original Java source code requires minimal modifications. 3.1. The Soot Framework

ADiJaC uses the Soot framework [Einarsson and Nielsen 2008; Vallée-Rai et al. 1999] to parse Java classfiles. These are translated into an unstructured, typed, three-address intermediate representation (IR) called Jimple. All AD-related analyses and transformations operate on this representation, which is more suitable for this purpose than the stack-based bytecode. The generated code is represented internally using a different IR, also provided by Soot, called Grimp. This is similar to Jimple, the important difference being that it allows aggregate expressions (i.e., not limited to a single operation on the right-hand side). This facilitates the generation of derivative code, as derivative statements are generally more complex than their corresponding original statements. After the Grimp code has been generated, Soot translates it into bytecode, which can be executed on a Java Virtual Machine. The Soot framework features a Java decompiler called Dava. This would allow the translation of the classfiles back into Java code. However, the decompiler has been found to be unreliable for anything but the simplest of codes. Therefore, the decision was made to use the bytecode-to-bytecode model. The primary disadvantage of this approach is that the classfiles are difficult to inspect and modify. It is therefore important for the generated code to not require any user intervention in order to function correctly. Soot provides an application programming interface (API) that is useful for numerous compiler-related techniques. The various entities associated with an input program are represented internally by Java objects, the most important of which are the following: —Scene – the entire application —SootClass – an individual class or interface —SootField – a field (instance or static) —SootMethod – a method (does not contain the actual code) —Body – the body of a method —Unit – an instruction or a statement —Value – the interface for expressions (Expr), local variables (Local), constants (Constant), and so forth Code transformations, analyses, and optimizations are usually introduced by extending Soot’s transformer classes. The SceneTransformer class operates at the level of the entire application and can be used to add or remove classes and methods and ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:7

perform interprocedural transformations. The BodyTransformer operates on the bodies of individual methods and is used for intraprocedural transformations. Soot also provides a flexible and generic way of defining static analyses, which are very common in the field of code transformations [Schwartzbach 2008]. To this end, the classes ForwardFlowAnalysis and BackwardFlowAnalysis can be extended by specifying the appropriate flow equations. The framework includes various traditional analyses that have been built on top of these base classes. However, when dealing with a specific kind of code transformation (such as automatic differentiation), it is useful to be able to define domain-specific static analyses. 3.2. Workflow

The ADiJaC tool takes Java .class files as inputs; therefore, the original source files must be translated into bytecode by a compiler (e.g., javac). The syntax used to run ADiJaC is the following: java -cp [...] adijac.ADiJaCDriver [mode] [class] [function] Apart from the ADiJaC classes, the classpath must contain the Soot library, the classes to be transformed, and their corresponding dependencies. The mode must be one of the following: —fm – forward scalar mode —fvm – forward vector mode —rm – reverse scalar mode The class specifies the name of the class that contains the input function to be differentiated. First, the tool invokes Soot, which generates Jimple code as an intermediate representation. The first AD-related action is to identify the methods that will be differentiated within the application and to generate the methods that will contain the derivative code. This is done using one of several scene transformers, each corresponding to a different mode of differentiation. Afterward, each of these methods is processed by a body transformer, also in accordance with the specified mode. At this level, ADiJaC performs the required AD-related analyses and transformations on the Jimple representation, generating the Grimp code for each derivative method. Finally, Soot translates the Grimp code into bytecode, thus producing the final results in the form of .class files. The resulting files contain valid Java bytecode and can be run on a Java Virtual Machine. 3.3. Annotations

In order to use ADiJaC to differentiate a function, one must first introduce annotations in the original code. The purpose of these annotations is twofold: identify the active user-defined functions and specify the input (independent) and output (dependent) variables. The choice of input and output variables is not limited to function arguments. Any local scalar or array of float or double can be an input or an output, but other entities (e.g., integers, references, fields) cannot. The annotations are preserved in the bytecode, and therefore can be inspected by Soot.


9:8


As this example suggests, the independent and dependent variables are identified by name. A more desirable alternative would have been to annotate variables directly in the original body; however, such annotations are not preserved in bytecode. 3.4. Support Classes

The derivative-enhanced classfiles generated by ADiJaC require a set of AD-related classes that reside in the package adsupport. These must be accessible to both ADiJaC (during code generation) and the application that uses the derivative code. Three of the classes in this package are always required, regardless of the AD mode: —ADiJaCInfo – the interface for ADiJaC annotations —RealBox – a mutable (i.e., changeable) reference to a real number; contains the public field double val —ArrayBox – a mutable reference to a real array; contains the public field Object arr The box classes are necessary when the changes made to the arguments of a function must be visible outside of that function. In a language such as C, this can be easily achieved by using pointers to variables. However, the Java programming language does not allow passing arguments by mutable references. Previous versions of ADiJaC used a class called DerivType to encapsulate both a real value and its corresponding derivative value. This approach was changed, primarily for performance reasons; thus, the current implementation uses separate variables for the two values. 3.5. Preprocessing

Before performing the actual AD-related analyses and code generation, ADiJaC executes a preprocessing phase. The first step of this phase is to translate each body into a form in which, with minor exceptions, a scalar variable cannot appear on both sides of an assignment. This is achieved by introducing intermediate variables and assignments. The second step converts each basic block into static single assignment (SSA) form. In the current implementation, this transformation does not cross the boundaries of basic blocks, does not deal with issues of control flow, and consequently can avoid the introduction of phi-nodes (used when a variable can be assigned different values based on the path of control flow). This reduces the complexity of the analysis and that of the generated code. The tradeoff in this case is the potential increase in the number of used variables. The purpose of the preprocessing phase is to bring the input function bodies to a more canonical form, which facilitates further analyses and transformations. 3.6. Activity Analysis

An important improvement to ADiJaC is the introduction of the activity analysis [Hascoët et al. 2005; Hascoët and Araya-Polo 2006]. In previous versions [Slus¸anschi 2008], all float and double variables were considered active, and were therefore converted to DerivType. The activity analysis uses the information provided in the annotations. It actually consists of a forward varied analysis, which identifies the variables that depend on at least one input, and a backward useful analysis, which identifies the variables on which at least one output depends. A variable is considered active if it is varied and useful. If a variable is found to be inactive, the derivative code can be improved in several ways: —Its derivative(s) will not be computed. —Derivative instructions that would normally use the variable’s derivative(s) can be simplified, as those derivative(s) are effectively null. —There will be no derivative object associated with the variable. ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:9

The first two implications related to inactive variables generally reduce the number of operations present in the derivative-enhanced code, whereas the latter reduces the memory footprint. In addition to these direct benefits, activity information is also necessary for the to-be-recorded analysis [Hascoët et al. 2005], which is specific to the reverse mode (see Section 5.2). It must be noted that, to date, in ADiJaC only local variables can be active. Active class and instance fields are not yet supported by our tool. As previously described in Section 3.3, the current implementation of the activity analysis in ADiJaC requires the use of annotations for all active functions and is therefore not interprocedural in nature. 3.7. Postprocessing

Before Soot outputs the resulting bytecode, ADiJaC performs a series of optimizations on the derivative code. Some of these are standard, general-purpose optimizations, such as constant propagation and expression aggregation, and are provided by the Soot library [Schwartzbach 2008]; these are applicable to any kind of code, not just derivative-enhanced code. Others are customized for ADiJaC and can take advantage of the specific structure and characteristics of the derivative code; the removal of dead assignments falls into this category—some useless default initializations are not detected by Soot’s general-purpose analyses. 3.8. Language Coverage

As mentioned before, one of the goals of ADiJaC is to be able to differentiate successfully as many types of language constructs as possible. In this respect, the tool has been improved and is now able to handle language features that were not supported before [Slus¸anschi 2008]. Among these new features, the most important are the following: Method calls. Previously, support for interprocedural differentiation was limited to a single nested method with a predefined name. In contrast, ADiJaC can now differentiate call graphs of arbitrary complexity as long as the active methods are annotated appropriately. Control flow. Prior to this version, ADiJaC’s approach for transforming control flow structures relied primarily on commonly used patterns and did not attempt to offer generic support. This approach introduced various constraints on the input computer code, such as limitations on the nesting levels and the inability to use unstructured jumps (e.g., break, continue). In addition, various assumptions were made regarding the input code that were not valid for all compilers. Our new strategy greatly favors generality and is able to differentiate arbitrary control flow graphs. It is even possible to handle unstructured flows that constitute valid bytecode but cannot appear in high-level languages such as Java (e.g., arbitrary unconditional jumps using the goto statement). Multidimensional arrays. ADiJaC can now differentiate code that contains arrays of any number of dimensions. This was not the case in previous versions, which meant that users had to intervene and modify the Java source code in order to manually reduce the dimensionalities. The strategies used to implement these features usually depend on the mode of differentiation. Therefore, they will be discussed in more detail in the following two sections. 4. THE FORWARD MODE 4.1. Scalar Mode

The forward mode is a simple form of automatic differentiation. For each user-defined active method, a corresponding gradient method is generated. The name of a derivative ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:10


Fig. 1. The forward-mode workflow. The additional steps required for the vector mode are displayed on the right.

method is obtained by adding the prefix g_ to the name of the original function. The derivative methods, including their signatures, are generated by the forward-mode scene transformer. The workflow for ADiJaC’s forward mode is illustrated in Figure 1. Within the active bodies, a gradient object that contains derivative information is associated with each active variable. The name of each gradient object consists of the name of the original variable, to which we append the suffix _g. Each original body is processed sequentially, in a single sweep, by a forward-mode body transformer. With minor exceptions, the derivative code for an active instruction will contain a clone of the original instruction, preceded by the corresponding gradient instruction. As mentioned before, no corresponding gradient operations are generated for inactive instructions. Consequently, such instructions, as well as the ones that define control flow (i.e., jumps), are simply cloned into the derivative code. To illustrate some of the basic forward-mode transformations, we consider the following method:

The bytecode corresponding to this method is translated by Soot into Jimple code. The AD-related transformations will act on a preprocessed Jimple body: ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:11

We will use this small example to showcase some basic characteristics of the Jimple IR. The Jimple function compute has a header that is very similar to the original Java function header; it specifies that the function takes a double parameter and returns a double. The function body starts with declarations for all the local variables (lines 2– 3 in our example). The statement at line 4 associates the local variable x with the first parameter of the function. Computations are expressed using simple instructions such as assignments (line 8), arithmetic operations (line 9), and relational operations (line 6). Additionally, we can have function calls (invocations) such as the one at line 5. Control flow is expressed using conditional and unconditional jumps; at line 7, we have a conditional jump to label0. The function returns the value of z via the statement at line 11. The same conventions also apply to the Grimp IR; the only notable exception is the fact that Grimp allows compound expressions. The ADiJaC transformations generate a Grimp method body, which is shown next. Throughout this work, the resulting Grimp bodies are displayed without having been subjected to postprocessing, unless otherwise specified; this makes it easier to observe individual AD transformations.


9:12


The first thing to notice with respect to the generated method is its signature. The original method has one independent input and one dependent output. Consequently, the derivative method takes the following arguments: the input x of type double, the gradient of x encapsulated in a RealBox object (x_g_ref), and the gradient of z, also encapsulated in a RealBox object (z_g_ref). In this scenario, the method uses the value seeded inside x_g_ref, accumulates the derivative of z with respect to x, and outputs it via z_g_ref. Apart from performing the derivative computation, g_compute also returns the result of the original function. The generated forward-mode Grimp code begins with a definition and initialization phase for the local variables, as well as the binding for each corresponding method parameter. As mentioned before, the derivative code generation in this mode is then performed in a straightforward fashion. For instance, the assignment at line 9 in the Jimple code contains a multiplication operation. The gradient statement corresponding to this assignment can be found at line 17 in the Grimp code. It follows the well-known rule for differentiating multiplications and is succeeded by a clone of the original statement. The assignment at line 5 in the Jimple code contains a call to an intrinsic function (Math.sin), and its corresponding gradient instruction (line 11 in the Grimp code) contains a call to the derivative of this function (Math.cos). The input code also contains a conditional statement. It is easy to see that the generated Grimp code simply mirrors this instruction (line 14). The only active operations that do not follow the pattern presented previously are the ones that contain calls to user-defined (i.e., not intrinsic) active functions. In such cases, ADiJaC simply generates the call to the derivative function, along with any necessary boxing and unboxing statements. This is possible in the forward mode because a derivative function can also compute the original results, which are usually required in the code that is making the call. To illustrate code generation for function calls, let us assume that the compute method from the previous example is called in an active statement by some other method in the program. The Jimple instruction inside the caller will be of the form: The derivative Grimp code corresponding to this call is displayed next in its most generic form and illustrates two important issues. First, a single call to g_compute is used to obtain both the value of t (as computed by the original function) and the gradient t_g. Second, because g_compute takes RealBox arguments for the gradients, the derivative code must box the values of x_g and t_g (lines 1–2) before the call is performed and unbox them (lines 5–6) after the call has returned.

Before calling a derivative function produced by this mode, the user must first initialize the gradient values for both the independent inputs and the dependent outputs. The common approach is to set the gradients associated with the inputs to 1.0 and those associated with the outputs to 0.0. 4.2. Vector Mode

So far we have discussed what is known as the scalar forward mode, where each active value has an associated scalar derivative object. ADiJaC also features an ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:13

implementation of the vector forward mode, where each active value has an associated vector of derivatives. Consequently, the derivative variables will grow in dimensionality by 1 (i.e., scalar derivatives become 1D arrays, 1D array derivatives become 2D matrices, etc.). By enclosing each gradient instruction in a loop over the derivative vector, we can compute the derivatives for multiple directions. With the exception of the added loops, the rules for generating derivatives are the same for both forward modes. The number of directions for the vector mode is determined by a static field that resides in the adsupport.GradMax class. The value of this field defaults to 1 and must be set to the appropriate value before calling the differentiated functions. Being able to specify this size at runtime offers more flexibility than alternative approaches, which require the size to be fixed at compile time. The Grimp body shown next has been obtained by differentiating the compute method from the previous example in the forward vector mode. All of the gradient objects are now arrays, as opposed to scalars. Furthermore, because in general the gradient array (i.e., the pointer itself, not its contents) associated with a dependent variable may be changed inside the function, the gradient of z is encapsulated inside an ArrayBox. As before, we consider the multiplication at line 9 in the Jimple code. The cloned instruction can be found at line 31 in the derivative function. The preceding loop (lines 26–30) updates each component of the derivative vector at line 28.


9:14


Fig. 2. The reverse-mode workflow.

In this form, the vector mode is usually suboptimal due to the fragmented nature of the loops. However, a dependency analysis allows us to reorder the instructions in the derivative code and coalesce the loops generated for adjacent instructions. Both the dependency analysis and the loop coalescing are performed as separate sweeps over the derivative code. This transformation also brings up the possibility of automatic parallelization of the derivative loops, which will be an objective of future research and implementation efforts. 5. THE REVERSE MODE

The reverse mode of automatic differentiation is more complex than the forward mode; however, in many cases it is much more efficient. Each user-defined active method will have a corresponding adjoint method. The name of each derivative method consists of the prefix a_ added to the name of the original method. The derivative methods, including their signatures, are generated by the reverse-mode scene transformer. As explained in Section 3.2, the actual code generation is performed by a reversemode body transformer. The complete reverse-mode workflow for ADiJaC is illustrated in Figure 2. It requires two major sweeps over the original code, which can be summarized as follows: —The forward sweep – executes the original instructions, recording intermediate values and control flow information —The reverse sweep – executes the adjoint instructions in reverse order, using the recorded intermediate values and taking the original control flow into account Each active variable will have an associated adjoint object in the derivative code. The name of each adjoint object is obtained by appending the suffix _a to the name of the original variable. During the reverse sweep, the derivative information is propagated backward, from the adjoints of the dependent variables to the adjoints of the independent variables. It ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:15

is therefore important for the adjoint instructions to have access to any necessary intermediate values computed in the forward sweep. To this end, ADiJaC uses the store-all strategy, which means that all the required intermediate values are recorded onto data structures commonly called tapes. In contrast to other strategies, such as recompute-all, no recomputations are performed, except for some easily reversible integer operations. In order to determine which values are actually necessary in the reverse sweep, ADiJaC employs a to-be-recorded (TBR) analysis [Hascoët et al. 2005], which will be detailed in Section 5.2. At the moment, the reverse mode does not offer checkpointing [Dauvergne and Hascoët 2006; Griewank and Walther 2000] capabilities. The reverse mode uses several stacks to record information in the forward sweep. Each active user-defined function will have a separate instance of each of these stacks. In contrast to strategies involving global stacks, this approach limits the scopes and lifetimes of these data structures. This increases the likelihood that they will be deallocated by the Garbage Collector during execution, thus improving the memory usage of the reverse-mode implementation. Intermediate values, depending on their type, are stored in the DoubleStack, IntStack, or RefStack, respectively. Control flow information is recorded in the ControlStack; we will expand on this issue in Section 5.3. The stacks are implemented as singly linked lists of contiguous chunks and reside in the adsupport package. As is the case with the forward mode, using derivative methods obtained through the reverse mode also requires seeding. Most commonly, the adjoints of the independent variables are set to 0.0, whereas the adjoints of the dependent variables are set to 1.0 (corresponding to their derivatives with respect to themselves ∂∂ ff = 1, ∀ f ). 5.1. Code Generation

Performing the forward sweep of the reverse mode involves cloning the original instructions, with the exception of return statements that will not be cloned, as the execution of the derivative function must continue after the forward sweep has completed. In fact, because they are not required to compute the original results, functions generated using ADiJaC’s reverse mode do not contain any statements that return values. In addition, we must generate instructions that record the required information on the stacks. When a variable is overwritten, it is sometimes necessary to generate a push statement before the clone of the original instruction, in order to record its value. This ensures that the intermediate values will be available in reverse order during the reverse sweep, thus achieving the necessary data flow reversal. In the reverse sweep, the adjoint instructions are generated in reverse order with respect to the original code. The adjoint code for an active assignment will update the adjoints of each active variable on the right-hand side of that assignment, unless the statement contains a call to an active user-defined method. The adjoint assignment will be succeeded, if required, by a pop statement that restores the value of the overwritten variable. Therefore, as explained before, the correct value for this variable will be available for subsequent adjoint instructions. It should be noted that such pop statements may appear even if the original instruction is inactive. For our basic reverse-mode example, we simplify the method used to illustrate the forward mode so that it only contains straight-line code. Control flow reversal will be discussed later in this section. The input Java method is the following:


9:16


After preprocessing, the corresponding Jimple code will have the following form:

For the sake of clarity, in addition to disabling postprocessing, the TBR analysis has also been disabled for the generation of this code. Therefore, all values that appear on the left-hand side of assignments will be stored on, and subsequently retrieved from, the stack. Similar to the forward mode, the generated reverse-mode Grimp code begins with an initialization phase followed by the forward and reverse sweeps, respectively.



9:17

Let us consider the multiplication at line 6 in the Jimple code. In the forward sweep of the reverse mode, a clone of this assignment is generated at line 20 in the resulting code. At line 19 in the Grimp code, the value of the variable on the left-hand side of the assignment (z) is pushed onto the stack. Because both operators on the right-hand side of the assignment are active, both their adjoints are updated in the reverse sweep (lines 23–24). Because z is overwritten in the original assignment, its previous value will no longer influence any active variables; therefore, its adjoint z_a is set to 0.0 at line 25. Finally, at line 26, the value of z is retrieved from the stack. For interprocedural transformations in the reverse mode, ADiJaC uses the joint reversal strategy. In this approach, for each call to an active user-defined method in the original code, the forward sweep will contain a clone of this call, and the reverse sweep will contain a call to the corresponding adjoint method. This implies that the original instructions will be executed twice for each such call, both as part of the clone of the original method and as part of the forward sweep of the adjoint method. This strategy, although it involves executing more operations, has a lower memory footprint, as the values recorded during different method calls do not need to reside simultaneously on the stacks. As was the case for the forward mode, it is sometimes necessary for calls to adjoint methods to be accompanied by parameter boxing and unboxing. We consider again a scenario in which the compute method is called from another method. The Jimple code in the caller method is the following:

The forward sweep in the generated code will contain an invocation of the original compute method:

In the reverse sweep, the adjoint method is invoked, and the necessary boxing and unboxing are performed:

5.2. To-Be-Recorded Analysis

In previous implementations [Slus¸anschi 2008], ADiJaC would store all variables that appeared on the left-hand side of an assignment during the forward sweep. However, this approach was highly inefficient, especially in terms of memory usage, and can be improved by employing a to-be-recorded (TBR) analysis [Hascoët and Araya-Polo 2006; Hascoët et al. 2005; Naumann 2002]. This analysis is specific to the reverse mode and comprises a killed analysis, which determines the set of variables that will be overwritten at a given point in the program, and an adjoint-used analysis, which determines the set of variables that will be used in the adjoint code. A variable is to be recorded at a given point if it is both killed and adjoint used. It is only in such cases that we need to generate push and pop statements. ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:18


In our previous example, there are no variables to be recorded; thus, the complexity of the derivative code is greatly reduced when the analysis is enabled. The generated Grimp code thus becomes:

5.3. Control Flow Reversal

In both the Jimple and Grimp intermediate representations, control flow is unstructured, meaning it is represented exclusively by means of conditional and unconditional jumps (if..goto.. and goto..). In contrast to structured code, where reversing if blocks or for blocks can usually be done by generating similar constructs in the reverse sweep, reversing unstructured code is more challenging. Our approach to control flow reversal is one that favors generality. It is important for the tool to be able to handle as many types of control flow constructs as possible. In this respect, the reversal is done in an unstructured manner, using a basic block-oriented strategy that operates on the original control flow graph. In other words, ADiJaC does not attempt to recognize structure in the unstructured Jimple code. Control flow reversal is performed in a separate pass over the code, after the forward and reverse sweeps have completed. Each node BBi in the graph represents a basic block, and there is a directed edge BBi → BB j if control can flow from the basic block denoted by BBi to the basic block denoted by BB j . We use the notation BBi to denote the basic block that contains the reverse sweep adjoint code corresponding to BBi . Using these notations, control flow reversal in ADiJaC can be summarized as follows: —A unique index is assigned to each basic block in the original control flow graph. —A new control graph is generated by reversing the edges in the original graph. —In an iterative process, the basic blocks that have no corresponding adjoint instructions in the reverse sweep and are not exit blocks are removed from the reversed graph. —Each time a basic block BBk is removed, each pair of edges of the form (BBi → BBk, BBk → BB j ) in the reversed graph is replaced by the edge BBi → BB j . ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:19

—Each basic block in the forward sweep that has a corresponding block in the resulting reversed graph will push its index on the control stack every time it is executed. —In the reverse sweep, after the execution of a reverse basic block, an index is popped from the stack to determine which reverse basic block will be executed next. —Finally, a transition is made to the following executing basic block; this consists of either a flow through or some type of jump. Depending on the number of successors a basic block has in the reversed graph, the code used for the transition to the next basic block in the reverse sweep will differ: —If the block has more than two successors, this is represented as a switch statement. —If the block has two successors, it will consist of an if construct. —If the block has a single successor, control will either flow through or perform an unconditional jump to the next basic block. To exemplify ADiJaC’s control flow reversal, we consider the following Java method:

The method will yield the following preprocessed Jimple code. The comments on the right mark each individual basic block.


9:20


Unlike in previous examples, the derivative Grimp code displayed for this method is fully optimized by ADiJaC. This includes applying the TBR analysis and the postprocessing phase.



9:21

Fig. 3. Control flow graphs for the reversal example.

In this scenario, the Jimple body consists of six basic blocks. Figure 3(a) displays the control flow graph for the original body. In Figure 3(b), the graph has been reversed and two basic blocks have been marked for removal (BB2 and BB4 ), as they have no corresponding adjoint instructions in the reverse sweep. In contrast, although BB5 has no corresponding instructions, it is an exit block in the original code, and therefore it will not be removed from the graph. Finally, in Figure 3(c), the reversed graph is reduced to four nodes, and new edges are introduced. The Grimp code is generated according to these graph transformations. As shown in Figure 3(c), BB5 has three successors in the reversed graph, and therefore a switch statement is generated (lines 46–51 in the Grimp code) in order to determine which block will be executed next. For BB1 , a conditional statement (line 64) is generated, as it has only two successors. In the case of BB3 , which has a single successor, no conditionals or switches are necessary, and the transition is made using an unconditional jump (line 55). Finally, BB0 represents the exit block for the adjoint code; it simply boxes the result and returns. It may be argued that this highly generic approach sacrifices performance, especially in the case of simple for loops. However, it has the advantage that it works correctly for arbitrary control flows, including scenarios that cannot appear in Java code, such as unstructured goto statements. It should be mentioned that this strategy for control flow reversal is significantly different from the one used in previous versions of ADiJaC [Slus¸anschi 2008]. In the previous approach, the tool attempted to identify common structures such as loops and conditional blocks, which had to conform to a limited set of nesting patterns. Another ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:22


Fig. 4. SFI execution time (forward mode).

significant difference has to do with conditional blocks, for which the strategy was to store the values used in the condition, retrieve them in the reverse sweep, and evaluate the condition again in order to determine the correct branch. As we have shown, the unstructured approach represents a more generic, and arguably less complex, way of reversing arbitrary control flows. 6. EXPERIMENTAL RESULTS 6.1. Performance Evaluation – MINPACK-2 Problems

For our performance evaluation, we used ADiJaC to differentiate several problems (translated into Java) from the MINPACK-2 Test Problem Collection [Averick et al. 1992]. The same problems were translated into the C language and differentiated using the Tapenade [Hascoët and Pascual 2013] AD tool. We measured and compared the execution times and memory footprints of the resulting programs, as well as the original ones, for various problem sizes. All tests presented in this article were conducted on an Intel Core i7-3770K (3.5GHz) processor with 16GB of RAM, running Ubuntu 13.10 64-bit. OpenJDK 7 and GCC 4.8.1 were used to compile Java and C, respectively. The version of Tapenade used for comparisons was 3.9. 6.1.1. Forward-Mode Performance. For the forward mode, the Solid Fuel Ignition (SFI) and Flow in a Driven Cavity (FDC) problems were chosen, each consisting of a system of nonlinear equations that is evaluated at a certain point x0 . Each problem features an input matrix and an output matrix, both of size n×n. We therefore have n2 independent and n2 dependent variables. The derivative values of all of the independent variables are seeded to 1.0. We perform a single execution of the derivative code, generated using the forward scalar mode, for each value of n. Consequently, after the execution of the derivative code, the derivative object corresponding to each dependent variable will contain the cumulative derivative with respect to all of the independent variables. The execution costs of the derivative codes are expected to be small multiples of the costs of the original functions. Figures 4(a) through 7(a) show the execution times and the memory footprints for the original and AD codes as functions of n, for the SFI and FDC problems, respectively. For both problems, the execution time and the memory footprint grow quadratically with n, which is to be expected seeing how the actual size of the output vector is n2 , and the temporal complexity of the problems is O(n2 ). From the same figures, we observe ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


Fig. 5. SFI memory usage (forward mode).

Fig. 6. FDC execution time (forward mode).

Fig. 7. FDC memory usage (forward mode).


9:23

9:24


Fig. 8. SSC execution time (reverse mode).

that the original C code is significantly faster than its Java counterpart for the SFI problem, but not so for the FDC problem. Finally, the memory consumption for both the original and AD codes is independent of the language and AD tool, and the memory footprints of the AD codes are consistently higher than those of the corresponding original functions. In Figures 4(b) through 7(b), we illustrate the relative performance (performance ratios) of the derivative functions with respect to the original ones. Overall, the ratios for the ADiJaC and Tapenade implementations are very similar. The only exception is in the case of the execution time for the SFI problem, where the ADiJaC version performed noticeably better, having a consistently smaller performance ratio than the Tapenade-generated code. As expected, the performance ratios either stabilize or oscillate slightly in the vicinity of certain values. In either case, they do not appear to grow significantly when the problem size is increased. 6.1.2. Reverse-Mode Performance. To test the reverse-mode performance, we used the Steady-State Combustion (SSC) and Elastic-Plastic Torsion (EPT) minimization problems. Each function takes n2 input parameters and produces a single scalar value (that needs to be minimized). In each case, the generated reverse-mode code computes the partial derivatives of the output variable with respect to each input variable, for a total of n2 derivative values. As before, the performance ratios are expected to be small; furthermore, they should not exhibit significant growth with respect to the problem size. The performance of the SSC and EPT problems is illustrated in Figures 8(a) through 11(a), respectively. As before, the problems have quadratic temporal and spatial complexity with respect to n. However, they are more demanding in terms of both execution time and memory consumption. As shown in Figure 8(a), there is a notable difference between the original execution times in C and Java for the SSC problem, whereas Figure 10(a) shows that there is no such difference for the EPT problem. Another notable aspect is the difference in the memory consumption patterns for the reverse-mode-differentiated functions. As the plots in Figures 9(a) and 11(a) show, the memory for the C versions grows more smoothly than that of the Java versions; the irregular behavior of the Java memory allocation can be attributed to the policies of the Java Virtual Machine, which is not optimized for the distinctive, stack-based allocation patterns of adjoint functions. ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


Fig. 9. SSC memory usage (reverse mode).

Fig. 10. EPT execution time (reverse mode).

Fig. 11. EPT memory usage (reverse mode).


9:25

9:26


The performance ratios shown in Figures 8(b) through 11(b) are generally larger than the ones in our forward-mode experiments. This is mostly due to the additional overhead introduced by the stack operations. Although on average the relative performances are similar, the ADiJaC versions displayed much more irregular behaviors than the Tapenade versions, especially for the memory ratios. Despite these irregularities, the ADiJaC-generated codes managed to achieve small performance ratios that do not seem to grow significantly with the problem size. The only case in which the Tapenade-generated code performed consistently better was the execution time for the SSC problem. 6.2. Case Study – AdaptiveAmberFF

In this section, we present a detailed evaluation for a more complex application, consisting of the AdaptiveAmberFF energy function from the OGOLEM framework for computational chemistry [Dieterich and Hartke 2010]. Apart from presenting performance results for this problem, we use it to illustrate the benefits of the newly introduced static analysis and the behavior of the Java Virtual Machine memory management for these types of applications. The input function is an optimization problem that computes a scalar energy value corresponding to a configuration of atoms. The properties of the configuration are given by many different parameters; however, for our purposes, we only differentiate with respect to the positions of the atoms. The problem size in our performance analysis is given by the number of atoms, denoted by n. The position of each atom is given by three Cartesian coordinates; therefore, we have 3n independent variables. As there is a single scalar dependent variable (the energy), the reverse mode is more suitable for the differentiation of this problem. The energy calculated by AdaptiveAmberFF is actually composed of four additive terms, each computed by a partialInteraction function in objects of different types. Each of these partial interactions is differentiated separately, and the results are added up to obtain the total derivatives. The complete call graph of active methods is displayed in Figure 12, where each box represents a different class. We consider this problem to be well suited for such an analysis, as it is computationally demanding and uses several interesting ADiJaC features. These include interprocedural transformations, multidimensional arrays, and unstructured control flow. 6.2.1. Performance Evaluation. In the performance evaluation for this problem, we compare the execution times and memory footprints of three functions, for increasing problem sizes (numbers of atoms). The three functions are:

—The original AdaptiveAmberFF energy function, part of the OGOLEM library —An analytical derivative function, also part of the OGOLEM library —A derivative function obtained via reverse-mode automatic differentiation using ADiJaC The execution times and memory footprints for these functions are displayed in Figure 13. The execution times for all three functions are close to identical. In terms of memory consumption, the AD version performed consistently better than the analytical one. 6.2.2. Effects of Static Analyses. We use the AdaptiveAmberFF problem to measure and illustrate the performance improvements enabled by the activity and TBR static analyses. The same batch of tests was run using two derivative functions generated by ADiJaC: the fully optimized adjoint function used in the previous tests and an adjoint function for which the static analyses were disabled. ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:27

Fig. 12. Active call graph for the AdaptiveAmberFF problem.

Fig. 13. AdaptiveAmberFF performance (reverse mode).

The results displayed in Figure 14 show that the optimized version performed significantly better. On average, it ran consistently faster and required between 3 and 4 times less memory than the unoptimized version. It must be noted that the speedup decreases with the problem size, whereas the memory improvement is more significant, as the main effect of the TBR analysis is to reduce the size of the stacks. Figure 15 compares the total number of stack operations for the optimized and unoptimized versions of the adjoint function. Regardless of the problem size, the ratio of push/pop pairs between the unoptimized and the optimized versions is consistently close to 4. 6.2.3. Effects of Java Virtual Machine Memory Allocation. Throughout our performance tests, we have frequently observed a distinctive type of behavior for adjoint functions ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:28


Fig. 14. Performance improvements due to static analyses for AdaptiveAmberFF.

Fig. 15. Effects of TBR analysis on the number of stack operations for AdaptiveAmberFF.

regarding the way in which the Java Virtual Machine manages memory [Lindholm et al. 2013; Stark et al. 2001; Venners 1996]. To illustrate this behavior, we measured the memory usage for the AdaptiveAmberFF problem as reported by the methods in the java.lang.Runtime class. We compared these footprints with the ones obtained in the previous tests, as reported by the operating system. The two sets of measurements are displayed in Figure 16. As the results show, the memory footprint reported by the operating system is consistently larger on average than the one reported by the JVM. In addition, the OS-reported memory usage as a function of the problem size has an irregular behavior: it is not monotonous and frequently grows in large steps. In contrast, the memory utilization reported by the JVM grows in a much smoother and predictable fashion. This type of behavior is caused by the JVM’s generational memory management strategy. It tends to allocate memory in increasingly large quantities and deallocate it automatically at certain points during execution. In contrast to languages such as C, the user has little control over memory allocation and deallocation. Additionally, it is difficult for the JVM to determine precisely and efficiently how much memory it should allocate at a given point in time, and which objects can be safely deallocated. Therefore, ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.


9:29

Fig. 16. Memory footprints as reported by the operating system and the JVM for AdaptiveAmberFF.

there are situations where additional memory is allocated that will not be useful, and where some objects are not deallocated even though they will not be used anymore. Most JVM implementations feature command-line options that enable a certain degree of control over memory management. These allow altering the growth of the memory pool between generations and the behavior of the Garbage Collector. Depending on the application, it may be possible to improve the memory management using such features; however, this type of solution is usually not portable. 7. CONCLUSIONS AND OUTLOOK

Automatic differentiation is an efficient, accurate, and flexible alternative to numeric and symbolic differentiation. ADiJaC continues to be the only general-purpose AD tool for Java bytecode. It offers bytecode-to-bytecode automatic differentiation in both the forward (scalar and vector) and the reverse mode. In this work, we have presented the current architecture and implementation of the ADiJaC tool. Apart from having undergone a major refactoring that simplified its code, the tool is now able to process a more diverse set of language features and can produce significantly more efficient derivative codes. Among the new features, we consider the support for interprocedural transformations and arbitrary (unstructured) control flow in the reverse mode to be the most important. The latter required a completely new approach to control flow reversal, one that is based on basic block graphs rather than being instruction oriented as it was before. The most powerful enhancement in terms of performance is the introduction of ADspecific static analyses, namely, activity and TBR. The former is useful in both modes of differentiation and allows ADiJaC to generate fewer derivative objects and instructions. The latter is specific to the reverse mode and greatly diminishes the number of objects that are saved on and restored from the stacks. An interesting approach to reduce the overhead of using stacks in the reverse mode is given in Willkomm et al. [2015]; we plan to explore the possibility of using the RIOS software in conjunction with ADiJaC. ADiJaC has been tested on several problems from the MINPACK-2 collection. The performance of its generated codes has been compared to that of C implementations of the same problems, differentiated using the Tapenade tool. Although in terms of absolute performance the C implementations generally fared better, the derivative-tooriginal function performance ratios were similar, and actually favored ADiJaC in some ACM Transactions on Mathematical Software, Vol. 43, No. 2, Article 9, Publication date: September 2016.

9:30


cases. This shows that even though Java is generally less efficient than languages such as C, ADiJaC-generated derivative codes can provide very good performance relative to the original functions. Our case study of the AdaptiveAmberFF problem allowed us to illustrate performance issues for a complex, real-world application. We used this study to show the concrete performance impact of the static analyses and to discuss the effects of JVM memory management on the behavior of AD functions. The ADiJaC tool is not open source and it is currently available for testing at ADiJaC [2016]. More information on ADiJaC can also be found on the Community Portal for Automatic Differentiation [Autodiff 2016]. ADiJaC’s current design makes it easily extensible. In this respect, a module for generating codes for second derivatives [Martinelli and Hascoët 2008; Ghate and Giles 2007; Abate et al. 1997] is currently under development. Additionally, it may be useful to add support for multiobjective (vector) reverse mode similarly to tools such as Tapenade [Hascoët and Pascual 2013]. Automatic parallelization is also a possible future enhancement for the forward vector mode. In terms of language coverage, we plan to offer support for active class and instance fields in future releases of ADiJaC. Finally, the fact that ADiJaC performs bytecode-to-bytecode transformations is not fully exploited. We intend to explore the possibility of differentiating codes written in languages other than Java, such as Scala [Odersky et al. 2016] or Clojure [Hickey 2008], which also compile into bytecode. REFERENCES Jason Abate, Christian H. Bischof, Alan Carle, and Lucas Roh. 1997. Algorithms and design for a second-order automatic differentiation module. In International Symposium on Symbolic and Algebraic Computing. SIAM, Philadelphia, PA, 149–155. ADiJaC. 2016. ADiJaC Website. (2016). http://adijac.cs.pub.ro Accessed Jan 2016. Ali-Reza Adl-Tabatabai, Michał Cierniak, Guei-Yuan Lueh, Vishesh M. Parikh, and James M. Stichnoth. 1998. Fast, effective code generation in a just-in-time java compiler. SIGPLAN Notices 33, 5 (1998), 280–290. Autodiff. 2016. Community Portal for Automatic Differentiation. (2016). http://www.autodiff.org. Brett M. Averick, Richard G. Carter, Jorge J. Moré, and Guo L. Xue. 1992. The Minpack-2 Test Problem Collection. Argonne National Laboratory, Mathematics and Computer Science Division, ANL/MCSP153-0692. Bradley M. Bell. 2016. A Package for Differentiation of C++ Algorithms. (2016). http://www.coin-or.org/ CppAD/ Accessed January 19 2016. Bradley M. Bell and James V. Burke. 2008. Algorithmic differentiation of implicit functions and optimal values. In Advances in Automatic Differentiation. Springer, 67–77. Claus Bendtsen and Ole Stauning. 1996. FADBAD, a Flexible C++ Package for Automatic Differentiation. Technical Report 17. Department of Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark. Martin Berz, Christian H. Bischof, and George Corliss. 1996. Computational differentiation: Techniques, applications, and tools. In SIAM Proceedings in Applied Mathematics (1996). Christian H. Bischof, Ali Bouaricha, Alan Carle, and Peyvand Khademi. 1996a. Efficient computation of gradients and Jacobians by transparent exploitation of sparsity in automatic differentiation. Optimization Methods and Software 7 (1996), 1–39. ¨ Christian H. Bischof and H. Martin Bucker. 2000. Computing Derivatives of Computer Programs. Vol. 3. John von Neumann Institute for Computing. 315–327 pages. ¨ Christian H. Bischof, H. Martin Bucker, Bruno Lang, Arno Rasch, and Andre Vehreschild. 2002. Combining source transformation and operator overloading techniques to compute derivatives for MATLAB programs. In Proceedings of the 2nd IEEE International Workshop on Source Code Analysis and Manipulation. IEEE Computer Society, Los Alamitos, CA, 65–72. Christian H. Bischof, Alan Carle, George Corliss, and Andreas Griewank. 1992. ADIFOR: Automatic differentiation in a source translator environment. International Symposium on Symbolic and Algebraic Computing 92 (1992), 294–302.



9:31

Christian H. Bischof, Alan Carle, Peyvand Khademi, and Andrew Mauer. 1996b. ADIFOR 2.0: Automatic differentiation of Fortran 77 programs. IEEE Computational Science & Engineering 3, 3 (1996), 18–32. Christian H. Bischof, Alan Carle, Peyvand Khademi, and Gordon Pusch. 1995. Automatic differentiation: Obtaining fast and reliable derivatives – fast. In Control Problems in Industry. Springer, 1–16. Christian H. Bischof and Andreas Griewank. 1992. ADIFOR: A fortran system for portable automatic differentiation. In Proceedings of the 4th AIAA/USAF/NASA/OAI Symposium on Multidisciplinary Analysis and Optimization (1992), 433–441. Christian H. Bischof, Lucas Roh, and Andrew Mauer-Oats. 1997. ADIC: An extensible automatic differentiation tool for ANSI-C. Software: Practice and Experience 27 (1997), 1427–1456. Ronald F. Boisvert, José Moreira, Michael Philippsen, and Roldan Pozo. 2001. Java and numerical computing. Computing in Science & Engineering 3, 2 (2001), 18–24. ¨ H. Martin Bucker, George Corliss, Paul Hovland, Uwe Naumann, and Boyana Norris. 2006. Automatic Differentiation: Applications, Theory, and Implementations. Vol. 50. Springer. J. Mark Bull, Lorna A. Smith, L. Pottage, and Robin Freeman. 2001. Benchmarking Java against C and Fortran for scientific applications. In Proceedings of the 2001 Joint ACM-ISCOPE Conference on Java Grande (JGI’01). ACM, 97–105. Richard L. Burden and J. Douglas Faires. 2001. Numerical analysis. Brooks/Cole (2001). Benjamin Dauvergne and Laurent Hascoët. 2006. The data-flow equations of checkpointing in reverse automatic differentiation. In Computational Science (ICCS’06). Springer, 566–573. Johannes M. Dieterich and Bernd Hartke. 2010. OGOLEM: Global cluster structure optimization for arbitrary mixtures of flexible molecules. A multiscaling, object-oriented approach. Molecular Physics 108, 3–4 (2010), 279–291. Arni Einarsson and Janus Nielsen. 2008. A survivor’s guide to Java program analysis with soot. BRICS (2008). Vincent Fischer, Laurent Gerbaud, and Frédéric Wurtz. 2005. Using automatic code differentiation for optimization. IEEE Transactions on Magnetics 41, 5 (2005), 1812–1815. Shaun A. Forth. 2006. An efficient overloaded implementation of forward mode automatic differentiation in MATLAB. ACM Transactions on Mathematical Software 32, 2 (2006), 195–222. Assefaw Hadish Gebremedhin, Fredrik Manne, and Alex Pothen. 2005. What color is your Jacobian? Graph coloring for computing derivatives. SIAM Review 47, 4 (2005), 629–705. Keith O. Geddes, Bruce W. Char, Gaston H. Gonnet, Benton L. Leong, Michael B. Monagan, and Stephen M. Watt. 1993. Maple V: Language Reference Manual. Springer. Devendra Ghate and Michael B. Giles. 2007. Efficient hessian calculation using automatic differentiation. In 25th AIAA Applied Aerodynamics Conference 4059 (2007). Ralf Giering and Thomas Kaminski. 1998a. Recipes for adjoint code construction. ACM Transactions on Mathematical Software 24, 4 (1998), 437–474. Ralf Giering and Thomas Kaminski. 1998b. Using TAMC to generate efficient adjoint code: Comparison of automatically generated code for evaluation of first and second order derivatives to hand written code from the Minpack-2 collection. Automatic Differentiation for Adjoint Code Generation 3555 (1998), 31–37. Ralf Giering and Thomas Kaminski. 2003. Applying TAF to generate efficient derivative code of fortran 77-95 programs. PAMM 2, 1 (2003), 54–57. James Gosling, Bill Joy, Guy L. Steele, Gilad Bracha, and Alex Buckley. 2013. The Java Language Specification: Java SE 7 Ed. Prentice Hall. Andreas Griewank. 1989. On automatic differentiation. In Mathematical Programming: Recent Developments and Applications. Kluwer Academic Publishers, 83–108. Andreas Griewank. 2003. A mathematical view of automatic differentiation. Acta Numerica 12 (2003), 321– 398. Andreas Griewank, David Juedes, and Jean Utke. 1996. Algorithm 755: ADOL-C: A package for the automatic differentiation of algorithms written in C/C++. ACM Transactions on Mathematical Software 22, 2 (1996), 131–167. Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software 26, 1 (2000), 19–45. Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM. Laurent Hascoët. 2009. Reversal strategies for adjoint algorithms. In From Semantics to Computer Science. Essays in Memory of Gilles Kahn. Cambridge University Press, 487–503.


9:32


Laurent Hascoët and Mauricio Araya-Polo. 2006. The adjoint data-flow analyses: Formalization, properties, and applications. In Automatic Differentiation: Applications, Theory, and Implementations. Springer, 135–146. Laurent Hascoët, Uwe Naumann, and Valérie Pascual. 2005. “To be recorded” analysis in reverse-mode automatic differentiation. Future Generation Computer Systems 21, 8 (2005), 1401–1417. Laurent Hascoët and Valérie Pascual. 2013. The tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software 39, 3 (2013), 20:1–20:43. Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, and Eric T. Phipps. 2005. An overview of the trilinos project. ACM Transactions on Mathematical Software 31, 3 (2005), 397–423. Rich Hickey. 2008. The Clojure programming language. In Proceedings of the 2008 Symposium on Dynamic Languages. ACM. Robin J. Hogan. 2014. Fast reverse-mode automatic differentiation using expression templates in c++. ACM Transactions on Mathematical Software 40, 4 (2014), 26:1–26:24. JAutoDiff. 2014. JAutoDiff: A pure Java library for automatic differentiation. (2014). https://github.com/ uniker9/JAutoDiff. Michael Kofler. 1997. Maple: An Introduction and Reference. Addison-Wesley Longman Publishing Co. Daniel Kwiecinski. 2016. Deriva. (2016). https://github.com/lambder/Deriva. Accessed January 2016. Tim Lindholm, Frank Yellin, Gilad Bracha, and Alex Buckley. 2013. The Java Virtual Machine Specification. Addison-Wesley. Roman E. Maeder. 1991. Programming in Mathematica. Addison-Wesley Longman Publishing Co. Kyoko Makino and Martin Berz. 2006. Cosy infinity version 9. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors, and Associated Equipment 558, 1 (2006), 346–350. Massimiliano Martinelli and Laurent Hascoët. 2008. Tangent-on-tangent vs. tangent-on-reverse for second differentiation of constrained functionals. In Advances in Automatic Differentiation. Springer, 151–161. Michael Monagan and René R. Rodoni. 1996. An implementation of the forward and reverse mode in maple. In Computational Differentiation: Techniques, Applications, and Tools. SIAM, Philadelphia, 353–362. ´ Mar´ıa E. Portillo Montiel, Nelson Arapé, and Pirela Morillo. 2013. Biblioteca de diferenciación automatica ´ para la maquina virtual de java automatic differentiation library for the java virtual machine. Revista Tecnocient´ıfica Universidad Rafael Urdaneta 5 (2013), 11–25. Jose E. Moreira, Samuel P. Midkiff, Manish Gupta, Pedro V. Artigas, Marc Snir, and Richard D. Lawrence. 2000. Java programming for high-performance numerical computing. IBM Systems Journal 39, 1 (2000), 21–56. Sri Hari Krishna Narayanan, Boyana Norris, and Beata Winnicka. 2010. ADIC2: Development of a component source transformation system for differentiating c and c++. Procedia Computer Science 1, 1 (2010), 1845–1853. Uwe Naumann. 2002. Reducing the memory requirement in reverse mode automatic differentiation by solving TBR flow equations. In Computational Science (ICCS’02), 1039–1048. Uwe Naumann. 2008. Optimal jacobian accumulation is np-complete. Mathematical Programming 112, 2 (2008), 427–441. Uwe Naumann. 2011. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. SIAM. Uwe Naumann, Jean Utke, Andrew Lyons, and Michael Fagan. 2004. Control flow reversal for adjoint code generation. In 4th IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, 55–64. Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Stphane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. 2016. The Scala language specification. (2016). http://www.scala-lang.org/files/archive/spec/2.11/. Accessed January 19, 2016. Valérie Pascual and Laurent Hascoët. 2008. TAPENADE for C. In Advances in Automatic Differentiation. Springer, 199–209. Phuong Pham-Quang and Benoit Delinchant. 2012. Java automatic differentiation tool using virtual operator overloading. In Recent Advances in Algorithmic Differentiation. Springer, 241–250. Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, Vol. 120. Springer, Berlin. Michael I. Schwartzbach. 2008. Lecture notes on static analysis. University of Aarhus, Denmark.



9:33

Dimitri Shiriaev and Andreas Griewank. 1996. ADOL-F: Automatic differentiation of fortran codes. In Computational Differentiation: Techniques, Applications, and Tools. 375–384. Rune Skjelvik. 2001. Automatic differentiation in Java. Department of Informatics, University of Bergen, Norway. ¨ Emil-Ioan Slus¸anschi. 2008. Algorithmic differentiation of Java programs. Universitatsbibliothek, RWTHAachen University. Robert F. Stark, Egon Borger, and Joachim Schmid. 2001. Java and the Java Virtual Machine: Definition, Verification, Validation. Springer-Verlag New York, Secaucus, NJ. Christian W. Straka. 2005. ADF95: Tool for automatic differentiation of a fortran code designed for large numbers of independent variables. Computer Physics Communications 168, 2 (2005), 123–139. Jean Utke, Uwe Naumann, Mike Fagan, Nathan Tallent, Michelle Strout, Patrick Heimbach, Chris Hill, and Carl Wunsch. 2008. OpenAD/F: A modular open-source tool for automatic differentiation of fortran codes. ACM Transactions on Mathematical Software 34, 4 (2008), 18. Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay Sundaresan. 1999. Soot – A Java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research (1999), 13. Bill Venners. 1996. Inside the Java Virtual Machine. McGraw-Hill. Arun Verma. 1999. ADMAT: Automatic differentiation in MATLAB using object oriented methods. In SIAM Interdisciplinary Workshop on Object Oriented Methods for Interoperability (1999), 174–183. ¨ Johannes Willkomm, Christian H. Bischof, and H. Martin Bucker. 2014. A new user interface for ADiMat: Toward accurate and efficient derivatives of MATLAB programs with ease of use. International Journal of Computational Science and Engineering 9, 5/6 (2014), 408–415. ¨ Johannes Willkomm, Christian H. Bischof, and H. Martin Bucker. 2015. RIOS: Efficient I/O in reverse direction. Software: Practice and Experience 45, 10 (2015), 1399–1427. Stephen Wolfram. 1991. Mathematica. Addison-Wesley. Received August 2015; revised February 2016; accepted March 2016


9 ADiJaC â Automatic Differentiation of Java ... - ACM Digital Library

9 ADiJaC â Automatic Differentiation of Java ... - ACM Digital Library

Suggest Documents

Java Consistency: Nonoperational ... - ACM Digital Library

Understanding Automatic Conveyor-belt ... - ACM Digital Library

Creating and Preserving Locality of Java ... - ACM Digital Library

Automatic Quality of Experience Measuring on ... - ACM Digital Library

Automatic Intelligibility Assessment of Dysarthric ... - ACM Digital Library

Automatic Classification of Speech and Music ... - ACM Digital Library

Automatic Generation of MPEG-7 Compliant ... - ACM Digital Library

On the Automatic Construction of Regular ... - ACM Digital Library

Automatic Detection and Matching of Geospatial ... - ACM Digital Library

Automatic speculative parallelization of loops ... - ACM Digital Library

Automatic Generation of Templates using Ontology - ACM Digital Library

Color-Based Algorithm for Automatic Merging of ... - ACM Digital Library

Automatic Complexity Control of Generalized ... - ACM Digital Library

design - ACM Digital Library

crpit - ACM Digital Library

Conversations - ACM Digital Library

Incentives - ACM Digital Library

Gunrock - ACM Digital Library

Abstract - ACM Digital Library

AdaGIDE - ACM Digital Library

MOVELETS - ACM Digital Library

P10 - ACM Digital Library

2PXMiner - ACM Digital Library

feature - ACM Digital Library

9 ADiJaC â Automatic Differentiation of Java ... - ACM Digital Library