Aug 15, 2007 - We introduce the Swift virtual application infrastructure and the .... development economists have analyzed some of these scenarios, and tried ...
Accelerated Solution of Moral Hazard Problems through the Swift Grid Scripting System Tiberiu Stef-Praun, Gabriel A. Madeira, Ian Foster, Robert Townsend et al. University of Chicago, Computation Institute and the Dept. of Economics August 15, 2007 Abstract The study of human interactions and the modeling of social structures often needs large scale experiments that involve complex models and large computational resources. These are are very taxing requirements placed on the researchers, and often limit the scale of the investigation quite significantly. We introduce the Swift virtual application infrastructure and the SwiftScript workflow language as the means to enable the researchers to set up and manage large experiments that need to execute on distributed resources, such as, for instance, academic Grids. We propose a computation model where various components of the experiment are distributed across computational resources, and the researcher expresses in SwiftScript the experiment logic through input-output (file) dependencies. We present the features of the Swift infrastructure by describing them in relation with a successful ongoing project that is being conducted by researchers from the Economics Department at the University of Chicago. The obvious benefits (such as significant speedup and reduced experiment management tasks) in a “(computationally) heavyweight” social science such as Economics suggest the fitness of SwiftScript for all the other research domains in social sciences.
1
Introduction
The study of human interactions, as addressed in various fields of the social sciences, can be generally characterized by large sets of unstructured input data, by complex models of the actors involved, by sophisticated interactions between the actors, and by complex dependencies between the actors and the social groups that they belong to. Researchers working in these domains have often the daunting tasks of modeling the complex nature of humans and societies, and producing realistic models, theories and results. Additionally, much more work is required by the replication of the scale and complexity of the interactions and by the mapping the components of the model to physical computing resources. Generic experiment management tasks consist of modeling the actor’s behaviors, expressing the interactions and, essential for the topic of this paper, setting up and managing the logical dependency constraints and the physical resources needed to run models. The complexity of the setup and management phase is often the cause for using restricted-scale models, with limited validation of the results, and frustration for the researcher conducting the experiment.
1
Our research comes to address such issues by providing an easy-to-use workflow scripting language, which we call SwiftScript, and a transparent resource virtualization infrastructure, which we call Swift and which provides the support for SwiftScript to execute on. With Swift and SwiftScript at hand, the researcher will only need to focus on providing the application modules that make up his research and on describing (in SwiftScript) the interactions between those modules (in terms of inputs, output, and their dependencies). As soon as the applications are installed on the physical resources (such as computing clusters, or research Grids - Teragrid, OSG) that will be used in the research experiment, the whole management of the computing resources and of the workflow execution is being handled transparently by Swift. The Swift system has evolved from the GryPhyN Virtual Data System (VDS) [13], a computational infrastructure designed to automate the processing of large data sets generated by high energy physics experiments. The functionality of Swift’s execution engine is similar to that of the VDS scheduler, the Pegasus [2] workflow planner, in the sense that both map the execution of the components of a given workflow onto Grid resources. Other workflow and orchestration systems that (unsurprisingly) provide equivalent functionality are Taverna [9], Kepler [6] and others. These systems are built around the BPEL web services orchestration, while Swift has focused on executing Grid applications. Other distributed application execution frameworks include GenePattern [11], which has a graphical approach to expressing the workflow, and MapReduce [1] system which has been optimized for high volume parallel data processing on Google’s dedicated infrastructure. Swift is different through its XDTM-based support of heterogeneous data sources and through its support for task-parallel and data-parallel execution. The social sciences research application of Swift that we discuss here is an economic problem of dynamic optimal choice of organization under moral hazard assumptions. Worker agents (entities) can be organized in groups, with good internal information, or in a relative performance individualistic regime, which is informationally more opaque. Faced with various choices of production and organization, and with various payoffs from the entrepreneur agent, the worker agents use linear programming over a discrete grid of parameters, and compute their optimal behavior. The fragmentation of the parameter space allowed us to employ a distributed solution to solving the problem, and we took advantage of the independent nature of the subproblems by solving them in parallel on the Grid. The rest of the paper is organized as follows. Section two introduces the concepts of virtualizing applications through Swift and the way of expressing complex application logic using SwiftScript. Section three discusses an example application from the economics domain, introduces the socio-economic problem that we have implemented in Swift and hints on the complexity of the Moral Hazard model. Section four introduces implementation details of the economic problem and explains its execution in a Grid system. We conclude with a discussion on the future of Swift and its role in enabling large scale computations on the Grid.
2
Swift and SwiftScript
In this section we introduce the language constructs that we use for expressing application workflows and also present the Grid execution infrastructure that enables a distributed execution of the workflow’s components. The SwiftScript language constructs have been designed to address all the expressivity requirements of modern applications, and the Swift engine that interprets the workflow will transparently try to achieve the highest degree of parallelism in executing the workflow components. The dependency constraints that limit the parallelism of a workflow are expressed in SwitScripts through (file) objects that are both the output of a component and the input of the dependent component(s).
2
2.1
The SwiftScript Language
Starting from the observation that a relatively simple set of programmatic constructs such as conditional operators (foreach, if , while), variable manipulation (typing, declarations, assignments), data structures and support for procedures are expressive enough to address the needs of fairly complex computational tasks, we developed the SwiftScript workflow language to encompass all these features. One can describe SwiftScript as an extended scripting language designed for executing applications in distributed environments. One feature of SwiftScript addresses the fact that unstructured data that often characterizes the social sciences (for instance combinations of data and metadata, tags and relationship networks) needs sometimes to be handled as a whole. SwiftScript handled this case by making use of the XDTM (XML Data Type and Mapping) specification for “messy” data. Starting from basic data types such as file (most used for inputs and outputs), integer, string and date, one can declare and use data structures (arrays, structs) similar to any other programming languages. In addition to the programatic constructs mentioned above, SwiftScripts are organized as procedural languages, and they can be seen as having three hierarchical elements. The first element in the hierarchy is the atomic procedure representing an abstraction of the software components which have been installed on the remote Grid nodes, and which are documented in a site catalog. At run-time, the Swift workflow execution engine chooses from the catalog one of the grid resources that has a specific application installed, and executes it. The second element in the SwiftScript code is represented by the compound procedure, generally used to provide an encapsulation of the problem logic around the atomic procedures. In the most obvious example, the compound procedures consist of loops that consume parameter sets or input files by passing them to atomic procedures. The highest level in the hierarchy of Swift represents the complete problem, expressed as a workflow that combines all the atomic and compound procedures. This should be imagined as the main() procedure of a C program.
2.2
Distributed Computing with Swift
The distributed nature of our tool is handled by introducing mapper constructs to connect the logical (programmatic) data representation to the physical (file) entity containing the data. This allows the Swift environment to virtualize the data resources, both in the sense that their location is abstracted away from the SwiftScript programmer, and in the sense that the state of these mapped entities determines the parallel execution of the application. The Swift engine, executes in parallel all the workflow components that have their inputs available, and as more outputs are generated, the components depending on these generated objects (usually files) are also sent for execution. This data-driven dependency makes SwiftScript a flow language that is optimized for parallel execution. The execution of the workflow components (most of the time applications that process the researcher’s inputs) is handled by provider extensions, which address both the data (file) transfer in and out of the remote execution sites, and also the remote execution invocation and status management. Powered by the Karajan [4] just-in-time execution engine,the Swift virtualizing infrastructure makes the complexities of distributed and Grid computing almost fully transparent for the user, through its reliance on the Globus [3] Grid middleware and the CogKit [4] client libraries for Globus. At run time, the mapping of the executable application components of the workflow onto Grid sites is done transparently by an internal resource scheduler; this uses as an input a site catalog that contains a list of the sites where all the application components have been previously installed. The installation of the application is a
3
one-time process; after that any workflow invoking that specific application can be mapped on the fly onto the corresponding sites. Once there is a SwiftScript description of the problem, the Swift engine uses its resource providers to send (stage out) transparently the problem’s input files to the sites (clusters) where the execution will take place, and to manage the execution and retrieve (or stage out) any outputs needed by the researcher. The experiment execution management happens behind the scenes, all the researcher needs to worry about is to provide the input files, and declare a set of sites that are capable of executing the problem components in the site catalog. The one-time installation and the further reuse of it is a very attractive feature of Swift, as encourages collaborations and code reuse. On a large scale, we like to think of Swift as being the infrastructure that will enable research communities.
3
Economic Application
Following Madeira and Townsend [7], we introduce a model of a social interaction based on economic principles: consider one entity being in control of some resources (the entrepreneur), and entering in a business contract with some other entities that will use these resources to produce outputs (the workers). There are two organizational forms available, one where the workers cooperate on their efforts and on dividing up their income (thus sharing risks), and other where the workers are independent of each other, and they are rewarded based on their relative performance. Both of these are a stylized version of what is observed in tenancy data in villages such as those in Maharastra, India, described in detail in Townsend and Mueller [12] and Mueller, Prescott and Sumner [8]. This social interaction gets more complicated because the organizational regime in these comunities can change over time. In practice cooperative regimes sometimes break down into individualistic arrangements, and cooperation and risk sharing may emerge from initially competitive comunities. This organizational instability is a key element that Madeira and Townsend incorporate in theirs model. Sociological theorists Leik and Chalkey [5] discuss a list of possible causes of instability documented in this and many other studies: unreliable measurement, external change, inherent instability, and systematic change from endogenous forces. Other development economists have analyzed some of these scenarios, and tried to produce theories of groups and networks. The model of Madeira and Townsend formulate the tradeoff between individualistic versus cooperative regimes as a choice of alternative incentive structures under moral hazard: production depends on unobserved effort, and incentives for effort may be provided both under individualistic and cooperative arrangements. They show numerically that these two types of arrangements may be coexisting and interchangeable.
3.1
Moral Hazard Problem Model
In the current model, we consider three actors: two agents and a principal. The agents’ preferences are described by discounted expected utilities over consumption c and effort e. The utility of agent i at period t is T X
wit ≡ E{
β s−t [U (csi ) + V (esi )]}
(1)
s=t
The parameter β represents a subjective discount factor. There is a production technology function which maps the agents’ effort into an output, and the model represents this as a
4
probability distribution of outputs given the efforts of both agents: p(q1t , q2t |et1 , et2 ) > 0
(2)
The entrepreneur’s share (or profit) is given by the surplus of production over consumption: St ≡
T X s=t
(
1 s s ) [q1 + q2s − cs1 − cs2 ] 1+r
(3)
The parameter r is an exogenous interest rate. The variables of the model are discretized, which makes it possible that linear programming is employed in the solutions. This allows the use of lotteries as optimal policies and produce a reliable solution (conditional on the grid). The dimensionality of the problem depends on the size of these grids of consumption C, efforts E and outputs Q of the agents and also depend on the set of possible organizational forms, O (cooperative groups, with corresponding power balances within it, or relative performance). This is solved for a grid of current utility pairs for the agents, W . The elements of W are initial conditions for the solution but they also define the set of possible states in any future period. In practice, the future value of elements of W are part of current policies: promisses for the future are part of the incentives given today to motivate effort, and the resulting dynamics on the elements of this set drives the whole organizational history generated by the model. The model was solved with the following cardinality (parameter granularity) measures: |Q| = 2, |E| = 2, |C| = 18, |W | = 30, |O| = 102 (there are 101 possible values of Pareto weights defining the internal balance of power inside groups, and also the possibility of relative performance).
3.2
Moral Hazard Implementation
Because each agent and variable potentially introduces a dimension in the problem domain, the computational requirements grow exponentially with the number of agents and the size of the grids.To avoid the resulting Curse of Dymensionality, the problem is broken into 5 interdependent pieces. Also, a new variable is introduced: interim utility, with summarizes the utility from both current consumption and future arrangements and belongs to a grid set V , that has cardinality |V | = 45. All of these stages are solved by linear programing. The problem is solved backward. First (last stage chronologically), the balance between promises for the future and consumption to optimally reward agents (to give them interim utilities) is defined. This is a linear program that takes as inputs the surplus of the principal as a function of the future utilities (a 30 × 30 matrix that describes the surplus for each pair in |W | × |W |), and an initial pair of interim utilities, and determines the optimal probability distribution of elements of C and future utilities in W for each agent. This is solved for a grid of |V | × |V | elements, generating as an output a matrix of |V | × |V | elements representing the surplus of the principal for each pair of interim utilities. Each gridpoint of this program is generated by a program with 291,600 elements and a constraint matrix with dimension 3 × 291, 600. Next, two programs for groups and one for relative performance need to be solved (the group and relative performance stages can run in parallel, although there is a required sequence between the two group stages). The relative performance program takes as an imput the matrix determining the surplus associated with each interim utility pair (the output of the first program), and an initial utility pair conditional on the regime (that for each agent lays in a grid Wr , for which we impose a cardinality of 40). The linear program determines the optimal
5
joint probability distribution of outputs, efforts, and interim utilities, that generate the regimeconditional utilities subject to a set of technological and incentive constraints (implied by a constraint matrix in the LP program). This program must be solved for a grid of initial values of Wg for each individual, and thus generate as an output a matrix of |Wr | × |Wr | elements representing the surplus under relative performance conditional on initial promises (for the RP regime). Each gridpoint of this program is generated by a program with 10,816 elements and a constraint matrix with dimension 21 × 10, 816. The first group specific program (last chronologically) takes as an input the matrix determining the surplus associated with each interim utility pair (the output of the first program) and an initial Pareto weight and a surplus level. The linear program determines the utility maximizing joint probability distribution of outputs, efforts, and interim utilities, subject to a set of technological and incentive constraints. This program is solved for a grid of 101 values of the Pareto weights and 52 values of surplus. The output is 2 101 × 52 matrixes determining the optimal utility of respectivelly agent one and two given a Pareto weight and a sueplus level. Each gridpoint of this program is generated by a program with 10,816 elements and a constraint matrix with dimension 18 × 10, 816. The second group-specific program takes as an input the output matrixes of the first group program (the stage just described) and an initial group-specific utility pair (that for each agent lays in a grid Wg , for which we impose a cardinality of 40). It chooses surplus maximizing joint distribution of surplus and Pareto weights conditional on an initial group-specific surplus. The output is a Matrix 40 × 40 determining the optimal surplus under groups for each initial utility pair in |Wg | × |Wg |. Each gridpoint of this program is generated by a program with 5252 elements and a constraint matrix with dimension 3 × 5252. Finally, a fith stage (the first chronologically) determines the choice of regime and utilities in each regime. It has as imputs the outputs of the Relative Performance program (the 40 × 40 matrix describing the surplus function under RP) and of the second group program (the 40×40 matrix describing the surplus function under groups), and also an initial pair of utilities. This program is run for a grid of 30 × 30 elements (corresponding to pairs of the elements in W ) and generates as an output a 30 × 30 matrix detrmining the overall surplus function. This stage has a choice vector with 1600 elements, and a constraint matrix of dimension 3 × 1600. The output of this last stage can be used, iterativelly as an input in the first stage. This program is run until this last surplus function converges, with a solution that represents the infinite-period solution of the model.
4
Implementation and measurements
We introduce here the practical aspects of parallelizing applications in a Grid environment by referring to the MoralHazard model that we presented above. We mention the components of the Grid-ified system, and then we show how to use Swift to connect these components.
4.1
Problem Structure and Implementation Details
The moral hazard problem has been implemented in Matlab and the linear problem instances solving was initially handed out to CPLEX. Given licensing issues, we replaced all of these applications with open source alternatives: Matlab was replaced with Octave, and CPLEX was replaced with CLP from the COIN-OR optimization libraries. The structure of the problem at each stage was similar: the Matlab code would set up the matrices that represented the linear programming parameters, and the linear solver would use these matrices to generate an optimal solution. This procedure was replicated for all the points in the parameter grids
6
that were mentioned previously, and because the problems were independent for each set of parameters, the parallelization procedure was straightforward: have the instances of linear problems from each grid point be solved in parallel on different machines on the Grid. At this point, we can define the canonical component that is the unit for parallelization: it is the piece of the Matlab code that sets up the linear problem and the associated linear solver that produces the solution that maximizes the objective function. This functionality is encoded in our workflow in the morahazard solver atomic procedure. In the scenario of the parallel execution on multiple machines, the prerequisite is that we have both the supporting software (Matlab/Octave and CPLEX/CLP) installed on those machines and the problem instance copied (staged in) to that machine for execution. There is another atomic function which is artificially needed due to the fact that a previously monolithic application will be solved on pieces in a distributed environment: the solutions merging function, that puts back together the partial solutions from the network-distributed solvers into a form that the remaining of the Matlab code will easily import. The rest of the parallelization process consists of expressing the general logic of the MoralHazard problem in Swift, and of determining which parts of the original Matlab code will go into each of the canonical distributed code component. Each of this Matlab code components will take as input some parameters that determine which determine the point in the grid that is being solved by the current component instance. This part is where we implement the looping over the grid of parameters, and invoking the atomic procedure that represents the remote solvers. The complete Moral Hazard workflow consists of the five stages connected as described by the logic of the solution, with the dependencies between the stages being implemented, as explained above, through the files are generated by one stage and consumed by the next one. Each stage of the Moral Hazard problem is represented by a compound procedure that loops over the parameters grid and solves the individual linear optimization problems, followed by one atomic procedure that merges the results for that specific stage. We introduce below fragments of the workflow code covering the atomic procedure definitions, the first stage compound procedure definition, and the invocations of stages one and two. //define the linear solver (file solutions) moralhazard_solver (file scriptfile, int batch_size, int batch_no, string inputName, string outputName, file inputData[], file prevResults[]){ app{ moralhazard @filename(scriptfile) batch_size batch_no inputName outputName; } } //the merge atomic procedure (file mergeSolutions[]) econMerge (file merging[]){ app{ econMerge @filenames(mergeSolutions) } }
@filenames(merging);
//the stage one compound procedure (file solutions[]) stageOne (file inputData[], file prevResults[]){ file script; int batch_size=26; int batch_range=[0:25]; string inputName="IRRELEVANT"; string outputName="stageOneSolverOutput"; foreach i in batch_range { int position=i*batch_size; solutions[i]=moralhazard_solver(script,batch_size,position ,inputName,outputName,inputData, prevResults); } }
7
Figure 1: Moral Hazard stages diagram
//the invocation of the first two stages of the workflow file stageOneSolutions[]; file stageOneInputFiles[]; file stageOnePrevFiles[]; stageOneSolutions=stageOne (stageOneInputFiles,stageOnePrevFiles); //merge results at stage 1 file stageOneOutputs[]; stageOneOutputs=econMerge(stageOneSolutions);
file stageTwoSolutions[]; file stageTwoInputFiles[]; stageTwoSolutions=stageTwo(stageTwoInputFiles,stageOneOutputs); //merge results at stage 2 file stageTwoOutputs[]; stageTwoOutputs=econMerge(stageTwoSolutions);
4.2
The execution of the workflow
The workflow is executed in the following way: The Swift workflow engine takes as a parameter the file expressing the workflow, then it attempts to execute all the procedures that have the inputs defined, and it delays all the procedures that are waiting for input from their upstream dependency procedures. All the atomic procedures that are runnable will be executed by choosing a remote site on which the corresponding code will be executed, then staging in the
8
required input data files, and after the execution ends successfully staging out the results. When the newly generated output files have been staged out to the system that runs the workflow, the execution engine will be able to submit the other tasks that were dependent on these as inputs. When all the files have been generated, the workflow and the application are finished. If there is an error, the workflow engine will automatically resubmit that task.
4.3
Measurements
Both the system designers of Swift and the researchers using it have experienced quite promising results: in case of the Economics application, depending on the size of the input, on the complexity and parallelism of the workflow and on the sites where we ran the experiments, we measured speedups between 2.5 and 10 times, compared to the execution of the same experiment on a single machine. This is encouraging mainly because it shows that SwiftScript was efficient enough to hide the significant overheads resulting from disconnecting the problem into subcomponents, having data transferred over the Internet, waiting in queues for processing resources on the remote sites, and recombining all the outputs once they have been generated remotely. The current problem had loops with 25 to 100 iterations; for problems with bigger sizes we expect a much higher speedup. These measurements have been made without any speedup optimizations in place; a much better performance is to be expected when the workflow is tweaked for speed. The Grid is a shared environment by nature and design, and there are no implied guarantees on acquiring the resources needed by various users. Therefore the measurements of running this workflow were often plagued by the availability of the resources. We chose to run our experiments on the UC/Argonne Teragrid site, which tends to have a lower utilization. Either using this resource at times when there were few users, of making a reservation for 40 computing nodes allowed us to make the following measurements:
Table 1: Execution time of the Moral Hazard problem Resource Single Machine Swift default @Argonne Swift with Falkon @Argonne
5
Running Time 2.5 hrs 1 hr, 3 min 27 minutes
Conclusion
SwiftScript is a lightweight, user-friendly yet powerful and extensible workflow expressing language, and together with its Swift execution environment, will significantly improve the research capabilities of anyone needing large amounts of resources for their work. From the Economics problem example, you could see that Swift was able to manage without problems a fairly complex workflow consisting of 165 atomic procedures, over a heterogeneous pool of resources in Teragrid. Our team has also successfully used Swift to address much larger problems (> 10000 atomic procedure invocations) in domains like Bioengineering and Physics. Further work has already focused extensively on the acquisition and management of the resources, through the use of the Falkon [10] system, and on extending the SwiftScript language features, such as macros for basic data structures (ex. strings) manipulations. In the near
9
future we are hoping to extend the the Swift engine with providers that enable the web services invocation.
References [1] J Dean and S Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. [2] E Deelman, J Blythe, Y Gil, C Kesselman, G Mehta, M Su, K Vahi, and M Livny. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing, 2004. [3] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications, 2001. [4] G.V. Laszewski, M. Hategan, and D. Kodeboyina. Java cog kit workflow. Workflows for Science, 2007. [5] R. K. Leik and M. A. Chalkey. On the stability of network relations under stress. Social Networks, 1997. [6] B. Lud¨ ascher, I Altintas, C Berkley, D Higgins, E Jaeger, and M Jones. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10), 2005. [7] G. Madeira and R. Townsend. Endogenous groups and dynamic selection in mechanism design. Journal of Economic Theory, 2007. [8] R. A. E. Mueller, E.S. Prescott, and D.A. Sumner. Hired hooves: Transactions in a south indian village factor market. The Australian Journal of Agricultural and Resource Economics, 2002. [9] T Oinn, M Greenwood, M Addis, M. N Alpdemir, J Ferris, K Glover, C Goble, A Goderis, D Hull, D Marvin, P Li, P Lord, M R. Pocock, M Senger, R Stevens, A Wipat, and C Wroe. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 8(10), 2005. [10] I. Raicu, Zhao Y., I. Dumitrescu, C.and Foster, and M Wilde. Falkon: a fast and lightweight task execution framework. Supercomputing Conference, 2007. [11] M. Reich, T. Liefeld, J. Gould, J. Lerner, P Tamayo, and JP. Mesirov. Genepattern 2.0. Nature Genetics, 38(5):500–501, 2006. [12] R. M. Townsend and R. A. E. Mueller. Mechanism design and village economies: Credit to tenancy to cropping group. Review of Economic Dynamics, 1998. [13] Y. Zhao, M. Wilde, I. Foster, J. Voeckler, J. Dobson, E. Gilbertand T. Jordan, and E. Quigg. Virtual data grid middleware services for data-intensive science. Concurrency and Computation: Practice and Experience, 2000.
10