Design and Optimization of Scientific Workflows, UC ... - Google Sites

Design and Optimization of Scientific Workflows By DANIEL ZINN Dipl.-Inf. (Ilmenau University of Technology, Germany) 2005 M.S. (University of California, Davis) 2008 DISSERTATION Submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE in the OFFICE OF GRADUATE STUDIES of the UNIVERSITY OF CALIFORNIA DAVIS Approved: Professor Bertram Ludäscher (Chair)

Professor Todd J. Green

Professor Zhendong Su

Committee in Charge 2010

i

Daniel Zinn January 2010 Computer Science

Design and Optimization of Scientific Workflows Abstract This work considers the problem of design and optimization of scientific workflows. Progress in the natural sciences increasingly depends on effective and efficient means to manage and analyze large amounts of data. Scientific workflows form a crucial piece of cyberinfrastructure, which allows scientists to combine existing components (e.g., for data integration, analysis, and visualization) into larger software systems to conduct this new form of scientific discovery. We propose VDAL (Virtual Data Assembly Lines), a dataflow-oriented paradigm for scientific workflows. In the VDAL approach, data is organized into nested collections, much like XML, and flows between components during workflow execution. Components are configured with XQuery/XPath-like expressions to specify their interaction with the data. We show how this approach addresses many challenges that are common in scientific workflow design, thus leading to better overall designs. We then study different ways to optimize VDAL execution. First, we show how to leverage parallel computing infrastructure by exploiting pipeline, task, and data parallelism exhibited by the VDAL paradigm itself. To this end, we compile VDAL workflows into several Map-Reduce tasks, executed in parallel. We then show how the cost of data-shipping can be reduced in a distributed streaming implementation. Next, we propose a formal model for VDAL, and show how static analysis can provide additional design features to support the scientist during workflow creation and maintenance, namely, by displaying actor dependencies previewing the structure of the results, and explaining how output data will be generated from input data. Consequently, certain design errors can be detected prior to the actual workflow execution. Finally, we investigate the fundamental question of how to decide equivalence of VDAL workflows. We show that testing the equivalence of string-polynomials, a new problem, reduces to workflow equivalence when an ordered data model is used. Here, our preliminary work defines several normal forms for approximating the equivalence of string polynomials. ii

To my family.

iii

Acknowledgments Being a Ph.D. student at UC Davis was a wonderful experience. I want to express my deepest gratitude to all people that made this chapter of my life as fantastic as it was. I want to thank my major advisor Professor Bertram Ludäscher. He not only provided excellent guidance and inspired me, but also helped to organize, refine and structure my ideas. He taught me, how important it is to present your ideas well, and how to do so. I deeply appreciate that even under tight schedules, Bertram always had time for me: be it for discussing research ideas, revising papers, or even to provide guidance in nonresearch-related fields of my life. Bertram truly lives up to all expectations associated with a Doktorvater1 : he has not only become my role model for being a scientist, but also a very good friend. I am deeply thankful for all his advice, patience and help. I further want to thank the members of my dissertation committee Prof. Zhendong Su and Prof. Todd J. Green. Zhendong has not only provided valuable feedback for this work, but was always open for my questions. Special thanks to Zhendong, for his extra support after I broke my elbows. I want to thank T.J. for his detailed feedback on this work as well as his support while working on Chapter 7. Technical discussions with T.J. are not only enjoyable but also very fruitful. I thank him for his patience, especially while listening to my sometimes very nice, but wrong “proofs”. I further want to thank Shawn Bowers and Timothy McPhillips for their valuable help, discussions and suggestions with this dissertation. In particular, I want to thank them for their work on Comad, which this dissertation builds on and extends. Further, I want to thank Xuan Li, for his great help on the Kepler-PPN project in Chapter 5. I also want to thank Prof. Michael Gertz for co-advising me in the early stages of my Ph.D. In addition, I want to thank my advisors during my wonderful time at Google: Rebecca Schultz during my first internship with the platforms group, and Jerry Zhao and Jelena 1

doctorial father

iv

Pjesivac-Grbovic during my second one with the MapReduce team. It was an honor for me to work with such very nice and very smart people. The experience interacting with the systems staff at the Department of Computer Science, most notably Babak Moghadam and Ken Gribble, could not have been better. It is great to work with such approachable, knowledgeable, and open-minded people. Without their help, the experimental evaluations would not have been possible. I also want to thank the staff in the CS department office. Their friendliness made every single visit to the office enjoyable. Special thanks goes to Mary Reid, our first CS graduate advisor for her fantastic help in my early phases as a Ph.D. student. A very important part of my graduate life was interacting with fellow graduate students. I am very lucky to have found a very best friend and collaborator Michael Byrd. Thank you for the always great times we had! Our lunches at Sophia’s, Raja’s and almost every other restaurant in walking distance will be unforgettable to me. I can only say: pants, pants, pants! I am further very thankful to all the people of the database research group, who are not only very nice people and became good friends, but also provided valuable feedback during presentations and practice talks. I especially want to thank Zijie Qi, for trying to teach us some Chinese; Sven Köhler, not only for being a great collaborator in fighting Hadoop for Chapter 3, but also for being a great friend for more than 10 years. I thank Dave Thau for helping me a lot with my talk at the EDBT Ph.D. Workshop; and Manish Anand for, among other things, saving my life in Portland with delicious cookies. Also, thank you, David Welker for some legal advice; and Haifeng Zhao, whose friends showed me around in Shanghai. For always inspiring interactions, I want to thank Earl T. Barr and Jedidiah R. Crandall. I also want to thank fellow graduate students outside the Database lab. These are all great people and I am very grateful for my very pleasant interaction with them: James Shearer, Yuan Niu, Ananya Das, Till Stegers, and Jim Bosch. I further want to thank Prof. Kai-Uwe Sattler, my thesis Advisor during my Master’s studies in Ilmenau. Without his support, I would not have been able to come to Davis. Special thanks also go to Prof. Horst Salzwedel—without his generous offer to live at his v

house in Palo Alto during an internship, I would not have been infected with the California virus. I also thank Colin K. Mick and Ulla Mick for their help and support during these first 5 months I spent in California. I am also deeply thankful for the many friends I made here in Davis: Abhijeet Kulkarni, Tina Sch¨ utz, Francesca Martinez, Jeff Stuart, Tony Bernadin, Dan Alcantara, Mauricio Hess, Jay Jongjitirat, Jeff Wu, and many more. I especially thank Zach Grounds for being such a good friend and apartment-mate. I also thank Conny Franke, who was up for the adventure to apply for a Ph.D. program in California and accompanied me in the early phases of my Ph.D. Finally, and most importantly, I have deep gratitude for my family. I thank my mum, Christel Zinn, for her deep love, dedication, open-mindedness, and all she did for me throughout my life. Without her extensive care during my early childhood, I would not have lived to see my first day at school. She encouraged me to pursue higher education and stay in school, since “there is no hurry to get out of school, you will have to work afterwards for the rest of your life”. I am also deeply thankful to my dad, Gerhard Zinn, who besides many other things, introduced me to computers and the joy of math. I am very thankful for my dad’s patience and excellence in explaining technical and logical matters, even when he helped me to take my fist steps programing BASIC. I also thank his wife Ursel Zinn, especially, for being brave enough to read over my Master’s thesis written in English. I further thank my wonderful brother, Enrico Zinn, for being the best brother ever. I also thank Ines Greiner-Hiero for being a wonderful friend since I was eleven years old. Moreover, I want to thank my loving grandparents, Karl and Lonny Sesselmann. My grandpa was a magnificent person who was not as fortunate as me to have had the possibility of a good education. He nevertheless was a great teacher for me. During the last two years, I was able to talk with my grandma almost every day. I will miss her support, and I wish she could have lived to see me accomplish this goal. Last, but not least, I would like to thank my amazing girl-friend, Tu Anh Ngoc Huynh. Her understanding, encouragement, patience, and love are my endless sources of energy and happiness. vi

Contents List of Figures

xi

List of Listings

xiii

List of Tables

xiv

Structure and Contributions

1

1 Introduction 1.1 Problem Statement . . . . . . . . . . . . . . . . . . 1.2 Script-based Approaches . . . . . . . . . . . . . . . 1.3 Scientific Workflow Approach . . . . . . . . . . . . 1.3.1 Examples . . . . . . . . . . . . . . . . . . . 1.3.2 Scientific Workflow Terminology . . . . . . 1.3.3 Advantages . . . . . . . . . . . . . . . . . . 1.3.4 Limitations . . . . . . . . . . . . . . . . . . 1.4 Collection-Oriented Modeling and Design (Comad) 1.4.1 Advantages . . . . . . . . . . . . . . . . . . 1.4.2 Limitations . . . . . . . . . . . . . . . . . . 1.5 Towards Virtual Data Assembly Lines . . . . . . . 1.5.1 Research Questions . . . . . . . . . . . . . . 1.6 Detailed Description of Contributions . . . . . . . 2 Improving Scientific Workflow Design with Virtual Data Assembly Lines 2.1 Workflow Design Challenges . . . . . . . . . . . 2.1.1 Parameter-Rich Functions and Services 2.1.2 Maintaining Data Cohesion . . . . . . . 2.1.3 Conditional Execution . . . . . . . . . . 2.1.4 Iterations over Cross Products . . . . . 2.1.5 Workflow Evolution . . . . . . . . . . . 2.2 Virtual Data Assembly Lines (VDAL) . . . . . 2.2.1 Inside VDAL . . . . . . . . . . . . . . . 2.2.2 VDAL Components and Configurations 2.2.3 Example: VDAL Actor Configurations . vii

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

2 3 5 7 8 11 12 14 17 19 22 23 24 25

. . . . . . . . . .

30 31 32 34 37 39 40 41 42 45 50

2.3

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

50 50 52 54 55 56 56 58

3 Optimization I: Exploiting Data Parallelism 3.1 Introductory Example . . . . . . . . . . . . . 3.2 MapReduce . . . . . . . . . . . . . . . . . . . 3.3 Framework . . . . . . . . . . . . . . . . . . . 3.3.1 XML Processing Pipelines . . . . . . . 3.3.2 Operations on Token Lists . . . . . . . 3.3.3 XML-Pipeline Example . . . . . . . . 3.4 Parallelization Strategies . . . . . . . . . . . . 3.4.1 Naive Strategy . . . . . . . . . . . . . 3.4.2 XMLFS Strategy . . . . . . . . . . . . 3.4.3 Parallel Strategy . . . . . . . . . . . . 3.4.4 Parallel Strategy in Detail . . . . . . . 3.4.5 Summary of Strategies . . . . . . . . . 3.5 Experimental Evaluation . . . . . . . . . . . . 3.5.1 Comparison with Serial Execution . . 3.5.2 Comparison of Strategies . . . . . . . 3.6 Related Work . . . . . . . . . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

59 60 62 64 65 67 69 70 70 72 80 81 84 85 87 89 91 94

4 Optimization II: Minimizing Data Shipping 4.1 ∆-XML: Virtual Assembly Lines over XML Streams 4.1.1 Types and Schemas . . . . . . . . . . . . . . 4.1.2 Actor Configurations . . . . . . . . . . . . . . 4.1.3 Type Propagation . . . . . . . . . . . . . . . 4.2 Optimizing ∆-XML Pipelines . . . . . . . . . . . . . 4.2.1 Cost Model . . . . . . . . . . . . . . . . . . . 4.2.2 X-CSR: XML Cut, Ship, Reassemble . . . . . 4.2.3 Distributor and Merger Specifications . . . . 4.3 Implementation and Evaluation . . . . . . . . . . . . 4.3.1 Experimental Setup . . . . . . . . . . . . . . 4.3.2 Experimental Results . . . . . . . . . . . . . 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

96 97 98 101 104 105 106 106 110 111 112 113 118 119

2.4 2.5

Design Challenges Revisited . . . . . . 2.3.1 Parameter-rich Black Boxes . . 2.3.2 Maintaining Data Cohesion . . 2.3.3 Conditional Execution . . . . . 2.3.4 Iterations over Cross Products 2.3.5 Workflow Evolution . . . . . . Related Work . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . .

viii

. . . . . . . .

. . . . . . . .

. . . . . . . .

5 Implementation: Light-weight Parallel PN Engine 5.1 General Design Decisions . . . . . . . . . . . . . . . 5.1.1 Workflow Setup . . . . . . . . . . . . . . . . . 5.1.2 Workflow Run . . . . . . . . . . . . . . . . . 5.2 VDAL-specific Functionality . . . . . . . . . . . . . . 5.3 Kepler as a PPN GUI . . . . . . . . . . . . . . . . . 5.3.1 Architecture . . . . . . . . . . . . . . . . . . 5.3.2 PPN Monitoring Support in Kepler . . . . . 5.3.3 Communication with Kepler Actors . . . . . 5.3.4 Demonstration: Movie Conversion Workflow 5.4 Summary and Related Work . . . . . . . . . . . . . .

. . . . . . . . . .

6 Static Analysis I: Supporting Workflow Design 6.1 Design Use Cases . . . . . . . . . . . . . . . . . . . . . 6.2 Well-formed Workflows . . . . . . . . . . . . . . . . . . 6.2.1 Review: Virtual Assembly Lines . . . . . . . . 6.2.2 Notions about Well-formed Workflows . . . . . 6.3 Compilation of VDAL to FLUX . . . . . . . . . . . . . 6.3.1 Necessary FLUX Extensions . . . . . . . . . . 6.3.2 Rewriting VDAL to FLUX . . . . . . . . . . . 6.4 Static Analysis for FLUX-compiled VDAL Workflows 6.5 Discussion and Related Work . . . . . . . . . . . . . . 6.6 Future Work: Workflow Resilience . . . . . . . . . . . 6.6.1 Input Resilience . . . . . . . . . . . . . . . . . 6.6.2 Resilience against Workflow Changes . . . . . . 6.6.3 Inserting Actors . . . . . . . . . . . . . . . . . 6.6.4 Deleting Actors . . . . . . . . . . . . . . . . . . 6.6.5 Replacing Actors . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

120 120 123 126 127 128 129 130 131 133 134

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

135 136 138 138 139 141 141 143 145 147 148 149 149 149 150 150

7 Static Analysis II: Towards Deciding Workflow Equivalence 7.1 Relation to Conventional Regular Expression Types . . . . . . 7.2 General Notions . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Core XQuery Fragment XQ . . . . . . . . . . . . . . . . 7.2.3 Expressive Power of QX . . . . . . . . . . . . . . . . . . 7.2.4 Regular Expression Types . . . . . . . . . . . . . . . . . 7.2.5 Conventional Type Propagation . . . . . . . . . . . . . . 7.3 General Idea of Possible-Value Types . . . . . . . . . . . . . . . 7.4 Possible-Value Types . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Propagation Rules for XQ . . . . . . . . . . . . . . . . . 7.4.2 From PV-Typings to Query Equivalence . . . . . . . . . 7.5 Equality of String Polynomials . . . . . . . . . . . . . . . . . . 7.5.1 Restriction to a two-letter alphabet . . . . . . . . . . . . 7.5.2 Reduction to Restricted Equivalence . . . . . . . . . . . 7.5.3 Simple Normal Form . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

152 154 155 155 156 157 159 159 161 162 168 176 179 180 181 181

ix

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

7.6 7.7 7.8 7.9

7.5.4 Alternating Normal Form . . . . . . . . . . . . 7.5.5 Collecting Exponents into Big Polynomials . . 7.5.6 Comparing Lists of Monomials . . . . . . . . . 7.5.7 Towards Distributive Alternating Normalform . 7.5.8 Deciding M1P ≡≥c M2Q for dist-minimal Mi . . 7.5.9 Summary of Findings and Future Steps . . . . Undecidability of Value-Difference for PV-Types . . . Undecidability of Query-Equivalence for XQdeep-EQ . Related Work . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

182 185 187 188 191 191 192 195 196 198

8 Concluding Remarks

200

Bibliography

203

x

List of Figures 1.1 1.2 1.3 1.4 1.5 1.6

Promoter identification workflow (from [ABB+ 03]) . . . . . . . . . Monitoring workflow (from [PLK07]) . . . . . . . . . . . . . . . . . Snapshot of a Comad execution . . . . . . . . . . . . . . . . . . . Conceptual model of Comad-actor execution . . . . . . . . . . . . Workflow dependency on input structure (adapted from [MBL06]) Three-layer Comad architecture . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16

. . . . . .

. . . . . .

. . . . . .

. . . . . .

9 11 18 19 20 23

Parameter-rich service, record assembly and disassembly . . . . Record-handling shims . . . . . . . . . . . . . . . . . . . . . . . Maintaining nesting structure using array tokens and additional Dataflow model for “if (Test1 or Test2) then A” . . . . . . . . . Conventional cross-products . . . . . . . . . . . . . . . . . . . . Architectural differences of conventional versus VDAL . . . . . VDAL Actor Anatomy . . . . . . . . . . . . . . . . . . . . . . . Dataflow inside VDAL Actor . . . . . . . . . . . . . . . . . . . Example grouping via binding expression in γ . . . . . . . . . . Syntax for FLUX (adapted from [Che08]) . . . . . . . . . . . . Blackbox and VDAL actor configuration . . . . . . . . . . . . . Linear workflows . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchical data used in phylogenetic workflow . . . . . . . . . Maintaining nesting structure . . . . . . . . . . . . . . . . . . . Localizing if-then-else routing via XML attributes . . . . . . . Cross-products in VDAL . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

32 35 36 38 39 41 43 44 47 49 49 52 52 53 55 55

3.1 3.2 3.3 3.4 3.5 3.6 3.7

XML Pipeline Example . . . . . . . . . . . . . . . Example splits and groups . . . . . . . . . . . . . . Image transformation pipeline . . . . . . . . . . . . Processes and dataflow for the three parallelization Parallel re-grouping . . . . . . . . . . . . . . . . . Serial versus MapReduce-based execution . . . . . Runtime comparison of compilation strategies . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

60 65 69 71 81 88 90

4.1 4.2 4.3

Simple schema with examples for several concepts . . . . . . . . . . . . . . 101 X-CSR overview: Standard versus optimized with schemas . . . . . . . . . . 108 X-CSR experiments standard versus optimized . . . . . . . . . . . . . . . . 115 xi

. . . . . . . . . . . . . . . . . . strategies . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . . .

5.1 5.2 5.3 5.4 5.5 5.6

Detailed workflow execution times (collection structures only) Detailed workflow execution times (collections and data) . . . General Architecture of Kepler-PPN Coupling . . . . . . . . . Kepler-PPN Coupling . . . . . . . . . . . . . . . . . . . . . . Demonstrating Communication between Kepler and PPN . . Kepler-PPN Coupling in action . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

126 126 129 132 132 133

6.1 6.2 6.3 6.4

Components and dataflow inside VDAL actor . . . Example for VDAL actor configuration . . . . . . . FLUX-Code corresponding to VDAL actor given in Generating Required-For Relation . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

138 144 145 147

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8

Semantics of XQ . . . . . . . . . . . . . . . . . . . . . . . Commutativity diagram for PV-Typings . . . . . . . . . . Semantics [[ . ]]v for XML pv-types without free indexes . . Propagation Rules for XQ (constraint-free pv-types) . . . Lift Algorithm for XML pv-types . . . . . . . . . . . . . . Deciding query equivalence with input restrictions for XQ Algorithm to transform into alt-NF . . . . . . . . . . . . . Helper Algorithms for alt-NF . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

158 161 165 170 177 178 186 186

xii

. . . . . . . . . . Fig. 6.2. . . . . . . . . . . . . .

List of Listings 3.1 3.2 3.3 3.4 3.5 4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7

Split, Map, Reduce for Naive strategy . . . . . . . . . . . . . . . . . Split and Map for XMLFS . . . . . . . . . . . . . . . . . . . . . . . . Split for XMLFS & Parallel . . . . . . . . . . . . . . . . . . . . . . . Map and Reduce for Parallel . . . . . . . . . . . . . . . . . . . . . . Group and sort for Parallel strategy . . . . . . . . . . . . . . . . . . X-CSR algorithm for statically computing distributor specifications Actor class declaration . . . . . . . . . . . . . . . . . . . . . . . . . . Port class declaration . . . . . . . . . . . . . . . . . . . . . . . . . . Sample PPN workflow setup script . . . . . . . . . . . . . . . . . . . Sample schema declaration . . . . . . . . . . . . . . . . . . . . . . . Sample signature declaration . . . . . . . . . . . . . . . . . . . . . . Sample description of synthetic data . . . . . . . . . . . . . . . . . . Kepler configuration file of existing PPN actors . . . . . . . . . . . .

xiii

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

72 78 79 83 84 110 122 123 124 127 127 128 130

List of Tables 3.1

Main differences for compilation strategies . . . . . . . . . . . . . . . . . . .

85

4.1

X-CSR standard versus optimized: Savings on shipping and execution time

114

6.1 6.2

Actor definitions: Reading versus blind . . . . . . . . . . . . . . . . . . . . . 139 Actor definitions: Feeding versus starving . . . . . . . . . . . . . . . . . . . 140

xiv

1

Structure and Contributions In Chapter 1, we motivate and introduce scientific workflows and provide necessary background information. Chapters 2–5 describe Virtual Data Assembly Lines (VDAL) and present optimization strategies for their execution; Chapters 6 and 7 present the more theoretical part of this work, where we consider static analysis of VDAL workflows. In particular, Chapter 2 presents the VDAL paradigm. We outline concrete design challenges and show, based on example scenarios, how VDAL workflows address these challenges and are thus easier to design, maintain and evolve than workflows based on plain dataflow network primitives. In Chapters 3 and 4, we describe approaches to enhance execution efficiency: Chapter 3 analyzes how the execution of VDAL workflows can be enhanced by exploiting data parallelism. Here, we show how a MapReduce framework can be used to execute VDAL workflows in a cluster environment. In Chapter 4, we present a type-system for workflows operating on ordered collections, and show how we can optimize data shipping by analyzing data-dependencies based on information provided by the actors. Chapter 5 describes the workflow execution engine developed for this dissertation. Then, Chapter 6 describes how VDAL workflows can be translated into XML-processing programs written in the XML update language FLUX. We further define concepts related to workflowwellformedness and show how existing type systems for XML languages can be used to answer important design questions. In Chapter 7, we extend existing typing approaches for XML languages by developing a sound and complete type system for a core language for XML processing. We further introduce the problem of “String-Polynomial-Equality”, and show that it is at the core of deciding query equivalence for XQuery with an ordered data model. Our current results towards solving this problem conclude this chapter. Chapter 8 summarizes this work and outlines future research opportunities. This dissertation is based on the following publications: Chapter 1: [Zin08, MBZL09]; Chapter 2: [ZBML09a]; Chapter 3: [ZBKL09]; Chapter 4: [ZBML09b]; Chapter 5: [ZLL09]; and Chapter 6: [ZBL10].

2

Chapter 1

Introduction Imagination is more important than knowledge. For while knowledge defines all we currently know and understand, imagination points to all we might yet discover and create. Albert Einstein

Progress in computing technology has contributed greatly to accelerating scientific discovery [HTT09]. Along with experiments conducted in the field or in the lab, computation and data analysis have been established as new sources of ideas and inspiration, and as vital techniques to validate scientific hypotheses. For example, discovering the relationships of living organisms via phylogenetic analysis is based on large gene- or protein-sequence databases, algorithms for aligning sequences from different species, and applying sophisticated models for evolution, which are then solved computationally. Furthermore, computer simulations are used in particle physics to validate hypotheses, and in earth sciences to, for example, forecast our weather. In observational physics, such as astronomy, global virtual observatories are build to collect, process and analyze the massive amounts of data being captured. In short, experiments done in silico are an equitable asset next to in vitro and in vivo experimentation. The term e-Science [HT05] is used to describe such data- or computational intensive sciences. Here, to make scientific discoveries, algorithms need to be created, composed

1.1. Problem Statement

3

to larger studies, e.g., analysis pipelines, which need to be deployed and executed. Consequently, in addition to traditional data management tasks, problems that are common to software engineering, such as writing robust and bug-free code, creating re-usable and easy-to-maintain software, etc., need to be addressed in these application domains.

1.1

Problem Statement

When conducting computational science experiments, or e-Science, we can distinguish two main parts: (1) the scientific challenge of having the right data analysis or simulation idea, and (2) the engineering challenge to go from this idea, or conceptual plan, to an actual implementation. Of course, these two tasks are intertwined: scientists might develop new hypotheses once more data has been analyzed, and then want to test these hypotheses via actual data analysis or simulations, creating more ideas for possibly new hypotheses. The engineering challenge is to transform their ideas into executable programs. In the case of scientific data analysis, this often requires the integration of multiple domain-specific tools and specialized applications for data processing. This work considers this problem of integrating existing software components to build complex scientific analysis systems. Considering human and machine time as valuable resources, we present approaches that use them as efficiently as possible. In particular, to support scientists and developers, the systems should be easy to build, evolve and maintain, as well as easy to share and re-use. To save computing resources and to provide the scientists with the results quickly, the system should execute efficiently while utilizing available computing resources. This component-integration problem, embedded into an e-Science context, poses the following challenges: Designing complex systems is hard. Software systems for e-Science are often complicated, large-scale applications. Typically, several complex steps, e.g., sophisticated algorithms, need to be used together. Therefore, problems common to complex soft-

1.1. Problem Statement

4

ware systems in general emerge: In particular, it is not clear how to structure and develop components of such systems in a way that they can be easily re-used, that the dependencies among them are minimized, or that components can be developed independently by different groups. Managing data is hard. In e-Science, the nature of the data that is processed poses additional challenges. For example, particle physics simulations often generate large amounts of data that need to be monitored or analyzed. In the Center for Plasma Edge Simulation project [CPE], for example, experiments produce typically around 40 GBytes of data to be analyzed every hour. In the Large Synoptic Survey Telescope project [LSS], the telescope under construction will generate on the order of 15 TBytes of data every night! In addition to being large, scientific data can also be semantically rich (such as data in the life sciences) and organized into complex hierarchies. In phylogenetic workflows, for example, phylogenetic trees are inferred from sets of gene sequences of different living organisms; managing these collection structures also poses challenges. Furthermore, data lineage and other provenance information are becoming increasingly important, e.g., to validate results and ensure their reproducibility. Effective and efficient data provenance management poses new challenges [ABML09]. Building distributed systems is hard. E-Science problems exhibit many characteristics that require building them as distributed or parallel applications. They deal with very large amounts of data, or are computationally very expensive, or both. Thus, using clusters or supercomputers is often the only way to keep execution times feasible. Furthermore, the systems themselves are usually inherently distributed. Scientists, located all over the world, want to access and process data gathered at different locations, such as in astronomical observatories, or during experiments conducted on very expensive machinery (e.g., particle accelerators) available at only a few locations in the world. In the biological sciences, access to remote databases containing gene and

1.2. Script-based Approaches

5

protein sequences or data about biological pathways is essential. As is well-known in computer science, building distributed systems is hard. Since large-scale distributed systems can seldom be built using one single programming language, problems arise due to different representation of data, or function calling conventions. Having separate address spaces makes parameter passing or global variables expensive and complex. Since there are separate processes, there is no common control flow, which then raises the need for synchronization between them. Single components, (e.g., cluster hosts) can fail due to hardware errors. The more components a system is built of, the less likely it is that all of the components will work properly. Therefore, distributed systems usually employ special algorithms to tolerate partial failures, which again increases their complexity. Moreover, using different hardware usually results in heterogeneous data representations. Furthermore, problems arise from the communication between components themselves: data transport is more expensive (in time, and possibly also in money) than in a local environment, and due to open communication ways, security issues mat also arise.

1.2

Script-based Approaches

A common approach to integrate scientific software is to use scripting languages that “glue” together already existing components. In the Large Synoptic Survey Telescope project [LSS], for example, complex analysis steps are developed in C++, while Python is used to orchestrate components at a higher level. For the bio-sciences, an extensive toolkit of Perl modules called BioPerl [SBB+ 02, Bio09] has been built for the purpose of making it easy to combine components for data retrieval (e.g., from remote biological databases) with data analysis steps (e.g., via local programs). While scripting languages are broadly applicable, they have shortcomings: Programming expertise in C++/Perl/Python necessary. Even if the overall software system is very well structured, a high level of expertise in these languages (C++,

1.2. Script-based Approaches

6

Perl, or Python) is often necessary to build scientific applications. No specific component model. General purpose languages and scripting languages in particular do not define specific component models. While a general object-oriented paradigm is usually applied, it is often not clear how to structure the components into classes and what their interactions should be. This ambiguity often prevents scientists that are not trained in writing in the scripting language from extending the libraries or to take full advantage of the framework. A more constrained model of computation could give more guidelines for component creation and re-use. Such a model of computation could also provide more support for designing applications form components, and for application evolution. No automatic provenance. Since data is handled explicitly by the scientist, it is often hard to track data provenance (also called lineage) within these scripting languages. For a scientist, however, it is very important to record how and from which raw data some final data product has been derived. While such a feature could theoretically be added to a general purpose scripting language, for example based on Perl’s taintchecking mechanism, it is not clear how well such a provenance system would integrate with existing libraries. Hard to utilize distributed or parallel resources. In scripting environments it is often hard to utilize distributed resources, e.g., when distribution has to be explicitly programmed by the scientist. Even if high-level libraries are used, additional pieces of code clutter the scientific logic and can lead to solutions that are hard to adapt to different resources. Low-level textual representation. There is no standard on how to write scientific applications in scripting languages. Especially in Perl, with its motto “There’s more than one way to do it”, equivalent programs can be written and structured in many different ways. While this could be desirable for a general purpose language, it does

1.3. Scientific Workflow Approach

7

not directly facilitate an easy understanding of the program semantics. Programs that look very different can essentially compute the same products1 , which can complicate sharing and peer-reviewing of programs. Furthermore, certain dependencies (e.g., dataflow) are easier to comprehend in graphical than in textual form. Little or no automated design support. When writing in a scripting language, the system’s support for checking correctness of applications is often very limited. For example, BioPerl provides no compile-time type-checking: errors are therefore found only at run-time which greatly hampers productivity, especially in complex projects. Scripting languages also provide little help to create new applications from existing ones. In scripting languages, it is tempting (and therefore common) to copy-paste code fragments when evolving the script over time2 . However, these languages provide little mechanisms to ensure that the pasted code fragment will work in the new environment. Other than checking if variables are declared prior to their use, no dataflow analysis is performed to detect programming errors resulting from copypaste.

1.3

Scientific Workflow Approach

In view of these shortcomings, scientific workflow systems [FG06, TDGS07, DGST08, LAB+ 09] have emerged in recent years as tools for domain scientists to integrate existing components into larger analysis systems. That is, they support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis frameworks from existing building blocks, including algorithms that are available as locally installed software packages or globally accessible web services. Compared to script-based approaches, scientific workflow systems promise to offer a number of advantages, including built-in support for recording and querying data prove1

In fact, as it is well-known, program-equivalence is undecidable for general-purpose programming languages due to Rice’s theorem. 2 In fact, even in object-oriented programming, copy-and-paste is common [KBLN04].


8

nance [DBE+ 07], and for deploying workflows on cluster or Grid environments [Dee05]. In many kinds of scientific workflows, data is of primary concern. Computation is often used to derive new data from old one, e.g., to infer phylogenetic trees from gene sequences; or to remove artifacts from astronomical images that can then be used to detect new objects in the sky. Therefore, many scientific workflow systems are data-driven or even adopt dataflow-oriented models of computation to describe scientific applications. Here, computational steps are represented as nodes and dataflow between these tasks is made explicit using edges between them. A scientific workflow tool then allows the scientist to select certain methods (actors), place them on a canvas and connect the output of one node with inputs of others. The generated graph then represents the executable application for performing the (complex) scientific analysis.

1.3.1

Examples

We will now have a closer look at two example workflows to describe basic concepts and abstractions in dataflow-oriented workflow systems. Both workflows have been created using the Kepler system [LAB+ 06]. Promoter identification workflow. Fig. 1.1 shows a screen shot of the Kepler workflow system displaying the Promoter Identification Workflow [ABB+ 03]. This workflow is used by a biologist to identify likely transcription factor binding sites in a series of genes. The process involves a series of tasks, such that performing the same series manually for each of a few dozen genes can be quite a repetitive and time-consuming process [Kep04a]. The screen shot shows major components of most current scientific workflow systems. In the main area the components (or actors) are placed on a workflow-design canvas. They are connected with each other via channels through which data objects flow during workflow execution. Each actor can have multiple input and output ports, similar to functions that can have multiple input parameters. Many scientific workflow tools including Kepler also support nested workflows: Fig. 1.1 shows the model of the top-level workflow and the model


Figure 1.1: Promoter identification workflow (from [ABB+ 03])

9


10

for the actor GeneSequenceProcessing. Nested subworkflows (a.k.a. composite actors) contain formal ports as their defined interface. These are shown in the top-right part of the image: the three black connection-endpoints on the far right are formal output ports, and the one black connection endpoint on the left is a formal input port. Accordingly, there is one actual input port and three actual output ports on the composite actor instance (here: GeneSequenceProcessing) in the main workflow. This approach defines clear interfaces to the rest of the system. The interaction between different actors is performed through data that is flowing through their ports. By typing input and output ports, the actor developer can specify additional restrictions on the data, similar to the types used in function signatures. On the left-hand side of Fig. 1.1, a library of actors and data sources is shown. From here, the user can select available algorithms and suitable data to be placed onto the canvas. In addition to input ports, actors can also have parameters as a form of configuration. Values for parameters are typically set by double-clicking on the actor instances and are valid for a complete workflow run. Once a workflow has been built, it can be executed to perform the computations (data integration, analysis, visualization, etc.) as defined by the workflow graph and model of computation. CPES Workflow for processing simulation data and archival.

Fig. 1.2 shows a

subworkflow of a larger “monitoring workflow” that is used in the Center for Plasma Edge Simulation for processing simulation data and archival [PLK07]. Among other things, this workflow is used to automate the submission of jobs to a supercomputer. The overall workflow controls data transport from a supercomputing center to other sites, e.g., to scientists’ home universities and backup storage. This workflow is used to automate tasks that scientists did manually before. In contrast to the first workflow that actually performs data analysis, this “plumbing” workflow orchestrates the simulation and data analysis on a supercomputer and automates data transport [LAB+ 09].


11

Figure 1.2: Monitoring workflow (from [PLK07])

1.3.2

Scientific Workflow Terminology

A scientific workflow is a description of a process, usually in terms of scientific computations and their dependencies [LBM09], and can be visualized as a directed graph, whose nodes (also called actors) represent workflow steps or tasks, and whose edges represent dataflow and/or control-flow [DGST09]. A basic formalism is to use directed, acyclic graphs (DAGs), where an edge A→B means that actor B can start only after A finishes (a control-flow dependency). With such a DAG-based workflow model (e.g., used by Condor/DAGMan [CKR+ 07]) one can easily capture serial and task-parallel execution of workflows, but other data-driven computations, such as data streaming, pipeline-parallelism, and loops, cannot be represented directly in this model. In contrast, dataflow-oriented computation models, like those based on Kahn’s process networks [Kah74], incorporate pipeline-parallel computation over data streams and allow cyclic graphs (to explicitly model loops). This model underlies most Kepler workflows, and mutatis mutandis, applies to other scientific workflow systems with data-driven models of computation as well (e.g., Taverna [OGA+ 02], Triana [TSWH07], Vistrails [BCC+ 05, FSC+ 06], etc.). In these dataflow oriented workflow models, each edge represents a unidirectional channel, which connects an output port of an actor to an input port of another actor. Channels can be thought of as unbounded queues (FIFO


12

buffers) that transport and buffer tokens that flow from the token-producing output port to the token-consuming input port. For workflow modeling and design purposes, it makes sense to distinguish different kinds of ports: a data port (the default) is used by an actor A to consume (read) or produce (write) data tokens during each invocation (or firing) of A. In contrast, a control port of A is a special input port whose (control) token value is not used by A’s invocation, but which can trigger A’s execution in the first place. An actor parameter can be seen as a special, more “static” input port from which data usually is not consumed upon each invocation, but rather remains largely fixed (except during parameter sweeps). While actor data ports are used to stream data in and out of an actor, actor parameters are typically used to configure actor behavior, set up connection or authentication information for remote resources, and so forth. A composite actor encapsulates a subworkflow and allows the nested workflow to be used as if it were an atomic actor with its own ports and parameters. While the Kahn model is silent on the data types of tokens flowing between actors, practical systems often employ a structured data model. However, in practice, when actors implement web service calls or external shell commands, data on the wire is often of type string or file, even if the data conceptually has more structure. Kepler natively employs

a model with structured types (inherited from Ptolemy [BLL+ 08]), including records and arrays. When sending data from one actor to another, this creates many options. For example, a list of data can be sent from one actor to another in a single array token, or as a stream of tokens corresponding to the elements of the list. Similarly, large record tokens can be assembled and later broken into smaller fragments. These choices can in fact complicate workflow design (see below), whereas the proper use of a serializable, semistructured model such as XML allows a more flexible and uniform data treatment.

1.3.3

Advantages

Scientific workflow systems already provide a number of advantages over pure traditional, e.g., script-based approaches:


13

Component model. Using the abstraction of actors, scientific workflow systems provide uniform access to different, already existing software. There are, for example, actors that represent web-services, actors that call local applications as well as actors that invoke R or Matlab scripts. Once the components have been wrapped as actors, complex systems can in principle be built from these by simply placing them on a canvas and defining connections betweem them. To transport data from one component to the other, only a wire has to be “drawn” in the scientific workflow user interface. Data source discovery. Scientific workflow systems often support the discovery of relevant data sources. Also here, a unified access to databases, files, or web-services can be provided. While such a feature could also be provided to scripting languages by using a separate tool, the integrated environment of a workflow tool can be more convenient for the scientist. Provenance framework. Since dataflow is modeled explicitly in scientific workflow tools, provenance frameworks attached to the system themselves are able to record and provide provenance information. There is no need for the scientist to explicitly record or store provenance information as part of the workflow. Instead this functionality is completely provided by the workflow system [ABJF06, BML+ 06, BML08, ABML09, ABL09]. Semantic types. Workflow systems often provide additional features to support the scientist in building workflows. An approach for adding semantic types to workflow systems has been proposed and implemented in the Kepler system [BL04]. This helps to integrate data sources and actors based on semantic information as opposed to only structural types such as strings or array of integers. Users can leverage semantic type information by checking if actors are compatible with each other, or to find actors that operate on certain data in a large library. Parallelism. Since a dataflow paradigm is employed, workflows naturally exhibit pipeline


14

and task parallelism. Lee et al. show in [LP95] that the process network model of computation is deterministic with few synchronization requirements. The system can therefore execute the workflows with a high level of parallelism without requiring the scientist to think about synchronization.

1.3.4

Limitations

Despite the fact that workflow systems provide the mentioned advantages over script-based solutions, several problems and challenges remain. In the following, we present workflow requirements that are not or only partially met by current workflow systems. These are based on several years of experience with real-world use cases (e.g., see [KLN+ 07, LAB+ 06, PLK07]). For a more detailed explanation of them see [MBZL09]. Little Support for Data Modeling Scientific workflows usually operate on large amounts of often structured and nested data. The various formats which are in use for storing structured data (NetCDF [Net], Nexus [MSM97], XML, etc.) reflect this observation. Therefore, a scientific workflow tool should support nested data types and should assist the scientist with data modeling. In conventional workflow systems, such as Kepler [AJB+ 04] and Triana [TSWH07], however, capabilities to model scientific data are rather limited. Simple, basic types are often used on the channels between actors. To represent, for example, a list of gene sequences one can create an array token in Kepler which contains the gene sequences. Not only is there an actor needed to create the array token, but also is it necessary to introduce other special actors once a particular operation is to be performed on each element of the array. The workflow presented in Fig. 1.1, for example, makes use of this technique. Low-level components such as the SequenceToArray and ArrayToSequence actors “clutter” the workflow. These data assembly and disassembly actors are not of the same level as “scientific actors” such as RunClustalW. Workflows mixing in such data-management actors therefore tend to lose their self-explanatory character. In many applications, however, it is necessary to


15

group together one sort of data for providing it as a whole to the next component: to infer a phylogenetic tree, for example, a list of aligned sequences is needed. Scientists familiar with the specific domain should be able to quickly grasp which tasks and methods are used within a workflow. Furthermore, self-explanatory workflows could be used as means of communication between scientists. Just as UML is used as a “unified” way to communicate about object-oriented design, self-explanatory workflow graphs could be used to support communications about data-driven scientific procedures. In fact, webbased repositories, like myExperiment [GDR07], already provide a place to store, discuss and share scientific workflows. Workflow Designs may be Brittle Scientific workflows should tolerate certain changes in the structure of its input data, i.e., they should exhibit a certain degree of input resilience. A workflow that was, for example, created to work on a single data set of type T should also be usable if the scientist wants to apply the workflow to a series of data sets of type T . Also, a scientist might have many of these data sets, that could themselves be structured by the projects they belong to, by the methods that were used to derive them or by any other criteria. In practice, the directory structure on the hard disk represents such an organization. Here, a reasonable desideratum for a scientific workflow tool is to take such a whole structure as input and perform the workflow on the data sets without destroying the organizational structure. Furthermore, scientific workflows should be easy to modify. Adding new components, removing (possibly non-vital) components, or replacing components by structurally equivalent ones should be possible and easy-to-do for the user. The workflow tool should be able to tolerate certain changes and predict consequences of other changes that invalidate the workflow design. In the workflows shown in Fig. 1.1 and Fig. 1.2 it is for example hard to determine where to add new components or which components are not vital for the workflow run (e.g., the Display actor) due to the complex wiring. Unfortunately, all wires in the


16

workflow are carefully placed and necessary once it is modeled at this level of abstraction. We will therefore argue to raise the level of abstraction for the workflow graph in order to allow easier modifications. Optimization is not Performed Automatically The workflow system should be able to optimize workflow execution performance. Much of the impetus for developing scientific workflow systems derives from the need to carry out expensive computational tasks efficiently using available and often distributed resources. Workflow systems are used to launch, distribute and monitor jobs, move data, manage multiple processes, and recover from failures. One approach often taken today is to specify these tasks within the workflow itself as shown in Fig. 1.2. The result is that scientific workflow specifications can become cluttered with job-distribution constructs that hide the scientific intent of the workflow. Workflows that confuse systems management with scientific computation are difficult to design in the first place and extremely difficult to re-deploy on a different set of resources. Even worse, requiring users to describe such technical details in their workflows excludes many scientists who have neither experience nor the interest in playing the role of a distributed operating system. Systems should not require scientists to understand and avoid concurrency pitfalls (e.g., deadlock, data corruption due to concurrent access, race conditions) to take full advantage of available parallel computing infrastructure. Rather, workflow systems should safely exploit as many concurrent computing opportunities as possible, without requiring users to understand them. Ideally, workflow specifications would be abstract and employ principles and metaphors appropriate to the domain rather than including explicit descriptions of data routing, flow control, and pipeline and task parallelism. As we will see in Chapter 3 and 4, the approach presented in this dissertation can satisfy this requirement.

1.4. Collection-Oriented Modeling and Design (Comad)

1.4

17

Collection-Oriented Modeling and Design (COMAD)

Collection-Oriented Modeling and Design (Comad) [MB05, MBL06], a special way of developing scientific workflows, has been proposed to address many of the shortcomings described in the previous section. Since our approach will extend Comad, we now briefly describe the Comad idea, its advantages and drawbacks. As mentioned in Section 1.3, the majority of scientific workflow systems represent workflows using dataflow languages. The specific dataflow semantics used, however, varies from system to system [YB05]. Not only do the meaning of nodes and of connections between nodes differ, but the assumptions about how an overall workflow is to be executed given a specification can vary dramatically. Kepler makes an explicit distinction between the workflow graph on the one hand, and the model of computation used to interpret and enact the workflow on the other, by requiring workflow authors to specify a director for each workflow. It is the director that specifies whether the workflow is to be interpreted and executed according to a process network (PN), synchronous dataflow (SDF), or other model of computation [LSV98]. Most Kepler actors in PN or SDF workflows are data transformers. Such actors consume data tokens and produce new data tokens on each invocation; these actors operate like functions in traditional programming languages. Other actors in a PN workflow can operate as filters, distributors, multiplexors, or otherwise control the flow of tokens between other actors; however, the bulk of the computing is performed by data transformers. Assembly-line metaphor. In Comad, the roles of actors and of connections between actors are different from those in PN or SDF. Instead of assuming that actors consume one set of tokens and produce another set on each invocation, Comad is based on an assemblyline metaphor: Comad actors (coactors or simply actors below) can be thought of as workers on a virtual assembly-line, each contributing to the construction of the workflow product(s). In a physical assembly line, workers perform specialized tasks on products that pass by on a conveyor belt. Workers only “pick” relevant products, objects, or parts

1.4. Collection-Oriented Modeling and Design (Comad)

18

Proj

(a)

Data token >

Trial

Trial

>

<

>

<

>

S20 S11 T6 ... > >

T9

T6 M7:1 Trees

T4 T5

M5: Find MP trees

< <
for two different fragments) RETURN SortCompare(keyA, keyB) SortCompare: SKey keyA, SKey keyB → { } // always lexicographically compare ‘‘leading path ⊕ start ’’ RETURN LexicCompare( keyA.lp ⊕ keyA.start, keyB.lp ⊕ keyB.start ) Listing 3.5: Group and sort for Parallel strategy

scheme for each token whose lexicographical order corresponds to standard document order. Further, since each fragment contains the leading path to its first token and the ID start, a local ID, smaller than the ID of the first token, the leading path’s ID-list extended by start can be used to globally order the fragments. See, for example Fig. 3.5: In the intermediary row, the ID lists 0.5 < 1, 0.5 < 1, 1, 0.5 < 1, 2.5 < 1.5 < 2, 0.5 are ordering the fragments from left to right. We use this ordering for sorting the fragments such that they are presented in the correct order to the reduce functions. Listing 3.5 shows the definitions for the grouping and sorting comparator used in the Parallel strategy. Two keys that both have the group flag set, are compared based on the lexicographical order of their gpath entries. Keys that have group not set are simply compared. This ensures that one of them is strictly before the other that the returned order is consistent. The sorting comparator simply compares the IDs of the leading paths extended by start lexicographically.

3.4.5

Summary of Strategies

Table 3.1 presents the main differences of the presented strategies, Naive, XMLFS, and Parallel. Note, that while Naive has the simplest data structures it splits and groups the data in a centralized manner. XMLFS parallelizes grouping via the file system but still has a centralized split phase. The Parallel strategy is fully parallel for both splitting and

3.5. Experimental Evaluation

Data Split Group KeyStructure ValueStructure

85

Naive XML File Centralized Centralized by one reducer One integer

XMLFS File system representation Centralized Via file system + naming No shuffle, no reduce Leading path with Ids

SAX-elements

SAX-elements with XMLIds

Parallel Key-value pairs Parallel Parallel by reducers Leading path with Ids and grouping information SAX-elements with XMLIds

Table 3.1: Main differences for compilation strategies grouping at the expense of more complex data structures and multiple reduce tasks.

3.5

Experimental Evaluation

Our experimental evaluation of the different strategies presented above is focused on addressing the following questions: 1. Can we achieve significant speed-ups over a serial execution? 2. How do our strategies scale with an increasing data load? 3. Are there significant differences between the strategies? Execution Environment.

We performed our experiments on a Linux cluster with 40

3GHz Dual-Core AMD Opteron nodes with 4GB of RAM and connected via a 100MBit/s LAN. We installed Hadoop [Bor07] on the local disks9 , which also serve as the space for hdfs. Having approximately 60G of locally free disk storage provides us with 2.4TB of raw storage inside the Hadoop file system (hdfs). In our experiments, we use an hdfs-replication factor of 3 as it is typically used to tolerate node failures. The cluster runs the ROCKS [roc] software and is managed by SunGrid-Engine (SGE) [Gen01]; we created a common SGE parallel environment that reserves computers for being used as nodes in the Hadoop 9

Running Hadoop from the NFS-home directory results in extremely large start-up times for mappers and reducers.


86

environment while performing our tests. We used 30 nodes running as “slaves”, i.e., they run the MapReduce tasks as well as the hdfs name nodes for the Hadoop file system. We use an additional node, plus a backup-node, running the master processes for hdfs and the MR task-tracker, to which we submit jobs. We used Hadoop version 0.18.1 as available on the web-page. We configured Hadoop to launch mapper and reducer tasks with 1024MB of heap-space (-Xmx1024) and restricted the framework to 2 Map and 2 Reduce tasks per slave node. Our measurements are done using the UNIX time command to measure wallclock times for the main Java program that submits the job to Hadoop and waits until it is finished. While our experiments were running, no other jobs were submitted to the cluster to not to interfere with our runtime measurements. Handling of Data Tokens. We first implemented our strategies while reading the XML data including the images into the Java JVM. Not surprisingly, the JVM ran out of memory in the split function of the Naive implementation as it tried to hold all data in memory. This happened for as few as #B = 50 and #C = 10. As each picture was around 2.3MB in size, the raw data alone already exceeds the 1024MB of heap space in the JVM. Although all our algorithms could be implemented in a streaming fashion (required memory is of the order of the depth of the XML tree; output is successively returned as indicated by the EMIT keyword), we chose a trick that is often used in practice: we place references in form of file-names into the XML data structure, while keeping the large binary data at a common storage location (inside hdfs). Whenever we place an image reference into the XML data, we obtain an un-used filename from hdfs and store the image there. When an image is removed from the XML structure we also remove it from hdfs. The strategy of storing the image data not physically inside the data tokens also has the advantage that only the data that is actually requested by a pipeline step is lazily shipped to it. Another consequence is that the data that is actually shipped from the mapper to the reducer tasks is small and thus making even our naive strategy a viable option.


87

Number of Mappers and Reducers. As described in Section 3.2, a split method is used to group the input key-valuable pairs into so-called input splits. Then, for each input split one mapper is created, which processes all key-value pairs of this split. Execution times of MapReductions are influenced by the number of mapper and reducer tasks. While many mappers are beneficial to load balancing they certainly increase the overhead of the parallel computation especially if the number of mappers significantly outnumbers the available slots on the cluster. A good choice is to use one mapper for each key-value pair if the work per pair is significantly higher than task creation time. In contrast, if the work A is fast per scope match then the number of slots, or a small multiple of them is a good choice. All output key-value pairs of the mapper are distributed to the available reducers according to a hashing function on the key. Of course, keys that are to be reduced by the same reducer (as in naive) should be mapped to the same hash value. Only our smart approach has more than one reducer. Since the work for each group is rather small, we use 60 reducers in our experiments. The hash function we used is based on the GroupBy-part of the PKey. In particular for all fragments that have the group flag set, we compute a hash value h based on the IDs inside gpath: Let l be the flattened list of all the digits (longs) inside the IDs of gpath. Divide each element in l by 25 and then interpreted l as a number N to the base 100. While doing so, compute h = (N mod 263) mod the number of available reduce tasks. For fragments with the group flag not set, we simply return a random number to distribute these fragments uniformly over reducers10 . Our hash-function resulted in an almost even distribution of all k-v-pairs over the available reducers.

3.5.1

Comparison with Serial Execution

We used the image transformation pipeline (Fig. 3.3), which represents pipelines that perform intensive computations by invoking external applications over CData organized in a hierarchical manner. We varied the number #C of hCi collections inside each hBi, i.e., the 10 Hadoop does not support special handling for keys that will not be grouped with any other key. Instead of shuffling the fragment to a random reducer, the framework could just reduce the pair at the closest reducer available.

20000 18000 16000 140003.5. Experimental Evaluation

88

12000 25

25

25

20

20

20

15

15

15

10

10

10

5

5

5

0

0

0

10000 8000 6000 4000 2000 0

(a) #C = 1

Serial

(b) #C = 5

Naive

XMLFS

(c) #C = 10

Parallel

Figure 3.6: Serial versus MapReduce-based execution. Relative speed-ups to serial execution of image processing pipeline (Fig. 3.3). All three strategies outperform a serial execution. The achieved speed-ups for #C =1 is only around 13x, whereas in the experiments with more data, more than 20x speed-ups were achieved. #B was set to 200. total number of with hCi labeled collections in a particular input data is #B · #C. Execution times scaled linear for increasing #B (from 1 to 200) for all three strategies. We also ran the pipeline in serial on one host of the cluster. Fig. 3.6 shows the execution times for #B = 200 and #C ranging over 1, 5 and 10. All three strategies significantly outperform the serial execution. With #C = 10, the speed-up is more than twenty-fold. Thus, although the parallel execution with MapReduce has overhead for storing images in hdfs and copying the data from host to host during execution, speed-ups are substantial if the individual steps are relatively compute intensive in comparison to the data size that is being shipped. In our example, each image is about 2.3MB in size; and blur executed on the input image in around 1.8 seconds, coloring the image once takes around 1 second, the runtime of montage varies from around 1 second for one image to 13 seconds for combining 50 images11 . We also experimented with the number of mappers. When creating one mapper for each fragment, we could achieve the fastest and most consistent runtimes (shown in Fig. 3.7). When fixing the number of mappers to 60, runtimes started to have high fluctuations due 11

There are 5 differently colored images under each hCi, with #C = 10, thus 50 images have to be “montaged”.


89

to so-called “stragglers”, i.e., single mappers that run slow and cause all other to wait for the stragglers’ termination. For this pipeline, all our approaches showed almost the same run-time behavior with Naive performing slightly worse in all three cases. The reason for the similar runtimes is that the XML structure that is used to organize the data is rather small. Therefore, not much overhead is caused by splitting and grouping the XML structure, especially compared to the workload that is performed by each processing step.

3.5.2

Comparison of Strategies

To analyze the overhead introduced by splitting and grouping, we use the pipeline given in the introduction (Fig. 3.1). Since it does not invoke any expensive computations in each step, the run-times directly correspond to the overhead introduced by MapReduce in general and our strategies in particular. In the input data, we always use 100 empty hDi collections as leaves, and vary #B and #C as in the previous example. The results are shown in Fig. 3.7. For small data sizes (#C = 1 and small #B) Naive and XMLFS are both faster than Parallel, and XMLFS outperforms Naive. This confirms our expectations: Naive uses fewer reducers than the Parallel approach (1 vs. 60) even though the 60 reducers are executed in Parallel, there is some overhead involved to launch the tasks and wait for their termination. Furthermore, the XMLFS approach has no reducers at all and is thus a mapper-only pipeline and very fast. We ran the pipeline with #C = 1 until #B = 1000 to investigate behavior with more data. From approximately #B = 300 to around 700, all three approaches had similar execution times. Starting from #B = 800, Naive and XMLFS perform worse than Parallel (380s and 350s versus 230s, respectively). Runtimes for #C = 10 are shown in Fig. 3.7(b), Here, Parallel outperforms Naive and XMLFS at around #B = 60 (with a total number of 60,000 hDi collections). This is very close to the number of 80,000 hDi collections at the “break-even” point for #C = 1. In Fig. 3.7(c) this trend continues. Our fine-grained measurements for #B = 1 to 10 show that the “break-even” point is, again, around 70,000 hDi collections. The consistency in the


90

runtime [seconds]

1000 naive xmlfs parallel

800 600 400 200 0 0

100 200 300 400 500 600 700 800 900 1000 #B is varied on the X-Axis (a) #C = 1

runtime [seconds]


800 600 400 200 0 0

20

40

60 80 100 120 140 160 180 200 #B is varied on the X-Axis (b) #C = 10

runtime [seconds]


800 600 400 200 0 0

50 100 150 #B is varied on the X-Axis

200

(c) #C = 100

Figure 3.7: Runtime comparison of the compilation strategies. Runtimes executing the pipeline given in Fig. 3.1 are compared. On the X-Axis #B is varied, Y-axis shows wallclock runtime of the pipeline. For small XML structures, Naive and XMLFS outperform Parallel since fewer tasks have to be executed. On the other hand, the larger the data the better Parallel performs in comparison to the other two approaches.

3.6. Related Work

91

“break-even” point numbers suggests that our parallel strategy outperforms XMLFS and Naive once the number of fragments to be handled and regrouped from one task to the next is on the order of 100,000. In this experiment, we set the number of mappers to 60 for all steps as the work for each fragment is small in comparison to task startup times. As above, we used 60 reducers for the Parallel strategy. Experimentation result. We confirmed that our strategies can decrease execution time for (relatively) compute-intense pipelines. Our image-processing pipeline executed with a speed-up factor of 20. For XML data that is moderately sized, all three strategies work well, often with XMLFS outperforming the other two. However, if data size increases Parallel clearly outperforms the other two strategies due to its fully parallel split and group.

3.6

Related Work

Although the approaches presented here are focused on efficient parallelization techniques for executing XML-based processing pipelines, our work shares a number of similarities to other systems (e.g., [TSWR03, Dee05, ZHC+ 07, FPD+ 05]) for optimizing workflow execution. For example, the Askalon project [FPD+ 05] has a similar goal of automating aspects of parallel workflow execution so that users are not required to program low-level grid-based functions. To this end Askalon provides a distributed execution engine, in which workflows can be described using an XML-based “Abstract Grid Workflow Language” (AGWL). Our approach, however, differs from Askalon (and similar efforts) in a number of ways. We adopt a more generic model of computation that supports the fine-grain modeling and processing of (input and intermediate) workflow data organized into XML structures. Our model of computation also supports and exploits processes that employ “update semantics” through the use of explicit XPath scope expressions. This computation model has been shown to have advantages over traditional workflow modeling approaches [MBZL09], and a number of real-world workflows have been developed within the Kepler system using this approach

3.6. Related Work

92

(e.g., for phylogenetics and meta-genomics applications). Also unlike Askalon, we employ an existing and broadly used open-source distribution framework for MapReduce (i.e., Hadoop) [DG08] that supports task scheduling, data distribution, and check-pointing with restarts. This approach further inherits the scalability of the MapReduce framework.12 Our work also significantly differs from Askalon by providing novel approaches for exploiting data parallelism in workflows modeled as XML processing pipelines. Alternatively, Qin and Fahringer [QF07] introduce simple data collections (compared with nested XML structures) and collection shipping constructs that can reduce unnecessary data communication (similar approaches are also described in [FLL09, Goo07, OGA+ 02]). Using special annotations for different loop constructs and activities, they compute matching iteration data sets for executing a function, and forward only necessary data to this iteration instance. Within a data collection each individual element can be addressed and separately shipped. This technique requires users to specify additional constraints during workflow creation, which can make workflow design significantly more complex. In Chapter 4, we address similar problems for XML processing pipelines, however, the necessary annotations in our approach can be automatically inferred based on the workflow scope descriptions. We complement these approaches here by focusing on strategies for efficient and robust workflow execution through data parallelization strategies, while leveraging data and process distribution and replication provided by Hadoop. Thus, through our compilation strategies, we directly take advantage of the operations and sorting capability of the MapReduce framework for data packaging and distribution. MapReduce is also employed in [FLL09] for executing scientific workflows. This approach extends map and reduce operations for modeling workflows, requiring users to design workflows explicitly using these constructs. In contract, we provide a high-level workflow modeling language and automatically compile workflows to standard MapReduce operations. Our work also has a number of similarities to the area of query processing over XML 12

Which was demonstrated, e.g., by solving the Tera-sort challenge, where Hadoop successfully scaled to close to 1000 nodes and Google’s MapReduce to 4000 nodes on the Peta-sort benchmark.

3.6. Related Work

93

streams (e.g., see [KSSS04a, CCD+ 03, CDTW00, BBMS05, KSSS04b, GGM+ 04, CDZ06]). Most of these approaches consider optimizations for specific XML query languages or language fragments, sometimes taking into account additional aspects of streaming data. FluXQuery [KSSS04a] focuses on minimizing the memory consumption of XML stream processors. Our approach, however, is focused on optimizing the execution of compute and data intensive “scientific” functions and developing strategies for parallel and distributed execution of corresponding pipelines of such components. DXQ [FJM+ 07] is an extension of XQuery to support distributed applications, and similarly, in Distributed XQuery [RBHS04], remote-execution constructs are embedded within standard XQuery expressions. Both approaches are orthogonal to our approach in that they focus on expressing the overall workflow in a distributed XQuery variant, whereas we focus on a dataflow paradigm with actor abstractions, along the lines of Kahn process networks [Kah74]. A different approach is taken in Active XML [ABC+ 03], where XML documents contain special nodes that represent calls to web services. This approach constitutes a different type of computation model applied more directly to P2P settings, whereas our approach is targeted at XML processing applied to the area of scientific applications deployed within in cluster environments. To the best of our knowledge, our approach is the first to consider applications of the MapReduce framework for efficiently executing XML processing pipelines. Most relevant for the here presented approach is the work around Google’s Map-Reduce framework (see [DG08], or [L¨ am08]). Similar to Map-Reduce, our approach provides a framework where user-defined functions (or external programs) can be applied to sets of data. In addition to map-reduce, however, our framework itself is aware of whole pipelines comprised of many of these functions for data analysis. Furthermore, our framework provides a hierarchical data model and a declarative middle-layer to configure at which granularity user-defined functions are to be applied to the data. “Should f be called on each A or on each B (which each in turn contains a list of As?” can easily be configured using a configuration language, whereas this aspect is not considered by the Map-Reduce programming model.

3.7. Summary

3.7

94

Summary

In this chapter, we have presented novel approaches for exploiting data parallelism for efficient execution of XML-based processing pipelines. We considered a general model of computation for scientific workflows that extends existing approaches by supporting fine-grain processing of data organized as XML structures. Unlike other approaches, our computation model also supports processes that employ “update semantics” [MBZL09]. In particular, each step in a workflow can specify (using XPath expressions) the fragments of the overall XML structure they take as input. During workflow execution, the framework supplies these fragments to processes, receives the updated fragments, combines these updates with the overall structure, and forwards the result to downstream processes. To efficiently execute these workflows, we introduced and analyzed new strategies for exploiting data parallelism in processing pipelines based on a workflow compilation to the MapReduce framework [DG08]. While MapReduce has been shown to support efficient and robust parallel processing of large (relational) data sets [YDHP07], similar approaches have not been developed that leverage MapReduce for efficient XML-based data processing. The work presented here addresses these open issues by describing parallel approaches to efficiently split and partition XML-structured data sets, which are input to and produced by workflow steps Similarly, we describe mechanisms for dynamically merging partitions at any level of granularity while maximizing parallelism. Our parallel strategy allows for maximal decentralized splitting and grouping at any level of granularity: If there are more fragments than slots for parallel execution, i.e., hosts or cores, than any re-grouping is performed in parallel. This has been achieved via specific key-structures and MapReduce’s sorting support. Furthermore, our framework also allows the data to be merged into a very small number of very large partitions. This is in contrast to existing approaches, in which the partitions are either computed centrally (which can lead to bottlenecks) or a fixed partition scheme is assumed. Supporting a dynamic level of data partitioning is beneficial to the workflow tasks as they are provided the data in exactly the granularity they requested via

3.7. Summary

95

declarative scope expressions. Our experimental results verify the efficiency benefits of our parallel regrouping in comparison to more central approaches (Naive and XMLFS). By employing MapReduce we also obtain a number of benefits “for free” over more traditional workflow optimization strategies, including fault tolerance, monitoring, logging, and recovery support. As future work, we intend to extend the Kepler Scientific Workflow System with support for our compilation strategies as well as to combine the data parallel approaches presented here with the pipeline parallel and data shipping optimizations presented in Chapter 4.

96

Chapter 4

Optimization II: Minimizing Data Shipping Efficiency is intelligent laziness. David Dunham

As we have seen in Chapter 2, VDAL provides a number of advantages over traditional workflow modeling and design. However, there is an associated problem: The flexibility of the VDAL approach comes in part from its “ignore data that is out-of scope” approach. When implemented directly, this can introduce significant overhead in a distributed environment, because all data is sent to actors even if they are configured to ignore some or even most of it. This is, especially in data-intensive scientific applications, a major drawback of VDAL workflows. In this chapter1 , we show how this problem can be solved by the workflow system itself without requiring the scientist to explicitly define the data routing as it is the case in, for example plain process networks. We can thus provide the high-level modeling features to the user while simultaneously keeping the overhead at a minimum. We consider a special variant of VDAL workflows called ∆-XML. In particular, we use ordered trees as the basic data model. Since this is the classic model for XML data, we can 1

This chapter is based on [ZBML09b].

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

97

not only leverage a large body of existing work on XML, but also are our results widely applicable to other distributed XML processing systems. Contributions. Using a type system based on XML schemas, we show how to perform a data dependency analysis to determine which parts of the input data is actually consumed by an actor. We then deploy additional components into the VDAL workflow (distributors and mergers) to eliminate unnecessary data transport in between actors. They key idea is to dynamically partition the data stream, ship to the “right” place, and then reassemble. We describe this process in detail, and present an experimental evaluation that shows the effectiveness of this approach.

4.1

∆-XML: Virtual Assembly Lines over XML Streams

We adopt a simplified XML data model to represent nested data collections in Comad-style workflows. Here, an XML stream consists of a sequence of tokens of the usual form, i.e., opening tags “[t ”, closing tags “]t ”, and data nodes “#d ”. A well-formed XML stream corresponds to a labeled ordered tree. In general, we view an XML stream as a tree for typing, but as a sequence of tokens in XML process networks. Definition 4.1 (Streams). Data streams s are given by the following grammar: s ::= | #d | [t s]t | ss where t is a label from the label set T . From the perspective of ∆-XML we assume data nodes to contain binary data whose specific representation is unknown to the framework, but understood by actors. Although element labels are typically associated directly to opening and closing delimiters, we use the more convenient notation t[. . . ] when writing streams. Note that element attributes are not considered in the model, but can be emulated as singleton subtrees.


98

In the following we describe specific languages to express ∆-XML data schemas and actor configurations, and describe type propagation procedures based on these for ∆-XML pipelines. In the next section, we use the approach for type propagation described here to optimize shipping costs for distributed ∆-XML pipelines.

4.1.1

Types and Schemas

To describe the structure of ∆-XML data streams, we use XML regular expression types [HVP05, MLMK05]. These correspond to XML-Schema, but use a more compact notation. As it is well known, DTDs can be expressed via XML-Schema, in fact, XML-Schemas are more expressive than DTDs since they can capture context-dependence [BNdB04, LC00]. Definition 4.2 (Type declaration). A type declaration (or production rule) has the form T ::= hti R where T is a type name, hti is a label (or tag), and R is a regular expression over types. Regular expressions are defined as usual, i.e., using the symbols “,” (sequence), “|” (alternation or union), “?” (optional), “∗” (zero or more), “+” (one or more), and “” (empty word). When clear from the context, we omit “,” for sequence expressions. Definition 4.3 (Schema). We define a schema τ as finite set of type declarations. Every schema τ implies a set of labels Lτ and types Tτ = Cτ ∪˙ Bτ . The complex types Cτ are those that occur on the left-hand side (lhs) of a declaration in τ , and the base types Bτ are those that only occur on the right-hand side (rhs) of a declaration in τ . By convention, we use the type name Z to denote the base type representing any data item #d . For the purposes of type propagation and optimization, we place the following restrictions on schemas τ : 1. τ has a single, distinguished type S (i.e., the start or root type), such that S does not occur on the rhs of a declaration in τ . Thus, schemas can be represented as trees


99

(see Fig. 4.1). 2. Each complex type of τ is defined in at most one type declaration. 3. The type declarations of τ are non-recursive. 4. τ is non-ambiguous [MLMK05], i.e., each stream that is an instance of τ has a unique mapping (i.e., interpretation; see below) from s to τ . This very common restriction is also known as a deterministic content model; it is, for example, required for element types by the W3C recommendation for XML [BPSM+ 08]. Restrictions (1) and (2) are for simplifying notation. (3) is an assumption made in the here presented X-CSR algorithm. A generalization to recursive schemas is, however, possible by computing shipping destinations during run-time. Restriction (4) is crucial and necessary for the definition of signatures and schema-based operations on streams. By dropping (4), we would allow signatures for which there exists data streams with ambiguous mappings to the signature, and thus the question whether an actor reads a particular data item is not well-defined anymore. This restriction is common in the XML community []. Definition 4.4 (Reachability, down-closed and up-closed). We define reachability on types in a schema τ as follows: The type B is directly reachable from A, denoted A ⇒τ B, ∗

iff B occurs in the rhs of the declaration for A; as usual we define ⇒τ as the transitive and reflexive closure of ⇒τ . Let T be a set of types in schema τ . We say that T is down∗

closed iff T is closed under ⇒τ . We define the down-closed extension of T (denoted T↓τ , or simply T↓ when the context is clear) to be the smallest down-closed set T0 that contains T. Similarly, we define up-closed and the up-closed extension (denoted T↑τ ) for the inverse of the relation ⇒τ . Definition 4.5 (Roots, independence). Using reachability, we can define the roots T∧ of a set of types T in τ (i.e., the “top-most” types of the set) as the smallest set where (T∧ )↓ = T↓ . Similarly, we say that a set of types T is independent for a schema τ if it is


100

not possible to reach a type T2 ∈ T from another type T1 ∈ T, i.e., it is not the case that ∗

T1 ⇒τ T2 for T1 6= T2 . Definition 4.6 (Interpretation). An interpretation I of a stream s against a schema τ is a mapping from each node n in s to a type T such that: 1. I(n) = S if n is the root node of s (where S is the start type of τ ), 2. for each node n and its child nodes n1 , n2 , . . . , nm , there exists a type declaration X ::= hai R with: (i) I(n) = X such that n has the tag hai; and (ii) I(n1 ), I(n2 ), . . . , I(nm ) ∈ JRK, where JRK is the set of strings over type names denoted by the regular expression R.

Definition 4.7 (Instance). A stream s is an instance of the schema τ (denoted s ∈ Jτ K) iff there exists an interpretation of s against τ .

Definition 4.8 (Subtype). A schema τ1 is a subtype of a schema τ2 , denoted as τ1 ≺ τ2 , iff Jτ1 K ⊆ Jτ2 K. As discussed in [HVP05], the problem of determining whether two regular expression types are subtypes is decidable, but exptime-complete; however, the highly optimized practical implementation seems to work well in practice. Example 4.9 (Types, Closures, and Roots). Consider the schema τ in Fig. 4.1. Types are shown graphically such that S is the root type of τ , and each downward-pointing arrow denotes a type declaration (with tags given on edges). Moreover, as shown in Fig. 4.1, the set {B, C, Z4 } is down-closed, but not up-closed because S ⇒ B and S is not member of the set. {B, C} has a single root B. Also, D↑ = {S, A, D} and D↓ = {D, F, G}. An instance s of τ is also given, such that an interpretation I maps each node of s to the type with the corresponding node label.


101

τ: S hsi

Tτ = {S, A, . . . , Z} Lτ = {hsi, hai, . . . , hhi}

(A | B)∗ hai

hbi

D+ | E ∗ C hdi

F ∗G hfi hgi

Z1 Z2

hei

hci

H∗

Z4

{D}↑ = {S, A, D} {D}↓ = {D, F, G, Z1 , Z2 } {B, C}∧ = {B}

hhi

Z3

s ∈ Jτ K: s = s[b[c]a[d[ffg]d[gg]]a[e[hh]e[hhh]]b[c]b[c]]

Figure 4.1: A simple schema τ along with: the types and labels of τ ; example up-closed, down-closed, and root sets; and an instance s. Base values of type Z are omitted in s.

4.1.2

Actor Configurations

In ∆-XML process networks, actor configurations describe where an actor can modify its input, how the actors output is structured and where it is put back into the stream. Definition 4.10 (Actor Configuration). A configuration is of the form ∆A = hσ, τα , τω i that consists of a specialized actor signature σ for selecting and replacing relevant subtrees of an input stream, an input selection schema τα (A’s read-scope), and an output replacement schema τω (A’s write-scope). In general, an actor can be given different configurations in different ∆-XML pipelines. The schema τα describes how a subtree should be structured to be in scope for an actor. The signature σ determines which parts of the in-scope input is to be replaced by new data. The schema of collection data produced by the actor is given in τω . Next, we define actor signatures, which are used to describe how actors modify the XML stream.


102

Definition 4.11 (Signature, match rules, match and replacement types). A signature is a set of match rules. A match rule has the form X→R where X is a type of τα , and R is a regular expression over types of τω . We call X the match type, and each type in R a replacement type. We require the match types of a configuration to be independent (see Def. 4.5), thus avoiding ambiguous or nested matches. Additionally, there is no match type used in two different rules in σ, i.e.for each match type, there is exactly one replacement given. Intuitively, a match rule says that for any fragment of type X in the input stream, the actor A will put in place of X an output of type R. Unlike for data streams, configuration schemas are allowed to contain multiple root types, which provides greater flexibility when configuring actors. For example, a common root type is not required for types X1 and X2 (similarly Y1 and Y2 ) used in different match rules X1 → Y1 and X2 → Y2 . Match types can be additionally constrained within τα as follows: If a match type X occurs in the lhs of a τα type declaration, the declaration constrains X’s content model ; whereas if X occurs in the rhs the declaration constrains X’s “upper” context. Similar upper-context constraints are not allowed in τω : We require all lhs types of τω to be reachable from a replacement type Y in σ such that Y is not a type of τα . We also require that the result of applying σ to τα be a non-recursive, non-ambiguous schema (and similarly for propagated schemas τ 0 = ∆A (τ )). Once workflows are configured, we can detect cases that violate this constraint and reject these designs. Example 4.12 (Actor Configurations). Consider an actor A1 that produces a thumbnail from an image, and an input type τ : {S ::= hsi G∗ , G ::= hgi Z}


103

representing a set of images of type G. To replace each image in the given set with the corresponding thumbnail, we use the configuration ∆A1 = hσ, τα , τω i such that σ : {G0 → T }, τα : {G0 ::= hgi Z}, and τω : {T ::= hti Z}. The type that results from applying ∆A1 to τ is thus τ 0 : {S ::= hsi T ∗ , T ::= hti Z}. Now consider an actor A2 that takes an image and produces a collection containing both the thumbnail and the original image. We can use A2 to replace each image in a stream of type τ with a thumbnail collection using the configuration ∆A2 = hσ, τα , τω i such that σ : {G0 → C}, τα : {G0 ::= hgi Z}, and τω : {C ::= hci G0 T, T ::= hti Z}. The type that results from applying ∆A2 to τ is τ 0 : {S ::= hsi C ∗ , C ::= hci GT, G ::= hgi Z, T ::= hti Z}. Finally, given an input schema τ with intermediate levels of nesting τ : {S ::= hsi X ∗ Y ∗ , X ::= hxi G∗ , Y ::= hyi G∗ , G ::= hgi Z} we can configure A2 to work only on the images under X (and not those under Y ) by simply adding the declaration X 0 ::= hxi G0∗ to τα in ∆A2 .


4.1.3

104

Type Propagation

Given an input schema τ and an actor configuration ∆A = hσ, τα , τω i, we can infer the schema τ 0 = ∆A (τ ) from ∆A and τ as follows. Without loss of generality, we assume that complex type names are disjoint between τ and τα ∪ τω . Let Mσ = M(σ) be the set of match types of σ. We define Mτ as the types of τ that correspond to the match types Mσ , where T ∈ Mτ iff there exists a T 0 ∈ Mσ such that T and T 0 have the same element tag. Let T0 = (M↓τ )↑ be the set of potentially relevant types, which include the types of Mτ and (intuitively) the types in τ that are “above” and “below” them. Here T0 defines the initial context and corresponding content model of match types in σ mapped to τ . The actual context (the relevant types) Tα = πα (T0 ) are obtained from a so-called relevance filter πα . We use πα : Tτ → Tτ here specifically to remove, or filter out, the types of Tα that do not satisfy the context and content-model constraints of τα . We can define πα as follows. Notice that from T0 we can obtain a set of type paths P of the form

X1 /X2 / . . . /Xn

(n ≥ 1)

where X1 = S, Xi is a parent type of Xi+1 in τ (i.e., Xi ⇒τ Xi+1 ), and there is an Xj = T ∈ Mτ . Thus each path of P starts at the root type of τ and passes through a corresponding match type from σ. Informally, a path is removed from P if: (1) the types above T (wrt. τ ) along the path do not satisfy the context constraints of T in τα ; or (2) the types below T are either not mentioned in or do not satisfy the content-model constraints of T in τα .2 Similar to [HVP05], both tests for determining whether a given path satisfies the constraints of τα for T can be performed by checking inclusion between tree automata [CDG+ 97]. Thus Tα = πα (T) consists of the set of nodes along unfiltered paths P0 ⊆ P. ∆-XML Type Propagation. The output type τ 0 = ∆A (τ ) can now be inferred from τ and ∆A as follows. Let P0 be the set of unfiltered paths of T as described above. Further, 2

Note that match types T with multiple root types in τα can be satisfied from any one of the corresponding root types, i.e., the set of root types of T can be viewed as unions of constraints.

4.2. Optimizing ∆-XML Pipelines

105

let P ∈ P0 be a path that passes through a match type T ∈ Mτ such that T 0 ∈ Mσ is the corresponding match type in σ, and T 0 → R is the replacement rule for T 0 . We construct τ 0 from τ for all such P by replacing the particular occurrence of T in τ 0 (according to P ) by R, then adding the associated replacement type declarations of τω to τ . In the following section, we use the equivalent equation τ 0 = ∆A (Tα ) to denote type propagation, such that Tα refers only to the rooted fragments of τ that are relevant for an actor configuration ∆A , as opposed to the entire τ schema. Given the above type propagation procedure for ∆-XML, type propagation through a ∆-XML pipeline is straightforward. In particular, we sequentially infer types τi+1 = ∆Ai (τi ) for 1 ≤ i ≤ n, where the output of actor Ai is sent directly to the actor Ai+1 , τ = τ1 is the input schema of the pipeline, and τ 0 = τn+1 is the output schema of the pipeline.

4.2

Optimizing ∆-XML Pipelines

After a ∆-XML pipeline has been built and configured, each actor automatically receives data that matches its read scope τα , whereas data outside of the read scope is automatically routed around the actor. Note that all data, regardless of whether it is within an actors scope or not, is still shipped to each actor. Here we describe the X-CSR dataflow optimization that exploits schema and signature information provided by ∆-XML pipelines to route data directly to relevant “downstream” actors. Intuitively, we distribute the workflow’s input data and data that is generated by actors to the first actor that has it in scope in the pipeline. Here the distribution problem arises: Problem 4.13 (Distribution). Given the data to be distributed. What is the destination for each collection token and data element? At the input of down-stream actors (and at the very end of the workflow) there might be data arriving from various locations upstream in the workflow. Since we want to keep the optimized workflow equivalent to the unoptimized one we need to guarantee that each actor


106

receives the same data in its scope with and without optimization. It is therefore important to carefully order the data arriving from various locations upstream while merging it again: Problem 4.14 (Merge). How to merge multiple streams such that the original order of the data and collections is restored? We solve the Distribution-Problem by analyzing where in the workflow which parts of the input is used first. We do this by leveraging the signature information of the actors and the workflow’s input schema. We solve the merge problem by putting additional information in the streams when they are split up by the distributors. On the “main-stream” going serially from actor to actor we put “hole” markers at positions where we cut out parts of the data that is sent further down the stream (on “bypassing lanes”). By grouping bypassed token sequences using filler markers, we can then pair holes and fillers to merge the stream order-preserving.

4.2.1

Cost Model

For our routing optimization we assume that base data is significantly larger in size than the opening and closing tokens used to provide the context. This assumption is especially true in data-driven scientific workflows [PLK07], which typically deal with complex and large data sets. We also assume for simplicity of presentation that all actors are allocated at different hosts and that data can be sent between arbitrary hosts. We strive to minimize the total amount of data that is shipped during the execution of a scientific workflow (modeled as a ∆-XML pipeline).

4.2.2

X-CSR: XML Cut, Ship, Reassemble

In ∆-XML, by definition, actor results are independent from all data fragments that are outside the actor’s read-scope. Therefore, we can alter the incoming stream to the actor as long as we guarantee that all data within its read scope is presented to the actor as before.


107

Example. Consider a stream with the schema {S ::= hsi (A | B | C)∗ , A ::= hai Z, B ::= hbi Z, C ::= hci Z} and a workflow consisting of three actors A1 , A2 , and A3 that are consuming A, B and C while producing A0 , B 0 , and C 0 respectively. We introduce a stream splitter, or distributor in front of the three actors. The distributor will have three output channels each of which leading directly to one of the actors. After all three actors “did their job”, there are three separate streams each containing the output of a single actor. A stream merger is then used to merge these streams together to form a single output. Of course, we expect the output stream of an optimized workflow to be equivalent to an unoptimized one. It is therefore essential that we keep track of the order of the events— especially when splitting the stream. We therefore insert markers, or holes (denoted ◦) into the main stream3 whenever we decide to send irrelevant data onto the “fastlane.” When we later merge the bypassed data fragments back into the stream, we only need to fit those fragments into their original positions denoted by corresponding holes.4 We further “group” fragments in the bypassing lane by adding filler-tags (denoted •) to match single holes within a possibly longer sequence of bypassed elements.

General Case In the general setting, we deploy distributors after each actor to route its output to the closest downstream actor that has the data in its scope. Similarly, we also deploy mergers in front of every actor as it might receive data from various upstream locations. In Fig. 4.2, this general pattern is shown5 . Consider the input type τ of the ∆-XML pipeline in Fig. 4.2, together with the read-scope of actor A1 : Only F is in the match type 3

The “main stream” is the stream on the original assembly-line channels, i.e. from one actor to the next. This approach is similar to promise-pipelining, a technique that greatly reduces latency in distributed messaging settings. 5 There is no merger in front of the first actor and no distributor after the last one, as there is obviously no data to be merged, and only one final destination to send data to. 4


108

(a) Original pipeline S

S

(A | B)∗ D+ | E ∗

(A | B)∗ C

FG

S

C

i∈τ

D+ | E ∗

D+ | X ∗

BE(E | B)(E | B)

H∗

C

A2

σA1 = {F → BEG}

C

C

H∗

H∗

(BEG)G

A1

(A | B)∗

(A | B)∗ C

D+ | E ∗

H∗

S

S

H∗ H∗ C

BX(X | B)(X | B)

H∗ C

C

A3

σA2 = {G → (E | B)}

C

C

o

σA3 = {E → X}

T

(b) X-CSR optimized pipeline S

S

S

S S

S )∗

(A | ◦04 )∗

(A | ◦04

D+ | ◦∗03

D+ | ◦∗03

F ◦02

(BEG)◦02

(A | ◦04

i∈τ

(A | ◦04 )∗

D+ | ◦∗03

D+ | ◦∗03

D+ | E ∗

◦14 ◦13 (E | B)(E | B)

H∗ ◦14 E(E | ◦24 )(E | ◦24 )

(◦14 ◦13 G)G

d◦01 = i1

D0

(A | ◦04 )∗

(A | ◦04 )∗

C H∗

S

)∗

o1

A1

d•02

H∗ C H∗ C

D1

d◦12

M2

i2

A2

H∗ H∗

o2

D2

D+ | X ∗ ◦14 X(X | ◦24 )(X | ◦24 )

H∗

d◦23

M3

i3

o3

A3

o

M4

T

d•13

d•03

d•24

d•14

d•04

(c) Schema partitioning performed by distributors D0 :

D1 :

S

D2 :

S

(A | ◦04 )∗

S

(A | ◦04 )∗

(A | ◦04 )∗

D+ | ◦∗03

D+ | ◦∗03

•04 D+ | ◦∗03

d◦01

•03

F ◦02 •02

d•02

G

d◦12

B

E

C H∗

(◦14 ◦13 G)◦02

d•04

d•03 d•14

•14

•13

B

E

C

H∗

◦14 ◦13 (E | ◦24 ) (E | ◦24 )

d◦23

d•13

H∗

d•24

•24

H∗

•24

B

B

C

C

d•24

Figure 4.2: X-CSR (“X-scissor ”) in action: (a) conceptual user design (unoptimized) with actor signatures σAi (part of configurations ∆Ai ), initial input schema τ and inferred intermediate schemas (dash-boxed schema trees, above channels) and final inferred schema (after A3 ). (b) optimized (system-generated) design: The sub-network M2 ; A2 ; D2 (M2 and D2 reside on the same host as A2 ) shows the general pattern: A2 receives, via the merger M2 , all parts relevant for its read-scope, then performs its localized operation. The distributor D2 “cuts away” parts that are irrelevant to the subsequent actor A3 and ships them further downstream, but not before leaving marked “holes” behind where the cutting was done. This allows downstream mergers to pair the cut-out, bypassed chunks (which were going on the “fastlane”) back with the holes from which the chunks were originally taken; (c) distributors D0 , D1 , and D2 “cut” the schemas on the wire as indicated.


109

of its signature σ1 . The relevant type path leading to F is S/A/D/F , we therefore send the “F -data” and its context to actor A1 . Now consider the second actor A2 : its match type is G. As there is a G in τ (right next to F ), this G is relevant for the second actor. Note, that because actor A1 is not allowed to change parts of the stream that are not in its scope, i.e. G, it is safe to send G “on the fastlane” directly to the front of actor A2 where it will be merged back into the main stream. Next, consider actor A3 , which is operating on E. Since both, actor A1 and A2 will ignore the “E-data” (including the list of H-data beneath it) we can safely ship this portion of the stream to the third actor. Now we have determined shipping destinations for all the types except B and C. As they are not “picked up” by any actor in the workflow, we bypass them to the very end. In summary, we imagine the input schema τ to be cut into pieces as shown in the bottom of Fig. 4.2. The immediately following actor will receive the data inside its read scope and its corresponding context. Then all other downstream actors take turns cutting out their portion of the stream, possibly together with some of the remaining context, i.e. that has not yet been shipped to another, preceding actor. The distributor then partitions the stream, according to the partitioning of the schema. We will use a partition of types d◦i,i+1 , d•i,i+2 , . . . , d•i,i+j ⊆ Tτ , from a schema τ to describe the action of a certain distributor Di . While d◦i,i+1 contains the main context (up to the type S) and holes wherever data is cut out, the d•i,j contain bypassed data grouped under •-labels. Labeling Holes. To be able to attach bypassed parts back into the main context, which is sent on the main line, the distributors put hole-markers (◦) into the stream. are grouped using filling-tags. To match up the fillings with the holes later on, holes need to be indexed: If holes would not be distinguishable, merger M2 would not know whether an encountered hole corresponds to some data that is sent on d•02 (and should thus be filled), or if the hole corresponds to some data on channel d•03 (and should therefore be ignored because merger M3 will fill it). However, only marking the hole with the index of the merger that should


2

4

6

8

10

12

14

16

18

INPUT: τ, ∆1 , . . . , ∆n ; ∆i = hσi , ταi , τωi i OUTPUT: d◦i−1,i for i = 1, . . . , n d•ij for i + 1 < j; j = 1, . . . , n + 1 CODE: Mσi := M(σi ), i = 1, . . . , n τ1 := τ FOR i := 1 TO n DO i ↑τi d◦i−1,i := παi ((M↓τ σi ) ) ◦ R := di−1,i FOR j := i + 1 TO n DO i ↑τi d•i−1,j := πα ((M↓τ σj ) ) \ R R := R ∪ d•i−1,j ENDFOR d•i−1,n+1 := Tτi \ R τi+1 := ∆i ( d◦i−1,i ( mi ( τi ) ) ) ENDFOR RETURN: all d◦ij and d•ij , with labeling according to channel number

110 Input schema Configurations Distribution queries

Actor match types Intermediate schemas Ship what is asked for Already assigned variables

Type propagation

Listing 4.1: X-CSR algorithm for statically computing distributor specifications

fill in the data again is not sufficient. To see this, consider merger M3 when it is receiving a hole marked with “to-be-filled-by-M3 ”. It then cannot decide from which bypassing channel it is supposed to grab the filling from. However, since we are not changing the order of main and bypassing channels, it is not necessary to number the holes and fillings uniquely; “source-distributor” and “destinationmerger” provide enough information for each merger to unambiguously augment the stream with the formerly bypassed data.

4.2.3

Distributor and Merger Specifications

As illustrated above, mergers are not very complex. A merger Mi with one main-line and 1, . . . , n fastlane inputs will sequentially read the main-line stream. This “actor” ignores all tokens in the stream except holes labeled with ◦jk where k = i. When such a hole ◦ji is read, the merger reads a new filling from channel j and inserts the data within the filling

4.3. Implementation and Evaluation

111

markers back into the main stream. Distributors, on the other side need to be configured, i.e. the correct partitioning d◦i,i+1 , d•i,i+2 , . . . , d•i,i+j of the set of types needs to be inferred. This can be done in the spirit described in the example above (general algorithm is given in listing 4.1). Starting with the input type τ of the workflow (line 7), match types are complemented by all types and data below them (down-close operation in line 9) and the types up to the root symbol S are added (up-close operation). Then, the relevant parts are selected via the παi operator as described in the previous section. This set of types “denotes” the operation on the main line. We accumulate the types we have already assigned a destination to in the set R (line 13). Then, we loop over all the downstream actors to find the “left-over” yet relevant data for them. If some types are not relevant for any actor in the workflow, they are added to the last bypassing channel, which will merge at the very end (like B and C in Fig. 4.2). Once one distributor is fully specified, the current type is propagated through the “hole-making” operation and through the merger, and then the result type is propagated through the next actor (line 16). All the following distributors are then configured performing the same steps. From Schema-level to Instance-level. During runtime, the distributor continuously maps incoming tokens to type symbols in its schema6 . It then sends the data to the correct destination based on the partition of its schema, adding holes and fillings as appropriate.

4.3

Implementation and Evaluation

In this section, we describe the implementation and experimental evaluation of the X-CSR optimization. The cost savings enabled by X-CSR are in part based on the following observations. Assuming actors perform in-place substitution and given a basic cost model that considers only the cost of shipping base data, X-CSR yields optimal shipping costs: 6

Remember, this mapping exists as our schemas are non-ambiguous. To increase “stream-ability” we can further demand that the mapping can be computed as the tokens come in, of course.


112

Proposition 4.15 (Shipping Optimality). Every base data element is sent directly from its originating actor Ai on host Hi to its destination actor Ak on host Hk for i ≤ k without being sent to an intermediate actor Aj on host Hj for which it is irrelevant. To see that X-CSR is shipping optimal, notice that as soon as a data token is produced by an actor (or provided to the first actor of a pipeline), X-CSR finds the closest actor downstream that has the data item in scope. The data is then directly sent to this actor without passing through intermediate actors (as in the unoptimized case). Because the data item must be received by this actor to guarantee equivalence with the unoptimized version of the pipeline, this shipping is indeed necessary, and thus optimal. We also show in the evaluation below that the overhead introduced by X-CSR is minimal. In an unoptimized workflow, shipping data hdi from actor Ai to actor Ai+n , the closest downstream actor that has hdi in scope, involves shipping sizeof hdi∗n bytes. The optimized version will directly send the data to Ai+n and will thus only ship sizeof hdi bytes, resulting in a savings of sizeof hdi ∗ (n − 1). It can be shown that the saved shipping cost is linear in the number of bypassed actors as well as in the size of the total base data involved in the shipping optimization. Thus, in X-CSR, the more data is shipped, the bigger the savings.

4.3.1

Experimental Setup

We have implemented a distributed stream-processing system based on the Kahn Process Network [Kah74] model. We use PVM (Parallel Virtual Machine) [PVMa] for process creation and message passing between actors. Each actor is implemented as its own process and runs on a different host. Opening and closing tags (including holes and fillers) are sent using PVM messages, whereas (large) data tokens are kept as files on the local filesystems and are sent between hosts using remote secure copy (scp). Keeping large data files on disk is common in scientific workflows, e.g., actors often invoke external programs with input and output provided and generated as files. This setting fits our assumptions that data is generally expensive to ship compared to collection delimiters.


113

To evaluate the approach, we deployed the system on a 40-node, 2.5GHz Dual Opteron processor, cluster running Linux. Nodes are connected to each other by a 100 MBit/s switched LAN. Each actor is started on one of the cluster nodes using a round-robin assignment of actors to hosts. Example Workflow. The following example is used to analyze and explain the benefits of the X-CSR optimization. Consider a 3-actor workflow A1 → A2 → A3 with replacement rules ∆A1 : σ = {A → B | U } ∆A2 : σ = {B → C | V } ∆A3 : σ = {C → W }. Corresponding actor match types τα and replacement types τω contain type declarations of the form X ::= hxi Z for each type X in the replacement rules. For instance, actor A1 works on input labeled with hai, it either outputs data tagged with hbi or hui depending on the actual data it reads. The properties on which A1 ‘chooses’ its output is not observed by the type system: A1 could perform an expensive analysis of its input data and depending on the quality of the outcome it might be in need for further refinement by A2 , or no further refinement by A2 or A3 is needed. Analogously, A2 could be able to produce a final result V or pass its result to the third actor. In our experiments we consider data tokens 5MB in size. We also conducted experiments with varying data sizes of 1, 10, 20, and 100MB, where execution times scaled linearly across data sizes. In addition, we assume each actor Ai immediately outputs its result without any extensive computation or delay.

4.3.2

Experimental Results

Based on the actor configurations, several scenarios of data flow are possible. In table 4.1, we present three cases, called parallel, serial, and mixed, to study the savings in data shipping, as well as the overall execution time. Fig. 4.3 shows wall-clock execution times

4.3. Implementation and Evaluation Scenario (a) Parallel (b) Serial (c) Mixed

A1 : hai 7→ hui A2 : hbi 7→ hvi A3 : hci → 7 hwi A1 : hai 7→ hbi A2 : hbi 7→ hci A3 : hci 7→ hwi A1 : hai 7→ hbi A2 : hbi 7→ hvi A3 : hci 7→ hwi

114

Input Data Input Workflow

A2

A3

s[ a[z] ∗ i ] A1

A2

A3

s[ (a[z] ∗ i) (c[z] ∗ i) ] A1

A2

A3

A2

D0

M4

opt.

80i

35i −56%

≈ 3.6i

≈ 1.1i −69%

80i

80i 0%

≈ 3.6i

≈ 2.6i −28%

80i

50i −38%

≈ 3.6i

≈ 2.2i −39%

A3

D0

D0

A1

A2

A3

A2

A1 A3

M4

M4

Exec. Time (sec) orig. opt.

orig.

A1

s[ (a[z] b[z] c[z] w[z]) ∗ i ] A1

Data Shipped (MB)

Actual Dataflow

Table 4.1: X-CSR optimized vs. standard: Reduction in data shipping and execution times for these cases. We varied the amount of input data to the workflow on the x-axis. We executed each workflow 5 times for each input7 . Crosses and pluses represent individual runs (optimized and original workflow respectively). The curves connect the averaged times to factor out noise and to show the overall trend. (a) “Parallel” actors. Consider input s such that A1 and A2 always output U and V data, respectively. In this case, the three actors work independently of each other. The input data has the structure s[(a[z]b[z]c[z]w[z]) ∗ i]”, i.e., a stream with i repetitions of a[z]b[z]c[z]w[z] in which each z stands for a data item sized 5MB. We varied i from 1 to 20. In the unoptimized pipeline, 4i data tokens are sent in between the three actors, source and sink. Hence, a total of 4i · 4 · 5MB = i · 80MB is sent in between hosts. In the X-CSR optimized pipeline, the first distributor D0 will send each data item directly to the actor that is “picking it up” or to the end of the stream if no actor has the data item in scope (w[z] here). Also each actor’s output will be sent directly to the end of the stream. This decreases the total amount of sent data from i · 80MB to i · 35MB - a reduction by 56% (see table 4.1(a)). Runtime measurements are shown in Fig. 4.3: Execution times scale linear with the number of data that is streamed through. On average, the unoptimized workflow took 3.6i seconds, whereas the optimized version took 1.1i seconds to completion. This is a reduction 7 For some configurations, we ran the workflow more often gaining the same results with equally low variance.


115

80 Orig individual Orig avgerage Opt individual Opt average

70

runtime [seconds]

60 50 40 30 20 10 0 0

2

4

6

8

10

12

14

16

18

20

12

14

16

18

20

12

14

16

18

20

(a) Parallel 80 Orig individual Orig average Opt individual Opt average

70

runtime [seconds]

60 50 40 30 20 10 0 0

2

4

6

8

10

(b) Serial 80 Orig individual Orig average Opt individual Opt average

70

runtime [seconds]

60 50 40 30 20 10 0 0

2

4

6

8

10

(c) Mixed

Figure 4.3: X-CSR experiments standard versus optimized. Execution times with and without X-CSR optimization for increasing number of data items in the different scenarios.


116

by 69% of the original time. The system experiences a larger speed-up as the saved amount of data would suggest, since by using distributors and mergers the expensive data transfer is moved away from the actors themselves allowing additional concurrency. We will observe a speed-up due to this effect in all other cases as well. (b) “Serial” actors. Now, consider the other extreme case that A1 and A2 are always outputting hbi and hci tagged data, respectively. In case only hai data is provided as input to the workflow, no single data item can be bypassed. The dataflow structure of the optimized workflow does not differ from the original workflow structure (table 4.1(b)). Since the same data has to be shipped in both versions, we would expect their execution times to be very similar. In our experiments, however, the optimized version outperformed the original one by 28%. We attribute this additional speed-up to the increased concurrency we get due to decoupling the sending and receiving from the actor’s execution by introducing distributors and mergers. (c) “Mixed parallel and serial” actors.

Let A1 and A2 always output hbi and hvi

labeled data, respectively. If we provide the workflow with only hai data (for the first actor) and hci data (for the third), the dataflow as in table 4.1(c) arises. Savings in data shipping as well as in execution time are as expected between our two extreme case “parallel” and “serial”. Dynamic routing.

In practice, processing pipelines can often involve combinations of

the cases given in (a)-(c). That is, a single run of a pipeline can involve different types of routings and levels of parallelism. Because our distributors are implemented using a hierarchical state-machine to parse incoming data streams, the correct routing decision will be made dynamically, at runtime. Having all of the possible actor dependencies within one workflow demonstrates the generality of the X-CSR approach: While it would be possible to explicitly model a task-parallel network as depicted in table 4.1(a), this model would not be able to accommodate the case that A1 produces output for A2 . On the other hand, modeling the workflow as in (b) results in expensive unnecessary shipping when data is


117

not serially flowing from one actor to the next. Leveraging X-CSR, the workflow can be conveniently modeled as a linear pipeline while the data is dynamically routed by the framework itself, ensuring shipping optimality for large data items according to (sopt). Overhead. To investigate the overhead introduced by the additional actors (distributors and mergers), as well as by the additional tokens (holes and fillers) sent, we ran the workflow on the same input structure but without data items. The execution times without data being sent are very small in general. In fact, the time spent for sending tokens in the workflow was 0.5 seconds8 for both optimized and original workflow. However, starting and connecting the actors on the different hosts increased from 0.2 seconds to 0.4 from the original to the optimized version. Hence, the total execution time (running + setup) increased from 0.7 seconds to 0.9 seconds for the given workflow when run without the actual data. We believe that the initial delay is a tolerable overhead and do not expect to slow down the workflow significantly due to the shipping of additional tokens, distributors and mergers – considering that in real workflows significant execution time is spent in shipping and computation. Comparison to “central database” approach.

The parallel example would also

behave not too bad in a more traditional setting where the data is kept in a central repository and only references are shipped through the stream. Each actor would fetch the relevant data, process it and then push it back to the central server. However, since we would need to ship i · 30MB9 from and to the actors, this would take at least 2.4i seconds10 for a server connected with 100MBit/s (as our cluster) — which is more than twice the time it takes in the X-CSR optimized version. The situation is even worse in the serial case, as there are 6 · 4 · i datashipping of 5MB chunks involved11 which would result in a lower bound of 9.6 seconds—more than 3 times slower then our X-CSR approach. 8

Filtering out some larger execution times that were caused due to noise on the cluster. or even i · 5MB more if we assume source and target are on different nodes as in our scenarios 10 i · 30MB · 8Bits/Byte divided by 100MBits/sec 11 each of the actors fetches and puts the data 9

4.4. Related Work

118

Read-only access of actors.

The default mode of computation in ∆-XML assumes

that actors perform in place substitution, i.e., in general matched fragments are replaced with new data. If, however, an actor only adds new data to the input stream—keeping its matched input data intact—the presented ∆-XML type system is not aware of this. However, our framework, can be easily extended to handle add-only actors, by allowing configurations that declare which matched types are to be replaced or left in-place by an actor. An extended X-CSR algorithm could then take this additional information into account and ship read-only data to multiple destinations in parallel to increase concurrency.

4.4

Related Work

Closely related to the shipping optimization presented here is work about query processing over XML streams, see [LMP02, CCD+ 03, CDTW00, KSSS04b, GGM+ 04, CDZ06] among many others. However, most of these approaches consider optimizations for specific XML query languages or language fragments, sometimes taking into account additional aspects about the streaming data (e.g., sliding windows). They do, however, not evaluate their approaches against the specific desiderata I have given in Section 1.3.4. They also do not focus on incorporating existing scientific functions into the framework; and XML documents dealt with in these approaches usually do not contain large chunks of leaf-node data—which is very common for scientific applications. They consequently do not address implications that come with this assumption. To the best of our knowledge, there exists, for example, no work that exploits a regular expression-based type system to analyze dataflow and subsequently optimize the data shipping for distributed XML process networks. Since VDAL workflows are usually executed as streaming collections flowing through data processors (the actors), previous work on stream processing, and in particular on streaming XML data becomes relevant. Work in the general area of stream processing (see [BBD+ 02, CBB+ 03] for an overview) is concerned with the following aspects: Blocking/unblocking.

A significant amount of work focuses on operators for stream

4.5. Summary

119

processing. As streaming datasets can only be read once (due to their possibly infinite size), the focus is often on “unblocking” traditionally blocking operators. The punctuations work [TMSF03] is related to our own since the holes we use in the X-CSR optimization can be seen as special punctuations that inform stream processors about properties of the stream, i.e., that data arriving at another stream has to be read to fill in the hole. Bounding memory of stream processing elements. Since input streams can potentially be infinite, it is necessary to restrict memory usage of the stream processor. Here, a large body of work considering how to use automata to parse and process XML streams exists, see [Sch07] for a survey. Closely related to the X-CSR shipping optimization is the work presented in [CLL06]. Here, Chirkova et al. consider the problem of minimizing the data that is sent as answer to related relational queries in a distributed system. For a set of conjunctive queries it is tried to find a minimal set of views that is sufficient to answer the queries. A minimal view set is a set that takes the least number of bytes to store it. We are not aware of any work in the field of XML processing that tries to minimize the size of shipped data between stream processors.

4.5

Summary

In this chapter we showed how to utilize type-level information about actors to optimize data transport in scientific workflows. We presented a formalism to represent schema information about the data sent between actors. We also defined actor signatures to characterize which parts of the input are used by the actors to produce outputs. Performing a data-dependency analysis, we were able to insert distributors into the pipeline to ship base data only to actors that will read, modify, delete or replace this data. Using labeled hole- and filler-items in the data stream allowed us to merge the forked data streams back into one single output. We showed the optimality of our approach when considering large base data sizes wrt. the information captured by the actor signatures. Our experimental analysis, performed on a cluster, showed the effectiveness of the overall approach.

120

Chapter 5

Implementation: Light-weight Parallel PN Engine Science is what we understand well enough to explain to a computer. Art is everything else we do. Donald Knuth

In this section, we describe the implementation of our light-weight, parallel process network engine (PPN). This engine has been used to perform the experiments described in Chapter 4. We first describe the design decisions made, then give an overview of the system architecture, and present experiments demonstrating system performance. The PPN engine has been coupled to the Kepler system via a specific Kepler director, as an effort for supporting transparent parallel execution within the Kepler system. In the last part of this chapter, we describe this effort1 .

5.1

General Design Decisions

Our work here is intended as an experimentation platform for the execution of scientific workflows that are computational intensive of data-intensive or both. We provide a clean 1

The PPN engine and Kepler coupling has been presented in [ZLL09].

5.1. General Design Decisions

121

process-network engine, with Actor and Port abstractions. We also put a strong emphasis on being able to write actors in different programming languages. Currently, actors can be written in C++, Perl, Python and Java and as shell scripts. To have a scalable base system, workflow execution is completely decentralized, that means a central component is only necessary to orchestrate (set up, monitor and terminate) workflow execution. As language, we chose C++ as it is easy to link C++ libraries to other languages such as Perl, Java or Python via SWIG [SWI]. PPN is implemented on top of PVM, a portable software library for message passing that provides abstractions for hosts, tasks and messages in between tasks. We used PVM++ [PVMb] to interface with PVM as well as the Boost library [Boo] for interacting with the filesystem and for parsing command-line options. All our network and messaging access are done through PVM++ which makes it easy to exchange PVM by another library, for example, an MPI implementation. In PPN, each actor is implemented as its own process. We provide a base-class Actor from which we inherit all components in a workflow (see listing 5.1). For logging/debugging purposes, each actor has an actor name, and an instance name. The method initialize() is called when the actor is created, before ports are connected; go() is called iteratively while the workflow is running and go() itself returns true. When the execution is done the method cleanup() is invoked to perform user-defined cleanup work. For sending data, we created a template class port that can be instantiated with primitive types such as int, char or std::string. The port class encapsulates sending and receiving completely (see listing 5.2). From the actor’s point of view, the two methods > are used to send and receive data through and from a port. We also implemented a custom BLOB data type that represents data that is in the temporary directory of the actor. The BLOB class provides methods to get a filename to this data, and to create new BLOB tokens from existing files. When BLOB data is sent through ports it is sent via scp to the host on which the receiving actor is located. The workflow system can be configured to not copy files if both actors are on the same host. Instead, a hard link is created to the

5.1. General Design Decisions

1

3

5

7

9

11

13

15

17

19

21

class ActorImpl; class Actor { public: Actor(const std::string & instanceName = ""); virtual ~Actor(); virtual void initialize(); virtual bool go(); virtual void cleanup(); std::string getInstanceName(); virtual std::string getActorName() { return "Actor"; } static void sleep(unsigned long microseconds); static void system(const std::string &cmd); protected: ActorImpl *myImpl; }; Listing 5.1: Actor class declaration

122

5.1. General Design Decisions 1

3

5

7

123

template class InPort : public InPortB { public: InPort(const char *name); InPort & operator>>(T &data); void read(T &data); T readR() { T d; read(d); return d; } ~InPort();

9

11

13

15

17

19

21

protected: InPortImpl *myImpl; virtual void reset(); }; template class OutPort : public OutPortB { public: OutPort(const char *name); OutPort & operator 0 then f (Γ(E)) and l(Γ(E)) are independent from Γ; so let us lift f and l to expressions, keeping in mind that we restrict valuations to not contain zeroes. It is easy to see that the following recursion computes f (E) and l(E) for a ∈ Σ, x ∈ V,

7.5. Equality of String Polynomials

184

and p ∈ N[V]: f (ap )

:= a

l(ap )

:= a

f (E · E 0 ) := f (E)

l(E · E 0 ) := l(E 0 )

f (E x )

l(E x )

:= f (E)

:= l(E)

Theorem 7.26 (Checking E for alt-NF). An expression E in simple-NF is in alt-NF, if E is collision-free, with collision-free (cfree) being recursively defined as follows: cfree(ap )

:= true

cfree(E · E 0 ) := l(E) 6= f (E 0 ) ∧ cfree(E) ∧ cfree(E 0 ) cfree(E x ) Proof

:= l(E) 6= f (E) ∧ cfree(E).

¬ cfree(E) ⇒ E is not in alt-NF: Choose Γ ≡ 1 to have a valuation in which the

length of E without small polynomials is greater than its alternations. cfree(E) ⇒ E is in alt-NF: simple structural induction. Transforming from simple-NF to alt-NF. Given an expression E in simple-NF, we construct an expressions E 0 in alt-NF that is “almost” equivalent to E. Here, “almost” can be understood in the following sense: we will partially expand some scalar multiplications and substitute the respective variables with a new ones, e.g., (010)x

010(010)x 010. Since 0

our variables are natural numbers, the new expression then “misses” some values (e.g., for Γ(x) = 0, and for Γ(x) = 1). However, if we “move” the variables by the same amount on both sides of the SPE equation, and compare these new polynomials, then we have essentially solved a problem for restricted equivalence of the original polynomials, which is enough according to our earlier reduction. We now explain the crucial insights of the transformation on simple examples. This explanation will not be very formal, but provides enough intuition to understand the general algorithm, which will be presented afterwards. The crucial insights are: (1) Concatenating expressions with simple ends is easy: Consider two expressions E and E 0 with simple ends, i.e., whose tail and head, respectively, are monomials. Example: The


185

violating concatenation 01x · 1y+x 03 is easily fixed by merging the ends to form 01y+2x 03 ; however, concatenating (01)x with 1y+x 03 is harder since we cannot easily merge the ends with each other. (2) From an expression E that does not have simple ends, we can create an expression E 0 that does: We can easily transform (01)x , which does not have a simple head, nor a simple tail, to the expression 01(01)x with a simple head, and further to 01(01)x 01 with 00

0

simple head and tail. Note, that this transformation is only valid for Γ(x) ≥ 2, and with x00 then ranging from 0 to infinity as usual. We will take care of this problem later4 . (3) Any violating expression of the form E x , i.e., with l(E) = f (E), can be “fixed”: Consider E x = (02 103 )x , clearly E x is not in alt-NF although E is. By expanding x once to the right, we obtain E 0 := E x · E = (02 103 )x 02 103 . Note, that the 03 will be right next 0

0

02 for Γ(x0 ) = 1. Further, if Γ(x0 ) > 1 then the outside ends, here 02 and 03 , will always be next to each other and form a regular pattern with the inside, here 1, of E. We can thus rewrite the term by moving the 02 inside the parenthesis to the end of the parenthesis after 03 , add a single 02 before the parenthesis to make up for the loss and removing 02 right after the parenthesis. This turns the expression into alt-NF, in our example, we gain 02 (105 )x 103 . Again, this transformation is only for x ≥ 1 and 0 ≤ x0 ∈ N. 0

The algorithms for the general case are shown in Fig. 7.7 and Fig. 7.8.

7.5.5

Collecting Exponents into Big Polynomials

Starting from string polynomials in alt-NF, we now further develop more machinery for building normal forms. As a relatively simple rewriting, we collect cascading exponentiation into polynomials. Although these are regular polynomials, we cann them big polynomials to emphasize their position in the string polynomial. Given an expression in alt-NF (and 4 In fact, as we will show later, we can just replace x00 by x − 2. Possibly negative exponents can be interpreted as inverse letters that cancel out normal letters.

7.5. Equality of String Polynomials 1

to-alt-NF(ap ): return ap

3

5

7

9

11

13

15

17

19

21

23

to-alt-NF(E1 · E2 ): N1 := to-alt-NF(E1 ) N2 := to-alt-NF(E2 ) if l(N1 ) 6= f (N2 ) then return N1 · N2 else N10 := makeEndingSimple(N1 ) N20 := makeBeginningSimple(N2 ) return to-simple-NF(N10 · N20 ) to-alt-NF(E x ): N := to-alt-NF(E) if f (N ) 6= l(N ) return N x else N 0 := makeEndingSimple(N ) S := makeBeginningSimple(N 0 ) 0 Expand(S x ) to S x · S 0 M := MoveParenthesisIn(S x · S) return to-simple-NF(M ) Figure 7.7: Algorithm to transform into alt-NF

1

3

5

7

9

11

13

15

makeBeginningSimple(E): if E = ap return E 0 if E = F x return makeBeginningSimple(F )·F x // Now E is of type E1 · E2 if E = ap · E2 return E if E = (F x ) · E2 return makeBeginningSimple(F x )·E2 if E = (E1 E2 ) · E3 return makeBeginningSimple(E1 )·E2 E3 makeEndingSimple(E): // analog to makeBeginningSimple MoveParenthesisIn(E x · E): // E needs to have a simple/monomial head and tail // of the same kind, i.e., is of the form: Let ap · E 0 aq := E return ap (E 0 aq+p )x E 0 aq Figure 7.8: Helper Algorithms for alt-NF

186


187

thus in simple-NF), apply the following rewrite rules anywhere in the tree: (E p )q → E pq

p, q ∈ V

It is very easy to see that this rewrite rule is normalizing—again modulo polynomial equivalence in the big polynomials. Note, that all big polynomials we create (so far) are products of variables. Furthermore, this transformation obviously translates E into E 0 with E 0 ≡ E.

7.5.6

Comparing Lists of Monomials

Theorem 7.27 (Monomial equivalence). Testing monomials for equivalence reduces to testing regular polynomial equivalence: ap ≡≥c bq for a c ≥ 1 ⇐⇒ a = b ∧ p ≡ q

Proof

“⇐” clear. “⇒”: For ap and bq to be equivalent for all Γ ≥ c with c ≥ 1, clearly

a = b. Further, p and q need to agree on all Γ ≥ c. Now, since there are infinitely many of such Γ and the maximal degree of p and q is finite, the polynomials have to be equivalent.

Similarly, for a concatenation of simple monomials M1 := ap1 bp2 . . .pn and M2 := a ¯q1 ¯bq2 . . .qm to be equivalent for all Γ ≥ c ≥ 1, they need to start with the same letter (thus, a ¯ = a and ¯b = b), have the same number of alternations (m = n), which make the pi align with the qi , and since there are infinitely many Γ ≥ c the pi s agree with the qi s at infinitely many positions, and therefore pi ≡ qi : Theorem 7.28 (List of monomial equivalence). With the usual symbolic: ¯q1 ¯bq2 . . .qm for a c ≥ 1 ap1 bp2 . . .pn ≡≥c a ⇐⇒ a = a ¯ ∧ b = ¯b ∧ n = m ∧ pi ≡ qi for all i

7.5. Equality of String Polynomials Proof

188

“⇐” clear; “⇒” see explanation above.

Note, that once M1 ≡≥c M2 for some c, then M1 ≡≥c0 M2 for all those c0 for which all small polynomials evaluate to a positive number. During all our transformations we will guarantee that the small polynomials are positive for Γ > 1.

7.5.7

Towards Distributive Alternating Normalform

Given two lists of monomials M1 and M2 as defined above. For ease of notation let us call expressions of this type limos and use Mi to denote them from now on. We now want to compare M1P with M2Q for some big polynomials P and Q. Our goal is to characterize for which Mi , the equivalence of M1P and M2Q implies that P ≡ Q and M1 ≡ M2 . We will see that this is the case iff we cannot distribute the big polynomials over Mi , that is M1P is not equivalent to M1P0 MiP00 with some M10 M100 = M1 , and similarly for M2 . We now make these notions more clear: Definition 7.29 (Distributively-minimal). For a limo M := ap1 bp2 . . .pn and a big polynomial P , M P is distributively minimal, iff there does not exist j such that M = M1 M2 for M1 = ap1 bp2 . . .pj and M2 = dpj+1 . . .pn , and M P 6≡≥c M1P M2P for a c ≥ 1. Checking distributive-minimality for monomials. The following result allows for a simple procedure to check distributive minimality for a limo M . Theorem 7.30 (Non-distributive minimal limos have a repeating core). For a limo M := M1 · M2 with (M1 M2 )P in alt-NF, and 2 ≤ c ∈ N : (M1 M2 )P ≡≥c M1P M2P iff ∃M3 with M1 = (M3 )k and M2 = M3l for k, l ∈ N Proof

(7.20)

“⇐” is easy since M1 M2 ≡≥c M2 M1 . “⇒” via a case distinction as follows:

Case alts(M1 ) = alts(M2 ): Choose infinitely many ci with Γ ≡ ci ≥ c, therefore Γ(P ) ≥ 2


189

for all these infinitely many Γ; For all those Γ, we have M1 M2 M1 . . . M2 ≡≥c M1 M1 . . . M1 M2 M2 . . . M2 | | {z } {z }| {z }

2Γ(P )−many M groups

Γ(P )−many

(7.21)

Γ(P )−many

since alts(M1 ) = alts(M2 ) and (M1 M2 )P is in alt-NF, the first M2 on the left and the second M1 on the right side perfectly align with each other (for all infinitely many Γ). But therefore, the monomials inside M1 and M2 also agree with each other for infinitely many valuations, which requires them to be equal thus M1 = M2 , set M3 := M1 to proof the claim. Case k · alts(M1 ) = alts(M2 ): With a large enough d ≥ c and infinitely many Γ ≥ d, a similar argumentation as in the previous case aligns the first M2 on the left with k-many M1 on the right; with the same arguments as above, M2 = (M1 )k which shows the claim. Case k · alts(M2 ) = alts(M1 ): Analog to previous, align from the right side. Case alts(M1 ) 6= alts(M2 ) and gcd 1 or 2: Let mi := alts(Mi ). If the greatest common divisor of m1 and m2 is 1, then at least one of them has to be an odd number. But then M1P M2P is not in alt-NF, and because (M1 M2 )P is in alt-NF, alt(Γ(M1P M2P )) < alt(Γ((M1 M2 )P )) for Γ ≡ c; consequently, the premise is wrong and we have nothing to show. Now, consider the case that the gcd of m1 and m2 is 2. Then, let A1 B2 . . . Am1 −1 Bm1 := M1 , i.e., it has

m1 2

groups comprising two monomials each. Similarly, M2 has

m2 2

groups:

0 let A01 B20 . . . A0m2 −1 Bm := M2 . Now consider those Γ for which Γ ≥ c and Γ(P ) > 2

2(m1 + m2 + 42); obviously, there are infinitely many of those Γ. We will now show that for all of these Γ and for all i and j the monomial Ai will align with the monomial A0j and similarly for the B monomials: each of it in M1 will match up with each of the B 0 monomials in M2 . Since these are infinitely many Γ in which they agree with each other, they actually have to have the same small polynomial! Therefore, M1 = (A1 B1 )m1 /2 and M1 = (A1 B1 )m2 /2 , q.e.d. To see that they will match up with each other, imagine (7.21) with no M1 on the left side and many, many M1 on the right side. Clearly, the groups would then be aligned after m1 placements of M2 or m2 placements of M1 (not earlier since


190

the gcd of m1 /2 and m2 /2 is 1). In the alignments since then, every single displacement of the groups will occur (gcd=1). Adding in the M1 on the left side only displaces the ending position of the M2 by exactly one m1 , and thus do not change the pattern but only requires more M1 on the right side. Since we place one M2 for each M1 , we need around m1 more of the M1 on the right side, totaling to about m1 + m2 of M1 ’s needed; so with Γ(P ) > 2(m1 + m2 + 42) we are on the safe side. Case alts(M1 ) 6= alts(M2 ) and gcd > 2 and mi 6= kmj : This case can be proven analog to the previous case with the only difference that the group size for the group of monomials that will be aligned with each other is larger, in fact, it equals gcd(m1 , m2 ). Thus, the alwaysrepeating core in M1 and M2 does not have length 2, but the length of l := gcd(m1 , m2 ). Since l < m1 and l < m2 for mi 6= kmj , we are done. Algorithm for: “Is M = (M 0 )k for some k > 1?”. Note, that we only consider M that are in alt-NF. For each pair of small polynomials pi and pj in M , test if they are equivalent; then build the small-polynomial-characteristic strings s0 and s1 as follows: For s0 , map each 0pi to a new letters C0 (0pi ) from a new alphabet Σ0 such that C0 (0pi ) = C0 (0pj ) iff pi ≡ pj ; map all 1pi to the empty-string C0 (1pi ) = ε. Then, s0 := C0 (M ). For s1 map the 1pi to letters and remove the 0pi . Then, try for all divisors k of

1 2

alts(M ) if s0 = (s0 )k for

some string s0 ∈ Σ0 and if s1 = (s00 )k for the same k and an s00 ∈ Σ0 . If such a k is found, answer YES else NO. Corollary 7.31. For M P , we can check whether it is distributive-minimal; if it is not, we can distribute the P to the smallest repeating group Mg and we can equivalently transform M P to MgP . . . MgP . Then Mg will be distribute-minimal.


7.5.8

191

Deciding M1P ≡≥c M2Q for dist-minimal Mi

Theorem 7.32. For two distributive-minimal limos M1 and M2 and two big polynomials P and Q, the following holds for all c ≥ 2: M1P ≡≥c M2Q ⇐⇒ P ≡ Q ∧ M1 ≡≥c M2

Proof

“⇐” trivial. “⇒”: Case distinction on the relationship between m1 := alts(M1 )

and m2 := alts(M2 ). Case m1 = m2 : Then clearly, M1 ≡≥c M2 since their groups are perfectly aligned; further, P and Q agree on infinitely many Γ and thus have to be equivalent. Case m1 = km2 for a 1 < k ∈ N: Similar to the proof above, now M1 lines up to exactly k many of M2 for infinitely many Γ. Thus, M1 = M2k , which is a contradiction to M1 being distributive-minimal. Case km1 = m2 : analog to previous. Case lm1 = km2 for 1 < l, k ∈ N: Also similarly to the proof above, now M1 and M2 need to be composed of some repeating core Mj since the core aligns with each other for infinitely many Γ. Thus, M1 is not distributive-minimal, contradiction.

7.5.9

Summary of Findings and Future Steps

We proposed steps to transform string polynomials into an alternating normal form with big polynomials as exponents. Aligning the substituted variables with each other and then checking syntactic equivalence of two polynomials in this normal form already provides a sound, but not precise algorithm for checking their equivalence, in other words their syntactic equivalence is a sufficient but not necessary condition for string-equivalence. We have also presented several ideas on how to develop a procedure that errs on the other side, i.e., are a necessary conditions: • One test is to imagine the concatenation operator to be commutative, and thus to test for equivalence of multivariate polynomials with integer coefficients. Clearly, if the string polynomials are equivalent, so are the “commutatively relaxed” polynomials.

7.6. Undecidability of Value-Difference for PV-Types

192

• Another procedure is to compare the polynomials that compute the number of alternations for an alphabet of the size two. Clearly, this is also a sound check. • Yet another check would be to replace all variables by “*” and compare the resulting regular expressions with each other. This is clearly also a sound procedure. We conjecture that SPE is decidable. We further believe that by merging neighboring, identical limos into one limo with a merged big polynomial and applying this process also to more complex string polynomials bottom-up will actually yield a precise decision procedure for SPE. For this we would need to design another normal form in which neighboring groups are not equivalent (otherwise they would have been merged). We would then need to prove that for this normal form two polynomials are only equivalent if they have the same syntactic form. Once we have found a decision procedure for SPE, we can then add the other constructors of pv-types, i.e., annotation suffixes and or-types. We hope that handling these is orthogonal to the SPE problem.

7.6

Undecidability of Value-Difference for PV-Types

In this section we will show that for two XML pv-types τ1 and τ2 , it is undecidable whether they represent different values under all valuations. Besides being an interesting result by itself, it also motivates our later proof for the undecidability of equivalence for XQ with a deep equality operator. Before we continue, we quickly define a symbol ≶ for this relation over pv-types: Definition 7.33 (Value-difference). Two pv-types τ1 and τ2 are (always)-value-different (in symbols τ1 ≶ τ2 ) iff [[ τ1 ]]v 6= [[ τ2 ]]v for all v that define τ1 and τ2 . We now show that the value-different relation is undecidable for general pv-types. We show this by reducing the problem of solving Diophantine equation with integer coefficients and variables in natural numbers to the question whether two pv-types are value-different.


193

Solving a Diophantine equation is to find integer solutions for a polynomial equation P (x1 , x2 , . . . , xn ) = 0 with integer coefficients. The decision problem is to answer the question whether there exists a solution. From the well-known fact that the decision problem for Diophantine equations in integers (IntD) is undecidable [Mat93], it is fairly easy to see that the decision problem for Diophantine equations restricted to solutions in natural numbers (NatD) is undecidable as well. Theorem 7.34. NatD is undecidable5 . Proof

Assume there is a decider for NatD. We can now construct a decider for IntD as

follows: Given a Diophantine equation E with variables x1 , . . . , xn . Consider the set of 2n equations E, resulting from E by replacing xi by either yi or −yi for all n variables. We now test if any of these equations has a solution in the natural numbers via NatD, if so E has an integer solution else E does not. Proof: Assume E has an integer solution x1 , . . . , xn . Then, the equation in which exactly these variables xi are replaced by −yi for which xi < 0 has a solution in the natural numbers. Furthermore, if any of the equations in E has a solution y1 , . . . , yn in the natural numbers then we can construct an integer solution to E by setting xi := yi if xi was replaced by yi and xi := −yi if xi was replaced by −yi . Theorem 7.35. τ1 ≶ τ2 is undecidable for our pv-types. Proof

NatD can be reduced to deciding τ1 ≶ τ2 : We now show how to encode an

instance of NatD into the decision problem of whether two types are always-value-different. The general idea is to represent the number 1 by the type a[]6 , addition by concatenation of types, and multiplication with a variable x by a star-type colored with x. In particular, any Diophantine equation E : P (x1 , . . . , xn ) = 0 can be transformed into an equation P1 (x1 , . . . , xn ) = P2 (x1 , . . . , xn ) such that each Pi is a summation of products with only positive coefficients. We now create τ1 and τ2 for P1 and P2 , respectively. Each product in Pi is of the form c · xj · · · · · xk . We create types for each product by encoding the constant 5 6

This is a known result, a different proof is in Matiyasevich’s book [Mat93] We use a[] as shorthand for a[()]


194

c as a sequence of c times the type a[]; and each multiplication by xj via a star type (..)xj . As example, we would encode 2xy 2 as follows (((a[], a[])x )y )y . The summation of products is encoded as concatenation, so x + y would become (a[])x , (a[])y . It is now easy to see that if E has a solution in the natural numbers, then τ1 6≶ τ2 ; as well as if τ1 ≶ τ2 then E has no solutions in the natural numbers.

Remark Intuitively, in any programing language with a (standard) if-then-else statement for which we want to make non-trivial statements about program output or program behavior, we need to be able to predict for an arbitrary program p if both branches of an if-then-else statements in p are taken or only one of them when p is executed on the set of considered values. Otherwise, we could just use such an “undecidable” if statement to switch between a sub-program p1 that clearly causes the property and another one p2 that clearly causes p to not satisfy the property: p := build x; if “undecidable(x)” then p1 else p2 Those non-trivial properties could for example be “does p ever output a?”, “does p always output a?”, or “does p ever execute an ill-defined statement?”. The second question, for example, is a specific case of program equivalence, and being not able to decide the third one prevents one from creating a sound and complete check for some “well-defindness” property as described in [Van05, dBGV04]. If the language allows to compare arbitrary values, then being not able to decide τ1 ≶ τ2 or τ1 ≡ τ2 is a good indication that an innocent-looking deepEQ($x, $y) operator can cause an unpredictable if-then-else statement. However, the undecidability of τ1 ≶ τ2 or τ1 ≡ τ2 does not necessarily imply the existence of an unpredictable if-statement: The type-system itself might be too complex for the target language: For example, consider a type-system that is able to describe sets of values that

7.7. Undecidability of Query-Equivalence for XQdeep-EQ

195

cannot be created in the program. Comparing these (over-specifying) types can then be very hard even if the language is well-analyzable. It is thus often easier to take the undecidability of τ1 ≶ τ2 or τ1 ≡ τ2 as inspiration to create a class of programs for which it can be shown that they are not analyzable, what we will do next.

7.7

Undecidability of Query-Equivalence for XQdeep-EQ

The idea of encoding Diophantine equations can easily be adapted to show that query equivalence for XQdeep-EQ is undecidable, leading to the following theorem: Theorem 7.36. Query equivalence for XQdeep-EQ is undecidable. Proof

Indirect, assume we could solve equivalence for XQdeep-EQ : To solve an arbi-

trary but fixed Diophantine equation in natural numbers with positive coefficients E : P1 (x1 , . . . , xn ) = P2 (x1 , . . . , xn ), we can construct XQ programs p1 and p2 that would represent the sides of the equation similarly to the previous section. As an example consider how the following statements simulate P1 = 2xy 2 + y: let $p1 = ( let $x = ( for $xh in $root :: x return a[] ) in let $y = ( for $yh in $root :: y return a[] ) in let $s1 = ( for $i in $x return for $j in $y return for $k in $y return a[], a[] ) in let $s2 = $y in $s1, $s2 ) in ...

The input $root 7→ x[], x[], x[], y[], y[], for example, would represent the case x = 3 and

7.8. Related Work

196

y = 2. With a similar sub-program for P2 we can now build two queries: q1 :=

let $p1 = . . . in ( let $p2 = . . . in ( if deepEQ($p1, $p2) then “a” else () ))

q2 :=

()

Clearly, E has a solution iff q1 and q2 are not equivalent.

7.8

Related Work

Related to the work on string-polynomials is the work from Bogdanov et al. [BW05] and Raz et al. [RS05], which consider noncommutative polynomial identity testing. However it is not clear to me how the models used in these work can be used to solve SPE. Andrej Bogdanov mentioned in a personal communication that he does not think that the here suggested problem of SPE has been considered before. The remainder of the related work section discusses work related to XML processing: Colazzo et al. [CGMS04] describe a sound and complete type system for “path-correctness” of XML Queries. That is, there method can statically decide whether a XQuery sub-query will create a non-empty result set for some input to the whole query. Their type language, supporting recursion with minor restrictions, is as expressive as regular tree languages and thus powerful enough to capture possibly recursive DTDs or XML Schema. Queries are the for-let-return queries with child and descendant-or-self axis for navigation. The data model is equivalent to ours, i.e., they have lists of ordered labeled trees, where leaves from a base data type are allowed. We plan on extending their results in the following way: Being able to solve query equivalence subsumes path-correctness as being equivalent to the query “()” essentially amounts to the question of path correctness. For the non-recursive types, our work is thus strictly more general.

7.8. Related Work

197

Kepser describes a “simple proof” for the Turing-completeness of a more expressive fragment of XQuery in [Kep04b] by using the fragment’s capabilities of defining recursive functions and XPath’s capability of doing integer arithmetic. Our result is orthogonal as we analyze the core of XQuery with the result that even without functions and without integer arithmetic, Diophantine equations can be encoded causing the fragment XQdeepEQ to be “not analyzable”. Vansummeren [Van05] analyzes well-definedness for XQuery fragments under a depthbounded type system. Well-definedness is closely related to query equivalence (for both, if-statements have to be predictable) and thus his work can be seen complementary to ours, since we consider the problem of query equivalence, and present a different approach (pvtyping) to our positive and negative results. It is noteworthy that [Van05] mentions that XQuery is not analyzable if the language contains + and × as base operations to modify atoms, because of a possible reduction from Diophantine equations. In our work, we show that Diophantine equations can be encoded into core XQuery if a deep-equality operator is allowed—even without explicit operations on base values. In [Van07], he characterizes for which base-operations well-definedness is decidable for XQuery. In particular, these are the monotone base operations. Hidders et al. [HPV06] study the expressive power of XQuery fragments. Our work is complementary to theirs, since consider equivalence, of an, admittedly very small fragment of XQuery. In recent work, DeHaan [DeH09] studies the equivalence of nested queries with mixed semantics. He shows how to decide equivalence under several unordered data models by encoding nested relations into flat relations. It is not clear how this approach could be ported for an ordered data model. Work that also considers containment and equivalence of queries returning nested objects is Levy and Suciu’s work in [LS97]. Again, this also mainly focuses on unordered data models. There is also work on XPath containment and equivalence [Woo03, MS04, Sch04], however these works do not consider for-loops.

7.9. Summary

198

Recent work by Cate et al. [CL09] studies query containment for XPath 2.0 fragments (with for-loops), however, the semantics of XPath expressions is defined as relation over sets of nodes, which is different to the XQuery semantics, which returns labeled, ordered trees. Bunemann et al. use colors in [BCV08] to track individual data values through complex operations. Besides being applied to the Nested Relational Algebra instead of to ordered, labeled trees, our approaches also significantly differ in goals and methods. Bunemann et al. propose a coloring scheme with propagation rules to automatically track provenance– or the origin of data; while we use colors to statically analyze queries—to check query equivalence. Consequently, Bunemann et al. color data values instead of types. Issues that arise in our approach due to star-types and for loops do not occur if coloring is performed at the value-level. Coloring at the value level, however, does not allow to decide query equivalence. While the authors note in Lemma 1 that two functions f : s → t and g : s → t are equivalent if f (v) = g(v) for all distinctly colored v ∈ s, they do not provide a decision procedure to check the premise. Using colors only at the value-level, we would need to check all possible values. Our approach performs this check on the type-level via a symbolic simulation of many values at once. A second difference between provenance and checking for value-equality is that data origin matters for the first but not for the latter. From a provenance point of view there is a difference between grabbing an element X from the input and creating a new X from scratch. In contrast, when comparing two queries for equivalence, we are interested in their input/output behavior, that is it does not matter how or from where the output was constructed as long as it is a specific value.

7.9

Summary

In this chapter, we introduced the concept of possible-value types for semantic simulation. In contrast to conventional, set-semantic types, which denote a set of values, pv-types denote a function from a set of possible worlds to values. For a concrete set of pv-types, suited for

7.9. Summary

199

XML processing, we showed that it is undecidable whether two pv-types denote the same value in all worlds; while it is decidable whether they are the same. The negative result translates to XQuery, where we showed that when a deep equality operator is allowed, query equivalence is not decidable any more. We furthermore, adapted the concept of sound and precise typing from conventional set-semantic typing and showed how the problem of query equivalence (with input restrictions) can be reduced to type questions. We introduced the problem of string-polynomial equality, which lies at the core of pv-type equivalence. Here, we further proposed several normal-forms that provide sound (but not complete) tests for string-polynomial equivalence, as well as several other insights resulting to decidable necessary conditions.

200

Chapter 8

Concluding Remarks Men love to wonder, and that is the seed of science. Ralph Waldo Emerson

This dissertation considers the problem of designing and optimizing data-driven workflows, in particular dataflow-oriented scientific workflows. The main challenges arise from the heterogeneity of existing algorithms, libraries and tools, their computational complexities, the large amounts of inter-related data, and the exploratory nature of the scientific process. This problem involves three main areas of computer science. Finding the right process for scientists to build these systems is a software engineering challenge; designing a language for workflow specification together with appropriate methods for its static analysis lies in the realm of programming languages; and inventing domain-specific query languages and data models that are suited for efficient execution is at the heart of database research. Virtual data assembly lines, our proposed paradigm for building such scientific workflows, combines existing ideas from dataflow networks with principles of XML data processing. Virtual assembly lines deploy a tree-based data-model, with an “assembly-line” of existing tools that interact with the data via XQuery-based configurations. We showed that this approach solves many design-problems that are common to scientific workflows

201 and arise in dataflow oriented modeling approaches. We further demonstrated how static analysis can be used to support the design process, and to guide the scientist during workflow construction, maintenance and evolution. Moreover, we showed how VDAL-workflows can be parallelized and how static analysis can be used to compile VDAL-workflows into equivalent dataflow networks that exhibit a more efficient data routing. We also developed a type system for XQuery with an ordered data model that is more precise than existing ones, and showed that query equivalence reduces to type-equivalence. Type-equivalence inspires theoretical questions about equality of polynomials over strings with multiplications. This structure exhibits a non-commutative “+” operation together with a scalar multiplication from scalars forming a ring. Here, we made progress towards solving this problem, by constructing a normal-form that allows a sound approximation.

Future Work This work opens many opportunities for future research and development, ranging from solving open theoretical questions, investigating more aspects about workflow design support and resilience, evaluating new strategies for efficient execution, as well as providing a better integration with Kepler. We will now detail each of these points: Basic theoretical research. As a first milestone, we want to either provide a sound and complete procedure to decide string-polynomial equality, or to prove its undecidability. From there, it is still a long way to deciding pv-type equivalence: the first step here would be to add types with sufficed annotations, then, it is interesting to investigate how the different conditional tests (emptiness and base-value equivalence) effect pvtype equivalence. Workflow design-support and resilience. Based on an exact type system for VDAL workflows, the use cases from Chapter 2 can be re-considered and addressed more precisely than we did here. We will further investigate the use cases that have not

202 been addressed in this dissertation: creating a schema-level provenance graph, and a canonical input schema with sample instance data. It is also interesting to show how all use cases can be solved in an unordered data model. Recent work about static analysis of XQuery on unordered data [DeH09] looks very promising as a foundation here. Other interesting questions to be addressed are the ones related to workflow resilience outlined at the end of Chapter 2. Alternative VDAL execution strategies. Besides the data-parallel MapReduce strategy and the pipeline-parallel strategy, other approaches should be investigated. It would, for example, be interesting to create DAGMan [Dag02] models to execute VDAL workflows. Since the intermediate data products are dynamically generated during workflow execution, the DAGMan model would need to be extended while it is running. A completely different strategy, that is worth benchmarking, is to store the VDAL data as “shredded relations” in a standard relational database. A VDAL workflow could then be compiled down to SQL. With horizontal fragmentation, this approach could also lead to an acceptable performance. Implementation. Our light-weight PPN engine as well as our Hadoop-based MapReduce implementation are research prototypes. In our ongoing work, we would like to improve their respective implementations and make them available to a broader user base.

203

Bibliography [ABB+ 03]

Ilkay Altintas, Sangeeta Bhagwanani, David Buttler, Sandeep Chandra, Zhengang Cheng, Matthew Coleman, Terence Critchlow, Amarnath Gupta, Wei Han, Ling Liu, Bertram Ludäscher, Calton Pu, Reagan Moore, Arie Shoshani, and Mladen A. Vouk. A Modeling and Execution Environment for Distributed Scientific Workflows. In SSDBM, pages 247–250, 2003. xi, 8, 9

[ABC+ 03]

Serge Abiteboul, Angela Bonifati, Grégory Cobéna, Ioana Manolescu, and Tova Milo. Dynamic XML documents with distribution and replication. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 527–538, New York, NY, USA, 2003. ACM Press. 93

[ABE+ 09]

David Abramson, Blair Bethwaite, Colin Enticott, Slavisa Garic, and Tom Peachey. Parameter Space Exploration Using Scientific Workflows. In Intl. Conf. on Computational Science, LNCS 5544, pages 104–113, 2009. 57

[ABJF06]

Ilkay Altintas, Oscar Barney, and Efrat Jaeger-Frank. Provenance Collection Support in the Kepler Scientific Workflow System. In Intl. Provenance and Annotation Workshop (IPAW), pages 118–132. 2006. 13

[ABL09]

Manish Kumar Anand, Shawn Bowers, and Bertram Ludäscher. A navigation model for exploring scientific workflow provenance graphs. In Deelman and Taylor [DT09]. 13

[ABML09]

Manish Kumar Anand, Shawn Bowers, Timothy M. McPhillips, and Bertram Lud¨ ascher. Efficient provenance storage over nested data collections. In Martin L. Kersten, Boris Novikov, Jens Teubner, Vladimir Polutin, and Stefan Manegold, editors, EDBT, volume 360 of ACM International Conference Proceeding Series, pages 958–969. ACM, 2009. 4, 13

[AJB+ 04]

Ilkay Altintas, Efrat Jaeger, Chad Berkley, Matthew Jones, Bertram Lud¨ ascher, and Steve Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows. In 16th Intl. Conf. on Scientific and Statistical Database Management (SSDBM), pages 423–424, Santorini, Greece, 2004. 14

204 [AvLH+ 04] K. Amin, G. von Laszewski, M. Hategan, NJ Zaluzec, S. Hampton, and A. Rossi. GridAnt: A Client-controllable Grid Workflow System. System Sciences, 2004. Proceedings of the 37th Annual Hawaii International Conference on, pages 210–219, 2004. 134 [BBD+ 02]

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. Proceedings of the twenty-first ACM SIGMODSIGACT-SIGART symposium on principles of database systems, pages 1–16, 2002. 118

[BBMS05]

Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Mike Stonebraker. Fault-Tolerance in the Borealis Distributed Stream Processing System. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, June 2005. 93

[BCC+ 05]

Louis Bavoil, Steven P. Callahan, Patricia J. Crossno, Juliana Freire, Carlos E. Scheidegger, Claudio T. Silva, and Huy T. Vo. VisTrails: Enabling Interactive Multiple-View Visualizations. In Proceedings of IEEE Visualization, pages 135–142, Minneapolis, Oct 2005. 11, 57

[BCF03]

Véronique Benzaken, Giuseppe Castagna, and Alain Frisch. CDuce: an XMLcentric general-purpose language. In Intl. Conf. on Functional Programming (ICFP), pages 51–63, New York, NY, USA, 2003. 66, 148, 154

[BCG+ 03]

C. Barton, P. Charles, D. Goyal, M. Raghavachari, M. Fontoura, and V. Josifovski. Streaming XPath Processing with Forward and Backward Axes. In Proceedings of the International Conference on Data Engineering, pages 455– 466. IEEE Computer Society Press; 1998, 2003. 71

[BCV08]

Peter Buneman, James Cheney, and Stijn Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Transactions on Database Systems (TODS), 33(4):28, 2008. 198

[Bio09]

Bioperl tutorial. http://www.bioperl.org/wiki/Bptutorial.pl, 2009. 5

[BKC+ 01]

M.D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Computing, 27(11):1457–1478, 2001. 134

[BKW98]

A. Bruggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 142(2):182–206, 1998. 127

[BL04]

Shawn Bowers and Bertram Ludäscher. An Ontology Driven Framework for Data Transformation in Scientific Workflows. In International Workshop on Data Integration in the Life Sciences (DILS), LNCS 2994, pages 25–26, Leipzig, Germany, March 2004. 13, 57

205 [BL05]

Shawn Bowers and Bertram Ludäscher. Actor-Oriented Design of Scientific Workflows. In 24st Intl. Conference on Conceptual Modeling (ER), LNCS, Klagenfurt, Austria, October 2005. Springer. 57

[BLL+ 08]

Christopher Brooks, Edward A. Lee, Xiaojun Liu, Stephen Neuendorffer, Yang Zhao, and Haiyang Zheng. Heterogeneous Concurrent Modeling and Design in Java (Volume 1: Introduction to Ptolemy II). Technical Report No. UCB/EECS-2008-28, April 2008. 12

[BLNC06]

Shawn Bowers, Bertram Ludäscher, Anne H.H. Ngu, and Terence Critchlow. Enabling Scientific Workflow Reuse through Structured Composition of Dataflow and Control-Flow. In Post-ICDE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow), Atlanta, GA, April 2006. 38

[BML+ 06]

Shawn Bowers, Timothy McPhillips, Bertram Ludäscher, Shirley Cohen, and Susan B. Davidson. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows. In Intl. Provenance and Annotation Workshop (IPAW), pages 133–147. 2006. 13

[BML08]

S. Bowers, T.M. McPhillips, and B. Ludäscher. Provenance in collectionoriented scientific workflows. Concurrency and Computation: Practice & Experience, 20(5):519–529, 2008. 13

[BMN02]

G.J. Bex, S. Maneth, and F. Neven. A formal model for an expressive fragment of XSLT. Information Systems, 27(1):21–39, 2002. 148

[BMR+ 08]

Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Lud¨ ascher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. In Intl. Provenance and Annotation Workshop (IPAW), 2008. 33

[BNdB04]

Geert Jan Bex, Frank Neven, and Jan Van den Bussche. DTDs versus XML Schema: A Practical Study. In WebDB, pages 79–84, 2004. 98, 159

[Boo]

Boost C++ Libraries. http://www.boost.org/. 121

[Bor07]

Dhruba Borthakur. The Hadoop Distributed File System: Architecture and Design. Apache Software Foundation, 2007. http://svn.apache.org/repos/ asf/hadoop/core/tags/release-0.15.3/docs/hdfs_design.pdf. 64, 73, 85

[BPSM+ 08] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and Fran¸cois Yergeau. Extensible markup language (xml) 1.0 (fifth edition), November 2008. W3C Recommendation. http://www.w3.org/TR/2008/REC-xml-20081126/. 99 [BW05]

Andrej Bogdanov and Hoeteck Wee. More on noncommutative polynomial identity testing. In Proceedings of the 20th Annual IEEE Conference on Computational Complexity, pages 92–99. Citeseer, 2005. 196

206 [CBB+ 03]

M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, and S. Zdonik. Scalable Distributed Stream Processing. CIDR Conference, 2003. 118

[CCD+ 03]

Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR’03), Asilomar, CA, January 2003. 93, 118

[CDG+ 97]

H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata, 1997. release October, 1st 2002. 104

[CDTW00] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 379–390, Dallas, Texas, USA, 2000. ACM Press. 93, 118 [CDZ06]

Yi Chen, Susan B. Davidson, and Yifeng Zheng. An Efficient XPath Query Processor for XML Streams. In Intl. Conf. on Data Engineering (ICDE), page 79, 2006. 93, 118

[CGMS04]

D. Colazzo, G. Ghelli, P. Manghi, and C. Sartiani. Types for path correctness of XML queries. In Proceedings of the ninth ACM SIGPLAN international conference on Functional programming, pages 126–137. ACM New York, NY, USA, 2004. 142, 196

[CGMS06]

Dario Colazzo, Giorgio Ghelli, Paolo Manghi, and Carlo Sartiani. Static analysis for path correctness of XML queries. J. Funct. Program., 16(4-5):621–661, 2006. 142, 146, 148, 156, 159

[Che08]

James Cheney. FLUX: functional updates for XML. In ICFP ’08: Proceeding of the 13th ACM SIGPLAN international conference on Functional programming, pages 3–14, New York, NY, USA, 2008. ACM. xi, 48, 49, 141, 142, 146, 148, 159

[Che09]

J. Cheney. Provenance, XML, and the Scientific Web. In ACM SIGPLAN Workshop on Programming Language Technology and XML (PLAN-X 2009), 2009. Invited paper. 153, 155

[Cho02]

Byron Choi. What are real DTDs like? In WebDB, pages 43–48, 2002. 159

[CKR+ 07]

Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. Workflow Management in Condor, pages 357–375. In Taylor et al. [TDGS07], 2007. 11

207 [CL09]

B. Cate and C. Lutz. The complexity of query containment in expressive fragments of XPath 2.0. Journal of the ACM (JACM), 56(6):31, 2009. 198

[CLL06]

R. Chirkova, C. Li, and J. Li. Answering queries using materialized views with minimum size. The VLDB Journal The International Journal on Very Large Data Bases, 15(3):191–210, 2006. 119

[CPE]

Center for plasma edge simulation. http://www.cims.nyu.edu/cpes/. 4

[Dag02]

The directed acyclic graph manager (DAGMan), 2002. http://www.cs.wisc. edu/condor/dagman/. 202

[DBE+ 07]

Susan B. Davidson, Sarah Cohen Boulakia, Anat Eyal, Bertram Ludäscher, Timothy M. McPhillips, Shawn Bowers, Manish Kumar Anand, and Juliana Freire. Provenance in Scientific Workflow Systems. IEEE Data Engineering Bulletin, 30(4):44–50, 2007. 8

[dBGV04]

Jan Van den Bussche, Dirk Van Gucht, and Stijn Vansummeren. WellDefinedness and Semantic Type-Checking in the Nested Relational Calculus and XQuery. CoRR, cs.DB/0406060, 2004. 194

[Dee05]

E. Deelman. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219–237, 2005. 8, 57, 91

[DeH09]

David DeHaan. Equivalence of nested queries with mixed semantics. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 207–216. ACM, 2009. 153, 197, 202

[DG08]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. 22, 62, 92, 93, 94

[DGST08]

Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Generation Computer Systems, In Press, 2008. 7

[DGST09]

Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Gen. Computer Systems, 25(5):528–540, 2009. 11

[DT09]

Ewa Deelman and Ian Taylor, editors. Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS 2009, November 16, 2009, Portland, Oregon, USA. ACM, 2009. 203, 215

[EJ03]

J. Eker and J.W. Janneck. CAL Language Report: Specification of the Cal Actor Language. Technical Report UCB/ERL M03/48, EECS Department, University of California, Berkeley, 2003. 57

208 [Fel04]

J. Felsenstein. PHYLIP (phylogeny inference package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle, 2004. 32

[FG06]

Geoffrey C. Fox and Dennis Gannon, editors. Concurrency and Computation: Practice & Experience, Special Issue: Workflow in Grid Systems, volume 18(10). Wiley, August 2006. 7, 210

[FJM+ 07]

Mary F. Fern´ andez, Trevor Jim, Kristi Morton, Nicola Onose, and Jérˆ ome Siméon. Highly distributed XQuery with DXQ. In Proc. ACM SIGMOD, pages 1159–1161, 2007. 93

[FLL09]

X. Fei, S. Lu, and C. Lin. A MapReduce-Enabled Scientific Workflow Composition Framework. In IEEE International Conference on Web Services (ICWS), pages 663–670, 2009. 92

[FMJG+ 05] R.A. Ferreira, W. Meira Jr, D. Guedes, L.M.A. Drummond, B. Coutinho, G. Teodoro, T. Tavares, R. Araujo, and G.T. Ferreira. Anthill: A Scalable Run-Time Environment for Data Mining Applications. Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing-Volume 00, pages 159–167, 2005. 134 [FPD+ 05]

T. Fahringer, R. Prodan, R. Duan, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.L. Truong, A. Villazon, and M. Wieczorek. ASKALON: A Grid Application Development and Computing Environment. International Workshop on Grid Computing, pages 122–131, 2005. 91

[FSC+ 03]

Mary F. Fern´ andez, Jérôme Siméon, Byron Choi, Amélie Marian, and Gargi Sur. Implementing XQuery 1.0: The Galax Experience. In VLDB, pages 1077–1080, 2003. 147

[FSC+ 06]

Juliana Freire, Claudio Silva, Steven Callahan, Emanuele Santos, Carlos Scheidegger, and Huy Vo. Managing Rapidly-Evolving Scientific Workflows. In Intl. Provenance and Annotation Workshop (IPAW), LNCS 4145, pages 10– 18, 2006. 11, 57

[FSW01]

M. Fernandez, J. Simeon, and P. Wadler. A semi-monad for semi-structured data. In Proceedings of the 8th International Conference on Database Theory, pages 263–300. Springer, 2001. 159

[GDR07]

C.A. Goble and D.C. De Roure. myExperiment: social networking for workflow-using e-scientists. In Proceedings of the 2nd workshop on Workflows in support of large-scale science, page 2. ACM, 2007. 15

[Gen01]

W. Gentzsch. Sun Grid Engine: Towards Creating a Compute Power Grid. In First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001. Proceedings, pages 35–36, 2001. 85

209 [GGM+ 04] Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. Processing XL Streams with Deterministic Automata and Stream Indexes. ACM Transactions on Database Systems (TODS), 29(4):752–788, 2004. 93, 118 [Goo07]

D. J. Goodman. Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation. In International World Wide Web Conference (WWW), pages 983–992, 2007. 92

[GRD+ 07]

Yolanda Gil, Varun Ratnakar, Ewa Deelman, Gaurang Mehta, and Jihie Kim. Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows. In National Conference on Artificial Intelligence, pages 1767–1774, 2007. 57

[GS03]

A.K. Gupta and D. Suciu. Stream Processing of XPath Queries with Predicates. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 419–430. ACM New York, NY, USA, 2003. 71

[HHMW07] T. H¨ arder, M. Haustein, C. Mathis, and M. Wagner. Node labeling schemes for dynamic XML documents reconsidered. Data & Knowledge Engineering, 60(1):126–149, 2007. 75 [HKS+ 05]

Jan Hidders, Natalia Kwasnikowska, Jacek Sroka, Jerzy Tyszkiewicz, and Jan Van den Bussche. Petri Net + Nested Relational Calculus = Dataflow. In OTM Conferences, LNCS 3760, pages 220–237, 2005. 57

[HPV06]

Jan Hidders, , Jan Paredaens, and Roel Vercammen. On the expressive power of XQuery-based update languages. Lecture Notes in Computer Science, 4156:92, 2006. 147, 197

[HS08]

J. Hidders and J. Sroka. Towards a Calculus for Collection-Oriented Scientific Workflows with Side Effects. In Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:, page 391. Springer, 2008. 57

[HSL+ 04]

Duncan Hull, Robert Stevens, Phillip Lord, Chris Wroe, and Carole Goble. Treating shimantic web syndrome with ontologies. In First Advanced Knowledge Technologies Workshop on Semantic Web Services (AKT-SWS04), Open University, Milton Keynes, UK., 2004. CEUR-WS.org ISSN:1613-0073. 57, 58

[HT05]

T. Hey and A.E. Trefethen. 308(5723):817, 2005. 2

[HTT09]

Anthony J. G. Hey, Stewart Tansley, and Kristin M. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009. 2

Cyberinfrastructure for e-Science.

Science,

210 [HVP05]

Haruo Hosoya, Jerome Vouillon, and Benjamin C. Pierce. Regular expression types for XML. ACM Transactions on Programming Languages and Systems (TOPLAS), 27(1):46–90, 2005. 98, 100, 104, 148, 154, 159

[Ima08]

Image-magick. http://www.imagemagick.org, 2008. 69

[JAZ+ 05]

Efrat Jaeger, Ilkay Altintas, Jianting Zhang, Bertram Ludäscher, Deana Pennington, and William Michener. A Scientific Workflow Approach to Distributed Geospatial Data Processing using Web Services. In 17th Intl. Conference on Scientific and Statistical Database Management (SSDBM), Santa Barbara, California, June 2005. 34

[Kah74]

G. Kahn. The semantics of a simple language for parallel programming. In J. L. Rosenfeld, editor, Proc. of the IFIP Congress 74, pages 471–475. NorthHolland, 1974. 11, 93, 112

[KBLN04]

Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. An ethnographic study of copy and paste programming practices in OOPL. In International Symposium on Empirical Software Engineering, pages 83–92. Citeseer, 2004. 7

[Kep04a]

Kepler Actors User Manual. http://poc.vl-e.nl/distribution/manual/ kepler-1.0.0alpha7/kepler-ActorUserManual.pdf, 2004. 8

[Kep04b]

S. Kepser. A simple proof for the turing-completeness of XSLT and XQuery. In Extreme Markup Languages, 2004. 197

[KLN+ 07]

C. Kamath, B. Ludäscher, J. Nieplocha, S. Parker, R. Ross, N. Samatova, and M. Vouk. SDM Center Technologies for Accelerating Scientific Discoveries. Journal of Physics: Conference Series, 78, (2007) 012068:1–5, 2007. 14

[KSC+ 08]

D. Koop, C.E. Scheidegger, S.P. Callahan, J. Freire, and C.T. Silva. VisComplete: Automating Suggestions for Visualization Pipelines. IEEE Transactions on Visualization and Computer Graphics, 14(6):1691–1698, 2008. 57

[KSSS04a]

C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. FluXQuery: An Optimizing XQuery Processor for Streaming XML Data. Proc. VLDB 2004, pages 1309–1312, 2004. 93

[KSSS04b]

C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. In 28th Conf. on Very Large Data Bases (VLDB), pages 228–239, 2004. 93, 118

[LAB+ 06]

Bertram Lud¨ ascher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. In Concurrency and Computation: Practice & Experience [FG06], pages 1039–1065. 8, 14, 56

211 [LAB+ 09]

Bertram Lud¨ ascher, Ilkay Altintas, Shawn Bowers, Julian Cummings, Terence Critchlow, Ewa Deelman, David De Roure, Juliana Freire, Carole Goble, Matthew Jones, Scott Klasky, Timothy McPhillips, Norbert Podhorszki, Claudio Silva, Ian Taylor, and Mladen Vouk. Scientific Process Automation and Workflow Management. In Arie Shoshani and Doron Rotem, editors, Scientific Data Management: Challenges, Existing Technology, and Deployment, Computational Science Series, chapter 13. Chapman & Hall/CRC, 2009. 7, 10, 31

[Läm08]

R. L¨ ammel. Google’s MapReduce programming model—Revisited. Science of Computer Programming, 70(1):1–30, 2008. 93

[LBM09]

Bertram Lud¨ ascher, Shawn Bowers, and Timothy McPhillips. Scientific Workflows. In Encyclopedia of Database Systems. Springer, 2009. 11

[LC00]

Dongwon Lee and Wesley W. Chu. Comparative analysis of six XML schema languages. SIGMOD Rec., 29(3):76–87, 2000. 98

[LLF+ 09]

Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai, and Jing Hua. A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In IEEE Intl. Conf. on Services Computing, Bangalore, India, 2009. 57

[LMP02]

Bertram Lud¨ ascher, Pratik Mukhopadhyay, and Yannis Papakonstantinou. A Transducer-Based XML Query Processor. In 28th Conf. on Very Large Data Bases (VLDB), pages 227–238, Hong Kong, 2002. 118

[LP95]

Edward A. Lee and Thomas Parks. Dataflow Process Networks. Proceedings of the IEEE, 83(5):773–799, May 1995. 14

[LS97]

Alon Y. Levy and Dan Suciu. Deciding containment for queries with complex objects and aggregations. Proc. of PODS, Tucson, Arizona, 1997. 197

[LSS]

Large Synoptic Survey Telescope (LSST). www.lsst.org. 4, 5

[LSV98]

Edward A. Lee and Alberto L. Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17(12):1217–1229, 1998. 17, 57

[Mat93]

Y. Matiyasevich. Hilbert’s 10th Problem. Foundations of Computing Series. The MIT Press, 1993. 193

[MB05]

Timothy M. McPhillips and Shawn Bowers. An Approach for Pipelining Nested Collections in Scientific Workflows. SIGMOD Record, 34(3):12–17, 2005. 17, 25, 58

[MBL06]

Timothy McPhillips, Shawn Bowers, and Bertram Ludäscher. CollectionOriented Scientific Workflows for Integrating and Analyzing Biological Data. In 3rd Intl. Workshop on Data Integration in the Life Sciences (DILS), LNCS, pages 248–263, European Bioinformatics Institute, Hinxton, UK, July 2006. Springer. xi, 17, 20, 25, 31, 56, 58

212 [MBZL09]

Timothy McPhillips, Shawn Bowers, Daniel Zinn, and Bertram Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5):541 – 551, 2009. 1, 14, 18, 57, 91, 94

[MLMK05] M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology (TOIT), 5(4):660–704, 2005. 98, 99 [Mor94]

J. Paul Morrison. Flow-Based Programming – A New Approach to Application Development. Van Nostrand Reinhold, 1994. 58

[MS04]

G. Miklau and D. Suciu. Containment and equivalence for a fragment of XPath. Journal of the ACM (JACM), 51(1):2–45, 2004. 197

[MSM97]

D.R. Maddison, D.L. Swofford, and W.P. Maddison. NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46(4):590–621, 1997. 14

[Net]

NetCDF (Network Common Data Form). http://www.unidata.ucar.edu/ software/netcdf/. 14

[OAF+ 04]

T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M.R. Pocock, A. Wipat, et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045, 2004. 34

[OGA+ 02]

Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: Lessons in Creating a Workflow Environment for the Life Sciences. Concurrency and Computation: Practice & Experience, pages 1067–1100, 2002. 11, 37, 57, 92

[OOP+ 04]

P. O’Neil, E. O’Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-friendly XML node labels. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 903–908. ACM New York, NY, USA, 2004. 75

[PLK07]

Norbert Podhorszki, Bertram Ludäscher, and Scott Klasky. Workflow Automation for Processing Plasma Fusion Simulation Data. In 2nd Workshop on Workflows in Support of Large-Scale Science (WORKS), June 2007. xi, 10, 11, 14, 38, 69, 106

[Pto06]

Ptolemy II project and system. Department of EECS, UC Berkeley, 2006. http://ptolemy.eecs.berkeley.edu/ptolemyII/. 56, 128

[PVMa]

Parallel Virtual Machine. http://www.csm.ornl.gov/pvm/. 112

213 [PVMb]

PVM++: A C++-Library for PVM. http://pvm-plus-plus.sourceforge. net/. 121

[QF07]

Jun Qin and Thomas Fahringer. Advanced data flow support for scientific grid workflow applications. In Proceedings of the ACM/IEEE conference on Supercomputing (SC), pages 1–12. ACM, 2007. 92

[RBHS04]

C. Re, J. Brinkley, K. Hinshaw, and D. Suciu. Distributed XQuery. Workshop on Information Integration on the Web, pages 116–121, 2004. 93

[roc]

Rocks clusters. http://www.rocksclusters.org/. 85

[RS05]

R. Raz and A. Shpilka. Deterministic polynomial identity testing in noncommutative models. Computational Complexity, 14(1):1–19, 2005. 196

[SBB+ 02]

J.E. Stajich, D. Block, K. Boulez, S.E. Brenner, S.A. Chervitz, C. Dagdigian, G. Fuellen, J.G.R. Gilbert, I. Korf, H. Lapp, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome research, 12(10):1611, 2002. 5

[Sch04]

T. Schwentick. XPath query containment. ACM SIGMOD Record, 33(1):101– 109, 2004. 197

[Sch07]

T. Schwentick. Automata for XML – A survey. Journal of Computer and System Sciences, 73(3):289–315, 2007. 119

[SCZ+ 07]

K. Stevens, D. Cutler, M. Zwick, P. de Jong, K.H. Huang, M. Koriabine, B. Lud¨ ascher, C. Marston, S. Lee, D. Okou, K. Osoegawa, J. Warrington, D.J. Begun, and C.H. Langley. DPGP Cyberinfrastructure and Open Source Toolkit for Chip Based Resequencing. In Advances in Genome Biology and Technology (AGBT), 2007. 35

[SHS04]

G.M. Sur, J. Hammer, and J. Simeon. An XQuery-based language for processing updates in XML. Proceedings of PLAN-X, 2004, 2004. 147

[SOL05]

A. Stamatakis, M. Ott, and T. Ludwig. RAxML-OMP: An Efficient Program for Phylogenetic Inference on SMPs. Lecture Notes in Computer Science, 3606:288–302, 2005. 33

[SWI]

Simplified Wrapper and Interface Generator. http://www.swig.org/. 121

[TDGS07]

Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Mark Shields, editors. Workflows for e-Science: Scientific Workflows for Grids. Springer, 2007. 7, 206

[TMG+ 07] D. Turi, P. Missier, C. Goble, D. De Roure, and T. Oinn. Taverna workflows: Syntax and semantics. In Proceedings from the 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore, India, 2007. 57

214 [TMSF03]

P.A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting Punctuation Semantics in Continuous Data Streams. IEEE Transactions on Knowledge and Data Engineering, pages 555–568, 2003. 119

[TSWH07] I. Taylor, M. Shields, I. Wang, and A. Harrison. The triana workflow environment: Architecture and applications. Workflows for e-Science, pages 320–339, 2007. 11, 14, 31, 57 [TSWR03] I. Taylor, M. Shields, I. Wang, and O. Rana. Triana Applications within Grid Computing and Peer to Peer Environments. Journal of Grid Computing, 1(2):199–217, 2003. 91 [Van05]

Stijn Vansummeren. Deciding well-definedness of XQuery fragments. In PODS ’05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 37–48, New York, NY, USA, 2005. ACM. 194, 197

[Van07]

Stijn Vansummeren. On deciding well-definedness for query languages on trees. Journal of the ACM (JACM), 54(4):19, 2007. 197

[vHHH+ 09] Kees van Hee, Jan Hidders, Geert-Jan Houben, Jan Paredaens, and Philippe Thiran. On the relationship between workflow models and document types. Information Systems, 34(1):178–208, March 2009. 57 [vL96]

G. von Laszewski. An Interactive Parallel Programming Environment Applied in Atmospheric Science. Making Its Mark, Proceedings of the 6th Workshop on the Use of Parallel Processors in Meteorology, pages 311–325, 1996. 134

[Woo03]

P.T. Wood. Containment for XPath fragments under DTD constraints. Lecture notes in computer science, pages 300–314, 2003. 197

[YB05]

Jia Yu and Rajkumar Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Record, 34(3):44–49, September 2005. 17, 134

[YDHP07]

Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Mapreduce-merge: simplified relational data processing on large clusters. In SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040, New York, NY, USA, 2007. ACM. 94

[ZBKL09]

Daniel Zinn, Shawn Bowers, Sven Köhler, and Bertram Ludäscher. Parallelizing XML data-streaming workflows via MapReduce. Journal of Computer and System Sciences, In Press:–, 2009. 1, 27, 59

[ZBL10]

Daniel Zinn, Shawn Bowers, and Bertram Ludäscher. XML-Based Computation for Scientific Workflows. In Intl. Conf. on Data Engineering (ICDE), 2010. to appear, see also technical report. 1, 28, 135

215 [ZBML09a] Daniel Zinn, Shawn Bowers, Timothy M. McPhillips, and Bertram Ludäscher. Scientific workflow design with data assembly lines. In Deelman and Taylor [DT09]. 1, 26, 31 [ZBML09b] Daniel Zinn, Shawn Bowers, Timothy M. McPhillips, and Bertram Ludäscher. X-CSR: Dataflow Optimization for Distributed XML Process Pipelines. In Intl. Conf. on Data Engineering (ICDE), pages 577–580, 2009. Also see Technical Report CSE-2008-15, UC Davis. 1, 27, 96 [ZDF+ 05]

Yong Zhao, Jed Dobson, Ian Foster, Luc Moreau, and Michael Wilde. A notation and system for expressing and executing cleanly typed workflows on messy scientific data. SIGMOD Rec., 34(3):37–43, 2005. 69

[ZHC+ 07]

Y Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In IEEE Congress on Services, pages 199–206, 2007. 91

[Zin08]

Daniel Zinn. Modeling and optimization of scientific workflows. In Ph.D. ’08: Proceedings of the 2008 EDBT Ph.D. workshop, pages 1–10, New York, NY, USA, 2008. ACM. 1

[ZLL09]

Daniel Zinn, Xuan Li, and Bertram Ludäscher. Parallel Virtual Machines in Kepler. Eighth Biennial Ptolemy Miniconference, UC Berkeley, California, April 2009. 1, 28, 120, 129