High Performance Computing of Hydrologic Models Using HTCondor ...

1 downloads 134 Views 662KB Size Report
Keywords: cloud computing, daemon, GDM, GSSHA, high-performance computing, high- throughput computing, HTCondor, paralle
High Performance Computing of Hydrologic Models Using HTCondor

Spencer Taylor

A project report submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science

Norman L. Jones, Chair Everett James Nelson Gus P. Williams

Department of Civil and Environmental Engineering Brigham Young University April 2013

Copyright © 2013 Spencer Taylor All Rights Reserved

ABSTRACT High Performance Computing of Hydrologic Models Using HTCondor

Spencer Taylor Department of Civil and Environmental Engineering, BYU Master of Science “Big Iron” super computers and commercial cloud resources (Amazon, Google, Microsoft, etc.) are some of the most prominent resources for high-performance computing (HPC) needs. These resources have many advantages such as scalability and computational speed. However the limited access to supercomputers and the cost associated with cloud systems may prohibit water resources engineers and planners from using HPC in water applications. The goal of this project is to demonstrate an alternative model of HPC for water resource stakeholders who would benefit from an autonomous pool of free and accessible computing resources. In order to demonstrate this concept, a system called HTCondor was used at Brigham Young University in conjunction with the scripting language, Python, to parallelize intensive stochastic computations with a hydrological model called Gridded Surface Subsurface Hydrologic Analyst (GSSHA). HTCondor is open-source software developed by the University of Wisconsin - Madison, which provides access to the processors of idle computers for performing computational jobs on a local network. We found that performing stochastic simulations with GSSHA using HTCondor system significantly reduces overall computational time for simulations involving multiple model runs and improves modeling efficiency. Hence, HTCondor is an accessible and free solution that can be applied to many projects under different circumstances using various water resources modeling software.

Keywords: cloud computing, daemon, GDM, GSSHA, high-performance computing, highthroughput computing, HTCondor, parallelization, pool, Python

ACKNOWLEDGEMENTS It is with immense gratitude that I acknowledge the support and help of my advisor Dr. Norm L. Jones of the Civil and Environmental Engineering Department at Brigham Young University for guiding me through this project. I am also grateful for Dr. Everett James Nelson and Dr. Gus P. Williams for their input and enthusiasm for my project. Further thanks must also be given to my associates Dr. Kris Latu, Nathan Swain, and Scott Christensen for working closely with me on this project. This material is based upon work supported by the National Science Foundation under Grant No. 1135482.

TABLE OF CONTENTS

LIST OF TABLES ...................................................................................................................... vii  LIST OF FIGURES ..................................................................................................................... ix  1 

Introduction ......................................................................................................................... 11  1.1 

Problem Statement ........................................................................................................ 13 



Literature Review ............................................................................................................... 15 



Stochastic Simulations Using Gssha and Htcondor ......................................................... 21  3.1 

HTCondor ..................................................................................................................... 21 

3.1.1  Master Computer ...................................................................................................... 23  3.1.2  Worker Computers .................................................................................................... 24  3.1.3  The BYU HTCondor Pool ........................................................................................ 24  3.1.4  HTCondor Workflow ................................................................................................ 25  3.1.5  Universe Environments ............................................................................................. 26  3.2 

External HTCondor Resources ..................................................................................... 28 

3.2.1  Cloud Resources ....................................................................................................... 28  3.2.2  Supercomputing Resources ....................................................................................... 29  3.2.3  External HTCondor Pools ......................................................................................... 29  3.3 

GSSHA ......................................................................................................................... 30 

3.4 

Stochastic Simulations and Statistical Convergence .................................................... 32 

3.5 

Using Python to Connect GSSHA and HTCondor ....................................................... 35 



Case Study ........................................................................................................................... 41 



Conclusions .......................................................................................................................... 45  5.1 

HTCondor ..................................................................................................................... 45 

5.2 

Mary Lou ...................................................................................................................... 46  v

5.3 

Amazon Cloud .............................................................................................................. 46 

5.4 

Future Work .................................................................................................................. 47 

REFERENCES............................................................................................................................ 48  Appendix A.  Python Scripts.................................................................................................... 50  A.1 Scripts for Method One – Statistical Convergence .......................................................... 50  A.2 Scripts for Method Two – Static Number of Jobs ........................................................... 58 

vi

LIST OF TABLES

Table 1. List of HTCondor universes and descriptions .........................................................27  Table 2. Comparison of the two stochastic GSSHA simulation methods .............................40 

vii

viii

LIST OF FIGURES

Figure 1. HTCondor network structure ..................................................................................22  Figure 2. Stochastic convergence of average peak discharge ................................................34  Figure 3. Diagram of how statistical convergence is determined for this project .................35  Figure 4. Python script interaction with HTCondor ..............................................................38  Figure 5. Plot of the time it took to complete certain numbers of iterations .........................42  Figure 6. A-D are graphs of statistical convergence for a tolerance of 0.001 .......................42  Figure 7. Graph of statistical convergence for a tolerance of 0.0001 ....................................44 

ix

x

1

INTRODUCTION

Water resource applications are often modeled one instance at a time on a single desktop computer. In cases of smaller models, such circumstances provide enough computing power to complete a model in a reasonable time frame. Some water resource applications, on the other hand, are composed of multiple model instances and larger model domains requiring more computing power than the average desktop computer alone can provide. For example, researchers at Brigham Young University (BYU) are in the process of creating web application for generating stochastic simulations of hydrologic models as a part of a larger National Science Foundation (NSF) initiative known as the Cyber-Infrastructure to Advance High Performance Water Modeling (CI-WATER) project. An ongoing and computationally intensive modeling project such as this requires high-performance computing (HPC) which provides access to large amounts of computing resources that can manage multiple jobs at the same time. Common methods of HPC include traditional super computers which are designed to process as many floating-point operations per second (FLOPS) as possible. The problems with these resources are that they can be prohibitively expensive and therefore their accessibility is limited to a small number of users within the hydrological community. While FLOPS is a good measure of computational speed, it may not translate directly as a good metric for computational performance when considering actual jobs completed. High-throughput computing (HTC) is a form of HPC that uses a large number of relatively slower processors over a long period of time

11

to complete massive amounts of computations. HTC measures computational performance in terms of jobs completed during a long period of time such as a week or month. HTCondor is a scheduling and networking software developed by the University of Wisconsin-Madison (UWMadison) that accomplishes this task by linking large numbers of existing desktop computers together to create one computing resource (Thain, Tannenbaum, & Livny, 2006). This project was not an attempt to make a comparison of scheduling software however a brief justification for the selection of HTCondor will be given. There are many other open source queuing programs that are similar to HTCondor that could be used to meet various scheduling and networking needs (Epema, Livny, van Dantzig, Evers, & Pruyne, 1996). What sets HTCondor apart is its ability to manage many different types of computing resources and operating systems while also facilitating job creation as well as job scheduling. Its unique strengths include the ability to utilize opportunistic computing as well as checkpointing jobs to dynamically match jobs and computational resources and allow jobs to be interrupted and resumed at a later time (Thain, Tannenbaum, & Livny, 2005). It is difficult to gage how many instances of HTCondor have been installed since its creation in 1984. Because of its open source nature it is not sold and licensed like proprietary software and therefore harder to track. It had been confirmed at one point that HTCondor had been installed on over 60,000 computational processing units (CPUs) in 39 countries demonstrating how trusted and reliable its performance is (Thain et al., 2006). Being created at UW-Madison, HTCondor, has undergone extensive testing and continues to be the subject of intensive research and development (Litzkow & Livny, 1990). HTCondor also offers a system that can be scaled to fit the needs of almost any collection of computing resources from a half dozen desktop computers to a mix of hundreds of HPC clusters and cloud resources as well as desktop CPUs (Bradley et al., 2011). 12

1.1

Problem Statement The objective of this project is to demonstrate how HTCondor can be implemented at

BYU to provide a robust and economical computational environment that can support web-based applications that performs hydrologic stochastic simulations.

13

14

2

LITERATURE REVIEW

There are a substantial number of queuing and scheduling software available as the literate suggests. This review will focus on literature that explains HTCondor’s strengths and utility as it pertains to the specific instance of HTCondor at BYU. Although the literature explains HTCondor’s utility is a variety of settings and applications this project report will apply each insight to how HTCondor can benefit hydrologic web application at BYU. Epema et al. (1996) provide evidence and explanation of how HTCondor can help to provide a robust computational environment by connecting several pools of resources at BYU to process jobs from a web application. They discuss the ability and usefulness of connecting HTCondor pools, thus creating a flock of Condor resources that, in some cases, may span continents while providing increased computational power to all connected. A flock was created in that connected resources in Madison (USA), Amsterdam and Delft (Netherlands), Geneva (Switzerland), Warsaw (Poland), and Dubna (Russia). The Flock mechanism could make use of the massive amounts of idle resources that sweep around the globe every 24 hours. The HTCondor philosophy places emphasis on maintaining the control of the owners over their workstations throughout the flock of resources. The three guiding principles of the HTCondor system are as follows: 1. Condor should have no impact on owners.

15

2. Condor should be fully responsible for matching jobs to resources and informing users of job progress. 3. Condor should preserve the environment on the machine on which the job was submitted. Bradley et al. (2011) provide examples of how HTCondor can be scaled to increase its usefulness as an organization expands. It also examines the optimal scalability and deployment of HTCondor. Each new version of Condor increases the system’s ability to schedule, match, and collect more jobs in a pool of computing resources while decreasing wasted time during %(b) newJobFile = resultFilePath + pr + "job" fileStr = "" file = open(newJobFile, 'a') file.write("Universe = vanilla\n") file.write(("Executable = {0}bat\n").format(pr)) file.write('Requirements = Arch == "X86_64" && OpSys == "WINDOWS"\n') file.write("Request_Memory = 1200 Mb\n") file.write(("Log = {0}log.txt\n").format(pr)) file.write(("Output = {0}out.txt\n").format(pr)) file.write(("Error = {0}err.txt\n").format(pr)) file.write("transfer_executable = TRUE\n") file.write("transfer_input_files = " + fileList[0]) for i in range(1,len(fileList)): fileStr = fileStr + ","+ fileList[i] file.write(fileStr + "\n") file.write("should_transfer_files = YES\n") file.write("when_to_transfer_output = ON_EXIT_OR_EVICT\n") file.write(("Queue {0} \n").format(numJobs)) file.close() def writeExe(resultFilePath,b): newExe = resultFilePath + "project%i.bat" %(b) file = open(newExe, 'a') file.write("C:\Python26\ArcGIS10.0\python.exe oneRndGSSHA.py > cmdout.txt\n") file.close() def writeSubmit(resultFilePath,b): drive = resultFilePath.split(':')[0] # this allows the jobs to be sent from a thumb drive newBat = resultFilePath + "submit.bat" file = open(newBat, 'a') file.write(drive + ": \n") file.write("cd " + resultFilePath + "\n")

56

file.write(("condor_submit " + resultFilePath + "project{0}.job > submit.out\n").format(b)) file.close() def fileFinder(baseFilePath): fileList = [] files = os.listdir(baseFilePath) for file in files: if file.endswith('.prj'): project = baseFilePath + file projectName = file.split(".")[0] for file in files: if file.startswith(projectName): if file.endswith(".ohl") or file.endswith(".rec") or file.endswith(".gmh") or file.endswith(".dep"): pass else: fileList.append(baseFilePath + file) elif file.endswith(".idx"): fileList.append(baseFilePath + file) elif file.endswith("gssha.exe"): fileList.append(baseFilePath + file) elif file.endswith("oneRndGSSHA.py"): fileList.append(baseFilePath + file) elif file.startswith("HMET"): fileList.append(baseFilePath + file) return fileList def main(): t1 = time.clock() #set basepath baseFilePath = os.getcwd() + "\\" #set and clear results folder resultFilePath = baseFilePath + "Results1\\" b = 2 while os.path.exists(resultFilePath): resultFilePath = (baseFilePath + "Results{0}\\").format(b) b = b + 1 b=b-1 #"b" is the number of the current Results folder #Make current Results path if not os.path.exists(resultFilePath): os.makedirs(resultFilePath) numJobs = 500 pargssha(baseFilePath,resultFilePath,b, numJobs)

57

results = fileCollector(resultFilePath,numJobs) dt = (time.clock()) -t1 resultsOutFile = open((resultFilePath + "result{0}.out").format(b),'a') resultsOutFile.write("main Results = \n") for i in range(0,len(results)): resultsOutFile.write(str(results[i]) + "\n") resultsOutFile.write(("Finished in {0} seconds.\n").format(int(round(dt,0)))) resultsOutFile.close csv = open((resultFilePath + "result{0}.csv").format(b),'a') for i in range(0,len(results[1])): csv.write(str(i+1) + "," + str(results[1][i]) + "\n") csv.close() if __name__ == '__main__': main()

A.2 Scripts for Method Two – Static Number of Jobs multRndCondorGSSHA.py – This script runs a set number of randomized GSSHA models on HTCondor and parses the outputs to generate a statistical results file. All of the randomized CMT files are created on the submission computer prior to submitting jobs to HTCondor. import math, os, random, subprocess, sys, threading, time t1 = time.clock() def pargssha(argv): baseFilePath = argv[0] project = argv[1] projectName = argv[2] stuff = argv[3] b = argv[4] i = argv[5] gssha = argv[6] resultFilePath = argv[7] run = argv[8] cmt = argv[9] fileList = argv[10] #create a new work folder to copy altered .prj file into

58

newPath = (resultFilePath + "NewGSSHAfolder{0}_{1}\\").format(b,i) if not os.path.exists(newPath): os.makedirs(newPath) #set altered .prj filepath and remove if it exists projectNew = (newPath + "project{0}_{1}.prj").format(b,i) if os.path.exists(projectNew): os.remove(projectNew) rain = "" #write altered .prj file arg1 = [baseFilePath,newPath,project,projectNew,projectName,stuff,rain,b,i,gs sha,resultFilePath,run,cmt,fileList] readCmt(arg1) writePrj(arg1) writeJob(arg1) writeExe(arg1) writeSubmit(arg1) submitCommand = newPath + "submit.bat" subprocess.call([submitCommand]) def writeJob(argv): baseFilePath = argv[0] newPath = argv[1] projectName = argv[4] b = argv[7] i = argv[8] gssha = argv[9] pr="project%i_%i." %(b,i) newJobFile = newPath + pr + "job" fileStr = "" file = open(newJobFile, 'a') file.write("Universe = vanilla\n") file.write(("Executable = {1}bat\n").format(newPath,pr)) file.write('Requirements = Arch == "X86_64" && OpSys == "WINDOWS"\n') file.write(("Request_Memory = 1200 Mb\n").format(pr)) file.write(("Log = {0}log.txt\n").format(pr)) file.write(("Output = {0}out.txt\n").format(pr)) file.write(("Error = {0}err.txt\n").format(pr)) file.write("transfer_executable = TRUE\n") file.write(("transfer_input_files = " + gssha + ",{0}prj,{0}cmt").format(pr)) for i in range(0,len(fileList)): fileStr = fileStr + ","+ fileList[i] file.write(fileStr + "\n")

59

file.write("should_transfer_files = YES\n") file.write("when_to_transfer_output = ON_EXIT_OR_EVICT\n") file.write("Queue\n") file.close() def writeExe(argv): newPath = argv[1] projectName = argv[4] b = argv[7] i = argv[8] pr="project%i_%i." %(b,i) newExe = newPath + pr + "bat" file = open(newExe, 'a') file.write(("gssha.exe {0}prj > cmdout.txt\n").format(pr)) file.write(("del {0}.dep /F /Q\n").format(projectName)) file.write(("del {0}.rec /F /Q\n").format(projectName)) file.write(("del {0}.ghm /F /Q\n").format(projectName)) file.write("del maskmap /F /Q\n") #file.write("del out_time.out /F /Q\n") file.close() def writeSubmit(argv): newPath = argv[1] b = argv[7] i = argv[8] drive = newPath.split(':')[0] newBat = newPath + "submit.bat" file = open(newBat, 'a') file.write(drive + ": \n") file.write("cd " + newPath + "\n") file.write(("condor_submit " + newPath + "project{0}_{1}.job\n").format(b,i)) file.close() def writePrj(argv): baseFilePath = argv[0] newPath = argv[1] project = argv[2] projectNew = argv[3] projectName = argv[4] stuff = argv[5] rain =argv[6] b = argv[7] i = argv[8] idx = argv[13] for file in os.listdir(baseFilePath): if file.endswith('.prj'): project = baseFilePath + file

60

newPrjFile = open(projectNew, 'a') for line in open(project): if line.split()[0] == "SUMMARY": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t\t\t " + '"' "project{0}_{1}.sum" + '"\n').format(b,i)) elif line.split()[0] == "OUTLET_HYDRO": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t " + '"' + "project{0}_{1}.otl" + '"\n').format(b,i)) elif line.split()[0] == "MAPPING_TABLE": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t " + '"' + "project{0}_{1}.cmt" + '"\n').format(b,i)) else: newPrjFile.write(line) newPrjFile.close() def readCmt(argv): baseFilePath = argv[0] newPath = argv[1] b = argv[7] i = argv[8] for file in os.listdir(baseFilePath): if file.endswith('.cmt'): cmtFilePath = baseFilePath + file newCmt = (newPath + "project{0}_{1}.cmt").format(b,i) cmt = open(cmtFilePath) new = open(newCmt, 'a') #this creates a new file nextLine = cmt.readline() new.write(nextLine) nextLine = cmt.readline() while nextLine.split()[0] == "INDEX_MAP": new.write(nextLine) nextLine = cmt.readline() new.write(nextLine) nextLine = cmt.readline() numIDs = int(nextLine.split()[-1]) new.write(nextLine) nextLine = cmt.readline() new.write(nextLine) for i in range(0,numIDs): nextLine = cmt.readline()

61

+

lineBy = nextLine.split() mean = float(lineBy[-1]) min1 = mean*0.8 max1 = mean*1.2 stdev = (max1-min1)/6 num = normalDist(mean,stdev,min1,max1) repLine =nextLine.replace(lineBy[-1], str(num)[:7]) new.write(repLine) for i in range(0,500): nextLine = cmt.readline() new.write(nextLine) new.close() def normalDist(mean,stdev,min1,max1): x= (min1-5.0) while max1 > x < min1: x = gauss()*stdev + mean return x def gauss(): fac = 0.0 r = 1.5 V1 = 0.0 V2 = 0.0 rnd = random.random while r>=1: V1 = 2*rnd() - 1 V2 = 2*rnd() - 1 r = V1**2 + V2**2 fac =(-2*math.log(r)/r)**(1/2.0) gauss = V2*fac return gauss def parseResults(path,numRuns,b,timeSum): resultFilePath = path numRuns = float(numRuns) b=b timeSum = timeSum openResultsSum = open((resultFilePath + "ResultSummary%i.txt" %b), 'a') maxPeak = [0.0,0.0] minPeak = [100000000000000000000000000000000.0,0.0] sumPeak = [0.0,0.0] sqrsumPeak = [0.0,0.0] avePeak = [0.0,0.0]

62

stdPeak = [0.0,0.0] varPeak = [0.0,0.0] listPeak = [] listTime = [] for dir in os.listdir(resultFilePath): currentFolder = resultFilePath + dir + "/" if os.path.isdir(currentFolder): for file in os.listdir(currentFolder): if file.endswith('.sum'): sumFilePath = currentFolder + file elif file.endswith('.otl'): oltFilePath = currentFolder + file peak = extractPeakFlow(oltFilePath) sumPeak[0] = sumPeak[0] + peak[0] sumPeak[1] = sumPeak[1] + peak[1] listPeak.append(peak[0]) listTime.append(peak[1]) if peak[0] > maxPeak[0]: maxPeak = peak elif peak[0] < minPeak[0]: minPeak = peak avePeak[0] = sumPeak[0]/numRuns avePeak[1] = sumPeak[1]/numRuns for i in range(0,len(listPeak)): varPeak[0]= varPeak[0] + ((listPeak[i]-avePeak[0])**2) varPeak[1]= varPeak[1] + ((listTime[i]-avePeak[1])**2) stdPeak[0] = (varPeak[0]/numRuns)**(0.5) stdPeak[1] = (varPeak[1]/numRuns)**(0.5) openResultsSum.write("MaxPeak Discharge = %(maxPeak[0],maxPeak[1])) openResultsSum.write("MinPeak Discharge = %(minPeak[0],minPeak[1])) openResultsSum.write("AvePeak Discharge = %f\n" %(avePeak[0],avePeak[1])) openResultsSum.write("StDevPeak Discharge %f\n" %(stdPeak[0],stdPeak[1])) openResultsSum.write(timeSum + "\n") openResultsSum.close() def extractPeakFlow(path): otlFilePath = path openOtl = open(otlFilePath) peak = [0.0,0.0] flow = 0.0 for line in openOtl: lineBy = line.split() flow = float(lineBy[1]) if flow > peak[0]:

63

%f @ Time = %f\n" %f @ Time = %f\n" %f AveTime to Peak = = %f StDevTime to Peak =

peak[0] = flow # include the time for comparison peak[1] = float(lineBy[0]) return peak def fileFinder(baseFilePath): baseFilePath = baseFilePath fileList = [] files = os.listdir(baseFilePath) for file in files: if file.startswith(projectName): if file.endswith('.prj') or file.endswith(".cmt") or file.endswith(".ohl") or file.endswith(".rec") or file.endswith(".gmh") : pass else: fileList.append(baseFilePath + file) elif file.endswith(".idx"): fileList.append(baseFilePath + file) elif file.startswith("HMET"): fileList.append(baseFilePath + file) return fileList #set basepath baseFilePath = os.path.split( sys.argv[0] )[0] + "\\" print baseFilePath #set and clear results folder resultFilePath = baseFilePath + "Results1\\" b = 2 while os.path.exists(resultFilePath): resultFilePath = (baseFilePath + "Results{0}\\").format(b) b = b + 1 b=b-1 #"b" is the number of the current Results folder #Make current Results path if not os.path.exists(resultFilePath): os.makedirs(resultFilePath) #set gssha.exe path gssha = baseFilePath + "gssha.exe" print "The path to gssha.exe is " + gssha #set main .prj, .cmt, and .idx file paths project = "" allFiles = os.listdir(baseFilePath) for file in allFiles: if file.endswith('.prj'): projectName = file.split(".")[0] cmt = ""

64

fileList = fileFinder(baseFilePath) numRuns = 100 run = numRuns + 1 # run-1 is the number of theads that will be generated for i in range(1, run): print "Processing Project%i_%i..." %(b,i) #set variable to look for in main .prj file stuff = "RAIN_INTENSITY" arg1 = [baseFilePath,project,projectName,stuff,b,i,gssha,resultFilePath,run,c mt,fileList] #create thread "i" and run gssha model thread = threading.Thread(target=pargssha, args=[arg1]) thread.start() for i in range(0,1000000): pass thread.join() for i in range(1,run): checkFile = (resultFilePath + "NewGSSHAfolder{0}_{1}/project{0}_{1}.sum").format(b,i) while not os.path.exists(checkFile): pass print "WORKING..." dt = (time.clock()) -t1 timeSum = ("Finished {0} Projects in {1} seconds").format(i,int(round(dt,0))) parseResults(resultFilePath,numRuns,b,timeSum) print "FINISHED!"

65