Detecting Thread-Safety Violations in Hybrid OpenMP/MPI Programs ...

3 downloads 467 Views 704KB Size Report
Keywords-Hybrid MPI/OpenMP, Thread-safety Vio- lation, Race, Static ... One of the earliest work is the Intel Thread checker. [1] that rewrites the ... programs. Kang et al [6] presented a tool that focuses on the detection of first data races with no.
Detecting Thread-Safety Violations in Hybrid OpenMP/MPI Programs Hongyi Ma, Liqiang Wang, and Krishanthan Krishnamoorthy Department of Computer Science, University of Wyoming {hma3, lwang7, kkrishna}@uwyo.edu Abstract—We propose an approach by integrating static and dynamic program analyses to detect threadsafety violations in hybrid MPI/OpenMP programs. We innovatively transform the thread-safety violation problems to race conditions problems. In our approach, the static analysis identifies a list of MPI calls related to thread-safety violations, then replaces them with our own MPI wrappers, which involve accesses to some specific shared variables. The static analysis avoids instrumenting unrelated code, which significantly reduces runtime overhead. In the dynamic analysis, both happen-before and lockset-based race detection algorithms are used to detect races on these aforementioned shared variables. By detecting races, we can identify thread-safety violations according to their specifications. Our experimental evaluation over real-world applications shows that our approach is both accurate and efficient. Keywords-Hybrid MPI/OpenMP, Thread-safety Violation, Race, Static Analysis, Dynamic Analysis

I. R ELATED W ORKS AND BACKGROUND There are many tools for detecting these errors in MPI programs. Both static and dynamic methods are used to detect parallel programs. As examples of static analysis, model checking and symbolic execution often suffer a state explosion problem [14]. As examples of dynamic analysis, DAMPI [4], Marmot[5], Umpire[15] and MPI-CHECK [13]. Marmot utilizes a timeout threshold to determine whether deadlock happens. DAMPI uses a scalable algorithm based on lamport clocks and vector clocks to capture non-deterministic interleaving matches for the communications in MPI program. MPI-CHECK instruments the source code at compile time by inserting extra information to MPI calls and allows collective verification for the full MPI standard. Several research works have been proposed to address error detection issues in OpenMP program. One of the earliest work is the Intel Thread checker [1] that rewrites the program binary code with additional intercepting instructions to monitor the program serial execution and infer possible parallel execution events of the program. Kim et al [7]

designed a practical tool to utilize the on-the-fly dynamic monitoring to detect data races in OpenMP programs. Kang et al [6] presented a tool that focuses on the detection of first data races with no explicit happen-before order in OpenMP programs. The paper [2] implemented the dynamic analysis method to detect the data races in multithreaded programs. OpenMP Analysis Toolkit (OAT) [9] uses Satisfiability Modulo Theories (SMT) solver based symbolic analysis to detect data races and deadlocks in OpenMP codes. II. OVERVIEW This paper proposes a novel approach to detect thread-safety violations using the integrated static and dynamic program analysis. Figure 1 shows the workflow of our approach HAME, which consists of two phases: static analysis and runtime checking. During the static analysis phase, we replace the MPI calls inside OpenMP parallel region, which may affect thread-safety, with our own MPI call wrappers. Such selective instrumentations reduce the overhead of dynamic analysis. To identify MPI calls related to thread-safety properties, our static analysis firstly generates the control flow graph of this hybrid MPI/OpenMP program using the Rose compiler. Consequently, each MPI call would be represented as a node in control flow graph. By visiting all the nodes in the control flow graph, our tool inserts shared variables definitions that will work together with our MPI wrappers. We transform the thread-safety violations to races on monitored variables. III. S TATIC A NALYSIS To determine whether two MPI calls can be concurrently executed at thread level, as shown below, we replace MPI routines with wrappers that can catch runtime information in MPI calls, such as source, tag, communicator, and thread ID information besides executing the original MPI routine.

Static Analysis for Replacing MPI Calls with Wrappers

omp single, #pragma omp critical and omp_set_lock(), omp_unset_lock() during the implementation.

MPI Wrapper with Predefined Monitored Variables

Executable Binary Code

V. T HREAD - SAFETY IN THE H YBRID MPI/O PEN MP P ROGRAMMING

Specification of Thread-Safety Violations

Dynamic Hybrid (Lockset/Happen-Before) Race Detection

Hybrid MPI/OpenMP Source Code

Report of ThreadSafety Violations

Figure 1.

The architecture of the tool HAME.

Different MPI calls require different monitored variables. Section V gives more details about it. IV. DYNAMIC A NALYSIS We use Intel Pin [8] to monitor the shared variables defined MPI wrappers during execution. By analyzing locksets and happen-before orders on the accesses to these variables, we then determine thread-safety violations. Pure lockset analysis would find more accuracy races then happens-before based approaches, but may increase the overhead [11]. Since happen-before relationship is a partial order of events that determines whether the happening order of these events, so it may report false positives due to missing correct orders. Hence, we follow the approach in [11] to combine them in order to make our approach more efficient and more accurate. Intel Pin can monitor not only memory accesses, but also locking primitives operations for threads synchronization. However, there are several challenges in our implementation of dynamic analysis using Intel Pin. The major issues are how to find happen-before relationships between two events and obtain locksets for each event. To figure out the happen-before relationship between two events, it is necessary to consider the OpenMP synchronization points, which include explicit synchronization points #pragma omp barrier, #pragma omp critical, and implicit synchronization directives such as #pragma omp single, #pragma omp task, and omp reduction. To obtain the locksets on an event, it is necessary to consider #pragma omp section, #pragma omp master, #pragma

The MPI standard [3] allows using multiple threads within an MPI process, which is restricted by several rules stated in the MPI standard. The threadsafety issues in hybrid MPI/OpenMP programs is vital, otherwise the performance of normal thread behavior can show incorrect. Here is the checking list of thread-safety specifications in Hybrid MPI/OpenMP programs. • Initialization Violation: Executing MPI calls within threads should follow the specification in MPI thread initialization. For example, if MPI_THREAD_SINGLE is specified, omp parallel returns false. If MPI_THREAD_SERIALIZED is specified, it is not allowed to call MPI routines in two concurrent threads. To detect initialization violations, we need to monitor the following conditions. • Concurrent MPI finalizing violation: MPI_Finalize should be called in the main thread at each process. In addition, prior to call MPI_Finalize, each thread needs to finish existing MPI calls under its own thread to make sure there is no pending communication. We check termination variable to see whether there are two MPI calls executing with MPI_Finalize at the same time. A termination variable is inserted into all MPI wrappers. • Concurrent MPI_Recv violation: Each thread within an MPI process may issue MPI calls; however, threads are not separately addressable. In other words, the rank of a send or receive call identifies a process instead of threads, which means when two threads call MPI_Recev with same tag and communicator the order is undefined. In such cases, we can expect a data race. • Concurrent request violation in MPI_Wait and MPI_Test: MPI does not allow that two or more threads are concurrently invokes MPI_Wait(request) and MPI_Test(request) with the same shared request variable. • Probe Violation in M P I P robe and M P I IP robe: Two concurrent invocations of M P I P robe() or M P I IP robe() from

80% 3000

40% 2000 20% 1500

0% 1000 500

2

4

2

4

0

Figure 2.

HAME

120%

MARMOT

100%

Base ITC

HAME

80%

60% 8 16 40% 3500 Number of 6000 Process 20% 3000 5000 8 16 0% 2500 Number of Porcess 4000 2000

MARMOT 32

64

32

64

2

4

ITC

2500

2000 6000 1500 5000 1000 500 4000 0 3000

2

4

HAME MARMOT

ITC

1000 500

0 16

8

NumberBase of Processors

2000 3000 1500 2000 1000

Base 232

464

Number of Porcess

2000

8 HAME 16 Number of Processes MARMOT ITC

1000 0

2 8

4

8

16

32

64

Number of Process

16

Number of Processors

LU-MZ hybrid MPI/OpenMP Testing

Figure 3.

BT-MZ hybrid MPI/OpenMP Testing MARMOT

4500

ITC

4000 different threads on the same communicator 3500 should not have the same “source and “tag 8 16 32 64 as their parameters. We check src and tag Number of3000 Processes 2500 variables in MPI wrappers for race condition. 2000 Collective Call Violation: According to the MPI 1500 requirements, all processes on a given com1000 municator must make the same collective call. 500 0 Furthermore, the user is required to ensure that 2 4 8 16 32 64 the same communicator is not concurrently used Number of Processes by two different collective calls by threads in the same process. Figure 4. SP-MZ hybrid MPI/OpenMP Testing Execution time in seconds



Avergage Overhead

60% 2500

140%

time in seconds Execution Time

Execution Time in seconds Execution Time in seconds

100% 3500

time in seconds Execution Time

in seconds percentage Execution Time overhead

120%

Base HOME

MARMOT ITC

VI. E XPERIMENTS In order to measure the overhead and effectiveness of our approach, we conducted all experiments on a virtual cluster on Amazon EC2. The overhead of our approach is ranging from 10% to 40%. Our approach can detect all 6 different kinds of threadsafety violations mentioned in Section V. A. System Setup We evaluate HAME over some microbenchmarks for effectiveness and real world applications for scalability test. Our Amazon EC2 virtual cluster uses M3 instances with 2.6 GHz Intel Xeon E5-2670. We use up to 32 M3 instances. Our experiments are based on NPB-MZ in NAS Parallel Benchmark [10], which includes BT-MZ, SP-MZ and LU -MZ with Class C size. The number of threads in each MPI process is set to 2 in our experiment. Otherwise, the overhead of Intel Thread Checker would be very high if using more threads in a process.

The experiment demonstrates that our tool HAME can detect all injected 6 violations. Intel Thread Checker misses MPI_Probe violation on NPB LU. Marmot detects only violations that happen in real executions, thus some potential violations could be missed. When we use Intel Thread Checker, it can produce a false positive in BT testing, because the omp critical directive cannot be recognized correctly. Therefore, this single thread execution routine is reported as concurrent execution by multiple threads. The overhead in Marmot and Intel Thread Checker are much higher than that in HAME. Table I T HE EFFECTIVENESS COMPARISON OF THREE TOOLS , HAME, ITC, AND M ARMOT. Benchmarks NPB-MZ LU (6) NPB-MZ BT (6) NPB-MZ SP (6)

HAME 6 6 6

ITC 5 7 6

Marmot 5 6 5

B. Performance Analysis and Comparison We compare our approach with Intel Thread Checker[12] and Marmot [5] by injecting 6 different kinds of violations in the LU, BT and SP benchmarks in NAS-MZ [10]. Table I shows the effectiveness of the three tools. ITC stands for Intel Thread Checker.

Figures 2, 3, and 4 show the execution runtime with inserted violations in the hybrid MPI/OpenMP benchmarks LU, BT and SP, respectively. In addition, during the testing we added MPI calls with

140%

100% 80%

40% 20% 0% 2

Avergage Overhead

60%

140%

HAME

120%

MARMOT

100%

ITC

80%

60% 16 40% 3500 Number of 6000 Process

4

8

Figure 5.

time in seconds Execution Time

overhead percentage

120%

32

64

20% 3000 5000

Overhead measurement

openmp parallel regions without any computation influence from original benchmark semantics. The Base shows the original runtime of application without any instrumentation. The execution durations of our approach, Marmot and Intel Checker are shown in Figure 2, 3 and 4. Figure 2 shows the performance of program using Intel Thread Checker slows down up to 100% or above. The overhead in our approach is increasing when the number of processes is increasing. The programs with Intel Thread Checker always performs the worst performance due to high overhead. VII. C ONCLUSIONS In this paper, we present a scalable approach to detect thread-safety violations in hybrid MPI/OpenMP program. Our experiments results show that the violations detected by our tool is more accuracy than Intel Thread Checker and Marmot in hybrid MPI/OpenMP programs, also we have less overhead than Intel Thread Checker and Marmot. Our observation experiments shows that overhead ranges from 10% to 40%. Our future work will focus on continuously reducing runtime overhead and improving the scalability. In addition, we plan to investigate more subtle programming errors in hybrid MPI/OpenMP programs and design effective detection algorithms. R EFERENCES [1] Intel thread checker. http://software.intel.com/enus/intel-thread-checker/. [2] Q. Chen, L. Wang, and Z. Yang. Sam: Self-adaptive dynamic analysis for multithreaded programs. In Proceedings of the 7th International Haifa Verification Conference on Hardware and Software: Verification and Testing, HVC’11, Berlin, Heidelberg, 2012. Springer-Verlag. [3] M. P. Forum. Mpi: A message-passing interface standard. Technical report, Knoxville, TN, USA, 1994.

[4] G. Gopalakrishnan, R. M. Kirby, S. Siegel, R. Thakur, W. Gropp, E. Lusk, B. R. De Supinski, M. Schulz, and G. Bronevetsky. Formal analysis of mpi-based parallel programs. Commun. ACM, 54(12):82–91, Dec. 2011. [5] T. Hilbrich, M. S. M¨uller, and B. Krammer. Detection of violations to the mpi standard in hybrid openmp/mpi applications. In Proceedings of the 4th International Conference on OpenMP in a New Era of Parallelism, IWOMP’08, pages 26–35, Berlin, Heidelberg, 2008. Springer-Verlag. [6] M.-H. Kang, O.-K. Ha, S.-W. Jun, and Y.-K. Jun. A tool for detecting first races in openmp programs. In PaCT ’09: Proceedings of the 10th International Conference on Parallel Computing Technologies, pages 299–303, Berlin, Heidelberg, 2009. SpringerVerlag. [7] Y.-J. Kim, M.-Y. Park, S.-H. Park, and Y.-K. Jun. A practical tool for detecting races in openmp programs. In PaCT, pages 321–330, 2005. [8] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM. [9] H. Ma, S. R. Diersen, L. Wang, C. Liao, D. Quinlan, and Z. Yang. Symbolic analysis of concurrency errors in openmp programs. In Proceedings of the 2013 42Nd International Conference on Parallel Processing, ICPP ’13, pages 510–516, Washington, DC, USA, 2013. IEEE Computer Society. [10] Nasa nas parallel benchmarks. Available from www.nas.nasa.gov/Software/NPB. [11] R. O’Callahan and J.-D. Choi. Hybrid dynamic data race detection. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, pages 167–178, New York, NY, USA, 2003. [12] P. Sack, B. E. Bliss, Z. Ma, P. Petersen, and J. Torrellas. Accurate and efficient filtering for the intel thread checker race detector. In Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Dependability, ASID ’06, pages 34–41, New York, NY, USA, 2006. ACM. [13] E. Saillard, P. Carribault, and D. Barthou. Parcoach: Combining static and dynamic validation of mpi collective communications. Int. J. High Perform. Comput. Appl., 28(4), Nov. 2014. [14] S. F. Siegel. Verifying parallel programs with mpispin. In Proceedings of the 14th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 13–14, Berlin, Heidelberg, 2007. SpringerVerlag. [15] J. S. Vetter and B. R. de Supinski. Dynamic software testing of mpi applications with umpire. In Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, SC ’00, Washington, DC, USA, 2000. IEEE Computer Society.