Performance Issues for Parallel Implementations of ... - Google Sites
Recommend Documents
Section 3 presents the parallel implementation defined in. [4]. Section 4 presents the .... ing parallel programming mod
formance analysis is based on the construction of generic models using Stochastic Automata ..... software tool PEPS2003
tations of combinations of Broadcast, Reduction and Scan ... reduction and scan (parallel prefix) with an associative oper- ator. ...... In M. Broy, editor, Constructive Methods in Comput- ... library Prelude.hs, we use repTimes, mapList, and.
1Undergraduate Researcher, Donald Bren School of Information and. Computer Sciences, Univ. of Calif., Irvine, CA 92697. 2Associate Professor, Dept. of Civil ...
computing scheduler in GPU. Mauro Canabé. Centro de Cálculo, Facultad de Ingenierıa. Universidad de la República, Uruguay [email protected] and.
Processing Unit (CPU) and 600 times than C++ by using single core CPU respectively by our .... This paper proposes to solve the parallel matrix multiplication ... MPI_Get_processor_name() Gets the name of the processor. ⢠MPI_Bcast() ...
Western Michigan University. Western Michigan ... individuals (potential solutions) and modi es the ge- ..... of tasks in the list, the order of execution is likewise.
André L. P. Guedes1, and Elias P. Duarte Jr.1. 1. Federal ... Ponta Grossa, Brazil. 3. Western Paraná State University, Department of Computer Science. Cascavel, Brazil ... Section 5 defines the environment and the ..... 10 11 12 13 14 15 16. 0.
Engineering, Organization to Rutgers, The State University of New Jersey, 94 Brett ... e-mail: [email protected] .... inject information from the child to its parent. ..... LPA scheme helps to reduce application synchronization time while the BP
München, Germany. Email: [email protected], [email protected] ... and a real-world benchmark we assess 4 alternative barrier implementations on 7 ...
Department of Computer Science, Rice University. {mjoyner, zoran, vsarkar ...... Rice University, 2001. [6] Z. Budimlic, K. D. Cooper, T. J. Harvey, K. Kennedy,.
Feb 2, 2008 - der both distributed and shared memory programming paradigms on the Silicon. Graphics/Cray Research Origin 200. The 1D Fast-Fourier ...
in one step, followed by the permutation vector generation. On the ...... Performance Computing, Networking, Storage and Analysis, (2012), Salt Lake City, USA.
Sep 1, 2013 - hardware/software (or) at the algorithmic level, programming ..... [21] http://chetsarena.files.wordpress.com/2012/10/3-3-recent-trends-in-par.
A Choice of SM/DM Parallel ANN Implementation for Embedded. Applications. V. Dvorak, R. Cejka. Dept. of Computer Science, Technical University of Brno,.
interoperable business-to-business protocol usable over the. Internet ... fits for web services and web applications in peer-to-peer systems .... client and server hosts joined by a small LAN. The LAN was a 10 Mbps LAN with only these two hosts. For
Providers, HP WBEM Client and HP WBEM SDK [12]. There also exist several .... CIMOM and offers the Java API for a client and provider. When a client tries to ..... We experimented to discover some difference between a binary XML encoding ...
using Apache SOAP with the Tomcat application server. WSDL provides an ... torials for rapid learning [6][7]. ... Apache SOAP was developed by IBM alphaWorks and donated to ..... of larger messages, as long as code download doesn't take.
visualization tools [23,13,22] offer important information about performance of an existing parallel implementation, but
target applications are continuous- ow embedded systems. .... of video frames were processed in parallel using the originally speci ed sequential algorithm. ... mean latency is su cient provided the distribution of transfer times is in nitely divisib
Sep 14, 2008 - such as AT&T (www.att.com), and content delivery network (CDN) ... include CDN providers such as Akamai (www.akamai.com), and ISPs.
rithms using Sisal, a high level functional language. We have ... multiprocessors and supercomputers provide a good subset of the machines where Sisal.
[t 2uj 2ext + ej] to choose appropriate literal(s) Lm from l can- didates in serial. We assume to get ready a computer as boss and n other computers as followers.
Technische Universität Dresden, Center for Information Services and High Performance Computing (ZIH), ... parallel performance measurement in three leading HPC tools: PAPI [1] ..... calls into the CUDA or OpenCL libraries, VampirTrace can.
Performance Issues for Parallel Implementations of ... - Google Sites
Performance Issues for Parallel Implementations of Bootstrap Simulation Algorithm. 22nd International Symposium on Compu
Performance Issues for Parallel Implementations of Bootstrap Simulation Algorithm 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2010)
Ricardo M. Czekster, Paulo Fernandes, Afonso Sales and Thais Webber Pontif´ıcia Universidade Cat´olica do Rio Grande do Sul (PUCRS) PaleoProspec Project - PUCRS/Petrobras Funded also by CAPES and CNPq - Brazil
Context Interest The solution of complex and large state-based stochastic models to extract performance indices.
Solution Numerical (iterative methods) ◮ ◮ ◮
Power method [Stewart 94] Arnoldi [Arnoldi 51] GMRES [Saad and Schultz 86]
Simulation ◮ ◮ ◮ ◮
Traditional [Ross 96] Monte Carlo [H¨aggstr¨om 02] Backward [Propp and Wilson 96] Bootstrap [Czekster et al. 10] ⋆ ⋆
reliable estimations high computational cost to generate repeated batches of samples
Context
Markovian simulation Generation of independent samples (parallel execution) Parallel sampling (e.g., master-worker approach) Possible sequence of states using the transition matrix ◮ ◮ ◮
random walk or simulation trajectory huge size → huge memory cost Stochastic Automata Networks (structured formalism, underlying Markov Chain)
Objective
Goal It is to present a parallel implementation of Bootstrap simulation, focusing on the overall technique performance by presenting a method to generate large amount of samples in less time.
Discussion processing x communication times model size x amount of generated samples
Outline
Stochastic Automata Networks (SAN) Bootstrap simulation Parallelization Experiments and results Conclusion and future works
Stochastic Automata Networks (SAN) • It allows the description of a large system in a structured manner by its parts (automata) SAN model
Underlying Continuous-Time Markov Chain
A
B
C
0
0
0
l1
s1 1
s2
s1
s2
1
s1 1
Type Event Rate Type Event Rate Type Event Rate syn s1 α syn s2 β loc l1 f f = [(B == 0) && (C == 0)] × γ
000
α
γ
111
β 100
Stochastic Automata Networks (SAN) • It allows the description of a large system in a structured manner by its parts (automata) SAN model
Underlying Continuous-Time Markov Chain
A
B
C
0
0
0
l1
s1 1
s2
s1
s2
1
s1 1
Type Event Rate Type Event Rate Type Event Rate syn s1 α syn s2 β loc l1 f f = [(B == 0) && (C == 0)] × γ
α
0
γ
1
β 2
Bootstrap
Method It is a well known statistical method applied to many fields to improve accuracy when performing sample estimations for complex distributions.
In the simulation context Bootstrap simulation provides more reliable estimations than the traditional simulation [SCSC’10].
Main feature Generation of repeated batches of samples that helps to improve the method accuracy.
n = trajectory length K: bootstrap z: number of bootstraps
π x¯1[0] + x¯2[0] + ··· + x¯z [0] z x¯1[1] + x¯2[1] + ··· + x¯z [1] z
= π0
x¯1[2] + x¯2[2] + ··· + x¯z [2] z
= π2
= π1
mean permanence Pz x ¯i probability π = i=1 z
Parallelization
Approach Split the bootstrap sampling tasks over the processing nodes Each node performs the full trajectory simulation but produces a different set of samples Master-worker pattern Implementation: C++ language and MPI primitives Executed on a cluster with 8 Dell PowerEdge R610 connected in a Gigabit Ethernet network
Experiments Number of bootstraps assigned to nodes in each configuration configuration 1 2 3 4 5 6 7 8
number of bootstraps 36 18 12 9 7 6 5 4
18 12 9 7 6 5 4
12 9 7 6 5 4
9 7 6 5 4
8 6 5 5
6 5 5
6 5
5
Models (examples) ASP - Alternate Service Patterns: describes an Open Queueing Network with servers that map P different service patterns. FAS - First Available Server: indicates the availability of N servers. RS - Resource Sharing: maps R shared resources to P processes.
Results Large models (million of states) n = 106 20
n = 107 100
Proc. Comm.
Time (s)
15 Time (s)
Proc. Comm.
80
10
5
60
40
20
0
0 1 2 3 4 5 6 7 8 ASP
1 2 3 4 5 6 7 8 FAS Number of nodes
1 2 3 4 5 6 7 8 RS
1 2 3 4 5 6 7 8 ASP
n = 108
1 2 3 4 5 6 7 8 RS
n = 109 10000
Proc. Comm.
800
8000
600
6000
Time (s)
Time (s)
1000
1 2 3 4 5 6 7 8 FAS Number of nodes
400
200
Proc. Comm.
4000
2000
0
0 1 2 3 4 5 6 7 8 ASP
1 2 3 4 5 6 7 8 FAS Number of nodes
1 2 3 4 5 6 7 8 RS
1 2 3 4 5 6 7 8 ASP
1 23 4 5 6 7 8 FAS Number of nodes
1 23 4 5 6 7 8 RS
Results Small models (hundred of states) n = 106
n = 107 100
Proc. Comm.
8
80
6
60
Time (s)
Time (s)
10
4
2
Proc. Comm.
40
20
0
0 1 2 3 4 5 6 7 8 ASP
1 2 3 4 5 6 7 8 FAS Number of nodes
1 2 3 4 5 6 7 8 RS
1 2 3 4 5 6 7 8 ASP
n = 108
1 2 3 4 5 6 7 8 RS
n = 109 10000
Proc. Comm.
800
8000
600
6000
Time (s)
Time (s)
1000
1 2 3 4 5 6 7 8 FAS Number of nodes
400
200
Proc. Comm.
4000
2000
0
0 1 2 3 4 5 6 7 8 ASP
1 2 3 4 5 6 7 8 FAS Number of nodes
1 2 3 4 5 6 7 8 RS
1 2 3 4 5 6 7 8 ASP
1 23 4 5 6 7 8 FAS Number of nodes
1 23 4 5 6 7 8 RS
Conclusion and future works
Summary An efficient implementation of a novel simulation algorithm Considerable speedup for very large models ◮
specially for long trajectories
The speedup was consistent with different SAN models ◮
nearly 5 times speedup for 8 nodes
The processing demands depend only on the simulation trajectory length (n) The communication demands depend only on the reachable state space size of the model
Conclusion and future works
Future works Study of bootstrap distribution over non-uniform memory architectures ◮
some levels of shared memory could be highly beneficial to cope with high communication (short trajectories for large models)
Blending methods ◮
combination of parallel Bootstrap approach with more sophisticated simulation approaches (e.g., Perfect Simulation)