Towards System-Level Electromagnetic Field Simulation ... - IEEE Xplore

18 downloads 2453 Views 1MB Size Report
cloud platforms, is creating a revolution in the software industry by offering the SaaS (software as a service) model. The benefits of hosted software are well ...
Towards System-Level Electromagnetic Field Simulation on Computing Clouds Dipanjan Gope1, Vikram Jandhyala1,2, Xiren Wang1, Don Macmillen1, Raul Camposano1, Swagato Chakraborty1, James Pingenot1, and Devan Williams1 1: Nimbic, Mountain View, CA, [email protected] 2: Department of Electrical Engineering, University of Washington, Seattle WA, [email protected] cost and high-density systems, neither approach is acceptable. Therefore, approaches based on fast solvers have taken root, such as those based on multilevel boundary element integral equation solvers [5-8]. Even with these solvers, there is a critical gap between the needs of the electronic design community and the efficacy of the parasitic extraction and electromagnetic (EM) simulation tools.

Abstract— Cloud computing is a potential paradigm-shifter for system-level electronic design automation tools for chippackage-board design. However, exploiting the true power of on-demand scalable computing is as yet an unmet challenge. We examine electromagnetic (EM) field simulation on cloud platforms. Keywords- Electromagnetic (EM) and EM interference modeling, simulation algorithms, tools and flows

Commodity computing, in the form of scalable public cloud platforms, is creating a revolution in the software industry by offering the SaaS (software as a service) model. The benefits of hosted software are well understood and reported in existing literature [1]. An even more exciting development is the availability of scalable computing platforms, such as the Amazon Web Service (AWS) elastic computing platform EC2 [2] that can enable developers to create value from designing inherently parallel scalable software solutions. Scalable parasitic extraction and electromagnetic field simulation is a cornerstone of a largeclass of electronic design automation (EDA) applications including signal integrity (SI), power integrity (PI), simultaneous switching noise (SSN), and electromagnetic interference (EMI). With ever increasing system-level design challenges and number of systems being designed including the preponderance of multiple packaging and system integration strategies including package-on-package (PoP), system-in-package (SiP), system-on-chip (SoC), stacked die, 3DIC, etc, these EDA applications are expected to become progressively more important over time, perhaps at the cost of reduced EDA emphasis on commodity chips from a few well-entrenched providers. Parasitic extraction and field solution have traditionally been areas that have adopted two very disparate approaches. In the first approach, slow but accurate methods have been used for electromagnetic simulation. These approaches lead to very slow design cycles, and sometimes only a few runs are possible in the sense of enabling incomplete verification. The scale of problems that can be attempted is also limited, leading to ad hoc manual design decomposition, resulting in accuracy issues even with accurate solvers. The second approach has been to sacrifice accuracy in order to build scale and speed, for instance through geometry-based lookup tables, or through quasi-static or planar / transverse approximations. In today’s world of complex 3d integrated systems, with low-

978-1-4244-9401-9/11/$26.00©2011 978-1-4244-9399-9/11/$26.00©2011 IEEE

I.

CLOUD COMPUTING

A typical cloud computing environment such as Amazon EC2 [2], offers several choices in the type of machine instances available. Each such type of machine instance is typically suited to a particular type of parallelization paradigm. Virtualized setups mimicking desktop computing configurations are available with a range of core count and memory and are suitably used for multicore or shared memory parallelization using OpenMP [3]. Cluster compute instances, which are specially-hosted on machines in close proximity and interconnected with high bandwidth cables to reduce latency, are suited to parallelization using the Message Passing Interface (MPI) protocol [4]. Recently, Graphics Processing Units (GPUs) with high core count have also been incorporated in the cloud ecosystem lending itself to massively parallel computing. An umbrella of a queuing and scheduling tool like Sun Grid Engine (SGE) can then harness the composite power of multiple machine instances and generate unprecedented speedup and memory capacity. A typical cloud framework in terms of compute power and available memory resources is shown in Figure 1.

Figure 1: Cloud computing infrastructure elements as a function of compute power and memory.

167

II.

PARALLEL EM SOLVERS ON THE CLOUD

parallelization techniques. For example, a dynamic scheduling can be employed in OpenMP to improve load balancing, but cluster compute will be limited to static scheduling across nodes. Figure 3 demonstrates the scaling obtained in the two cases. The multithreaded scaling is superior to the MPI scaling due to the minimal communication overheads. The advantage of MPI however is in the farming of distributed memory on the individual cluster nodes to provide huge on-demand capacity limited only by the number of cluster node instances available.

Boundary Element Method (BEM) based fast electromagnetic field solvers typically employ fast matrix vector products [5-8] in a Krylov subspace iterative solution framework. The time and memory performance is greatly improved from a traditional dense direct matrix inverse formulation to approach almost linear complexity. While there are several scalable fast methods for BEM solvers, we exemplify our approach here using a simplified low-rank methodology that is well known [7,8]. Analysis of large scale structures like systems-in-package (SiP) presents significant challenges in terms of memory capacity to fit the full system and quick turnaround time required for early design optimization. Therefore, there is a strong necessity to harness the power of cloud computing in large-scale electromagnetic simulations. The selection of the type of parallelism employed at each different phase in the hybrid framework depends on a scalability study of the underlying algorithms and is guided by Amdahl’s law which states:

Figure 3: Comparison of scaling for (a) multicore and (b) multiple cluster node instances. The multithreaded implementations for GPU architectures are constrained by the GPU memory, typically limited to 4GB. As memory gets consumed, the matrices generated need to be transferred to the associated CPU memory leading to a significant communication bottleneck. Also, GPU computation is significantly slower for double precision arithmetic as opposed to float precision, thereby limiting the advantages gained from the high core count. For matrix sizes that would fit in the available memory, a 240 core GPU provides ~40x speedup over single CPU core.

(1)

where p is the fraction of the algorithm that can be parallelized. A. Matrix Setup Parallelization of solver-level matrix operations can often be complicated by the inherently serial content of treetraversal of the underlying fast algorithm and therefore can present scalability challenges. A pre-determined matrix structure can be employed to alleviate the problem [8]. From an implementation perspective, the key aspects are effective load balancing and scheduling [9]. Figure 2 demonstrates a typical matrix setup process for multicore (a) and cluster compute configurations (b).

B. Matrix Solve The computation time of a typical fast iterative matrix solve is dominated by the cost of matrix-vector products. The most suitable cloud parallelization options for a combination of different flavors of matrix-vector (matvec) operations and type of GMRES [8] framework are detailed in Table 2. Table 2: Suitable parallelism techniques for different choices of matvec and GMRES methods

(a)

Scalar GMRES Block GMRES

(b)

Parallel Matvec OpenMP, MPI OpenMP, MPI

Parallel RHS SGE SGE

The parallel matvec option, which parallelizes a single matrix vector product, is limited to shared-memory multithreading or MPI-based cluster compute. Similar to matrix setup, the scaling is better for OpenMP-based shared-memory parallelization. However, the serial content is larger due to repeated memory fetch operations. The parallel RHS option is applicable to all of OpenMP, MPI and SGE, but is best suited for the latter due to the minimal

Figure 2: Matrix setup on (a) shared memory many core architecture (b) cluster compute farmed memory architecture. The communication between the MPI nodes is much slower than between the OpenMP cores owing to off-core interconnects. This leads to subtle differences in the

168

data transfer required and the cost and availability of multiple machine instances rather than specialized high core count or cluster compute facilities.

IV.

NUMERICAL RESULTS

In this section, numerical results are presented detailing the performance of the hybrid parallel solver framework. The graphical user interface is resident locally on the client machine and is used to setup the problem before the data is uploaded to the computing cluster on Amazon EC2. The test structure under consideration is a full-package layout called plasma_32R8955_040728fl, which was presented as a challenge problem by IBM for a special session in IEEE conference on Electrical Performance of Electronic Packaging (EPEP) 2006.

Figure 4: Performance comparison for parallel matvec using OpenMP and parallel RHS using SGE. C. Multiple frequency and design parametrics Discrete multiple frequency or parametric sweeps belong to the so-called “embarrassingly parallel” class of problems which are easy to parallelize with almost 100% scalability and are therefore very effective where applicable. However, this kind of parallelization duplicates memory and should not be employed within the shared memory frameworks as shown in Figure 5.

Figure 7: The top view of IBM plasma_32R8955_040728fl package. Side view is presented as inset. (a)

The package is meshed using non-uniformly refined 200,000 triangular elements which lead to 350,000 edges or triangle interfaces. An electromagnetic extraction is required to model the electrical properties of the package either in the form of RLGC netlist or S-Y-Z parameters which can then be used to study the signal integrity (SI) and power integrity (PI) characteristics of the package.

(b)

Figure 5: Memory considerations for multiple frequency or parametric sweep parallelism in (a) shared-memory and (b) multiple instance cases. III.

HYBRID PARALLEL FRAMEWORK FOR THE CLOUD

Capacitance Extraction: A 322x322 capacitance matrix is obtained by solving the 200000x200000 MoM matrix using 322 orthogonal right-hand-side (RHS) vectors. The entire problem was solved using 80 m2.4xlarge Amazon EC2 instances, each with 8 virtual cores. The solution employs shared-memory-multithreading inside each machine and parallel-RHS solution across machines. The time taken and the corresponding speed-up and compute-cost analysis is presented in Table 3.

A hybrid framework is built around the individual components as shown in Figure 6. At the top, the SGE1 layer parallelizes for discrete frequency or parametric sweeps. At the next level the SGE2 layer employs parallel RHS depending on the number of RHS vectors to solve. The third layer is a cluster compute MPI layer that accounts for large memory capacity if required. The bottom-most layer consists of multithreading with OpenMP to utilize the many cores of individual machine instance.

Table 3: Performance metrics for capacitance extraction of plasma package # cores Matrix Matrix Speed- Compute Setup Solve up Cost 640 1.8 min 2.2 min 217x $10.66 8 1.8 min 143 min 6x $4.76 1 (estimate) 10.8 min 858 min x $28.56 The 1 core timings, which pertain to running an equivalent but non-parallelized solver, are projected based on the

Figure 6: Hybrid parallel framework.

169

typical 6x on 8 core speed-up observations. It should also be noted that currently commodity cloud computing providers like Amazon charge in 1 hour increments, but the costs calculated here pertain to fractional unit costs, assuming a full pipeline of tasks in the time sequence.

Each signal net forms 1 port on the die-side with the nearest GND pin and 1 port similarly on the BGA-side, therefore leading to a total of 40 ports. The structure is meshed using triangular mesh elements resulting in a matrix size of 150000. Each frequency point is simulated using 5 instances each solving 8 RHS vectors and the 50 frequency points are also parallelized in a hybrid framework employing a total of 125 m2.4xlarge machine instances.

Inductance Extraction: The magnetostatic solution for 350000 x 350000 MoM matrix with 1000 RHS vectors is distributed over 125 Amazon machine instances. Each machine instance performs matrix setup as a serial component and solves the 8 RHS vectors assigned to it. The performance metrics are enumerated in Table 4.

Table 6: Performance metrics for full-wave S-parameter extraction of 40 port plasma package over 50 points

Table 4: Performance metrics for inductance extraction of plasma package # cores Matrix Matrix Speed Compute Setup Solve -up Cost 1000 28 min 25 min 316x $220.83 8 28 min 2713 min 6x $91.36 1 (estimate) 168 min 16278 min x $542.60

# cores

Total Time

1000 8 1 (estimate)

70 min 5954 min 35724 min

Speedup 510x 6x x

Compute Cost $291 $198 $1190

The 40x40 S-parameter matrix is obtained for 50 frequency points and the frequency behavior is shown in Fig. 9.

Full-wave Extraction: In the first full-wave simulation, the entire package is modeled with all nets and 1000 ports for a single frequency of 1GHz. The ports are created between VDD or signal pins and the closest GND pin. 320 signal nets have 640 ports, 1 for each net on the die-side and 1 on the BGA-side. The remaining 360 ports are created between die or BGA side VDD pin and the nearest GND pin. Table 5: Performance metrics for full-wave 1000 port Sparameter extraction of plasma package # cores 1000 8 1 (estimate)

Matrix Setup 50 min 50 min 300 min

Matrix Solve 45 min 4950 min 29700 min

Speed -up 315x 6x x

Figure 9: S-parameters for 40 port plasma package over 50 frequency points from 1MHz and 20 GHz

Compute Cost $395 $167 $1000

REFERENCES [1] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski,

In the next simulation, a broadband discrete 50 frequency point sweep is performed between 1MHz and 20 GHz on the selected 20 nets and an un-cropped GND, as shown in Fig. 8.

[2] [3] [4] [5] [6]

[7] [8] [9]

Figure 8: The layout consisting of 20 nets and an uncropped GND

170

G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, “Above the Clouds: A Berkeley View of Cloud Computing”, TR No. UCB/EECS-2009-28, Univ. of California at Berkeley, Feb. 10, 2009. http://aws.amazon.com/ec2/ M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI: The Complete Reference 2nd Edition, The MIT Press, 1998. http://www.openmp.org R. Coifman, V. Rokhlin and S. Wandzura, “The fast multipole method for the wave equation: a pedestrian prescription”, IEEE Trans. Antennas Propagat. Mag., vol. 35, pp. 7-12, June 1993. J.R. Phillips and J. White, “A precorrected-FFT method for electrostatic analysis of complicated 3-D structures” IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, vol. 16 issue 10 pp.1059-1072, Oct. 1997. S. Kapur and D. Long, “IES3: A fast integral equation solver for efficient 3-dimensional extraction”, IEEE/ACM Int. Conference on Computer-Aided Design, 1997, pp. 448 –455, Nov. 1997. D. Gope and V. Jandhyala, “Efficient solution of EFIE via low- rank compression of multilevel predetermined interactions”, IEEE Trans. on Antennas and Propag., vol. 53, (10) pp. 3324-3333, Oct. 2005. X. Wang, V. Jandhyala; Parallel Algorithms for Fast Integral Equation Based Solvers, Conference on Electrical Performance of Electronic Packaging, pp 249-252, 2007.

Suggest Documents