Semi-Static Performance Prediction for MPSoC Platforms? Ana Lucia Varbanescu, Henk Sips, and Arjan van Gemund Department of Computer Science, Delft University of Technology, The Netherlands
[email protected]
Abstract. While most of the past research in the field of Multiprocessor Systems-on-Chip (MPSoC) has been dedicated to increasing the available processing power on a chip, less effort has been dedicated to analyze their system-level performance, or to predict their behavior. This paper introduces PAM-SoC, a performance predictor for MPSoCs system-level performance, based on adapting Pamela, a performance prediction tool for parallel applications, to the requirements of MPSoCs. We validate the proposed methodology with a set of five benchmark applications, whose performance is measured on the target MPSoC and compared to the behavior predicted by PAM-SoC on the same platform. The experiments show that PAM-SoC is able to correctly predict the behavior of various application types running on a MPSoC.
1
Introduction
MPSoC platforms are a hot topic in the industry’s high-performance computing trends, advertising new uses for parallel programming techniques and technologies in real-time applications. The fast-paced development of multiprocessor systems-on-chips is mainly focused on getting more on-chip processing power, by extensively increasing their hardware complexity. As a result, real-time constraints, together with specific hardware limitations, reveal unexpected performance holes that may have been neglected in design phases. For MPSoC platforms performance evaluation relies heavily on trace-based or full-fledged simulation. Although quite accurate, such simulators are difficult to build, slow, and platform-specific (in the sense that different MPSoCs have different simulators due to hardware differences). This paper presents a PAM-SoC, a generic MPSoC performance predictor that generates system-level performance estimations within reasonable limits. PAM-SoC was developed based on Pamela [1], a static performance prediction methodology already proved valuable for general purpose parallel platforms (GPPPs). The aim of Pamela is to compute the lower bound of the execution time of a given application, implemented using the Series-Parallel (SP) [2] programming paradigm, by coupling its model with the targeted machine model. ?
The research is part of the Scalp project http://scalp.ewi.tudelft.nl funded by STWProgress
Because the Pamela approach is static (i.e., analytical) and due to its algebraic reduction engine, the predictions can be obtained very quickly even for large applications, avoiding time-consuming simulations; previous results have shown that, even for modest problem sizes, a speedup of many orders of magnitude can be obtained compared to simulation. Furthermore, although the machine model is still done by-hand, automated application modeling allows several variants of the same parallel application (e.g., using different data partitioning or code parallelization) to be evaluated quickly and accurately. For obtaining a performance predictor suitable for MPSoCs, Pamela had to be adapted to the specifics of these platforms, requiring (1) several additions to the application modeling tool-chain, and (2) a machine modeling technique able to capture the architectural details that influence the overall performance. As a result of this process, presented within this paper, we have obtained a first version of a performance predictor for MPSoCs, named PAM-SoC. Validating PAM-SoC is also, in itself, a difficult task, requiring a set of specific benchmarking applications: they must be representative for the MPSoC application classes, and, at the same time, have to stress the architecture limits. We have used a set of five basic benchmark applications that are analyzed with PAM-SoC, which estimates the lower bound of their execution times. The same applications have been simulated by an accurate MPSoC simulator and the results (mainly, the measured execution times) are compared to those of the predictor. The comparative analysis of these two results sets prove the value of PAM-SoC, as well as its current accuracy. The paper is organized as follows. Section 2 presents the specifics and challenges of MPSoC platforms, also introducing our target MPSoC platform. Section 3 presents the benchmark applications and their simulation results. Section 4 presents the Pamela methodology for GPPPs and introduces the new PAMSoC predictor. Section 5 details the predictor validation process, presenting the experimental results. Section 6 presents related work and results in the field of performance prediction for embedded systems and MPSoCs. Finally, Section 7 draws the conclusions and presents future work directions.
2
Multiprocessor Systems-on-Chips - MPSoCs
Systems-on-Chips (SoC) are integrated circuits that embed (most of) the heterogeneous components of a complete electronic system on a single chip [3]. They provide a high processing power/area, which should make them the best option for small mobile devices, as well as for covering the entire processing requirements of other host systems, like household appliances or cars. MPSoCs are a somewhat natural extension of SoCs, integrating several programmable processors, thus providing, ideally, faster and more flexible programmability. So far, MPSoC architectures have been considered an interesting hardware research topic, allowing the hardware designers to prove their skills in squeezing as much performance as possible on a single chip. As a result, many different platforms have been designed, with different complexities and usage scenarios. Semiconductor companies answered the challenge as well, with various MPSoC
solutions [4]: Intel with the IXP2850 Network Processor, Texas Instruments with the OMAP platforms, Philips with NexperiaTM , ST with NomadikTM , and so on. Nevertheless, MPSoCs still face difficult challenges: Hardware imbalance Even though the available processing power is almost unlimited from the user’s point of view, severe limitations are imposed on memory sizes, mainly due to area limitations. Next, the significant difference between the on-chip and off-chip memory latencies is a second source of unpredictable behavior: the fast on-chip memory leads to very good performance for smallsized applications, while the slow off-chip memory can cause severe performance drops when data grows to exceed the on-chip availability Lack of programming models and suitable applications There are no standardized programming models for MPSoCs, although they are required to increase programming efficiency for MPSoC. Furthermore, only a few applications can immediately make use of the computational power available on MPSoCs, and very few of them belong to the consumer market MPSoCs are aiming at. Difficult performance analysis Currently, most of the analysis is done based on simulations, a time-consuming solution that also requires well-defined benchmarks and suitable programming models to provide useful conclusions. Embedded systems performance evaluation techniques [5] are yet to give results on MPSoCs, while analytical solutions are hindered by the shear complexity of the platforms. Given these challenges, static performance prediction for MPSoCs can offer a viable and fast alternative to simulation, predicting system performance with reasonable accuracy. 2.1 Wasabi - the target MPSoC platform For our experiments, both with the simulations and the predictions, we focus on Wasabi [6], a tile of the CAKE architecture from Philips [7]. A Wasabi chip, presented in Figure 1, is a shared-memory MPSoC, having up to ten processors, both digital signal processors and general purpose processors, several specialized hardware units and various interfaces. The tile memory hierarchy has three levels: (1) private separate data and instruction caches for each processor (L1’s), (2) one on-chip shared common cache (L2), available for all processors, and (3) one off-chip memory module for each tile. L2 is split into banks, to allow concurrent L2 access to memory addresses that are not in the same bank. A hardware mechanism keeps all L1’s and L2 coherent and consistent. For software support, Wasabi runs eCos, a modular open-source Real-Time Operating System (RTOS), which has embedded support for multithreading. Wasabi is currently programmable in C/C++, using eCos synchronization system calls and the default eCos thread scheduler. eCos has been successfully ported on the Wasabi simulator, thus we were able to run eCos-C programs on the simulator and perform time measurements. The simulation experiments have been run on Wasabi’s cycle-accurate simulator, provided by Philips, which can be tuned to support various hardware
Fig. 1. The architecture of the Wasabi MPSoC
configurations (i.e., different numbers and/or types of processors, various memory sizes, and so on). For our experiments, we have chosen a fixed memory configuration, which is the closest to a future hardware platform: data L1’s are set at 256KB, L2 has 2MB, organized in 8 banks, and the off-chip memory has 256MB. We have opted for homogeneous processors, using various configurations with up to 9 identical Trimedia1 processors (the maximum number for one Wasabi chip).
3
Simulation results
3.1
Benchmark applications
For the first experiments, we have chosen a set of five simple benchmark applications. By ”simple” we mean that any of these applications could easily be a component of a more complex, real-life MPSoC application. We have not yet proceeded further into testing more complex applications because of two reasons: (1) some performance bottlenecks are already exposed with these very simple applications, and (2) application (de)composability is a step further in our research, and can be studied if the simple application cases are carefully analyzed. The set of benchmark applications comprises the following: (1) elementwise matrix addition - trivial, but stresses the memory architecture due to its very small computation-to-communication ratio, (2) matrix multiplication - a classical benchmark, computation intensive, yet simple to implement and parallelize, (3) RGB-to-YIQ conversion - a color-space transformation filter, with the same complexity as matrix addition, but a higher computationto-communication ratio [8], (4) high-pass Grey scale filter - an image convolution filter [8], and (5) filter-chain - a chain of three filters (YIQ-to-RGB, RGB-to-Grey scale, high-pass Grey scale) successively applied on the same input image. All the benchmark applications have been implemented according to the following rules: 1
TriMedia is a family of Philips DSP cores optimized for Audio/Video processing
• Applications are implemented for a shared-memory model, compliant with the Wasabi memory model. • Applications use the SP-programming model and they exploit data-parallelism only (no task parallelism, a natural choice for the case of these one-task applications); • To avoid any OS scheduler interference, all applications have used a number of threads equal to the number of processors in the target architecture. 3.2
The results
The simulation results are presented in Figure 2. We further comment briefly on the specific behavior pattern of each application. Matrix addition The application performance decreases with the increase of the processor number. Visible in Figure 2(a), this effect is due to the very small computation:communication ratio. Matrix multiplication Figure 2(b) shows two types of behavior: (1) for matrices that fit in the L2 cache, the speedup is almost linear (the system behaves like a shared memory multiprocessor), and (2) for bigger sized matrices, although the execution time increases, L2 becomes a real cache, thus superlinear speedup may be achieved due to cache effects. RGB-to-YIQ The application behaves like it is running on a SMP, exhibiting good speedup - Figure 2(c). High-pass Grey scale filter The application scales very well (see Figure 2(d)) due to the fact that all data fits in the L2 cache, who again behaves like a shared memory. Filter-chain Figure 2(e), is, as expected, a combination of the graphs from two filters like RGB-to-YIQ (graph (c)), and one last Grey filter (graph(d)). The behavior is altered by the barrier synchronization between the filters. The peaks obtained for 7 processors seem to be an anomaly caused by the simulator. As a common result of the previous measurements, several performance discontinuities have been identified. Thus, the goal of PAM-SoC is not to be cycleaccurate in its numeric predictions, but to be able to detect such unintuitive application behavior.
4 4.1
The PAM-SoC predictor Pamela Methodology
Pamela (PerformAnce ModEling LAnguage) [9] is a performance simulation formalism that facilitates symbolic cost modeling, featuring a modeling language, a compiler, and a performance analysis technique that enables Pamela models to be compiled into symbolic performance models that trade prediction accuracy for the lowest possible solution complexity. For Pamela’s symbolic cost modeling, parallel programs are mapped into explicit, algebraic performance expressions that are parameterized in terms of program parameters, such as problem size,
(a) Matrix addition
(b) Matrix multiplication
1.2
30
1
25
0.8
20 Speedup
Speedup
100 x 100 Elements 128 x 128 Elements 256 x 256 Elements
0.6
0.4
15
10
0.2
5
100x 100 Elements 128x 128 Elements 256x 256 Elements 0
0
1
2
3
4 5 No of processors
6
7
8
1
2
3
(c) RGB-to-YIQ image transformation
6
7
8
(d) High-pass Grey scale filter
8
8
160 x 120 pixels 320 x 240 pixels 480 x 360 pixels
160 x 120 pixels 320 x 240 pixels 480 x 360 pixels
7
7
6
6
5
5
Speedup
Speedup
4 5 No of processors
4
4
3
3
2
2
1
1
1
2
3
4 5 No of processors
6
7
8
6
7
8
1
2
3
4 5 No of processors
6
7
8
(e) Filter chain 16 160*120 Pixels 320*240 Pixels 480*360 Pixels
14
12
Speedup
10
8
6
4
2
0 1
2
3
4 5 Number of processors
Fig. 2. The simulation results for: (a) matrix addition, (b) matrix multiplication, (c) RGB-to-YIQ conversion, (d) High-pass Grey filter, and (e) filter chain
and machine parameters, like the number of processors, computation and communication parameters, etc. Instead of simulation, the models are automatically compiled into a symbolic cost model, that can be further compiled into a timedomain cost model and, finally, evaluated to a time estimate. Figure 3 presents the Pamela methodology. Modeling language Pamela is a process-oriented language designed to capture concurrency and timing behavior of parallel systems. Data computations from the original source code are only modeled in terms of their workload. Based on a process algebra, Pamela uses the equation syntax and substitution semantics found in ordinary
Fig. 3. The Pamela symbolic cost estimation
algebra. A Pamela model of a program is written as a set of process equations. Work (computation, communication) is described by the elementary use process. The construct use(r,t) exclusively acquires service from resource r for t units of (virtual) time. The resource r may have multiplicity greater than 1. The mutual exclusion synchronization offered by the use construct is defined according to a work conserving scheduling discipline - FCFS is currently supported - with nondeterministic conflict arbitration. Applications and machine models are composed from use and delay processes using sequential, parallel, and conditional composition. Modeling Technique The machine model is expressed in terms of resources and application-independent processes. A scheduling policy and the multiplicity have to be specified for all resources, while the application-independent processes are modeled as an abstract instruction set of the target architecture. As a rule of thumb, separate abstract instructions should be created if they (1) use different resources, (2) use the same resources, but with different delays (i.e.,differentt use time), or (3) need to use different parameters (other than latency). Most of the Pamela models used so far have focused on NUMA-like multiprocessors, with models comprising three types of abstract instructions: local processing, local memory access, and remote memory access. Due to the significant latency differences between local and remote memory accesses, the small differences between local memory reads and local writes is irrelevant to the overall performance. The application model of the parallel program is then implemented using these abstract instructions. Symbolic Compilation A Pamela model is translated to a time-domain performance model by substituting every process equation by a numeric equation that models the execution time associated with the original process. The result is a new Pamela model that only comprises numeric equations, as the original process and resource equations are no longer relevant. The analytic approach underlying the translation, together with the algebraic reduction engine that drastically optimizes the process, are further detailed in [10]. Finally, the Pamela compiler can further evaluate this model for different values of the parameters, calculating to the value of the execution time. 4.2
Pamela for MPSoCs
The differences between GPPPs and MPSoCs prevent the Pamela methodology to be immediately and directly used for the latter ones. Details that can be
safely ignored on GPPPs (as they do not have a major influence on the overall performance) become very important for MPSoC behavior. Thus, both the applications and the machine architecture require more detailed models. MPSoC Machine Modeling To exercise the machine modeling of MPSoCs, we used the Wasabi chip as target platform. This particular choice does not limit the generality of the method - Wasabi’s architecture is sufficiently complex, containing most of the problem-posing components of an MPSoC. As a consequence, the resulting model is generic enough and can be easily tuned to model most of the existing MPSoCs. The Pamela machine model of Wasabi must include all its significant architectural details, Given the complexity of the architecture and its multiple types of resources, we used an iterative modeling process, starting with coarser characteristics and repeatedly adding finer architectural details: 1. Refine the machine model (add/refine hardware detail) 2. Run Pamela for several input sets, and get the prediction results 3. Compare the predictor results to the simulation results and, in case of nomatch (i.e., the predicted behavior of the application is completely different than the simulated one2 ), go back to Step 1. For the current version of Wasabi’s machine model, we have done several major iterations, finally obtaining the model presented in Figure 4. Before further detailing this process, one should note that, according to the implementation rules we imposed, the processors do not exhibit contention. Thus the refinements have been basically focused on the memory system of the MPSoC, adding more complexity to memory access operations. In the following description, the MEM(p,addr) represents the generic memory access instruction, whose equation has been continuously refined. Model 1: Basic model The basic model is very similar to a shared memory multiprocessor model, with the tile cache (L2) acting as the shared memory. A fixed memory access latency is used: tL2 . No notion of cache behavior (e.g., differentiated hit or miss behavior) is included: MEM(p,addr) = use(L2, t_L2) For the applications we have tested, the model is too general, and, more important, it is not able to predict any performance variations when increasing the number of used processors (from 1 up to 8). Model 2: L2 Cache The next step is to investigate how the L2 cache behavior can be modeled in Pamela. Because Pamela is a state-less formalism, the hit/miss behavior cannot be modeled in terms of contention. Our solution is to compute an average 2
The predicted execution time is usually not equal to the measured one, as Pamela predicts its lower bound; however, the model should be able to identify similar performance trends as those exposed by the simulation
Fig. 4. The Wasabi machine model
access time for L2, tL2 , depending on the cache hit/miss ratio, hL2 , and considering the different hit/miss latencies of L2 - thL2 and tmL2 . The refined model is: MEM(p,addr) = use(L2, (t_h_L2*h_L2 + t_m_L2*(1-h_L2)) Using this model, Pamela correctly predicts the typical expected application behavior: as the L2 hit ratio is increasing, the predicted performance is also increasing (i.e., the execution time is decreasing). The important note to be made here is that hL2 may vary not only with the application pattern, but also with the number of processors and the problem size. Thus, for applications that exhibited significant hit ratio’s variations (like matrix multiplication), the model was able to predict the correct behavior curve, detecting the performance trend expected from the simulation (i.e., increasing or decreasing speedup). However, for applications whose L2 hit ratio is only slightly varying with the number of processors (like matrix addition), the model is not correct, the behavior being influenced more by other architectural details, yet to be included in the model. Model 3: L1 Caches Next, we have added the L1 caches, in the similar probabilistic manner as L2. The L1 caches have been modeled on a per-processor basis, allowing individual hit ratio’s if necessary. The accuracy improvements we have obtained were not significant, showing that L1 caches do not hinder the overall system performance for the considered applications. The model including both L1’s and L2 caches is: MEM(p,addr) = use(L1(p), t_h_L1*h_L1 + t_m_L1*(1-h_L1)); use(L2, (1-h_L1)*(t_h_L2*h_L2 + t_m_L2*(1-h_L2)) Model 4: L2 Banks Wasabi’s L2 cache has an interleaved banked structure, providing concurrent
access to addresses that do not belong to the same bank. At the same time, if several requests are addressed to the same L2 memory bank, the access time will vary depending on the bank load (i.e., the length of the queue of requests that wait to be served), even if the request itself results in an L2 hit. Using the bank load as a new parameter, we have refined thL2 : MEM(p,addr) = use(L1(p), t_h_L1*h_L1 + t_m_L1*(1-h_L1)); use(L2(bank(addr)),(1-h_L1)*( h_L2*t_h_L2*load(bank(addr) + + (1-h_L2)*t_m_L2) ); The model is not complete because tmL2 is depending on the L2 cache miss policies and the performance of the external memory. Thus, the next step is to add the off-chip memory to the model. Model 5: Off-chip memory The off-chip memory was added as a very high-latency connection, together with its transfer buffers write to MEM and read from MEM 3 . Unfortunately, as these buffers are not included in the simulator, we could not validate them against real results, so we had to disable them for these tests. Finally, the off-chip memory accesses have been separated into reads and writes. The new model has two abstract instructions, MEM RD and MEM WR, with a identical structure, but different latencies: MEM_RD(p,addr) = use(L1(p), t_h_L1*h_L1 + t_m_L1*(1-h_L1)); use(L2(bank(addr)),(1-h_L1)*(h_L2*t_h_L2*load(bank(addr) + +(1-h_L2)*t_m_L2); use(M, (1-h_L1)*(1-h_L2)*t_M_RD) To conclude, the current machine model is able to predict the behavior curves of the benchmarking applications, except for the anomalies shown in subsection 3.2. To be able to estimate the lower bound of the execution time for Wasabi, the model uses five sets of numerical parameters: (1) processor operation latencies, (2) memory latencies (measured in no-contention conditions), (3) L1’s hit ratio’s, (4) L2 hit ratio, and (5) banks loads. The latencies are, theoretically, fixed values for a given architecture, and they can be either obtained from the hardware specification itself or by means of micro-benchmarking. We have based the measurements on theoretical latencies from the hardware specification, although the simulator is not entirely accurate with respect to that. The values for the other three parameters sets are both machine- and application-dependent, and they have to be computed/evaluated on a per-application basis. This, however, does not imply that the machine model has become application dependent - the numerical values of the parameters are only required for the numerical evaluation of the symbolic expressions of the execution time. For obtaining the actual values of these parameters, we have considered three options: (1) use a full-fledged or a trace-based simulator - this option was overruled, as a prediction strategy cannot involve such an expensive simulation; (2) 3
To alleviate the effect of the high latency connection to the off-chip memory, the hardware architecture includes two buffers, write to MEM and read from MEM
use analytical methods for evaluation - this is still an open issue for research in memory systems performance analysis or (3) use a hybrid approach - a lightweight simulator able to evaluate the parameters with reasonable accuracy. We have decided for the latter, and we built MemBE, a Memory system BEhavior simulator that, based on an application model, generates the three sets of parameters mentioned above as outputs. The simulator is a very light-weight multi-threaded application, performing no data-processing, but carefully monitoring the data-path of the modeled application. Furthermore, the simulator has been built to closely emulate Wasabi’s memory system, one of the most complex to be found in current MPSoCs. As a result, MemBE can be successfully used, with minimal tuning, for any other architecture that resembles the memory hierarchy of Wasabi or any simplified version of it. Machine Model Abstract Instructions When designing this interface from the application model to the machine model, we still had to decide which instructions to model, and how. In the case of MPSoCs, simple GPPP-like models do not hold, mainly because of (1) more significant latency differences, which require more versions of the same instruction (see MEM(p) versus MEM_RD(p) and MEM_WR(p)), and (2) the more complex nature of the instruction implementation, which require more arguments in the prototype (see MEM_RD(p) versus MEM_RD(p, addr, bank)). 4.3
Application modeling
Translating an application implemented in a high-level programming language to its Pamela application model (as well as writing a Pamela model from scratch) implies two distinct phases: (1) modeling the application as a seriesparallel graph of processes, and (2) modeling each of the processes in terms of Pamela machine instructions. For all the experiments presented in this paper, we have developed the application models by hand, postponing the construction of an automated translator because of two reasons: (1) the applications themselves were quite simple and straightforward to model, and (2) the machine instructions ”prototypes” were part of the iterative process of refining the machine model. The PAM-SoC tool-chain is presented in Figure 5. First, note that no modifications have been done on Pamela itself. Second, note that the shaded components are, at this stage of the predictor, still under development, meaning that we will further investigate if there are any better or simpler solutions to implement them. 4.4 Discussion In the present stage of the predictor, the application conversion from the source code to the simplified application that is fed into MemBE is done by hand. However, an automated translator can be implemented, given that we define (1) the full range of operations that have to be preserved in the application model, and (2) the parameters that each of them should preserve.
Fig. 5. The PAM-SoC tool-chain. Shaded blocks are on the open issues list
A similar discution cab be held Pamela application model. It has been proved before [11] that automatic translation from full source code to its Pamela application model is possible, but, this translation will require a more complex translator to deal with the finer granularity of the entire system. With respect to the PAM-SoC toolchain, we will investigate whether Pamela and MemBE can use the same application model. From the current experiments, such a common model seems not to limit any of the two, but we have to analyze more complex applications, before any definitive conclusions can be made. An important simplification of the modeling process would be the removal of MemBE in favor of an analytical module. Although that would be a more efficient and elegant solution, it would also require further deep research in application memory behavior and cache systems characteristics. Until such research is completed, it is not certain that such an approach is possible.
5
PAM-SoC Experiments and Results
The prediction results of PAM-SoC for the same five benchmark applications analyzed in Section 3 are presented in Figure 6. We briefly discuss the meaning of each graph. Matrix addition PAM-SoC correctlty predicts the speedup behavior for matrix addition - Figure 6(a). This indicates that the system is able to detect if an application does exhibit speedup or not. Matrix multiplication PAM-SoC detects the speedup behavior of matrix multiplication - Figure 6(b) for the two data sets presented4 . RGB-to-YIQ and High-pass Grey scale filter For these applications - Figure 6(c,d), PAM-SoC was still accurate in speedup trend estimation, as well as with memory contention saturation. However, it did not correctly predict the exact the number of processors when the saturation occurs, with an error of 1. This shifting of the saturation point is an effect of the latency differencies between the theoretical hardware latencies, used in PAM-SoC, and those of the 4
Due to an yet unidentified error, the third data set could not deliver any results, so it was not included
(a) Matrix Addition
(b) Matrix multiplication
1.2
30
1
25
0.8
20
Predicted Speedup
Predicted Speedup
100 x 100 Elements 128 x 128 Elements
0.6
0.4
15
10
0.2
5
128 x 128 Elements 256 x 256 Elements 512 x 512 Elements 0 2
4 No of processors
6
8
1
2
3
(d) RGB to YIQ image filter
6
7
8
(d) High-pass Grey Scale Filter
8
8
160 x 120 Pixels 320 x 240 Pixels 480 x 360 Pixels
160 x 120 Pixels 320 x 240 Pixels 480 x 360 Pixels
7
7
6
6 Predicted Speedup
Predicted Speedup
4 5 No of processors
5
4
5
4
3
3
2
2
1
1
1
2
3
4 5 No of processors
6
7
8
6
7
8
1
2
3
4 5 No of processors
6
7
8
(e) Filter chain 16 160 x 120 Pixels 320 x 240 Pixels 480 x 360 Pixels 14
Predicted Speedup
12
10
8
6
4
2 1
2
3
4 5 No of processors
Fig. 6. The prediction results for: (a) matrix addition, (b) matrix multiplication, (c) RGB-to-YIQ conversion, (d) High-pass Grey filter, and (e) filter chain
simulator. Filter-chain Figure 6(e) is again a correct prediction of the speedup behavior of the filter chain, demonstrating that it is able to also basic composed applications. Due to time limitations, we only ran very few experiments with large data-sets, and their definitive results are not yet available. However, we believe we have shown that PAM-SoC is able to predict application behavior trends for different types of applications running on a MPSoC platform.
6
Related Work
With respect to MPSoC performance analysis, the state of the art is simulation. MPSoC designers and producers deliver their own proprietary toolchains, while generic frameworks, like [12], provide flexible and complex solutions for hardware/software co-simulation. Yet, such simulations are a very time-consuming solution, despite optimizations [13], and thus difficult to use in the early design stages due to their significant delay in the entire process. However, there are attempts to analyze components of MPSoCs, such as on-chip communication [14], [15] or memory systems [16]. The only available solution for formal system-level performance verification of MPSoCs is presented in [17], but the approach aims to verify the performance of the hardware system, not to estimate its applicationspecific behavior. Despite the existence of system level performance evaluation tecniques for embedded systems [5], [18], the problem of adapting these solutions to MPSoCs has not been addressed yet.
7
Conclusions and Future Work
In this paper we have presented a semi-static performance prediction strategy for MPSoC’s, a low-cost solution based on Pamela, an existing static performance estimator with good results in evaluating GPPPs platforms and applications. To our knowledge, this paper proposes the first such solution for predicting MPSoC hardware/software performance. We motivate the need of a static performance prediction for MPSoCs by its reduced cost, which allows it to be a part of the design loop. Even though static performance predictors trade accuracy for estimation cost, we state that a behavior estimation within seconds is more valuable, at least in the early design phases, than a twenty-hours long simulation with very precise results. The paper details the process of adapting Pamela to the specific requirements of MPSoCs, where capturing the right amount of detail was the most difficult task, requiring, for the current version of PAM-SoC, additional tools for avoiding major changes to Pamela itself. To validate of our approach, we ran five simple benchmark applications on the MPSoC platform simulator and, at the same time, we used PAM-SoC to predict their behavior. The comparison of the results proved the experiments successfull, validating the PAM-SoC toolchain. For the future, the first step we have to make to improve PAM-SoC accuracy is to use a systematic procedure for calibrating its latencies according to the ones in the real-system (in our case, the simulator), as it has been shown that even small latency deviations may have visible influences on the results. Further on, we plan to explore three directions: (1) to model and test more complex applications, for further validation/improvement of PAM-SoC, (2) to make the modeling process as automatic as possible, and, (3) to investigate whether PAMSoC can become a static performance predictor. Acknowledgements. We would like to thank Paul Stravers from Philips Research for providing the Wasabi simulator, and for the help and support we got for understanding the details of the Wasabi architecture, support that allowed us to properly model the system.
References 1. Gemund, A.v.: Symbolic performance modeling of parallel systems. IEEE Transactions on Parallel and Distributed Systems (2003) 2. Gonzalez-Escribano, A.: Synchronization Architecture in Parallel Programming Models. PhD thesis, Dpto. Informatica, University of Valladolid (2003) 3. Jerraya, A., Wolf, W.: Multiprocessor Systems-on-Chips. Morgan Kaufmann Publishers (2004) 4. Wolf, W.: The future of multiprocessor systems-on-chips. In: DAC ’04: Proceedings of the 41st annual conference on Design automation, New York, NY, USA, ACM Press (2004) 681–685 5. Pimentel, A.D.: The artemis workbench for system-level performance evaluation of embedded systems. Int. Journal of Embedded Systems 1 (2005) 6. Borodin, D.: Optimisation of multimedia applications for the philips wasabi multiprocessor system. Master’s thesis, TU Delft (2005) 7. Stravers, P., Hoogerbrugge, J.: Homogeneous multiprocessing and the future of silicon design paradigms. In: International Symposium on VLSI Technology, Systems, and Applications (VLSI-TSA). (2001) 8. E.E.M.B.C. (http://www.eembc.org/benchmark/) 9. Gemund, A.v.: Performance Modeling of Parallel Systems. PhD thesis, Delft University of Technology (1996) 10. Gautama, H., Gemund, A.v.: Static performance prediction of data-dependent programs. In: ACM Proc. on The Second International Workshop on Software and Performance (WOSP’00), Ottawa, ACM (2000) 216–226 11. Gemund, A.v.: Symbolic performance modeling of data parallel programs. In: International Workshop on Compilers for Parallel Computers. (2003) 299–310 12. Mahadevan, S., Storgaard, M., Madsen, J., Virk, K.: ARTS: A system-level framework for modeling MPSoC components and analysis of their causality. In: MASCOTS ’05: Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Washington, DC, USA, IEEE Computer Society (2005) 480–483 13. Benini, L., Bertozzi, D., Bogliolo, A., Menichelli, F., Olivieri, M.: Mparm: Exploring the multi-processor soc design space with systemc. Journal of VLSI Signal Processing 41 (2005) 169–182 14. Loghi, M., Angiolini, F., Bertozzi, D., Benini, L., Zafalon, R.: Analyzing on-chip communication in a MPSoC environment. In: DATE ’04: Proceedings of the conference on Design, automation and test in Europe, Washington, DC, USA, IEEE Computer Society (2004) 20752 15. Pande, P.P., Grecu, C., Jones, M., Ivanov, A., Saleh, R.: Performance evaluation and design trade-offs for Network-on-Chip interconnect architectures. IEEE Transactions on Computers 54 (2005) 1025–1040 16. Loghi, M., Poncino, M.: Exploring energy/performance tradeoffs in shared memory MPSoCs: Snoop-based cache coherence vs. software solutions. In: Proceedings of 2005 Design, Automation and Test in Europe Conference and Exposition (DATE 2005), IEEE Computer Society (2005) 17. Richter, K., Jersak, M., Ernst, R.: A formal approach to MpSoC performance verification. IEEE Computer 36 (2003) 60–67 18. Thiele, L., Wandeler, E.: Performance Analysis of Embedded Systems. In: The Embedded Systems Handbook. CRC Press (2004)