Silvia Brunet. Virgil Andronache. Nelson L. Passos. Ranette .....  A. Aiken, A. Nicolau, âFine-grain Parallelization and the. Wavefront Method,â Languages and ...
Performance Evaluation of Parallel Implementation of Nested Loop Control Structures Silvia Brunet
Nelson L. Passos
Department of Computer Science Midwestern State University Wichita Falls, TX 76308 (brunet|andron|passos|rhalver)@cs.mwsu.edu
Abstract The computer field has experienced tremendous progress over the past 20 years. Day after day, innovative technology has contributed to the creation of new systems, whose complexity is hardly understood. The implementation of parallel capabilities has contributed to increase this complexity. These parallel capabilities are being used in some cases to decrease the CPU time required in numerous applications such as image processing and 2-D hydrodynamics simulation where the largest amount of time is spent executing loop structures. To fully explore these parallel resources, different techniques have been developed. A simulation of these systems is the common approach to understand their complexity. This paper describes a simulator developed in order to evaluate the performance of a specific parallel algorithm implemented to achieve high levels of optimization inside nested loops. Keywords: Retiming, trace-driven simulation, executiondriven simulation, superscalar, parallel processing.
1. Introduction At the edge of the new millenium researchers and designers have invested great part of their time in developing techniques that will increase the efficiency of computer systems. Parallel computation is the most exploited topic towards the achievement of this purpose. The importance of developing such techniques is due to the broad variety of applications in which critical time is expended in repetitive calculations. These calculations increase tremendously the CPU usage, decreasing the efficiency of the application. Applications such as image processing, fluid mechanics and 2-D Hydrodynamics depend highly in calculations done through the execution of nested loops. To achieve high-speed performance in these applications, a reduction on the time expend in the execution of nested loops have to be exploited. Several techniques are used to improve the execution time within nested loops and between code blocks. A process similar to the one being used in this paper is loop pipelining.
However, loop pipelining has produced methods that center on one-dimensional problems [9, 11]. Some other software approaches have been used previously in the field of loop transformation such as software pipelining [7, 10, 17] and the wavefront method . These methods have attained varying degrees of success, but they also report disadvantages . A hardware perspective using two functional units has also been studied . This approach uses a dataflow instructionscheduling unit with a pipelined instruction execution unit. However the use of two processing modules is not always necessary. This paper explores a method of achieving high levels of parallelism inside nested loops by optimizing the execution in an individual processor. In this paper, a software (compiler) approach is being explored. In this approach, the multidimensional (MD) retiming technique  for the reconstruction of nested loops is being used. This technique modifies the order in which operations are being executed in the iterations of the original nested loop producing a new structure [3, 13]. Thus the access method to the iteration space of the loop is modified. An iteration space, in multiple nested loops, is the set of all integral points determined by using the loop indices as system coordinates. Due to the MD retiming transformation two sets of instructions are moved outside of the loop, the prologue and epilogue [2, 13]. The determination of the prologue and epilogue is done at compile time along with an efficient implementation of the loop execution. In this paper the performance of the execution of the code after the necessary transformations is accurately measured with the simulation described in this paper. The execution of the parallel code is evaluated in an individual processor. The results obtained after the simulation of the parallel code open the doors to new kinds of software implementations to achieve high levels of optimization inside nested loops with the use of a uniprocessor. In the following section the concepts relevant to the approach taken in the development of the simulator are introduced. Section 3 presents the steps followed in the implementation process. Sections 4 and 5 discuss the experiments and results obtained from the simulation. Finally, Section 6 concludes the paper with the summary of the advantages obtained through the simulation.
2. Background Evaluation, study and understanding of a system depend on the development of a correct model. One of the major advantages obtained from modeling a system is the level of exploration that can be achieved through the manipulation of parameters, design alternatives or operating environment. In the computer field a number of simulation techniques are widely used to enhance the understanding of recent architectural advances . Different types of simulation methodologies evaluate the performance of new architectural features as well as aid towards the development of efficient algorithms. Two simulation methodologies are usually applied: Trace-Driven Simulation and Execution-Driven Simulation. Trace-Driven Simulation is used for the evaluation of uniprocessor models. A disadvantage of this methodology is that it requires the creation of an input trace, which can have several references and use up to gigabytes of tape or disk space . When evaluating multiprocessor architectures, TraceDriven models report certain inaccurate results due to assumptions built into the traces . Execution-Driven simulations are used to achieve higher accuracy than TraceDriven simulations. One basic advantage of execution driven simulation over Trace-Driven is its ability to create and use bit traces without the use of large trace files from tapes or disks. It also permits an accurate evaluation of uniprocessor as well as multiprocessor performance. Execution-Driven is mostly used for quantitative evaluations such as run time estimates for floating point and fixed point execution units , error recovery percentages of a superscalar processor  and address and timing analysis of multiprocessor . There is a type of Execution-Driven simulation called Direct-Execution, which is often used for the evaluation of shared memory systems  and messagepassing parallel programs . Current processors exploit high level of instruction-level parallelism, using complex features such as multiple issue, out of-order issue and speculation. In this paper a Direct-Execution simulation is used which directly executes a parallelizing algorithm for nested loop control structures to obtain the information necessary to evaluate the behavior and performance of its execution time. This simulation accounts for out of-order instruction execution by providing for actual execution of each instruction.
3. Implementation The simulation tool described in this paper was implemented using visual C++ and functional programming and design. In this implementation our goal is to transform an original nested loop used in the execution of certain applications into an optimal sequence of loop transformations to enhance parallelism. The simulation tool is divided in three main stages: input, loop transformation and execution, as shown in the flow diagram in Figure 1.
In the first stage, the procedure read_graph is in charge of reading in the code to be optimized from a text file. This code will represent the nested loop being tested. The main values the simulator is interested in are the inner and outer indices of the loop as well as the data points and the dependencies between each other. After this values are recognized the procedure will convert them into a graph format. The conversion is done as following: each instruction becomes a node in a directed graph representing the code. The edges in the graph represent data dependencies between the instructions. As soon as the graph has been constructed the second stage is activated. Code Read Build Graph Return Graph Simulate Original Code Simulate Optimized Code
Figure 1. Flow Diagram The main procedure in charge of the second stage is called retime_graph. This procedure is in charge of calculating a valid retiming vector for the graph, using a modified version of the technique described in . One such modification is that the retiming vector is chosen to be of the form (1,-2n). This modification decreases CPU usage needed in the calculation of expensive multiplications needed at the end of each execution sequence. These operations are used in the calculation of the starting indices in each execution sequence. An execution sequence is a set of points in an iteration space where the transition from one point to the next is derived using the same formula. The multiplications mentioned above are then replaced by more efficient bit shifts and logical operations. Finally, in the third phase, the execution space is broken down into five components - prologue, epilogue and three general execution sections as seen in figure 2. Using the graph constructed from the loop information, the loop without any transformations is executed. This paper compares the execution time of the optimized section of the algorithm tested (the time spent in the execution stages) with that of the regular, unretimed, execution. The time involved in stages one and two is not considered as part of the execution time, as both tasks will be performed at compile time.
In the third stage a valid retiming vector is calculated. With this information the graph is transformed into a retimed graph as seen in Figure 3b. The transformation of Figure 3a into 3b is done using the retiming function (1,-1). This function is applied to node a thereby adding (1,-1) to its outgoing edges and subtracting (1,1) from the incoming ones.
5. Results The tests where performed using an Intel Pentium II processor running at a speed of 300 MHz with 128 Mb of RAM in a Windows 95 environment. Figure 2. Execution sections of the iteration space
4. Retiming The technique of multi-dimensional retiming is applied to nested loops. Consider the case of the two dimensional nested loop shown below:
In order to analyze the behavior of our simulator, some initial tests were conducted. The nested loop used in this initial test was the sample code shown previously. Table 1 shows some of the data obtained by applying the retiming technique and the new loop structure to the example described in the previous section.
Table 1. Simulation results
for i = 1 to n for j = 1 to n a[i,j] = b[i,j-1] + b[i-1,j] b[i,j] = a[i,j]
After the execution of the first stage, the loop above will be represented in a graph format as seen in Figure 3a. (1,0)
(-1,2) (b) Figure 3. (a) MD Graph; (b) Retimed Graph
(seconds) 20 X 20
50 X 50
165 X 165
185 X 185
190 x 190
195 X 195
As can be observed, for small problems, the optimizing method does not produce any improvement over the original code, introducing a significant overhead. As the arrays being used increase approaching typical sizes used in common applications, the improvement becomes visible as can be seen in the last line of the table. Another experiment was conducted using one of the Livermore loops. Livermore loops are a standard measurement benchmark for various aspects of computer performance. Livermore 23 uses nested loops of various sizes. Using the algorithm described in this paper, improvements of 6.5% were obtained.
Finally, a test using a medium size resolution image (512 x 512 pixels) was developed. This size of image is used in a broad variety of image vision applications such as image-toimage transformations, global feature extraction and 2-D filters. Since the amount of data values used while processing this kind of image is extremely large, a great amount of processing time is needed. The amount of computations required to support real-time image applications can be greatly reduced through the use of parallel processing. The improvement obtained from the transformed nested loop over the original was approximately 9%.
6. Summary This paper has presented a simulator for evaluating the improvement of the execution of applications such as real-time image processing, 2-D hydrodynamics and fluid mechanics. The problem is simulated using a Direct-Execution simulation after it is converted to a multidimensional graph representing the original nested loop. The simulator was able to detect improvements of 4 to 9 percent in the execution time of the target experiment. Using this approach, promising results have been achieved which can lead to new kinds of software implementations that achieve high levels of optimization inside nested loops with the use of a uniprocessor.
7. Acknowledgments This works was supported by the National Science Foundation under Grant No. MIP 9704276.
References  A. Aiken, A. Nicolau, “Fine-grain Parallelization and the Wavefront Method,” Languages and Compilers for Parallel Computing, Cambridge, Massachusetts, MIT Press, 1990, pp. 1-16.  V. Andronache, R. Simpson, N. L. Passos, “An Efficient Implementation of Nested Loop Control Instructions for the Fine Grain Parallelism,” Proceedings of the Ninth Annual CCSC South Central Conference, Vol 13, No. 4, March 1998, pp. 67-76.  L. -F. Chao and E. H. -M. Sha, “Static Scheduling of Uniform Nested Loops,” Advances in Languages and Compilers for Parallel Processing, Cambridge, Massachusetts, MIT Press, 1990, pp. 192-219.  D. M. Dickens, P. Heidelberger and D. M. Nicol, “Parallelized Direct Execution Simulation of MessagePassing Parallel Programs,” IEEE Transactions on
Parallel and Distributed Systems, Vol 7, No. 10, October 1996, pp. 1090-1105.  S. Dwarkadas, J. R. Jump and J. B. Sinclair, “ExecutionDriven Simulation of multiprocessors: address and timing analysis,” ACM Transactions on Modeling and Computer Simulation, Vol. 4, No. 4 October 1994, pp. 314-338.  T. M. Frazier and Y. Tamir, “Execution-Driven Simulation of Error Recovery Techniques for Multicomputer,” 30th Simulation Symposium, April 1997, pp. 4-13.  G. R. Gao, Z. Paraskevas, “Compiling for Dataflow Software Pipelining,” Languages and Compilers for Parallel Computing, Cambridge, Massachusetts, MIT Press, 1991, pp. 275-306  S. R. Goldschimidt and J. L. Hennessy, “The Accuracy of Trace-Driven Simulations of Multiprocessors,” Stanford University Computer Systems Laboratory, Technical Report CSL-TR-92-546, September 1992.  G. Goosens, J. Wandewalle, and H. deMan, “Loop Optimization in Register Transfer Scheduling for DSP Systems,” Proceedings ACM/IEEE Design Automation Conference, 1989, pp. 826-831.  R. Govindarajan, E.R. Altman, G. R. Gao, “A Framework for Resource-Constrained Rate-Optimal Software Pipelining,” IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 11, November 1996, pp. 1133-1149.  T. -F Lee, A. C. -H. Wu, D. D. Gajski, and Y. -L. Lin, “An Effective Methodology for Functional Pipelining,” Proceedings Int’l Conference Computer Aided Design, December 1992, pp. 230-233.  D. Park and R. H. Saavedra, ‘‘Trojan: High Performance Simulator for Parallel Shared Memory Architectures,” 29th Annual Simulation Symposium, New Orleans, LA, April 1996, pp. 44-53.  N. L. Passos, E. H. -M. Sha, “Achieving Full Parallelism Using Multidimensional Retiming,” IEEE Transactions on Parallel and Distributed Systems, Vol 7, No. 11, November 1996, pp. 1150-1163.  H. A. Rizvi, J. B. Sinclair and J. R. Jump, “ExecutionDriven Simulation of a Superscalar Processor,” System Sciences Architecture, Proceedings of the 27th International Conference, Vol. 1, January 1994, pp. 185194.  J. B. Rothman, “Multiprocessor Memory Reference GenerationUsingCommune,”http://djinn.CS.Berkeley.edu/ rothman/commune/  A. Sampogna, D. R. Kaeli, D. Green, M. Silva and C. J. Sniezek, “Performance Modeling Using Object Oriented Execution-Driven Simulation,” Proceedings of the 29th Annual Simulation Symposium, New Orleans, LA, April 1996, pp. 183-192  M. E. Wolf and M. S. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed systems, Vol 2, No. 4, October 1991, pp. 452-471.