Vertex-To-Vertex Parallel Radiosity on Clusters of

Vertex-To-Vertex Parallel Radiosity on Clusters of PCs

Adi Bar-Lev Ayal Itzkovitz Alon Raviv Assaf Schuster

Department of Computer Science, Technion, Haifa 32000, Israel e-mail: [email protected], fayali, [email protected] ABSTRACT In this paper we describe the parallel implementation of the Vertex-To-Vertex Radiosity method on a cluster of PC hosts. We rst explain the evaluation of the Form-Factor by casting stochastic rays between pairs of vertices for establishing the visibility ratio between them. We then describe an implementation of the Radiosity computation on top of the millipede system, a strong virtual parallel machine that runs (in user mode) on top of standard distributed environments of PC/NT, interconnected by slow communication media such as the Ethernet. The initial result of this distributed implementation is a large slowdown. Using millipede optimization tools we then work step-by-step on improving the locality of the memory references, until eventually we get good speedups.

1. Introduction Fast generation of a photorealistic rendering of a scene using computer graphic techniques is one of the most challenging tasks today. The involved computational complexity is due to complex interactions of materials, light and occlusions, which determine the appearance of an environment in the real world. A constant-in-time amount of energy that is emitted by a light source establishes an energy transfer balance between all the objects in the environment in a matter of just a few nano-seconds. We wish to simulate this state of energy balance and to display it as our nal stage in the generation of the synthetic scene. One method which tries to give a solution to the problem is the Ray-Tracing technique which cast rays from the imaginary location of the viewer towards the virtual scene. This in eect reverses the original paths of light reaching the viewer eye in reality. The Ray-Tracing method is based on the re ectance quality of the material, i.e., the quality of material to re ect most of the energy reaching it from a given direction in a non-uniform manner. For highly re ective objects (highly specular material), the ray-tracer generates another ray in the re ection direction, and gains its shade from the nearest object that was hit by the ray. The Ray-Tracing algorithm is especially eective in scenes which have shiny and highly re ective (specular) objects which are treated almost as mirrors. In contrast, it does not deal well with closed space scenes where the light eects are gained mostly by the diusion of light from the walls and the objects. In such scenes the shading will seem to be un-realistic, and will lack the eect of materials having highly diused attributes. More theory concerning this method can be found in Glassner book [2]. Most of the shading which we see in every day life is caused by the diusion quality of material in nature. This quality of the material causes the light to appear \soft", and to shade places which are not in view sight of the light source - For example the shade near corners of a room, or the shade below a table. Adapted from the eld of Heat Transfer (Sparrow & Cess [8]), the Radiosity method tries to solve the problem from another perspective. The basic assumption is that most of the materials in a scene are diusive, namely, they distribute the unabsorbed energy from light in a uniform manner to all directions. (see Cohen & Wallace [1]) Using this method, each light source in the 3D virtual environment is treated as a source of energy which casts energy ( ux) towards the scene. The Radiosity method tries to simulate the energy transfer after it was distributed from 177

the energy source by discretizing the environment into meshes and nding an approximation of the energy that each \patch" in the mesh will have after all the energy is distributed over the mesh. Eventually each patch in the scene is given a calculated amount of energy from the total amount that was distributed by the light sources, and this energy is computed in terms of color and intensity by combining the three elements of color (in the human eye) - Red, Green, and Blue. This method is highly eective in closed scenes that are composed of objects of diused material attributes, because the method simulates the energy transfer as a transfer of ux that is distributed equally in all directions. Notice that the Radiosity assumption is incorrect for specular materials such as mirrors or shiny surfaces, and so, while the method works well in most of the scenes taken from the real world, it does not deal well with specular materials which get a \softer" and more shallow shading than in nature. The best results with this method are obtained in closed scenes such as rooms and buildings which do not suer from loss of energy except for the absorption of energy by the objects in the scene. The solution for the Radiosity method is known to be highly exhaustive both in memory and computation when compared to the Ray-Tracing method, and can be as much as two orders of magnitude slower in comparison. On the other hand, the Radiosity method has the advantage of being \View independent", i.e., once the synthetic scene is generated, a viewer may be placed anywhere, and then it takes very little time to compute how the scene should look from its point of view since the energy distribution was already computed during the Radiosity process. In contrast, the Ray-Tracing method is \View dependent" and so the full computation must repeat for every new point of view. Hence, if most of the objects are static in time, the Radiosity method is preferable for the generation of a walk through the scene. Another advantage of the Radiosity method is the ability to simulate not only lights from light sources, but also the shade caused by heat sources such as furnaces in a plant, thus enabling the simulation of much more realistic scenes than with any other method. In this work we were interested in parallel implementations of rendering algorithms, which is relatively straightforward with Ray-Tracing, but seem harder to implement when the Radiosity method is employed. The important feature, shared by both the ray-tracing and the Radiosity techniques, is that the given world of objects is read-only. Thus, as long as the local memory of the processors involved is suciently large (Several MBytes are sucient for fairly complex scenes) the constructed scene can be split into fragments which may be computed independently with almost no communication overhead, see for example [3]. In contrast to Ray Tracing, the Radiosity algorithm is based on iterating energy casting. Spreading the energy between the elements of the scene typically involves an \all-to-all"-type communication which may prevent speedups with naive implementations. In particular, we developed in this work a parallel version of Radiosity on a standard, non-dedicated cluster of PCs, so that the communication media is extremely inecient when compared to other parallel computing platforms. Furthermore, we used the millipede system that implements (among its other features, see below) Distributed Shared Memory in user space, which sometimes creates implicit bursts of communication. Hence we had to design the data structures carefully, to use optimization techniques that are provided by the millipede system, and to ne-tune the involved parameters. 1.1. The Millipede System

millipede is a work-frame for adaptive distributed computation on non-dedicated networks of personal computers, [4, 5, 6].1 millipede includes a user-level implementation of a distributed shared memory (dsm) and load sharing mechanisms (mgs); it provides a convenient way to distributively execute multithreaded shared-memory applications with transparent page- and job- migration. millipede provides several techniques for reducing the implicit communication which result from 1 millipede

is

currently

implemented

http://www.cs.technion.ac.il/Labs/Millipede.

on

the

178

Windows-NT

operating

system,

see

also

using the distributed shared memory. These include a strong support for reduced consistency of the shared-memory pages (thus allowing for duplicate local copies on dierent hosts), with the consistency type assigned on a per-variable basis, a mechanism for allocating dierent variables on dierent pages of memory (thus avoiding ping-pong and false-sharing situations), and instructions to copy-in-and-lock pages of memory, and release them eventually (again, to avoid ping-pongs and for stabilizing the system). A millipede application is issued by a user from the millipede Daemon running on the user machine. The Daemon automatically distributes the application over the network, so that identical instances (copies) of the program are available at hosts that are under-loaded. If a node is a symmetric multi-processor then all its internal processors are used in a transparent way by a single instance of an application. Instances of an application share a single virtual space. They communicate in a location-independent way via the dsm mechanism and synchronize using the millipede primitives. 1.2. Radiosity - An Overview

Assuming a closed space environment, we can divide it into sets of polygons, called patches. These patches receive and transmit the energy in the given environment. Based on the theory of energy transfer in diused material, the relative amount of energy transfered between two patches is dependent on the amount of area which is not occluded by other patches in the environment, and on their geometry relation (angles of planes, and distance). This relation, called the Form-Factor, represents the relative amount of visibility between every pair of points in the scene. For example - if we choose a point in a scene, and look toward a polygon, we can estimate the Form-Factor by projecting the polygon on a hemisphere which surrounds the point on its plane, and dividing the projected area by the area of the hemisphere. Notice that this de nition of the Form-Factor dictates that its value is between zero (for a certain pair of patches) { when one of the patches in the corresponding pair have no point with a clear view sight of any point on the other patch, and 1 { the total relative visibility of all the scene patches from any given patch. After discretizing the scene into meshes, the whole problem can be viewed as a set of N equations with N variables. The variables in these equations are the total amount of energy each patch receives from the environment after the energy balance in the environment was achieved, and the coecients of each equation are the Form-Factors between the given patch and the other patches R in the scene. Previous to the discretization, the dierential equation is Bi dAi = Ei dAi +i Area over Aj (Bj Fj;i dAj ), where dAi ; dAj are the dierential area of points i and j in the direction of the plane, Bi ; Bj are the energy of dAi and dAj , Ei is the amount of energy emitted from dAi , i is the re ectivity coecient of dAi , and Fj;i is the Form-Factor from dAj to dAi. After taking the scene and discretizing the surfaces into meshes, we get: Bi Ai = Ei Ai + i j (Bj Fj;i Aj ). The fact that the Form-Factor is essentially a geometric quality of a state between two patches dictates a symmetric relation: Fi;j Ai = Fj;i Aj . Re-organizing and plugging in the above equation we get Bi = Ei + i j (Bj Fj;i). This set of equations can now be solved by using the Gaussian elimination procedure or the Kramer method for solving sets of equations. The time complexity of the solution is O(N 3 ), where N is the number of patches in the scene. Since N in most cases varies from a few thousand patches to a few hundreds of thousands, clearly the complexity of the above method is too high. A dierent approach for solving the set of equations uses the Jacobi method which approximates the correct solution in iterations. The involved complexity is O(N 2). This may further be reduced to half by using the Gauss-Seidel technique, which nd the approximated solution by letting iterations use the approximated solution that was found in the previous iterations. Notice that the related equations simulate the energy transfer from the environment (the patches) toward each patch in turn. Cohen et al. [1] described a slightly dierent approach called the Progressive-Re nement Radiosity method. In their method they isolated each variable (energy of a patch) and instead of computing the amount of energy it receives from the other variables on each iteration, they computed the amount of energy it casts over the rest of the patches. This approach 179

does not alter the solution, but lets the algorithm pick the patch with the largest amount of energy for distribution, thus speeding up the approximation process by focusing on the most \important" equations in the set. As another advantage of this method, the iteration process can be stopped at any given time and still get a \uniformly-approximated" result. Alternatively the process may terminate when the patch with the largest amount of energy has less energy to distribute than a certain pre-speci ed threshold - thus ensuring an approximation within an error bound.

2. The Radiosity Algorithm In this section we explain the methods and ideas we used when we implemented the Radiosity algorithm. 2.1. Adaptive Area Subdivision

As mentioned before, the variables in the Radiosity equations are the energy amounts of each patch in the scene. The scene is usually given as sets of polygons, but whether the scene was composed of polygons in the rst place, or polygonized after creating it as meshes, the polygons shape and size in most scenes do not match the \shading lines" which were created in a matching scene in reality. For example: consider a scene of a room with a light source dangling from the ceiling and a table in the center which creates a shaded area underneath. This scene in the virtual environment will most probably be composed from only one polygon for the oor without any division according to shade or size. Thus, we should rst nd a good way to split the scene, and only then decide about the eventual energy of the patches. Recall that as the division re nes and the area of each patch decreases, the time complexity of the Radiosity method increases by a polynomial in the division ratio as discussed above. We get a tradeo between the time complexity and the accuracy of the result. It is important to notice that the major energy contributors by the equations are the light sources which de ne also the shadow lines in the scene. We thus chose to split the scene to patches by testing the energy dierences between the vertices. The rst stage of the method involves calculating the Form-Factor from each light source towards each vertex in the scene, and distributing the energy of the light sources among the original vertices of the scene. We now set a threshold and compare the energy of the vertices for each polygon in the scene. If the energy diers by more than the given threshold we divide the polygon using the central point composed by all the polygon vertices. We call this point the center of gravity, see Figure 2.2. and repeat this process until all patches are below the threshold, or until the number of subdivisions at this branch reaches a certain limit. 2.2. Form-Factor Computation

The major computational bottleneck in solving the Radiosity equations is the computation of the Form-Factors between the patches. Our approach is to gather all the energy intensities at the vertices (either the original or those created by our subdivision process). After we divide the polygons into smaller patches, we take each patch and distribute it among its vertices by using the following method. We rst calculate the \center of gravity" (CG) of the patch, then de ne the following new vertices which determine the new area of the vertex: the vertex itself, half way to the right vertex, the CG of the patch, and half way to the left vertex. Notice that as a result of the subdivision, it is likely that a vertex is shared by several patches. An example of such a division is depicted in Figure 2.2. Now that we have the basic elements for our algorithm, we proceed to describe the computation of the Form-Factor between each pair of vertices A and B in the scene. We rst compute the FormFactor between a vertex A and the delta area of vertex B, called dB. We do this by using a formula which computes the Form-Factor as the sum around the perimeter of dB. This formula which is suitable for convex polygons is F(A ?! dB) = 1=(2) nf i NA (Ri Ri+1 )g where F(A ?! dB)

180

Center of gravity

Vertex dA

Vi

Vi

Figure 1: Left = initial polygon from the scene. Right = subdividing the polygon into patches using its Center

of Gravity.

is the Form-Factor from vertex A to the area dB, n is the number of vertices in dB (always 4 in our case), i is the angle between two lines connecting A with two adjacent edges of dB, NA is the plane normal of dA, and Ri is the vector which starts at A and ends at the ith vertex of dB. The derivation of this formula as well as other formulations for Form-Factor calculation can be found in Cohen & Wallace [1] or in Sillion & Puech [7]. We now have the formulated Form-Factor which from now on will be referred to as the Initial Form-Factor since its computation does not take into account that any pair of dA and dB can be partially or completely occluded from each another. To accurately compute the occlusion is an exhaustive task and since we want to reduce this computation, we chose to use a stochastic ray casting method. From each vertex A we cast N rays toward each dB in the scene, where N is determined by two parameters: The number of rays to be casted per hemisphere for every vertex, and the number of rays casted at every dB proportional to the size of the Initial Form-Factor from A toward dB. The rays are being computed starting at a random location on dA and towards a random point on dB. This provides a good estimation for the visibility between dA and dB, an estimation which becomes closer to the actual value as the number of rays increases. After casting the rays we count the number of rays reaching dB without hitting another patch along their course, and the hit , where N Final Form-Factor is computed as Final Form Factor = Initial Form Factor NNtotal hit denotes the number of rays which arrive at dB, and Ntotal denotes the total number of rays shot from dA toward dB. Since light sources emit energy in all directions (we assume point light type for all light sources) and because they do not have a plane, computing the Form-Factor for light sources takes a dierent approach. The solution we took is to divide the light by an imaginary plane and consider the two resulting hemispheres, assuming each has half the light source energy. We use the normal of this plane to compute the Initial Form-Factor. 2.3. The Progressive Re nement algorithm

The progressive re nement algorithm goes as follows:

1. Compute the Form-Factor and cast energy of the light sources. 2. Subdivide scene to patches and recompute energy transfer, initiating each vertex with a respective energy by the light sources. 3. Repeat the following operations: Sort the vertices by energy amounts into an ordered list. If the energy of the rst vertex in the list is below the threshold then stop. Otherwise, for each of the rst few vertices from the list do: a. Calculate the Form-Factor from the vertex toward each vertex in the scene. b. Distribute its unabsorbed energy to the vertices. c. Elapse its unabsorbed Delta 181

energy. d. Insert back to the global list of vertices. Notice that in this algorithm we referred to two types of vertex energy - the global energy which it did not absorb and will give it its shade, and the amount of energy it received from other vertices in the scene (and did not distribute, yet). The latter energy is added to the global energy at each distribution, and also added to a Delta energy which, in turn, will be distributed by the vertex. The List of vertices is sorted by the amount of the Delta energy of each vertex (and not by the global energy). There are few points in the algorithm that are worth noticing. First, the Form-Factors are being computed only if the vertex which distributes the energy is one of the main contributors to the equations, and so there may be iterations where a vertex which already computed its Form-Factors is again at the beginning of the list due to the Delta energy distributed to it, thus saving the time to compute its Form-Factors all over. Second, we can save some time on computing the Form-Factors by rst shooting a few random rays from a vertex which is about to distribute its energy toward each dB and decide that it is visible only if at least one ray hits the other vertex.

3. Parallelizing The Algorithm In this section we describe the practical experience we had in parallelizing the radiosity algorithm. The sequential radiosity algorithm was running as a Windows-NT application and we put in modi cations to make it work eciently in parallel. The parallelization was done in two main steps. In the rst step we analyzed the potential maximal level of parallelism of the problem; i.e., locating places in the code where work could potentially be carried in parallel rather than sequentially. We put in minimal code modi cations, to make the application run concurrently on millipede and tested the correctness of the parallel algorithm. In the second step work was done towards improving performance by handling three factors: the true sharing, the false sharing and the scheduling policy. The results show that the second step was vital for improving the performance and obtaining speedups in a loosely-coupled distributed shared memory environment. 3.1. The Naive Parallelization

The Radiosity algorithm is based on iterating energy casting as described earlier, where each step is in fact independent of the other steps. Parallelism can be found mainly in three places. The rst is the subdivision stage where we can divide the scene between the hosts so that each will compute the eect of the subdivision on the polygons it received. The second place which is the main stage of the algorithm is the iteration stage in which we can assign each vertex to a dierent host, and this host will calculate the Form-Factor and distribute the vertex delta energy among the other vertices. The third place is the display of the scene. In our work we implemented the second and the third options. We implemented the second by maintaining for each vertex a data structure containing its current energy, its current delta energy which it can further distribute to the other vertices, and the new delta energy, which is received during the time the vertex distributes its current delta energy. Calculating Form-Factor and distributing delta energy can therefore be done in parallel in the following way: Parallel Step: for vertex v with non-zero new delta energy: 1. Copy new delta energy to current energy and current delta energy. 2. If Form-Factor has not been calculated yet then calculate the Form-Factor of v. 3. Distribute the current delta energy of v according to the Form-Factor to the new delta energy of the vertices that are visible by this vertex. End Parallel Step After initialization, when some initial energy is distributed to the vertices, the algorithm spawns millipede jobs that execute the Parallel Steps concurrently for all vertices. Mutual exclusion should be enforced in two places. The rst is while a vertex takes the new delta energy and adds it to its current energy and current delta energy (Parallel Step, stage 1). No concurrent update of the new delta energy is allowed at that time. The second place is when the new

182

delta energy of the other vertices is being updated, an operation that should be taken exclusively by a single vertex at a time (Parallel Step, stage 3). Using Millipede

millipede is designed to support multithreaded parallel programming on top of clusters of PCs. Therefore, code changes for parallelizing the sequential Radiosity algorithm were minimal. Basically each vertex could execute the Parallel Step independently of all other vertices, thus making it a parallel execution task (a job in millipede terminology). Memory allocation and deallocation operators new and delete were overloaded to use millipede interface api for allocating memory from the shared memory. millipede atomic operation faa (Fetch And Add) was used to preserve mutual exclusion when the new delta energy of the current job vertex was updated, or when this value was updated for the other jobs vertices. Starting a new job for each vertex turned out to cause thrashing and calculation of redundant data. Thus we created execution jobs for a limited number of vertices that were picked from the beginning of the vertex array after the sorting phase, i.e., only those vertices with the highest energy to be casted are handled exclusively by a dedicated job. Scheduling is done in the following way. A special job is created, which is in charge of scheduling new jobs. Its code looks as follows. while (not finished) 1. Sort the vertex array according to energy amounts. 2. For each vertex vi in the rst k places of the array Spawn a new job running the Parallel Step for vi . 3. Sleep milliseconds. end while In Figure 2 we see that the naive implementation performs poorly. When the number of machines grows, the computation become slower and there is no bene t from using the parallel algorithm compared to the sequential one. Naive implementation

Naive implementation

450

500 naive implementation all optimizations included

naive implementation 450

400 400 350

350 300 improvement (%)

time (sec)

300

250

200

250 200 150 100

150

50 100 0 50

-50 1

2

3 hosts

4

5

1

2

3 hosts

4

5

Figure 2: In this gure we see the performance of two versions of the parallel radiosity algorithm. In the right

graph we see that the naive implementation performs poorly when the number of machines involved in the computation grows. In the left graph we see that the dierence in performance between the naive implementation and the best (fully optimized) implementation can grow to more than 450 percent when the number of machines grows.

3.2. Improving the parallel algorithm

The parallel implementation, as described above, suers from several ineciencies which greatly in uence the performance. We focused on three factors: the sharing of the data structures, the false sharing in the algorithm implementation and the scheduling policy. We show that each change by 183

itself makes a large improvement in the execution time, while the composition of all result in good speedups on various scenes as can be seen in Figure 3. Speedups of several scenes

Execution times of several scenes

4

2000 scene 1 scene 2 scene 3

scene 1 scene 2 scene 3

1800

3.5 1600 1400 3

time (sec)

speedup

1200 2.5

1000 800

2 600 400 1.5 200 1

0 1

2

3 hosts

4

5

1

2

3 hosts

4

5

Figure 3: Speedups for three scenes, two relatively small ones (number 1 and 2, each less than 200 seconds sequentially), and one relatively large (number 3, about 2000 seconds). For problems with high computational demands (like scene 3) good speedups are achieved.

3.2.1. True Sharing

Calculating the Form-Factor of a vertex is a time consuming operation. The calculation involves reading other vertices characteristics, making the mathematical operations and nally storing the Form-Factor in the private data structure of the vertices. In terms of dsm operations, calculating the Form-Factor involves reading from other vertices data structures and writing to that of its own vertex. On the other hand, distributing the vertex energy (called shooting radiosity) is a very short operation, and requires merely updating the data structures of other vertices according to current values of the vertex own data structure. In terms of dsm operations, distributing the energy involves reading from the vertex own data structure and writing to other vertices data structures. In the naive parallel algorithm a job does both operations, namely, calculating the Form-Factor (if required) and distributing the energy of its vertex to other vertices. Thus in the naive parallelization a true sharing of the vertices data structures occurs. A data structure of a vertex v is shared by each job which calculates the Form-Factor of any vertex u to which v is visible (shared for reading), by each job that distributes the energy of its vertex u to the visible vertex v (shared for writing), and by the job to which vertex v is assigned for calculating the Form-Factor (shared for reading), or for distributing its energy to other visible vertices (shared for writing). The sharing of the data structures by several jobs leads to massive page shuttling and high delays in execution. Our solution is as follows. Note that the Form-Factor calculation takes the majority of the computation time and has a high computation-to-communication ratio (vertices' data structures are only being read). Therefore, in our solution only Form-Factor calculations are carried in parallel. All energy distribution is done in one of the machines, while calculations of Form-Factors are sent to other machines to be computed concurrently. As a result of this change, sharing reduces signi cantly. The data structures of the vertices are being accessed in the following occasions: A remote job might need it for reading (it is calculating the Form-Factor of that vertex), or local jobs might need it for writing (they are distributing their energy to this vertex). The Parallel Step therefore changes so that it activates either of the two distinct Parallel Steps as shown below. The scheduling job is left unchanged. 184

for vertex v: 1. Calculate the Form-Factor of v. for vertex v with non-zero new delta energy: 1. Copy the new delta energy to the current energy and the current delta energy. 2. Distribute the current delta energy of v according to the Form-Factor to the new delta energy of the vertices that are visible from this vertex. Parallel Step for vertex v: If the Form-Factor has not been calculated yet then spawn a new job running Parallel Step 1. Else spawn a new local job running Parallel Step 2. Figure 4 shows the performance boost of the improved algorithm (i.e., when the parallel steps are separated to local and remote execution). The improvement is the result of a dramatic reduction in the sharing of the vertices data structures. In fact, for the sake of locality of memory reference, the potential level of parallelism is reduced such that distributing the energy (shooting radiosity) is not carried concurrently. Parallel Step 1 Parallel Step 2

The differences between the naive and the event triggerred scheduler


450

220 naive implementation all optimization included all optimization included, shooting radiosity parallelized

400

all optimization included, shooting radiosity parallelized 200 180

350

160 140 improvement (%)

time (sec)

300

250

200

120 100 80 60

150

40 100 20 50

0 1

2

3 hosts

4

5

1

2

3 hosts

4

5

Figure 4: Maximizing the level of parallelism does not give optimal execution times. When the creation of remote concurrent jobs is restricted to those calculating the Form-Factor, then an improvement of 60-200 percent is achieved due to less sharing and increased locality of memory references.

3.2.2. False Sharing

A known phenomena in page based shared memory systems is false sharing, in which two or more distinct variables that are used by distinct machines are placed on the same memory page. Thus, although there is no true sharing, the page will frequently move between the machines as if its data was shared. It turned out that in the naive parallel Radiosity implementation there were several variables in the vertex structure that were falsely shared, thus causing a performance degradation. Among others, this structure contains the values that re ect the energy in the node (current energy, current delta energy and new delta energy). This data is referenced merely in the stage of energy distribution, and is not required during the Form-Factor calculation. Thus, the false sharing on the vertices data structure can be avoided by extracting the energy elds to a stand-alone allocation, and replacing it with a pointer to that allocation. As a result, jobs which calculate the Form-Factor do not (falsely) share data with shooting radiosity operations, and thus would not be disturbed by energy distribution that is done by other vertices concurrently. Figure 5 shows that a performance improvement of 30-50 percent is obtained when the energy variables are excluded from the vertex data structure to a stand-alone allocation. We mention here that in the naive implementation there were several other variables that were falsely shared, and we used the same technique for eliminating it.

185

The effect of false sharing in the vertex data structure

The effect of false sharing on the vertex data structure

450

60 naive implementation all optimizations included all optimization included, shooting radiosity parallelized false sharing on the vertex structure with all other optimizations

400

false sharing on the vertex structure with all other optimizations 50

350 40

improvement (%)

time (sec)

300

250

200

30

20

10 150

0

100

50

-10 1

2

3 hosts

4

5

1

2

3 hosts

4

5

Figure 5: False sharing causes a slowdown of 30-50 percent in the potential optimal execution time of the parallel

radiosity algorithm. The graph shows the eect of solving it by separating the related allocations. The right graph shows some performance degradation on a single machine, which is the result of an additional level of indirection in accessing some variables. In a distributed system this overhead can be neglected.

An additional optimization method we use is padding array allocations to an integral number of pages. This operation also solves many of the false sharing situations. 3.2.3. Scheduling Policies

The previously described naive scheduler wakes up once in every milliseconds. It sorts the array of vertices according to their energy level, and then spawns k new jobs for the k most signi cant vertices in this respect which are not taken by other jobs at that time. It turned out that a ne tuning of the value of is very important, but is not described here for lack of space. We also examined a dierent approach in which the scheduler is designed to be event triggered. We used the integrated mechanism for message passing in millipede (named mjec) to implement the scheduler so that when the Parallel Step ends, the job sends a FINISHED message to the scheduler. This triggers the scheduler to spawn new tasks: it 1. sorts the array of vertices according to energy amounts, and 2. for each vertex vi in the rst k places of the array it spawns a new job running the Parallel Step for vi . Figure 6 shows that the event triggered scheduling policy performs better than the eager scheduling even when using the optimal value for the naive scheduler.

4. Conclusions In this work we implemented a parallel Radiosity rendering algorithm on a loosely-coupled cluster of PCs. The initial naive algorithm did not perform well, and in fact the computation time increased with the number of machines. We then applied a handful of tunings and optimization techniques which eventually made the algorithm execute eciently, achieving speedups close to optimal. This project was carried on the millipede system which provides an easy to use environment for parallel programming on top of distributed clusters of PCs. In fact, the millipede concept was designed to promote this precise modus operandi (in the process of parallel programming): a rst easy phase of porting an existing sequential code to the parallel domain, and then, in a follow-up phase, a step-by-step gradual optimization towards eciency and improved speedups.

186



450

24 naive implementation all optimization included using the naive scheduler with all optimizations

400

using the naive scheduler with all optimizations 22

20

350

18 improvement (%)

time (sec)

300

250

200

16

14

12 150

10

100

8

50

6 1

2

3 hosts

4

5

1

2

3 hosts

4

5

Figure 6: The naive scheduler performs 15-20 percent worst than the event triggered scheduler. Results in the graph were taken with the optimal values for the naive scheduler.

References [1] M. F. Cohen and J. R. Wallace. Radiosity and Realistic Image Synthesis. AP Professional, 1993. [2] A. S. Glassner. An Introduction To Ray-Tracing. Academic Press, 1989. [3] C. Gotsman, A. Reisman, and A. Schuster. Parallel Progressive Rendering of Animation Sequences at Interactive Rates on Distributed-Memory Machines. In To appear proc. Parallel Rendering, Phoenics, October 1997. Preliminary version appeared in SUP'EUR 96, CYFRONET, Krakow, September 1996, pp. 131{141. [4] A. Itzkovitz, A. Schuster, and L. Shalev. Millipede: Supporting Multiple Programming Paradigms on Top of a Single Virtual Parallel Machine. In Proc. HIPS Workshop, Geneve, April 1997. [5] A. Itzkovitz, A. Schuster, and L. Shalev. Thread Migration and its Applications in Distributed Shared Memory Systems. The Journal of Systems and Software, 1997. To appear (Also: Technion TR LPCR-#9603). [6] A. Schuster and L. Shalev. Access Histories: How to Use the Principle of Locality in Distributed Shared Memory Systems. Technical Report #9701, Technion/LPCR, Jan 1997. [7] F. X. Sillion and C. Puech. Radiosity & Global Illumination. Morgan-Kaufmann, Sun Francisco ,California, 1994. [8] E. M. Sparrow and R. D. Cess. Radiation Heat Transfer. Hemisphere Publishing Corp., 1978.

187