Interactive Direct Volume Rendering on a Cluster of Workstations Ping-Fu Fung
[email protected]
Pheng-Ann Heng
[email protected]
Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories, Hong Kong
Abstract
put posed by medical, engineering and scienti c applications has always been increasing. Special hardwarebased solution such as 3D texture mapping is a way for achieving interactive or even real-time volume visualization [2] [5]. Nevertheless, such hardware is usually costly, volume size sensitive and platform dependent. Meanwhile, the notion of parallel and distributed processing oers a low cost, eective and feasible feedback to the demand. Several existing volume visualization systems adopt parallel processing in an attempt to cross the frame rate barrier. Lacroute applied the Shear-warp Factorization algorithm on a SGI Challenge as well as a Stanford DASH Multi-processor [6], both of them are closely-coupled shared memory multi-processors. He reported an interactive rendering speed of more than 10 fps for a volume of size 2563 with 16 costly processing nodes. Ma took another paradigm of distributing the ray casting process on a cluster of loosely-coupled workstations in a LAN using hierarchical bisection [9]. His system takes seconds to minutes to render a single frame, lagging much behind the multi-processors. Based on the LAN environment, we show how a cluster of low cost machines can compete with a closelycoupled multi-processing giant in direct volume rendering.
Visualization systems based on direct volume rendering, which is a resource hungry procedure, often lacks interactivity. The advent of dropping costto-performance ratio of modern workstations and PC's leads to feasible and cost eective algorithms of distributing the demanding rendering tasks over a cluster of machines in a local area network (LAN). This approach improves volume rendering speed dramatically and lends itself to interactive visualization applications. We discuss in this paper how to distribute economically the volume rendering tasks over a LAN of limited bandwidth while striking for interactivity on visualizing realistically sized volume data sets. It involves various design issues on data and task partitioning, communication pattern as well as load balancing. We also provide performance measures of our integrated system implementation. Keywords: volume visualization, direct volume rendering, distributed rendering
1 Introduction There has been much eort putting on improving the speed of direct volume rendering by various techniques such as object space coherence and leaping [15] [1] [3], adaptive sampling [8], splatting [14] [12], special ecient traversal order [7], etc. Current software implementations have almost pushed to the limit the capability of a single processor for direct volume visualization, agreed by the signi cant cache-level optimization eort of the Shear-warp implementation by Lacroute. On the other hand, the interactivity requirement, or put it simply, the demand for high rendering through-
2 Multi-platform Virtual Parallel Environment Nowadays workstations, and even personal computers, are equipped with large amount of main memory (typically 64 MB or more) and powerful CPU's while their prices are continuously dropping. In contrast, parallel computers are powerful but the cost-to-performance ratio is very high. A bundle of previous researches [6] [9] [10] [11] [13] suggests direct volume rendering is vulnerable to parallel processing. All these trends to-
This
work is supported by Hong Kong RGC Co-operative Research Centres Scheme Grant No. CRC4/98 and RGC Earmarked Grant No. CUHK4162/97E.
1
application. Secondly, we have to come up with a communication pattern which well suits our data and task partitioning scheme. In this section, we go through various design considerations and introduce our Distributed Rendering Pipeline.
gether stimulate the development of a loosely-coupled parallel computing environment for direct volume rendering. Our virtual parallel environment allows a volume renderer to be executed in a multi-platform looselycoupled network of machines. The rendering task is performed as a set of processes on a cluster of workstations of various platforms, which do not directly share any resources and there is no common clock. They communicate and synchronize through network packets using message-passing mechanism on top of TCP/IP1. Console Side Read config Launch renderers
Renderer Side Network Message Packets
Issue task
. . .
Renderer
(a) System architecture.
Initialization Stage
Launch process
Receive task Perform rendering
Renderer Renderer
Console
Since we are building the system on top of a LAN, the underlying network architecture cannot be assumed. It may be a shared bus network such as Ethernet, a ring network such as Token Ring, or a star network. Unlike closely-coupled systems, our inter-node connectivity and bandwidth are usually within tight limits. Therefore, we should try to reduce network trac and design carefully the communication pattern in order to minimize possible bottle-necks. For example, if all the machines are linked by a 10Base-T Ethernet bus, we then have a shared bus network with a maximum practical data rate2 of about 1 M byte per second (Bps). Given a 8-bit precision graylevel image with 256 by 256 pixels, the raw bit count will be 2562 8 bit = 512 kb. If a sustained frame rate of 10 Hz is required, the raw image data will generate a data stream of 512 kb 10 Hz = 5 Mbps. This gure is not including the overhead introduced by the underlying communication and control ow. Notice that this trac calculation involves only the resulting rendered image. The data trac induced by the data volume and/or intermediate rendering information are not taken into account.
Send ack
Receive ack
Local Area Network
3.1 Network Architecture of a LooselyCoupled System
Receive result Time
. . .
Return result Time
Rendering Stage
(b) Distributed rendering actions through time.
Figure 1: Virtual parallel environment. Fig. 1(a) shows our simpli ed system architecture. Each node in this heterogeneous virtual parallel machine possesses its own CPU and memory. One of the machines is dedicated as the console for displaying the rendered images and handling the user interfacing tasks. The rendering processes are spawned all over the network of participants. It is best to illustrate the various actions performed by the console and the renderers of our system in a time diagram as shown in Fig. 1(b).
3.2 Data and Task Partitioning
The core of a parallel/distributed system is the data and task partitioning scheme. According to the above analysis, we conclude with the following design directions:
Reduce network trac. Make full use of the abundant main memory re-
3 Distributed Rendering
sources. Simplify communication pattern.
In designing our distributed volume renderer, there are several important design issues. First of all is the data and task partitioning scheme. This aects the communication pattern, memory usage, CPU utilization, speed and eciency of the system. The partitioning scheme in turn depends very much on the underlying system architecture as well as the characteristics of the
We employ image space strip-wise task partitioning scheme as shown in Fig. 2. On each node, we keep the whole volume and the related information used by the rendering engine. In this way, we can avoid massive data communication among the nodes. Although 2 This is an experimental peak value measured in our department, with nearly no collision on a Ethernet bus segment.
A network communication protocol: Transmission Control Protocol/Internet Protocol. 1
2
I1,1
I2,1
I3,1
I1,2
I2,2
I3,2
I1,3
I2,3
I3,3
I1,4
I2,4
I3,4
I1,5
I2,5
I3,5
I1,6
I2,6
I3,6
I1,7
I2,7
I3,7
Image 1
Image 2
Image 3
sole then sends viewing parameters and strip information to the renderers. Since image order based volume rendering algorithm is employed and each renderer is equipped with the full data set, the renderers can work on their strips without communicating with other renderers. At the same time, the console is freed to handle the requests from the user. After nishing the strips, the renderers return their results to the console and the nal image is displayed.
Figure 2: Image based strip-wise partition. the volume data sets are of very large sizes, typically 16 MB or more, the huge memory capacity of nowadays workstations are capable to handle such data sizes. It is justi ed to impose this partitioning scheme in return reducing network trac and increasing rendering speed. If we do not replicate the volume data, either part of the volume or some of the intermediate rendering information has to be exchanged among the renderers through the network. Ma's approach [9] is a representative of this class. The sub-volume ray segment information are exchanged among the workstations in his IVES system. Assume a conservative estimated data stream of size equals 5% of the volume data set size is to be exchanged among the nodes for rendering each frame. Rendering a volume of size 16 MB (2563) with a frame rate of 10 Hz would produce a data stream of 64 Mbps. This gure is not acceptable in a loosely-coupled network unless a special underlying network architecture is assumed, which may not be feasible in practice. The above estimation on the data exchange amount is justi ed: each axle bisection of the volume on each hierarchy level in Ma's algorithm creates a bisection area of n2 for a volume of size n3 . Data exchanges happen across the bisection planes. Suppose there are 3 levels in the hierarchy with a 2563 volume. There will be (23 levels ? 1) 3 axes = 21 bisecting planes, creating a total bisecting plane area of 21 2562, which is more than 8% of the volume size. Under our scheme, each node is responsible for rendering a strip of the image. Therefore, the underlying volume rendering algorithm should be image order based, for example, ray-casting. We have implemented two perspective ray-casting volume renderers for our distributed rendering system, a brute-force caster and one incorporated the IsoRegion Leaping Acceleration [3]. Both t into our system well and we employed the later in all the results presented in this paper.
Display Image #1 Console Receive Console Send I1,1 I1,2 I1,3 Target Renderer Renderer 1 Receive Renderer 1 Action Renderer 1 Send Renderer 2 Receive Renderer 2 Action Renderer 2 Send Renderer 3 Receive Renderer 3 Action Renderer 3 Send
1
2
R1,2 R1,1 R1,3
R1,4 R1,5 R1,6
R1,7
I1,4 I1,5 I1,6
I1,7
I2,1
2
3
1
I1,1
3
2
1
2
3
I1,5 W1,5
W1,1 R1,1
R1,5 I1,4
I1,2
I1,7 W1,4
W1,2
W1,7 R1,4
R1,2
R1,7
I1,6
I1,3 W1,3
W1,6 R1,3
R1,6
Time I1,3
Instruction for Image #1, Subtask #3
W2,4
Working for Image #2, Subtask #4
R1,7
Result for Image #1, Subtask #7
Figure 3: Simple subtask assignment scheme. Fig. 3 shows the basic subtask assignment scheme. Here, we take an example of 3 renderers and suppose each image is partitioned into 7 subtasks. Table 1 summarizes the rendering performance of this simple scheme. For fair comparison, machines of the same platform were con gured into the system in all the testings in this section. They were Sun Ultra 1/140 with 64 MB main memory running Solaris 2.5.1. The renditions are in the resolution of 256 256. In all the experiments, the number of subtasks equal to four times the number of renderers and the size of each subtask is xed. Each measurement is the average of 80 renderings. The renderings are made by rotating the camera around the object.
Data set CT Head (128 128 64) Frame rate (fps) Speedup Eciency (%) Response time (ms) CT Lung (256 256 200) Frame rate (fps) Speedup Eciency (%) Response time (ms)
3.3 Communication Pattern and Analysis
Number of processors (concurrency factor) 1 2 4 8 1.49 1.00 100 670
1.80 2.60 3.60 1.20 1.74 2.41 60 44 30 557 385 278
1.42 1.00 100 705
1.87 2.55 3.25 1.32 1.80 2.29 66 45 29 534 392 307
Table 1: Result for simple subtask assignment scheme. The result shows that the speedup is sub-linear and far from satisfactory. Moreover, the frame rate is low. There is still room for improvement as indicated in the
Before start rendering, all the renderers need to load the volume and the related information. The con-
3
surprisingly low eciency entries. In Fig. 3, there are lots of empty slots. Such space fragments are in fact representing stalls in the pipeline. That means either a network channel is empty or a machine is idle. In order to increase performance, we should ll up these space and make full use of both the network and computation resources.
ble 2.
We achieve slightly better speedup and eciency with the new scheme, compared to the simple subtask assignment scheme. We also have a slight overall improvement on the response time. Even though subtask pre-assignment produces more satisfactory results, the system eciency and the frame rate are still low. Fig. 4 helps to explain this phenomenon. With a pre-assignment scheme, the empty slots are indeed squeezed together. However, large slots are formed between the rendering of two consecutive images. In the example, renderer 1 and renderer 3 are kept idle. Both of them and the console are waiting for renderer 2. It is because the console does not distribute the subtasks of the next frame until the current image is displayed. If the console keeps only one image buer, this cannot be avoided. To improve this, we introduce multiple image buers. So, the console is able to send rendering requests of the following frames before showing the current one. This allows several images to be rendered concurrently. Fig. 5 illustrates this subtask assignment scheme.
Display Image #1 Console Receive Console Send I1,1 I1,2 I1,3 I1,4 I1,5 I1,6 Target Renderer Renderer 1 Receive Renderer 1 Action Renderer 1 Send Renderer 2 Receive Renderer 2 Action Renderer 2 Send Renderer 3 Receive Renderer 3 Action Renderer 3 Send
1
2
3
1
2
I1,1
R1,4 R1,5 R1,6
R1,2 R1,1 R1,3
R1,7 I2,1 I2,2 I2,3 I2,4
I1,7
3
1
2
3
1
2
2
3
1
I1,4
3
2
I2,2
W1,1
W1,4 R1,1
I1,2
R1,4
I1,5
I1,7
W1,2
I2,1
W1,5
W1,7
R1,2 I1,3
W2,1
R1,5
R1,7 I2,3
I1,6 W1,3
W1,6 R1,3
R1,6
Time I1,3
Instruction for Image #1, Subtask #3
W2,4
Working for Image #2, Subtask #4
R1,7
Result for Image #1, Subtask #7
Figure 4: Task assignment scheme with preassignment. Fig. 4 delivers a dierent picture. The shaded areas are packed tighter. The trick is to allow more concurrency during network transfer and rendering since I/O and CPU events are theoretically independent. With nowadays advanced multi-tasking computer system design and operating system coordination, it is possible to arrange such events to occur concurrently. This avoids the situation of letting the renderers wasting time in waiting for next rendering request as in Fig. 3. Instead, the next rendering request packet is being transmitted when a renderer is computing a strip. To make this happen, the console has to send one more request packet to each renderer at the beginning of the rendering pipeline. That is, a subtask pre-assignment is performed. By the time a renderer nishes its current subtask, the next subtask is ready.
Data set CT Head (128 128 64) Frame rate (fps) Speedup Eciency (%) Response time (ms) CT Lung (256 256 200) Frame rate (fps) Speedup Eciency (%) Response time (ms)
Display Image #2 Display Image #1 Console Receive Console Send I1,1 I1,2 I1,3 I1,4 I1,5 I1,6 Target Renderer Renderer 1 Receive Renderer 1 Action Renderer 1 Send Renderer 2 Receive Renderer 2 Action Renderer 2 Send Renderer 3 Receive Renderer 3 Action Renderer 3 Send
1
2
3
2
1
I1,1
R1,2 R1,1 R1,3
R1,4 R1,5 R1,6
R2,1
R2,2 R2,3 R2,5
I1,7 I2,1 I2,2
I2,3 I2,4 I2,5
I2,6
I2,7
1
3
3
2
1
3
I1,4
1
I2,1
W1,1
I1,3
3
R2,6
W1,7
W2,4 R1,7
I2,2
R1,3
W2,2
W2,5 R2,2
R1,6
R2,4
I2,7
I2,5
W1,6
W1,3
1
W2,6 R2,3
R1,5
I1,6
2
2
I2,4
W1,5 R1,2
R2,4 R2,6 R2,7
R1,7
3
W2,3 R2,1
I1,7
W1,2
1
I2,6
W2,1 R1,4
I1,5
3
I2,3
W1,4 R1,1
I1,2
2
W2,7 R2,5
R2,7
Time I1,3
Instruction for Image #1, Subtask #3
W2,4
Working for Image #2, Subtask #4
R1,7
Result for Image #1, Subtask #7
Figure 5: Task assignment scheme with both preassignment and multiple image buers. Data set
Number of processors (concurrency factor) 1 2 4 8
CT Head (128 128 64) Frame rate (fps) Speedup Eciency (%) Response time (ms) CT Lung (256 256 200) Frame rate (fps) Speedup Eciency (%) Response time (ms)
1.56 2.33 3.09 3.49 1.00 1.49 1.98 2.73 100 75 49 28 640 429 324 286 1.49 2.28 2.72 3.01 1.00 1.53 1.83 2.02 100 76 46 25 671 439 367 332
Number of processors (concurrency factor) 1 2 4 8 1.70 1.00 100 855
3.20 4.64 7.68 1.88 2.73 4.52 94 68 56 453 407 313
1.66 1.00 100 945
2.86 4.34 7.48 1.72 2.61 4.51 86 65 56 527 383 316
Table 3: Result for subtask assignment scheme with both pre-assignment and multiple image buers.
Table 2: Result for subtask assignment scheme with pre-assignment.
With our ultimate scheme, the pipeline eciency is greatly improved. There are less pipeline stalls. Both
The results of the improved scheme is shown in Ta4
Average Frame Rate for Different Concurrency Factor (CT Lung)
9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00
Frame rate (fps)
Frame rate (fps)
Average Frame Rate for Different Concurrency Factor (CT Head)
1
2
4
8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00
8
1
2
# of renderers Simple
Pre-assign
4
8
# of renderers
Pre-assign w/Multi-buffer
Simple
(a) CT Head (128 128 128)
Pre-assign
Pre-assign w/Multi-buffer
(b) CT Lung (256 256 185)
Figure 6: Frame rate of the three task assignment schemes with varying concurrency factor. Average Speedup for Different Concurrency Factor (CT Head)
Average Speedup for Different Concurrency Factor (CT Lung)
Speedup (log scale)
10.00
Speedup (log scale)
10.00
1.00
1.00 1
2
4
8
1
2
# of renderers Simple
Pre-assign
4
8
# of renderers
Pre-assign w/Multi-buffer
Linear
Simple
(a) CT Head (128 128 128)
Pre-assign
Pre-assign w/Multi-buffer
Linear
(b) CT Lung (256 256 185)
Figure 7: Speedup of the three task assignment schemes with varying concurrency factor (the vertical axes are in log scale). Average Response Time for Different Concurrency Factor (CT Head)
Average Response Time for Different Concurrency Factor (CT Lung) 1000000
Response time (us)
Response time (us)
1000000 800000 600000 400000 200000 0
800000 600000 400000 200000 0
1
2
4
8
1
# of renderers Simple
Pre-assign
2
4
8
# of renderers
Pre-assign w/Multi-buffer
Simple
(a) CT Head (128 128 128)
Pre-assign
Pre-assign w/Multi-buffer
(b) CT Lung (256 256 185)
Figure 8: Response time of the three task assignment schemes with varying concurrency factor. 5
estimation of renderer i at epoch t (assume there are n renderers):
the console and the renderers are kept busy most of the time. Table 3 re ects the results. Fig. 6 to 8 show us a series of graphical comparison among several aspects of the above schemes. They compare the frame rate, the speedup and the response time of the three schemes with varying concurrency factors. Although the ideal case, linear speedup with 100% eciency, is not and would not be achieved, our system delivers acceptable performance. With less than 10 machines in a local area network, an interactive frame rate of about 8 fps is reached. Our partitioning and task assignment scheme is computation ecient by keeping every processors busy most of the time. Each of them works on its own and network communication overhead is kept minimal.
(t)?1 ri (t) = Pn m(im ?1 i=1 i (t) ) pi (t) = pi (t ? 1) + ri (t)(1 ? ) pi (0) = n?1 8 i
(1) (2) (3)
where mi (t) and ri (t) are respectively the measured time taken and relative performance of renderer i at epoch t. pi (t) is the estimated performance measure of renderer i at epoch t. It is an accumulated value over time. The damping factor 2 [0; 1] determines the forgetting characteristics of the performance measure. The value of should be tuned for dierent systems and con gurations. Generally, any value between 0.1 to 0.9 should be ne. Our experiment suggests = 0:9.
4 Load Balancing
5 Heterogeneous Rendering
On a cluster of heterogeneous loosely-coupled machines, the performance of the renderers running on dierent platforms may vary through time. This may due to a number of factors: Intrinsic computation power deviation of the machines. Background job load in a multi-user multi-tasking environment. Variances of the network trac load. Deviation of the complexities of the rendering subtasks. So there is a need of balancing the load between different renderers in order to maintain the overall average performance. To attend a balanced load, our system adjusts the size of the strips dynamically. When generating just a single frame, we do not have much information on the performance of each renderer. In addition, the eect of load imbalance is not apparent in such case. When a series of images is to be rendered, performance estimation of individual renderer is available by measuring the average time taken of the renderer for the previous subtasks. The console can then assign variable sized subtasks to the renderers accordingly. This load balancing scheme is based on the assumption that the performance of the renderers do not change abruptly. That means the estimation is a valid approximation of the performance within small time interval. To avoid sudden change of load (band-band controlled load balancing), it is better to low-pass lter the eect of some single instance performance disorder. We introduce a forgetting function for the performance
Apart from the above experiments which use solely Sun Ultra's, we have also performed experiments on heterogeneous environment. Table 4 shows the result of heterogeneous rendering. Here, several parameters are xed. Firstly, we employ the subtask pre-assignment scheme with multiple image buers, which is the best one from the result and discussion in Section 3.3. Secondly, the load balancing parameter is set to 0.9. Thirdly, there are 16 renderers from a mixture of 2 Sun's, 2 SGI's and 12 PC Pentium's. Data set CT Head CT Lung
Frame Speedup rate (fps) 16.91 9.98 15.83 9.54
E. Response (%) time (ms) 62 396 60 339
Table 4: Result of heterogeneous rendering. With this heterogeneous environment, our volume rendering system scored a frame rate of over 10 Hz, a comparable score with the state of the art Shear-warp based multi-processor implementation by Lacroute [6]. In fact, our solution is much cost eective in terms of computing equipment. In addition, there is no image quality sacri ce. Launching our system on a cluster of PC's makes interactive volume visualization generally aordable.
6 Conclusion Interactive direct volume rendering is very important and useful in many volume visualization applications. In addition to introducing algorithmic improvements, 6
hardware level assistance and multi-processing are also signi cant in boosting the rendering process. Instead of investing in hardware, we show how to achieve interactive direct volume visualization by distributed rendering in a cluster of low cost heterogeneous machines. This solution can generate high enough frame rate for interactive volume visualization as illustrated in our interactive Computer Simulated Bronchoscopy system [4] which makes use of this engine. 2D/3D graphics acceleration add-on cards for PC's are now oering high throughput with aordable cost. Combining the power of such hardware and introducing distributed processing can de nitely improve a lot the quality and speed of direct volume rendering. This is one of our future research directions in hope of reducing the number of nodes hence the cost.
[8] [9] [10]
[11]
References [1] Daniel Cohen and Zvi Sheer. Proximity Clouds - an Acceleration Technique for 3D Grid Traversal. The Visual Computer, 11(1):27{38, November 1994. [2] Shiaofen Fang, Rajagopalan Srinivasan, and Su Huang. Volume Rendering by Template Based Octree Projection. 8th EuroGraphics Workshop on Visualization in Scienti c Computing, 4 1997. [3] Ping-Fu Fung and Pheng-Ann Heng. Ecient Volume Rendering by Isoregion Leaping Acceleration. Proceedings of The Sixth International Conference in Central Europe on Computer Graphics and Visualization '98, 3:495{501, 2 1998. [4] Ping-Fu Fung and Pheng-Ann Heng. High Performance Computer Simulated Bronchoscopy with Interactive Navigation. Proceedings of Computer Aided Radiology and Surgery '99, pages 161{165, 1999. [5] Todd Kulick. Building an OpenGL Volume Renderer. SIGGRAPH Course Notes 1996, pages 1{8, 1996. [6] Philippe Lacroute. Real-Time Volume Rendering on Shared Memory Multiprocessors Using the Shear-Warp Factorization. Proceedings of the 1995 Parallel Rendering Symposium, pages 15{22, 1995. [7] Philippe Lacroute and Marc Levoy. Fast Volume Rendering Using a Shear-Warp Factorization of the Viewing Transformation. SIGGRAPH '94, In
[12]
[13]
[14] [15]
7
Computer Graphics Proceedings, Annual Conference Series, ACM SIGGRAPH, pages 451{458, 1994. Marc Levoy. Ecient Ray Tracing of Volume Data. ACM Transactions on Graphics, 9(3):245{ 261, 7 1990. Kwan-Liu Ma and James S. Painter. Parallel Volume Visualization on Workstations. Computer and Graphics, 17(1):31{37, 1993. Raghu Machiraju, Loren Schwiebert, and Roni Yagel. Parallel Algorithms for Volume Rendering. The Ohio State University Department of Computer and Information Science Technical Report, 10 1992. Raghu Machiraju and Roni Yagel. Ecient FeedForward Volume Rendering Techniques for Vector and Parallel Processors. Proceedings of Supercomputing, pages 699{708, April 1993. Klaus Mueller and Roni Yagel. Fast Perspective Volume Rendering with Splatting by Utilizing a Ray-Driven Approach. IEEE Visualization '96 Conference, pages 65{72, 1996. Peter Schroder and Gordon Stoll. Data Parallel Volume Rendering as Line Drawing. 1992 Workshop on Volume Visualization Proceedings, pages 25{32, 10 1992. L. Westover. Interactive Volume Rendering. Chapel Hill Workshop on Volume Visualization, pages 9{16, 1989. Roni Yagel and Zhouhong Shi. Accelerating Volume Animation by Space-Leaping. Proceedings of Visualization '93, pages 62{69, 1993.