2010 IEEE 3rd International Conference on Cloud Computing
Scalable Support for 3D Graphics Applications in Cloud Weidong Shi Yang Lu Zhu li Jonathan Engelsma ViTie Inc. Dept of Computing School of Computing and Information Science P.O. Box 4617, Naperville, IL 60567 Hong Kong Polytechnic University Grand Valley State University
[email protected],
[email protected] 11 Yuk Choi Rd, Hong Kong 1 Campus Drive, Allendale, MI 49401
[email protected] [email protected]
services, etc. Beyond the enterprise, cloud computing has also been commonly leveraged by nascent startup companies who offer consumer facing applications and services, and need to be positioned to rapidly scale up their offerings just in time should demand suddenly materialize. Yet these endeavors also involve the same types software services and application categories as the established enterprise. One application category that has up until now not been actively pursued in the context of cloud computing is multimedia applications, and especially those that involve 3D graphics. This is perhaps due in part to the fact that conventional enterprise software services and application have been the low hanging fruit and thus far received the most attention, but more likely due to the default perception that the technical obstacles are insurmountable. Nevertheless multimedia and 3D graphics functionality is becoming more and more preponderant in many application categories, well beyond the obvious “ fast twitch” video games where the technology has been most widely applied. Interactive situated displays are beginning to be deployed in bricks and mortar retail establishments to allow shoppers to visualize what they might look like in a new outfit, or as information kiosks in public places. Many enterprises are now using online 3D virtual worlds for training and collaborative purposes. In such situations, the computing needs of the typical office worker begins to transcend the standard issue office computer today which often has very primitive graphic processing capabilities. In all of these situations involving rich multimedia applications and 3D graphics, the economies of scale and elasticity of cloud computing are equally attractive. Yet to-date, little effort has been expended on understanding whether or not this is indeed viable, as well as the technology and wherewithal to make it a practical reality. The main objective of this work is to design and evaluate a scalable cloud based solution that can enable delivering of realtime 3D virtual appliances to users. To this end, we propose a framework called SHARC, Scalable 3D Graphics Virtual Appliance Delivery in Cloud. The primary contributions of this framework are: • A scalable pipelined cloud infrastructure for supporting high performance realtime 3D virtual appliances; • A number of optimizations for enabling high performance virtual graphics processing and delivery of 3D virtual appliance experiences to cloud users; and • An extension to VNC (Virtual Network Computing) protocol that supports the overlay of streaming windows of 3D graphics application running in a cloud.
Abstract—Recent advances in virtualization technology and wide acceptance of the cloud computing model are having significant impact on the software service industry. Though cloud computing and virtualization technology has been widely applied in supporting the information processing needs of conventional enterprise and business applications, there has been little success to-date in enabling realtime 3D virtual appliances in the cloud. This paper aims to address this deficiency by presenting SHARC, a solution for enabling scalable support of realtime 3D applications in a cloud computing environment. The solution uses a scalable pipelined processing infrastructure which consists of three processing networks according to the principle of division-of-labor, a virtualization server network for running 3D virtual appliances, a graphics rendering network for processing graphics rendering workload with load balancing, and a media streaming network for transcoding rendered frames into H.264/MPEG-4 media streams and streaming the media streams to a cloud user. The paper describes a prototype implementation of SHARC and reports test results that demonstrate the viability of this approach. Keywords-Cloud Computing, 3D Graphics, Scalability
I. I NTRODUCTION Cloud computing is emerging as a viable alternative to premise-based deployment of hardware and software solutions. The economies of scale and elasticity that cloud computing offers in an increasingly dynamic and competitive business climate has garnered rapid adoption and as a consequence is quickly altering the landscape of the information technology service industry. Cloud computing as a disruptive technology, derives its power in part from recent advances in virtualization technology. Virtualization turns traditional software into virtual appliances 1 and allows software with its execution environment to be deployed and delivered as services in ways that are both massively scalable and elastic. Though virtualization has existed long before the emergence of cloud computing, it is the recent introduction of low cost multi-core processors combined with the power of virtualization that makes cloud computing especially effective. Future many-core processors will increase virtualization densities to the point at which large numbers of virtual servers can be ran concurrently on a single physical server. To-date cloud computing has been primarily concerned with delivering the software and application services required by the conventional business enterprise, such as web servers, transaction processing, business applications, email 1 A virtual appliance is a software application combined with a lightweight or just enough operating system for it to run optimally on a virtualization platform, which may or may not be deployed in the cloud.
978-0-7695-4130-3/10 $26.00 © 2010 IEEE DOI 10.1109/CLOUD.2010.76
346
allow users to launch and control 3D virtual appliances running in cloud through a VNC client. To deliver 3D virtual appliance running in a cloud over the Internet presents additional challenges to those mentioned already and addressed by SHARC. The lack of QoS support on the Internet and network latency both potentially hinder cloud users from accessing cloud based 3D virtual appliances. These hurdles can be dealt with by solutions complementary or orthogonal to SHARC. First, for users in proximity to cloud infrastructure deploying SHARC (e.g., within a few hundred of miles or within few Internet hops), the service can be accessed with reasonable quality. For geographically dispersed users, SHARC can be used in conjunction with distributed clouds. In a distributed cloud model, there can be many regional mini-cloud centers that provide cloud computing services to regional users who demand low latency access to the cloud. Apart from applications using 3D graphics, a regional cloud computing center can host many other types of applications that also prefer low communication latency, such as desktop applications, remote database servers, multi-player game servers, to name just a few. In the following sections, we describe in details how SHARC satisfies these design objectives by employing a scalable pipelined infrastructure in a modular fashion. Each subsystem of SHARC can be individually developed and tested.
The proposed VNC extension targets users who access 3D virtual appliances running in a cloud via a client supporting the extended VNC protocol. In SHARC, we are concerned with both horizontal scaling and vertical scaling. By the former we mean being able to support large numbers of concurrent virtual appliances and users) and by the latter we mean the ability to scale with the future increasing virtualization density. The remainder of the paper is organized as follows. In section II, we present the design and architecture of the SHARC framework. Then, in section III, we report and analyze results derived from a research prototype implementation of SHARC. In section IV we discuss our future work regarding SHARC and section V enumerates related work by others and a discussion of how our work differs from those approaches. Finally, section VI concludes the paper. II. SHARC S YSTEM D ESIGN Having provided a basic introduction and motivation for our work we now present some of the underlying technical details of SHARC. We shall begin with a discussion of the design objectives we identified at the outset, and then proceed to demonstrate how these objectives are realized in the overall system architecture. A. Design Objectives SHARC is designed to bring the same economies of scale and elasticity that cloud computing brings to conventional non-multimedia software applications, to applications that utilize realtime 3D graphics. In other words, our overarching goal is to be able to deploy realtime 3D virtual appliances in the cloud. Hence, we designed SHARC with the following design objectives in mind: • It should enable realtime 3D virtual appliances (both interactive 3D virtual appliances and non-interactive 3D virtual appliances) for cloud users; • It should be massively scalable and compatible with the widely adopted virtualization solutions by existing cloud infrastructure; • The design should be modular where each sub functionality can be individually developed and tested; and • It should build upon and extend widely adopted protocols and related technologies with regard to the interfaces between client applications and the the systems running 3D virtual appliances in the cloud. To further clarify the first design objective listed above, our goal is to enable both interactive and non-interactive 3D graphics virtual appliances. Interactive 3D graphics virtual appliances typically include, interactive 3D applications, 3D games, 3D design tools, etc. Non-interactive 3D applications are applications such as 3D visualization application, noninteractive 3D walkthrough applications, 3D rendering applications, etc. We address compatibility with virtualization platforms in existing cloud infrastructure by adopting Xen virtualization as our initial target environment. The preponderance of VNC with regard to running desktop applications in the cloud, made extending the VNC client to support 3D graphics desktop applications running as 3D virtual appliances in cloud a natural choice. Hence, SHARC extends VNC protocol to
B. Overall System Architecture
5HVRXUFH0DQDJHPHQW
9LUWXDO*XHVWV *UDSKLFV5HQGHULQJ6HUYHUV 5HJLRQDO&ORXG&HQWHU
,QSXWV
9LUWXDOL]DWLRQ6HUYHUV
5HJLRQDO&ORXG&HQWHU 0HGLD6WUHDPLQJ6HUYHUV 0HGLD6WUHDPV
C
Figure 1.
Overall Architecture
Figure 1 presents a high level diagram of SHARC. As mentioned above, SHARC can be deployed in either a central cloud computing facility or a regional cloud computing center. Regardless of the particular deployment details, within a cloud environment SHARC consists of three networks of servers inter-connected using high-performance inter-connect or local network (e.g., Infiniband or multiGigabit Ethernet). One of the networks is a collection of inter-connected virtualization servers which form the backbone of a cloud infrastructure. Each virtualization server runs a Xen virtual host with hardware assisted virtualization (HVM) and supports a number of virtual Linux or Windows
347
guests. The virtualization servers are designated to run 3D virtual appliances as virtual guests. In addition to virtualization servers, there is a collection of graphics rendering servers that offer high performance 3D graphics processing capabilities to the virtualization servers. Those graphics rendering servers connect with the virtualization servers via high-performance inter-connect network. Each graphics rendering server can handle graphics processing workload from multiple 3D virtual appliances concurrently. Furthermore, there is a third network of media streaming servers. The media streaming servers provide realtime media processing services to the virtualization servers and graphics rendering servers. They convert frames rendered by the graphics rendering servers into media streams and deliver the media streams to the cloud clients who receive media streams either using a client software application built on top of the VNC protocol with streaming extension support or by using a Flash application embedded in a web browser. SHARC does not use a frame capture device for capturing frames rendered by graphics rendering servers. Instead the current implementation of SHARC uses a more flexible and scalable approach that directly converts rendered frames into JPEG in the graphics accelerators using the accelerators’ stream processing support and sends the resulting JPEG streams to the media streaming servers. The graphics rendering servers connect with the media streaming servers via a high-performance inter-connect network. Figure 1 also shows a resource manager server that acts as a server for managing and allocating resources for the 3D virtual appliances. For each 3D virtual appliance, the resource manager server decides which graphics rendering server and media streaming server should be used for graphics processing and media streaming. The resource manager server also provides load-balancing services for the graphics processing network and media rendering network. Each graphics processing server or media rendering server reports its workload status and utilization statistics to the resource manager. When there is a new 3D virtual appliance, the resource manager tries to allocate resources by choosing a graphics processing server and media processing server that has more available resources. Currently, SHARC implements the resource manager using the Ruby scripting language and XML-PRC (remote procedure call). Note that the present paper describes a system where graphics rendering and virtual servers are separated. However, the system can be easily adapted to any cloud computing infrastructure where virtual servers possess high performance graphics processing capabilities. This only requires merging functionality of graphics rendering servers with virtual servers. Comparing with a monolithic system that combines graphics rendering servers with virtual servers, our current system has certain advantages. It is more flexible and scalable in terms of distributing graphics processing workload considering that future single cloud physical server may support hundreds of virtual servers. Adding expensive, power hungry, and large form factor graphics accelerators to physical servers of a general purpose cloud infrastructure vendor may not be a good choice comparing with using separate dedicated graphics rendering servers. Due to modular
design, components we have developed can be applied to both scenarios. C. Virtual Appliance Server 91&6HUYHU
'LUHFW;$SSOLFDWLRQ
2SHQ*/$SSOLFDWLRQ
9LUWXDO'LUHFW''ULYHU 7UDQVODWLRQ/D\HU
900*XHVW 9LUWXDO2SHQ*/'ULYHU
8VHU6SDFH,QSXW 'DHPRQ6HUYHU
9LUWXDO*UDSKLFV%XV
9LUWXDO3RLQWLQJ'HYLFH'ULYHU 9LUWXDO.H\ERDUG'HYLFH'ULYHU
9LUWXDO'LVSOD\'ULYHU
;HQ900 +RVW 1HWZRUN6WDFN
Figure 2.
Virtual Appliance Server
Figure 2 shows the internal structure of a virtualization server in SHARC. The server consists of a Xen VMM and a set of virtual guests. Figure 2 highlights a Windows virtual guest. 1) Virtual Graphics Support: To support 3D virtual appliances, SHARC uses a virtual OpenGL driver (for both Linux and Windows). This OpenGL driver was initially developed by modifying the open source Chromium OpenGL driver [1]. VMGL [2] is a Linux virtual OpenGL driver for the desktop Xen environment where a Linux virtual guest can use the host’s 3D graphics processing capability. Since VMGL is also based on the source code of Chromium [1], SHARC incorporates modifications made by VMGL to Chromium for its Linux virtual OpenGL driver. SHARC offers a Linux GLX driver [3] and implements many GLX stub functions that are either not implemented or missing in both Chromium and VMGL. For the Windows virtual OpenGL driver, SHARC implements a display driver based on a sample display driver provided by Windows DDK. SHARC implements a virtual graphic bus for transmitting graphics rendering commands and data from a guest using shared memory. There is a proxy daemon running in the privileged dom0 guest that routes graphics commands and data to the graphics rendering server. SHARC supports Direct3D (version 9) by wrapping Direct3D over OpenGL. SHARC’s virtual Direct3D driver is based on open source WineX’s D3D driver which implements Direct3D support by wrapping Direct3D API over OpenGL. A number of OpenGL APIs required by WineX’s D3D driver are not implemented or supported by the original Chromium OpenGL implementation. SHARC ported and implemented those OpenGL APIs in order to wrap Direct3D over SHARC’s virtual OpenGL driver. 2) Audio Support: Virtual audio support is based on sound card emulation (SB16) in the dom0 guest. This works fine for the Windows guest. In our prototype system, we
348
accelerator. For a 3D virtual appliance assigned to a graphics rendering server, the graphics rendering server creates a corresponding graphics rendering context. As shown in figure 3, a graphics rendering scheduler is responsible for assigning or binding graphics rendering contexts with graphics processing resources. SHARC’s graphics rendering server is developed by modifying Chromium’s server. Since SHARC’s objectives are significantly different from Chromium’s, lots of changes are made on the server side in order to handle concurrent rendering of multiple 3D applications over multiple graphics accelerators on a single graphics rendering server. The original Chromium server supports the concept of low overhead software OpenGL context switch which is also incorporated into SHARC’s graphics rendering server. Different from Chromium [1] or VMGL [2], SHARC renders frames for virtual appliances using Pbuffer because display of rendered frames on a graphics rendering server is unnecessary and cumbersome. The graphics rendering scheduler in figure 3 connects to the resource manager. It reports its workload status periodically and receives control messages from the resource manager. The graphics rendering scheduler also receives information from the resource manager for each 3D virtual appliance, identifying which media streaming server its cJPEG stream should be sent to. SHARC compresses each rendered frame into a custom JPEG format simplified for running on a GPU’s SIMD processors. Initially, SHARC used JPEG library optimized with Intel’s library Later, we ported JPEG compression to GPU using Nvidia’s CUDA GPU programming support [6]. 2) Object Caching: To reduce transmission overhead, as shown in figure 3, SHARC employs a graphics object cache for caching graphics objects frequently used by 3D virtual appliances. A graphics rendering server uses system memory as a graphics object cache. The current version of SHARC supports only caching of textures and geometry data (e.g., vertex arrays). For a 3D virtual appliance, a virtualization server does not transmit the graphics objects if they are cached by the target graphics rendering server. Profiling results show that depending on application, the number of graphics rendering commands required to render a single frame can be a few thousands to tens of thousands. To reduce bandwidth consumption further, aggressive compression of graphics commands and data may be required. Though not currently implemented, SHARC plans to compress graphics commands and data in addition to graphics object caching. 3) GPU JPEG compression: Compressing a rendered frame on the GPU has two advantages. First, it relieves bulk of processing workload from a graphics rendering server’s main processors and improves scalability. Second, it reduces bandwidth overhead for transmitting rendered frames from a graphics accelerator to the graphics rendering server’s host memory. SHARC’s GPU based compression uses a custom JPEG format (cJPEG) which can reduce bandwidth requirement for transmitting rendered frames to 20-30% of the bandwidth requirement otherwise. It should be pointed out that current GPU architecture and CUDA programming model do not support a complete implementation of standard
1HWZRUN6WDFN 5HQGHULQJ&RPPDQG 6WUHDPV *UDSKLFV 5HQGHULQJ 6FKHGXOHU
5HQGHUWR 3%XIIHU
&8'$-3(* &RPSUHVVLRQ
F-3(*6WUHDPV
*UDSKLFV 5HQGHULQJ&RQWH[W
*UDSKLFV 5HQGHULQJ&RQWH[W
*UDSKLFV 5HQGHULQJ&RQWH[W
*UDSKLFV 5HQGHULQJ&RQWH[W
*UDSKLFV 5HQGHULQJ&RQWH[W
*UDSKLFV 5HQGHULQJ&RQWH[W
7UDQVPLVVLRQ 7R+RVW
5HQGHUWR 3%XIIHU
*UDSKLFV$FFHOHUDWRU
5HQGHUWR 3%XIIHU
&8'$-3(* &RPSUHVVLRQ
Figure 3.
&8'$-3(* &RPSUHVVLRQ
7UDQVPLVVLRQ 7R+RVW
*UDSKLFV$FFHOHUDWRU
7UDQVPLVVLRQ 7R+RVW
*UDSKLFV$FFHOHUDWRU
*UDSKLFV2EMHFW &DFKH
5HQGHUWR 3%XIIHU
&8'$-3(* &RPSUHVVLRQ
7UDQVPLVVLRQ 7R+RVW
*UDSKLFV$FFHOHUDWRU
Graphics Rendering Server
also tried virtual audio cable. Virtual audio cable [4] is a virtual audio driver for the Windows platform that can capture audio sound wave and redirect the sound wave to a user process. The audio output is transmitted to the media streaming server where it will be merged with graphics output. To synchronize video and audio output, there is metadata associated with the audio output that annotates audio output with a graphics frame counter. The metadata is also sent to the media streaming server which is used for synchronizing video and audio output. 3) Interactive Input Support: For prototyping, a virtualization server can receive interactive input (e.g., keyboard or pointing device input) from a client. For a virtual Windows guest, SHARC implements two virtual input device drivers using the Windows DDK, one for a keyboard and the other for a mouse pointing device. Both drivers can inject fake input signals into a Windows guest. A user space input daemon server is developed. This input daemon server can receive encrypted input from a client over TCP connection and send the received input to the virtual keyboard or mouse device driver. For the Linux virtual guest, the input daemon server uses Linux’s udev [5] support for implementing virtual keyboard and virtual mouse device. The Linux input daemon can inject keyboard and mouse inputs to a guest after they are received from a client and decrypted. D. Graphics Rendering Server Figure 3 shows the internal structure of a graphics rendering server of SHARC. A graphics rendering server comprises, multiple graphics accelerators attached through PCIe, a graphics rendering scheduler and a graphics object cache. A graphics rendering server can receive graphics rendering command streams from multiple 3D virtual appliances, carry out graphics rendering, and transmit custom JPEG streams (cJPEG streams) to the media streaming servers. 1) Rendering Context Scheduling: To maximize performance and scalability, SHARC’s graphics rendering server allows concurrent processing of graphics rendering commands received from multiple 3D virtual appliances. This is achieved by, employing multiple graphics accelerators, and allocating multiple 3D virtual appliances to each graphics
349
graphics rendering network, transcode the cJPEG streams into H.264/MPEG-4 media streams, and transmit the media streams to cloud clients. A salient feature of SHARC’s media streaming server is inherent massive level of task parallelism. First, there are multiple cJPEG stream processing workloads assigned to each media streaming server. Second, for each cJPEG stream, there are three steps of pipelined processing performed by a string of tasks. For each cJPEG stream, the tasks for each step are chained through Linux shared memory queue [8]. A media streaming server receives cJPEG stream sent from a graphics rendering server network via networked pipes. After pre-processing, each cJPEG stream is transcoded into H.264/MPEG-4 stream. SHARC takes advantage of the task level parallelism in a media streaming server’s workloads and multi-core architecture by assigning tasks to a pool of processor cores of a media streaming server. In the foreseeable future, a media streaming server can comprise hundreds or even thousand of MIMD (multiple instruction multiple data) processor cores that are ideally suitable for handling workloads with massive task level parallelism. For maximizing performance, SHARC binds processor cores with tasks using Linux’s processor affinity support [9]. Processor affinity improves performance by exploiting cache locality. As shown in figure 4, there are three sets of tasks, networked pipes, transcoding tasks, and media streaming tasks. Those tasks are bound to processor cores of a SMP multi-core media streaming server using Linux’s processor affinity support. Further, a media streaming server uses H.264 baseline profile for transcoding where only I- and P-frames are used. This profile is suitable for realtime scenarios requiring low latency because B-frames are not used. In addition, a media streaming server uses real-time Linux kernel [10], [11] for task scheduling. Linux realtime OS task scheduler can handle tasks with soft recurring deadlines. It improves QoS of realtime transcoding and streaming [12], [13].
F-3(*6WUHDP 2SWLRQDO $XGLR6WUHDP
1HWZRUNHG3LSH 7UDQVFRGLQJ
6WUHDPLQJ 57035735763006
7UDQVFRGLQJ
6WUHDPLQJ 57035735763006
7UDQVFRGLQJ
6WUHDPLQJ 57035735763006
1HWZRUNHG3LSH
F-3(*6WUHDP 2SWLRQDO $XGLR6WUHDP
1HWZRUNHG3LSH 1HWZRUNHG3LSH
F-3(*6WUHDP 2SWLRQDO $XGLR6WUHDP
1HWZRUNHG3LSH 1HWZRUNHG3LSH
0DQ\&RUH 3URFHVVRU&RUH
0DQ\&RUH 3URFHVVRU&RUH
0DQ\&RUH 3URFHVVRU&RUH
0DQ\&RUH 3URFHVVRU&RUH
0DQ\&RUH 3URFHVVRU&RUH
0DQ\&RUH 3URFHVVRU&RUH
0DQ\&RUH 3URFHVVRU&RUH
Figure 4.
Streaming Server
JPEG. It is impractical to implement entropy coding in CUDA. cJPEG uses a simple encoding scheme suited for CUDA but has much less coding efficiency than standard JPEG. Currently, a graphics rendering server performs the following processing in the GPU, compression preparation, DCT transformation, quantization, and simple encoding. Compression preparation comprises required preprocessing on a rendered frame before it is transformed using DCT. Such preprocessing includes color space transformation and downsampling. After quantization, the quantized results usually contain lots of zeros. Since it is not practical to conduct entropy encoding using CUDA, SHARC adopts a simple encoding scheme that stores all non-zero quantization results in an array. We apply the parallel prefix sum approach in [7] to compute an array offset value for each non-zero quantization result. The computed offset values are stored in an offset map. Finally, after the offset position for each non-zero quantization value is determined, another CUDA program is executed to encode each non-zero quantization value and output the encoded result to an array using the array offset computed in the previous step. Each encoded value consists of three parts, one block bit indicating whether the output marks beginning of a DCT coefficient block, relative position of the output within a DCT coefficient block (x and y relative positions, each 3 bits), and the quantized coefficient. In order to delimit beginning of a DCT coefficient block, the first quantized coefficient value (at zero relative x and y position) is always outputted regardless whether its quantized value is zero or not. After the array is filled, it is transmitted to the graphics rendering server’s host memory using DMA.
F. Streaming VNC Support SHARC extends VNC (Virtual Network Computing) protocol [14] to support 3D virtual appliances for cloud users who access virtual appliances through a VNC client. In a standard VNC implementation, a VNC client periodically sends request to a VNC server querying whether there has been any display update. If one or multiple regions of the VNC server’s display have been changed, the VNC server will send updates of those regions as compressed images to the VNC client. Standard VNC client is designed to operate with low bandwidth requirement and low update frequency. Those do not hold for realtime 3D applications which demand timely high frequency frame update. SHARC addresses support of 3D virtual appliances in VNC client by extending the VNC protocol. The result is a proof-of-concept VNC client that allows views of realtime 3D applications in cloud to be displayed as media streaming overlay windows on top of a regular VNC frame buffer. Views of applications that do not require frequent updates are still updated using the conventional VNC update approach. Several new messages are defined and added to
E. Media Streaming Server Figure 4 shows internal structure of a media streaming server of SHARC. A prototype version of SHARC’s media streaming server (Linux-based) comprises a SMP (symmetric multi-processor) system using state-of-the-art multi-core processors. As shown in figure 4, a streaming server can receive multiple cJPEG streams sent from a
350
the VNC protocol. Those new messages handle creation, teardown, resizing, positioning of 3D application windows. A 3D application window is an overlay view that is superimposed on a VNC framebuffer. The overlay window updates its display in realtime based on media stream received from a media streaming server. Current SHARC client supports streaming overlay window with H.264/MPEG-4 encoded content delivered using RTP/RTSP/MMS. A VNC server manages and keeps track of all the 3D application views (windowed display or fullscreen display) that require streaming overlay.
Client Total JPEG Encoding(CUDA)
JPEG Decoding(SW)
Others Networking (SHARC Internal) H264 Encoding(SW)
Figure 5.
Averaged Latency Decomposition
III. E XPERIMENTS We implemented a fully functional prototype of SHARC and used it to evaluate our approach. This section will show and discuss our findings. A. Test Setup Figure 6. Screenshots of Tested Applications(Tux, WoW3 Demo, and Alien Arena)
The prototyping environment consisted of several Linux servers (Ubuntu 8.04 Linux Distribution) and a Gigbit Ethernet switch. An Intel Xeon dual quad-core (E5410 2.33GHz) server with 16GB memory running Xen 3.1 was used as virtualization server. The virtualization server supported both Linux and Windows (XP) virtual guests. A graphics rendering server comprised a single processor motherboard with 4 x PCIe x 16 slots. It contained four Nvidia 9800GT graphics cards and 4GB memory. Another two Intel quadcore servers (E5410 2.33GHz) with 4GB memory were used as media streaming servers. A client Linux workstation was situated on the same local area network with the servers. The Linux workstation comprised a dual-core AMD Opteron processor 2210 with 2GB memory. All machines were connected through a Cisco Gigabit Switch.
C. Preliminary Results One critical performance metric of SHARC is the total latency for delivering a frame after it is rendered by a graphics rendering server. In this preliminary study, this latency was measured as the time interval between when a frame was rendered and when the frame was displayed by a client. The interval included, time spent for JPEG compression by a graphics rendering server, time spent for transcoding a frame to H.264 encoded frame which further comprised time for transmitting the frame internally, JPEG decompression time, and H.264 encoding time, latency for delivering the encoded frame to a client, and finally time spent by the client to decode the frame and additional display scheduling delay. Latency data were collected from three 3D applications, Tux Racer (a GPL racing game), Alien Arena (a FPS GPL game) and War of Warcraft 3 demo (a popular RPG game). Figure 6 shows sample screenshots of the three tested applications. Figure 5 shows averaged decomposition of the total latency for the three tested applications. Averaged results were used because variations in the results were very small across the tested applications. Averaged total latency was around 113 millisecond which was significant considering our test environment was LAN based. According to figure 5, nearly 40% of the total latency was spent on the client side. The current prototype client was not sufficiently optimized. It decoded frames in software and used loosely coupled multi-threading. The client used one thread for receiving and decoding frames, and another thread for scheduling when the decoded frame should be displayed.
B. Profiling Support In order to study the performance of the SHARC prototype and identify potential bottlenecks, we implemented a number of timers and data gathering counters in various places along the processing pipeline. To support detailed timing analysis, we designed a frame marking scheme that embeds color codes into each rendered frame. The color coding scheme allowed profiling of time spent at each stage of the pipelined processing by a frame as the frame traverses from a graphics rendering server, to a media streaming server, and finally to a destination client. One problem we faced in profiling the system was how to measure time and assign a precise timestamp when the same frame was processed by different machines. In order to achieve sub-millisecond timestamping performance, clocks of all the servers in our test environment were synchronized using PTPd [15], an open source implementation of IEEE 1588 standard [16]. PTP was developed to provide very precise time coordination of LAN connected computers. PTPd can achieve sub-millisecond precision for computers situating within an ethernet LAN which is sufficient for our profiling purpose [17].
351
Quantization
cessed per second to be in the range from tens of thousands to hundreds of thousands. Table I also shows, for each tested application, averaged virtual server to graphics server bandwidth requirement for transmitting OpenGL commands and data from a virtual server to a graphics rendering server.
DCT
Preparation
Parallel Prefix Sum
Figure 7.
IV. F URTHER W ORK SHARC represents an initial effort to bring scalable realtime 3D applications support to cloud computing. We have by no means solved all the problems of supporting realtime 3D application in cloud. However, we do believe our results are the first to demonstrate the viability of the general concept. SHARC is an ongoing project that requires more in-depth future evaluations and further optimizations in order to make it into a real-world production service. Our prototyping system reveals lots of interesting features of SHARC that can be launch points for further research. For example, one challenge of SHARC is how to exploit the future MIMD acceleration hardware and future many-core processor architectures to support realtime media transcoding on a massive scale. Another research topic is how to achieve high performance virtual graphics processing. The ultimate objective is being able to deliver virtual graphics processing with performance nearly on par with native graphics. Furthermore, increasing H.264 encoding performance, reduce JPEG processing time, and optimize delays incurred on the client side are all subjects of future research and optimization.
Encoding
Average JPEG Processing Time Decomposition
Application Tux AlienArena WoW3 Demo
Averaged API Calls Per Frame 4276 6489 8430
Bandwidth (Mbits/ps) 2.04 18.30 19.86
Table I API C ALL P ROFILE AND V IRTUAL TO G RAPHICS S ERVER BANDWIDTH U TILIZATION
There was a significant delay between when a frame was decoded and when it was displayed. Optimizing the client to reduce this latency is a subject we plan to investigate further in the future. Furthermore, H.264 encoding took approximately another 25% of the total latency. On average, it took 30-35 millisecond to encode a frame in software. Another 16% of total latency was spent on software cJPEG decompression before it was encoded using H.264. We expect that both H.264 encoding and cJPEG decompression latency can be significantly reduced in future with aggressive optimization and performance tuning. Possible future optimization includes applying hardware acceleration to boost H.264 and JPEG performance. As shown in figure 5, latencies incurred in other processing steps were relatively insignificant comparing with those we have discussed. Figure 7 decomposes the total time of CUDA based JPEG compression. On average, it took less than 10 milliseconds to compress a frame (640x480) into cJPEG. Surprisingly, quantization took more time than DCT. This was likely due to the fact that for each coefficient, quantization outputted a binary value (1 or 0) to an offset map used as input by parallel prefix sum. This processing might involve expensive conditional tests for the GPU. Further, as shown in figure 7, parallel prefix sum was the most time consuming procedure in our CUDA based JPEG compression. How to further improve performance of GPU based JPEG compression remains as an ongoing challenge. Table I shows averaged OpenGL API calls per frame for the three tested applications. To render one frame, an application may send at least a few thousands OpenGL API calls. This makes the number of OpenGL API calls pro-
V. R ELATED W ORK Support of 3D graphics in virtualization has been a recent phenomena [2], [18]. Some of the early 3D virtual graphics solutions [2] were inspired by the work of Chromium [1], [19] which is a networked OpenGL rendering framework for rendering 3D graphics application in super large display size (e.g., video wall) over distributed servers. Prior to the appearance of virtualization, a number of solutions were proposed in the past to support remote execution and rendering of 3D graphics applications [20], [21]. VirtualGL [22] is an open source remote rendering project. VirtualGL uses a server to run OpenGL application and processes graphics rendering on the same server. To support display of rendered frames on a client, VirtualGL can capture rendered frames, compress them into JPEG images, and transmit the JPEG images to the client. A VirtualGL based remote rendering approach using a GPU to accelerate image compression was described in [23], [24]. Open source WineX project did some pioneering work in wrapping Direct3D API over OpenGL [25]. This approach has been adopted by a number of virtualization products for enabling Direct3D support. SHARC differs from the prior work cited above in the following aspects. First, previous support of 3D graphics in virtualization focuses on enabling 3D graphics support on a virtual guest using the VMM host’s graphics processing capability and display device. In contrast, SHARC targets cloud users who access 3D application over the network, where graphics rendering and display on a VMM host are not actually required or even feasible sometimes. Second, SHARC distinguishes from the previous remote rendering research by its support for 3D applications using virtualized
352
graphics which makes it suited for cloud computing model. Last but not the least, SHARC distinguishes itself from the previous simple remote rendering environment by utilizing a scalable pipelined processing infrastructure. Both SHARC and VMGL provide OpenGL support in virtual guest. Some of the differences between SHARC and VMGL are the follows. On the virtual guest client side, SHARC supports both Windows and Linux guest while VMGL works for only Linux guest. SHARC supports Direct3D and implements more stub functions. Furthermore, SHARC uses inter-domain shared memory and proxy in the dom0 for transmitting data out of a guest domain while VMGL uses TCP/IP. On the graphics server side, each SHARC graphics rendering server has a management mechanism that enables centralized management and control of SHARC graphics rendering servers. SHARC renders all frames on Pbuffer while VMGL renders frames directly on screen. In addition, SHARC implements a managed frame readback mechanism and frame processing software module using GPU support. Further, different from VMGL which is designed to serve a local guest, SHARC is designed to handle concurrent workload for large number of virtual servers. To facilitate this objective, SHARC includes mechanisms such as load-balancing and graphics object caching.
[7] M. Harris, “Parallel prefix sum (scan) with CUDA,” http://developer.download.nvidia.com/compute/cuda/sdk/ website/projects/scan/doc/scan.pdf, 2007. [8] M. D. Cvetkovic and M. S. Jevtic, “Interprocess communication in real-time linux,” in Telecommunications in Modern Satellite, Cable and Broadcasting Service, 2003. TELSIKS 2003. 6th International Conference on, vol. 2, 2003, pp. 618– 621 vol.2. [9] R. Love, Linux System Programming: Talking Directly to the Kernel and C Library. O’Reilly Media, Inc., 2007. [10] S. Rostedt and D. V. Hart, “Internals of the RT patch,” in In Proceedings of the Linux Symposium, 2007, pp. 161–172. [11] FSMLabs, “The RTLinux company,” http://www.fsmlabs. com. [12] D. Sitaram and A. Dan, Multimedia servers: applications, environments, and design. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000. [13] A. Ferone, M. Miralto, and A. Petrosino, “A real-time streaming server in the RTLinux environment using videolanclient,” University of Naples Parthenope, Tech. Rep. RT-DSAUNIPARTHENOPE-07-03, March 2007. [14] T. Richardson, Q. Stafford-Fraser, K. R. Wood, and A. Hopper, “Virtual network computing,” IEEE Internet Computing, vol. 2, no. 1, pp. 33–38, 1998.
VI. C ONCLUSION
[15] PTPd, “Precision time protocol daemon,” http://ptpd. sourceforge.net/.
This paper presents an solution for enabling support of realtime 3D applications in a cloud computing environment. The solution, following the principle of division-of-labor uses a scalable pipelined processing infrastructure which consists of three processing networks: a virtualization server network for running 3D virtual appliances, a graphics rendering network for processing graphics rendering workloads with load balancing, and a media streaming network for transcoding rendered frames into H.264/MPEG-4 media streams and streaming the media streams to a cloud user. Experimental results obtained from a fully functional prototype implementation are promising and suggest a number of interesting avenues for further research.
[16] “IEEE Std 1588-2002,” http://ieee1588.nist.gov/. [17] K. Correll and N. Barendt, “Design considerations for software only implementations of the ieee 1588 precision time protocol,” in In Conference on IEEE 1588 Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, 2006. [18] M. Dowty and J. Sugerman, “GPU virtualization on VMware’s hosted i/o architecture,” in First Workshop on I/O Virtualization, 2008. [19] I. Buck, G. Humphreys, and P. Hanrahan, “Tracking graphics state for networked rendering,” in HWWS ’00: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware. New York, NY, USA: ACM, 2000, pp. 87–95.
R EFERENCES [1] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P. D. Kirchner, and J. T. Klosowski, “Chromium: a streamprocessing framework for interactive rendering on clusters,” ACM Trans. Graph., vol. 21, no. 3, pp. 693–702, 2002.
[20] D. S. Wallach, S. Kunapalli, and M. F. Cohen, “Accelerated mpeg compression of dynamic polygonal scenes,” in SIGGRAPH ’94: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994, pp. 193– 196.
[2] H. A. Lagar-Cavilla, N. Tolia, M. Satyanarayanan, and E. de Lara, “VMM-independent graphics acceleration,” in VEE ’07: Proceedings of the 3rd international conference on Virtual execution environments. New York, NY, USA: ACM, 2007, pp. 33–43.
[21] Y. Noimark and D. Cohen-Or, “Streaming scenes to mpeg-4 video-enabled devices,” IEEE Comput. Graph. Appl., vol. 23, no. 1, pp. 58–64, 2003. [22] VirtualGL, http://virtualgl.sourceforge.net/.
[3] P. W. Phil Karlton and J. Leech, “Opengl graphics with the x window system, version 1.4,” http://www.opengl.org/ documentation/specs/glx/glx1.4.pdf.
[23] S. Lietsch and O. Marquardt, “A CUDA-supported approach to remote rendering,” in ISVC (1), 2007, pp. 724–733.
[5] G. Kroah-Hartman, “udev–A Userspace Implementation of devfs,” in In Proceedings of the Linux Symposium, July 2003.
[24] S. Lietsch and P. H. Lensing, “GPU-supported image compression for remote visualization — realization and benchmarking,” in ISVC ’08: Proceedings of the 4th International Symposium on Advances in Visual Computing. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 658–668.
[6] NVIDIA, NVIDIA CUDA Programming Guide 2.0, 2008.
[25] WineD3D, http://wiki.winehq.org/WineD3D.
[4] Ntonyx, “Virtual Audio Cable,” http://www.ntonyx.com/vac3. htm#sc.
353