On the Applicability of Remote Rendering of Networked ... - CiteSeerX

4 downloads 5904 Views 497KB Size Report
why manufacturers of mobile devices have opted to include one or more custom integrated circuits in mobile devices, which are at the same time more energy ...
On the Applicability of Remote Rendering of Networked Virtual Environments on Mobile Devices Peter Quax∗ , Bjorn Geuns∗† , Tom Jehaes∗† , Wim Lamotte∗ and Gert Vansichem‡ ∗ Hasselt University, Expertise Centre for Digital Media, Wetenschapspark 2, BE-3590 Diepenbeek, Belgium † Interdisciplinary institute for BroadBand Technology (IBBT), Expertise Centre for Digital Media, BE-3590 Diepenbeek,

Belgium

‡ Androme

NV, Wetenschapspark 4, BE-3590 Diepenbeek, Belgium Email : {peter.quax, bjorn.geuns, tom.jehaes, wim.lamotte}@uhasselt.be Email : [email protected]

Abstract— The technologies behind Networked Virtual Environments have seen increased interest in recent years, mainly due to the proliferation of the Massively Multiplayer On Line games. It is of increasing importance to developers and service providers to be able to target the mass market of mobile device users. With increased availability of wireless connectivity, one of the hurdles has been stepped over. However, displaying a graphically attractive environment on a device with limited capabilities and battery power still presents a major obstacle. On the other hand, mobile television viewing is currently actively being promoted as one of the key drivers behind the market uptake of 3G services. By using a technique that is known as remote rendering, we are able to use existing client-side hardware (suited for 3G applications) for display of NVE-based applications on mobile devices. In this paper, we present an overview of a system architecture that supports remote rendering of these applications and measurement results that show the feasibility of the approach.

I. I NTRODUCTION The large-scale deployment of applications based on technology from networked virtual environments on mobile devices is considered to be the next step up for providers of IT entertainment products. Classic examples of these applications include the hugely popular Massively Multiplayer Online Role-Playing Games and the upcoming market of Virtual Communities. Enabling a user to take the virtual world and associated virtual identity along wherever he or she is traveling guarantees increased income for both application providers and network providers alike. At the same time, a new market segment will be opened up by taking advantage of the proliferation of mobile phones and, to a lesser extent, the Personal Digital Assistants. While certain classes of mobile devices, mainly state-of-the-art (and thus relatively expensive) PDAs, are becoming increasingly powerful, several issues remain when this type of application is to be deployed on new platforms. Battery power consumption in mobile devices becomes problematic when the load on the main (general purpose) CPU usage is continuously increased. This is one of the reasons why manufacturers of mobile devices have opted to include one or more custom integrated circuits in mobile devices, which are at the same time more energy efficient and faster

than a general purpose microprocessor. The main drawback of these integrated circuits is that they are targeted towards a single-use scenario. For example, 3G mobile phones may include circuitry (which we will call a media processor), which provides application developers with the ability to write pieces of code that make use of the acceleration built into the device through specific API’s and compiler toolchains. This acceleration support can range from hardware supported still image compression to fully fledged video coding and decoding abilities. However, these support (media) processors require the use of special APIs and cannot execute code that is not specifically supported in hardware. A media processor that is e.g. able to decode encrypted H263 video cannot be used to accelerate computer graphics. This clearly shows that for NVE related applications (which rely extensively on 3D computer generated graphical environments) to become successful on mobile platforms, some hardware support will be needed in order for the battery consumption to be kept at acceptable levels. Although a very limited number of devices do include specific support for acceleration of 3D computer generated content, their presence in off-the-shelf mobile hardware is currently lacking. A possible solution to resolve some of these problems is to make use of a technique called remote rendering, in which a (collection of) server(s) are deployed specifically to continuously render 3D computer generated images. These images are subsequently encoded in a pre-specified video format and transmitted over a network connection. At clientside, the stream can be decoded as a generic video sequence, making use of any available hardware acceleration. The main advantage to this architecture is that the (server side) hardware can easily be adapted for this specific use. PC hardware to support rendering of 3D hardware has become commonplace and low in cost due to the high demand of computer game manufacturers. The same hardware can be used in remote render servers to generate the desired output images, keeping the total cost of such a system relatively low. In the specific case of mobile clients discussed here, with limited screen resolution, it is possible to have a single remote render server which delivers images to a multitude of simultaneously

connected users. The issue of the transmission of the data related to state synchronization and the resulting video streams from the remote render process is beyond the scope of this paper. We would like to refer the interested reader to related work, such as [1]. This paper is divided as follows : in section II, we provide some pointers to related work. In section III we will elaborate on the context in which the system was developed. Section 2 focusses on the overall architecture of our system, including all elements needed for remote rendering. In section VI we provide absolute numbers to support our findings about the applicability of remote rendering in this context. Section VII concludes with a summary and some pointers for future work. II. R ELATED WORK Remote rendering has been the subject of previous research, however mainly for non-interactive applications. NVE applications differ from these usage scenarios precisely through their high degree of interactivity. This means that the entire architecture is to be adapted to be able to respond effectively to input provided by the user. The ultimate goal should be to provide users on a mobile platform with the same degrees of freedom they have on a classic (PC-based) platform. They will thus provide input that triggers interaction with other users and/or objects in the virtual world, possibly causing a cascade of changes in all views to be rendered. In most circumstances, this user interaction will consist of a limited interface to control the virtual camera, such as in [2]. Other previous work shares more similarities with our work, such as [3], where the authors have used a server farm to generate the 3D images that are subsequently transmitted to the users. The fundamental difference is that in this case, the remote render servers are all-in-one solutions and do not appear to integrate seamlessly into an existing architecture. A well-known framework for remote rendering is Chromium [4]. This framework is used in several implementations, such as the one described above. We have opted not to use this framework due to the availability of in-house developed code and technologies that made integration with the existing infrastructure easier. Also, the limitation of Chromium to the OpenGL rendering technology may turn out to be an impediment in future (due to the ever increasing popularity of Microsoft’s DirectX technology). An even more limited user experience is delivered by the system described in [5], where users are only able to enter begin- and endpoints for an otherwise non-interactive walkthrough, after which a movie is rendered at server-side and transmitted to the mobile clients. The system described in [6] uses a web browser as client application for displaying the 3D environment. Through the use of advanced rendering techniques such as Image-Based rendering, the authors claim to provider better render frame rates and higher compression rates. The authors of [7] employ the combined processing power of clients and server, whereby the clients are responsible for the rasterization of the servergenerated 3D models. The visual quality of experience is

however lacking, due to limitations in the use of colors and shading. III. U SAGE S CENARIO C ONTEXT In order to clarify some of the choices that were made and that are to be discussed in the next section, it is necessary to first provide an overview of the usage scenario for the actual application to be developed with support for remote rendering. The application to be developed is a mobile city guide, which runs on off-the-shelf mobile hardware that is handed out to tourists visiting the city (through the visitor information centers) and/or hardware that is owned by inhabitants of the city. The system provides a 3D environment representing the city center. Through use of active sensing devices, in this case GPS sensors and/or the available wireless network connection, the location of each of the devices and users, active in the city, is known at all times. This information is subsequently transmitted to all other users that have shown an interest in this type of information, allowing people to track one-another while walking through the city. Evidently, the tracking features of the system can easily and selectively be turned off for e.g. privacy reasons. Specifically in the 3D city context, the generation of new information and accompanying metadata is a crucial contribution of the project. In a typical usage scenario, pictures and/ or video fragments will be shot using the built-in camera of the mobile device the user is carrying, and subsequently annotated through the use of speech-to-text analysis. At the same time, the location the picture was taken at will be added to the asset as meta-data. The picture is instantaneously uploaded to the main database through the device’s wireless network link and shared with other users of the system. This type of user-generated content will be visualized through the use of tags in the 3D environment, representing the particular type of data that is available at that particular location. Besides the rather trivial means of representing pictures as tags, the system also allows users to actively participate in extending and embellishing the 3D model of the city, by contributing their pictures to serve as textures to be placed on the facades of the various buildings in the city. Tags that are placed in various location throughout the city may be commented on by other users, e.g. to provide reviews on restaurants, museums etc. IV. NVE S YSTEM A RCHITECTURE Typically, NVE systems feature either a client-server or a peer-to-peer architecture. While a peer-to-peer architecture typically exhibits superior behavior in terms of scalability (and has therefore been the subject of much research, e.g. [8] and [9]), a client-server system is often chosen in practical applications. Some of the trivial advantages of such an architecture are ease of management, the ability to provide access to a persistent world and the fact that client state is easily synchronized. As described above, the usage scenario for our application is a mobile city guide, the concept of which relies heavily on the availability of a server infrastructure to store and retrieve the user-generated content. A management system is

Fig. 2.

Fig. 1.

NVE System Architecture

Screenshot of system concept.

also already present, based on accounting/billing servers and SIP servers for inter-person communication through Instant Messaging and audio/video calls. Given the fact that the goal was to include remote rendering capabilities in the system, the logical choice was to build the system around a client-server architecture. At the same time, the targeted user group consists of a maximum of 250 people for trials, which will provide no problems as far as World server load is concerned. The system architecture including all main (server) components and their connections is shown in fig 2. As should be clear from this picture, the server infrastructure consists of four main parts, namely the SIP server, the World server, the Content server and finally the subject of this work : the Remote Render server. Clients that wish to join the virtual world will do so by first connecting to the SIP server and, after authentication and session setup is finalized, will be assigned a World Server to connect to. All state conversations (positions, orientations, camera angles,...) happen between clients and the world server, without the need for extra interclient connectivity. Only when audio/video communication is explicitly requested is the SIP server queried for the current location (IP address) of the other party, after which a direct connection is established for transmission of RTP data. The responsibility of the Content server is the management of all multimedia assets and associated metadata that is present in the virtual world. We should point out at this time that -taken into consideration all elements in the architecture described above- there is no distinction between clients that perform their rendering locally and those that use the remote rendering infrastructure. There is in fact no direct data flow originating from the

Fig. 3.

Remote Render stages

mobile clients towards the remote render server (as is made clear through the unidirectional arrow in the diagram). All information that is necessary for performing the rendering by the server can be retrieved from the World server. While this slightly increases the processing load on the clients that wish to use remote rendering, we feel that this is an acceptable compromise, given the fact that it greatly simplifies the communication flow between all types of devices in the architecture. V. R EMOTE R ENDER S ERVER The Remote Render Server consists of two separate processes. The first process is responsible for generating the graphical content. In this case the graphical content is the view

of a client on the 3D virtual city. When a clients wishes to use the facilities offered by the Remote Render infrastructure, a notification is sent to both processes of the server, see figure 3. In the first process (Render process) this notification will result in the allocation of an available viewport to the client. In the second process (Customization process) a worker thread will be started containing a video encoder that will process the incoming views into a video stream that gets sent back to the client which in turn decodes the stream to get a view on the virtual world. This view can correspond to the user’s view in the real world when GPS is used but it can also deviate from the real world when the user chooses to use keypad/stylus input. In a typical usage scenario, the mobile client will automatically transmit positional data using a GPS module, indirectly telling the first process where to position the client in the virtual world. The user can also choose to send commands telling the World server to make his/her avatar move in a certain direction or to make his/her avatar turn around to get a better view on the surroundings. We should stress again that this data does not get sent directly to the remote render server, but rather to the World server which will, in turn, deliver this information to the Remote Render server. In every render pass the 3D engine renders the scene in every viewport using the corresponding personalized virtual camera. The second process (Customization) is responsible for grabbing the graphical content generated by the Render process. When a client connects to the virtual environment, the Customization process in the Remote Render server gets notified and starts a worker thread. Every worker thread corresponds to a client and contains an RGB buffer, a YUV buffer and an encoder. After every pass of the Render process, the main thread of the second process will grab the backbuffer of the Render process containing all the viewports. To efficiently grab the backbuffer of the first process, the Customization process installs hooks in the graphical library (OpenGL or DirectX) used by the engine in the Render process. Every time the render engine in the Render process calls swapbuffer, a function in the Customization process gets triggered to first copy the backbuffer to main memory before the actual swapping begins. The second process then has the backbuffer in main memory and can start processing the data contained in this buffer. First the buffer gets sliced into segments. Each segment contains the data corresponding to one viewport. These segments get passed to a framebuffer manager. The main thread then notifies all the waiting worker threads that a new frame has arrived and is ready to be processed. Every worker thread retrieves the corresponding framebuffer and converts the data in this buffer from RGB to YUV. This step is necessary because nearly all codecs rely on the input frames being in YUV format. The encoder then takes this YUV frame and adds it to the video stream. When a worker thread is finished with that frame it puts itself on hold until it gets notified by the main thread. The resulting stream is sent back to the client which in turn decodes the video stream to get a view on the 3D

virtual city. Every worker thread’s encoder can be customized to best fit the clients needs. Parameters such as frames per second, bit rate, resolution and codec can be adapted. This way the stream can be tailored to best fit the receiving device. VI. T EST R ESULTS In order to get accurate readings from our trials, that can later be extrapolated, we have opted to use a single trial server, with following specifications : an Intel dual core 3GHz CPU, 1GB of main memory, an NVIDIA Geforce 7200 LE graphics accelerator and a standard installation of windows XP. Given the fact that all of this hardware is commonplace, the inclusion of such an infrastructure will remain cost-effective for service providers. Several video codecs were considered for use in the architecture, however given the limitations of mobile devices in terms of network throughput and screen resolution, the choice fell on the H263 and MPEG 4 codecs. These too are often supported by media processors that may be present in 3G mobile devices. In this section we will provide test results for both H263 and MPEG 4. Both codecs used are part of the LibAVCodec library, which in turn is part of the FFMPEG project [10]. It would be interesting to be able to include H264 in these tests, however, due to license restrictions on the available encoders, this was not an option at the time of writing. The first stage in the remote rendering process is the actual rendering of the virtual world for all assigned clients, based on the state as received from the World server. However, as the performance of this stage depends heavily on the complexity of the scene that is being rendered and the available graphics accelerator, this stage is not included in the measurements. The second stage can roughly be divided into five functions: GetFrame, PutBytes, GetBytes, RGB2YUV and Encode. The GetFrame function comprises grabbing the backbuffer via the hooks in the graphical library and transferring this backbuffer from graphical memory into main memory. In the function PutBytes, the main thread slices the backbuffer into separate viewports and copies them into framebuffer objects contained in a thread-safe framebuffer manager. Every worker thread in turn calls GetBytes to retrieve the framebuffer from the framebuffer manager. Next, the RGB data contained within the framebuffer object is transformed into YUV data in RGB2YUV. Finally this YUV data is passed on to the encoder that adds a new frame to its output stream. We have performed measurements on these 5 functions using the H263 and MPEG-4 codecs. For both codecs we measured results using QCIF(176 x 144) and CIF(352 x 288) resolution viewports. Every resolution was measured running 1, 4, 16 and 25 clients. The results show that Getframe remains constant for all tests using CIF and also using QCIF (see e.g. figures 4(c) and 4(d). The resolution of the backbuffer used with CIF viewports was 1920x1440 and 1024x768 with QCIF viewports. The cost of GetFrame is defined by the cost of copying data from graphical memory to main memory and as the amount of data

(a) H263 QCIF. X axis shows stage in customization process.

(b) H263 CIF. X axis shows stage in customization process.

(c) H263 QIF. X axis denotes number of clients.

(d) H263 CIF. X axis denotes number of clients.

Fig. 4.

H263 Timing results in msec

remained constant for all tests with CIF and all tests with QCIF the time spent in this function is constant. The time spent in the function PutBytes increases with the number of clients. This function’s initial cost is that of waiting for a lock on the framebuffer manager. Each client adds an additional cost in terms of copying the viewport data into a framebuffer object assigned to that client. Every worker thread has to wait for a lock on the framebuffer manager to get the content of the corresponding framebuffer object to copy the framebuffer data into an RGB buffer. This RGB buffer will be converted by RGB2YUV into a YUV buffer. The cost of the functions PutBytes, GetBytes, RGB2YUV and Encode is determined by the amount of data per viewport and the number of clients. We aim for a video stream of 15 frames per second. The h263 codec produces satisfactory visual results for limited screen resolutions at 15 frames per second. Our measurements have resulted in the following figures for QCIF resolution with a load of 25 clients and H263 encoding : GetFrame 7.88 msec, PutBytes 0.042 msec, GetBytes 0.112 msec, RGB2YUV 0.749 msec and Encode 1.221 msec. This results in a total time of 10.09 msec per client per frame or 61.48 msec per frame for a total 25 clients. This translates into a frame rate of approximately 15 FPS. Using CIF resolution, the totals change to 38.372 msec per client per frame or 164 msec per frame for all 25 clients. This translates into

approximately 6 FPS. The main bottleneck however are the GetFrame function and the RGB to YUV conversion, both of which will be subject to optimization (see the section on future work). In summary, we can say that by using QCIF resolution for the viewports and the h263 codec for encoding the tests show that this is feasible for 25 simultaneous clients. Using CIF resolution approximately half that number can be supported. It is also clear from the differences between figures 4 and 5 that the H263 codec is able to compress the frames faster when compared to the MPEG-4 codec used. VII. C ONCLUSIONS AND F UTURE W ORK We have presented an architecture for networked virtual environments that incorporates remote rendering facilities. This allows for devices with limited graphical abilities (such as 3G phones and low-end Personal Digital Assistants) to display attractive virtual worlds. Due to the nature of the network architecture, the remote rendering infrastructure integrates seamlessly with the existing servers. Clients in fact need not be aware whether other parties are using the remote rendering facilities or not. The remote render server gets its world state information directly from the World servers, thereby ensuring the consistency for all connected users. Test results have shown that it is feasible to implement such an infrastructure on low-

(a) MPEG-4 QCIF. X axis shows stage in customization process.

(b) MPEG-4 CIF. X axis shows stage in customization process.

(c) MPEG-4 QCIF. X axis denotes number of clients.

(d) MPEG-4 CIF. X axis denotes number of clients.

Fig. 5.

MPEG-4 Timing results in msec

cost off-the-shelf hardware, whereby it is possible to support at least 25 simultaneous users on a single server. It has also been shown that under these load factors, a sufficient frame rate can be maintained that preserves the end-user Quality of Experience, which is important due to the highly interactive nature of these NVE applications. Our future work consists of the addition of an in-house developed H264 encoder. We are also looking into ways of achieving a higher transport speed between graphical and main memory, as well as eliminating the need for the costly YUV to RGB conversions needed at this time. ACKNOWLEDGMENT Part of this research is funded by IWT project MAVIC (040541) and IBBT project A4M C 3 . The authors would like to thank all members of the NVE research group at EDM for their support in developing the application presented here. R EFERENCES [1] G. Kuhne and C. Kuhmunch, “Transmitting mpeg-4 video streams over the internet: problems and solutions,” in MULTIMEDIA ’99: Proceedings of the seventh ACM international conference on Multimedia (Part 2). New York, NY, USA: ACM Press, 1999, pp. 135–138. [2] D. Cohen-Or, Y. Noimark, and T. Zvi, “A server-based interactive remote walkthrough,” in Proceedings of the sixth Eurographics workshop on Multimedia 2001. New York, NY, USA: Springer-Verlag New York, Inc., 2002, pp. 75–86.

[3] F. Lamberti, C. Zunino, A. Sanna, A. Fiume, and M. Maniezzo, “An accelerated remote graphics architecture for pdas,” in Web3D ’03: Proceeding of the eighth international conference on 3D Web technology. New York, NY, USA: ACM Press, 2003, pp. 55–ff. [4] G. Humphreys, M. Houston, Y. Ng, R. Frank, S. Ahern, P. Kirchner, and J. Klosowski, “Chromium: A stream processing framework for interactive graphics on clusters,” 2002. [Online]. Available: citeseer.ist.psu.edu/humphreys02chromium.html [5] M. Brachtl, J. Slajs, and P. Slavk, “Pda based navigation system for a 3d environment,” Computers and Graphics, vol. 25, no. 4, pp. 627–634, August 2001. [6] I. Yoon and U. Neumann, “Web-based remote rendering with IBRAC (image-based rendering acceleration and compression),” Computer Graphics Forum, vol. 19, no. 3, 2000. [Online]. Available: citeseer.ist.psu.edu/yoon00webbased.html [7] J. Diepstraten and T. Ertl, “Remote Line Rendering for Mobile Devices,” in Proceedings of Computer Graphics International 2004. IEEE, 2004, pp. 454–461. [8] P. Quax, P. Monsieurs, T. Jehaes, and W. Lamotte, “Using autonomous avatars to simulate a large-scale multi-user networked virtual environment,” in The 2004 International Conference on Virtual Reality Continuum and its Applications in Industry (VRCAI2004), Singapore. 2004. ACM Press. pp 88-94. [9] P. Quax, T. Jehaes, M. Wijnants, and W. Lamotte, “Mobile extensions for a multi-user framework supporting video-based avatars,” in International Conference on Internet and Multimedia Systems, and Applications (IMSA 2005), Honolulu, Hawaii, USA. ACTA Press. 2005. Electronic Proceedings. [10] FFMPEG, http://www.ffmpeg.org.

Suggest Documents