upgradable devices (typically software based) and a powerful ... should be built to support reconfigurable ... reconfigurable hardware is the best trade-off for.
Enabling Hardware-Software Multitasking on a Reconfigurable Computing Platform for Networked Portable Multimedia Appliances J-Y. Mignolet, S. Vernalde, D. Verkest*, R. Lauwereins** IMEC vzw Kapeldreef 75, 3001 Leuven, BELGIUM *also Professor at Vrije Universiteit Brussel ** also Professor at Katholieke Universiteit Leuven
Nowadays a lot of multimedia applications are emerging on portable appliances. However these applications require both the flexibility of upgradable devices (typically software based) and a powerful computing engine (typically hardware). This application domain is therefore a very good target for reconfigurable computing. This paper describes a user scenario that motivates the introduction of reconfigurable hardware into portable devices together with a vision on the platform that should be built to support reconfigurable computing. Solutions for both the hardware and software problems are presented. Finally, a first demonstrator is depicted.
1. Introduction
Nowadays a lot of multimedia applications are emerging on portable devices like a personal digital assistant (PDA) or a mobile phone. Examples are MP3 players, MPEG players, games and browsing. Multimedia applications are usually computationally intensive and have a lot of parallelism, what prevents them to be implemented on general-purpose embedded processors. To achieve the minimal Quality of
Service (QoS) required for these applications, traditional designs of multimedia platforms contain special hardware accelerators or application specific instruction-set processors (ASIP). The first approach does not provide flexibility, while the second offers an improved computation power for a specific application domain only. If different applications have to run, different components have to be integrated on the same device. If a novel application emerges, a new platform has to be designed. We are therefore building a general-purpose compute platform that allows to run different multimedia applications and that offers enough flexibility to download and execute future applications. This platform should therefore be powerful, flexible, energy conscious and inexpensive. The goal is to develop a programming environment for such a reconfigurable platform that allows the same ease-of-programming as a general-purpose processor platform today. Based on our experience in reconfigurable systems [1][2], we believe that the combination of instruction-set processors (ISP) with reconfigurable hardware is the best trade-off for such a platform. The introduction of the reconfigurable hardware provides low power
consumption for high computation power by its hardware characteristics on the one hand, and flexibility by its reconfiguration aspects on the other hand. The platform has to support true hardware/software multitasking. The tasks should be able to run either on the ISP or on the reconfigurable hardware. An operating system will manage the applications by distributing the different tasks on the available resources. By developing this platform for multimedia applications, we also found a good driver for reconfigurable computing. The domain of reconfigurable computing has evolved a lot in recent years, but is still looking for the killer application that can demonstrate its benefits. We think that multimedia applications on portable devices are certainly a place to find such an application.
When Mr. Smith is starting his movie decoder the first time, the reconfigurable hardware is free and completely available. As a result, the movie player can be started on it, as a hardware task (Figure 1.a.). The computational power of the reconfigurable hardware allows playing the movie at full quality. When Mr. Smith decides to start the 3D game, there are not enough resources available on the reconfigurable hardware to run both applications. However the movie can be downscaled in resolution and frame rate, resulting in reduced computational requirements for the player. The movie player can therefore be rescheduled in one of the ISPs, as a software task. This frees up some resources and allows the 3D game to be started as a hardware task.
The remainder of this paper is organized as follows: Section 2 describes the application scenario that motivated the design of the platform. Section 3 presents our vision on the hardware software multitasking platform. Section 4 discusses the results that we are achieving. Finally some conclusions are drawn in Section 5.
2. The application scenario
The driver application scenario is depicted in Figure 1. Mr. Smith is powering on his portable device. He feels like watching a movie. He presses the start button and begins enjoying the show (Figure 1.a.). Some time later, a scene is boring him. He wants to play a 3D game for a while, while the scene is going on. He will resume to watch the movie after the scene. He therefore wants to have the movie being displayed in a small part of the screen (Figure 1.b.). Now that the scenario is described, we can analyze what happens inside the portable device of Mr. Smith. We will first describe the platform and then the step by step behavior. The device is composed of one or more ISP(s), one or more application specific integrated circuit(s) (ASIC) for functions like wireless communication or LCD management, and reconfigurable hardware.
a.
A movie is displayed
3-D
b. 3D game with reduced movie inlaid
Figure 1: Application scenario What about running this on an architecture containing only fixed hardware? Some experiments on video and 3D sequences have been performed in-house on dedicated architectures. If a movie is played on a set-top
box (STB) with a TriMedia processor (Trimedia 1300 running at 166MHz), a frame rate of 25 fps (frames per second) can be reached. If a 3D game is executed on the same STB, only 2 fps can be achieved (for 12800 triangles). On the other hand, the same 3D game can be run realtime (32 fps) on a 3D graphics card (Neon250 board with a PowerVR chip). This shows us that for different multimedia content, different architectures are required. Consequently the two architectures would have to be present on the device to enable the same scenario. By running these applications on a reconfigurable hardware, we circumvent this problem. We can download the architecture that fits best the application running on the device. We could even implement an ASIP tuned for each application, and replace it every time we want to change the application. This scenario can be extended to the task level. An application is typically composed of communicating tasks. It would be interesting for this reason to have a reconfigurable hardware that can run multiple separate (hardware) tasks at the same time. As a result, our platform could realize true hardware-software multitasking. Let us consider a reconfigurable hardware matrix containing nh tiles. Each of these tiles is able to run a task. An application i is divided in ni tasks. Each task can be run either in hardware or in software. The system will first try to schedule the tasks on the hardware tiles. If it fails - meaning that the sum of the tasks of each application is bigger than the nh available – two solutions can be envisaged: 1. the operating system selects which tasks are critical and should be run in hardware, and schedules the other ones on the ISP, 2. the tasks are dynamically scheduled one after the other on the hardware tiles. The first solution implies that some applications would receive more compute power than others, generating the best trade-off for the user. The second solution is equivalent to consider the nh tiles as nh processors that can schedule multiple tasks sequentially. It requires the possibility to perform a context switch of a hardware task.
3. Our vision on the hardware software multitasking platform
This section presents our vision of the platform that is needed to realize a scenario as described in Section 2.
!
!
#
$
%" &'
" "
Figure 2: Layered approach of the platform Our approach can be sketched in three layers (Figure 2). The lowest layer represents the physical platform in itself, including the reconfigurable hardware and the ISPs. The middle and upper layers compose the programming infrastructure that needs to be built on top of the physical platform to allow easy design of applications on it. The middle one is the operating system, which has to enable multitasking of both hardware tasks and software tasks, and has to provide some realtime services. The upper one is a middleware layer that should deliver two types of services. First, it has to create an abstraction layer for the programmers. Indeed, these should be shielded from the internals of the platform, having a common way of coding their applications independently of the target device on which they would run. Second, the middleware has to manage the usage of the resources in order to provide always the best trade-off in terms of QoS for each of the applications. Each layer will be detailed in the next paragraphs, starting with the physical platform.
3.1 The reconfigurable SoC platform
As already mentioned, the platform is composed of ISP(s), ASIC(s) and reconfigurable hardware. According to the presented scenario, we want to achieve multitasking on both the ISPs and the
reconfigurable hardware. The key problem is to find suitable reconfigurable hardware that allows us to run multiple tasks in parallel.
Task
1D-Router Y
1D-Router X
Task
1D-Router Y
Task
1D-Router Y
Task
1D-Router X
1D-Router Y
To avoid a complete P&R iteration for each reconfiguration, an additional layer with fixed interface topology has to be implemented on top of the FPGA, to raise the granularity of the architecture. We have created a coarse-grain platform, containing a set of logical tiles that can be reconfigured separately. A fixed
1D-Router X
Task
Task 1D-Router X
1D-Router Y
However, each partial reconfiguration would need a complete Place and Route (P&R) iteration, which would prevent a run-time utilization. Indeed, the borders between the different blocks of the system are not fixed on the FPGA architecture, and would have to be rearranged each time a modification is performed. A P&R run typically takes minutes to hours, and is therefore not possible when targeting run-time reconfiguration.
Task
1D-Router X
1D-Router Y
Today, state-of-the-art FPGAs allow a partial reconfiguration of their functionality. For example, Xilinx Virtex™ series [3] offers a reconfiguration speed of 50 Mbytes/sec. If we want to reconfigure 10% of a 800 k system gates device (which contains 4715616 configuration bits), it would take about 1.1 ms. This is fast enough for user-initiated interaction (starting a new program – the 3D game for example) but probably too slow for a dynamic context switching of task. Therefore, to realize a dynamic multitasking system, the tasks will be spawned on the available resources, either a hardware tile or an ISP (the ISP will run a traditional multitasking operating system). If a task requires (for QoS reason) to be executed in hardware, one of the less critical task running on a hardware tile will be reschedule on an ISP, providing a free resource for the new task.
1D-Router X
Task
1D-Router X
1D-Router Y
2. we only write part of the configuration at each time, i.e. we make a partial reconfiguration.
Task
1D-Router X
1D-Router Y
1. we read back the configuration of the FPGA, update it and write back the modified configuration;
1D-Router X
1D-Router Y
To run independent tasks in parallel on one piece of reconfigurable hardware, we have to be able to remove and create tasks without affecting the others. On a Field Programmable Gate Array (FPGA) for example, we would like to be able to modify some part of the logic without affecting the rest. Two possibilities come into the picture:
communication network between the tiles allows them to communicate together and with the ISPs. This separation between communication and computation enables an easy and flexible instantiation of new blocks of functionality in the system.
Figure 3. Packet-switched interconnection network We selected the packet-switched interconnection network as a mean to accomplish this (Figure 3). This architecture comes from the multicomputing world. It exists in different flavors, both for the topology (k-ary n-cubes, hypercubes, butterflies, crossbars) and for the type of routing (wormhole, virtual cutthrough, mad postman switching) [4]. Such a concept has been introduced in SoC (Systemon-Chip), allowing to route packets instead of wires [5]. When applying this in the context of FPGAs, no wires to route means that no cumbersome P&R iteration is required anymore. As communication model, the packetswitched interconnection network uses message passing. This model can also be used towards the ISPs, so that a unified scheme is obtained for hardware/hardware and hardware/software communication. In short, we are building an interconnection network on top of a commercial FPGA in order to allow dynamic reconfiguration of the tasks running in hardware. The communication scheme is unified for both hardware and software tasks. Regarding the rest of the platform, fewer constraints come into the picture. At least one of the ISPs should be used to run the operating system. This ISP should therefore have a
connection to the reconfiguration interface of the FPGA. The ASIC part is there for managing fixed parts of the platform, like an LCD or connectivity (e.g., wireless).
3.2 Operating System
If we want to create a true hardware/software multitasking environment, we have to create an operating system that will manage the different applications running on the platform.
The main function of the operating system is to manage tasks. An application will contain a combination of tasks. Each of these tasks should run on one of the available resources. When a new task is started or an old task is deleted, some of the existing tasks might have to be rescheduled on another resource (e.g. to provide more computational power to the new task or to decrease the power consumption). Applied to our architecture, the task creation and deletion consist in the usual task handling for a software task, and in reconfiguring the FPGA for a hardware task. The operating system should however keep track of the available resources, meaning location of all the tasks running (e.g. which tile of the FPGA). Task rescheduling is more complex. The main problem is to identify state equivalence between a software task and a hardware task. Indeed, when a task is rescheduled from hardware to software or vice-versa, the state information should be transferred in order to resume the task where it has stopped. The operating system should also handle the communication between the tasks. Three possible cases are envisaged: communication between two hardware tasks, communication between a hardware task and a software task and communication between two software tasks. The communication scheme we selected is message passing. The communication between hardware tasks is handled by the interconnection network. The operating system updates at run-time the routing tables of the tasks, and the network drives the messages to the correct destination. For hardware-software communication, the hardware block is using a specific address in its routing table that corresponds to the operating system. The messages between the ISP and the FPGA are stored in buffers. For software-software
communication, the same scheme (message passing) is used in order to have a unified representation of the communication between tasks, no matter where the tasks are located. In order to guarantee some QoS in the system, real-time services should be provided by the operating system. The communication will indeed generate interrupts that should be handled fast enough (especially in the case of multimedia streams).
3.3 Middleware layer
The middleware layer has to fulfill two functions: platform abstraction and QoS aware rescheduling of task. Why do we need platform abstraction? We want the user to have access to real-time updates of services. Using his/her wireless networked device, the user will download new services from service providers. A typical example is an applet that a user downloads while browsing the Internet. This applet is coded in a platform-abstracted way, using the JAVA™ framework. The code is then run on a JAVA virtual machine, that interprets the bytecodes of the applet. If we want to apply the same concept to a heterogeneous platform like ours, we should have a virtual machine that can run applets on hardware as well as on software. The hardware virtual machine is a concept that has already been developed [6]. We should now combine the hardware and software virtual machines to have one unique machine that can, from a single bytecode, spawn tasks both in hardware and in software. This approach also alleviates the problem of the design entry for the platform. The application is split into tasks and each task might be run in hardware or software. But a software task is usually an object file, while the hardware task is represented by a bitstream. Therefore the level of abstraction should be raised. In our research roadmap, we plan to follow a three-step approach to explore the solution for this problem. In a first phase, we will use several implementations with different hardware/software partitioning for an application, generated at design time; the different flavors will provide different performance and for each application one flavor can be selected at run-time. At this stage we keep the traditional design workflow and we
end up with object file and bitstreams. In a second phase, the application will contain tasks for which both implementations will be available; the operating system will select at task level whether a task will run in hardware or software. In the third phase, a unified language should be decoded by the operating system and executed on either the hardware or software resources. The second and third phase present two possible solutions for the design entry. The second phase corresponds to a platform dependent approach. Every platform builder will provide its own hardware/software compiler and that every application would be represented by two sets of tasks (a hardware and a software version for each task). The third phase corresponds to platform abstraction, in the same way as JAVA is doing it today. Both solutions should be evaluated before any conclusions can be drawn. Indeed, using a compiler solution requires to store both views (hardware and software) for each tasks, increasing the memory footprint. On the other hand, having a unified representation could generate some overhead that would reduce the performance of the platform. The other functionality of the middleware is the QoS aware rescheduling of tasks. We use the recent research in the domain of multimedia terminal QoS [7][8] as an application driver to exploit the heterogeneity of the platform. The idea of terminal QoS is to adapt the decoding/rendering computations to the available power of the terminal. For instance, in the case of a 3D face that has to be rendered on a screen, both the number of projected pixels and the number of triangles can be varied to provide the best trade-off between visual perception quality for the user and computation time on the resource. The computation time should indeed be small enough to provide a reasonable frame rate. Applying this terminal QoS to our platform adds a new degree of freedom. Indeed, it gives the possibility to also change the resources on which the different multimedia applications run. As explained in Section 2, when multiple applications are run on the platform, only a limited number of tasks will be scheduled on the hardware tiles. The computation time of an application will of course depends on the number of its tasks that can be scheduled in hardware. If we come back to the example of
the 3D face, in addition to vary the number of triangles and the number of projected pixels, the system can select the set of tasks that can run in hardware. This will modify the computation time and increase or limit the frame rate. The QoS aware rescheduling task of the middleware consists in determining the best trade-off in terms of QoS for the different applications that run simultaneously on the platform. By varying for each application the resources on which its tasks are running, different global QoS levels can be provided.
4. First proof of concept
Currently, we are designing a first demonstrator that will highlight some of the concepts that we present in this paper. We will first describe the demonstrator, following the bottom up approach of Section 3, and then point out which concepts we want to showcase with it. The platform is composed of a board containing a Xilinx Virtex-II™ XC2V6000 FPGA together with some glue logic and a Compaq iPaq™ handheld device. The FPGA is our reconfigurable hardware, on which we are designing a packet-switched interconnection network. The iPaq contains the rest of the platform. There is only one ISP, which is a StrongARM SA-1110 processor. The iPaq and the FPGA board are connected together via the Expansion Bus of the iPaq. The communication between the processor and the network on the FPGA is performed by buffering messages in DPRAM. The glue logic on the FPGA board allows the processor to perform a partial reconfiguration of the FPGA. The operating system is based on Linux™. A port of Linux is indeed available for the iPaq. We have ported the real-time services to it. We are also creating the necessary extension to enable the management of hardware tasks. The scenario of the demonstrator follows the one presented in Section 2. A video decoder and a 3D game are developed on the platform. Different partitioning will be available. Depending on the number of tasks running on the hardware tiles, different QoS level will be achieved for each application. If we have for instance one video task that achieves 12,5 fps on hardware, we can use two tiles to reach 25 fps. If the resources are not available (because
used by the 3D game), we can only have 12,5 fps. Another example would be that we could achieve 25 fps with one task but only for a quarter of an image. In this case we need four tiles for a complete image. If only one is available, we can still have a smaller size movie decoded in real-time. With this demonstrator, we want to have a proof of concept for three main topics. The first concept is the infrastructure that allows multitasking on an FPGA. For this we have to develop an interconnection network and to exploit the partial reconfiguration abilities of the newest FPGAs. The second concept is an operating system that can manage both hardware and software tasks. The third concept is an application scenario that demonstrates the necessity and capabilities of reconfigurable computing.
5. Conclusions
In this paper, we have presented a scenario that motivates the usage of reconfigurable hardware in embedded multimedia terminals. A vision was presented on the architectural challenges and the required programming environment for the platform. Finally, a description of our first demonstrator was presented that will showcase this vision.
6. References [1] D. Desmet; P. Avasare; P. Coene; S. Decneut; F. Hendrickx; T. Marescaux; J.-Y. Mignolet; R. Pasko; P. Schaumont; D. Verkest: Design of Cam-E-leon: A Run-time Reconfigurable Web Camera, (accepted), Lecture Notes in Computer Science, Springer - Verlag..
[2] D. Verkest; D. Desmet; P. Avasare; P. Coene; S. Decneut; F. Hendrickx; T. Marescaux; J.-Y. Mignolet; R. Pasko; P. Schaumont: Design of a Secure, Intelligent, and Reconfigurable Web Cam Using a C Based System Design Flow, Proceedings of 35th Asilomar Conference on Signals Systems & Computers Pacific Grove, pp. 463-467, CA, Nov. 2001. [3] Xilinx: Virtex™ Data Sheet, http://www.xilinx.com/ partinfo/ds003.htm [4] J. Duato; S. Yalamanchili; L. Ni: Interconnection networks, an engineering approach, September 1997, ISBN 0-81867800-3. [5] W. J. Dally, B. Towles: Route Packets, Not Wires: On-Chip Interconnect Networks, Proceedings of the 38th DAC conference, pp. 684-689, Las Vegas, NV, June 2001. [6] Y. Ha; B. Mei; P. Schaumont; S. Vernalde; R. Lauwereins; H. De Man: Development of a Design Framework for PlatformIndependent Networked Reconfiguration of Software and Hardware, FPL'01 pp. 264 – 273, Belfast, UK, August 2001. [7] G. Lafruit, N. Pham Ngoc, W. van Raemdonck, J. Bormans: Terminal QoS for real-time 3D visualization using scalable MPEG-4 coding, IEEE Transactions on Circuits and Systems for Video technology (submitted), 2001. [8] G. Lafruit; L. Nachtergaele; K. Denolf; J. Bormans: 3D Computational Graceful Degradation, Proceedings of ISCAS Workshop and Exhibition on MPEG-4, pp. III-547 - III-550, May 28-31, 2000.