A Component-based Coordination Language for ...

2 downloads 0 Views 209KB Size Report
ACM Press. [5] I. Buck, T. Foley, D. Horn, J. Sugerman, P. Han- rahan, M. Houston, and ... [22] P. Stravers and J. Hoogerbrugge. Single chip multipro- cessing for ...
A Component-based Coordination Language for Efficient Reconfigurable Streaming Applications Maik Nijhuis and Herbert Bos and Henri E. Bal Vrije Universiteit, Computer Systems Group, Amsterdam, The Netherlands {maik,herbertb,bal}@cs.vu.nl

Abstract Consumer electronics applications are becoming increasingly complex because of increased functionality requirements, such as watching multiple compressed video streams on a single screen. We address this complexity by allowing a programmer to specify the application in terms of independent components. Components interact using streaming communication and by sending and receiving events. From this component specification, the executable is generated. We use the Hinch run time system and the SpaceCAKE architecture to validate the effectiveness of our approach. Because the specification language is generic, the application can easily be ported to different platforms.

Keywords: Streaming, components, reconfigurability, user interaction, XML, consumer electronics

1. Introduction Currently, software for consumer electronics (CE) devices is typically developed using C and/or hand-coded assembly. This approach works very well for simple CE devices with limited functionality, such as calculators or digital watches. However, CE applications are becoming increasingly complex due to three main reasons. First, the individual functions to perform are increasingly complicated. For example, portable media players are nowadays typically used to play digitally compressed media formats. To achieve higher compression ratios, increasingly complex (de)compression algorithms are supported. Second, CE applications have to be capable of performing multiple different functions, often at the same time. For example, many modern cell phones can be used to play music or listen to the radio while communicating with multiple base stations. Third, CE platforms increasingly exhibit parallel architectures, for example Cell [14] and SpaceCAKE [22]. For optimal performance, the application has to be written as

a parallel application dealing with synchronization, communication and load balancing, amongst others. Moreover, with the increase in complexity it becomes challenging to handle dynamic reconfiguration and user event handling. To address these issues we have proposed the SP@CE framework [29], in which we model the application as a Series-Parallel (SP) graph of interacting components. Each component implements a specific function of the application. Components are connected using streams. A stream is a data structure in which the data is only used for a limited amount of time. It is typically implemented using a FIFO queue. To ease the transition to SP@CE, components can still be developed using the traditional languages for CE programming, such as C and even assembly. In our view the end-user writes an application by specifying components in optimized C code and by specifying the application graph at a higher level, e.g., using a front-end with a graphical user interface. This front-end then uses a coordination language for the internal representation of the application graph. In this paper, we present a new coordination language, called XSPCL (pronounced as x-special). XSPCL should be seen as a generic intermediate language for efficient streaming applications. It is based on SPC-XML [11]. XSPCL supports task- and data parallelism. Dynamically reconfigurable applications are also supported, as well as interactive applications that respond to asynchronous user events. As XSPCL is based on XML, it is extensible. We have implemented a prototype XSPCL processing tool that implements these features. This tool converts an XSPCL specification into a runnable C program, which contains the glue code between the various components. It is linked to the component code and to the Hinch run time system [20]. Hinch provides automatic load balancing using a central job queue. It runs the application in a data flow style by putting a job in this queue for each component that is ready to be run. Furthermore, Hinch provides generic functions for streaming and event communication. The overhead of XSPCL is negligible because the generated glue code is only run at initialization time, or when the program is reconfigured. During normal operation, the

application speed is limited by that of the components and the run time system, not by the glue code. The main contributions of this paper are: 1. We present a new coordination language for streaming applications with support for task parallelism, data parallelism, dynamic reconfigurability, and asynchronous user interaction. 2. Using experiments with a prototype implementation, we show the overhead of XSPCL is low. 3. We show that XSPCL can be used to express efficient parallel applications for a shared-memory Multi Processor System-on-Chip (MPSoC) architecture. In the remainder of this paper we first give an overview of the required properties of XSPCL. The next section describes the coordination language. Section 4 presents experiments using three applications. XSPCL overhead and parallel performance are investigated. Section 5 describes related work. Section 6 concludes the paper and provides future work directions.

2. Required properties of XSPCL The design of XSPCL is focused on streaming applications, because many challenging CE applications fall into this category. Streaming applications consist of multiple components that are organized into a task graph. The task graph determines the scheduling dependencies between the components. The application is run as a series of iterations of the task graph, in which each node is executed once. Streams provide the main communication primitive between components. The data in a stream is only used in the current and possibly a few next iterations, after which it is discarded. To cope with this kind of applications, XSPCL should be able to express the following items: 1. The task graph of components. We have chosen the Series-Parallel Contention (SPC) model [28] for expressing the task graph. In this model, the graph is specified recursively by combining subgraphs using sequential and parallel constructs. The leafs in this hierarchy tree are the individual components. Both taskand data-parallel constructs can be used to build the task graph. Typically, task-parallel parts contain data parallelism, but the reverse is also possible. SPC allows efficient performance prediction, while the performance penalty compared to non-SPC applications is very small [10]. For embedded and CE applications, performance prediction is important because the application has to process data and events in a timely manner. Performance prediction can be used to verify that the application meets its deadlines. Moreover, it can

be used to tune application parameters to make the application more efficient. 2. Procedural abstraction. To avoid specifying similar subgraphs multiple times, one should be able to encapsulate these subgraphs into procedures. These procedures can then be instantiated at multiple positions in the application, possibly using different parameters. 3. Communication between components. Communication should be abstract in the sense that a component does not need to know to which other component(s) it is connected. This way, a component can easily be reused in a different application, as long as it is connected to compatible components. Two forms of communication are identified: (a) Streams provide a synchronous communication primitive for large pieces of data. Each component has a fixed number of i/o ports to which streams can be connected. When the component is run, it reads the data at its input ports. This data has been written to the stream by another component that has been scheduled earlier in the iteration. Similarly, it has to write data to its output ports. This data will be read by other components that will be scheduled later in the iteration. (b) Events provide an asynchronous communication primitive for small pieces of data. When run, a component can send an event to another component, for example when it detects the user has pressed a key. Events can be sent or received at any moment, independent of the current iteration. The action to perform can vary between changing the component’s behavior by adjusting parameters, to a complete reconfiguration of a subgraph of the application, in which multiple components are destroyed and/or created. In non-interactive applications, events can be used to respond to special input values. 4. Reconfigurability. Parts of the application may be enabled or disabled at runtime, for example in response to an event. This yields two requirements for XSPCL: (a) One should be able to declare entire subgraphs as optional. (b) A graph with optional subgraphs must have a special container that specifies the reconfigurable part of the application. This container is needed to keep the contained subgraph in a consistent state. For example, nodes in the optional subgraph can be connected using streams to other nodes in the contained subgraph. When these

connections are made, the nodes in the optional subgraph have to synchronize with the other nodes in the contained subgraph. As we will show in the next section, the container is also a convenient place to define the relation between events and the option(s) that need to be enabled. 5. New primitives. XSPCL should be extensible. It should be easy to add new primitives to XSPCL, for example new parallelization strategies.

3. XSPCL

Front End

Since XSPCL is derived from XSPCL SPC-XML [11] most of its XML tags resemble SPC-XML tags. However, XSPCL is novel beRTS Prediction cause it supports domain-specific features such as streaming comFigure 1. Posimunication, reconfiguration and tion of XSPCL user event handling. XSPCL within a framealso allows special groups with work non-SP dependencies. As shown in Figure 1, XSPCL should be seen as an intermediate language between a front-end, in which the end-user expresses the application using a graphical interface, and the compiled executable. This way, the XSPCL specification can also be fed to a performance estimation tool, that provides feedback for parallelization decisions. When XSPCL is used in a streaming application development platform, the end-user of this platform is ideally not aware of the XSPCL layer. The framework design is discussed in more detail in [29]. We have currently implemented XSPCL, the run time system [20], a performance prediction tool [30], and a conversion tool from XSPCL to an executable that uses the run time system. The front-end and a conversion tool from XSPCL to the performance prediction tool still have to be developed and are shown using dashed lines in Figure 1. XSPCL can specify programs at multiple abstraction levels. At the most abstract level, all possible parallelism in the application is expressed. Implementation details, such as the exact amount of data-parallel copies of a component are left out. The abstraction is lowered by filling in these details, for example based on performance prediction results. We chose to base XSPCL on SPC-XML and choose an XML-based language because XML processing tools are widely available, XML specifications are readable by humans, and an XML-based language can easily be extended with new primitives. In the remainder of this section we will describe XSPCL in a bottom-up fashion.

3.1. Components A component is the most basic structure in an XSPCL specification. Components implement the basic functionality of the application. An example of a component that implements a spatial down scaler is given in Figure 2. This component reduces the size of each frame at its input stream by a given down scale factor. The resulting pixels are written to its output stream. The class attribute is used to find the C function that initializes the component. Multiple instances of the same class can be defined. A component can have two kinds of parameters: 1. stream parameters. These are used to establish streaming connections between the components. In the example, input images are read from the ’big’ stream and output images are written to the ’small’ stream. 2. initialization parameters. These are used to configure the component. In the example, the ’factor’ parameter specifies that the input images are to be reduced in size by a factor of 3. Initialization parameters can also be used to supply pointers to data that is shared between multiple components. However, the use of this is discouraged since mutual exclusive access to this data is not guaranteed. Moreover, if two components synchronize using shared data, this might interfere with the scheduler and cause deadlock. Besides these parameters, a tag may be used to give the component a reconfiguration request upon creation. A component may have a reconfiguration interface at which it listens for reconfiguration requests. For example, a picture-in-picture blender can support changing the position of the blended picture. The reconfiguration interface is also used to tell a component which part of the input it has to process when it is run in data parallel mode. Data parallelism will be explained in Section 3.3. Most components do not need any resources beyond exclusive CPU and memory access, which are implicitly assigned to the component as it is being scheduled. These components always run to completion and the design of XSPCL guarantees that the application will never deadlock. However, when multiple components need external resources, deadlocks and livelocks could occur because of resource contention. It is up to the component designer to prevent deadlocks and livelocks in this case.

3.2. Procedures A procedure is used to specify a generic set of components that are to be inserted in another procedure using a tag. This way, procedural abstraction is facilitated.

1

Figure 2. Example component ... ...

Figure 3. Example procedure with a call to it All procedures must have a unique name. One procedure, named ’main’ is special. It defines the top-most procedure in the application. An example of a procedure and its caller is shown in Figure 3. The parameter specification in a call is the same as in a component, as explained above. In this case these denote the actual parameters given to the procedure. The parameter specification in a procedure denotes the formal parameters of the procedure. These can be used by the components inside the body of the procedure. The part of a procedure describes the contained component tree, which is built using components, procedure calls, or one of the other structures described later in this section. Although procedures can be nested, recursion is currently not supported as there is no way to end the recursion.

3.3. Parallelism By default, when two components are specified after another, these are scheduled sequentially. An example of this are components x and y in Figure 3. To exploit parallelism in the underlying architecture, XSPCL has several ways of specifying parallelism in the application. First of all, multiple iterations of the application may be active concurrently. Sequentially scheduled components can be run in parallel in a pipeline style, executing different iterations of the pipeline. No special tags are needed to exploit pipeline parallelism as the underlying runtime system automatically starts multiple concurrent iterations.

... ... ... ... ...

Figure 4. Examples of parallel specifications Parallelism within an iteration has to be specified using a tag, as shown in Figure 4. Each parallel tag contains one or more parblocks which indicate the components that are to be run in parallel. Parallel groups may be nested to exploit multiple types of parallelism. Currently three types of parallelism are supported. These are indicated by the shape attribute: 1. Task. Once an iteration reaches the parallel region, each parblock is scheduled in parallel. When all parblocks have finished, the successors of the parallel block are scheduled. 2. Slice. Only one is allowed when using this shape. Upon creation of the parallel group, the components inside the group are copied n − 1 times. n is an attribute of the parallel group. Each copy, including the original component, is given its position within the group together with the group size. All copies are then run in parallel while each copy only operates on the region it has been assigned to. In case of images these regions correspond to horizontal slices of the image, hence the name ’slice’. 3. Cross dependencies (’crossdep’). This works the same as the slice shape, except multiple parblocks are allowed with special dependencies between the parblocks. The components inside each parblock are

slice: i−1

i

i+1

parblock j:

parblock j+1:

Figure 5. Cross dependencies again copied n − 1 times. The dependencies are shown in Figure 5: A component with slice i is scheduled when slices i − 1, i, and i + 1 in the previous parblock have finished. At the start of the parallel region, all copied of the first parblock are scheduled. The whole region is completed when all copies of the last parblock have finished. Cross dependencies are very helpful for parallelizing a sequence of image filters. These filters need the data in the next or previous slice when computing boundary pixels. This structure does not adhere to the Series-Parallel paradigm. It shows that optimized subgraphs with nonSP dependencies can easily be expressed in XSPCL. If performance prediction is required on this structure, it has to be transformed into SP form by adding a synchronization point between the parblocks.

3.4. Reconfigurability Each component or group of components can be declared optional by encapsulating it inside an tag. The option must be contained inside a special manager structure. This manager implements the container mentioned in Section 2. A manager is responsible for the consistency of its contained subgraph. The manager is invoked twice in every iteration: At the entrance of its subgraph, when its subgraph is about to be scheduled, and at the exit of its subgraph, when the whole subgraph has completed the iteration. It can halt the managed subgraph for reconfiguration by suspending the execution of its subgraph. Each manager is associated with an event queue. Using an event queue, components can send events to each other. In our prototype implementation, the event queue for a sending component is supplied using an initialization parameter, as described in Section 3.1. When the manager is invoked, it polls the event queue for events. One or more of the following actions can be defined for each event: • Enable, disable or toggle an option. The event is ignored when the option is already in the required state. • Forward the event to another event queue. • Send a reconfiguration request to all components in the managed subgraph.

... ...

Figure 6. Manager example When an option needs to be enabled or disabled, the manager first halts its subgraph. When it is halted, reconfiguration is performed. For options that need to be enabled, the components that need to be added are created and initialized as soon as the event is detected, even though the contained subgraph is still active. This way, reconfiguration time is reduced, as these components do not have be created and initialized during reconfiguration. When the subgraph is quiescent, there are only two simple actions left to perform. First, the created components are added to the subgraph. Second, the new components are synchronized to the subgraph.

4. Experiments To investigate the suitability of XSPCL for expressing various streaming applications we have used a cycleaccurate simulator for the Philips SpaceCake [22] architecture, which simulates a tile with at most 9 TriMedia cores. A TriMedia is a VLIW processor aimed at multimedia applications. At a tile, each TriMedia has its own level 1 cache. The level 2 cache is shared between all TriMedias. At this architecture, we have run three applications: 1. Picture-In-Picture (PiP). This application reads multiple uncompressed video files and combines these into a single video file. One file contains the background video, which is simply copied. The other files contain the picture-in-picture videos. These videos are scaled down in size by a factor of 4 and blended into the background video. Task parallelism is exploited by processing these components in a pipeline, and by processing the various color fields in the images con-

5000

JPEG decode IDCT Y

MJPEG input

IDCT Y

Downscale Y

JPEG decode

IDCT U

Downscale U

IDCT V

Downscale V

IDCT U

IDCT V

Blend Y Blend U Blend V

Sequential XSPCL

4000 cycles x 1.000.000

MJPEG input

3000 2000

Output

Figure 7. JPiP with one picture-in-picture currently. Data parallelism is exploited by running the down scaler and blender using 8 slices. The size of the image frames is 720x576. 2. JPEG Picture-In-Picture (JPiP). This application is similar to PiP. The input videos consist of compressed JPEG images instead of uncompressed video. Besides down scaling and blending, the application also has to decode the JPEG images. The application structure, with one picture-in-picture, is shown in Figure 7. The figure shows the components and the communication streams. The application has been converted to Series-Parallel form by introducing synchronization points between each operation. For example, before the Blend components are run, all Downscale and IDCT components must have finished. Again, task parallelism is exploited by processing the components in a pipeline, and by processing the various color fields in the images concurrently within each operation. Data parallelism is exploited by running the IDCT, down scale and blend components using 45 slices. The input image size is 1280x720. The down scale factor is 16.

1000

PiP-1

PiP-2

JPiP-1 JPiP-2 Blur-3x3 Blur-5x5 Application

Figure 8. Sequential overhead

4.1. Sequential overhead

First we have investigated the overhead of using XSPCL. To do this, we compare the XSPCL versions of our applications to hand-written sequential versions, that do not use the Hinch runtime system. PiP and JPiP are run with one and two picture-in-pictures (PiP-1 and PiP-2). Blur is run both in 3x3 and 5x5 mode. The results are shown in Figure 8. The sequential versions of PiP and JPiP combine several operations, for example down scaling and blending, into a single function. In the XSPCL versions, these operations are performed by two separate components. This induces overhead because data has to be buffered in a stream between the two components. For PiP-1 and PiP-2, this results in a total overhead of 5%. The JPiP application clearly suffers more from this behavior, as the overhead is 18%.

The kernel is separated into an horizontal and vertical phase. The two phases are run in parallel using cross dependencies, as explained in Section 3.3. 9 data-parallel slices are used.

Profiling information of the XSPCL JPiP application shows that the number of cache misses is significantly higher than when the sequential version is run. This issue can be addressed in future versions by grouping several components into a group that is scheduled as one entity. The consumer components in this groups will then be run immediately after the producers, when the data is still in the cache. However, this approach reduces the amount of parallelism in the application so it might degrade the parallel performance of the application. Choosing the right balance is subject to further research.

The PiP and Blur applications process 96 image frames. Because of limited simulation speed, the JPiP application processes 24 image frames. To exploit pipeline parallelism, as explained in Section 3.3, five iterations are simultaneously scheduled.

In the sequential Blur application, no operations are combined. The performance of the XSPCL version is even better than the sequential version. Profiling information shows the caching behavior is the same for both applications. As the performance difference is small (< 1,1%), we attribute it to common measuring noise.

3. Blur. A 3x3 or 5x5 Gaussian blurring kernel is applied to the luminance field of an 360x288 uncompressed video file. The standard deviation of both kernels is set to 1. The 5x5 kernel thus gives a more accurate result than the 3x3 kernel but is computationally more intensive.

6 4

PiP-12 JPiP-12 Blur-35

12 overhead (%)

speedup

14

ideal speedup PiP-1 PiP-2 JPiP-1 JPiP-2 Blur-3 Blur-5

8

2

10 8 6 4 2 0

0

1

2

3

4

5 nodes

6

7

8

9

Figure 9. Speedup @ SpaceCAKE simulator

4.2. Parallel Performance To verify the power of the XSPCL parallelization constructs, we have run the applications on multiple nodes. Figure 9 shows the results. All speedup measurements are relative to the fastest sequential version of the application. For Blur, this is the parallel version. When a parallel version is run at 1 node, parallelization overhead is reduced by disabling all synchronization operations. All applications exhibit a good efficiency. JPiP performs worse because the overhead compared to its sequential counterpart is relatively high, as explained in Section 4.1. The Blur applications perform best, because they have the largest computation to communication ratio.

4.3. Reconfiguration overhead To investigate the overhead of the reconfiguration primitives in XSPCL we have created reconfigurable variants of the three applications. JPiP-12 and PiP-12 start with one picture-in-picture and switch a second picture-in-picture on and off every 12 frames. Blur-35 switches between the 3x3 and 5x5 kernel every 12 frames. The run times of these applications, divided by the average of the corresponding static applications, are shown in Figure 10. Although reconfiguration occurs very often, the overhead stays below 15 %. When the application is stopped for reconfiguration, the amount of parallelism in the application drops until the application is run sequentially. Thus, on average there is less parallelism to exploit in the reconfigurable applications and the reconfigurable applications will perform relatively worse on larger numbers of nodes. This causes the reconfigurability overhead to increase with the number of nodes, which is is clearly visible in Figure 10. When a reconfiguration request arrives and the program is made idle, it can occur that all threads finish the available work simultaneously. It can also occur that most threads

1

2

3

4

5 nodes

6

7

8

9

Figure 10. Reconfiguration overhead have to wait for a single thread to finish. In the first case, more parallelism is exploited than in the second case, on average. The reconfiguration overhead will also be lower then. Therefore, small variations in reconfiguration overhead do occur, as shown in Figure 10.

5. Related work In other application areas than consumer electronics, UML is often used to aid software design, for example in Rhapsody by I-Logix [12] and AndroMDA [3]. Rhapsody is used to create reliable embedded systems, for example for the aviation industry. AndroMDA targets enterprise applications. In these tools, UML is used as a high-level coordination or modeling language, from which the application is generated. Rhapsody also uses UML to verify the correctness of the application and to prevent design flaws. Lee also argues that using separate coordination and implementation languages provides the means to easily build reliable parallel applications [17]. Two notable exceptions to this approach are the X10 and Threaded-C languages [6, 24]. X10 is based on Java, while Threaded-C has its roots in C. Both languages include various highlevel primitives for expressing parallelism. Coordination primitives have thus been incorporated into the implementation language, instead of having a separate coordination language. Although these languages provide valuable abstractions for parallel programming, they do not provide the higher abstractions that XSPCL and other coordination languages provide. Threaded-C applications can be run efficiently in parallel using the EARTH run time system [26]. Although much work has been put into the design of X10, we are unaware of any performance figures that show the efficiency of X10 for (CE) applications on parallel hardware. The component model of XSPCL resembles that of the Common Component Architecture (CCA) [2], which targets Grid applications rather than embedded CE applica-

tions. The main difference is that XSPCL uses components to exploit parallelism within an application. CCA, on the other hand, provides a generic interface for complete (parallel) applications and aims at connecting these applications using this interface. ASSIST [1] is another component-based framework for Grid Applications. It includes the ASSISTcl coordination language. Like XSPCL, ASSIST features communication using events and streams. It also supports dynamic reconfiguration based on availability of resources. Dryad [13] has a similar approach to ours in the sense that it also uses a graph with computational elements with communication ports. These ports can be connected using various communication mechanisms, including files, shared memory FIFOs and TCP pipes. The main difference is that Dryad targets coarse-grained data-parallel applications, where a single job processes gigabytes or more. In contrast, we primarily support fine grained parallelism, where the job input size is less than a megabyte. Another difference is that we also support multiple forms of task parallelism besides data parallelism. Furthermore, dynamic reconfigurability is not supported in Dryad, although plans exist to add support for this feature in the future [13]. Our data flow approach is also used in the Nizza multimedia run time system [23]. However, Nizza does not offer run time reconfiguration primitives, as far as we know. We are also unaware of any coordination language using Nizza. Another multimedia run time system that is similar to our Hinch run time system, is Decklight [4]. It has been integrated into the Soundium multimedia authoring platform, which acts as a graphical front-end to Decklight [18]. Decklight aims at integrating various external multimedia components. Although this is also possible using XSPCL and Hinch, we aim at self-contained components that do not need external functionality. Many other systems adhere to the streaming programming paradigm. Although these systems provide a useful abstraction of streaming communication, they lack a separate coordination language such as XSPCL, that provides an even higher abstraction level to the application developer. Among these are Space-Time Memory [21], StreamIt [25], Brook [5], and the Stream Virtual Machine [16]. Streaming is often used within the Kahn Process Network model [15], in which the application is split up into multiple parallel threads with specific functions. These threads communicate using streams. Within Philips, several of these systems have been developed [8, 27, 19].

6. Conclusion and Future Work We have presented XSPCL, a coordination language for streaming consumer electronics applications. Many advanced structures can easily be expressed in XSPCL. Be-

sides streaming communication, these include task- and data parallelism, procedural abstraction, dynamic reconfiguration and asynchronous user event handling. To evaluate the language, we have implemented a prototype XSPCL processing tool that converts an XSPCL specification to a running application. Experiments have shown that XSPCL incurs low overhead, and applications are able to scale well on a parallel shared memory architecture. In the future, we plan to use XSPCL for applications on various other architectures. First, we will investigate how we can develop efficient applications for the Cell processor [14], which has fast specialized vector engines. We will create special wrapper components that interact with these vector engines. We also plan to look at High Performance Computing (HPC) applications on shared memory, distributed memory and hybrid architectures. We think XSPCL can easily be extended to support this kind of applications, as long as the application can be expressed as a streaming application. An example application is the processing of data from radio telescopes. Modern radio telescopes produce huge data streams (>100Gb/s) and require compute power in the order of teraflops [9]. Another research direction is skeletal parallelism [7]. The various shapes of parallelism we have shown already implement a skeletal template. This can be extended to the components themselves: Template components can be developed for certain classes of algorithms. Using the initialization parameters, different instances can be instantiated. Currently, XSPCL does not provide the means to express deadlines in real-time systems. However, an XSPCL specification could be used to estimate the worst case execution time by recursively traversing the component graph. This is also subject to future research.

7. Acknowledgments This work is supported by the Dutch government’s STW/Progress project DES.6397. We would like to thank Kees van Reeuwijk and the anonymous reviewers for providing feedback on this paper.

References [1] M. Aldinucci, S. Campa, M. Coppola, M. Danelutto, D. Laforenza, D. Puppin, L. Scarponi, M. Vanneschi, and C. Zoccolo. Components for high performance grid programming in Grid.it. In V. Getov and T. Kielmann, editors, Proc. Workshop on Component Models and Systems for Grid Applications, CoreGRID series, Saint Malo, France, Jan. 2005. Springer. [2] B. A. Allan, R. Armstrong, D. E. Bernholdt, F. Bertrand, K. Chiu, T. L. Dahlgren, K. Damevski, W. R. Elwasif, T. G. W. Epperly, M. Govindaraju, D. S. Katz, J. A. Kohl,

[3] [4]

[5]

[6]

[7] [8]

[9]

[10]

[11]

[12] [13]

[14]

[15]

[16]

[17]

M. Krishnan, G. Kumfert, J. W. Larson, S. Lefantzi, M. J. Lewis, A. D. Malony, L. C. McInnes, J. Nieplocha, B. Norris, S. G. Parker, J. Ray, S. Shende, T. L. Windus, and S. Zhou. A component architecture for high-performance scientific computing. Intl. J. High-Perf. Computing Appl., 20(2):163–202, Summer 2006. AndroMDA. http://www.andromda.org. S. M. Arisona, S. Schubiger-Banz, and M. Specht. A realtime multimedia composition layer. In Proc. AMCMM’06, pages 97–106, New York, NY, USA, oct 2006. ACM Press. I. Buck, T. Foley, D. Horn, J. Sugerman, P. Hanrahan, M. Houston, and K. Fatahalian. Brook. http://brook.sourceforge.net. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. 20th Annual Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 519–538, San Diego, California, USA, Oct. 2005. ACM. M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989. E. A. de Kock, W. J. M. Smits, P. van der Wolf, J.-Y. Brunel, W. M. Kruijtzer, P. Lieverse, K. A. Vissers, and G. Essink. Yapi: application modeling for signal processing systems. In Proc. 37th Design Automation Conference, pages 402–405, New York, NY, USA, 2000. ACM Press. C. de Vos, K. van der Schaaf, and J. Bregman. Cluster computers and grid processing in the first radio-telescope of a new generation. In Proc. CCGRID ’01, 2001. A. Gonz´alez-Escribano, V. Carde˜noso-Payo, and A. J. van Gemund. On the loss of parallelism by imposing synchronization structure. In Proc. IASTED Intl. Conf. on Parallel and Distributed Systems, pages 251–256, Barcelona, Spain, June 1997. A. Gonz´alez-Escribano, A. J. van Gemund, and V. Carde˜noso-Payo. SPC-XML: A structured representation for nested-parallel programming languages. In Proc. Euro-PAR ’05, pages 541–555. Springer, Sept 2005. I-Logix. Rhapsody. http://www.ilogix.com/. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In 2nd European Conference on Computer Systems, Lisbon, Portugal, March 21-23 2007. also as MSR-TR2006-140. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49(4/5):589, Sept. 2005. G. Kahn. The semantics of a simple language for parallel programming. In J. L. Rosenfeld, editor, Information processing, pages 471–475, Stockholm, Sweden, Aug 1974. North Holland, Amsterdam. F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The stream virtual machine. In Proc. 13th Intl. Conf. on Parallel Architectures and Compilation Techniques, pages 267–277, Antibes Juan-les-Pins, France, Sept. 2004. E. A. Lee. The problem with threads. IEEE Computer, 39(5):33–42, May 2006.

[18] P. M¨uller, S. M. Arisona, S. Schubiger-Banz, and M. Specht. Interactive media and design editing for live visuals applications. In Proc. Intl. Conf. on Computer Graphics Theory and Applications, pages 232–241, feb 2006. [19] A. Nieuwland, J. Kang, O. P. Gangwal, R. Sethuraman, N. Bus´a, K. Goossens, R. P. Llopis, and P. Lippens. CHEAP: A heterogeneous multi-processor architecture template and scalable and flexible protocol for the design of embedded signal processing systems. Design Automation for Embedded Systems, 7(3):233–270, 2002. [20] M. Nijhuis, H. Bos, and H. E. Bal. Supporting parallel reconfigurable multimedia applications. In Proc. Euro-PAR ’06, LNCS 4128, pages 765–776, Dresden, Germany, Aug. 2006. Springer. Accepted as distinguished paper. [21] U. Ramachandran, R. S. Nikhil, N. Harel, J. M. Rehg, and K. Knobe. Space-time memory: A parallel programming abstraction for interactive multimedia applications. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 183–192, Atlanta, Georgia, 1999. [22] P. Stravers and J. Hoogerbrugge. Single chip multiprocessing for consumer electronics. In Bhattacharyya, editor, Domain-Specific Processors, chapter 11. Marcel Dekker, 2003. [23] D. Tanguay, D. Gelb, and H. H. Baker. Nizza: A framework for developing real-time streaming multimedia applications. Technical Report HPL-2004-132, HP Labs, Palo Alto, August 2004. [24] K. B. Theobald, J. N. Amaral, G. Heber, O. Maquelin, X. Tang, and G. R. Gao. Overview of the Threaded-C language. CAPSL Technical Memo 19, University of Delaware, Mar. 1998. [25] W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. 11th Intl. Conf. on Compiler Construction, pages 179–196, Grenoble, France, Apr. 2002. [26] G. Tremblay, C. J. Morrone, J. N. Amaral, and G. R. Gao. Implementation of the EARTH programming model on SMP clusters: a multi-threaded language and runtime system. Concurrency and Computation: Practice and Experience, 15(9):821–844, Aug. 2003. [27] P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijtzer, and G. Essink. Design and programming of embedded multiprocessors: an interface-centric approach. In Proc. Intl. Conf. on Hardware/Software Codesign and System Synthesis, pages 206–217, Sept. 2004. [28] A. J. van Gemund. SPC: A model of parallel computation. In Proc. Euro-Par’96, pages 397–400, Sept. 1996. [29] A. L. Varbanescu, M. Nijhuis, A. Gonz´alez-Escribano, H. Sips, H. Bos, and H. Bal. SP@CE - an SP-based programming model for consumer electronics streaming applications. In Proc. 19th International Workshop on Languages and Compilers for Parallel Computing, LNCS 4382 (to appear). Springer, Nov. 2006. [30] A. L. Varbanescu, H. Sips, and A. van Gemund. PAM-SoC: A toolchain for predicting MPSoC performance. In Proc. Euro-PAR ’06, LNCS 4128, pages 111–123, Dresden, Germany, Aug. 2006. Springer.