Adaptive resource allocation for streaming ... - Semantic Scholar

4 downloads 241329 Views 925KB Size Report
Rather than aiming at affordable solutions for general ... various domains it is required that hardware remains functional ..... name port∗ implementation+ port.
Adaptive resource allocation for streaming applications Timon D. ter Braak, Hermen A. Toersche, Andr´e B.J. Kokkeler, Gerard J.M. Smit Computer Architectures for Embedded Systems Group Department of Electrical Engineering, Mathematics and Computer Science University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands {t.d.terbraak, h.a.toersche, a.b.j.kokkeler, g.j.m.smit}@utwente.nl

Abstract—Streaming applications often have latency and throughput requirements due to timing critical signal processing, or the timely interaction with their environment. Mapping such applications to a multi-core architecture is mostly done at designtime to be able to analyze the complex design-space. However, such design-flows cannot deal with a dynamic platform or a dynamic set of applications. Hardware faults and resources claimed by other applications may render the assumed available resources inaccessible. To avoid these assumptions posed on the state of the platform by a fixed resource allocation, applications should be designed with location transparency in mind. We require applications to be analyzed at design-time to determine the required resource budgets. Sufficient performance is then guaranteed when such applications are mapped onto an architecture in which each resource is arbitrated using a budget scheduler. Within the Cutting edge Reconfigurable ICs for Stream Processing project, a many-core platform is developed that adheres to these requirements. Using the configuration features of the platform, the system is able to control what resources are being used by the applications. This paper shows that run-time resource allocation can effectively adapt to the available set of resources, providing partial distribution transparency to the user. As an example, a GNSS receiver is mapped to the platform containing faulty hardware components. A few resources are critical, but in most cases, adequate resources can be allocated to the application.

I. I NTRODUCTION Multiprocessing has been common in embedded systems for decades. Rather than aiming at affordable solutions for general applications, embedded system architectures are tuned for the constraints and economics of a specific application. This has often involved the use of multiple processors, where some or each of them are specialized for the task at hand. A commonly employed component is the digital signal processor (DSP), which is specialized to efficiently perform tight multiplyaccumulate loops, enabling high performance and real time signal processing [1]. Within the Cutting edge Reconfigurable ICs for Stream Processing (CRISP) project [2], a scalable and dependable reconfigurable many-core system concept has been developed, named the “General Stream Processor” (GSP). Figure 2 shows This research is conducted within the FP7 Cutting edge Reconfigurable ICs for Stream Processing (CRISP) project (ICT–215881) supported by the European Commission, and partly by the Sensor Technology Applied in Reconfigurable systems for sustainable Security (STARS) project. For further information: http://www.starsproject.nl/

Application

Admit / Reject

Resource demand

Platform Resource availability

Updates

Resource manager

Figure 1. Resource allocation at run-time deals with an unknown set of applications and a dynamic platform

the demonstration platform, consisting of 5 Reconfigurable R DSP cores. Fabric Devices (RFDs) each hosting 9 XENTIUM R

The XENTIUM is a 32-bit VLIW DSP core with 10 functional units that can operate in parallel [3]. The 45 cores are interconnected by a NoC that supports both packet switched connections (best-effort) and circuit switched connections (guaranteed throughput) [4]. As more and more resources are added, both the dependability of the platform and the connectivity between the resources become important. To improve the dependability of the platform, hardware support has been added to perform R structural tests of the XENTIUM cores. Together with a software-based functional test of the interconnect, a hardware fault detection mechanism is provided by the platform. In various domains it is required that hardware remains functional over the length of years, whereas in other domains, a very short mean time to repair is required (war zone), or where it is simply not possible to replace the hardware (space). Run-time resource allocation can then provide graceful degradation of the platform by working around faulty hardware components that are not critical to the system. Signal processing applications often demand predictable latency: the input and output ports operate in the real time domain, so correctness is affected by timing. Temporal irregularities should be masked by buffers or massively overdi-

Off-board link

Off-chip link 12

17 11

RFD 14

13

1

5

10

RFD

RFD

2 16

GPD

RFD 15

3

RFD

4

4 5 3 2 1 0

FPGA

6 7 8 9

RF front end

Off-board link

(a) Hardware verification board Figure 2.

(b) Schematic view showing the various components

A General Stream Processor instantiation of one GPD and 5 RFDs

mensioned performance, both of which are expensive. Currently, only multi-core architectures with restricted resource sharing can be analyzed effectively. Synchronous dataflow is often used to analyze the performance of applications while considering resource allocations [5]. Wiggers et al. provide the dataflow techniques to analyze the behavior of run-time scheduling [6]. Using these techniques, we can determine the minimum time per interval (buget) a task or communication channel needs from a resource to meets its timing constraints. We can calculate these budgets for applications in isolation; therefore, we do not need to know the set of applications in advance. Over time, new or updated applications may be added, without the need to change other parts of the system. A resource manager may also allow unanticipated combinations of applications to simultaneously execute on a non-ideal set of resources, without violating their timing constraints. Figure 1 summarizes the context of this paper, in which we assume an unknown set of applications and a dynamic platform. The organization and contributions in this paper are as follows. In Section III, we propose a platform model in which the heterogeneity of real platforms can be described, opposed to the regular (meshed) platforms often used in related work. Section V shows that, from a resource management perspective, faults in the hardware are just resources that are unavailable. Resource allocation upon start-up of an application should disregard any type of unusable resources. As an illustration of our claims, Section VI and VII present the results of mapping a GNSS receiver to a platform with faulty components. II. R ELATED WORK Chou and Marculescu address fault-aware resource management in [7]. Although they state that faults in an MPSoC result in irregularity, they assume a 6x6 meshed MPSoC platform containing only two types of tiles; computational cores and memories. In order to configure and start applications, a more detailed description of the hardware is required, resulting in more types of resources. Link contention is undesired in their

best-effort system, but allowed. For our application domain, real-time streaming applications, this is not allowed. Their approach results in highly clustered mappings of applications, whereas in most platforms communication becomes the bottleneck, especially when accessing memories or I/O ports. The fault-aware mapping algorithm proposed by Chou and Marculescu is tailored towards these assumptions, and therefore has to be adapted to work with other architectures. An adaptive, two-step approach with region forming of resources is presented in [8]. Reasoning about the global state of the platform often results in centralized approaches [9]. In [10], a central entity initially distributes applications into virtual clusters, whereafter a packing strategy tries to map their tasks close to each other. For larger platforms, centralized resource management might not scale enough. In [11], a decentralized heuristic is presented that combines a spring layout approach with Tabu search. A combined distributed and fault-tolerant approach is presented in [12], by integrating the task mapping algorithm into the routing protocol. An interconnect topology must be exploited effectively; how a network is used is determined by the routing strategy. In macroscopic networking, routing is based on dynamically constructed tables which map addresses to routing hops. For high performance computing this approach is already considered prohibitive, so one should not expect it to be usable for NoC in general. General architectures use regular addressing, allowing routes to be constructed on demand. In our real time digital signal processing context, the situation is different: communication partners rarely change, and we wish to control channel utilization [13]. Rather than depending on addressed routing, often small (semi-)static source routing tables are used. Routes are computed and adapted at design-time [14], [15] or at run-time [13], [16], [17]. The latter approach is followed in our platform. Energy consumption is used in [7] as an optimization or decision criteria. Realistic models of the energy (in)directly consumed by applications are too detailed for evaluation at run-time. Simpler models are too inaccurate, due to deviations of the application behavior to the specification and due to

III. P LATFORM MODEL Resource management requires knowledge about the platform that has to be controlled and maintained. In conventional systems, information about the underlying hardware is provided by a board support package, used to initialize data structures within the operating system kernel and accompanying drivers. Such information mostly concerns the local processing node and its peripherals. Larger systems have to maintain information about the current state of multiple processing nodes and their connectivity. Within a hierarchical platform model, information about the available computation as well as communcation resources of multiple nodes can be combined to get a global view of the platform. We propose a model that describes the amount, type and connectivity of the resources that are provided by the platform. At the top level, the platform is described by a single element, which contains components and links between those components. Within components, a new level of elements is defined. This allows the platform definition to be queried at various levels of granularity. An element is of a given element type, which allows for an unbounded number of different resources to be described in a single platform specification. Each element contains a schedule that is used to account the budgets of tasks that have access to the modeled resource. The status of an element indicates the fault status of the element and a timestamp of the last performed test. Recursively, each element contains a subset of (more fine-grained) resources that are spatially related to each other. The structure of the platform model we propose is given below as a regular expression1 : → element → element type components links incoming outgoing schedule status element type → (T ARM | T XENTIUM | T MEMORY | ... | T FPGA ) link → link type components links incoming outgoing schedule status link type → (T NOC | T MCP ) components → (element type element+ )∗ links → (link type link+ )∗ incoming → (element | link)∗ outgoing → (element | link)∗ status → (S CORRECT | S UNKNOWN | S ABSENT | S FAULTY | S FAULTY CORE | S FAULTY MEMORY ) timestamp

platform element

In our platform model, the structures describing links and elements in the platform have the same signature. Different types are used to distinct between them, but they are modeled, queried and scheduled in an identical manner. We believe that the given platform model can describe any platform that contains resources suitable for budget scheduling. 1 Assumed to be common knowledge; one or more.



means zero or more and

+

means

Router

MCP 0 SMR

MCP

0

DLI 0

link (off-chip) RFD

registers

Dependability Manager

X E0

X E1

X E2

X E3

X E5

MEM

MEM

X E4

DM

X E6

X E7

X E8

R XENTIUM DSP

IP – NOC

5

tile

interface

Memory tile (64KB) 2 NOC

4

Figure 3.

1

SMR

the layout of the architecture and variations in silicon. We assume that longer communication routes take more energy, and we use a cost factor to distinguish between different types of communication links.

link

3

The Reconfigurable Fabric Device (RFD)

A. Example: The CRISP 46-core platform Within the Cutting edge Reconfigurable ICs for Stream Processing (CRISP) project [2], a scalable and dependable reconfigurable many-core system concept has been developed, named the “General Stream Processor” (GSP). Figure 2a shows the hardware verification board that hosts two kinds of chips; Reconfigurable Fabric Devices (RFDs) are used for signal processing, and one General Purpose Device (GPD) controls the platform and the FPGA is mainly used as data generator for predefined test scenarios. Figure 2b shows a schematic overview of the platform at the board level. At this granularity, the platform is described as one element (the board), having three sets of resources as components (GPD, 5 RFDs, and an FPGA), which are connected together using off-chip connections as links. Each element can be queried when more details are desired. For example, Figure 3 shows the information contained in the elements that model each of the RFDs. The hierarchy in the model can be exploited for representation to the user; it can be exported as a resource filesystem or in a human readable format. For example, R number 1 on RFD 2 “gsp0.rfd2.xe1” refers to XENTIUM within GSP 0. B. Platform discovery and fault detection For initialization of the system, a platform specification is used that describes the platform as assumed to be present. The status of each resource in that initial platform specification is set to ‘absent’. First, a probing mechanism detects whether boards and chips are actually present. Then, similar techniques as described in [18] and [19] are used to perform functional tests on the interconnect. Links that are found to be fault-free are marked as such in the platform model. R After the interconnect has been tested, the XENTIUM cores and memories are examined. Figure 3 shows the Dependability Manager (DM), positioned in the bottom left corner of each RFD. This tile provides hardware support to perform structural R tests of the XENTIUM cores, using the NoC as test access mechanism [20]. Using the principle of majority voting, triples R of XENTIUM cores are tested for faults. The platform model is then updated according to the latest test results. When applicable, similar techniques are applied to other types of elements. These tests can be scheduled periodically to monitor the platform during operation. The set of available

resources can thus change at run-time, either by changing the status of the elements and links, or by altering the structure of the platform model (reflecting addition or removal of hardware). IV. A PPLICATIONS Each application that is to be used in combination with the run-time resource manager has to be described in a predefined format. An application is defined by a set of tasks. Each task must have at least one implementation that denotes to which types of elements the task can be mapped. These types are not limited to hardware blocks used for processing, but can also describe other hardware. The on-board FPGA, the Dependability Manager and the RFD registers are examples of fixed hardware components that may also be required by tasks within the application. In this approach, the framework supports application specific needs to use I/O ports or to configure hardware, with an unbounded degree of heterogeneity. A task is linked to other tasks through ports; an input port (C IN ) is matched with an output port (C OUT ) within the same application, having the notion of ‘channels’. Dataflow analysis of the application at design-time gives the required scheduling budgets for each task and each channel. A more detailed description of an application specification is given below: application task port implementation initialization

→ → → → →

name task+ name port∗ implementation+ name (C IN | C OUT ) budget element type budget initialization∗ (I RESET | I SEND MAIL | ... | I LOAD BINARY )

A. Example: A Global Navigation Satellite System receiver In the CRISP project, a Global Navigation Satellite System (GNSS) application is specified and designed for the GSP platform. The three main blocks in any GNSS receiver are identified to be i) a radio front end for analogue signal processing, ii) a digital baseband processing part and iii) navigation calculus to determine PVT (position, velocity and time) from measured pseudo ranges. The GNSS application is able to solve the PVT of the receiver if four or more satellites are tracked successfully. Thus, the receiver application should always (after an initial acquisition stage) have four or more cores running a tracking process to enable navigation. If a satellite is lost, no satellite data is available until the reacquisition procedure is again performed and concluded. Acq 0

RF

Result (Res) + Control

IP

Tra 0

Figure 4.

Acq 1

Tra 1

Tra 2

Tra 3

Simplified task graph of the GNSS application

Figure 4 shows a simplified version of the GNSS application; the entire task graph consists of 17 tasks and 68 unidirectional channels. The implementation steps towards the CRISP GNSS application are explained in more details in [21]. Figure 5 shows a partial application specification that describes the GNSS application in terms of the given application structure. { "name": "RF Front End", "ports": [ { "channel": "rf_input", "out": 15 } ], "implementations": [ { "type": "RFFrontEnd" }, ] }, { "name": "Input FIFO", "ports": [ { "channel": "rf_input", "in": 15 }, { "channel": "input_ip", "out": 15 } ], "implementations": [ { "type": "SmartMemory", "memory": 0x8000 } ] }, { "name": "Input Processing", "ports": [ { "channel": "input_ip", "in": 15 }, { "channel": "ip_fifo0", "out": 15 }, { "channel": "in_tra0", "out": 15 } ], "implementations": [ { "type": "Xentium", "memory": 0x4000, "init": [["CONFIG_ROUTES"], ["SOFT_RESET", 1], /* Send ID 0 to mailbox 3 */ ["SEND_MAIL", 3, 0]] } ] }, Figure 5.

Specification of the input chain of the GNSS application

B. Location transparency The IP tiles of the RFD (Figure 3) are connected to the routers with a network interface that provides the abstraction of a master/slave memory bus. Accesses to off-tile memory regions are encapsulated in a common request/response protocol. Each master has a local look-up table that maps an address range to a hforward path, return pathi pair. When the corresponding memory range is accessed, both of them are transmitted. The addressed peer network interface will subsequently respond with the requested data using the return path. In our application specifications, the index of a port in the list determines the entry in the corresponding routing table of the element the task is mapped to. The relation between channels and routing table entries should thus be consistent with the program code using those entries. The resource

R manager can then (re)configure the view of each XENTIUM to the ’outside’ world, providing location transparency to the largest part of the GSP. The other part, the GPD, runs a Linux kernel and software stack to provide end-users with the interaction with and control of the GSP. A device driver has been implemented to communicate with the NoC [22]. The user receives a handle to a specific resource on the platform by sending routing information to the device driver. Common file-operations can then be used to communicate through this handle with the requested resource. This allows for a rich-featured platform support library that is independent of the specific locations of resources. Figure 6 shows an example, in which routing information is passed to the function c2c_open, which returns a file descriptor (handle). The function c2c_write_value writes the value ’0x1’ to offset ’0x200’, without knowing the type or location of the resource that is being accessed. The resource manager is typically used to provide the handles to the requested resources, again providing the desired location transparency.

correct location in the platform. Optionally, it could configure the RF front end with application specific parameters. The initial subset of tasks T0 is thus mapped to a subset of elements E0 in the platform. A subset of channels C0,1 connects tasks T0 to another subset of tasks T1 . Starting from the locations E0 , we search through the platform model by following the links L0,1 that are available for routing the channels C0,1 , until we find a set of elements E1 to which all tasks in T1 can be mapped. In the search for the next set of resources Ei , it is also checked whether links in the interconnect have sufficient budget remaining to route the channels Ci−1,i . This procedure is performed until all tasks are mapped, or until the heuristic fails to find an adequate amount of resources. M0

P Mi−1

A T0 Ti−1

Mi

struct gv_header fw, rt; int fd;

Ei−1

Ei

Ti Ti+1

gvheader_init_with_route(&fw, "WEWL"); gvheader_init_with_route(&rt, "LWEW"); fd = c2c_open(&fw, &rt); c2c_write_value(fd, 0x1, 0x200); c2c_close(fd); Figure 6.

Example usage of the C2C device driver to access the NoC

V. RUN -T IME R ESOURCE M ANAGEMENT At run-time, applications can request resources from the platform through a resource manager (see Figure 1). The resource manager queries the current platform state and updates the state when the active set of applications changes. Depending on the resource demands and state of the platform, an application is admitted only to the platform if adequate resources are available. In this section, we briefly describe the heuristic we use to find the required set of resources, and the conditions that hold for the GSP platform. A. Resource allocation heuristic Our resource allocation heuristic [23] requires an initial step that builds upon the assumption that in each application, a (small) subset of tasks can only be mapped to specific resources in the platform. This assumption holds for tasks responsible for the I/O of the application. These dependencies may vary per execution of the application, but at the start of the application both the origin of the inputs, and the designated output has to be known. For example, the GNSS receiver described in section IV-A contains an input task that has to be mapped to an RF front end. This input task solely fixates the source of the input datastream within the application to the

Figure 7. For every subset Ti of tasks in application A, a subset Ei of the elements in platform P is selected to form mapping Mi .

B. Mapping constraints R The XENTIUM has instruction level parallellism, but is a single-core DSP. Therefore, we can only map a single task to a R XENTIUM core. This limitation is expressed in the schedule that is associated with each core. Each NoC link provides 4 virtual channels, allowing up to 4 connections to traverse a physical link. Routers can be configured to assign different weights to each virtual channel in order to assymmetrically distribute the available bandwidth over the active connections. Applications may hold on to their allocated resources indefinitely. Currently, we assume that not every task can be migrated to a different set of resources. Also, no suitable method has been found yet for fault isolation within applications running on our system. When an application is malfunctioning (compromised by a hardware fault), we currently have to reset it to make sure that its state remains consitent. Online hardware tests could result in one or more hardware components to be marked as ’faulty’, such that the application is restarted on fault-free resources.

VI. M APPING A GNSS RECEIVER TO A CHIP WITH FAULTY COMPONENTS

The GNSS application of section IV-A is mapped by the software designer to the resources of a single RFD. Figure 8a shows (a part of) the manual mapping; for clarity, not all communication channels are drawn. The colors of the connections that are marked match with the edges of the task

graph in Figure 4. An input datastream is taken from the RF front end (located bottom right of Figure 2b), which is routed through the FPGA to I/O port number 2. Within the memory tile, a FIFO buffers the datastream, which is in turn read by the input processing task (IP). After decimation, the IP-task first forwards the datastream through another FIFO to the acquisition chain (Acq 0,1), and then writes the same samples in the local memory of the first tracker task (Tra 0).

RFD with the resources marked that are critical to the GNSS application. Table I A DMISSION RATE OF THE GNSS APPLICATION ON A FAULTY RFD Resource type

Utilization

NoC links NoC routers

32 ± 32% 31 ± 20% 89% 63%

R XENTIUM cores Memory tiles

A. Experiment Applications that span multiple RFDs and mapping applications over the boundaries of a single RFD are very well supported, and this is exploited by other applications. In the CRISP project, we studied two applications: the GNSS receiver and digital beamforming. The beamforming application has 85 tasks and 252 channels, and uses most of the resources on the board (GPD, 5 RFDs and the FPGA). For the GNSS application, we used the resources of a single RFD to test with all possible single hardware fault scenarios. In turn, each of the 100 NoC links, 16 routers, R 9 XENTIUM and 2 memory tiles is marked as faulty, after which we attempt to start the application. In the case that the run-time resource manager admits the GNSS application to the platform, the application actually runs for a couple of iterations to make sure that no details are overlooked. The complexity of the mapping problem is easily underestimated; there are potentially 9! possible task to core mappings, that each require a different set of communication resources. We experienced that applications often deadlock if we do not take the communication resources into account. VII. R ESULTS Figure 8b shows the generated mapping of the GNSS application to a fault-free RFD. Although one would expect a different assignment of the tracker tasks, the mapping proposed by the resource manager is actually more efficient in terms of communication resources than the manual mapping. The generated mapping has an average hop count of 6.4±1.4 versus 6.8±1.8 hops in the manual mapping. Table I shows the results of the experiment described in the previous section. It shows, per resource type, the utilization by the GNSS application, and how many times the application was rejected due to a faulty resource being critical to the GNSS application. The GNSS application does not require all processing capacity on the RFD. Having a single faulty R XENTIUM thus does not keep the application from being admitted to the platform. The memory tiles are heavily utilized, making each of them critical for the GNSS application. As a consequence, three out of the four access paths to these memory tiles are also found to be critical. Lastly, two offchip links are essential for the application as well. MCP link number 2 takes the input stream from to the RF front end. MCP link number 5 leads to the GPD that manages the platform and performs part of the control of the application (see GPD and RFD 2 in Figure 2b). Figure 8c shows the

Critical 10 5 0 2

out out out out

Admission rate

of 100 of 16 of 9 of 2

90 69 100 0

% % % %

Table II shows the time that is required by the resource manager to perform all steps required to start the application. The resource manager runs on the GPD, which contains an ARM926EJ-S core running at 200 MHz. A little over 10% of the time is required to parse the task graph and create the associated data structures. Then, the majority of the time is devoted to the mapping of the 17 tasks and 68 channels to the RFD. Note that we put as much flexibility in the application as possible; adding constraints to the mapping problem often leads faster to a solution (or rejection). The third step is the realization of the application on the platform, consisting of the configuration of the hardware and uploading of binaries (program and data). Table II T IME REQUIRED TO START AND STOP THE GNSS APPLICATION Phase Parsing of application specification Resource allocation Configuration and starting of GNSS app. Releasing resources

Time required 53.56 ± 3.13 ms 318.76 ± 10.31 ms 43.71 ± 0.44 ms 11.15 ± 0.25 ms

VIII. C ONCLUSIONS AND F UTURE W ORK This paper shows a proof-of-concept implementation of a run-time resource manager, demonstrated with a GNSS receiver mapped to the CRISP GSP architecture [2]. This platform provides system-level fault tolerance for applications, where most hardware faults can be circumvented, allowing applications to start in the face of faulty hardware. Parts of the interconnect, mainly located near I/O ports and memories, remain critical to applications, as illustrated with the GNSS application. Within this paper, a centralized resource manager is used to administer the entire platform. However, a hierarchical model of the platform allows the resources to be distributed over multiple resource managers, each taking ownership of a part of the platform. In such a distributed setup, a resource manager can look beyond its own set of resources, and might request resources from other clusters. Multiples of the board shown in Figure 2a can be daisy-chained to easily create larger platforms, which outgrow the practical size that can be handled in a centralized manner. Future work thus includes the distribution of applications, the resource negotiation and consistency of state in such a distributed environment.

IP

0

1 ACQ 0

ACQ 1

ACQ 1

T RA 0

5

0

1 IP

T RA 3

5 T RA 2

T RA 1

4

T RA 0

2

R ES

R ES

3

4

(a) Manual mapping on a fault-free chip Figure 8.

X E0

X E1

X E2

X E3

X E5

MEM

MEM

X E4

DM

X E6

X E7

X E8

5 ACQ 0

2 T RA 3

1

T RA 2

T RA 1

SMR

0

2

3

(b) Example generated mapping on a fault-free chip

4

3

(c) Critical links and routers

GNSS application mapped to a single Reconfigurable Fabric Device

Another challenge is to seamlessly switch between the quality-of-service levels provided by an application. Scenarios exists where it is preferred to scale down running applications to allow additional functionality to be performed on the same platform. In other scenarios, hardware faults may trigger a reconfiguration of an application, such that it gracefully degrades by running in a reduced mode on a (slightly) different set of resources. The routing table in the network interface of R a XENTIUM can temporarily be invalidated, until the routing tables of all producers for the migrating task are updated with the new location. This is a relative cheap procedure, and may not be noticed at all by the producers if the tasks communicate via message based datastreams, as the routing information is only required when transmitting a new message. The location transparency provided by our platform thus also allows for migration transparency. ACKNOWLEDGMENT The authors would like to thank everybody that contributed to the CRISP project, for creating the necessary building blocks needed to perform our research. R EFERENCES [1] E. J. Tan and W. B. Heinzelman, “DSP architectures: past, present and futures,” Computer Architecture News, vol. 31, no. 3, pp. 6 – 19, 2003. [2] The CRISP Consortium. (2011, Jan.) CRISP - Cutting edge Reconfigurable ICs for Stream Processing. FP7-ICT-215881. [Online]. Available: http://www.crisp-project.eu/ [3] Recore Systems. (2011) Xentium technology. [Online]. Available: http://www.recoresystems.com/technology/xentium-technology/ [4] P. T. Wolkotte, G. J. M. Smit, G. K. Rauwerda, and L. T. Smit, “An energy-efficient reconfigurable circuit switched network-on-chip,” in Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - 12th Reconfigurable Architecture Workshop (RAW 2005), Denver, Colorado, USA. Los Alamitos, CA, USA: IEEE Computer Society, April 2005, p. 155. [5] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal, “Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs,” in DAC ’07: Proc. of the 44th annual Design Automation Conference. New York, NY, USA: ACM, 2007, pp. 777–782. [6] M. H. Wiggers, M. J. Bekooij, and G. J. Smit, “Monotonicity and runtime scheduling,” in EMSOFT ’09: Proceedings of the seventh ACM international conference on Embedded software. New York, NY, USA: ACM, 2009, pp. 177–186. [7] C. Chou and R. Marculescu, “FARM: Fault-aware resource management in NoC-based multiprocessor platforms,” in Proceedings of the Conference on Design, Automation and Test in Europe (DATE 2011), Grenoble. European Design and Automation Association, March 2011.

[8] ——, “Run-time task allocation considering user behavior in embedded multiprocessor networks-on-chip,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 29, no. 1, pp. 78 –91, jan. 2010. [9] V. Nollet, T. Marescaux, P. Avasare, D. Verkest, and J.-Y. Mignolet, “Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles,” in DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe. Washington, DC, USA: IEEE Computer Society, 2005, pp. 234–239. [10] A. Singh, W. Jigang, A. Prakash, and T. Srikanthan, “Efficient heuristics for minimizing communication overhead in NoC-based heterogeneous MPSoC platforms,” in Rapid System Prototyping, 2009. RSP ’09. IEEE/IFIP International Symposium on, 23-26 2009, pp. 55 –60. [11] P. Zipf, G. Sassatelli, N. Utlu, N. Saint-Jean, P. Benoit, and M. Glesner, “A decentralised task mapping approach for homogeneous multiprocessor network-on-chips,” Int. J. Reconfig. Comput., vol. 2009, pp. 1–14, 2009. [12] M. Hosseinabady and J. Nunez-Yanez, “Run-time resource management in fault-tolerant network on reconfigurable chips,” in Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, aug. 2009, pp. 574 –577. [13] P. K. F. H¨olzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit, “Runtime spatial mapping of streaming applications to a heterogeneous multiprocessor system-on-chip,” in DATE ’08: Proc. of the conference on Design, automation and test in Europe, Mar. 2008, pp. 212–217. [14] A. Hansson, K. Goossens, and A. Rˇadulescu, “A unified approach to constrained mapping and routing on network-on-chip architectures,” in CODES+ISSS ’05: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis. New York, NY, USA: ACM, 2005, pp. 75–80. [15] A. Hansson, M. Coenen, and K. Goossens, “Undisrupted quality-ofservice during reconfiguration of multiple applications in networks on chip,” in DATE ’07: Proc. of the conference on Design, automation and test in Europe. San Jose, CA, USA: EDA Cons., 2007, pp. 954–959. [16] V. Nollet, T. Marescaux, D. Verkest, J.-Y. Mignolet, and S. Vernalde, “Operating-system controlled network on chip,” in Proceedings of the 41st annual Design Automation Conference, ser. DAC ’04. New York, NY, USA: ACM, 2004, pp. 256–259. [Online]. Available: http://doi.acm.org/10.1145/996566.996637 [17] T. Marescaux, J. y. Mignolet, A. Bartic, W. Moffat, D. Verkest, and S. Vernalde, “Networks on chip as hardware components of an OS for reconfigurable systems,” in In Proceedings of 13th International Conference on Field Programmable Logic and Applications, 2003, pp. 595–605. [18] M. Herve, E. Cota, F. L. Kastensmidt, and M. Lubaszewski, “Diagnosis of interconnect shorts in mesh NoCs,” Networks-on-Chip, International Symposium on, pp. 256–265, 2009. [19] M. Cuviello, S. Dey, X. Bai, and Y. Zhao, “Fault modeling and simulation for crosstalk in system-on-chip interconnects,” in Computer-Aided Design, 1999. Digest of Technical Papers. 1999 IEEE/ACM International Conference on, 1999, pp. 297–303.

[20] X. Zhang and H. Kerkhoff, “Design of a highly dependable beamforming chip,” in Digital System Design, Architectures, Methods and Tools, 2009. DSD ’09. 12th Euromicro Conference on, aug. 2009, pp. 729–735. [21] H. Hurskainen, J. Raasakka, T. Ahonen, and J. Nurmi, “Multicore software-defined radio architecture for GNSS receiver signal processing,” EURASIP J. Embedded Syst., vol. 2009, pp. 3:1–3:10, January 2009. [Online]. Available: http://dx.doi.org/10.1155/2009/543720

[22] H. A. Toersche, “On chip network support in Linux for a General Stream Processor,” December 2010. [23] T. D. ter Braak, P. K. F. H¨olzenspies, J. Kuper, J. L. Hurink, and G. J. M. Smit, “Run-time spatial resource management for real-time applications on heterogeneous MPSoCs,” in Proceedings of the Conference on Design, Automation and Test in Europe (DATE 2010), Dresden. European Design and Automation Association, March 2010, pp. 357–362.