Power Aware Heterogeneous MPSoC with Dynamic Task ... - CiteSeerX

0 downloads 0 Views 886KB Size Report
Vodafone Chair Mobile Communications Systems. Technische ... computers, and application-specific instruction-set processors. .... Due to the flexible and.
Power Aware Heterogeneous MPSoC with Dynamic Task Scheduling and Increased Data Locality for Multiple Applications Oliver Arnold and Gerhard Fettweis Vodafone Chair Mobile Communications Systems Technische Universität Dresden Dresden, Germany {oliver.arnold, fettweis}@ifn.et.tu-dresden.de Abstract—A new heterogeneous multiprocessor system with dynamic memory and power management for improved performance and power consumption is presented. Increased data locality is automatically revealed leading to enhanced memory access capabilities. Several applications can run in parallel sharing processing elements, memories as well as the interconnection network. Real time constraints are regarded by prioritization of processing element allocation, scheduling and data transfers. Scheduling and allocation is done dynamically according to runtime data dependency checking. We are able to show that execution times, bandwidth demands and power consumption are decreased. A tool flow is introduced for an easy generation of the hardware platform and software binaries for cycle accurate simulations. Further newly developed tools are available for power analysis, data transfer observation and task execution visualization. MPSoC, Task Scheduling, Runtime Scheduling, Runtime Memory Management, Runtime Power Management, MPSoC development tools, MPSoC analysis tools

I.

INTRODUCTION

A heterogeneous MPSoC is a promising approach to solve the challenge of improving the contradicting objectives in terms of increased performance and reduced power consumption. It provides the flexibility to integrate different processing elements (PEs), e.g. application-specific integrated circuits, general purpose processors like reduced instruction set computers, and application-specific instruction-set processors. The assembly of all used processors has to be done in a convenient way. An easy to use programming model is important for a fast application development. Furthermore, several applications run concurrently in the system, sharing processing elements, memories and the interconnection network. Synchronization between control code and PEs as well as synchronization between PEs has to be possible. Tools for debugging, visualization, verification and power profiling are important for a fast, error-free and power aware implementation of applications. Application performance is nowadays limited by the memory access capabilities, the available processing power, the power consumption constraints and the inherent parallelism of the application. In the last years it has become evident that memory access capabilities are growing slower than the This work was supported by the German Federal Ministry of Education and Research (BMBF) as part of the ”CoolBaseStations” project under grant 13N10788.

processing power on the chip [15]. Thus, the gap between data fetches and processing increases. The connection to the memory is a major bottleneck of the system. We will show an improvement of the memory access by explicit dynamic memory management. For this, reuse of local data and bypassing between PEs is applied. In state of the art architectures power management is done by a dedicated power management unit (e.g. the Infineon XPMU 6xx series [16]). It makes power shut down and frequency/voltage scaling available. In this work explicit power management is introduced which allows a faster and more accurate control of the power management unit. Thus, system power consumption is reduced. Multi-core platforms are e.g. Coresonic, PicoChip, Infineon's MuSIC [1], Icera, and Sandbridge's SB3011 platform [2]. In contrast to these platforms dynamic scheduling at runtime is applied in this work, including explicit dynamic memory and power management. Furthermore, programming these chips is not easy. Programming models for parallel architectures are e.g. OpenMP, MPI, OpenCL (Apple, Khronos Group), CUDA (NVidia), Sequoia (Stanford University) [3], Cilk (MIT) [4], Ct (Intel) [5], CellSs (Barcelona Supercomputing Center) [6]. The programming model in this approach is comparable with the last one but not limited to the Cell BE architecture [18] with its homogenous synergistic PEs. The remainder of the paper is organized as follows: In section II, the hardware system and the programming model is presented. In the following section the Software CoreManager and its memory and power management capabilities as well as the real-time support are introduced. The available tools and the tool flow are described in section IV. Section V presents the experimental results using benchmarks running on a generated architecture. The conclusion is presented in section VI. II.

SYSTEM MODEL

The idea of this approach is the execution of atomic tasks on a PE [7]. Each task is a composition of instructions and can be seen as a computational kernel. Its input and output is defined at runtime and thus not static. All necessary input data as well as instruction data are fetched prior to the execution of the task. Data are transferred using direct memory access

(DMA) transfers. The programming model is described in the next subsection. The system framework of our heterogeneous MPSoC is shown in Fig. 1. It consists of a number of application processors (#APP) as control processors, one scheduling unit, which is called CoreManager, several global memories (#MEM) and several PEs. The global memories are disjunctively mapped to the global address space. The total number of processing elements is determined by the number of PE types (#PE_TYPES) and the number of processing elements of each type (#PE). Furthermore, each PE has one local memory for instruction and data storage. The local memories have two ports. The first one is mapped to the global address space; the second one is mapped to the local address space of the PE. For data transfers several direct memory access controllers (DMACs) are available. The number of DMACs can be equal to the number of global memories if concurrent access is possible. An additional DMAC can be added for data transfers among PEs. The CoreManager is responsible to program the DMACs, allocate PEs and schedule tasks. A. Programming Model A programming model for heterogeneous MPSoCs is introduced in [7]. We extend it to support further functionality. The model consists of a function call on the application processors (APP) which leads to an execution of a task on a PE. The function arguments are the task name, at least one input data parameter and one output data parameter. Each data parameter is separated into a pointer, indicating a memory address in the shared memory, and a size, which specifies the width of the data region. A source code example is shown in Listing 1. It runs on the APP which is therefore responsible to evaluate the control code dependencies. In this case an if-else statement is present. Thus, either task2 or task3 will be executed on a PE. By calling a task(..) function on the APP the task description is transferred to the CoreManager. The task description is threefold: 1. the task name, 2. possible and preferred PEs of this task and 3. all input and output information. Input and output data are specified by a pointer and a size using the IN(..) and OUT(..) statement. E.g. in1, in2 and in3 are pointers to data regions in the main memory or the scratch pad. The second argument is the size in bytes. The PE annotation for a task is taken from a configuration file and is inserted in the task description in the function call. #APP

Application Processor

Core Manager

#DMAC

DMAC

Interconnect #PE_TYPE #PE

#Mem

Global  Memory

Local Memory

Processing element

Figure 1. System Framework

During the execution of someFunction() on the application processor task1 and task3 run concurrently on PEs if the else branch is taken. Thus, the parallelism level is three if the application processor is taken into account. The taskSync() call forces the application processor to stall until all remainder tasks of this application are executed and the output data are written back to memory. The stall of the application processor can be done e.g. by a wait for interrupt mode or by polling a variable. The CoreManager informs the APP to continue after task processing is finished. task( task1, IN( in1, 256), IN(in2, 128), OUT(out1,512), OUT(out2, 256)); if ( random1 == random2 )

Data  Dependency

task( task2, IN( in3, 256), IN( out1, 128),

Control code  dependency

OUT(out3,512)); else task( task3, IN( out1+512, 512), OUT(out4,512)); someFunction(); taskSync();

Listing 1. Application processor source code example

The interface for communicating with the CoreManager is shown in Table 1. Functions marked with an asterisk are taken from [7], others are introduced in this work. Each PE can be excluded at runtime from the available list of PEs in the CoreManager by using the enablePE(..) and disablePE(..) functions. Furthermore, for each PE a maximum clock frequency can be set with setMaxClockPE(..). Nevertheless, the used clock frequency is determined by the CoreManager as described in the next section. Table 1. Enhanced CoreManager Interface

Function name task(taskName, )* taskID task(taskName, ) taskSync()* taskSync(taskId) setMaxClockPE(peId, clock) enablePE(peId) disablePE(peId) setClockCM(clock) taskDepCheck(mode) turnOnCM(enable) dataLocality(enable) Config getConfig()

Explanation This task call transfers the task description to the CoreManager for scheduling a task on a PE. Task call as above, a unique task identification is returned. Synchronization of all tasks. Application processor is stalled until all tasks are finished. Synchronization to a task with taskId. Application processor is stalled until this task is finished. Sets the maximum clock frequency of a PE. Enables a PE. Disables a PE. Sets the CoreManager clock frequency. Sets the dependency check mode of the CoreManager. DEP() can be used for explicit task dependency annotations. CoreManager is shut on and off. Enables/Disables data locality options (local memory reuse & bypassing). Returns PE and CoreManager configuration.

III.

SOFTWARE COREMANAGER

In contrast to [8] a CoreManager written in software is used. This approach is chosen for increased flexibility and faster system evaluation compared to a hardware implantation. Due to the used C language the implementation can run on any RISC core. In this case an ARM926 is chosen. Furthermore, dynamic memory (local memory reuse and bypass) and power management (frequency scaling and power shut off) is added. Additionally, several applications can run concurrently. For real-time support priorities for PE allocation, scheduling and the interconnection are available. Dynamic scheduling is chosen because of advantages compared to static scheduling. The latter has a fixed DMAC and PE allocation scheme. The scheduling is fixed as well. A static approach with energy awareness is presented in [10]. If multiple applications run concurrently in the system the static approach has disadvantages in terms of performance and power consumption. In contrast, a dynamic approach can increase the utilization of PEs and the interconnection due to PE and DMAC allocation at runtime. The scheduling itself is dynamically optimized according to current demands like priorities and power consumption. A list based scheduling approach is implemented using an as-soon-as-possible (ASAP) algorithm. For each priority a dedicated list is available. The distinction in preferred and possible PEs for each task allows a further optimization for PE allocation and thus improves performance and power consumption. Due to the flexible and modular implementation the scheduling and allocation can be easily changed. The Software CoreManager has three task dependency checking modes. The first one does not check dependencies at all. Thus, allowing independent tasks to execute immediately after the task description has arrived in the CoreManager, as long as free suitable PEs and DMACs are available. The second mode allows an explicit annotation of task dependencies. Thus, the taskId, which is a return value of a task(..) function call on the application processor, is taken and used as a further argument to the successor task. An example is shown in Listing 2. The dependencies for task3 are explicit given using the newly introduced DEP(..) command. The dependencies are merged with the task description which is send to the CoreManager. If the body of the if clause is omitted; only a dependency to task1 is generated for task3. The most advanced mode of operation does data dependency checking at runtime. Therefore, the CoreManager checks the last arrived task with all tasks in the system by comparing the input and output memory regions of all tasks. Thus, predecessor tasks are found and the execution of the current task has to be delayed until these tasks finish. For each application a dedicated task queue is existent. A size has to be specified for each queue. Thus, the number of concurrent running tasks is limited. If the queue size of an application is reached, the application is stalled until a free task place is available.

uint32 taskId1 = task( task1, IN(in1, 128), OUT(out1, 256)); uint32 taskId2 = 0; if ( random1 == random2 ) taskId2 = task( task2, IN(in2, 128), OUT(out2, 256)); task( task3, DEP(taskId1, taskId2), IN(in3, 128), OUT(out3, 256));

Listing 2. Explicit Task Dependencies

A. Memory Management The boot code for a PE as well as the instruction code for a task reside in the local memory of a PE as long as it is powered on. Nevertheless, task instruction code can be replaced at runtime if a different task type has to be executed. If memory space is available more than one task instruction code can be kept. The boot code initializes necessary memory regions like heap and stack for general purpose processors. The architecture of the MPSoC and the runtime data dependency checking allows an explicit data management. Due to the CoreManagers knowledge of all input and output data transfers of all tasks, data locality can be increased without any annotations made by the software developer. Especially the local memories of the PEs are regarded and used as fast data storage. As a main requirement for functionality, it is important that after a taskSync() call all output data resides within the global memory. Thus, allowing flawless synchronization among PEs and APPs. Reuse of local data and bypassing is applied to increase data locality. Reuse of local data allows output data of a task to reside on the local memory for further use on the same PE. The control is done within the CoreManager according to data dependencies of task. The advantage of the reuse of local data is shown in Figure 2. A sequence of two task executions is shown. The first task T1 has one input and one output, the second task T2 has two inputs and one output. In Figure 2a) no data is reused. Therefore, the output data of Task T1, which is equal to the second input of task T2, is written back to the global memory. Afterwards, for the execution of task T2 this data has to be fetched. In comparison, in Figure 2b) local data is reused and the first input of task T2 is prefetched. The second input data transfer of this task can be omitted. After task T1 is finished, task T2 can immediately start its execution. It can be seen, that the execution time is significantly decreased if local data is reused. For functional correctness, it has to be ensured, that the output memory of a task must not be modified during execution of its successors until this transfer is finished. Otherwise, reuse of local data has to be omitted. DMAC:  IN: T1 a)

PE: 

OUT: T1 IN1: T2 IN2: T2 Task T1

OUT: T2 Task T2

Time DMAC:  IN: T1 IN1: T2 b)

PE: 

Task T1

OUT: T1

OUT: T2

Task T2

Time

Figure 2. Prefetching and reuse of local data: a) w/o, b) w/

If global data reside in a local memory it can be used by bypassing these data to another PE on which the task will be executed. Therefore, bandwidth demands to global memory are decreased. An example is shown in Figure 3. Three tasks are executed. Task T1 has one input; the other tasks have two inputs. By using the bypass it is possible to execute task T3 much earlier. Without bypassing the task has to wait for the output data of task T1 written back to the global memory. Afterwards, all input data for task T3 can be loaded. Data reuse and data bypassing is implemented in separate functions. The algorithm is straight forward due to the explicit data dependency checking. During this process all necessary information is annotated in an internal data structure. After all predecessor tasks are finished local data can be reused for increased data locality. Therefore, the successor task’s data locations are updated with data locations taken from the predecessor tasks. Furthermore, the DMA transfer is skipped for local data reuse. For local memory reuse and bypassing a task has to be informed about the location of input and output data in the local memory of a PE. This information is dynamic and can be, as already mentioned in the previous paragraph, changed during execution. Therefore, the CoreManager needs to inform the task on the PE about local memory locations for all input and output data. It writes local input and output pointers to its shared memory. These pointers point to data regions within the local memory address space of the PEs. These data are transferred to the local memory of a PE using a DMA transfer. Within the local memory of each PE four dedicated memory places are available for storing this information. Thus, task information for four tasks can be written to this memory region. Altogether it sums up to 256 reserved bytes. After task execution the memory region is released and can be used for the next task. For each task 16 pointers can be specified. By executing a task on a PE these input and output pointers are given as arguments to the task. Thus, the locations of input and output data are hidden from the application developer. Only the number of arguments and the argument order has to be the same as in the task(..) call on the application processor. For dynamic local memory allocation different strategies are possible. The simplest one is the Single-Space allocator. It allocates the whole local memory and allows exactly one task to be scheduled. The Top-Down allocator can schedule two tasks. Data for the first task grows from the beginning of the local memory; data for the second task grows from the end of it. If the accumulated memory space of both tasks is larger than the available local memory, the memory allocation of the second task fails. The most advanced memory allocation scheme is implemented in the Block-Based allocator. It divides the local data memory in n blocks. The number of blocks is configurable during the CoreManager compilation process through configuration files. For each local memory n bits are necessary to save its current status. One bit represents one block. The block is occupied if the bit is set to one. For each task a bit field is available representing its occupied local memory. After finishing the task this bit field allows an easy release of local memory. In the case of local reuse of data the bit field of the predecessor task is changed. This modification avoids false memory release.

DMAC:  IN: T1 IN1: T2 IN1: T2 PE0:  PE1: 

Task T1

Bypass OUT: T1 OUT: T2 OUT: T3 Task T2 Task T3

Time

Figure 3. Task data Bypass

B. Real-Time Capabilities Soft real-time capabilities are available by setting priorities to each application. Thus, the next scheduled task is always taken from the highest prioritized queue. Furthermore, the interconnection is prioritized, allowing higher prioritized application to gather all necessary DMA transfers which leads to an increased throughput and a decreased latency. Run time predictability is important for worst case execution time calculation. Especially the predictability of caches is difficult and calculated worse than needed. PEs solely work on their local on-chip memories. Thus, caches can be omitted and predictability is increased. No performance loss occurs. Runtime prediction is exact due to the fact that the same data and instruction input leads to exactly the same task execution time. C. Power Management The global system view of the CoreManager can be used for explicit power management of the PEs, the local memories, peripherals and the interconnection. Therefore, PEs can be grouped to power domains. Even each PE can reside in its own power domain. Thus, allowing the CoreManager to shut off and power on PEs and their local memories. For enabling this approach on RISC cores additional boot code is necessary to set up the PE and define proper settings, e.g. stack and heap memory regions. Furthermore, the power states of the local memories can be independently controlled, thus, allowing the CoreManager to switch on the local memories for data prefetching and afterwards power on PEs for task execution. Nevertheless, start up time of the PEs is regarded through time annotations for each PE type, allowing the PE to start the task execution immediately after data prefetching. A similar approach for low power scheduling is given in [11] and [12] but does not allow an explicit power management at runtime as it is proposed in this work. Furthermore, the CoreManager can explicitly control the frequency for each clock domain, allowing the reduction of the voltage for possible energy savings. As in the case of the power domain, one clock domain can be available for each PE. Nevertheless, until the PE is powered on the static power consumption must be considered. The currently implemented algorithm for frequency assignment works as follows: In normal mode the frequency is set to the base frequency of the PE type. In the case of executing a high prioritized task on the PE the frequency is doubled. If all PEs of a PE type are taken the frequency is doubled once more to relax this situation. E.g. the base frequency of the ARM926 is 50 MHz and of the ARM1176 it is 100 MHz, leading to further frequency levels of 100/200 MHz for the ARM926 and 200/400 MHz for the ARM1176.

Priority inheritance is applied if a possible PE of a higher prioritized task is taken by a lower prioritized task. This is due to the fact that no preemption is supported at the moment. Hence, power awareness is available for different prioritized applications as well as for the system load for each PE type. The flexible and modular approach allows an easy integration of other algorithms which may lead to further power consumption reduction. IV.

TOOLS

A tool flow is available for automatic generation of the hardware platform and software binaries. For system analysis the following tools were written by the authors: a task execution and data transfer visualization tool called TaskVisualizer, the ResultChecker for task consistency checking, and the PowerProfiler for power and energy profiling. VaST Comet 6.2 is chosen as cycle accurate simulation environment [14]. Several modules are available which are usable as APP, PE or peripherals. Our developed processors, e.g. digital signal processors and RISC cores can be integrated by interfacing ModelSim. A. Tool flow Software binary creation is shown in Figure 4. The SoftwareGenerator reads the application source files and the task definitions. For each task definition at least one PE type has to be annotated with pragmas. Preferred and possible PEs can be distinguished. By parsing this information binaries are generated by the corresponding compilers. The CoreManager is compiled after including hardware specification. As a result several binaries are generated. Task definitions are included in the application binary. The application is generated independently from the number of available PEs of each PE type. At least one PE of each necessary PE type has to be included in the hardware platform. Hence, portability of the software is given due to the dynamic scheduling. Nevertheless, the CoreManager has to be reconfigured for the hardware platform. #User defined Task Definitions

Application

CoreManager Library

1

1..*

PE Type

#User defined Platform Definition Processor Library

Interconnection Library

Figure 5. Hardware generation

The hardware platform generation tool flow is shown in Figure 5. The platform definition is user generated and written in xml format. It defines the modules of the system. By parsing the platform definition the PlatformGenerator generates the MPSoC architecture. Currently, the hardware generation tool flow is adapted to Comet 6.2. Nevertheless, the general approach of the PlatformGenerator allows integration in similar system level design environments. All processor models are able to execute binaries which are compiled by standard compilers from the processor vendors, e.g. compiler for the ARM modules are taken from ARM Inc. In Figure 6 the runtime environment is shown. By combining all binaries and loading them to the main memory of the MPSoC architecture the simulation environment starts with its execution. The configuration determines e.g. memory sizes and latencies. The post processing step collects all traced information and hands them over to the ResultChecker, the TaskVisualizer and the PowerProfiler. Traced information is gathered w/o influencing simulation results. For verification the ResultChecker is able to check all task executions, all transfers between global and local memories, and all transfers among local memories. In a case of an error the user is informed. In a heterogeneous environment debugging can be error prone. The task concept (PEs running in a local address space) allows the use of off the shelf debuggers. No multi-core debuggers are necessary. This is especially helpful in a heterogeneous environment. All input and output data as well as instruction code is defined at task start time and resides in the local memory. This reduces validation and verification effort. No data race conditions occur on a PE. Application Binary

Platform Definition

CoreManager Binary

CoreManager Binary

Simulation Environment

Compiler (for each processor type)

Application Binary

Peripheral Library

MPSoC  Architecture

MPSoC  Architecture

SoftwareGenerator

Memory Library

PlatformGenerator

Execution

Configuration

Debugger

Post Processing

PE Binaries

Figure 4. Software binary generation

PE Binaries

ResultChecker

TaskVisualizer

PowerProfiler

Figure 6. Runtime environment and post processing

B. TaskVisualizer By the visualization of task executions, date transfers and further information the TaskVisualizer supports the application and hardware developer to debug and improve the system. The input for the TaskVisualizer is twofold. For each DMAC the corresponding connection to the interconnection is traced. Thereby a value change dump (VCD) file of bus signals is written. The TaskVisualizer parses these signals. For instance in Figure 7 one DMA controller is instantiated. Each color corresponds to one PE. The blue DMA rectangle (second transfer in the timeline) represents task input data transferred to PE 1. By clicking on the rectangle further details are displayed in the upper left corner indicating e.g. the global and local address, the size, and the start and end time. If data are transferred to the global memory, the DMA transfer is marked with a black square in the upper part (OUT transfer). The second input information for the TaskVisualizer is generated by profiling. Therefore, each task call on a PE generates an entry with task start and end times in the profiling file. This information is used to generate the task execution times. TaskIds are profiled within the CoreManager. C. PowerProfiler The PowerProfiler analyses traced data. Therefore, the following energy consumptions are regarded: 

EE - Energy to execute tasks on PEs.



EM - Energy to store data in memory, e.g. the main memories, local memories, caches, and scratch pads.



ET - Energy for data transfers between memories.



ES - Energy to schedule a task.



EA - Energy for the application processors.



EP - Energy for peripherals.

All necessary information is traced at runtime. The processing of this information is done after the simulation. E.g. the transferred bytes of the interconnection network per time unit are counted for all possible interconnection frequencies. Executed instructions can be traced for each processor. Cache misses do not occur on the PEs due to local memory approach and thus the absence of caches. The power consumption for scheduling a task is done by averaging the power of the CoreManager processor during a dedicated period of time. This is necessary due to the impossible assignment of traced instructions to the scheduling of a task. The information to map the gathered information to power consumption figures is obtained from published papers and data sheets. E.g. the interconnection can be modeled using an approach given in [17]. For the ARM cores power values are directly obtained from ARM Inc. within the eMuCo project [13]. The overall energy consumption is the summation of all mentioned energies. Different frequency/ voltage pairs, static power consumptions and technologies are regarded.

Figure 7. TaskVisualizer

In Figure 8 the power consumption over time of a bandwidth limited system is shown. For better visibility, only a subset of possible power consumptions is given. Local memories are regarded as local caches in this figure. The time axis is divided in 100 time slots. The number of time slots is configurable. Furthermore, all PE power states are analyzed. In Figure 9 power states of an ARM926 core are shown. Power states changes from off to on and vice versa. Frequency changes during task execution are shown as well, e.g. from 50 to 200 MHz. In this example the frequency is not set to zero during power off mode, but can be easily forced.

Figure 8. PowerProfiler

Figure 9. PowerProfiler: PE states

V.

RESULTS

According to the introduced heterogeneous MPSoC framework a system is generated as shown in Figure 10. It consists of an ARM11MPCore with four ARM1176 cores which represent the application processors. It has a constant frequency of 320 MHz. As global memories three possibilities are available: SDRAM (connected by an ARM PL340 dynamic memory controller), flash memory (connected by an ARM PL354 static memory controller), and a 256 Kbyte on-chip scratch pad. As PEs eight ARM926 and two ARM1176 processors are instantiated. Each of them is connected to a 64 Kbyte local memory. For reference timing the Platform Baseboard for ARM11MPCore [9] is used, e.g. for latencies of the interconnection and the memories. Clock frequencies are variable. Simulations are done cycle accurate. First results are given for showing the feasibility of the approach and the developed tools for hardware as well as software system design and analysis to improve performance and power consumption. A. Single Application Example The first example is a mathematical application consisting of 960 tasks. At runtime 420 tasks are executed; the others are omitted due to control code dependencies. During execution 311 tasks have to be delayed due to data dependencies. A subset of all PEs is chosen. Only the ARM926 cores are used. Therefore, the speed up is theoretically limited to 8 according to one PE. Nevertheless, the interconnection is the limiting part in this application as can be seen in Figure 11. Three different interconnection frequencies are evaluated, revealing the speed up of the system. A task queue size of 16 is chosen. B. Multiple Application Example As a second example four identical mathematical applications are chosen and run concurrently in the system. For each application 1044 tasks are chosen out of 1750. Thus, each application has the same workload for every run. All tasks can run on any PE leading to a maximum of available PEs of ten. Different priorities are set to each application. Application 3 has the lowest priority. Synchronization between tasks is forced using taskSync(). Inter application synchronization is done using semaphores. The last one is only used after the boot-up process to allow similar start times. CoreManager

App. Processors

ARM926

ARM11MPCore 64

4x

64

L2 Cache

Memory

Memory

DMAC

ARM L220

Scratch Pad

APP2CM

ARM PL080

64

64

64

64

64

ARM AXI PL301 64

Mem. Controller Mem. Controller

64

64

8x

ARM PL354

Memory

Memory

Local

Local

Memory[4]

Memory

PE

PE

SDRAM

Flash

ARM926

ARM1176

ARM PL340 64

In Table 2 results of a comparison between explicit data management and no data management are shown. Five and ten PEs are regarded. A task queue size of 16 is used. Data dependencies are checked at runtime. Furthermore, the interconnection frequency is adjusted to 100, 250 and 1000 MHz. It can be seen that execution time is most reduced for low interconnection frequencies due to the high DMAC utilization. E.g. for an interconnection frequency of 100 MHz and ten PEs the absolute DMAC utilization changed from 96.8 % to 95.2 %. In both cases the DMAC is nearly saturated. Thus, the impact of explicit data management is fully exploited. Likewise, the PE utilization is most increased. In contrast, the execution time reduction is lesser at an interconnection frequency of 1000 MHz, even by nearly saving the same amount of transferred bytes. It can be seen, that the energy consumption is slightly decreased due to less DMA transfers, increased PE utilization and earlier PE shut off. Altogether, PEs are powered on 59 times. Quality of service for soft real time is made available by prioritization of PE allocation, scheduling and data transfers. The result by executing all four applications is shown in Table 3. It can be seen that the start times are nearly the same. Prior application 0, the CoreManager initialization is done on the same processor. Thus, this application start time is slightly delayed due to cache effects. Nevertheless, the application finish times show the feasibility of this approach. Table 2. Relative change in % by using explicit memory management, baseline value: no data locality (parentheses: absolute utilization values)

Interconnect 64

Figure 11. Single Application example, Speed up is shown against the execution with 1 PE

2x

Number of PEs Interconnection Frequency(MHz) Execution Time Transferred Bytes (DMA) Energy Consumption DMAC Utilization PE Utilization

Figure 10. Generated System Model

5 100

250

-34.1 -18.1

10 1000

100

250

-3.4

-32.3 -25.2

1000 -5.7

-39.2 -38.3 -39.6 -35.1 -35.2 -34.4 -16.1 -12.1 -5.1 (88.9) +54.4 (60.4)

-25.2 (49.0) +21.9 (83.4)

-5.6 -34.5 (13.7) +6.9 (93.4)

-17.1 -12.9 -1.7 (95.2) +45.6 (29.4)

-14.2 (72.4) +32.8 (58.6)

-6.4 -30.3 (20.9) +5.4 (66.0)

C. Software CoreManager performance The CoreManager is written in software and runs on an ARM926 processor with 300 MHz. Each functional part is responsible for a percentage of the overall runtime (10 PEs, task queue size of 16). An example runtime profile for the multiple applications scenario is the following: dependency checking (38.0%), interrupt processing (12.8%), DMA configuration (2.0%), PE allocation/initialization (4.1%), transfer scheduling (3.9%), dependency cleanup (1.6%), and task scheduling (4.0%). The remaining part accounts to idle mode and some overhead. Thus, the Software CoreManager is limited by the number of events in the system which depends on the average task execution time, data transfer time, number of applications, task queue sizes for each application, and the dependency check mode. The influence of the task queue size is shown in Table 4. The first example application is chosen. Four PEs and an interconnection frequency of 100 MHz are used. The execution time is normalized to the execution time of a queue size of four. By increasing the task queue size the execution time is decreased due to a higher probability of tasks in the queue which have no dependencies and thus can be executed. VI.

CONCLUSIONS AND FUTURE WORK

In this paper a heterogeneous MPSoC with dynamic memory and power management is presented. Task dependencies can be either explicitly annotated or checked at runtime. The increased data locality with reuse of local data and bypassing is automatically done by the CoreManager, using different memory allocation schemes. This error prone task is taken from the application programmer. Thus, a performance improvement is possible while ensuring functional correctness. Power is saved due to earlier power shut off. Furthermore, power management is done by dynamic PE allocation and explicit frequency scaling. The CoreManager is capable of handling different applications running concurrently in the system. Quality of service for soft real time is made available by prioritization of PE allocation, scheduling and data transfers. A tool flow is introduced for hardware and software platform generation, power analysis, data transfer observation and task execution visualization. Future work aims at integrating more algorithms for power consumption reduction distribution. A battery model will be integrated which allows the CoreManager to react on consumed power. Furthermore, preemption of supported for increased responsiveness.

sophisticated and thermal in the system the currently tasks will be

Table 3. Application execution times using priorities

Application App 0 App 1 App 2 App 3

Start Time 364,500 ns 311,500 ns 313,190 ns 290,160 ns

Finish Time 6,437,710 ns 8,970,270 ns 13,284,360 ns 15,539,730 ns

Table 4: Results for different task queue sizes in %, baseline value for the execution times: task queue size of 4

Task queue size Execution time Dependent tasks PE Utilization

4 ±0 24.9 56.0

8 -26.0 49.0 75.6

16 -32.8 75.1 83.2

32 -34.8 77.5 90.1

ACKNOWLEDGMENT The authors acknowledge the excellent cooperation with all project partners within the CoolBaseStations project. REFERENCES [1] [2]

[3] [4]

[5] [6]

[7] [8]

[9] [10]

[11] [12]

[13] [14] [15]

[16] [17]

[18]

U. Ramacher, “Software defined radio prospects for multistandard mobile phones” Computer, 2007. J. Glossner, D. Iancu, M. Moudgill, G. Nacer, S. Jinturkar, S. Stanley, and M. Schulte, “The sandbridge SB3011 platform”, EURASIP J. Embedded Syst., 2007. Fatahalian, K., Knight, T. J., Houston et al., “Sequoia: Programming the memory hierarchy”, IEEE Conference on Supercomputing, 2006. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall, “The implementation of the Cilk-5 multithreaded language”, In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, June 1998. A. Ghuloum, E. Sprangle, J. Fang, G. Wu, X. Zhou, “Flexible Parallel Programming for Terascale Architectures with Ct”, Intel, 2007 P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta, “CellSs: a programming model for the cell be architecture”, In Proceedings of the ACM/IEEE Supercomputing 2006 Conference, November 2006. H. Seidel, “A Task-level Programmable Processor”, PhD Thesis, WiKu, October 2006. T. Limberg, M. Winter, M. Bimberg, R. Klemm et al, "A Heterogeneous MPSoC with Hardware Supported Dynamic Task Scheduling for Software Defined Radio", DAC/ISSCC Student Design Contest, 2009. ARM Inc., Platform Baseboard for ARM11MPCore Datasheet J. Hu and R. Marculescu, “Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints”, Design, Automation and Test in Europe Conference and Exhibition, 2004. Y. Zhang, X. Hu and D. Chen, “Task scheduling and voltage selection for energy minimization”, Design Automation Conference, June 2002. J. Liang, S. Swaminathan and R. Tesssier, “aSoC: a scalable, single-chip communications architecture,” International Conference on. Parallel Architectures and Compilation Techniques, Oct. 2000. http://www.emuco.eu, March 2010. http://www.vastsystems.com ( is now Synopsys: http://www.synopsys.com), March 2010. K. Asanovic, R. Bodik, B.C. Catanzaro, et al, “The landscape of parallel computing research: A view from Berkeley”, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. http://www.infineon.com, March 2010. T. T. Ye, L. Benini and G. De Micheli, “Analysis of power consumption on switch fabrics in network routers”, Design Automation Conference, June 2002. Pham et al., “The design and implementation of a first-generation cell processor”, In Proceedings of the IEEE International Solid-State Circuits Conference, 2005.

Suggest Documents