Design Space Exploration for Customized Asymmetric ...

Design Space Exploration for Customized Asymmetric Heterogeneous MPSoC Bouthaina Dammak, Rachid Benmansour and Smail Niar

Mouna Baklouti and Mohamed Abid

University of Valenciennes and Hainaut Cambr´esis 59300 Valenciennes, France

National School of Engineers of Sfax 3042 Sfax, Tunisia

Abstract—Modern FPGA allows the design of very complex System-on-Chips (SoC). To fulfil modern application requirements, in terms of performance/energy consumption ratio, Heterogeneous Multiprocessor System-on-Chip (Ht- MPSoC) architectures represent a promising solution. In such systems, the processor instruction set is enhanced by application-specific custom instructions implemented on reconfigurable fabrics, namely FPGA. To increase area utilization and guarantee application constraint respect, we propose a new Ht-MPSoC architecture where hardware accelerators (HW accelerators) are shared among different processors in an intelligent manner. In this paper, we extend existing Ht-MPSoC architectures by considering asymmetric (AHt-MPSoC). In these architectures, cores have different resources that may share in different manners. Depending on the running applications and their needs in processing, private and shared HW accelerators are attached to the different cores. On a 8-core AHt-MPSoC we obtained a speed of 2.6 with a reduced number of HW accelerators for our benchmarks. Index Terms—FPGA, Shared hardware accelerators, Application-specific instructions, Area constraint, MIP model.

I. I NTRODUCTION The increase in HW resources in the latest FPGA generation, makes it possible to implement extremely complex Heterogeneous Multi-Processor System-on-Chip (Ht-MPSoC) architectures. These architectures combine hardware and/or software cores, application-specific HW accelerators and communication units. Cyclone V from Altera [1], Zynq from Xilinx [2] and SmartFusion2 from Micro-Semi [3] are examples of Ht-MPSoC. These architectures include one or more hardcores and up to 500K of reconfigurable logic elements to build computational accelerators. Thanks to this reconfigurable area, it is possible to obtain either a symmetric Ht-MPSoC (SHt-MPSoC), in which all the processors have the same number of private and shared HW accelerators, or an asymmetric architectures (AHt-MPSoC) where HW accelerators attached to the different processors differ from one processor to the other. In an AHt-MPSoC, applications (or tasks for a multitasked system) that are critical and that require a lot of HW resources to respect their execution time constraint are executed by cores with a large number of private accelerators. At the opposite, applications that are not critical and does not require a lot of resources are run by core with small number (or not at all) private HW accelerators. Figures 1 and 2, shows 2 examples of HtMPSoC with 4 processors (P1 to P4). In Figure 1, we have a

SHt-MPSoC. The processors have the same type and number of HW accelerators. In Figure 1, each processor has one and the same HW accelerator and share m accelerators with the other processors. Figure 2 gives an example of an AHtMPSoC. In this architecture P1, P3 and P4 have each one a private accelerator. These private accelerators can be similar or different. P2 has no private accelerator. P1 and P2 share the same accelerator, named ”Sh 1” in the Figure 2, whereas P2, P3 and P4 share another HW accelerator, named ”Sh 2”. We think that AHt-MPSoC is very promising class of architectures as it allows an efficient utilization of HW resources and provide high performances with a relatively energy consumption. However, their utilization increases even more the size of the configuration space to explore. Thus, it is necessary to provide to the designer a Design Space Exploration (DSE) tool to determine the best architectural configuration for a given set of concurrent applications. In addition, this tool plays an important role in the design flow of Ht-MPSoC architectures as it allows to determine the most efficient HtMPSoC configuration in a reduced time. This configuration is the one that requires the least FPGA resources in a reduced execution time and energy budget. In the literature very few works have been devoted to DSE tool for AHt-MPSoC. This is due to the fact that previous and current generation of FPGA circuits offer relatively few resources in terms of logic elements compared to ASICs. Thus, it was possible to explore the entire design space in relatively short time interval either by simulation or by simple analytical models. In other studies, the authors consider only symmetric Ht-MPSoC architectures in which all processors have the same number and type of accelerators. This approach limits the application of Ht-MPSoC and can not operate effectively for high complex reconfigurable systems. In the solution that we propose in this paper, we target next generation FPGA circuits with a high number of reconfigurable logic elements and their utilization in AHt-MPSoC architectures. In these architectures, the number of HW accelerators and their type may vary from one processor to another. Furthermore in our model, the applications executed by the processors may differ from one processor to another. Our solution is based on Mixed Integer Linear Programming (MIP) formulation to explore the very large space of possible configurations. Due to the complexity of finding an optimal

Fig. 1. Example of a SHt-MPSoC architecture with 4 processors. Each processor has one private HW accelerator and share m accelerators with the other processors.

Fig. 2. Example of an AHt-MPSoC architecture with 4 processors. P1 and P2 share one accelerator and P2, P3 and P4 share another accelerator. P2 has no private accelerator.

enumerative solution, the proposed mathematical model allows the identification of a unique global minimum (i.e. area usage) in a reasonable time. This model was solved, after a process of constraints linearization, using CPLEX linear program solver. The paper consists of 6 sections. In the next section, a survey on existing approaches in Ht-MPSoC is presented. In section 3, we detail AHt-MPSoC and discuss their benefits. In section 4, we develop our MIP-based Design Space Exploration (DSE) for AHt-MPSoC. The next section present the experimental results and performances of our AHt-MPSoC for real and synthetic benchmarks. Finally in section 6, we give a conclusion and some possible extension to make AHt-MPSoC more efficient. II. R ELATED W ORKS In [4] architectures that use custom instructions are classified according to the coupling of hardware reconfigurable components to the processor. The author distinguishes 2 classes: tightly coupled and loosely coupled. In the first class, the reconfigurable logic is part of the data path of the processor. The reconfigurable logic is placed inside the processor. At the opposite, in the second class the reconfigurable logic is placed

close to the processor but outside the processor on a dedicated bus. In [5] the authors proposes a polynomial-time heuristic that uses resource sharing to minimize the area required to synthesize a Set of custom Instruction Extension (ISEs). Their resource sharing approach transforms the set of ISEs into a single hardware data path. Zuluaga et al. [6] introduces a latency constraints in the merging process of the ISEs to control the performance improvement. Their proposed parametric algorithm combines a path-based resource sharing algorithm, similar to the ones presented in [5], with a timing budget management scheme. More recently, the work presented by Stojilovi`c et al.[7] aims at a pragmatic increase in flexibility to integrate different ISEs from different applications. This works is motivated by the path-based datapath based algorithm presented in. While [5] aims at minimizing the area cost, [7] increases the flexibility for a moderate cost. Their approach ensures that all instruction set extensions (ISEs) from an application domain map on the same proposed domain-specific coarse-grained array. The architectures proposed by these papers belongs to tightly coupled class of application specific architectures. The

work we present in this paper differs from [5], [6] and [7] at different levels. First, they consider the sharing of fine grained custom instructions that are closely coupled to the processor’s pipeline. At the opposite, we consider loosely coupled class application-specific architectures. In addition the sharing of coarse grained custom instructions that provide a better performance improvement. These previous works provide an approach to share logic between different custom instructions for a single processor. Their proposed heuristic algorithms select the custom instructions to be mapped on logic providing a more area saving with operation sharing. Instead, we propose the sharing of an entire custom instruction between different cores and our MIP model explores the custom instructions to be mapped on hardware and the sharing-degree of hardware to support the ISE. In [5] [6] [7], the proposed algorithms focus only on area reduction without taking into account the impact on performance. Even if Zuluaga et al. [6] introduces the latency features, they didn’t consider application-performance constraints. At the opposite, our work explores all the possible sharing configurations that minimize the area usage and respecting the desired applications-performance constraints in terms of execution time. In all the previous works, the authors consider single processor architecture. This approach is not efficient for MPSoC. To obtain a better resource utilization and to obtain the optimal architecture, the customization of the different processor has to be carried out in a synergistic manner and simultaneously In [8], the author present another approach to minimize area usage by resource sharing for tightly coupled applicationspecific architectures. In this paper, the authors proposes to share at runtime reconfigurable area for multi-core architecture. They develop an algorithm to select the ISEs to be mapped and to optimize the fabric sharing between cores leading to the best execution time. In [9], the authors propose a pseudo-polynomial time algorithm to explore the design space of Multi-Application Specific Instruction Processor (or MASIP). In their approach task mapping and custom instruction sets selection are separated in 2 different phases. At the opposite of our papers, they do not consider accelerator sharing between cores. In [10], we proposed the utilization of SHt-MPSoC executing the same application. The work presented in this paper, we extend [10] for loosely coupled AHt-MPSoC for applicationspecific architectures based on FPGA. III. AH T-MPS O C BASED ON H ARDWARE S HARING Application-specific instructions are an effective way of improving the performance of processors. Critical computations can be accelerated based on new instructions executed on HW accelerators. These accelerators can be either loosely coupled to the processor via system bus, or memory controller, or tightly coupled to the instruction pipeline. Within this work, the HW accelerators are implemented as hardware modules executing application-specific instructions, and are loosely coupled to the system bus.

In this section, we first discuss the benefits of sharing HW accelerators in multiprocessor systems, then we propose our AHt-MPSoC architecture. A. Proposed Hardware Sharing Approach For Ht-MPSoC in which N applications are running on the different processors has a less opportunity to implement all the HW accelerators due to hardware resources constraint. Due to hardware resources constraints, the HW accelerators integration for Ht-MPSoC architecture cannot be fully explored. Most of existing embedded applications, such as multimedia, telecommunication or automotive applications, use the set of kernel functions. Matrix operations, convolutions and filters are frequently used in such applications. For an Ht-MPSoC, that do not use hardware-sharing, separate custom private accelerators are used for different custom instructions to provide the same computations. The proposed sharing approach enables to share a range of specific instructions among different tasks. This means that different computational tasks can share several accelerators. It is expected that this optimization will extenuate the area and power consumption and will preserve performance. We call a shared pattern the computational task existing on different applications. The shared pattern identification offers a range of possible optimization. In Figure 3, different applications have same heavy computational patterns that can be executed simultaneously on private accelerators (one accelerator for each pattern/processor). However based on our proposed technique of sharing accelerators, only one accelerator for each pattern can be implemented and shared among the processors. The sharing degree defines the number of processors sharing the same accelerator. In Figure 3, the P2 processor can have a 3-degree shared-accelerator for T3, shared with P3 and P4 processors. The accelerators sharing and the sharing degree provide a large architectural space exploration. The next paragraph gives an overview of the proposed hardware AHt-MPSoC architecture. B. AHt-MPSoC Architecture The AHt-MPSoC architecture class proposed in this paper uses Hw accelerators that execute application-specific instructions and are loosely coupled to processor via shared bus. Busbridges are used to communicate between the processor and the shared accelerators. The Figure 2 illustrates 4-processor AHt-MPSoC where each processor has private accelerator as well as shared accelerators. These latters are activated by application-specific instructions. The synchronization between processors, having shared HW accelerators, is established through a lock/barrier mechanism implemented in hardware. When a processor Pj tries to use a shared accelerator, which is already locked by another processor Pi , it is put in a waiting state until Pi unlocks the shared accelerator. The lock/barrier mechanism follows the First In First Out fashion. In the next Section, a Mixed Integer Linear Programming (MIP) model is presented for identifying the shared optimal resource allocation architecture respecting a required performance.

Fig. 3.

An example of 4 parallel programs with different HW accelerators (in the circles)

IV. M IXED INTEGER LINEAR PROGRAMMING MODEL Our space exploration addresses the way to merge the computational patterns, existing on the different applications, to reduce the overall area usage while respecting applicationsperformance constraints. Increasing the sharing degree reduces the area usage, but we may increase the delay of each processor to access shared accelerators and therefore we may not reach the required performance. Thus, the goal of our MIP model is to have minimum area usage, while keeping the execution time of each application under a required limit. A. Problem Formulation The architecture is a multi-processor system with p processors running p applications. These applications can be similar, i.e. Single Program Multiple Data model, or different, i.e. Multiple Program Single/Multiple Data. We define pattern a time consuming task-kernel existing in one or different tasks. Based on expected acceleration, we select the sequence of most computational patterns executed on the p processors. These patterns are candidates for hardware implementations that may fulfill the desired performance. Let {T1 ...Tj ...Tm } denotes this sequence and P = {P1 , P2 , ..., Pi , ...Pn } the sequence of n homogeneous processors. Let N = {1, 2, . . . , n} and M = {1, 2, . . . , m}. Each consuming pattern Tj (j 2 M ), is assigned a predefined area constant aj the value of which is the number of FPGA area required by Tj to be implemented as Hw accelerator. In addition, we define two integer constants tsj,i and tej,i respectively for the start-time and the end-time of the pattern Tj on processor Pk . Pi The implementation of Tj in private way provides a defined

acceleration taccj . For each processor Pi , an acceleration constant limiti is set as a constraint for the total processor acceleration acci . Our problem definition can be formally declared as follows: Given a number of m patterns executed on n homogeneous processors, we look for the number of processors sharing each pattern (sharing degree) so that the maximum execution time of each processor is under the time-constraint and the total used area is minimized. B. Objective Function In this section, we present a MIP formulation of the problem so that we can obtain an optimal solution with the help of a commercial optimization solver for mixed integer linear programming. Let xj a binary variable that denotes whether Tj is implemented on Hardware (HW) for processor Pi . 8j 2 M xj

=

⇢

1, if Tj is implemented on HW 0, else

Let yjik a binary variable that denotes whether the accelerator (Acc) of task Tj is shared between processors Pi and Pk or not. 8j 2 M , 8(i, k) 2 N 2 ⇢ 1, if Acc of Tj is shared between Pi and Pk yjik = 0, otherwise Our objective function is to minimize the total area required

to implement the m patterns. T otal Area =

m X n X j=1 i=1

xj Pn

aj

(1)

k=1 yjik

Rji =

i 1 X

pjik (tehjk

tshji )

(6)

k=1

In the Total Area equation we have implicitly defined the sharing degree of a processor Pi for the task Tj . Let shij be this variable: n X shij = yjik (2)

Where tehj,k and tshj,i , j 2 M , i 2 N , k 2 {1, 2, ..., i}, are continuous variables that define respectively the start-time and the end-time of executing Tj on HW respectively on processors Pk and Pi and are calculated as follow:

k=1

To linearise Eq.1, we define new continuous variables zij and wij : zij = Pn

1

k=1

8j 2 M, 8i 2 N, 8k 2 {1, 2, ..., i}, tshji = tsji

1 = shij

yjik

tehjk = tejk

wij = zij xj

j 1 X

(accl

Rli )

(accl

Rlk )

l=1 j X l=1

The definition of zij can be expressed in linear form as follows:

zij + zij

n X

yjik = 1

k=1

zij +

n X

(3)

✓ijk = 1

Now the period constraint can be re-written as:

k=1

Where ✓ijk is a continuous variable expressed as follows: ✓ijk = zij yjik

M inAir =

m X j=1

(4)

ai wij

8i 2 N acci =

The objective function (eq. (1)) can be re-written as: m X n X

pjik , j 2 M , (i, k) 2 N 2 , is a binary variable defined as follow: ⇢ 1, if Pk is the last processor sharing Tj with Pi pjik = 0, else

taccj

i 1 X

k=1

pjik ⇤ (hj,k

M inAir =

The performance constraint is based on the following principle: the total acceleration for each processor provided by the mapping of the different tasks on Hw accelerators needs to be upper a performance limit. Regarding the sharing degree of Hw accelerator, each processor has a delay Rji to access this shared accelerator. The performance constraint can be imposed as follows:

acci =

m X j=1

xj taccj

xj Rji

limiti ,8i 2 N

(5)

Rji is the Pi processor delay to access a shared accelerator of Tj . This variable is defined as the time interval between the end-time of executing the shared task Tj on processor Pk , where Pk (0  k  i 1) is the last processor sharing task Tj with Pi , and the start-time of executing task Tj on processor Pi . Rji = tehjk

tshji where k = max{0, 1, ..., i

1} and yjik

1),i )

>= limiti (7)

Our objective function is to minimize the total area :

i=1 j=1

C. Performance constraint

h(j

m X n X

ai wij

i=1 j=1

V. E XPERIMENTAL R ESULTS To evaluate the performance of the proposed AHt-MPSoC system as well as to study the effectiveness of our MIP model, we first validate the model with a 4-processor architecture executing the same application, which is the JPEG encoder. The MIP generated solution is compared to FPGA implementation results. Our target platform is a Xilinx virtex 5 FPGA. The basic processing element of this platform is the Microblaze softcore processor running at 125 MHz. Second, for asymmetric multiprocessor architectures executing different applications, we generate the best area solution under different period constraints. A. JPEG encoder application execution

The jpeg profiling results show that more than 40% of jpegexecution-time is spent on the HDCT (Horizontal DCT) and VDCT (Vertical DCT) tasks. These two tasks are considered = 1 as potential shared patterns for our MIP model. The HDCT

Fig. 4.

Real and Model Estimation Results for the JPEG application

TABLE I MIP MODEL VARIABLES

M atrixxj M atrixYjik

M atrixPjik M atrixRji

HDCT(j=1) 1 8 9 > 1 1 1 1= > < 1 1 1 1 > 1 1 0 1; > : 1 1 1 1 8 9 > 0 0 0 0= > < 1 0 0 0 > 0 1 0 0; > : 0 0 1 0 0 0 0 0

VDCT(j=2) 1 8 9 > 1 1 1 1= > < 1 1 1 1 > 1 1 1 1; > : 1 1 1 1 8 9 > 0 0 0 0= > < 1 0 0 0 > 0 1 0 0; > : 0 0 1 0 0 0 0 0

and VDCT consumes respectively 6 and 5 area units. An area unit corresponds to 336 slices. The HDCT and VDCT pattern informations (area, starttime, end-time) are inserted into our MIP model. For each processor the performance constraint is set to ensure the 20 images/second requirement of the JPEG application. The solver produces the decision variables for the optimal architecture. These variables are summarized in Table I. From yjik values in TableI, we note that the optimal architecture integrates HDCT and VDCT accelerators ( x1 = 1, x2 = 1). For each pattern only one hardware accelerator is used and shared among the processors (y1ik = 1, y2ik = 1). The generated architecture requires 9.6 area units to integrate one shared HDCT and one shared VDCT. Physical integration of this architecture on FPGA consumes 11.6 area. This overhead is due to the integration of bridges for shared HDCT and VDCT accelerators. To assess the MIP model results, physical implementation results of different configurations are compared with the generated model-based results. The area consumption, execution time and energy dissipation/image are presented in Figure 4. We note that the architecture with private HDCT and VDCT configurations cannot fit in the FPGA. The area constraint allows the implementation of only one accelerator type in private configurations, configurations (4,0) and (0,4) in Figure

4.a. The accelerator-sharing allows the HDCT and VDCT accelerators integration in the same architecture, as illustrated by configurations (2,2) and (1,1) in Figure 4. From Figure 4.b, we notice that the accelerator-sharing does not introduce a notable performance cost. Moreover, the best performance was obtained with (1,1) and (2,2) configurations. From Figure 4.b. the encoding of 20 images/second performance requirement was satisfied with (2,2) and (1,1) configurations. The 2degree and 4-degree shared HDCT and VDCT configurations provide the same performance but with a reduced area-cost compared to configuration (1,1). Thus, as expected with the MIP resolution the configuration with 4-degree shared HDCT and 4-degree shared VDCT (configuration (1,1) in Figure 4) provides the best performance/area trade-off. B. Different synthetic applications execution

Fig. 5.

Generation of different synthetic applications

Now we focus on generating the best configuration for an 8 multiprocessor architecture where the processors execute different applications. Here we use synthetic applications that are produced based on the same patterns. The executed applications consist of three computational patterns (T1: data

Fig. 6. Area usage (y-axis) of the MIP model generated configurations for different speed-up (x-axis) Vs. configurations 8 processors without HW accelerators. FPGA-based implementation results (real measurements) are also given.

inversion, T2: loop multiplication, T3: search maximum) and a number of non-computational tasks implemented as loop iterations, as illustrated in Figure 5. For the different processors, we vary the number of iterations (i, j and k) of the noncomputational loops to obtain different applications. Table II summarizes the execution time and the area requirement for the three tasks T1, T2 and T3. Note that the area requirement is presented in term of area unit. An area unit corresponds to 150 slices. Figure 6 presents the logic area usage calculated based on our proposed MIP model while varying the speed-up (limiti in eq.(5)). The speed-up is defined as follow: speedup =

(Tsw

Tsw limiti )

(8)

Where Tsw is the execution time without HW accelerators. In Figure 6, the configurations given by the MIP model to satisfy 1.007 and 1.014 speed-ups consume only 2 area units. These solutions have only one shared HW accelerator for T1 (x1 = 1 and y1ik = 1), whereas T2 and T3 are executed in software (x2 = 0, x3 = 0). In contrast, when the speed-up is increased, the generated solutions integrate T2 and T3 on HW accelerators. In Figure 6, the solution given by 2.15 consumes 25 area units. This solution represents a configuration with HW accelerators for T1, T2 and T3 (x1 = 0, x2 = 0, x3 = 0). To illustrate the impact of combination of processors that share a HW accelerator, we compare configurations of different points in Figure6 consuming the same area. The points 1.6 and 1.75 have 30 area units but they correspond to different configurations. In fact, for 1.6, the MIP model generates an AHt-MPSOC architecture with two HW accelerators of T2, one private for P1 and the second is shared between (P2, P3, P4, P5, P6, P7, P8). Whereas the AHt-MPSOC architecture, which provides a speed-up equal to 1.75, has two HW accelerators of T2, one shared between (P1, P2, P4) and other shared between (P3, P5, P6, P7, P8). We deduce that different combinations of processors sharing the

same HW accelerator could impact the performance of AHtMPSoC architecture. Thus, for a given area, the designer has to choose the most optimal hardware architecture configuration that provides higher performances. Figure 6 also demonstrates that the maximum speed-up is provided with a reduced area-usage configuration compared to the private configuration. The 8 processor architecture with private T1, T2 and T3 accelerators provides a speedup equal to 2.6 and consumes 136 area units. To guarantee the same speed-up, our MIP model generates a configuration that consumes 96 area units. The generated AHt-MPSoC architecture integrates T1, T2 and T3 on HW accelerators (x1 = x2 = x3 = 1 as shown in table III). This architecture has 4 HW accelerators for T1, 4 HW accelerators for T2 and 7 HW accelerators for T3. In table III, y1ik vector indicates that for T1, (P1, P2, P3, P4) and (P5, P7) have two shared HW accelerators and (P6) and (P8) have their private ones. In Figure 6, we compare the area usage of respectively MIP-based generated results and FPGA-based implementation results. Due to the extra consumed logic-area needed by the implemented bridges, we note a slight area overhead for the physical implementation. Figure 6 demonstrates that the proposed MIP model produces results close to that obtained with in physical implementation. VI. C ONCLUSION In this paper we presented a new class of Ht-MPSOC architecture. In the new proposed AHt-MPSOC architectures, an accelerator can be shared between processors in different manners. Experimental results have demonstrated that with AHt-MPSoC it is possible to obtain the same performance with less logic elements than with SHt-MPSOC architecture. Our proposed MIP formulation is also able to explore, in reasonable time, the design space of Ht-MPSoC: all the configurations between the two extremes, configuration with only private accelerators in one side and configuration with shared accelerators in the other side. Thanks to the proposed

TABLE II T1, T2 AND T3 A REA AND E XECUTION TIME INFORMATION T1:Data inversion loop T2:Loop multiplication T3:Search maximum

Area usage (area unit) 2 15 4

SW execution time (cycles) 440 3815 2000

HW execution time (cycles) 222 213 1200

TABLE III MIP M ODEL RESOLUTION FOR A SPEED - UP EQUAL TO 2.6

Model inputs

Model Variables N M acc[M] a[M] te[M][N] limit[N] xj

Model outputs y1ik

Rji

8 < 440 4255 :6255 4150

Variables value 8 3 200 3375 1000 2 15 4 670 910 1150 1240 1360 1470 4485 4725 4965 5055 5175 5245 6485 6725 6965 7055 7175 7245 4100 4200 4280 4300 4350 4500 1 1 1 81 1 1 1 0 0 0 09 > > > > > > > >1 1 1 1 0 0 0 0> > > > > 1 1 1 1 0 0 0 0> > > < = 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 > > > > > > > > > >0 0 0 0 0 0 1 0> > > > > :0 0 0 0 1 0 1 0> ; 0 0 0 0 0 0 0 1 8 9