Mcost (S))
79
Function Assessment same solution Improvement Deterioration Improvement Improvement need analysis Deterioration need analysis Deterioration
M emory Cycles (Mcyc ). Memory cost is computed using the equation (4.1). Memory cycles is obtained using the same greedy backtracking heuristic described in Section 3.5 for the data layout. The initial temperature is computed by randomizing the solution space initially. The mean and the variance are computed from the δ (improvement or deterioration) through the initial iterations during the randomization process. Temperature is initialized with the standard-deviation of δ calculated during the initial iterations. To generate a new solution S 0 from the current solution S, we use controlled randomization to ensure that S 0 is in the neighborhood of S and that the new solution represents a valid solution. For example, we change the number of banks in S 0 by adding a randomly generated offset to the number of banks in S while ensuring that the total does not exceed the maximum number of banks. We also ensure that the memory size is greater than the data size. Let us denote the memory cost and memory cycles associated with S as Mcost (S)and Mcyc (S); similarly, let Mcost (S 0 )and Mcyc (S 0 )correspond to solution S 0 . The new solution is a definite improvement if Mcost (S 0 ) ≤ Mcost (S) and Mcyc (S 0 ) ≤ Mcyc (S) and we say S 0 dominates S. When the new solution does not dominate the existing solution, there are many possibilities, as illustrated in Table 4.2. lt ut ) , Mcyc We maintain an upper and lower threshold for the objective functions; let (Mcyc lt ut ) be the limits for memory cost. The , Mcost be the limits for memory cycles and (Mcost
considered in the optimization. We defer this to Chapter 6.
80
Logical Memory Exploration
change in the overall cost function is computed as follows.
³
´
ut lt ut lt δ = (Mcost (S 0 ) − Mcost (S))/(Mcost − Mcost ) + (Mcyc (S 0 ) − Mcyc (S))/(Mcyc − Mcyc )
(4.4) Since our objective is to present to the designer a list of all good solutions, we maintain a list of competitive solutions seen during the course of optimization. Each of these solutions is assigned a weight. When a new locally good solution is encountered, we compare its weight with that of all the globally competitive solutions that have been seen so far. There is fixed amount of room in the data structure that stores globally competitive solutions; as a result, we will remove a solution from the list if its weight is lower than that of all others, including the new entrant.
4.5 4.5.1
Experimental Results Experimental Methodology
We have implemented the multi-objective Genetic Algorithm (GA), Simulated Annealing (SA) and the heuristic data-layout algorithm in a standard desktop as a framework to perform memory architecture exploration. Some practical implementation constraints are applied on the memory architecture parameters to limit the search space. For example, the memory bank sizes (Bs , Bd , and Es ) are restricted to powers of 2 as it is done in practical implementations . As before we have used Texas Instruments (TI) TMS320C55X and Texas Instruments Code Composer Studio (CCS) environment for obtaining the profile data and also for validating the data-layout placements. We have used the same set of 4 different applications used in the previous chapter from the multimedia and communications domain as benchmarks to evaluate our methodology. The applications are compiled with the C55X processor compiler and assembler. The profile data is obtained by running the compiled executable in a cycle accurate SW simulator. To obtain the profile data we use a memory configuration of a single large bank of single-access RAM to fit the
4.5 Experimental Results
81
application data size. This configuration is selected because this does not resolve any of the parallel or self conflicts, so the conflict matrix can be obtained from this simulated memory configuration. The output profile data contain (a) frequency of access for all data sections (b) the conflict matrix. The other inputs required for our method is the application data section sizes, which were obtained from the C55X linker.
4.5.2
Experimental Results
This section presents our results on the memory architecture exploration. We have applied GA and SA for the memory architecture exploration. Both GA and SA based methods uses the same data layout heuristic described in Section 3.5. The reason for trying two different evolutionary schemes for memory architecture exploration is to identify the better approach between GA and SA for the multi-objective problem at hand. A better approach will be able to search the design space uniformly and identify non-dominated points that are globally Pareto-optimal. In this section, first we compare the results from GA and SA, and then we describe our observations of the exploration process based on the better approach (GA or SA). 4.5.2.1
Comparison of GA and SA Approaches
The objective is to obtain the set of Pareto-optimal points that minimizes either memory cost or the memory cycles. For one of the benchmark, Vocoder, Figure 4.3 plots all the memory architectures explored by GA and SA respectively, each point represents a memory architecture and the non-dominated points or the Pareto optimal points are also plotted in the same figure. Note that each of the non-dominated points represents the best memory architecture for a given Mcyc or Mcost . In Figure 4.3, the x-axis represents the normalised memory cost as calculated by equation 4.1. We have used Ws = 1, Wd = 3 and We = 0.05 in our experiments. Based on these values and from equation (4.1), the important data range for x-axis (Mcost ) is from 0.05 to 3.0. It can be seen that Mcost = 0.05 corresponds to an architecture where all the memory is only off-chip memory, while Mcost = 3.0 corresponds to a memory
82
Logical Memory Exploration
Figure 4.3: Comparison of GA and SA Approaches for Memory Exploration
4.5 Experimental Results
83
architecture that has only on-chip memory composed of DARAM memory banks. The y-axis represents the memory stall cycles, which is number of processor cycles spent in data accesses. This includes the memory bank conflicts and also the additional wait-states for data accessed from external memory.
Figure 4.4: Vocoder Non-dominated Points Comparison Between GA and SA
From Figure 4.3, we observe that the multi-objective GA explores the design points uniformly in all regions of memory cost, whereas SA explores a large number of design points only in the region of Mcost < 1. Our observation is that the sharing and niche formation methods used in GA leads to better solution diversity than SA. This trend is observed in other benchmarks as well. Table 4.3 presents the number of memory architectures explored and the number of non-dominated points obtained from the GA and SA based approaches. For each of the applications, the GA and SA are run for a fixed time, so as to compare the efficiency of the two approaches. The execution time reported in Table 4.3 is the time taken on a Pentium P4 Desktop machine with 1GB main memory operating at 1.7 GHz. From Table 4.3 we observe that both GA and SA explore a large number of design points (a few thousand) for each of the benchmark and identifies a few hundred Pareto optimal design points which are interesting from a platform based design [59] viewpoint. The total computation time
84
Logical Memory Exploration
taken by these methods for each benchmark varies from 3 hours to 11 hours. Compared to this, the memory space design exploration typically done manually in industry, can take several man-months and may explore only few design points of interest. We observe that GA produces most of the non-dominated points in the first 25% of time and slowly improves the solution quality after that. On the other hand, SA gives the best results only towards the end when the annealing temperature approaches zero. Hence, given sufficient time SA catches-up with GA but the time taken by SA to reach the solution quality of GA is 2-3 times more than the GA’s run-time. We observe that GA explores significantly more points in the design space (almost by a factor of 2 to 3) than SA for all applications except Mpeg Enc. This is due to the higher execution time of 11 hours. SA’s performance levels increase with time and we observe that the number of global non-dominated points is highest for MPEG enc. However the number of non-dominated points identified by these methods are nearly the same. Interestingly, the non-dominant design points identified by the two methods only partly overlap. Note that our definition of non-dominated point with respect to GA and SA approach refer to those design points, that are not dominated by any other point seen so far by the respective methods. Thus it is possible a point identified as non-dominated in one approach may be dominated by design points identified in the other approach. We observe that the non-dominated points from SA and GA are very close. This can be observed from Figure 4.4 where the non-dominated points of GA and SA are plotted together in one graph for two of the applications. Table 4.4 presents data on the total number of non-dominated points obtained from GA and SA. The table presents the number of common non-dominated points between GA and SA in column 4. These are the same Pareto optimal design points identified by both GA and SA. The number of unique non-dominated points represents the solutions that are globally non-dominated but present only in one of GA/SA approaches. The presence of unique non-dominated points in one approach means that this point is missing in the other approach. Column 6 reports the global non-dominated points, this is the sum of column 4 and column 5. The ratio of column 6 to column 3, in a way represents the efficiency of an approach. We
4.5 Experimental Results
Application
Mpeg Enc Vocoder Jpeg DSL
85
Table 4.3: Memory Architecture Exploration Time Taken GA SA No of Arch No of non No of Arch No of non explored dominated explored dominated points points 11 hours 8780 270 9172 287 6 hours 7724 105 3850 104 3 hours 9266 89 2240 90 6 hours 11240 133 8560 149
Table 4.4: Non-dominant Points Comparison GA-SA Application
Method
Mpeg Enc
GA SA GA SA GA SA GA SA
Vocoder Jpeg DSL
Num of non-dom (ND) points 270 287 105 104 90 89 133 149
Common ND pts
Unique ND pts
global ND pts
No of Dominated pts
115 115 56 56 32 32 71 71
143 99 45 22 63 2 54 34
258 214 101 78 95 34 125 105
12 73 4 26 2 55 8 44
avg min dist from unique NDs 1.3% 2.1% 0.77% 3.1% 1.6% 2.1% 1.8% 5.5%
observe that the number of common points increases if time allotted to SA is increased. Further, the column 7 reports the number of non-dominated points identified by one method which gets dominated by points in the other method. This also is an indicator of the efficiency of the approach: more the number of dominated points less the efficiency. For example, for the MPEG encoder benchmark, 73 of the non-dominated design points reported by SA are in fact dominated by certain design points seen by the GA approach. As a consequence the global non-dominated points reduces to 214 for this benchmark. In contrast, GA fares 270 non-dominated points of which 258 are globally non-dominated. In fact this trend is observed almost for all benchmarks. Thus the data from experiments point that GA performs a better job than SA. One concern that still remains is the set of
86
Logical Memory Exploration
unique non-dominated points identified by SA but not by GA. If these design points are interesting from a platform based design, then to be competitive the GA approach should at least find a close enough design point. In order to analytically assess this we find the minimum of the Euclidean distance between the each unique non-dominated point reported by SA to all the non-dominated points reported by GA. The minimum distance is normalised with respect to the distance between the unique non-dominated point to the origin. This metric in some sense presents a close enough design point for each Pareto optimal point missed by GA. If we could find an alternate non-dominated point in GA at a very close distance to the unique non-dominated point reported by SA, then the GA’s solution space can be considered as an acceptable superset. In column 8, we report the average (arithmetic mean) minimum distance of all unique non-dominated points in SA to the non-dominated points in GA. A similar metric is reported for the unique nondominated points identified by GA. We also report the maximum of the minimum distance for all unique non-dominated points in column 9. The worst case average distance from unique non-dominated points is 1.8% for GA and 5.5% for SA. Thus for every unique nondominated point reported by SA, the GA method can find a corresponding non-dominated point within a distance of 1.8%. In summary, we observe that GA finds more non-dominated points in general and result in better solution quality for a given time. Only fewer non-dominated points of GA are dominated by SA. Also, GA searches the design space more uniformly. This may be due to the sharing and niche count based approach used in multi-objective GA which facilitates better solution diversity and explores more number of Pareto-optimal points. 4.5.2.2
Memory Architecture Exploration Results
Figures 4.5, 4.6, 4.7 and 4.8 plots all the memory architectures explored for each of the 4 applications using the GA approach. The figure also plots the non-dominated points with respect to memory cost and memory cycles. Note that each of the non-dominated point is a Pareto optimal memory architecture for a given memory cost or memory cycles. The results present 150-200 non-dominated solutions (that represents optimal architectures)
4.5 Experimental Results
87
for each of the application.
Figure 4.5: Vocoder: Memory Exploration (All Design Points Explored and Nondominated Points)
Figure 4.6: MPEG: Memory Exploration (All Design Points Explored and Non-dominated Points)
88
Logical Memory Exploration
Figure 4.7: JPEG: Memory Exploration (All Design Points Explored and Non-dominated Points)
Figure 4.8: DSL: Memory Exploration (All Design Points Explored and Non-dominated Points)
4.6
Related Work
Broadly, there are two types of approaches that are attempted for memory design space exploration: (i) Architecture Description Language (ADL) based approaches that uses
4.6 Related Work
89
simulation as a means to evaluate different design choices and (ii) exhaustive search or evolutionary based approaches for memory architecture exploration with analytical model based estimation to evaluate memory architectures. There are architecture description language based approaches like LISA [56], EXPRESSION [46], and ISDL [32] that capture processor architecture details in a high level language as front-end and uses a generator as back-end to generate ’C’ models that simulate the processor architecture configuration. Specifically, LISA and EXPRESSION captures the micro-architectural details of memory organization in a high level language format. From the specification, both EXPRESSION and LISA generates ’C’ models that simulate the memory behavior. To evaluate a specific memory configuration for the given application, the application has to be compiled and run on the generated ’C’ model to get the performance numbers in terms of number of memory stalls. LISA allows the flexibility to capture memory architecture details at different abstraction levels like functional and cycle accurate specification. A functional ’C’ model will be 1-2 orders of magnitude faster, in terms of run-time, as compared to a cycle accurate simulation model. Though ADLs provide an elegant means to capture the memory architecture details and further provide a platform to evaluate a given configuration by means of simulation, there are some open issues that needs to be addressed. One, simulation is an expensive method in terms of run-time and this limits the number of configurations that can be evaluated. Two, the memory configurations needs to be fed as inputs manually. To evaluate significantly different memory organizations, developing the specification is a time consuming task. Further, these methods do not address the problem of configuration selection itself. Providing new configurations is a manual task and based on the designer, who modifies the specification, the type of configurations evaluated could be different. While these methods are very effective in evaluating a given memory architecture in an accurate way, it is not suitable for exploring the design space with thousands of configurations because of the following reasons: (i) for every configuration that needs to be explored, the input specification needs to be modified and this is a manual process and (ii) since these are simulation based approaches, even with a functional simulator, the number of architecture
90
Logical Memory Exploration
configuration that can be evaluated for a large application is very limited because of the large time taken by the simulator. The second type of approach is estimation based methods. In [54], Panda et al., present a heuristic algorithm for SPRAM-cache based local memory exploration. The objective of this work is to determine the right size of on-chip memory for a given application. Their algorithm partitions the on-chip memory into Scratch-pad RAM and data cache and also computes the right line size for the data cache. This algorithm searches the entire memory space to find the right combination of Scratch-Pad RAM, data cache and line size that gives the best memory performance. This approach is very useful for architectures that contain both SPRAM and Cache. Our work is different from this work in many aspects. We address a different memory architecture class which consists of a on-chip SPRAM with multiple SARAM banks and DARAM banks, but without cache memory. We have proposed a two-level iterative approach for memory architecture exploration. The main advantage of our method is it integrates data layout and memory exploration into one problem. To the best of our knowledge there is no work which considers integration of memory exploration and data layout as one single problem and optimises for performance and area (memory cost). The memory exploration strategy presented in [64] explores the design space to find the optimal configurations considering the cache size, processor cycles and energy consumption. They propose an enumeration based search. Our approach on the other hand uses evolutionary methods and is efficient, in terms of computation time, in exploring complex memory architectures with multiple objectives. Also there are other memory design space exploration approaches that considers a cache based target memory architectures [9, 52, 51]. In this chapter, our work addresses the memory architecture exploration of DSP memory architecture that is typically organized as multiple memory banks where each of the banks can consist of single/dual port memories with different sizes. We consider non-uniform memory bank sizes. Our work uses an integrated datalayout and memory architecture exploration approach, which is key for guiding GA’s search path in the right direction. The cost functions and the solution search space will be very different for a cache based memory architecture used in [9, 52] and an on-chip
4.7 Conclusions
91
scratch pad based DSP memory architecture used in our paper. Although the approach presented in this chapter does not address cache based architectures, we deal with this in Chapter 7. In summary the unique contributions of our work are the following: (a) integrating memory architecture exploration and data layout in an iterative framework to explore memory design space (b) addressing the class of memory architecture for DSPs that are more complex and heterogeneous and (c) solving the design space exploration problem for multiple objectives (memory architecture performance and memory area) and to obtain a set of Pareto-optimal design solutions.
4.7
Conclusions
In this chapter we addressed the multi-level multi-objective memory architecture exploration problem through a combination of evolutionary algorithms (for memory architecture exploration) and an efficient heuristic data placement algorithm. More specifically, for the outer level memory exploration problem, we have used two different evolutionary algorithms (a) multi-objective Genetic Algorithm and (b) Simulated Annealing. We have addressed two of the key system design objectives (i) performance in terms of memory stall cycles and (ii) memory cost. Our approach explores the design space and gives a few hundred Pareto Optimal memory architectures at various system design points in a few hours of run time on a standard desktop. Each of these Pareto optimal design point is interesting to the system designer from a platform based design view point. We have presented a fully automated approach in order to meet the time-to-market requirements. We extend the methodology to handle energy consumption, in Chapter 6.
92
Logical Memory Exploration
Chapter 5 Data Layout Exploration 5.1
Introduction
In Chapter 3, we addressed the data-layout problem only from a performance (reducing memory stalls) perspective. In this chapter, we explore the data-layout design space with an objective to identify a set of Pareto-optimal data-layout solutions that are interesting from performance and power viewpoint. In the earlier chapters, we have solved the problem of data layout and memory architecture design space exploration keeping the logical architecture viewpoint. This is because, during system design, the software application developers work with logical memory architecture, which specifies the logical structure of the memory architecture in terms of the size of on-chip memory, the way the on-chip memory is organized in terms of the number of memory banks, number of ports each of the memory banks have, the size of the external memory and the access latency for all these memories. These are the parameters that impact the performance of the application and hence the software developers use this logical view of memory architecture to optimize the data layout to extract the maximum performance. The hardware designers take the logical memory architecture specification as input and design a physical memory architecture. This process is referred as memory allocation in the literature [35, 61]. Each of the logical memories is constructed with one or
94
Data Layout Exploration
more memory modules taken from a Semiconductor vendor memory library. For example, a logical memory bank of 16KB×16 can be constructed with four 4KB×16 or eight 2KB×16 or eight 4KB×8 or sixteen 1KB×16 memory units. Each of these options, for different process technology and different memory unit organization results in different performance, area and energy consumption trade-offs. Hence the memory allocation process is performed with the objective to reduce the memory area in terms of silicon gates, and energy consumption. The memory allocation problem in general is NP-Complete [35]. Earlier approaches for the data layout step typically use a logical memory architecture as input [10, 53] and as a consequence power consumption data for the memory banks is not available. By considering the physical memory architecture, the data layout method proposed in this chapter can optimize for power as well. Also, a common design assumption in earlier design approaches [10] is that, for data layout, the power and performance are non-conflicting objectives and therefore optimizing performance will also result in lower power. However we show that this assumption in general is not valid for all classes of memory architectures. Specifically, we here show that for DSPs memory architectures, power and performance are conflicting objectives and there is a significant trade-off (up to 70%) possible. Hence this factor needs to be carefully factored in the data layout method to choose an optimal power-performance point in the design space. When we extend these problems taking the physical memory architecture into account, there are two possible approaches. One approach is to solve the data layout and memory architecture exploration problem for logical memory architecture, as described in the previous chapters and then map the logical memory architecture to physical memory architecture. Alternatively the above problem can be directly solved for physical memory architecture. We evaluate both these approaches and demonstrate the latter is more beneficial. In this process, we develop a comprehensive automatic memory architecture exploration framework, which can explore logical and physical memory architecture. We do this in a systematic way, first addressing the data layout problem for physical memory architecture in this chapter. The following chapter deals with the memory architecture exploration problem considering the physical memory architecture.
5.2 Problem Definition
95
In this chapter we propose MODLEX framework, a Multi Objective Data Layout EXploration based on Genetic Algorithm that explores the data layout design space for a given logical memory architecture which is mapped to a physical memory architecture and obtains a list of Pareto-optimal data layout solutions from a performance and power perspectives. In other words, the MODLEX approach identifies the best data layouts for a given physical memory architecture (implementing logical memory architecture) to identify design points that are interesting from power and performance viewpoint. The main contributions of this chapter are (a) Combining different data layout optimizations into a unified framework that can be used for the complex embedded DSP memory architectures. Even though we target the DSP memory architectures, our method also works for microcontrollers as well. (b) Model the data layout problem as multi-objective Genetic Algorithm (GA) with performance and power being the objectives. Our method optimizes the data layout for power and run-time and presents a set of solution points that are optimal with respect to power and performance. (c) Most of the work in the literature assumes that performance and power are non-conflicting objectives with respect to data allocation. But we show that there is significant trade-off (up to 70%) that is possible between power and performance. The remainder of this chapter is organized as follows. Section 5.2 deals with the problem definition. In Section 5.3, we present our MODLEX framework. In Section 5.4, we describe the experimental methodology and report our experimental results. In Section 5.5, we present the related work. Finally concluding remarks are provided in Section 5.6.
5.2
Problem Definition
We are given a logical memory architecture Me with m on-chip SARAM memory banks, n on-chip DARAM memory banks, and an off-chip memory and the memory access characteristics of the application in terms of the conflict matrix defined in Section 3.2.1, which specifies the number of concurrent accesses between a pair of data sections i and j and self-conflicts, and the frequency of access of individual data sections. The problem on
96
Data Layout Exploration
hand is to realise the logical memory architecture Me in terms of physical memory modules, available in the ASIC memory library for a given technology or process node and obtain a suitable data layout for the physical memory architecture Mp such that the number of memory stalls incurred and the energy consumed by the memory architecture are minimised. More specifically, we consider the data layout problem for physical memory architecture with the following two objectives. • Number of memory stalls incurred due to conflicting accesses (parallel and self conflicts) and the additional cycles incurred in accessing off-chip memory • The total memory power calculated as the sum of the memory power of all memory banks for various memory accesses. Memory power of each of the banks is computed by multiplying the number of read/write accesses (based on the data placed in the bank) and the power per read/write access for the specific memory module accesses We defer the consideration of memory area optimization from a physical memory architecture exploration perspective to the following chapter.
5.3
MODLEX: Multi Objective Data Layout EXploration
5.3.1
Method Overview
We formulate the data layout problem as a multi-objective GA [30] to obtain a set of Pareto optimal design points. The multiple objectives are minimizing memory stall cycles and memory power. Figure 5.1 illustrates our MODLEX (Multi Objective Data Layout EXploration) framework, which takes application profile information and a logical memory architecture as inputs. The logical memory architecture, as explained in Chapter 4 contains the number of memory banks, memory bank sizes, memory bank types (singleport, dual-port), and memory bank latencies. The logical memory to physical memory
5.3 MODLEX: Multi Objective Data Layout EXploration
97
map is obtained using a greedy heuristic method which is explained in the following section. The core engine of the MODLEX framework is the multi-objective data layout, which is implemented as a Genetic Algorithm (GA). The data layout block takes the application data and the logical memory architecture as input and outputs a data placement. The cost of data placement in terms of memory stalls is computed as explained in Chapter 3. To compute the memory power, we use the physical memory architecture and use the power per read/write obtained from the ASIC memory library. The memory power computation is further explained in Section 5.3.3.3. The overall fitness function used by the GA is a combination of memory stall cost and memory power cost. Based on the fitness function, the GA evolves by selecting the fittest individuals (the data placements with the lowest cost) to the next generation. To handle multiple objectives, the fitness function is computed by ranking the chromosomes based on the non-dominated criteria (as explained in Section 2.5). This process is repeated for a maximum number of generations specified as a input parameter.
Figure 5.1: MODLEX: Multi Objective Data Layout EXploration Framework
98
5.3.2
Data Layout Exploration
Mapping Logical Memory to Physical Memory
To get the memory power and area numbers, the logical memories have to be mapped to physical memory modules available in a ASIC memory library for a specific technology/process node. As mentioned earlier each of the logical memory bank can be implemented physically in many ways. For example, for a logical memory bank of 4K*16 bits can be formed with two physical memories of size 2K*16 bits or four physical memories of size 2K*8 bits. Different approaches have been proposed for mapping logical memory to physical memories [35, 61]. The memory mapping problem in general is NP-Complete [35]. However since the logical memory architecture is already organized as multiple memory banks, most of the mapping turns out to be a direct one to one mapping. In this chapter a simple greedy heuristic is used to perform the mapping of logical to physical memory with the objective of reducing silicon area. This is achieved by first sorting the memory modules based on area/byte and then by choosing the smallest area/byte physical memory to form the required logical memory bank size. Though this heuristic is very simple, it results in an efficient physical memory architecture. Further in the following chapter, we consider the exploration of physical memory architecture with an added objective of area optimization.
5.3.3
Genetic Algorithm Formulation
To map the data layout problem to the GA framework, we use the chromosomal representation, fitness computation, selection function and genetic operators defined in Section 3.4. For easy reference and completeness, we briefly describe them in the following subsections. 5.3.3.1
Chromosome Representation
For the data memory layout problem, each individual chromosome represents a memory placement. A chromosome is a vector of d elements, where d is the number of data sections. Each element of a chromosome can take a value in (0 .. m), where 1..m represent onchip logical memory banks (including both SARAM and DARAM memory banks) and 0
5.3 MODLEX: Multi Objective Data Layout EXploration
99
represents off-chip memory. For the purpose of data layout it is sufficient to consider the logical memory architecture from which the number of memory stalls can be computed. However, for computing the power consumption for a given placement done by data layout, the corresponding physical memory architecture obtained from our heuristic mapping algorithm, need to be considered. Thus if element i of a chromosome has a value k, then the data section is placed in memory bank k. Thus a chromosome represents a memory placement for all data sections. Note that a chromosome may not always represent a valid memory placement, as the size of data sections placed in a memory bank k may exceed the size of k. Such a chromosome is marked as invalid and assigned a low fitness value. 5.3.3.2
Chromosome Selection and Generation
The strongest individuals in a population are used to produce new off-springs. The selection of an individual depends on its fitness, an individual with a higher fitness has a higher probability of contributing one or more offspring to the next generation. In every generation, from the P individuals of the current generation, M new offsprings are generated, resulting in a total population of (P + M ). From this P fittest individuals survive to the next generation. The remaining M individuals are annihilated. Crossover and mutation operators are implemented as explained in Section 3.4. 5.3.3.3
Fitness Function and Ranking
For each of the individuals corresponding to a data layout the fitness function computes the power consumed by the memory architecture (Mpow ) and the performance in terms of memory stall cycles (Mcyc ). The value of Mcyc computation is similar to the cost function used in our heuristic algorithm described in Section 3.5 which is explained briefly below. The number of memory stalls incurred in a memory bank j can be computed by summing the number of conflicts between pairs of data sections that are kept in j. For each pair of the conflicting data sections, the number of conflicts is given by the conflict matrix. Thus the number of stalls in memory bank j is given by ΣCx,y , for all (x, y) such that data sections x and y are placed in memory bank j. As DARAM banks support concurrent
100
Data Layout Exploration
accesses, DARAM bank conflicts Cx,y between data section x and y placed in a DARAM bank, as well self conflicts Cx,x do not incur any memory stalls. Note that our model assumes only up to two concurrent accesses in any cycle. The total memory stalls incurred in bank j can be computed by multiplying the number of conflicts and the bank latency. The total memory stalls for the complete memory architecture is computed by summing all the memory stalls incurred by all the individual memory banks. Memory Power corresponding to a chromosome is computed as follows. Assume each logical memory bank j is mapped to a set of physical memory banks mj,1 , mj,2 , ...mj,nj . If Pj,k is the power per read/write accesses of memory module mj,k and AFi,j,k is the number of accesses to data section i that map to physical memory bank mj,k , then the total power consumed is given by
Ponchip =
XXX i
j
AFi,j,k × Pj,k
(5.1)
k
Note that AFi,j,k is 0 if data section i is either not mapped to logical memory bank j, or if i not mapped to physical memory bank k. Also, AFi,j,k and AFi,j,k‘ would both account for an access to data section i that is mapped to logical memory bank j, when j is implemented using multiple banks k and k‘. For example, logical memory bank of 2K × 16 implemented using two physical memory modules of size 2K × 8. Thus the total power Mpow for all the memory banks including off-chip memory is given by Mpow = Pon−chip +
X
AFi,of f ∗ Pof f
i
where AFi,of f represents the number of access to off-chip memory from data section i, and Pof f is power per access for off-chip memory. Once the memory cost and memory cycles are computed for all the individuals in the population, individuals are ranked according to the Pareto optimality conditions on power consumption (Mpow ) and performance in terms of memory stall cycles (Mcyc ). More b b a a ) are the memory power and memory cycles , Mcyc ) and (Mpow , Mcyc specifically, if (Mpow
5.4 Experimental Results
101
of chromosome A and chromosome B, A is ranked higher (i.e.,has a lower rank value) than B if a b a b a b a b ((Mpow < Mpow ) ∧ (Mcyc ≤ Mcyc )) ∨((Mcyc < Mcyc ) ∧ (Mpow ≤ Mpow ))
The ranking process in a multi-objective GA proceeds in the non-dominated sorting manner as described in Section 4.3. All non-dominated individuals in the current population have a rank value 1 and are flagged. Subsequently rank-2 individuals are identified as the non-dominated solutions in the remaining population. In this way all chromosomes in the population get a rank value. Higher fitness values are assigned for rank-1 individuals as compared to rank-2 and so on. This fitness is used for the selection probability. The individuals with higher fitness gets a better chance of getting selected for reproduction. To ensure solution diversity which is very critical for getting a good distribution of solutions in the Pareto-optimal front, the fitness value is reduced for a chromosome that has many neighboring solutions. This is accomplished as explained in Section 4.3. The GA must be provided with an initial population which is created randomly. In our implementation we have used a fixed number of generations as the termination criterion. As the GA approach evolves different generations, non-dominated solutions, which are Pareto-optimal (data layout) in terms of performance and power are saved in a database.
5.4 5.4.1
Experimental Results Experimental Methodology
We have used the same set of benchmark programs and profile information as in the earlier chapters. For performing the memory allocation step, we have used TI’s ASIC memory library. The area and power numbers are obtained from the ASIC memory library. We consider a set of 6 different logical memory architecture listed in Table 5.1. The corresponding physical memory architecture and the normalized area1 required by the 1
As the ASIC library is proprietary to Texas Instruments, we present only the normalized power and
102
Data Layout Exploration
physical memory for the different architectures are also given in Table 5.1. Note that the memory size used for each of the memory architectures is 96KB and this is enough to fit the data of each of the applications considered. Further, the architectures A1 to A5 are sorted based on physical memory area in descending order. Architecture A6 will be used in Section 5.4.3 for comparison with other related work. In Table 5.1, column 3, the physical memory banks with symbols 1P and 2P represent respectively single and dual port memory banks. Architectures A1 to A5 are selected such that the memory configuration in terms of multiple memory banks and the bank types (SARAM and DARAM) is varied. In all of these configurations, the data width is 16-bit in both the logical architecture and physical memory banks. From the table it can be observed that the memory area increases with the DARAM size and the number of banks. A1 has the highest number of memory banks with largest DARAM size; hence A1 consumes the largest area. A2 and A3 has the same DARAM size but the SARAM configuration is different. A3 and A4 present a non-uniform bank size based SARAM architecture. Non-uniform bank size based architectures allows the usage of memory banks with multiple sizes and hence presents opportunities to optimize memory area and power consumption. Larger memory banks optimizes area, whereas smaller memory bansk reduces power consumption. A5 has the least number of memory banks and uses larger memories with a reduced memory area. In summary, we would expect architecture A1 to perform very well in terms of performance because of its large DARAM memory and architecture A4 to perform better in terms of power consumption because of its lesser DARAM size and the presence of non-uniform bank sizes. Note that the architecture A4 has more memory area than A5 even though it has only half of the A5’s DARAM. This is due to the higher number of banks in A4. Note that A6 has the lowest area because it has 32KB of off-chip RAM and off-chip memory is not included in the area. area numbers.
5.4 Experimental Results
Table 5.1: Memory Architectures Used for Data Layout Arch no Architecture Memory Logical Physical Area Memory Memory A1 2x8K(SARAM) 2x8192(1P) 1 5x16K(DARAM) 20x4096(2P) A2 16x4K(SARAM) 16x4096(1P) 0.91 32K(DARAM) 8x4096(2P) A3 8x4K(SARAM) 8x4096(1P) 0.82 1x32K(SARAM) 1x32K(1P) 8x4K(DARAM) 8x4096(2P) A4 8x2K(SARAM) 8x2048(1P) 0.77 4x4K(SARAM) 4x4096(1P) 3x16K(SARAM) 3x16K(1P) 16K(DARAM) 4x4096(2P) A5 64K(SARAM) 2x32K(1P) 0.72 32K(DARAM) 8x4096(2P) A6 8x2K(SARAM) 8x2048(1P) 0.57 4x4K(SARAM) 4x4096(1P) 16K(SARAM) 1x16K(1P) 16K(DARAM) 4x4096(2P) 1x32K(Off-Chip) 1x32K(SDRAM)
103
104
5.4.2
Data Layout Exploration
Experimental Results
This section presents the experimental results on the multi-objective data layout for physical memory architecture. Figures 5.2, 5.3, and 5.4 shows the set of non-dominated points each corresponds to a Pareto Optimal data layout from a power and performance viewpoint for the 3 applications for architectures A1-A5. Note that architectures A1 to A5 correspond to a fixed physical memory architecture with a known silicon area. Figure 5.2 presents the different data layout solution space from a power consumption and performance (memory stalls) view point. Note that each of the point in the plot represents a data layout for a given architecture. Observe that there are several data layout solutions presented for each of the architectures considered.
Figure 5.2: Data Layout Exploration: MPEG Encoder
It should be noted that the non-dominated points seen by the multi-objective GA are only near optimal, in the sense that these are the non-dominated points among the solutions seen so far. As the evolutionary method may result in a design point that could dominate an existing non-dominated solution. However we choose the number of generations in our method in such a way that increasing the number of iterations does
5.4 Experimental Results
Figure 5.3: Data Layout Exploration: Voice Encoder
Figure 5.4: Data Layout Exploration: Multi-Channel DSL
105
106
Data Layout Exploration
Figure 5.5: Individual Optimizations vs Integrated
not result in any new non-dominated points. Note that solutions points corresponding to architecture A1 gives better performance (lesser memory stalls), and observe that there is a solution point (data layout) that resolves all the memory stalls for MPEG. This is because of the large size of DARAM in A1. However, most of the solution points of A1 consume more power, again due to the large DARAM size. Solution points in A2 follows very closely the solution points in A1. Observe that A2’s solution points is only slightly inferior in terms of performance to that of A1’s solution points, even though A2 has only less than half the size of DARAM as A1. Also, A2’s solution points dominate (performs better both in terms of power and performance, also note that memory area of A2 is lower than memory area of A1) most of the solution points of A1 in the low performance region. Hence, it can be deduced that MPEG does not gain in terms of performance by having larger DARAM memory. And at the same time the larger DARAM of A1 decreases the power efficiency of the data layouts. Interestingly, even A3’s solution points, in lower performance region, dominates A1’s solution points. Solution points corresponding to A4 gives the best solution points
5.4 Experimental Results
107
in terms of power and performance in the mid-region. Observe that A4’s solution points dominates all the rest of the solution points in the mid region. A5’s solution points are notably inferior as compared to the rest of the solutions. This is because of the fewer number of memory banks in A5. From this, it can be deduced that MPEG performs multiple simultaneous memory access and thus, for MPEG, multiple memory banks is more important than DARAM banks for achieving better solution points. Figure 5.3 presents the results for the Voice Encoder application for the 5 different architectures A1-A5. Unlike MPEG, the solution points of A1 are clearly superior here, mainly in terms of performance. Observe that the solution points of the architectures A1, A2 and A4 dominate some of the power-performance regions in the data layout space. Solutions of A1 dominate the high performance space, solutions of A2 and A4 dominate the middle space both in performance and power, and again solutions of A2 dominates the low power-performance region. From the results, it can be deduced that for voice encoder, DARAM and multiple memory banks both are equally critical. With only a small increase in area compared to A5, A3 achieves much better performance than A5. This is due to the higher number of banks in A3 that resolves more parallel conflicts. Figure 5.4 presents the results for the Multi-channel DSL application for the 5 different architectures A1-A5 described in Table 5.1. Observe that all the architecture gives a solution point with near zero memory stalls. This indicates that the application does not require more that 16K of DARAM (this is the smallest size of DARAM used among all architectures A1-A5 and is used in A4). Also, it can be deduced that this application does not need more than 3 banks to resolve all the parallel conflicts (note that A5 has only 3 number of banks). A significant portion of the DSL application was developed in ’C’ language and this is one reason for the lesser number of parallel and self conflicts. Typically, hand optimized assembly code will try to exploit the DSP architectures by using multiple simultaneous accesses and self accesses. However, compiler generated assembly code may not be as efficient as hand-optimized code, mainly in terms of parallel memory accesses. Interestingly, the solution points of A4 dominates most of the other solution points. This is mainly due to the non-uniform banks sizes of A4 that presents opportunities for data
108
Data Layout Exploration
layout to optimize and trade-off power and performance. Also observe the wide range of trade-off available between power and performance for all the applications for each of the architectures. This is very useful for application engineers and system designers from a platform design viewpoint. Also, these different power-performance operating points are very essential for SoC’s that have Dynamic Voltage and Frequencies Scaling (DVFS) [23]. DVFS presents different operating points for a SoC to gain power based on use cases. For example, in a mobile phone, a stand-alone MP3 play may not require much performance. However, MP3 play while shooting a still picture will consume more performance. DVFS allows different operating points for stand-alone MP3 player and MP3 with Camera. MP3 player can be operated with processor running at 80Mhz at 1.2V, whereas MP3 with Camera will need more performance and can be operated with processor running at 120Mhz at 1.45V. Hence, we will need two different data layouts for MP3 for the above two operating points. Next, we report the execution time required by our multi-objective GA to obtain the data layouts. The GA is run on a Pentium4 desktop at 1.7GHz. It takes 26 to 31 minutes to obtain all Pareto Optimal design points for a single architecture. This run-time is approximately same for all the application.
5.4.3
Comparison of MODLEX and Stand-alone Optimizations
In this section we present results on all the stand-alone optimizations and compare it with our integrated approach, MODLEX, where all the optimizations are considered together. For this purpose we consider the following optimzations. O1 Optimization O1 corresponds to performing just on-chip/off-chip data partition, similar to the approach proposed in [10, 67] O2 Optimization O2 corresponds to performing O1 and also resolving parallel memory conflicts by utilizing only multiple memory banks [44, 40] O3 Optimization O3 corresponds to the MODLEX approach that integrates O1 and O2, and resolves self-conflicts and also exploits non-uniform sized memory banks
5.5 Related Work
109
Figure 5.5 presents the results for MPEG for the memory architecture A6 explained in Table 5.1. There are six different plots and each plot represents a specific data layout optimization. Note that the plots O1 and O2 uses only the optimizations O1 and O2 respectively, In comparison using our MODLEX framework and optimization O3, presents different solution points from a power and performance view point. Observe that for the same memory architecture, the MODLEX approach presents a wide range of solutions starting from the high performance region that resolves almost all the memory stalls to the low performance region. Note that from power and performance perspective, the solution points in the integrated approach completely dominates the solution points in the other two plots. Methods like [10, 67] will give power/performance close to point P1 and the point P2 corresponds to the works [44, 40, 58] and the data layout that optimizes power [15] is represented by point P3. From the results we can conclude that our integrated approach gives better solution points both with respect to power and performance. Also from the experimental results it can be concluded that there is a wide range of design points with respect to power and performance that can be obtained from multi-objective data layout optimizations. The computation cost involved in our approach is very small , less than an hour on a standard desktop.
5.5
Related Work
The data layout problem [10, 15, 40, 44, 53, 67] has been widely researched in the literature from either a performance or power perspective individually. In [18], a low-energy memory design method is proposed, referred as VAbM, that optimizes the memory area by allocating multiple memory banks with variable bit-width to optimally fit the application data. In [15], Benini et al., present a data layout method that aims at energy reduction. The main idea of this work is to use the access frequency of memory address space as starting point and design smaller (larger) bank size for the most (least) frequently accessed memory addresses. In [40], the authors present a heuristic algorithm to efficiently partition the data to avoid parallel conflicts in DSP applications. Their objective is to partition the data into multiple chunks of same size so that they can fit in a memory
110
Data Layout Exploration
architecture with uniform bank sizes. This approach works well if we consider only performance as an objective. However, if the objective is to optimize both performance and power, then a memory architecture with non-uniform banks is very attractive. All the above optimizations are very effective individually for the class of memory architecture they target. However a complete data layout approach has to combine many/all of these approaches to be able to comprehensively address the problem. Also it may not be optimal to just combine different optimizations compared to an integrated approach which is likely yield a better result. Our MODLEX framework accomplishes this. Further, our data layout approach can effectively partition data to resolve parallel conflicts and also exploit the advantage from non-uniform bank architecture to save power. To the best of our knowledge there is no work done in the literature to address this problem.
5.6
Conclusions
In this chapter we presented our framework MODLEX, a Multi Objective Data Layout EXploration for physical memory architecture. Our approach results in many data layout that are Pareto-optimal with respect to power and performance which are important from a platform design view point. We demonstrated that there is significant trade-off (up to 70%) that is possible between power and performance. In the next chapter we extend our framework to explore memory architecture design space along with the data layout.
Chapter 6 Physical Memory Exploration 6.1
Introduction
As discussed in Chapter 2, at the memory exploration step, a memory architecture is defined which includes determining the on-chip memory size, the number and size of each memory bank in SPRAM, the number of memory ports per bank, the types of memory (scratch pad RAM or cache), and the wait-states/latency. This architecture was referred to as logical memory architecture in Chapter 4 as it does not tie the architecture at this point to a specific ASIC memory library module. We proposed a logical memory exploration (LME) framework in Chapter 4 for identifying Pareto optimal design points in terms of performance and area. At the memory architecture design stage, each of these logical memory architectures under consideration has to be mapped to a physical memory architecture. Alternatively, one can directly explore the space of physical memory architecture, taking into consideration the different memory banks/modules available in the semi-conductor vendor memory library. The physical memory architecture proposed for a given application is once again evaluated from performance, power and cost (memory area) view point and a list of Pareto optimal design points are obtained that are interesting from a platform design perspective. In this chapter we evaluate these two approaches. This is explained in greater detail below. • The first approach is an extension of the Logical Memory Exploration (LME)
112
Physical Memory Exploration
Figure 6.1: Memory Architecture Exploration
method described in Chapter 4. The output of LME is a set of design points (Logical Memory Architectures) that are Pareto optimal with respect to performance and (logical) memory cost. Also, as a part of LME, the data layout generates data placement details for each of the logical memory architecture has been explored in LME. The non-dominated points from LME and the placement details for each of the non-dominated point form inputs to physical memory architecture exploration step. The mapping of logical memory architecture to physical memory architecture is formulated as a Multi-Objective Genetic Algorithm to explore the design space with power and area as the objectives. Area and power numbers of these physical memory modules are obtained from a semi-conductor vendor memory library. The physical memory exploration step is performed for every non-dominated point from LME. Note that the performance was one of the objectives at LME and this does not change during physical memory exploration step. Hence at the output of physical memory exploration approach, for every non-dominated point generated from LME, a set of non-dominated points are identified that are optimal with respect to power and area. We refer to this approach as LME2PME. • Second approach is a direct approach for Physical Memory Exploration (PME). In this approach we integrate three critical components together: (i) memory architecture exploration, (ii) memory allocation, which constructs a logical memory by picking memory modules from a semi-conductor vendor memory library and (iii)
6.1 Introduction
113
data layout exploration, this module is critical for estimating performance. The memory allocation step is critical and influences power/read and power/write as well as memory area for all the memory modules. This integrated approach is shown in Figure 6.2. For memory architecture exploration, we use a multi-objective non-dominated sorting Genetic Algorithm approach [25]. For the data layout problem which needs to be solved for each of the 1000s of memory architectures, we use the fast efficient heuristic method described in Section 3.5. For the memory allocation, we use an exhaustive search algorithm. Thus the overall framework uses a two level iterative approach with memory architecture exploration and memory packing at the outer level and data layout at the inner level. We propose a fully automated framework for this integrated approach and we refer the framework as DirPME.
Figure 6.2: Memory Architecture Exploration - Integrated Approach
Thus, the main contribution of this chapter is a two-pronged approach to physical memory architecture exploration. Our method optimizes the memory architecture, for a given application, and presents a set of solution points that are optimal with respect to performance, power and area. The remainder of this chapter is organized as follows. In Section 6.2, we present the LME2PME approach. In Section 6.3, deals with our DirPME framework. In Section 6.4, we present the experimental methodology and results for both LME2PME and DirPME
114
Physical Memory Exploration
frameworks. Section 6.5 covers some of the related work from the literature. Finally in Section 6.6, we conclude by summarizing our work in this chapter.
6.2
Logical Memory Exploration to Physical Memory Exploration (LME2PME)
6.2.1
Method Overview
The LME2PME method extends the Logical Memory Exploration (LME) process described in Chapter 4 by considering memory power and memory area in addition to the memory performance objective addressed by the LME. Note that the LME works on minimizing the number of memory stalls for a given logical memory cost. The logical memory cost is a factor proportional to memory area. LME, for a given application, finds a list of Pareto optimal logical memory architectures considering performance and logical memory area as the objective criterion. This is shown in the top right portion of Figure 6.3. At the LME step, the memory is not mapped to physical modules and hence the actual silicon area and power consumption numbers are not known. Also, for a given logical memory architecture there are many possible ways to implement the actual physical memory architecture. As shown in Figure 6.3, the non-dominated points from LME are taken as inputs for the memory allocation exploration step. The output of Physical Memory Exploration is a set of Pareto optimal points with the memory power, memory area and memory stalls as the objective criteria. For each non-dominated logical memory architecture generated by LME, there are multiple physical memory architectures with different power-area operating points with the same memory stalls. This is shown in Figure 6.3, where the design solution LM 1 in LME’s output corresponds to a memory stall of ms1 and generates a set of Pareto optimal points (denoted by P M 1s in the lower half of Figure 6.3) with respect to memory area and memory power. Similarly, LM 2 which incurs a memory stall of ms2 results in a set of P M 2s of physical memory architectures. Note that ms1 and ms2 , which are the memory stalls as determined by
6.2 Logical Memory Exploration to Physical Memory Exploration (LME2PME)
115
LME, does not change during the Physical Memory Exploration step. Different physical memory architectures are explored with different area-power operating points for a given memory performance.
Figure 6.3: Logical to Physical Memory Exploration - Overview
6.2.2
Physical Memory Exploration
In a traditional HW-SW codesign method, once the logical memory architecture is finalized, the HW and SW designs happen independently. The SW design teams focus on performance optimization of application with the given logical memory architecture and the H/W design teams focus on area optimization during the memory allocation step. But in the process, power optimization, which requires both H/W and S/W perspective, is not considered. Our LME2PME method addresses this problem by taking the required inputs from the datalayout step that helps in optimizing power consumption and also optimizing area at the same time during memory allocation. Figure 6.4 describes the LME2PME method.
116
Physical Memory Exploration
The top part of Figure 6.4, the logical memory architecture exploration (LME) is same as what was described in Chapter 4. The bottom part of Figure 6.4 shows the physical memory exploration. As shown in Figure 6.4, the Physical Memory Exploration (PME) step takes two inputs from the LME. The first input is the set of Non-dominated points generated by LME and the second input is the data placement details, which is the output of the data layout step and provides information on what data-section is placed in which memory bank. From the data placement and the profile data, the PME computes the number of memory accesses per logical memory bank. This is an important information and this can be used to decide on using larger or smaller memories while mapping a logical memory bank. As discussed in Chapter 2, a smaller memory consumes less power per read/write access as compared to a larger memory. Hence, if a logical memory bank is known to have a data that is accessed higher number of times, it is power-optimal to design this logical memory bank with many smaller physical memories. However this comes with a higher silicon area cost and hence results in a area-power trade-off. We formulate the memory allocation exploration as a Multi-Objective Genetic Algorithm problem. To map an optimization problem to the GA framework, we need the following: chromosomal representation, fitness computation, selection function, genetic operators, the creation of the initial population and the termination criteria. Figure 6.5 explains the GA formulation of the Physical Memory Mapping problem.
6.2.3
Genetic Algorithm Formulation
6.2.3.1
Chromosome Representation
Each individual chromosome represents a physical memory architecture. As shown in Figure 6.5, a chromosome consists of a list of physical memories picked from a ASIC memory library. These list of physical memories are used to construct a given logical memory architecture. Typically multiple physical memory modules are used to construct a logical memory bank. As an example, if the logical bank is of size 8K*16bits then, the physical memory modules can be two 4K*16bits or eight 2K*8bits or eight 1K*16bits and so on. We have limited the number of physical memory modules per logical memory bank
6.2 Logical Memory Exploration to Physical Memory Exploration (LME2PME)
117
Figure 6.4: Logical to Physical Memory Exploration - Method
to at most k. Thus, a chromosome is a vector of d elements, where d = Nl ∗ k + 1 and Nl is the number of logical memory banks, which is an input from LME. Note that each of the element represents an index in the semiconductor vendor memory library which corresponds to a specific physical memory module. For decoding a chromosome, for each of the Nl logical banks, the chromosome has k elements. As mentioned earlier each of the k element is an integer used to index into semiconductor vendor memory library. With the k physical memory modules, a logical memory bank is formed. We have used a memory allocator that performs exhaustive combinations with the k physical memory modules to get the largest logical memory required with the specified word size. Here, the bank size, the word size and the number of ports are obtained from the logical memory architecture, corresponding to the chosen non-dominated point. In this process, it may happen that m out of the total k physical
118
Physical Memory Exploration
Figure 6.5: GA Formulation of LME2PME
memories selected may not be used, if with k−m physical memories the given logical memory bank can be constructed1 . For example, if k=4, and if the 4 elements are 2K*8bits, 2K*8bits, 1K*8bits, and 16K*8bits and if the logical memory bank is 2k*16bits, then our memory allocator builds a 2K*16bits logical memory bank from the two 2K*8bits and the remaining two memories are ignored. Note that the 16K*8bit memory and 1K*8bit memory is removed from the configuration as the logical memory bank can be constructed optimally with the two 2K*8bit memory modules. Here, the memory area of this logical 1
This approach of using only the required k − m physical memory modules relaxes the constraint that the chromosome representation has to exactly match a given logical memory architecture. This, in turn, facilitates the GA approach to explore many physical memory architecture efficiently.
6.2 Logical Memory Exploration to Physical Memory Exploration (LME2PME)
119
memory bank is the sum of the memory area of the two 2K*8bit physical memory modules2 . This process is repeated for each of Nl logical memory banks. The memory area of a memory architecture is the sum of the area of all the logical memory banks. 6.2.3.2
Chromosome Selection and Generation
The strongest individuals in a population are used to produce new off-springs. The selection of an individual depends on its fitness; an individual with a higher fitness has a higher probability of contributing one or more offsprings to the next generation. In every generation, from the P individuals of the current generation, M new offsprings are generated using mutation and crossover operators, resulting in a total population of (P + M ). The crossover operation is performed as illustrated in Figure 6.5. From this total population of (P + M ), P fittest individuals survive to the next generation. The remaining M individuals are annihilated. Crossover and mutation operators are implemented in standard way. 6.2.3.3
Fitness Function and Ranking
For each of the individuals, the fitness function computes Marea and Mpow . Note that Mcyc is not computed as it is already available from LME. The Marea is obtained from the memory mapping block, which is the sum of area of all the physical memory modules used in the chromosome. Mpow is computed based on two factors: (a) access frequency of data-sections and the data-placement information and (b) power per read/write access information derived from the semiconductor vendor memory library for all the physical memory modules. To compute the memory power the method uses the data layout information provided by the LME step. Based on the data layout, and the physical memories required to form the logical memory (obtained from the chromosome representation), the accesses to each data section is mapped to the respective physical memories. From this, the power per 2
Although the chromosome representation may have more physical memories than required to construct the given logical memory, the fitness function (area and power estimates) is derived only for the required physical memories.
120
Physical Memory Exploration
access for each physical memory, and the number of accesses to the data section, the total memory power consumed for all accesses to a data section is determined. From this, the total memory power consumed by the entire application on a given physical memory architecture is computed by summing the power consumed by all the data sections. Once the memory area, memory power and memory cycles are computed for all the individuals in the population, individuals are ranked according to the Pareto optimality conditions given in the following equation, which is similar to the Pareto optimality condition discussed in Chapter 4, but considers all three objective functions. Let a a a b b b (Mpow , Mcyc , Marea ) and (Mpow , Mcyc , Marea ) be the memory power, memory cycles and
memory area of chromosome A and chromosome B. A dominates B if the following expression is true.
a b a b a b (((Mpow < Mpow ) ∧ (Mcyc ≤ Mcyc ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Mcyc < Mcyc ) ∧ (Mpow ≤ Mpow ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Marea < Marea ) ∧ (Mcyc ≤ Mcyc ) ∧ (Mpow ≤ Mpow )))
For ranking of the chromosomes, we use the non-dominated sorting process described in Section 4.3. The GA must be provided with an initial population that is created randomly. In our implementation we have used a fixed number of generations as the termination criterion.
6.3
Direct Physical Memory Exploration (DirPME) Framework
6.3.1
Method Overview
In the LME2PME approach described in the previous section, the physical memory exploration is done in two steps. In this section we describe the DirPME framework that
6.3 Direct Physical Memory Exploration (DirPME) Framework
121
directly operates in the physical memory design space.
Figure 6.6: MAX: Memory Architecture eXploration Framework
Figure 6.6 explains our DirPME framework. The core engine of the framework is the multi-objective memory architecture exploration, which takes the application data size and semi-conductor vendor memory library as inputs and forms different memory architectures. The memory allocation procedure builds the complete memory architecture from the memory modules chosen by the exploration block. If the memory modules together does not form a proper memory architecture, the memory allocation block rejects the chosen memory architecture as invalid. Also the memory mapping allocation checks the access time of the on-chip memory modules and rejects those whose cycle time is greater than the required access time. The exploration process using the genetic algorithm and the chromosome representation is discussed in detail in the following section. Once the memory modules are selected the memory mapping block computes the total memory area, which is the sum of all the individual memory modules. Details on the selected memory architecture, like the on-chip memory size, number of memory banks, number of ports, off-chip memory bank latency are passed to the data layout procedure. The application data buffers and the application profile information also
122
Physical Memory Exploration
given as inputs to the data layout. The application itself consists of multiple modules, including several third-party IP modules as shown in Figure 6.6. With these inputs the data layout maps the application data buffers to the memory architecture; the data layout heuristic is the same as explained in Section 3.5. The output of data layout is a valid placement of application data buffers, from the data layout, and the application memory access characteristic the memory stalls are determined. The memory power is also computed using the application characteristic and power per access available from the semi-conductor vendor memory library. Lastly, the memory cost is computed by summing the cost of the individual physical memories. Thus the fitness function for the memory exploration is computed with the memory area, performance and power. Based on the fitness function the GA evolves by selecting the fittest individuals to the next generation. Since the fitness function contains multiple objectives, the fitness function is computed by ranking the chromosomes based on the non-dominated criteria (explained in Section 6.3.2). This process is repeated for a maximum number of generations specified as an input parameter.
6.3.2
Genetic Algorithm Formulation
6.3.2.1
Chromosome Representation
For the memory architecture exploration problem in DirPME, each individual chromosome represents a physical memory architecture. As shown in Figure 6.7, a chromosome consists of two parts: (a) number of logical memory banks (Li ), and (b) list of physical memory modules that form the logical memory bank. Once again we assume that each logical memory bank is constructed using at most k physical memories. It is important to note here that a key difference between the LME2PME and DirPME approaches is that, in the LME2PME approach the number of logical memory bank is fixed (equal to Nl ). Hence the chromosome are all of the same size. However in DirPME each Li can be of a different size. Hence the chromosomes are of different sizes. Thus, a chromosome is a vector of d elements, where d = Li ∗ k + 1 and Li is the number of logical memory banks for ith chromosome. The first element of a chromosome is Li and it can take value in
6.3 Direct Physical Memory Exploration (DirPME) Framework
123
(0 .. maxbanks), where maxbanks is the maximum number of logical banks given as input parameter. The remaining elements of a chromosome can take a value in (0 .. m), where 1 .. m represent the physical memory module id in the semiconductor vendor memory library. Here the index 0 represents a void memory (size zero bits) to help the memory allocation step to construct physical memories.
Figure 6.7: GA Formulation of Physical Memory Exploration
For decoding a chromosome, first Li is read and then for each of the Li logical banks, the chromosome has k elements. Each of the k elements are integers used to index into semiconductor vendor memory library. With the k physical memory modules, corresponding to a logical memory bank, a rectangular memory bank is formed. We have used the
124
Physical Memory Exploration
same memory allocator (described in Section 6.2.3.1 which performs exhaustive combinations with the k physical memory modules to get the largest logical memory with the required word size. In this process it may happen that some of the physical memory modules may be wasted. For example, if k=4, then if the 4 elements are 2K×8bits, 2K×8bits, 1K×8bits, and 16K×8bits and if the bit-width requirement is 16-bits then our memory allocator builds a 5K×16bits logical memory bank from the given 4 memory modules. Note that 11K×8bits is wasted in this configuration and this architecture will have a low fitness as the memory area will be very high but is considered in the exploration process nonetheless. The memory area of a logical memory bank is the sum of the memory area of all the physical memory modules. This process is repeated for each of the Ln logical memory banks. The memory area of a memory architecture is the sum of the area of all the logical memory banks. 6.3.2.2
Chromosome Selection and Generation
The strongest individuals in a population are used to produce new off-springs. The selection of an individual depends on its fitness; an individual with a higher fitness has a higher probability of contributing one or more offspring to the next generation. In every generation, from the P individuals of the current generation, M more off-springs are generated using mutation and crossover operators, resulting in a total population of (P + M ). From this P fittest individuals survive to the next generation. The remaining M individuals are annihilated. Crossover and mutation operators are implemented in the standard way. The crossover operation is illustrated in Figure 6.7. 6.3.2.3
Fitness Function and Ranking
For each of the individuals, the fitness function computes Marea , Mpow and Mcyc . The value of Mcyc is computed by data layout using the heuristic explained in Section 3.5. The Marea is obtained from the memory mapping block, which is the sum of area of all the memory modules used in the chromosome. Memory power computation is performed in the same way as described in Section
6.4 Experimental Methodology and Results
125
6.2.3.3. Once the Marea , Mpow and Mcyc are computed, the chromosomes are ranked as per the process described in Section 6.2.3.3.
6.4 6.4.1
Experimental Methodology and Results Experimental Methodology
We have used the same set of benchmark programs and profile information as in the earlier chapters. For performing the memory allocation step, we have used TI’s ASIC memory library. The area and power numbers are obtained from the ASIC memory library. The results from the LME2PME method are presented in the following section. After that we present the results from the DirPME framework. Finally we compare the results from LME2PME and DirPME.
6.4.2
Experimental Results from LME2PME
As discussed in Section 6.2, the LME2PME approach performs memory allocation exploration on the set of Pareto-optimal logical memory architectures, which are obtained from the LME, with the objective to obtain Pareto-optimal physical memory architectures that are interesting from a area, power and performance viewpoint. Figures 6.8, 6.9 and 6.10 present the results of the LME2PME approach for all the 3 applications. In these figures, the x-axis represents the total memory area (normalized) required by a physical memory architecture and the y-axis represents the total power (normalized) consumed by the memory accesses. In the figures, each plot corresponds to a set of performance operating points from the LME. Note that the performance points are grouped to reduce the number of plots so that it is easier to analyze the results. Performance band 0 − 0.1 corresponds to an operating point that resolves > 90% memory stalls (from the on-chip memory bank conflicts) and hence is a high performance operating point. Similarly, the performance band of 0.8 − 0.9 corresponds to an operating point that resolves only less than 20% of memory stalls and hence is a low performance operating points.
126
Physical Memory Exploration
For each of the pareto-optimal logical memory architecture, the memory allocation exploration step constructs a set of physical memory architectures that have different area-power operating point. Note that the performance (number of memory stalls) remain unchanged from the LME step. Each point in the Figures 6.8, 6.9 and 6.10 represents a physical memory architecture. It can be observed from these figures that each plot presents a wide choice of area-power operating points in the physical view. Note that the plots are arranged from the high performance band to low performance band. Each plot starts from a high-power and low-area region and ends in a low-power and high-area region. Observe that all the high performance (low memory stalls) plots operate on a high area-power region and a low performance (high memory stalls) operating points have a lower area-power values. Thus, from a platform design view point, a system designer needs to be clear on the critical factor among area, power and performance. Based on this information, the system designer can select the appropriate set of operating points that are intresting from the system design perspective.
6.4.3
Experimental Results from DirPME
This section presents the experimental results on the multi-objective Memory Architecture Exploration. The objective is to explore the memory design space to obtain all the nondominated design solutions (memory architectures) that are Pareto optimal with respect to area, power and performance. The Pareto-optimal design points identified by our framework for the voice encoder application are shown in Figure 6.11. It should be noted that the non-dominated points seen by the multi-objective GA are only near optimal, as the evolutionary method may result in another design point in future generations that could dominate. One can observe a set of points for each x-z plane (memory power - memory stalls) corresponding to a given area. These represent the trade-off in power and performance for a given area. The same graph is plotted in a 2D-graph in Figure 6.12 where architectures which require an area within a specific range are plotted using the same color. These correspond to the points in a set of x-z planes for the area range.
6.4 Experimental Methodology and Results
127
Figure 6.8: Voice Encoder: Memory Architecture Exploration - Using LME2PME Approach
Figures 6.12, 6.13 and 6.14 show the set of non-dominated points each corresponding to a Pareto Optimal Memory Architecture for the 3 applications. It can be observed from Figures 6.12, 6.13 and 6.14 that the increase in memory area results in improved performance and power. Increased area will translate to one or more of on-chip memory, increased number of memory banks, and more dual-port memory — all these are essential for improved performance. We look at the optimal memory architectures derived by our framework. In particular we consider (a) R1, (b) R2 and (c) R3 in each of the figures. The region R1 corresponds to (high performance, high area, high power); R2 corresponds to (low performance, high area, low power); and the region R3 corresponds to (medium performance, low area, medium power). Since the memory exploration design space is very large, it is important to focus on regions that is critical to the targeted applications. The region R1 has memory architectures with large dual-port memory that aids in improved performance but also is a cause for high power consumption. The region R2 has large
128
Physical Memory Exploration
Figure 6.9: MPEG: Memory Architecture Exploration - Using LME2PME Approach
number of memory banks of different size. This helps in reducing the power consumption by keeping the data sections with higher access frequency to smaller memory banks. But the region R2 does not have dual-port memory modules and hence results in low performance. But the presence of a higher number of memory banks increases the area. The region R3 does not have dual port memory modules and also has lesser number of on-chip memory banks. Since the memory banks are large, the power per access is resulting in higher power consumption. Note that for a given area there can be more than one memory architecture. Also it can be observed that for a fixed memory area, the design points are Pareto optimal with respect to power and performance. Observe the wide range of trade-off available between power and performance for a given area. We observe that by trading off performance, power consumed can be reduced by as much as 70-80%. Table 6.1 gives details on the run-time, the total number of memory architectures explored and the number of non-dominated (near-optimal) points for each of the application. Note that even the number of non-dominated design solutions is also large. Hence
6.4 Experimental Methodology and Results
129
Figure 6.10: DSL: Memory Architecture Exploration - Using LME2PME Approach
Table 6.1: Memory Architectures Explored - Using DirPME Approach Application Time Taken No of Arch No of nonexplored dominated points Mpeg Enc 2.5 hours 9780 670 Vocoder 3.5 hours 13724 981 DSL 2 hours 7240 438
to select an optimal memory architecture for a targeted application, the system designer needs to follow a clear top down approach of narrowing down the region (area, power, performance) of interest and then focus on specific memory architectures. The table also reports execution time taken on a standard desktop (Pentium 4 with 1.7Ghz). As can be seen the execution time for each of these application is fairly low.
130
Physical Memory Exploration
Encoder48 − Memory Architectcure Exploration
5
x 10 2.4 2.2
Memory Power
2 1.8 1.6 1.4 1.2 1 0
1
1
0.8
2 0.6
3
5
x 10
0.4 Memory Area Memory Stalls
Figure 6.11: Voice Encoder (3D view): Memory Architecture Exploration - Using DirPME Approach
6.4.4
Comparison of LME2PME and DirPME
In this section we compare the non-dominated points from the LME2PME and DirPME approaches. Table 6.2 presents data on the total number of non-dominated points obtained from LME2PME and DirPME. The number of unique non-dominated points listed in column 4 represents the solutions that are globally non-dominated but present only in one of LME2PME or DirPME approach. The presence of unique non-dominated points in one approach means that this point is missing in the other approach. The ratio of column 4 to column 3, in a way represents the efficiency of an approach. We observe that this ratio is low for DirPME approach compared to LME2PME. The number of unique non-dominated points in DirPME increases if time allotted to DirPME is increased.
6.4 Experimental Methodology and Results
131
Figure 6.12: Voice Encoder: Memory Architecture Exploration - Using DirPME Approach
Further, the column 5 of Table 6.2 reports the number of non-dominated points identified by one method which gets dominated by points in the other method. For example, for the MPEG encoder benchmark, 709 of the non-dominated design points reported by DirPME are in fact dominated by certain design points seen by the LME2PME approach. As a consequence the unique non-dominated points reduces to 26 for this benchmark. In contrast, LME2PME fares better with 175 non-dominated points of which none are dominated by DirPME approach. In fact this trend is observed almost for all benchmarks. Thus the data from experiments point that LME2PME performs a better job than DirPME. One concern that still remains is the set of unique non-dominated points identified by DirPME but not by LME2PME. If these design points are interesting from a platform based design, then to be competitive the LME2PME approach should at least find a close enough design point. In order to quantitatively assess this, we find the minimum of the Euclidean distance between the each unique non-dominated point reported by DirPME
132
Physical Memory Exploration
Figure 6.13: MPEG Encoder: Memory Architecture Exploration - Using DirPME Approach
to all the non-dominated points reported by LME2PME. The minimum distance is normalised with respect to the distance between the unique non-dominated point to the origin. This metric in some sense represents how close a non-dominated in DirPME approach is to a point in LME2PME. If we could find an alternate non-dominated point in LME2PME at a very close distance to the unique non-dominated point reported by DirPME, then the LME2PME’s solution space can be considered as an acceptable superset. In column 6, we report the average (arithmetic mean) minimum distance of all unique non-dominated points in DirPME to the non-dominated points in LME2PME. A similar metric is reported for the unique non-dominated points identified by DirPME. We also report the maximum of the minimum distance for all unique non-dominated points in column 7 of Table 6.2. The worst case average distance from unique non-dominated points is 0.46% for LME2PME and 0.49% for DirPME. Thus for every unique non-dominated point reported by DirPME, the LME2PME method can find a corresponding non-dominated
6.4 Experimental Methodology and Results
133
Figure 6.14: DSL: Memory Architecture Exploration - Using DirPME Approach
point within a distance of 0.46%. In colum 7, we report the maximum of minimum distance of all non-dominated points in DirPME to the non-dominated points in LME2PME. The same metric is presented the other way, i.e. the maximum of minumum distance of all non-dominated in LME2PME to the non-dominated points in DirPME. Observe from cloumn 6 that for every non-dominated point that is missing in LME2PME and reported in DirPME, we can find a close enough non-dominated point in LME2PME at most within 4.1% distance from the missing point for MPEG benchmark. Simillarly, for every new non-dominated point reported by LME2PME, we can find a close enough non-dominated point in DirPME at most within 6.2% distance from the missing point. Finally, in column 8, the run-time for all the benchmarks for both approaches are reported. Note that the DirPME approach takes significantly more time than the LME2PME approach. In summary, we observe that LME2PME finds more non-dominated points in general and offers better solution quality for a given time. However, since for every unique nondominated point in LME2PME, we can find a very close non-dominated point result in
134
Physical Memory Exploration
Table 6.2: Non-dominant Points Comparison LME2PME-DirPME AppliMethod Num of Unique No of avg max of runcation non-dom ND pts Dom- min dist min dist time (ND) inated from from points pts unique unique NDs NDs Mpeg LME2PME 175 175 0 0.22% 4.1% 0.45.47 Enc DirPME 735 26 709 0.37% 6.2% 7.17.52 Vocoder LME2PME 214 192 22 0.04% 0.23% 0.40.52 DirPME 558 13 545 0.49% 6.8% 3.08.54 DSL LME2PME 134 114 20 0.46% 3.75% 0.26.23 DirPME 1093 12 1081 0.38% 7.08% 4.26.34
DirPME and vice versa, we can conclude that both the approaches perform very closely. Further, the DirPME approach operates on a much bigger search space. Hence we expect DirPME approach to catch-up and fare equally well or fare even better as compared to LME2PME approach when sufficient time is given.
6.5
Related Work
Memory architecture exploration is performed in [18] using a low-energy memory design method, referred as VAbM, that optimizes the memory area by allocating multiple memory banks with variable bit-width to optimally fit the application data. Their work addresses custom memory design for application specific hardware accelerators. Whereas our work focuses on defining memory architecture for programmable processors of embedded SoC. In [15], Benini et al., present a method that combines the memory allocation and data layout together to optimize the power consumption and area of memory architecture. They start from a given data layout and design smaller (larger) bank size for the most (least) frequently accessed memory addresses. In our method the data layout is not fixed and hence it explores the complete design space with respect to area, performance and power. Performance-energy design space exploration is presented in [72]. They present
6.6 Conclusions
135
a branch and bound algorithm which produces Pareto trade-off points representing different performance-energy execution options. In [62], an integrated memory exploration approach is presented which combines scheduling and memory allocation. They consider different speed of memory accesses during memory exploration. They consider only performance and area as objectives and they output only one design point. In our work we consider area, power and performance as objectives and we explore the complete design space to output several hundreds of Pareto optimal design points. There are other methods for memory architecture exploration for target architectures involving on-chip caches [9, 51, 52, 64]. We compare them with our memory architecture exploration approach for hybrid architectures described in the next chapter. The memory allocation exploration step of LME2PME approach is an extension of memory packing or memory allocation process [63]. Memory allocation step typically constructs a logical memory architecture with a set of physical memories considering minimizing memory area as the objective criteria. But in our approach, the memory allocation exploration has to consider two inputs: (a) area optimization by picking the right set of memory modules and (b) power optimization by considering the memory access frequency of data-sections placed in a logical bank. Note that these are conflicting objectives and our approach outputs Pareto-optimal design points which present interesting trade-offs for these objectives.
6.6
Conclusions
In this chapter we presented two different approaches for Physical Memory Architecture Exploration. The first method, called LME2PME method, is a two step process and an extension of the LME method described in Chapter 4. The LME2PME method offers flexibility with respect to exploring the design spaces in logical and physical memory architectures independently. This enables the system designers to start the memory architecture definition process without locking the technology node and semiconductor vendor memory library. The second method is a direct physical memory architecture exploration (DirPME) framework that integrates memory exploration, logical to physical memory
136
Physical Memory Exploration
mapping and data layout. Both LME2PME and DirPME approaches addressed three of the key system design objectives (i) memory area, (ii) performance and (iii) memory power. Our approach explores the design space and gives a few hundred Pareto-optimal memory architectures at various system design points in a few hours of run time. We have presented a fully automated approach that meets the time-to-market requirements. In the next chapter, we extend the framework to address cache based memory architectures.
Chapter 7 Cache Based Architectures 7.1
Introduction
In the previous chapters, memory architecture exploration frameworks and data layout heuristics are presented for target architectures that are primarily Scratch-Pad RAM (SPRAM) based. Many SoC designs on the other hand also include cache in their memory architecture [77] as caches provide comparable performance benefits of SPRAM but with lower software overhead [83] at both program development time - requiring very little data layout and management responsibilities from the application developer - and runtime - movement of data from off-chip memory to cache is transparent and managed by hardware. Hence in this chapter we consider memory architecture with both SPRAM and cache. The work in this chapter also applies to memory architectures that have on-chip memories that could be configured both as cache and Scratch-Pad RAM. We discussed in Chapter 6 about how the presence of caches alter the objective functions in memory architecture exploration and also for data layout heuristics. In a cache architecture if two different data sections that are accessed alternatively are mapped to the same cache sets, it causes a large number of conflict misses [33], potentially resulting in no benefits from the cache. Hence it is important to address memory exploration and data layout approaches for cache based architectures. Further, the memory exploration problem becomes more challenging if the target architecture consists of both SPRAM
138
Cache Based Architectures
and cache. In this chapter, we address the memory architecture exploration problem for hybrid memory architectures that have a combination of SPRAM and Cache. As discussed in Chapter 4, the evaluation of a memory architecture cannot be separated from the problem of data layout, which physically places the application data in the memory. A non-optimal data layout will yield an inferior performance even on a very good memory architecture platform, thereby leading the memory exploration search path to go in a wrong direction. Hence before addressing the memory architecture exploration problem, for a cache based memory architecture, it is important to have a efficient data layout heuristic. For SPRAM-Cache based architectures, a critical step is to partition the data placement between on-chip SPRAM and external RAM. Data partitioning aims at improving the overall memory sub-system performance by placing data sections in SPRAM that has the following characteristics: (a) higher access frequency, (b) over-lapping life time with many other data, and (c) poor spatial access characteristics. By placing all data that exhibits the above characteristics in SPRAM results in reducing the number of potentially conflicting data in Cache and hence the reduced cache misses leading to overall memory sub-system performance. Typically the SPRAM size is small and hence it is not possible to accommodate all the data identified for SPRAM placement. Hence, even after data partitioning, there will be a significant number of potentially conflicting data placed in external RAM. If these data are not carefully placed based on the off-chip RAM, there will be a significant number of cache misses resulting in lower system performance. Cache conscious data layout addresses this problem and aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache misses. The mapping of data from off-chip RAM to L1-cache is dictated by the cache size and associativity. Hence the data-sections which map to the same cache set, when accessed alternatively can incur a large number of conflict misses. Careful analysis of data access characteristics and understanding of temporal access pattern of the data structures is required in order to come up with cache conscious data layout that minimize conflict misses. A number of earlier approaches address the problem of data
7.1 Introduction
139
layout mapping for cache architecture in embedded systems [17, 21, 22, 41, 50, 53]. In this chapter our aim is to perform memory architecture exploration and data layout in an integrated manner, assuming a hybrid architecture which includes both on-chip SPRAM and data cache (see Figure 7.1). As a first step, we address the data layout problem for Cache-SPRAM based architecture. We address this problem using a two step approach for each memory architecture: (a) data partitioning to divide the data between SPRAM and cache with the objective to improve overall memory sub-system performance and power and (b) Cache conscious data layout to minimize the number of cache misses within a given external memory address space.
Figure 7.1: Target Memory Architecture
The major contributions of this chapter are: • an efficient heuristic to partition data between SPRAM and caches based on access frequency, temporal and spatial locality in access patterns; • a data layout heuristic for data caches that improves run-time and reduces the off-chip memory address space usage; and • hybrid memory architecture exploration with the objective to improve run-time performance, power consumption and area. The rest of this chapter is organized as follows. In the following section we give an overview of the proposed method. In Section 7.3 we explain our data partitioning heuristic. In Section 7.4, describes the cache conscious data layout heuristic. In Section 7.5, we present the experimental results. We discuss related work in Section 7.6. Conclusions are presented in Section 7.7.
140
7.2
Cache Based Architectures
Solution Overview
Figure 7.2 presents our memory architecture exploration framework. Our proposed memory exploration framework consists of two levels. The outer level explores various memory architectures while the inner level explores placement of data sections (data layout problem) to minimize memory stalls. More specifically the outer level, the memory architecture exploration phase, targets the optimization of cache and SPRAM size and the organization of cache architecture, including cache-line size and associativity. We use an exhaustive search1 for memory architecture exploration by imposing certain practical constraints (such as, the memory bank size is always a power of 2) on the architectural parameters. Although these constraints limit the search space, they still allow all “practical” architectures to be considered and at the same time help to reduce the run-time of the memory exploration phase drastically. The exploration module takes the application’s total data size as input and provides an instance of memory architecture by defining (a)cache size (b) cache block size (c) cache associativity and (d) SPRAM size2 . Based on the SPRAM size and the application access characteristics, the data partitioning heuristic identifies the data sections to be placed in SPRAM. The remaining data sections are placed in off-chip RAM. The details of the data partitioning heuristic are presented in Section, 7.3. The cache conscious data layout heuristic assigns addresses to the data sections placed in off-chip RAM such that these data do not conflict in the cache. The data layout heuristic uses the temporal access information as input to find the optimal data placement. The objective is to minimize the number of cache misses. In Section, 7.4 we discuss the proposed cache conscious data layout. The data partitioning heuristic and data layout heuristic together place the application data in SPRAM and off-chip RAM respectively. From the temporal access information 1
Alternative approaches such as genetic algorithm or simulated annealing could also be used here. However we found the exhaustive approach does explore all practical memory architectures in a reasonable amount of computation time. 2 The proposed framework can easily be extended to consider SPRAM organization parameters such as, number of banks, number of ports etc. We do not consider it here as these were extensively dealt with in the earlier chapters.
7.2 Solution Overview
141
of data sections and access frequency information, the run-time performance in terms of memory stall cycles is computed. The memory stalls include stall cycles due to concurrent accesses to the same single-ported SPRAM bank, stall cycles due to cache misses and misspenalty (off-chip memory access to fetch the cache block). The software eCacti [45] is used to obtain the power per cache read-hit, read-miss, write-hit and write-miss. The SPRAM power per read access and power per write access are obtained from the semiconductor vendors ASIC memory library. The area for a given cache architecture is computed using eCacti [45] and the area for SPRAM is obtained from the memory library.
Figure 7.2: Memory Exploration Framework
The exploration process is repeated for all valid memory architectures and the area, power and performance are computed for each of these. The last step is to identify the list of “optimal” architectures. Since this is a multi-objective problem, all the solution points are evaluated according to the Pareto optimality conditions given by Equation 6.1 in b b b a a a ) , Marea , Mcyc ) and (Mpow , Marea , Mcyc Section 6.2.3.3. According to this equation, (Mpow
are the memory power, memory cycles and memory area for memory architecture A and
142
Cache Based Architectures
B respectively, then A dominates B if the following expression is true. a b a b a b (((Mpow < Mpow ) ∧ (Mcyc ≤ Mcyc ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Mcyc < Mcyc ) ∧ (Mpow ≤ Mpow ) ∧ (Marea ≤ Marea )) a b a b a b ∨((Marea < Marea ) ∧ (Mcyc ≤ Mcyc ) ∧ (Mpow ≤ Mpow )))
From the set of solutions generated by the memory architecture exploration module, all the dominated solutions are identified and removed. The non-dominated solutions form the Pareto optimal set, which represents the set of good architectural solutions that provide interesting design trade-off points from power, performance, cost view point.
7.3
Data Partitioning Heuristic
As cache structure has associated tag overheads, SPRAM consumes much less area than caches on a per-bit basis [12]. Further, SPRAM memory accesses consume less power than a memory access that is a cache-hit [12]. While the data sections mapped to off-chip memory share the cache space dynamically and in a transparent manner, SPRAM space is assigned to data sections exclusively if dynamic data layout is not used. As a result, the usage of SPRAM is costly from a system perspective as it gets locked to a specific data after data section layout unlike in caches, where the space is effectively reused through dynamic mapping of data by hardware. Hence, SPRAM has to be carefully utilized and the objective in a memory architecture exploration should be to minimize the SPRAM size. The objective of data partitioning is to identify data sections that must be placed in SPRAM for best performance. We refer to a set of data (one or more scalar variables or array variables) that are grouped together as one data-section. A data-section forms an atomic unit that will be assigned a memory address. All data that are part of a data section are placed in memory contiguously. An example of a data section is an array data structure. In order to identify data sections that should be mapped to SPRAM, our heuristic
7.3 Data Partitioning Heuristic
143
uses different characteristics of the data section. These include the access frequency, the temporal access pattern and the spatial locality pattern. These are explained below. To model the temporal access pattern of different data sections, a temporal relationship graph (TRG) representation has been proposed in [17]. A TRG is an undirected graph, where nodes represent data sections and an edge between a pair of nodes represents that the two successive references of either of the data sections is interleaved by a reference to the other. The weight associated with an edge (a, b) represents the number of times such interleaved accesses of a and b has occurred in the access pattern. We illustrate these ideas with the help of an example. Let there be 4 data-sections a, b, c and d and the access pattern of these data sections in the application be: aaabcbcbcbcdddddaaaaaaacacaacac
Figure 7.3: Example: Temporal Relationship Graph
For this access pattern the TRG is shown in Figure 7.3. Given a trace of data memory references, the weight associated with (a, b), denoted by T RG(a, b) is the number of times that two successive occurrences of a are intervened with at least one reference to b or vice versa. As an example, for the pattern bcbcbcb, T RG(b, c) = 5. Note that reference to c intervenes successive references to b on three occasions and references to b intervenes
144
Cache Based Architectures
successive references of c twice making the T RG(b, c) = 5. For the given pattern the T RG(b, d) = 0 as there are no interleaved accesses. Hence no edge exist between b and d. TRG is computed for all the data sections from the address trace collected from an instruction set simulator. We define ST RG(i) as the sum of all TRG weights on the edges connected to node i. As an example from Figure 7.3, ST RG(a) = 10. Next, We define a term, spatial locality factor, which gives a measure of spatial locality in the access trace for each data section. The spatial locality is influenced by the stride in accessing different elements of the data section. The spatial locality factor is computed by determining the number of misses incurred by that data section on a cache with a single block by the filtered access trace that contains only accesses pertaining to that data section. The spatial locality factor is the ratio of the number of such misses to the size of the data section. For example if the accesses to data section b in the filtered trace bbbb correspond to cache blocks b1 b2 b1 b1 , where b1 and b2 correspond to different blocks (determined by the cache block size) and size of data section b is Sb cache blocks then the spatial locality factor is 3/Sb . There are three parameters that control the decision to keep a data section in an on-chip SPRAM. 1. Access Frequency (Af) : Placing the most frequently accessed data section in SPRAM gives better power consumption and better run-time performance. 2. Temporal Access Characteristics : A data section is said to be conflicting if it gets accessed along with many other data sections. Placing the most conflicting data section in SPRAM reduces the number of cache conflict misses and hence improves the overall memory subsystem performance. This parameter is computed from the TRG. The ST RG factor is a direct indication of the extent to which a data section’s life-time overlaps with other data sections. 3. Spatial Locality Factor (SLF) : Data sections that have lesser spatial locality factor uses more cache lines simultaneously and thereby reduce the available cache space for other data. Also, such data exhibits less spatial reuse, causing more cache misses
7.3 Data Partitioning Heuristic
145
Table 7.1: Input Parameters for Data Partitioning Algorithm Notation Description N Number of data sections T RG(a, b) Temporal access pattern between node a and b ST RG(a) Sum of trg weights on all edges connected to node a AF (a) Access Frequency of data section a SLF (a) Spatial Locality Factor of data section a nST RG(a) Normalized ST RG(a) nAF (a) Normalized AF (a) nSLF (a) Normalized SLF (a) CI(a) Conflict Index of data section a
which in turn increases the power consumption due to off-chip memory accesses. Hence, it is both power and performance efficient to place a data section that has less spatial locality factor in SPRAM. Thus, a frequently accessed data that conflicts most with the rest of data and also exhibits less spatial locality is an ideal candidate to be placed in SPRAM as this gives the best performance from an overall memory subsystem perspective. For each of the data sections, a conflict index is computed using the three parameters mentioned above. The conflict index of a node corresponding to data section s is computed as follows. ST RG(s) ´ nST RG(s) = ³PN ST RG(i) i=1
(7.1)
AF (s) ´ nAF (s) = ³PN AF (i) i=1
(7.2)
SLF (s) ´ nSLF (s) = ³PN i=1 SLF (i)
(7.3)
CI(s) = nST RG(s) + nAF (s) + nSLF (s)
(7.4)
In the above equations, SLF (s) and AF (s) correspond to the spatial locality factor and access frequency of s respectively. The terms in the LHS of equations 7.1, 7.2 and
146
Cache Based Architectures
7.3 are normalized factors. Higher the conflict index, more suitable the data section is for SPRAM placement. Our data partitioning heuristic algorithm is explained in Figure 7.4. The greedy heuristic sorts the data sections based on the conflict index and assigns data sections that have the highest conflict index to SPRAM. The corresponding node is removed from the TRG and the conflict index for the remaining data sections are recomputed. Note that the above step is performed for every data section identified to be placed in SPRAM. This process is repeated either until the SPRAM space is full or until there are no more data sections to be placed.
7.4 7.4.1
Cache Conscious Data Layout Overview
The data partitioning step places the most conflicting data in SPRAM and thereby reduces the possible conflict misses in cache. However, the SPRAM size typically is very small and only a few data-sections would have been placed in SPRAM3 . The remaining data sections still needs to be placed carefully in the cache to reduce the cache misses. In this section we will be discussing the cache conscious data layout. The problem of cache-conscious data layout is to find optimal data placement in offchip RAM with the following objectives: (a) to reduce the number of cache misses and (b) to reduce address space used in off-chip RAM. In other words, the objective is to reduce the “holes“ in off-chip RAM after placement. By this, we mean that the data sections are placed in the off-chip RAM, in such a manner that the gap between data sections which is left to reduce conflict misses is reduced. These gaps lead to wasted memory space and hence increase hardware cost. To the best of our knowledge, reducing cache misses (the first objective) has been the sole objective targeted by all earlier data layout approaches published [17, 22, 41, 50, 53]. But it is very important to consider objective (a) in the context of objective (b) for the following reasons. 3 As mentioned earlier, data placement within the SPRAM can be done in a subsequent phase using any of the data layout methods discussed in Chapter 3. We do not experiment this as this has been extensively dealt with in the previous chapters.
7.4 Cache Conscious Data Layout
Algorithm: SPRAM-Cache Data Partitioning Inputs: N = Number of data sections Access Frequency of all data sections Temporal Relationship Graph (TRG) Spatial Locality Factor (SLF) Data Section Sizes Output: List of data sections to be placed in SPRAM begin 1. Compute access frequency per byte for all data sections 2. Normalize the access frequency per byte for all data sections 3. for i= 0 to N -1 3.1 compute STRG(i); sumSTRG += STRG(i); 4. for i= 0 to N -1 4.1 nSTRG(i) = STRG(i)/sumSTRG; 5. for i= 0 to N -1 5.1 compute SLF(i); sumSLF += SLF(i); 6. for i= 0 to N -1 6.1 nSLF(i) = SLF(i)/sumSLF; 7. for i= 0 to N -1 7.1 conflict-index(i) = nSTRG(i) + nAF(i) + nSLF(i); 8. sort the data sections in descending order with respect to conflict-index 9. while (available space in SPRAM) 9.1 identify the data section s with the highest conflict index 9.2 place s in SPRAM if it fits within the available space 9.3 update SPRAM available space to account for the above placement 9.4 remove s from TRG 9.5 Recompute ST RG for the remaining nodes in TRG 9.6 Recompute conflict index with the newly updated ST RG 9. exit end
Figure 7.4: Heuristic Algorithm for Data Partitioning
147
148
Cache Based Architectures
• For SOC architectures with instruction cache and data cache that share the same off-chip RAM, a data layout approach that optimizes only the data cache misses, without considering optimization of off-chip RAM address space will use-up too much address space by spreading the data placement, leaving many holes. This will place severe constraints on code placement requiring the code to be placed across the holes and in the remaining off-chip RAM. This may potentially result in additional instruction cache misses. Hence, there is a chance that all the gains achieved by optimizing the data cache misses is lost. • A data layout approach which optimizes the data placement in off-chip RAM without any holes will be independent of instruction cache placement. Hence, the architecture exploration of data cache can be done independent of instruction cache. For example, an application with 96K of data will have around 2700 hybrid architectures that are worth exploring. If the code placement is not independent of the data layout and the code segments are placed in the holes created, then the memory exploration process needs to consider both instruction and data cache configuration together. This will increase the number of architectures considered. In such a scenario, the number of architectures explored could increase to 50000+. Hence, it is important to design a data layout algorithm that is independent of instruction cache. We formulate the cache conscious data layout problem as a graph partitioning problem [38]. Inputs to the data layout algorithm are (i) application’s data section sizes and (ii) Temporal Relationship Graph. The data layout algorithm is explained in a block diagram in Figure 7.5. The first step in the data layout problem is modelled as a graph partitioning problem, where data sections are grouped into disjoint subsets, such that the memory requirement for the data sections in a disjoint subset is less than the cache size. More specifically, the first step is a k-way graph partitioning, where k = dapplication data size/cache-sizee. The data sections in each of the partitions are selected such that they have intervening accesses and hence can cause potential conflict misses. Thus the output of graph partitioning step is k partitions with each partition
7.4 Cache Conscious Data Layout
149
Figure 7.5: Cache Conscious Data Layout
having a set of data sections that conflicts among themselves the most and the partition size is less than cache size. Since each of the k partitions is lesser than the cache size, each of these partitions can be mapped into off-chip RAM address space that corresponds to one cache page. This step eliminates all the conflicts between data-sections that are in the same partition. The graph partitioning method is discussed in detail in Section 7.4.2. The next step in the data layout is to minimize the possible conflicts between datasections that are in two different partitions. This is handled by the offset-computation step. The details of the offset computation are presented in Section 7.4.3. Once the offsetcomputation step assigns cache-block offsets for each of the data section, the address assignment step allocates unique off-chip addresses to all the data-sections. Finally, using the address assignment, the number of cache misses and the power consumed for cache
150
Cache Based Architectures
and off-chip memory accesses are computed which is used for identifying Pareto-optimal solution. The following subsections details the graph partitioning heuristic and offset computation heuristic.
7.4.2
Graph Partitioning Formulation
In this section we explain the graph partitioning heuristic which is a generalisation of Kernighan-Lin [38] and operates on the temporal relationship graph for the data sections that need to be placed in off-chip RAM. Note that this excludes all data sections that have been mapped to SPRAM. Given a temporal relationship graph G‘ = {V, E, s, w}, where V is the set of vertices representing data sections and E is the set of edges between a pair of data sections representing a temporal access conflict. Further functions s and w are associated respectively with the nodes and edges of the TRG; s(u) represents the size of the data section associated with a node u and w(u, v) represents the number of temporal access conflicts between a pair of nodes u and v. The weight function w(u, v) is same as T RG(u, v), but only restricted to the data sections that need to be assigned to the off-chip RAM. The graph partitioning problem aims at dividing G into m disjoint partitions. A m-way partition of G is a collection of subsets Gi = {Vi , Ei }, such that • the subsets are disjoint, Vi ∩ Vj = 0, for i 6= j •
Sm
i=1
Vi = V
• ∀ ei = (u, v) ∈ G, is in Gi iff u ∈ Gi and v ∈ Gi The objective of the graph partitioning step is to group the nodes such that the sum of weights on the internal nodes is maximized. The objective function that needs to be maximized is given in Equation 7.5 with the constraint given in Equation 7.6. X X
w(ej )
(7.5)
i ej ∈Gi
X uj ∈Gi
s(uj ) ≤ cache-size
(7.6)
7.4 Cache Conscious Data Layout
151
An edge eext = (u, v) is said to be an external edge for a partition Gi if u ∈ Gi and v ∈ / Gi ; i.e., if one of the nodes connected by e is in partition Gi and the other is not. Similarly, an edge eint is said to be an internal edge if both the nodes it connects are in the partition Gi . The sum of all the weights on the external edges in partition Gi is referred as external cost (Ei =
P
eext , where eext ∈ Gi ). The sum of all the weights
on the internal edges in partition Gi is referred as internal cost (Ii = eint ∈ Gi ). The total external cost E =
P P i
eext ∈Gi
P
w(eint ), where
w(eext ). Thus the objective of the
partitioning problem is to find a partition with minimum external cost. Alternatively, the graph partitioning problem can also be formulated as maximizing the total internal cost, i.e.
P P i
eint ∈Gi
w(eint ) subject to the constraint
P uj ∈Gi
s(uj ) ≤ cache-size∀Gi .
The optimal partitioning problem is NP-Complete [38, 66]. There are a number of heuristic approaches [26, 47] to this problem, including the well known Kernigan Lin heuristic [38] for two partitions. We extend the heuristic proposed in [38, 66] to solve our problem. The Kernighan-Lin heuristic aims at finding a minimal external cost partition of a graph into two equally sized sub-graphs. The heuristic achieves this by starting with a random partition, and keeps swapping two nodes that gives the maximum gain. Gain is computed as the difference between internal and external costs. Let us consider two nodes a and b present in two different sub-graphs A and B respectively. We define external cost(ECost) of a as Ea =
P x∈B
w(a, x) and internal cost (ICost) of a as Ia =
P y∈A
w(a, y)
for each a ∈ A. Similarly ECost and ICost of b are defined as Eb and Ib respectively. Let Da = Ea − Ia be the difference between ECost and ICost for each a ∈ A. A result proved by Kernighan and Lin [38] shows that for any a ∈ A and b ∈ B, if they are interchanged, the reduction in partitioning cost is given by Rab = Da + Db − 2 × w(a, b). The nodes a and b are interchanged to partitions B and A respectively if Rab > 0. In [66], the graph partitioning heuristic is generalized to an m-way partition. It starts with a random set of m partitions and picks any two of the partitions and applies the Kernighan-Lin heuristic repeatedly on this pair until no more profitable exchanges are possible. Then these two partitions are marked as pair-wise optimal. The algorithm then picks two other partitions to apply the heuristic. This process is repeated until all the
152
Cache Based Architectures
partitions are pair-wise optimal. We have adapted the algorithm of [66] and added additional constraints to make it work for our problem. The main constraints are as below: 1.
P
s(a) ≤ cache-size for all partitions;
2. if a data-section size s(a) > cache-size, then this data-section is placed in a partition and marked optimal; and 3. Nodes a and b are interchanged to partitions B and A respectively only if Rab > 0 and if
P a∈A
s(a) < cache-size and if
P b∈B
s(b) ≤ cache-size
The output of the graph partitioning step is a collection of sub-graphs that maximizes the internal cost and minimizes the external cost and ensures that no partition has a size larger than the cache size4 . Thus, each of the partition can be placed in the off-chip RAM address space that maps to a cache page such that none of the data sections that are part of the same partition will conflict in cache. Now we are left with optimizing the cache conflicts that might arise because of conflicts from data sections belonging to two different partitions. Since the external cost is already minimized, the number of such conflicts will already be very less. The offset computation step, described in the following subsection, aims at reducing conflicts caused by data sections belonging to different partitions.
7.4.3
Cache Offset Computation
The cache offset computation step aims at reducing cache conflict misses between data sections that are part of two different partitions. Each partition is placed in the offchip RAM address space that corresponds to one cache page. It may be noted that the ordering of the partitions does not have any impact on the cache misses. For each of the data sections in a partition, a cache-block offset needs to be assigned which in turn is used to determine a unique off-chip memory address for the data section. 4 Obviously, a partition containing a data section whose size larger than the cache size will not obey this property. But such data section can be considered to form l = ddata section size/cache-sizee consecutive partitions, each less than or equal to cache size.
7.4 Cache Conscious Data Layout
153
Algorithm: Offset Computation Heuristic Inputs: T RGblk values for all the data blocks External costs for all the partitions (Ei ) Internal costs for all the partitions (Ii ) Ei,uj External costs for node uj in a partition Gi Cache configuration Data Section Sizes Output: Offsets assigned each of the data sections begin 1. Sort the partitions in the decreasing order of external cost 2. for i =1 to k partitions 2.1 Pick the partition Gi with the highest external cost 2.2 Sort the data sections in descending order with respect to the external cost Ei,uj 2.3 for alldata sections in Gi 2.3.1 pick the data section uj with the highest Ei,uj 2.3.2 evaluate placement cost by placing uj in each of the available cache-line for the target cache configuration 2.3.2.1 place data uj in cache in the available cache line with the constraint that the data section must be contiguously placed 2.3.2.2 compute the cost of placement by using T RGblk information for all the data blocks already placed in the data section 2.3.2.3 store the cost of placement Cl for the cache line l 2.3.2.4 repeat the last three steps for all possible cache lines 3.4.3 find the cache-line l that gives the minimal cost 3.4.4 assign l as the starting point for uj 3.4.5 mark the cache lines from l to l + size(uj )/block − size as not available for other data sections in Gi 2.4 end for 3. end for 4. placement complete end
Figure 7.6: Heuristic Algorithm for Offset Computation
154
Cache Based Architectures
To decide the offset that gives the least number of conflicts, we compute the placement cost for all possible placements of the data section inside a cache page. To compute the placement cost, we use a fine grained version of TRG. Note that the TRG computed in Section 7.3 is at the granularity of data section. But to determine which offset to place a data section, the temporal access pattern needs to be computed at a finer granularity level. We illustrate these ideas with the help of an example. Let there be 2 data sections a and b of size 128 bytes and 64 bytes respectively. Consider the following access pattern: a[0]b[0]a[60]b[1]a[61]b[2]a[62]b[3] For this access pattern TRG(a,b) is 6 as explained in Section 7.3. Basically it means data sections a and b are accessed 6 times in an interleaving way. However, for a direct mapped cache of size 4KB with 32 byte block size, if a is placed in address k and b is placed in off-chip address k + 4KB, will not result in any conflict misses even though the TRG(a,b)=6. This is because a[60], a[61] and a[62] will map to a cache line (C + 1), while a[0], b[0], b[1], b[2] and b[3] will map to cache line C. Further as if a is placed in address k and b is placed in address k + 4KB + 32B then it will result in 5 conflict misses. Hence, to determine, the cost of placing the data section, on the number of conflict misses, the TRG values are needed at a more granular level. For the above example, if we keep the granularity level as 1 cache block then the data section a is divided into 4 data blocks and data section b is divided in to 2 data blocks. We define a new term T RGblk that represents the temporal access pattern among data blocks. This is similar to the approach described in [17]. The above access sequence results in a0, b0, a1, b0, a1, b0, a1, b0, where a0, a1 and b0 represent the first two (cache-block sizes) blocks of data section a and first block of data section b. For the above example, T RGblk will consist of nodes a0, a1, a2, a3, b1 and b2. For the access pattern given above, T RGblk (a1, b0) = 5 and all other T RGblk values are 0. We use the T RGblk values to compute the cost of placement, C(s,l), for a data section s in a cache offset l. The offset computation algorithm is explained in the Figure 7.6. To begin with, the partitions are ordered based on the total external cost (Ei ). The partition Gi with the highest external cost is selected first for offset computation. Data sections that are part
7.5 Experimental Methodology and Results
155
of partition Gi are ordered based on the external cost of the corresponding nodes in Gi . Data section uj with the highest external cost (Ei,uj ) is taken for offset computation first. Data section uj is placed in each of the allowable cache lines and the placement cost is computed with the help of T RGblk . Here, by allowable, we mean there should be contiguous cache lines free in a cache page to accommodate the data section uj . For example, if data section size is 128 bytes and cache block size is 32 bytes, then a feasible cache line mean 4 contiguous lines are free. Note that at this point no offset is assigned to the data section uj . Cost of placement C(uj ,`) for data section uj is computed for all allowable cache line ` from 1 to Nl , where Nl is the total number of cache lines. The cache line ` that has the minimum cost is assigned to to data section uj and the cache lines from l to size(uj )/line-size is marked as full so that these cache lines are not available for any other data section in Gi . Note that this restriction is put to ensure that the cache offsets for all data section in a partition Gi are assigned within one cache page and this ensures that the amount of external address space used is close to the application data size. The above process is repeated for all data sections in partition Gi . After this the next partition Gi+1 with the highest external cost is selected for offset computation. This process continues until all partitions are handled.
7.5 7.5.1
Experimental Methodology and Results Experimental Methodology
We have used Texas Instrument’s TMS32064X processor for our experiments. This processor has 16K data cache and we have used Texas Instrument’s Code Composer Studio (CCS) environment for obtaining profile data, data memory address traces and also for validating data-layout placements. We have used 3 different applications - AAC(Advanced Audio Codec), MPEG video encoder and JPEG image compression from the Mediabench [43] for performing the experiments. We compute the TRG, sumtrg, and spatial locality factor from the data memory address traces obtained from the CCS. We used eCacti [45] to obtain the area and power numbers for different cache configurations. First, we
156
Cache Based Architectures
report experimental results demonstrating the benefits of our cache-conscious data layout method. Subsequently in Section 7.5.4, we repeat the results pertaining to cache-SPRAM memory architecture exploration.
7.5.2
Cache-Conscious Data Layout
In this section we present results on our cache conscious data layout and we compare our results with the approach proposed by Calder [17]. We have used the above 3 mediabench applications and 4 different cache sizes. In this experiment, for all the cache sizes we have used a 32 byte cache-block size and direct mapped cache configuration. Table 7.2 presents the results of the data layout. Column 4 in Table 7.2 presents the number of cache-misses incurred when the data-layout approach of [17] is used and the Column 5 gives the number of cache misses incurred when our data layout approach is applied. Our approach performs consistently better and reduces the number of cache misses especially for AAC and MPEG. Our method achieves upto 34% reduction (for AAC with 16KB cache size) in cache misses. Also our approach consumes an off-chip memory address space that is very close to the application data-size. This is by construction of the graphpartitioning approach and avoiding gaps during data layout as explained in Section 7.4. Whereas Calder’s [17] approach consumes 1.5 to 2.6 times the application data-size in the off-chip address space to achieve the performance given in Table 7.2. This is a significant advantage of our approach, as increased off-chip address space implies increased memory cost for the SoC. In Table 7.3, we present the results of our approach for different cache configurations (direct mapped, 2-way and 4-way set associative caches). Note that these experiments are performed with cache only architecture and no SPRAM. Observe that for all the applications, the reduction in misses is significant for 2-way and 4-way set associative caches. However, for the 4KB cache configuration for MPEG, the reduction in cache misses is not much. This is due to the large data set (footprint) requirement for MPEG. Also, observe that the data set for JPEG is much smaller and hence a direct mapped 16K cache or 4-way set associative 8KB cache could resolve most of the conflict misses.
7.5 Experimental Methodology and Results
Application
AAC
MPEG
JPEG
Table 7.2: Data Layout Comparison Cache Number of Number of cache misses Size memory Calder Graphaccesses [17] Partition (our approach) 32KB 43 0 0 16KB Million 14746 9711 8KB 155749 128322 4KB 446912 385795 32KB 92 17204 14574 16KB Million 275881 224278 8KB 2332008 2314398 4KB 11919814 11919814 32KB 38 0 0 16KB Million 0 0 8KB 2350 2112 4KB 10220 10294
157
imrovement (%) 0 34 17 14 15 19 1 0 0 0 10 -1
Table 7.3: Data Layout for Different Cache Configurations Appli- Cache Number of cache misses cation Size Direct 2-Way set- 4-way setMapped Associative Associative AAC 32KB 0 0 0 16KB 9711 5252 4111 8KB 128322 66741 53732 4KB 385795 314122 260110 MPEG 32KB 14574 2122 712 16KB 224278 123632 78412 8KB 2314398 1863214 1301257 4KB 11919814 10121122 9884788 JPEG 32KB 0 0 0 16KB 0 0 0 8KB 2112 112 10 4KB 10294 4300 3200
158
7.5.3
Cache Based Architectures
Cache-SPRAM Data Partitioning
In this section we present the results from our cache-SPRAM Data Partitioning method. Figures 7.7, 7.8 and 7.9 present the results of data partitioning heuristic. In these figures, the x-axis represents the SPRAM size and the y-axis represents the performance in terms of memory stalls. Experiments were performed for three different cache sizes (4KB, 8KB and 16KB). For each of the cache sizes, the SPRAM size is increased from 0 to application data size. For each of the memory configuration, data partitioning and cache conscious data layout is performed to obtain the memory stalls. The memory stalls refers to the number of stalls due to the external memory accesses due to cache misses.
Figure 7.7: AAC: Performance for different Hybrid Memory Architecture
Observe that for all the applications, when the SPRAM size is increased, a significant performance improvement is achieved for all the cache sizes. However, the performance improvement is more pronounced in 4KB and 8KB caches than in 16KB caches. Observe that for AAC, the 8KB cache + 24KB of SPRAM gives the same performance as a 16KB cache with 4KB of SPRAM. The 16KB Cache and 4KB SPRAM consumes more area than the 8KB Cache + 24KB SPRAM configuration. Simillarly, for JPEG, 4KB Cache with 20KB of SPRAM gives the same performance as a 16KB Cache with no SPRAM. This gives an architecture choice to the designers to select a configuration that suits
7.5 Experimental Methodology and Results
159
Figure 7.8: MPEG: Performance for different Hybrid Memory Architecture
Figure 7.9: JPEG: Performance for different Hybrid Memory Architecture
the target application. As we discussed earlier, both caches and SPRAM have their own advantages. For instance, caches offer hardware managed resusable on-chip memory space that provides feature extendability to the system. Whereas SPRAM provides predictable
160
Cache Based Architectures
performance and lower power consumption. Hence, the selction of architecture needs careful analysis from different viewpoints. Now we present the power consumption numbers for all the applications in Figures 7.10, 7.11 and 7.12. In these figures, the x-axis represents the SPRAM size and the y-axis represents the total power consumed by the memory subsystem. There are three plots, each for different cache sizes (4KB, 8KB, and 16KB). As expected, the power numbers for 16KB cache configurations is higher than the other two. However, in the all the figures, observe that the power numbers converge towards the end for higher SPRAM sizes. This is becase, for higher SPRAM sizes, most of the application’s critical data sections are mapped to SPRAM and hence not much activity happens in cache. Thus, the power numbers are mostly influenced by the SPRAM accesses. Observe that for 16KB cache, the power numbers are higher for lesser SPRAM sizes and gradually decreases as the SPRAM size increase.
Figure 7.10: AAC: Power consumed for different hybrid memory architecture
In summary, the system designer needs to look at the performance graphs for his application, presented similar to those in Figures 7.7, 7.8 and 7.9 and also study the power graphs similar to the ones presented in 7.10, 7.11 and 7.12 to arrive at a suitable
7.5 Experimental Methodology and Results
161
Figure 7.11: MPEG: Power consumed for different hybrid memory architecture
Figure 7.12: JPEG: Power consumed for different hybrid memory architecture
architecture. One more dimension that is not covered here is the memory area. The next section looks at the memory architecture exploration, where the system designer can look
162
Cache Based Architectures
at the memory design space from a power, performance and area viewpoint.
7.5.4
Memory Architecture Exploration
In this section we present the results from our memory architecture exploration. As mentioned in Section 7.2, we explore the Cache-SPRAM solution space with the following parameters: (a) cache-size, (b) cache block-size, (c) cache associativity and (d) SPRAM size. Again we have used the same 3 benchmark applications. As mentioned earlier, we use an exhaustive search method for memory exploration by varying the above parameters. We start with no SPRAM and a 4KB cache and keep increasing the cache sizes up to the application data size (88KB, 108KB and 40KB for AAC, MPEG and JPEG respectively). For each of the cache size explored, we then increase the SPRAM size from 0 to application data size with a 4KB step increase. Also for each of the cache configurations we vary the block size from 8 bytes to 64 bytes with a 8-byte step increase and associativity from 1 to 4. Based on the application data size, the number of memory configurations evaluated varies from 1200 to 2800. From the total memory configurations evaluated, we compute the non-dominated solutions based on the Pareto Optimal criteria explained in Section 7.2. Figures 7.13, 7.14, and 7.15 present the non-dominated solutions for AAC, MPEG and JPEG respectively. In these figures, the x-axis represents the number of memory stall cycles and the y-axis represents the power consumption. We have presented the power vs. performance graph for different area bands5 . We observe from the Figure 7.13 that as the area band increases, we get better power and performance. Note that the solution points are converging from the top-left portion of the graph (which is a high power and low performance region) to the lower left portion of the graph (which is the low power and high performance region) as the area is increased. In Figure 7.14, the solution on the top right corner has the memory configuration of 4K cache size, direct mapped with 32 byte cache-block with no SPRAM. As we can observe this is a very conservative architecture giving very low performance and high power consumption. On the other 5 Again, due to proprietary reason, we present normalized area for different configuration, instead of absolute values.
7.6 Related Work
163
hand, the solution in the lower left corner has the memory configuration of 8K cache with 2-way set-associativity and 16-byte cache-block and 128K of SPRAM. This is a very high end architecture consuming lot of area but gives the best performance and power consumption. Thus the set of Pareto Optimal design points presents a critical view to the designers to pick appropriate memory configurations that suit the application-system requirements.
Figure 7.13: AAC: Non-dominated Solutions
7.6 7.6.1
Related Work Cache Conscious Data Layout
There are many earlier work that propose source code level transformations with the objective to improve the data locality. Loop transformation based data locality optimizing algorithms are proposed by Wolf et al., [82]. They describe a locality optimization algorithm that applies a combination of loop interchange, skewing, reversal and tiling to improve the data locality of loop nests. Earlier work [21, 37, 53] propose source level
164
Cache Based Architectures
Figure 7.14: MPEG: Non-dominated Solutions
Figure 7.15: JPEG: Non-dominated Solutions
7.6 Related Work
165
transformations such as array tiling, re-ordering data structures and loop unrolling to improve cache performance. But we focus on optimizing object module level data placements without any code modifications. We emphasize that this is important as application development flow in embedded systems typically involves integration of many IPs and the source code for them may not be available. Data layout optimization proposed by Panda et al., addresses the scenario of data arrays placed in off-chip RAM addresses that are multiples of cache size which results in thrashing due to cache conflict misses in a direct mapped cache. They propose introducing dummy words (padding) between the data arrays to avoid cache conflicts. Data layout heuristics that aim at minimizing cache conflict misses have been proposed in [17, 41]. The problem has been formulated as an Integer Linear Program (ILP) in [41]. They also propose a heuristic method to avoid the long run-times of ILP solvers. Calder et al., [17] use a Temporal Relationship Graph (TRG) that captures the temporal access characteristics of data and proposes a greedy algorithm for cache conscious data layout. While the approaches in [41, 17] target only the minimization of conflict misses en masse, our approach aims at minimizing conflict misses within a certain off-chip memory address space. The constraint of working within a certain external memory address space is very important for memory architecture exploration, since this makes the instructioncache performance independent of data cache for architectures where the external memory address space is common for both data and instruction caches, and thereby reducing the memory architecture search space. Chilimbi et al., [22] propose two cache friendly placement techniques — coloring and clustering — that improves data structure’s spatial and temporal locality and there by improving cache performance. Their approach works mainly for tree and tree like data structures. They also propose a cache conscious heap allocation method which allocates memory closer to contemporaniously accessed data objects based on programmer’s input. This reduces the number of conflict misses. However, this approach will be expensive in terms of performance as run-time decisions need to be taken. Embedded systems are
166
Cache Based Architectures
performance sensitive and hence usually the usage of dynamic heap objects are discouraged. Any additional run-time performance overhead in memory allocation will take away the benefit that comes from reduced conflict misses. Further, critical sections of embedded applications are typically developed in hand-written assembly language. Hence any modifications in the layout of structures cannot be completely handled by compilers. A greedy data layout heuristic is proposed in [65] which optimizes energy consumption in horizontally partitioned cache architectures. Their approach uses the idea that the energy consumed per access in a small cache is less than that in a larger cache. Hence, for cache architectures that have a main cache and a smaller mini cache, the authors show that a simple greedy data partitioning heuristic, which partitions data between the main cache and the mini cache, performs well to reduce the overall energy consumption of the memory subsystem. Our work addresses a different target memory architecture with SPRAM and cache. Palem et al. [50] propose a compile time data remapping method to reorganize record data types with the objective to increase temporal access characteritics of data objects. Their method analyzes program traces to mark data objects of records whose access characteristics and field layout do not exhibit temporal locality. The authors propose a heuristic algorithm to remap the fields of the data objects that were marked during the analysis phase. The heuristic remaps the fields in data objects to improve the temporal locality and thus avoids additional cache misses. Their approach is very efficient for record type data structures like linked lists and trees. However, their approach requires compiler support to reorganize the fields of data structures and the corresponding code changes to access the remapped fields. Whereas our work focusses on the layout of data structures that do not require code changes, which is an important constraint in IP-based embedded application development flow.
7.6.2
SPRAM-Cache Data Partitioning
Integer Linear Programming (ILP) based approach to partition instruction traces between SPRAM and instruction cache with the objective to reduce energy consumption
7.6 Related Work
167
has been propsed in [80]. Our work focuses on data partition between SPRAM and data cache. Further, we consider DSP applications which typically have multiple simultaneous memory accesses leading to parallel and self conflicts. To the best of our knowledge, only [53] addresses data partitioning for SPRAM-cache based hybrid architectures. They propose a data partitioning technique is presented that places data into on-chip SRAM and data cache with the objective of maximizing performance. Based on the life times and access frequencies of array variables, the most conflicting arrays are identified and placed in scratch pad RAM to reduce the conflict misses in the data cache. This work addresses the problem of limiting the number of memory stalls by reducing the conflict misses in the data cache through efficient data partitioning. They also demonstrate memory exploration of hybrid architectures with their proposed data partitioning heuristic. However, their memory exploration framework does not have an integrated cache-conscious data layout. They propose a model to estimate the number of cycles spent in cache access. Our approach proposes data partitioning based on three factors (i) access frequency, (ii) temporal access characteristics and (iii) spatial access characteristics. Our proposed method is a comprehensive data layout approach for SPRAM-cache based architecture as we perform data partitioning followed by cache conscious data layout. Also our approach works on all the key system design objectives such as area, power and performance.
7.6.3
Memory Architecture Exploration
Panda et al., proposes a local memory exploration method for SPRAM-cache based memory organization. They propose a simple and efficient heuristic to walk through the SPRAM-cache space. For each of the memory architecture configuration considered, the performance of the memory configuration is estimated by an analytical model. In [64], an exhaustive search based exploration approach is proposed for a cache based memory architecture which explores the memory design space based on parameters like on-chip memory size, cache size, line size and associativity. The authors extend the work by Panda et al., to consider energy consumption as the performance metrics for the memory
168
Cache Based Architectures
architecture exploration. Memory exploration for cache based memory architecture is also considered by [51]. The main difference between the above works and our method is that our memory architecture exploration framework integrates an efficient data layout heuristic as a part of the framework to evaluate the memory architecture. Without an efficient data layout a random mapping of application may result in a poor performance even for a good memory architecture. Further, our memory architecture exploration framework considers multiple objectives such as performance, area and power. Memory hierarchy exploration problem in the Genetic Algorithm framework is proposed in [52] and [9]; their target architecture consists of separate L1 caches for instruction and data, and unified L2 cache. Their objective function is a single formula which combines area, average access time and power. In [52], additional parameters such as bus width and bus encoding are considered, and the problem is modeled in a multi-objective GA framework. The main difference between their work and our work is the integration of data layout as part of the memory architecture exploration framework. Absence of a cache conscious data layout means that the application data may not have been efficiently placed in off-chip RAM and hence will lead to a poor performance. A point to note here is that the poor performance is a result of inefficient data placement and not due to the cache configuration. The other main difference is that [52, 9] uses simulation based fitness function evaluation which will limit the number of evaluations due to large run-time. In comparison our approach uses an analytical model to compute fitness functions.
7.7
Conclusions
In this chapter we have presented a memory architecture exploration framework for SPRAM-Cache based memory architectures. Our framework integrates memory exploration, data partitioning between SPRAM and Cache, and cache-conscious data layout to explore memory design space and presents a list of Pareto Optimal solutions. We have addressed three of the key system design objectives viz., (i) memory area, (ii) performance and (iii) memory power. Our approach explores the memory design space and presents
7.7 Conclusions
169
several Pareto Optimal solutions within a few hours on a standard desktop. Our solution is fully automated and meets the time-to-market requirements.
170
Cache Based Architectures
Chapter 8 Conclusions In this chapter, we present a summary of the thesis and outline possible extensions to this work.
8.1
Thesis Summary
In this work, we presented methods and a framework to address the memory subsystem optimization problem for embedded SoC. In Chapter 3, we presented three different methods to address the data layout problem for Digital Signal Processors. Multiple methods are required for addressing the problems in the embedded design flow. For instance, data layout during memory architecture exploration needs to be very fast, as data layout is used several thousand times to evaluate different memory architectures. On the other hand, the data layout method used for final system production needs to generate a highly optimal solution irrespective of the run-time. Hence, we proposed three different approaches for data layout in Chapter 3: (i) Integer Linear Programming (ILP) based approach, (ii) Genetic Algorithm (GA) formulation of data layout and (iii) a fast and efficient heuristic method. We compared the results of all the three approaches. The heuristic method performs very efficiently both in terms of the quality of the data layout and also in terms of run-time. The quality of data layout (the number of memory stalls reduced) generated by the heuristic algorithm is within 5% from
172
Conclusions
that of GA’s output. The ILP approach gives the best quality solution, but its run-time is very high. In Chapter 4, we addressed the logical memory architecture exploration problem for embedded DSP processors. As discussed in Chapter 1, logical view is closer to the behavior and helps in reducing the search space by abstracting the problem. We formulated the logical memory architecture exploration (LME) problem in multi-objective GA and multi-objective SA. The multiple objectives include performance (in terms of memory stalls) and cost (in terms of ”logical” memory area). Both GA and SA produce 100-250 Pareto-optimal design points for all application benchmarks. Our experiments showed that the multi-objective GA performed better than SA approach in terms of (i) quality of solutions in terms of the number of non-dominated solutions generated for a given time and (ii) uniformly searching the design space (diversity of solutions). Both GA and SA based approaches take approximately 30 minutes of run-time to generate Pareto-optimal solutions for one benchmark. Chapter 5 addressed the data layout exploration problem from a physical memory architecture perspective. Again, the target memory architecture is for the embedded DSP processors. We proposed a Multi Objective Data Layout EXploration (MODLEX) framework that searches the data layout design space from a performance and power consumption view point for a given physical memory architecture. We showed that our method effectively uses the multiple memory banks, single/dual-ported memories, and non-uniform banks to produce around 100-200 data layout solutions that are Paretooptimal with respect to performance and power consumed. We also showed that there is a big 70% trade-off between power and performance possible by using different data layout solutions; specifically for DSP based memory architectures. In Chapter 6, we addressed the memory architecture exploration for embedded DSP processors from a physical memory perspective. We proposed two different approaches to physical memory exploration. First approach is extends the logical memory architecture exploration described in Chapter 4 to address the physical memory architecture exploration problem. This approach was referred as LME2PME. As part of the steps to extend
8.1 Thesis Summary
173
the LME to address PME, we proposed a memory allocation exploration framework that takes the Pareto-optimal logical memory architecture and its corresponding data layout as input and explores the physical memory space by constructing the given logical memory architecture with physical memories in different possible ways with the objective to optimize area and power consumption. The memory allocation exploration is formulated in multi-objective GA. The second approach proposed in Chapter 6 is an integrated approach that directly address the physical memory architecture exploration problem. This approach is known as DirPME. This approach formulates the physical memory exploration problem directly as a multi-objective GA. This approach works on data layout, memory exploration and memory allocation simultaneously and hence the search space is very high as compared to the LME2PME approach. We showed that both approaches, LME2PME and DirPME, provide several 100s of Pareto-optimal points that are interesting from a area, power and performance view point. Further, we showed for a given time LME2PME provides better solutions than the DirPME approach. However, the solutions of DirPME and LME2PME are very close and hence both approaches are useful depending on the needs of system designers. Finally, in Chapter 7 we extended our memory architecture exploration framework to address SPRAM-cache based on-chip memory architecture. We proposed an efficient data partitioning heuristic to partition data sections between on-chip SPRAM and cache. A graph partitioning based cache conscious data layout heuristic is proposed with the objective to reduce cache conflict misses. Exhaustive search method is applied to explore SPRAM-cache design space. Each memory architecture is evaluated by mapping a target application to the memory architecture under consideration, by using the data partition heuristic and cache conscious data layout heuristic, to obtain the performance in terms of number of memory stalls. We used eCacti [45] to obtain the area and power per access numbers for the cache and used a semiconductor memory library to obtain the area and power numbers for SPRAM. Based on the area, power and performance and by applying the Pareto-optimal conditions, the list of Pareto-optimal memory architectures
174
Conclusions
are identified.
8.2
Future Work
In this section, we outline some of the possible extensions to our work.
8.2.1
Standardization of Input and Output Parameters
The memory architecture exploration problem, as discussed in Chapter 1 and illustrated in Figure 1.4, has to be addressed at several levels, namely, behavioral level, logical architecture level, physical architecture level, and and data layout. To be able communicate the interfaces and input/output parameters across these different levels of abstraction, it is very critical to standardize the communication. This involves standardizing the input and output file-formats/parameters from an IP, Platform, Semiconductor library view point. Currently there is no standardization of format, syntax and semantics for these parameters, which is aligned across these levels. It is very critical to address this problem so that multiple methods/optimizations can be integrated seamlessly to address specific applications, architectures and system aspects.
8.2.2
Impact of platform change on system performance
The impact of platform change on system parameters like area, power and performance can be studied for a given application, a given semiconductor memory library and process node. The impact analysis is critical to identify where to spend the effort in improving the platform such that overall system performance improvement is high.
8.2.3
Impact of Application IP library rework on system performance
We have addressed the memory architecture exploration problem for a given set of applications. At this stage, the make, buy or reuse decisions are made and the list of IP
8.2 Future Work
175
modules to be used as part of the system is known as shown in Figure 1.4. We could extend our memory architecture exploration framework to analyze the impact of rework or design improvement of one or more software IP on the memory power, performance and area. This analysis could direct the IP optimization efforts properly with the objective to improve system area, power and performance.
8.2.4
Impact of semiconductor library rework on the system performance
Our memory architecture exploration framework can be extended to study the suitability of a specific semiconductor memory library for a specific embedded system. Further, the impact of rework of a semiconductor memory library from memory system area and power can be studied to decide and prioritize the area of rework.
8.2.5
Multiprocessor Architectures
Our work on data layout and memory architecture exploration focuses mainly on optimizing the on-chip memory organization of a processor (DSP or Microcontroller) in a SoC. Our work can be extended for optimizing shared memory subsystems in a multiprocessor based SoC.
Bibliography [1] ARM920T and ARM922T: ARM9 Family of Embedded Processors. http://www.arm.com/products/CPUs/families/ARM9Family.html. [2] ARM926EJ-S and ARM926E-S: ARM9E Family of Embedded Processors. http://www.arm.com/products/CPUs/families/ARM9Family.html. [3] lp solve. http://lpsolve.sourceforge.net/5.5/. [4] SystemC – Language for System-Level Modeling, Design and Verification. http://www.systemc.org/home. [5] Verilog Hardware Description Language. http://www.verilog.com/index.html. [6] International Technology Roadmap for Semiconductors, SEMATECH, 3101, Industrial Terrace Suite, 106 Austin TX 78758., 2001. [7] 2007 global mobile communications - statistics, trends and forecasts. Technical report, http://www.reportbuyer.com/telecoms/mobile/2007 global mobile trends.html, 2007. [8] 1st IEEE Inter. Symposium on Industrial Embedded Systems. Panel Discussion. Open Issues in SoC Design., http://www.iestcfa.org/panel discussions.htm, 2006.
BIBLIOGRAPHY
177
[9] G. Ascia, V. Catania, and M. Palesi. Parameterised system design based on genetic algorithms. In Proceedings of ACM 2nd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2001. [10] O. Avissar, R. Barua, and D. Stewart. Heterogeneous memory management for embedded systems. In Proceedings of ACM 2nd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2001. [11] F. Balasa, F. Catthoor, and H. De Man. Background memory area estimation for multidimensional signal processing systems. IEEE Trans. VLSI system, 3:157–172, June 1995. [12] R. Banakar, S. Steinke, B-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory: A design alternative for Cache On-chip memory in Embedded Systems. In Tenth International Symposium on Hardware/Software Codesign (CODES), Estes Park, Colorado, May 2002. ACM. [13] M. Barr. Embedded Systems Gallery. http://www.netrino.com/Publications/Glossary/index.php. [14] L. Benini, L. Macchiarulo, A. Macii, E. Macii, and M. Poncino. From architecture to layout: Partitioned memory synthesis for embedded systems-on-chip. In Design Automation Conference, 2001. [15] L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. Layout driven memory synthesis for embededed systems-on-chip. In Proceedings of ACM 3rd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2002. [16] Broadcom, http://www.broadcom.com/collateral/pb/2702-PB02-R.pdf. BCM2702: High Performance Mobile Multimedida Processor, 2006.
178
BIBLIOGRAPHY
[17] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, October 1998. [18] Y. Cao, H. Tomiyama, T. Okuma, and H. Yasuura. Data memory design considering effective bitwidth for low energy embedded systems. In Proc. of the International Symposium on System Synthesis (ISSS), 2002. [19] F. Catthoor, N. D. Dutt, and C. E. Kozyrakis. How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level? In Design, Automation and Test in Europe Conference and Exhibition, pages 426–433, 2000. [20] J.A. Chandy and P. Banerjee. Parallel simulated annealing strategies for VLSI cell placement. In Ninth International Conference on VLSI Design, 1996. [21] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache conscious structure definition. In Proceedings of the 1999 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, May 1999. [22] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache conscious structure layout. In International Conference on Programming Languages Design and Implementation (PLDI99), May 1999. [23] K. Choi, R. Soma, and M. Pedram. Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times. In Design, Automation and Test in Europe Conference and Exhibition, volume I, 2004. [24] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001. [25] K. Deb. Multi-objective evolutionary algorithms: Introducing bias among paretooptimal solutions. Technical report, IIT Kanpur, 1996.
BIBLIOGRAPHY
179
[26] W.W. Donaldson and R.R. Meyer. A dynamic-programming heuristic for regular grid-graph partitioning. Technical report, http://pages.cs.wisc.edu/ wwd/rev4.pdf, 2007. [27] G. Dueck and T.Scheuer. Threshold accepting: A general purpose optimization algorithm appear superior to simulated annealing. Journal of Computational Physics, 90:161–175, 1990. [28] R. Fehr. Intellectual property: A solution for system design. In Technology Leadership Day, October 2000. [29] D. Gajski. Design methodology for systems-on-chip. Technical report, Centre for Embedded Computer Systems, University of California, Irvine, California, http://www.cecs.uci.edu/eve presentations.htm, 2002. [30] D. E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning. Addison-Wesley, 1989. [31] P. Grun, N. Dutt, and A. Nicolau.
Memory Architecture Exploration for Pro-
grammable Embedded Systems. Kluwer Academic Publishers, 2003. [32] G. Hadjiyiannis, S. Hanono, and S. Devadas. ISDL: An Instruction Set Description Language for Retargetability. In Proceedings of the Design Automation Conference (DAC), June 1997. [33] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, second edition, 1995. [34] J. D. Hiser and J. W. Davidson.
Embarc: an efficient memory bank assign-
ment algorithm for retargetable compilers. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 182–191. ACM Press, 2004. [35] P. K. Jha and N. D. Dutt. Library mapping for memories. In EuroDesign, 1997.
180
BIBLIOGRAPHY
[36] B. Juurlink and P. Langen. Dynamic techniques to reduce memory traffic in embedded systems. In Conference On Computing Frontiers, pages 192–201, 2004. [37] M. Kandemir, J. Ramanujam, and A. Choudhary. Improving cache locality by a combination of loop and data transformations. IEEE Transactions on Computers, 1999. [38] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2):91–307, 1970. [39] S. Kirkpatrick, C. D Gelatt, and M. P Vechi. Optimization by simulated annealing. Science, 220, 1983. [40] M. Ko and S. S. Bhattacharyya. Data partitioning for DSP software synthesis. In Proceedings of the International Workshop on Software and Compilers for Embedded Processors, September 2003. [41] C. Kulkarni, C. Ghez, M. Miranda, F. Catthoor, and H. De Man. Cache conscious data layout organization for embedded multimedia applications. In Design, Automation and Test in Europe, pages 686–691, 2001. [42] Bernard Laurent and Thierry Karger. A system to validate and certify soft and hard ip. In Design, Automation and Test in Europe Conference and Exhibition, 2003. [43] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In International Symposium on Microarchitecture, 1997. [44] R. Leupers and D. Kotte. Variable partitioning for dual memory bank DSPs. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City (USA), May 2001. [45] M. Mamidipaka and N. Dutt. eCACTI: An enhanced power estimation model for onchip caches. Technical report, Centre for Embedded Computer Systems, University of California, Irvine, California, 2004.
BIBLIOGRAPHY
181
[46] P. Mishra, P. Grun, N. Dutt, and A. Nicolau. Processor-memory co-exploration driven by a memory-aware architecture description language. In Proceedings of the International Conference on VLSI Design, 2001. [47] B. Monien and R. Diekmann. A local graph partitioning heuristic meeting bisection bounds. In 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997. [48] H. Orsila, T. Kangas, E. Salminen, T. D. Hamalainen, and M. Hannikainen. Automated memory-aware application distribution for multi-processor system-on-chips. Journal of System Architectcure: the EUROMICRO Journal, 53(11), November 2007. [49] R. Oshana. DSP Software Development Techniques for Embedded and Real-Time Systems. Embedded computer systems, 2006. [50] K. V. Palem, R. M. Rabbah, V. J. Mooney III, P. Korkmaz, and K. Puttaswamy. Design space optimization of embedded memory systems via data remapping. ACM Conference on Languages, Compilers and Tools for Embedded Systems (LCTES), June 2002. [51] G. Palermo, C. Silvano, and V. Zaccaria. Multi-objective design space exploration of embedded systems. Journal of Embedded Computing, 1(3), August 2005. [52] M. Palesi and T. Givargis. Multi-objective design space exploration using genetic algorithms. In International Workshop on Hardware/Software Codesign (CODES), May 2002. [53] P. R. Panda, N. D. Dutt, and A. Nicolau. Memory issues in Embedded Systems-onchip: Optimizations and Exploration. Kluwer Academic Publishers, Norwell, Mass., 1998. [54] P. R. Panda, N. D. Dutt, and A. Nicolau. Local memory exploration and optimization in embedded systems. IEEE Trans. Computer-Aided design, 18(1):3–13, January 1999.
182
BIBLIOGRAPHY
[55] P. R. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Design Automation of Electronic Systems, 5(3):682–704, July 2000. [56] S. Peesl, A. Hoffmannl, V. Zivojnovic2, and H. Meyrl. LISA - Machine Description Language for Cycle-Accurate Models of Programmable DSP Architectures. In Design Automation Conference, 1999. [57] R. Rutenbar. Simulated annealing algorithms: an overview. IEEE Circuits and Devices Magazine, January 1989. [58] M. A. R. Saghir, P. Chow, and C. G. Lee. Exploiting dual data-memory banks in digital signal processors. In Proceedings of the 7th Intl Conference Architectural Support for Programming Languages and Operating Systems, pages 234–243, October 1996. [59] A. Sangiovanni-Vincentelli, L. Carloni, F. De Bernardinis, and M. Sgroi. Benefits and challenges for platform based design. In Design Automation Conference, 2004. [60] E. Schmidt. Power Modelling of Embedded Memories. PhD thesis, 2003. [61] H. Schmit and D. Thomas. Array mapping in behavioral synthesis. In Proc. of the International Symposium on System Synthesis (ISSS), 1995. [62] J. Seo, T. Kim, and P. Panda. An integrated algorithm for memory allocation and assignment in high-level synthesis. In Proceedings of 39th Design Automation Conference, pages 608–611, 2002. [63] J. Seo, T. Kim, and P. Panda. Memory allocation and mapping in high-level synthesis: an integrated approach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 11(5), October 2003. [64] W.T. Shiue and C. Chakrabarti. Memory exploration for low power, embedded systems. In Design Automation Conference, pages 140–145, New York, 1999. ACM Press.
BIBLIOGRAPHY
183
[65] A. Shrivastava, I. Issenin, and N. D. Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In Proceedings of ACM 6th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), September 2005. [66] K. Shyam and R. Govindarajan. Compiler directed power optimization for partitioned memory architectures. In Proc. of the Compiler Construction Conference (CC-07), 2007. [67] J. Sjodin and C. Platen. Storage allocation for embedded processors. In Proceedings of ACM 2nd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), November 2001. [68] A.J. Smith. Cache memories. ACM Computing Surveys, 1993. [69] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, 2003. [70] S. Sriram and S. S. Bhattacharyya. Embedded Multiprocecssors: Scheduling and Synchronization. Embedded computer systems, 2000. [71] A. Sundaram and S. Pande. An efficient data partitioning method for limited memory embedded systems. In ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems (in conjunction with PLDI ’98), pages 205–218, 1998. [72] R. Szymanek, F. Catthoor, and K. Kuchcinski. Time-energy design space exploration for multi-layer memory architectures. In Design Automation and Test Europe, 2004. [73] Texas Instruments, http://focus.ti.com/dsp/docs/. Code Composer Studio (CCS) IDE. [74] Texas Instruments, http://dspvillage.ti.com/. TMS320 DSP Algorithm Standard, 2001.
184
BIBLIOGRAPHY
[75] Texas Instruments, http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C54x DSP CPU and Peripherals Reference Set, volume 1 edition, 2001. [76] Texas Instruments, http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C55x DSP CPU Reference Guide, 2001. [77] Texas Instruments, http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C64x DSP CPU Reference Guide, 2003. [78] H. Tomiyama and N. D. Dutt. Program path analysis to bound cache-related preemption delay in preemptive real-time systems. In Eighth International Symposium on Hardware/Software Codesign (CODES), 2000. [79] S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Tracsactions in Embedded Computing Systems, 5:1–33, 2005. [80] M. Verma, L. Wehmeyer, and P. Marwedel. Cache-aware scratchpad allocation algorithm. In Proceedings of the conference on Design, automation and test in Europe Volume 2. IEEE Computer Society, 2004. [81] L. Wehmeyer and P. Marwedel. Influence of memory hierarchies on predictability for time constrained embedded software. In Design Automation and Test in Europe (DATE), 2005. [82] M. E. Wolf and M. S. Lam. A data locality optimizing algorihm. In Proceedings of the 1991 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, June 1991. [83] W. Wolf and M. Kandemir. Memory system optimization of embedded software. Proceedings of the IEEE, 91(1), January 2003.
BIBLIOGRAPHY
185
[84] D.F. Wong, H.W. Leong, and C.L. Liu. Simulated Annealing for VLSI Design. Kluwer Academic Publishers, 1988.