Software Codesign for Video Compression ... - CiteSeerX

Hardware/Software Codesign for Video Compression Using the EXECUBE Processing Array Michael Sheliga Edwin Hsing-Mean Sha Peter M. Kogge

Dept. of Computer Science & Engineering University of Notre Dame Notre Dame, IN 46556

ABSTRACT Real time image processing is an important area of study that requires the computational power of parallel computers. In this paper MPEG video compression is used as a benchmark application to study and improve the performance of the processor-in-memory (PIM) EXECUBE array which has been proposed as a solution for the PetaFlop project. Since parallel computers naturally lend themselves to the display of such pictures, the EXECUBE array of parallel processors is a logical choice. It is shown that the array is not able to display video in real time due to computational intensive algorithms within MPEG, as well as an internal communication bottleneck. As such, hardware/software codesign is used to overcome these problems while minimizing design time. Through the use of specialized hardware, the timing constraints may be met. Results are shown which illustrate the eciency of the system as well as the savings achieved.

This work was supported by NSF MIP 95-01006 and NSF/ACS96-12028.

1

1 Introduction The PetaFlop project studies super parallel architectures in order to achieve 1015 oating point operations per second. As one of the design point teams sponsored by NSF, NASA, etc. for the PetaFlop project, we propose the processor-in-memory architecture to achieve the desired peta op performance. In this research, we use the example application of video compression, along with the hardware/software codesign methodology, to study and improve the performance characteristics of the processor-in-memory (PIM) EXECUBE array. The EXECUBE array is a scalable array of processors designed for massively parallel processing [1, 2] that includes a novel processor in memory design. Image processing is an important problem that requires high computer performance, and that lends itself well to parallel solutions. Researchers and designers in this area are looking for solutions to large, real time problems through the use of parallel computers, such as the EXECUBE array, and/or specialized hardware. In this paper we consider the problem of real time video display with the MPEG video compression algorithm using the EXECUBE array, and show how extra hardware added via hardware/software codesign can improve system performance. It is shown that the EXECUBE array, used without additional hardware, is not able to display video in real time. In particular, the computational complexity of portions of the MPEG algorithm, as well as internal communication constraints, limit the eective speed of the EXECUBE array. Therefore, additional, embedded, hardware is inserted in order to improve system performance. The technique of hardware/software (hw/sw) codesign [8] is used when adding the hardware, in order to keep the design time to a minimum. By using this technique, the EXECUBE array is able to display video in real time. In order to evaluate the EXECUBE array, we have chosen video compression, which naturally lends itself to parallel implementations. It was decided to use the MPEG [14] video compression standard, since it is one of the most widely used video compression standards. The MPEG standard for video compression is used to compress pictures before they are stored and/or transmitted, uncompressed, and displayed. This process is shown in gure 1. Our results show that the EXECUBE array, under current constraints, can not display video in 1

Video Generation

Storage and/or Transmission

MPEG Compression (Execube)

MPEG Decompression (Execube)

Display

Figure 1: The steps performed during real time video display. real time, even with the maximum processing array under consideration (512 nodes). The display rate could be improved by designing a new parallel array of computers which had specialized video processing functions. Unfortunately, such a solution would likely take a large amount of design time. Therefore, the system throughput was improved through the use of hardware/software (hw/sw) codesign [8]. HW/SW codesign is becoming an increasingly important design style, justi ed by the increasing complexity and functionality of the computer systems being built. It is very dicult for custom systems to be designed, built, and tested within an acceptable time period even with the most advanced computer-aided design tools unless software is used. However, many systems also have time critical parts which must be implemented in hardware. Hardware/software systems are able to take advantage of standardized processors which have been previously designed and tested to reduce design time and improve reliability. At the same time they use hardware to meet time and area constraints which could not be met by only using general purpose processors. As an example, consider the optical wheel speed sensor system shown in gure 2 (A) [7] which 2

Input Decoding

Tick to Speed Inversion

FIR Filter

Processor 1

Processor 2 Processor 5

Output Encoding Processor 3

System Constraints Area - 40 Units Time - 100 Cycles

Processor 4

Area - 48 Units Time - 132 Cycles Design Time - 2 Months

(A) (B)

Processor 1

Processor 2

Processor 3 ASIC

Area - 24 Units Time - 52 Cycles Design Time - 9 Months

Area - 37 Units Time - 95 Cycles Design Time - 3.5 Months

(C)

(D)

Figure 2: (A) Block diagram of an optical wheel speed sensor system. (B)The system implemented in hardware. (C)The system implemented in software. (D) The system implemented using hw/sw codesign-design. is to be implemented in 100 clock cycles using no more than 40 square units of chip area. As with most systems, this system could be implemented using standardized processors, specialized hardware, or various combinations of both of these. Figure 2 (B) shows a design of the wheel speed sensor system designed solely with software. It took only two months to implement, however, it did not meet the chip area constraints or the timing constraints. Hence, while this design was easiest and fastest to build and test, it was not acceptable. A second design, shown in gure 2(C) was implemented solely in hardware. It surpassed both the area and timing constraints by at least 40%. In fact, it is minimum in terms of the AREA TIME product among the designs considered. However the design cycle time had increased to nine months. In many applications, especially those in which products are competitively sold, such delays are becoming increasingly unacceptable. This is especially true as hardware advances continue to come at faster rates and the eective lifetimes of computer systems is reduced. A third design of the system, shown in gure 2 (D) was implemented in both hardware and 3

software. While this design is not as ecient as the all hardware design, and was ready to be used slightly later than the all software design, it established a balance between the two extremes. This implementation would have allowed the producer to market his product before the competition while also meeting the technical constraints. In addition, to the above trade-os, there is also an implicit trade o between the amount of hardware used (and hence the chip area used) and the response time of the nal system. Since specialized hardware adds signi cantly to the design and test cycle time, our algorithms emphasize adding hardware that may be used several times. By this method only a small portion of the total hardware needs to be designed and tested for correctness. In the hw/sw codesign area, relatively little research has been done on actually synthesizing an entire design. Most research has focused on particular aspects of the design process such as creating appropriate abstractions and speci cations of the problem [14, 16], formal description techniques [9], and formal veri cation [15]. Other research which has actually synthesized systems have done so on a case by case basis. For instance, [10] covers a study of a distributed elevator controller while [17] examines the tradeos in partitioning a oating point square root computation. Some automated tools were developed to perform hw/sw codesign. For example, COSYMA [13] uses a simulated annealing partitioning algorithm, and VULCAN II [11, 12] performs traditional hw/sw codesign for reactive systems which have inputs whose arrival times are unknown. Unfortunately, these systems do not lend themselves well to the problem at hand. Therefore, the hw/sw algorithm discussed in section 3 was used to accelerate the evaluation and utilization of system prototypes, and reduce the development cycle of our system. Our results show that the MPEG algorithm can not be implemented in real time using the existing EXECUBE array, even with a \full" array of 512 nodes. By adding additional hardware to the EXECUBE array, we were able to take advantage of EXECUBE's ability to perform non time critical tasks in software, while also speeding up the system by having the hardware perform time critical tasks. Two important, time intensive steps in the MPEG algorithm are the motion estimation (ME) and discrete cosine transformation (DCT) calculations. Analysis of the execution time of the MPEG algorithm on the EXECUBE array shows that these steps consumes a large amount of the processing time. By adding additional hardware to perform these steps, the system throughput was increased so that video could be compressed in real time. While adding ME and DCT hardware greatly improves the theoretical execution rate of the system, it also introduces an internal communication bottleneck 4

into the system. Therefore, we also considered using some nodes as queues for the hardware, inserting additional hardware queues, and changing the routing of communication between the EXECUBE array and the hardware. The results described in this paper are as follows: Section 2 introduces the EXECUBE array, details of the MPEG video compression algorithm, and the problem under consideration. Section 3 covers the hw/sw algorithm used to improve the systems throughput, while section 4 covers con gurations that were used in our simulations, and demonstrates the eectiveness of the algorithm for varying hardware con gurations. Finally, section 5 draws conclusions from the results obtained and discusses ideas for future research.

2 Preliminaries 2.1 The EXECUBE Array The EXECUBE array is a scalable array of processors that is designed for massively parallel processing [1, 2]. It has been fully implemented and tested, and is designed to be used in applications that contain a high degree of parallelism. In this paper we use the problem of real time video display, considering varying sizes of images and processing arrays, to evaluate the EXECUBE array. The EXECUBE array has several important features that make it a good candidate for this problem. First, the EXECUBE array is scalable. Scalability enables the system to be expanded simply by adding more processors instead of having to redesign or reprogram a large portion of the system. Therefore the same algorithm may be used on eight processors or 512. Furthermore, if it is desired to increase the speed of the system, we may simply double the number of processors. This feature is necessary to implement very large applications without redesigning hardware. The above feature is common to both EXECUBE as well as a number of other parallel designs. However, in addition to the above, the EXECUBE array has a number of innovative design features that make it superior to other parallel machines for many problems. Most traditional parallel machines have attempted to increase system performance by improving the processing speed of the core CPU. However, the EXECUBE array focuses on using a large number of less powerful, 5

more fully utilized nodes that occupy a smaller area per node. This strategy allows the overall throughput of the system to be increased in many cases. There are several results of this strategy. First, since traditional parallel designs have concentrated on increasing the speed of each node, they normally consist of a number of chips that include a central CPU, main memory, network interface logic, as well as memory managers, math coprocessors, boot proms, etc. Each of these needs to be duplicated when an additional node is added. In contrast, the EXECUBE array requires only one chip, which includes a scaled-down version of the above functions, to be replicated. This greatly simpli es the design. Second, the EXECUBE array involves processors in memory (PIM). It may be noted that today's memory chips have sucient bandwidth to keep up with an average high-end CPU. However, most of this bandwidth is discarded when we leave the chip. It is then recovered by using a large number of memory management systems and caches. Each of these chips add to the design complexity, however, it may be noted that they do not increase the system storage, nor do they perform additional computations. Therefore, the EXECUBE array uses the strategy of including the memory on the same chip as the CPU. By not having to go o chip for memory the amount of memory for each node is decreased. However, the speed of the memory is greatly increased since a large number of bandwidth issues are resolved. In addition to eliminating a large number of caches and memory management systems, this decision also eliminates the large number of buers that must be accessed to retrieve data from memory. As such, memory access times are also improved. Third, by using a smaller, simpler CPU EXECUBE not only increases the number of nodes that may be implemented in a given area, but it also increases the utilization of the CPUs. The EXECUBE CPU was designed by starting with a single CPU and then adding a few additional features that would provide the largest increase in performance per transistor. As such, while the central CPU is relatively simple, there are not a large number of poorly utilized features in it. Finally, the EXECUBE array supports both Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD). As such, it is easily possible to program the EXECUBE array to perform a single operation in parallel - by using SIMD mode and having each processor perform the same task. It is also possible to have one or more processors perform unique tasks, 6

by using MIMD mode. For example, one of every eight processors could be used to queue data while the other seven processors perform the MPEG algorithm. Most other massively parallel architectures do not support SIMD.

2.2 The MPEG Video Compression Algorithm In order to evaluate the EXECUBE array, we have chosen video compression, which naturally lends itself to parallel implementations. Since the frame rate for standard video is normally 30 frames per second, and each frame takes on the order of one megabyte of computer memory, the processing and transmission requirements are obviously very great. In order to overcome the processing rate of 30 megabytes per second, the MPEG video compression standard [3, 4] is used to decrease the bandwidth. We have chosen to use the MPEG video compression standard, since it is one of the most widely used video compression standards. The MPEG standard for video compression is used to compress pictures before they are stored and/or transmitted, uncompressed, and displayed. Our goal is to show how high quality video can be generated, compressed, transmitted, uncompressed, and displayed in real time. This process is shown in gure 1. An optional step of storing the data after compression is also possible. Since compression is greatly more time intensive than decompression we consider it in this paper. Using a similar EXECUBE structure as the one developed, video may also be decompressed in real time. The MPEG video compression standard is a generic standard for the compression of moving pictures. It de nes a method for compressing pictures so that they may be stored and transmitted more eciently. For example, one application of MPEG would be to compress a series of pictures taken by a space telescope so that they could be transmitted using less energy. The MPEG standard is independent of implementation details and is applicable to a wide range of moving picture applications. The MPEG standard also supports a range of important properties such as random access, forward and reverse searches, reverse playback, audio-visual synchronization, edit ability, and the ability to trade-o compression/decompression time for quality. It should also be noted that MPEG compression is a lossy compression. In other words, the picture after compression and decompres7

sion, while very nearly similar to the original picture, will not be identical. While the compression ratio is dependent on a number of factors, most notably the video to be compressed, it is often on the order of 25. Finally, it should be noted that the MPEG standard applies equally well to both color and grey-scale images. A common application of MPEG would be to compress a VHS video that has been digitized into a sequence of arrays of 512 by 512 bits, with each array representing one frame of the original video. Through the use of MPEG the digitized video could be stored and transmitted more eciently. While the MPEG standard is independent of implementation details, it picture applications. At the same time, the standard requires that pictures be compressed using certain techniques, most notably the discrete cosine transform (DCT). A second common, although optional, technique used in the MPEG algorithm is motion estimation (ME). While this technique is normally included in MPEG implementations, it is actually not part of the MPEG standard. Therefore it may implemented dierently in dierent implementations of MPEG, or even left out entirely. It should also be noted that MPEG compression is a lossy compression. In other words, the picture after compression and decompression, while very nearly similar to the original picture, will not be identical. Finally, it should be noted that the MPEG standard applies equally well to both color and grey-scale images. Overall, the MPEG algorithm is a challenging problem to test the EXECUBE array on for several reasons. It is a real time, computation intensive activity, that lends itself well to massively parallel computers. In addition, it also has portions of the problem that can not be done in parallel. Hence, communication between processors is also tested. Finally, is an important real world problem that has been examined by others.

2.3 Increasing the Speed of the EXECUBE Array In order to test the EXECUBE array, we rst estimated the amount of time it would take to run the MPEG algorithm. This number could not be simulated exactly since the EXECUBE compiler is still in the development stage. However, by using a number of manual techniques, a reasonable estimate can be obtained. Our results show that the EXECUBE array can not display video in real time, even with the maximum processing array under consideration (512 nodes), for a picture 8

size of one megabyte. The display rate could be improved by designing a new parallel array of computers which has specialized video processing functions, however, this would involve a large amount of eort and time. Instead, we chose to improve system performance by using a form of hardware/software (hw/sw) codesign. Hardware/Software codesign allows a system to be designed and implemented quickly by using standardized, o the shelf software for most of the system. However, it also allows the system to meet timing constraints by using specialized, custom built hardware to be used for time critical portions of the system. By adding additional hardware to the EXECUBE array, we were able to take advantage of its ability to perform many parallel tasks, while also speeding up the system by having the hardware perform time critical tasks. A crucial step in the MPEG algorithm is the motion estimation (ME) calculation. By adding additional hardware to perform ME, the system throughput was increased considerably. Since our analysis showed that the motion estimation routine was the most time critical portion of our MPEG implementation, we chose to add additional hardware for it. We also reduced the compression time by simplifying the motion estimation algorithm: at the cost of less ecient encoding. In this case we also explored the possibility of using hardware to implement the discrete cosine transform (DCT). In these cases, the extra design time would be minimal, since full search motion estimation and DCT hardware already exists. Some of the techniques involved in the our system include having one piece of hardware for several pieces of software. Next, a number of dierent con gurations were simulated. These con gurations included the EXECUBE array by itself, the EXECUBE array with additional hardware, and dierent implementations of MPEG. The extra hardware consisted of varying numbers of DCTs, as well as varying numbers of queues. In addition, we also tested the case where some nodes were used to queue data for the DCTs, as well as cases where the communication channels were modi ed.

9

3 Algorithm With any hardware/software codesign system it is important to remember what assumptions are made as well as what limitations are placed on the design process. We present these as well as the main ow chart of the hardware/software codesign algorithm in this section. We then brie y discuss how we decide what hardware to add to the system as well internal communication overhead that is a result of this hardware. It is assumed that design time is the most important parameter in the hw/sw codesign algorithm. This has several eects on our the hw/sw codesign process. First, the system begins with all software, and then adds hardware until a solution is found. In this manner only as much hardware as is needed is added, resulting in less design and test time. Second, when new hardware is introduced, it is added to the system in a regular, repeating pattern, that does not require EXECUBE's structure to be altered. For example, if a coprocessor of some type were to be added to the EXECUBE array, one coprocessor might be introduced for a cluster of eight EXECUBE nodes. If it were necessary to eliminate some EXECUBE nodes due to chip area limitations, entire clusters would be eliminated, as opposed to eliminating part of a cluster and redesigning the layout of the array. In order to simplify the design process we do not consider the input and output of the video data from the EXECUBE array. Obviously, for real time video display the IO demands are high, resulting in the need for parallel IO using specialized hardware. This problem is not considered in this paper. Instead, it is assumed that data is distributed to the nodes as needed. We do, however, attempt to minimize IO demands when it is feasible. We also begin with the assumption that each node is to process a portion of the video that is to be compressed (as opposed to having each node work on a dierent frame). The input is broken up into areas of 16 by 16 bits. In the MPEG algorithm this is known as a macro block. Since the motion estimation subroutine of MPEG requires the processing element to use not only the values of the macro block from the current frame, but also the corresponding macro block in surrounding frames, each node will process the same macro block for all frames. For example, EXECUBE node six might process the macro block corresponding to the lower left corner of the video for all frames. If necessary, each node will compress more than one macro block per frame. 10

ALGORITHM MAIN HW/SW ( ) Input :

Display Rate: Required Display Rate Nodes: Number of EXECUBE Nodes Size: Picture Size

Output :

A Design That Meets the Timing Constraints

begin Initialize(No Hardware) while (Current Rate < Display Rate)

New Hardware ? Most Needed Hardware() New Speed ? Hardware Speed(New Hardware) Ratio = Determine Ratio(New Speed) Add Hardware(New Hardware; Ratio) if (Data Transfer Rate() < Display Rate) Add Transfer Hardware()

end end end

Figure 3: The hw/sw codesign algorithm. In this case the macro blocks will be adjacent since a limited amount of additional IO is necessary for the motion estimation subroutine for each adjacent macro block. Figure 3 presents the pseudo code of the algorithm that was used during the hw/sw codesign process. We begin by attempting to implement the entire system using the EXECUBE array without any additional hardware. If the system can not meet the time constraints, the algorithm begins to add specialized hardware in an iterative fashion. The subroutine Most Needed Hardware is used to determine what type of hardware to add. Hardware is added based upon the amount of time spent in each subroutine, as well as the diculty of implement the subroutine in hardware and hence the hardware's area. Fortunately, the algorithms that take up the most amount of time are structured loops, such as the motion estimation algorithm and the DCT. Not only are these algorithms of cyclical nature but they have been previously researched and implemented in hardware [8]. Once the new hardware to be added is decided upon, the ratio of EXECUBE nodes to hardware must be calculated. Since the specialized hardware normally runs much faster than the EXECUBE 11

nodes, the hardware may be shared among a number of nodes. For example, a group of 16 EXECUBE nodes might share one DCT unit. The ratio of nodes to hardware is determined by the function Determine Ratio. Once the ratio and hardware are set, the hardware is added via the Add Hardware subroutine. It should be noted that if adding a unit of hardware violates the chip area constraint, some EXECUBE nodes may have to be eliminated. As noted above, in this case, nodes would be eliminated in groups. The total time spent calculating the output of each macro block is de ned as Tmb , while the amount of time that it takes the each node to complete its internal, non data transfer calculations for each macro block is de ned as Tn . Similarly, the amount of time that it takes for the specialized hardware to complete its processing for each macro block is de ned as Th . Each node must access the hardware Nmb times for each macro block. Hence, it may appear that the execution time for a macro block may simply be Tmb = Tn + (Nmb Th ). However, we must also consider the time spent transferring data between the EXECUBE nodes and the hardware. In addition, if groups of nodes share a hardware unit, the time spent in the hardware queue waiting for the hardware to nish calculations from other nodes must be considered. Since EXECUBE interleaves data transfers with internal operations, data transfers may significantly aect the speed of the nodes. In addition to the amount of time spent transferring data, there is also a small amount of time required to set up the data transfer, which we will denote as Tse . (We use Tse to include both the set up time for transferring data from the EXECUBE array to the hardware, as well as the time for receiving data from the hardware to the EXECUBE array.) The set up time is independent of the amount of data transferred. Therefore, if D pieces of data must be transferred for each hardware operation, and Td is the time it takes to transfer a single piece of data from the EXECUBE array to the hardware, then the total time for each macro block would be Tmb = Tn + ((Th + (D Td ) + Tse) Nmb ). As an example, there are four DCT calculations required per macro block. If Tn = 48000, Th = 16, D = 64, Td = 240, Tse = 2400, and Nmb = 4, then Tmb = Tn + ((Th + (D Td ) + Tse) Nmb ) = 48000 + ((16 + (64 240) + 2400) 4) = 48000 + ((16 + (15360) + 2400) 4) = 48000 + ((17776) = 48000 + 71104 = 119; 104. It should also be noted that the speed will also be aected by how we implement the data transfers. In some cases data may be passed from one node, called the originating node, through 12

a series of other nodes, called passing nodes, before being received by the hardware. In this case, the passing nodes, do not spend Td time for each piece of data being transferred. On the other hand, passing nodes must receive a message from the originating node them to pass data through themselves. The time spent receiving, parsing and implementing this message will be denoted as Tm . For example, if sixteen nodes share a hardware coprocessor, and all sixteen nodes pass data through node ve, which is the only node connected to the coprocessor, then it should be expected that node ve will run slower than the other nodes, since it will spend a relatively large amount of time processing messages that tell it to pass data to other nodes. In this case, it may be worthwhile to reserve node ve solely for transferring and queuing data. A second issue is that of what each node does while the hardware is performing calculations for it. For example, do we assume that EXECUBE nodes may execute other operations while it waits for the specialized hardware to nish its calculations? If we assume that the video is not real time, we may work on more than one frame at a time. In this case, the eect of the hardware queue on the overall execution rate will be negligible since the EXECUBE nodes may execute instructions from other frames while waiting for the hardware to process data for the current frame. While this model simpli es the timing calculations, it is not used since it includes the assumption that the data is not arriving and compressed in real time. The model used for is that of real time video in which each frame must be received and processed before the next frame is received. In this case the amount of time spent in the hardware queue may be a signi cant portion of the total execution time. If the data is to queued to the hardware we must consider where it is to reside while waiting for the hardware, as well as the path from it to the hardware. For example, only one node per cluster may be connected to the hardware with no queue, one node per cluster may be connected to the hardware and act as a queue, or one node per cluster may be connected to a hardware queue which is then connected the other hardware which performs processing. Other options would be to add more hardware to enable each node to be connected to the specialized hardware, or to have a single large queue for all nodes and all hardware units. The model used will aect not only the response time of the system, but also the chip area. Each of these alternative queuing models is considered in more detail in the next section.

13

4 Experimental Results This section presents experimental results for several dierent system con gurations. Included as parameters to the system were:

the gure size - height and width of the gure in pixels the number of EXECUBE nodes - ranging from 8 to 512 the MPEG algorithm - with or without motion estimation additional hardware - none, motion estimation, or DCT queue structure - none, an EXECUBE node, or additional hardware. the interconnection structure - all data passed to the hardware through EXECUBE's hypercube via a single node, or each node connected to the hardware

In addition, a target frame rate of 30Hz was used for all experiments. A summary of some of the dierent system con gurations is shown in gure 4. For each con guration we attempted to calculate the overall run time of the compression algorithm. A few comments will be made to help explain the gure. Figure 4 (A) shows eight nodes of the EXECUBE array along with a single unit of extra motion estimation hardware. Note that the EXECUBE array is interconnected in a 3-D hypercube. These interconnections are set by the hardware and can not be modi ed; however, additional interconnections may be added, as discussed below. The extra motion estimation hardware is added to the system and connected to node 1. In this con guration all data must be passed to the ME hardware through node 1. However, there are a number of dierent paths that can be taken to get to node 1. For example, we could pass data from node 8 through nodes 6 and 2, or through nodes 4 and 3. In our experiments, we assumed that data was rst passed in the x direction, then in the y direction, and nally in the z direction. Hence data from node 8 would go through nodes 4 and 3, while data from node 7 would go through node 3. These paths are denoted by the dark black lines in 4 (A). While these paths could be modi ed as execution proceeds, they are not since, as shown below, the link between node 1 and the ME hardware will be the bottleneck, irregardless of the other interconnections. 14

ME

DCT

5

ME

6 5

1

6

2 1

7

2

8 7

3

8

4 3

4

(A)

(B)

ME Q

ME 5

6 5

Q

6

2 1

7

2

8 7

3

8

4 3

4

(C)

(D)

ME ME

ME

5

6

1

5

2

7

1

8

3

6 2

7

4

8

3

(E)

4

(F)

Figure 4: The dierent system con gurations that were simulated. (A) The EXECUBE array and motion estimation hardware. (B) The EXECUBE array, ME and DCT hardware. (C) The EXECUBE array, ME hardware, and the rst node used as a queue. (D) The EXECUBE array, ME hardware, and an additional queue. (E) The EXECUBE array and 2 ME hardware units. (F) The EXECUBE array and ME hardware with dierent interconnections. 15

Figure 4 (B) shows the same setup except with DCT hardware in addition to ME hardware. In cases where DCT hardware was added, ME hardware was always added rst, since the ME algorithm is much more time intensive than the DCT algorithm. Figure 4 (C) is the same as (A) except that node 1 has now become a queue. This is denoted by the letter Q in the node. In this con guration, it is hoped that by queuing data \closer" to the hardware, the algorithm will run faster. The disadvantage to this con guration is that node 1 will no longer be available for doing calculations. Figure 4 (D) shows a similar con guration as (C), except here a queue has been added in hardware, and node 1 has once again become a calculation node. This con guration has the advantage that all EXECUBE nodes may be used for calculations, at the expense of extra queuing hardware. Figure (E) shows a con guration in which more than one DCT has been added for each eight nodes. In this con guration data from the left side of the array (nodes 1, 3, 5, and 7) was passed to the left DCT only, while data from the right side of the array was passed to the right DCT only. Finally, gure 4 (F) shows a con guration where each node is connected directly to the DCT hardware. Such a con guration relieves all nodes of the work associated with passing data around. As such, it is likely to be the most ecient con guration. Futhermore, the EXECUBE array has an additional IO channel, aside from the three shown in (A)-(E), that can be used. In (A)-(E) this channel was used for communication between groups of nodes and the outside. In (F) this channel must also double as a path for data between the node and the DCT. It should also be noted that gure 4 (A) through (F) do not represent all possible data con gurations. For example, an additional hardware queue could be added to con guration (F) or con guration (B) could be changed to include the options in gures (D), (E), and (F). Figure 5 shows the execution time for the EXECUBE array without additonal hardware, presuming a 1024 by 1024 image size, and 512 nodes. If the algorithm inlcudes a full motion estimation algorithm, the MPEG algorithm takes 266.1ms. This corresponds to a frame rate of approximately 3 frames per second, or one tenth of the desired rate. It may be seen that a large portion of the algorithm is spent in the ME subroutine. If we eliminate this routine (which is feasible by simply presuming the motion to be zero), the MPEG algorithm takes 37.8, corresponding to a frame rate of 26.5 frames per second. This is near, but slightly below the desired frame rate. This design solution is not used since it results in poor compression. 16

With ME Sub Routine Time Percent Setup 3.1 1.2 ME 228.3 85.8 DCT 27.4 10.3 Encode 1.6 0.6 Other 5.7 2.1 Total 266.1 100

Without ME Sub Routine Time Percent Setup 3.1 8.2 0 0 DCT 27.4 72.5 Encode 1.6 4.2 Other 5.7 15.1 Total 37.8 100

ME: Motion Estimation DCT: Discrete Cosine Transform Encode: Human Encoding

Figure 5: Execution Times for 1024 by 1024 image with 512 nodes Figure 6 shows the results of simulating the EXECUBE array using the General Purpose Simulation System (GPSS) language [5] for various con gurations. It should be noted that cases A through G, O, H, I, K, and M correspond to gure 4 (A) through (F) respectively. The hardware column indicates the type of hardware that was added to the system. Note that DCT hardware is only added after ME hardware, hence \DCT" implies that both DCT and ME hardware are being used. The \HW No" column shows the number of hardware units added per eight EXECUBE nodes. Since the hardware is faster than the EXECUBE nodes, each hardware unit may be grouped with several EXECUBE nodes. We use eight as our grouping since eight nodes are contained in each EXECUBE chip. The \Q" column shows the type of queue used, as per gure 4. A blank indicates that no queue was used, while a single Q indicates that one hardware queue was added for each group of eight nodes. Similarly, \2Q" indicates two hardware queues were added, each to a dierent node. Finally, N1 indicates that node 1 was used as a queue, instead of as a processing node. The bus column indicates what type of bus structure was used. A one in this column indicates the a hypercube structure was used, a two indicates that half a hypercube was associated with each piece of hardware, and \Full" indicates that each node is connected directly to the hardware or queue. Cases A through F show the frame rates for various picture sizes and number of nodes, with and without the ME algorithm. Only for those cases where all 512 nodes are used, and the ME algorithm is not used, does the frame rate approach the desired rate of 30 frames per second. However, as noted above, not using the ME algorithm results in poor compression, and is rejected 17

Case A B C D E F G H I J K L M N O P Q R

Con guration Hardware HW No Q None

ME

DCT

1 1 1 2 2 2 1 1 1 1 1 1

Bus

1 1 1 1 2 2Q 2 Full Q Full 1 2 Full Q Full N1 Q Q

Algo Parameters Hor Vert Nodes 1K 1K 512 1K 1K 512 1K 1K 8 1K 1K 8 520 720 512 520 720 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512 1K 1K 512

ME Y N Y N Y N Y Y Y Y Y Y Y Y Y Y Y Y

Results Time Frame 266.1 3.8 37.8 26.5 17030 0.06 2417 0.42 88.7 11.4 12.6 79.5 82.4 12.1 82.4 12.1 79.8 12.5 79.8 12.5 47.6 21.0 47.1 21.2 79.5 12.6 42.1 23.8 83.0 12.0 79.0 12.7 80.0 12.5 15.4 64.9

Hor and Ver: Horizontal and Vertical Pixels ME: Motion Estimation Used or Not Nodes: Execube Nodes HW No: HW Units per 8 nodes Bus: Bus Interconnection Structure Q: Queue Number and Type N1: Node 1 Used as Queue Time: Execution Time per Frame Frame: Frame Rate

Figure 6: Results for several con gurations

18

as a solution. Case G of gure 6 shows that when extra ME hardware is added, the execution time does not decrease nearly as much as is expected. This is because an internal communication bottleneck is encountered between EXECUBE and the ME hardware. While the ME hardware itself is very fast, it can only receive data from one node at a time. For a 1024 by 1024 array, with 512 nodes, there are 64 by 64 macroblocks, for a total of 4096 macroblocks. Hence each node must perform ME for eight macroblocks. Since each macroblock consists of a 16 by 16 pixel area, and the ME algorithm used requires one macroblock from the current frame, nine macroblocks from the previous frame, and nine macroblocks from the next, already encoded frame, 48640 pixels must be sent for each macroblock. With eight macroblocks per node, and eight nodes per ME hardware unit, approximately 214 19 or 311000 data items must be sent to each ME hardware unit per frame. Since each communication instruction takes 240ns, each EXECUBE node will spend 74.7ms per frame just communicating data to the ME hardware. Hence, irregardless of the queuing method used, the execution rate will remain nearly constant unless data is able to be queued in parallel. Analysis of the results also points this out. It may be observed that only in the cases where more than one piece of data may be queued at once, does the execution time noticeably decrease. Even with ME hardware and a fully parallel queuing structure, the frame rate is still somewhat less than the desired rate. Therefore DCT hardware is also added. With both ME and DCT hardware, the desired frame rate may be met.

5 Conclusion The complexity and functionality of computer systems is increasing at an exponential rate. At the same time, the life cycle of such products are decreasing due to rapidly advancing technology. Due to these two factors, the design time for an integrated circuit has begun to take up more and more of the circuit's life cycle. These factors have greatly increased not only the need for hw/sw codesign, but also the need for design tools that automate this process. Hardware/software codesign is able to create a system rapidly since standardized parts are used to implement most of the system. However, it is also able to meet timing constraints since a limited amount of hardware is used for time critical parts of the system. We consider the problem of video compression using the MPEG algorithm and the EXECUBE 19

processing array. Since EXECUBE alone can not meet the desired frame rate, hw/sw codesign is used to reduce the design time. Unfortunately, adding extra hardware introduces an internal communication bottleneck that can only be overcome through the use of a parallel queuing structure. Results that illustrate the savings which are possible with these solutions are presented for several system con gurations.

References [1] P.M. Kogge, T. Sunaga, H. Miyataka, K. Kitamura, and E. Retter, \Combined DRAM and Logic Chip for Massively Parallel Systems," Proc. 16th Conference on Advanced Research in VLSI, 1995. [2] P.M. Kogge, \Execube - A New Architecture for Scaleable MPPS," Proc. 1994 International Conference on Parallel Processing, pp. 77-82. [3] D.J. Le Gall, \MPEG: A Video Compression Standard for Multimedia Application," Communications of the ACM, Vol. 34, No. 4, April 1991, pp. 47-58. [4] D.J. Le Gall, \The MPEG Video Compression Algorithm," Signal Processing: Image Communication 4, Vol. 4, No. 2, April 1992, pp. 129-140. [5] S.V. Hoover, \Simulation: A Problem Solving Approach," Addison-Wesley Publishing Company, Inc, NY, NY 1990. [6] G.K. Wallace, \The JPEG Still Picture Compression Standard," IEEE Transactions on Consumer Electronics, Vol. 38, No. 1, February 1992, pp. xviii-xxxiii. [7] X. Niu, F. Lovett, M. Sheliga, and J. Swadener, \Optical Wheel Speed Sensor," Term Project Report, Computer Systems Design, University of Notre Dame, May 1993. [8] D. Gaski, F. Vahid, S. Narayan, and J. Gong, \Speci cation and Design of Embedded Systems," Prentice-Hall, Inc, Englewood Clis, NJ, 1994. [9] V. Carchiolo and M. Malgeri, \Behavioural Approach to System Codesign," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993. [10] J.V. D'Anniballe and P.J. Koopman Jr., \Towards Execution Models of Distributed Systems: A Case Study of Elevator Design," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993. [11] R. Gupta and G.DeMicheli, \Hardware-Software Cosynthesis for Digital Systems," IEEE Design and Test of Computers, October 1993, pp. 29-41. [12] R. Gupta and G.DeMicheli, \System Level Synthesis Using Re-programmable Components," The European Conference on Design Automation, March, 1992 pp. 2-7. [13] J. Henkel, T. Benner, and R. Ernst, \Hardware Generation and Partitioning Eects in the COSYMA System," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993. [14] S. Kumar, J.H. Aylor, B.W. Johnson and W.W. Wulf, \Exploring Hardware/Software Abstractions & Alternatives for Codesign," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993. [15] M.C. McFarland and T.J. Kowalski, \Formal Veri cation of CPA Descriptions with Audit," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993. [16] C.-J.R. Shi, \Towards a Uni ed Operational Semantics for Behavior Speci cation of VLSI Systems," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993. [17] P.G. Soderquist and M.E. Leeser, \Implementing Floating-Point Square Root Computation With Newton's Method: A Case Study in Hardware/Software Partitioning," 2nd International Workshop on Hardware-Software Co-Design, Workshop Handout, 1993.

20