A Low Cost, Multithreaded Processing-in-Memory System Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, Peter M. Kogge University of Notre Dame, Notre Dame, IN 46556 USA
[email protected] http://www.nd.edu/~pim
Abstract. This paper discusses die cost vs. performance tradeoffs for a PIM system that could serve as the memory system of a host processor. For an increase of less than twice the cost of a commodity DRAM part, it is possible to realize a performance speedup of nearly a factor of 4 on irregular applications. This cost efficiency derives from developing a custom multithreaded processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design—reducing area—as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite.
1
Introduction
Processing In Memory (PIM), where significant processing logic is embedded with dense DRAM on the same chip, has become a feasible technology in recent years. This paper shows how a system consisting of a low-cost network of PIM chips can deliver very high performance on both regular and irregular scientific applications. The processors embedded within the memory make very efficient use of the available high bandwidth and low latency to memory through a novel instruction set that combines multithreaded-MIMD with SIMD short-vector operations, implemented with a highly-compact microarchitecture and VLSI floorplan. The processors also provide highly efficient mechanisms for lightweight messages between PIM nodes called parcels. Based on an analysis of commodity DRAM products, available embedded DRAM macros, embedded and commodity microprocessors, as well as the PIM Lite experience, the paper develops a PIM area versus functionality model that shows that a multithreaded core with a 4 SIMD floating-point units could be implemented in an area approximately equivalent to 8 MBytes of DRAM; embedding 4 such units in a 64 MByte DRAM would thus add roughly 50 percent to the die area.
This research was sponsored by DARPA under the Data Intensive Systems (DIVA), PAC/C (MORPH) and HPCS (Cascade) programs.
II
In conventional terms, our PIM system can be approximated as a highly multithreaded, NUMA, distributed shared memory system. In addition to the basic principles that would apply to any system of this class, such as organizing data for spatial locality to minimize communication traffic and providing lowoverhead message support, there are several additional principles that apply to an in-memory system that have influenced our design: – minimize the amount of execution state physically resident in the processor. This not only leads to lower processor area, but also minimizes the amount of data transferred between the processor and memory during thread context switches; – transfer data between the processor and memory at the highest bandwidth supported by the on-chip memory. This entails packaging thread state in wide-words matched to the wide memory bus, and operating on wide-words as such in the processor pipeline; – provide fast mechanisms to determine if a given memory reference is local to the current node or not. If a memory reference is non-local, then provide an efficient way for user-level code to determine how to react and to transfer control to the node that contains the data; While most of the discussion in this paper pertains to a general class of systems, for many of the examples, we will refer to a prototype implementation called PIM Lite. PIM Lite is a 16-bit, multithreaded processing-in-memory architecture and corresponding VLSI implementation, developed for convenient experimentation with many aspects of PIM technology, including execution models, ISA design, microarchitecture, physical layout, system software, and applications programming. Fig. 1 shows the layout of the PIM Lite chip, which has been fabricated on a 0.18 µ TSMC process through MOSIS.
Fig. 1. 4-node PIM Lite layout (left) and fabricated 1-node chip.
III
The term “Lite” refers to both virtues and limitations of the architecture and implementation. As a virtue, PIM Lite provides a complete working demonstration of a minimal-state, lightweight multithreaded processor with low-overhead thread swapping. The ISA has no registers, and the implementation has only a very small data cache, primarily to provide multiported access to operands for arithmetic operations. The microarchitecture and VLSI implementation also demonstrate a very compact wide-word, bit-sliced organization that is physically pitch-matched to line up with columns of memory. The main limitations of PIM Lite derive from the 16-bit instruction width and address space, as well as the tight design budget, which necessitated a small die size and a simplifications in circuit design. Complete details on the VLSI implementation of PIM Lite are beyond the scope of this paper and may be found in [34]. The remainder of this paper is organized as follows. First, we provide the example of an N-body tree code to motivate the use of a lightweight multithreaded PIM system. Next we discuss the programming and execution models and provide with examples from PIM Lite. Finally, we discuss the cost of such a system, based on both an analysis of commercial chips as well as our PIM Lite experience.
2
Example: N-Body Force Calculation
While in-memory processing can use either conventional shared memory or message passing metaphors to orchestrate parallel programs, it also opens a new possibility: moving the thread to the data, rather than copying data back to an immobile thread. In this section, we demonstrate the use of travelling threads for an N-body tree code. The goal of the N-body problem is to simulate the motion of bodies in space resulting from interactive forces between the bodies. Since each body exerts a force on every other body, the problem of calculating the net force on each body is O(N 2 ). A number of methods have been devised, however, that reduce the complexity of this in practice. The Barnes-Hut [3] method is based on the heuristic that if a group of bodies is tightly clustered, then it may be approximated as a single body for the purposes of calculating the net force on a single body that is sufficiently far away from the group. Central to the Barnes-Hut method is building a quadtree in 2-space or an octtree in 3-space that recursively subdivides the region into cells, until each cell contains 0 or 1 bodies. To calculate the net force on a given body, the Barnes-Hut algorithms performs a depth-first traversal of the tree. If a given cell is far enough from the body, the entire cell is approximated as a single body at the center-of-mass and the traversal along that branch of the tree stops. Otherwise, the traversal expands to the children of that cell. In the our PIM version of the program, we distributed the tree spatially over a set of PIM nodes, as shown in Fig. 2. We generated a thread for each body to traverse the tree in parallel and accumulate the net force acting upon each body. In order to eliminate the bottleneck at the top of the tree, since each thread starts
IV
its traversal at the root, we replicated the top few levels of the tree onto each node before beginning the traversal (not shown in the figure). Parcels are only sent over the network when a thread encounters a link to a child off-chip. When this occurs, the entire thread state must be sent, but in this application the thread state is very small, since it consists only of the accumulated force, the mass of the body, its coordinates in 3-space, and a pointer to the body for reporting back the net force when the traversal is complete. For the simulation, we used a multithreaded version of the MIPS ISA. The simple heuristic of replicating the root of the tree distributed the load very evenly during execution.
PIMs
host cache host CPU
Fig. 2. Distribution of N-body tree over PIM system
Additional speedup for this example comes from SIMD operations on widewords, since 44 percent of the floating point operations in the force calculation “inner loop” consists of vector arithmetic operations on 3-element vectors. An analysis of the N-body performance for a variety of PIM configurations is given in Section 4.2.
3
Programming and Execution Models
Our multithreaded PIM execution model borrows heavily from early work in hybrid dataflow architectures including P-RISC [25] and Monsoon [28], as well as the Threaded Abstract Machine (TAM) [7]. Lightweight messaging using parcels [6] was also inspired by protocols developed for split-phase memory access in the dataflow work, as well as active messages [35], the MDP [10] and the J-Machine [26]. IMPS extends this prior work by adapting lightweight multithreading and communication for use with wide-words from on-chip memory and also integrates short SIMD operations into the architecture.
V
Below, we briefly summarize the set of architectural mechanisms for the multithreaded PIM processing system. All of these mechanisms have been implemented in PIM Lite; a complete description of the PIM Lite ISA and programming environment may be found in [4], [5]. Frames In the multithreaded PIM model, nearly all the state information of a thread is taken out of the CPU and kept in memory at all times. In place of named registers in the CPU, thread state is packaged in data frames of memory. The main difference between a frame and a register set is that frames are logically and physically part of the memory system, rather than part of the processor, and that a multithreaded program can have access to many different dynamicallycreated frames over the course of execution. Logically, a frame is simply a region of contiguous memory locations within a single node. Physically a frame consists of one or possibly a few rows in a memory block. For a typical DRAM block such as [15], a row of memory fetched in a single internal read cycle is 2048 bits; this is storage equivalent to 32, 64-bit registers. Normally, the row latched at the sense amplifiers must then be paged out to the digital interface, typically as 256-bit wide-words. Because the ALU needs simultaneous access to several slots at a time, a multi-ported frame cache is used to store frames recently fetched from memory. A key issue with frames is allocation and deallocation of storage. In PIM Lite, we used fixed-size frames managed on a simple linked list with software; if the head of the list were part of the architecture, then frames could be allocated in a single-cycle as long as there are free frames available. Analysis of sample programs suggests that using a fixed frame size of approximately 32 words would not use memory any less efficiently than variable-sized frames. For example, an analysis of the SPEC benchmarks shows that 70% of the frames allocated were 32 words or less, 93% were 64 words or less, and more than 99% were 256 words or less. Continuations The execution state of a PIM thread is completely described by a continuation [28] consisting of two pointer values: a frame pointer, FP, which points to the starting location of the data frame, and an instruction pointer IP that points to the current instruction. During an instruction cycle, a single continuation is removed from a pool and processed, and 0, 1, or 2 continuations may be written back to the pool. The pool may contain multiple continuations with the same frame pointer value FP, meaning that multiple threads of execution can share values through a common frame. Interleaving continuations within a single pipeline provides an added degree of thread-level parallelism on top of having multiple independent nodes. In PIM Lite, a hardware scheduling mechanism ensures that two continuations with a common FP will never be in the pipeline at the same time, eliminating the need for hazard detection or forwarding logic in the pipeline. At very fine granularity, it may be difficult for a compiler to automatically find enough parallelism to keep the pipeline full; for scientific applications written in a course-grain parallel style, however, threads that might
VI
otherwise be scheduled onto independent nodes can be scheduled onto the same node for improved hardware efficiency when this is a concern. Arithmetic and Logic Operations Transferring frames as wide-words from memory is one approach to exploiting the high bandwidth access to rows in memory. Since frame structure preserves the wide-word organization of data, it is also natural to perform short vector/SIMD operations on wide-words in frames. Wide-word SIMD operations have been used in a number of past PIM designs, including [16] [22] [11] [19]. PIM Lite supports a basic set of wide-word operations, including vector-vector, vector-scalar, scalar-scalar, and permutation using a single wide-word datapath. In contrast, the VIRAM chip [21] uses a true vector instruction set that supports non-unit and irregular stride memory operations. While this has been demonstrated to be highly effective for multimedia applications, our research focuses on the use of multithreading for irregular applications. Parcels A Parallel Communication ELement, or parcel is a type of message for performing operations on remote objects in memory. As with conventional memory operations, parcels are routed to target memory locations, but they greatly expand the set of possible actions beyond simple reads and writes. Parcels may be viewed as an outgrowth of other lightweight messaging schemes for multithreaded systems, such the MDP [9], J-Machine [26], TAM [7], and Active Messages [35]. The concept of parcels has formed the basis for several PIM designs including HTMT [33], DIVA [11], and MIND/Gilgamesh [32]. A parcel contains the following information: – the address of a target memory location, – the operation to be performed, – a set of arguments. In PIM Lite, a parcel is formatted as a wide word, where the first slot always specifies the pointer to the frame (FP) that the new thread will use for local storage, and the next slot specifies the instruction pointer (IP) of the first instruction in the new thread. The remaining 6 slots in the wide-word serve as arguments that can be passed to the new thread. The send instruction injects a parcel into the network. Upon receipt, the parcel is copied into the target frame and the first two slots of the parcel are inserted as a continuation into the thread scheduler in a single instruction cycle, providing extremely low overhead for message handling. Thread Management Operations In PIM Lite, fork and join instructions control the synchronization of threads operating on a common frame, using a counting semaphore in a frame slot. The fork instruction inserts two continuations into the thread pool, while the join instruction inserts either zero or one, depending upon the value of the semaphore. PIM Lite also provides a critical section mechanism for atomic operations spanning multiple instructions. During a critical section, only a single thread has access to the pipeline. In the same manner
VII
as used in earlier lightweight multithreaded machines including [28] [25] [8] [7] our multithreaded PIM model uses a tree of frames to represent the procedure linkage structure. The PIM Lite prototype ISA has no dedicated call or return instructions. Instead, parcels are used to both call the callee and return the response to the caller. Location-Aware Memory Operations For applications that use large dynamic data structures, such as the N-body example, it is generally not possible to determine at compile time on which PIM node a given object resides. When a PIM thread needs access to data that is not on the current node, there are two choices: either bring the data to the thread, or move the thread to the data. The latter option is especially attractive when the thread state is small and the likelihood of future references to data on the new node is high. In either case, the thread must be able to determine whether a given virtual address is on the current node or not. PIM Light uses conditional memory transfer instructions that check the local page table. Load and store operations in PIM Lite transfer data between global memory and data frames if the global memory address is on the current PIM node; otherwise, they fall through to a user instruction that specifies how to respond. Typically, this instruction slot would be a jump to a parcel send operation. Software Libraries To simplify development of multithreaded PIM programs, we created a small library of functions that is based on and extends parts of the Pthreads library for building and traversing distributed data structures. Location awareness is expressed in the library through the concept of the neighborhood of an address, which indirectly identifies the node on which that address resides, without having to explicitly name the node. Two functions in the library are: pim malloc Allocate and return a pointer to a block of PIM memory, in the neighborhood of (on the same node as) a given address. Several macros are defined to support data distribution, including PIM HERE, the current node and PIM RANDOM, which randomly chooses an address. pim MoveTo Move a thread to the neighborhood of (same node as) a given address. If the thread is on the same node as the given address, then the call is effectively a nop. Otherwise, the call sends parcels to transfer the state of the current thread (its frame and possibly extension frames linked to it) and then restart the thread on the new node.
4 4.1
System Architecture and Silicon Cost An Entry-Level PIM Configuration
Selecting a PIM system configuration is clearly a tradeoff between benefits and costs. Even among applications with large amounts of natural parallelism, computationally-intensive applications may benefit from more chip area devoted to processing logic, whereas data-intensive applications could suffer from the
VIII
higher cost-per-bit of memory. In this study, we focus on configurations that would have minimal impact in the cost-per-bit for a given memory capacity, yet still provide an order-of-magnitude or more performance improvement for a broad range of engineering and scientific applications that could be run on a high-end desktop workstation with 1–2 GB of memory. Today, a 1 GB DIMM typically has 16 × 512 Mb (64 MB) DRAM chips. Using this as a basis, we pose a two-stage network of PIM nodes, with a high-level network connecting 16 PIM chips on a card and a low-level network connecting 4 independent PIM nodes on a chip, for a total of 64 independent, networked PIM nodes. Because the main performance gains from PIM come from the parallel speedup of pushing multiple compute nodes into the memory system, the performance of each PIM node in operations-per-second is not nearly as important as it would be for a uniprocessor system. Further, added complexity in the processor design increases chip area, adding to the cost-per-bit for memory. Still, if the singlenode performance is too low, either much larger systems will be needed (and more parallelism extracted from the application) to achieve a given performance level, or the speedup for a fixed-size system will be too low. Our basic approach to processor instruction throughput is to use a simple single-issue, RISC-like pipeline, running at the highest clock rate that the DRAM fabrication process can comfortably accommodate. Table 1 summarizes the clock frequencies for several embedded RISC CPU cores on 180nm and 130nm fabrication processes. The PowerPC-440G clock rate numbers are cited for IBM’s embedded DRAM processes, SA27-E and Cu-11. From this we conclude that a 200 MHz 5-stage RISC pipeline on a 130 nm DRAM process should be readily feasible. Pipe Stages Clock Rate (180nm) Clock Rate (130nm) ARM7 Core 3 80–110 MHz 100–133 MHz ARM926EJ-S Core 5 180–200 MHz 220–250 MHz MIPS32 4K Core 5 200–240 MHz 260–300 MHz PowerPC-440G Core 7 400–500 MHz 500–667 MHz Table 1. Embedded CPU clock rates
For IBM’s Cu-08 density-optimized 130 nm embedded DRAM ASIC process, DRAM random access time is 9 ns. Assuming a 200 MHz CPU, memory is thus within 5 clock cycles from memory. Either caching or hardware multithreading could be used to mitigate the effects of this relatively low memory latency. In [30], researchers at Sun Microsystems investigated cache configurations with very wide cache lines that could be filled in a single memory cycle with on-chip DRAM. Results from the SPEC benchmark suite showed that an 8 KB instruction cache with 512 B lines had miss rates between 0.1 and 0.5 percent. Further, a 16 KB, 2-way associative data cache with a 16 entry by 32 B victim cache had miss rates of mostly 5 percent or less across the same benchmark suite. The miss rates from these very small, simple caches are acceptable, since the miss penalty
IX
is low; assuming 512 B memory row is multiplexed as 8 512-bit wide-words, the time to fill a cache line is only 13 cycles. Before considering the node configuration and chip area, we’ll first consider the network design. For the board-level network, inexpensive commodity 10/100/1000 Mb/s media switch chips with up to 24 ports are readily available today. For the on-chip, low-level network stage, the area cost of fully connecting 4 PIM nodes with extremely high bandwidth is negligible. For example, a 4-way crossbar with 256 bits per link could be implemented with a wire matrix of 1024 rows by 1024 columns. This is the same wiring requirements for a 1 Mb or 0.13 MB array of DRAM cells. Running at only 100 MHz, this network would provide 25.6 Gb/s bandwidth. Furthermore, since 45 percent of the area of a typical commodity DRAM is decoding logic and interconnect between independent storage arrays, this 4-way crossbar could be embedded into a commodity DRAM floorplan without any noticeable area penalty. 4.2
The Cost of PIM versus Performance
To get a rough estimate of the relative costs of storage and processing logic in terms of silicon die area, we measured the areas of a variety of memory and logic structures—including DRAM, SRAM cache, integer datapaths and floating point units—from published die photos of a sample of commodity and performance-optimized embedded DRAM designs, commodity microprocessors, and embedded cores. We then normalized the areas of these structures for different process gate lengths, Lp , and determined “average” areas for each of these structures, conservatively rounding the areas for the integer cores and floating point units up. Next, we expressed the areas of each of the structures in terms of equivalent MB of commodity DRAM (including support circuitry such as decoders and sense amplifiers). Table 2 summarizes the results of this study. From the table, we see that 1 MB of SRAM cache requires more area than 20 MB of commodity DRAM. On the other hand, a 32-bit CPU integer core with a 64-bit floating point unit can be implemented in the area of 2.5 MB of DRAM (or less). Put another way, it is instructive to consider the areas of these structures relative to a 512 Mb (64 MB) commodity DRAM, which is the largest capacity memory part shipping in high-volume production today. Whereas a nominal 32-bit integer/64-bit floating point processor would add less than 4 percent to the area of a 64 MB DRAM, 1 MB of cache alone would add over 30 percent. Assuming that die cost increases with die area to the third or fourth power, the relatively low cost of processing logic and high cost of cache is clear. As a comparison, the university-developed PIM Lite processor logic was implemented in an area equivalent to 10.3 KBytes of single-ported SRAM on the same chip [34]. For the “professionally-developed” chips summarized in Table 2, the average 32-bit integer unit has an equivalent area of 33.3 KBytes of SRAM cache. This suggests that at least within a factor of 2 or so, the PIM Lite floorplan can reasonably be used for relative area comparisons. On the other hand, when normalized to process line width, PIM Lite memory and logic components
X
Average Equivalent Area vs. 64 MB Area DRAM Area Commodity (MB) DRAM Functional Unit Reference (106 · L2p ) 1 MB Commodity DRAM [36] [1] [18] [31] 96 1.0 1.6% 1 MB Embedded DRAM [15] [14] 434 4.5 7.0% 1 MB SRAM Cache [20] [24] [29] 2120 22.0 34.4% 32-bit Integer Unit/Core [24] [29] [2] [13] 69 0.7 1.1% 64-bit Floating Point Unit [24] [29] [17] [23] 169 1.8 2.8% Table 2. Functional unit sizings relative to commodity DRAM
were each approximately an order of magnitude less dense than their “professional” counterparts. This can be explained by the fact that PIM Lite used only 4 layers of metal for local and global interconnect, as well as by the substantially reduced engineering effort and experience that went into the design. We now consider the silicon area requirements for several candidate PIM chip configurations, each with 4 independent MIMD nodes, and optionally, some amount of SIMD short vector/wide-word acceleration. Table 3 lists several possibilities for chips with 64 MB of commodity DRAM. Sizings in the table are based on the data in Table 2, which are very conservative estimates of the area required for the non-DRAM structures. As Table 3 shows, 4 fully-connected 32bit integer processors would add only 5 percent to the area of a 64 MB DRAM chip. Including an FPU with each node would increase the chip area versus memory-only by 14 percent, while including 4 FPUs per node would cost 33 percent more area than a memory chip alone.
DRAM Int FP Cache Equivalent Area Processor Area/ Configuration (MB) Units Units (KB) (MB DRAM) Total Area Plain DRAM 64 0 0 0 64.0 0.00 Integer only 64 4 0 32 67.6 0.05 Single shared FPU 64 4 1 32 69.3 0.08 1 FPU per node 64 4 4 32 74.6 0.14 4 FPU per node 64 4 16 32 95.6 0.33 Table 3. Area requirements for PIM chip configurations with 64 MB commodity DRAM and 4 processing nodes.
A final consideration is to relate the area cost of PIM technology based chips to monetary cost. An analytic model for PIM die cost is rather complex, especially when one considers the use of redundancy in the DRAM cell array. A general rule of thumb, however, is that die cost is proportional to die area raised to the power of α, where 3 < α < 4. A preliminary detailed study of PIM cost, which also considers testing, suggests that this heuristic is reasonable. Fig. 3 plots the die cost multiplier versus the die area multiplier, for α equals both 3
XI
and 4. According to the model, a 50 percent increase in die area would lead to a cost increase of a factor approximately between 3 and 5. When we consider that for this cost multiplier we could embed 64 integer cores, each with 4 FPUs into a 1 GB DIMM, the cost/performance improvement is potentially very good for the broad class of scientific application that are known to parallelize well. Fig. 4 illustrates the cost versus performance tradeoffs for the N-body program discussed in Section 2. The plot shows the relative increase in both die cost and speedup as the number of PIM nodes per chip varies. Two different PIM node configurations are considered, with a single FPU per chip and with a 4-way SIMD FPU. The results assume that all parts of the N-body program except the force calculation run on the host processor, that the clock rate of the PIMs is half the clock rate of the host, and that the value of the die cost parameter is α = 3.5. The PIM card is assumed to be a 1 GB DIMM with 16 64 MB PIM chips. As the plot shows, the performance gains for the SIMD unit are modest, while the cost increase is substantial; at 4 nodes the cost increase nearly equals the performance gain. With a single PIM node with a single FPU on the die, the speedup is 3.3 with a relative cost of 1.1. With 2 nodes per die, the speedup is 4.0 with a relative cost of 1.3. Beyond this the benefits versus cost rapidly diminish.
16
Die Cost Multiplier
14 12 a=4 10 8 6
a=3
4 2 1
1.2
1.4 1.6 Die Area Multiplier
1.8
2
Fig. 3. Die cost multiplier versus die area multiplier
5
Conclusions
In this paper, we have described a multithreaded architecture for a low-cost PIM system. We have shown that for an increase of less than twice the cost of a commodity DRAM part, we can add significant integer and floating point multithreaded parallel processing capability. This cost efficiency derives from
XII 5.5 speedup cost
5
4−way SIMD
Relative Increase
4.5 4
1 FPU
3.5 3 2.5 4−way SIMD 2 1 FPU
1.5 1
1
2
3 Nodes per PIM Chip
4
5
Fig. 4. Cost versus performance for N-body program
developing a custom processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design—reducing area—as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite. The chip was fabricated on an 0.18 µ TSMC CMOS process through MOSIS and is currently being tested. In addition to supporting conventional parallel processing models using shared memory and message passing, PIM offers the possibility of efficiently sending threads to data, rather than copying data back to a thread running at a fixed location. We demonstrated the use of these “travelling threads” in an N-body tree code, and showed how we could get nearly ideal parallel speedup through multithreading, with an extra “kicker” from short vector/SIMD operation, with low network traffic—and minimal cache. Results of this work may extended to other PIM-based systems, including [12] and [27].
References 1. A multi-gigabit DRAM technology with 6F2 open-bit-line cell distributed overdriven sensing and stacked-flash fuse. In International Solid-State Circuits Conference (ISSCC), San Francisco, CA, February 2002. IEEE, IEEE. 2. ARM. ARM thumb family, www.arm.com, 2003. 3. Josh Barnes and Piet Hut. A hierarchical O(N logN) force-calculation algorithm. Nature, 324(4):446–449, December 1986.
XIII 4. J. B. Brockman. PIM lite architecture and assembly language manual. Technical report, University of Notre Dame CSE Dept., July 2003. 5. J. B. Brockman. Programming PIM Lite. Technical report, University of Notre Dame CSE Dept., July 2003. 6. Jay B. Brockman, Peter M. Kogge, Vincent W. Freeh, Shannon K. Kuntz, and Thomas L. Sterling. Microservers: A new memory semantics for massively parallel computing. In Conference Proceedings of the 1999 International Conference on Supercomputing, pages 454–463, Rhodes, Greece, June 20–25, 1999. 7. David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, and Thorsten Von Eicken. TAM – A compiler controlled Threaded Abstract Machine. Journal of Parallel and Distributed Computing, 18(3):347–370, July 1993. 8. David E. Culler, Anurag Sah, Klaus E. Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain parallelism with minimal hardware support: A compilercontrolled threaded abstract machine. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 164–175, April 1991. 9. W. J. Dally, A. Chien, J. A. S. Fiske, G. Fyler, W. Horwat, J. S. Keen, R. A. Lethin, M. Noakes, P. R. Nuth, and D. S. Wills. The message driven processor: An integrated multicomputer processing element. In International Conference on Computer Design, VLSI in Computers and Processors, pages 416–419, Los Alamitos, CA, October 1992. 10. W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R. Nuth, R. E. Davison, and G. A. Fyler. The message-driven processor. IEEE Micro, pages 23–39, April 1992. 11. J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, , and G. Daglikoca. The architecture of the DIVA processing-in-memory chip. In ACM International Conference on Supercomputing (ICS’02), June 2002. 12. Basilio Fraguela, Paul Feautrier, Jose Renau, David Padua, and Josep Torrellas. Programming the FlexRAM parallel intelligent memory system. In International Symposium on Principles and Practice of Parallel Programming, June 2003. 13. IBM. The PowerPC 440 core. Technical report, IBM Microelectronics Division, Research Triangle Park, NC, September 1999. 14. IBM. IBM SA-27E Embedded DRAM Macro Datasheet, April 2002. 15. IBM. Embedded Memory Selection Guide. http://www-3.ibm.com/chips/products/asics/products/ememory.html, March 2003. 16. Ken Iobst, Maya Gokhale, and Bill Holmes. Processing in memory: The Terasys massively parellel PIM array. IEEE Computer, 28(4), April 1995. 17. Romesh M. Jessani and Michael Putrino. Comparison of single and dual pass multiply add fused floating point units. IEEE Transactions on Computers, 47(9):927– 937, 1998. 18. T. Kirihata et. al. A 113 mm2 600 Mb/s/pin 512 MB DDR2 SDRAM with vertically-folded bitline architecture. In International Solid-State Circuits Conference (ISSCC), San Francisco, CA, February 2002. 19. Graham Kirsch. Active memory device delivers massive parallelism. In Microprocessor Forum, San Jose, CA, October 2002. 20. G. Konstadinidis et. al. Implementation of a third-generation 1.1GHz 64b microprocessor. In International Solid-State Circuits Conference (ISSCC), page 338, San Francisco, CA, February 2002.
XIV 21. Christoforos Kozyrakis, Joseph Gebis, David Martin, Samuel Williams, Ioannis Mavroidis, Steven Pope, Darren Jones, and David Patterson. Vector IRAM: A media-enhanced vector processor with embedded DRAM. In IEEE, editor, Hot Chips 12: Stanford University, Stanford, California, August 13–15, 2000. 22. G. Lipovski and C. Yu. The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval. In IEEE International Workshop on Memory Technology, Design and Testing, pages 24–33, San Jose, CA, August 1999. 23. MIPS. MIPS64 5K family, www.mips.com, 2003. 24. S. D. Naffziger and G. Hammond. The implementation of the next-generation 64 b itanium microprocessor. In International Solid-State Circuits Conference (ISSCC), page 344, San Francisco, CA, February 2002. 25. Rishiyur S. Nikhil and Arvind. Can dataflow subsume von Neumann computing? In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 262–272, June 1989. 26. M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine multicomputer: An architectural evaluation. In Lubomir Bic, editor, Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 224–236, San Diego, CA, May 1993. 27. M. Oskin, F. Chong, and T. Sherwood. Active pages: A computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA-98), New York, June 27–July 1 1998. 28. Gregory M. Papadopoulos and David E. Culler. Monsoon: An explicit token-store architecture. In 17th International Symposium on Computer Architecture, pages 82–91, Seattle, Washington, May 28–31, June 1990. 29. R. P. Preston et.al. Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading. In International Solid-State Circuits Conference (ISSCC), page 334, San Francisco, CA, February 2002. 30. Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the memory wall: The case for processor/memory integration. In 23rd Annual International Symposium on Computer Architecture (23rd ISCA’96), pages 90–101. May 1996. 31. Semiconductor Industries Association. International technology roadmap for semiconductors. Technical report, 2001. 32. T. Sterling and H. Zima. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Supercomputing: High-Performance Networking and Computing, November 2002. 33. Thomas Sterling and Larry Bergman. A design analysis of a hybrid technology multithreaded architecture for petaflops scale computation. In Conference Proceedings of the 1999 International Conference on Supercomputing, pages 286–293, Rhodes, Greece, June 20–25, 1999. 34. Shyamkumar Thoziyoor. PIM Lite: VLSI prototype of a multithreaded processorin-memory chip. M.S. thesis, University of Notre Dame, April 2004. 35. Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active messages: a mechanism for integrated communication and computation. In Proceedings the 19th Annual International Symposium on Computer Architecture, pages 256–266, Gold Coast, Australia, May 1992. 36. Hongil Yoon et. al. A 4 Gb DDR SDRAM with gain-controlled pre-sensing and reference bitline calibration schemes in the twisted open bitline architecture. In International Solid-State Circuits Conference (ISSCC), pages 378–79, San Francisco, CA, February 2002.