implementation of dual-core multithreaded processor on xilinx ...

1 downloads 0 Views 84KB Size Report
Annual International Symposium on Computer Architecture, pp. 191-202, June 1995. [2] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.
Kishore Pinninti et al. / International Journal of Engineering Science and Technology (IJEST)

IMPLEMENTATION OF DUAL-CORE MULTITHREADED PROCESSOR ON XILINX SPARTAN-III FPGA KISHORE PINNINTI* Department of Electronics and Communication Engineering, Anil Neerukonda Institute of Technology and Sciences, Sangivalasa, Bheemili (Mandal), Visakhapatnam, Andhra Pradesh, India [email protected]

P.V. SRIDEVI Department of Electronics and Communication Engineering, Andhra University College of Engineering(A),Visakhapatnam ,Andhra Pradesh, India [email protected]

Abstract: Technology is emerging at a very rapid way and the need for smaller and faster systems has been increasing today at an exponential rate along with time. This need for design and implementation of new systems at a low time to market has given rise to the concept of the digital system. The enhancement to the speed of operation of a system majorly depends on the speed of operation of the core-processing unit. The enhancement to the speed of operation of a processor comes to a bottleneck for designer while realizing multitasking of operation sequentially. The major difficulty arises with the synchronization of data operation. This paper work aims towards the realization of a fast processing 5-stage dual core pipelined microprocessor for the enhancement of operational efficiency of a system. The processor is implemented with 5-stage fine parallelism controlled using the multithreading concept. The proposed task is implemented using VHDL language and simulated using Modelsim tool for functional verification. The synthesis, route and placement and Floor planning of the implemented design are carried out on XILINX Spartan-III FPGA. Key words: Dual core processor, multithread, FPGA 1. INTRODUCTION A dual-core CPU combines two independent processors and their respective Caches and cache controllers onto a single silicon chip, or integrated circuit. IBM’s Power4 as the first microprocessor to incorporate 2-cores on a single die. Various dual-core CPUs are being developed by companies such as Motorola Intel and AMD, and are scheduled to appear in consumer products in2005. Dual-core CPU technology first became a practical viability in 2001 as 180mm CMOS process technology became feasible for volume production. At this size, multiple copies of the largest microprocessor architectures could be incorporated onto a single production die[1]. (Alternative uses of this newly available "real estate" include widening the bus and internal registers of existing CPU cores, or incorporating more high-speed cache memory on-chip). The dual-core type of processor falls into the architectural class of a tightlycoupled processor. In this class, a processing unit, with an independent instruction stream executes code from a pool of shared memory. Contention for the memory as a resource is managed by arbitration and by the processing unit specific caches. The localized caches make the architecture viable since modern CPUs are highly optimized to maximize bandwidth to the memory interface. Without them, each CPU would run near 50% efficiency. multiple caches into the same resource must be managed with a cache coherency protocol[1-4]. Beyond dual-core processors, there are examples of chips with multiple cores. Such chips include network processors which may have a large number of cores or micro engines that may operate independently on different packet processing tasks within a networking application[6].

ISSN : 0975-5462

Vol. 3 No.12 December 2011

8473

Kishore Pinninti et al. / International Journal of Engineering Science and Technology (IJEST)

2. DUAL-CORE PROCESSOR EVOLUTION Computer processor design has evolved at a constant pace for the last 20 years. The proliferation of computers into the mass market and the tasks we ask of them continue to push the need for more powerful processors[7-8]. The market requirement for higher performing processors is linked to the demand for more sophisticated software applications. E-mail, for instance, which is now used globally, was only a limited and accensive technology 10 years ago. Today, software applications span everything from helping large corporations better manage and protect their business-critical data and networks to allowing PCs in the home to edit home videos, manipulate digital photographs and burn downloaded music to CDs. Tomorrow, software applications might create real-world simulations that are so vivid it will be difficult for people to know if they are looking at a computer monitor or out the window; however, advancements like this will only come with significant performance increases from readily available and inaccensive computer technologies. Computing society has planned for these computing advancements since the late 1990s when it first announced the company’s current processor architecture[9-10]. Various organizations have proposed processor architecture from the ground up to accommodate multiple cores on a single processor. Dual core design offers enhanced overall system performance as well as a sophisticated platform to better tackle today’s more complex software applications. DUAL-core processors offer an immediate and cost-effective technology for solving today’s Processor design challenges alleviating the by-products of heat and power consumption That exist when continually advancing single core processor frequency, or ‘clock speed’[12]. 2.1. Parallelism in Processor  Throughput is the number of operations processed by a circuit in a given time (usually per clock period). The concept of throughput applies to any type of circuit and we use Hennessy and Patterson’s academic DLX microprocessor [11] as an example throughout this paper. For the throughput of microprocessors, the operations that we are concerned with are instructions. Pipelining [3] is commonly used to optimize throughput, by partitioning the function of the circuit into stages, so that multiple instructions can be processed concurrently. 3. DESIGN APPROACH The CPU design and implement is a typical DUAL CORE machine following Simple load/store architecture executing simple instructions. The processor uses a Four stage pipelining having instruction fetch (fetch), data fetch, executing and Storing (data memory operations and write back stages). The pipeline is efficiently designed to eliminate the all data dependencies between various instructions by Forwarding results or writing into the register file which has two read and one write Port. this ensures logic split of functionality. Both the instructions memory (Icache) and data memory (D-cache) unit have direct mapped instructions, So it prefetches Instructions and ensures that the CPU is never starved of instructions to execute. Control signals are generated using the hardwired logic for a group of instructions or particular cases. For the generation of the Control signals, the instruction set was thoroughly deselected and minimized. Our processor has Fixed point Execution units. The Fixed Point ALU is responsible for performing all the Fixed Point (integer) arithmetic, logic, shift and rotate instructions. Load and store Instructions are also handled in this processor. The data bus is one word wide.

ISSN : 0975-5462

Vol. 3 No.12 December 2011

8474

Kishore Pinninti et al. / International Journal of Engineering Science and Technology (IJEST)

Fig. 1.Dual core Processor Architecture 

3.1. Description The Fig. 1. Shows the architecture of dual core processor. To read an instruction, the content of program counter are transferred to the address lines. This is done when the fetch signal is high and the address diplexer chooses the contents of the PC to be loaded on to the address bus. As soon as the contents of the Program Counter are loaded onto the address bus, a memory read cycle is initiated and the instruction is read from the location pointed out by the address lines and the Micro-Instruction code is placed on to the data bus. The program counter is incremented to point the next micro instruction in the memory location of the control memory. The data bus transfers the micro instruction to the instruction Register. The Instruction Register has two fields, in two different formats namely,  

Opcode, Data Operand Opcode, Address of Data operand.

During the first case the opcode is given to the ALU and the control unit for decoding and a series of micro operations are generated. The data operand is loaded on to the data bus and transferred to the ALU for its respective micro-operations as specified by its OPCODE.In the second case the address of the data operand is loaded on to the address bus (As the fetch signal is low and the diplexer loads the IR’s address contents on to the address lines) and a memory read cycle is initiated. Here the memory location in the main memory specified by the address lines is read and the data is transferred on to the data bus and thus given the ALU to undergo the operations specified by its opcode. The results of the ALU are stored in the accumulator. Data operations may be combined with the memory contents and the Accumulator and the result is transferred back to the Accumulator. The function of the NOR gate is that when ever all inputs are low the output is high. And at all times remains low. It is attached to tri-state buffer. When the try state buffer is enabled the data from the ALU is fed to the data bus attached to the memory, thus allowing the data to be stored in to the memory. When disabled the data is given to the Accumulator and cut off from being written onto the data bus. 3.1.1. Thread Selection First, we need to describe the thread selection algorithm we used. One of the schemes proposed by Tullsen in [2], suggests giving lowest priority to the thread that has instructions closest to the head of a queue, with the oldest instruction being at the head. This policy is based on the assumption that fast running threads will not leave their instructions for long in the queues and vise versa. This algorithm (called IQPOSN) was always second best within 4% from ICOUNT in his study. We used a slightly modified version of this algorithm, which gives highest priority to the thread with the least instructions after the instruction closest to the head in all queues. This means that for each thread we find the number of instructions between the head of each queue and the first instruction of the thread

ISSN : 0975-5462

Vol. 3 No.12 December 2011

8475

Kishore Pinninti et al. / International Journal of Engineering Science and Technology (IJEST)

(that is the oldest instruction for the thread) and subtract it from the total number of instructions in the queue. Then we add the results for all queues and the thread with the smaller number has the highest priority. This scheme should be slightly better since the number calculated for each thread gives some additional insight into the queue’s population as compared to IQPOSN. Thus, we term this scheme IQCOUNT24. The “queues” involved in our model are the IRS, the FPRS, the LQ and the SQ. After assigning priorities to threads, we select the two with the higher priority that are active. A thread is removed from the active if it misses in the instruction cache, if it is stalled and the condition is not relieved in the next cycle and if it is a return (BTB hit) and the RAS has been popped more times than its size in a row25. In addition to the above, the second thread is selected so that no bank conflict occurs. 4. RESULT ANALYSIS

Fig. 2. Routing of the selective design targeting on Spartan III FPGA.

Fig. 3. Logical block implementation in CLB taken from selective design on targeting Spartan III FPGA.

Figure 4. Floor plan for the selective design on SPARTAN III FPGA floor planner tool.

ISSN : 0975-5462

Vol. 3 No.12 December 2011

8476

Kishore Pinninti et al. / International Journal of Engineering Science and Technology (IJEST)

Dual-Core Power 5 Fine Grained Multithreaded Processor has been implemented on XILINX Spartan-III and the target device is xc3s1600e-5fg484.The design used 125 slices,181 BELS,49 registers,200 LUTS (4 input look up tables)and the reported maximum FPGA clock rate is 191.571MHz. 4.1. Synthesis Report Table 1. Synthesis Report of the processor on XILINX Spartan-III

Synthesis report

Device

% of utilization

utilization No. of slices

125 out of 4656

2.6%

No.of slice Flip Flops

181 out of 9312

19.4%

No. of 4 input LUTs

200 out of 9000

22.2%

No.of bonded IOBs

66 out of 232

28%

No. of GCLKs

1 out of 24

4%

5. CONCLUSION In this dissertation, we conducted research in ways of assigning priorities to threads in such a throughput-oriented environment. One privileged thread was used for this study, in order to get the feasibility of this task. Previous research guided our approach upon prioritization. The basic properties of SMT environment where preserved in order to sustain the realizable throughput, which is its main advantage. Our initial approach, in spite of targeting interference and contention with the primary thread, had limited success, mostly in its revealing of a more appropriate one. A cache partitioning scheme was introduced, in order to manage the effects of multiple threads sharing the processor. Interference in the first level instruction and data caches was addressed by allowing sharing of only half these caches and full access only to the primary thread. We found that the advantage over the primary thread was not fully accloited, but was in the right direction. An additional modification of this scheme was introduced, which provided a dedicated port to the private to the primary thread part of the cache, in order to address contention for fetch bandwidth. This modified design limited primary thread latency increase to 14% with 2 threads, 36% with 4 threads and 52% with 6 threads, when the corresponding increase in the fully shared model of Tullsen is 19%, 72% and 121%. At the same time, total throughput was not compromised. Some cases of variance and the target of further limiting primary thread latency increase, guided acceriments in one additional model. This time a private cache is provided to the primary thread, in order to allow the fetch bandwidth it would have in autonomous execution. In addition, the prioritization approach previously taken for the data cache is removed. The results show that we can limit the latency increase of the primary thread to 9% for 2 running threads, 18% for 4 running threads and 24% when 6 threads are running. At the same time, overall throughput is improved relative to the fully shared model, which we did not anticipate. For this last most successful model, we did not use the partitioning scheme for the data cache. If partitioning the data cache does not provide mach of a primary thread improvement, the data cache should be shared in its whole. In this case, no coherence scheme is needed for our prioritization scheme since no kind of stale data can appear in the instruction cache. These results show that applying priorities in hardware in the throughput oriented environment of a SMT processor is feasible without throughput loss. The operating system can easily provide fair behavior if needed and reduced response time by using this mechanism. On the other hand, because primary thread latency weakly depends on how many threads are running on the SMT processor, an application requiring unaffected processing would lose a small fraction of each speed in our final model. Further limiting of primary thread latency with 4 and 6 running threads should be possible.

ISSN : 0975-5462

Vol. 3 No.12 December 2011

8477

Kishore Pinninti et al. / International Journal of Engineering Science and Technology (IJEST)

References

  Dean M. Tullsen, Susan J. Eggers, Henry M. Levy, "Sipaneous Pithreading: Maximizing on-chip parallelism", In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 191-202, June 1995. [2] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm, "Accloiting Choice: Instruction Fetch and Issue on an Implementable Sipaneous Pithreading Processor", In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 191-202, May 1996. [3] Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, "Converting Thread-Level Parallelism to Instruction-Level Parallelism via .Sipaneous Pithreading", ACM Transactions on Computer Systems, August 1997. [4] Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm, Dean M. Tullsen, "Sipaneous Pithreading: A Platform for Next-generation Processors", IEEE Micro, pages 12-18, September/October 1997. [5] Jack L. Lo, “Accloiting Thread Level Parallelism on Sipaneous Pithreaded processors”,Ph.D.thesis,http://www.cs.washington.edu/homes/jlo/papers/dissertation. ps, 1998 [6] Sèbastien Hily and André Seznec, "Contention on 2nd level cache may limit the effectiveness of sipaneous pithreading", Technical Report PI 1086, IRISA, February 1997, (appears in Proceedings of MTEAC'98 Workshop, Feb. 1998) Sèbastien Hily and André Seznec, “Branch Prediction and Sipaneous Pithreading (V2)”, Technical Report PN997, March 1996 (appears in Proceedings of PACT'96, Boston, October 1996). [7] Tse-Yu Yeh and Yale N. Patt, “Alternative implementations of two-level adaptive branch prediction”, 19th International Symposium on Computer Architecture, pp.124-134, May 1992. 66 Scott McFarling, "Combining Branch Predictors", Technical Report TN-36, Western Digital Research Laboratory (DEC-WRL), June 1993. [8] Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen, [9] “Tuning Compiler Optimizations for Sipaneous Pithreading” Proceedings of the 30th Annual International Symposium on Microarchitecture, pp. 114-124, December 1997. [10] Dean M. Tullsen, Jack L. Lo, Susan J. Eggers, and Henry M. Levy, “Supporting Fine-Grain Synchronization on a Sipaneous Pithreaded Processor” Proceedings of the 5th International Symposium on High Performance Computer Architecture, pp.54-58, January 1999. [1]

ISSN : 0975-5462

Vol. 3 No.12 December 2011

8478

Suggest Documents