Design and Implementation of the Multicore Architecture Teaching ...

4 downloads 27112 Views 2MB Size Report
periments on the computer architecture courses and completes design and implementation of a multicore architecture comput- er experiment platform based on ...
Design and Implementation of the Multicore Architecture Teaching Experiment Platform Qian Wang, Yuhua Tang, Zongbo Li and Jin Wang

Abstract— With the continuous improvement of integrated circuit technology, the new low-power, multicore architecture to replace the previous single core processor architecture has become an inevitable trend of development. The emergence of multicore architectures will lead teaching courses face the problem that the experiments contents should be updated. Under such demand, this article reforms the teaching experiments on the computer architecture courses and completes design and implementation of a multicore architecture computer experiment platform based on Tianhe Sunshine development board. This article makes an analysis of Beehive experiment platform which is based on the XUPV5 development board. On this basis, according to Tianhe Sunshine development board’s hardware configuration, we design single core processor, DDR controller and the communication module which can exchange information between PC and board, complete the multicore architecture teaching experiment platform based on message passing model. Finally, this article simulates and verifies the multicore architecture platform functionality. The results of the test show that the designed modules achieved their functions.

I. I NTRODUCTION OR half a century, the development of integrated circuit technology enables the processor performance upgrade. To enhance the CPU speeds, we can not rely on the traditional single core CPU. Multicore processor (Chip Multiprocessor, CMP) is a new architecture which is the product of the urgent requirement[2]. Multicore processor integrates multiple cores on a chip, which realizes the processor architecture hierarchical, modular and parallel. The appearance of multicore processor architecture will certainly cause a new situation of the system structure design, cause a series of new technical revolution on the programming model, compiler, runtime library, the working set and the operating system. The rapid development of computer technology at the same time makes the society demands for computer talents increase. We must reform the content of teaching and experiment, keep pace with the development of information technology, improve students’ computer theory and practice ability[1]. We reform on the computer hardware series courses, design and implement a multicore structure teaching experiment platform on Tianhe Sunshine development board.

F

This work is supported by the National Natural Science Foundation of China under Grant No. 60921062 Q Wang, corresponding author, is with the State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha410000, Hunan, PRC (phone:+8613739087185 email: [email protected]). Y Tang, Z Li and J Wang are with the School of Computer, National University of Defense Technology, Changsha410000, PRC (email: [email protected], [email protected], [email protected]).

The main construction contents of new experimental teaching platform include: Microsoft Research Academys Beehive experimental platform was deeply analyzed, and on the basis of clipping and improvement on their design, we use Tianhe Sunshine development board to achieve multicore processor architecture experiment platform with the same functions. The remainder of this paper is divided into four sections. Section II is related work, which mainly introduces Beehive multicore architecture experimental platform developed by Microsoft Research. Section III introduces the design and implementation of the experimental platform that transplants the Beehive experimental platform to Tianhe Sunshine development board. Section IV mainly introduces the validation of the multicore experiment platform. The fifth secion summarizes the whole work. II. R ELATED W ORK Recent years, the well-known domestic and foreign university and research institution engaged in multicore processor (CMP) research, and used it in system structure computer experiment. The University of Stanford implemented a Hydra chip multiprocessor, integrated 4 single issue MIPS processor core, and each core had a private level instruction cache and a data cache. All of the cores shared a piece on the two 1MB cache to communicate and share data[4]. Professor Agarwal of MIT led RAW project team in order to solve the internal chip scalability problems, proposed a fully distributed modular design. The first contained the 16 processing unit[5]. MIT M.Machine multicore processors integrated three cores on chip, and they used crossbar switch way to level up register communication. Professor Andrew Birrell of Microsoft Research designed Beehive multicore architecture experimental platform which included a number of RISC cores, each with a local I/O system. The processor was a fairly conventional 32-register RISC. It used byte addressing for data accesses, and word addressing for instructions. Data and instruction accesses can address 8GB of DDR2 DRAM memory. The processors were connected to each other and to the memory using a token ring interconnect which is shown in Fig. 1. The interconnect carried a stream of 32-bit data items plus a 4-bit SlotT ype field and a 4-bit SrcDest field indicating the source or destination of an operation. The ring addressed one of the most serious limitations of an FPGA many-core design: Routing congestion. With a ring, all inter-core wiring was local and relatively short, so it was possible to instantiate a large number of cores. The

processor was small enough that a single FPGA can have 16-32 cores, depending on the number and complexity of other high performance I/O devices. Processors controlled these devices by exchanging messages with them, and the devices can do DMA directly to and from memory[8].

Fig. 1.

instructions are 32 bits, single core instruction set format modeled after MIPS instruction set format. The main function of debug module is that core 1(called the main core) controls and debugs other cores (known as slave core) through the debug module. Debug module locates in a single core processor I/O position 7, the main core controls slave cores through this module[9]. Each single core processor includes a debug module, but the debug module for the main core is almost useless. The debug module connection ports are shown in Fig. 3.

The Ring Interconnect

III. I MPLEMENTATION AND D ESIGN OF THE E XPERIMENT P LATFORM A. Platform Structure Framework In this paper, the design of multicore architecture platform is according to the objectives of development board. The original module of Beehive platform has been improved and cutting. The whole work is as follows: 1) Maintain the original data path of single core processor unchanged, add a debug module. By adding this module, user can control the execution type of program which can use main core to control where to put the code and which core to execute it in real time. 2) Maintain the multicore interconnect structure unchanged, redesign the DDR controller. Because the hardware configuration of target board is quite different from XUVP5, we must redesign DDR controller to achieve the same function with some performance enhanced methods. 3) Clip the Ethernet control module and add a communication module, in order to achieve download user program and the program execution monitoring. Due to simplicity and ease of using serial communication method, in this design we use it to achieve the same function of the original Ethernet module. We use the communication module to monitor the entire hardware platform. In addition, by using a simple communication protocol, students can use it easily, and have a better understanding of the process of multicore program execution. B. Implementation and Design of Single Core We keep the original single core processor overall data path unchanged and add a debug module. The instructions format of single core processor is shown in Fig. 2. All

Fig. 3.

Debug Module Connection Ports

The debug module contains 3 registers: savedP C and savedLink are 32 bits registers, running is a 1 bit register. When a break occurs, savedP C and savedLink are loaded. When j7valid is set, the debugging module based on Rb [2:0] to determine which the debugging operation is: Rb [2:0] = 0: no operation; Rb [2:0] = 1: the address of savedarea in this core is loaded into link. Every core is reserved a specific memory area to save register, P C, link, when the core is stopped in a read queue in the project and the project number. The first byte of savedarea address is 0x4000 + 512 ∗ N ; Rb [2:0] = 2: savedP C is loaded into link; Rb [2:0] = 3: savedLink is loaded into link; Rb [2:0] = 4: when read queue is empty, the link is 0, or is 1. It allows us to access an empty read queue; Rb [2:0] = 5: when running is 1, the link was 1, or is 0; Rb [2:0] = 6: indicate that the current instruction is breakpoint instruction. The main core sends control information with length to slave cores. Type 2 control information indicates that the kill (slave core stops work immediately, empty the address queue and write queue), type 1 control information indicates that the stop operation (complete the execution of the next instruction and then slave core stops work), type 0 control information indicates that the start operation. When a system resets, or

Fig. 2.

Single Core Instruction Set Architecture

encounters a breakpoint instruction, or receives stop or kill control information, slave cores save their work status and send stopped information to the main core. And then slave cores are waiting, until receive the running information, restore status.

mode register reset. Fig. 5 is the controller state transition diagram:

C. Implementation and Design of Memory Controller 1) DDR SDRAM controller logic structure: The DDR SDRAM controller frame structure is shown in Fig. 4, the main modules include: state machine control module, refresh control module, clock control module, DDR configuration register module, DDR read data channel and DDR write data channel. DDR state machine module is the control core, achieve the DDR SDRAM[11] operation command and status switching. Refresh module calculates a refresh operation time interval, notify the DDR promptly control core of SDRAM refresh operation. The DDR configuration module provides the user interface, the module in accordance with the need of controller of various configurations, including refresh interval set, DDR SDRAM timing parameters modification and DQS delay.

Fig. 4.

The DDR Controller Block Diagram Fig. 5.

2) DDR SDRAM controller state machine design: DDR state machine of the controller is the core part of the whole design. DDR controller state machine relates to the DDR SDRAM initialization, work mode switching, read and write commands and other operations. In order to simplify the DDR state machine and improve the state machine execution efficiency, this design has designed several functions as follows: DDR SDRAM initialization, variable length burst read and write, automatic refresh function, precharge and

The Controller State Transition Diagram

After the system is reset, DDR SDRAM is in idle state (Idle), before reading and writing operation it must be initialized. Before read (write) the command, DDR needs to activate (Active) read (write) line, then we can do the burst read (write). In the controller design, all read and write commands are without precharge, therefore, in a row which has been activated will has been in a state of activation, until the user sends a burst termination command, the controller

will automatically generate a precharge command to close the current line which improve the system’s data throughput rate. The controller also provides an automatic refresh counter. 3) DDR controller data path design: The DDR controllers read, write data channel directly relates to the stability of the data that to be read and wrote. DDR SDRAM read, write timing needs to use the same clock rising edge and falling edge, timing correctness is difficult to guarantee. Therefore in our design optimize the data channel to improve stability. Read Data Channel: Fig. 6 is the read data channel structure, rd data is 16 bits read data from DDR SDRAM on a DQ bus, we use the dqs dl (delay 90degrees phase after DQS) as a sampling clock, in the rising and falling edges sample data. In our design, first rising edge collection data is stored in rd data u, and in drop edge collect data stored in rd data d. these two are respectively as the low 16 bits and the high16 bits of data, combining as a 32 new data data rd, and send the 32 bits synchronic out data by the system.

Fig. 6.

Read Data Channel Structure

Write Data Channel: Write data channel splits the 32 bits data that delivered by system into low 16 bits and high 16 bits, and then sent to the DDR SDRAM 16 bits data bus. This design adopts a design as shown in Fig. 7, firstly, the system 32 bits data wr data is divided into wr data l (low 16 bits) and wr data h (high 16 bits), and then send into a 2 to 1MUX, with the system clock as a strobe signal. The MUX output data use the internal clock clk 2x to register, maintain stable data, obtained with the DQS edge aligned data ddr wr data. By the falling edge of clk 2x data is collected, let DQ signal delays 90 degrees phase, to meet the timing requirements.

1) Communication module hardware implementation: Default serial control in the communication module KCSPM3, PC achieves the download by sending commands to communication module, can also get the multicore system related data value through the communication module. The interactive process between experimental board and the PC: download the bit file into the experimental board, the system is reset. The default RS232 control is in KCSPM3[3]. Firstly, communication module through the clock control module intercepts all cores clock signal. After the download success, it gives each core clock drive. Initialize DDR module, and return success signal to the communication module, the PC end user can see the feedback corresponding information. Communication module is in a passive receiving user command state. The user by sending different orders, achieve the user program download and observe multiple cores each register value when running. In this design, the communication module includes a soft processor module, serial port module, clock control module. Serial port module[10] is designed with macro module, and clock control module in the system is relatively simple, when the system is reset not giving clock drive to each core. In the execution of the user to download the program instructions, signal returned from the storage controller which the program is successfully written in the DDR, the system clock as clock incentive to perform each core. In the execution of the user to download the program instructions, signal returned from the memory controller which the program is successfully written in the DDR, the system clock as clock incentive to perform each core. The soft processor structure is shown in Fig. 8 and communication module structure diagram is shown in Fig. 9.

Fig. 8.

Fig. 7.

Write Data Channel Structure

D. Implementation and Design of Communication Controller In this paper, by using PicoBlaze soft core processor through the serial communication to achieve the user program download and multicore system register value monitoring,

Soft Processor Architecture

E. Communication Protocol Design Design of the communication protocol is to ensure communication between PC and KCPSM3 correctly. 1) information / data packet format The protocol uses the package as a unit of information transmission, and the packet format is shown in Table I. < Header >: flag of packet header, indicate packet start < Length >: packet length, indicate control word < Ctrl >, data section < Data >, check code < CRC − 8 > length (in bytes) < Ctrl >: control word, packet signature that is what type of package, followed by what is data

TABLE I I NFORMATION / DATA PACKET F ORMAT

Byte 1 Header

Fig. 9.

Byte 2 Length

Byte 3 Ctrl

Byte 4 - Byte (n-2) Data

Byte (n-1) CRC − 8

Byte n T ail

Communication Module Structure Diagram

< Data >: data segment, can consist of more than one byte, if do not need to transmit data, but also for free < CRC − 8 >:8 CRC verification, calibration range is the control word < Ctrl > and data segment < Data >. The generator polynomial is CRC8 = X8+X5+X4+1(CCITT standard) < T ail >: flag of packet tail, and indicate the end of the package In the process of information transmission, PC is the main equipment, KCPSM3 is the slave equipment. The device receives the data packet flow is shown in Fig. 10. For a slave device, each received package from the master device, if received successfully (corresponding flowchart CRC check), ACK is returned; if the receive fails, it returns NAK, which contains the error message. For ACK and NAK packets, with specific control word < Ctrl > to tag: The version uses control word 0x00 for NAK packet, data segment is a byte error type < error num > Several typical types of errors: < error num >= 01: Uart FIFO overflow (reducing the baud rate) < error num >= 02: PicoBlaze SPM overflow (packet length error) < error num >= 03: CRC checksum error < error num >= 04: Pack tail byte error According to the specific application, it can be extended. < error num > is a byte, which supports 256 different error messages. In the ACK package, the control word is equal to the response packet control word. The master device sends a package, entering waiting for a response state. After receiving the response, compare with the packet control word, if both are consistent, master device can determine the response

Fig. 10.

Slave Device Receives the Data Packet Flow Chart

which is previously transmitted packet ACK, indicate that a packet has been received successfully by slave device. Table II. shows NAK and ACK packets specific format. IV. R ESULT A. Construction of Experimental Environment Hardware equipment: PC Software: Xilinx ISE 12.1, serial debugging tools, software such as Modelsim We used software simulation tools to validate the experimental board functionally. B. Simulation and Verification of Single Core The debug module of single core is mainly provided core 1 the control of other cores. The function of test file is, main core debugged core 5. Firstly is the main core emits debug start information ctrlValid, then sends control information start, starting from core 5, making it into the running state. And then the issue of stop control information, and save the current PC. The test result is shown in Fig. 11.

TABLE II NAK PACK AND ACK S PECIFIC F ORMAT

N AK ACK

< Header > < Header >

Fig. 11.

03 02

00 correspondingpacketcontrolword

< error num > N one

CRC − 8 CRC − 8

< T ail > < T ail >

Test Result

C. Simulation and Verification of Memory Controller The DDR controllers main function is to ensure that the off-chip DDR SDRAM correct sequential access and data accurate collection, due to the current lack of hardware environment, therefore in the process of testing, we use Micron development DDR simulation file (available from the Micron download official website), the file is designed to test the completed DDR controller function. The result is shown in Fig. 12.

Fig. 13.

Write Data to SDRAM

Fig. 14. Fig. 12.

Read Result

DDR Controller Test Result

Firstly writing data in the memory, setting a burst read length is 4, while the activation of 4 bank, after receiving a read command, a continuous reading of the 4 data, to accomplish the goals, in Modelsim monitor window output the write and read the results, as shown in Fig. 13 and Fig. 14, and test correctly.

contains data of 11110000. Through the analog control signal, the test result is shown in Fig .15. As shown in Fig. 15, serial port sends the correct data to transmission module and the serial output signal back is high. V. C ONCLUSION

D. Simulation and Verification of Communication Module In the test of communication module, we do not have the hardware environment, therefore only to realize the test of serial communication module. 1) Simulation and verification of serial port module: We give the en 16 x baud signal enable when baud count is 324, else it will increase baud count, and the data buffer

In this paper, we design a multicore experimental teaching platform which achieves the same functions of Beehive platform that maintain the original data path of Beehive single core unchanged, add a debug module and a communication module, and redesign DDR controller. The application of the multicore experiment platform for teaching practice can make students deeply understand of

Fig. 15.

Serial Port Test Result

multicore, design a multicore system structure, and improve the ability of parallel programming. Students can according to their own situation to design different parallel program and also can improve the design of this platform with their own knowledge of multicore. In the future work, we will add friendly GUI of hardware platform program operation and also achieve cache consistency by hardware protocol. As a result of Tianhe Sunshine development board has not yet been completed, at present the design of the multicore is still in the stage of simulation which use software to certificate the functions of each module. After the hardware board is ready, we will finish hardware debugging and develop application in the Tianhe Sunshine experiment board. R EFERENCES [1] C. Zhang, Z. Wang and C. Zhang, Computer Architecture.Second Edition, Beijing: Higher Education Press, 2005, pp. 133-145. [2] J. Lu, Research and implementation of key technologies on CMP, Ph.D. dissertation, National University of Defense Science and Technology, Changsha, Hunan, PRC, 2002. [3] J. Li, Multicore processor design technology research, Ph.D. thesis, Harbin Engineering University, harbin, Heilongjiang, PRC, 2010. [4] Z. Xie, Multiprocessor communication technology research, M.S. thesis, Electronic Science and Technology University, Chengdu, Sichuan, PRC, 2009. [5] Y. Wang, Multicore processor hardware implementation, M.S. thesis, Xi’an Electronic and Science University, Xi’an, Shanxi, PRC, 2010. [6] X. Song and X. Wu, “Multiprocessor communication mechanism design,” Journal of Zhejiang University of Technology, vol. 4, no. 38, pp. 426-429, 2010. [7] R. Dong and Y. Zhu, “Parallel programming model research and development,” Journal of Computer Technology and Development, vol. 1, no. 21, pp. 92-99, 2011. [8] C. Hu and X. Wang, “Based on multicore cluster system for parallel programming model research,” Journal of Computer Technology and Development, vol. 4, no. 18, pp. 70-73, 2008. [9] W. Chen and W. Wu, QuinnM J. MPI & OpenMP Parallel Program Design(C Edition) Beijing: Tsinghua University Press, 2001, pp. 212245. [10] K. Chapman. (2003, Jan). UART Transmitter and Receiver Macros [Online]. Available: http://www.dsi.fceia.unr.edu. ar/downloads/DDA/UART_Manual.pdf [11] G. F. Pfister, D. A. George and S. L. Harvey. (1985). The IBM Research Parallel Processor Prototype(RP3): Introduction and Architecture [Online]. Available: http://web.mit.edu/6.173/www/ currentsemester/readings/R03-ibm-rp3-1985.pdf [12] C. L. SEITZ, “The Cosmic Cube,” Communications of the ACM, vol. 1, no. 28, pp. 22-33, 1985. [13] K. Chapman. (2003). KCPSM3 8-bit Micro Controller for Spartan-3, Virtex-II and Virtex-IIPRO [Online]. Available: http://www.eng.auburn.edu/˜strouce/class/ elec4200/KCPSM3_Manual.pdf [14] L. Hammond, B. A. Nayfeh and K. Olukotun, “A Single-Chip Multiprocessor,” IEEE Computer Special Issue on Billion-Transistor Processors, vol. 30, no. 9, pp.79-85, Sep 1997.

Suggest Documents