Electrical and Computer Engineering Department,. University of Tehran, ... with test instructions so that online testing
Instruction Level Test Methodology for CPU Core Software-Based Se1f-Testing Saeed Shamshiri, Hadi Esmaeilzadeh and Zainalabedin Navabj Electrical and Computer Engineering Department, University of Tehran, Tehran, Iran {shamshiri,
[email protected], navabi @ece.neu.edu approaches have been proposed [5][61[71[81[9][ 101 1 1I. Another proposed method is an instruction level DFT that adds instructions for improving the controllability and observability of processor cores for software based selftesting [4]. In TIS method [l] test instructions are added and employed to test a processor core. This instruction level testing method can be used for both online and offline testing. In the offline testing phase, the only instructions that run in the CPU are test instructions. Therefore all combinational and sequential parts of the processor can be tested with a high Ievel of fault coverage. In the online testing phase test instructions are inserted in the machine code by the assembler or compiler instead NOP instructions. This way, combinational parts of the processor will tested while the processor performs its normal operation without any performance penalty. TIS method [l] foIlows a unique approach for online and offline testing of processor cores. For testing the processor core, this method utilizes all the time that is wasted due to processor stalls after data, control and structural hazards or cache misses. The TIS method is appropriate for online testing of pipelined architectures, In a pipelined architecture, one or many NOP instructions are inserted as stalls between instructions which are data or control dependent. For TIS realization, a hardware-oriented approach is previously proposed [lJ that is based on the BIST architecture and employs LFSRs and MISRs for generating test vectors and compressing the results respectively. This paper proposes a software-oriented approach that decreases the hardware overhead of the previous method by removing LFSRs and decreasing the number of MISRs. In this approach, test vectors are generated using a pseudo random pattern generator software that is embedded in the processor’s assembler. These random test vectors appended to the test instructions. When executing a test program, test data are fetched as immediate data along with test instructions. Section I1 illustrates the software-oriented implementation of the TIS method and discusses the implementation framework and challenges. Experimental results are presented in Section 111and the paper is concluded in Section IV.
Abstract- TIS’ [l]is an instruction level methodology for CPU core self-testing that enhances instruction set of a CPU with test instructions. Since the functionality of test instructions is the same as the NOP instruction, NOP instructions can be replaced with test instructions so that online testing can be done with no performance penalty, TIS tests different parts of the CPU and detects stuck-at faults. This method can be employed in offline and online testing of all kinds of processors. Hardware-oriented implementation of TIS is proposed previousIy [I] that tests just the combinational units of the processor. Contributions of this paper are first, a softwarebased approach that reduces the hardware overhead to a reasonable size and second, testing the sequential parts of the processor besides the combinational parts. Both hardware and software oriented approaches are implemented on a pipelined CPU core and their area overheads are compared. To demonstrate the appropriateness of the TIS test technique, several programs are executed and fault coverage resufts are presented. Keywords- instruction level testing, CPU core testing, software-based self testing, test instruction set, BET, pipelined CPU
1. INTRODUCTION In many SoCs, embedded processor cores are widely used because they offer several advantages including design reuse and portability over ASICs. Core based design allows processors to be used in a variety of applications in a cost effective manner. On the other hand, design based on processor cores presents new challenges for testing since access to these embedded processors becomes further removed from the pins of the chip [Z]. SeIf-testing for high-speed circuits has clear advantages over testing through external testers. The tester’s OTA (Overall Timing Accuracy) does not increase as fast as the on-chip clock speed and this implies more yield loss 131. One approach for realizing self-testing is running a test program on the processor which tests it by its own instructions. This pure software self-testing method has some disadvantages including low fault coverage, large program size which cannot fit in an on-chip memory, and long test time [4].For self-testing of a microprocessor for either stuckat or delay faults by test program generation, several Test Instruction Set
0-7803-8714-7/04%20.0002004 XEEE
25
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.
TI.
be reduced by 4 bits. This gives us 28 bits for our test data
SOFTWARE ORIENTED IMPLEMENTATION OF TIS
Erom the memory, So we share two 16-bit test vectors in their four bits.
We have implemented the TIS method on PAYEH (Pipelined SAYEH) [ 11 processor. SAYEH [ 121 is a multicycle RISC CPU with 16-bit data bus and 16-bit address bus. PAYEH is pipelined version of SAYEH with a similar instruction set but with a pipelined architecture. Table IV shows the instruction set of PAYEH. PAYEH processor has 5 pipe stages iIlustrated in Fig. 5. Main combinationaf components of PAYEH are a 16-bit adder and a control unit in the ID stage and a 16-bit ALU and a branch unit in the EXE stage. In the PAYEH CPU, all of the instructions that need stall are LDA (load addressed), BRZ (branch if zero), BRC (branch if carry), SPA (jump addressed) and JPR (jump relative). BRZ, BRC and JPA need two stalls while JPR and LDA need one stall. In the software-oriented implementation of TIS, the assembler inserts test vectors (random or deterministic) with the test instruction opcode. These test vectors are fetched from the instruction memory in the run time immediately after fetching the test instruction opcode. When a test instruction enters into a pipe stage the test controller puts the corresponding combinational unit in the test mode and applies the test vectors to it. Then the test results are collected and compressed with a MISR. For testing the sequential parts of processor in addition to the combinational parts, the test results are also written to the corresponding part of the pipe register in the next clock cycle. Then for validating the register, its output is compared with its input by some xor gates. PAYEH is a 16-bit processor and all of its instructions are 16 bits long. This 16-bit space may be insufficient for making a test instruction. For example a test instruction that is to test the 16-bit ALU of PAYEH, requires 16 bits for each ALU input and 4 bits for the instruction opcode. The 36 bits of data require three memory words. To load and execute such an instruction, three clock cycles are required. Therefore software-oriented implementation of TIS,needs a controller to handle the running of these multi cycle test instructions. In this case study, test instructions that are one or two words long are preferred, since PAYEH instructions, need at most two stalls. Therefore it is better to defrne a separate test instruction for each combinational unit of PAYEH. For the test instruction dedicated to test the control unit, 16-bit space is sufficient, since the control unit does not have more than 12 inputs. The situation is the same for the branch unit, but different for the adder and the ALU.In latter two cases the test instruction must be at least 36 bits containing opcode bits and two 16-bit immediate test vectors. Therefore, each of these test instructions dedicated to test the functional units, need three words of instruction memory and three memory cycles for execute them. Replacing NOP instruction with these three-cycle instructions affects the performance of a running program. To reduce instructions ta 2 words, a 36 bit instruction must
Capturing test results of different combinational units can be done with only one M E R because at any given time only one combinational unit is being tested and no conflict may occur for using this single MISR. Since one MISR captures the results of more than one unit, the assembler must follow a predetermined order for inserting the test instructions in the machine code. Four different test instructions are defined for testing four combinational units of PAYEH. These test instructions are TST1, TSTZ, TST3 and TSTQ. TSTl and TST2 a d responsible for testing the control unit and the branch unit respectively and they fit into one memory slot and their results are captures with a shared MISR. T S T ~and TSTQ are responsible for testing the adder and the ALU respectively and their results are captured with another shared MISR. As mentioned, by overlapping two test vectors in their four bits, these test instructions fit in two memory slots and these test instructions can be inserted instead of two consecutive NOPS by the processor assembler. This software-oriented approach has less hardware overhead. The only requirements are two parallel MTSRs, a test controller and some additional discrete gates. The hardware cost of software-oriented implementation is lower than that of the hardware-oriented implementation. On the other hand, the test time is longer by a factor of six because in the hardware-oriented implementation a single test instruction tests all parts of a processor in one clock cycle, while in the software-oriented implementation four test instructions must be executed in six clock cycles. 111. EXPERIMENTAL RESULTS
To demonstrate the results of TIS in the both implementation approaches, several experiments have been done. The first objective is illustrating the role of test instruction in online testing of the processor and the second objective is fault coverage measurement of the method. For acheving the first objective, we used several programs. These programs are as follows: Power: This program calculates ab for natural numbers a and b (see Fig. 1). Two stalls after BRZ and one stall after JPR instruction are filled with test instructions. FacfoOrinl: This program calculates a! (see Fig. 2). Two stalls after BRZ and one stall after JPR instruction are filled with test instructions . Fibonacci: This program calculates the n* statement of the Fibonacci series (see Fig. 3). Two stalls after each BRZ and one stall after each JPR instruction are filled with test instructions. Vector addition: This program adds two vectors from the data memory and stores the results into the data memory (see Fig. 4). Two stalls after BRZ and one stall after J P R and one 26
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.
stall after dependent LDA instruction are filled with test instructions. I
MVI RO,O MVI R1,l LDA RO,RO LOA R1,R I M V I R2,l
M V I RO,O W I R1,l LDA R0,RO LDA R1,R.l MVI R 2 , l
f o r : MUL R2,RO
f o r : MUL R2,RO
DEC R1,l BRZ end
DEC R l , 1 BRZ end
TST
TST3
TST lend:
DEC RO, 1 ER2 end2 TST4
TST
TST
for: ADD R1,R2 DEC RO, 1 BRZ e n d l TST TST ADD R2,Rl DEC RO,1
I
MVI RO,O LDA R0,RO M V I R1,l
f o r : A D D R1,RZ DEC RO, 1 BRZ e n d 1 TST3
ADD R2,R.l
DEC RO,1
BRZ end2
BRZ end2
TST TIT
TSTP
JPR for
JPR f o r TSTl
TST
1
LDA R 0 , R O MVI R 1 , I
MVJ R1,l MVZ R2,l DEC RO,l BRZ endl
DEC R 0 , l S R Z end2
Fig. 1. Calculating R2 = ROR’ while RO and R I are loaded from the 0 and 1 data memory locations respectively. (a) Hardware-oriented implementation. (b) Software-oriented implementation.
MVI RO, 0
LDA R0,RO
MVI R1.1 MVI R2,l DEC R D , l TST TST
3PR for TSTl end:
M V I R0,O
LDA R0,RO
ER2 endl
TST
JPR for
MVI RO, 0
e n d l : MVR R 3 , R l JPR end
endl:
TST
TST2 end2 : MVR R3, R 2 end:
end2: MVR R3,R2 end:
MVR R3,Rl
J P R end
(a) f o r : MUL R1,RO DEC R 0 , l BRZ end TST TST
f o r : MUL R 1 , R O
Fig. 3. Calculating the RO* statement of the Fibonacci series while RO is loaded from the 0 location of the data memory. (a) Hardwareoriented implementation. (b) Software-oriented implementation.
DEC RO, 1 BRZ end TST4
J P R for
J P R for
TST
TST2
I
end:
RO,O MVI R1,lDO
MVI RO,O MVI R1,100
f o r : LDA R2,RO LDh R3,R1
LDA R 3 , R l
TST
TSTl
ADD W , R 3
INC R I , 1
ADD R 2 , R3 STA RO,R2 I N C RO, 1 INC R1,1
MOV R2,R0,0,1 SUB R2,RO
MOV RZ,R0,0.1 SUB R2,RO
BRZ end
BRZ end
TST TST
TST3
JPR for TST
JPR for TSTP
end :
end:
MVI
end:
Fig. 2. Calculating R1 = RO! while RO is loaded from the 0 Iocation of the data memory. (a) Hardware-oriented implementation. (b) Software-oriented implementation.
STA RO,R2
n3c
In the hardware-oriented implementation all stalls are filled with TST instructions but in the software-oriented implementation the assembler fills the two consecutive stalls after BRZ, BRC and J P A with T S T ~or T S T ~and it fills the single stall after the JPR and LDA instructions with TsTi or T S T ~ instructions. For illustrating the effect of each program in online testing of the processor, a parameter is defined which is called test period. The test period is the time it takes to test the whole processor with one test vector during the normal operation of a running program. The test period depends on the program context and can be calculated as follows: Test period =
RO, 1
f o r : LDA R2,RO
(4 Fig. 4. Adding two vectors from the data memory and saving the results. The size of vectors is specified in the RO register from the second window of the register file. (a) Hardware-oriented implementation. (b) Software-oriented implementation.
The test period can be calcutated based on the loop bodies of the benchmark programs. Table I summarizes this parameter for all benchmark programs in both types of implementation. Test period shows the relation between the
Execution time of a program (#clock cydes)
Total number of test instructions in the program
(b)
Number of test instructions of the processor
27
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.
online testing time and the offline testing time of each separate component. Online test time = Offline test time Of each component
x Test period
In the fault coverage measurement process, the fault coverage of each combinational component is measured separately. The method used for fault coverage measurement is based on synthesizing the design into a faulty library. In the faulty library, each gate reports its detected faults during the test procedure [131. Table I1 shows fauft coverage achieved for each component in the both kinds of implementations after testing it with 8192 test vectors. TABLEI
RII'.TEST PFRIOD OF EACH BENCHMARK PROGRAM MTiCUTED ON PAYEH
w .CONCLUSlON AND FUTURE WORK A software-based implementation for TIS high level selftesting methodology was presented. A method for testing the sequential parts of the processor in addition to the combinational parts was proposed. The implementation challenges of TIS in the software-oriented implementation were explained and a real implementation of the method, on the PAYEH processor was presented. Some sample programs on this CPU demonstrated the method's appropriateness for at speed online testing of pipelined processors. For each of these programs, the fault coverage of each component was measured. These measurements show that this method can achieve a desirable level of fault coverage for at speed online and offline self-testing. The hardware overhead of both implementations were measured and compared together. Applying tlus method on some other processors with complicated architectures like VLIW and superscalar processors are part of our future works.
PROCESSOR.
Vector
Program
REFERENCES S. Shamshiri, H. Esmaeilzadeh, M. AIisafaee, P. Lotfitamran and Z. Navabi, Test Instruction Set (TIS): An Instruction k v e l CPU Core Self-Testing Method, Proc. of ETSW, pp. 15-16, May 2004. Murray and J Hayes. Testing ICs, getting to the core of the problem. E E E Design and Test of Computers. Vol. 29, No. 11. The National Technology Roadmap for Semiconductors, Semiconductor Industry Association,*1997. Wei-Cheng h i , Kwang-Ting (Tim) Cheng. Instruction-Level DFT for Testing Processor and IP cores in System-on-a-Chip. Design Automation Conference, June, 2001. W.-C. Lai, A. Krstic, and K.-T. Cheng. Test Progam Synthesis for Path Delay Faults in Microprocessor. Proceedings of International Test Conference, pages 1080-1089.2000. L. Chen and S. Dey. DEFUSE: A Deterministic Functionai Self-Test Methodology for Processors, V U 1 Test Symp.pp. 255-262, M a y 2ooO. D.Brahme and J.A. Abraham. Functional Testing of Microprocessors. IEEE Transactions on Computers, vol. C-33, pages. 475-485, 1984. F. Distante and V. Piuri. Optimum Behavioral Test Procedure for VLSl Devices: A Simulated Annealing Approach. Proceedings of the IEEE IntemationaI Conference on Computet Design, pages 3 1-35, 1986. 1. Shen and J.A. Abraham. Native Mode Functional Test Generation for Processors with Applications to Self Test and Design Validation. Proceedings of International Test Conference, pages 990-999, 1998. 1101 K. Batcher and C.A. Papachnstou. Instruction- Randomization Self Test For Processor Cores. VLSl Test Symposium, pages 3440, 1999. I l l ] J. Lee and J.H. Patel. Architectural Level Test Generation for Microprocessors. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 13(10):1288-1300, October 1994. [ I 21 Zainalabedin Nsvabi. Digital Design and Implementation with Field Programmable Devices, Kluwer Academic Publisher, May 2004. 1131 M. Zolfy, S. Mirkhani, 2. Navabi, SPC-FC: A New Method for Fault Simulation Implemented in VHDL, Proc. of NATW'OI, pp.17-21, June 2001.
Oriented
TABLE TI FAULT COVERAGE OF EACH COMBINATIONAL COMPONENT A F E R TESTING Component
Hardware Oriented
Fault 'Overage
,
Oriented
Control
Branch Adder
Unit
Unit
ALV
97%
97.3%
100%
96.3%
92.1%
96.9%
90.3% 81.8%
Both hardware and software oriented implementations of the TIS method on the PAYEH processor with four LFSRs and four parallel MIS& for hardware-oriented and two parallel MISRs for software-oriented approach has been synthesized with a 0 . 5 ~ASIC technology. Table XI1 shows the post-synthesis hardware overhead of the both implementations. TABLEIII THE AREA OVERHEAD OF BOTH TIS IMPLEMENTATIONS IK PAYEH PROCESSOR.
PAYEH (Gate Count)
Software
Oriented
I
55876
1
with TIS (Gate Count)
59004
1
Area Overhead
5.6%
28
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.
te
-
Ii
I
1
1
IF ID EXE MEM WB Fig, 5. The data path and controller of PAYEH with its five pipe stages. This processor has been designed and implemented by Saeed Shamshiri.
(15:10)
Instruction Mnemonic and Definition
RTL Notation: Comments or Condition
29
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.