An Incremental Learning Framework for Estimating Signal Controllability in Unit-Level Verification Charles H.-P. Wen , Li-C. Wang†, Jayanta Bhadra‡ Dept. of Communication Engineering, National Chiao-Tung University, Hsinchu, Taiwan 300 † Dept. of Electrical and Computer Engineering, Univ. of California, Santa Barbara, CA 93106 ‡ Freescale Semiconductor Inc., Austin, TX 78729 {
[email protected],
[email protected],
[email protected]} Abstract—Unit-level verification is a critical step to the success of full-chip functional verification for microprocessor designs. In the unit-level verification, a unit is first embedded in a complex software that emulates the behavior of surrounding units, and then a sequence of stimuli is applied to measure the functional coverage. In order to generate such a sequence, designers need to comprehend the relationship between boundaries at the unit under verification and at the inputs to the emulation software. However, figuring out this relationship can be very difficult. Therefore, this paper1 proposes an incremental learning framework that incorporates an ordered-binary-decision-forest(OBDF) algorithm, to automate estimating the controllability of unit-level signals and to provide full-chip level information for designers to govern these signals. Mathematical analysis shows that the proposed OBDF algorithm has lower model complexity and lower error variance than the previous algorithms. Meanwhile, a commercial microprocessor core is also applied to demonstrate that controllability of input signals on the load/store unit in the microprocessor core can be estimated automatically and information about how to govern these signals can also be extracted successfully.
I. I NTRODUCTION A high-performance microprocessor design is often divided into several units. Each unit is designed individually and verified before assembling them together for the full-chip functional verification. Unit-level verification [1][2][3] is a critical step to the success of full-chip functional verification. Scafidi et al. in [4] demonstrate that even after the full-chip model of Intel’s Itanium-2 processor neared tape-out quality, the unitlevel verification could uncover additional bugs. Therefore, the divide-and-conquer strategy gives tremendous benefit in terms of simulation and debug efficiency. For microprocessor designs, legacy testbenches are accumulated over generations. These legacy tests provide the first set of golden tests to be applied on the new generation of design. Conventionally, unit-level verification requires that the unit under verification (UUV) needs to pass legacy tests. However, legacy tests are typically assembly programs and their correctness is defined upon the content of external memory or signal values on the system bus. To verify a unit with legacy tests, an emulation software is developed as the full-chip model to interact with the UUV. This software defines the mapping from the assembly 1 This work is supported in part by Semiconductor Research Corporation contract No. 2005-TJ-1360.
1-4244-1382-6/07/$25.00 ©2007 IEEE
instructions to the input signals of the UUV, and the mapping from the output signals of the UUV to the system bus and memory. The emulation software evolves together with the evolution of microprocessor designs. Legacy tests often function as the sanity check. They do not provide the designated coverage for the UUV. Designers are often asked to write additional tests to cover certain targets with respect to different metrics. For example, it is a common practice today that designers prepare tests to cover assertions embedded in the UUV. Once designers successfully develop new tests, these tests can be used in later full-chip simulation and may become the legacy tests hereafter. However, these new unit-level tests cannot be written on the unit-level boundary only. It is because typically the legal input space from the assembly instructions to the UUV is not explicitly specified in the design specification but implicitly defined within the emulation software. Hence, unit-level tests need to be written based on the inputs to the emulation software. Because designers (for the unit under verification) usually do not equip with the knowledge about the mapping from the inputs to the emulation software to the inputs of the UUV, it can be a difficult task to write tests which can satisfy the required values on the unit-level signals. fail generate trial tests
Fig. 1.
collect simulation data
examine switching activities
conduct controlling rules
succeed STOP
An engineering flow for conducting signal controllability
These tests that provide the controllability to the unit-level input signals may not be complex. However, figuring out what kind of simple tests can be used to govern one unit-level input signal can be a tedious process. If this needs to be done manually, designers would need to (1) generate trial tests, (2) collect the simulation results on the unit-level signals of interest, (3) use a waveform viewer to examine the switching activities on the input, and (4) conduct a manual induction process to find a test that can indeed be used to control the target input signal. Figure 1 illustrates this engineering flow. For each unit-level input signal, understanding what kind of tests can be used to control its value can be tremendously helpful to writing tests for the UUV. Hence, if the tedious process of figuring out this information can be automated, such a methodology will be valuable to designers for writing
250
unit-level tests. Therefore, this work pursues this direction to develop such a learning framework. The rest of the paper is organized as: Section II describes the flow of the proposed framework and its requirements and difficulties to be solved. Section III explains the detailed steps of the overall framework while section IV elaborates the last step, the OBDF learning algorithm, into details. Section V describes the experimental setup for a commercial Power Architecture Technology2 e200 microprocessor from Freescale Semiconductor Inc. and Section VI further discusses more practical issues on the architecture level and their impacts on the effectiveness of learning. Section VII summarizes the experimental results to show the effectiveness of the proposed framework on Freescale’s e200 microprocessor core. Section VIII concludes the paper. II. I NCREMENTAL LEARNING FRAMEWORK test 1
Software C++
test 2
…
test i1
…
instruction template 1
test 1 instruction template 2
unit under verification
test 2 test i2 simulation data test 1
instruction template K
test 2
template 1 controllable: signal 1 and how … signal n1 and how
template 2 controllable:
Learning
signal 1 and how … signal n2 and how
test iK template K controllable: signal 1 and how … signal nk and how
Fig. 2.
Proposed incremental learning framework
Figure 2 illustrates the simulation-based incremental learning framework. It starts from generating multiple test sequences. Given a collection of test templates, each template is instantiated into a set of tests (each is a sequence of instructions) based on the constraints and biases specified in the template. After running simulation of these test sequences under each template, data observed on the boundaries at both full-chip level and unit-level are collected and sent to the later learning process. For each test template, the final learned result will summarize its signal controllability by including two kinds of useful information: (1) which inputs can be controlled under this template, and (2) how we can control these inputs when refining the template to instantiate tests. It is more interesting to note that essentially each template defines functional sub-space for the inputs of the UUV. Learning is to explore this sub-space based on the collected data in order to induce both kinds of information mentioned above. In order to assess the effectiveness of the learning framework, we further measure two indexes: (1) learning accuracy and (2) confidence of learning accuracy. Learning accuracy is measured by simulating an additional set of tests, and this index indicates how well the framework learns on the existing 2 The
Power Architecture and Power.org wordmarks and the Power and Power.org logos and related trademarks and service marks licensed by Power.org
data. Besides, since it is impossible to exhaust the entire input space, current learning accuracy can only be evaluated on a sub-space explored by the additional set of tests. Therefore, another index is incorporated to quantify the confidence of the previous evaluation. Considering the high complexity of the modern microprocessor designs, exploring the entire functionality only based on limited amount of simulation data is virtually impossible. Our goal cannot be learning the exact functionality. Moreover, we only try to identify as simple tests as possible for controlling input signals of the UUV to the desired values. The complexity of a test template can be defined upon the number of key instructions executed. The fewer number of core instructions one template executes, the less complex (simpler) it is. In our case, the simplest template is a single-instruction template. In order to achieve effective data-mining and easy utilization for designers to produce tests, we always start with the simpler test templates and gradually increase the complexity. If a signal can be controlled by a single-instruction template, it is easier to use it than to use a multi-instruction template. In this sense, our framework follows an incremental learning flow. As a result, different signals may require templates of different complexities. Inevitably, certain signals like exception and overflow flags can only be sufficed by tests of long instruction sequences. The incremental learning framework cannot find simple tests for those special signals, but aims at quickly find simpler tests for other signals. A data-mining engine based on a decision-diagram based learning approach was first proposed in [6]. The authors show that learning accuracy of their approach is comparable to other state-of-the-art learning techniques. However, they do not show how to quantify the confidence of their learning accuracy. In this work, we propose an ordered-binary-decision-forest (OBDF) based learning algorithm. This algorithm follows an approach of ensemble learning [14], which refers to building a collection of learners where each is trained with only a randomly selected portion of the dataset. Imagine that learning is to cover a space represented by the data. Figure 3 illustrates the basic principle of our OBDF algorithm (as well as ensemble learning). Each OBDD is an individual learner that learns a part of the space. If the majority of the OBDDs agree on what they find for a particular sub-space, then what they agree on would be used as the model for the sub-space. Also, if more OBDDs agree with each other, we will have higher confidence on the model. true space OBDD2
forest
OBDD1
OBDD2
OBDD1
OBDDt
OBDDt OBDDi
Fig. 3.
OBDF vs. Venn diagram for covering space
In this work, we apply the proposed framework on a commercial microprocessor e200 from Freescale Semiconduc-
251
computational-wise, no learning algorithm can possibly fix this kind of errors. assembly testcase
concatenate
100…10111 | 00…01
…
Fig. 4.
prune
output
010…10101 | 10…10
…
core section
learning engine format bit stream
input initialization section
III. T HREE S TEPS OF THE FRAMEWORK The proposed framework mainly consists of three steps to compute the signal controllability: (1) test preparation/simulation, (2) data preprocessing, and (3) OBDF learning. Step 1 and step 2 will be discussed in this section. However, because step 3 executes the proposed OBDF learning algorithm which involves several different techniques, it will be elaborated into details in the next section. Test preparation/simulation: Instead of figuring out the complicated unpredictable architectural behaviors, we choose to use simple instruction templates to avoid unpredictable architectural behaviors. It is because our objective is not to understand the circuit functionality but to understand what signals are controllable and how they are controlled. Simple instruction templates provide good controllability to prevent these unpredictable architectural behaviors from happening. Each simple instruction template, composed of the initialization sequence and core instruction selection, will instantiate a set of testcases. Learning boundaries are monitored to collect data during simple instruction testcase simulation. Data preprocessing: The preprocessing step transforms the simulation data into the format that data learning engine can take. Figure 4 illustrates the process. Since every input signal from the starting instruction in the initialization section to the current instruction in the core section may impact output signals at the current cycle, the sequence of those input signals in the simulation trace are concatenated into one bit stream. Later a pruning is performed on the bit stream to remove inputs with low-sensitivity activities. Sensitivity σi is calculated by the total number of 1’s seen on input i over the total data size in the data set. Users can specify a threshold value δ for low-sensitivity signals like 1%. If σi < δ or σi > (1 − δ), it means that the 0/1 distribution on input i is too biased to provide information and thus can be reduced from the input dimension. The reduced bit stream will be sent to the learning algorithm as inputs. Dropping a dimension for one input may cause ambiguities in the training data. For example, given f (x1 = 10001) = 1, f (x1 = 00001) = 0 and f (x1 = 11001) = 1 if we remove the first two variables, then reduced vectors 001 will results in consistent answers. In this example, we use 1 as the answer since the number of f (xi ) = 1 is larger than the number of f (xi ) = 0. Then f (x1 = 00001) = 0 introduces an error to the training data for the later learning. In the literature [10][11], it is called irreducible error, which means
simulation trace
…
tor Inc.. Our experimental results show that many highlycontrollable inputs can be automatically identified from a collection of simple instruction templates. Learning results are stored in the form of OBDFs that provide the information about how to govern those inputs from the assembly instructions. If an input is considered controllable after learning, we verify this controllability based on its learned OBDFs and additional simulation.
010…10101 | 10…10
learning inputs unknown constrained function learning outputs
testcase and the learning data transformation
IV. OBDF LEARNING ALGORITHM First, let us look at the main idea behind OBDF learning algorithm. Mathematical analysis is provided to demonstrate the power of ensemble learning. Then we will discuss each constituent technique and conclude the effectiveness of OBDF learning by comparison with the previous algorithms in [6]. A. Concept of ensemble learning In general, when the dimension of learning inputs in Figure 4 is high but the training sample size is relatively low, it is difficult to construct an effective single learner due to the bias of training data and variance of dimension estimation. Such a learner tends to be unstable because any small change of training data may lead to a large change in the learner and thus is called a weak learner. Typically, a weak learner can be recognized from its poor learning performance when validating different samples. To improve the weak learner, ensemble learning[12] is one of the most popular approaches. Ensemble learning tries to construct a set of weak learners instead of one and to combine them into a powerful decision rule. The task of ensemble learning is to approximate a target Boolean function f (x) : Bn → B where x ∈ X. Assume that an ensemble learner contains |T | independent weak learners, gi (x). A weighted ensemble learner g(x) is defined as 1, if T wi gi (x) ≥ 0.5; g(x) = 0, otherwise. Each weight wi can be viewed as the belief in individual learner gi (x). For an input x, the error of the ensemble (x), the error of the the ith learner i (x), and its ambiguity ai (x) are measured based on the mean-square-error (MSE) functions. (x) =
(f (x) − g(x))2
i (x) = ai (x) =
(f (x) − gi (x))2 (gi (x) − g(x))2
The ensemble error can be written as (x) = ¯(x) − a ¯(x) w ( x ) is the average of the errors of where ¯(x) = T i i individual learners, and a ¯(x) = T wi ai (x) is the average of their ambiguities, which stands for the variance of the output over the ensemble.
252
Therefore, by averaging over the the total Boolean space Bn , the true generalization error is = (xj )/2n X = ¯(xj )/2n − a ¯(xj )/2n X X = ¯ − a ¯ Here ¯ and a ¯ stand for the (weighted) average of true errors of all i ()’s and the average of true ambiguities of all ai ()’s over Bn . An important feature of ensemble learning is that it separates the generalization error into a term that depends on the generalization errors of individual learners and another term that contains all correlations between learners. The more the learners differ, the lower the error will be. That also infers the result of ≤ ¯ since a ¯ ≥ 0, which explicitly justifies the power of ensemble learning. To sum up, when designing an algorithm for ensemble learning, two major concerns need to be taken care of: • How to construct different but similar learners from the training data? • How to estimate the belief of each weak leaner? Our data learning algorithm resolves these two concerns by constructing an ordered-binary-decision-forest (OBDF). It employs the bootstrap method from [17], followed by ordered nearest neighbor(ONN) learning from [6] to construct individual learners. Then the out-of-bag evaluation helps us decide the weight for each OBDD learner. Figure 5 illustrates the stepby-step flow in our learning algorithm. The following sections will provide more details of individual techniques. Variable 2Variabl selection 1Bootstrap Bootstrap sampling
training sample
sample BS1
3ONN ONN learning
supporting variables
OBDD1
OOB 4OOB weighting sample BS2
supporting variables
OBDD2 OBDDT
forest sample BST
Fig. 5.
supporting variables
OBDDT
Step-by-step illustration of the OBDF learning algorithm
B. Bootstrap sampling In order to construct multiple weak learners, we adopt the Bootstrap sampling [17]. Its objective is to produce multiple bootstrap samples BSi from one training sample. The training objects are sampled with replacement, which means after an object is randomly drawn from the training data, we put it back before drawing the next object. In Figure 5, the bootstrap sampling repeats |T | times and produces |T | different bootstrap samples for computing |T | learners later. The bootstrap sampling approach provides several important statistical properties: • The bootstrap distribution is centered close to the mean of the training sample, which means it is low biased from the training sample.
On average, a bootstrap sample of size n = |X| contains about p = 1 − (1 − 1/n)n of the objects in the training sample. If n is sufficiently large, p asymptotically approaches 1 − e−1 = 63.2%. In other word, 36.8% objects containing possible malicious outliers are not presented in the bootstrap sample, which may result in a better learner. Note that since not all objects in the training sample are included, the bootstrap sampling may introduce a small amount of different errors from excluded objects to each bootstrap sample BSi . Comparing to irreducible error mentioned in Section III, such error is called reducible error in the sense that it can be minimized by a good learning algorithm. We will analyze how our ensemble learning algorithm mitigates this error later. •
C. Selecting supporting variables One major difficulty mentioned in the previous decisiondiagram-based learning algorithm in [6] is to find a good variable ordering for learning. However, since in the algorithm the OBDD sizes are determined by the size of training data X, the worst-case OBDDs come from the complete binary trees and result in 2log2 |X|+1 nodes at most. Therefore, instead of finding a good variable ordering directly, given the limited information provided by the training data, our first objective should be finding the variables that have the greatest impacts on the learning output among the large number of learning inputs. We call them supporting variables. Given the bootstrap samples generated from step 1, the current task is to select the supporting variables. Actually, this task can be transformed into finding the best-splitting attributes of the dataset for the tree-based algorithm. The impact of one variable can be defined upon the impurity difference of the dataset before and after splitting on the variable. The smaller the impurity is, the more clear trend the dataset can see. In this paper, Gini index proposed by Breiman et al. in [14] is applied to quantify the impact of one variable. Given the training sample dataset of w objects with its output class distribution C = {c0 , c1 }, the Gini index of this dataset is defined as GI(w) ≡ 1 − p(c0 |w)2 − p(c1 |w)2 where c0 and c1 represent the numbers of 0’s and 1’s seen at the output from w respectively. Given a variable xi , w can be split into xi = 0 with wx¯i objects and xi = 1 with wxi objects. The impact of xi can be calculated by the difference from the Gini index of original sample of w objects and the weighted sum of Gini indices of two co-factor datasets, wx¯i and wxi . That is, wxi wx¯i GI(wxi ) GI(wx¯i )− Im(xi ) ≡ GI(w)− wx¯i + wxi wx¯i + wxi
Figure 6 shows an example to compare the impact values of two variables x1 and x2 . Table I, II and III in Figure 6 represent the 0/1 class distributions of the original training sample, splitting on x1 and splitting on x2 . The Gini indices of each node are also shown in Figure 6. Now we can compute the impacts of x1 and x2 as following,
253
Algorithm 1 ONN learning algorithm: ONN(Mλ ) Im(x1 )
=
Im(x2 )
=
7 GI(N 1) − 12 5 GI(N 3) − GI(N ) − 12 GI(N ) −
5 GI(N 2) = 0.0103 12 7 GI(N 4) = 0.1250 12
1: 2: 3: 4: 5:
Since Im(x2 ) > Im(x1 ), variable x2 is the better splitting node than variable x1 . GI(A) =1-(6/12)2- (6/12)2 =0.5
6: 7: 8:
Table I GI(B) =1-(6/12)2- (6/12)2 =0.5 N C0 6 C1 6 x1 Table II N1 N2 2 4 C0 3 3 C1
N1
N2
GI(N1) =1-(4/7)2- (3/7)2 =0.4897 GI(N2) =1-(2/5)2- (3/5)2 =0.48
Fig. 6.
9: 10:
x2
N3
N4
11:
Table III N3 N4 C0 1 5 C1 4 2
12: 13: 14:
GI(N3) =1-(1/5)2- (4/5)2 =0.32 GI(N4) =1-(5/7)2- (2/7)2 =0.1632
Impact comparison of splitting on x1 and x2
The impact of each variable can be computed by this manner and all variables are sorted according to their impact value. The top 2log2 |X|+1 variables are selected as supporting variables for the target output. Selecting the best-splitting variable has been widely studied in the data-mining research and many other measures such as information gain and entropy have been also proposed (summarized in [18]). However, Gini index is popular because its definition is easy and straightforward. Only one scan of data is needed during its calculation, which leads to a better efficiency than the support-confidence framework in [6] and many other measures [18]. D. Ordered nearest neighbor (ONN) learning Given the bootstrap samples and the supporting variables, we can apply ordered nearest neighbor (ONN) learning from [6] as the baseline algorithm to grow weak learners. The underlying idea is that given the variable ordering and two bit vectors, (p1 , . . . , pn ) and (q1 , . . . , qn ), ONN utilizes a weighted distance Σnk=1 (pk ⊕qk )2n−k to determine the nearest neighbor. The weighted scheme ensures that (pk ⊕ qk ) weighs more than (pk+1 ⊕ qk+1 ). Our ONN learning is implemented by a binary decision diagram (BDD) and Algorithm 1 describes the conversion from a data matrix Mλ into the corresponding BDD. Due to the space limitation, readers may refer to [6] for more details of ONN learning. From [14][16], we know unstable learners like tree-based approaches characteristically have high variances and low biases. The original ONN learning algorithm also has similar characteristics. It always fits the training sample perfectly, and hence the derived learning models are vulnerable to small data changes. However, OBDF learning will mitigate this problem in the individual learner by tolerating errors during input dimension reduction and bootstrap sampling to prevent the model complexity from growing too high . E. Out-of-bag (OOB) weighting The typical ensemble learning uses uniform weights for each weak learner, i.e. g(x) = 1 if |T1 | gi (x) > 0.5 where
if Mλ = constant then x ← the 1st variable in Mλ ; create node Tx ; if compatible(Mλ0, Mλ1 ) then Mλx = merge(Mλ0 , Mλ1 ); return ONN(Mλx ); else Tx .lef ttree ← ONN(Mλ0 ); Tx .righttree ← ONN(Mλ1 ); end if return Tx ; else return constant(Mλ ); end if
|T | is the number of the weak learners in the ensemble and has the desirable property of ≤ |T1 | i i , which guarantees the true generalization error of the ensemble learner will be less than and equal to the average true generalization error of individual learners. However, because not all learners are equally good in all parts of the problem space, weighting can help to improve the resolution of learning accuracy. Therefore, as shown in section IV-A the weight wi is introduced to reflect the generalization capability of each learner. For each bootstrap sample BSi as shown in Figure 5, only 63.2% objects are included from the training sample. Those objects are often called in-bag data. On the contrary, the remaining 36.8% objects are called out-of-bag (OOB) data. We can further re-use these OOB data to evaluate the quality of each weak learner since the out-of-bag data is drawn from the same distribution as the in-bag data used for the training sample. An ensemble learner with good generalization capability should reflect a high accuracy on these out-of-bag data as well. On the contrary, if the accuracy is not good on out-of-bag data, the reason why the ensemble learner fails to generalize to the entire space is likely due to the insufficient size of training data. F. Comparison of OBDF and ONN In summary, OBDF learning mainly outperform the previous ONN learning in two aspects: • Lower model complexity To avoid over-fitting the training sample, small number of errors are added to catch the variance coming from the training sample randomness. Bootstrap sampling and the input dimension reduction introduce such noise to OBDF learning. Therefore, the average model complexity of individual learner in OBDF will be lower than that of the learner from ONN learning. • Lower error variance Aggregating |T | learners in the ensemble learner will average out the reducible errors caused from single or few learners. Recall that = ¯ − a ¯. The difference of each individual learner makes a ¯ ≥ 0. Hence, ≥ ¯. The
254
variance of ensemble learner is lower than the average variance of individual learners. That is, wi × var(gi (x)). var(g(x))≤ T
V. E XPERIMENTAL SETUP MMU
CPU control logic
SPR interface
DATA CONTROL
Instruction cache
ADDR
Instruction bus interface unit
Instruction unit unit under verification
EXE unit
load/store unit
learning outputs
MUL unit
data cache SPE unit data bus interface unit
learning inputs
Fig. 7.
ADDR
DATA
CONTROL
Block diagram of Freescale’s e200 microprocessor
Figure 7 shows a block diagram of the various units in the processor used for our experiments. e200 is a dual issue, 32-bit Power ArchitectureTM Technology compliant CPU and supports variable-length encoding technology, in-order execution and retirement. This processor integrates a pair of integer execution units, a branch control unit, instruction fetch unit and load/store unit, and a multi-ported register file capable of sustaining six read and three write operations per clock. It contains a 16KB instruction cache, a 16KB data cache, as well as a memory management unit. Also, it utilizes a ten stage instruction pipeline as shown in Figure 8. FET0
Fig. 8.
FET1
FET2 DEC0 DEC1
EXE0
EXE1
EXE2
EXE3
WB
10 stage pipeline in Freescale’s e200 microprocessor
We choose the load/store unit as our driver example. The dotted lines in Figure 7 indicate the boundaries for the learning. The outer line from I/O of the instruction and data bus interface units is the learning input boundary, and it represents how the surrounding logic interacts with the load/store unit. The inner line surrounding the I/O of the load/store unit is the learning output boundary. Simulation data across timeframes at these two boundaries are collected. Then learning is to uncover the relationship between the signals at these two boundaries. VI. P RACTICAL ARCHITECTURAL ISSUES Even though we know the signal boundary as the dotted lines indicate in Figure 7, we are still far from understanding the relationship between these boundaries. It is because each testcase executes a sequence of instructions during simulation and a learning output may be affected by certain signals several timeframes ago. Therefore, the first issue that we need to resolve is the number of timeframes we need to watch at the output boundary. Many architectural techniques may worsen this problem: • (instruction pre-fetch): This technique increases operating speed by fetching the next instruction and starting to generate operand addresses before the current result
has been calculated and stored. It may cause the same instruction to be fetched more than once, depending on the current machine state. • (pipeline stall): Hazards are the situations that prevent the next instruction from executing in its designated clock cycle. The most common solution is to stall the pipeline and wait for the completion of older instructions. Different kinds of hazards may induce different number of stalled cycles. Therefore, the activity lifespan (measured in cycles) of the testcases instantiated from the same template may vary. • (pipeline flush): Many situations such as branch mispredictions or exception handlings may cause the instruction pipeline to be flushed. Similarly, simulation data partitioning becomes harder because flushed cycles vary the activity lifespan of testcases. Moreover, those instructions before the flush is taken are not the true instructions in execution and become noise in the data. In our experiments, we resolve the issues mentioned above by constraining our test templates to avoid the unwanted situations happening. This greatly simplifies the learning process. Tests instantiated from these constrained test templates provide predictable controllability on the unit’s input signals. This gives designers another level of controllability if they decide to use these test templates later. Suppose that we have successfully extracted the correct portion of the simulation data to learn for each test. Learning can still be an issue. This is because the number of inputs in the learning is the number of bits per instruction multiplied by the number of cycles used in the test, which can go up to hundreds or even thousands. Even though the number of input variables may be large, for a given dataset, not all variables are critical for learning. If we can identify a small subset of these input variables which are important for learning the dataset, then we can dramatically reduce the complexity of the learning. We call these selected input variables supporting variables. Therefore, in our OBDF learning algorithm, the step of selecting supporting variables is designed not only due to algorithmic limitations, but also due to computational efficiency. VII. E XPERIMENTAL R ESULTS Assuming a practical time budget for full-chip simulation, 10,000 testcases are instantiated from simple instruction templates with K core instructions where (K = {1, 2, 3, 4}) and are simulated to produce training data, respectively. As indicated before, the bus interface at the full chip boundary is monitored as learning input boundary while the module input ports of the load/store unit is monitored as learning output boundary. For example, the CPU runtime of simulating 10,000 4-instruction testcases on this setup is about 40 hours. We will demonstrate later that varying the number of core instructions in test templates is a simple way to observe the impact of learning accuracy and confidence in predicted results. Before applying OBDF learning on simulation data, we first perform a preprocessing step of input dimension reduction
255
in order to make the learning computationally easier. Table I shows the results in the data preprocessing step to transform data format and to reduce input dimension. The rows represent the total number of used cycles in simple instruction templates, the width of bit streams used in original concatenated inputs and the reduced width after input dimension reduction, respectively. Here the clock cycles used consists of (1) the clock cycle used in the core section and (2) the clock cycle used in the initialization section. Table I also demonstrates that pruning the low-sensitivity inputs can greatly reduce the number of learning inputs and thus simplifies the data complexity and improves computational efficiency. #core instruction K=1 K=2 K=3 K=4 #clock cycles used in testcases 40 50 60 70 #original input width 1280 1920 2560 3200 #reduced input width 384 768 1152 1536 TABLE I PREPROCESSED DATA AND REDUCED INPUT DIMENSION
The second observation compares the similarity of learners after applying ONN learning on bootstrap samples. Output 2 and output 23 are used as examples here. The average numbers of supporting variables used in an individual tree for output 2 and output 23 are 4 and 14, but the total numbers of possible supporting variables used in OBDFs are 7 and 22. No pair of learners in the OBDFs uses the same input variable combination. Figure 10(a) and (b) further show the histograms of each supporting variable. 2 out of 7 variables for output 2 and 5 out of 22 variables for output 23 appear in every tree. This leaves 2 and 9 supporting variables to be picked for each tree from the remaining inputs (5 and 17, respectively). Both histograms suggest that although each individual learner is different from each other, the variance would not be too large. Therefore, this confirms that the bootstrap sampling together with support variable selection can successfully produce multiple different but correlated learners. 10
10
10 5 0
15 10
6 4
2
231
229
30
40
50
60
0
10
20
30
40
50
60
Fig. 10.
occurrence
20
15 10 5 0
15 10 5 0
30
40
index of outputs
(c) K=3
Fig. 9.
249
0
250
239 251 242 255 229 17 241 140 132 133 131 236 143 141 130 128 129 240 225 142 238 224
index of inputs
(a) output-2
(b) K=2
20
20
237
(b) output-23
index of outputs
(a) K=1
10
230
index of inputs
index of outputs
0
248
50
60
0
10
20
30
40
50
60
index of outputs
(d) K=4
signal activities in K-instruction test templates
The first experiment is designed to observe signal activities of learning outputs with respect to different simple instruction templates. Signal activity is measured by the occurrences that one learning output has changed during the learning cycles. Since the clock cycles used in the core sections are 20 in our experiments, the maximum number of occurrences for one output is also 20. Figure 9(a) indicates the number of activity occurrences of each learning output. 22 learning outputs have no or low activities during 1-instruction testcase simulation, which also implies that these module inputs of load/store unit are not controllable through the current test template. Figure 9(b), (c) and (d) represents activity occurrence distributions for 2-, 3and 4-instruction testcase simulation, respectively. Note that 3 more learning outputs (demarcated by stars) are specially marked out in Figure 9(b) because these outputs that cannot be controlled by 1-instruction template now become controllable through 2-instruction template. Similarly, 3- and 4-instruction template can also excite one more learning output for each.
Supporting variable histogram
Next, we validate the controllability of learning outputs. A different sample is generated by the same means of training data generation and represents the true design space. In our experiment, another 10,000 testcases are randomly generated. We compare the simulation answers with predicted answers from the OBDF learner. The learning accuracy is measured by the number of correct predicted answers divided by total number of testcases. If the accuracy is high, it means the learning output is more controllable from the template. Figure 11 shows accuracies of all learning outputs from 1and 2-instruction testcase simulation with tree size of 10 as default. After learning 1-instruction testcase simulation data, 28 outputs can achieve 100% correctness while the learning accuracy of remaining controllable outputs ranges from 69.9% to 97.2%. After learning 2-instruction testcase simulation data, the learning accuracies for most of the 1-instruction controllable outputs can be elevated to 100% correctness. Output 25, 41, 50 and 63 have some negligible errors and result in 99.2%, 99.8%, 99.6% and 99.6% accuracy, repectively. 100
100
80
80
accuracy (%)
20
4
0
5
accuracy (%)
10
6
2
0 0
occurrence
8
occurences
15
occurences
20 occurrence
occurences
20
8
60 40 20 0
40 20 0
0
10
20
30
40
index of outputs
(a) K=1 Fig. 11.
256
60
50
60
0
10
20
30
40
index of outputs
(b) K=2
accuracy distribution of learning outputs
50
60
We further investigate the causes of these errors and observe that they are rooted from irreducible errors described in Section III. Fundamentally, the OBDF learning algorithm cannot avoid such errors. Instead, this just shows how the OBDF learning algorithm avoids the over-fitting problem by tolerating these small amount of errors on purpose in order to derive learning models using lower complexity. Table II further shows the learning accuracy and the average OBDD size of one tree in OBDF on those additional controllable outputs as demarcated by stars in Figure 9. High learning accuracy associated with small OBDD size on these outputs implies that their underlying behaviors in multiple instruction templates are simple. In other words, these signals are easy to control from the corresponding multiple instruction templates. K-inst. template K=2 K=3 K=4 output index 25 41 48 38 27 accuracy 99.2% 99.8% 100% 86.4% 84.8% avg. OBDD size 8.7 14 14 12.8 13.1 TABLE II ACCURACY ON ADDITIONAL CONTROLLABLE OUTPUTS 87
100
accuracy (%)
80
85
60
84 40
83 high risk
82
20
81
high risk ratio (%)
learning accuracy
86
R EFERENCES
0 0
20
40
60
80
100
number of trees
Fig. 12.
which mimics the full-chip behavior. However, it can be very difficult to control signals at unit-level from full-chip boundary. Hence, this work proposes an ordered-binary-decisionforest (OBDF) algorithm to implement an incremental learning framework to automate estimating signal controllability and provide information to govern these unit-level signals. The proposed OBDF algorithm mathematically outperforms the previous ONN algorithm in terms of lower model complexity and lower error variance. Meanwhile, we also utilize Freescale’s e200 microprocessor design to demonstrate the effectiveness of the incremental learning framework. The learning results show that 45 inputs of the load/store unit are highly controllable using only simple instruction templates. Our incremental learning framework can accurately estimate the controllability of unit-level inputs from full-chip boundary. In the future, several issues deserve further investigations : (1) Research on selecting best splits has become thriving in data mining area. Many other measures have been proposed and can be applied to decide supporting variables as well. (2) Searching the optimal number of trees to achieve best learning accuracy and stable high risk ratio can be formulated into an optimization problem where many existing algorithms such as genetic algorithm can be applied. (3) The learned information can be integrated into the pseudo-random test pattern generator (RTPG) and provides better guides during test generation.
accuracy vs. confidence of output-27 as K=4
In the last experiment, we study the impact of the number of trees in the forest in terms of learning accuracy and predicted confidence. As mentioned in Section IV-A, the ensemble learner g(x) will output the predicted answer decided by x), which ranges from 0 to 1. Predicted answers T wi gi ( ranging from 0.3 to 0.7 is classified as high risk. High risk ratio is defined as the total number of answers in high risk range divided by the total number of validation testcases. We use high risk ratio to estimate confidence. The lower the high risk ratio, the higher the confidence of the ensemble learner. Figure 12 illustrates an example of the curves of learning accuracy and high risk ratio as the number of trees increase on the learning output 27 after learning 4-instruction testcase simulation. The accuracy and high risk ratio will saturate when the number of trees grows large. Once the high risk ratio becomes more stable, we will be more confident of the learning accuracy. Empirically, 10-20 trees are sufficient for most of learning outputs for current simple instruction template to reach a high accuracy and stable high risk ratio. VIII. C ONCLUSION Full-chip functional verification for modern microprocessor designs usually adopts a divide-and-conquer strategy to save simulation cost and achieve better debug efficiency where unitlevel verification is critical to this success. The unit under verification is typically embedded inside an emulation software
[1] C. Roth, J. Tyler, P. Jagodik and H. Nguyen ”Divide and conquer approach to functional verification of PowerPC microprocessors,” Proc. Int’l Workshop on Rapid System Prototyping, pp. 128-133, 1997. [2] J. Monaco, D. Holloway and R. Raina, ”Functional verification methodology for the PowerPC 604 microprocessor,” Proc. Design Automation Conf., pp. 319-324, 1996. [3] K. Albin, ”Nuts and bolts of core and SoC verification,” Proc. Design Automation Conf., pp. 249-252, 2001. [4] C. Scafidi, J. D. Gibson, and R. Bhatia, ”Validating the Itanium2 Exception Control Unit: A Unit-level Approach,” IEEE Design & Test of Computers, pp. 94-101, 2004. [5] C. Wen, L.-C. Wang and K.-T. Cheng, ”Simulation-based Functional Test Generation for Embedded Processors,” IEEE Tran. on Computers, vol. 55, no. 11, pp. 1-9, Nov. 2006. [6] C. Wen, O. Guzey, L.-C. Wang and J. Yang, ”Simulation-based Functional Test Justification Using a Boolean Data Miner,” to appear in IEEE Int’l Conf. on Computer Design, 2006. [7] L. Valiant, ”A theory of the learnable,” Communications of the ACM, vol. 27, no.11, pp.1134-1142, 1984. [8] M. Kearns, M. Li, L. Pitt and L. Valiant, ”On the learnability of Boolean formulae,” Proc. 19th Symp. on Theory of Computing, pp. 285-295, 1987. [9] D. Angluin, ”Queries and concept learning,” Machine Learning, vol. 2, no. 4, pp.319-342, 1987. [10] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning Date Mining, Inference, and Prediction, Springer, 2001. [11] V. N. Vapnik, The nature of statistical learning theory, Springer, 1999. [12] A. Krogh, and P. Sollich ”Statistical mechanics of ensemble learning,” Physical Review, vol. 55, no. 1, pp.811-816, 1997. [13] K. Tumer, and J. Ghosh ”Error Correlation and Error Reduction in Ensemble Classifiers,” Connection Science, vol. 8, no. 3-4, pp.385-403, 1996. [14] L. Breiman, Random forests. Machine Learning Journal, vol 45, pp. 5-32, 2001. [15] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and regression trees. Wadsworth Inc., Belmont, California, 1984. [16] L. Breiman, Bagging predictors. Machine Learning Journal, vol. 26, no. 2, pp. 123-140, 1996. [17] B. Efron, and R. Tibshirani, ”The bootstrap method for standard errors and confidence intervals of the adjusted attributable risk,” Statistical Science, vol.1, no.1, pp. 54-77, 1986. [18] I. Kononenko, ”On biases in estimating multi-valued attributes,” Int’l Joint Conf. on Artificial Intelligence, pp.1034-1040, 1995. [19] R. E. Bryant. Graph-based Algorithms for Boolean Function Manipulation IEEE Tran. Comp, C-35, 8, pp. 677-691, 1986.
257