Compiler writers therefore include algorithms ... scheduling (such as permuting non-dependent instructions), this greedy approach is practical for production use, ...
Learning Policies for Local Instruction Scheduling J. Eliot B. Moss,1 John Cavazos, Darko Stefanovi´c, Paul Utgoff, Doina Precup Department of Computer Science, University of Massachusetts Amherst, MA 01003-4610, USA; moss,cavazos,stefanov,utgoff @cs.umass.edu
f
g
David Scheeff, Carla Brodley School of Electrical and Computer Engineering, Purdue University West Lafayette, IN 47907, USA; scheeff,brodley @ecn.purdue.edu
f
g
Submitted to SIGPLAN PLDI ’98, 7 November 1997 Abstract. Execution speed of programs on modern computer architectures is sensitive, by a factor of two or more, to the order in which instructions are presented to the processor. To realize potential execution efficiency, it is now customary for an optimizing compiler to employ a heuristic algorithm for instruction scheduling. These algorithms are currently hand-crafted. We show how to cast the local instruction scheduling problem as a machine learning task, so that one obtains a heuristic scheduling algorithm automatically. Our empirical results demonstrate that just a few features are adequate for competence at this task for a real modern processor, and that any of several supervised learning methods perform nearly optimally with respect to the features used. The learning goal is to minimize the cycles needed for each basic block, in isolation and assuming all memory accesses hit in the cache. We found that competence at this task resulted in good measured execution times for entire scheduled programs. In the end, we essentially match the performance of commercial compilers for the same processor, and offer some evidence that our learned heuristics are nearly optimal.
1 Introduction Modern computer architectures provide semantics of execution equivalent to sequential execution of instructions one at a time. However, to achieve higher execution efficiency, they employ a high degree of internal parallelism. Overall execution time can vary widely, because individual instruction execution times vary, depending on when an instruction’s inputs are available, when its computing resources are available, and when it is presented. Based on just the data and control dependences of instructions, a sequence of instructions usually has many permutations with equivalent meaning— but the different sequences may have considerably different execution time. Compiler writers therefore include algorithms to schedule instructions to achieve low execution time. Currently they hand-craft such algorithms for each compiler and target processor. We apply machine learning to construct the scheduling algorithm and thus automate this process. We focus on local instruction scheduling, i.e., ordering instructions within a basic block. For our purposes, a basic block is a straight-line sequence of code, with a conditional or unconditional branch instruction at the end. The scheduler should find optimal, or good, orderings of the instructions prior to the branch. It is safe to assume that the compiler generated a semantically correct sequence of instructions for each basic block. We consider only reorderings of each sequence (not more general rewritings), and only those reorderings that cannot affect the semantics. The semantics of interest are captured by dependences of pairs of instructions. Specifically, instruction I j depends on (must follow) instruction Ii if it follows Ii in the input block and has one or more of the following dependences on Ii : (a) I j uses a register used by Ii and at least one of them writes the register (condition codes, if any, are treated as a register); (b) I j accesses a memory location that may be the same as one accessed by Ii , and at least one of them writes the location. From the input sequence of instructions, 1
Corresponding author; telephone +1-413-545-4206.
X = V; Y = *P; P = P + 1;
(a) C Code
1:
STQ
R1, X
2: 3: 4:
LDQ STQ ADDQ
R2, 0(R10) R2, Y R10, R10, 8
(b) Instruction Sequence to be Scheduled
1
2
3
Available 1
2
4
(c) Dependence Dag of Instructions
Scheduled 2
3 Not Available
4 Available
(d) Partial Schedule
Figure 1: Example basic block code, DAG, and partial schedule one can thus build a dependence DAG, usually a partial (not a total) order, that represents all the semantics essential for scheduling the instructions of a basic block. Figure 1 shows a sample basic block and its DAG. The task of scheduling is to find a least-cost total order of each block’s DAG.
2 Learning to Schedule The learning task is to produce a scheduling procedure to use in the performance task of scheduling the instructions of basic blocks. One needs to transform the partial order of instructions into a total order that will execute as efficiently as possible. At run time, execution costs will be affected by preceding blocks (which may leave functional units busy, etc.) and by memory references that miss in the cache. To simplify our scheduling algorithms and the learning task, we ignored these effects. Our performance results demonstrate that this simplification does not result in significantly poorer local instruction scheduling, as measured in terms of execution times of scheduled benchmark programs. We consider the class of schedulers that work by repeatedly selecting the apparent best of those instructions that could be scheduled next, proceeding from the beginning of the block to the end. While one can imagine other approaches to scheduling (such as permuting non-dependent instructions), this greedy approach is practical for production use, and in fact corresponds to list scheduling, a technique used in production compilers. Because the scheduler selects the apparent best from those instructions that could be selected next, the learning task consists of learning to make this selection accurately. Hence, the learner needs to acquire the notion of ‘apparent best instruction’. The process of selecting the best of the alternatives is like finding the maximum of a list of numbers. One keeps in hand the current best, and proceeds with pairwise comparisons, always keeping the better of the two. One can view this as learning a relation over triples (P; Ii ; I j ), where P is the partial schedule (the total order of what has been scheduled, and the partial order remaining), and I is the set of instructions from which the selection is to be made. Those triples that belong to the relation define pairwise preferences in which the first instruction is considered preferable to the second. Each triple that does not belong to the relation represents a pair in which the first instruction is not better than the second. We frame the learning problem as a supervised learning task, in which one presents positive and negative examples of a concept (in this case examples of triples in the relation and of triples not in the relation) to the learning algorithm. The learning algorithm then chooses a predicate, within the space of predicates that it can represent, with the aim of separating the positive and negative examples. That is, the resulting predicate should have a low error rate in its determination of whether triples are or are not members of the relation. To implement this plan, we must choose a representation in which to state the relation, create a process to supply positive and negative examples, and modify the learned predicate in the face of the examples presented. We now consider
2
each of these steps in greater detail.
2.1 Representation of Scheduling Preference Because we have framed the determination of the apparent best instruction as learning a logical relation, the learning algorithm must construct or revise a boolean expression that evaluates to true if and only if (P; Ii ; I j ). We note that if (P; Ii ; I j )
is considered to be a member of the relation, then it is safe to infer that (P; I j ; Ii ) is not a member.
What are the primitive facts from which we can construct the boolean expression? In this case we need features of the candidate instructions Ii and I j and of the partial schedule P. There is considerable art to choosing useful features. We were greatly assisted by having in hand a good hand-crafted heuristic scheduler (called DEC below) supplied by the processor vendor, which suggested potentially useful features. We describe our feature set in detail below.
2.2 Obtaining Examples and Counter-Examples How can we construct a set of example triples (P; Ii ; I j ) (and counter-examples (P; I j ; Ii ))? For local instruction scheduling, we can search the entire space of schedules for basic blocks that are not prohibitively long. We found it practical to do this only for blocks of ten or fewer instructions. For such blocks, for each partial schedule P and each candidate next instruction Ii , given a timing model of the processor we can determine the optimal cost of schedules starting with P and choosing Ii next. If the cost for P followed by Ii is less than the cost of P followed by I j , we generate an example triple (P; Ii ; I j ) and a counter-example (P; I j ; Ii ). We had no trouble generating millions of examples and counter-examples. Is this procedure biased because we considered only small blocks? For the benchmark programs we studied, 92% of the basic blocks were small enough, and the average block size was 4.9 instructions. However, we did see that large blocks tend to be executed more often and thus have a disproportionate impact on execution time. To determne what effect omitting the large blocks from the training data had on learning, we generated examples for large blocks by examining a large sample of schedules using Monte Carlo techniques. We found that using the large block examples resulted in no scheduling improvement over using the short block examples.
2.3 Updating the Preference Relation There are many learning algorithms potentially suitable for building an expression of our preference relation. We used a decision tree induction package (ITI [10]), a rule set induction package (Ripper (RULE) [3]), table lookup (TLU),1 a function approximator (ELF [11]), and a feed-forward artificial neural network (NN [6]). A companion paper [5] includes a detailed description of each learning algorithm and some comparisons of all schemes except for RULE. Since the schemes performed about the same, we report here results only for ELF and RULE.
3 Empirical Results We designed our experiments to answer the following questions: Can we schedule as well as hand-crafted algorithms in production compilers? Can we schedule as well as the best hand-crafted algorithms? How close can we come to optimal 1 This basically memorizes all the examples and counter-examples, and interpolates for cases not seen. If there are conflicting examples and counterexamples, then it chooses the majority class (positive or negative).
3
schedules? The first two questions we answer with comparisons of program execution times, as predicted from simulations of individual basic blocks (multiplied by the number of executions of the blocks as measured in sample program runs), with cycle counts from a cycle-level simulator, and with actual run time measurements. The predicted execution time measure seems fair for local instruction scheduling, since it omits other execution time factors being ignored. That is, the predicted time measure indicates how well the learning algorithms learned the task as posed. The actual run time measurements indicate whether doing well on the learning task carries over to actual performance improvements, which of course is what compiler implementers ultimately care about. The reason we used simulated cycle counts, too, was to consider inter-block interference effects separately from cache effects. We set the simulator to treat every memory access as a cache hit, so the simulated times reflect only inter-block effects. Answering the optimality question is more difficult, since it is infeasible to generate optimal schedules for long blocks. We offer a partial answer by measuring the number of optimal choices made within small blocks. To proceed, we selected a computer architecture implementation and a standard suite of benchmark programs (SPEC95) compiled for that architecture. We extracted basic blocks from the compiled programs and used them for training, testing, and evaluation as described below.
3.1 Architecture and Benchmarks We chose the Digital Alpha [7] as our architecture for the instruction scheduling problem. When introduced it was the fastest scalar processor available, and from an instruction dependence analysis standpoint its instruction set is simple. The 21064 implementation of the instruction set [4] is interestingly complex, having two dissimilar pipelines and the ability to issue two instructions per cycle (also called dual issue) if a complicated collection of conditions hold. Instructions take from one to many tens of cycles to execute. We used the SPEC95 benchmark suite, which consists of 18 programs, 10 written in FORTRAN, which tend to use floating point calculations heavily, and 8 written in C, which focus more on integers, character strings, and pointer manipulations. These were compiled with the vendor’s compilers, set at the highest level of optimization offered, which includes compile- or link-time instruction scheduling. We call these the Orig schedules for the blocks. The resulting collection has 447,127 basic blocks, composed of 2,205,466 instructions.
3.2 Simulator, Schedulers, and Features Researchers at Digital made publicly available a simulator for basic blocks for the 21064, which indicates how many cycles a given block requires for execution, assuming all memory references hit in the caches and translation look-aside buffers, and no resources are busy when the basic block starts execution. When presenting a basic block one can also request that the simulator apply a heuristic greedy scheduling algorithm. We call this scheduler DEC. By examining the DEC scheduler, applying intuition, and considering the results of various preliminary experiments, we chose to learn with the features shown in Table 1. The possible values for the features are: odd: a single boolean ‘yes’ or ‘no’; wcp and e: ‘first’, ‘second’, or ‘same’, indicating whether the first instruction (Ii ) has the larger value, the second instruction (I j ) has the larger value, or the values are the same; d: ‘first’, ‘second’, ‘both’, or ‘neither’, indicating which of Ii and I j can dual-issue with the previous instruction; ic0 and ic1: both instruction’s values, expressed as one of twenty
4
named categories (‘load’, ‘store’, ‘iarith’, etc.). For ELF the values for each feature are mapped to a 1-of-n vector of bits, n being the number of distinct values.2 Heuristic Name Odd Partial (odd)
Heuristic Description Is the current number of instructions scheduled odd or even?
Instruction Class (ic)
The Alpha’s instructions can be divided into about 20 equivalence classes with respect to timing properties. The height of the instruction in the DAG (the length of the longest chain of instructions dependent on this one), with edges weighted by expected latency of the result produced by the instruction. Can the instruction dual-issue with the previous scheduled instruction?
Weighted Critical Path (wcp)
Actual Dual (d)
Max Delay (e)
The earliest cycle when the instruction can begin to execute, relative to the current cycle; this takes into account any wait for inputs for functional units to become available
Intuition for Use If ‘yes’, we’re interested in scheduling instructions that can dual-issue with the previous instruction. The instructions in each class can be executed only in certain execution pipelines, etc. Instructions on longer weighted critical paths should be scheduled first, because they directly affect the lower bound of the schedule cost. If Odd Partial is ‘yes’, it is important that we find an instruction, if there is one, that can issue in the same cycle with the previous scheduled instruction. We want to schedule instructions that will have their data and functional unit available earliest.
Table 1: Features for Instructions and Partial Schedule This mapping of triples to feature values loses information. The loss does not affect learning much (as shown by preliminary experiments omitted here), but it reduces the size of the input space, and tends to improve both speed and accuracy of learning for some learning algorithms.
3.3 Experimental Procedures From the 18 SPEC95 programs, we used the ATOM tool [8] to extract all basic blocks, and also to determine, for sample runs of each program, the number of times each basic block was executed. For blocks having no more than ten instructions, we used exhaustive search of all possible schedules to (a) find instruction decision points with pairs of choices where one choice is optimal and the other is not, and (b) determine the best schedule cost attainable for either decision. Schedule costs are always as judged by the DEC simulator. This procedure produced over 13,000,000 distinct choice pairs, resulting in over 26,000,000 triples (given that swapping Ii and I j creates a counter-example from an example and vice versa). We selected 1% of the choice pairs at random (always insuring we had matched pairs of example/counter-example triples). For each learning scheme we performed an 18-fold cross-validation, holding out one benchmark program’s blocks for independent testing. That is, we would train on 17 programs’ blocks, and evaluate the learned heuristic on the 18th program, so that we test the learning scheme’s ability to generalize appropriately to examples it has not seen before. We evaluated both how often the trained scheduler made optimal decisions, and three measures of execution time of the resulting schedules. The first measure, which we call predicted execution time, was computed as the sum of predicted basic block costs weighted by execution frequency, with the frequencies taken from sample program runs as described above. The predicted basic block costs are as produced by the DEC simulator. The second measure is machine cycles as predicted by the Zippy cycle-level simulator [2]. This simulator was provided by Digital Equipment Corporation, and is distinct from the previously mentioned DEC simulator, which only produces estimates for basic blocks, in isolation and without cache effects. We set Zippy to assume all cache accesses were hits in 2 ELF
works with numeric inputs; the 1-of-n encoding insures that no order is imposed on symbolic input values.
5
Scheduling Scheme Rand RULE DEC ELF Orig
50th 1.087 0.993 0.998 0.993 0.989
Percentile 25th 1.030 0.979 0.989 0.976 0.976
75th 1.130 1.015 1.005 1.002 1.000
Table 2: Experimental Results: Normalized Measured Execution Times the primary cache, but otherwise it included all real instruction execution costs. We validated Zippy to insure that it was faithful to the Alpha 21064 implementation. Our third measure is machine cycles as measured using the Alpha hardware performance counters. The first and second measures are from simulators and are hence repeatable. The third measure is on actual hardware, and though we ran in single-user mode, we still saw variation from run to run. Therefore we performed 35 runs of each program to obtain reliable estimates of the average running time and its variation.
3.4 Statistical Analysis Procedures and Results Consider first our analysis of actual running times. In examining the distributions of running times for each benchmark as scheduled by each scheduler, we found the distributions were not normal, so we used a non-parametric procedure (Friedman Repeated Measures Analysis of Variance (ANOVA) on Ranks) to compare schedulers. Since the absolute running times of the 18 benchmarks varies widely, we first normalized all runs of each benchmark to the median time of the runs for the DEC scheduler (on the same benchmark). The ANOVA test sorts the 3150 (18 benchmarks
35 runs 5 schedulers)
normalized execution times and gives them ranks according to the sort. It then considers the rank numbers for the 630 runs associated with each scheduler, and determines whether that scheduler’s runs were faster, slower, or essentially the same as runs of each other scheduler, using a test of statistical significance. Table 2 shows the median (50th percentile) and 25th and 75th percentile normalized execution times resulting from each scheduler’s runs, in order of increasing scheduler quality as determined by the ANOVA analysis.3 The difference in each pair of schedulers was statistically significant (p < :05), but the table shows that the differences are small and perhaps not of great practical significance, except that all heuristic schedulers noticeably outperform random scheduling. (The Rand scheduler simply flips a coin at each decision point.) The same information is presented graphically in Figure 2, which shows both the variation over all runs, and the variation of the medians of the benchmarks. We present the predicted execution time results in Table 3 and Figure 3. Again we used an ANOVA test, but this time on the 18 predicted execution times, normalized to the DEC scheduler predicted time. The analysis ranked the schedulers Rand < Orig < DEC = ELF = RULE in quality (Rand worst, Orig next, and the other three equivalent). The Zippy results were quite similar to the predicted time results, so we omit them. While there are a number of differences in the ranking of schedulers by measured execution time, predicted execution 3 The lines of the table really are in the proper order; the RULE scheduler produced enough times worse then DEC times to be ranked slower than DEC.
6
1.5
max
1.4 min quartiles of benchmark medians
1.3
quartiles of all runs
1.2 1.1 1.0 0.9 0.8
Rand
RULE
DEC
ELF
Orig
Figure 2: Experimental Results: Normalized Measured Execution Times
2.2
max median
2.0
min
1.8 1.6 1.4 1.2 1.0 0.8
Rand
RULE
DEC
ELF
Orig
Figure 3: Experimental Results: Predicted Execution Time
7
Scheduling Scheme Rand Orig RULE DEC ELF
50th 1.246 1.019 1.000 1.000 1.000
Percentile 25th 1.134 1.012 0.999 1.000 0.995
75th 1.540 1.031 1.002 1.000 1.002
Table 3: Experimental Results: Predicted Execution Time time, and the ANOVA analysis, overall the heuristic schedulers perform similarly, and noticeably better than random. Another way of understanding the results is to see how often the schedulers make optimal choices. As previously explained, we can determine optimality only for small basic blocks. We reported this previously [5] and repeat that result here in Table 4 for convenience; it does not include the RULE scheduler. We see that there is little room for improvement, at least in terms of our local instruction scheduling goal. In separate experiments we determined that the DEC scheduler produces optimal schedules for small blocks 99.5% of the time. Scheduler ITI NN TLU ELF
% Optimal Choices (95% conf. int.) 97.777 (97.474,98.081) 97.496 (97.058,97.936) 96.683 (96.229,97.140) 96.568 (96.083,97.055)
Table 4: Experimental Results: Optimal Choices in Small Blocks We tried to improve effectiveness of our learned schedulers by adding more features, in hopes of making optimal choices closer to 100% of the time. The features we added offered no significant improvement. Indeed, with performance this close to optimal already, it may be unrealistic to expect more.
3.5 Sample Learned Preference Predicate Some of the learning schemes we used produce boolean expressions that are difficult for humans to understand, especially those based on numeric weights and thresholds, such as ELF and NN. Decision trees (ITI) and rule sets are easier to comprehend, and it is also relatively easy to generate code that will evaluate the learned preference expression in a scheduler. We show below a rule set learned by training using examples drawn from all 18 SPEC benchmark programs. If the right hand side condition of any rule (except the last) is met, then the first instruction of a pair is preferred; otherwise the second is preferred. The numbers in the first two columns give the number of correct and incorrect training examples matching the condition of the rule.
8
(77838/ 600) (15036/ 490) ( 412/ 264) ( 82/ 15) ( 450/ 115) ( 190/ 20) ( 81/ 39) ( 127/ 41) ( 247/ 0) ( 225/ 93) ( 320/ 44) ( 22/ 0) ( 762/ 135) (96367/2130)
first first first first first first first first first first first first first second
e=second e=same e=same e=same e=same e=same e=same e=same e=same e=same e=same e=same e=same true
^ wcp=first ^ wcp=same ^ d=first ^ ic0=’load’ ^ wcp=same ^ d=first ^ ic0=’store’ ^ wcp=same ^ d=first ^ ic0=’ilogical’ ^ wcp=same ^ d=first ^ ic0=’fpop’ ^ wcp=same ^ d=first ^ ic0=’iarith’ ^ ic1=’load’ ^ wcp=same ^ d=first ^ ic1=’iarith’ ^ wcp=same ^ d=first ^ odd=yes ^ wcp=same ^ d=both ^ ic0=’store’ ^ wcp=same ^ d=both ^ ic0=’ilogical’ ^ ic1=’fpop’ ^ wcp=same ^ d=both ^ ic0=’fpop’ ^ ic1=’load’ ^ wcp=same ^ d=both ^ ic1=’load’ ^ odd=’yes’
In this case we see that e, the time before the instruction can start executing, and wcp, the length of the critical path from the instruction to the end of the basic block, are the most important features, with the rest offering fine tuning when e and wcp are identical for both instructions.
4 Related Work Instruction scheduling is a well-known issue and others have proposed a variety of heuristics and techniques. It is also known that optimal instruction scheduling for today’s complex processors is generally NP-complete. We found two pieces of work more closely related to ours. One is a patent [9]. From the patent’s claims it appears that the inventors used a scheduler with heuristics weighted by the weights in something like a neural network. They evaluate each weight setting by scheduling an entire benchmark suit, running the resulting programs, and using the resulting times to choose among a set of possible weight adjustments. The inventors’ approach appears to us to be potentially very timeconsuming. It does have two advantages over our technique: in the learning process it uses measured execution times rather than simulated times, and it does not require a timing simulator. Being a patent, this work does not state any experimental results. The other related item is the application of genetic algorithms to tuning weights of heuristics used in a list scheduler [1]. The authors showed that different hardware targets resulted in different learned weights, but they did not offer experimental evaluation of the quality of the resulting schedulers. We previously reported [5] some of the results presented here. To that prior work we have added the RULE scheduler, the measured execution times, the simulated (Zippy) execution times, and improved statistical analyses. The validation of prior estimated execution times with actual and simulated measurements is crucial to demonstrating the effectiveness of our technique in practice. In addition we also tried using examples from large basic blocks and adding more features; neither effort produced better schedulers.
5 Conclusions and Outlook The majority of the effort in this research was in determining how to cast this problem in a form suitable for machine learning. Once that form was accomplished, supervised learning produced quite good results on this practical problem—
9
comparable to two vendor production compilers, as shown on a standard benchmark suite used for evaluating such optimizations. Thus the outlook for using machine learning in this application appears promising. Another significant result is that learning with the goal of minimizing each basic block’s execution time ignoring other blocks and the effects of memory hierarchy worked well in the presence of those effects. Put another way, local instruction scheduling improvements generally carried over to overall execution time performance. A third result is that the supervised learning schemes performed about as well, with only minor performance differences between them. A fourth result is that adding examples from large blocks, obtained using Monte Carlo techniques to limit the search for optimal schedules, offered no significant improvement. On the other hand, significant work remains. The current experiments are for a particular processor; can they be generalized to other processors? After all, one of our goals is to improve and speed processor design by enabling more rapid construction of optimizing compilers for proposed architectures. What about global instruction scheduling? Since the search space of possible global schedules will almost always be large, our supervised learning approach may fail because we cannot generate examples of optimal and sub-optimal decisions, since we cannot find optimal schedules. Our ultimate goal is to extend the approach to learn policies for combined global register allocation and global instruction scheduling, which appears to be quite challenging to frame as learning problem. Another relevant issue is the cost not of the schedules, but of the schedulers: are these schedulers fast enough to use in production compilers? Again, this demands further experimental work. Our primary conclusion is that machine learning can find, automatically, quite competent local instruction scheduling heuristics. Acknowledgments: We thank various people of Digital Equipment Corporation, for the DEC scheduler, the ATOM program instrumentation tool, and the Zippy simulator, without which this work would not have been accomplished. David Jensen provided helpful guidance on statistical analyses. Jeffrey Dean brought the related patent [9] to our attention. We also thank Sun Microsystems and Hewlett-Packard for their support.
References [1] Steven J. Beaty, Scott Colcord, and Philip H. Sweany. Using genetic algorithms to fine-tune instruction-scheduling heuristics. In Proceedings of the International Conference on Massively Parallel Computer Systems, 1996. [2] Brad Calder, Dirk Grunwald, and Joel Emer. A system level perspective on branch architecture performance. In Proceedings of the 28th International Symposium on Microarchitecture, pages 199–206, 1995. [3] William W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, Lake Tahoe, CA, 1995. [4] Digital Equipment Corporation, Maynard, MA. DECchip 21064-AA Microprocessor Hardware Reference Manual, first edition, October 1992. [5] J. Eliot B. Moss, Paul E. Utgoff, John Cavazos, Doina Precup, Darko Stefanovi´c, Carla Brodley, and David Scheeff. Learning to schedule straight-line code. In Proceedings of Neural Information Processing Systems 1997 (NIPS*97), Denver CO, December 1997. To appear. [6] D. E. Rumelhart, G. E. Hinton, and R. J Williams. Learning internal representations by error propagation. In Rumelhart and McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA, 1986. [7] Richard Sites. Alpha Architecture Reference Manual. Digital Equipment Corporation, Maynard, MA, 1992. [8] Amitabh Srivastava and Alan Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN ’94 Conference on Programming Language Design and Implementation, pages 196–205, Orlando, FL, June 1994. Association of Computing Machinery.
10
[9] Gregory Tarsy and Michael J. Woodard. Method and apparatus for optimizing cost-based heuristic instruction schedulers. US Patent # 5,367,687. Filed 7/7/93, granted 11/22/94. [10] P. E. Utgoff, N. C. Berkman, and J. A. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, 1997. To appear. [11] P. E. Utgoff and D. Precup. Constructive function approximation. Technical Report 97-04, University of Massachusetts, Department of Computer Science, Amherst, MA, 1997.
11