Document not found! Please try again

Architectural features of Processors to Implement ...

3 downloads 0 Views 281KB Size Report
[9] Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, Pat. Conway, The AMD Opteron Processor for Multiprocessors. Servers, IEEE Computer Society, ...
Architectural features of Processors to Implement Parallelism at Instruction Level in Assembly Level Language Nitin Mathur

Rajesh Purohit

M.E. Scholar Dept. of Computer Science and Engineering M.B.M. Engineering College, J. N. V. University Jodhpur, India [email protected]

Associate Professor Dept. of Computer Science and Engineering M.B.M. Engineering College, J. N. V. University Jodhpur, India [email protected]

Abstract— Parallel processing utilizes the concurrent events in

multiple operations handled by complier or processor hardware. According to decision taking power, instructionlevel parallelism architectures can be classified in two categories: Superscalar and VLIW (Very Long Instruction Word).

assembly level language. Concurrent events engage parallelism, simultaneity and pipelining. Parallelism in program can be implemented as coarse grain, middle grain and fine grain called Job level, Module level and Instruction level parallelism. Instruction-level parallelism is realized by processors. Superscalar and VLIW are examples of processor architectures that implement instruction-level parallelism to derive the benefit.

Superscalar architectures use special hardware to analyze the instruction stream at execution time and to determine which operations in the instruction stream are independent of all preceding operations and have their source operands available. These operations can then be issued and executed concurrently [7], [21]. VLIW architectures increase the resources available to the complier. VLIW architectures use static scheduling techniques that utilize the complier to determine sets of operations that have their source operand ready and have no dependencies within the set, the hardware is able to issue and execute directly with no dynamic analysis [7] [24],

The processor architectures use different techniques to achieve instruction-level parallelism, such as reservation station, reorder buffer, branch prediction etc, up to a desire level. Thus, for achieving instruction-level parallelism, a very closely tied instruction-level parallelism exploiting techniques should be embedded in the processor. In this paper, various processors architectures are compared to explore the characteristics which are significant for instruction-level parallelism in assembly level language.

Instruction-level parallelism is implemented in superscalar processor at micro-architecture level by reservation station, reorder buffer and branch prediction. The reservation station holds the instructions which are ready to execute but wait for their operands. When instructions get their operands, they are dispatched to respective execution units in parallel form. Reorder buffer tracks all instructions from dispatching to retiring and arranges the instructions in in-order form that are executed in out-of-order form. Branch prediction techniques predict the branches that will occur in future. Branch prediction increases the number of instructions in executions as it resolves the branch in program before execution. Parallelism and its different levels are discussed in next section. Details of different ILP architectures are described in section 3. Architectural features of processors are explained in section 4. In the section 5, different processor’s architectures are compared and conclusion appears in section 6.

Index Terms— Parallelism, Superscalar, VLIW, Reservation Station, Reorder Buffer, Branch Prediction.

I. INTRODUCTION Computer industry has grown adapted to the spectacular rate of increase in microprocessor performance at assembly level. Architectural advances enhance the performance more because it achieves higher degree of parallelism and future increase in performance will forced to rely more heavily on advances in computer architecture. Parallelism can be applied at various levels of processing such as job, module and instruction. Instruction-level parallelism (ILP) realized by processor architecture and complier techniques that speed ups execution by causing individual machine operations to execute in parallel. It is necessary to take decisions about executions of

1

II. INSTRUCTION-LEVEL PARALLELISM (ILP)

processing of multiple instructions and higher execution throughput. A superscalar processor makes a great effort to issue an instruction every cycle so as to execute many instructions in parallel, even though program is sequentially handed by the hardware.

Parallelism has been exploited at various processing levels such as job, module and instruction, also called coarse grain, middle grain and fine grain. These levels of program execution represent different computational grain sizes and changing communication and control requirements. The lower the level, machine gets the finer the granularity of the processes. Granularity or Grain size is a measure of amount of computation involved in processing of software. The simplest measure is to count the number of instructions in the grain. Grain size determines the basic program segment chosen for parallel processing.

SuccessiveI Instructions

S u c c e s s iv e In s t r u c t i o n s

Ifetch Decode Execute Write Back

Instruction Level Parallelism is the lowest level of parallelism. At instruction or statement level, a typical grain contains less than twenty instructions, called “fine grain”. Depending on individual programs, fine-grain parallelism at this level range from two to thousand. The advantage of finegrain computation lies in the excess of parallelism [24], [25]. ILP can be defined by various ways. Some are as follows.

0

1

2

3

4

5

6

7

8

9

Time

Fig. 1- Pipeline structure of Superscalar processor of degree 3

(a) Instruction-level parallelism defined as degree of parallelism (measured by the number of instructions) that can be achieved by issue and execute multiple instructions concurrently [24], [39].

With every instruction that a superscalar processor issues, it must check the instruction’s operands interfere with the operands of any other instruction in flight. Once an instruction is independent of all other ones in flight, the hardware must be also decide exactly when and on which available functional unit to execute the instruction. Superscalar processor rely on hardware for the scheduling the instructions which is called Dynamic instruction scheduling. Figure 1 shows pipelining structure of superscalar processor of degree 3. In order to fully utilize a superscalar processor of degree m must issues m instructions per cycle to execute in parallel at all times. If ILP of m is not available, stalls and dead time will result where instructions are waited for results of previous instruction [24], [25], [27], [36], [39].

(b) Instruction-level parallelism may be defined as the ability to exploring a sequential instruction stream, identify independent instructions, issue multiple instructions per cycle and send to several execution units in parallel to fully utilizing the available resource [21], [39]. (c) Instruction-Level parallelism results from a set of processor and complier techniques that speed up execution by causing individual machine operation to execute in parallel form [7], [21]. (d) Instruction-level parallelism processing remains only the viable approach for continuously increasing performance without fundamentally rewriting applications [35].

(b) Very Long Instruction Word -VLIW processors represent the dominant examples of machines with independence architecture. Instructions in VLIW architecture are very long and may contain hundred of bits. Each instruction contains a number of operations that are executed in parallel. The program for a VLIW processor specifies exactly which functional unit each operation should be executed on and exactly when each operation should be issued so as to be independent of all operations that are being issued at the same time as well as those that are in execution. VLIW’s hardware is not responsible for discovering opportunities to execute multiple operations concurrently. That means, parallelism is implemented by Static Scheduling which schedules the instructions at compile time in a static way so run-time scheduling and synchronization are completely eliminated.

III. ILP ARCHITECTURES The end-result of instruction-level parallel execution is that multiple operations are simultaneously in execution. It is necessary to take decision about when and whether an operation should be executed The alternatives can be broken down depending on the extent to which either these decisions are made by the complier rather than hardware. With this, ILP architecture can be classified as follows. (a) Superscalar Architecture- Superscalar processors are based on sequential architecture. Superscalar machines incorporate multiple functional units to achieve greater concurrent

2

Figure 2 shows pipeline structure of VLIW processor of degree 3. The main advantage of VLIW architecture is its simplicity in hardware structure and instruction set. VLIW machines behave much like superscalar with three differences: easy decoding of VLIW instructions than superscalar, code density of superscalar is better than VLIW and superscalar machines can be object-code compatible with a large family of nonparallel machines [25], [27], [36], [37], [38], [39].

I Successive Instructions

S u c c e s s iv e In s t r u c t i o n s

Ifetch Decode

branch predictor could easily predict all branches. Dynamic branch prediction may require more complex algorithm. Processing of conditional branches has two major components: predicting the branch direction and branch target. Prediction of branch direction decides whether a branch taken or not taken. After the direction of a branch is known, the actual target address of the next instruction along with the predicted path must also be determined. If the branch is predicted to be not-taken, then the target address is simply the current branch’s address. If the branch is predicted to be taken, then the target will depend on the type of branch. Target prediction must also cover unconditional branches [24], [25].

Write back Execute (3 operations)

0

1

2

3

4

5

6

7

8

9 Time

Fig. 2- Pipeline structure of VLIW processor of degree 3

IV. ARCHITECTURAL FEATURES The outcome of instruction-level parallel execution is that multiple operations are simultaneously in execution. Instruction-level parallelism is implemented in assembly level language by micro-architecture features of processors. The term micro-architecture refers to the design features used to reach the target cost, performance and functionality goals of the processor. These features are realized by Superscalar processors as hardware is responsible for ILP. Branch Prediction, Reservation Station and Reorder Buffer are one of the main features that are used to implement ILP in processor’s micro-architecture. In this section, these three features are discussed.

Fig. 3- Structure of Branch Predication in Superscalar processor

Most current state-of-the-art superscalar microprocessors consist of an out-of-order execution core (also referred to as the dynamic execution core). The operation of such a dynamic execution core can be described by reservation station (RS) and reorder buffer (ROB) that are the critical components of the dynamic execution core. (b) Reservation Station- Reservation station is one of the main components of dynamic execution core. There are three tasks associated with the operation of a reservation station: dispatching, waiting and issuing. The use of a reservation station decouples instruction decoding and instruction execution and provides a buffer to take up the slack between decoding and execution stages due to the temporal variation of throughput rates in the two stages. Each reservation station is responsible for identifying instructions and for scheduling their execution. When an instruction is first dispatched to a reservation station, it may not have all its source operands and therefore must wait in the reservation station. When an instruction in reservation station has all its source operands, it become ready for execution and can be issued into the functional unit. In a given machine cycle if multiple

(a) Branch Prediction- The early stages deal with instruction flow or the processing of branches. The primary goal for processing of branches is to maximize the supply of instructions that executes in parallel to the execution pipeline. Modern processors used key approach branch prediction and speculatively execute the instructions in the predicated path of program control flow as shown in figure 3. There are two types of branch prediction techniques: (i) static branch prediction, (ii) dynamic branch prediction. Static branch prediction algorithms are very simple and by definition do not incorporate any feedback from the run-time environment. By observing, run-time behavior, a dynamic

3

instructions in a reservation station are ready, a scheduling algorithm is used (typically oldest first) to pick one of them for issuing into the functional unit to begin execution. Figure 4 shows the structure of reservation station.

order. In order to accommodate out-of-order finishing of execution and in-order completion of instructions, a reorder buffer is needed in the instruction completion stage. As instructions finish execution in program order, they enter in the reorder buffer out-of-order, but they exit in program order from ROB. The reorder buffer contains all the instructions that are in-flight, i.e., all the instructions that have been dispatched but not yet completed architecturally. These include all the instructions waiting in the reservation stations and execution in the functional units and those that have finished execution but are waiting to be completed in program order. The status of each instruction in the reorder buffer can be tracked in each entry of the reorder buffer. Each instruction can be several states, i.e., awaiting execution, in execution and finished execution. The status is updated as an instruction traverses from one state to the next. Reorder buffer also tracks the instructions whether an instruction is speculative (in the predicated path) or not.

Fig. 4- Structure of Reservation Station in Superscalar processor.

Based on the placement of the reservation station relative to instruction dispatching, two types of reservation station implementations are possible. First, if a single buffer is used at the source side of dispatching, it is identify as a centralized reservation station. In such implementation, one reservation station with many entries feeds all the functional units. Instructions are dispatched from this centralized reservation station directly to the all functional units to begin execution. On the other hand, if multiple buffers are placed at the destination side of dispatching, they are identifying as distributed reservation station. In this implementation, each functional unit has its own reservation station on the input side of the unit. Instructions are dispatched to the individual reservation station based on their type. These instructions remain in these reservation stations until they are ready to be issued into the functional unit for execution.

Fig. 5- Structure of Reorder Buffer in Superscalar processor.

When a branch is resolved, a speculative instruction can become nonspeculative (if branch is correct) or invalid (if the predication is incorrect). Only finished and nonspeculative instructions can be completed. An instruction marked invalid is not architecturally completed when exiting the reorder buffer. Reorder buffer can be viewed as the heart or central control of the dynamic execution core [24], [25].

Dynamic instruction scheduler is also used in dynamic execution core and includes the instruction window and its associated logic. Instruction window is a one single structure that is combination of reservations station and reorder buffer. At dispatch, a combine entry as one is done in instruction window. Hence, instructions are dispatched into the instruction window, entries of the instruction windows monitor the tag buses for pending operands, results are forwarded into the instruction window when ready and instructions are completed form instructions window [24], [25].

V. CASE STUDY Instruction-level parallelism is implemented in processors micro-architecture using architectural features (branch prediction, reservation station, reorder buffer). Processor’s designing and manufacture market is dominating by Intel, AMD and Motorola in their respective order from over the years. In the section, various processors micro-architecture of these companies, as shown in table 1, are compared to explore these architectural characteristics.

(c) Reorder Buffer (ROB) - Reorder buffer is one of the main gears of dynamic execution core. Instructions are fetched and decoded in program order but are executed out of program

4

(a) Reservation Station -ILP is initially implemented by Intel in 1993 called 5th generation. In this micro-architecture called P5, Intel used two parallel execution pipelines that were worked in lock-stepped manner called U pipe and V-pipe [8], [13]. In the same year, Motorola came with its micro-architecture 68060 that was used dual execution pipeline called primary and secondary, worked in lock-stepped manner [23] but AMD appeared with micro-architecture K5 in 1995. In this microarchitecture AMD implemented distributed reservation stations and each reservation station has 2-entries [6], [10], [33]. In the next generation i.e. 6th, Intel launched its new microarchitecture, compare to its pervious micro-architecture, called P6 in 1996. Centralized reservation station was used in this micro-architecture and has 20-entries to hold uops [16], [30]. AMD arrived in same year with K6 micro-architecture. Scheduler was used in this micro-architecture and has 24entries to hold ROPs [1], [6], [34]. Processor’s designing Companies

INTEL

AMD

MOTOROLA

Micro-

Micro-

Micro-

architecture

architecture

architecture

5th

P5

K5

68060

6th

P6

K6

7th

Netburst

K7

8th

Pentium M, Core

K8

9th

Nehalem

K10

10th

Sandy Bridge

Bulldozer

capacity of 20 entries [19], [20], [40]. In the same year, AMD came with new micro-architecture K8 with same scheduler style as its pervious design. Integer scheduler has 8-entries and FP scheduler has 36-entries [3], [6], [9]. Intel redesigned its pervious micro-architecture and came with new microarchitecture called Core in 2006. Centralized reservation station is used in this design and has 32-entries to hold uops [16], [19], [20]. In the next generation i.e. 9th, AMD launched its design earliest from Intel in 2007. The micro-architecture called K10 or Barcelona. AMD continued its pervious scheduler design but increased the number of entries. Integer scheduler has 24entries and FP scheduler has 42-entries [4], [6], [11]. Intel launched its new micro-architecture Nehalem that is follow up of successful Intel Core and Core 2 products in 2008. In this design, centralized reservation station is used and has 36entries to hold uops [12], [16], [19], [20], [31, [32]. In the 10th generation, AMD disclosed its new micro-architecture just before Intel disclosed its micro-architecture in 2011. AMD launched new micro-architecture called Bulldozer. In this design, AMD incorporated two integer clusters and one shared FP cluster in one module and AMD counts each module as two cores. Each scheduler has 40-entries [5], [6]. Intel introduced its new micro-architecture Sandy Bridge in the year 2011 on success of Intel Core and Intel Nehalem. Intel used scheduler in this design and scheduler has 54-entries [16], [19], [20], [29].

Generations

(b) Reorder Buffer- In 5th generation, Intel introduced its micro-architecture called P5 in 1993. Intel used two parallel execution pipelines that were worked in lock-stepped manner but did not support out-of-order execution [8], [13]. In the same year, Motorola came with its micro-architecture 68060 that was used dual execution pipeline, worked in lock-stepped manner and did not support out-of-order execution [23]. After two years AMD appeared with new micro-architecture K5 in 1995. In this micro-architecture AMD implemented reorder buffer with capacity of 16-entries [6], [10], [33]. In the next generation i.e. 6th, Intel launched its new micro-architecture, compare to its pervious micro-architecture, called P6 in 1996. Reorder buffer used in this micro-architecture and has 40entries to hold uops [16], [30]. AMD arrived in same year with K6 micro-architecture. Instruction control unit (ICU) is implemented in this micro-architecture which is worked same as reorder buffer and has 24-entries to hold ROPs [1], [6], [34].

Table 1- Generations of Processor’s Micro-architecture

In the 7th generation, AMD launched its micro-architecture earliest than Intel and came with micro-architecture K7 in 1999. AMD divided its pervious scheduler micro-architecture into Integer and Floating-point scheduler. Integer scheduler also called reservation station and works in distributed manner. Integer scheduler has 15-entries and FP scheduler has 35entries [2], [6], [26]. Intel arrived with micro-architecture Netburst in 2001. Scheduler used in this micro-architecture and several individual schedulers attached to various execution units. Each scheduler has 24-entries [14], [16], [18], [19], [20]. In the 8th generation, Intel introduced its new microarchitecture Pentium M in 2003. Intel used centralized reservation station in this micro-architecture and has the

In the 7th generation, AMD launched its micro-architecture earliest than Intel and came with micro-architecture K7 in 1999. AMD implemented ICU same as pervious design and increased the number of entries up to 72 [2], [6], [26]. Intel arrived with micro-architecture Netburst in 2001. Intel used

5

reorder buffer with capacity of 126-entires which is more compare to itself pervious design [14], [16], [18], [19], [20]. In the 8th generation, Intel introduced its new micro-architecture Pentium M in 2003. Intel used reorder with 40-entires in this micro-architecture [19], [20], [40]. In the same year, AMD came with new micro-architecture K8 with same ICU style as its pervious design with 72-entries [3], [6], [9]. Intel redesigned its pervious micro-architecture and came with new microarchitecture called Core in 2006. Reorder buffer is used in this design with 96-entries [16], [19], [20].

came with new micro-architecture K8. AMD implemented BTB, global history bimodal counter, RAS and target array [3], [6], [9]. Intel redesigned its pervious micro-architecture and came with new micro-architecture called Core in 2006. Intel employed BTB, BHT, indirect branch predictor, loop stream detector and RAS [16], [19], [20]. In the 9th generation, AMD launched its micro-architecture earliest from Intel in 2007 called K10 or Barcelona. AMD implemented BTB, global history bimodal counter, RAS and target array [4], [6], [11]. Intel launched its new microarchitecture Nehalem in 2008. In this design, Intel used BTB, renamed RAS and loop stream detector [12], [16], [19], [20], [31], [32]. In the 10th generation, AMD disclosed its new micro-architecture in 2011 called Bulldozer. In this design, AMD employed 2-level of BTB, hybrid branch predictor, indirect target predictor, RAS and target array [5], [6]. Intel introduced its new micro-architecture Sandy Bridge in the year 2011 after the Bulldozer. Intel implanted BTB, global history table and loop stream detector [16], [19], [20], [29].

In the 9th generation, AMD launched its micro-architecture earliest from Intel in 2007 called K10 or Barcelona. AMD continued its pervious ICU design but increased the number of entries up to 84-entries [4], [6], [11]. Intel launched its new micro-architecture Nehalem in 2008. In this design, reorder buffer is used and has 128-entries that are more as compare to its pervious design [12], [16], [19], [20], [31], [32]. In the 10th generation, AMD disclosed its new micro-architecture in 2011 called Bulldozer. In this design, AMD incorporated two retirement unit also called reorder buffer in integer clusters. Each unit has 128-entries [5], [6]. Intel introduced its new micro-architecture Sandy Bridge in the year 2011 just one month after of Bulldozer. Intel is used reorder buffer in this design and scheduler has 168-entries [16], [19], [20], [29].

VI. CONCLUSION Present and future era of computer architecture belongs to micro-architecture invention in assembly level language. Instruction-level parallelism is most appropriate technique to deal efficiently with micro-architecture issues. Instruction-level parallelism is implemented using most significant elements reservation station, reorder buffer and branch prediction. These are essential characteristics to realize instruction-level parallelism at micro-architecture level in superscalar processors. As compare to VLIW design, superscalar design is preferred choice of leading processors designing companies. Reservation station can be implemented either centralized or distributed manner. Reorder buffer is central controller of execution core as it tracks the states of all instructions. Branch prediction forecast the occurrence of branch.

(c) Branch Prediction- In 5th generation, Intel introduced its micro-architecture called P5 in 1993. Intel used a BTB (Branch Target Buffer) with 2-bit history based Smith prediction algorithm [8], [13]. In the same year, Motorola came with its micro-architecture 68060 that used same prediction algorithm as Intel [23]. After two years AMD appeared with new microarchitecture K5 in 1995. In this micro-architecture AMD implemented 1-bit branch history algorithm based on cacheline. AMD did not use separate BTB for store target addresses [6], [10], [33]. In the next generation i.e. 6th, Intel launched its new micro-architecture, called P6 in 1996. Intel employed BTB with 4-bits branch history based Yeh and Patt algorithm [16], [30]. AMD arrived in same year with K6 microarchitecture. AMD implemented BTC (Branch Target Cache), BHT (Branch History Table) based on 2-level branch prediction scheme and RAS (Return Address Stack) [1], [6], [34].

Motorola did not utilize reservation station and reorder buffer in their ILP design except branch prediction technique is used. AMD and Intel increase the size of reservation station and reorder buffer to handle multiple operations simultaneously with generations passes. Intel use specific techniques for branch prediction and AMD uses different branch prediction techniques to handle different type of branches. Intel’s center of attention is becoming multi-core processor architecture but AMD is moving in the direction of multithreaded processor architecture. Performance of processor will depend on instruction-level parallelism provided by individual core using these architectural characteristics as well as coarse-grain parallelism supplied by multiple cores in multicore environment and multithreaded environment.

In the 7th generation, AMD launched its micro-architecture called K7 in 1999. AMD implemented BTB, BHT based on 2bit Smith prediction algorithm and RAS [2], [6], [26]. Intel arrived with micro-architecture Netburst in 2001. Intel used BTB, BHT and RAS [14], [16], [18], [19], [20]. In the 8th generation, Intel introduced its new micro-architecture Pentium M in 2003. Intel used BTB, indirect branch predictor, loop detector and RAS [19], [20], [40]. In the same year, AMD

6

REFERENCES

[23]

Advanced Micro Devices, AMD K6 Processor Data-Sheet, 1998. [2] Advanced Micro Devices, AMD Athlon Processor Code Optimization Guide, 2002. [3] Advanced Micro Devices, Software Optimization Guide for AMD 64 Processors, 2005. [4] Advanced Micro Devices, Software Optimization Guide for Family of 10h and 12h Processors, 2011. [5] Advanced Micro Devices, Software Optimization Guide for Family of 15h Processors, 2012. [6] Advanced Micro Devices, AMD64 Architecture Programmer’s Manual Volume 6, dated: 28-05-2012. [7] B. Ramakrishna Rau and Joseph A. Fisher, Instruction Level Parallel Processing: History, Overview and Perspective, The journal of supercomputing, 1993. [8] Brian Case, Intel Reveals Pentium Implementations Details, Microprocessor Report, 1993. [9] Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, Pat Conway, The AMD Opteron Processor for Multiprocessors Servers, IEEE Computer Society, 2003. [10] Dave Christie, Developing The AMD K5 Architecture, IEEE Micro, 1996. [11] David Kanter, Inside Barcelona: AMD’s Next Generation, www.realworldtech.com, dated 02-07-2012. [12] David Kanter, Inside Nehalem: Intel’s Future Processor and System, www.realworldtech.com, dated-02-06-12. [13] Donald Alpert, Dror Avnon, Architecture of the Pentium Microprocessor, IEEE Micro, 1993. [14] Glenn Hinton, Dave Sager, Mike Upton, Darrrell Boggs, Doug Careman, Alan Kyker, Patrice Roussel, The Micro-architecture of the Pentium 4 Processor, Intel Technology Journal Q1, 2001. [15] Harsh Sharangpani, Ken Arora, Itanium Processor Microarchitecture, IEEE Micro, 2000. [16] Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture, 2011 [17] Intel, Itanium Processor Microarchitecture Reference for software optimization, 2000. [18] Intel, Intel Pentium 4 Processor Optimization Reference Manual, 2001. [19] Intel, Intel 64 and IA-32 Architectures Optimization Reference Manual, 2012. [20] Intel, Intel Quick Reference Guide – Product Family, dated: 28-05-2012. [21] James E. Smith, Gurindar S. Sohi, The Micro-architecture of Superscalar Processors, 1995. [22] Jean-Michel Puiatti, Instruction-Level Parallelism for LowPower Embedded Processors, PhD. Thesis, 1999.

[24]

[1]

[25]

[26] [27]

[28] [29] [30] [31]

[32]

[33] [34] [35] [36]

[37] [38]

[39]

[40]

7

Joe Circello, Floyd Goodrich, The Motorola 68060 Microprocessor, IEEE, 1993. John Paul Shen and Mikko H. Lipasti, Modern Processor Design, fundamental of Superscalar Processors, Tata Mc-Graw Hill Limited, 2005. Kai Hawng, Advance Computer Architecture-Parallelism, Scalability, Programmability, Mc-Graw Hill International Edition, 1993. Keith Diefendroff, K7 Challenges Intel, Microprocessor Report, 1998 Kevin W. Rudd, VLIW Processors: Efficiently exploiting Instruction Level Parallelism, PhD thesis, Stanford University, 1999. Linley Gwennap, Merced Shows Innovative Design, Microprocessor Report, 1999. Linley Gwennap, Sandy Bridge Spans Generations, Microprocessor Report, 2010. Linley Gwennap, Intel’s P6 Uses Decoupled Superscalar Design, Microprocessor Report, 1995. Martin Dixon, Per Hammerlund, Stephan, Ronak Singhal, The Next-Generation Intel Core Microarchitecture, Intel Technology Journal Volume 14 Issue 3, 2010. Michael E. Thomadakis, The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms, Research Report, Texas A&M University, 2011. Michael Slater, AMD’s K5 Designed to Outrun Pentium, Microprocessor Report, 1994. Michael Slater, K6 to Boost AMD’s Position in 1997, Microprocessor Report, 1996. Michael S. Schlansker, B. Ramakrishna Rau, EPIC: Explicitly Parallel Instruction Computing, IEEE, 2000. Norman P. Jouppi and David W. Wall, Available InstructionLevel Parallelism for Superscalar and Super pipelined Machines, Proc. Third Int. Conf. Arch. Support for Prog. Lang. and OS, ACM Press, 1989. Philips Semiconductors, An introduction to Very-long Instruction Word (VLIW) Computer Architecture, 1990. Robert P. Colwell, Robert P. Nix, John J. O’Donnell, David B. Papworth, Paul K. Rodman, A VLIW Architecture for a trace Scheduling Complier, ACM, 1987. Siamak Arya, Howard Sachs, Sreeram Duvvuru, An Architecture for High Instruction Level Parallelism, Proc. of the 28th Annual Hawaii International Conference on system Sciences, 1995. Simcha Gochman, Ronny Ronen, Ittai Anati, Ariel Berkovits, Tsvika Kurts, Alon Naveh, Ali Saeed, Zeev Sperber, Robert C. Valentine, The Intel Pentium M Processor: Microarchitecture and Performance, Intel Technology Journal, Volume 7 Issue 2, 2003.

Suggest Documents