RC23283 (97774) March 27, 2000 Computer Science
IBM Research Report Simulation and Debugging of Full System Binary Translation Erik R. Altman, Kemal Ebcioglu IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598
Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. Ithas been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email:
[email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home.
Simulation and Debugging of Full System Binary Translation Erik R. Altman and Kemal Ebcio˘glu IBM T.J. Watson Research Center
Abstract
sociate a particular VLIW operation with its original PowerPC counterpart. Finally, the simulation runs on the bare hardware of a workstation where there is an element of non-determinism and where incorrect accesses to I/O can crash the system and possibly corrupt the disk. We propose novel solutions for handling these problems.
We describe full system simulation of DAISY (Dynamically Architected Instruction Set from Yorktown). At runtime DAISY dynamically translates code for a PowerPC processor into code for an underlying VLIW processor. Our style of simulation can also be used in the context of full system emulation a` la SimOS and SimICS. Unlike SimOS and SimICS, DAISY emulation is operating system and device independent. We have successfully simulated a workstation running under DAISY binary translation [7, 8, 9], starting with translation of the PowerPC firmware, continuing through the boot of the AIX 4.1.5 operating system to the login prompt and finally execution of X-Windows and a variety of utilities including emacs, latex, dbx, grep, ping, and a user mode version of the DAISY simulator itself. Simulated aspects of DAISY include large register sets, precise exceptions, parallel semantics of VLIW instructions, virtual address translation, cacheable and non-cacheable memory, and selfmodifying code — but not I/O devices. They are not needed. More difficult still is debugging. DAISY’s simulation code is two levels removed from the original VLIW PowerPC) PowerPC code (PowerPC with bugs possible in each level of translation. As well, the VLIW code is heavily reordered from the original PowerPC code, thus making it difficult to as-
KEYWORDS: VLIW, B INARY T RANSLATION , S IMULATION , D EBUGGING , P OWER PC
1
Introduction
As processor hardware and systems grow more complex, it is increasingly important to simulate them accurately both to find errors in design as well as to find bottlenecks to higher performance. Such simulation is also generally more cost-effective than building prototype hardware. Our DAISY (Dynamically Architected Instruction Set from Yorktown) simulator works in this vein to study the DAISY binary translation architecture, which dynamically translates fragments of PowerPC code to instructions for an underlying VLIW machine. In addition to modeling a VLIW machine under binary translation, the simulator has several other noteworthy features:
1
The simulator runs on bare PowerPC hardware and deals directly with devices and memory with no operating system intervention.
Consequently, the simulator is operating system independent. Likewise, since the peripheral devices on the machine on which we simulate DAISY are the same as we would expect on a hardware DAISY platform, peripherals need not be simulated.
VLIW PowerPC from the original: PowerPC simulation code. The remainder of the paper is structured as follows. Section 2 describes the DAISY architecture and the novel ways in which it is simulated. Section 3 describes how we deal with the debugging problems described above. Section 4 offers some The simulator models a complete system — in- observations from using the DAISY simulator. Seccluding both user and privileged mode code, as tion 5 discusses related work. Finally, Section 6 conwell as virtual address translation, exceptions cludes. and traps, cacheable and non-cacheable accesses to memory and I/O, and self-modifying code. 2 DAISY The DAISY system dynamically translates PowerPC code into code for an underlying VLIW machine. The translation is done at runtime and hence must be fast and efficient so as not to be perceptible to the user. VLIW code translations are saved and cached so that when the same PowerPC code is later encountered, its VLIW translation can execute immediately without retranslation. The DAISY VLIW architecture is in the style of VLIW instruction that was pioneered by Ebcio˘glu [6]. DAISY is can be an 8 or 16-wide machine, with 64 integer registers, 64 floating point registers, and 64 condition register bits (16 condition register fields). The first 32 integer registers contain PowerPC values, while the upper 32 registers are used for speculative computation and for translator use. For example, r3 always contains the value that r3 would contain in a “normal” PowerPC program. The condition bits are similar with the first 32 corresponding to the PowerPC condition bits, and the second 32 being available for speculative and scratch computation. Each DAISY instruction can have 8/16 ALU operations, out of which one half can be loads/stores. Given that the DAISY instruction set is designed as a binary translation target for the PowerPC instruc-
Since we do not currently have DAISY hardware, our simulator can be used to estimate DAISY performance. The simulator also provides a platform on which we can test the correctness and efficacy of our translation and system management algorithms, and obtain statistics, such as the amount of code reuse and the size of the code footprint in systems of interest. Our style of simulation can also be used in a different context, that of full system emulation a` la SimOS and SimICS. However, both SimOS and SimICS are dependent on a particular operating system and model only a limited set of devices in that system. The DAISY simulation system is completely independent of both operating system and any devices present in the system. Any and all manner of networking cards, graphics cards, data collection cards, etc. can be present in a DAISY simulation system. DAISY does have two significant drawbacks compared to SimOS and SimICS: (1) As the DAISY simulator was designed to emulate a VLIW architecture it is slightly slower than SimOS and SimICS. (2) The DAISY simulator is significantly more difficult to debug. Debugging challenges range from nondeterminism to dealing with real hardware devices to heavily reordered code that is two levels removed 2
DAISY VLIW
AIX Applications
AIX Applications
AIX
AIX
PowerPC 604e
DAISY Translator
DAISY Translator
Simulator
DAISY Machine
DAISY Simulator PowerPC Machine
(a) DAISY Hardware System
PowerPC Flash ROM
60x Bus Memory Controller
(b) DAISY Simulator System
PCI Bus Disk
Video
Network
Keyboard
Memory
Figure 1: DAISY Schematic
PowerPC
tion set, DAISY’s primitive operations are similar to PowerPC operations, with the exception that complex PowerPC operations such as update instructions and string operations are cracked into simpler DAISY primitives. Each VLIW instruction can branch to up to 4 targets, based on up to 3 condition bits, in a decision tree-like fashion. For each target that the VLIW branches to, a different subset of the primitive ALU and memory operations in the VLIW can be executed (committed). Thus, each primitive ALU and memory operation in a DAISY instruction can be predicated on up to 3 condition code bits, although these are not fully independent predicates associated with each operation, as is the case with the Intel IA-64. Figure 1(a) shows the overall framework under which DAISY would operate if hardware were available. Figure 1(b) shows the framework under which we simulate DAISY. Of particular importance is the fact that DAISY runs directly on a real hardware, with no intervening operating system support. This is true for both DAISY hardware systems, as well as the DAISY simulator described here. This “bare metal” operation provides the benefit of portability — any operating system from AIX to Linux to MacOS-X should run without changes to the DAISY system (or the operating system), although to date, we have only tested DAISY with AIX.
DAISY
Figure 2: DAISY Simulation System. Figure 2 shows a DAISY simulation system. The shaded boxes differ from a traditional PowerPC system, while the unshaded boxes do not. As can be seen a traditional PowerPC processor is replaced by a PowerPC processor simulating a DAISY VLIW processor. (The DAISY VLIW processor is in turn executes code which emulates PowerPC.) Translated DAISY code is kept in the DAISY portion of memory, as is simulation code for the DAISY code. During normal system operation, the PowerPC portion of the memory will look exactly like it would were a traditional PowerPC system in use. Figure 3 depicts the DAISY memory map in slightly more detail. The DAISY portion also contains the DAISY system software, translator, simulation code, and other required tables. Section 2.3, PowerPC explains why pages 0, 1, and 2 are kept in the DAISY portion of memory.
2.1
Bootstrapping the DAISY Simulator
The machine on which DAISY runs has 512 Mbytes of RAM installed. The upper 384 Mbytes are reserved for the translator and simulator, while leaving 3
tension’s stack. Once everything is in low memory, the kernel extension copies all DAISY code and data to their desired locations in high real memory. Once this load is completed, the kernel extension sets the PowerPC stack pointer and TOC 1 registers (r1 and r2) to the appropriate values for the translator, and turns off data and instruction translation in the PowerPC MSR. It likewise zeroes the BSS data for the translator and simulator, and then passes control to the translator and simulator with a translation address of 0xFFF00100 — the standard reset location for the PowerPC architecture, the contents of which reside in the PowerPC Flash ROM in Figure 2. In other words, the DAISY simulator begins its simulation at the reboot of the machine. Translation of PowerPC code to DAISY VLIW code then commences. When an unseen fragment of PowerPC code is encountered during execution, code on several paths from that fragment is translated to DAISY VLIW code. Simulation code for the DAISY VLIW code is then generated as described in Section 2.2. This simulation code is then executed until code branches out of the translated fragment or until an indirect branch is encountered. Out of fragment and indirect branches check if a DAISY VLIW translation exists for their target. If so the simulation code for the DAISY code translation is executed. If not, the code is translated as before and executed. This process continues all the way to the AIX login prompt and beyond — indeed until the machine is rebooted.
DAISY Memory o Translator o Translated Code o Side Tables o System Software o Simulator o PowerPC Pages 0,1,2
PowerPC Memory -- Except PowerPC Pages 0,1,2
Figure 3: DAISY Memory Map the lower 128 Mbytes for use by the “normal” system. The large majority of the 384 Mbytes used by the translator and simulator is for simulation code of the DAISY VLIW code. As will be discussed in Section 2.2, simulation code is several times larger than VLIW code. The DAISY simulator begins by loading the translator and simulator software into high real memory on the machine on which the simulation is to be run. This load of the translator and simulator is accomplished via an AIX kernel extension. Like most of the AIX kernel, the kernel extension runs with address translation on initially. This, however can cause problems. The real memory locations of the kernel extension, as well as the translator and simulator software are arbitrary. Some pages may reside in the high real memory locations to which the translator and simulator will ultimately be loaded. Thus the kernel extension first moves all pages of itself, the translator, and the simulator to low real memory pages (taking care not to overwrite parts already in low real memory). Such pages include not only code but data including the kernel ex-
2.2
Simulation Code
The DAISY simulator generates simulation code for each VLIW instruction in a Shade-like manner [5]. This is illustrated in Figure 4. Figure 4(a) shows 1
Under standard compiler conventions for AIX code, the TOC — table of contents — is used to point at globally visible program symbols.
4
"!#$%&''
and X63,X21,X22 bne X_CR0 xor X24,X25,X26 addi X28,X63,1 b V25 b V47
()*+ , - . "!#$%&''#!0/ 1235476.8:
[email protected]DEGFHI JLK"MLN"MOPQSR+TCUV"WCNYX"ZY[CX"\CN"P[C]^PC_` JbaM^cCdfe"g"hYiCjkmlWCNYOCXLno0pCN"lWCVfaMq JLU"Mlro"X"sQYsPCXsCU"qCU"Ota0X"Mvu wyx_Sz {yx_|txtz}U"Ml iCjk:R~iC_txfR~iC__ fo SR T 3 9O 9O+07 9 7
;:< 4=>?ÿA@ 89& B% 67 3 EF =>?ÿA@ 89& B6(
7 ,77 GGH ;I D ;J ÿ K;LNM % !$GGHB897- > DD!!$APQ1A#5 þ !%RT 3 ;::3BJ&% UV BGQ7A ,77 7 >T 3 I PP"! % ::3:C WAPX1A#5 3C11W1A#5 þ !% R