OPTIMIZATION OF LINEAR EQUATION SYSTEMS SOLVERS. Paulo M. ... Keywords: Debugging Techniques, Linear System, Software Reverse Engineering.
Paper Published in XXXII Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2011)
SOFTWARE REVERSE ENGINEERING TECHNIQUES APPLIED TO OPTIMIZATION OF LINEAR EQUATION SYSTEMS SOLVERS Paulo M. Pimentaa, Alexandre B. Ferreiraa,b, Fernando R. Gonçalvesa, Paulo S. B. Nigroa a
PPGEC, Departamento de Engenharia Civil, Universidade de São Paulo, Av. Prof. Almeida Prado, trav. 2, 83, 05508-900, São Paulo, SP, Brazil, http:/ppgec.poli. usp.br b
Faculdade de Tecnologia Rubens Lara, Av. Bartolomeu de Gusmão, 110, 11045-908, Santos, SP, Brazil, http://www.fatecrl.edu.br
Keywords: Debugging Techniques, Linear System, Software Reverse Engineering. Abstract. This work consists of a low level analysis, considering assembly instructions, registers, Floating Point Unit and RAM (Random-Access Memory) details, of programming codes used by direct methods of solving linear systems, such as Crout, Cholesky and others, written in high level language, as well as discussing mechanisms that can be used to optimize the software run time. The RAM memory debugging techniques were applied to establish which high level instructions generate a set of low level instructions more efficient, resulting in faster software.
Paper Published in XXXII Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2011)
1
INTRODUCTION
The resolution of linear system equations (Arenales and Darezzo, 2008, Ruggiero and Lopes, 1996) is a recurring theme in research work in the area of Computational Mecâcanica as well as in related areas. However, it is not possible to observe in the literature a more detailed analysis of the algorithms implemented in computer programs, considering the architecture of the processor used in personal computers, as well as the code generated by compilers from different vendors. This work addresses the optimization of time in solving linear systems equations in the view of the x86 architecture processors, considering the use of reverse engineering techniques, through software tools, which is possible to make a more precise code analysis that the processor actually performs. 2
LINEAR SYSTEM OF ALGEBRAIC EQUATIONS
The problems of computational mechanics often fall in solving linear systems, being one of the main numerical methods in this field of knowledge. Numerous are the mathematical formulations and algorithms for solving this class of problems. The resolution of linear systems of algebraic equations can basically be divided into two methods groups: the direct and indirect methods. The direct methods are those in which is possible to determine in advance the number of mathematical operations required for the resolution of the system. Among the main direct methods include the method of Gaussian Elimination and the methods based on previous decomposition of the coefficient matrix by a matrix product; among them can be cited the Crout method, the Doolittle method and Cholesky method. Since the indirect methods are those based on iterative processes, which consist to calculate a sequence of approximations of the linear system solution, conditioned to an initial approximation and tolerance for the linear system solution. Therefore, to the indirect methods cannot predict the number of mathematical operations to obtain the solution. Among the main indirect methods include the Gauss-Seidel, the Gauss-Jacobi, the method of conjugate gradients method and the gradients bi conjugate method (Darezzo and Arenales, 2008, Ruggiero and Lopes, 1996). This work is restricted only to the study of methods based on the previous matrix of the coefficients decomposition. Let the system of linear algebraic equations described in its matrix form by: Ax = b ,
(1)
where, A is the matrix of the coefficients from the algebraic equations, x is the vector unknowns and b is the vector of independent terms. Defining the multiplicative decomposition on the matrix A in: A = LDU ,
(2)
where, L is the lower triangular matrix with unitary diagonal, D the diagonal matrix and U the upper triangular matrix with unitary diagonal. Thus, the new linear system to be solved is the following: LDUx = b .
This system can be solved using the three steps outlined below:
(3)
Paper Published in XXXII Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2011)
1. Ly = b
(forward substitution)
2. Dz = y
(diagonal substitution)
3. Ux = z
(back substitution).
(4)
The vectors y and z are just aids the process. Note that using the LDU decomposition it is necessary that the matrix A is not singular and the element a1,1 must be different from zero. When we work with symmetric matrices, the system can be simplified by using for decomposition U = LT thus, the new decomposition will be LDLT , which will significantly reduced the number of mathematical operations to solve systems with symmetric matrix. This methodology derives several classical methods for solving systems, by making combinations of matrices decomposed as follows: a) The Crout method: When it does L = LD and using the decomposition LU to solve de system, without diagonal substitution step, the method is called the Crout Method. This method does not require that the matrix is symmetrical. b) The Doolitle method: When it does U = DU and using the decomposition LU to solve de system, without diagonal substitution step, the method is called the Doolitle method. This method does not require that the matrix is symmetrical. T
1/2 c) The Cholesky method: When it does L = LD and using the decomposition LL to solve de system, without diagonal substitution step, the method is called the Cholesky method. This method requires that the matrix is symmetric and positive definite.
3
MICROPROCESSOR X86
The term x86 is widely used in the computing context by referring to the computers used today. Is there some other nomenclature used, but seen in some literature, which is the microprocessor 8088. The 8086 had 16 bit registers, 16-bit arithmetic instructions and data via an external 16-bit (Tanenbaum, 2005). The 8088 produced in sequence, has a way of 8-bit external data, which simplified hardware physically and in terms of cost, the PC (personal computer) became more accessible to all. Later the 32-bit processors came, which lasted until the last processor in the Pentium IV family, now with 64-bit architecture. 3.1 Registers or Memory Cells
They are used to store any values in volatile memory. They can be compared to the highlevel languages variables. Each processor has a register structure of the architecture in question proportionally, in other words, 16 bit processors have memory cells of 16-bit, processors have 32-bit memory cells of 32-bit, and 64-bit processors have memory cells of 64 bits. The memory cells are divided into three types: general purpose, segment and pointer, and their names are specific to each architecture, as can be seen in Table 1.
Paper Published in XXXII Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2011)
Table 1: Registers of 16, 32 and 64 bits
Register Accumulator Base Counter Data Source Index Target Index Base Pointer Top Pointer Instruction Pointer Code Segment Data Segment Stack Segment Extra Segment Extra Segment Extra Segment
16 bits AX BX CX DX SI DI BP SP IP CS DS SS ES FS* GS*
32 bits EAX EBX ECX EDX ESI EDI EBP ESP EIP -
64 bits RAX RBX RCX RDX RSI RDI RBP RSP RIP -
Type General General General General Pointer Pointer Pointer Pointer Pointer Segment Segment Segment Segment Segment Segment
There are two considerations that must be made about Table 1 for codes displayed later can be understood. The first is the fact that the segment registers in architectures 32 and 64 bits are the same as the 16 bit architecture, therefore, the memory architectures on 32 and 64 bits, through specific techniques, is accessed via register of 16-bit. The second consideration is that the FS and GS registers came into existence only after the 32-bit architecture, despite being 16-bit registers. 3.2 Assembly Language and Operating Codes Assembly language is present in all micro programmed or microcontroller hardware (Tanenbaum, 2006), it is the closest form of machine develop a programming code, however this language is more complex for the programmer, requiring more specific knowledge of the architecture in question. When told that assembly language is the closest to the machine, here are two other possibilities ruled out to write programs. An even closer to the machine that are the binary instructions, and the other is the abstraction in hexadecimal base to binary codes, known as operation codes or opcodes simply. Writing code to opcodes is something quite unnecessary, as it makes the process even more complex to the programming. For example, the assembly instruction MOV AL, 01 would be written in opcodes as B801. Even more complex would be the process of coding in assembly language instruction MOV AL, 01 for instruction to binary, it would be 1011000000000001. With these two examples is possible to reach the conclusion that the way a programmer to develop code with an extreme control and understanding of the hardware at the same time with the highest level of understanding, is the assembly language. 3.3 Set Assembly instruction to 8086 The set of instructions in assembly language is specific to each kind of architecture from
Paper Published in XXXII Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2011)
microprocessor or microcontroller and usually suffers updates and changes to the following versions of the same architecture (Manzano, 2004). Figure 1 has all the instructions in 8086 and were maintained at processors of 32 and 64 bits. AAA CBW CWD IN JBE JMP JNGE JNS LAHF LOOPNE NOP PUSHA REPZ SCASB STOSW
AAD CLC DAA INC JC JNA JNL JNZ LDS LOOPNZ NOT PUSHF RET SCASW SUB
AAM CLD DAS INT JCXZ JNAE JNLE JO LEA LOOPZ OR RCL ROL SHL TEST
AAS CLI DEC INTO JE JNB JNO JP LES MOV OUT RXR ROR SHR XCHG
ADC CMC DIV IRET JG JNBE JNP JPE LODSB MOVSB POP REP SAHF SRC XLAT
ADD CMP HLT JA JGE JNC JNLE JPO LODSW MOVSW POPA REPE SAL STD XOR
AND CMPSB IDIV JAE JL JNE JNO JS LOOP MUL POPF REPNE SAR STI
CALL CMPSW IMUL JB JLE JNG JNP JZ LOOPE NEG PUSH REPNZ SBB STOSB
Figure 1: 8086 instruction set.
Considering Figure 1 it should be said that for each instruction there is a set of opcodes or binary instruction that are the same of its parameters, in other words, registers, addresses and values involved with the use of each instruction. 4
LOW-LEVEL ANALYSIS OF THE REPETITION LOOP
To better understand the operating process within the x86 instruction of a high level, it is give here for example a loop written in C programming language (Schildt, 2000), represented in Figure 2. Using a debugging tool is made the loop analysis, getting the equivalent of opcodes that the CPU understands disassembled in low-level language (Assembly). for (i = 0; i < 0x6969; i++) { } Figure 2: A repetition loop in the C language
The code above, apparently simplistic, if analyzed with a little more depth, help to understand some questions of code optimization and programs operations. As can be seen, the first parameter of the loop indicates its initial condition, in other words, the variable "i" is initialized to 0 (zero) value. The second parameter indicates the stopping condition of the loop; therefore, the loop will be repeated until the value of the variable "i" is smaller than the hexadecimal value 6969 which could also be represented by 26985 in decimal base. It is adopted the hexadecimal encoding in order to facilitate the understanding of the RAM interpretation, given that all opcodes and values are always represented in the hexadecimal base. The third and last parameter indicates the increment of the loop, in other words, each step
Paper Published in XXXII Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2011)
of execution, or each iteration, the variable named "i" is incremented by 1 (one), allowing in a moment that the stop condition informed by the second parameter is reached. Understood the developing process of a loop, after to compile the same loop with two different programming environments (Borland Turbo C 3.0 and Gcc), is obtained two executable with size and behavior of different loops. Figure 3 refers to the disassembled code of the loop in the Assembly language, compiled in Borland Turbo C 3.0: xor si, si jmp short loc_1029A loc_10299: inc si loc_1029A: cmp si, 6969h jl short loc_10299 Figure 3: Loop of the Borland Turbo C 3.0 disassembled into Assembly.
The tool used to disassemble the loop code compiled in Borland Turbo C 3.0 was the IDA (The Interactive Disassembler), widely used in the field of Reverse Engineering (McGraw and Hoglund, 2006). The first statement (xor) realizes the called or "exclusive" of the Boolean logic with the registrar called SI (source index) with itself. The result is that SI is zero, therefore, this statement corresponds to the first parameter of the loop (i = 0). The second instruction (jmp) performs an unconditional jump to the memory address symbolized by the label "loc_1029A." This label is there an instruction "cmp si, 6969h" that is comparing the value of the SI register with the value 6969 which is on numerical hexadecimal base. This statement is equivalent to the second parameter of the loop (i