Software Random Number Generation Based on

0 downloads 0 Views 181KB Size Report
dependency between RNG and applications that require se- cure data can be .... the time quantum assigned to the two threads, someone can predict the final ...
Software Random Number Generation Based on Race Conditions ∗ Adrian Coles¸a

Radu Tudoran

Sebastian B˘anescu

Technical University of Cluj-Napoca Abstract The paper presents a new software strategy for generating true random numbers, by creating several threads and letting them compete unsynchronized for a shared variable, whose value is read-modified-updated by each thread repeatedly. The generated sequence of random numbers consists of the final values of the shared variable. Our strategy is based on the functionality of the operating system’s thread scheduler. Different values of the shared variable are obtained because the concurrent threads are preempted at different moments in their execution. We identified some software and hardware factors that make the scheduler generate context switches at unpredictable moments: execution environment, cache misses, the instruction execution pipeline and the imprecision of the hardware clock used to generate timer interrupts. We implemented the strategy on x86 architecture running Linux operating system. The random number sequences obtained pass over 90% of the NIST tests.

1. Introduction In the last few years there has been an increasing interest for random number generators (RNGs), since they are exhaustively exploited in cryptography [3], [11]. The strong dependency between RNG and applications that require secure data can be drawn from the successful attack on the communication mechanism of Netscape V2.0 browser [6]. There is a lot of ongoing research for improving and obtaining new methods to generate random numbers, based on both software [5] and hardware [4] strategies. Pseudorandom number generators (PRNGs) use a seed and an algorithm in order to produce an apparently random sequence of numbers. They have an internal state, which provides the next output [7]. True random number generators (TRNGs) produce a sequence of numbers based on some physical process in hardware (e.g. jitter, ring oscillators etc) or on ∗ This work was partially supported by the CNMP funded CryptoRand project, nr. 11-020/2007.

some entropy sources in software (e.g. system clock, mouse movement etc.) [9] and not on any previous outputs. The hardware methods proved to behave better than the software ones, but at more higher costs, because they use expensive specialized hardware components that must be attached to normal computers. The in-use software methods for TRNGs are not so expensive, because they can be run on any on-hand computer. Anyway, when they use the user’s unpredicted actions (e.g. mouse movements, key pressing) their throughput is quite bad (the user is awfully slow, compared with the speed a processor can execute applications), and when they are based on the activity of some hardware components (e.g. network-card activity), they depend on some external factors which can be controlled by attackers. We developed a software strategy to get a TRNG, based only on the software and hardware off-the-shelf components a common on-hand computer provides. We wrote an application that creates more threads that read-modify-update the same shared variable, whose final value is the random number we generate. It is well known that letting concurent threads compete unsynchronized for a shared resource create race conditions that bring the resource into an unpredicted state. Because race conditions rarely occur in practice, we tried to increase in our application the frequency of their occurence, in order to improve the randomness of the numbers we generated. Even if the scheduler is deterministic, we identified a couple of software and hardware factors that influence its decisions and make our application outputs different, random results during different runs. We tested our application on an x86 architecture with a Pentium 4 processor, running Linux operating system. The random number sequences we obtained pass over 90% of the NIST [10] tests. Section 3 describes the sofware strategy we used. In Section 4 we identify and explain the software and hardware factors that contribute to the generation of true random numbers using our strategy. Section 5 contains details about the application we wrote to test our stategy. Section 6 presents tests we made and the results we got, and finaly, we present the conclusions of our work and the ongoing research directions.

Thread 1 v1 = shared var; v1 = (v1 + 1) % 2; shared var = v1; ... (suspended) ... (suspended) ... (suspended)

Thread 2 ... (suspended) ... (suspended) ... (suspended) v2 = shared var; v2 = (v2 + 1) % 2; shared var = v2;

Table 1. The “normal” execution scenario: the steps of the two threads do not interleave. Thread 1 v1 = shared var; v1 = (v1 + 1) % 2; ... (suspended) ... (suspended) shared var = v1; ... (suspended)

Thread 2 ... (suspended) ... (suspended) v2 = shared var; v2 = (v2 + 1) % 2; ... (suspended) shared var = v2;

Table 2. The “abnormal” execution scenario: the steps of the two threads interleave.

2. Related Work The vast majority of software generators are PRNGs. They are based on the system clock, mathematical algorithms, for which the current output is a function of the previous output, and network statistics etc. Our generator is based on the correlation of cache misses, hardware clock imprecision and execution delays within the processor’s pipeline. Compared with other software strategies, which generate random numbers monitoring various and unpredicted actions of user [8] (e.g. mouse movements, keyboard keys pressing etc.) or events occurring in the system (e.g. interrupts from network card), our application takes implicitly such factors into account due to the fact that they influence the scheduler decisions, our strategy is based on. Another difference is that our startegy needs no seed. This is a solid argument that our application generates true random sequences of numbers. For example, the PRNG implemented in Java [14], which is fast and passes around 90% of the NIST tests, requires a seed for initialization. Our application does not need such an input to function properly and passes over 90% of the NIST tests.

3. The Strategy The idea of our work is to create more threads of the same application that access concurrently and unsynchronized a shared variable. We promote race conditions during

threads’ execution, which can lead to unpredictable results in different runs of the same application. The threads modify the shared variable in three steps: (1) read its value into another local working variable, (2) modify the read value in some way (e.g. increment or decrement) and (3) write the final value from the local working variable into the shared variable. Race conditions occur because the three steps executed concurrently by different threads can interleave in different ways, resulting in different values of the shared variable in different runs of the application. Race conditions in our case can basically lead to two different execution scenarios described in Table 1 and Table 2. The normal execution corresponds to the case the context switch occurs after the current thread succeeds executing all three steps corresponding to variable’s manipulation. The abnormal execution corresponds to the situation in which the context switch occurs before the update step is executed, letting the next concurrent thread read an inconsistent value of the shared variable.

4. Factors of Randomness Supposing the initial value of the shared variable used in the method described in Section 3 is 0 and the modification operation performed on it is addition followed by a modulo 2, the final value of the shared variable in the two scenarios described above could be 0 and 1, respectively. The probability to get one of them as the final value of the shared variable depends on moment when the context switch between the two threads occurs, according to the first or the second scenario. Having only the two scenarios possible to occur, there are 50% theoretical chances to get 0 or 1 as the final value of the variable. It is known, that the thread scheduler is deterministic, i.e. it uses a deterministic algorithm, generating the same sequence of scheduled threads in the same execution configuration. With no loss of generality we can consider that there are only the two threads of our application running in the system. In that case, the scheduler will always generate the same scheduling sequence, Thread1, Thread2, Thread1, Thread2 and so on or the one in which Thread2 is scheduled first. The two threads have the same priority and are identical from the point of view of CPU and other resources usage, so they will get identical time quanta alternately. Thus, knowing the duration of the time quantum assigned to the two threads, someone can predict the final value of the shared variable. It can be 0 or 1, depending on the moments the context switches occur, but it is sure it will alway be the same value, because context switches occur always at the same moments of time, applied on the same thread scheduling sequence. It is only a matter of knowing the time quantum duration and the thread

Figure 1. Deterministic scheduling of Thread 1 and Thread 2. The time quanta assigned to both threads are of the same lenght, i.e. τ milliseconds. So, it can be calculated at each moment which thread will be executed and what that thread executes at that moment.

which is scheduled first. Figure 1 illustrates the deterministic scheduling of the two threads. Although the behavior of the scheduler is deterministic, we observed during multiple execution of our application that the final value of the shared variable is different and unpredictable. So, the question that must be answered is: “What makes it possible to get different random values of the shared variable in different runs under the same conditions, even if the system is deterministic?” In order to answer this question we followed two directions: finding software and hardware factors of the randomness. On the other hand, we also took into consideration two different perspectives on any factor identified: theoretical and practical possibility that factor to contribute to the generation of random values of the shared variable. Related to the software factors, one that contributes to the randomness behavior of our application is its execution environment. The scheduler being deterministic, it behaves identically each time in the same conditions, generating the same sequence of scheduled threads. From this point of view, generating an execution environment identical with a previous one means to execute the same applications and start them all at the same corresponding moments and in the system state as in the previous environment. This is theoretically possible, but practically very difficult if not impossible to get. We tried to test of our application in the same execution environment: (1) we ran it in Linux single mode with no other user processes running, but the ones the operating system automatically starts in order to function properly and (2) started it when the system was in an identical stabilized state, observed using a monitoring tool. Even so, the values of the shared variable we obtained were random. So, it is very possible that the software execution environment is not the only factor responsible for the randomness observed. Nevertheless, in a practical situation, on a system in use, the execution environment being not identical at different runs of our application and also not even during one execution — because other applications running evolve dynamically

Figure 2. Variation in real time of an allocated time quanta. The tk+1 end time of a quanta with the value τ = tk+1 −tk is situated actually in the interval [tk + τ − ρτ, tk + τ + ρτ ].

— it can contribute to the randomness of our application’s behavior. We identified three hardware factors that contribute to the random behavior of our application: 1. Cache misses force the CPU to bring the needed instruction or data from another memory level and consequently to consume some extra CPU clock cycles on behalf of the currently running thread [13]. That means that during the same time interval, i.e. the allocated time quantum, a different number of thread’s instructions can be executed in different situations. That results in one thread being preempted at different places during its execution. Related to our application, that means that the two different execution scenarios described in Section 3 can be met. If that happens randomly during the execution of our application or during different runs of it, having the effect of generating random values of the shared variable, depends on the way the cache misses occur or not. This is very execution environment dependent. In a special test-bed prepared system, like the one we used to test our application, where the environment was very stable, the cache misses do not occur eventually at all, in the case all the threads of the testing application fit in the cache (which we think is the case of our application) or occur accordingly to a regular periodic pattern, in the case where there is not enough room for all the running threads in the cache. The latter case happens because the scheduling sequence is a fixed one and also follows a regular periodic pattern as we already mentioned. In a real practical system, anyway, the environment being not stable, the cache miss sequence will also not be a regular and deterministic one, thus we can take it into account for the randomness of our application. 2. Another hardware factor that can make our application behave differently in similar situations is related to the way the pipeline instruction flow is affected by the occurrence of the timer interrupt, which leads to a

context switch. In such a situation, the CPU instruction execution pipeline contains instructions of the preempted thread. Now, due to the context switch the following instructions being introduced in the pipeline belong to the newly scheduled thread, the remaining instructions of the preempted thread being executed when it resumes later its execution. Anyway, after the context switch, the pipeline contains instructions of both threads. We do no insist now on the way the CPU protects one thread from another in such a situation, because this is not so important for us now (see [13] for details). Let us suppose that the first instruction that will be fetched by the CPU when the preempted thread will resume its execution depends in some way on some other previous instruction, which was in the pipeline at the moment of the context switch. In an uninterrupted execution, when the context switch would have not occurred, this instruction has had to wait some time, i.e. some number of CPU clock cycles, for the instruction it depends on to be retired (finished). In the case of the context switch, however, the instruction is not fetched by the CPU. Nevertheless, in the same time the previous instruction it depends on, being in the CPU pipeline, continues its execution, but in the context of the next thread executed. That is, it “consumes” (in fact, shares) time from the quantum allocated to the next thread. When the preempted thread resumes its execution and its next instruction is fetched in the CPU, the instruction it depends on is already retired and consequently it can be launched immediately, so saving CPU clock cycles from its quantum, compared with the uninterrupted execution. Similar to the case of cache misses, that results in a thread executing different number of instructions during different identical time quanta and consequently meeting the two execution scenarios. 3. The timer interrupt, the scheduler uses to preempt currently running thread and switch the CPU to another thread, is not generated perfectly periodically. This happens because the hardware clock, which determines the timer interrupt is not perfect. It is known [1] that a hardware clock has a small bounded drift rate, i.e. the difference between the clock and realtime can change every second at most by some known constant ρ  1. Consequently, different time quanta allocated by the scheduler to the running threads are not of the same length, even if, mathematically speaking, they have the same value. In case the clock used to generate timer interrupts is not the same or not synchronized with the one used to generate clock cycles to the CPU, the variation in realtime length of time quanta results in different numbers

Figure 3. Undeterministic scheduling of Thread 1 and Thread 2 on a real system. All the time quanta assigned to the two threads last theoretically τ milliseconds. Practically, however, they contain a different number νk of CPU clock cycles. So, it cannot be calculated at one moment which thread will be the executed one.

of CPU clock cycles and, consequently different numbers of instructions, being executed in different time quanta of the same (theoretical) length. Relative to our application, that means that its execution can generate randomly any of the two scenarios described before. The second and the third hardware factors we mentioned above do not depend of the execution environment. They are tightly coupled, though the second one, the hazard in the instruction pipeline, seems to be a consequence of the third one, the variation of clock quatua. So, they always contribute to the randomness of our application. The hardware clock’s drift rate can vary in the interval [−ρ, ρ], where ρ is in the range [10−4 , 10−6 ] µs s , i.e. every second the clock deviates with a 10−4 or 10−6 µs from the real time. Thus a time quantum of theoretical τ milliseconds lasts in reality something between [τ − ρτ, τ + ρτ ]. Figure 2 illustrates this. So, the variation in time of the time quantum is 2ρτ . Considering the default value of a time quantum in Linux of 100ms that means that the variation interval I of a time quantum is −4 µs I = 2ρτ = 2×10−4 µs s ×100ms = 2×10 s ×100× −3 −4 2 −3 10 s = 2 × (10 × 10 × 10 )µs = 2 × 10−5 µs = 2 × 10−2 ns = 0.02ns For a processor of 2.6GHz, which generates a clock tick every 0.385ns, a CPU clock cycle can be missed or saved at each 20 consecutive time quantum, i.e. each 2 seconds. All the above mentioned factors contribute in a real system to the randomness behavior of our application, due to the fact that different time quanta allocated to its threads, even equal as mathematical values, contain different number of CPU clock cycles. That means that the theoretical scheduling described in Figure 1 does not hold in practice. The scheduling situation, we suppose to happen in reality, is illustrated in Figure 3. The different lengths of time quantum we illustrated in that figure are different in terms of CPU clock cycles, a characteristic specific to all the factors we described above.

5. The Application The application we wrote to generate random numbers is based on the strategy described in Section 3. It creates more threads, which access unsinchronized a shared variable. Race conditions manifest because of context switches occurrence in the code area where the shared variable is accessed (critical code) and lead to an unpredicted final value of the shared variable from the two possible, i.e. 0 and 1. Theoretically, a context switch can occur anywhere during a thread’s execution, but in practice we observed that it rarely occurs during a time quantum allocated to that thread (because of an interrupt for example). That is why we tried to place the critical code in areas where context switches are supposed to occur, in order to increase the probability of race conditions manifestation. A context switch happens for a CPU-bound [12] thread, as the threads of our application are, when its allocated time quantum expires. So, we tried to place the critical code to the end of each time quantum allocated to a thread. Of course, when we wrote our application, we had no way to find out and control the positions in the source code, where the context switches will occur. In fact, if we had done it, than there would have been no source of randomness. The idea was just to let the things happen randomly in one way (normal, i.e. no race conditions) or another (abnormal, i.e. in the case of race conditions). Notwithstanding, the critical code had to be near the end of time quantum. We succeeded that knowing the fact that our critical code is executed in a much smaller time than a time quantum. Executing only such a step (formed by the read, modify, an update operations on the shared variable) would have always fitted and executed in a time quantum, so never met a context switch. Observing this, all we had to do was to try to “fill” (in fact to “override”) the entire time quantum with such small steps. That way, because the time quantum is not a multiple of our step size, with a great probability, the last step will cross the limit of the time quantum and will be interrupted by a context switch. The idea is illustrated in the Figure 4. We must explain one more thing in order to make clear why random values are generated. The critical code, the one we called a step, consists of more operations, so the occurrence of a context switch during such a step can happen according to the two scenarios described in Section 3. Depending on which scenario takes place, the final value of the shared variable will be one or another. As we saw in Section 4 there is a small random variation in number of CPU clock cycles of time quanta allocated to the concurrent threads. Because this variation is smaller than the number of CPU clock cycles needed to completely execute the critical code, it is possible the context switch to happen randomly in different points of our step, leading to one scenario or another and, consequently, to different random values.

Figure 4. The way we place the critical code susceptible to race conditions to the end of a time quantum, i.e. the area of context switch between concurrent threads. The critical code is considered to be one step and the time quantum is filled with such steps. The last one intersects the end of a time quantum, which was exactly our intention.

6. Testing and Results We have developed our tests extending the method described in the previous sections. In order to make a thread use more then a time quantum, a couple of modifications were required to the basic structure of a thread. As mentioned in Section 5, the three operations on the shared variable, which we refer to as a step or a basic instruction block executed by a thread, require less CPU clock cycles then a time quantum has. Thus, the first modification was to repeat this basic block several times (steps). In this way, for a sufficiently large number of steps we can be sure the context switch point where the entropy is present will be encountered. We also created more then two threads in order to further exploit the competition for the CPU. Each test was repeated many times in order to get sequences of random numbers large enough to be evaluated by NIST benchmark programs. We tested our application using the following two approaches: 1. The first approach was to keep the modification operation simple and increase the number of threads competing for the shared variable (up to 64), using a fixed number of steps (e.g. 30000). The cost of this approach is the time required to get the final value of the variable. We observed that the sequences of numbers obtained using this way passed over 90% of NIST tests with no post-processing. Furthermore the quality of the generated numbers was proportional with the number of threads and with the time required to obtain the output. We could conclude that in order to obtain good random numbers by this method we need a high generation time. 2. The second approach was to increase the complexity of the modification operation, in order to increase the number of CPU clock cycles required to execute it.

Steps 6500 11000 13000 14000 20000

No post-processing 10% 19.88% 24.19% 38.17% 42%

XOR post-processing 45% 67.2% 80.1% 91.93% 97.84%

Table 3. The pass rate for the NIST tests, for different steps values.

This was done by replacing it with more arithmetic operations and introducing others shared variables. We applied a modulo 2 operation to the final result of the arithmetic operations in order to keep it in the {0, 1} set. We used 8 threads. The program throughput was improved, since there was no need for such a big number of threads. They also passed several NIST tests, demonstrating that this was a relatively good way of getting random numbers. The tests were carried on different number of steps (see Table 3). They demonstrated that, similar with the previos approach, increasing the number of steps the quality of the output is improved. It is obvious that a good compromise must be made between the cost (time required for generating a bit) and quality of the sequence of random numbers. A significant improvement for the program output was obtained by a simple XOR post-processing discussed in the next sub-chapter.

6.1. Post-Processing There are several post-processing methods used in this field like the XOR operation or the von Neumann method [2]. We tried both of these in order to decide which of them is the most appropriate for us. The Von Neumann not only caused a loss of 75% of the output but also didnt improve the quality too much. On the other hand the XOR operation made very good correction of the output. This can be seen by comparing the last two columns in Table 3. The improvement is compared based on the NIST tests. We found that this post processing method can improve the quality by almost 60%. Although that a XOR between two bits will generate only one, so the output will be 50% reduced, the quality is improved a lot. We can conclude that this is a good compromise between cost and quality.

7. Conclusions We developed a new method of implementing a software TRNGs. It is based on the fact that a context switch can

occur at unpredictable moments in the case of race conditions between concurrent threads. The factors we identified as contributing to the randomness of our method are cache misses, hazard in the processor’s instruction pipeline and imperfectness of hardware clocks. Their influence is increased by the fact that the execution environment is variable in time and cannot be reproduced. The results are quite good providing some simple postprocessing on a couple of output files. Taking into account that it is a cheap TRNG compared with other hardware counterparts and does not depend on external factors (e.g. network traffic), we consider our method to be a very promising way of generating random numbers. Future work directions involve the improvement of throughput without affecting the quality of the RNG. A deeper analysis of the sources of randomness could be the key to further improving the quality of the generated sequences of numbers.

References [1] F. Christof and C. Flaviu. Building fault-tolerant hardware clocks from cots components. In Proceedings of the conference on Dependable Computing for Critical Applications (DCCA), pages 67–86, Novmber 1999. [2] M. Dichtl. Bad and good ways of post-processing biased physical random numbers. August 2007. [3] W. Diffie and M. Hellman. New directions in cryptography. November 1976. [4] M. Drutarovsky and P. Galajda. A robust chaos-based true random number generator embedded in reconfigurable switched-capacitor hardware. Radioelektronika, April 2007. [5] E. J. Gentle. Random Number Generation and Monte Carlo Methods. Springer, 2003. [6] I. Goldberg and D. Wagner. Randomness and the netscape browser. January 1996. [7] P. Kohlbrenner, M. Lockhead, and K. Gaj. An embedded true random number generator for fpgas. FPGA’04, February 2004. [8] M. Mitchell, J. Oldham, and A. Samuel. Advanced Linux Programming, 1st Edition. New Riders, 2001. [9] D. Schellekens, B. Preneel, and I. Verbauwhede. Fpga vendor agnostic true random number generator. In Proceedings of Field Programmable Logic and Applications, FPL 06, August 2006. [10] J. Soto. Statistical testing of random number generators. National Institute of Standards and Technology, October 1999. [11] T. Stojanovski and L. Kocarev. Chaos-based random number generators-part i: analysis[cryptography]. March 2001. [12] A. Tanenbaum. Modern Operating Systems, 3rd Edition. Prentice Hall, 2007. [13] A. Tanenbaum. Structured Computer Systems, 3rd Edition. Prentice Hall, 2007. [14] A. Walsh. Java Bible. John Wiley & Sons, 1998.

Suggest Documents