Random sampling script (C++)

5 downloads 260488 Views 2MB Size Report
(Figure 1). In addition, if rand()%4 is used to obtain random numbers in the range 0-3, the 512x512 pixel ... Statistics (1). Tips - Android (1) ... Tips - Galaxy (3).
Topics: Life sciences, Biology, Biotechnology and Biomedical research, Lab protocols and methods, Bioinformatics and Computational Biology, Programming for Biologists, R, Excel, C, C++, Perl, Python, Basic and Advanced Unix, shell scripting, Ubuntu. Data analysis, awk scripting, vi, grep, sed, tr, rsync, wget, screen. Next gen. sequencing. Bioconductor packages, biomaRt, Bowtie, BWA, Cufflinks, STAR, bedtools, samtools, Galaxy, GATK, impute, shapeit. Reviews of science articles. Home

About me

Wednesday, April 27, 2011

Total Pageviews

Random sampling script (C++) Computers as machines with deterministic behavior cannot generate truly random numbers. However, they can be attached to external devices that can measure the random-like events from the physical processes (for example, the radio measuring atmospheric noise after the thunderstorm). Such devices, termed hardware random number generators, may be relatively slow in some cases, and they may produce a biased sequence of numbers (since some values could be more common, depending on the nature of the observed physical process). Computer algorithms, on the other side, can generate relatively quickly a sequence of pseudorandom numbers using a predefined mathematical formula (pseudo-random number generators or PRNGs). Since such numbers are made by the formula and thus could be predicted, they should be avoided for creating passwords or for any other cryptographic application. In biology studies, the predictability of PRNGs is not of a main concern. The pseudo-random number generators are commonly used for different purposes in biology. Randomization, for example, is used to generate random nucleotide or amino-acid sequences or shuffle the existing sequences for different test purposes. Features that should be controlled properly in PRNGs and that are of the most importance in biology studies are the uniform distribution of chosen values and the absence of short cycles of repetitions of randomly chosen numbers. As here is shown, in some instances the uniform deviates and the short cycling of random values may be mutually exclusive. However, there is a way to keep both of these essential characteristics of PRNGs when creating a generator for its successful implementation in biological data analysis. Pseudorandom number generators are often based on linear congruential generators (LCGs). The following relation defines LCGs: Xn+1 = (aXn + c) mod m with X being a sequence of pseudorandom values, and m (modulus) defining the range of these values. Builtin rand() functions of various programming languages are based on LCGs that have a modulus equal to 21515 – 1 = 32,767 (e.g. in Microsoft Visual C++ that was used is this study). Thus, the range of most of the LCGs commonly used will be limited to the value of 2. Defining a new range of random values, whether extending or contracting the initial range, might be done in many different ways. One of the algorithms often used in C++ for defining the desired (smaller) range of random numbers (RANGE_MAX) is based on the division of rand() with the value that equals RANGE_MAX +1. The remainder of division of rand() and RANGE_MAX +1 is considered a random number that is ranging from 0 to RANGE_MAX. However, this algorithm is using the lower-order bits of random numbers, which exhibit much shorter cyclic behavior and are thus considered less “random”. This type of algorithms generally should be avoided. To test the cycling properties of this algorithm one could use the following code that is creating 262,144 random numbers ranging from 0-1.

363,000 Search This Blog Search

About me

MILOS PJANIC POSTDOCTORAL RESEARCH FELLOW, CARDIOVASCULAR MEDICINE Stanford University School of Medicine Doctor of Philosophy-PhD, Universite De Lausanne (2010) Master of Science-MSc, University Of Belgrade (2005) milospjanic@stanford:~$ mpjanic | blog > tips.out

Labels Affy Bioconductor package (3) awk scripting (33)

// Algorithm type 1 #include #include #include

barcoding (2) bash (18) bash variables (1)

using namespace std;

bedtools (7)

int main()

biomaRt (3)

{ srand((unsigned)time(0)); long int i=0; int random_integer; while(i