Feb 9, 2012 - 2.3.2 Exemple 2: reading haplotype file in fastPHASE output format . ... 3 Computing EHH related statistics for a given marker: the calc_ehh(),.
Tutorial for the R package rehh Mathieu Gautier and Renaud Vitalis February 9, 2012
1
Contents 1 Installation
5
1.1
From the R-forge repository . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2
From source or binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
Under Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.2
Under MAC OS X . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.3
Under Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Loading the package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
2 Input Files
7
2.1
Haplotype data file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
SNP information data file . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Loading data files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3.1
Exemple 1: reading haplotype file in standard format . . . . . . . .
9
2.3.2
Exemple 2: reading haplotype file in fastPHASE output format . . 10
2.3.3
Additional options . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Computing EHH related statistics for a given marker: the calc_ehh(), calc_ehhs() and scan_hh() functions 3.1
12
Definition and Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1
Extended Haplotype Homozygosity (EHH ) . . . . . . . . . . . . . . 12
3.1.2
Integrated EHH (iHH ) . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3
Site-specific EHH (EHHS ) . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4
Integrated EHHS (iES ) . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5
Dealing with missing data . . . . . . . . . . . . . . . . . . . . . . . 14
2
3.2
The calc_ehh() function . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3
The calc_ehhs() function . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4
The scan_hh() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Computing iHS and Rsb statistics with ihh2ihs() and ies2rsb() 4.1
4.2
4.3
21
Within population test: iHS . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2
The ihh2ihs() function . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3
Additional plotting functions
. . . . . . . . . . . . . . . . . . . . . 23
Pairwise population test: Rsb . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.3
Additional plotting functions
. . . . . . . . . . . . . . . . . . . . . 26
Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Visualizing Haplotype structure around a core allele
27
6 Comparisons with other programs
29
6.1
The sweep software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2
iHS calc program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3
List of Figures 1
calc_ehh() graphical output . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
calc_ehhs() graphical output . . . . . . . . . . . . . . . . . . . . . . . . . 18
3
ihsplot() graphical output . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4
rsbplot() graphical output . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5
distribplot() graphical output . . . . . . . . . . . . . . . . . . . . . . . 27
6
bifurcation.diagram() graphical output . . . . . . . . . . . . . . . . . . 29
7
sweep and rehh give exactly the same results for EHH
8
Graphical output from sweep focusing on SNP #456 from the BTA12 CGU
. . . . . . . . . . . 30
haplotypes (note the similarities of the bidiagram plots with those obtained with rehh (see Figure 6))
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Comparison of the results obtained with ihs program and rehh for 1,424 SNPs of the 280 BTA12 CGU haplotypes (On this example rehh was more than 100 times faster)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4
This document presents additional information on the R package rehh and describes how to use it to perform whole genome scan for footprints of selection using the Extended Haplotype Homozygosity (EHH) related statistics [2]. Note that in the current implementation of the tests, markers are assumed bi-allelic.
1
Installation
For the moment The rehh package is currently available from two different sources. If the paper is accepted, we are planning to upload the latest version of rehh on the CRAN website to facilitate installation and portability.
1.1
From the R-forge repository
The package is available for most platforms (Linux, MS Windows and Mac OS X) from the R-forge repository (http://r-forge.r-project.org/projects/rehh). It thus only requires R (possibly >2.13.0). To install the package simply run the following command within R:
1
install.packages("rehh", repos="http://R-Forge.R-project.org")
1.2
From source or binaries
Because the R-forge repository might be off-line (lot of maintenance is currently done), we also provide the source of the package (rehh.tar.gz), two package binaries for Mac OS X (rehh_1.0.tgz) and Windows (rehh_1.0.zip) respectively and the R manual (rehh_manual.pdf) in the zipped archive available here: 1
In this document the following nomenclature is used:
command written in the R terminal (or GUI) and Resulting Output
5
http://www1.montpellier.inra.fr/URLB/R-packages/rehh_1.0_all.zip Before installing the package, be sure that the gplots R package is installed (usually it is).
1.2.1
Under Linux
In a terminal cd to the directory containing the package archive and use the following command (possibly as root) R CMD INSTALL rehh.tar.gz
1.2.2
Under MAC OS X
Within a R console use the following command: setwd("PATH TO THE DIRECTORY CONTAINING THE rehh_1.0.tgz binary") install.packages("rehh_1.0.tgz",repos=NULL)
If you encounter problems, you might need to install the package directly from the sources by using the same command in terminal as the one described above for Linux.
1.2.3
Under Windows
Within a R console use the following command setwd("PATH TO THE DIRECTORY CONTAINING THE rehh_1.0.zip binary") install.packages("rehh_1.0.zip",repos=NULL)
The Windows binary was produced under Windows XP (32 bits) and tested on both Windows XP (32 bits) and Windows 7 (64 bits). We nevertheless cannot guarantee it works on more recent Windows version. If you encounter some problem, you might need to install the package from the source by using the same command in terminal as the 6
one described above for Linux. To install from the sources, you also need to have the Rtools utilities installed and appropriately declared in the environment variable names (see http://cran.r-project.org/).
1.3
Loading the package
Once the package has been successfully installed on your system, it can be loaded using the following command: library(rehh)
2
Input Files
rehh requires as input i) a genotype data file (with phased haplotypes) for each population(s) of interest and ii) a SNP information file. Three example files
2
are provided in
the package (bta12_hapguess_switch.out, bta12_cgu.hap and map.inp) and can be copied in the working directory with the following command: make.example.files()
2.1
Haplotype data file
Two haplotype input file formats are supported: Standard haplotype format. Each line represents a haplotype (the first element
being the haplotype id.) with SNP genotype in columns as in the example file bta12_cgu.hap which contains 280 haplotypes (id. 1 to 280) of 1424 SNPs each. 2
All the example data provided in the package rehh originate from a previously published study on the Creole cattle breed from Guadeloupe
(CGU) [1]. Individuals were genotyped for more than >40,000 SNPs. Haplotype data file contains 280 haplotypes of 1424 SNPs mapping to bovine chromosome 12 (BTA12)
7
Output file from fastPHASE [3] as in the bta12_hapguess_switch.out example
file. Note that haplotypes might originate from several different population (e.g. if the -u fastPHASE option was used): see 2.3.2. By default alleles are assumed to be coded as 0 (missing data), 1 (ancestral allele) and 2 (derived allele). However, recoding of the alleles in this format can be performed with the data2haplohh function (see 2.3.3).
2.2
SNP information data file
This data file should contain SNP information as in the map.inp example file. Each line correspond to the SNP name, its chromosome of origin, its position on the chromosome (it is up to the user to chose either physical or genetic map positions), its ancestral and derived alleles (as coded in the haplotype input file). SNPs must be in the same order as in the haplotype for the chromosome considered (and thus should be ordered according to the map information). If several chromosomes are represented in the map file, one can provide the name of the chromosome of interest (considered in the corresponding haplotype input file) as shown below.
2.3
Loading data files
The data2haplohh() function allows to convert data file into a a R object of class haplohh subsequently used by the functions of the rehh package. Several additional options are available to select SNPs (based on Minor Allele Frequency or percentage of missing data) and haplotypes (based on percentage of missing) as exemplified below. More details about the different arguments of the function are available in the documentation accessible using the command: 8
?data2haplohh
In the following example, we assume that the command make.example.file() was run (see above) so that example files are in the working directory.
2.3.1
Exemple 1: reading haplotype file in standard format
In this example, the example haplotype input file bta12_cgu.hap (standard format) and SNP information input files map.inp are converted into a haplohh object named haplo. Because the map file contains information for SNP mapping to other chromosome than the one of interest (BTA12), we use the option chr.name=12. hap