Using IWTomics to detect genomic features that ... - Bioconductor

1 downloads 0 Views 743KB Size Report
Apr 19, 2017 - 1/P, where P the total number of possible permutations leading to distinct .... Adjusted p−values − Threshold 64. Windows p−v alue qqq q qqqq.
Using IWTomics to detect genomic features that discriminate between dierent groups of regions. Marzia A Cremona



*

, Alessia Pini ,

Francesca Chiaromonte and Simone Vantini April 19, 2017

1 Introduction IWTomics

is a package to statistically evaluate dierences in genomic features between

groups of regions along the genome.

Locations (within the regions) and scales at

which these dierences unfold need not be specied at the outset, and are in fact an output of the procedure. In particular, the package implements an extended version of

?

the Interval-Wise Testing (IWT) for functional data ( ), specically designed to work with Omics data.

GenomicRanges

IWTomics

also includes a set of functions based on the package

to import and organize measurements of multiple genomic features in

dierent regions, as well as a set of functions to create graphical representations of the features and of the test results. Importantly, genomic regions can have dierent length and dierent features can be measured at dierent resolutions. In this vignette we present an example in which IWTomics is used to compare recombination hotspots in the genomic regions surrounding xed ETns (elements of the Early Transposon family of active Endogenous Retroviruses in mouse) versus control regions. This data is part of a much larger dataset analyzed in Campos-Sánchez et al. (2016).

The complete dataset comprises several genomic features, and was used to

study integration and xation preferences of dierent families of endogenous retroviruses in the mouse and human genomes (the software used in Campos-Sánchez et al. (2016) was a beta version of this package in which we implemented the Interval Testing Procedure of Pini and Vantini (2016) in place of the Interval-Wise Testing employed here). We also present the complete workow and various options in IWTomics through some synthetic datasets that are provided within the package.

* These

authors contributed equally

„ [email protected]

1

1.1 Interval-Wise Testing (IWT) The IWT is an inferential procedure that tests for dierences in the distributions of a genomic feature between two sets of regions (e.g. between case and control regions  two sample test), or between a single set of regions and a reference curve (e.g. between case regions and a reference null measurement  one sample test). The feature under study must be measured in windows of xed size in each region (e.g. at a resolution of 1 bp, or over 1 kb windows), and these contiguous measurements are considered by the IWT as curves. The IWT is a Functional Data Analysis technique, hence it can directly deal with these curves. In particular, the IWT is able to assess whether there are dierences in the distributions of the curves globally, and it also investigates local eects, imputing the statistically signicant dierences to specic locations along the regions (e.g. only in the central part of the regions). Since the IWT is based on permutation tests (non-parametric tests), it does not require any assumption on the statistical distribution of the data and it can be easily employed to study dierent types of Omics signals, from DNA conformation contents, to transcription data or chromatin modications. Moreover, it can be used even if the sample sizes dier in the two groups under consideration, or if the sample sizes are small. Sample sizes aect the resolution of the empirical p-value which will not exceed

1/P ,

where

P

the total number of possible permutations leading to distinct values of

the test statistics (see Subsection 4.2). For instance, in the two sample test between  n1 +n2 two independent populations of sizes n1 and n2 , P = . n1 In the extended version implemented in IWTomics, the IWT is not only locationfree but also scale-free. Indeed, it is able to investigate a range of dierent scales (from the nest one provided by the measurement resolution, to the coarsest one given by the region length), making it possible to assess the scale at which a feature displays its eect (see Section 7 for details). Moreover, dierent test statistics can be employed (e.g. mean dierence, variance ratio or quantile dierence).

1.2 Installation and loading IWTomics

package is available at bioconductor.org and can be installed via

biocLite,

typing the following commands in the R console (an internet connection is needed)

source("http://www.bioconductor.org/biocLite.R") biocLite("IWTomics",dependencies=TRUE) After the package is installed, it can be loaded into R workspace typing

library(IWTomics)

1.3 A rst example: recombination hotspots around xed ETn We illustrate use and output of IWTomics through a real data example from CamposSánchez et al. (2016) (the dataset is provided as part of the package).

2

This data

contains two groups of genomic regions,

ETn fixed and Control. In each region we Recombination hotspots. This is how

measure one genomic feature, the content of to load the data into R:

data(ETn_example) ETn_example ## ## ## ## ## ## ##

IWTomicsData object with 2 region datasets with center alignment, and 1 feature: Regions: ETn fixed: 1296 regions Control: 1142 regions Features: Recombination hotspots content: 1000 bp resolution No tests present. The region dataset

ETn fixed

comprises 1296 regions anking xed ETn elements

(elements of the Early Transposon family elements of active Endogenous Retroviruses in mouse). These regions correspond to 32-kb anking sequence upstream and 32-kb anking sequence downstream of each xed ETn element. The content of

hotspots is measured at 1000 bp resolution in each region:

Recombination

the 64-kb length is divided

into 64 1-kb and 64 consecutive measurements are produced, expressing the fraction of each 1-kb window covered by recombination hotspots. The goal is to understand whether the presence of xed ETn elements in the genome is aected by recombination hotspots.

To this end, we compare the regions in

ETn fixed

with control

regions, dened as 64-kb regions that do not overlap with anking sequences of ETn

0.15 0.05

0.10

ETn fixed Control

0.00

Recombination hotspots content

0.20

Recombination hotspots content

−30

−20

−10

0

10

20

30

10

20

30

Windows −30

−20

−10

0

ETn fixed Control

1142

1180.5

1219

1257.5

1296

Sample size

Figure 1: Plot of

Recombination hotspots

regions (green).

3

in

ETn fixed

regions (red) and

Control

or of other families of Endogenous Retroviruses. The region dataset

Control

contains

1142 of such regions, for each of which we have again 64 consecutive measurements of

Recombination hotspots. Figure 1 shows a visual representation of the dataset obtained with the function

plot

in the following code:

plot(ETn_example,cex.main=2,cex.axis=1.2,cex.lab=1.2) Figure 1 suggests that the average recombination hotspots content right around the locations of xed ETns (±5 kb anks) is higher than that in control regions. The function

IWTomicsTest,

the main function of the package IWTomics, allows us to test

dierences in a rigorous way, returning a p-value curve (one p-value for each 1-bp window) adjusted considering the whole 64-kb region, as shown in the next chunck of code.

ETn_test=IWTomicsTest(ETn_example, id_region1='ETn_fixed',id_region2='Control') ## Performing IWT for 'ETn fixed' vs. 'Control'... ## Performing IWT for feature 'Recombination hotspots content'... ## Point-wise tests... ## Interval-wise tests... adjusted_pval(ETn_test) ## ## ## ## ## ## ## ## ##

$test1 $test1$Recombination_hotspots [1] 0.456 0.457 0.504 0.747 0.864 [11] 0.990 0.992 0.990 0.990 0.990 [21] 0.726 0.726 0.713 0.688 0.575 [31] 0.048 0.085 0.218 0.482 0.768 [41] 0.992 0.996 0.996 0.996 0.996 [51] 0.993 0.993 0.993 0.993 0.993 [61] 0.993 0.993 0.993 0.993

0.890 0.990 0.459 0.829 1.000 0.993

0.890 0.990 0.364 0.887 1.000 0.993

0.890 0.990 0.197 0.946 1.000 0.993

0.931 0.951 0.072 0.978 1.000 0.993

0.990 0.771 0.039 0.991 1.000 0.993

Test results can be easily understood using the visual representation provided by the function

plotTest

and shown in Figure 2.

plotTest(ETn_test) In this gure, the top panel shows the adjusted p-value heatmap: the x axis represents locations, i.e. the 64 1-kb windows in the 64-kb region, while the y axis represents all possible scales used to adjust the p-value curve (from 1 window, no adjustment; to 64 windows, adjustment based on the entire region). The central panel is a plot of the adjusted p-values at the maximum scale (threshold at 64 windows; more details below and in Subsection 4.1), and the gray area corresponds to signicant adjusted p-values (

Suggest Documents