Data management for maize genomic selection

3 downloads 0 Views 2MB Size Report
(in "A Scandal in Bohemia" by Sir Arthur Conan Doyle). Data should be sufficient to support ... Experiment design is different. For example, plot data includes ...
Data management for maize genomic selection

Several reasons to use a database to manage data:  Store data, and avoid wasting the cost for collecting data  Share data between different researchers, enhance communication 1

Database is also important for statistician Sherlock Holmes told Dr. Watson: “I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts” (in "A Scandal in Bohemia" by Sir Arthur Conan Doyle)

Data should be sufficient to support theory

Therefore, we need to manage data firstly, and then to come up with statistical models

2

Databases related to maize 

MaizeGDB(http://www.maizegdb.org/)



Maize TE(Transposable Element) database(http://maizetedb.org/~maize/)



TIGR Maize Database(http://maize.jcvi.org/)



PPDB(Plant Proteome DataBase (http://ppdb.tc.cornell.edu/)



Maize gene expression database(http://www.plexdb.org/plex.php? database=Corn) None of them includes genomic selection information Now we have phenotype and GBS genotype data from different breeders available, so it is an appropriate time for constructing a database Zeabase 3

Overall working flow of Zeabase Raw or semi-raw data Phenotype data: Metadata Accession name Pedigree Plot data for traits GBS genotype data: Accession name GBS markers

Web Browser perl

Parsing loader

PostgreS PostgreS QLwith with QL CHADO CHADO schema schema

dbco n

Server Pages Mason Html Jquery Ajax

Web Browser

Web Browser

Challenge: There is no universal loader. Data format is different because of diversified data resource Solution(Based on efficiency): Develop filters to control quality of data. Develop parsers to make data fit loaders. Modify loaders to fit the data. 4

Data Sets 

Phenotype data 

  

298 trials(The first 145 trials have 4 traits, and the second 153 trials have 21 traits) 20321 accessions 95604 plot

Genotype data(GBS data)  

Imputated GBS data 955690 markers for each line

5

Quality Control of Data Set 





Different data resource makes data format be different. For example,different breeders has different pedigree encode format Experiment design is different. For example, plot data includes entry number, replication, block and plot information. Entry number is associated with accession,and each has 2-3 replications. Some trials have check,missing data. After quality control, Data needs to be parsed into loadable format for the the loaders

6

QC and Parsing of Pedigree data ([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/[NAW5867/P30SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197//[TUXPSEQ]C1F2/P49SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B-B

([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/[NAW5867/P30SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197//[TUXPSEQ]C1F2/P49SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B//CML312/CML442

([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/[NAW5867/P30SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197//[TUXPSEQ]C1F2/P49SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B//CML395/CML444

(ECA-EE-DLN-PL1-(ECA-EE-DLN-PL1-1/PL15QPMC7SRC1F2//POOL15QPMSR)-B-24B//CML144/CML159-B-16-B

This data set includes more than 7345 pedigree string expressions like this format(including square brackets, round brackets(parentheses)), so it is impossible to control quality manually. Therefore, we need to come up with an algorithm to process 7

Algorithm for processing pedigree data 1.

Check for balanced parentheses using stack data structure(a LIFO data structure, LIFO: “Last In,First Out”)

2.

Identify the character “//” outside of parentheses, if Yes, replace the characters “//” by the character “///”; if not, find the characters “/”, and replace the character with “///”, perform this step recursively until no “//” or “/” character exists outside of parentheses and no (,) or [,] character exists

3.

Convert pedigree strings into part1///part2 format

8

Use stack data structure to check unbalanced parentheses 1.Scan from left to right 2.If opening symbol,push it into a stack. If closing symbol and top of stack opening is of same type, pop. 3.Should end with an empty stack [(]) Stack data structure (ECA-EE-DLN-PL1-(ECA-EE-DLN-PL1-1/PL15QPMC7SRC1F2//POOL15QPMSR)-B-24B//CML144/CML159-B-16-B (ECA-EE-DLN-PL1-(ECA-EE-DLN-PL1-1/PL15QPMC7SRC1F2//POOL15QPMSR)-B-24B)//CML144/CML159-B-16-B 9

Visualization of pedigree data ([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/ [NAW5867/P30-SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197// [TUXPSEQ]C1F2/P49-SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B//CML312/CML442

QC, correction, parsing,loading visualization

10

Standardization of phenotype traits 









Data source is different To make communication be easy between different researchers, phenotype trait description needs to be standardized We use maize trait ontology to encode traits We edited Trait Ontology term in maize.obo file using OBO-Edit to make TO term be more suitable for the traits in these trials These TO terms are loaded into Zeabase before we load data

11

Maize trait ontology

12

Zeabase Natural Diversity(ND) Chado Schema cvterm_relationship

cvterm

project_prop nd_location_description

cv

phenotype(value)

genotype

project

project_relationship

nd_experiment

protocol

nd_exp_prop

stock_prop

stock

Stock-relationship(pedigree,plot vs accession)

13

Database Demo http://zeabase.sgn.cornell.edu/

14

Summary 

We developed several QC filters and parsers to process raw and semi-raw

data, and afterwards these data can be loaded into Zeabase efficiently using loaders 

Some functions of Zeabase can be used by users now

15

Further direction 





Data format issue



Create phenotype trait data template, and communicate with data submitters to use consistent data format to make data management and loading more easily



Develop ad-hoc filters and parsers that fit the coming new data(WEMA,stage1&2,synthetics,... )

Optimization of current functions on zeabase (loading ,search using materialized view, downloading) Add new functions:

Training set selection Integration of GS methods

16

Data management for maize genomic selection Data set

WEMA(4 traits) Stage1&2(21 traits) Synthetics Data (30 traits)

17

Standardization of traits of Synthetics Data set

It is important to use Trait Ontology to standardize trait description

18

Management of GBS data sets WEMA: 3402 accessions Stage1&2: 2334 accessions Synthetics:609 accessions Each accession has 955690 SNP markers

19

GBS data (Hapmap format)

20

Loading GBS data into zeabase

21

Download GBS data for one accession

22

Download GBS data for multiple accession

23

Genomic selection in zeabase

24

GBS data encoding

The letter in the lower alphabetical order in alleles: -1, H:0, other:1 Missing:-9 25

Download function for GBS data

26

Download GBS data for multiple accessions

27

Integration of DOE into Zeabase “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” Ronald Fisher

It is important to supply a layout of DOE to the breeders before they perform breeding trials

28

Implementation architecture https://github.com/cran/agricolae(Include many DOE in Agricultural Research)

https://github.com/solgenomics/yapri Web Browser perl

PostgreS PostgreS QLwith with QL CHADO CHADO schema schema

dbcon

Server Pages Mason Html Jquery Ajax

Web Browser

Completely Randomized trial Complete Block Alpha Lattice Augmented

Web Browser

29

Trial design in zeabase

30

Field layout for Completely Randomized Trial

31

Layout for Complete Block

32

Layout for Alpha Lattice

33

Integration of Moving Average Design(MAD) in zeabase One of design for unreplicated trials It aims to differentiate genotype and environmental effects.

34

Interface for MAD

35

Further direction 

GBS Big data issue: –

Use HDF5,Hadoop to improve management of theses big data



Optimization of current functions on zeabase



Add new functions

36