(in "A Scandal in Bohemia" by Sir Arthur Conan Doyle). Data should be sufficient to support ... Experiment design is different. For example, plot data includes ...
Data management for maize genomic selection
Several reasons to use a database to manage data: Store data, and avoid wasting the cost for collecting data Share data between different researchers, enhance communication 1
Database is also important for statistician Sherlock Holmes told Dr. Watson: “I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts” (in "A Scandal in Bohemia" by Sir Arthur Conan Doyle)
Data should be sufficient to support theory
Therefore, we need to manage data firstly, and then to come up with statistical models
2
Databases related to maize
MaizeGDB(http://www.maizegdb.org/)
Maize TE(Transposable Element) database(http://maizetedb.org/~maize/)
TIGR Maize Database(http://maize.jcvi.org/)
PPDB(Plant Proteome DataBase (http://ppdb.tc.cornell.edu/)
Maize gene expression database(http://www.plexdb.org/plex.php? database=Corn) None of them includes genomic selection information Now we have phenotype and GBS genotype data from different breeders available, so it is an appropriate time for constructing a database Zeabase 3
Overall working flow of Zeabase Raw or semi-raw data Phenotype data: Metadata Accession name Pedigree Plot data for traits GBS genotype data: Accession name GBS markers
Web Browser perl
Parsing loader
PostgreS PostgreS QLwith with QL CHADO CHADO schema schema
dbco n
Server Pages Mason Html Jquery Ajax
Web Browser
Web Browser
Challenge: There is no universal loader. Data format is different because of diversified data resource Solution(Based on efficiency): Develop filters to control quality of data. Develop parsers to make data fit loaders. Modify loaders to fit the data. 4
Data Sets
Phenotype data
298 trials(The first 145 trials have 4 traits, and the second 153 trials have 21 traits) 20321 accessions 95604 plot
Genotype data(GBS data)
Imputated GBS data 955690 markers for each line
5
Quality Control of Data Set
Different data resource makes data format be different. For example,different breeders has different pedigree encode format Experiment design is different. For example, plot data includes entry number, replication, block and plot information. Entry number is associated with accession,and each has 2-3 replications. Some trials have check,missing data. After quality control, Data needs to be parsed into loadable format for the the loaders
6
QC and Parsing of Pedigree data ([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/[NAW5867/P30SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197//[TUXPSEQ]C1F2/P49SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B-B
([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/[NAW5867/P30SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197//[TUXPSEQ]C1F2/P49SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B//CML312/CML442
([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/[NAW5867/P30SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197//[TUXPSEQ]C1F2/P49SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B//CML395/CML444
(ECA-EE-DLN-PL1-(ECA-EE-DLN-PL1-1/PL15QPMC7SRC1F2//POOL15QPMSR)-B-24B//CML144/CML159-B-16-B
This data set includes more than 7345 pedigree string expressions like this format(including square brackets, round brackets(parentheses)), so it is impossible to control quality manually. Therefore, we need to come up with an algorithm to process 7
Algorithm for processing pedigree data 1.
Check for balanced parentheses using stack data structure(a LIFO data structure, LIFO: “Last In,First Out”)
2.
Identify the character “//” outside of parentheses, if Yes, replace the characters “//” by the character “///”; if not, find the characters “/”, and replace the character with “///”, perform this step recursively until no “//” or “/” character exists outside of parentheses and no (,) or [,] character exists
3.
Convert pedigree strings into part1///part2 format
8
Use stack data structure to check unbalanced parentheses 1.Scan from left to right 2.If opening symbol,push it into a stack. If closing symbol and top of stack opening is of same type, pop. 3.Should end with an empty stack [(]) Stack data structure (ECA-EE-DLN-PL1-(ECA-EE-DLN-PL1-1/PL15QPMC7SRC1F2//POOL15QPMSR)-B-24B//CML144/CML159-B-16-B (ECA-EE-DLN-PL1-(ECA-EE-DLN-PL1-1/PL15QPMC7SRC1F2//POOL15QPMSR)-B-24B)//CML144/CML159-B-16-B 9
Visualization of pedigree data ([CML202/CML442//[DTP2WC4H255-1-2-2-BB/[[NAW5867/P30-SR]-111-2/ [NAW5867/P30-SR]-25-1]-8-1-1-B-1]-1-2-2-B]-1-1-1-1-BBB-B/[CML442/CML197// [TUXPSEQ]C1F2/P49-SR]F2-45-7-3-2-BBB]-2-1-1-1-1-B*4-B)DH3-B//CML312/CML442
QC, correction, parsing,loading visualization
10
Standardization of phenotype traits
Data source is different To make communication be easy between different researchers, phenotype trait description needs to be standardized We use maize trait ontology to encode traits We edited Trait Ontology term in maize.obo file using OBO-Edit to make TO term be more suitable for the traits in these trials These TO terms are loaded into Zeabase before we load data
11
Maize trait ontology
12
Zeabase Natural Diversity(ND) Chado Schema cvterm_relationship
cvterm
project_prop nd_location_description
cv
phenotype(value)
genotype
project
project_relationship
nd_experiment
protocol
nd_exp_prop
stock_prop
stock
Stock-relationship(pedigree,plot vs accession)
13
Database Demo http://zeabase.sgn.cornell.edu/
14
Summary
We developed several QC filters and parsers to process raw and semi-raw
data, and afterwards these data can be loaded into Zeabase efficiently using loaders
Some functions of Zeabase can be used by users now
15
Further direction
Data format issue
–
Create phenotype trait data template, and communicate with data submitters to use consistent data format to make data management and loading more easily
–
Develop ad-hoc filters and parsers that fit the coming new data(WEMA,stage1&2,synthetics,... )
Optimization of current functions on zeabase (loading ,search using materialized view, downloading) Add new functions:
Training set selection Integration of GS methods
16
Data management for maize genomic selection Data set
WEMA(4 traits) Stage1&2(21 traits) Synthetics Data (30 traits)
17
Standardization of traits of Synthetics Data set
It is important to use Trait Ontology to standardize trait description
18
Management of GBS data sets WEMA: 3402 accessions Stage1&2: 2334 accessions Synthetics:609 accessions Each accession has 955690 SNP markers
19
GBS data (Hapmap format)
20
Loading GBS data into zeabase
21
Download GBS data for one accession
22
Download GBS data for multiple accession
23
Genomic selection in zeabase
24
GBS data encoding
The letter in the lower alphabetical order in alleles: -1, H:0, other:1 Missing:-9 25
Download function for GBS data
26
Download GBS data for multiple accessions
27
Integration of DOE into Zeabase “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” Ronald Fisher
It is important to supply a layout of DOE to the breeders before they perform breeding trials
28
Implementation architecture https://github.com/cran/agricolae(Include many DOE in Agricultural Research)
https://github.com/solgenomics/yapri Web Browser perl
PostgreS PostgreS QLwith with QL CHADO CHADO schema schema
dbcon
Server Pages Mason Html Jquery Ajax
Web Browser
Completely Randomized trial Complete Block Alpha Lattice Augmented
Web Browser
29
Trial design in zeabase
30
Field layout for Completely Randomized Trial
31
Layout for Complete Block
32
Layout for Alpha Lattice
33
Integration of Moving Average Design(MAD) in zeabase One of design for unreplicated trials It aims to differentiate genotype and environmental effects.
34
Interface for MAD
35
Further direction
GBS Big data issue: –
Use HDF5,Hadoop to improve management of theses big data
Optimization of current functions on zeabase
Add new functions
36