Galaxy High Throughput Genotyping Pipeline for GeneTitan ... solutions for quality control and genotype calling, becomes legacy rather than an advantage.
Galaxy High Throughput Genotyping Pipeline for GeneTitan Oleksiy Karpenko MS1, Neil Bahroos MS1, Morris Chukhman MS1, Xiao Dong PhD1, Pinal Kanabar MS1, Zarema Arbieva PhD1, Tommie Jackson MS1, William Hendrickson PhD1 1
University of Illinois at Chicago, Chicago, IL
Abstract Latest genotyping solutions allow for rapid testing of more than two million markers in one experiment. Fully automated instruments such as Affymetrix GeneTitan enable processing of large numbers of samples in a truly highthroughput manner. In concert with solutions like Axiom, fully customizable array plates can now utilize automated workflows that can leverage multi-channel instrumentation like the GeneTitan. With the growing size of raw data output, the serial computational architecture of the software, typically distributed by the vendors on turnkey desktop solutions for quality control and genotype calling, becomes legacy rather than an advantage. Advanced software techniques provide power, flexibility, and can be deployed in an HPC environment, but become technically inconvenient for biologists to use. Here we present a pipeline that uses Galaxy as a mechanism to lower the barrier for complex analysis, and increase efficiency by leveraging high-throughput computing. Introduction The next generation of the Affymetrix genotyping platform provides automated high-throughput solutions for population-specific or custom genome-wide genotyping, but the large volume of the high-throughput data does not pair well with Affymetrix Genotyping Console (AGC), the recommended software suite. The Axiom PanAFR array set, for instance, provides 3 microarrays with ~700K SNP probes per plate totaling ~2.1M probes for the array set. AGC can only process one plate at a time, requiring three separate processing runs, each lasting around 5 hours, with a manual user input required between runs. While Affymetrix Power Tools (APT) provide a command line interface to the same algorithms and can be used to automate most of the process, the required level of scripting skills makes them impractical for bench biologists. Galaxy (http://galaxyproject.org/) is a high performance computing (HPC) oriented, extendable, web-based bioinformatics workflow platform. It features a user-friendly graphical interface to tools running on a computer cluster or a cloud along with a comprehensive suite of built-in tools and data libraries for next generation sequencing, genotyping, and microarray analyses. An instance of Galaxy can be readily extended by a custom tool set such as APT. An increasingly broader adoption of Galaxy by the bioinformatics community makes it an obvious bridge for the gap between bench biologists and powerful tools for high-throughput genotype calling. Here we present our experience with building, deploying, and running an APT based Galaxy workflow for the GeneTitan genotyping pipeline. Method The GeneTitan data analysis pipeline was implemented as a Galaxy workflow incorporating four tools: wrappers for APT quality control and genotype calling tools apt-geno-qc and apt-probeset-genotype, a tool for SNP metric calculations, and a tool to convert the output into Plink format for downstream genomics analyses. We registered a custom file format for a batch of 96 ".CEL" files from a single Axiom plate and a few auxiliary file formats specific to APT tools to facilitate file selection process in Galaxy. The Galaxy workflow starts with receiving its input of “.CEL” and “.ARR” files from the instrumentation computer. It proceeds with extracting the “.CEL” files and executing the quality control tool with a user specified Dish-QC threshold (by default 0.82). The names of the samples that have passed the QC are passed on to the genotyping tool, along with the “.CEL” dataset, for the first round of genotyping. The output from this first round contains, among other metrics, the call rates for each sample. The samples with the call rate above a user specified threshold (by default 97%), along with the “.CEL” dataset again, are input for the second iteration of genotyping. The final genotype calling report is then annotated with the phenotype data and converted into Plink format, and simultaneously processed by the SNP metrics tool to calculate such statistics as Call Rate, FLD, HomFLD, Minor Allele count, etc. Results and Conclusion Unfortunately, besides the SNP Metrics and Plink formatting steps, the pipeline is strictly sequential. Nevertheless, the Galaxy implementation allows for the 3 sets of 96 samples to be processed concurrently on three cluster nodes, providing a threefold speedup. The PanAFR data were processed in Galaxy in 5 hours where AGC required over 15 hours on the same hardware and data with additional manual steps. The overhead of unpacking the dataset is compensated by the added parallelism, and Galaxy developers are actively addressing the needs to make multiple input files downstream persistent.
102