Genome bioinformatics
Digital Gene: A web server for gene expression measure using the second generation sequence data Aimin Yan, Cheng Li* Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX ABSTRACT Summary: The
next generation sequence data is replacing micro-array to measure gene expression. However there is a lack for an online tool to calculate the gene expression using the next generation sequence data. Here we use the aligned SAM file to calculate the digital gene expression. The digital gene expression calculation is implemented using Python. The web interface is made available by Perl and CGI.
Availability: http://155.52.45.149/ Contact:
1
[email protected]
INTRODUCTION
The second generation sequence technology can be used to obtain the reproducible gene expression measurement. This technology is replacing micro-array to measure gene expression measurement. So far, there are three platforms used: the Roche/454 FLX, the Illumina/Solexa Genome Analyzer, and the Applied Biosystems SOLiD System. Among these platforms, the Illumina/Solexa is one of the widest used platforms. Typically, more than 10,000,000 short reads are produced from Illumia/Solexa. The short reads are formatted into fastq files which include the short read sequence id, sequence and quality value for each short read. These fastq sequence files are subsequently aligned with the reference genome. There are several alignment software available at present. Maq is one of the alignment software. The output file from Maq is MAP file. BWA is the faster alignment software. Its output file is SAM file. These SAM file can be converted into
column readable format. From this column readable format, we can identify the counts that short read are mapped to the specific position on the reference genome. By combining these column readable file with gene annotation file, we can calculate the mapping count for each gene. After adjusted by the length of gene, we can obtain the digital gene expression. Since most of biologists are not familiar with programming, so it is not easy for them to perform these steps manually. Therefore it is very necessary to develop the convenient pipeline for them, and these pipelines should be easily accessible for biologist. Taking these points into our consideration, here we develop an online tool for calculating the gene expression using the second generation data. The destination of this paper is to supply a convenient platform to help biologist to perform gene repression research, which is very important on many cancer diagnostics. 2
IMPLEMENTATION
We implemented the gene expression calculation using Python. Since there is more than 20000 genes, and the amount of short reads are above 4000000. If we perform exhaustive search, it takes intensive computation time to calculate the mapping count for each gene. Joshua et al used stabbing and interval queries to obtain reads that overlap the specific gene’s open reading frame, and improved the computation efficiency[1]. However, in their method, they have to use Binary Search Tree (BST) data structure. Here we proposed a simple genome index algorithm to calcu-
*
To whom correspondence should be addressed. Cheng Li
© Oxford University Press 2010
1
Aimin Yan et al.
late the count mapped to each gene. This algorithm works as the following: 1. The chromosome is indexed from 1 to n. n is the length of this chromosome, an array of length n is allocated. Let this array is A, then A[0]=0, A[1]=0,…,A[n-1]=0 2. For m short reads, another two dimensional array is allocated. Let the array is B, then B[0,0]=p0, B[0,1]=count_p0,…,B[m1,0]=pm, B[m-1,1]=count_pm. pi is the mapped position, and count_pi is the count mapped to pi. note: pi belong to (0,n) 3. let A[B[i,0]]=B[i,1] 4. For each gene start from s to t, note: s belongs to (0,n) and t belongs to (0,n). Based on this properties, the count for this gene is: A[s]+…+A[t] The worst case running time for this algorithm is O(n+2m+Lk). Here L is the largest length of the particular gene, k is the number of gene. We download a public available data set from Stephen Montgomery’s transcription genetics studies[2]. This data set includes the sequence of mRNA fraction of transcriptome in 60 extended HapMap individuals of European descent. We have calculated the mapping count for each position across genome for these 60 samples, and then we converted these position-specific counts into the gene-specific mapping counts based on the gene annotation file. Afterwards we normalized the mapping counts by the length of each gene. The further downstream statistical could be performed from the output file from this server. For example, we used principal component analysis to cluster these 60 samples. We calculated the accumulated portion of variance explained by the first two principal components. The first two principal components can explain more than 80% of total variance, so we used the first two principal components to cluster the subjects. Figure 1 shows the cluster results. It is clearly we can identify the subjects with the different gene expression pattern using RNA-seq based gene expression profile. Besides the cluster analysis, using the output file from our server, we can also process the differential gene expression analysis and other statistical analysis. 2
Figure 1 The cluster of 60 individuals using RNAseq based on gene expression profile by the first two pericipal components. In order to make our Digital Gene program available widely in biological communities, we implemented this web server interface. The destination of this server is to supply a platform for preprocessing RNA-seq data before user can perform the further statistical analysis. The server is on Windows XP professional x64 work station with 24GB memory. The Web server is installed using xampp. The web interface is established with a CGI script written using HTML and PERL. ACKNOWLEDGEMENTS Funding: It is a pleasure to acknowledge the financial support provided by NIH grants R01 GM077122
Reference List 1. Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA: Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics 2009, 10:221. 2. Montgomery SB, Sammeth M, GutierrezArcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET: Transcriptome genetics using second generation sequenc-
Digital Gene: A web server for gene expression measure using the second generation sequence data
ing in a Caucasian population. Nature 2010, 464:773-777.
3