Web-based pipeline for sequencing projects ...

6 downloads 250 Views 11KB Size Report
present a customizable, totally web-based pipeline which ... Using the tool's web based user interface ... running on Apache web servers and MySQL database.
Web-basedpipelinefor sequencingprojects

ABSTRACT Summary: The process of data manipulation in sequencing projects consists of a commonsuccession of steps that include the submission of chromatogram data generatedby sequencingequipment, base-calling, vector screening, read storage, assembly and pre-annotation. Each of these steps is performedby different tools which are manually processed using shell commands or by scripts which combine some of these steps, here we present a customizable, totally web-based pipeline which can be used by many different sequencing projects without the need to re-implement an interface for data processingand storage. Availability:(http://compbio.epm.br). Opensource. Contact:[email protected] INTRODUCTION The entire computational process in a sequencing project is repetitive and performed by different people with different backgrounds. Access control and user tracking is essential when the project involves different sequencing centers and many users. A standardized nomenclature for processed plates, containing information of the lab that originated it and possibly the tissueor library usedis essential. A common design pattern for systems that support sequencing projects can be easily perceived. This involves a data submission form, where chromatograms generated by sequencing equipment are uploaded from the workstation attached to the sequencer to a central system where data is to be processed and stored. Following the submission, base-calling tools are automatically executed to render human readable nucelotide sequences named reads. These sequences and corresponding quality assessment values are stored into a relational database. Reads contain low quality stretches and vector sequences which are masked out after being rendered. Possible contaminants are also screened. After being processed and stored, read sequences are matched to user-defined databases of known sequences. Using the tool's web based user interface, the assembly of the reads may be scheduled. The contigs generatedin the assembly are also stored to the databaseand immediately comparedto the database

of related, knownsequences. Users and their roles must be specified and controlled. All user interfaces used for sequence submission, scheduled process triggering, result view and administrative tasks must be accessible by any computer in the network,independentlyof the operatingsystemand hardwareplatformof the workstation. A totally web-based systemwouldsatisfy this requirement. DESCRIPTION The tool was developed using PHP for the web-based user interface, Perl scripts for internal data processing, running on Apache web servers and MySQL database management system for data storage. All project-specific informationis configuredusing a web-basedconfiguration form, where paths for the tools used by the pipeline, database connection information, and project particularities such as vectors, contaminants and known sequence databases for the pre-annotation process are specified. Subsequently to the initial configuration, the project manager must define users and their roles in the system. Three basic roles are defined with overlapping functionalities. The most restricted role is the 'viewer' which can only see the current results of the project. The 'submitter' can submit chromatogramdata to the system and the 'manager' is allowed to define users, libraries, sequencing labs, initiate the assembly process and managesubmitteddata. Pipeline Chromatogram data submission to the system is performed by a web-based form which accepts a compressedfile originated from the sequencer. The form allows the user to specify the library of the plate and the lab where it was processed. The system automatically generates a name for referencing the plate and it's slots. This compressed file is stored in a specific folder of the system in order to allow backups of unprocessed data, it is uncompressed to a working folder where the base-calling programPhred is executed. The output from this step, read sequences and their quality values are stored to the database. Vector screening is performedon these reads using Cross_match. Sequences are masked and stored to the database. After being masked, short and low quality sequences are tagged. The remaining reads are compared using BLAST (Altschul et al., 1997) to a user-defined contaminant sequence database, relevant matches are tagged as being spurious sequences.Untaggedread sequencesare pre-annotaded using BLAST to a user-defined set of known sequences,

matching results are stored in the database. The project manager can initiate the assembly process using the web-based interface. After being triggered, the assembly is performed in background by Phrap. When the assemblyprocesshas ended, resulting contigs are saved and the system automatically performs a pre-annotation of thesecontigs, storingBLASTresults to the database. All results and project summaryinformation,such as the total number of processed plates and reads, number of discarded reads due to low quality or contamination, number of contigs assembled, and matches to known sequences, can be accessed using the web-based interfaceby registeredusers. REFERENCES Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J.(1997), GappedBLASTand PSI-BLAST:a new generation of protein database search programs, Nucleic Acids Res. 25,3389-3402.

Suggest Documents