Intro to BioPerl(pdf)

24 downloads 64991 Views 475KB Size Report
Mar 11, 2011 - Page 1 ... Build.PL. #type your password when prompted. #hit enter when .... subroutines are for creating your own functions similar to length( ) ...
Intro to BioPerl

Install BioPerl http://www.bioperl.org/

#bioperl website #click on HOWTO in the Documentation box  for some code to help you get started

cd ~/Downloads wget http://bioperl.org/DIST/BioPerl­1.6.1.tar.gz tar ­xzvf BioPerl­1.6.1.tar.gz cd BioPerl­1.6.1 sudo ./Build.PL

#type your password when prompted

#hit enter when prompted to accept the defaults [Y/n] sudo ./Build install

object attributes 

desc $seq_obj­>desc

object $seq_obj

seq $seq_obj­>seq

length   $seq_obj­>length

LOCUS       NP_001123420             440 aa            linear   MAM 11­MAR­2011 DEFINITION  L­gulonolactone oxidase [Sus scrofa]. ACCESSION   NP_001123420 VERSION     NP_001123420.1  GI:194018724 DBSOURCE    REFSEQ: accession NM_001129948.1 KEYWORDS    . SOURCE      Sus scrofa (pig)   ORGANISM  Sus scrofa             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;             Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Suina; Suidae;             Sus. REFERENCE   1  (residues 1 to 440)   AUTHORS   Hasan,L., Vogeli,P., Stoll,P., Kramer,S.S., Stranzinger,G. and             Neuenschwander,S.   TITLE     Intragenic deletion in the gene encoding L­gulonolactone oxidase             causes vitamin C deficiency in pigs   JOURNAL   Mamm. Genome 15 (4), 323­333 (2004)    PUBMED   15112110 FEATURES             Location/Qualifiers      Protein         1..440                      /product="L­gulonolactone oxidase"                      /EC_number="1.1.3.8"                      /note="L­gulono­gamma­lactone oxidase; GLO; LGO"                      /calculated_mol_wt=50221      CDS             1..440                      /gene="GULO"                      /coded_by="NM_001129948.1:228..1550"                      /db_xref="GeneID:396759" ORIGIN               1 mvhghkgvkf qnwaktygcc pemyyqptsv eeirevlala rqqnkrvkvv ggghspsdia        61 ctdgfmihmg kmnrvlkvdm ekkqvtveag illadlhpql dkhglalsnl gavsdvtagg       121 vigsgthntg ikhgilatqv veltlltpdg tvlvcsessn aevfqaarvh lgclgviltv       181 tlqcvpqfhl qettfpstlk evldnldshl kkseyfrflw fphsenvsvi yqdhtnkpps       241 ssanwfwdya igfyllefll wistfvpglv gwinrfffwl lfngkkencn lshkiftyec       301 rfkqhvqdwa iprektkeal lelkamleah pkvvahypve vrftraddil lspcfqrdsc       361 ymniimyrpy gkdvprldyw layetimkkv ggrphwakah nctrkdfekm ypafrkfcai       421 rekldptgmf lnaylekvfy //

Use BioPerl to retrieve sequences from a list of GenBank accessions

1. 2. 3. 4.

Pseudocode: Put your GenBank accessions into an array (“NP_848862”, “NP_071556”, “NP_001123420”, “NP_001029215”) Loop through the array values (@accessions) Connect to GenBank and get the accession record (sequence object) Print out the record description, sequence and length

#!/usr/bin/perl ­w use strict; use Bio::DB::GenBank;     

#module for connecting to GenBank database

my $db_obj = Bio::DB::GenBank­>new; my @accessions = (put accessions here);

#create your new database connection object #type in the above accessions in the array

foreach my $acc (@accessions) {     my $seq_obj = $db_obj­>get_Seq_by_acc($acc);  #connect to GenBank & get accession info      print $seq_obj­>desc . "\n"; #print out the accession description print $seq_obj­>seq . "\n"; #print out the accession sequence print $seq_obj­>length . "\n"; #print out the sequence length }

Script 1

Use the Data::Dumper module to see what is in an object

#!/usr/bin/perl ­w use strict; use Bio::DB::GenBank;      use Data::Dumper;

#module for connecting to GenBank database ####module to print variable data

my $db_obj = Bio::DB::GenBank­>new; my @accessions = (put accessions here);

#create your new database connection object #type in the above accessions in the array

foreach my $acc (@accessions) {     my $seq_obj = $db_obj­>get_Seq_by_acc($acc);  #connect to GenBank & get accession info     print Dumper($seq_obj); die; ####print object info and die print $seq_obj­>desc . "\n"; #print out the accession description print $seq_obj­>seq . "\n"; #print out the accession sequence print $seq_obj­>length . "\n"; #print out the sequence length }

Script 1

Print out your accession records to a file in GenBank format 1. 2. 3. 4. 5.

Pseudocode: Put your GenBank accessions into an array (“NP_848862”, “NP_071556”, “NP_001123420”, “NP_001029215”) Use BioPerl to open a file for writing your records in genbank format Loop through the array values (@accessions) Connect to GenBank and get the accession record (sequence object) Write the GenBank records to the outfile

#!/usr/bin/perl ­w use strict; use Bio::DB::GenBank;      use Bio::SeqIO;

#module for connecting to GenBank database #module for sequence Input/Output writing

my $db_obj = Bio::DB::GenBank­>new; my @accessions = (put accessions here);

#create your new database connection object #type in the above accessions in the array

my $outfile_obj = Bio::SeqIO­>new( ­file   => '>gulo.gb',                                     ­format => 'genbank' );

#open file for writing data # in genbank format  # fasta format is also available

foreach my $acc (@accessions) {     my $seq_obj = $db_obj­>get_Seq_by_acc($acc); #get accession sequence object from GenBank $outfile_obj­>write_seq($seq_obj);   #write the GenBank record to your file }

Script 2

Handling Errors when Record is not found by BioPerl

1. 2. 3. 4.

Pseudocode: Put your GenBank accessions into an array Loop through the array values (@accessions) Connect to GenBank and get the accession record (object) Print out sequence only if record is found

#!/usr/bin/perl ­w use strict; use Bio::DB::GenBank;     

#module for connecting to GenBank database

my $db_obj = Bio::DB::GenBank­>new;   my @accessions = ("NP_ABC123", "NP_848862");

#create your new database connection object #array of accessions

foreach my $acc (@accessions) { my $seq_obj;     eval { $seq_obj = $db_obj­>get_Seq_by_acc($acc); #connect to and get accession info     };     #a semicolon is required when using eval if ($@) { #if error was found by eval; $@ catches errors found by eval print “$acc not found.\n”; } else { print $seq_obj­>seq . "\n"; #print out the accession sequence } }

Script 3

Writing your own subroutines (functions) subroutines are for creating your own functions similar to length( ) and substr( ) keeps your script concise and organized saves from repeating blocks of code across multiple places in your script; code reuse #!/usr/bin/perl ­w use strict; my $first_number = 5; my $second_number = 8; my $total = get_total($first_number, $second_number); print “The total of $first_number and $second_number is $total\n”; sub get_total { my ($value1, $value2) = @_; # @_ is a Perl reserved variable my $value3 = $value1 + $value2; return $value3; }

Script 4

Programming Assignment Part 1 Write a script that uses BioPerl and eval and gets sequences from NCBI and prints to a file in FASTA format. This script will be used for next week's Multiple Sequence Alignment class 1. 2. 3. 4. 5. 6.

Pseudocode: Create a file that has the GenBank protein accession from script 1 Read in (shift) the file that contains GenBank protein accessions in a single column Put accessions into an array Loop through the array values (@accessions) Connect to GenBank and get the accession record (object) Print sequence in FASTA format to a file only if record is found # you are basically adding code from script 2 to script 3 for this assignment FASTA file

accession file NP_848862 NP_071556

Perl Script

>L-gulono.. MVHGYKG VKFQNWA

Programming Assignment Part 2 Write a script to translate the mystery_ccds.fa sequence into an protein sequence in all 6 reading frames Create two subroutines one to do the actual translation part (use the substr function here) one to get the reverse complement of a sequence (use the tr and reverse functions)

Use the %translation hash provided in the file dna2rna.pl Print out results in FASTA format and include in the header the frame that was used +1, +2, +3, -1, -2, -3 Identify which frame is the proper translation Submit your BioPerl and translation scripts to me by next week wget http://140.226.65.107/mystery_ccds.fa wget http://140.226.65.107/dna2rna.pl

Perl has a function called transliterate which allows you to replace all matching characters in a string. Its basically a search and replace for characters in a string. #!/usr/bin/perl ­w use strict; my $sequence = “CAT”; $sequence =~ tr/GATC/CTAG/; print “$sequence\n”;

#you need the same number of   characters inside each set   of the forward slashes; this is   global replacement by default #prints GTA 

Perl has a function called reverse( ) which allows you to reverse a string #!/usr/bin/perl ­w use strict; my $sequence = “GTA”; my $reversed_sequence = reverse($sequence); print “$reversed_sequence\n”;

#prints ATG