Mar 11, 2011 - Page 1 ... Build.PL. #type your password when prompted. #hit enter when .... subroutines are for creating your own functions similar to length( ) ...
Intro to BioPerl
Install BioPerl http://www.bioperl.org/
#bioperl website #click on HOWTO in the Documentation box for some code to help you get started
cd ~/Downloads wget http://bioperl.org/DIST/BioPerl1.6.1.tar.gz tar xzvf BioPerl1.6.1.tar.gz cd BioPerl1.6.1 sudo ./Build.PL
#type your password when prompted
#hit enter when prompted to accept the defaults [Y/n] sudo ./Build install
object attributes
desc $seq_obj>desc
object $seq_obj
seq $seq_obj>seq
length $seq_obj>length
LOCUS NP_001123420 440 aa linear MAM 11MAR2011 DEFINITION Lgulonolactone oxidase [Sus scrofa]. ACCESSION NP_001123420 VERSION NP_001123420.1 GI:194018724 DBSOURCE REFSEQ: accession NM_001129948.1 KEYWORDS . SOURCE Sus scrofa (pig) ORGANISM Sus scrofa Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Suina; Suidae; Sus. REFERENCE 1 (residues 1 to 440) AUTHORS Hasan,L., Vogeli,P., Stoll,P., Kramer,S.S., Stranzinger,G. and Neuenschwander,S. TITLE Intragenic deletion in the gene encoding Lgulonolactone oxidase causes vitamin C deficiency in pigs JOURNAL Mamm. Genome 15 (4), 323333 (2004) PUBMED 15112110 FEATURES Location/Qualifiers Protein 1..440 /product="Lgulonolactone oxidase" /EC_number="1.1.3.8" /note="Lgulonogammalactone oxidase; GLO; LGO" /calculated_mol_wt=50221 CDS 1..440 /gene="GULO" /coded_by="NM_001129948.1:228..1550" /db_xref="GeneID:396759" ORIGIN 1 mvhghkgvkf qnwaktygcc pemyyqptsv eeirevlala rqqnkrvkvv ggghspsdia 61 ctdgfmihmg kmnrvlkvdm ekkqvtveag illadlhpql dkhglalsnl gavsdvtagg 121 vigsgthntg ikhgilatqv veltlltpdg tvlvcsessn aevfqaarvh lgclgviltv 181 tlqcvpqfhl qettfpstlk evldnldshl kkseyfrflw fphsenvsvi yqdhtnkpps 241 ssanwfwdya igfyllefll wistfvpglv gwinrfffwl lfngkkencn lshkiftyec 301 rfkqhvqdwa iprektkeal lelkamleah pkvvahypve vrftraddil lspcfqrdsc 361 ymniimyrpy gkdvprldyw layetimkkv ggrphwakah nctrkdfekm ypafrkfcai 421 rekldptgmf lnaylekvfy //
Use BioPerl to retrieve sequences from a list of GenBank accessions
1. 2. 3. 4.
Pseudocode: Put your GenBank accessions into an array (“NP_848862”, “NP_071556”, “NP_001123420”, “NP_001029215”) Loop through the array values (@accessions) Connect to GenBank and get the accession record (sequence object) Print out the record description, sequence and length
#!/usr/bin/perl w use strict; use Bio::DB::GenBank;
#module for connecting to GenBank database
my $db_obj = Bio::DB::GenBank>new; my @accessions = (put accessions here);
#create your new database connection object #type in the above accessions in the array
foreach my $acc (@accessions) { my $seq_obj = $db_obj>get_Seq_by_acc($acc); #connect to GenBank & get accession info print $seq_obj>desc . "\n"; #print out the accession description print $seq_obj>seq . "\n"; #print out the accession sequence print $seq_obj>length . "\n"; #print out the sequence length }
Script 1
Use the Data::Dumper module to see what is in an object
#!/usr/bin/perl w use strict; use Bio::DB::GenBank; use Data::Dumper;
#module for connecting to GenBank database ####module to print variable data
my $db_obj = Bio::DB::GenBank>new; my @accessions = (put accessions here);
#create your new database connection object #type in the above accessions in the array
foreach my $acc (@accessions) { my $seq_obj = $db_obj>get_Seq_by_acc($acc); #connect to GenBank & get accession info print Dumper($seq_obj); die; ####print object info and die print $seq_obj>desc . "\n"; #print out the accession description print $seq_obj>seq . "\n"; #print out the accession sequence print $seq_obj>length . "\n"; #print out the sequence length }
Script 1
Print out your accession records to a file in GenBank format 1. 2. 3. 4. 5.
Pseudocode: Put your GenBank accessions into an array (“NP_848862”, “NP_071556”, “NP_001123420”, “NP_001029215”) Use BioPerl to open a file for writing your records in genbank format Loop through the array values (@accessions) Connect to GenBank and get the accession record (sequence object) Write the GenBank records to the outfile
#!/usr/bin/perl w use strict; use Bio::DB::GenBank; use Bio::SeqIO;
#module for connecting to GenBank database #module for sequence Input/Output writing
my $db_obj = Bio::DB::GenBank>new; my @accessions = (put accessions here);
#create your new database connection object #type in the above accessions in the array
my $outfile_obj = Bio::SeqIO>new( file => '>gulo.gb', format => 'genbank' );
#open file for writing data # in genbank format # fasta format is also available
foreach my $acc (@accessions) { my $seq_obj = $db_obj>get_Seq_by_acc($acc); #get accession sequence object from GenBank $outfile_obj>write_seq($seq_obj); #write the GenBank record to your file }
Script 2
Handling Errors when Record is not found by BioPerl
1. 2. 3. 4.
Pseudocode: Put your GenBank accessions into an array Loop through the array values (@accessions) Connect to GenBank and get the accession record (object) Print out sequence only if record is found
#!/usr/bin/perl w use strict; use Bio::DB::GenBank;
#module for connecting to GenBank database
my $db_obj = Bio::DB::GenBank>new; my @accessions = ("NP_ABC123", "NP_848862");
#create your new database connection object #array of accessions
foreach my $acc (@accessions) { my $seq_obj; eval { $seq_obj = $db_obj>get_Seq_by_acc($acc); #connect to and get accession info }; #a semicolon is required when using eval if ($@) { #if error was found by eval; $@ catches errors found by eval print “$acc not found.\n”; } else { print $seq_obj>seq . "\n"; #print out the accession sequence } }
Script 3
Writing your own subroutines (functions) subroutines are for creating your own functions similar to length( ) and substr( ) keeps your script concise and organized saves from repeating blocks of code across multiple places in your script; code reuse #!/usr/bin/perl w use strict; my $first_number = 5; my $second_number = 8; my $total = get_total($first_number, $second_number); print “The total of $first_number and $second_number is $total\n”; sub get_total { my ($value1, $value2) = @_; # @_ is a Perl reserved variable my $value3 = $value1 + $value2; return $value3; }
Script 4
Programming Assignment Part 1 Write a script that uses BioPerl and eval and gets sequences from NCBI and prints to a file in FASTA format. This script will be used for next week's Multiple Sequence Alignment class 1. 2. 3. 4. 5. 6.
Pseudocode: Create a file that has the GenBank protein accession from script 1 Read in (shift) the file that contains GenBank protein accessions in a single column Put accessions into an array Loop through the array values (@accessions) Connect to GenBank and get the accession record (object) Print sequence in FASTA format to a file only if record is found # you are basically adding code from script 2 to script 3 for this assignment FASTA file
accession file NP_848862 NP_071556
Perl Script
>L-gulono.. MVHGYKG VKFQNWA
Programming Assignment Part 2 Write a script to translate the mystery_ccds.fa sequence into an protein sequence in all 6 reading frames Create two subroutines one to do the actual translation part (use the substr function here) one to get the reverse complement of a sequence (use the tr and reverse functions)
Use the %translation hash provided in the file dna2rna.pl Print out results in FASTA format and include in the header the frame that was used +1, +2, +3, -1, -2, -3 Identify which frame is the proper translation Submit your BioPerl and translation scripts to me by next week wget http://140.226.65.107/mystery_ccds.fa wget http://140.226.65.107/dna2rna.pl
Perl has a function called transliterate which allows you to replace all matching characters in a string. Its basically a search and replace for characters in a string. #!/usr/bin/perl w use strict; my $sequence = “CAT”; $sequence =~ tr/GATC/CTAG/; print “$sequence\n”;
#you need the same number of characters inside each set of the forward slashes; this is global replacement by default #prints GTA
Perl has a function called reverse( ) which allows you to reverse a string #!/usr/bin/perl w use strict; my $sequence = “GTA”; my $reversed_sequence = reverse($sequence); print “$reversed_sequence\n”;
#prints ATG