Linux and RNA-Seq read alignment

36 downloads 176 Views 396KB Size Report
The Linux operating system. •Many 'flavors' of Linux (Ubuntu, fedora, CentOS, openSUSE, .... easyqsub.pl -a "bowtie -q -n 2 -S $index $reads > $samp.sam".
Linux and RNA-Seq read alignment Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

1

Outline •Intro to Linux •Reference types •Read filtering •Short read alignment

2

The Linux operating system •Many ‘flavors’ of Linux (Ubuntu, fedora, CentOS, openSUSE, Slackware). •Frequently includes a GUI (Gnome, KDE). •Strength is in the shell, a programmer’s OS. •Permissions. •Multiple shells (bash, tcsh, ksh). •Text editors (gedit, vi, emacs). •Finding help.

3

Interacting with a server (PC options) Putty: http://www.chiark.greenend.org.uk/~sgtatham/putty/ Xming: http://www.straightrunning.com/XmingNotes/

Shell commands ls ls –lh cd ~ cd .. pwd mv cp mkdir df rm rmdir rm –rf # Will delete everything without asking. cat filename.txt head filename.txt less filename.txt gedit filename.txt & top chmod u+x filename.txt tar –xvzf file.tar.gz (Google ‘linux cheat sheet’)

Shell commands Tab completion history

Finding help with Linux $ man command $ info command Google ‘Linux what you need help on’. O’reilly books (http://oreilly.com/).

7

Reference types •From a genome project (model organisms). •De novo or from cDNA. Are all isoforms present? How will exon skipping affect inference of regulation?

8

What’s in a name? •Bowtie truncates reference names at spaces. •Some characters don’t mix well with the sequence ontologies. http://www.sequenceontology.org/resources/gff3.html

Note the difference between sequence ontology and gene ontology. http://www.geneontology.org/

9

Fastq file @HWI-EAS121:1:1:0:952#0/1 CGTTNCCACTTCCTCCATCATGTCATCATGTGCGACAGGA +HWI-EAS121:1:1:0:952#0/1 aab^D\babbbabbbbabbaaaabaabaaa_`aaaaa]PY @HWI-EAS121:1:1:0:405#0/1 CGTTNTAAAGGTGCACCAGGGATCAAATCAATGGAATGCT +HWI-EAS121:1:1:0:405#0/1 aa^[DVa^`^_Y`a^a`[\^\Z^aaYZ`a`X__]ZZ_]`_ @HWI-EAS121:1:1:0:724#0/1 CGTTNCATGCCCTTCTTTAATTTTTACACATGGTTCTTCT +HWI-EAS121:1:1:0:724#0/1 aa`[D^aa`aaaaaaaaa_R`aaaaaaaa`aa`Y`aa``a @HWI-EAS121:1:1:0:666#0/1 TTGTNAAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG +HWI-EAS121:1:1:0:666#0/1 a`bOD[]R]`a__aT^YX\a`aMXaa[a[_a\HT\_``\[ @HWI-EAS121:1:1:0:1591#0/1 TTGTNCTCACCTATAATTTGACTTTGACATGCTACCTAGC +HWI-EAS121:1:1:0:1591#0/1 aaaYD[aaa`aaaaWZaaaaa``_aaaa`aa`_V_``Y[a

40-mer sequences

Read filtering •Adapter dimers. •Fastq quality format (Phred, Illumina pre1.3, Illumina post1.3). http://maq.sourceforge.net/qual.shtml

•Poly(A). •Non-target organism.

11

Alignment software •Bowtie: http://bowtie-bio.sourceforge.net/index.shtml Persistent index Heterogeneous read-length •BWA: http://bio-bwa.sourceforge.net/ Persistent index Heterogenous read length Gapped alignment •CASHX: http://jcclab.science.oregonstate.edu/? q=node/view/56095 Non-Smith-Waterman alignment SAMTools: http://samtools.sourceforge.net/ Manipulate SAM files. 12

Index creation - Bowtie mkdir btindex mv rna_ref.fa btindex cd btindex bowtie-build rna_ref.fa rna_ref bowtie-inspect -s rna_ref cd ..

13

Read alignment - Bowtie mkdir btout cd btout bowtie -q -n 2 -S ../btindex/rna_ref ../fastq/sample1.fq > sample1.sam samcounter.pl -a ../btindex/rna_ref.fa -b sample1.sam samtools view -b -S sample1.sam > sample1.bam samtools sort sample1.bam sample1 samtools pileup -f ../btindex/rna_ref.fa sample1.bam > sample-pileup.txt samtools index sample1.bam

14

Read alignment - Bowtie #!/bin/tcsh set index='../btindex/rna_ref' set reads='../fastq/sample1.fq' set samp='sample1' ##### ##### ##### ##### ##### # Main. bowtie -q -n 2 -S $index $reads > $samp.sam samcounter.pl -a $index.fa -b $samp.sam samtools view -b -S $samp.sam > $samp.bam samtools sort $samp.bam $samp samtools pileup -f $index.fa $samp.bam > $samp-pileup.txt samtools index $samp.bam ##### ##### ##### ##### ##### # EOF.

15

Read alignment - Bowtie #!/bin/tcsh set index=“../btindex/rna_ref” set reads=“../fastq/sample1.fq” set samp=“sample1” ##### ##### ##### ##### ##### # Main. easyqsub.pl -a "bowtie -q -n 2 -S $index $reads > $samp.sam" easyqsub.pl -a "samcounter.pl -a $index.fa -b $samp.sam" easyqsub.pl -a "samtools view -b -S $samp.sam > $samp.bam" easyqsub.pl -a "samtools sort $samp.bam $samp" easyqsub.pl -a "samtools pileup -f $index.fa $samp.bam > $samp-pileup.txt" easyqsub.pl -a "samtools index $samp.bam" ##### ##### ##### ##### ##### # EOF.

16

Alignment viewer - SAMtools samtools tview sample1.bam ../btindex/rna_ref.fa

17

Index creation - BWA mkdir bwaindex cp btindex/rna_ref.fa bwaindex/ cd bwaindex bwa index -a is -p rna_ref rna_ref.fa

18

Read alignment - BWA cd .. mkdir bwaout cd bwaout bwa aln -o 0 ../bwaindex/rna_ref ../fastq/sample1.fq > sample1.sai bwa samse ../bwaindex/rna_ref sample1.sai ../fastq/sample1.fq > sample1.sam samcounter.pl -a ../btindex/rna_ref.fa -b sample1.sam samtools view -b -S sample1.sam > sample1.bam samtools sort sample1.bam sample1 samtools pileup -f ../btindex/rna_ref.fa sample1.bam > samplepileup.txt samtools index sample1.bam

19

Alignment viewer - SAMtools samtools tview sample1.bam ../btindex/rna_ref.fa

20

SAM file format @HD VN:1.0 SO:sorted @PG TopHat VN:1.0.13 CL:/local/cluster/bin/tophat -p 4 --solexa1.3-quals ../indexes/psme_ref ../psme_seqs.fq ILLUMINA-3AB384_0001:6:24:19059:8781#GATT 0 0_54_255 1 255 80M * 0 0 TCTTCTTCATGTTTGGCACGTGTATTCGGGCCTACTTCGCCTTTCCTTCACAGTAGGCGCCTTATCATTATTGGTCAGTT CCCCCCCCCCCCCCCCDCCCCCCCC@CBCBBCCBCCCCCCCCCCCCCCCCCCCDCD@C@CCCC4=CCBCCCCAC>B>BBC NM:i:1 HWI-EAS121_0024_FC61F8DAAXX:7:101:7452:15154#CTGT 0 0_54_255 17 255 76M * 0 0 CACGTGTATTCGGGCCTACTTCGCCTTTCCTTCACAGTAGGCGCCTTGTCATTATTGGTCAGTTATGACCTTAATT GGGGGGGGGGFEGFFGFEEFFBEECEFFFFFGGDGFDDGE:FBBFEGFFD?DEDEFB=DDD=ECCC=EAACDEDC= NM:i:0 @header line1 – file format version @header line2 – program which created the file 1 Query (read) name 2 flag 3 Reference name 4 Leftmost mapping position 5 Mapping quality 6 CIGAR string 7 Reference name of mate 8 Position of the mate 9 Template length 10 Fragment sequence 11 Fragment quality

21

Gene cb_a cb_b yk_a yk_b isotig18613_gene=isogroup07808_length=677_numCo 17 18 139 159 ntigs=1 isotig01880_gene=isogroup00225_length=652_numCo ntigs=4

11

10

162

56

isotig07160_gene=isogroup01638_length=3698_numC ontigs=4

31

81

276

226

isotig06362_gene=isogroup01321_length=1396_numC ontigs=4

32

31

149

91

isotig06005_gene=isogroup01197_length=1204_numC ontigs=4

52

68

169

198

isotig06363_gene=isogroup01321_length=1470_numC ontigs=4

21

27

73

100

contig29123_gene=isogroup00629_length=686 isotig30058_gene=isogroup19254_length=1101_numC ontigs=1

30 31

15 36

75 75

161 400

contig50604_gene=isogroup01657_length=1247 272 contig21101_gene=isogroup01657_length=559 47 isotig05419_gene=isogroup01011_length=1938_numC 32 ontigs=4 contig03433_gene=isogroup00629_length=496 isotig05877_gene=isogroup01156_length=2570_numC ontigs=4

21 91

405 1153 724 96 264 165 49 103 126 10 70

55 154

71 762

Strand specificity

Parkhomchuck et al. 2009. Transcriptome analysis by strand-specific sequencing of complimentary DNA. Nucleic Acids Research 37(18):e123

23

24

25

26