ALSCRIPT: a tool to format multiple sequence ... - The Barton Group

30 downloads 933511 Views 4MB Size Report
Oct 14, 1992 - AMPS @arton and Sternberg,1987; Barton, 1990) block-frle format and a set of formatting commands, and produces a. PostScript file that may ...
Protein Engineering vol.6 no.l pp.37-40, 1993

ALSCRIPT: a tool to format multiple sequencealignments

Geoffrey J.Barton University of Oxford, Laboratory of Molecular Biophysics, The Rex Richards Building, South Parks Road, Oxford OXl 3QU, UK

Introduction Alignments of protein and nucleic acid sequencesare a central aid to understanding biological systems. Although several effective tools now exist for the rapid automatic alignment of large numbers of sequences(see, for example, Barton and Sternberg, 1987; Feng and Doolittle, 1987; Higgins and Sharp, 1989; Vingron and Argos, 1989),the preparationfor publication of quality figures of such alignments with annotationsis often extremely difficult. Artwork using pen/ink/dry lettering can allow labelling and boxing. However, this involves a great deal of labour and a fair degreeof skill to producevisually pleasingresults[e.g. the large 128 sequenceglobin alignmentin Barton and Sternberg(1987)1. With the advent of powerful word-processing and graphics packagesfor personal computers, many laboratories 'tidy up' their alignmentsby judicious mouse-work. Although good results may be obtained by this route, the process, in common with conventionalartwork, is rather inflexible. A particular problem is that having chosena number ofcharacters per line and character point size, it is difficult subsequentlyto modifz the figure. It is also laborious to add new sequencesto an existing figure. In addition, one is normally limited to the use of one of a few fixedwidth fonts (e.g. Courier) rather than the full range of fonts availableto the word-processor.The table-drawing facilities of various packagescan overcome the problems of proportional fonts. For example, the Unix 'troff' macros 'tbl' were used to prepare the alignmentsin Barton and Sternberg(1990), though subsequentlyretyped by the J. Mol. Biol.l L{leX (Lamport, 1986) tables were used for the 88 sequenceannexin alignment in Barton et al. (1991) and the 67 sequence SH2 domain alignment in Russell et al. (1992): both figures were extremely time consuming to prepare, but rather easier to update than a conventionally word-processedalignment. The ALSCRIPT program described in this article was developed specifically to allow the easy formatting and graphical display of large multiple sequencealignments.Although written originally for the author's use, the interface is relatively friendly, and should be easy to learn by anyone familiar with plotting graphs. Description of ALSCRIPT ALSCRIPT takes a multiple sequencealignment in the simple AMPS @arton and Sternberg,1987; Barton, 1990) block-frle format and a set of formatting commands, and produces a PostScriptfile that may be printed on a PostScriptlaser printer or viewed using a PostScriptpreviewer (e.g. Sun Microsystem's PageView program). GCG 'MSF' format files and CLUSTAL format files (PIR) are also supported. ALSCRIPT is strictly a formatting, display and annotationtool. It is not a program for

multiple sequencealignment, or editing. The previous programs that offer the closest functionality to ALSCRIPT are PRETTYPLOT and PRETTYBOX as supplied with the GCG package (Devereux et al., 1984), and the latest colour version of SOMAP (Parry-Smith and Attwood, 1991). Whilst these programs do not provide the samedegree of flexibility in display as offered by ALSCRIPT, they do allow the calculation of consensussequencesand automatic shading/boxing according to dehned rules. The aim of ALSCRIPT is to allow the user total control over the display ofthe sequencealignment; such control is essentialif non-sequenceinformation is to be used to highlight features of the sequence:for example, the location of active-site residues,the positionsof secondarystructures(cy-helicesand pstrands),and domain or intron/exon boundaries.The flexibility of ALSCRIPT also permits it to be used as a 'front-end' display tool for programs that offer sophisticated alignment analysis features. For example, the AMAS (Analysis of Multiply Aligned Sequences;C.D. Livingstone and G.J.Barton, in preparation) system, which highlights structurally important regions of a multiple alignment, generates ALSCRIPT commands as one output option. Similarly, the consensusalgorithms encoded in the GCG program PRETTY could be used to generate ALSCRIPT shading, boxing or font-changing commands for graphical output. Given a block-file and the text point size, AIJCRIPT calculates how many residues can be fitted across the page and how many sequenceswill fit down it. It then prints the alignment at the chosen point size on as many pages as are needed. Running ALSCRIPI with a smaller or larger point size will automatically rescale the alignment to f,it on fewer or more pages, as appropriate. The actual page dimensions may be reset to any value, so if an ,A.3PostScriptprinter or fypesettingmachine is available, alignmentscan readily be scaledto make best use of the extra space. Each output pagehas three regions. The left hand edgecontains identifuing text for each sequence,the main part of the page holds the alignment, and the top part, the position numbers and optional tick marks. ALSCRIPT commands make use of a character coordinate system for font changes and other formatting commands.Thus, any residuein the alignment may be referred to by its sequenceposition number (x-axis) and sequencenumber (y-axis). Rangesof residuepositions or sequencesmay also be def,rnedin the character coordinate system. The basic ALSCRIPI commands allow the followine functionalities. Fonts Any PostScript font in any size may be def,rnedand used on individual residues, regions or identifrer codes. Boxing Simple rectangular boxes may be drawn around any part of the alignment. Particular residue types may be selected and automatically 'surrounded' by lines. For example, if the characters 'G' and 'P' are selected, lines will not be drawn 5t

G.J.Barton

a

FFFLLFFFTFFFFFFFFFIJIJDDDDDDDDDDDDDDDDDD HHHITHHH 7 HHHHIIHHH HHHTIHHHH EEEEEEDDDEEEEEDDEEEGRRRRRRAOOOOOAAOAAA KKKTTTRRRRRKRRKKRRRREDDAAAVAADDDDDD I TR HHI{HITHHHH LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL+ HHHHHHHHIT

E

E E E E E

HHHITHHHHH QEEYYYDDDKKKKKKKYYRQDDDEEERQOEEEEEAXDQ HHHHHHHHH HHHHHHIIH AAAAASSSSSSSAASSW? H GGGGGGGGGGGGGGGG@ MMMMMMMMMUMMMMMU IJLWVEEEEEEEEEEEEEEE HHHHHHBB KKKKKKKKKKKKKKKKKKKKKKKRRRLLLLLKKKKNKG HHHHHHH IIITH ffiRRRRRRKKKKKKKKIRKR VAAAAAKKKIJI'I'L I'AAAA I AKKKKKKWIIT{WVIWVTVIRI{YIL HHH + HH HHHH TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT+ RRRDDDRRRDDDRRDDRDDDDDDDDDDDDDDDDDDDDD TTITHHH IIHHDDDDDDDDDDDEEDDEEWvvvvEEEEEEEEEEEEITHHHHHHH KKKHHDKKKNDNNNKKGS HHITHHHH FFP PPNNNEEEAAWVI'IDDS AATTTTWVTTTTTTTTTTTKKKVVUIKKQQKKKKKKC HHHHHHHH LLLLLLLLLLLLLLLLLLLLWWWFFFFFFFFFFFFFFF6 HTIITHHHHII I T I I I I I I I I I I I ITT IVIiINI I INTNI I I I ILLLITTNs HTITTHHHII HITHHHHH WSSSTTTTTTYYTTTTEEM II IVWI I I\AA/II I\NI TII I I IIII I II IVVVI I I I5 HHHHHH HHHHHH VMMMMMLLLLFI,I,LLLLLLLL 6 HTTHHH WTTTTTGGGGGCCCCCCA6 sssssssssssssssssrssEEETsrrrrNNsssrLLT HITITH + HIIHH

E E

E E EEE EEEE EEEE EEEEE EEEEE EEEEE

EEEE EE EE

TTTTH TTTH TTH TTTH TTTH TTH TTH TTH T TTTH TTTTH TTTTT TTTTT TTTTTTT TTTTTTTTTT TTTTTTTTTT TTTTTTTTTT TTTTTTTY TTTTTH TTH TH H H TH TTTTH TTTTTT TTTTTTTTTT

* @) Brsic formrtting.nd rnnotrtion commmdc:

b # (A) Input ud output flcs: BLOCK_FILE OUTPUT_FILE

examp1el.bloc examplel.ps

# (B) Pegc Lryout rnd ovcrrll epeclng: ADD_SEQ 38 1 ADD_SEQ 39 3 ADD-SEQ 69 1 PORTF"AIT POINTSIZE 10 # (C) Font dc0nitiong: DEFINE-FONT DEFINE_FONT DEFINE-FONT DEFINE-FONT DEFINE_FONT DEFIM-FONT SETI'P

O 1 3 4 5 6

Helvetica DEFAULT Helvetica REL 0. ?5 Helvetica-Bold DEFAIILT Times-Bold DEFAITLT Helvetica-BoldObIigue DEFAULT Times-Ronan DEFAITLT

IDENT_FONT AJ,L 6 ALL SURROI'ND-CITARSLIV FONT-CITARS IV AI,L 5 FONT-CIIARS ASN ATL 1 FONT-CIIARS KR ALL 4 BOX_REGTON 14 1 16 38 + AIJL SIIADE-CHARS O. O INVERSE-CHARS DE AI,I, SIIADE-CITARSD AIL 0.5 SHADE-CHARSE AJ,I, O.O SURROI'ND_CHARS 56+?89 t 40 2'l 40 LINE TOP 2 40 8 BOTTOM 2 40 LINE 8 TEXT 2 42 "Helix Patterntr 12 42 nGlycine Loopn TEXT TEXT 2L 42 "Helix Patternn 49 "HeLix Predictions" SI]B_ID SUB_ID 58 nSheet Predictionsn SIIB_ID 68 nTurn Predictions" # @) Now improvc thc eppcrencc of thc prcdlction histogrems. SURROUND-CIIARS II IT SIIADE-CITARS 1 SI'B_CHARS

L L 44

4 4

4 4 2't

2 2

7 7 53

53 53 0.5 TTSPACE

Fig. 1. Illustrationof someof the formattingcapabilitiesof ALSCRIPT. (a) A small AMPS block file. Each alignedsequenceis represented by a column. This block file includes character-basedhistogramsthat show the result of a combined secondarystructure prediction method (Russell et al., 1992). (b) An ALSCRIPI commandfile. Commentlines start with a # character,sectionsA-C set up generallayout definitions:the ADD-SEQ commandallows extra spaceto be inserted in the alignment at any location; 39 3 meansadd spacefor three sequencesafter sequence39. PORTRAIT specifiesthat the output will be arrangedwith the longest side of the paper vertical. POINT SIZE setsthe default point size for plotting characters.The DEFINE-FONT commandsset up the fonts that are to be used and their point size; font 0 is used as the default. The point size may be set to the default value, an explicit value or relative (REL) to the default size. SectionsD and E show the boxing, shading and font changing commandsin action. The IDENT-FONT ALL 6 specifiesfont 6 (Times-Roman)for all identifiers. SURROUND-CHARS LIV ALL specifresthat all L, I and V characterswill be separatedfrom other charactersby a line (e.g. at sequencepositions19-21). FONT-CHARS KR ALL 4 switchesthe font of K and R charactersto font 4 (Times-Boldl0 point). When regionsare selectedthe coordinatesare given in characterunits with the origin at the top left hand corner of the plotted alignment. For example, BOX___REGION 14 I 16 38 meansbox from residue 14 to residue 16 of sequencesI -38. These coordinatesapply irrespective of how many pagesare neededto plot the alignment at the chosenpoint size. The SHADE-CHARS commandpermits specific charactersto be shadedwith a grey value; The INVERSE-CHARS command allows white lettering to be superimposedon the shadedbackground, as shown for D and E. LINE TOP 2 40 8 draws a horizontal line at the top of the charactersfrom residue 2 to residue 8 of sequence40, and similar commandsallow lines at the left, right and below the defined residues;SUBJD changesthe identifier code for a sequenceand SUB-CHARS allows a characterto be substitutedwith another. In the example shown here (c), SUB-CHARS is used to remove the H, E and T charactersthat made up the prediction histogramsand replace them with SPACE ' ' characters.The similar commandsused to format the Sheet, Turn and Summary predictions are omitted for brevity.

38

ALSCRIPT I

AnnexinI H4 Annexin I M4 Annexin I R4 AnnexinV R4 Annexin V H4 AnnexinV C4 Annexin II H4 AnnexinII 84 AnnexinII M4 AnnexinIV P4 Annexin IV 84 AnnexinIV H4 AnnexinVI H4 AnnexinVI M4 Annexin VI H8 AnnexinVI M8 AnnexinV BH4 AnnexinVII4 Annexin III H4 AnnexinIII R4 Annexin II lvlil AnnexinII H3 AnnexinII 83 AnnexinI H3 AnnexinI M3 AnnexinI R3 AnnexinV C3 AnnexinV H3 AnnexinV R3 AnnexinVI H3 AnnexinVI M3 AnnexinIV H3 AnnexinIV 83 AnnexinIV P3 AnnexinV BH3 AnnexinIII H3 AnnexinIII R3 AnnexinVII 3

lF' A

F r F e IL IA IL IA

F r F Ar F AA F AA F e F Ae F Ar F Ae F Ae F A F r F r f a A A A A

A A

K K K T T T R R R R R K R R K K R R R G R

v

l

A A A A A A

El Y A Y A

Y s IffD

l*Is l&Es K s K s K s

M M M M M M M M

K K K K K K K K

M M M M M M M M

KGI KGI KGI KGI K G K G K G K G

F K A F K ^ K S K s Y n Y A H R A H Q e

lllx cl

ILIKG

Hr EHffi R ^ R ^ R A

OE

G l v l KR G G G G G r R ^ G F Q r \: F Q r G

Ee t:lA

Y

lJ

G

EIA G

G EI A AG UI

Y K ^ G

Q T O R

EIe G Q e G

FT-.r-TN

Conservation

R R R R R R R R R R R R R R R R R R R R

G G G G G G G G

MKcl

rE Q r

A

l,ioxcl

H Q e

I

HelixPattern

T T T T T T T T T K K

R R R R R R N R N R s R s R s R Iu I R R t L lR T R

s T T T T

F F F F F F F F F F F F

16Tl ls 6--6-l f

GlycineLoop

HelixPattern

Helix hedictions

SheetPredictions

@

$.

t-.. ..ffiI ffi*ql

.ffi* # iiiiiiiiililltiililliltli,il

Tum Predictions

:i:::l:;i:li::,:::i:ii::l:::,i:ii::li:it:i::: | |

#

Summaryhedictionffi

ffi

:lll:ilitiillllitffi.,#iiilfi

i:

between G and P characters.but onlv where G and P border with other characters.

or as residue-specificshading, e.g. to 'shade all Cys residues between positions 6 and 30'.

Shading Grey shadingof any level from black to white may be applied to any region of the alignment, either as a rectangular region

Text Specific text strings may be addedto the alignment at any position and in any font or font size.

39

G.J.Barton Lines Horizontal or vertical lines may be drawn to the left, right, top or bottom of any residue position or group of positions. Defaults All defaults, e.g maximum numberAengthof sequences,character spacing,may be modified using ALSCRIPT commandswithout the need to recompile the program. Figure l(a) illusrates a small section of a multiple sequence alignment in AMPS block-file format. The sequenceidentifrer codes stored above the alignment have been deleted for brevity. In this example, the block-file also contains character-based histogramsrepresentingthe prediction of helix, strand and turn. Figure 1(c) illustratesthe result of running ALSCRIPT on this file using the commandsshown in Figure 1O). It is not suggested that this combination of options gives the clearest representation of the data; rather, the options have been chosento illustrate many of the capabilities of the program. Although written with the aim of producing figures for publication, ALSCRIPT is a useful researchtool for interpreting multiple sequencealignments. For example, the boxing, shading and font changing facilities can be applied to highlight amino acids of a particular fype and thus draw attention to clusters of positive or negative charge, hydrophobicity and so on. Furthermore, computer programs for the automatic analysis of alignments can be made to produce ALSCRIPT formatting comnxrndsand a block file, thus simpliffing the task of generating graphical representationsof such analyses. System requirements and availability ALSCRIPT is written in C and should compile and run on most computers with a C compiler. An IBM-compatible disk (1.214 MB) including the source code and a compiled version of the program for 386 DOS computersis available from the author. Acknowledgements The author thanks the Royal Society for support, and Professor L.N.Johnson for providing a stimulating working environment.

References Barton,G.J. (199O) Methods Enzymol., 183, 403-428. Barton,G.J.and Sternberg,M.J.E.(1987) l. Mol. Biol.. lW, 327-337. Barton,G.J.and Sternberg,M.J.E.(1990) l. Mol. Biol.,212,389-4U. Barton,G.J., Newman,R.H., Freemont,P.F. and Crumpton,M.J. (1991\ Eur. J. Biochem.. l9rl. 749-7ffi Devereux,J., Haeberli,P. and Smithies,O. (1984) Nucleic Acids Res., 12, 387-395. Feng,D.F. and Doolittle,R.F. (1987) J. Mol. Evol.,25,351-360. Higgins,D.G. and Sharp,P.M. (1989) Comput.Appl. Biosci.,5, 151-153. Lamport,L. (1986) IaTeX: A Document Preparation Sysrem.Addison-Wesley, New York. Parry-Smith,D.J. and Attwood,T. K. ( 199I ) Comput.Appl. Biosci., 7, 233 - 235. Russell,R.B.,Breed,J.and Barton,G.J. (1992) FEBSLett.,3{M., l5-2O. Vingron,M. and Argos,P. (1989) Comput.Appl. Biosci.,5, 115-121. Received on June 30, 1992: revised on October 14, 1992; accepted on October

20. 1992

40