Exploring protein fold space by secondary structure prediction using ...

3 downloads 0 Views 788KB Size Report
derived from the new scoring scheme for four different model proteomes was ... improvement for protein secondary structure prediction was .... Intel Pentium 4.
BIOINFORMATICS

Vol. 20 no. 18 2004, pages 3500–3507 doi:10.1093/bioinformatics/bth435

Exploring protein fold space by secondary structure prediction using data distribution method on Grid platform Soojin Lee1,† , Min-Kyu Cho2,† , Jin-Won Jung2,† , Jai-Hoon Kim1,∗ and Weontae Lee2, ∗ 1 Distributed

and Mobile Computing Laboratory, Graduate School of Information and Communication, Ajou University, Suwon 442-749, Korea and 2 Biomolecular NMR Laboratory, Department of Biochemistry, Yonsei University, Seoul 120-749, Korea Received on January 10, 2004; revised on June 21, 2004; accepted on July 17, 2004 Advance Access publication July 29, 2004

INTRODUCTION Protein secondary structure prediction has been used as a very useful tool for exploring proteins, through applications such as topology recognition (Di Francesco et al., 1997), fold recognition (McGuffin et al., 2001; Przytycka et al., 1999), screening novel folds (McGuffin and Jones, 2002), identification of domains and monitoring influences of point mutation (Rost, 2001). To derive secondary structural information of proteome from a huge amount of biological data after genome projects, computing equipment with high performance is ∗ To

whom correspondence should be addressed.



The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors

3500

essential. However, a small research unit in the university or an individual researcher has difficulties to afford the expensive computers with high performance. To achieve the required performance at a lower cost, Grid (Foster and Kesselman, 1998; Foster et al., 2001) system has been developed. Owing to the popularity of the Internet, the powerful computers and high-speed networks as low-cost commodity components are normally open to public. Grid platform combines into the equivalent of a single unified computer, a wide variety of resources such as supercomputers, storage system, data sources and special classes of devices, which are actually distributed geographically. In this paper, performance improvement for protein secondary structure prediction was achieved by making our Grid to process a large amount of protein sequence data. A program called PSIPRED (Jones, 1999), which is based on the neural network approach, is used for protein secondary structure prediction. Our Grid system distributes protein sequence data and each computing node processes its own part of protein sequence data to speed up the structure prediction. The results show that our Grid is a viable platform for processing massive biological data such as the set of all protein sequences in a living organism. On the basis of improved computing power, a genome scale secondary structure prediction and topology scoring was proposed. In this report, the results show that the structure prediction, analysis and comparison on a genomic scale become feasible and provide an insight for protein folding space of whole genomes.

SYSTEMS AND METHODS To construct our Grid test-bed, we installed Globus (Foster and Kesselman, 2002) as our Grid middleware, Portable Batch System (PBS) (Henderson, 1995) to manage resource and Nimrod (Abramson et al., 1995) API to monitor system resources and applications.

Bioinformatics vol. 20 issue 18 © Oxford University Press 2004; all rights reserved.

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

ABSTRACT Motivation: Since the newly developed Grid platform has been considered as a powerful tool to share resources in the Internet environment, it is of interest to demonstrate an efficient methodology to process massive biological data on the Grid environments at a low cost. This paper presents an efficient and economical method based on a Grid platform to predict secondary structures of all proteins in a given organism, which normally requires a long computation time through sequential execution, by means of processing a large amount of protein sequence data simultaneously. From the prediction results, a genome scale protein fold space can be pursued. Results: Using the improved Grid platform, the secondary structure prediction on genomic scale and protein topology derived from the new scoring scheme for four different model proteomes was presented. This protein fold space was compared with structures from the Protein Data Bank, database and it showed similarly aligned distribution. Therefore, the fold space approach based on this new scoring scheme could be a guideline for predicting a folding family in a given organism. Contact: [email protected]

Prediction of protein fold space by Grid computing

Globus toolkit and Nimrod

Test-bed in Grid environment All calculations were performed on both Ajou BioGrid and local BioGrid. Figure 1 shows the configuration of Ajou local BioGrid and other local BioGrid. We extended Ajou local BioGrid using the clusters of many institutes such as KISTI in Korea and AIST in Japan. In addition, we are planning to connect several other sites and expect to enlarge BioGrid network. The master nodes were excluded from the computation nodes to perform job distribution and result collection. Globus, PBS server/scheduler/executor and Nimrod are installed on the master nodes. The worker nodes are computation nodes for executing job and installed with PBS executor. Because Network File System (NFS) is used on clusters to control the file system easily, PSIPRED, a protein secondary structure prediction program, and integrated database of several protein databases are installed on the master nodes of clusters and workstations. Ajou local BioGrid consists of a master node and eight worker nodes, and extending BioGrid composes a master node, and eight worker nodes of Ajou local BioGrid together with five nodes (ten worker nodes) of Yonsei University as worker node. Table 1 describes system specification of OS, CPU and memory on our BioGrid.

Data distribution scheme Basic motivations of our data distribution scheme are that protein structure prediction programs manipulate a huge amount of data [large number of protein sequences of amino acid or protein structures in a database like GenBank (http://www.ncbi.nlm.nih.gov/Genbank/), SwissProt (http://us.expasy.org/sprot/) or PDB (http://www.rcsb.org/ pdb/)] and there is no data dependence among the protein sequences; the computation results are same regardless of the processing order of the protein sequences.

Fig. 1. Our BioGrid test-bed. (a) Ajou BioGrid. (b) Extending BioGrid. Schematic diagram of test-bed used in this experiment.

Therefore, we are able to speed up remarkably computations for protein structure prediction by distributing the protein sequence data among worker nodes on our BioGrid, and combining the results from the worker nodes after each one has processed its own part of protein sequences. It is common to process large numbers of protein sequences from many users or from a user requiring prediction of proteome consisting of many proteins. In this paper, we demonstrate improvement of calculation speed for protein secondary structure prediction on our BioGrid by distributing the proteins in a whole proteome. Following procedure presents the data distribution scheme used in this protocol: (1) Make the work plan file and gatekeeper file on our BioGrid. The work plan file is a script file to execute jobs on Grid, containing the number of jobs decided by the user, and the order of executions such as Seqsplit1 and Seqsplit2. Gate keeper is used for job management on our BioGrid.

3501

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

Globus provides users with a set of components that implement basic services such as information service, resource management and security. Globus also provides these services as independent elements for users. Therefore, it could be possible for users to select, combine and control these services according to their own particular needs. In addition, this supports the multiform application program and paradigms (Foster and Kesselman, 2002). Nimrod, which is a program to control job distribution on several computers, is able to do management of distributed applications and usage of systems (Abramson et al., 1995). Based on Metacomputing Directory Service (MDS) in Globus, data—the results of distributed and processed jobs— from each computer is obtained. Thus we allocate jobs using PBS. Incorporating with MDS and PBS, Nimrod controls and monitors the distributed operation through the remote management. Owing to Nimrod, users confirm the allocation of jobs and control jobs.

S.Lee et al.

Table 1. System specification

Site

System

No. of nodes (Number of worker nodes)

OS

CPU

Memory

Ajou University BioGrid

Workstation Master Slave Master Slave

1 (1) 1 7 (7) 1 5 (10)

Linux Redhat 9.0 Linux Redhat 9.0 Linux Redhat 9.0 Linux Redhat 9.0 Linux Redhat 9.0

Intel Pentium 4 Intel Pentium 3 Intel Pentium 4 Dual Intel Pentium 3 Dual Intel Pentium 3

512 MB 512 MB 512 MB 1 GB 1 GB

Yonsei University Cluster

Table 2. Test data

Test sequence data

Protein sequence database

Ajou University BioGrid

One hundred of sequences obtained from PDB files on Eva Server (Eyrich et al., 2001), from December 2001 to January 2002

Non-redundant protein DB (GenBank of non-redundant CDS translation, PDB, Swiss-Prot and PIR) provided by NCBI (ftp://ftp.ncbi.nih.gov/genbank/)

Present BioGrid (Ajou BioGrid and Yensei Cluster)

H.pylori 51 proteome (1389 protein sequences) M.genitalium proteome (452 protein sequences) E. coli proteome (4279 protein sequences) S. cerevisiae proteome (6322 protein sequences)

(2) User’s sequences are converted into index file, length file and sequence file by Seqsplit1. The index file for reducing the sequence size consists of index number and sequence name. The length file consists of index number and sequence length. The sequence file consists of index number and sequence. (3) For making the group files, sequences in the sequence file are sorted by the length file. Then, sorted sequences are divided into the same number of group files as the executable nodes by considering the length and number of sequences so that all group files have a similar length and number of sequences. (4) The group files are distributed among worker nodes on BioGrid. Also, we can monitor the state (executing or completed) of each group file by Nimrod. After allocating one group file on each worker node, the group file is split into protein sequences by Seqsplit2. The structure of each protein sequence is predicted by PSIPRED. The results of prediction are sent to the master node. The master node merges the results and provides a user with the final result. Work plan file is one type of script file to execute on Grid, and Gate keeper is used for job management.

Genome scale prediction and data analysis All analyzed genome sequences were Mycoplasma genitalium, Escherichia coli K12 from NCBI (http://www.ncbi.nih.gov/),

3502

Helicobacter pylori 51 from The Center for Functional Analysis of Human Genome in Korea (unpublished data) and Saccharomyces cerevisiae from SGD (http://www.wormbase.org/). Each prediction result was converted by our secondary topology scoring then clustered with five seeds by K-means procedure served as FASTCLUS in SAS version 8 (http://www.sas.com).

RESULTS AND DISCUSSION We performed two test sets of genome sequences to calculate the efficiency and accuracy on a genome scale at Ajou BioGrid and local BioGrid platform. When we performed the test, no other user(s) program was executed on our BioGrid. Table 2 shows the test dataset for calculation. First, as shown in Figure 1a, we evaluated the performance of the prediction system using data distribution scheme on Ajou BioGrid. We obtain approximately seven times speedup using eight worker nodes as shown in Figure 2a. We were unable to obtain the eight times speed up that is the ideal case. The reason might be as follows: (1) the communication overhead per node increases when Nimrod allocates the work to each node and (2) the Nimrod does not schedule all nodes simultaneously at the initial time to find available nodes, which increases the schedule time. Although the improvement in speed is slightly less than ideal, the proposed data distribution system is able to

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

Test-bed

Prediction of protein fold space by Grid computing

9000

(a)

7 20 sequences

20 sequences 40 sequences

7000

60 sequences

6000

80 sequences

40 sequences

6

60 sequences 5

80 sequences 100 sequences

100 sequences

Speed Up

Processing Time(sec)

8000

5000 4000

4

3

3000

2 2000

1

1000 0

0 1

2

4

8

1

Number of Nodes

8 Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

Mycoplasma genitalium : 452 Protein Sequences

Mycoplasma genitalium : 452 Protein Sequences

35

12 Helicobacter pylori 51 : 1389 Protein Sequences

30

10

Helicobacter pylori 51 : 1389 Protein Sequences

25

Speed Up

Processing Time (hour)

4

14

40

(b)

2

N umber of Nodes

20

8

6

15 4

10 2

5 0

0

1

6

12

18

Number of Nodes

1

6

12

18

The Number of Nodes

Fig. 2. Comparison of the processing time and speed up in accordance with the number of nodes and with the number of user sequences (a) and two proteomes (b). As the number of nodes increases total processing time decreases.

analyze and predict the secondary structure not only for each protein but also for an entire proteome within a short time. Second, we tested the prediction system using data distribution scheme on the present BioGrid shown in Figure 1b. It is possible for biochemists to predict protein secondary structures of proteome in a reasonable time using data distribution

scheme on BioGrid. Figure 2b shows that our BioGrid using data distribution scheme was able to process H. pylori 51 proteome (1389 protein sequences) within 4 h and 50 min and M.genitalium proteome (452 protein sequences) within 2 h and 20 min. If it were not for BioGrid, structure prediction of proteins in these two proteomes would require more than two days and one day, respectively.

3503

S.Lee et al.

..MKKLALILFMGTLVSFYADA.. ..CCHHHHHHHHHHHHEEEECC..

….1

2….

Log 3 (..12..) = Top. Score

Prediction result

SS into Number

Topology Score

If number is 0 then the topology score is –1.

Recent reports proposed that a protein fold universe or fold space could be derived from secondary structural information (Hou et al., 2003; Shindyalov and Bourne, 2000; Holm and Sander, 1996; Burley and Bonanno, 2002) and secondary structure has been employed as a useful tool for genome comparison (Gerstein and Hegyi, 1998). Hou and coworkers showed that folding space should represent three-dimensional (3D) space because it has a ‘shape’ (Hou et al., 2003). This gives us an insight about protein folding space of all genomes. However, this method was derived from the known structures of proteins. On the hypothesis that the secondary structural topology of protein gives baseline or determines the protein fold (Di Francesco et al., 1997; Przytycka et al., 1999; McGuffin et al., 2001) and in order to extend the concept of protein fold space to whole genome sequence, genome scale prediction was performed using the proposed data distribution scheme on the present BioGrid. Test data were taken from model organisms of M.genitalium, H.pylori 51, E.coli and S.cerevisiae—three bacteria and one eukaryote. To derive numerical data from predicted secondary structural topology, we devised a topology score. The linear distribution of secondary structural elements is encoded into a numerical score. Topology score for each protein is assigned as described in Figure 3. Seven or more consecutive H (helical) segments is assigned 1 while four or more consecutive E (extended) segments is assigned 2 and these numbers take each place to represent the linear location of each element. Then the logarithm of this code is used to determine the topology score. In case of code 0, the topology score is assigned −1 and removed as it 3504

Fig. 4. Real fold space. The PDB deposited sequences are analyzed by DSSP and scored by topology score, and changed into graphical form by OriginPro version 7.0 (http://www.originlab.com). Axes Helix and Extended are the proportion of H and E to the total number of residues, respectively. The correlation between proportion of extended and topology score of a protein is depicted by open diamonds, while between proportion of helix and topology score is depicted by X. Helix–extended correlation is represented by open triangles, and is drawn in detail at the bottom panel.

designates unstructured protein (1, 4, 15, 29 proteins for each genome, respectively). In this manner, each predicted result from PSIPRED is assigned a topology score. To make a standard fold space, DSSP assignment of PDBFINDER (Hooft et al., 1996) was used. PDBFINDER has the DSSP analysis result of every PDB structure. Sequences with repeated PDB ID or topology score −1 were excluded. The results of DSSP were converted by topology score and then made into a graph as shown in Figure 4. The highest score is under 80 and located between helix 0.2–0.4 and extended 0.1–0.2.

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

Fig. 3. Example of topology scoring scheme. From secondary structure prediction results, a ‘converted number’ is generated for each sequence; groups of seven or more consecutive H is assigned 1 and groups of four or more consecutive E is assigned 2 as their positions in a sequence are converted into numerical unit. To avoid numerical problem of the logarithm of zero, the topology score of sequence with converted number 0 is assigned as −1 and then omitted from the next step as an unstructured protein. The logarithm to base 3 is taken as of this converted number, and the final result is used as topology score.

Prediction of protein fold space by Grid computing

(a)

(b) 60

60

50

score

50 40

Topology score

Topology

M

M

30 20

M M

10 M

0

0.5

0.0

0.4 0.2

0.2 0.1

0.8

d de ten Ex

0.0

1.0

30 20 10 0

0.5

0.0

0.4 0.2

0.3

0.4

He 0.6 lix

40

0.3 d e 0.2 nd e

0.4

He 0.6 li x 0.8

t

0.1 Ex 1.0

0.0

0.5

0.3

0.2

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

Extended

0.4

0.1

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Helix

Fig. 5. Fold spaces of four genomes. (a) for M.genitalium, (b) for H.pylori 51, (c) for E.coli and (d) for S.cerevisiae. Each result was clustered by FASTCLUS in SAS. Population of each cluster was ranked and changed into graphical form as shown this in Figure. Though each genome is only distantly related to the others in phylogeny, their fold spaces show similar distribution. Color scale shows the population order of each cluster (from the most populated—black, red, green, blue, cyan—to the least) and each mean of clusters is described as ‘M’. Bottom panel shows Helix–Extended plane of the fold space.

3505

S.Lee et al.

Table 3. Proteome topology score results

Cluster 1 2 3 4 5 Scored proteins score (−1) Total (predicted)

M.genitalium Mean score Proteins 3.6 10.0 18.0 29.9 38.8

74 174 146 49 8 451 1 452

% 16.4 38.6 32.4 10.9 1.8 100.0

H.pylori 51 Mean score 4.6 13.6 24.6 38.9 50.0

Proteins

%

447 665 44 4 225 1385 4 1389

32.3 48.0 3.2 0.3 16.2 100.0

E.coli Mean score 4.6 13.2 23.7 36.9 50.0

Proteins

%

1394 2081 17 651 119 4262 15 4279

32.7 48.8 0.4 15.3 2.8 100.0

S.cerevisiae Mean score 4.7 14.8 28.5 41.5 59.5

Proteins

%

2612 2612 907 153 9 6293 29 6322

41.5 41.5 14.4 2.4 0.1 100.0

The most populated cluster for each proteome is written in boldface.

CONCLUSION AND FUTURE WORK Efficient data management for a vast amount of protein sequences was successfully performed to predict secondary structure by a data distribution scheme on our BioGrid platform. According to the proposed scheme, our test-bed was constructed and the performance was examined. The results showed that the calculation speed increased proportionally to the number of worker nodes participating in our BioGrid. As an example of a massive dataset in the bioscience research area, our BioGrid approach is very powerful in handling all protein sequences in a whole genome. From the comparison of genome structures determined from this analysis, the fold space information for a given genome could be extracted, and these data gave us an insight of ‘Fold space’. As a future work, a new genome or organism will be analyzed to standardize the fold space. To manage larger proteome data more

3506

accurately, we need to extend our BioGrid test-bed and to improve the proposed system by utilizing the new scheme for optimal resource allocations.

ACKNOWLEDGEMENTS We thank Se-Jung Kook for helpful discussion about clustering. This work was supported by the NRL program of MOST NRDP (M1-0203-00-0020) and Protein Network Research Center at Yonsei University (W.L.) and by a grant of the International Mobile Telecommunications 2000 R&D Project, Ministry of Information & Communication, Korea and by the Ministry of Science and Technology of Korea/the Korea Science and Engineering Foundation.

REFERENCES Abramson,D., Sosic,R., Giddy,J. and Hall,B. (1995) Nimrod: a tool for performing parameterized simulations using distributed workstations. The 4th IEEE Symposium on High Performance Distributed Computing (HPDC-95), pp. 520–528. Burley,S.K. and Bonanno,J.B. (2002) Structuring the universe of proteins. Annu. Rev. Genomics Hum. Genet., 3, 243–262. Di Francesco,V., Garnier,J. and Munson,P.J. (1997) Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. J. Mol. Biol., 267, 446–463. Eyrich,V.A., Marti-Renom,M.A., Przybylski,D., Madhusudhan,M.S., Fiser,A., Pazos,F., Valencia,A., Sali,A. and Rost,B. (2001) EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242–1243. Foster,I. and Kesselman,C. (1998) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, Elsevier Inc. Foster,I. and Kesselman,C. (2002) The Globus Project: A status Report. Proceedings of the 7th Heterogeneous Computing Workshop, March 30, 1998, Orlando, FL, USA. Foster,I., Kesselman,C. and Tuecke,S. (2001) The anatomy of the grid: enabling scalable virtual organizations. Int. J. High Performance Comput. Appl., 15, 200–222. Gerstein,M. and Hegyi,H. (1998) Comparing genomes in terms of protein structure: surveys of a finite parts list.FEMS Microb. Rev., 22, 277–304.

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

Each genome data showed an aligned distribution and the distribution has a unified structure through genomes. Threedimensional distribution map shows each genome has nearly pyramidal distribution. The result of each genome was then treated by Clustering method (K-means procedure). Each genome has five clusters. The cluster means of each genome are almost in the same region from the XY (Helix–Extended) plane panel and are in a perpendicular line of 3D space. The highest score of each genome is under 60, so the score of genome is in the region of PDB space (Fig. 4) that can be regarded as a boundary space. Interestingly, the population of each cluster is different among genomes (Table 3). In Figure 5, M.genitalium, H.pylori 51 and E.coli show that the most populated cluster is at topology score 10–20, meanwhile S.cerevisiae shows its most populated cluster under 10. We are not sure whether or not this is related to species or phylogenetic tree. To resolve this issue further investigation might be necessary; however, this significant difference could be assessed with more genome sequence prediction. In addition, ‘real’ and ‘pseudo’ fold space model might be used to improve the accuracy of protein structure prediction.

Prediction of protein fold space by Grid computing

Henderson,R.L. (1995) Job scheduling under the portable batch system. LNCS 949, Springer-Verlag, Heidelberg, pp. 279–294. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–603. Hooft,R.W., Sander,C. and Vriend,G. (1996) The PDBFINDER database: a summary of PDB, DSSP and HSSP information with added value. Comput. Appl. Biosci., 12, 525–529. Hou,J., Sims,G.E., Zhang,C. and Kim,S.-H. (2003) A global representation of the protein fold space. Proc. Natl Acad. Sci., USA, 100, 2386–2390. Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices J. Mol. Biol., 292, 195–202.

McGuffin,L.J., Bryson,K. and Jones,J.T. (2001) What are the baselines for protein fold recognition? Bioinformatics, 17, 63–72. McGuffin,L. and Jones,J.T. (2002) Targeting novel folds for structural genomics. Prot. Struct. Funct. Genet., 48, 44–52. Przytycka,T., Aurora,R. and Rose,G.D. (1999) A protein taxonomy based on secondary structure.Nat. Struct. Biol., 6, 672–682. Rost,B. (2001) Review: protein secondary structure prediction continues to rise. J. Struct. Biol., 134, 204–218. Shindyalov,I.N. and Bourne,P.E. (2000) An alternative view of protein fold space. Prot. Struct. Funct. Genet., 38, 247–260.

Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 27, 2011

3507