5688
DOI 10.1002/pmic.200600157
Proteomics 2006, 6, 5688–5693
SHORT COMMUNICATION
MASCOT HTML and XML parser: An implementation of a novel object model for protein identification data Chunguang G. Yang1, Stephen J. Granite1, Jennifer E. Van Eyk1, 2 and Raimond L. Winslow1, 2 1
Center for Cardiovascular Bioinformatics and Modeling, The Institute for Computational Medicine and The Whitaker Biomedical Engineering Institute, The Johns Hopkins University, Baltimore, MD, USA 2 Department of Medicine, Division of Cardiology, The Johns Hopkins University, School of Medicine, Baltimore, MD, USA
Protein identification using MS is an important technique in proteomics as well as a major generator of proteomics data. We have designed the protein identification data object model (PDOM) and developed a parser based on this model to facilitate the analysis and storage of these data. The parser works with HTML or XML files saved or exported from MASCOT MS/MS ions search in peptide summary report or MASCOT PMF search in protein summary report. The program creates PDOM objects, eliminates redundancy in the input file, and has the capability to output any PDOM object to a relational database. This program facilitates additional analysis of MASCOT search results and aids the storage of protein identification information. The implementation is extensible and can serve as a template to develop parsers for other search engines. The parser can be used as a stand-alone application or can be driven by other Java programs. It is currently being used as the front end for a system that loads HTML and XML result files of MASCOT searches into a relational database. The source code is freely available at http:// www.ccbm.jhu.edu and the program uses only free and open-source Java libraries.
Received: February 28, 2006 Revised: June 21, 2006 Accepted: July 10, 2006
Keywords: HTML parser / Java / MASCOT parser / Protein Identification Data Object Model / XML parser
Modern MS is an important and powerful technique for accurately measuring masses of proteins and their digested peptide products [1–4]. These masses, when searched against a protein sequence or nucleotide database using search engines such as MASCOT and SEQUEST, allowed proteins to be identified quickly and accurately [5–10]. Protein identification based on MS data has advantages over other techniques in sensitivity, accuracy, speed, and throughput [11–13]. With advances in high-throughput techniques, protein identification can be done in batches Correspondence: Dr. Raimond L. Winslow, The Johns Hopkins University, Clark Hall Room 201B, 3400 North Charles Street, Baltimore, MD 21218–2686, USA E-mail:
[email protected] Fax: 11-410-516-5294 Abbreviation: PDOM, protein identification data object model
2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
[13–15]. As a result, the amount of protein identification data has been growing rapidly, as noted by Martens et al. [16]. Therefore, archiving, validating, analyzing, and mining these important data have become challenging tasks in many laboratories [16–20]. A computer program, known as a parser, is required to extract the data and to accomplish any one of the tasks mentioned above. Several parsers or programs with parsing functionality have been created (Tables 1, 2). These programs are either expensive, inaccessible, or platform specific. In this paper, we present a MASCOT parser that is free, open sourced, portable, robust, and extensible. As input file formats are crucial to the design and the implementation of a parser, we first describe the file formats used by the MASCOT search engine by following the data analysis steps of a typical MASCOT user. To do a database search using MASCOT, a user first chooses one of the three types of searches: MS/MS ions search, PMF, and sequence www.proteomics-journal.com
Bioinformatics
Proteomics 2006, 6, 5688–5693
5689
Table 1. Parsers, toolkits, or converters for protein identification data
Name
Language
Input file format
Availability
MASCOT parsera) MASCOT2XMLb) SEQUEST2XMLb) Comet2XMLb) Mres2xc) DBParserd)
C11/Java/Perl C11 C11 C11 C Perl
MASCOT DAT files MASCOT DAT files SEQUEST HTML files COMET cmt.tar.gz files MASCOT DAT files MASCOT DAT files
Commercial, on multiple platforms Open source,Windows and Linux Open source,Windows and Linux Open source,Windows and Linux Open source Open source
a) b) c) d)
http://www.matrixscience.com/msparser.html http://sashimi.sourceforge.net/software_tpp.html http://www.protein-ms.de or http://sourceforge.net/projects/protms http://www.proteomecommons.org/archive/1109121060785
Table 2. Programs with built-in parsers for protein identification data. Many commercial programs are not included in this Table due to the limited access to these software packages by the authors
Name
Language
Input file format
Notes
MSQuanta) Phenyx Packb) SCAFFOLDc)
VB.Net 2 Java
Open source, Windows only Commercial Commercial, Windows only
Protein results parserd)
Perl/Tk
MASCOT HTML files MASCOT HTML files MASCOT DAT files, SEQUEST, X!Tandem MASCOT HTML files in MS/MS ion search report
a) b) c) d)
Free to Academia, Windows
http://msquant.sourceforge.net/ http://www.phenyx-ms.com/about/features.html http://www.proteomesoftware.com/ http://chemfacilities.chem.indiana.edu/facilities/proteomics/parser/main.htm
tag search [6, 7, 10]. The user then sets parameters such as the enzyme used for the digest, mass error of the MS equipment, PTMs, etc. These parameters are submitted along with the mass peak list obtained from the MS measurement to the MASCOT search engine. MASCOT performs the search based on the above user input. After MASCOT finishes the search, it saves the search result and the submitted parameters and masses on the server in a text file (DAT format), and then returns to the user an HTML file for viewing. The user then examines the data and may choose to save the HTML file on their local computer for future inspection and analysis. Starting with version 2.1, the MASCOT server allows a user to save search results to a variety of XML file formats as well (http://www.matrixscience.com/). Therefore, the MASCOT search engine can store the protein identification data (including user submitted parameters and masses, and MASCOT search results) in three file formats: DAT, HTML, and XML. With respect to parsing, each format has its advantages and limitations which are outlined next. The main drawback of the DAT file as an input source is its limited accessibility. Since DAT files reside on a MASCOT server, permission needs to be granted before DAT files can be retrieved by a user. For users accessing a public MASCOT 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
server or having no access to a MASCOT server, DAT files cannot be retrieved. Despite this limitation, there are already five parsers using DAT files as input (Tables 1, 2), primarily due to the relative simplicity of parsing DAT files. The XML format is not yet supported by any of the programs we know, despite its advantage over HTML and DAT as a data carrier. Since XML is a file format created to carry any types of data, it is the easiest to parse among the three file formats. However, visualizing an XML file still requires transforming it to HTML format using Extensible Stylesheet Language Transformations (XSLT). This inconvenience is minor when compared with the advantages of using this file format. Hence, development of an XML parser is useful. The HTML file from MASCOT searches can be parsed by only three programs currently (Table 2). Protein Results Parser does not provide source code. Phenyx Pack is a commercial product. MSQUANT is not portable, since it is written in platform specific language. Therefore, an open-source crossplatform HTML parser is not yet available to the proteomics community. Compared with the XML format described above, HTML format has the advantage of presenting a userfriendly view of the data. Even though XML format will likely overtake HTML as the carrier of protein identification data in www.proteomics-journal.com
5690
C. G. Yang et al.
the future, an HTML parser is still needed because the majority of the legacy, yet valuable, data are in HTML format and there are no means of converting these valuable data into other formats. Furthermore, until all the search engines can export their results to XML format, HTML format will remain the primary carrier for protein identification data. Therefore, availability of an HTML parser would be useful to the proteomics community. One critical issue in choosing an appropriate file format to parse is the amount of data contained in each file format. For HTML and XML files, the amount of data contained also depend on the values for each report parameter. Given the number of available parameters, all sample HTML/XML files referred by the rest of the paper are exported using the default values for each report parameter. Further examination of sample files for each file format reveals that all three file formats contain essentially the same set of information (i.e., proteins, peptides, search parameters, submitted masses, etc.). DAT files also contain protein matches that are less significant and that are not found in either HTML or XML files. None of the file formats includes protein properties (e.g., protein amine acid sequence, the pI value, the protein mass, and sequence coverage) because these are generated dynamically as per user’s viewing request. Two ways exist to obtain these protein properties: (i) a user or a computer program can retrieve these data from the MASCOT server by issuing a query; or (ii) they can also be calculated using open-source packages such as Biojava (http:// www.biojava.org). However, such functionality should be implemented by nonparsing modules to avoid adding extraneous data to the input source.
Proteomics 2006, 6, 5688–5693
The above discussion shows that currently an opensource XML and HTML parser for MASCOT does not exist. Parsers for other search engines are also sparse. Herein we present an open-source parser implemented using Java programming language, based on the PDOM. It can parse XML and HTML files saved from MASCOT MS/MS ion search in peptide summary report or from MASCOT PMF search in protein summary report. Although our implementation targets MASCOT files specifically, it is designed to be extensible to other search engines, especially those that can only store results in HTML files. Since HTML, DAT, and XML formats may change in the future, our open-source approach should make the program more adaptive to future changes. The extensibility of the parser is based on the proven object-oriented programming paradigm and the novel object model (PDOM) for describing protein identification data using Java objects (Fig. 1a). Figure 1b shows the class hierarchy of our parser which can parse HTML and XML output from MASCOT search engine. As shown in Fig. 1a, the PDOM consists of Protein, Peptide, Hit, Query, Modification, and Form objects with references from one object to another. The references are necessary to define the relationships between different objects. For example, a reference from one Protein to a Hit object indicates that the Hit object contributes to the identification of this Protein. Since each search engine may use a different set of parameters to describe an object, we will use the MASCOT engine as an example to illustrate how objects should be defined and implemented. The Protein object defines a matched protein with properties such as accession number, description, molecular
Figure 1. (a) The object model for protein identification data consists of 6 objects: Protein, Peptide, Hit, Query, Modification, and Form all of which except Form contain references to one another as exemplified by the arrows linking them. (b) The class hierarchy of MASCOT parser. MASCOT is the base class, from which MASCOTMS2, MASCOTPMF, and MASCOTXML are derived. HTML tags and contents are parsed and extracted by LexerMASCOT class and its derived classes LexerMASCOTPMF and LexerMASCOTMS2, which are used by MASCOTMS2 and MASCOTPMF, respectively. MASCOTXML class parses and extracts data from an XML file. MASCOTFactory class provides the convenience for a user’s program to instantiate an appropriately derived MASCOT class by giving the type and name of an input file.
2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
Proteomics 2006, 6, 5688–5693
weight, the number of peptide matches, references to homologous proteins, and reference to a set of Hit objects which will be described later. The Peptide object contains the amino acid sequence of a matched peptide. When PTMs are included in the matching (see Fig. 2 for an example), the mass reported by MASCOT engine includes those contributed by modification. Even though the true peptide mass can be calculated, we choose not to add extraneous data to the input source. Therefore, this mass is not the mass of a Peptide object but a total mass of a Hit object. The Query object models a submitted mass using properties such as query number, mass value, and expected mass value. The Hit object defines a match between a submitted mass (modeled by a Query object) with an amino acid sequence (modeled by a Peptide object) plus any PTMs (modeled by the Modification object). The properties of a Hit object include the total mass of peptide and PTMs (discussed above), score, rank, delta value, and terminal amino acids. The position of the modified amino acid should be a property of a Hit object. However, it is not present in either HTML or XML files. Even though it can be calculated by knowing the exact nature of a Modification object, our parser does not implement this function. This property, however, is contained in DAT files and should be implemented by a DAT parser. For other search engines, the properties of a Hit object will likely be different from those used by MASCOT.
Bioinformatics
5691
The Modification object defines one kind of PTM, such as carbamidomethylation. The properties of Modification include count (i.e., how many times this modification is found in one match event) and name, which likely differ among search engines. A Form object is used to store the search parameters, such as the database name and version, peptide mass error tolerance, etc. Figure 2 shows one portion of an HTML file for a MASCOT MS/MS ion search displayed in a web browser, illustrating how data are parsed into PDOM objects. The Protein object in Fig. 2 contains the accession number (giu6013427) of the top match protein, and its mass value (69, 180 Da), the total score (134), and the number of peptides being matched (12). The Query object in Fig. 2 models the submitted mass (680.67 Da) and the expected mass calculated (1359.33 Da) by MASCOT. Figure 2 also highlights a Peptide object whose sequence is AVMDDFAAFVEK. Note that one row of data in Fig. 2 is separated into one Query, one Hit, one Peptide, and zero or more Modification objects. Although not shown in this paper, the data from MASCOT PMF searches are parsed in a similar way. The redundancy of data is obvious in Fig. 2: peptide RHPDYSVVLLLR appears twice in the figure, peptide AVMDDFAAFVEK appears three times, and modification Carbamidomethyl (C) shows up in three matches. Since Fig. 2 represents only a small part of a typical search result file, redundancy in an entire file is much greater. The
Figure 2. An illustration of how MASCOT data can be parsed into objects. One example is shown for each object except the Form object which contains the data in the hidden fields of the HTML result file. The color of the original text is removed for clarity.
2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
5692
C. G. Yang et al.
implementation of PDOM is such that redundant data create only one instance of an object within the file scope. Using Fig. 2 as an example, a Peptide object is created at the first occurrence of the string RHPDYSVVLLLR in the result file. Subsequent occurrences of RHPDYSVVLLLR will not create new Peptide objects but will add references to the existing one. The quality of protein identification was found to correlate directly to the number of nonredundant peptides by Chepanoske et al. [21]. Therefore, the elimination of redundancy in our program not only saves computer memory, but also provides powerful insight into the validity of the proteins identified. Furthermore, references among objects provide users additional information such as: (i) how many unique peptides are in the result file, (ii) how many hits a submitted mass matches to, and (iii) how many queries a given peptide matches to. This information is then available and can be utilized by the user or a thirdparty tool to further evaluate and validate the data. When parsing of the result file is finished, objects are created and made available to the other aspects of the program. This implementation insulates the remainder of the program from the parsing process. The advantage of such modular design will be discussed further in the implementation of the MASCOT parser. To implement the parser, we chose Java as the programming language due to its crossplatform portability, its highquality open-source libraries in HTML and XML processing, its uniform database connectivity through Java Database Connectivity (JDBC) and its object-oriented programming (OOP) model. Java development was done using Websphere Studio Application Developer 5.1.2 (IBM, San Jose, CA, USA) with code level compliance set to Java version 1.3. Interested users can use an integrated Java development environment such as Eclipse to examine, edit, or extend our code (http://www.eclipse.org). The implementation of the parser for MASCOT result files starts with an abstract base class, MASCOT, which defines fields and methods shared by all derived classes. The fields of MASCOT class include a Form object, a set of Protein objects, a set of Peptide objects, a set of Query objects, a set of Hit objects, and a set of Modification objects. MASCOT also implements the output function which is reused by its derived classes through inheritance. The program can print to the standard output devices (screen by default), all the information of each object. An output to a relational database is also implemented using JDBC. Because of the modular design of our program, the output routines are separated from parsing of the input files. Therefore, new output formats can be added to fit the specifications of individual users. For example, output can be written to meet the requirement of a journal and to support new standards in proteomics, such as mzIdent [22, 23]. MASCOT also defines an abstract parse method which is implemented by MASCOTPMF, MASCOTMS, and MASCOTXML to coordinate the parsing process on its own file format. 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Proteomics 2006, 6, 5688–5693
Parsing of HTML files is a major part of the program since it is difficult to parse HTML files programmatically due to their flexible and nonrigorous grammar. This is demonstrated by the fact that users of certain database systems must prepare the data input manually. Parsing is performed by using LexerMASCOT and its derived classes (Fig. 1b). These classes are necessary, given the complexity of how data are embedded in the HTML files. In an HTML file, data are marked up by tags such as ,TABLE., ,TD., ,A., ,PRE., ,FONT., ,TT., etc., which tell the browser how data should be displayed. The HTML also uses JavaScript code to interact with the user and uses cascading style sheet (CSS) to provide styles to the data. There are other issues that also complicate parsing, such as multiple standards and unclosed tags. Therefore, dedicating classes (LexerMASCOT and its derived classes) for working with the HTML tags separates the parsing process with other parts of the program and make the program more adaptive to changes in the HTML format. The base class, LexerMASCOT, and its derived classes make use of an open-source Java library, HTMLParser (http://htmlparser.sourceforge.net/). LexerMASCOT class provides utility methods used by derived classes and defines methods that need to be overridden by child classes. Methods that need to be redefined in derived classes include gotoNextHit, hasMoreHit, extractHit, extractModification, gotoTextNode, gotoTagNode, and extractProtein. As their names imply, LexerMASCOTMS2 is used for HTML files saved from the peptide summary report of the MS/MS ion search and LexerMASCOTPMF for HTML files saved from the protein summary report of the PMF search. The LexerMASCOTPMF and LexerMASCOTMS2 classes are used by MASCOTPMF and MASCOTMS2, respectively (Fig. 1b). Parsing XML is straightforward since XML is a format created for carrying data in a structured and portable fashion. Several Java libraries have been developed for the processing of XML files [24]. We employed the Simple Application Programming Interface for XML (SAX) to parse MASCOT XML files. The MASCOTXML class does the parsing on its own – no LexerMASCOT class is needed (Fig. 1b). MASCOTXML class inherits all other functionalities from base class MASCOT and only needs to override the parse method. A factory class MASCOTFactory is created to instantiate a proper derived MASCOT class given the type of input file and to hide the difference in local files, network files, or input stream passed in by a web server. This class helps the integration of this program with other Java applications running in stand-alone mode or acting as a part of the server. Our program has been used in PROTEIN-DB2 to parse Mascot HTML files and XML files and to write the parsing results to this primary proteomics database. This program only requires open-source software including JDK 1.4, HtmlParser, and XML parser. If the PDOM objects need to be written to a relational database, one can use PostgreSQL, an open source database which we have tested. A graphical user www.proteomics-journal.com
Proteomics 2006, 6, 5688–5693
interface implementation that integrates the program has been provided in the Supplementary Material for this paper. The program and its source code are freely available at http://www.ccbm.jhu.edu. In conclusion, we presented an object model for PDOM. We also implemented a parser based on this model to parse MASCOT HTML and XML result files. Our parser program is open-source and written in Java. It can be referenced by other Java programs or used stand-alone and can be extended to parse other file formats generated by MASCOT or other search engines. Given the framework provided by our implementation, more file formats and search engines can be added so that researchers can focus on other aspects of working with protein identification data.
The authors would like to thank the financial support provided by National Heart, Lung, and Blood Institute through grant NHLBI NO1-HV-28180, the Falk Medical Trust, and the Donald W. Reynolds Foundation. The authors also thank Rebekah Gundry and Dawn Chen for providing sample files for testing.
References [1] Tanaka, K., Waki, H., Ido, Y., Akita, S. et al., Rapid Commun. Mass Spectrom. 1988, 2, 151–153.
Bioinformatics
5693
[7] Pappin, D. J. C., Hojrup, P., Bleasby, A. J., Current Biol. 1993, 3, 327–332. [8] Boutilier, K., Ross, M., Podtelejnikov, A. V., Orsi, C. et al., Anal. Chim. Acta 2005, 534, 11–20. [9] Mortz, E., O’Connor, P. B., Roepstorff, P., Kelleher, N. L. et al., Proc. Natl. Acad. Sci. USA 1996, 93, 8264–8267. [10] Wilkins, M. R., Ou, K., Appel, R. D., Sanchez, J.-C. et al., Biochem. Biophys. Res. Commun. 1996, 221, 609–613. [11] O’Farrell, P. H., J. Biol. Chem. 1975, 250, 4007–4021. [12] Engvall, E., Perlmann, P., J. Immunol. 1972, 109, 129–135. [13] Gevaert, K., Vandekerckhove, J., Electrophoresis 2000, 21, 1145–1154. [14] Shen, Y., Tolic, N., Zhao, R., Pasa-Tolic, L. et al., Anal. Chem. 2001, 73, 3011–3021. [15] Wurzel, C., Wittmann-Liebold, B., Biotecnol. Apl. 2000, 17, 117. [16] Martens, L., Hermjakob, H., Jones, P., Adamski, M. et al., Proteomics 2005, 5, 3501–3505. [17] Ferry-Dumazet, H., Houel, G., Montalent, P., Moreau, L. et al., Proteomics 2005, 5, 2069–2081. [18] Hodges, P. E., Payne, W. E., Garrels, J. I., Nucleic Acids Res. 1998, 26, 68–72. [19] Wilke, A., Ruckert, C., Bartels, D., Dondrup, M. et al., J. Biotechnol. 2003, 106, 147–156. [20] Weatherly, D. B., Atwood, J. A., III, Minning, T. A., Cavola, C. et al., Mol. Cell. Proteomics 2005, 4, 762–772.
[2] Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F., Whitehouse, C. M., Science 1989, 246, 64–71.
[21] Chepanoske, C. L., Richardson, B. E., von Rechenberg, M., Peltier, J. M., Rapid Commun. Mass Spectrom. 2005, 19, 9– 14.
[3] Mann, M., Hoejrup, P., Roepstorff, P., Biol. Mass Spectrom. 1993, 22, 338–345.
[22] Orchard, S., Hermjakob, H., Binz, P.-A., Hoogland, C. et al., Proteomics 2005, 5, 337–339.
[4] Traini, M., Gooley, A. A., Ou, K., Wilkins, M. R. et al., Electrophoresis 1998, 19, 1941–1949.
[23] Orchard, S., Hermjakob, H., Taylor, C. F., Potthast, F. et al., Proteomics 2005, 5, 3552–3555.
[5] Perkins, D. N., Pappin, D. J. C., Creasy, D. M., Cottrell, J. S., Electrophoresis 1999, 20, 3551–3567.
[24] Harold, E. R., Processing XML with Java: A Guide to SAX, DOM, JDOM, JAXP, and TrAX, Addison-Wesley Professional, Boston 2002, p. 1120.
[6] Yates, J. R., III, Electrophoresis 1998, 19, 893–900.
2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com