, where peakID = n, cmass = mn, lmass and hmass, respectively, are the low- and high-mass bounds of the topologies that can be used to interpret this peak and are stored in topologySet, and topoReconstructionSet is a set containing information for deriving topologySet. Each member in sn.topoReconstructionSet is an object topoReconstruction = representing a set of topologies that use the same root (a monosaccharide class ∈ G) and choose their branches from branchSet (each member in branchSet contributes one branch). Each member in branchSet is a candidate set of a peak preceding the n-th peak. Basically, each topology in topoReconstruction.topologySet chooses one branch from the topologySet of each member in topoReconstruction.branchSet. A topology is represented by a structure , where mass is its theoretical mass, representation is a
P. Hong et al.: Efficient Algorithm for Accurate de novo Glycan Sequencing
(c) Train IonClassifier to learn fragmentation patterns Machine Learning
IonClassifier
Mass spectra of known glycans Precursor Ion Fuc HexNAc Hex Neu5Ac [Neu5Ac Hex] [Fuc] HexNAc [Neu5Ac] [Fuc HexNAc] Hex ••• ••• •••
[4 12 128 148 324 342 459] [4 12 71 89 265 285 459] [4 12 71 89 128 148 459]
4 7* 5
(b) Interpretation-Graph Precursor Ion 459 Peaks 265/285 B/C: Fuc Neu5Gc [4 12 265 285] B/C: Hex Neu5Ac [27 265 285] B/C: Neu5Ac Hex [71 89 265 285]
285 342
386 265
324 173 294 254
27
148 240 12
Peaks 4/12 B/C: Fuc [4, 12]
128
89 71
PeakInterpreter
4
[Neu5Ac(α2-3) Hex(β1-3)] [Fuc(α1-4)] GlcNAc
(a) Input mass spectrum Figure 2. GlycoDeNovo pipeline. (a) An example EED spectrum (SLA). Peaks marked by red circles are interpretable by PeakInterpreter to derive the interpretation-graph shown in (b). The numbers inside the interpretation-graph nodes are peak IDs. Each blue dot associated with a node is a topoReconstruction object that specifies how the corresponding peak can be interpreted by other peaks pointing to it. The node-pairs in dashed rounded rectangles are interpreted as B/C-ion pairs. The reconstructed topologies of three peaks (4/12, 265/285, precursor ion) are shown as examples in the rounded rectangle callouts along with their supporting peaks in brackets. This precursor ion has 14 candidate topologies. Based on the supporting peak count, three of them tie for the best and are listed in the call out. (c) Machine learning is applied to automatically learn an IonClassifier to distinguish B and C ions from other ion types. IonClassifier is then used to score topology candidates (see main text for details). In this example, it gives the highest score of 7 to the correct topology [Neu5Ac Hex] [Fuc] HexNAc
text string following the modified IUPAC condensed text nomenclature without linkage information, and supports contains peaks in the enriched peak list that can be interpreted as B- or Ctype ions and be generated from this topology.
Let S be the candidate pool containing all non-empty candidate sets. The current design of PeakInterpreter allows candidate topologies to have up to four branches at each branching
P. Hong et al.: Efficient Algorithm for Accurate de novo Glycan Sequencing
Algorithm I: S
PeakInterpreter( {
,
,…,
})
(1) (2) (3)
Initialize the candidate pool S = { }. for n = 1 to N Initialize the candidate set sn of the n-th peak: sn.cmass = mn, sn.lmass = mn , sn.hmass = mn + , sn.topoReconstructionSet = , sn.topologySet = . (4) for all possible combinations of up to 4 candidate sets sa, sb, sc, sd S (5) Calculate lm = sa.lmass + sb.lmass + sc.lmass + sd.lmass hm = sa.hmass + sb.hmass + sc.hmass + sd.hmass = mass difference caused by creating a B-ion (or the precursor ion if n = N) by linking sa, sb, sc, sd to a monosaccharide. (6) if g G s.t. (lm, hm) = (mn , mn + ) (g.mass + lm + , g.mass + hm + ) (7) Create a topoReconstruction object r = , and add r to sn.topoReconstructionSet. Set sn.lmass = min(sn.lmass, lm) and sn.hmass = max(sn.hmass, hm). (8) end (9) end (11) if sn.topoReconstructionSet , add sn to S, end (12) end
point, but this constraint can be tightened to allow a lower degree of branching if needed. PeakInterpreter maintains a candidate pool S. Each candidate serves as a potential b u i l d i n g b l o c k f o r in t e r p r e t i n g a h e a v i e r pe a k . PeakInterpreter starts from the lightest peak and tries to interpret every peak as a B ion, C ion, or the precursor ion by searching for all allowable combinations of building blocks in the candidate pool S (steps 4–9) that can be appended to a monosaccharide g to obtain a candidate set with mass within the accuracy range specified by τ. The mass difference δ in step 5 depends on the ion type and the glycan derivatization method employed (e.g., permethylation). We use the intensities of non-precursor peaks interpretable by PeakInterpreter to normalize the intensities of all peaks into z-scores [26]. We do not need to reconstruct the topologies (i.e., sn.topologySet) at this step. Topology reconstruction will be done later by calling CandidateSetReconstructor after PeakInterpreter terminates. Although PeakInterpreter does not have the accurate mass of each candidate topology that is yet to be reconstructed, the test performed at step 6 gives an estimate of the mass range tight enough to include all true positives, but it may also include a small number of false positives (i.e., topologies with masses outside of the accuracy range). Because each interpreted peak is still represented as one yet-to-be-reconstructed candidate set, the false positives will not increase the computational complexity, and they will be removed later by CandidateSetReconstructor.
Theorem The complexity of building an interpretation-graph is O(|G| × NH + 1), where G is the monosaccharide set, N is the
number of peaks in the given spectrum, and H ≤ 4 is the maximal branching number permitted. Proof The computation of PeakInterpreter mainly resides in the for-loop between steps 4 and 9 complexity of which is O(|G| × |S(n)|H), where S(n) is the value of the candidate pool S at the n-th loop and |S(n)| is the size of S(n) (i.e., the number of interpretable peaks up to the n-th loop). The overall complexity of PeakInterpreter is O(|G| × ∑n = 1N|S(n)|H). Since |S(n)| ≤ n, O(|G| × ∑n = 1N|S(n)|H) = O(|G| × ∑n = 1NnH) = O(|G| × NH + 1). Comment In practice, we found that most peaks cannot be interpreted so that |S(n)| is often much smaller than n. Therefore, the empirical complexity of PeakInterpreter has a small constant in O(|G| × NH + 1). After obtaining the interpretation-graph, GlycoDeNovo passes the candidate set object of the precursor ion into CandidateSetReconstructor to reconstruct all legal candidate topologies. CandidateSetReconstructor first checks if each topoReconstruction object r in the input candidate set s has been reconstructed. If not, it recursively calls itself to reconstruct all branches of r. Then CandidateSetReconstructor creates all legal topologies of r (steps 11–19), which are rooted at r.root and satisfy the mass accuracy constraint. At step 14, the branches are linked by their alphabetic order to r.root so that isomorphic topologies can be effectively detected and removed at step 16. The union operation at step 15 effectively and efficiently solves the problem of repeated counting of supporting peaks, which was a problem in GLYCH [19]. Finally, at step 19, the candidate topology set of r is added to that of s. CandidateSetReconstructor runs extremely fast, and its running time is negligible compared with that of PeakInterpreter.
P. Hong et al.: Efficient Algorithm for Accurate de novo Glycan Sequencing
Algorithm II: CandidateSetReconstructor( s ) (1) if s.topologySet (2) return // s has been reconstructed. (3) end (4) for each r s.topoReconstructionSet (5) if r.topologySet (6) continue // r has been reconstructed (7) end (8) for each branch r.branchSet (9) CandidateSetReconstructor( branch ) (10) end (11) for each of all possible branch combinations (a combination is formed by choosing one topology from the topologySet of each s r.branchSet) (12) Calculate tmass = total mass of the topology with the chosen branches linked to r.root. (13) if tmass (massLow, massHigh) (14) Create a topology by linking the chosen branches to r.root, let t.mass = tmass. (15) t.supports = {peakID} {peak supports of t branches}. (16) Add t to r.topologySet. (17) end (18) end (19) Add r.topologySet to s.topologySet. (20) end
One of the major differences between GlycoDeNovo and previous de novo approaches [19–21, 23, 27] is that it uses the mass range to confine the search space within the experimental mass accuracy window without reconstructing any topology during the peak interpretation process. GlycoDeNovo delays topology reconstruction until it finishes deriving the interpretation group of the precursor ion, and hence it only needs to reconstruct topologies that are required to interpret the precursor ion. In our experiments, since most of the partial topologies did not lead to precursor ions, this simple strategy dramatically reduced the computational time and space. GlycoDeNovo starts from the non-reducing end to incrementally build up interpretations of B and C ions because (1) glycosidic fragments are in general substantially more likely to be observed than cross-ring fragments; and (2) Y and Z ions provide redundant mass information to B and C ions, and even in cases where only Y and/or Z ions are observed at a cleavage site, their information is recaptured in the enriched peak list. This strategy is different from the one used by Mizuno et al. [16] and by Sun et al. [22] that start the reconstruction procedure from the reducing end. Growing topologies from the reducing end may run into difficulties when dealing with branching points where each of the branches contain more than one monosaccharide residue. In such a scenario, some of the reconstructed topologies can correspond to internal fragments, which are more likely to be missing in data, thus making it difficult to evaluate those topologies. To handle the problem of missing peaks, we made one modification to PeakInterpreter so that it will consider monosaccharide pairs in addition to individual monosaccharides at
step 6. Basically, for each possible ordered pair of monosaccharides [g1, g2] satisfying the mass accuracy constraint, we can expand the interpretation graph by (1) creating a topoReconstruction object r1 that links sa, sb, sc, and sd to g2 and then another topoReconstruction object r2 that links r1 to g1 or (2) for each s in {sa, sb, sc, sd}, creating a topoReconstruction object r1 that links s to g2 and then another topoReconstruction object r2 that link r1 ∪ ({sa, sb, sc, sd } – s) to g1. Obviously, allowing missing peaks greatly increases the search space. Therefore, we suggest turning this option on only when no topology can be found without considering missing cleavages. Biosynthetic rules (e.g., the chitobiose N-glycan core) can also be incorporated to constrain the search space of PeakInterpreter.
Scoring Topologies via Machine Learning Mass spectrometry data can be noisy. In addition, the presence of internal fragments can greatly complicate the de novo topology reconstruction process. PeakInterpreter may misinterpret some Y, Z, or O ions as B or C ions and generate ambiguities. Misinterpretation may lead to false topologies being ranked as high as or better than the correct topology based on the supporting peak count alone. To tackle this problem, we applied machine learning to build an IonClassifier for distinguishing B and C ions from other ion types (Figure 2c). IonClassifier takes a peak and its context, currently defined as the neighboring peaks within a predetermined mass-difference
P. Hong et al.: Efficient Algorithm for Accurate de novo Glycan Sequencing
Table 1. Glycan Standards Used in This Study Short Name
Formula
SLA
[Neu5Ac(α2-3) Gal(β1-3)] [Fuc(α1-4)] GlcNAc
SLX
[Neu5Ac(α2-3) Gal(β1-4)] [Fuc(α1-3)] GlcNAc
Lewis B
[Fuc(α1-2) Gal(β1-3)] [Fuc(α1-4)] GlcNAc
Lewis Y
[Fuc(α1-2) Gal(β1-4)] [Fuc(α1-3)] GlcNAc
LNT
Gal(β1-3) GlcNAc(β1-3) Gal(β1-4) Glc
LNnT
Gal(β1-4) GlcNAc(β1-3) Gal(β1-4) Glc
LNFP I
Fuc(α1-2) Gal(β1-3) GlcNAc(β1-3) Gal(β1-4) Glc
LNFP II
[Gal(β1-3)] [Fuc(α1-4)] GlcNAc(β1-3) Gal(β1-4) Glc
LNFP III
[Gal(β1-4)] [Fuc(α1-3)] GlcNAc(β1-3) Gal(β1-4) Glc
CelHex
Glc(β1-4) Glc(β1-4) Glc(β1-4) Glc(β1-4) Glc(β1-4) Glc
MalHex
Glc(α1-4) Glc(α1-4) Glc(α1-4) Glc(α1-4) Glc(α1-4) Glc
N002
[Neu5Ac(α2-3) Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] [Neu5Ac(α2-3) Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] Man(β1-4) GlcNAc(β1-4) GlcNAc
N003
[Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] [Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] Man(β1-4) GlcNAc(β1-4) GlcNAc
N012
[Neu5Ac(α2-3) Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] [[Man(α1-3)] [Man(α1-6)] Man(α1-6)] Man(β1-4) GlcNAc(β1-4) GlcNAc
N013
[Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] [[Man(α1-3)] [Man(α1-6)] Man(α1-6)] Man(β1-4) GlcNAc(β1-4) GlcNAc
N222
[Neu5Ac(α2-3) Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] [Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] Man(β1-4) GlcNAc(β1-4) GlcNAc
N223
[Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] [Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] Man(β1-4) GlcNAc(β1-4) GlcNAc
N233
[Neu5Ac(α2-3) Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] [Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] Man(β1-4) GlcNAc(β1-4) GlcNAc
NA2F
[Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] [Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] Man(β1-4) GlcNAc(β1-4) [Fuc(α1-6)] GlcNAc
A2F
[Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-3)] [Neu5Ac(α2-6) Gal(β1-4) GlcNAc(β1-2) Man(α1-6)] Man(β1-4) GlcNAc(β1-4) [Fuc(α1-6)] GlcNAc
Man9
[[Man(α1-2) Man(α1-6)] [Man(α1-2) Man(α1-3)] Man(α1-6)] [Man(α1-2) Man(α1-2) Man(α1-3)] Man(β1-4) GlcNAc(β1-4) GlcNAc
Structure (CFG with linkage placement notation)
P. Hong et al.: Efficient Algorithm for Accurate de novo Glycan Sequencing
window (e.g., 105 Da), and classifies the peak as +1 (i.e., a B or C ion) or –1 (i.e., a non-B or C ion). The neighboring peaks can be expressed as an array of contextual features (i.e., mass shifts from the peak of interest). The final score of a candidate topology is calculated by summing up the IonClassifier values of its supporting peaks. IonClassifier is trained by boosting [28] the decision tree classifier [29] on the experimental tandem mass spectra of a set of known glycans. For each glycan standard, we can match its theoretical spectrum to the experimental spectrum to collect the observed context of each theoretical peak found in the experimental spectrum. We grouped the supporting peaks of candidates into true B ions, true C ions, true Y ions, true Z ions, and O ions, and trained IonClassifier to distinguish true B,ions and true C ions from Y, Z, and O ions. If a supporting peak is interpreted by PeakInterpreter as a B ion, it will be validated by the B-ion classifier of IonClassifier. Similarly, if a supporting peak is interpreted by PeakInterpreter as a C ion, it will be validated by the C-ion classifier of IonClassifier.
Experimental Although GlycoDeNovo can handle glycans containing residue(s) with up to four branches, its performance was tested only on bifurcated structures due to the availability of glycan standards. The structures of glycans used in our study are listed in Table 1.
Materials Sialyl lewis A (SLA), sialyl lewis X (SLX), lewis B, lewis Y, lacto-N-tetraose (LNT), and lacto-N-neotetraose (LNnT) were purchased from Dextra Laboratories (Reading, UK). Lacto-N-fucopentaose (LNFP) I, II, and III were acquired from V-LABS, Inc. (Covington, LA, USA). Cellohexaose (CelHex), maltohexaose (MalHex), A2 F, and N A2F gly cans we re p urchased from Carbosynth Ltd. (Berkshire, UK). Synthetic N-linked glycan standards (N002 to N233) were obtained from Chemily Glycoscience (Atlanta, GA, USA). Man9 N-glycan, H218O (97%) water, 2-aminopyridine, acetic acid, dimethyl sulfoxide (DMSO), sodium hydroxide, methyl iodide, chloroform, sodium borodeuteride, and cesium acetate were purchased from Sigma-Aldrich (St. Louis, MO, USA). Pierce PepClean C18 spin columns were acquired from ThermoFisher Scientific.
deutero reduction, approximately 10 μg each of glycan standards were incubated with 0.5 M sodium borodeuteride in 0.2 M ammonium hydroxide solution for 2 h at room temperature while mixing, followed by drop-by-drop addition of acetic acid (10%) until bubbling stopped. The reaction mixture was dried down in a centrifugal evaporator. Excess borates were removed by repeated resuspension and drying of the samples in methanol. Permethylation was performed according to the method described previously [30, 31]. Briefly, the underivatized, 18O-labeled, or deutero-reduced glycan was suspended in 100 μL of DMSO/NaOH solution and gently vortexed for 1 h at room temperature. Methyl iodide (50 μL) was added to the reaction mixture and the reaction was allowed to proceed for another 1 h at room temperature in the dark. Additional NaOH/DMSO (100 μL) and methyl iodide (50 μL) were added together followed by 1 h of vortexing. This process was repeated up to five times to ensure complete methylation before the reaction was terminated by addition of 200 μL of chloroform and 200 μL of water. Permethylated glycans were extracted by liquid–liquid fractionation in water and chloroform, and desalted using PepClean C18 spin columns.
Mass Spectrometry Analysis Permethylated glycans were dissolved to a concentration of 2–5 μM in 50/50 (v/v) methanol/water solution that also contains 20–50 μM of sodium hydroxide or cesium acetate to produce sodium or cesium adducts of permethylated glycans. For electronic excitation dissociation (EED) analysis, each glycan sample was loaded onto a pulled glass capillary tip with a 1-μm orifice diameter and directly infused into a solariX hybrid Qh-Fourier transform ion cyclotron resonance (FTICR) mass spectrometer (Bruker Daltonics, Bremen, Germany) equipped with a hollow cathode dispenser. Sodiated or cesiated precursor ions were isolated by the quadrupole mass filter, externally accumulated in the collision cell, and fragmented in the ICR cell by irradiation of electrons for up to 1 s, with the cathode bias voltage set at –14 V and the ECD lens voltage at –13.95 V. Each transient was recorded at a 0.55-s length, and up to 40 transients were summed for improved S/N ratio. Peak picking and deconvolution were achieved with the DataAnalysis software (Bruker Daltonics), using the SNAP algorithm [32] with the quality factor threshold set at 0.01, S/N threshold set at 2. All tandem MS spectra were internally calibrated with several fragment ions assigned with high confidence to give a typical mass accuracy of