Carbohydrate Structure Database merged from ...

69 downloads 8367 Views 2MB Size Report
Carbohydrate Structure Database merged from bacterial, archaeal, plant and ... Glycan builder with D-Quip4NAc entered and displayed in the CFG format.
Philip V. Toukach and Ksenia S. Egorova

Carbohydrate Structure Database merged from bacterial, archaeal, plant and fungal parts

SUPPLEMENTARY DATA

Homepage

Figure S1. CSDB main menu and homepage. Operations in the Maintenance section are password-protected.

CSDB linear encoding language Carbohydrates are characterized by a multitude of monomers bonded into complex structures by linkages of various types. The monomers are subjects to various chemical modifications (e.g. methylation, acetylation, etc.) and may have different anomeric, absolute or ring size configurations. Therefore, the standardized description of carbohydrate structures is a difficult problem. The CSDB linear encoding language has been developed within the CSDB project and is based on a text-encoded tree, in which the structure is encoded in a single line and monomeric names originate from a controlled vocabulary. The language can be used to describe oligomeric or polymeric structures composed of residues linked via glycosidic, amide, diester, and other linkages. One of its main advantages is human-readability allowing easy structure verification; its main limitation is inability to code repeating and non-repeating parts within a single structure. Knowledge of the CSDB linear encoding language is required for using the expert query form and for submitting data to the database. The major features of this encoding are listed in Fig. S2. The detailed information on it is available at the CSDB ‘Help’ under the ‘Structure encoding’ section (http://csdb.glycoscience.ru/database/core/help.php?db=database&topic=rules) and in the chapter in “Glycoinformatics” (1).

Figure S2. Major features of the CSDB linear structure encoding language. (A) Residue name components (the obligatory ones are shaded ). (B) Example of topology and linkage encoding: residues A and B form the polymer backbone; residues A and C are branching points, whereas residues E, D, and G are terminal; residue G forms a dual linkage with residue F. (C) Example of encoding of an undetermined or uncertain structure: an unknown hexose is linked to an unknown position of either residue D or residue E, which forms the 1–4 bond with residue A. In 25% of the molecules, residue A is (1–6)-substituted by residue B, which is partially (in an unknown part of the molecules) substituted at position 2 by an unknown position of residue C; residue A is substituted by alkyl at position 2 or by acyl at position 3 or by both of them. Reprinted with permission from (2) (copyright 2011 American Chemical Society).

User operations in exemplary complex query. In Fig. S3-S11, red arrows indicate user interface items changed from their default state. For the logic of the query, see Fig. 3 in the main text.

Figure S3. Glycan builder with D-Quip4NAc entered and displayed in the CFG format.

Figure S4. Structure search form after returning from GlycanBuilder. Restrictions on class and domain were specified.

Figure S5. Structure wizard with the D-QuipNAc4N(1-?)HEX fragment entered. As the bacillosamine residue is not widespread, the COMPLETE LIST option was used to select it.

Figure S6. Structure search form after returning from the Structure wizard. Restrictions on class and domain were applied. The scope “OR” was selected.

Figure S7. Structure search form with the term edited after copying from the previous query. The scope “AND NOT” was selected.

A

B

Figure S8. (A) Composition search form with the selected partial composition of one amino acid and three hexoses. “Complete composition” was unchecked. The scope “AND” was selected. (B) Part of the results from the query in Fig. S8A. The data are arranged by compounds.

A

B

Figure S9. (A) NMR signal search form with two carbon chemical shifts specified. “Signals in the same residue” was unchecked. The scope “AND” was selected. (B) Part of the results from the query in Fig. S9A. The data are arranged by compounds.

A

B

Figure S10. (A) Bibliography search form with the terms, journal, and year span specified. A group of two terms (“azo dyes”) is combined with a term with wildcard (pollutant*) using the AND operation. The scope “AND” was selected. (B) Part of the results from the query in Fig. S10A. The data are arranged by publications. The arrow points to a record of interest.

Figure S11. Data for the record with persistent CSDB ID 22684. Some parts of the screenshot were omitted for clarity.

Figure S12. Phenetic tree built on the results of clustering of genera most represented in CSDB. The outer rim reflects the taxonomic group of genera (see the legend). White bars stand for the number of glycan structures associated with organisms from each genus; grey bars stand for the number of these organisms present in CSDB. The Ward’s minimum variance (3) was chosen in the tool interface as the clustering method. Lengths of branches connected to leaves were ignored for visual clarity. Adapted with permission from (4).

REFERENCES

1. 2. 3. 4.

Toukach, P.V. and Egorova, K.S. (2015) In Lütteke, T. and Frank, M. (eds.), Glycoinformatics. Springer, New York, Vol. 1273, pp. 55-85. Toukach, P.V. (2011) Bacterial carbohydrate structure database 3: principles and realization. J. Chem. Inf. Model., 51, 159—170. Murtagh, F. and Legendre, P. (2014) Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? J Classif, 31, 274—295. Egorova, K.S., Kalinchuk, N.A., Knirel, Y.A. and Toukach, P.V. (2015) Carbohydrate Structure Database (CSDB): New Features. Izv. Akad. Nauk Ser. Khim., 5, 1205-1210.