Mass spectral features were calculated to remove invariant m/z values. .... s3 s4 s5. O. Si N. Si O. O. O. ChemAxon LibMCS sugar fatty acid hydroxy groups.
A machine learning pipeline for substructure detection in unknown mass spectra Tobias Kind, Oliver Fiehn UC Davis Genome Center, Metabolomics, Davis, CA The challenge – thousands of unknowns
Concept of predictive data mining
The detection of unknown compounds with mass spectrometry in complex mixtures is a very cumbersome and lengthy process. A comprehensive GCxGC-MS chromatogram can contain up to 10,000 chromatographic peaks and mass spectra in one chromatogram. The identification rates for unknown small molecules in such complex samples are usually less than 1%.
Data Preparation Feature Selection
For biologically important substructures (amines, sugars, alkanes, amino-sugars, alcohols, aromatic and aliphatic rings, steroids) we determined important statistical parameters like sensitivity (falsenegative rate), specificity (false positive rate), and predictive accuracy (overall performance) using electron impact mass spectra from trimethylsilyl (TMS) derivatized compounds. Average misclassification rates between 7% and 16% were obtained. In our concept the best performing five algorithms vote on the presence or absence of a detected substructure, which is crucial for obtaining later a low number of structural isomers from deterministic and stochastic molecular isomer generators. The whole process is extendible to most possible molecular substructures and molecular classes. Successful models are automatically exported into computer languages like C++, Visual Basic or PMML code for use in independent external programs. References: Mass spectral classifiers for supporting systematic structure elucidation elucidation Varmuza K., Werther W., J. Chem. Inf. Comput. Sci., 36, 323323-333 (1996). Chemical Substructure Identification by Mass Spectral Library Searching Searching S.E. Stein, J. Am. Soc. Mass Spectrom., 1995, 6, (644(644-655)
Predict important features with MARS, PLS, NN, SVM, GDA, GA; apply voting or meta-learning
O
MLP
Si
NN-IPS
Model Training + Cross Validation
O
Si O
O Si
NaiveBayes
O
Use only important features, apply bootstrapping if only few datasets; Use GDA, CART, CHAID, MARS, NN, SVM, Naive Bayes, kNN for prediction
Si
O Si
MARS KNN Trees
Sorbitol with TMS groups
GDA Probit
Model Testing
Logit
Calculate Performance with Percent disagreement and Chi-square statistics
SVM 0.0
2.0
4.0
Model Deployment
6.0
8.0
10.0
12.0
Percent (% ) disagreement of model
Deploy model for unknown data; use PMML, VB, C++
lower is better
Mass spectral feature space (~800 features) 192
100
50
Mass Spectral Features • m/z value • m/z intensity • delta (m/z) • delta (m/z) x intensity • non linear functions • intensity series 91
Results
Si
Btrees
120
The algorithm is based on electron impact mass spectra from the NIST05 library, which were transformed into 800 mass spectral features (characteristic combinations thereof and important peaks). Mass spectral features were calculated to remove invariant m/z values. Substructure information was obtained via a substructure isomorphism matrix. The feature selection process prior the actual classification process using partial least squares removed invariant and insignificant features. Using an automated data mining workflow all algorithms were trained and tested with cross validation. Finally the five best performing algorithms vote for the presence and absence of a specific substructure in a given mass spectrum.
CHAID
105 0
80
126
134
f1
PLS
266
222
162
Feature selection process with PLS Occam's Razor: Of two equivalent theories or explanations, all other things being equal, the simpler one is to be preferred.
Tree model
MS Feature matrix f2 281
f3
176
148
Classification and machine learning methods
MS Spectrum f1 100 MS1 100 MS2 100 MS3 0 MS4 0 MS5
f2 f3 20 50 20 50 20 60 40 20 40 20
f4 f5 fn 60 0 0 60 0 20 50 0 0 50 0 40 50 0 40
William of Ockham Epicurus 1285-1349 341 BC – 270 BC
Cluster Analysis
206
Epicurus: Principle of multiple explanations “all consistent model should be retained”.
234 242 250
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290
PLS Feature Selection 0.12
25.0
F546 F547
0.10
F35 F 397 F734
0.08
Substructure matrix
s1
0.04
Si O
O
s2 s3
Si N
Substructure Molecule1 Molecule2 Molecule3 Molecule4 Molecule5
s1 Y Y Y N N
s2 Y Y Y N N
s3 N N N N N
s4 Y Y Y Y Y
s5 Y Y Y Y Y
sn N N N Y Y
Si
0.00
Neural Network
Si O
SVM ECHAID Average
10.0
5.0
F145
-0.10 -0.20
Feature selection
15.0
-0.06 -0.08
O
F77 F189 F146
-0.04
s4 s5
0.02
-0.02
O
N
20.0
F148 F147
0.06
O
Component 2
Several predictive classification algorithms like Stochastic Gradient Boosting Trees, LDA (linear discriminant analysis), CART (Classification and Regression Trees), Neuronal Networks, Naive Bayes Classifier and Support Vector Machines were combined in a meta-learning approach for the detection of important substructures in mass spectra.
RBF
Basic Statistics, Remove extreme outliers, transform or normalize datasets, mark sets with zero variances
ECHAID
Metabolomics has the ultimate goal of giving a comprehensive overview about all small molecules in a certain sample. Better software for de-novo identification of the true isomer structure of small molecules is desperately needed. We developed an automated classification workflow which can recognize substructures from unknown electron impact mass spectra.
Methods
Best performers for TMS substructure
P e rc en t d is ag re em en t [% ]
Introduction
F34 -0.15
0.0 -0.10
-0.05
0.00 Component 1
0.05
0.10
0.15
0.20
0
20
40
60
80
100
Number of mass spectral features
Machine Learning (KNN)
Maximum common substructures (MCS)
Data mining workflow in Statistica Dataminer
The vision – Automatic substance class annotation aromatic compound hydroxy groups sugar fatty acid
ChemAxon LibMCS
120