A machine learning pipeline for substructure detection in ... - Fiehn Lab

32 downloads 362 Views 702KB Size Report
Mass spectral features were calculated to remove invariant m/z values. .... s3 s4 s5. O. Si N. Si O. O. O. ChemAxon LibMCS sugar fatty acid hydroxy groups.
A machine learning pipeline for substructure detection in unknown mass spectra Tobias Kind, Oliver Fiehn UC Davis Genome Center, Metabolomics, Davis, CA The challenge – thousands of unknowns

Concept of predictive data mining

The detection of unknown compounds with mass spectrometry in complex mixtures is a very cumbersome and lengthy process. A comprehensive GCxGC-MS chromatogram can contain up to 10,000 chromatographic peaks and mass spectra in one chromatogram. The identification rates for unknown small molecules in such complex samples are usually less than 1%.

Data Preparation Feature Selection

For biologically important substructures (amines, sugars, alkanes, amino-sugars, alcohols, aromatic and aliphatic rings, steroids) we determined important statistical parameters like sensitivity (falsenegative rate), specificity (false positive rate), and predictive accuracy (overall performance) using electron impact mass spectra from trimethylsilyl (TMS) derivatized compounds. Average misclassification rates between 7% and 16% were obtained. In our concept the best performing five algorithms vote on the presence or absence of a detected substructure, which is crucial for obtaining later a low number of structural isomers from deterministic and stochastic molecular isomer generators. The whole process is extendible to most possible molecular substructures and molecular classes. Successful models are automatically exported into computer languages like C++, Visual Basic or PMML code for use in independent external programs. References: Mass spectral classifiers for supporting systematic structure elucidation elucidation Varmuza K., Werther W., J. Chem. Inf. Comput. Sci., 36, 323323-333 (1996). Chemical Substructure Identification by Mass Spectral Library Searching Searching S.E. Stein, J. Am. Soc. Mass Spectrom., 1995, 6, (644(644-655)

Predict important features with MARS, PLS, NN, SVM, GDA, GA; apply voting or meta-learning

O

MLP

Si

NN-IPS

Model Training + Cross Validation

O

Si O

O Si

NaiveBayes

O

Use only important features, apply bootstrapping if only few datasets; Use GDA, CART, CHAID, MARS, NN, SVM, Naive Bayes, kNN for prediction

Si

O Si

MARS KNN Trees

Sorbitol with TMS groups

GDA Probit

Model Testing

Logit

Calculate Performance with Percent disagreement and Chi-square statistics

SVM 0.0

2.0

4.0

Model Deployment

6.0

8.0

10.0

12.0

Percent (% ) disagreement of model

Deploy model for unknown data; use PMML, VB, C++

lower is better

Mass spectral feature space (~800 features) 192

100

50

Mass Spectral Features • m/z value • m/z intensity • delta (m/z) • delta (m/z) x intensity • non linear functions • intensity series 91

Results

Si

Btrees

120

The algorithm is based on electron impact mass spectra from the NIST05 library, which were transformed into 800 mass spectral features (characteristic combinations thereof and important peaks). Mass spectral features were calculated to remove invariant m/z values. Substructure information was obtained via a substructure isomorphism matrix. The feature selection process prior the actual classification process using partial least squares removed invariant and insignificant features. Using an automated data mining workflow all algorithms were trained and tested with cross validation. Finally the five best performing algorithms vote for the presence and absence of a specific substructure in a given mass spectrum.

CHAID

105 0

80

126

134

f1

PLS

266

222

162

Feature selection process with PLS Occam's Razor: Of two equivalent theories or explanations, all other things being equal, the simpler one is to be preferred.

Tree model

MS Feature matrix f2 281

f3

176

148

Classification and machine learning methods

MS Spectrum f1 100 MS1 100 MS2 100 MS3 0 MS4 0 MS5

f2 f3 20 50 20 50 20 60 40 20 40 20

f4 f5 fn 60 0 0 60 0 20 50 0 0 50 0 40 50 0 40

William of Ockham Epicurus 1285-1349 341 BC – 270 BC

Cluster Analysis

206

Epicurus: Principle of multiple explanations “all consistent model should be retained”.

234 242 250

90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290

PLS Feature Selection 0.12

25.0

F546 F547

0.10

F35 F 397 F734

0.08

Substructure matrix

s1

0.04

Si O

O

s2 s3

Si N

Substructure Molecule1 Molecule2 Molecule3 Molecule4 Molecule5

s1 Y Y Y N N

s2 Y Y Y N N

s3 N N N N N

s4 Y Y Y Y Y

s5 Y Y Y Y Y

sn N N N Y Y

Si

0.00

Neural Network

Si O

SVM ECHAID Average

10.0

5.0

F145

-0.10 -0.20

Feature selection

15.0

-0.06 -0.08

O

F77 F189 F146

-0.04

s4 s5

0.02

-0.02

O

N

20.0

F148 F147

0.06

O

Component 2

Several predictive classification algorithms like Stochastic Gradient Boosting Trees, LDA (linear discriminant analysis), CART (Classification and Regression Trees), Neuronal Networks, Naive Bayes Classifier and Support Vector Machines were combined in a meta-learning approach for the detection of important substructures in mass spectra.

RBF

Basic Statistics, Remove extreme outliers, transform or normalize datasets, mark sets with zero variances

ECHAID

Metabolomics has the ultimate goal of giving a comprehensive overview about all small molecules in a certain sample. Better software for de-novo identification of the true isomer structure of small molecules is desperately needed. We developed an automated classification workflow which can recognize substructures from unknown electron impact mass spectra.

Methods

Best performers for TMS substructure

P e rc en t d is ag re em en t [% ]

Introduction

F34 -0.15

0.0 -0.10

-0.05

0.00 Component 1

0.05

0.10

0.15

0.20

0

20

40

60

80

100

Number of mass spectral features

Machine Learning (KNN)

Maximum common substructures (MCS)

Data mining workflow in Statistica Dataminer

The vision – Automatic substance class annotation aromatic compound hydroxy groups sugar fatty acid

ChemAxon LibMCS

120