Chromatographia (2014) 77:1387–1398 DOI 10.1007/s10337-014-2735-4
ORIGINAL
Support Vector Machine Applied to Study on Quantitative Structure–Retention Relationships of Polybrominated Diphenyl Ether Congeners Xiaotong Zhang · Xin Zhang · Qiang Li · Zhaolin Sun · Lijuan Song · Ting Sun
Received: 11 June 2014 / Accepted: 30 June 2014 / Published online: 30 July 2014 © Springer-Verlag Berlin Heidelberg 2014
Abstract Quantitative structure–retention relationships (QSRR) models were constructed for the GC relative retention times (RRTs) of 126 polybrominated diphenyl ether (PBDE) congeners. First, a number of topological and connectivity indices descriptors were derived from E-dragon software. In a further step, six molecular descriptors were extracted by genetic algorithm (GA) coupled with multiple linear regression (MLR) method. The QSRR model was established using a support vector machine (SVM) algorithm as regression tool. High training sets correlation coefficients R2 indicated that >99.6 % (except for stationary phase CP-Sil 19) of the total variation in the predicted RRTs is explained by the fitted models. It showed that we provided a more accurate model that was subsequently used to predict the RRTs of validation sets. The excellent statistical parameters Q2 loo (correlation coefficient of leave-one-out cross validation) and validation sets correlation coefficients R2 > 99.0 % reveal that the models are robust and have high internal and external predictive capability. According to sum of ranking differences (SRD) validation values, we concluded that DB-1 and DB-5 are the best two models.
X. Zhang · T. Sun (*) College of Sciences, Northeastern University, Shenyang, Liaoning, China e-mail:
[email protected] X. Zhang · X. Zhang · Q. Li · Z. Sun · L. Song (*) Liaoning Key Laboratory of Petrochemical Engineering, Liaoning Shihua University, Fushun, Liaoning, China e-mail:
[email protected] Q. Li College of Chemical Engineering, Beijing University of Chemical Technology, Beijing, China
Keywords Support vector machine (SVM) · Genetic algorithm (GA) · Gas chromatography · Quantitative structure–retention relationships (QSRR) · Polybrominated diphenyl ether congeners (PBDE)
Introduction Polybrominated diphenyl ethers (PBDEs) have been widely used as brominated flame retardants (BFRs), which have been manufactured in large quantities and widely used in a variety of consumer goods as computers, furniture, and automobiles [1]. They spread ubiquitously as environmental contaminants and have been found in the air, soil, sediment, and water of natural environments, in the buildings where we live and work, in the sewage we produce, and in the automobiles we drive [2–5]. It is known that PBDEs are persistent, bioaccumulative, endocrine disrupting, and possibly carcinogenic [6]. PBDEs can further interfere by altering liver function, leading to changes in thyroid hormones and vitamin A homeostasis, often resulting in over elimination of the thyroid hormone [7]. Hence, there is a need to characterize the PBDEs as prime types of emerging contaminants. Analysis of PBDEs in environmental samples is preferably conducted with gas chromatographic separation followed by identification using mass spectrometer or electron capture detector [8, 9]. Similar to polychlorinated biphenyls (PCBs), there are 209 possible PBDE congeners; however, only about a dozen congeners exist as major components in the commercial PBDEs mixtures [10]. Due to the lack of PBDE congeners standards, many previous PBDE studies facing the biggest obstacle is how to correctly identifying PBDE congeners. Therefore, quantitative structure–retention relationships (QSRR) models
13
1388
were developed to predict retention parameters efficiently by theoretical descriptors computed from chemical structure [11–14]. For PBDEs prediction, with Rayne and Ikonomou [15] using the available 46 PBDE standards with mono- to deca-bromination, a multiple linear regression equation was used to predict the RRTs of the remaining 163 PBDE congeners. Later, larger GC retention databases became available [8, 16]. Wang and Li [10] used heuristic method to build regression models for 126 PBDE congeners from the literature on seven individual GC capillary columns. Yi and Li [17] used a genetic algorithm with leave-multipleout cross validation (LMOCV) to select descriptors and got multiple-linear regression fitting correlation coefficients (R2) greater than 0.988. But all these previous works focus on the application of the more traditional methods like MLR, heuristic method or artificial neural network (ANN) [18, 19]. The purpose of this work was to develop novel models capable of predicting the PBDEs gas chromatographic relative retention times (RRTs) on various stationary phases using support vector machine (SVM) algorithm. A total of 151 topological and connectivity indices descriptors were derived from E-dragon software and GA coupled with MLR was used to select optimal subsets from large-size molecular descriptors. The models were evaluated by predicting the GC-RRTs of PBDE congeners in a validation set containing 13 molecules. We have also compared the statistical quality of the models using different chemometric tools like sum of ranking differences (SRD) validation method. Finally, we got a superior model with excellent statistical parameters to other reported literatures.
Materials and Methods Genetic Algorithm A major step in constructing the QSRR models is to find a set of molecular descriptors that represent variation in the structural properties of the molecules. However, the complexity of locating such a globally optimal subset of features is usually prohibitive, which motivates the use of more advanced search techniques, such as GA [20]. Genetic algorithm has been proposed by John Holland in the early 1970s [21]. GA is a stochastic method to solve the optimization problems, defined by fitness criteria by applying to the evolution hypothesis of Darwin and different genetic functions, i.e., crossover and mutation. Compared to the traditional search and optimization procedures, GA is global and generally more straightforward to apply to where little or no a priori knowledge about the process to be controlled [22–24].
13
X. Zhang et al.
Support Vector Machine Support vector machine is a new and very promising classification and regression technique developed by Vapnik [25]. Compared to traditional neural networks, SVM possesses the prominent advantages including high generalization capability, avoiding local minima, always having solution by a standard algorithm (quadratic programming), automatically obtaining network topology structure, and lower workload [26]. The SVM algorithm is based on estimating a linear regression function:
f (x) = w · φ(x) + b
(1)
where w and b represent the slope and offset for the regression line, x is a high-dimensional input space, ϕ is a kernel function that can map the input space x to a higher or infinite dimensional space. f(x) is the linear regression function. In case of SVM, the regression function is calculated by minimizing: n
1 1 T c(f (xi ), yi ) w w+ 2 n
(2)
i=1
where 1/2wTw is a term characterizing the model complexity and c(f(xi),yi) is a loss function, y is the target and n is the number of samples [27]. Multiple types of kernel functions are often used in SVM, including linear, polynomial, radial basis function and sigmoid function [28]. In this study, we construct the SVM models using the Gaussian radial basis function (RBF):
2 k(¯xi , x¯ j ) = exp(−r x¯ i − x¯ j )
(3)
As a new and powerful modeling tool, SVM has gained much interest in pattern recognition and function approximation applications recently. The more details of the theoretical background of SVM can be found in Ref. [29–31]. Here, we give only a brief introduction to its main principle. Model Building Data Set and GC Systems The GC relative retention times of 126 PBDE congeners on 7 stationary phases were taken from Ref. [16], and presented in Table 1. The specific chromatographic conditions of the capillary GC columns and temperature programs used to collect the retention data in the present work are shown in Table 2. To enable comparison between stationary phases, a similar oven temperature program was used for all columns. The data set was divided randomly into a training set containing 100 molecules, which was used for model generation, a test set which consisted of 13
1389
SVM Applied to Study on QSRR Table 1 Experimented and SVM predicted gas chromatographic relative retention time data for PBDEs on seven stationary phases PBDE
1 2v 3 4 6 7a 8 9 10 11v 12 13 14a 15 16 17 18v 19 20 22 25 26a 27 28 29 30 31v 32 33 34 35 36 37 38a 39 40 42 46 47 48 49v 50 51 53 55 58a 62
DB-1
DB-5
HT-5
DB-17
DB-XLB
HT-8
CP-Sil 19
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
0.100 0.102 0.103 0.134 0.135 0.134 0.139 0.131 0.125 0.139 0.142 0.142 0.133 0.146 0.207 0.196 0.186 0.181 0.214 0.223 0.198 0.190 0.180 0.207 0.193 0.172 0.199 0.190 0.207 0.188 0.215 0.196 0.223 0.212 0.203 0.344 0.317 0.287 0.301 0.287 0.283 0.261 0.278 0.253 0.328 0.312 0.277
0.106 0.105 0.101 0.137 0.133 0.122 0.141 0.125 0.130 0.145 0.147 0.136 0.111 0.140 0.207 0.190 0.213 0.181 0.212 0.224 0.193 0.154 0.182 0.201 0.191 0.166 0.233 0.194 0.218 0.183 0.221 0.195 0.228 0.185 0.197 0.330 0.312 0.292 0.288 0.293 0.318 0.258 0.274 0.260 0.331 0.263 0.283
0.095 0.097 0.099 0.133 0.134 0.133 0.139 0.129 0.123 0.137 0.142 0.142 0.130 0.147 0.214 0.202 0.191 0.187 0.221 0.231 0.204 0.195 0.184 0.214 0.197 0.174 0.205 0.195 0.214 0.192 0.221 0.199 0.232 0.217 0.208 0.355 0.329 0.299 0.313 0.298 0.294 0.272 0.289 0.264 0.341 0.321 0.287
0.095 0.095 0.085 0.131 0.130 0.111 0.135 0.120 0.129 0.145 0.148 0.137 0.095 0.143 0.210 0.192 0.213 0.185 0.217 0.229 0.197 0.151 0.187 0.209 0.202 0.179 0.237 0.198 0.220 0.185 0.228 0.196 0.240 0.188 0.202 0.334 0.320 0.302 0.301 0.308 0.324 0.275 0.286 0.276 0.341 0.270 0.303
0.091 0.091 0.093 0.122 0.121 0.121 0.126 0.116 0.112 0.124 0.128 0.128 0.117 0.133 0.193 0.183 0.169 0.168 0.201 0.211 0.182 0.172 0.162 0.194 0.173 0.152 0.182 0.175 0.193 0.169 0.201 0.178 0.211 0.193 0.186 0.347 0.313 0.275 0.297 0.272 0.269 0.247 0.271 0.232 0.326 0.304 0.256
0.097 0.095 0.079 0.121 0.116 0.105 0.122 0.104 0.118 0.129 0.134 0.122 0.084 0.132 0.194 0.172 0.194 0.165 0.200 0.216 0.176 0.126 0.165 0.191 0.179 0.152 0.220 0.180 0.206 0.160 0.214 0.172 0.231 0.166 0.180 0.324 0.307 0.283 0.283 0.289 0.307 0.248 0.265 0.252 0.330 0.246 0.277
0.087 0.086 0.088 0.129 0.125 0.122 0.129 0.117 0.117 0.124 0.129 0.129 0.114 0.133 0.209 0.199 0.188 0.194 0.209 0.221 0.195 0.184 0.184 0.206 0.186 0.166 0.195 0.191 0.210 0.179 0.209 0.177 0.221 0.204 0.189 0.351 0.327 0.306 0.308 0.298 0.290 0.278 0.296 0.275 0.349 0.302 0.291
0.093 0.088 0.079 0.129 0.123 0.101 0.128 0.111 0.123 0.130 0.142 0.121 0.082 0.127 0.212 0.192 0.214 0.186 0.212 0.225 0.188 0.135 0.183 0.201 0.198 0.172 0.222 0.197 0.218 0.175 0.216 0.176 0.228 0.176 0.183 0.342 0.324 0.309 0.301 0.309 0.309 0.277 0.291 0.279 0.343 0.250 0.307
0.086 0.088 0.090 0.127 0.128 0.128 0.135 0.123 0.117 0.133 0.138 0.140 0.124 0.147 0.218 0.205 0.192 0.186 0.228 0.242 0.208 0.198 0.184 0.223 0.198 0.174 0.212 0.199 0.218 0.196 0.228 0.203 0.242 0.220 0.215 0.372 0.343 0.306 0.326 0.304 0.305 0.278 0.296 0.268 0.356 0.335 0.296
0.092 0.091 0.081 0.129 0.127 0.111 0.135 0.117 0.123 0.140 0.147 0.134 0.094 0.143 0.215 0.197 0.215 0.185 0.222 0.236 0.203 0.158 0.189 0.217 0.204 0.171 0.241 0.203 0.229 0.189 0.235 0.203 0.248 0.186 0.209 0.348 0.332 0.310 0.311 0.317 0.319 0.275 0.293 0.284 0.352 0.271 0.302
0.084 0.085 0.087 0.131 0.130 0.129 0.137 0.122 0.116 0.134 0.139 0.140 0.124 0.147 0.226 0.214 0.196 0.196 0.235 0.247 0.213 0.200 0.188 0.227 0.202 0.175 0.213 0.205 0.226 0.196 0.235 0.206 0.247 0.226 0.217 0.380 0.350 0.315 0.335 0.313 0.309 0.287 0.311 0.272 0.363 0.341 0.297
0.090 0.086 0.077 0.132 0.128 0.111 0.137 0.116 0.122 0.136 0.149 0.134 0.089 0.142 0.224 0.203 0.205 0.190 0.229 0.244 0.207 0.158 0.193 0.222 0.208 0.171 0.229 0.210 0.236 0.190 0.241 0.203 0.254 0.187 0.211 0.360 0.342 0.321 0.319 0.324 0.301 0.282 0.302 0.289 0.362 0.268 0.307
0.083 0.083 0.086 0.130 0.127 0.125 0.134 0.120 0.116 0.130 0.135 0.136 0.118 0.142 0.217 0.206 0.193 0.193 0.223 0.235 0.204 0.194 0.185 0.216 0.193 0.167 0.206 0.197 0.218 0.189 0.224 0.194 0.236 0.214 0.205 0.360 0.332 0.304 0.315 0.298 0.296 0.272 0.296 0.269 0.347 0.319 0.283
0.095 0.090 0.074 0.121 0.115 0.122 0.127 0.108 0.129 0.109 0.137 0.128 0.095 0.142 0.205 0.193 0.160 0.180 0.213 0.232 0.198 0.158 0.181 0.217 0.199 0.179 0.175 0.201 0.224 0.182 0.228 0.199 0.245 0.168 0.204 0.347 0.330 0.306 0.308 0.314 0.206 0.276 0.293 0.283 0.350 0.243 0.295
13
1390
X. Zhang et al.
Table 1 continued PBDE
66 67 68 69v 71 72 73 74 75 76a 77 78 79 80 81 85 86v 87 88 97 98 99a 100 101 102 103 104 105 106v 108 109 114 115a 116 118 119 120v 121 123 124 125a 126 127 128 131 138 139
DB-1
DB-5
HT-5
DB-17
DB-XLB
HT-8
CP-Sil 19
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
0.316 0.294 0.285 0.258 0.287 0.274 0.257 0.309 0.278 0.311 0.343 0.325 0.312 0.284 0.339 0.476 0.452 0.450 0.411 0.450 0.405 0.425 0.395 0.401 0.385 0.358 0.357 0.496 0.461 0.451 0.406 0.483 0.434 0.434 0.450 0.405 0.408 0.363 0.453 0.436 0.414 0.490 0.449 0.679 0.586 0.618 0.576
0.322 0.299 0.281 0.293 0.309 0.268 0.263 0.312 0.275 0.287 0.352 0.330 0.316 0.290 0.341 0.458 0.477 0.444 0.410 0.446 0.405 0.390 0.379 0.395 0.397 0.355 0.363 0.494 0.495 0.445 0.415 0.472 0.424 0.428 0.456 0.417 0.432 0.366 0.450 0.438 0.402 0.484 0.444 0.662 0.596 0.613 0.562
0.328 0.303 0.294 0.268 0.299 0.281 0.266 0.319 0.290 0.322 0.355 0.334 0.321 0.289 0.350 0.486 0.462 0.460 0.423 0.457 0.414 0.433 0.405 0.410 0.396 0.369 0.369 0.506 0.469 0.461 0.417 0.491 0.442 0.444 0.457 0.414 0.414 0.371 0.462 0.443 0.424 0.495 0.454 0.677 0.586 0.617 0.577
0.336 0.311 0.290 0.302 0.318 0.270 0.274 0.329 0.295 0.295 0.371 0.340 0.327 0.290 0.356 0.459 0.480 0.445 0.426 0.446 0.410 0.397 0.390 0.404 0.406 0.369 0.374 0.496 0.493 0.445 0.431 0.480 0.435 0.439 0.461 0.426 0.434 0.374 0.454 0.437 0.407 0.491 0.442 0.642 0.589 0.606 0.566
0.313 0.279 0.270 0.238 0.275 0.253 0.234 0.299 0.271 0.300 0.347 0.318 0.304 0.267 0.335 0.498 0.455 0.457 0.409 0.455 0.400 0.424 0.396 0.389 0.370 0.336 0.344 0.519 0.464 0.456 0.395 0.494 0.435 0.424 0.455 0.400 0.398 0.339 0.459 0.432 0.403 0.507 0.453 0.730 0.598 0.644 0.595
0.328 0.290 0.265 0.277 0.306 0.237 0.247 0.314 0.273 0.276 0.373 0.328 0.312 0.261 0.349 0.461 0.477 0.440 0.411 0.442 0.396 0.378 0.369 0.383 0.392 0.341 0.349 0.508 0.493 0.439 0.416 0.481 0.417 0.418 0.458 0.417 0.414 0.345 0.452 0.426 0.395 0.501 0.434 0.675 0.601 0.624 0.572
0.326 0.294 0.280 0.265 0.306 0.261 0.262 0.310 0.296 0.322 0.350 0.322 0.302 0.255 0.344 0.500 0.478 0.471 0.444 0.452 0.414 0.430 0.403 0.403 0.408 0.372 0.383 0.519 0.472 0.460 0.427 0.495 0.445 0.463 0.452 0.416 0.394 0.362 0.462 0.435 0.436 0.489 0.430 0.702 0.595 0.626 0.610
0.332 0.301 0.276 0.290 0.322 0.252 0.268 0.320 0.293 0.283 0.362 0.333 0.308 0.261 0.350 0.472 0.475 0.450 0.437 0.451 0.421 0.375 0.395 0.397 0.415 0.370 0.384 0.506 0.473 0.444 0.433 0.484 0.411 0.457 0.453 0.429 0.390 0.366 0.456 0.429 0.398 0.489 0.428 0.670 0.606 0.612 0.576
0.343 0.312 0.305 0.274 0.306 0.289 0.269 0.334 0.296 0.329 0.372 0.346 0.335 0.299 0.365 0.499 0.466 0.473 0.431 0.469 0.423 0.443 0.413 0.418 0.396 0.374 0.373 0.519 0.477 0.473 0.427 0.503 0.457 0.454 0.469 0.423 0.423 0.377 0.471 0.453 0.426 0.508 0.465 0.680 0.590 0.619 0.579
0.349 0.322 0.300 0.289 0.330 0.283 0.281 0.339 0.298 0.295 0.382 0.352 0.338 0.305 0.366 0.476 0.469 0.462 0.429 0.463 0.422 0.389 0.396 0.413 0.417 0.373 0.378 0.513 0.479 0.461 0.434 0.489 0.413 0.445 0.473 0.436 0.412 0.382 0.469 0.452 0.406 0.505 0.459 0.661 0.597 0.613 0.564
0.350 0.318 0.308 0.278 0.315 0.291 0.273 0.338 0.311 0.339 0.380 0.354 0.341 0.304 0.370 0.508 0.475 0.476 0.438 0.472 0.428 0.447 0.424 0.418 0.404 0.373 0.382 0.524 0.481 0.474 0.425 0.505 0.458 0.452 0.472 0.428 0.424 0.375 0.476 0.454 0.431 0.513 0.469 0.682 0.586 0.620 0.584
0.356 0.325 0.304 0.277 0.340 0.283 0.285 0.344 0.305 0.295 0.390 0.360 0.344 0.304 0.374 0.485 0.441 0.465 0.436 0.467 0.430 0.378 0.402 0.412 0.423 0.376 0.386 0.521 0.442 0.467 0.436 0.492 0.395 0.446 0.474 0.442 0.378 0.381 0.474 0.453 0.398 0.511 0.463 0.665 0.598 0.610 0.563
0.332 0.301 0.290 0.264 0.303 0.277 0.264 0.318 0.296 0.322 0.359 0.333 0.320 0.281 0.349 0.479 0.453 0.456 0.416 0.446 0.403 0.422 0.394 0.400 0.389 0.360 0.359 0.500 0.457 0.450 0.407 0.478 0.430 0.430 0.447 0.403 0.400 0.356 0.449 0.432 0.415 0.485 0.439 0.704 0.562 0.605 0.551
0.339 0.311 0.288 0.194 0.328 0.272 0.276 0.330 0.298 0.269 0.371 0.342 0.320 0.294 0.356 0.467 0.290 0.443 0.415 0.445 0.412 0.333 0.379 0.377 0.399 0.347 0.372 0.508 0.306 0.439 0.415 0.474 0.355 0.421 0.443 0.414 0.264 0.356 0.445 0.420 0.364 0.487 0.427 0.684 0.588 0.607 0.538
13
1391
SVM Applied to Study on QSRR Table 1 continued PBDE
DB-1
DB-5
HT-5
DB-17
DB-XLB
HT-8
CP-Sil 19
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
Exp
Pre
140v 141 142 144 153a 154 155 156 158 159v 160 161 166 167a 168 173 181 182v 183 184 185 190 191a 192 198 203 204v 205 206 207a 208
0.586 0.587 0.595 0.528 0.563 0.515 0.491 0.644 0.586 0.593 0.598 0.533 0.623 0.600 0.547 0.779 0.770 0.707 0.699 0.673 0.713 0.780 0.734 0.721 0.882 0.885 0.858 0.922 1.074 1.042 1.029
0.606 0.576 0.589 0.537 0.543 0.519 0.493 0.641 0.604 0.600 0.592 0.539 0.614 0.582 0.563 0.785 0.746 0.713 0.703 0.679 0.718 0.786 0.733 0.716 0.878 0.889 0.843 0.928 1.068 1.031 1.030
0.588 0.585 0.598 0.531 0.560 0.517 0.496 0.640 0.587 0.590 0.590 0.535 0.621 0.596 0.548 0.763 0.754 0.697 0.687 0.666 0.702 0.763 0.721 0.707 0.853 0.855 0.834 0.891 1.027 1.001 0.988
0.598 0.576 0.587 0.546 0.541 0.524 0.497 0.634 0.605 0.592 0.592 0.540 0.616 0.577 0.562 0.759 0.733 0.694 0.692 0.663 0.712 0.772 0.716 0.704 0.851 0.868 0.813 0.902 1.025 0.982 0.983
0.614 0.594 0.610 0.516 0.564 0.504 0.483 0.672 0.598 0.603 0.592 0.521 0.643 0.613 0.543 0.807 0.807 0.717 0.703 0.683 0.716 0.807 0.744 0.720 0.899 0.903 0.885 0.944 1.111 1.089 1.073
0.616 0.582 0.593 0.545 0.533 0.516 0.482 0.659 0.622 0.599 0.598 0.534 0.633 0.582 0.567 0.801 0.770 0.721 0.713 0.677 0.742 0.819 0.747 0.725 0.905 0.928 0.875 0.970 1.117 1.081 1.067
0.603 0.588 0.624 0.543 0.552 0.518 0.500 0.643 0.597 0.578 0.609 0.534 0.631 0.588 0.551 0.776 0.761 0.708 0.692 0.679 0.714 0.778 0.729 0.710 0.860 0.859 0.849 0.904 1.048 1.024 1.004
0.583 0.575 0.611 0.549 0.495 0.524 0.506 0.639 0.611 0.549 0.605 0.536 0.631 0.534 0.567 0.782 0.749 0.666 0.693 0.678 0.720 0.782 0.682 0.704 0.859 0.875 0.811 0.915 1.054 0.979 1.010
0.588 0.586 0.597 0.533 0.560 0.513 0.494 0.642 0.590 0.591 0.592 0.536 0.627 0.596 0.545 0.755 0.744 0.680 0.674 0.654 0.693 0.755 0.708 0.698 0.828 0.830 0.810 0.865 0.997 0.965 0.955
0.570 0.577 0.592 0.541 0.503 0.518 0.496 0.641 0.605 0.557 0.594 0.540 0.614 0.546 0.568 0.761 0.721 0.646 0.674 0.655 0.696 0.761 0.669 0.693 0.828 0.837 0.777 0.882 1.002 0.929 0.961
0.597 0.582 0.599 0.524 0.558 0.513 0.497 0.640 0.586 0.587 0.584 0.526 0.622 0.594 0.543 0.743 0.742 0.675 0.665 0.652 0.676 0.742 0.695 0.678 0.809 0.812 0.801 0.842 1.009 0.972 1.009
0.528 0.569 0.593 0.537 0.482 0.517 0.495 0.634 0.601 0.503 0.589 0.531 0.611 0.515 0.566 0.751 0.713 0.605 0.671 0.648 0.692 0.747 0.626 0.672 0.823 0.843 0.768 0.863 1.015 0.932 1.003
0.565 0.563 0.578 0.507 0.531 0.487 0.462 0.637 0.563 0.564 0.567 0.505 0.603 0.572 0.515 0.850 0.823 0.696 0.685 0.636 0.722 0.852 0.746 0.722 1.094 1.100 0.981 1.254 2.091 1.741 1.661
0.399 0.551 0.565 0.501 0.461 0.474 0.462 0.650 0.598 0.431 0.571 0.500 0.599 0.507 0.541 0.807 0.767 0.609 0.731 0.648 0.735 0.864 0.720 0.727 1.122 1.211 0.893 1.246 1.804 1.650 1.673
209
1.219
1.213
1.171
1.156
1.290
1.284
1.282
1.222
1.196
1.151
1.281
1.220
2.540
2.528
a
indicated test set
v indicated validation set
molecules was used to take care of the overtraining and a validation set, consisted of 13 molecules, was used to evaluate the produced models. Molecular Descriptors Generation All structures were drawn with the ChemBioDraw 2010 software. Then, the geometry optimizations were performed with the program package DMol3 in the Materials Studio (MS) of Accelrys Inc. The resulting geometry was transferred into the E-Dragon program (Talete srl, DRAGON for
Windows–Software for molecular descriptors calculation, Milano, Italy, 2007) to obtain 151 descriptors, grouped in topological and connectivity indices classes [32]. One of the challenging parts in developing models is to choose suitable parameters encoding different aspects of the molecular structure. First, any descriptor that had identical or zero values greater than 90 % of the compounds was eliminated. Furthermore, to decrease the redundancy existing in the descriptors data matrix, the correlations of descriptors with each other were examined, and collinear descriptors which showed high interrelation (i.e., r > 0.95) were detected.
13
1392
X. Zhang et al.
Table 2 GC conditions for PBDE retentions on different columns Column
DB-1 DB-5 HT-5
DB-17 DB-XLB HT-8
CP-Sil19
Stationary Phase
Dimension (m × mm × μm)
Temperature program T (°C) Initial T
Hold (min)
Rate 1 (°/min)
Break T
Rate 2 (°/min)
Final T
30 × 0.25 × 0.25
90
2
30
200
1.5
325
1
330
30 × 0.25 × 0.25
90
2
30
200
1.5
325
7
330
30 × 0.25 × 0.10
90
2
30
200
1.5
325
7
330
30 × 0.25 × 0.25
90
2
30
200
1.5
325
30
330
30 × 0.25 × 0.25 25 × 0.22 × 0.25
90 90
2 2
30 30
200 200
1.5 1.5
325 325
30 50
330 330
14 % cyanopropyl17 × 0.15 × 0.30 methylpolysiloxane
90
2
30
200
1.5
270
150
290
100 % methylpolysiloxane 5 % phenylmethylpolysiloxane 5 % phenylmethylpolysiloxane (carborane) 50 % phenylmethylpolysiloxane – 8 % phenylmethylpolysiloxane (carborane)
ECDT Hold (min)
Carrier gas: Inlet pressure 25 psi, except for CP-Sil19 35 psi, Average velocity 38.5 cm/s, except for HT-8 35.8 cm/s, CP-Sil19 33.4 cm/s
After this procedure, we identified 45 molecular descriptors to be considered in further analysis. For the selection of the most important descriptors, GA was run many times with different initial sets of population. At the end, a population of good models was obtained [33]. The descriptors, selected by this method, were used to construct some linear and nonlinear models with the employment of the MLR and SVM techniques [34, 35]. First, for each of the seven columns of the training set, a simple model was established by MLR using GA variable selection to identify, within the 45 molecular descriptors, a small subset encoding the effect of PBDE structure on RRT for the considered stationary phases. Finally, six molecular descriptors were chosen to build the QSRR model and their specific information is presented in Table 3. Specifically, four topological and two molecular connectivity indices descriptors were selected in the QSRR model. The topological descriptors describe the atomic connectivity in the molecule. And the idea behind the connectivity index is to use available information on some molecular property for their construction [36]. In this work, GAMLR analysis was performed using the Matlab (2010b) and SPSS software.
Table 3 The molecular descriptors selected by GA-MLR in QSRR model No. Descriptor
Abbreviate Class
1 2
W S3K
Topological descriptor Topological descriptor
TI2 CENT X5
Topological descriptor Topological descriptor Connectivity indices
Wiener W index 3-path Kier alpha-modified shape index Second Mohar index TI2 Centralization Connectivity index chi-5
3 4 5 6
Valence connectivity index X5v chi-5
Connectivity indices
to establish the QSRR model of PBDEs to obtain better predicted results. All the programs were achieved in matlab matrix laboratory. The goodness of the correlation of the training set was examined by the correlation coefficient (R2). The stability of the models was evaluated against the cross-validated coefficient, Q2 loo, which described the stability of an obtained model by focusing on the sensitivity of the model to the elimination of any single data point. Q2 loo is defined as follows: 2
n (yi − yˆ i )2 = 1 − i=1 n ¯ )2 i=1 (yi − y
(4)
SVM Model Building
Q
Support vector machine, a non-linear algorithm, was developed for regression and classification and gained popularity in QSPR studies for drug design and biological activity in recent years [37, 38]. Here, we used SVM method
where yi is the experimented value, yˆ i is the estimator of the dependent variable value obtained via iteration in leave-one-out cross validation; y¯ is the average of experimented value, n is the number of training set. The external
13
loo
1393
SVM Applied to Study on QSRR Fig. 1 Correlation coefficients R2 and root mean squared error RMS vs. number of descriptors
Table 4 GA-MLR regression equations listed on different stationary phases Column
N
R2
SE (%)
F
Equation
DB-1 DB-5
100 100
0.995 0.994
1.76 1.79
2826.63 2532.95
RRT = −0.732 + 0.001W − 0.182S3K + 0.338TI2 + 0.001CENT − 0.123X5 + 0.048X5v RRT = −0.892 + 0.001W − 0.157S3K + 0.329TI2 + 0.001CENT − 0.055X5 + 0.013X5v
HT-5 DB-17 DB-XLB HT-8
100 100 100 100
0.992 0.994 0.992 0.990
2.32 1.90 2.01 2.25
1836.96 2459.34 1933.66 1557.69
RRT = −0.931 + 0.001W − 0.230S3K + 0.419TI2 + 0.001CENT − 0.118X5 + 0.037X5v RRT = −0.764 + 0.001W − 0.185S3K + 0.262TI2 + 0.001CENT − 0.024X5 − 0.007X5v RRT = −1.047 + 0.001W − 0.147S3K + 0.357TI2 + 0.002CENT − 0.024X5 − 0.005X5v RRT = −0.725 + 0.001W − 0.155S3K + 0.264TI2 + 0.001CENT − 0.052X5 + 0.008X5v
CP-SiL19
100
0.933
9.86
217.19
RRT = 3.486 + 0.009W − 0.498S3K − 0.331TI2 − 0.009CENT − 1.065X5 + 0.522X5v
predictive ability of the model was examined by the correlation coefficient (R2) of validation set. Also the root mean square error (RMS) function was employed to evaluate the performances of the SVM. It was computed using the following formula: n yi − yi )2 i=1 (ˆ (5) RMS = n where yˆ i is the target value, yi is the experimental value and n is the number of compounds in analyzed set.
Results and Discussion The Molecular Descriptors Selected By GA‑MLR By the GA-MLR method, regression equations were obtained using the GC-RRTs of the 100 PBDE congeners in the training set. To gain the optimum number among these molecular descriptors, plot of R2 and RMS values against the number of descriptors shown in Fig. 1 provide guidance
in deciding the number of descriptors to be retained in the QSRR models. As seen in Fig. 1a, for the former six GC stationary phases, the R2 increases gradually with number of descriptors until reaches a plateau while RMS declines until drops a lower limit value. However, for CP-Sil 19, the varieties of R2 and RMS with number of descriptors differ from the above GC columns as shown in Fig. 1b. Considering the seven columns, we selected optimal six molecular descriptors which have a large influence on improving correlation to build effectual models. The resulting regression equations and more specific information for individual GC stationary phases are summarized in Table 4. The descriptive performance of each GA-MLR model was evaluated by usual statistical parameters: correlation coefficient (R2), the relative standard error (SE) and the F test value (F), respectively. Correlation coefficients appeared in Table 4 show that R2 of GA-MLR are greater than 0.990, except for the CPSil 19 column. At the same time, correlation coefficient R2 of CP-Sil 19 column is 0.933. Therefore, the six molecular descriptors we chose can successfully represent RRT in a regression equation.
13
1394 Table 5 Comparison of the QSRR models developed in the present work and those previously published
X. Zhang et al. Column
DB-1
Model Wang [10]
Zhang [19]
Yi [17]
Angelo [18]
This work
n
100
126
126
126
100
m
4 0.990
5 0.9904
3 0.9934
5 0.9960
6 0.9983
0.988 100
0.9886 126
0.9938 126
0.9956 126
0.9970 100
4 0.990
5 0.9919
3 0.9947
5 0.9968
6 0.9980
0.989 100
0.9904 126
0.9945 126
0.9965 126
0.9970 100
4 0.985
5 0.9804
3 0.9917
5 0.9941
6 0.9962
0.983 100
0.9844 126
0.9910 126
0.9934 126
0.9949 100
4 0.986
5 0.9920
3 0.9924
5 0.9965
6 0.9977
0.985 100
0.9912 126
0.9915 126
0.9960 126
0.9969 100
4 0.992
5 0.9923
3 0.9925
5 0.9953
6 0.9978
0.991 100
0.9915 126
0.9917 126
0.9947 126
0.9970 100
4 0.991
5 0.9897
3 0.9890
5 0.9926
6 0.9969
0.990 100
0.9898 126
0.9875 126
0.9912 –
0.9957 100
R2
4 0.990
5 0.9321
3 0.9774
– –
6 0.9921
Q2 loo
0.988
0.9140
0.9661
–
0.9808
R2 Q2 loo DB-5
n m 2
R Q2 loo HT-5
n m R2 Q2 loo
DB-17
n m R2 Q2 loo
DB-XLB
n m R2 Q2 loo
HT-8
n m 2
R Q2 loo CP-Sil19
n m
n and m indicate the number of PBDE congeners and model descriptors, respectively
The Major Contributing Molecular Descriptors GC-RRTs of PBDEs on the seven stationary phases are strongly dependent on the Wiener index, the first topological index used in chemistry [39] as shown in Table 4. It is the sum of distances between all the pairs of vertices in a hydrogen-suppressed molecular graph. The correlations between RRTs of PBDEs and Wiener index show that the magnitude of intermolecular interactions between the compounds and the stationary phase is related to the degree of branching of the molecules. It is noticed that Wiener index tends to increase with increasing number of para bromines and to decrease with increasing ortho bromines, within individual homologs. Within a PBDE homolog, the positions of bromines determine the optimal molecular conformation and affect the GC retention. No doubt, the differences in RRTs among structural isomers may not be fully accounted for by Wiener index alone although plots of Ref. [16]. In addition, other
13
literatures also confirmed the influence of the RRT of Wiener index. For example, Wang and Li [10] concluded that the intermolecular interactions between PBDEs and the stationary phase are related to Wiener or Randic index. Wang and Yao [40] showed that there were significant relationships between retention indices of saturated esters on seven stationary phases of different polarity and the Wiener index. Influence of Different Stationary Phases By analyzing our correlation coefficients R2 and cross-validated coefficient Q2loo of PBDEs (Table 5) on similar stationary phases (DB-1, DB-5, DB-17, DB-XLB, HT-8 and CP-Sil 19), we can clearly find that correlation coefficients become lower (R2 = 0.9983–0.9921) for these stationary phases. As we all know that the different polarity of the stationary phases affected the elution order of compounds. Meanwhile, this point reflected in the model represents the
1395
SVM Applied to Study on QSRR
influence of correlation coefficients R2 and cross-validated coefficient Q2 loo. Therefore, we further confirmed that the polarity of the stationary phase increases, as the deviations from linearity between the set of RRTs increase. Further, the retention database for PCBs produced by Frame [41] also showed correlation of the elution orders of these two classes of pollutants. So the overall result is a further indication of the usefulness of the present approach of retention time prediction for various classes of polyhalogenated aromatic compounds. The Result of SVM Model The purpose of this work was to develop novel models capable of predicting PBDEs chromatographic relative retention times (RRTs) on various stationary phases using SVM algorithm. After the establishment of the GA-MLR model, SVM was used to develop a model by the training set compounds, based on the same subset of descriptors. The test set which consisted of 13 molecules was used to take care of the SVM overtraining. The SVM performance for regression is dependent on the combination of several factors, such as the kernel function type, the capacity parameter c, ε of the ε-insensitive loss function and its corresponding parameters [33]. The process of optimization of SVM parameters relied on Grid Search Method. The optimum values of c and ε of the ε-insensitive loss function in the SVM model were set at 16 and 0.0625, respectively. All programs’ implementation relies on the programs we have written on Matlab. Now, the results of external validation sets and test sets of 7 stationary phases are shown in Fig. 2. By observing the statistical parameters for SVMs (Table 6), we can draw that our SVM models’ validation sets correlation coefficients R2 are larger than 0.9903, except that for the CP-Sil 19 column it is 0.9556. The RMS of all validation sets is less than 0.014, except that for the CP-Sil 19 column it is 0.055. It indicated that our SVM models have external predictive abilities. By comparing the residual plots, we can conclude that all the validation sets and test sets PBDEs points on 7 stationary phases are distributed near the zero line, meaning that there are no systematic errors in our calculated results. These results verify that the SVM QSRR models of each column have good fitting abilities and statistical significance. Comparison of SRD of Different SVM Models The qualities of the models have also been checked by sum of ranking differences (SRD) validation method [42]. It described a recently developed novel method of ranking based on the sum of ranking differences to solve model
comparison task. The closer the SRD value to zero, the better is the model. Using the data of Table 1 and selecting the experimental seven columns data as reference, respectively, we calculated SRD of SVM predicted relative retention time data for PBDEs on seven stationary phases. The comparison between calculation results of SRD and model correlation coefficients R2 has been listed in Table 7. According to SRD values (Table 7), it is clearly seen that model DB-1 and DB-5 are the best two models. CP-Sil 19 is the relatively worse model. The order of the models according to the goodness of prediction is DB-1 > DB-5 > DB-17 > DBXLB > HT-5 > HT-8 > CP-Sil 19. The sign “>” means that the previous model is better than the later one. According to R2 values, the goodness of prediction is DB-1 > DB-5 > DB-XLB > DB-17 > HT-8 > HT-5 > CPSil 19. By comparing the R2 and SRD values between DB-17 and DB-XLB columns, similar performance of the model can be observed, the same can be observed in HT-5 and HT-8 columns. Finally, we compare our work with other QSRR models. Several authors have constructed QSRR models of PBDEs for prediction. Some statistical parameters are listed in Table 5. Our training sets correlation coefficients R2 and cross-validated coefficient Q2 loo including CP-Sil 19 column are all higher than the literature results. It is evident that our SVM models exhibit a better descriptive and predictive ability than the other QSRR models established so far.
Conclusions QSRR models between GC-RRTs of 126 PBDE congeners on 7 stationary columns and various topological and connectivity indices molecular descriptors were successfully established using SVM method. In this work, GA-MLR selected molecular descriptors Wiener index, although confirm the intermolecular interactions between PBDEs and the stationary phase are related to the degree of branching of the molecules. The SVM results show that the models have strong predictive abilities with high correlation coefficients R2 of both training and validation sets, and robust stability with excellent statistical parameters Q2 loo. Acceptable values of the models in terms of different validation parameters such as internal prediction, external prediction and sum of ranking differences confirm reliability of the models. Finally, our SVM QSRR models of 126 PBDE congeners provided a more accurate and reliable model than those existing in the literature. In addition, we expect that our QSRR models may have some guiding significance for the rapid analysis of PBDE congeners in the complex environmental samples.
13
1396
X. Zhang et al.
Fig. 2 Experimented vs. predicted RRTs of PBDE congeners in the validation sets and test sets on different columns and distribution of residuals
13
1397
SVM Applied to Study on QSRR
Fig. 2 continued Table 6 The statistical parameters for SVM
Column
Train R
2
Validation 2
Test
RMSE
R
RMSE
R2
RMSE
DB-1 DB-5 HT-5 DB-17 DB-XLB HT-8
0.9983 0.9980 0.9962 0.9977 0.9978 0.9969
2.97E − 4 3.34E − 4 6.33E − 4 3.61E − 4 3.41E − 4 4.12E − 4
0.9963 0.9963 0.9967 0.9967 0.9968 0.9903
3.97E − 3 3.10E − 3 3.30E − 3 3.66E − 3 3.36E − 3 13.71E − 3
0.9981 0.9983 0.9972 0.9985 0.9970 0.9942
2.96E − 3 4.32E − 3 4.16E − 3 9.27E − 3 10.00E − 3 17.25E − 3
CP-Sil19
0.9921
7.58E − 4
0.9556
54.98E − 3
0.9969
5.46E − 3
13
1398
X. Zhang et al.
Table 7 SRD values compared with model correlation coefficients R2 of seven GC columns Column
R2
SRD
DB-1 DB-5 HT-5 DB-17 DB-XLB HT-8
0.9983 0.9980 0.9962 0.9977 0.9978 0.9969
266 266 300 282 288 368
CP-Sil19
0.9921
608
Acknowledgments The corresponding author is grateful for the financial support from the National Natural Science Foundation of China (21376114), and Department of Education of Liaoning Province, which made this work possible.
References 1. Renner R (2000) Environ Sci Technol 34:452–453 2. Hale RC, La Guardia MJ, Harvey E, Gaylor MO, Mainor TM (2006) Chemosphere 64:181–186 3. Hazrati S, Harrad S (2006) Environ Sci Technol 40:7584–7589 4. Lundstedt-Enkel K, Karlsson D, Darnerud PO (2010) J Chemometrics 24:710–718 5. Cunha S, Kalachova K, Pulkrabova J, Fernandes J, Oliveira M, Alves A, Hajslova J (2010) Chemosphere 78:1263–1271 6. Meerts IA, Van Zanden JJ, Luijks EA, van Leeuwen-Bol I, Marsh G, Jakobsson E, Bergman Å, Brouwer A (2000) Toxicol Sci 56:95–104 7. Vonderheide AP, Mueller KE, Meija J, Welsh GL (2008) Sci Total Environ 400:425–436 8. Wei H, Yang R, Li A, Christensen ER, Rockne KJ (2010) J Chromatogr A 1217:2964–2972 9. Vonderheide AP (2009) Microchem J 92:49–57 10. Wang Y, Li A, Liu H, Zhang Q, Ma W, Song W, Jiang G (2006) J Chromatogr A 1103:314–328 11. Zhang X, Ding L, Sun Z, Song L, Sun T (2009) Chromatographia 70:511–518 12. Zhang XT, Ren C, Guo JJ, Song LJ, Sun T (2012) Adv Mater Res 347:1769–1773 13. Fatemi MH, Ghorbanzad’e M, Baher E (2010) Anal Lett 43:823–835
13
1 4. Fatemi MH, Malekzadeh H (2012) J Sep Sci 35:2088–2094 15. Rayne S, Ikonomou MG (2003) J Chromatogr A 1016:235–248 16. Korytár P, Covaci A, de Boer J, Gelbin A, Brinkman UAT (2005) J Chromatogr A 1065:239–249 17. Yi Z, Li L, Zhang A, Wang L (2011) Chin J Chem 29:2495–2504 18. D’Archivio AA, Giannitto A, Maggi MA (2013) J Chromatogr A 1298:118–131 19. Zhang YH, Liu SS, Liu HY (2007) Chromatographia 65:319–324 20. Goodarzi M, Jensen R, Vander Heyden Y (2012) J Chromatogr B 910:84–94 21. Cai W, Xia B, Shao X, Guo Q, Maigret B, Pan Z (2001) J Mol Struct: THEOCHEM 535:115–119 22. Noorizadeh H, Farmany A (2010) Chromatographia 72:563–569 23. Whitley D (1994) Stat Comput 4:65–85 24. Jalali Heravi M, Kyani A (2007) Eur J Med Chem 42:649–659 25. Vapnik V (1998) Statistical learning theory. Wiley, New York 26. Luan F, Xue C, Zhang R, Zhao C, Liu M, Hu Z, Fan B (2005) Anal Chim Acta 537:101–110 27. Zhang J, Wang B, Zhang X, Huang DS, Zhang X, Reyes García C, Zhang L (2010) Advanced intelligent computing theories and applications. Springer, Berlin, pp 83–90 28. Cong Y, Li BK, Yang XG, Xue Y, Chen YZ, Zeng Y (2013) Chemom Intell Lab Syst 29. Vladimir VN, Vapnik V (1995) The nature of statistical learning theory. Springer, Heidelberg 30. Scholkopf B, Sung KK, Burges CJ, Girosi F, Niyogi P, Poggio T, Vapnik V (1997) Signal Proces, IEEE T 45:2758–2765 31. Smola AJ, Schölkopf B (2004) Stat Comput 14:199–222 32. Tetko IV, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P, Palyulin VA, Radchenko EV, Zefirov NS, Makarenko AS (2005) J Comput Aid Mol Des 19:453–463 33. Riahi S, Pourbasheer E, Ganjali MR, Norouzi P (2009) J Hazard Mater 166:853–859 34. Pourbasheer E, Riahi S, Ganjali MR, Norouzi P (2009) Eur J Med Chem 44:5023–5028 35. Pourbasheer E, Riahi S, Ganjali MR, Norouzi P (2010) Eur J Med Chem 45:1087–1093 36. Randic´ M (2001) J Mol Graph Model 20:19–35 37. Moskovkina M, Bangov I, Patleeva AZ (2013) Bulg Chem Commun 45:9–23 38. Bangov I, Moskovkina M, Patleeva A (2010) Bulg Chem Commun 42:338–342 39. Dobrynin AA, Entringer R, Gutman I (2001) Acta Appl Math 66:211–249 40. Wang Y, Yao X, Zhang X, Zhang R, Liu M, Hu Z, Fan B (2002) Talanta 57:641–652 41. Frame GM (1997) Fresen J Anal Chem 357:714–722 42. Héberger K, Kollár-Hunek K (2011) J Chemometrics 25:151–158