Multiobjective Feature Selection Approach to ... - ACS Publications

5 downloads 5017 Views 500KB Size Report
Apr 28, 2017 - (RON) and motor octane number (MON) of pure components using two databases. ..... The multiobjective optimization search generates.
Article pubs.acs.org/EF

Multiobjective Feature Selection Approach to Quantitative Structure Property Relationship Models for Predicting the Octane Number of Compounds Found in Gasoline Zhefu Liu,† Linzhou Zhang,*,† Ali Elkamel,‡ Dong Liang,† Suoqi Zhao,† Chunming Xu,† Stanislav Y. Ivanov,§ and Ajay K. Ray§ †

State Key Laboratory of Heavy Oil Processing, China University of Petroleum, Beijing, 102249, P. R. China Chemical Engineering Department, University of Waterloo, Waterloo, N2L 3G1, Canada § Western University, London, N6A 3K7, Canada ‡

S Supporting Information *

ABSTRACT: Octane number is one of the most important factors for determining the price of gasoline. The increasing popularity of molecular models in petroleum refining has made predicting key properties for pure components more important. In this paper, quantitative structure property relationship (QSPR) models are developed to predict the research octane number (RON) and motor octane number (MON) of pure components using two databases. The databases include oxygenated and nitrogen-containing compounds as well as hydrocarbons collected from published data. QSPR models are widely utilized because they effectively characterize molecular structures with a variety of descriptors, especially different isomeric structures. Feature subset selection is an important step for increasing the performance and simplifying the complexity of a QSPR model by removing redundant and irrelevant descriptors. A two-step feature selection method is developed to identify appropriate subsets of descriptors from a multiobjective perspective: (1) a filter using the Boruta algorithm to remove noise features and (2) a multiobjective wrapper to simultaneously minimize the number of features and maximize the model accuracy. A multiobjective wrapper is developed to account for both the complexity and generalizability of models to resist overfitting, which commonly occurs when using a single-objective feature selection method. In the proposed procedure, optimized subsets of descriptors are used to build the final QSPR models to predict the RON and MON of pure components via support vector machine regression. The proposed models are competitive with other models found in the literature. determine the octane number of commercial petroleum fuels.5 Bao and Dai proposed a novel method based on PLS to detect outliers in spectral data collected via near-infrared and Raman spectroscopy.6 Nikolaou et al. developed a nonlinear model to calculate the RON of isomerization gasoline using gas chromatography data.7 They calculated the weight factor representing nonlinearity based on published pure component and blend data. Because of the application of molecular models in petroleum refining, predicting key properties for pure components has become more important. The models utilize an ensemble of molecular components rather than a bulk mixture of all components in a petroleum fraction to simulate the refining processes on a molecular level.8 Accurately predicting the properties, such as RON and MON, of individual compounds in the product fractions and the appropriate mixing rules is necessary for constructing effective molecular models. These models enable the designing of more effective methods for gasoline blending to produce higher octane number molecules in the target products while saving time and energy during the refining process.

1. INTRODUCTION The octane number reflects the antiknock properties of gasoline in an engine’s cylinder and is a determining factor in the price of gasoline. Engine knocking is a self-detonation phenomenon of a fuel−air mixture that occurs in the absence of a spark. Low knocking resistance of fuel may result in increased fuel consumption, loss of power, and even engine damage.1 Two of the most commonly utilized approaches for measuring antiknock properties in the laboratory are the research and motor methods, which utilize different test engine working modes.2,3 The outcomes of these tests are quantitatively expressed as the research octane number (RON) and the motor octane number (MON). The engine tests used to assess the octane number are expensive and time-consuming. Therefore, in recent decades, intensive research has focused on developing correlation models to predict the RON and MON. Researchers have combined multivariate calibration with different analytical techniques, such as near-infrared spectroscopy and gas chromatography, to predict the RON and MON of mixtures. Kelly et al. utilized near-infrared spectral features in the range of 660−1215 nm to predict the octane number of 43 unleaded gasoline samples.4 They adopted partial least-squares (PLS) for the spectral correlation of octane numbers and obtained satisfactory results. Cooper et al. used Fourier transform (FT)-Raman spectroscopy coupled with PLS to © XXXX American Chemical Society

Received: December 7, 2016 Revised: April 27, 2017 Published: April 28, 2017 A

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

all the descriptors, there are advantages to minimizing the number of descriptors: (i) potentially improving the prediction accuracy by removing redundant and irrelevant descriptors, (ii) increasing the computing speed by decreasing the number of inputs, and (iii) decreasing the degrees of freedom by reducing the number of descriptors when the number of descriptors is relatively larger than the number of available samples.17 The two main strategies used to reduce the number of descriptors are the filter and wrapper methods.18 The filter method is used to remove features that do not greatly contribute to the model, and this technique is independent of the regression algorithm (predictor). The filter method considers performance evaluation measures computed directly from the data without feedback from predictors.19 In most cases, relevance scores between features and properties are calculated using a function, such as the correlation coefficient or information content (mutual information), and low-scoring features are removed. The filter method is substantially more computationally efficient than other methods since it avoids the evaluation of feature subsets; however, this method can produce bias in the prediction as it neglects the intercorrelations among features. Thus, the filter method is prone to generating subsets that contain variables with high collinearity, which may lead to losing important inputs.20 The wrapper method avoids the problems of the filter method by considering bias during the feature subset evaluation.21 The core of the wrapper comprises two components. An induction algorithm based on the training samples is used as a supervised evaluation model to assess the candidates of the feature subset, and a feature subset selection search searches for more accurate feature subsets based on the evaluation of the induction algorithm. The feature selection algorithm runs as a wrapper around the induction algorithm to optimize the feature subsets. The wrapper can achieve higher accuracy than the filter, but it always adopts a large number of variables for a small incremental improvement in model performance, which can result in an overfitted model. Occam’s razor, or the principle of parsimony, states that all necessary information should be used for modeling and nothing more; thus, it is undesirable to include unnecessary features in the model.22 Feature selection presents a multicriterion optimization issue: the prediction accuracy and the complexity of the model (e.g., number of features). The approach of combining different criteria into one fitness function, which is used in genetic algorithms, has been studied to search for appropriate solutions for feature subsets.23,24 In one fitness function, a cost term representing the complexity of candidate feature subsets is summed with the accuracy term based on a particular weight. The main drawback of the fitness function is the difficulty of establishing an appropriate trade-off between the different terms in one fitness function. To avoid this problem, Emmanouilidis et al.25 proposed multiobjective optimization to simultaneously maximize the solution performance and minimize the number of features. Multiobjective optimization has been shown to be effective on various commonly used data sets in the field of feature selection. Wrapper methods based on multiobjective optimization prefer solutions with a smaller number of variables, in contrast to the larger number of variables selected by singleobjective wrappers. Therefore, multiobjective optimization is more powerful in resisting overfitting and has better generalizability in practical applications.26,27

The octane number of pure components is highly dependent on the molecular structure. For example, the octane numbers of aromatics are affected by the ring type (five-member ring or sixmember ring) and the location and branching degree of the alkyl groups attached to the rings. These factors are associated through nonlinear relationships to determine the target properties. Empirical models have been widely adopted to correlate chemical properties with descriptors characterizing molecular structures. Kahrs et al. proposed that correlation models can be divided into two groups: group/bond contribution methods and “most significant common features” methods.9,10 The first group utilizes predefined fragments (groups) in molecules to calculate the property of interest. Albahri proposed a structural group contribution method to predict the octane number of pure hydrocarbon liquids.11 Two types of models comprising 22 and 33 structural groups were built based on more than 200 hydrocarbons, and the best predicted average deviations of RON and MON were 4.0 and 5.7 units, respectively. However, the group contribution methods neglect the positions of the group connections within a molecule, which can make partitioning isomers difficult. In the second group of methods, molecular descriptors are calculated using group contributions, quantum chemical methods, simulated molecular mechanics, and the topology of the molecules.12 These methods are broadly designated quantitative structure property relationship (QSPR) models. QSPR approaches avoid the shortcomings of group contribution methods because they account for the molecular structures by using a variety of descriptors to characterize the differences between isomeric structures.13 QSPR models are highly promising for assessing target properties based on descriptors and are derived solely from molecular structures by fitting available experimental data in the absence of domain knowledge. The models have been successfully used to predict pure component properties of interest to researchers. Saldana et al. adopted QSPR methods to predict the flash point and cetane number for fuel compounds, including oxygenated compounds.14 The boiling points, specific gravities, and refraction indices of hydrocarbons were predicted by Ha et al.13 Additionally, Al-Fahemi et al.15 developed a QSPR model to predict the RON of hydrocarbons. They constructed an eight-parameter linear model using multiple linear regression (MLR) based on 40 training molecules to predict the RON of 25 molecules with a determination coefficient of R2 = 0.942 and a standard error of s = 6.328 units. However, the relations between the descriptors and octane number are complicated, and linear models do not always provide satisfactory outcomes. To determine the MON for paraffins, Zhu et al.16 developed a nonlinear QSPR model using support vector machine (SVM) techniques based on 72 molecules, with a determination coefficient of R2 = 0.934 and an average absolute error of 3.961 units. Six descriptors were selected using PLS wrapped in a genetic algorithm. Al-Fahemi and Zhu both used small databases, which restricts the application range of the models. This study develops an accurate general QSPR model to predict the RON and MON of pure components based on large and diverse databases that include oxygenates, nitrogen compounds, and hydrocarbons. Thousands of descriptors are often included as independent variables in QSPR models. However, only a small subset carries the necessary information related to the target property. Although it is possible to construct QSPR models that include B

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

SMILES file is input into E-dragon,32,33 a free online version of Dragon software, for the descriptor calculations. E-dragon supports three-dimensional (3D) atom coordinate calculations via CORINA34 and optimized 3D structure calculations using HYDROGENS online. A total of 1667 descriptors is generated for each molecule.

Since the relationships between the RON and MON are not explicit for pure components, it is necessary to build separate models for RON and MON. In this paper, after data preprocessing, two steps of feature selection are used to predict the octane numbers of pure components: (1) filter, the Boruta algorithm is used to conservatively filter out redundant descriptors to avoid losing important inputs; and (2) multiobjective wrapper, a wrapper method is applied using two objective functions for the number of descriptors and the cross-validation (CV) accuracy of the training data sets. The optimized solutions are tested on validation subsets, and the most suitable feature subsets are selected.

3. METHODOLOGY The RON and MON databases are represented by 279 × 1667 and 273 × 1667 matrices, respectively. Problems can arise when the number of variables is greater than the number of samples; therefore, it is necessary to reduce the dimensions by selecting subsets of the variables. However, obtaining suitable descriptor subsets is difficult because the computational complexity is O (2n), where n is the number of descriptors. When n is large, an exhaustive search is impractical. Thus, feature subset selection is a key procedure for constructing QSPR models. Figure 2 presents an overall flowchart of the proposed methodology. The data are pretreated before feature selection. In the first

2. MATERIALS 2.1. Experimental Data. The quality and size of a database are critical for building and evaluating a QSPR model. In this study, two databases are built with 279 and 273 samples for the RON and MON, respectively, using API 4528 and the API technical data book,29 and approximately 30 heteroatom compounds are included from reference 30. The databases contain not only pure hydrocarbons but also a number of oxygenates and nitrogen compounds. We aim to create universal QSPR models to predict octane numbers for compounds that could exist in gasoline fractions or potential gasoline alternatives. The octane numbers in the databases range from negative values (for a few compounds) to more than 120. The distribution of the octane numbers is shown in Figure 1. The two dashed red lines indicate the average values calculated from the RON and MON databases.

Figure 2. Research process flowchart. stage of feature selection, a filter is adopted to reduce the matrix dimensions since it is computationally fast. A multiobjective wrapper is then utilized to simultaneously further refine the prediction accuracy and decrease the number of descriptors. Furthermore, the elite solutions (optimized feature subsets) existing in the nondominated fronts identified by the multiobjective optimization algorithm are collected to provide more options for the decision makers. The elite solutions are checked for overfitting using the validation data set because the CV used during the training process is not always able to overcome this intricate problem.35,36 Finally, the best-performing groups of descriptors on both the training and validation data sets are employed as inputs to a machine learning algorithm to assess the generalizability to independent test subsets. 3.1. Data Preprocessing. Before feature selection, data preprocessing is performed to remove all low-information descriptors from the two databases, including constant, low-variance, and pairwise linearly correlated descriptors. (1) Constant descriptors with a standard deviation of less than 0.0001 and descriptors with missing values are removed. (2) Low-variance descriptors, which may become constant descriptors when the data are split into the training, validation, and test subsets, are removed. (3) Pairwise linearly correlated descriptors are identified by the Pearson correlation coefficient (ρm,n), which is commonly used in quantitative structure activity relationship (QSAR)/QSPR models.14,37 When the absolute value ρm,n for two descriptors is greater than or equal to a threshold value, then the two descriptors are matched as a pair, and the one that is highly correlated to all the other descriptors is deleted. The threshold value is set to 0.98, as recommended by Dragon software.38

Figure 1. Distributions of the octane numbers: (a) RON and (b) MON. The dashed red line indicates the average value of the octane number. The RON and MON databases are randomly divided into three groups: the training set, validation set, and test set, which contain 70%, 20%, and 10% of the samples, respectively. The training set is used to train the machine learning algorithms and to find optimal subsets of descriptors during the feature selection procedure. The validation set is used to determine whether and when overfitting occurs during the multiobjective optimization and provides accuracy scores to determine the feature subset with the greatest generalizability. The selected feature subset is then used to construct the final regression model and to evaluate the predictive error on the test set. 2.2. Molecular Structures. Simplified molecular input line entry specification (SMILES) is manually performed for each molecule in our database considering complicated cases, such as chirality and configurations around double bonds. Open Babel (free software) is then used to convert the SMILES into canonical form.31 Next, the C

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels The Pearson correlation coefficient, which is used to identify collinearity among descriptors, is calculated as ρm , n =

cov(m , n) σmσn

be compared to verify whether and when overfitting occurs during feature selection optimization. Furthermore, a multiobjective wrapper can generate an ensemble of feature subsets with different lengths rather than the single subset produced by a single-objective wrapper. The ensemble of feature subsets provides more options for decision makers depending on their demands. The nondominated sorting genetic algorithm II (NSGA-II) is utilized in this study to search for the multiobjective global optima. The method was developed by Deb et al.43 and has been successfully implemented to solve multiobjective optimization problems in a wide range of domains. A detailed description can be found in the original report.44 3.4. Software. The research is implemented in the R open source programming language and software environment.45 The following packages are used in this study: Boruta46 for the Boruta algorithm, randomForest47 for RF, e107148 for PLS, FFNN, and SVM and mco49 for NSGA-II. The ggplot250 package is used for plotting, and doMC and foreach51 are used for multicore parallel computing in an Ubuntu 14.04 LTS operating system.

(1)

where ρm,n is the correlation coefficient between descriptors m and n; cov(m,n) is the covariance; and σm and σn are the standard deviations of descriptors m and n, respectively. Some of the correlation coefficients between the descriptors derived from the RON database are shown in Table S1 in the Supporting Information. This table shows that two descriptors, Sp and Sv, are paired successfully. After data preprocessing, 698 and 682 of the 1667 total descriptors remain in the RON and MON databases, respectively. Each remaining descriptor is centered and scaled to a mean of 0 and a standard deviation of 1 to eliminate dimensional differences among descriptors. 3.2. Filter Step. The Boruta algorithm39 efficiently identifies noisy variables from inputs via a variable importance procedure computed by multiple runs of a random forest40 (RF) algorithm. The algorithm randomly generates a copy variable (shadow attribute or permuted attribute) for each descriptor and ranks the original descriptors and shadow attributes together based on a Z-score. The Z-scores can be used to evaluate the variable importance distribution adopted by the RF. Only original features with Z-scores exceeding the highest values of the shadow attribute are considered relevant features in the prediction models, and the other features are omitted. The computational complexity of the Boruta algorithm depends on that of the RF and increases linearly with the number of features. Thus, this algorithm is suitable for problems in which the number of variables is larger than the number of objects, such as the case considered in this study. The postfilter descriptors are used as inputs to various machine learning algorithms to select the best algorithm for constructing the regression model. The algorithm is then used as the induction algorithm for the multiobjective wrapper in the next step. Linear and nonlinear machine learning algorithms are involved in the comparison, including PLS, feed forward neural networks41 (FFNNs), RF,40 and SVM,42 which have all been shown to be efficient in QSAR/QSPR. 3.3. Multiobjective Wrapper Step. As mentioned in the Introduction, the wrapper method is composed of two components, an induction algorithm and a feature subset selection or optimization search. In the latter, the candidate feature subsets are encoded as strings of “0” and “1”, and the length of a string is equal to the number of features. When a descriptor is selected for a candidate feature subset, the position corresponding to the descriptor is set to “1” in the string; otherwise, it is set to “0”. These strings are decoded into feature subsets and imported into the induction algorithm to assess and rank the candidate feature subsets based on a predefined objective function. Depending on the evaluation results, the optimization component in the wrapper conducts a search to find better strings. The optimization component runs around the induction algorithm until the optimization process converges or the objective function reaches the threshold. For regression problems, the common measures used for induction algorithms are the mean absolute error (MAE), mean squared error, and root mean squared error (RMSE). Single-objective wrappers are prone to selecting feature subsets compromising a large number of variables with little improvement in training accuracy, which can decrease the model performance on hold-out samples and result in overfitting. In this study, two objective functions are simultaneously evaluated and optimized using the wrapper method. In addition to the common measure of the MAE, a function to assess the feature subset size is added and conveniently implemented by summing the “1” values in each string. The multiobjective optimization search generates a convex Pareto f ront, where the solutions are nondominated by other solutions, at least in one objective function. The Pareto f ront reveals the relationship between the size of a feature subset and the performance of the training data set. A curve can be generated by calculating the optimized feature subsets corresponding to the solutions over a validation data set to verify the generalizability of the Pareto-optimal solution. The Pareto f ront and validation curve can

4. RESULTS AND DISCUSSION 4.1. Filter Feature Selection Results. In the Boruta algorithm, the Z-score of each descriptor is computed based on 500 RF runs. The default values of the parameters in the RF algorithm are utilized as suggested by Svetnik et al.52 The parameters are set to 1000 for ntree and to one-third of the number of imported descriptors for mtry (233 for RON, 227 for MON). In the RF algorithm, ntree represents the number of “trees” used in the model, and mtry is the number of variables randomly sampled at each split. Boruta divides the inputs into three categories: Rejected, Tentative, and Confirmed. These categories represent different ways of handling the inputs: deleting, pending, and remaining. In Figure 3, the blue boxes represent the shadow variables that are randomly produced by Boruta, and the rightmost box represents the shadow variable with the highest Z-score. Descriptors with statistically lower Z-scores than that of the rightmost shadow variable are marked by red boxes and can be

Figure 3. Z-scores of part of descriptors computed by the Boruta algorithm. The blue boxes represent the shadow variables that are randomly produced by Boruta; the red boxes represent the deleted descriptors; the yellow boxes represent the pending descriptors; and the green boxes represent the remaining descriptors. D

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

“trees” and one-third of the descriptors for “mty” in RF; and grid search for SVM, where epsilon (ε) of the insensitive-loss function ranges from 0.1 to 0.8 in increments of 0.1, the cost of constraints violation of C is a vector [23, 24, ..., 28] and gamma (γ), which represents the width of the radial basis function, ranges from 0.005 to 0.020 in increments of 0.005. The global regression model performance for each subset of the two databases is shown in Table 3 and Table 4. The numbers in the parentheses indicate the final optimized values of the tuning parameters. MAE and R2 are used to compare the model performance. The CV errors, as an internal validation measure of the training set fit, are combined with the external validation errors calculated for the validation and test subsets to determine the total error, which is displayed in the last row of these tables. Comparing the last rows in Tables 3 and 4 shows that the regression model performance decreases in the following order for both RON and MON predictions: SVM > RF > FFNN > PLS. The MAEs of the SVM are superior to those of RF, which is the second best model, by 1.30 and 1.79 units for the RON and MON predictions, respectively. The differences among the other models is less than 1 unit. With the exception of the test subset of the MON database, SVM outperforms all other methods over all database subsets. Therefore, whether the entire database or only a partial database is used, SVM is the ideal induction algorithm for the multiobjective wrapper. To elaborate on the order of superiority, the linear PLS is inferior to the three nonlinear methods since there is a highly nonlinear relationship between the descriptors and octane number. SVM is preferable to FFNN because SVM utilizes structural risk minimization rather than empirical risk minimization and uses the artificial neural network to avoid overfitting.53 For SVM and RF, there is no definite theory to explain which algorithm is better. Statnikov et al.54 proved that SVM generally outperformed RF on 22 microarray data sets. In this study, the data sparsity is an additional factor that makes SVM better. Additionally, SVM utilizes support vectors instead of all the training samples to construct the regression model, which can decease the impact of unrelated samples. Furthermore, the performance of SVM relies on selecting the model parameters, which is time-consuming. The 3D (ε, γ, and C) scatterplots of SVM for RON and MON prediction using postfilter descriptors are shown in Figures S1 and S2 in the Supporting Information. The average RMSE values in the 5fold CV are used to evaluate the parameter combinations. The scale is expressed by the color bars on the right side of the two plots. Both figures show that the average RMSE decreases with decreasing gamma (γ) and increasing cost (C). The impact of epsilon (ε) on the performance of the SVM model can be neglected since the colors of all points in Figures S1 and S2 are constant in the vertical direction. Thus, epsilon can be neglected during wrapper optimization, which improves the computing time. 4.3. Results of the Multiobjective Wrapper. NSGA-II uses the following parameters: mutation and crossover probabilities of 0.3 and 0.7, respectively, 5 mutation and crossover distribution indices, 100 individuals for each generation, and 200 total generations for each run. The impact of the NSGA-II parameters on performance is not remarkable, and the procedure for tuning the parameters is not described in detail. In practice, it is difficult to guarantee that the solutions are truly Pareto optimal; thus, a set of nondominated solutions forming a nondominated f ront is utilized to approximate the

safely deleted. Descriptors marked with green boxes should be included in the model. Yellow boxes indicate Z-scores that are close to of the rightmost shadow variable and for which the Boruta algorithm cannot decide their status. In this study, to avoid losing important descriptors, the descriptors in the Tentative group are retained and combined with those in the Confirmed group as inputs of the multiobjective wrapper. The distributions of the descriptors in the three categories are shown in Table 1. Table 1. Distribution of the Number of Descriptors in the Three Categories Derived from the Boruta Algorithm category

RON

MON

rejected tentative confirmed total

518 54 126 698

497 66 119 682

Table 1 shows that the descriptor distributions for the RON and MON databases are similar. Most of the descriptors in the two databases are in the Rejected group. After combining the descriptors in the Tentative and Conf irmed categories, 180 and 185 descriptors remain in the RON and MON databases, respectively. 4.2. Selection of Regression Models. In this section, several regression algorithms are implemented and compared on the postfiltered databases to determine the most effective algorithm for predicting the RON and MON. Only the most effective algorithm is used to construct the regression models. The tested algorithms include PLS, FFNN, RF, and SVM. The 5-fold cross-validation (CV) over the training set is adopted to tune the regression parameters. The samples included in each fold are the same for each method to compare the regression approaches under identical conditions. The validation and test subsets are used to measure the generalizability of the regression models. Five molecules are obtained with large residuals during training models for both the RON and MON databases. These molecules are identified as outliers and deleted from the databases. One of the main causes of outliers is that the databases contain various compounds with a large range of molecular weights. Thus, because of sparsity in the data space, the outliers are isolated from the other samples, which increases the difficulty of precisely extrapolating the values using machine learning algorithms. The outliers are not invalid samples, and they will be studied in our future work. The molecule distributions of the training, validation, and test subsets in the databases after deleting the outliers are presented in Table 2. The performance of the regression methods is dependent on the tuning parameters. The following parameter ranges are used for tuning: 1 to 10 components for PLS; 5 to 60 (in increments of 5) neurons in the hidden layer that FFNN is from 1000 Table 2. Molecular Distributions over Subsets in the RON and MON Databases category

RON

MON

training validation test total

197 51 26 274

192 51 25 268 E

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

Table 3. Performance of the Global Regression Models over Each Subset in the RON Database Based on the Post-Filtered Descriptors PLS (8 components)

FFNN (30 neurons)

RF (1000, mty = 60)

SVM (ε = 0.1, γ = 0.005, C = 256)

data sets

MAE

R2

MAE

R2

MAE

R2

MAE

R2

5-fold CV validation test totala

9.25 5.84 8.18 8.51

0.75 0.88 0.65 0.74

9.08 6.87 4.73 8.25

0.75 0.86 0.85 0.75

8.49 4.33 6.76 7.55

0.79 0.93 0.62 0.79

7.20 3.66 4.23 6.25

0.78 0.96 0.88 0.81

a The total indicates the summed performance for all training, validation, and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation.

Table 4. Performance of the Global Regression Models over Each Subset in the MON Database Based on the Post-Filtered Descriptors PLS (7 components)

FFNN (35 neurons)

RF (1000, mty = 62)

SVM (ε = 0.8, γ = 0.005, C = 128)

data sets

MAE

R2

MAE

R2

MAE

R2

MAE

R2

5-fold CV validation test totala

8.66 6.37 6.75 7.95

0.77 0.80 0.70 0.73

7.62 6.90 5.83 7.32

0.75 0.80 0.81 0.75

7.82 5.98 4.59 7.17

0.74 0.81 0.83 0.75

5.44 5.50 4.66 5.38

0.86 0.83 0.82 0.86

a The total indicates the summed performance for all training, validation, and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation.

Pareto f ront with acceptable trade-offs between objective functions. The performance of the multiobjective wrapper using SVM regression coupled with NSGA-II is shown in Figures 4 and 5

Figure 5. Performance of multiobjective feature selection on the MON database. MAE is shown by the blue line, and RMSE is represented by the red line.

training data is used for the training error, and the MAE is used for the validation error. The nondominated fronts generated by optimization over the training subsets are displayed as solid red lines, and the nondominated solutions over the validation data sets are shown as dashed blue lines. Figures 4 and 5 show that the solid red lines decreasing with an increasing number of descriptors; however, as the number of descriptors increases, the rate of increase decreases. Therefore, adding descriptors to a feature subset enhances the internal prediction ability over the training data, but this improvement decreases with increasing feature subset length. In contrast, the dashed blue lines indicate that the external generalizability of a

Figure 4. Performance of multiobjective feature selection on the RON database. MAE is shown by the blue line, and RMSE is represented by the red line.

for the RON and MON databases, respectively. The x axis shows the number of descriptors in the candidate feature subsets (i.e., the complexity of the model), and the y axis shows the errors. To distinguish the two curves in the figures calculated over the training and validation subsets, different measures are utilized to estimate the model performance over the two subsets. The mean RMSE averaged by the 5-fold F

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels model is not as monotonous as the red lines. Figure 4 shows a clear valley in the blue line, while Figure 5 shows the blue line fluctuating along the x axis and increasing after the dashed black line, providing strong evidence of overfitting caused by the optimization process. One of the main causes of overfitting is that the wrapper method introduces an excessive number of descriptors into the feature subsets to maximize the internal accuracy of the training samples. Even CV fails to fully overcome this issue.35,36 The multiobjective wrapper successfully identifies the overfitting problem. To avoid overfitting, the validation subsets are used to re-evaluate the solutions on the nondominated f ronts. On the basis of the performance over the validation subsets, feature subsets of 12 and 23 descriptors, indicated by the dashed black lines in Figure 4 and Figure 5, are selected to build the regression models for the RON and MON data, respectively. The feature subsets simultaneously provide the highest accuracies in the validation data sets and acceptable internal prediction capacity for the training samples. These options are not uniquely defined, allowing researchers to make their own decisions based on the two figures. For example, the point indicated by the triangle in Figure 5 has a slightly higher value than the lowest value over the validation subset, but this feature subset has only 13 descriptors. Thus, its length is much shorter than that of the subsets selected in this paper. The shorter subset for MON prediction is a good option in terms of simplicity. Furthermore, the optimization search depth can impact the quality of nondominated f ronts generated by NSGA-II.55 The nondominated f ronts calculated by the 10th, 50th, 100th, 150th, and 200th generations of NSGA-II over the RON and MON databases are displayed in Figures 6 and 7, respectively. The

Figure 7. Generation evolution of NSGA-II on MON.

generation because the improvement is negligible in the final 50 generations. The determination of the optimal time to stop the search is beyond the scope of this work. The feature subsets selected by the multiobjective wrapper (MOW) over the RON and MON databases are coupled with the SVM (MOW_SVM) to construct the final regression models, and their performance is summarized in Table 5 and Table 5. Performance of the Models Obtained by Feature Selection Methods on RON F_SVM (180)

MOW_SVM (12)

2

data sets

MAE

R

5-fold CV validation test totala prediction MAEb

7.20 3.66 4.23 6.25 3.85

0.78 0.95 0.88 0.81 0.94

MAE

R2

5.95 3.86 3.97 5.37 3.90

0.90 0.94 0.85 0.90 0.92

a

The total indicates the summed performance over all training, validation, and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation. bThe prediction MAE represents the weighted average of the MAEs calculated over the validation and test data sets.

Table 6. The results of the regression models computed by the postfilter features combined with SVM (F_SVM) are provided for comparison. The values in parentheses represent the number of descriptors used in the model. The parameters used in the MOW_SVM models to predict RON are γ = 0.02 and C = 256, and those used to predict MON are γ = 0.02 and C = 128, which are optimized via 5-fold CV. Tables 5 and 6 show that the performance of the feature subsets selected by the multiobjective wrapper is superior to that of the postfilter descriptors with respect to the total accuracy (in the second to last rows) calculated over the entire data sets. The multiobjective wrapper shows an improvement of 0.88 and 1.30 units for RON and MON, respectively. Considering the stability and regularization of the SVM,56 this difference is not produced by deviation of the models. In Table 5, the MOW_SVM significantly improves the CV accuracy

Figure 6. Generation evolution of NSGA-II on RON.

nondominated fronts calculated by the 200th generation are displayed as purple curves, and the solutions produced in preceding generations are shown in other colors. The figures show that the points corresponding to different generations gradually approach the purple curves until the points of the 150th generation almost coincide with those of the 200th generation. The optimization is stopped at the 200th G

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

The optimization time of the multiobjective wrapper was 13.3 h for one run on 10 server cores for parallel computing. Although the computing time is longer than that of a wrapper using a single-objective optimization algorithm (e.g., 4.8 h for a genetic algorithm), this time is acceptable when considering the satisfactory results produced by the multiobjective wrappers. 4.4. Confidence Interval Using Bootstrap. Parity plots comparing the predicted values of RON and MON computed by MOW_SVM and experimental data are displayed in Figures 8 and 9, respectively. The error bars in the two plots are

Table 6. Performance of the Models Obtained after Feature Selection Methods on MON F_SVM (185)

MOW_SVM (23)

data sets

MAE

R2

MAE

R2

5-fold CV validation test totala Pred. MAEb

5.44 5.50 4.66 5.38 5.22

0.86 0.83 0.82 0.86 0.82

4.02 4.39 3.94 4.08 4.24

0.93 0.92 0.86 0.92 0.90

a

The total indicates the summed performance over all training, validation and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation. bThe prediction MAE represents the weighted average of the MAEs calculated over the validation and test data sets.

while showing similar performance to that of F_SVM over the validation and test data sets. In addition, MOW_SVM drastically reduces the complexity of the model, as reflected by the lower number of descriptors included in the models. For the MON database, as shown in Table 6, MOW_SVM increases the accuracies for all data subsets using only 23 descriptors compared to the 185 descriptors used in F_SVM. As shown in Tables 3 and 4, SVM outperforms the other regression methods; thus, MOW_SVM not only possesses the most accurate prediction ability but also significantly decreases the model complexity. The weighted average of the MAEs calculated over the validation and test subsets, which represents the prediction MAEs and reflects the external generalizability of the constructed models, are shown in the last rows of Tables 5 and 6. In these tables, the prediction MAEs of the MOW_SVM models for RON and MON are 3.90 and 4.24 units, respectively. This performance is superior to that obtained by Albahri,11 who used a group contribution method to obtain prediction MAEs of 4.0 and 5.7 units for RON and MON, respectively. To the best of our knowledge, the model of Albahri11 is the only one reported in the literature that includes more than 200 molecules in the database to predict the octane number of pure components. The scale of the database impacts the reliability and application range of the constructed models. The RON and MON databases used in this paper include more than 260 molecules and are larger than those in reference 11. Furthermore, our databases contain oxygenate and nitrogenous compounds and are more diverse than those in the previously mentioned research. Al-Fahemi et al.15 developed a QSPR model based on MLR to predict the RON; 40 training molecules were used to predict 25 other compounds, and the standard error was 6.328 units. The MAE calculated from the results was 5.341 units (since the MAE was not provided in the original paper). The results of this paper are superior to those of Al-Fahemi et al. because nonlinear models coupled with appropriate feature selection techniques are utilized instead of linear models. The MAE of the QSPR model used by Zhu et al.16 to predict MON was 3.96 units, which is slightly smaller than that in this work. However, that result was calculated over a smaller database of only 72 molecules. In addition, the model focused only on the MON of paraffins without considering other critical types of hydrocarbons, such as olefins and aromatics. Compared with the model of Zhu et al., the model presented in this paper has similar accuracy and a much wider application range, including hydrocarbons and oxygenated and nitrogen-containing compounds.

Figure 8. Parity plot comparing the predicted values and experimental data of RON based on SVM using 12 descriptors.

Figure 9. Parity plot comparing predicted values and experimental data of MON based on SVM using 23 descriptors. H

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels Table 7. List of Molecular Descriptors Selected by Multiobjective Feature Selection category RON

symbol

DRAGON block

Second zagreb index by valence vertex degrees

Topological indices

SPAN

Size of descriptors

nCp

Number of terminal primary C

Geometrical descriptors Functional group counts

Mor03u

Information content index (neighborhood symmetry of order 1) Structural complexity per vertex Signal 03/unweighted

Mor31m

Signal 31/weighted by mass

Mor10p

Signal 10/weighted by polarizability

Mor16m

Signal 16/weighted by mass

Mor29v

Signal 29/weighted by van der Waals volume

RDF020m

ALOGPS_logS

Radial distribution function -020/weighted by mass Squared Ghose-Crippen octanol−water partition coeff. (logP̂ 2) Lipophilic contributions of atoms in the molecule Solubility in water hydrophilic descriptors

nAB

Number of aromatic bonds

C.001

CH3R/CH4

C.002

CH2R2

SPAN

Size of descriptors

MEcc

Molecular eccentricity

JGI2

Mean topological charge index of order 2

X0Av

RDF020m RDF020u RDF050m Mor06m

Average valence connectivity index of order 0, which is able to account for the presence of heteroatoms in the molecule as well as double and triple bonds Structural information content index (neighborhood symmetry of order 2) second order, is defined in a normalized form of the information content to delete the influence of graph size Radial distribution function -020/weighted by mass Radial distribution function -020/unweighted Radial distribution function -050/weighted by mass Signal 06/weighted by mass

Mor26m

Signal 26/weighted by mass

Mor31m

Signal 31/weighted by mass

Mor25v

Signal 25/weighted by van der Waals volume

Mor29p

Signal 29/weighted by polarizability

Mor04u

Signal 04/unweighted

Mor18m

Signal 18/weighted by mass

Mor29m

Signal 29/weighted by mass

Mor23v

Signal 23/weighted by van der Waals volume

Mor15e

Signal 15/weighted by Sanderson electronegativity

JhetZ

Balaban-like index from Barysz matrix weighted by atomic number, which is able to account for heteroatoms and multiple bonds Spectral moment of order 12 from edge adjacency matrix

IC1

ALOGP2

MON

description

ZM2V

SIC2

ESpm12u

I

Information indices 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors RDF descriptors Molecular properties Molecular properties Constitutional indices Atom-centered fragments Atom-centered fragments Geometrical descriptors Geometrical descriptors 2D autocorrelations Connectivity indices Information indices RDF descriptors RDF descriptors RDF descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 3D-MoRSE descriptors 2D matrix-based descriptors Edge adjacency indices

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

isopropanol is 0.04, which is much lower than the value of 4.49 for 2,2,4-trimethylpentane. Furthermore, the ALOGPS_logS values of isopropanol and 2,2,4-trimethylpentane are 0.80 and −4.14, respectively. The gap in these values reflects the difference in hydrophilicity between isopropanol and 2,2,4trimethylpentane. One of the main difficulties for accurately predicting the octane number is the large number of isomers in the boiling range of gasoline, especially those that differ in only configuration, such as cis- and trans-1,3-dimethylcyclohexane. These two molecules have similar values of most of descriptors, but their RON values are 71.7 and 66.9 units, respectively. This gap in RON values is far from negligible, but most of the descriptors do not reflect this gap. The Mor03u values for the two molecules are −7.181 and −5.517, and this gap adequately reflects the difference in the RON values of the two isomers. 3D-MoRSE descriptors can successfully handle the issue in which a QSAR study includes a structurally similar set of compounds. The corresponding values can be computed using a few pairs of neighboring atoms,62 which explains why many 3D-MoRSE descriptors are selected by our methods since a large number of isomers are included in the databases. The 3DMoRSE descriptor Mor03v was selected by Zhu et al.16 to predict the MON of paraffins. If a QSPR model lacks proper feature selection, the useful descriptors are diluted by the large number of irrelevant descriptors. The examples mentioned above serve to verify the efficiency of the feature selection methods developed in this paper.

computed using bootstrapping. The prediction provided by SVM regression is a point estimate; thus, the uncertainty assessment must be completed using other methods. Bootstrap methods are applied in SVM regression to obtain prediction intervals without making assumptions about the probability distribution.57 These methods tend to be more accurate on small data sets58,59 since they rely on a limited distribution constructed from the data rather than asymptotic results.60 There are two groups of bootstrap techniques that can be distinguished by sampling a regression model based on pairs or residuals.60 Percentile confidence intervals computed by bootstrap sampling based on pairs are used in this study, and the details of the algorithm are described in references 57 and 60. Each interval is calculated based on 2000 runs. In Figures 8 and 9, some of the white points expressing the training samples have large intervals, especially those with octane numbers between 0 and 25. This result occurs in part because during the CV, only four-fifths of the molecules in the training set are used in the bootstrapping to estimate the confidence intervals of the other one-fifth of the samples. In contrast, all the molecules in the training set are adopted to estimate the confidence intervals of the molecules in the validation and test subsets. Larger confidence intervals are generally produced when a smaller subset is used for bootstrapping. Furthermore, using the entire training set yields narrower intervals and ensures better prediction accuracy. The other reason for this result is that the distribution of molecules with octane numbers between 0 and 25 is sparse, and this sparsity increases the size of the confidence intervals. Both of these effects contribute to the large confidence intervals for these points. However, most of the points are close to the diagonal line with reasonable confidence interval scales. The parity plots comparing the predicted values of RON and MON computed by the postfilter descriptors and the experimental values are displayed in Figures S3 and S4 in the Supporting Information. 4.5. Meaning of the Selected Descriptors. The 12 and 23 descriptors selected by the multiobjective wrapper for RON and MON are presented in Table 7 along with their meaning. The meanings of the descriptors are in agreement with petroleum chemistry knowledge about antiknocking properties. The branching degree of hydrocarbons positively influences the octane number, and the descriptors ZM2V, SPAN, nCp, and IC1 describe the branching of molecules from different viewpoints.61 For a straight-chain alkane and an isoparaffin with the same number of carbon atoms, the straight-chain molecule has a larger diameter (SPAN), a smaller number of terminal primary carbons (nCp), and less structural complexity per vertex (IC1). For a given set of molecules, the difference in the descriptor values among molecules in an optimized feature subset should reflect the octane number differences of the molecules and can be used to determine the rationality of the optimized feature subset. For example, the descriptors ALOGP2 and ALOGPS_logS represent the hydrophobic and hydrophilic properties of the compounds, respectively. Because oxygenated compounds are contained in the RON database, the two descriptors can distinguish oxygenates from hydrocarbons to provide accurate predictions for the regression models. Compared to 2,2,4-trimethylpentane, isopropanol possesses stronger hydrophobicity and weaker hydrophilicity. As mentioned previously, the values of the descriptors are pretreated to eliminate dimensional differences in the databases. The ALOGP2 value of

5. CONCLUSION QSPR models are developed to predict the pure component RON and MON. Two databases composed of published data include the MON and RON values of oxygenated and nitrogencontaining compounds as well as hydrocarbons. Feature subset selection is used to increase the performance and simplify the complexity of a QSPR model by removing redundant and irrelevant descriptors. In this study, a two-step feature selection method is utilized to optimize both the length of feature subsets and the generalizability of the models. On the basis of multiple runs of RF, the Boruta algorithm is then used as a filter to remove irrelevant descriptors from the databases. After the filtering step, 180 and 185 descriptors remain and are used as inputs for the multiobjective feature selection method. On the basis of the postfiltered descriptors, several linear and nonlinear regression methods, including PLS, FFNN, RF, and SVM, are compared, and SVM regression is selected as the evaluation algorithm. In the second step, a multiobjective optimization algorithm (NSGA-II) runs around the SVM as a wrapper to simultaneously minimize the lengths of the feature subsets and maximize the model performance. Thus, feature subsets with 12 and 23 descriptors are selected to predict the RON and MON based on the performance over the validation data sets. The RON and MON regression models for pure components are constructed using the two optimized subsets of descriptors coupled with the SVM regression. These models have higher generalizability and are able to avoid the overfitting problem that is commonly produced by single-objective feature selection. The prediction MAEs of RON and MON are 3.90 and 4.24 units, respectively. Compared with previous studies, the models developed in this work achieve higher accuracies, even though they are based on more diverse databases. Furthermore, bootstrap techniques are used to calculate the J

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels

(17) Shahlaei, M. Chem. Rev. 2013, 113, 8093−8103. (18) Dutta, D.; Guha, R.; Wild, D.; Chen, T. J. Chem. Inf. Model. 2007, 47, 989−997. (19) Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L. A. Feature Extraction: Foundations and Applications; Springer-Verlag: New York, 2006. (20) Saeys, Y.; Inza, I.; Larrañaga, P. Bioinformatics 2007, 23, 2507− 2517. (21) Kohavi, R.; John, G. Artif. Intell. 1997, 97, 273−324. (22) Hawkins, D. M. J. Chem. Inf. Comput. Sci. 2004, 44, 1−12. (23) Yang, J.; Honavar, V. Intell. Syst. their Appl. IEEE 1998, 13, 44− 49. (24) Chaikla, N.; Qi, Y. In 1999 IEEE International Conference on Systems, Man, and Cybernetics, Conference Proceedings, Tokyo, Japan, October 12−15, 1999; Vol. 5, pp 538−540. (25) Emmanouilidis, C.; Hunter, A.; MacIntyre, J. In Proceedings of the 2000 Congress on Evolutionary Computation, La Jolla, CA, July 16− 19, 2000; Vol. 1, pp 309−316 vol.1. (26) Deb, K.; Raji Reddy, A. BioSystems 2003, 72, 111−129. (27) Oliveira, L. S.; Sabourin, R.; Bortolozzi, F.; Suen, C. Y. Int. J. Pattern Recognit. Artif. Intell. 2003, 17, 903−929. (28) American Society for Testing and Materials (ASTM). A. Knocking characteristics of Pure Hydrocarbons: Developed under American Petroleum Institute Research Project 45; 1958. (29) Daubert, T. E.; Danner, R. P. API Technical Data Book-Petroleum Refining; American Petroleum Institute (API), Washington, DC, 1997. (30) Yanowitz, J.; Christensen, E.; McCormick, R. Utilization of Renewable Oxygenates As Gasoline Blending Components, Technical Report NREL/TP-5400-50791, Contract No. DE-AC36-08GO28308; 2011. (31) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. J. Cheminf. 2011, 3, 33. (32) Tetko, I. V.; Gasteiger, J.; Todeschini, R.; Mauri, A.; Livingstone, D.; Ertl, P.; Palyulin, V.; Radchenko, E. V.; Zefirov, N. S.; Makarenko, A. S.; Tanchuk, V. Y.; Prokopenko, V. V. J. Comput.Aided Mol. Des. 2005, 19, 453−463. (33) VCCLAB, Virtual Computational Chemistry Laboratory, http:// www.vcclab.org, 2005. (34) Sadowski, J.; Gasteiger, J.; Klebe, G. J. Chem. Inf. Model. 1994, 34, 1000−1008. (35) Radtke, P. V. W.; Wong, T.; Sabourin, R. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, Vancouver, British Columbia, Canada, July 16−21, 2006; pp 3327− 3334. (36) Oliveira, L. S.; Sabourin, R.; Bortolozzi, F.; Suen, C. Y. Feature selection using multi-objective genetic algorithms for handwritten digit recognition. In 16th International Conference on Pattern Recognition, Quebec City, Quebec, Canada, August 11−15, 2002; Vol. 1, pp 568− 571, DOI: 10.1109/ICPR.2002.1044794 (37) Mansouri, K.; Ringsted, T.; Ballabio, D.; Todeschini, R.; Consonni, V. J. Chem. Inf. Model. 2013, 53, 867−878. (38) Mauri, A.; Consonni, V.; Pavan, M.; Todeschini, R. Match Commun. Math. Comput. Chem. 2006, 56, 237−248. (39) Kursa, M. B.; Jankowski, A.; Rudnicki, W. R. Fundam. Informaticae 2010, 101, 271−285. (40) Breiman, L. Mach. Learn. 2001, 45, 5−32. (41) Chen, S.; Billings, S. A. Int. J. Control 1992, 56, 319−346. (42) Vapnik, V. N. The Nature of Statistical Learning Theory; Information Science and Staistics; Springer-Verlag: New York, 1995. (43) Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. IEEE Trans. Evol. Comput. 2002, 6, 182−197. (44) Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms; John Wiley & Sons: Chichester, U.K., 2001. (45) R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2012; http://www.R-project.org/. (46) Kursa, M. B.; Rudnicki, W. R. J. Stat. Softw. 2010, 36, 1−13. (47) Liaw, A.; Wiener, M. R News 2002, 2, 18−22.

confidence intervals of the estimations produced by the SVM regression models.



ASSOCIATED CONTENT

S Supporting Information *

. The are also includes in the Supporting Information . The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.energyfuels.6b03251. Part of the correlation coefficients among the descriptors derived from the RON database (Table S1); parameter scatter plots of SVM for the RON and MON data using postfilter descriptors (Figures S1and S2); parity plots of the RON and MON prediction using postfilter descriptors (Figures S3 and S4) (PDF) RON and MON databases (in csv format) (ZIP)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Linzhou Zhang: 0000-0002-8354-784X Ali Elkamel: 0000-0002-6220-6288 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (Grants 21506254 and U1162204) and the Science Foundation of China University of Petroleum, Beijing (Grant 2462014YJRC020)



REFERENCES

(1) Reid, R. C.; Prausnitz, J. M.; Poling, B. E. The Properties of Gases and Liquids, 4th ed.; McGraw-Hill: New York, 1987. (2) American Society for Testing and Materials (ASTM). ASTM D2699-15a, Standard Test Method Research Octane Number SparkIgnition Engine Fuel; ASTM: West Conshohocken, PA, 2015. (3) American Society for Testing and Materials (ASTM). ASTM D2700, Standard Test Method Motor Octane Number Spark Ignition Engine Fuel; ASTM: West Conshohocken, PA, 2011. (4) Kelly, J. J.; Barlow, C. H.; Jinguji, T. M.; Callis, J. B. Anal. Chem. 1989, 61, 313−320. (5) Cooper, J. B.; Wise, K. L.; Groves, J.; Welch, W. T. Anal. Chem. 1995, 67, 4096−4100. (6) Bao, X.; Dai, L. Fuel 2009, 88, 1216−1222. (7) Nikolaou, N.; Papadopoulos, C. E.; Gaglias, I. A.; Pitarakis, K. G. Fuel 2004, 83, 517−523. (8) Neurock, M.; Nigam, A.; Trauth, D.; Klein, M. T. Chem. Eng. Sci. 1994, 49, 4153−4177. (9) Kahrs, O.; Brauner, N.; Cholakov, G. S.; Stateva, R. P.; Marquardt, W.; Shacham, M. Comput. Chem. Eng. 2008, 32, 1397− 1410. (10) Wakeham, W. A.; St. Cholakov, G.; Stateva, R. P. J. Chem. Eng. Data 2002, 47, 559−570. (11) Albahri, T. Ind. Eng. Chem. Res. 2003, 42, 657−662. (12) St. Cholakov, G.; Wakeham, W. A.; Stateva, R. P. Fluid Phase Equilib. 1999, 163, 21−42. (13) Ha, H. Z.; Ring, Z.; Liu, S. Energy Fuels 2005, 19, 152−163. (14) Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Pidol, L.; Jeuland, N.; Creton, B. Energy Fuels 2011, 25, 3900−3908. (15) Al-Fahemi, J. H.; Albis, N. A.; Gad, E. A. M. J. Theor. Chem. 2014, 2014, 520652. (16) Zhu, X.; Jiang, J.-c.; Pan, Y.; Wang, R. W. Nat. Gas Chem. Ind. 2011, 36, 54−57 (in Chinese). K

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX

Article

Energy & Fuels (48) Dimitriadou, E.; Hornik, K.; Leisch, F.; Meyer, D. A. Misc Functions of the Department of Statistics (e1071). R Package, version 1.5-1.8; TU Wien: Vienna, Austria, 2008. (49) Mersmann, O. mco: Multiple Criteria Optimization Algorithms and Related Functions; 2014. (50) Wickham, H. ggplot2: Elegant Graphics for Data Analysis; 2009. (51) Revolution Analytic; Weston, S. foreach: Provides Foreach Looping Construct for R; 2015. (52) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958. (53) Kecman, V. Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models; MIT Press: Cambridge, MA, 2001. (54) Statnikov, A.; Wang, L.; Aliferis, C. F. BMC Bioinf. 2008, 9, 319. (55) Loughrey, J.; Cunningham, P. Res. Dev. Intell. Syst. XXI 2005, 33−43. (56) Xu, H.; Caramanis, C.; Mannor, S. J. Mach. Learn. Res. 2008, 10, 1485−1510. (57) Lins, I. D.; Droguett, E. L.; Moura, M. D. C.; Zio, E.; Jacinto, C. M. Reliab. Eng. Syst. Saf. 2015, 137, 120−128. (58) Efron, B. Ann. Stat. 1979, 7, 1−26. (59) Efron, B.; Tibshirani, R. J. Refrig. Air Cond. 1993, 57, 436. (60) De Brabanter, K.; De Brabanter, J.; Suykens, J. K.; De Moor, B. IEEE Trans. Neural Netw. 2011, 22, 110−120. (61) Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000. (62) Devinyak, O.; Havrylyuk, D.; Lesyk, R. J. Mol. Graphics Modell. 2014, 54, 194−203.

L

DOI: 10.1021/acs.energyfuels.6b03251 Energy Fuels XXXX, XXX, XXX−XXX