Int. J. Electrochem. Sci., 10 (2015) 3568 - 3583 International Journal of
ELECTROCHEMICAL SCIENCE www.electrochemsci.org
High Dimensional QSAR Study of Mild Steel Corrosion Inhibition in acidic medium by Furan Derivatives Abdo M. Al-Fakih1,3,*, Madzlan Aziz1,*, Hassan H. Abdallah4, Zakariya Y. Algamal2,5, Muhammad H. Lee2, Hasmerya Maarof1 1
Department of Chemistry, Faculty of Science, University Technology Malaysia, 81310 UTM Skudai, Johor, Malaysia 2 Department of Mathematical Sciences, University Technology Malaysia, 81310 UTM Skudai, Johor, Malaysia 3 Department of Chemistry, Faculty of Science, Sana’a University, Sana’a, Yemen 4 Department of Chemistry, College of Education, Salahaddin University, Erbil, Iraq 5 Department of Statistics and Informatics, College of Computer Science and Mathematics, University of Mosul, Iraq * E-mail:
[email protected],
[email protected] Received: 22 December 2014 / Accepted: 18 January 2015 / Published: 24 February 2015
The inhibition of mild steel corrosion in 1 M HCl by 17 furan derivatives was investigated experimentally using potentiodynamic polarization measurements. The furan derivatives inhibit the mild steel corrosion. The experimental inhibition efficiency (IE) was used in a Quantitative StructureActivity Relationship (QSAR) study. Dragon software was used to calculate the molecular descriptors. Penalized multiple linear regression (PMLR) was applied as a variable selection method using three penalties namely, ridge, LASSO, and elastic net. A number of 8 and 38 significant molecular descriptors were selected by LASSO and elastic net methods, respectively. The most significant descriptors namely, PJI3, P_VSA_s_4, Mor16u, MATS3p, and PDI were selected by both LASSO and elastic net methods. The elastic net results show low mean-squared error of the training set ( MSE train ) of 0.0004 and test set ( MSE test ) of 5.332. The results confirm that the penalized multiple linear regression based on elastic net penalty is the most effective method to deal with high dimensional data.
Keywords: Polarization; corrosion inhibitors; furan derivatives, high dimensional QSAR, penalized multiple linear regression (PMLR)
1. INTRODUCTION Metal corrosion causes a huge loss in resources and industrial equipment especially in acidic medium. Acid solutions are the most corrosive media because of their widely use in industry [1]. The
Int. J. Electrochem. Sci., Vol. 10, 2015
3569
most reported corrosion inhibitors are organic compounds with heteroatoms such as oxygen, nitrogen, sulfur and phosphorous and compounds containing multiple bonds [2,3]. Experimental and theoretical methods have been used to investigate the corrosion inhibition efficiency of many organic compounds [4]. Computational methods has become more developed and increasingly used in the corrosion inhibition studies [5]. Quantitative structure activity relationship (QSAR) is a computational modeling method which has been applied in many disciplines of chemistry [6,7]. A good QSAR model should possess high prediction power and prediction reliability [8]. In the QSAR modeling area, compounds are treated as observations and descriptors are treated as explanatory variables. Quantum chemical calculations are the traditional methods used to calculate the molecular descriptors. In addition, software such as Molconn-Z, CODESSA and Dragon are used to calculate descriptors based on the molecular structures [9]. Dragon software has considerable applications in QSAR and scientific studies. A number of 4885 descriptors can be calculated using Dragon software version 6.0 [10]. A problem of high dimensionality in QSAR modeling, which the number of molecular descriptors, p, exceeds the number of compounds, n, is one of the new challenges [11]. Statistical issues associated with modeling high-dimensional QSAR include model overfitting and multicollinearity [12,13]. Classical statistical methods such as multiple linear regression (MLR) cannot solve overfitting and multicollinearity issues. Several methods have been proposed to deal with high dimensional data problem. For example, dimensional reduction methods act by representing the original explanatory variables with orthogonal components such as principle component analysis (PCA) [14], and partial least squares (PLS) [15]. Other methods such as penalized regression methods act to do simultaneously shrinkage and variable selection. Variable selection is the main objective in high dimensional data [16]. The aim of selecting optimal subset of molecular descriptors is to reduce the descriptors number to those that contain relevant information, and thereby to improve QSAR modeling. This should be observed in terms of predictive performance (by decreasing the effect of multicollinearity) and interpretability (to prevent overfitting). A procedure called penalization is used for variable selection in high dimensional data. This penalization attaches a penalty term P (β) to the ordinary least squares (OLS) to get a better estimate of the prediction error by avoiding overfitting and multicollinearity. In this study, corrosion inhibition efficiencies of furan derivatives on mild steel in 1 M HCl solutions were evaluated using electrochemical potentiodynamic polarization. Dragon software version 6.0 was used to calculate the structural-based descriptors. A high number of molecular descriptors with high dimensionality were obtained. High dimensional data is more informative to develop better models; however, it is a big challenge to the classical variable selection methods to deal with such data. Therefore, the aim of this paper is to apply new proposed variable selection methods (i.e. Penalized multiple linear regression (PMLR) based on ridge, LASSO, and elastic net penalties) in the QSAR studies. In addition, the study aims to evaluate 17 furan derivatives as corrosion inhibitors for mild steel in 1 M HCl solution.
Int. J. Electrochem. Sci., Vol. 10, 2015
3570
2. MATERIALS AND METHODS 2.1. Experimental Preparation of Materials and Inhibitors A number of 17 derivatives of furan were obtained from Sigma-Aldrich and investigated as corrosion inhibitors of mild steel in 1 M HCl (Table 1). The test solution (1 M HCl) was prepared from analytical grade hydrochloric acid (37 wt. %). The composition of mild steel specimens (wt%) was: C0.036, Mn-0.172, Cu-0.082, Ni-0.108, Cr-0.053, Al-0.035, Zr-0.146 and Fe balance. The surface of the steel was abraded using 240, 320, 400, 600, 800 and 1500 grades of sand papers. The specimens were well cleaned with deionized water and then again by acetone.
2.2. Experimental Potentiodynamic Polarization Measurements Potentiodynamic polarization measurements were used to investigate the inhibition efficiency of the inhibitors. Potentiodynamic polarization measurements were carried out at room temperature (25±1°C) using 250 ml of 1 M HCl electrolyte with and without the addition of 0.005 M of the inhibitors. Before the polarization measurements, the system was stabled within 30 min to reach open circuit potential (OCP) steady state. Polarization curves were recorded at a scan rate of 10 mV/s with a scan range from -0.25 and +0.25 V with respect to OCP. The Autolab Potentiostat/Galvanostat instrument was used to carry out potentiodynamic polarization measurements by recording the Tafel polarization curve. The used cell was a three-electrode cell assembly that contained a 1 cm2 coupon of a mild steel embedded in a specimen holder. The mild steel specimen acted as working electrode (WE). A platinum electrode was used as a counter electrode (CE). A saturated calomel electrode (SCE) was used as the reference electrode (RE).
2.3. High-Dimensional QSAR Dataset The dataset consisted of 17 furan derivatives used as corrosion inhibitors. The molecular structures of the dataset compounds were drawn using Chem3D software. The molecular structures were optimized using the molecular mechanics MM2 method and then again by a Molecular Orbital Package (MOPAC) module in Chem3D software. Dragon software Version 6.0 was used to calculate the molecular descriptors based on the optimized molecular structures [10]. A total of 1951 descriptors were calculated. The dataset was randomly split into 70% training set and 30% test set.
2.4. High-Dimensional QSAR Variable Selection The most informatics descriptors are needed to be selected precisely from the whole dataset molecular descriptors. The problem of variable selection is one of the most prominent problems in QSAR study. The variable selection is to find a subset of significant descriptors to build a QSAR model with better predictive accuracy compared to a model built with whole dataset descriptors. In this work, the obtained dataset was high dimensional data. Unlike classical variable selection methods,
Int. J. Electrochem. Sci., Vol. 10, 2015
3571
penalization methods can deal with high-dimensional data. In this paper penalized multiple linear regression was applied using three well-known penalties, ridge, LASSO, and elastic net. Although the ridge penalty cannot do variable selection, it is useful to deal with multicollinearity. In general, classical linear regression assumes that the response variable y = (y1 ,..., y n ) is a linear combination of p molecular descriptors x1 , ..., xp , an unknown parameter vector β = (β1 , ...,βp ) , and an additive error term e = (e1 ,...,e2 ) . When n p the usual estimation procedure for the parameter vector β is the minimization of the residual sum of squares (RSS) with respect to β βˆ = arg min RSS = arg min (y - Xβ)(y - Xβ). (1) OLS
β
β
Then, the OLS estimator βˆ OLS = (XX)-1 Xy is obtained by solving Eq. (1). The OLS estimator is optimal within the class of linear unbiased estimators if the molecular descriptors are not correlated. However, multicollinearity occurs if there are highly correlated molecular descriptors in the regression model. This can lead to problems in the computation of the OLS estimator. In the case of high dimensional data, n p , both the design matrix X and the matrix XX no longer have full rank p . Thus, (XX)-1 cannot be calculated and the OLS estimator cannot be solved. The penalization methods are based on penalty terms and should yield unique estimates of the parameter vector β . An improvement of the prediction accuracy can be achieved by shrinking the coefficients, and an improvement of the interpretability can be done by setting some of the coefficients to zero. Thereby, the obtained QSAR regression models should contain only the relevant molecular descriptors which are easier to interpret. In general, the penalized multiple linear regression (PMLR) is defined as: PMLR (y - Xβ)(y - Xβ) P (β). (2) The estimates of the penalized parameter vector are obtained by minimizing Eq. (2) with respect to β as shown by Eq. 3: βˆ arg min PMLR . (3) PMLR
β
The penalty term P (β) depends on the tuning parameter which controls the shrinkage intensity. For the tuning parameter 0 , the obtained result is the OLS estimators. On the contrary, for large values of , the influence of the penalty term on the coefficient estimates will increase. Therefore, the penalty region determines the properties of the penalized estimated parameter vector, whereas the desirable molecular descriptors will be the selected variables. Different forms of the penalty terms have been introduced in the literature such as ridge, LASSO, and elastic net penalties.
2.4.1. Ridge Regression One of the most popular penalties is ridge regression (RR), which introduced by Hoerl and Kennard [17] as an alternative solution to OLS when there is multicollinearity between molecular p
descriptors. The ridge regression solves the RSS using P (β) β j2 . Consequently, the ridge j 1
estimate is defined by the Eq. (4):
Int. J. Electrochem. Sci., Vol. 10, 2015
3572
p
βˆ Ridge arg min β (SSR β j2 ).
(4)
j 1
In RR, the tuning parameter controls the amount of shrinkage, but never set molecular descriptor coefficients to be exactly equal zero. Therefore, in high dimensional data when n p , the RR will not perform variable selection. Although RR does not have the variable selection property, it is preferred in high dimensional data since highly correlations between molecular descriptors is expected. Unlike the OLS estimates, the RR is biased. Therefore, this penalized method accepts a little bias to reduce the variance and the mean squared error (MSE). Since the RR cannot perform selection of the variables, further penalization methods were developed such as LASSO and elastic net.
2.4.2. Least Absolute Shrinkage and Selection Operator (LASSO) Tibshirani [18] proposed the least absolute shrinkage and selection operator (LASSO) as a penalty to perform the variable selection by setting some variable coefficients to zero. It does both continuous shrinkage and automatic variable selection simultaneously. Similar to the RR, the LASSO p
estimates are obtained by adding the penalty of: P (β) β j to the RSS. The PMLR estimates j 1
using LASSO is given by Eq. (5): p
βˆ LASSO arg min β (SSR β j ).
(5)
j 1
Depending on the property of the LASSO penalty, some coefficients will be exactly equal to zero. Hence, LASSO performs the variable selection. Although LASSO is widely used in many applications, it has some drawbacks. One of the drawbacks, it is not robust to high correlation among molecular descriptors and will randomly choose one of these descriptors and ignores the rest. Another drawback of LASSO in high dimensional data is that the maximum number of selected descriptors is equal to n even if there is more descriptors with non-zero coefficients in the final model. Therefore, elastic net penalized method was developed to overcome the drawbacks of the LASSO.
2.4.3. Elastic Net Elastic net is a penalized method for variable selection. It was introduced by Zou and Hastie [19] to deal with the drawbacks of LASSO. Elastic net tries to merge both LASSO and ridge penalties, by using ridge regression penalty to deal with high correlation problem and taking the advantage of LASSO penalty of variable selection property. The elastic net estimates for PMLR are defined by Eq. (6): p
p
j 1
j 1
βˆ elastic arg min β (SSR 1 β j 2 β j2 ).
(6)
As it can be observed by Eq. (6), elastic net depends on non-negative two tuning parameters 1 , 2 . According to lemma 1 in Zou and Hastie [19], to find the estimates of βˆ elastic in Eq. (6), the given data set (y, X) is extended to an augmented data (y , X ) and defined by Eq. (7):
Int. J. Electrochem. Sci., Vol. 10, 2015
3573
X y X ( n p , p ) (1 2 ) , y ( n p ,1) (7) Ι 0 2 As a result of this augmentation, the elastic net can be written and solved as a LASSO penalty. Hence, the elastic net can select all p molecular descriptors in the high dimensional when n