Data Mining and Machine Learning Tools for ... - Wiley Online Library

5 downloads 12731 Views 3MB Size Report
Mar 20, 2015 - DOI: 10.1002/minf.201400174. Data Mining and Machine Learning Tools for Combinatorial. Material Science of All-Oxide Photovoltaic Cells.
www.molinf.com

DOI: 10.1002/minf.201400174

Data Mining and Machine Learning Tools for Combinatorial Material Science of All-Oxide Photovoltaic Cells Abraham Yosipof,[a] Oren E. Nahum,[a] Assaf Y. Anderson,[a] Hannah-Noa Barad,[a] Arie Zaban,[a] and Hanoch Senderowitz*[a] Abstract: Growth in energy demands, coupled with the need for clean energy, are likely to make solar cells an important part of future energy resources. In particular, cells entirely made of metal oxides (MOs) have the potential to provide clean and affordable energy if their power conversion efficiencies are improved. Such improvements require the development of new MOs which could benefit from combining combinatorial material sciences for producing solar cells libraries with data mining tools to direct synthe-

sis efforts. In this work we developed a data mining workflow and applied it to the analysis of two recently reported solar cell libraries based on Titanium and Copper oxides. Our results demonstrate that QSAR models with good prediction statistics for multiple solar cells properties could be developed and that these models highlight important factors affecting these properties in accord with experimental findings. The resulting models are therefore suitable for designing better solar cells.

Keywords: Data mining · Machine learning · QSAR · Combinatorial material science · All oxide photovoltaic cells

1 Introduction The continuous growth in energy demands, coupled with the need for new and clean energy are likely to make Photovoltaics (PV) an important part of future energy resources. Photovoltaics convert sunlight into electricity using semiconducting materials. A typical photovoltaic device operates by: (1) Generation of charge carriers (electrons and holes) due to the absorption of photons; (2) Separation of the photo-generated charge carriers of opposite types via charge selective contact(s); (3) Collection of the photo-generated charge carriers to an external circuit leading to electricity. Photovoltaics has enjoyed a continuum growth over the last decade yet it is predicted that 10–15 additional years of at least a similar growth rate are required before it becomes a major source of electricity.[1] This process could be potentially accelerated by the discovery of new PV materials. Such materials should ideally be efficient in terms of their ability to convert sunlight to electricity, cheap, stable over long periods of time, easy to manufacture, and environmentally friendly. Most of these requirements are met by metal oxides (MOs) which are emerging as new materials for the production of solar cells. Yet before all metal oxide photovoltaic cells enjoy widespread usage, they still need to demonstrate a substantial increase in their sun light to electricity conversion efficiency.[2] This in turn requires the development of new MOs. New MOs could be discovered by combinatorial material synthesis. In analogy to combinatorial chemistry, combinatorial material synthesis uses a limited set of target MOs to produce multiple binary, ternary, quaternary, etc. combina-

tions. Classic mix-and-split techniques are mimicked by material deposition processes such as sputtering,[3] pulse laser deposition (PLD),[4] and spray pyrolysis[5] to produce material libraries which could form the basis for all metal oxide solar cells. Several all metal oxide (also known as all-oxide) photovoltaic cells have been described in the literature.[6] Briefly, their basic assembly includes (see Figure 1): (1) A transparent conducting oxide (TCO) coated on a glass, typically in the form of fluorine doped tin oxide (FTO); (2) Window layer, which is a wide band-gap n-Type semiconductor; (3) Light absorbing layer (absorber); (4) Metal back contact; (5) Metal frame (front contact) soldered directly onto the TCO. This basic design could be transformed into a solar cells library by using combinatorial material synthesis for the window and absorber layers.[2,4,6] Upon library production, each of the individual solar cells (corresponding to specific compounds in classical chemoinformatics) could be characterized by experimentally measuring attributes related to its MO composition (referred to as material descriptors) and to its PV properties. In analogy with Quantitative Structure Activity Relationship (QSAR), the former corresponds to compound descriptors and the latter, to activities. [a] A. Yosipof, O. E. Nahum, A. Y. Anderson, H.-N. Barad, A. Zaban, H. Senderowitz Department of Chemistry, Bar Ilan University Ramat-Gan 52900, Israel *e-mail: [email protected]

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

367

Special Issue EuroQSAR

Full Paper

www.molinf.com

2 Methods 2.1 Data Sets

Figure 1. Schematic drawing of a combinatorial all-oxide PV library.

Given a well characterized library of solar cells, data mining and machine learning tools could be used to provide insight into the empirical results, highlight factors which are responsible for PV properties and design better cells, much in the same way QSAR methods are used for the analysis of compounds collections. Only few applications of data mining approached in the field of PV were published and these primarily focused on the engineering of solar panels (i.e., large array of solar cells). Thus, Mellit, et al.[7] have reviewed the usage of artificial intelligence techniques for sizing and for optimizing parameters of photovoltaic systems, Bonanno et al.[8] used data mining to predict the output characteristic of a commercial PV module, and Ishaque et al.[9] used evolutionary algorithms to extract parameters to PV modules. However, to the best of our knowledge, the usage of machine learning algorithms to study large numbers of individual solar cells differing in their composition have not been reported to date. The main objective of this study is therefore to establish the usefulness of data mining and machine learning techniques for combinatorial material science of all-oxide photovoltaic cells. The specific goals are to: (1) Use data mining technique to analyze libraries of photovoltaic cells. (2) Build predictive machine learning models for PV parameters. This would establish a quantitative link between photovoltaic parameters and their corresponding material descriptors based on the quantitative structure activity relationship (QSAR) approach.[10] (3) Set the ground for the experimental design of new solar cells with better PV properties. In order to meet these goals we developed a data mining workflow and applied it to two libraries differing in their MO composition and method of preparation, namely, a TiO2 j Cu¢O library reported by Anderson et al.[4] and a TiO2 j Cu2O library reported by Pavan et al.[11] We demonstrate that the workflow is able to highlight those material descriptors which most affect PV properties and to develop models with good predictive statistics for several PV properties measured on the two input libraries.

Two solar cell libraries were considered in this work, namely a TiO2 j Cu¢O library and a TiO2 j Cu2O library. These libraries involve changes in physical parameters of the different layers but no material variation within a specific layer, and are thus termed “combinatorial device libraries”.[4] In these libraries a single MO was used for the window or the absorber layers (different MOs for each layer) and only the thickness of each layer was varied during library production. 2.1.1 TiO2 j Cu-O Library

A library of solar cells was obtained from Anderson et al.[4] Briefly, this library was generated on precut glass substrates onto which a TiO2 window layer with a linear gradient was deposited by spray pyrolysis, followed by Pulsed Laser Deposition (PLD) of a Cu¢O light absorber layer. This process led to cells with varying thicknesses of the window and absorber layers (and consequently do a variable thickness ratio) and to a variable composition of the Cu¢O layer. Upon inserting a grid of 13 Õ 13 = 169 Ag back contacts, the combinatorial materials library was transformed into a device library consisting of 169 solar cells. The notation Cu¢O indicates that while CuO was the metal oxide used for preparing the library, multiple oxides (e.g., CuO, Cu2O, Cu4O3) were found to be present in each cell.[4] 2.1.2 TiO2 j Cu2O Library

A second library of solar cells was obtained from Pavan et al.[11] This library is based on the same target MO for the window layer but a different target MO for the absorber layer (Cu2O). In contrast with the previous library this library was produced using spray pyrolysis only. Furthermore, two different back-contacts were used, namely, silver only (Ag) and silver and copper deposited one after the other (Ag/ Cu), leading to two sub-libraries each consisting of 169 cells. 2.2 Data Mining Workflow

The data mining workflow developed in this work is presented in Figure 2 and consists of the following components: (1) library characterization; (2) data visualization; (3) model development; (4) model validation; (5) experimental design. Each of these components is briefly described below. 2.2.1 Library Characterization

Each library member was characterized by seven experimentally measured material descriptors (independent variaÓ 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

368

Special Issue EuroQSAR

Full Paper

Special Issue EuroQSAR

Full Paper

www.molinf.com

Figure 2. Data mining workflow for the analysis of all-oxide photovoltaic cells. See text for a description of the different components.

bles) and by three PV properties (dependent variables). The material descriptors considered in this work are: (1) The thickness of the window layer (Tw); in this work the thickness of the TiO2 layer (TTiO2 ). (2) The thickness of the absorber layer (Ta); in this work the thickness of the Cu¢O or the Cu2O layers (TCu¢O or TCu2 O, respectively). (3) The thickness ratio between the absorber layer and the total (absorber + window) layers (Ratio = Ta/(Ta + Tw)). (4) The distance of the cell from the center of the depositing plume of the absorber layer (Dcenter). (5) The band gap of absorber layer (BGP). The band gap is the energy difference (in electron volts) between the top of the valence band and the bottom of the conduction band. (6) The measured resistance of the absorber layer (Ra). (7) The maximum theoretical calculated photocurrent (Jmax). The photovoltaic properties considered in this work are: (1) The short circuit photocurrent density (JSC). The short circuit photocurrent density is the current density through the solar cell when the voltage across the cell is zero. (2) The open circuit photovoltage (VOC). The open circuit photovoltage is the maximum voltage available from a solar cell, and this occurs at zero current. (3) The internal quantum efficiency (IQE), which reflects the charge separation and collection efficiencies of a device and is calculated by Equation 1. IQE ¼ Jsc =Jmax

ð1Þ

For a more detailed discussion of these parameters and how they were measured, see References[4,11] . 2.2.2 Data Visualization

A common technique for data visualization is Principal Component Analysis (PCA).[12] PCA reduces the dimensionality of a data set, while retaining as much as possible, its original variance. This reduction is achieved by transforming the original variables (i.e., descriptors) into a new set of orthogonal variables called Principle Components (PCs). PCs are typically produced in an ordered manner so that the first PC retains the largest portion of the variance of the original set while subsequent PCs retain increasingly smaller portions not accounted for by the previous PCs. PCA has been extensively used in the field of chemoinformatics.[13] In this work we used PCA as implemented in the IBM SPSS Statistics for Windows, Version 20.0.[14] 2.2.3 Model Building

Two machine learning techniques were used for model generation, namely, k Nearest Neighbors (kNN) and genetic programming (GP). Prior to model generation, outliers were removed using a newly developed outlier removal procedure.[15]

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

369

Special Issue EuroQSAR

Full Paper

www.molinf.com

Figure 4. Schematic representation of the new kNN optimization outlier removal algorithm as applied to the solar cells library. Figure 3. Schematic representation of the kNN optimization algorithm.

2.2.3.1 k-Nearest Neighbors (kNN): The k-Nearest Neighbors (kNN) algorithm is based on the idea that the activity of a given compound can be predicted by averaging the activities of its k nearest neighbors, namely, the k compounds most similar to it. This idea follows directly from the similar property principle[16] which states that similar compounds have similar properties. In the present study we attempt to extend this principle to PV cells. Since chemical similarity between two objects critically depends on molecular descriptors used to characterize them, inherent to kNN is a variable selection procedure which identifies a set of descriptor in terms of which the similar properties principle is satisfied. Furthermore, due to the large number of descriptors subsets, this variable section procedure could be treated as an optimization problem. Here we implemented the kNN method based on the work of Zheng and Tropsha[17] using Metropolis Monte Carlo/Simulated Annealing (MC/SA) as the optimization engine. The objective of this algorithm is to optimize the leave one out (LOO) cross-validated value (Q2LOO, Equation 2, Section 2.2.3.6) in the space of k (the number of nearest neighbors; typically between one and five) and the descriptors. A schematic representation of the kNN procedure employed in this work is provided in Figure 3. The optimized model, namely, the model with the highest Q2LOO value is defined by the number of nearest neighbors (k) and the by the identities of the material descriptors used for the similarity calculations. Clearly Q2LOO increases as the relevance of the descriptors to the activity increases. 2.2.3.2 kNN Optimization Based Outlier Removal: Within the kNN paradigm, predictions over short distances are likely to be more accurate than predictions over long distances. It therefore follows that the activity of an outlier in the descriptors space is unlikely to be reliably predicted by its nearest neighbors and that the removal of this outlier

will improve model performances. With this in mind we have recently presented a new algorithm for the removal of outliers which is briefly summarized in Figure 4.[15] The algorithm consists of the following steps: (1) For a set of solar cells, run kNN to obtain the model with the highest Q2LOO. (2) For each solar cell, calculate the improvement in Q2LOO upon its removal from the library. (3) Remove the solar cell which provides the largest increase in Q2LOO upon is removal from the library. Removal of a solar cell from the library will also remove it from the list of nearest neighbors of all other solar cells. In such cases, the removed solar cell will be replaced by the next-in-line nearest neighbor. (4) If no solar cell could be removed based on the first model (i.e., for all solar cells, their removal from the data set did not lead to an improvement in Q2LOO), repeat steps 2–3 for the second best model and so on. If no compound could be removed based on any of the best models, stop. (5) Repeat steps (1)–(4) above until Q2LOO is sufficiently high (stopping criteria). In the present implementation, the MC procedure was typically run for 104 steps per iteration replacing a single descriptor and randomly choosing k at each step. The effective temperature was set to produce an acceptance rate of ~ 0.5 %. The stopping criterion for outlier removal was set to Q2LOO Š 0.9. 2.2.3.3 Library Partitioning: Following outlier removal, the remaining solar cells were divided into modeling sets and independent validation sets using a newly developed algorithm.[18] Briefly, this algorithm selects a subset of objects (e.g., compounds) which best represents a parent database by optimizing a newly devised representativeness function. Subsets selected by this method were previously shown to be useful for the evaluation of QSAR models (as an external validation set) in terms of their ability to predict the activities of compounds residing within their applicability domain.[18] Validation sets were also obtained by random

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

370

www.molinf.com

divisions and in accord with previous reports[19] provided QSAR models with comparable predictive power and therefore are not further discussed. 2.2.3.4 kNN-Based Model Generation: Our implementation of the kNN algorithm was described above (Figure 3). The algorithm used for model building was identical to the one used for outlier removal except that for model building, simulated annealing was applied and the algorithm was not run iteratively. The MC/SA parameters used for model construction were typically running for 105 steps replacing a single descriptor and randomly choosing k at each step. The effective temperature was set to produce an initial acceptance rate of ~ 10 % and an average acceptance rate of ~ 0.5 %. 2.2.3.5 Genetic Programming Based Symbolic Regression for Model Generation: Genetic programming[20] belongs to the family of Evolutionary Algorithms. The method iteratively produces a population of models so that each population contains models which are better than those found in the previous population. This is done by first generating a random population of models where each model (defined by the identity of the material descriptors) is mapped into a “chromosome”, evaluating each chromosome based on the performances of its corresponding model and finally, by letting the chromosomes that correspond to the better models transfer their “genetic information” (i.e., descriptors) to subsequent generations using operators (e.g., crossover, mutations) taken from evolution. This “genetic pressure” leads to the derivation of improved models until the best model is obtained. Symbolic regression[21] searches the space of mathematical expressions to find the model that best fits a given data set. In this work we used the genetic programming based symbolic regression tool of the Eureqa software.[22] 2.2.3.6 Model Validation Parameters: Models were evaluated by standard parameters. For the modeling set subjected to kNN we used Leave-One-Out cross validation (Q2LOO; Equation 2) and for the modeling set subjected to GP we used the standard cross validation (R2CV; which takes the exact form as Q2LOO). For the test sets (external validation sets) we used for both the kNN and the GP the external explained variance (Q2ext; Equation 3) according to the OECD guidelines for model validation[23] and as found in the literature.[24] Q

2 LOO

Q2ext

2 CV

¼R

¦2 P ¨ Y Yexp ¢ YLOO=CV ¦ ¼ P ¨ ‡ 2 Y Yexp ¢ Yexp

¦2 P ¨ Y Yexp ¢ Ypre ¦ ¼ 1¢ P ¨ ‡ 2 Y Yexp ¢ Yexp

ð2Þ

ð3Þ

Where Yexp is the experimental value, YLOO, Ycv and Ypre are the predicted values and Y exp is the mean of the experimental results over modeling set (training set) cells. In addition we used the R2 (squared correlation coefficient) and

the mean absolute error coefficient (MAE; Equation 4) between the predicted (Ypre) and the experimental (Yexp) data for the test sets with both methods. ” P ”” ” Y Yexp ¢ Ypre MAE ¼ n

ð4Þ

For predictions made for the validation set with the kNN method we invoked the concept of applicability domain (AD). Following the work of Tropsha et al.[24a] we defined the AD as a threshold distance DT between a query compound and its nearest neighbors in the training set, calculated as follows: DT ¼ ‡y þ Zs where ‡y is the average Euclidean distance between each compound and its k nearest neighbors in the training set, s is the standard deviation of the Euclidean distances, and Z is an arbitrary parameter to control the significance level. We set the value of Z to 0.5, which formally places the allowed distance threshold at the mean plus one-half of the standard deviation. If the distance of the test compound from any of its k nearest neighbors (k is optimized during model construction) in the training set exceeds the threshold, the prediction is considered unreliable.

3 Results 3.1 TiO2 j Cu-O Library

The TiO2 j Cu-O library contained 169 cells. 17 of these cells were found to be non photovoltaic and were therefore removed leaving a total of 152 photovoltaic cells. Each cell was characterized by seven independent descriptors and by three dependent descriptors (see Section 2.2.1 above). 3.1.1 Data Visualization

The resulting 7D space was reduced into a 2D representation using PCA. The composition of the first two PCs in terms of the original descriptors is given in Table 1. These PCs cover 70.1 % and 16.2 % of the original variance, respectively, for a total of 86.3 %. The distribution of the solar cells in the resulting PC space is presented in Figure 5A. Figure 5A demonstrates that five cells (circled) are markedly different from the bulk of the library. When these cells were mapped back into the library, color coded according Table 1. Composition of the first two PCs (covering a total of 86.3 % of the original variance) in terms of the original material descriptors. PC1

PC2

TCu¢O Ratio Dcenter Jmax BGP

TTiO2 Ra

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

371

Special Issue EuroQSAR

Full Paper

Special Issue EuroQSAR

Full Paper

www.molinf.com

Figure 5. (A) Scatter plot of the TiO2 j Cu-O library in the space defined by the first two PCs. Cells suggested to be outliers are circled. (B) Plot of calculated internal quantum efficiency (IQE) as a function of cell position for TiO2 j Cu¢O library. The circled area indicates the location of the outliers.

Figure 6. (A) Plot of PC1 as a function of cell position for the TiO2 j Cu¢O library. (B) Plot of PC2 as a function of cell position for TiO2 j Cu¢O library (White cell represents non-photovoltaic cells that were eliminated). Cells considered to be outliers are circled. (C) Plot of the thickness of the Cu¢O layer as a function of cell position for the TiO2 j Cu¢O library. (D) Plot of the thickness of the TiO2 layer as a function of cell position for the TiO2 j Cu¢O library. The x axis and y axis are in mm.

to the internal quantum efficiency (see Section 2.2.1 above), they were found to concentrate at the lower right corner (Figure 5B), a library region characterized by high IQE values.

Figure 6 presents the values of the two PCs for each cell as a function of its position within the library (Figure 6A for PC1 and Figure 6B for PC2). The five outliers again concentrate at the bottom right hand side of the library (circled).

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

372

Special Issue EuroQSAR

Full Paper

www.molinf.com

Figure 7. Q2LOO values as a function of cell removal for the VOC model. Outlier removal began with a set of 152 cells and the stopping criterion was met after the removal of 10 cells.

The data in Table 1 demonstrate that the first PC is primarily composed of material descriptors related to the absorber layer (Cu¢O) whereas the second PC has a major contribution from material descriptors related to the window layer (TiO2). In agreement with this observation, the pattern resulting from mapping the thickness of the absorber layer onto the library is highly similar to the one generated by PC1 while the pattern resulting from mapping the thickness of the window layer onto the library is highly similar to the one generated by PC2 (compare Figure 6A with Figure 6C and Figure 6B with Figure 6D). 3.1.2 Model Building

3.1.2.1 Outlier Removal: We first checked whether a kNN model could be obtained for the dependent parameters, JSC, VOC, and IQE while using all 152 photovoltaic cells. For JSC and VOC, kNN models were derived using all seven material descriptors. For IQE we only used six material descriptors, omitting Jmax since it is directly related to IQE (see Section 2.2.1). This test suggested that reasonable models (Q2LOO > 0.6) could only be obtained for JSC and IQE but not for VOC. Thus, the kNN optimization based outlier removal algorithm was used to remove outliers in order to derive a model for VOC. The algorithm removed ten cells before the stopping criterion (Q2LOO Š 0.9) was met. Figure 7 presents the improvement in Q2LOO as a function of cell removal. The ten cells removed by this procedure were mapped back onto the original library color coded according to the thickness of the Cu¢O layer. This plot (Figure 8) demonstrates that all outliers reside in library regions characterized by a thin Cu¢O layer. Moreover three of the five cells found to be outliers in the PCA were removed by the kNN optimization based outlier removal procedure. 3.1.2.2 Model Building: For IQE and JSC models were built using all 152 photovoltaic cells while for VOC only the 142 cells surviving the outlier removal procedure were used.

Figure 8. Outliers (marked by X) mapped into the library colorcoded according to the thickness of the Cu¢O layer.

Models were built according to common QSAR principles. The data set was split into a modeling set (80 %, 122 and 114 cells for JSC and IQE and for VOC, respectively) and a validation set (20 %; 30 and 28 cells for JSC and IQE and for VOC, respectively) using the representativeness algorithm,[18] that selects the most representative subset from a parent data set. Models were built on the modeling sets (using Q2LOO as an evaluation criterion) and tested on the validation set (using Q2ext as an evaluation criterion). Results obtained by kNN and by GP are shown in Tables 2 and 3, respectively. These results indicate that models derived by the kNN algorithm have both relatively high cross validation values Q2LOO between 0.77–0.87) and external prediction values (Q2ext between 0.73–0.86; R2 between 0.74–0.88), the latter largely unaffected by the application of the applicability domain. Somewhat poorer results were obtained with the GP algorithm with R2CV between 0.62–0.88, Q2ext between 0.54–0.86, and R2 between 0.55–0.87.

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

373

www.molinf.com

Table 2. Results obtained with the kNN algorithm for the TiO2 j Cu-O library. End point JSC(Ag) VOC(Ag) IQE(Ag)

Q2LOO 0.87 0.86 0.77

Descriptors

No applicability domain

With applicability domain

Q2ext (R2)

MAE

Q2ext (R2)

MAE

%coverage

0.86 (0.88) 0.73 (0.74) 0.80 (0.84)

0.01 0.02 0.05

0.86 (0.89) 0.75 (0.77) 0.83 (0.86)

0.01 0.02 0.04

83 % 75 % 87 %

Ratio, BGP, Dcenter TTiO2 , Jmax TTiO2 , Ratio, Ra

Table 3. Results obtained with the GP algorithm for the TiO2 j Cu-O library. Model

R2CV

Q2ext (R2)

MAE

JSC = 0.062 + 0.0004 Õ TCu¢O¢430384.1022/Ra VOC = 0.011 Õ Jmax + 1.201 Õ 10¢5 Õ TTiO2 Õ Dcenter¢0.04¢6.62 Õ 10¢13 Õ TCu¢O Õ Ra IQE = 1.784 Õ Ratio + 0.072/Ratio¢2642279.244/(2356681.705 + Ra)

0.88 0.62 0.65

0.86 (0.87) 0.54 (0.55) 0.74 (0.74)

0.01 0.03 0.06

3.2 TiO2 j Cu2O Library

The TiO2 j Cu2O library contains two sub-libraries differing in their back contacts, with 169 cells with an Ag back contact and 169 cells with an Ag/Cu back contact. Seven non-photovoltaic cells were removed from the first sub-library and three from the second sub-library leaving a total of 162 and 166 for the libraries with the Ag and Ag/Cu back contacts, respectively. Both sub-libraries were characterized by five material descriptors (TTiO2 , TCu2 O , Ratio, BGP, and Jmax) and by the three PV parameters (JSC, VOC, IQE). While the Table 4. Composition of the first two PCs (covering a total of 97.4 % of the original variance) in terms of the original material descriptors. PC1

PC2

TCu2 O Jmax BGP

TTiO2 Ratio

material descriptors of both sub-libraries were identical, their PV properties were different. 3.2.1 Data Visualization

The resulting 5D space was reduced into a 2D representation using PCA. The composition of the first two PCs in terms of the original descriptors is given in Table 4. These PCs cover 60.7 % and 36.7 % of the original variance, respectively, for a total of 97.4 %. The distribution of the solar cells in the resulting PC space is presented in Figure 9. The cells are shown to be evenly distributed within the PC space with no obvious outliers. 3.2.2 Model Building

In this case two sets of models were constructed, each for one sub-library. As before, for JSC and VOC we used all five material descriptors whereas for IQE, Jmax was omitted. Con-

Figure 9. Scatter plot of TiO2 j Cu2O library in the space defined by the first two PCs. Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

374

Special Issue EuroQSAR

Full Paper

www.molinf.com

Table 5. Results obtained with the kNN algorithm for the two TiO2 j Cu2O sub-libraries (back contacts are given in parenthesis). End point JSC (Ag) VOC (Ag) IQE (Ag) JSC (Ag/Cu) VOC (Ag/Cu) IQE (Ag/Cu)

Q2LOO 0.92 0.78 0.91 0.92 0.92 0.90

Descriptors

No applicability domain

With applicability domain

Q2ext (R2)

MAE

Q2ext (R2)

MAE

%coverage

0.92 0.89 0.87 0.89 0.88 0.91

0.02 0.02 0.18 0.02 0.02 0.16

0.92 (0.92) 0.89 (0.89) 0.87 (0.87) 0.88 (0.89) 0.89 (0.89) 0.89 (0.89)

0.02 0.02 0.19 0.02 0.02 0.18

91 % 84 % 91 % 79 % 82 % 73 %

(0.92) (0.89) (0.87) (0.89) (0.89) (0.91)

TTiO2 , TCu2 O TTiO2 , TCu2 O TTiO2 , TCu2 O TCu2 O , Ratio TCu2 O , Ratio TCu2 O , Ratio

Table 6. Results obtained with the GP algorithm for the two TiO2 j Cu2O sub-libraries (back contacts are given in parenthesis). Model

R2CV

Q2ext (R2)

MAE

JSC (Ag) = 0.0009 Õ TCu2 O ¢0.22 VOC (Ag) = 0.00047 Õ TTiO2 + 0.0004 Õ TCu2 O IQE (Ag) = 0.0058 Õ TCu2 O ¢1.26 JSC (Ag/Cu) = 0.0009 Õ TCu2 O ¢0.22 VOC (Ag/Cu) = 0.00048 Õ TTiO2 + 0.0004 Õ TCu2 O IQE (Ag/Cu) = 0.0059 Õ TCu2 O ¢1.34

0.74 0.65 0.70 0.76 0.61 0.72

0.76 0.78 0.72 0.74 0.50 0.72

0.04 0.02 0.28 0.04 0.04 0.28

(0.76) (0.77) (0.73) (0.76) (0.50) (0.73)

Table 7. Results obtained with the kNN algorithm for the TiO2 j Cu2O sub-library with the Ag/Cu back contact omitting the “Ratio” descriptor. End point

Q2LOO

No applicability domain Q

JSC (Ag/Cu) VOC (Ag/Cu) IQE (Ag/Cu)

0.91 0.91 0.90

2 ext

2

Descriptors

With applicability domain 2 ext

2

(R )

MAE

Q

(R )

MAE

%coverage

0.89 (0.90) 0.88 (0.89) 0.90 (0.90)

0.02 0.02 0.17

0.89 (0.90) 0.89 (0.90) 0.90 (0.90)

0.03 0.03 0.18

76 % 82 % 85 %

sistent with the absence of outliers in the PC plot, in this case models for all three PV properties could be derived using the entire (photovoltaic) library. As for the previous library the data set was split into a modeling set (80 %, 130 and 133 cells for the Ag and Ag/Cu back contacts, respectively) and a validation set (20 %; 32 and 33 cells for the Ag and Ag/Cu, back contacts, respectively) using the representativeness algorithm.[18] Models were built on the modeling sets and tested on the validation set. Models were derived using both kNN and GP and the results are presented in Tables 5 (kNN) and 6 (GP). These results indicate that models derived by the kNN algorithm have high cross validation values Q2LOO between 0.90–0.92) except for the model for VOC with the Ag back contact (Q2LOO = 0.78) and good external prediction statistics (Q2ext and R2 between 0.87–0.92), the latter again largely unaffected by the application of the applicability domain. As before somewhat poorer results were obtained with the GP algorithm with R2CV between 0.61–0.76, Q2ext between 0.50– 0.78, and R2 between 0.50–0.77. In the case of the Ag back contact, all kNN models selected, as the final descriptors, the thickness of the window and absorber layers whereas in the case of the Ag/Cu back contacts, the thickness of the absorber layer (Cu2O) and the thickness ratio were selected. In search for global descrip-

TTiO2 , TCu2 O TTiO2 , TCu2 O TTiO2 , TCu2 O

tors that would work for both sub-libraries we have re-derived the models for the Ag/Cu back contact sub-library while omitting the “Ratio” descriptor. The results are presented in Table 7 and demonstrate similar performances to the model derived with all descriptors (Q2LOO between 0.90– 0.91; Q2ext between 0.88–0.90; R2 between 0.89–0.90). The ability to develop “global” models, namely models that will utilize the same descriptors for both sub-libraries is further emphasized by looking at the GP derived models (Table 6). Identical models were obtained for JSC for both back contacts, and similar models were obtained for IQE and VOC. The kNN and GP results for the TiO2 j Cu2O library suggest that the thickness of the window and absorber layers are important factors in determining PV properties. To evaluate the relative importance of these two material descriptors we have linearly regressed (using MLR) the three PV parameters (JSC, VOC, and IQE) against TTiO2 and TCu2 O as well as against the interaction term (TTiO2 Õ TCu2 O ) between them. The results are presented in Table 8. The beta values obtained for the JSC and IQE models are in agreement with the GP-derived models in that only TCu2 O is found to be a significant predictor for these two properties. Furthermore for VOC both, TCu2 O and the TTiO2 were

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

375

Special Issue EuroQSAR

Full Paper

www.molinf.com

Table 8. Results of the MLR analysis for regressing JSC, VOC, and IQE against TTiO2 , TCu2 O and the interaction term. Beta values are the normalized regression coefficients and R2 is the explained variance. Dependent variable

Back contact

R2

Independent variable

Beta (b)

p-value

JSC

Ag

0.77

JSC

Ag/Cu

0.79

IQE

Ag

0.73

IQE

Ag/Cu

0.76

VOC

Ag

0.68

VOC

Ag/Cu

0.59

TTiO2 TCu2 O TTiO2 Õ TCu2 O TTiO2 TCu2 O TTiO2 Õ TCu2 O TTiO2 TCu2 O TTiO2 Õ TCu2 O TTiO2 TCu2 O TTiO2 Õ TCu2 O TTiO2 TCu2 O TTiO2 Õ TCu2 O TTiO2 TCu2 O TTiO2 Õ TCu2 O

0.06 0.81 0.09 ¢0.02 0.75 0.22 0.10 0.80 0.07 0.01 0.75 0.19 0.65 0.68 ¢0.16 0.58 0.62 ¢0.11

> 0.05 < 0.01 > 0.05 > 0.05 < 0.01 > 0.05 > 0.05 < 0.01 > 0.05 > 0.05 < 0.01 > 0.05 < 0.01 < 0.01 > 0.05 < 0.01 < 0.01 > 0.05

found to be significant predictors with similar beta weights as found by both the GP and the kNN derived models.

4 Discussion Data mining techniques and machine learning algorithms have been extensively used in the fields of chemoinformatics and material sciences to provide insight into the factors affecting the activities of molecules/materials and to develop predictive models for these activities. Borrowing from many success stories, we anticipated that such methods could have a similarly large potential in the area of photovoltaics (PV), in particular to analyze and rationalize empirical results and to design new solar cells with improved properties. The emergence of libraries consisting of multiple, well characterized metal oxide based solar cells presented us with the opportunity to test this hypothesis. The application of data mining and machine learning tools in the field of PV is challenging for several reasons. First, classical QSAR work is based on well understood correlations between structures and activities defined, for example, by the similar properties principle. Whether such correlations also exist, e.g., between material descriptors and PV properties is still an open question. Second, years of experience have provided the scientific community with at least a rough understanding of which molecular attributes (i.e., descriptors) should correlate with specific activities. This is not the case for PV properties. Finally, and in marked contrast with well defined molecular entities (which are the subject of most, although not all chemoinformatic studies), metal oxide photovoltaic cells are built from layers of inorganic compounds, sometimes with an undefined composition. Thus, these materials could not be

characterized using “classical” molecular descriptors and instead experimentally measured descriptors should be used. This has the advantage of using measured rather than modeled properties for the analysis. However, the number of parameters which could be measured in a high throughput manner is by far smaller than the number of calculateable descriptors presenting model building algorithms with far fewer options to select from. Still, if reliable models could be derived, they are likely to be free from over-fitting or chance correlation. Two fundamental hypotheses drove this research. First, that data mining and machine learning techniques could provide useful information on factors responsible for PV properties and second that it is feasible to develop predictive machine learning models based on QSAR principles, using experimentally measured cell characteristics to predict photovoltaic activities. In order to test both hypotheses we developed a workflow for the study of libraries of all-oxide photovoltaic cells and evaluated it on two solar cells libraries, namely, a TiO2 j Cu¢O library[4] and a TiO2 j Cu2O library.[11] The first library subjected to the workflow was the TiO2 j Cu-O library. This library was generated using TiO2 and CuO as the target MOs for the window and absorber layers, respectively. The particular preparation procedure for this library led to heterogeneity in the absorber layer which was shown to include contributions from CuO, Cu4O3, and Cu2O.[4] This heterogeneity is a probable cause for the presence of outliers and for the more complex models derived for this library (see below). The library was characterized using seven material descriptors as described in Section 2.2.1. A PCA on the resulting 7-dimentional space identified the presence of five outliers. When mapped back onto the

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

376

Special Issue EuroQSAR

Full Paper

www.molinf.com

original design of the library color-coded according to the IQE values (Figure 5) all outliers were found to concentrate at the lower right corner and to have high IQE values. This region of the library is characterized by a thin absorber layer (thickness of Cu-O layer < 100 nm). Interestingly, for this library high IQE values are typically associated with a thick absorber layer. This discrepancy supports the identification of these cells as outliers. Anderson et al.[4] have noted that the thin Cu-O layer was primarily composed of Cu2O, suggesting that a combination of a thin Cu2O layer and a much thicker TiO2 layer can also increase the IQE. The presence of outliers compromised the ability of the kNN algorithm to derive reliable models for VOC but not for JSC or IQE. Whether this is a general phenomenon, that is, VOC is more sensitive to the presence of outliers than either JSC or IQE requires the analysis of additional libraries. Removing the five outliers discovered by the PCA did not lead to a better model for VOC (Q2LOO = 0.41). However, removal of ten outliers using the newly developed outlier removal algorithm afforded a good model (Q2LOO = 0.9, Figure 7) also for VOC. This suggests that removal of outliers in the descriptors space does not necessarily lead to a better model and conversely, that outliers in the descriptors space could still be well predicted by the kNN approach. Indeed two of the five original outliers were retained in the final data set. Following the removal of outliers and the splitting of the library into a modeling set and a test set, models with good prediction statistics for both sets were derived for all PV properties using both kNN and GP. Almost identical models, both in terms of the equations and the statistical parameters were obtained when using the entire library for model generation. This likely resulted from the rational division of the library members into a modeling set and a validation set which span the same region of the descriptors space. This result also holds true for the models derived for the TiO2 j Cu2O library. The ability to obtain reliable models with kNN suggests that the basic assumptions underlying this method, namely, that similarity in the descriptors space is translated into similarity in the activity space is also valid for the prediction of PV properties from material descriptors. Somewhat better models were obtained with kNN than with GP (Q2LOO between 0.77 and 0.87 and between 0.62 and 0.88 for kNN and GP, respectively; Q2ext between 0.73 and 0.86 and between 0.54 and 0.86 for kNN and GP, respectively; R2 between 0.74–0.88 and between 0.55–0.87 for kNN and GP, respectively). This may result from the fact that kNN is a non-linear methods whereas GP requires a linear relation between descriptors and activities (although within the framework of the symbolic regression, the descriptors can adopt multiple, including non-linear forms). In their work, Anderson et al.[4] showed that JSC values positively correlate with the thickness of the absorber (CuO) layer. These observations are generally matched by the material descriptors selected by both GP and kNN-generat-

ed models. Specifically, the GP model selected TCu¢O as one of two important descriptors (the second one being constant/Ra ; see Table 3) and a subsequent correlation analysis demonstrated it to be positively significantly correlated with JSC (r = 0.80, p-value < 0.01). The kNN model selected three descriptors, namely, layers ratio (Ratio), distance from the center of the deposition plume (Dcenter), and the band gap (BGP). Correlation analysis found the Ratio to be positively significantly correlated with JSC (r = 0.82, p-value < 0.01) and both Dcenter and the band gap to be negatively significantly correlated with JSC (Dcenter : r = ¢0.81, p-value < 0.01; BGP: r = ¢0.76, p-value < 0.01). The experimentally observed dependence on Cu-O is introduced into the kNN model via the inter-correlations between the material descriptors. Thus, TCu¢O positively correlates with Ratio (r = 0.92; p-value < 0.01; so that the higher the ratio, the higher is TCu¢O) and negatively correlates with Dcenter (r = ¢0.93; pvalue < 0.01; so the higher Dcenter, the lower is TCu¢O). In addition, TCu¢O negatively correlates with the band gap (r = ¢0.94; p-value < 0.01) suggesting that narrower band gaps correlate with higher TCu¢O values and consequently with higher JSC. In addition, Anderson et al.[4] suggested that the ratio between the thicknesses of the absorber layer and the window layer (Cu¢O/TiO2) positively correlates with IQE. Accordingly, Ratio was selected as an important descriptor by both kNN and GP and correlation analysis demonstrated it to be significantly positively correlated with IQE (r = 0.34, pvalue < 0.01). Interestingly, two other descriptors selected by the kNN were found to correlate with IQE (Ra : r = 0.23, pvalue < 0.01; TTiO2 : r = ¢0.30, p-value < 0.01). Finally, the dependence of VOC on the material descriptors was found to be complex with multiple descriptors significantly correlated with it. Thus, no meaningful analysis could be made for this property. The second library subjected to the workflow was the TiO2 j Cu2O library. This library was generated using TiO2 and Cu2O as the MOs for the window and absorber layers. The particular preparation procedure for this library (spray pyrolysis for both layers) ensured homogeneity in the absorber layer. This homogeneity is a probable cause for the lack of outliers in this library and for the simpler, better (in terms of prediction statistics), and more intuitive models derived for all PV parameters (compare Tables 2–3 with Tables 5–7). As for the previous library kNN-based models performed better than GP-based ones. The resulting models for this library are in agreement with several observation made by Pavan et al.[11] First, JSC was observed to strongly depend on the thickness of the Cu2O layer with Cu2O thickness > 500 nm leading to the highest values and a thin Cu2O layer leading to the poorest values. These observations are consistent with the descriptors selected by kNN (TCu2 O and TTiO2 ) and GP (TCu2 O only). Furthermore, in a subsequent MLR analysis TCu2 O was found to be the only significant predictor for JSC as well as for IQE with a positive correlation to both parameters (Table 8).

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

377

Special Issue EuroQSAR

Full Paper

www.molinf.com

Second, VOC was observed to correlate with the total heterojunction thickness, where either both TiO2 and Cu2O, or just one of them, are the thickest.[11] These observations are again consistent with the descriptors selected by kNN and GP. Both algorithms selected TTiO2 and TCu2 O as important descriptors (although GP did not select the interaction term between these descriptors and a subsequent MLR analysis did not find it to be a significant predictor of VOC). A subsequent MLR analysis revealed that both descriptors are significant predictors of VOC with comparable contributions (Table 8). Finally, TTiO2 was found to be a good surrogate for the Ratio descriptor in predicting PV properties for the Ag/ Cu back contact library (Table 7). This is perhaps not surprising due to the reasonably high (yet not perfect) correlation between TTiO2 and Ratio for this library (r = ¢0.84). Analyzing the results across both libraries (and across both sub-libraries in the case of the TiO2 j Cu2O library), the thickness of the window and absorber layers emerge as the strongest determinants of PV parameters. This conclusion is perhaps not surprising since layers thicknesses was the only parameter directly varied in this work. In future works, we plan to introduce another layer of complexity by using combinations of MOs for each layer (i.e., combinatorial material). This will likely introduce more complexity into the data mining process and give rise to more complex correlations between layer composition and PV properties. A hint towards this proposition is provided by the more complex models generated for the TiO2 j Cu¢O library where the absorber layer was shown to consist of several oxides, namely, CuO, Cu2O, and Cu4O3.[4] Yet despite its conceptual simplicity the dependence of PV properties on layers thickness is interesting for several reasons: (1) It extends the structure-function concept common in “classical” QSAR studies to the realm of PV cells. (2) It paves the way towards future experimental designs since at present, the thicknesses of the window and absorber layers are the only parameters which could be easily controlled when producing new PV cells. While we realize that the physics of solar cells puts limits on thickness variation (e.g., the thickness cannot be infinitely increased), the results obtained in this study suggest plausible directions for the development of better cells. (3) It establishes the tools for the analysis of more complex combinatorial material libraries.

5 Conclusions and Future Directions In this work we presented the first application of data mining and machine learning techniques to the analysis of large numbers of all-oxide PV cells at the single cell level. For this purpose we developed a simple workflow and applied it to the analysis of two libraries, TiO2 j Cu¢O and TiO2 j Cu2O, the second of which containing two sub-libraries. This workflow represents a working method for implementing data mining and machine learning tools into the

all-oxide PV world. The main conclusions which could be drawn from the present study are: 1. Correlations between material descriptors and activities (expressed in terms of PV properties) are observed for PV cells. Furthermore, the similar property principle is followed in this system. 2. Models with good prediction statistics, correlating cell composition with PV properties could be derived. Such models could therefore be used for experimental design. Such design could be performed, for example by generating, virtual solar cells, predicting their properties and selecting for manufacturing those with favorable PV properties, much in the same way QSAR models are applied for the design of new bioactive compounds. 3. Derived models are in good agreement with experimental observations in particular with respect to highlighting factors affecting PV properties. Such models could therefore be used for a similar purpose in more complex cases, for example, when interactions between different factors come into play. An area of potential improvement in this field is the usage of additional material descriptors, either calculated or experimentally measured. The former may require new theoretical developments or the ability to apply existing methods (e.g., DFT calculations) in a high throughput manner. The latter depends on the further application of existing technologies or the development of new technologies for characterizing PV cells. In particular spectroscopic methods (e.g., XRD, Raman) could be employed and the resulting spectra could be used as new material descriptors. Finally we believe that the ideas and methodologies presented in this work will be beneficial in future studies of PV cells.

Acknowledgement The authors acknowledge financial support from the Israeli National Nanotechnology Initiative (INNI, FTA project).

References [1] A. J•ger-Waldau, Int. J. Photoenergy 2012, 2012. [2] S. Rìhle, H. N. Barad, Y. Bouhadana, D. A. Keller, A. Ginsburg, K. Shimanovich, K. Majhi, R. Lovrincic, A. Y. Anderson, A. Zaban, Phys. Chem. Chem. Phys. 2014, 16, 7066 – 7073. [3] J. Deuermeier, J. Gassmann, J. Brçtz, A. Klein, J. Appl. Phys. 2011, 109, 113704. [4] A. Y. Anderson, Y. Bouhadana, H.-N. Barad, B. Kupfer, E. RoshHodesh, H. Aviv, Y. R. Tischler, S. Rìhle, A. Zaban, ACS Comb. Sci. 2014, 16, 53 – 65. [5] T. Kosugi, S. Kaneko, J. Am. Ceram. Soc. 1998, 81, 3117 – 3124. [6] S. Rìhle, A. Y. Anderson, H.-N. Barad, B. Kupfer, Y. Bouhadana, E. Rosh-Hodesh, A. Zaban, J. Phys. Chem. Lett. 2012, 3, 3755 – 3764.

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2015, 34, 367 – 379

378

Special Issue EuroQSAR

Full Paper

www.molinf.com

[7] a) A. Mellit, S. A. Kalogirou, Prog. Energy Combust. Sci. 2008, 34, 574 – 632; b) A. Mellit, S. A. Kalogirou, L. Hontoria, S. Shaari, Renew. Sustainable Energy Rev. 2009, 13, 406 – 419. [8] F. Bonanno, G. Capizzi, G. Graditi, C. Napoli, G. M. Tina, Appl. Energy 2012, 97, 956 – 961. [9] K. Ishaque, Z. Salam, S. Mekhilef, A. Shamsudin, Appl. Energy 2012, 99, 297 – 308. [10] A. Tropsha, Mol. Inf. 2010, 29, 476 – 488. [11] M. Pavan, S. Rìhle, A. Ginsburg, D. A. Keller, H.-N. Barad, P. M. Sberna, D. Nunes, R. Martins, A. Y. Anderson, A. Zaban, E. Fortunato, Sol. Energ. Mater. Sol. 2015, 132, 549 – 556. [12] I. T. Jolliffe, in Principal Component Analysis, Springer, New York, 2002. [13] a) S. Wetzel, A. Schuffenhauer, S. Roggo, P. Ertl, H. Waldmann, CHIMIA Int. J. Chem. 2007, 61, 355 – 360; b) N. Singh, R. Guha, M. A. Giulianotti, C. Pinilla, R. A. Houghten, J. L. MedinaFranco, J. Chem. Inf. Model. 2009, 49, 1010 – 1024; c) L. B. Akella, D. DeCaprio, Current Opin. Chem. Biol. 2010, 14, 325 – 330. [14] IBM SPSS, Version 20.0 ed., IBM Corporation, Armonk, New York 2011. [15] A. Yosipof, H. Senderowitz, J. Comput. Chem. 2015, 36, 493 – 506. [16] M. A. Johnson, G. M. Maggiora, Concepts and Applications of Molecular Similarity, Wiley, New York, 1990.

[17] W. Zheng, A. Tropsha, J. Chem. Inform. Comput. Sci. 1999, 40, 185 – 194. [18] A. Yosipof, H. Senderowitz, J. Chem. Inf. Model. 2014, 54, 1567 – 1577. [19] T. M. Martin, P. Harten, D. M. Young, E. N. Muratov, A. Golbraikh, H. Zhu, A. Tropsha, J. Chem. Inf. Model. 2012, 52, 2570 – 2578. [20] W. Banzhaf, P. Nordin, R. E. Keller, F. D. Francone, Genetic Programming: an Introduction, Vol. 1, Morgan Kaufmann, San Francisco, 1998. [21] D. A. Augusto, H. J. Barbosa, in Neural Networks, 2000. Proc. 6th Brazilian Symp., IEEE, Rio de Janeiro, 2000, pp. 173 – 178. [22] M. Schmidt, H. Lipson, Science 2009, 324, 81 – 85. [23] OECD Series on Testing and Assessment 69, OECD Document ENV/JM/MONO(2007)2, 2007, p. 55 (paragraph no. 198) and 165 (Table 195.197). [24] a) A. Tropsha, P. Gramatica, V. K. Gombar, QSAR Comb. Sci. 2003, 22, 69 – 77; b) L. M. Shi, H. Fang, W. Tong, J. Wu, R. Perkins, R. M. Blair, W. S. Branham, S. L. Dial, C. L. Moland, D. M. Sheehan, J. Chem. Inform. Comput. Sci. 2000, 41, 186 – 195.

Ó 2015 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Received: November 28, 2014 Accepted: February 9, 2015 Published online: March 20, 2015

Mol. Inf. 2015, 34, 367 – 379

379

Special Issue EuroQSAR

Full Paper