www.molinf.com
DOI: 10.1002/minf.201400068
Solved and Unsolved Problems of Chemoinformatics Johann Gasteiger*[a]
Abstract: From humble beginnings in the Sixties chemoinformatics evolved into a scientific field of its own. Without the achievements of chemoinformatics the flood of information in chemistry would simply not be manageable and modern research in chemistry and related fields would be inconceivable. However, there are still a host of problems
waiting for solutions to be found or to be improved. The impact of chemicals on human health and on the environment presents both challenges and concerns. Research in chemoinformatics could help in better understanding these topics and thus contribute to a better living.
Keywords: CASD · CASE · QSAR · QSPR · REACH · Databases
Introduction Chemistry has derived most of its knowledge from observations on experiments and the measurements of data. In the Sixties it was recognized that computers offer not only the possibility for performing computations but also for storing and processing information. Early fields for the application of information processing in chemistry were centered on computer-assisted structure elucidation (CASE) and computer-assisted organic synthesis design (CASD). From these beginnings, the processing of chemical information has blossomed into a field of its own, chemoinformatics, providing methods that allow the solving of many problems in chemistry and related fields.[1,2] Clearly, this communication can only scratch on the surface and provide only a few of the solved and unsolved problems of chemoinformatics. A more extensive – however still incomplete – presentation of the points raised here has been expressed in a recent publication.[3]
a molecule. Standards for exchanging such structural information have been agreed upon both in tabular form (Molfile, SDFile)[5] as well as in a linear code (SMILES).[6] Having solved the computer representation of chemical structures and reactions allowed the development of databases on chemical information such as CASOnline,[7] Beilstein,[8] Gmelin,[9] CSD,[10] PDB[11] and PubChem.[12] The representation of chemical structures with atomic resolution provided the basis for full structure, substructure and similarity searches of compounds in databases.[13] Furthermore, a variety of visualization methods for chemical structures have been developed. Particularly those for proteins and nucleic acids provide new insights into the very nature of these macromolecular structures and their functions. Databases with chemical information had a big impact on how chemical research is being done. The high amount of chemical information available can only be managed through chemoinformatics methods. Without databases modern scientific research in chemistry and related fields would simply be inconceivable.
Solved Problems Structure Representation and Databases
Learning from Chemical Information
In order to face such challenges as CASE and CASD, some fundamental problems in processing chemical information had to be solved. Foremost of all was the need to make the computer understand the language of chemists, the representation of chemical structures by structure diagrams and the representation of chemical reactions by reaction equations. This chemical language is largely graphical in nature and therefore molecule editors have been developed to communicate with the computer in a graphical manner. Over the years various methods for storing molecular information have been developed.[4] However, it became quite clear that molecules should be represented in a way that allows access to each atom and each bond in
Widespread as databases are, they will nevertheless necessarily always be incomplete. For example, whereas more than 70 million compounds are known, information on the 3D structure of organic compounds is only stored for 600 000 compounds in the CSD, less than 1 %! However, the available information can be used to learn from the data and develop models that allow the generation of 3D structures for any organic compound. This has been ach[a] J. Gasteiger Computer-Chemie-Centrum, University of Erlangen-Nuremberg D-91052 Erlangen, Germany *e-mail:
[email protected]
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 454 – 457
454
Special Issue STRASBOURG
Communication
www.molinf.com
Figure 1. The two-step approach to QSAR and QSPR. In this way, models for the prediction of a host of physical, chemical, or biological properties of compounds have been developed.
ieved with the development of the 3D structure generator CORINA.[14] Quantitative Structure Property or Activity Relationships (QSAR or QSPR) provide a two-step approach to the prediction of properties of compounds that are too difficult to be predicted on first principles. First, the structures of a dataset of compounds have to be represented by structure descriptors. A large variety of structure descriptors has been developed[15] describing the whole molecule, the 2D structure, the 3D structure, or molecular surfaces.[16] In a second step, a dataset of structures represented by appropriate descriptors and their associated properties are analyzed by some data modeling technique such as from statistics, chemometrics, pattern recognition, or artificial neural networks (Figure 1).[17]
Challenges Chemoinformatics is not an endeavor by itself but should provide insights into chemical problems and deliver methods for making research and development more efficient. In this sense it is worthwhile to reflect on the objectives of chemical industry. The goal of chemical industry is not so much to produce new compounds but to produce compounds with new, interesting properties, be it a drug, a dyestuff, a plastic, etc. Thus, the fundamental questions in chemistry and the areas where chemoinformatics could come in are: Which compound will have the desired property? ! Structure property relationships (QSPR, QSAR) How can I make this compound? ! Synthesis design What is the product of my reaction? ! Prediction of the outcome of a reaction ! Structure elucidation All these areas have already been tackled by chemoinformatics, but some only lightly so, more efforts are needed.
Society becomes increasingly interested and concerned about the impact of chemicals on human health and on the environment. Therefore, chemicals should be introduced or used only if they are proven to be safe. Legislation in Europe such as REACH (Registration, Evaluation, Authorization and restriction of CHemicals)[18] and the Cosmetics Directive[19] aims at providing a healthier and safer environment. Similar legislation has been introduced in other countries such as Japan and China. The objectives of these legislations provide an all-important challenge also for chemoinformatics. A similar view has been expressed in 2009 in the opening statement to the Journal of Cheminformatics[20] by identifying four “grand challenges” for chemoinformatics: – Overcoming stalled drug discovery – Green chemistry & global warming – Understanding life from a chemical perspective – Enabling the network of the world’s chemical and biological information to be accessible and interpretable
Problems Still to be Solved Access to Chemical Information
Widespread as molecule editors are, it should nevertheless be emphasized that they still suffer from being somehow cumbersome or time-consuming to be used. New structure input methods such as handwriting or through voice-recognition should be developed. Text mining methods are needed to provide access to information on the Internet or contained in printed media that is not included in databases. There are chemical compounds such as organometallic compounds (ferrocene etc.), boranes, Markush structures in patents, polymers that still defy a proper representation by a structure diagram. New representations for these compounds are needed. The quality of information in databases has to be improved. Many errors are contained in databases. Methods for eliminating these errors and for making quality checks before information is stored in a database have to be developed. All information that is available for a compound (all properties, all spectra) should go into databases. Databases with information on chemical reactions are notoriously incomplete. Information has to be provided on the full stoichiometry, side products, and all conditions (the ratio of starting materials, the reaction time and the temperature) of a reaction. Only then can we better learn about chemical reactivity and the course of chemical reactions.[21] The availability of electronic laboratory notebooks could lay the foundation for a direct flow of information from the information producer (the experimenter) to the information consumer without manual intervention (as is presently still the case!).
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 454 – 457
455
Special Issue STRASBOURG
Communication
www.molinf.com
Property Prediction
Statement
The trend in QSAR/QSPR has to go from model building to interpretation and increasing our knowledge. To that effect, structure descriptors that can be interpreted should be developed and be used. Models that take account of the physicochemistry of the process should be designed. Data analysis methods that are not black boxes but allow the understanding of the effects exerted onto the property investigated should be chosen. Toxicity prediction is presently a major focus. Clearly, knowledge on toxicology and on the effects governing individual endpoints have to flow into the building of models.
I have no conflicting financial or personal interests with myself or with others that might bias my work.
Areas of Application
By far the most important application of chemoinformatics is presently the area of drug design. The entire development chain of a drug can benefit from chemoinformatics. Most new drugs recently introduced into the market have somehow involved the use of chemoinformatics methods. Important areas in drug design that are open for new ideas are the conformational flexibility of compounds, particularly of proteins, protein-protein interactions, protein-DNA interactions, the representation and handling of RNA, prediction of binding energies based on the physicochemistry of the process, prediction of important properties such as water solubility or pKa-values and the prediction of ADMET properties. However, it has to be emphasized that chemoinformatics could be used in any field of chemistry and related sciences such as organic, inorganic, physical or analytical chemistry and toxicology. These areas need more devotion. CASE and CASD, being the first areas for the application of chemoinformatics, need new attention. To emphasize: The application of chemoinformatics is only limited by your own imagination!
Summary and Conclusions From humble beginnings chemoinformatics has developed into a full-fledged scientific discipline of its own. Many problems have been solved that have exerted a strong impact on how chemical research and development is being done. However, still many problems have to be tackled, just emphasizing again that chemoinformatics is a discipline of its own offering challenging tasks to be solved. This points out that new researchers entering chemoinformatics are needed and that the teaching of chemoinformatics has a priority. However, not only chemoinformatics specialists are needed but teaching chemoinformatics topics should be integrated into any course of chemistry majors in order that the next generation of chemists better knows when and how to use chemoinformatics methods.
Abbreviations ADMET CASD CASE CSD PDB QSAR QSPR REACH SDFile SMILES
Adsorption Distribution, Metabolism, Excretion, Toxicity Computer-Assisted Synthesis Design Computer-Assisted Structure Elucidation Cambridge Structural Database Protein Data Bank Quantitative Structure Activity Relationship Quantitative Structure Property Relationship Registration, Evaluation, Authorization and restriction of CHemicals Structure-Data File Simplified Molecular-Input Line-Entry System
References [1] Handbook of Chemoinformatics – From Data to Knowledge, Vol. 1–4 (Ed: J. Gasteiger), Wiley-VCH, Weinheim, 2003. [2] Chemoinformatics – A Textbook (Eds: J. Gasteiger, T. Engel), Wiley-VCH, Weinheim, 2003. [3] J. Gasteiger, “Some solved and unsolved problem of chemoinformatics”, SAR QSAR Envir. Res. 2014, 25, 443 – 453. [4] J. M. Barnard, “Representation of Molecular Structures – An Overview”, in Handbook of Chemoinformatics – From Data to Knowledge, Vol. 1 (Ed: J. Gasteiger), Wiley-VCH, Weinheim, 2003, pp. 27–50. [5] http://en.wikipedia.org/wiki/Chemical_table_file/ [6] http://en.wikipedia.org/wiki/Simplified_molecular_input_line_ entry_system/ [7] http://www.cas.org/ [8] http://en.wikipedia.org/wiki/Beilstein_database/ [9] http://en.wikipedia.org/wiki/Gmelin_database/ [10] http://www.cdc.cam.ac.uk/Solutions/CSDSystem/Pages/ CSD.aspx/ [11] http://http://www.wwpdb.org/ [12] http://http://pubchem.ncbi.nlm.nih.gov/ [13] J. Xu, “Two-dimensional Structure and Substructure Searching”, in Handbook of Chemoinformatics – From Data to Knowledge, Vol. 2 (Ed: J. Gasteiger), Wiley-VCH, Weinheim, 2003, pp. 868–884. [14] J. Sadowski, J. Gasteiger, Chem. Rev. 1993, 93, 2567 – 2581; CORINA can be obtained from Molecular Networks GmbH, Germany (http://www.molecular-networks.com) [15] R. Todeschini, V. Consonni, Molecular Descriptors for Chemoinformatics, Vol. 1 and 2, Wiley-VCH, Weinheim, 2009. [16] J. Gasteiger, J. Med. Chem. 2006, 49, 6429 – 6434. [17] a) J. R. Rose, “Machine Learning Techniques in Chemistry”, in Handbook of Chemoinformatics – From Data to Knowledge, Vol. 3 (Ed: J. Gasteiger), Wiley-VCH, Weinheim, 2003, pp. 1082–1097; b) K. Varmuza, “Multivariate Data Analysis in Chemistry”, in ibid., pp. 1098–1134; c) L. Eriksson, J. Lundstedt, J. Shockar, S. Wold, “Partial Least Squares (PLS) in Chemistry”,
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 454 – 457
456
Special Issue STRASBOURG
Communication
www.molinf.com
in ibid., pp. 1135–1166; d) J. Zupan, “Neural Networks”, in ibid., pp. 1167–1215; e) A. von Homeyer, “Evolutionary Algorithms and Their Applications in Chemistry”, in ibid., pp. 1239–1280. [18] http://ec.europa.eu/enterprise/sectors/chemicals/reach/indexen.htm [19] http://ec.europa.eu/consumers/sectors/cosmetics/documents/ directive/
[20] http://www.jcheminf.com/content/1/1/1/ [21] J. Gasteiger , J. Comput. Aided Mol. Des. 2007, 21, 33 – 52.
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Received: May 6, 2014 Accepted: June 2, 2014 Published online: June 6, 2014
Mol. Inf. 2014, 33, 454 – 457
457
Special Issue STRASBOURG
Communication