components for the structure elucidation of organic compounds consisting of ..... menthol (II), linallol(II1) and an isomer of thujopsene(IV), reported recently, with.
Pure & Appl. Chem., Vol. 61, No. 3, pp. 609-612, 1989, Printed in Great Britain. @ 1989 IUPAC
Application of the automated structure elucidation system (CHEMICS) to the chemistry of natural products Kimito Funatsu, Yutaka Susuta, and Shin-ichi Sasaki Toyohashi University of Technology, Tempaku, Toyohashi 4 4 0 , JAPAN
Abstract
Principles o f the automated structure elucidation system for organic compounds (CHEMICS) and i t s applications t o natural product chemistry are described. I t i s made clear t h a t CHEMICS i s a t the practical level in structure elucidation of natural products through the introduction of 2D-NMR data processing and specific partial structure elucidation functions t o the system.
The CHEMICS system, developed by the authors, is a computer-assisted structure elucidation system for organic compounds, which depends on the way of structure generation method; that is, the most probable structure is generated by the automated analysis of data (also for instance, chemical spectra) of an unknown using empirical and theoretical rules.') The principle of the system is that all possible structures, which are known to exist or which might exist on chemical grounds, are listed in a computer. The number of the structures in a particular case is then narrowed down by successively entering information from spectroscopic measurements. CHEMICS is designed to store all the substructures (called 'components') necessary for building any likely structures. At present, CHEMICS contains 630 components for the structure elucidation of organic compounds consisting of C, H, 0, N, S and halogen atoms (Table 1). The set of components has been devised so that it is possible to construct any structures by selecting appropriate components from the complete set. To store such a set of components in a computer is synonymous with storing all the complete structures which could be present. Table 1. Component set for structure elucidation of organic compounds containing C,H,O,N,S, and halogens. I4
F.
IR
'~-PId, "C-NMR
1
Input o f Macrocomponent( s )
No.
1
tert-Bu-
7
51 52 53
No.
Component (S) (ND)
C H ~ C H I - (CD)
Component
372 373
-774 .. :
(CT)
403
(F) (')
626 627
(ND)
628 .-.
IS5 186
351 352 353
Candidate Stereoisomeric Structure(s)
'C = N H
'
629 630
Fig.1. Block diagram of the current CHEMICS.
-F
-CI -Br -I -D-
The current CHEMICS system is composed of the following four functional modules, as shown in Fig. 1: a) Data analysis, b) Structure generator, c) Stereo-generator, d) Input of macrocomponent (partial structure). a) Data analysis: Among the components which have survived because they are consistent with the molecular formula, some can be subsequently discharged because they are inconsistent with H-1 and C-13 NMR chemical shift values, or IR data. In the selection of components by CHEMICS these spectral data measured on the sample are compared by computer with those in component/chemical shift or component/wave number correlation tables, so that only components consistent with these data are left. Part of the correlation table showing ppm ranges for H-1 and C-13 shifts is shown in Table 2. The next step is to make component sets by use of the components which have been selected as being not contradictory to the molecular formula and spectral data. 609
K. FUNATSU, Y. SUSUTA AND S. SASAKI
610
Table 2 . Correlation table for NMR analyses. No. 16 132 196 197 198 199 208 209 21 1 274 277
Components (CHdzCHCHICO>CH2 >CH2 >CH2 >CHI -CH< -CH< -CH c -CH= -CH=
IH-NMR(ppm) (CS) (0)
(N) (0)
(Y) (CD) (N) (0)
(CD) (N) (CD)
1.20 2.50 5.60 6.10 5.40 6.20 7.30 7.70 4.40 9.60 9.00
0.50 1.80 1.10 2.30 1.70 0.50 1.10 1.80 0.80 6.60 4.50
llC-NMR (pprn) 28.9 24.2 75.5 88.6 60.4 58.0 96.9 111.1 75.8 184.5 165.0
13.4 17.7 25.7 43.2 6.6 11.9 27.1 41.3 15.4 91.6 90.1
40.7 174.0
18.2 165.8
b) Structure generator: This step is to generate structures from the individual component combinations. The generation is carried out taking all possibilities into account, in due consideration of the principle that most of the components can only be linked to a limited number of species. On the basis of specially designed logic, connectivity stack, when the system functions properly it does not reproduce the same structure nor does it fail to generate any structure which can justifiably be built. c) Stereo-generator: The major role of the above module is generation of constitutional isomers. On the other hand, this module has a function for generating all possible stereoisomeric structures due to asymmetric carbon, double bond and so on using topological information of the respective constitutional isomers generated by 'structure generator'. d) Input of macrocomponent (partial structure): The chemist often has some information about the structure of a sample. This may be obtained from the past record of the sample or the experience in its laboratory handling. When the partial structure is entered by the user, the constitutional information is degraded into its components (described above), which are then compared with the components that the system has selected. The system will adopt the information entered only when all the components derived from the partial structure inserted have already been selected by CHEMICS. This means that the components which the system has selected with a full safety factor will take precedence over the additional information which has been entered manually. The stereochemical information of the macrocomponent is reflected on the final results according to other logic. As obvious from the above explanation, the fragments for making up structures and carriers of spectral information are just components. The unit of examining the reasonable allocation of each component to NMR signals, is also component centering around the correlation table. Moreover, the examination of the input macrocomponent by spectral data is also based on the component unit. It is obvious that essentially the analytical ability of 'data analysis' in CHEMICS never exceeds what is provided by component units. Thus, correspondence of candidate structures with input data is said to become ambiguous in some cases. The number of candidates increases in proportion to the ambiguity. According to the principle of never missing correct solution, this result is said to be unavoidable. As one of the ideas for coping with this situation, CHEMICS-F, which has a file retrieval function, has been developed, and the modules for prediction of the number of C-13 NMR signals and judgement of probability on the basis of strain energy calculation, have been provided. These functions play an effective role after generation of whole structures. On the other hand, the introduction of partial structures selected by the user, has enhanced the correctness and practicality of structure elucidation by our system. However, if possible, it seems to be one of the ideal features that partial structures entered should be determined by agreement with both deduction by the computer and judgement of it by the user. In order to realize about this situation an analytical way different from that in 'data analysis' of the current CHEMICS is required. In this sense, as a new approach of automated partial structure elucidation, an interdependent analytical way based on the relationships between H-1 and C-13 NMR chemical shifts for each atomic group with specified neighboring groups, has been developed. According to this method, rather big-sized partial structure can be generated automatedly which is most helpful for narrowing the number of candidates. Furthermore, the introduction of carbon-carbon signal connectivity information provided by 2-D NMR into CHEMICS has been accomplished so that the connectivity of carbons in skeletal structure of an unknown is automatically analyzed and elucidated at the both steps of 'data analysis' and 'structure generator' in generating whole candidate structures. This also results in the fewer number of candidates.
EXAMPLES OF STRUCTURE ELUCIDATION The first example of the elucidation process of an unknown compound (actually, nicotine) is serially illustrated in Fig. 2. Chemical data of the unknown, molecular formula and spectroscopic data of IR, H-1 and C-13 NMR, are compared with 630 components at the step of 'data analysis' to give rise 65 components. The number of candidates will amount to more than ten thousand if all the possible structures are constructed on the basis of the 65 components. In this example, the interdependent analytical method based on the relationships between H-1 and C-13 NMR chemical shifts to suggest automatically the presence of '3substituted pyridyl group' as a possible substructure in the unknown structure so that the candidates' number drastically diminishes into only seven structures in which the underlined correct solution is involved.
CHEMICS: a computer assisted structure elucidation system
61 1
The second example deals with an unknown compound a part of which is already identified from chemical or instrumental analysis made by chemist. The remaining part is generated automatically and connected to the known part to form the whole structure. Let's assume an unknown compound which has the molecular formula C17H1706C1 and comprises a known substructure, denoted as A, as the host(Fig. 3). If the composition of A, i.e. C9H704C1, and the number of
(UNKNOWN
:OMPOUND) I
IR cm-1 Pcr.1 3680 3400
H.F.
-
H
3080
2910 1590 1570 1460 1420 1320 1020 920 820 620
13C-Nf4R IHT. PPW I N T . WLT. 8.5 46 22.6 28.0 3 8.4 38 35.3 32.0 3 7.7 18 40.2 32.0 4 7.6 25 56.9 40.0 3 7.2 25 68.3 36.0 2 7.1 1 7 123.4 32.0 2 3.3 19 134.8 33.0 2 3.2 16 138.8 9.0 1 3 . 1 20 140.4 25.0 2 2.5 13 149.4 22.0 2 2.4 28 2.3 41 2.1 114 1.9 21 1.8
/
Chemist's observation?
IH-NHR
PPH
+
CgH704Cl
53
pq?, coMPoNENTs
C8H1002 Requirement : Two f r e e s i n g l e bonds, c o n s i s t e n t w i t h NMR data
*
DATA ANALYSIS
I
STRUCTURE GENERATOR
II
1
Fig.3. The conditions and results for elucidating the counterpart structurp.
I
its single bonds, i.e. two, are entered in the computer, calculation will be performed on the assumption that the remaining part (counterpart) has the composition of C8HI0O2 and two single bonds. Analysis of this I I compound by CHEMICS in terms of its molecular formula, H-1 NMR and C-13 NMR allows 65 Fig. 2. The analytical processes using 3components to remain as the components consubstituted pyridyl group provided sistent with these data. However, of the 65, by the H-1 and C-13 NMR chemical 16 components are not consistent with input shift interdependent analysis. data in consideration of forming the counterpart, with the remaining 4 9 components being left for construction of the counterpart that is expressed as C8HI0O2. Thus, the operation, which is performed with reference to NMR data, generates four substructures (B, C, D, E) as candidates for the counterpart, as shown in Fig. 3. Finally, the four are connected with the host ( A ) to generate six structures, F, G, H, J, K and L, of which G correctly represents the structure of griseofulvin used as the unknown compound, as listed in Fig. 4. A f t e r recombination process
u
Me0
OMe
Me0
OMe
0
Me0
Me0 c1
Me0 c1
(J)
(K)
(L)
Fig.4. The candidates provided by connection of counterpart structures with host (A).
K. FUNATSU, Y. SUSUTA AND S. SASAKI
612
The third example shows a procedure for applying 2-D NMR data.2) CHEMICS analysis of H-1 and C-13 NMR spectra of a certain compound (C13H200) gives 39 components, from which 1417 candidate structures are generated. 2-D NMR data can serve to reduce this number drastically. Three types of carbon-carbon signal connectivity information (on C-13 NMR) to be entered in the computer are obtained from the 2-D NMR of this specimen as shown below. Type 1: 1-10, 4-11, 7-9 (long range C-H cosy : path 2) Type 2: 2-5, 6-12, 8-12 (combination of C-H and H-H cosy) Type 3: 3-4, 3-7, 4-8, 6-7 (long range C-H cosy : path 3)
Table 3 . The results of 2-D NMR analyses using H-H cosy and C-H cosy information. I n f o r m a t i o n used i n CHEMICS Ordinary Analyses
4
The number o f candidates
I4l7
203
3 4 3 6
- 12 -
12
-
4 8 7 7
STRUCTURE NO. = 1
c \ /
c-c I
\
'
2 - 5 6 8
The number o f candidates
c
1
10 - 11 7 - 9
Table 4 . The results of 2-D NMR analyses using 2D-INADEQUATE information.
I
c-c=c-c \
36
\
by Check 1
-0
=
920
by Check 1 Check 2
C
c-c \
1
C by Check Check Check Check
SAMPLE(1) : 8-IONONE
Receiving these data the computer carries out the following operations. The data of Type 1 are used mainly to decide on the possibility of existence of each component with two carbons contained in the group of 39 components. As a result, from the remaining 36 components 203 structures are generated without using Type 2 and Type 3 data. Next, if the Type 2 and Type 3 data are used together with Type 1 data, the number of generated structures is decreased to 36 by examination through the Type 2 data and then to only one, the target compound beta-ionone(I), through Type 3 data. the above results are summarized in Table 3.
1 2
3 4
c
c--c I
\ /
c--c c
c-c \
c-c
c/
/
:
C
\ / / c-c-c-c--c-c
I
c-c
\c-c/ SAMPLE ( II ) L-Menthol
C
c-+-c-
c
I/
/
I/ c
C
0
/
c
C
-C
\ C
/ \ /
c
C
C
SAMPLE( III ) Linalool
SAMPLE( IV) an isomer o f Thujopsene reported r e c e n t l y
In Table 4 is summarized the number of candidate structures given by CHEMICS for menthol (II), linallol(II1) and an isomer of thujopsene(IV), reported recently, with and without the aid of ZD-INADEQUATE information. In these samples, the number of candidates diminishes into one and only correct structure through four check stages.
REFERENCES 1) K. Funatsu, N. Miyabayashi, and S. Sasaki, J.Chem.Inf.Comput.Sci., 3, 19(1988). 2) K. Funatsu, Y. Susuta, and S. Sasaki, J.Chem.Inf.Comput.Sci., submitted.