Advances in Molecular Similarity, Volume 2

ADVANCES IN MOLECULAR SIMILARITY Volume 2 • 1998 This Page Intentionally Left Blank ADVANCES IN MOLECULAR SIMILARI...

Author: R. Carbo-Dorca | P.G. Mezey

275 downloads 1299 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

ADVANCES IN MOLECULAR SIMILARITY

Volume 2 • 1998

This Page Intentionally Left Blank

ADVANCES IN MOLECULAR SIMILARITY Editors:

R A M O N CARBO-DORCA Institute of Computational Chemistry University of G iron a Giron a, Spain P A U L G . MEZEY Departments of Chemistry and Mathematics and Statistics University of Saskatchewan Saskatoon, Canada

VOLUME 2

•

1998

uCii) JAI PRESS INC. Stamford, Connecticut

London, England

Copyright © 1998 byJAI PRESS INC 100 Prospect Street Stamford, Connecticut 06904-0811 JAI PRESS LTD. 38 Tavistock Street Covent Garden London WC2E 7PB England All rights reserved. No part of this publication may be reproduced, stored on a retrieval system, or transmitted in any form, or by any means, electronic, mechanical, photocopying, filming, recording, or otherwise without prior permission in writing from the publisher. ISBN: 0-7623-0258-5 Manufactured in the United States of America

CONTENTS

LIST OF CONTRIBUTORS

vii

PREFACE

xi

Q U A N T U M SIMILARITY Ramon Carbo-Dorca, Liuis Amat, Emili Besalu, and Miquel Lobato

1

FUZZY SETS A N D BOOLEAN TAGGED SETS; VECTOR SEMISPACES A N D CONVEX SETS; Q U A N T U M SIMILARITY MEASURES A N D ASA DENSITY FUNCTIONS; DIAGONAL VECTOR SPACES AND Q U A N T U M CHEMISTRY Ramon Carbo-Dorca

43

PATTERN RECOGNITION TECHNIQUES IN MOLECULAR SIMILARITY W. Graham Richards and Daniel D. Robinson

73

TOPOLOGY A N D THE Q U A N T U M CHEMICAL SHAPE CONCEPT Paul G. Mezey

79

STRUCTURAL SIMILARITY ANALYSIS BASED O N TOPOLOGICAL FRAGMENT SPECTRA Yoshimasa Takahashi, Hiroaki Ohoka, and Yuichi Ishiyama

93

ANALYSIS OF THE TRANSFERABILITY OF SIMILARITY CALCULATIONS FROM SUBSTRUCTURES TO COMPLEX COMPOUNDS Guido Sello and Manuela Termini

105

vi

CONTENTS

SIMILARITY IN ORGANIC SYNTHESIS DESIGN: COMPARING THE SYNTHESES OF DIFFERENT COMPOUNDS GuidoSello

137

BROWSABLE STRUCTURE-ACTIVITY DATASETS Mark Johnson

153

CHARACTERIZATION OF THE MOLECULAR SIMILARITY OF CHEMICALS USING TOPOLOGICAL INVARIANTS Subhash C. Basak, Brian D. Cute, and Gregory D. Grunwald

171

OPTIMIZING HYBRID DENSITY FUNCTIONALS BY MEANS OF QUANTUM MOLECULAR SIMILARITY TECHNIQUES Miquel Sola, Marta Fores, and Miquel Duran ATOMIC SIMILARITY THROUGH A NEURAL NETWORK: SELF-ASSOCIATIVE PERIODIC TABLE OF ELEMENTS Jose Fayos

187

205

COMPARISON OF QUANTUM SIMILARITY MEASURES DERIVED FROM ONE-ELECTRON, INTRACULE, AND EXTRACULE DENSITIES Xavier Fradera, Miquel Duran, and Jordi Mestres

215

THE COMPLEMENTARITY PRINCIPLE AND ITS USES IN MOLECULAR SIMILARITY AND RELATED ASPECTS Jerry Ray Dias

245

CORRELATIONS AND APPLICATIONS OF THE CIRCUMSCRIBING/EXCISED INTERNAL STRUCTURE CONCEPT Jerry Ray Dias

259

LEAST-SQUARES AND NEURAL-NETWORK FORECASTING FROM CRITICAL DATA: DIATOMIC MOLECULAR fe AND TRIATOMIC AHa AND IP Jason Wohlers, W. Blake Laing, Ray Hefferlin, and W. Bradford Davis INDEX

265 289


Liufs Amat

Institute of Computational Chemistry University of Girona Girona, Spain

Subhash C. Basak

Natural Resources Research Institute University of Minnesota Duluth, Minnesota

Em Hi Besalu


Ramon

Carbo-Dorca


W. Bradford Davis

Department of Chemistry Southern Adventist University Colegedale, Tennessee

Jerry Ray Dias

Department of Chemistry University of Missouri Kansas City, Missouri

Miquel

Duran


Jose Fayos

Departamento de Cristalografia Instituto Rocasolano, CSIC Madrid, Spain

Marta Fores



VIII

Xavier Fradera


Gregory D.

Grunwald


Brian D. Gate


Ray Hefferlin

Department of Chemistry Southern Adventist University Collegedale, Tennessee

Yuichi Ishiyama

Department of Knowledge-Based Information Engineering Toyohashi University of Technology Toyohashi, Japan

Mark Johnson

Pharmacia & Upjohn Kalamazoo, Michigan

W. Blake Laing


Miquel

Institute of Computational Chemistry

Lobato

University of Girona Girona, Spain Jordi Mestres


Paul G. Mezey

Departments of Chemistry and Mathematics and Statistics University of Saskatchewan Saskatoon, Canada

List of Contributors Hiroaki

Ohoka


W. Graham Richards

New Chemistry Laboratory Oxford University Oxford, England

Daniel D. Robinson

New Chemistry Laboratory Oxford University Oxford, England

Guido Sello

Dipartimento dei Chimico Organica e Industriale Universita'degli Studi de Milano Milano, Italy

Miquel Sola


Yoshimasa Takahashi


Manuel a Termini

Dipartimento dei Chimico Organica e Industriale Universita'degli Studi de Milano Milano, Italy

Jason Wohlers



PREFACE

This new volume of the book series on Advances in Molecular Similarity is devoted to a selection of topics and problems presented at the Third Girona Symposium on Molecular Similarity, University of Girona, Girona, Spain, May 30-31,1997, held in conjunction with the Seventh International Conference on Mathematical Chemistry, University of Girona, Girona, Spain, May 26-29, 1997, both organized by Professor Ramon Carbo-Dorca, Director, Institute of Computational Chemistry, University of Girona, Girona, Spain. These two international scientific meetings were sponsored by several sources. Special thanks are due for the financial support provided by the following institutions and agencies: • • • • • • •

Institute of Computational Chemistry, University of Girona, Girona, Spain University of Girona, Girona, Spain Ministerio de Educacion y Cultura Fundacio Catalana per a la Recerca Generalitat de Catalunya Ajuntament de Girona Diputacio de Girona

The coverage of the two conferences provided a detailed cross section of the current advances in the rapidly expanding field of molecular similarity research, with a strong emphasis on both the fundamentals—quantum similarity measures.

xli

PREFACE

molecular shape analysis, molecular topology, and structural invariants—and the applications in such important experimental and industrial fields as pharmaceutical drug design, toxicological risk assessment, and molecular engineering for nanotechnology. This volume offers some of the highlights of the advances presented at the two conferences. In their presentations, our authors have made a remarkable effort to emphasize the underlying connections between the fundamentals and applications. Molecular similarity research is a dynamic field where the rapid transfer of ideas and methodologies from the theoretical, quantum chemical, and mathematical chemistry disciplines to efficient algorithms and computer programs used in industrially important applications is especially evident. These applications often serve as motivating factors toward new advances in the fundamental and theoretical fields, and the combination of intellectual challenge and practical utility provides mutual advantages to theoreticians and experimentalists. It is the aim of the Editors to present our readers with an overview of the current methodologies of molecular similarity studies, and to point out new challenges, unsolved problems, and areas where important new advances can be expected. We are convinced that this volume will serve our readers well, and it will represent a valuable, special source of information in their studies of chemistry, where molecular similarity continues to play a central role. Ramon Carbo-Dorca Paul G. Mezey Series Editors

QUANTUM SIMILARITY

Ramon Carbo-Dorca, Liuis Amat, Emili Besalu, and Miquel Lobato

I. Introduction 2 II. Quantum Similarity Measures 4 III. The Nature of Approximate First-Order Density Functions: Atomic Shell Approximations 5 A. Density Functions 6 B. ASA Coefficient Constraints 7 C. Quadratic Error Function 8 D. ASA Coefficient Optimization Using Elementary Jacobi Rotations 9 E. Alternative Approximate Expression of Density Functions: Complete ASA 12 F. Approximate Expectation Values 14 IV. Molecular Representations 14 A. MQSM Surfaces, Molecular Superposition, and Density Transformations . 15 B. Density Maps and Overlap-Like Measures 17 C. Discrete Matrix Representations 18 V Manipulation of Similarity Measures: Similarity Indices 21 A. C-Class Similarity Indices 22 B. D-Class Dissimilarity Indices 23

Advances in Molecular Similarity, Volume 2, pages 1-42. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 1

2

R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO

C. Generalized QSI D. Transformations between QSI E. A Discussion on Discrete Representation Indices VI. The Origin of QSAR and Related Problems A. The Success of QSAR B. Convex Sets and QSPR C. MQSM and Molecular Topology D. MQSM Topological Indices VII. Similarity Over Energy Surfaces A. Boltzmann Distributions and Boltzmann Similarity Measures B. General Distributions and Similarity Measures VIII. Conclusions Acknowledgments References

23 24 25 27 27 29 31 33 38 38 39 40 40 41

I. INTRODUCTION As is well known, the first rough description of quantum similarity was made in a naive paper by Carbo et al. in 1980.^ In this work was discussed the initial basic concepts, related to molecular similarity measures, but seen from a quantummechanical point of view. Since then, a large amount of research has been performed on the specific subject of quantum similarity. Several laboratories and individuals have developed the seminal ideas, and in the current literature have been published many papers^ as well as book chapters'^ and monographs'^ are available. Even a specialized series^ has been devoted to the study of the broader concept of molecular similarity. Theoretical settings of the Quantum Similarity framework have been studied in several aspects by various authors.^ A general discussion of the theory enveloping quantum similarity has also been constructed in our laboratory.^ Among these last contributions must be noted the description of the so-called Mendeleyev postulates,^ a set of several points of view trying to govern the ideas constructing quantum similarity background. Different concepts have emerged from the research experience of the present decade in the field of quantum similarity. Among the most relevant, in our opinion, are the following: 1. Quantum similarity measures (QSM) are a natural vehicle to obtain a discrete representation^^ of density matrix elements, with particular emphasis on the representation of first-order density functions. 2. Results on quantum similarity constitute a new way to practically connect all fields of chemistry^ with quantum theory. This new relationship has to be mostly considered based on purely geometrical grounds, as a consequence of the application of quantum-mechanical postulates.

Quantum Similarity

3

3. Computation of QSM can be achieved in a fast, approximate although highly accurate framework, based on density function fitting to a spherically symmetric atomic basis set,^^ by means of the so-called atomic shell approximation. 4. Molecular QSM (MQSM) surfaces can be easily computed,^^ generalizing the density surface analysis and producing alternative ways to observe pictorially the molecular shapes. 5. Molecular superposition^^ may be accomplished in a very efficient way, using the fact that any MQSM, being a definite positive function of the molecular relative positions, can be easily maximized. 6. Use can be made of the discrete, n-dimensional, description of the density functions,^^ achieved by means of QSM, performed on a known quantum object set. Zermelo's theorem [14] can be invoked to consider a possible order, which can be induced within a given quantum object set, opening a natural way to construct particular p^noJ/c tables over the set. 7. Quantum similarity indices are to be considered as a set of parameters strictly dependent on QSM, which can be arbitrarily described in numerous ways.^^ 8. The success and scientific foundations of the well-known QSAR or QSPR procedures^'^'^ may be easily deduced from considerations attached to the molecular quantum similarity framework. 9. Molecular topological parameters, with identical structure as the classically defined ones,^^ but bearing an important amount of three-dimensional information, may also be deduced using simple molecular quantum similarity ideas. 10. Extensions of quantum similarity^^ to chemically interesting functions, other than those belonging to the density matrix element family, can be envisaged without further effort. As some discussions of the main subjects listed above can be found dispersed in the literature, our aim here is to furnish a coherent and comprehensive presentation of all these quantum similarity related topics. To achieve this goal, we will first make a simple presentation concerning the nature of similarity measures. An analysis will follow, dealing with first-order density functions, which will open the way to introduce the reader to atomic shell approximations and to the procedures to compute them. Next, the framework of the discrete molecular representation connected with quantum similarity, and a short presentation of the features of quantum similarity indices will be developed. A triad of discussions, closing this contribution, will follow, consisting of the following points: 1. The origin of QSAR and related problems, studied from the point of view of the inherent discretization, associated with QSM.

4


2. Connection of QSM with molecular topology. Manipulation of quantum similarity matrix elements to compute topological indices. 3. A discussion on the possible ways of computing similarity measures using energy surfaces or other quantum functions.

II. QUANTUM SIMILARITY MEASURES QSM, although defined within the scope of an arbitrary number of density matrix elements, bearing arbitrary orders,^ when studied from the computational, practical, point of view, are based essentially on first-order density functions. This rule will be followed throughout this paper. A QSM involving two quantum objects (QO) {A, B}, described by the density functions {p^, p^}, may be defined by means of the integral ÂB(^)

= J J PAiriMr,, r^)p,ir^)dr,dr^

(1)

where the presence of the operator symbol Q(rj, r^) corresponds to a general possible selection of any positive definite linear operator form employable within Eq. 1. The most usual operator's choice has been a Dirac's delta^^ function 6(rj - r^), which defines a QSM related to the spin-spin contact^^ correction, a term forming part of the Breit Hamiltonian.^^ Taking into account the properties of the Dirac's delta function when used as an operator, Eq. 1 can be written using a more simplified integral form: M5

= |p^(r)Pg(rVr

(2)

which constitutes the oldest definition of a QSM.^ Due to the involved parts and computational structure, Eq. (2) has been customarily called an overlap-like QSM. In both previous QSM definitions, the possible integrals involving the same density function, z^, are referred to as quantum self-similarity measures (QS-SM). Other operators may be used in the defined measure integral 1, as Coulomb or gravitational operators^^ have been, but a particular operator choice turns out to be a matter of integral complexity, computational advantages, system description, and problem environment. Among feasible operator selections, as the previously mentioned ones, there can be present a density matrix element by itself, corresponding to another system C, like Pc(ri» ^2^ Then, a very interesting QSM could be constructed: ZAB;C

= 11

PA(Î)PC(^V

r2)PB(r2)dr^dr^

^^^

constituting one of the several possible forms associated with the so-called tripledensity QSM.^^ This integral form, as presented in Eq. 3, opens the way to outline

Quantum Similarity

5

the formal structure of multiple-density QSM. These measures may be constructed, for instance, as the integral of the product of the density function set D = {pi(r)}, attached in turn to the elements of some chosen quantum object set (QOS); that is, choosing the simplest formal notation within the integrand functions:

One can see, in this manner, how it is possible to define QSM, in the best-suited form, to study any problem related to the computational manipulation of QO. A large variety of positive definite operators can be used to allow a tailor-made description of QO by means of QSM. An obvious way becomes apparent when, once a QOS S = {Sj} is chosen, the QSM between a set element and the rest are computed. For example, using definifion 1, every element of S can be associated to all of the others belonging to the same QOS, including itself. This may produce a column vector, whose dimension will be attached to the set cardinality. The elements of this vector can be obtained computing each QSM between the chosen QO density function and the rest.

III. THE NATURE OF APPROXIMATE FIRST-ORDER DENSITY FUNCTIONS: ATOMIC SHELL APPROXIMATIONS Over a long period, our laboratory has been interested in the elementary Jacobi rotations (EJR) technique.'^^ EJR constitutes a body of straightforward procedures to obtain ^-dimensional vector norm-conserving variation. In this field, a large theoretical and computational contribution dealing with quantum electronic energy direct optimization^^ has been developed over time. Some work has also been performed on the many aspects of the Jacobi diagonalization algorithm,^"* proposing a new parallelizable procedure, which constitutes a practical computational scheme, able to deal with large matrices and producing a chosen subset of eigenvalues and eigenvectors. On the other hand, within the purpose of a general search of optimal QSM algorithms, the subject of electronic density function fitting has been considered in a preliminary paper,^^ by using a superposition of atomic spherical shells. Afterwards, a conceptual and practical refinement of the previous formalism, the so-called atomic shell approximation (ASA),^^ has also been extensively studied. This section will deal, in a broad manner, following the path of the accumulated experience on both EJR and ASA directions, with the proposal of using EJR transformations to solve the problem, associated with the constrained fitting of electronic density, using ASA-type functions. A brief description of the main ideas that will be employed seems necessary.

6

R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO A. Density Functions

From a theoretical point of view, accurate MQSM may be obtained using ab initio molecular electronic density expressions constructed within the common LCAO approach, corresponding to the expression

where {D^J are the charge-bond order matrix elements, and {x^} represent the atomic orbital (AO) basis set functions. Using this approximation, four center integrals over the AO basis set have to be computed in MQSM calculations, like the ones attached to Eq. 2. For instance, taking expression 5 and the equivalent one, corresponding to the second system density function LCAO expression, the following quadratic form in terms of the charge and bond order matrices is found:

\i,veA X,GEB

^

where the four index symbol hypermatrix elements {^y ^} are the four center overlap-like integrals over the corresponding AOs on both systems: ^ f ^ } = jx:(r)X»xf(r)xr(rMr

^'^

Besides these computational difficulties, MQSM evaluation has in addition the problem of the measure maximization.^^ The function defined by the integral z^^ depends on the relative position of the implied molecular systems A and B, and the best similarity matching between them could only be defined when the maximal integral value is reached. To speed up QSM integral maximization, computational algorithms have been designed to calculate high-accuracy approximate density functions. ^^'^^ QSM is an ideal mathematical construction, logically placed inside the theoretical and conceptual structure of quantum mechanics, which may be used whenever it is necessary to compare two or more density functions. In fact, this may be seen from a more general point of view, associating the density functions appearing in Eq. 1 with square summable definite positive functions. From this mathematical perspective, a similarity integral as Z^Q can thus be interpreted as SL positive valued weighted scalar product. On the other hand, use of ASA-like density functions can be traced up to the initial papers on QSM,^ where a CNDO-like^^ approach was invoked to deal with the computational evaluation problem of the quantum similarity integrals, z^^, for molecules. From this initial viewpoint, the concept of approximating a given density function has evolved to consider ASA as a superposition of spherical nS-type, STO or GTO, functions. Recently, another similar approach, but circum-

Quantum Similarity

7

scribed to the expression of core-electron density,^^ has been published, although no constraint conditions have apparently been used in this case, like the complete ones stressed in this paper, established in the following section. The density function in ASA form may be written in terms of an atomic function set superposition {oj,

pfV) = EcT,(r)

(8)

aeA

where the sum runs over all of the atoms {a}, present in a given molecule A. At the same time, the atomic function set {oj, may be constructed using another function set chosen in such a way as to describe atomic shells {5.}, using another sum: â(^) = Z î(^)

(9)

and the sum in Eq. 9 is performed over all of the atomic shells of atom a, belonging to molecule A. Finally, the spherical function set [s-], describing some sort of atomic shells, can be defined, in a very easy manner, as follows:

kei

where now the sum is carried out over all of a chosen positive definite function set {cp^}, belonging to the atomic iih shell. The set of coefficients {c^}, is sought to be positive in all cases, so as to keep, in general, positive definite the probability density distribution structure of the atomic shells, {5.}, and thus transferring this characteristic to the approximated density function p^^"^ too. The above ASA partition is equivalent to writing Eq. 8, in a more compact notation, as a linear combination of a definite positive function set {G-}: P n r ) = SvvA(r)

(^1>

ieA

where the sum is performed over all of the basis function set {9.}, and the set of positive coefficients [w-] must be determined. One must insist on the necessary positive definition of the usual density distributions, which becomes translated in the ASA approach as the approximate density functions, being defined in turn with the form and properties of n-dimensional simplexes}^ B. ASA Coefficient Constraints

The most interesting case of function fitting in the realm of MQSM is constituted by the ASA approximation of first-order density functions, but other high-level

8


density function forms may be studied as well. At any level, both the exact and the ASA density functions may be supposed normalized to one particle, by dividing the function by the appropriate particle number combinatorial coefficient. Also, in Eq. 11, considering the involved basis functions normalized in the usual sense: Je.(r)t/r=l,ViG A

(^2)

then, necessarily the set of ASA coefficients {w.}, besides the imposed positive definite condition: >v.>0, ViG A

(13)

must fulfill the additional constraint: (14)

Although the second condition may be easily taken into account, employing a Lagrange multiplier technique,^^ the first one as expressed in Eq. 13 cannot be so easily introduced^^^ into the computation process. It will be shown that both conditions can be kept throughout the optimization procedure by using adequate algorithmic tools. C. Quadratic Error Function

A significant MQSM computational simplification, while preserving measure integral accuracy, is achieved in the so-called promolecular approximation, where the total molecular electronic density function is written as a sum of individual atomic electronic densities:

pfV) = EpfV)

(i^>

aeA

Every atomic density function p^^"^ is built up with the same formalism as in Eq. 11, replacing 0- by squared nS-type functions pT\r) = J^w,S,{rf

(16)

Using this approximation, overlap-like QSM between two atoms can be expressed by âb = Yj î S "^jîj ie a

je b

(17)

Quantum Similarity

9

where the elements {z-}, which can be collected into some positive definite matrix Z, are defined by the integral over the nS-type functions: Zy-j

5,(r)2s/r)2«fr

(18)

For a given atom a, the ASA coefficients {w.}, can be calculated minimizing the quadratic error integral function between the atomic ab initio and ASA electronic density functions. A particular form of the quadratic error function is easily written as

e*'* = ||Pa(«-)-pfV)f^«ije a

ie a

pi,v6 a

where zâ corresponds to the ab initio QS-SM of atom a, z.a-iPairfdr

(20)

which is computed in turn within the LCAO approximation, replacing electronic density p^(r) by the equivalent form in Eq. 5. Equation 19 may be rewritten in matrix form as £(2) = ^^^ + w7';2w-2A^W

(21)

where the elements of the vector A = {A-} are given by the integral ^. = S ^ . v | W ' x ; ( r ) X v ( r ) ^ r

(22)

and w is the normalized column vector (w^w =1) containing the ASA coefficients. With the corresponding modifications within every implied integral, the quadratic error integral 19 may be rewritten using a definite positive weight operator. In fact, the quadratic error integral can be considered from the point of view of QSM as a self-similarity integral involving the exact and ASA density functions difference. Thus, form 19 is nothing than an overlap-like definition, as the one appearing in Eq. 2. While choosing a positive definite operator, a measure form like the general one provided by integral 1 will be present as an alternative quadratic error. D. ASA Coefficient Optimization Using Elementary Jacobi Rotations The set of positive coefficients collected as a vector w = {w-} can be defined as the square modules of some auxiliary vector components x = {x-}, which will be called the generating vector:

10


>v, = k,p; v/

(23)

and in this way condition 13 is fulfilled. Moreover, due to the fact that the Z matrix is definite positive, a unitary matrix U can be found such as U-'ZU = D

(24)

where D is a diagonal matrix with positive real elements. The first step in the procedure consists in diagonalizing the matrix Z, so as to obtain their eigenvalues and eigenvectors. Then, the initial coefficients {x-} can be made equal to the most suitable normalized eigenvector of the matrix Z, and consequently the required constraints specified in Eq. 14 are automatically fulfilled. Starting from this generating vector, and applying orthogonal EJR, the constraints will hold along the optimization process. ASA coefficients are obtained by an optimization procedure, which minimizes integral 19. Substituting every ASA coefficient w. by Ix-P, and only considering the case of real generating vector coefficients, the quadratic error integral function can be rewritten as (25) ij£ a

ie a

EJR are easy tools to obtain unitary or orthogonal transformations usable over vectors or matrices. The origin of such transformation matrices can be found in the 1846 paper of Jacobi.^^ Being orthogonal, EJR may also be viewed as rotation matrices over real (or complex)-valued n-dimensional spaces. Applied to a given ^-dimensional vector, an EJR, which will be written here as Jpq{o), will transform the vector components p and q only, keeping invariant the rest of them. The EJR transformation on the generating vector chosen components is defined by the equations

^q^^p^^^q

(26)

where c and s are the cosine and sine of the EJR angle a. The norm of the transformed vector remains invariant with respect to the initial one. Over the generating vector coefficients in Eq. 25 it is easy to apply the EJR represented by Eq. 26, and then the variation of e^^^ with respect to the active pair of elements {p,q} may be expressed as

+ 25x^1 ^^,, + 25.^ E^?^„ i*p,q

i*p,q

Quantum Similarity

11

-2A„5xl-2A8xl p

p

q

(27) q

To compute 8e^^^ in Eq. 27 it is necessary to evaluate the second- diwd fourth-order variation of the elements x and x . The second-order ones are easily obtained:

5-^ = (^ - ^9) = s Vp - 4 ) + 2cs V , ] = A

(28)

The fourth-order variation terms are obtained in turn from the second order ones, giving

8(^;^) = ( ^ ^ ^ - ^ ^ ^ ) = ( ^ - ^ ) A - A 2

(29)

Further development of the 5x1 and 5x^ expressions, as well as the one associated with the crossed term 5(xV) in Eq. 29 provides the dependence of the quadratic error from the EJR sine {s} and cosine {c}. Substituting the expressions dx^, 5x^, 8x^, 5^^, and 6(A^;C^) into Eq. 27 and collecting terms, one finally arrives at a quartic polynomial on the rotation sine: 5e^2^ = EQ/ + £:i3cs^ + i ^ o / + E^cs

(30)

where the parameters {Ejj} are described as follows: ^04 = i^pp + ^,, - 2Zp,)[(A^ - 4 ) ' - 4 ^ ^ ] ^13 = (Zpp + z^^ - 2zp^)ixl - xl)x^^ ^02 = 4(z^^ + z^^ - 2z^X^^ - 2 ( ^ - ^ ) G

and G=

Tj'^(Zpi'-^qi)^4^pp-'^q^qq-(4-'''lK-^^^

i^p,q

The optimal sine can be chosen with the gradient condition dde^^^/ds = 0, —T— = 4EQ/ + iî3(-^s^ + 3cs^) + 2EQ2^ + E^^(-ts + c) = -c[(£i3s2 + E,y

- 2 ( 2 £ o / + E,^)t - (3E,,s^ + ^jj)]

(31)

12


= -C{T/

- IT^t -T^) = 0

(32)

where s/c = t and dc/ds = -t. The best Jacobi rotation angle is found solving the quadratic polynomial equation in the EJR tangent {?}, appearing in expression 32. The optimization is conducted through an iterative procedure, until the global variation of Jacobi rotation angles or the quadratic error integral function becomes negligible. A Newton procedure has also been used to optimize the exponents of the fitted nS-type functions appearing in Eq. 16. This search algorithm is available because the analytic gradient and the Hessian matrix of the quadratic error can be made easily available. A program called GATOMIC^^ has been codified to compute fitted atomic shells using nS-GTO or nS-STO functions. The initial basis set exponents have been systematically taken from an even-tempered^^ geometric sequence. A coefficient optimization is sought, followed by a Newton search, which is used over this initial exponent set so as to obtain ameliorated values. Next, a new coefficient optimization using EJR is performed to obtain the most accurate fitted positive coefficients, until convergence is reached. Details and computational examples will be given elsewhere. E. Alternative Approximate Expression of Density Functions: Complete ASA

Although losing the elegant simplicity of the ASA approach, there appears to be an alternative very natural way to express the first order density function, using the same generating vector concept as before, within the ASA approach, as discussed in previous sections. Suppose known a spherical basis set made as in ASA environment of nS-type functions [S-]. Then, the first-order density function may be approximated by a function like

where {;c.}, are the elements of the generating vector x. This approach will be called the complete atomic shell approximation (CASA). The quadratic error function, using Eq. 33, can be written in a similar manner as in Eq. 25:

i,j,kM a

ije a

where the hypermatrix elements {z-.^J are overlap-like similarity integrals involving four different 5-type spherical basis functions:

Quantum Similarity

13

Zyu = ! S^(^)SM)S,{r)Siir)dr

(35)

while the matrix B = [B-j] corresponds to an integral between two AO and two S-type functions: îj = E ^.v I S.(r)Sjir)x;(r)Ur)dr

(^6)

EJR can be used in the same way as in the ASA approach to optimize the new CASA quadratic error, adapting different variation terms for the generating vector coefficients. But the structure of the CASA, as in Eq. 34, permits an alternative proposal, which leads to elegant matrix formalism, identical to the one used in the monoconfigurational SCF^^ computational structure. Here the new approximate expression of the density function leads to the normalization condition, which must fulfill the CASA generating vector coefficients: J pCASA(r)jr = ^ x^xj J S,(r)5/r)Jr i,jea

=

^XiXjSy=l

(37)

i,Jâ

That is, in matrix form x^Sx = 1, provided that the matrix S = {5.} collects the metric matrix elements of the S-type basis function set. Thus, the problem is reduced to minimizing Eq. 34, submitted to constraint 37. A Lagrange multiplier technique can produce a more elegant result here than EJR. The Euler equations of the constrained optimization problem are written easily in terms of a generalized secular equation^^: GX = YSX

(38)

where y is half of the necessary Lagrange multiplier, and the matrix structure G depends on the generating vector coefficients and the involved integrals already defined:

Klea

The CASA generating vector coefficient computational procedure should be made iterative, and in the process, the eigenvalue y has to be chosen as the one with minimal module. This is so, due to the fact that eigenvalues in Eq. 38 can be shown

14


to be the same as the scalar product between the CASA density and the difference between this approximate density function and the exact one. This approach is under study in our laboratory, and the practical results will be published elsewhere. F. Approximate Expectation Values

The ultimate use of ASA or CASA fittings to an exact density function lies in the possible fast, but accurate, computation of QSM integrals. ASA or CASA approaches become somehow essential owing to the need to compute huge amounts of integrals, so as to obtain the optimal measure values, when using quantum similarity to superpose two molecular structures.^^ See Section IV.A for more details. But another possible application of ASA-type functions may be found in the approximate calculation of expectation values of other operators than those employed in QSM. Numerical experiments show that ASA fitted density functions perform quite well when self-similarity values are computed, but fail when it is time to compute expectation values like kinetic energy, , which at any approximation level must be estimated using

= - ? S E ^6- j5,(r)V'5/r)^r

(40)

If sound expectation values have to be evaluated, then optimal quadratic error functions shall be optimized in the way discussed above, but adding the expectation value errors, computed under ASA, with respect to the ab initio ones. The overall self-similarity values possess a greater error than the computed ones with the previous optimization technique, but several expectation values can be perfectly adjusted in the process. Numerical details will be provided elsewhere.

IV. MOLECULAR REPRESENTATIONS In light of the previous discussion, many ways can be envisaged so as to have an appropriate, discrete or continuous, QO representation. Starting from the typical quantum-mechanical density function continuous description, using the QSM adequately, one can obtain new functions, which may produce additional information on the molecular shape and environment, in the same way as other density function manipulations do, like the well-known electrostatic potential. In this section these possible additional functions will be discussed along with the related problem of molecular superposition, which has been studied in our laboratory. A description of the QO discrete representation and possible manipulations will end this section.

Quantum Similarity

15

A. MQSM Surfaces, Molecular Superposition, and Density Transformations MQSM Surfaces Suppose one is dealing with the QSM involving two molecular structures {A,B} with attached density functions {p^jP^}. The MQSM integral of type 2, and more sophisticated ones too, should be written taking into account the relative coordinate positions of both molecular frames {X^,X^}, where vectors X collect all of the atomic coordinates of both molecules respectively. Supposing the molecular internal atomic position degrees of freedom constant, MQSM integral 2 can now be written in a slightly different form: z^(T;R) = J pîr;X^)Pgir;{T,R}[Xg\)dr

(41)

where M = {T, R} are the three translation and rotation vector elements related to both molecular frames. Here, it is supposed that molecule A coordinates are kept invariant and the atomic structure B is translated and rotated with respect to the A molecular coordinate system. The net result is such that the integral ZJ^Q will depend on these six parameters collected on the vector M, and being such a six-variable function, it will expand a seven-dimensional surface. Keeping some of the translation-rotation parameters constant, the MQSM could be transformed into a similar function, which can be depicted as the density or the electrostatic potential ones. However, in a very particular situation no such kind of restrictions need to be considered: This case corresponds to the MQSM integral involving a molecule A and an atom B. Certainly, in this situation the rotation angles are irrelevant, and only the translation vector survives in M; thus, Z^Q ( T ) will behave as a function of three variables, which depict the atom B position in space, and as such could be represented directly as plain density or electrostatic potential functions are. Some examples can be found in Ref. [11]. Integral 41 dependencies on the translation-rotation parameters M, as mentioned above, produce a very interesting problem, which has been considered and taken into account since the first time a MQSM was computed, ^'^^ although until recently has not been efficiently solved. ^^ MQSM, being positive definite functions, could have a maximal positive value with respect to the variation of the M = {T,R} parameter vector pairs. This feature can be explained in such a way as that the integral associated with the measure Z^Q reaches a maximal value, when the moving structure matches, in a natural way, the fixed molecular frame. That is, one can search for an MQSM, which superposes maximally, according to this particular measure integral choice, both sets of involved atomic coordinates.

16


Molecular Superposition

Molecular superposition has been a problem, whose solution has many implications in chemical process knowledge, and certainly carries a great relevancy in pharmacological studies. Although many solutions have been described in the literature, none as far as we know has been based on a coherent and natural theoretical basis. From the molecular superposition point of view, the optimal overlap-like similarity matrix, containing the MQSM between the elements of a QOS, can be formally computed using the following integral definition: z^" = max J p^(r,X^)p,(rM[X,])dr

(42)

The problem has been solved recently, and a set of algorithms described. ^^ A varied set of examples proved that MQSM maximal values can be easily reached. Using an ASA approach or even simpler similarity integral forms the process can be extended to proteins. Thus, maximal MQSM constitute the most appropriate theoretical structure to obtain molecular superposition and matching. When constructing the similarity matrix Z ={z^}, whose elements are made by the maximally matching molecular structure pairs, the added problem now is such that, in general, one cannot suppose Z to be a positive definite matrix any longer. Usually, the similarity matrix bears a positive definite structure. Whenever every column or row of the matrix corresponds to an MQSM calculation using the same Density Function, with the molecular coordinates remaining unchanged by a transformation of type M, the positive definition is present. The same remark holds, even when choosing a positive definite weight operator in the MQSM integral, because then, the Similarity Matrix, Z, also acquires a metric structure. A final remark must be proposed with respect to this metric property assigned to similarity matrices. This is so provided that the molecules in the QOS are chosen essentially different, that is, described by linearly independent density function descriptors. Matching can be extended to sets of molecular structures as optical isomers, molecular excited states, and conformational forms of a given molecular structure. Density Transformations

From Eq. 41 QSM expressions can be seen as a way to obtain density integral transforms (DIT). Indeed, in the above integral, one can so look at the integrand density function pair, as to consider the product of densities as a properly defined function and a transform kernel, respectively. The MQSM, z^^, can be considered as a transform of p^ employing the transform kernel p^. This situation can be easily generalized, defining a DIT of a known density function p^ as the integral A^(R) = r(p^) = J K{K T)p^{T)dT

(43)

Quantum Similarity

17

where A^(R, r) is the transform kernel, an operator that produces another function by performing the integration DIT A^, which can be used in turn as a new representation of the QO, attached to the former density function p^. The formal scheme in the DIT definition can be extended owing to the fact that the density function can still be regarded to depend on another position vector set, RQ, as occurs under the universally admitted Born-Oppenheimer approximation framework. This dependence, which can also be attached to the kernel, produces as a result a DIT dependence of this vector too: A^(R, R„) = r(pJ = J K(R, r, R,)p^(r, R^)dr

(44)

Usual calculations consider the dependence of all of the involved functions on the coordinate vector RQ as implicit, and in this way it is not explicitly written in the developed formulas. In any case, if DIT are used over a density function set P = {p^}, obtained over a QOS S = {sj}, the corresponding elements of the DIT set D = {Aj} can be considered as sound representations of the set S elements as the former density function set can be. In any case, the following relationships may be established: VSjE S A VpjE P-^SjSfÂj

(45)

B. Density Maps and Overlap-Like Measures

An electronic density distribution map may be connected to a similarity measure in the following way. Suppose known a density function for some molecular structure M: p^(r), say. In the same manner, suppose known the density function of a given atom a, centered at some space position R, and symbolically expressed as p^(r, R). The overlap-like QSM W R ) = IpM(r)Pa(r.R¥r

(46)

corresponds to an MQSM map of molecule M with respect to atom a, placed at the 3D space position R. Let us suppose that the atomic density p^(r, R) can be safely associated with a Dirac's delta function A^^ 5(r - R), with A^^ being the nuclear charge. Then, one will obtain ^M«(R) = Â J PM('-)8(r- R)dr= N^pîR)

(47)

That is, both MQSM and density function map become a unified concept. MQSM maps have as a limit the density map.

18

R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO C. Discrete Matrix Representations

From the discussion of the QSM problems presented so far, similarity measures of this kind can be considered as the way to obtain a discrete, numerical, matrix representation of a given molecular structure, which will depend on the rest of the molecular structures taken into account. The following points illustrate the general background of the discrete matrix representation of a molecule. The matrix representation of a molecule, with respect to a given molecular set, can be associated in turn with a set of numerical values forming a finite-dimensional vector, representing the studied molecule, as mentioned in Section II. If the studied molecular set contains m molecules, the original ©o-dimensional molecular representation by means of the density matrix or the corresponding DIT elements, associated at the same time with every molecule belonging to a QOS can be considered projected into an m-dimensional space. Of course, when one is dealing with hypermatrices, the m-dimensionality can be achieved by using a dimension reduction, as will be described below. However, in this case, this m-dimensional projection is not compulsive and one can freely work into higher dimensional spaces. Several discrete representations may be obtained, since the matrix of the QO coordinates may come from the QSM values collected into matrices of any kind. A practical way to codify the QOS into a matrix may be done as follows. Suppose a given QOS, 5, and an attached set D collecting the chosen density functions of every QO in a one-to-one correspondence: V5G5^3pGD=>5p

(48)

Also, all possible QSM, involving QOS elements, may be considered defined in the tensorial product space D(8)D, and finally collected into a similarity matrix: Z=[zjj{Q)}^{Zu]

(49)

The matrix Z contains, in this manner, information about the relationships between coupled elements of the QOS. The columns of matrix Z Z = (Zi,Z2, . . . , Z ; , . . . , z j

(50)

can be interpreted as the matrix representation of every element in D, in the vector space spanned by the QO density functions, which, in turn, act as a basis set. In this way, a finite-dimensional set of vectors Z represent the density funcdons D, by means of the correspondence VpG D - > 3 z e Z=>p3zG Z=>5z

(52)

In the case of a molecular set study, this point of view leads to the concept of point-molecule, that is, any column ZjE Z of the similarity matrix. The collection of all point-molecules in the matrix Z is known as a molecular point-cloud. Molecular point-clouds can also be constructed by means of the eigenvectors of matrix Z. This is possible because the set of QSM [Sjj] can be considered a set of scalar products between density function pairs. In this manner, the QSM matrix Z is somehow a kind of Gram matrix, constructed using the elements of the density set D. The Z matrix column eigenvector coordinates are to be considered, in any case, a set of point-molecules possessing a canonical behavior. They are normalized vectors orthogonal to the rest of the molecular point-cloud elements. They constitute, in this manner, some sort of uniform coordinate system, which may be defined for any QOS, provided that a similarity matrix is known and their eigenvectors computed. An even more interesting uniform coordinate set of this sort may be formed, taking into account the Z matrix eigenvectors, collected as row vectors, they can be considered a dual space representation of the former molecular point-cloud. When the weight operator Q in the QSM is chosen as another density function or as a product of them belonging to the elements of D, as in the triple-density QSM defined in Eq. 3, then the similarity matrix Z can be considered a hypermatrix. As a consequence a synmietric matrix representation of every object in S is obtained. In general, when a multiple QSM, consisting of a product of p density functions, is chosen, a (/7-l)-dimensional hypermatrix can be attached to every object in set S. The procedures outlined so far may be modified in the following way. Until now it has been supposed here that all elements of a given molecular set have been used to represent the same molecular elements, producing, in this manner, square dimensional numerical collections of the active molecular structures. But it is not necessary to proceed in this way, in order to obtain discrete molecular representations. A given molecule or molecules can be compared with a given molecular set, which may serve as a basis set for the discrete representation. Suppose that T is a molecular set acting as a pattern structure, to it a one-to-one correspondence with a density function set P is known. Then, the following relationships may be obtained, using the similarity matrix U derived from the tensorial product space D®P or from some high-order direct product similarity measures: V5; 6 5 -> 3u; G U =» 5; u^

(53)

and the column matrix elements {M^} G U^ are obtained using the similarity measures between the associated densities to Sj and all of the density functions of the reference molecular set T: p ^ G P; V^.

20


From here, it is easy to see how this algorithm can be taken either as a way to obtain a molecular representation by means of a rectangular array or as a procedure leading to the square discrete representation described previously. When the present numerical process is followed, it may be seen as acting as a tool to augment the discrete molecular matrix representation dimension. To understand this last comment, let us define by means of a direct sum a new matrix set V = Z©U such that \/sje S^3\je

V=>5^v^ = zêu^

(54)

From the diverse points of view discussed above, one can see that QSM, obtained considering some or all QOS elements, may lead to a projection of any given molecular element representation. This projection goes from the oo-dimensional molecular space density set, D, into a finite-dimensional space, whose dimensions may be of diverse, arbitrary, finite magnitude, depending on the chosen conventional rules used, when gathering the MQSM into a matrix form. It is possible to generalize the previous definitions: as can be deduced from the above discussion, every system set S = {sj} can be represented by a hypermatrix whose elements are taken as the MQSM: {Z,. ,

,.(£2)} = {Z[^,.,^,.,...,i,.](Q,R)}

(55)

where a multiple QSM made of n density elements attached to the corresponding QOS elements {s-,s^,..., s^ ] has been used. A possible explicit dependency of an operator Q and a coordina?e vector R is expressed on the left-hand side of the equation. To obtain a computationally manageable form of the similarity hypermatrix, it is possible to reduce the n-dimensional hypermatrix information into a matrix. A, dependent on the weight operator Q, say, involving all available pairs of molecules contained in S: A(Q) = {Ajj(Q)}

(56)

This can be reached by means of the following hypermatrix reduced product definition:

A,/Q) = X

(i = l,ml,l)Z/(i)Z/i)

(57)

which possesses a structure similar to a scalar product. There a nested summation symbol^"^ (NSS) formulation has been used and the terms Z/i) are written as a shorthand notation of the hypermatrix elements: Zj(i) = Zr ''l''2'-Vl

(Q)

(58)

Quantum Similarity

21

Most frequent representation cases can now be envisaged into this new formulation: 1. When n = 2, one is dealing with the {Zj-} matrix elements, as those defined in the previous paragraph. Then, (59) 1=1

2.

where the NSS appearing in Eq. 57 has been reduced to a unique summation symbol. This kind of reduction is a scalar product between the rows (or columns) of the matrix Z. If n = 3, then every molecular structure is represented by a square matrix, but even so one can also find a method to compact the data generated by triple-density QSM hypermatrix elements {Z^-}; for example, it is only necessary to use the following straightforward rule: m m

(60)

1=1 ; = l

At this stage, we can describe the general methodology to follow in a QSM study. When a family of QO is defined, one constructs the attached hypermatrix representation of every set element, once a given kind of similarity measure has been chosen. In general, a matrix or some hypermatrix, which is reduced to manageable dimensions, can always be directly obtained, following any one of the contractions previously mentioned. Afterwards, it is convenient to analyze somehow the matrix or hypermatrix structure. The final conclusions can be used to suggest relationships between the QOS elements.

V. MANIPULATION OF SIMILARITY MEASURES: SIMILARITY INDICES As has been previously discussed, once the set of QO to study is formed and the operator related to the QSM definition chosen, the resultant value of the QSM itself, related to the QOS, is unique. However, the similarity matrix elements, obtained as discussed in the previous sections, can be transformed or combined so as to obtain auxiliary terms of a new kind, here named quantum similarity indices (QSI). A vast quantity of possible QSM manipulations leading to a consequent great variety of QSI definitions exists. The most common, which can be considered as the standard ones, arise from the manipulation of QSM between two molecules as those defined in Eq. 1 or, more generally, by means of some reduction like that described in Eq. 57.

22


In this manner, the QSM values can always be referred to as the elements of the similarity matrix: Z = {Zjj(Q)}. For reasons that will be obvious through the following lines, two kinds of similarity indices may be described and collected into two well-defined classes. These classes are related to the most elementary similarity-dissimilarity Indices defined so far in molecular similarity studies or related fields, as pattern recognition theory. The nature of this classification of the similarity indices will be developed next. A. GCIass Similarity Indices A well-defined similarity index constitutes the leading member of this index class. It is nothing more than an index formal expression belonging to the correlation-like class. In fact, the mathematical interpretation of such an index is that of the generalized concept of the cosine of the angle subtended by two vectors, weighted by a chosen positive definite operator and defined in a suitable oo-dimensional functional space, containing the density matrix elements. The cosine-like similarity index between two molecules / and J may be constructed as z,/Q)

(61)

This C-class QSI, for any pair of compared systems, can have any value belonging to the [0,1] interval. A 0 figure corresponds to a total dissimilarity, whereas a value of 1 indicates complete similarity of the two compared objects. These values depend on the similarity matrix elements, associated with both molecules. The two cited extreme situations have a geometrical meaning too, corresponding to a couple of orthogonal or collinear density matrix elements, respectively. Some authors refer to this index as the Carbo similarity index. Indices of this kind are not unique to quantum similarity but actually have become extended to other scientific areas. Take for example crystallography, where recently the cosine-like form 61 has been described.^^ Cosine-Like Indices and Multiple QSM

How must one proceed to extend the concept outlined before, when three or more systems are to be explored? It seems that the natural way to extend the cosine-like index, for three molecular structures simultaneously, can be expressed as ÎJK

~ ÎJK\ÎII^JJJ^KKK)

^

^

Thus, when I = J = K, and someone tries to evaluate these general ideas, the plain C^y^ index goes to 1. So, defining the Mh order QSM as the integral

Quantum Similarity

23

.nW = J

dv

(63)

V J where I = (Ij, I 2 , . . . , I^y) is a set of indices associated with A^ QO; then an Mh order QSI may be defined as f N

\

-l/N

c'

Schultz index (molecular topological index)

Harary number

n

M7/=5;[V(T+D)]y n

n

^-

y>'

Balaban index

n

H+1 f

n

^ (D), c = {c,. = |;c,.p} -^ ^^(c)

(22)

Fuzzy Sets and Boolean Tagged Sets

59

Thus, EJR can be applied to transform the generating vector elements {JC.}, and then, indirectly, those of the ASA coefficient vector are varied, while preserving the convex conditions ^„(c). ASA Structure in Molecules With the above considerations known, it is easy to think that, by performing a previous computational task, a set of fitted atomic DF, A = {p^}, can be gathered, having the ASA form as in Eq. 21. The set A can be taken, being strictly formed by convex linear combinations of PD functions, as a PD operator set. Thus, A can be used as a generating set of new ASA-type DF. A molecular DF, p^, for example, can be approximated by a linear combination of A elements, with a coefficient vector w = {w^}, fulfilling the convex conditions ^„(w): pA/('-) = E'ÂP^(«--r4)A^„(w)

(23)

A

where {r^} are the molecular atomic coordinates, on which the ASA DF are centered. According to the properties of the PD operators, described in Definition 7, p ^ is also a PD function. This allows the possibility of fitting the coefficient vector, w, to the molecular DF, in the same fashion as described for the atomic case. This also means that some generating vector u can be defined too, fulfilling a similar set of conditions as shown in Eq. 22, provided x R^} A Vco 6 ^ ; Vp G P ^ Jco(r)p(ryrG R"

(^^)

Nothing opposes considering the following situation: Q c P => V(0, p: Jco(r)p(ryr= e R""

(^^)

where the application of the PD operator set over the PD DF set can be interpreted as a noncommutative scalar product, defined over the VSS, where both PD operators and DF sets belong. Moreover, the scalar product in Eq. 27 can be regarded according to the usual quantum-mechanical interpretation as the expectation value, , of the system observable, represented by the particular operator co, in terms of the QO particular state DF tag part, p. Because it is possible to consider the operator set, Q, as forming part of the VSS, P, the situation applied on DF as stated in Eq. 18 can be used over the elements of the operator set. In such a manner, if a set of coefficients w = {w^} exists, and the following constraints, similar to Eqs. 14 and 20, are set on it, then a linear combination of the operator set Q will yield a PD operator, y, such as ^»^{W={WJCRÂX>V„=1}=^ a

Y = X H-aCO^: e R* a

(30)

a

The second constraint has been introduced so as to obtain a pattern comparable to the CS structure of the VSS P and transfer it to Q, but it is not strictly necessary to keep this unit coefficient sum, if not needed. The most interesting thing is the obvious result, according to Definition 7, that PD operators can yield, in a CS environment, new PD operators. Tuned QSM^ SM^ and QO Descriptors The previous conservation of the PD property on linear combinations of PD operators in a CS environment can be employed in the evaluation of new kinds of QSM, by constructing a new breed of PD operator weights. The y-type operators appearing in Eq. 30 can be tuned up, while maintaining the identity of the operator set, just by changing the values of the CS coefficient set, w, conserving the initial


63

chosen constraints. A QSM, following the definition provided in Eq. 7, can be built up, under these circumstances, as:

The resulting tuned up SM elements, Z^^(Y), produce another obvious result for the SM set, {Z(cOj^)}, associated with every operator in Q. With each SM attached to a PD operator, every such SM can be considered as some discrete matrix representation of the associated operator in the corresponding basis set of the involved DF. These matrices, as already mentioned, can be considered as PD matrices. Thus, Eq. 31 can be written in whole matrix form as Ziy) = ^wj^(0j

(32)

Being the resultant matrix PD, because if in the SM set, 0 = {Z((0(j)}, all of the SM elements are PD, then the following property will hold: Vx e C„ A VZ(a)„) € e-> x^Z(coJx e R+=> x'-Z(Y)x = ^ w„x+Z(co^x 6 R^; if V„: w^ e R+

(33)

a

These results demonstrate that a finely tuned set of QO descriptors can be obtained in this way. This is so because Eq. 32 holds for the SM columns too, in such a way as z,(Y) e Z(Y) A z,((o„) e Z({o„) -> Z,(Y) = S âîi^â)

^^"^^

a

Thus, all of the findings and definitions up to this point can be summarized as follows. A QOS is chosen in form of a DF tagged set. A PD set of suitable operators is used, as a set of weights, in the evaluation of QSM between QO. A set of SM is thus computed for each operator. A CS with suitable coefficients is chosen to combine the elements of the SM set. The resultant SM columns are convex descriptors of the corresponding QO, and provide a discrete vector tagged set representation of the QOS. E. Finely Tuned QSAR

If an immediate application of all of the previous development has to be chosen, quantitative structure-activity or -property relationships (QSAR or QSPR) constitute a good candidate field. In our laboratory, the basic theory connecting QSM

64

RAMON CARB6-DORCA

and QSAR or QSPR was developed^^ some time ago and various practical applications have been reported^^ more recently. It has been deduced that molecular properties have to be, in some manner, related to the discrete representation of molecular descriptors furnished by the columns of SM, constructed in turn from QSM over the molecular QO. As a consequence of Eqs. 29 and 34, a given property value, 7i, for a particular molecular QO, described in turn by a discrete descriptor, Z(Y), can be related by means of 7i = u'^z(Y)

(35)

where the vector u corresponds to an unknown discrete representation of some operator over the same PD DF basis set, used to construct the convex discrete molecular descriptor Z(Y).^^ The usual procedure is to use a least-squares algorithm so that, knowing the pairs {71, Z(Y)} for a molecular QOS, the values of u can be obtained. Taking into account the tuned construction of the vectors Z(Y), it can be easily seen that the vector u will depend on the tuning parameter set, Y- We use the least-squares solution of the problem u = (Z(Y)^Z(Y))-^Z(Y)^p

(36)

where the vector p = {71^} contains the values of the property for each molecular QO, and Z(Y) is the SM of the QOS computed according to Eq. 32. Equation 36, however, has been written taking into account the possibility that the SM may no longer be square symmetric, but rectangular. This will constitute the more general case, where instead of a unique tagged set, two QOS with different cardinalities, m and n, are used to compute the QSM. The resultant SM will be of dimension (m x n). From this previous definition, one can easily deduce that the vector u will depend on the tuning coefficients w. Opfimization of the tuning set coefficients w can be done at the same time as the classical least-squares problem is solved, keeping in mind the associated CS constraints, which the tuning set w bears. A parallel nonlinear constrained opfimization on a quadratic function of the w elements will appear. The interesting feature here is that CS constraints can be studied in the same way as these are kept in the optimal ASA problem. Thus, to the usual least-squares problem, involving the operator associated vector, u, there will appear another least-squares equation, which starts defining the residual vector: A = p-Z(Y)u = p - ^ w „ Z ( c o J u = p - ^ w „ v „ = p - V w

(37)

where the previous definitions of the involved matrices have been employed. Also, the matrix V collects the vector set {v^ = Z(cOj^)u}, and the vector w = {w^}, contains the coefficients of the tuning set W. The residual vector 37 is obviously dependent on the classical least-squares solution u in Eq. 36. From the inspection


65

of the residual vector A, it is easy to see that the quadratic error will depend on a generalized quadratic function with a variable set formed by the new unknown vector w. This least-squares problem has to be solved under the constraints associated with the PD nature of the vector w. A CS constraint structure may be very convenient in normalizing the problem form. Thus, the quadratic function and the constraints may be written, using a compact matrix form, as follows: e^^^ = %-2qV

+ w^Qw A K^(W)

(38)

where the following simplifications have been used: X = p^pAq = V^pAQ = V^V

(39)

It is important to define the appropriate SM set, {Z(co^^)}, so that the matrix V = {v^^}, possesses its elements linearly independent, to obtain a PD matrix Q. This is equivalent to saying that the SM set shall provide images of the least-squares solution u, which must be linearly independent. The solution of the second optimization problem for the vector w could be sought using a generating vector, for example x, which will substitute the w elements in the following convex constraints A^„(w) and the generating !^ (x -> w) rule. Expression 38 will transform into a quartic function in terms of the components of vector X. Optimization under the unit norm of the generating vector x'*"x = 1 may be obtained by means of EJR, as in the ASA case.^^ The whole optimization process shall be made in an iterative manner: 1. Using a starting approximate tuning vector w obtain u, solving Eq. 36. 2. Knowing u, compute a new w, minimizing function 38. 3. Go to step 1 while the vector pair {u,w} remains inconsistent with respect to the previous iteration. Changing the number and nature of the SM composite Z(Y) will obviously produce different results, but within a given choice these can be coherently tuned up. This can add extraordinary possibilities to QSAR procedures.

IV. ON THE STATISTICAL INTERPRETATION OF DENSITY FUNCTIONS: DIAGONAL VECTOR SPACES AND RELATED PROBLEMS One of the main contributions of the present paper is the definition of n-dimensional diagonal vector spaces (DVS). The objective in introducing DVS will be to find some discrete vector representation so that it can consistently fit some usual properties of oo-dimensional Hilbert spaces,^"^ containing the relevant functions, which are subsequently employed to describe QO, in accordance with the von Neumann^"^ point of view. Thus, the main concern here will be to obtain, in a natural

66

RAMON CARB6-DORCA

way, the CS structure of approximate DF within DVS, in the same natural manner as the DF is obtained from the squared module of the QO system wave function. A. The Nature of Discrete Q O Representations

Let us now suppose a QOS, Q, constructed in the usual way as a TS, that is, Q = S X P. Let us also suppose that the elements of the tag set part are ASA-type DF, built as in Eqs. 21 or 23. Accepting this scenario is the same as considering that a QO is described under some finite PD functional basis set 0 = {l(p.(r)l^} with coordinates: CO = {O)-} fulfilling the convex conditions ^„(co) and belonging to a given Ai-dimensional VSS, such as (O G Wj^(R'^). The TS constructed as Q„ = S X {O c W^(R^)} corresponds to a QOS which has as tag set part a Subset of some VSS of finite dimensions. A discrete representation of QO can thus be reached in this way, besides the one discussed in Section IILA. Considering the nature of the continuous AS A-type transform 25, it can be seen that in the discrete case, the PD basis set O and the coefficient vector O) shall bear some equivalent structure. As the convex conditions A'^(co) hold for the coefficient vector, it is easy to interpret this feature in such a way that the elements of the coefficient vector CO constitute a discrete probability distribution. For example, in a promolecular approach as well as in an MO monoconfigurational closed shell structure, CO can appear as a homogeneous discrete probability distribution. Thus, the coefficient vector, co, bears the equivalent statistical features of a DF in discrete n-dimensional spaces. It is not strange that there exists a generating vector, Y G '^(C), producing the PD ca elements by application of the generating rule !^(Y -^ ca). The structure of the rule is not one that can be attached to a linear transformadon, but has to bear a nonlinear form. This nonlinear relationship between generating vector and coefficient vector appears nonnatural from an algebraic point of view. This is more obvious when examined in the continuous situafion, as discussed previously. The image appears even more conspicuous when observing the nature of the DF from the quantummechanical side, because any DF has to be considered as a squared module of the QO wave function, acting thus as a generating vector. The problem can be stated transparently using a simple, well-known mathematical device, associated with the most basic aspects of quantum mechanics. Suppose a QO wave function is known for some system state ^ (r) G i^(C). The corresponding DF is simply computed as p(r) = I 4^(r)p. The interesting fact is that the DF thus defined may belong either to the Hilbert space direct product 9{{C) (8) i^(C), when considered as an operator, or to some functional VSS HCR"*"), when considered as a PD real-valued function. In this sense, the generating rule can be applied here, and immediately written as %J^ -^ p). However, it can be interpreted in the following way, using the practically unmodified structure of Eq. 27:


^(^-^P)^

3^(r) G itf (C) A ||^(r)p^r= 1

67

(40)

This continuous generating rule must imply a closer relationship between the vectors involved in the discrete case. There, the generating rule !l(Jy —> co) means that, while the normalization part for y possesses a simple algorithm to be computed, 1 = Y'^Y, it is no such simple operation in the second part of the generating rule. That is, when one must attach the O) coefficient vector elements to the squared modules of the generating vector y, the algorithm is not naturally isomorphic to the one in Eq. 40. Insisting on the problem: there is a lack of simple, naturally obtained, isomorphic operation in the second part of the discrete generating rule, as stated in Eq. 22, when compared with the continuous case as defined in Eqs. 27 or 40. A possible solution of this interesting situation will be discussed next. B. The Structure of the Generating n-Dimensional VS: DVS The generating rules 22 and 40 are a shorthand notation of some nonlinear transformation involving the generating VS, '^'^(C), and the final VSS containing the coefficient vectors, 'H^^(R^). The lack of a simple natural operation, producing the results, implicitiy stated in the generating rule in the discrete case, can be circumvented using the following scheme. Suppose such an isomorphic pair of n-dimensional VS, which will be named ^„(C) and !FJR^) Both can be good substitutes for the original "UjiQ and ^„(R^) VS described above, respectively. A sound isomorphism of column or row VS is constituted by DVS, whose elements possess the structure of diagonal matrices. Let us consider that the isomorphic ^n(C) and !FJJR^) VS elements are chosen as diagonal matrices. This element choice has not been arbitrary, because matrix multiplication is closed in DVS, that is, matrix products of diagonal matrices yield new diagonal matrices. Moreover, diagonal matrix products are commutative. Considering only the diagonal part of the matrix elements, and discarding the off-diagonal elements, the DVS possess the same dimension as their isomorphic column-row vector counterparts. Then, it is easy to see that using this simple isomorphic device both the discrete and continuous generating rules acquire the same formal structure. Indeed, the discrete rule in Eq. 22 will be rewritten within any DVS framework as

^(DÂ)^

3DG^,(C)A 0; V/ A (A) = ^n. = 1

(42)

thus, working with DVS instead of conventional VS and VSS, the coefficients in discrete DF description possess the same structural properties as the DF themselves. The generating DVS elements, D e ^n(^)' ^^^ ^^ ^^® ssuno manner as do the QO wave functions. And the resultant coefficient diagonal matrix, A e !FJJR^), satisfying the convex conditions A'^(A), can be written as a squared module of the former diagonal matrix. This can be done using a discrete form of the generating rule i^(D —> A), similar to the wave function-DF generating rule ^Q¥-^p):A = D'^D = DD"" = IDP = Diag(ld.P), as described in Eq. 41. The DVS of ^n(C) type may be considered normed spaces, with one of the possible norms defined as the trace of the squared matrix module. As a consequence, the DVSS ^n(R^) elements are constructed in such a way that their trace is always normalizable, and thus easily made unit. A diagonal TS, (D^, can be derived in the usual way by using a given background set part, 5, and a DVSS convex subset, HC as the tag set part, that is, !D„ = 5 x{!^ c ir„(R^)}. The question, now, may not need to be: why do the n-dimensional DVS fulfill in a natural way the same conditions as oo-dimensional functional VS? But it could be much better stated as follows: which kind of consequences, if any, will this situation have in the development of a discrete quantum chemistry framework? The next section will try to describe some of the possible features of this DVS structure. C. Expression of the Density Functions and Other Problems

It has been shown that the best discrete representation of the DF, having an ASA-like form, as in Eqs. 21, 23, and 24, is better described as a diagonal matrix, instead of a vector, as is usually done. Then, the scalar-like expression of the ASA DF type could be redefined in terms of the natural operations presented in the discussion of the preceding section. To obtain a coherent view of all of the possible redefinitions, which can be found as a consequence of the adoption of the DVS representation, some preliminary considerations will be made next. As the formal structure of ASA generating rules is better represented from the point of view of diagonal matrices rather than vectors, both the generating and coefficient vectors are thus transformed into elements of some DVS. The ASA forms discussed in Section III.C, besides the coefficient vector, are associated with a PD function set, which is in turn connected to the squared module of another function set, further belonging to another structure which can be termed a generat-


69

ing function VS. This situation can be managed in the same way as in the preceding discussion. Suppose that a function basis set is known: 0 = {(p.}. Nothing opposes the situation in which the set O can always, without loss of generality, be arranged into a diagonal matrix structure, and considered constructed as O = Diag((pj, (p2,..., (p„) G F(C). Then, it is obvious that when the following diagonal matrix product is made: p = 0*0 = Diag(l(Pil^, I(p2p , . . . , IcP;,!^)} G PCR"^), it will always produce a new diagonal matrix, whose elements belong to a special function VSS made of PD functions, that is, made of function squared modules. Thus, taking into account the definition of the diagonal product of the initial basis set, one can consider that the above result produces an entirely new PD basis set: P = {l(p,P}. Also, having defined the generating and coefficient VS, one can construct the following hybrid diagonal matrix: VD = Diag(J.) G ^(C) A VO = Diag((p.) G F(C) => ^ = DO = Diag(J.(p,) eJiOQg

(C) xF(C)

(43)

Once mixed structures of this kind are constructed, the ASA-like DF could be simply built by computing traces of squared modules of the diagonal structures, ^ , as defined in Eq. 43. That is,

p==;^K.(p,f=SKI > , f = S m I

/■

(^^

/

The formalism is now clear on how to construct the necessary generating elements and the road is open to obtain, in a very natural way, the structure of ASA-like DF. The most interesting feature of the whole procedure, perhaps, will consist in finding out how closely the deducible formal rules, based on discrete DVS, are equivalent to the formalism based on continuous quantum mechanics. But it seems that nothing opposes this possibility. In fact, it only remains to express the formal problem as to how an expectation value (Q) of some observable, associated with an operator Q, can be computed within a DVS formalism. A possible way could be:

{a} = JQpdV=JQ{^*^)dV= XKP J^kP = S (oJQp,dv = S COJ(p;Q(p,^/y=/(T'Q^yy /

(^^s)

i

The last linear combination of integrals is suited to differential operators, and can be naturally obtained when considering the operator £2 as a scalar matrix QI.

70

RAMON CARBO-DORCA

V. CONCLUSIONS A general framework, where quantum objects can be described in a systematic way, has been constructed. The concept of density function tagged set encompasses an early generalization that is proposed as a sound substitution of fuzzy set definitions, to describe molecular structures, namely, the Boolean tagged sets. At the same time, the definition of quantum-mechanical density functions has been used to put in evidence its essential positive definite nature. This fundamental property of density functions, often forgotten in the current literature, has also been used to connect quantum similarity measures, a simple concept, which compares two or more quantum objects, with the spaces containing positive definite operators. Vector semispaces and, more conventional, convex set algebra have been put into the context of the computation of approximate density functions, as in the ASA framework. This kind of computational algorithmic experience has been extended to positive definite operators and their matrix representation, the similarity matrices, from the point of view of quantum similarity measures. Positive definite operators can be used to construct a convex set of new positive definite operators and consequently their matrix representations remain positive definite. In this way, a new window is opened to obtain discrete,finelytuned, molecular descriptors in the form of positive definite vectors belonging to n-dimensional vector semispaces. The utility of the presented theoretical results in the context of quantitative structure-activity relationships is but one of the vast prospective application fields. The quantum chemical, statistically coherent, significance of the expansion coefficients, satisfying convex conditions, in ASA-like DF forms, which can be considered as a discrete probability distribution has been shown. Moreover, there is apparently no problem in using the fitted atomic densities to obtain expectation values of quantum chemical operators. The formalism, based on DVS and TS, becomes in this manner a fruitful tool, where one can fundament further work.

ACKNOWLEDGMENTS This work was partiallyfinancedby CICYT Research Project SAP 96-0158. Professors J. Karwowski and P. G. Mezey are thanked for lively debates on the subject of fuzzy and tagged sets, and Dr. E. Besalu for constructive criticism and advice. The author warmly thanks Mr. LI. Amat for stimulating conversations on ASA and QSAR, which led to various Fortran 90 implementations, as well as for patiently performing preliminary calculation tests on some tuned QSAR problems. Enlightening, informal discussions with Dr. J. Mestres have been carried out in previous stages of this work.


71

REFERENCES 1. Zadeh, L. A. Inf. Control 1965 S, 338. 2. Trillas, E.; Alsina, C ; Temcabras, J. M. Introduccion a la Logica Difusa\ Ariel Matematica: Barcelona, 1995. 3. Carbd, R., Ed. Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Kluwer: Dordrecht, 1995. 4. Carbo-Dorca, R.; Mezey, R G., Eds. Advances in Molecular Similarity, Vol. 1; JAI Press: Greenwich, CT, 1996. 5. Carb6, R.; Calabuig, B.; Vera, L.; Besalu, E. Adv. Quantum Chem. 1994, 25, 253-313. 6. Carb6, R.; Calabuig, B. Int. J. Quantum Chem. 1992, 42, 1681-1709. 7. Carbo, R.; Besalu, E. In Molecular Similarity and Reactivity: From Quantum Chemistry to Phenomenological Approaches', Carb6, R., Ed.; Kluwen Dordrecht, 1995, pp. 3-30. 8. Carbo, R.; Amau, M.; Leyda, L. Int. J. Quantum Chem. 1980,17, 1185-1189. 9. Stoer, J.; Witzgall, C. Die Grundlehren der matematischen Wissenschaften in Einzeldarstellungen. Vol. 163; Springer-Verlag: Berlin, 1970. 10. See for example: (a) Carb6, R.; Calabuig, B. Comput. Phys. Commun. 1989, 55, 117-126. (b) Carbo, R.; Calabuig, B. J. Mol. Struct. (Theochem) 1992,254, 517-531. (c) Carbo, R.; Calabuig B. In Computational Chemistry: Structure, Interactions and Reactivity; Fraga, S., Ed.; Elsevier: Amsterdam, 1992, Vol. A, pp. 300-324. 11. See for example: (a) Lowdin, R O. Phys. Rev 1955,97,1474-1489. (b) McWeeny, R. Rev. Mod. Phys. 1960, 32, 335-369. 12. Carbo-Dorca, R. Fuzzy Sets and Boolean tagged sets; Technical Report IT-IQC-5-97, see also: J. Math. Chem. 1997, 22, 143-147. 13. Carb6, R.; Calabuig, B.; Besalu, E.; Martinez, A. Mol. Eng. 1992, 2, 43-64. 14. Carbo, R.; Calabuig, B. J. Chem. Inf. Comput. Sci. 1992, 32, 600-606. 15. Encyclopaedia of Mathematics; Reidel-Kluwer: Dordrecht, 1987. 16. See for example: (a) Constans, R; Carbo, R. J. Chem. Inf Comput. Sci. 1995, 35, 1046-1053. (b) Constans, R; Amat, L.; Fradera, X.; Carbo-Dorca, R. In Advances in Molecular Similarity; Carb6-Dorca, R.; Mezey, R G., Eds.; JAI Press: Greenwich, CT, 1996, Vol. 1, pp. 187-211. (c) Amat, L.; Carb6, R.; Constans, R Sci. Gerundensis 1996, 22, 109-121. (d) Amat, L.; CarboDorca, R. QSM and Expectation Values under ASA: First Order Density Fitting Using EJR; Technical Report IT-IQC-2-97, see also: J. Comp. Chem. 1997,18, 2023-2039. 17. Jacobi, C. G J J. Peine Angew. Math. 1846, 30, 51-94. 18. Constans, R; Amat, L.; Carbo-Dorca, R. / Comput. Chem. 1997,18, 826-846. 19. Carbo, R.; Besalu, E.; Amat, L.; Fradera, X. J. Math. Chem. 1995,18, 237-246. 20. See for example: (a) Fradera, X.; Amat, L.; Besalu, E.; Carb6-Dorca, R. Quant. Struct.-Act. Relat. 1997, 16, 25-32. (b) Lobato, M.; Amat, L.; Besalu, E.; Carbo-Dorca, R. Estudi QSAR d'una familia de Quinolones; Technical Report IT-IQC-4-97. (c) Lobato, M.; Amat, L.; Besalu, E.; Carb6-Dorca, R. Structure-Activity Relationship of a Steroid Family using QSM and Topological QS Indices; Technical Report IT-IQC-8-97, see also: Quant. Struct-Act. Relat. 1997,16, 465-472. 21. Carbo, R.; Besalu, E. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Carb6, R., Ed.; Kluwer: Dordrecht, 1995, pp. 3-30. 22. Carb6-Dorca, R. Tagged Sets, Convex Sets and Quantum Similarity Measures; Technical Report IT-IQC-9-97, see also: J. Math. Chem. 1998, 23, 353-364. 23. Berberian, S. K. Introduccion al Espacio de Hilbert; Editorial Teide: Barcelona, 1970. 24. Von Neumann, J. Mathematical Foundations of Quantum Mechanics; Princeton University Press: Princeton, NJ, 1955. 25. Encyclopaedia of Mathematics; Reidel-Kluwer: Dordrecht, 1987, Vol. 8, p. 249.

72

RAMON CARBO-DORCA

26. See for example: (a) Encyclopaedia of Mathematics', Reidel-Kluwer: Dordrecht, 1987, Vol. 5, p. 126. (b) Zemanian, A. H. Generalized Integral Transformations; Dover Publications: New York, 1987. 27. Carb6, R; Besalu, E. J. Math. Chem. 1995,18, 37-72. 28. Carbd-Dorca, R. On the Statistical Interpretation of Density Functions: ASA, Convex Sets, Discrete Quantum Chemical Molecular Representations, Diagonal Vector Spaces and Related Problems; Technical Report IT-IQC-10-97, see also: J. Math. Chem. 1998, 23, 365-375.

PATTERN RECOGNITION TECHNIQUES IN MOLECULAR SIMILARITY

W. Graham Richards and Daniel D. Robinson

I. II. III. IV.

Abstract Introduction Two-Dimensional Representations Alignment Conclusion Acknowledgment References

73 74 74 74 76 76 76

ABSTRACT The speedup in molecular similarity calculations needed to cope with libraries of tens of thousands of compounds is achievable if we adopt techniques from pattern recognition and start with two-dimensional representations derived by nonlinear mapping of the three-dimensional distance matrices. Here we describe the use of invariant moments in this respect.


74

W. GRAHAM RICHARDS and DANIEL D. ROBINSON

I. INTRODUCTION Molecular similarity has proved to be a useful tool since its introduction by Carbo et al.^ and later extensions to the use of molecular electrostatic potential^ and shape-^ as descriptors. In particular, the use of similarity matrices where every member of a series of compounds is compared with a lead compound or even with every other member of the series has been a powerful technique both for quantitative structureactivity studies'^ and as a measure of diversity. This utility was readily apparent when dealing with series of some tens of compounds. Now that combinatorial chemistry is providing actual libraries of tens of thousands of molecules and virtual libraries containing millions of compounds, new problems present themselves. These are difficulties of speed of calculation. If we are to align molecules optimally and then compute similarity, we need to gain an increase in speed of several orders of magnitude. In two-dimensional problems, such as optical character recognition, speeds that are appropriate can be achieved. Here we show how such pattern recognition approaches can be applied to molecular similarity if we can represent three-dimensional structures in two dimensions.

II. TWO-DIMENSIONAL REPRESENTATIONS The three-dimensional structure of a molecule may be represented by a distance matrix. We have shown^'^ that nonlinear mapping permits one to produce two-dimensional representations that retain the majority of the distance information from three dimensions. These figures are suitable material for pattern recognition.

III. ALIGNMENT There are two aspects to alignment of two-dimensional figures: putting the center of the figures at the same point (translational invariance) and rotating to achieve maximal similarity (rotational invariance). Borrowing the technique from pattern recognition, Hu's method^ of invariant moments may achieve both translational and rotational alignment. The method is based on a statistical analysis of the distribution and values of the property p to be compared. This distribution takes the form of the following equation: m

== J J yyp(;c, y)dxdy

for continuous data

or

% . =. -T^ T^ yVp(-^' y)

for discrete data

Pattern Recognition Techniques

75

Now from a uniqueness theorem due to Papoulis,^ provided that p(x, >^) is continuous, and has nonzero values only in a finite part of the x, y plane, moments of all orders exist and the sequence of moments m are uniquely determined by p(x, y). Conversely, it can be shown that the infinite sequence of m uniquely determines Inherent in the equations for generating the moments is the assumption that the property p(jc, y) is centered on the calculation coordinates. This may not be the case. However, let us define two parameters as follows: x=-

^0,0

y-

^0,1 ^0,0

These two parameters clearly give the center of the property p(x, y) in the calculation coordinate system. These have been shown to be consistently placed for all reasonably comparable systems. Then let us define the central moment |Lt„^ by |i

= J J (jc - Icfiy - 3^)^p(JC, y)dxdy

for continuous data

l^>,9 z> a =^^X La X (^ ~ ^y^y ~ >^)^P(-^' >^)

for discrete data

or

These central moments have the desired invariance to translafion. We have seen how aligning a structure along its principal axes enables us to remove any unwanted rotation. In the case of the moments Hu utilizes this property of the principal axes to rotate p(x, y) so that the property's distribufion is aligned along the calculation axes. He does this by forming a variety of combinations of the three second-order central moments and the four third-order central moments. In his paper Hu explains these combinafions, which are able to disfinguish between structures that exhibit mirror and rotational symmetry: ^ 0 = 1^0 + ^^02

^ 2 = ^Kl^so - ^1^12)'" + (3^^2i - |L^03)'' Tl3 = Vl(|ll3o + |Lli2)^ + (fi21 + ^^03)^'

./V

(M30 - 3fi,2)(H30 + ^^l2)[(^l3o + ^\T)^ - 3(Mai + ^03)^! + (3^21 - ^lo3)(^l2l + ^lo3)[3(^l3o+^\^^ - ^î\ + ^03)^]

76

W. GRAHAM RICHARDS and DANIEL D. ROBINSON

^6

(3^12 - M(l^21 + |L^03)[3(|Ll3o + [i^^)^ - (^21 + ^03)^] These seven invariant central moments are all that is required to gain a high degree of sensitivity in pattern recognition. Indeed, in Hu's original paper he used only the first two invariant central moments to implement a crude character recognition system which by all accounts worked remarkably well. Utilizing the invariant central moments is fairly straightforward. All we have to do is to project our molecular property onto a grid and run through the calculations detailed above. This gives us a seven-dimensional vector X]^ 20 which represents the distribution of p^ 2Z)(^' y)- ^ ^ molecule to be compared can also be subjected to the same treatment, remembering that we no longer have to bother aligning A and B to get the correct answer. This yields a second seven-element vector r|^ 20The similarity, or more properly in this case the distance between the two molecules, is then given by the Euclidean distance of these two vectors in seven-dimensional space:

Clearly the above equation can be calculated in a fraction of the time required for the Carb6 or Hodgkin indices.

IV. CONCLUSION Using pattern recognition techniques of this type fuels the hope that we might be able to scan a whole library of compounds for examples that are similar to a chosen set of leads or other sources of ideas. At the same time we retain the essentials of the three-dimensional structure which are so important in the binding between a small molecule and its target receptor.

ACKNOWLEDGMENT This work was carried out in part pursuant to a contract with the National Foundation of Cancer Research.

REFERENCES 1. Carb6, R.; Leyda, L.; Amau, M. Int. J, Quantum Chem. 1980, 77,1185. 2. Hodgkin, E. E.; Richards, W. G. Int. J. Quantum Chem. Quantum Biol. Symp. 1987,14,105. 3. Meyer, A. M.; Richards, W. G. J. Comput. AidedMol Des. 1991,5,426.

Pattern Recognition Techniques 4. 5. 6. 7. 8.

77

Good A .C; So, S.-S.; Richards, W. G. J. Med. Chem. 1993, 36, 433. Barlow T. W.; Richards, W. G. I Mol Graph. 1995,13, 373. Robinson, D. D.; Barlow, T. W.; Richards, W.G. J. Chem. Inf. Comput. Set. 1997, 37, 943. Hu, M. K. IRE Trans. Inf. Theory 1962, 179. Papoulis, A. Probability, Random Variables and Stochastic Processes; McGraw-Hill: New York, 1965.


TOPOLOGY AND THE QUANTUM CHEMICAL SHAPE CONCEPT

Paul G. Mezey

I. Introduction II. Topological Resolution and Molecular Shape III. Molecular Similarity Measures Based on Topological Resolution of the Shape of Electron Density IV. Summary Acknowledgment References

79 81 86 90 91 91

I. INTRODUCTION The concept of molecular shape is not based on direct observation. The size of most molecules is too small for visual examination; moreover, the wavelength range of visible light is not suitable to provide a detailed enough resolution for molecules. This fact has influenced in a fundamental way the special evolution of the molecular shape concept, since, in the absence of direct observation, the shape of molecules is usually perceived to be that of the molecular models used for their representation. Naturally, the early, somewhat simplistic molecular models, such as the "ball and

Advances in Molecular Similarity, Volume 2, pages 79-92. Copyright © 1998 by JAI Press Inc. Allrightsof reproduction in any form reserved. ISBN: 0-7623-0258-5

79

80

PAUL G. MEZEY

stick" or fused sphere "space-filling" models, could reflect only some, highly simplified aspects of molecular shape, yet the shapes of these models have been treated by many chemists as if they were the actual shapes of the molecules. It is remarkable that, even today, one of the primary tools for conveying molecular shape information is the "ball and stick"-type stereodiagram used by many chemists. The more realistic, fuzzy, three-dimensional electron density cloud models, already easily calculable by quantum chemistry methods, only recently have started to become an appreciated tool for molecular shape representation. ^"^^ Molecular electron densities can be represented by three-dimensional density functions p(^, r), where ^ is a specified nuclear configuration and r is the three-dimensional position variable. With the introduction of the additive fuzzy density fragmentation (AFDF) methods,^^'^^ including the numerical MEDLA (molecular electron density loge assembler) technique^^""*^ and the more advanced analytical ADMA (adjustable density matrix assembler) method,"^^""^^ ab initio quality electron densities p(^, r) can be calculated, virtually for any molecule of chemical and biochemical importance, even for macromolecules, such as proteins."^^"^^ The shapes and similarities of these p{K, r) electron density clouds can be analyzed in great detail, and additional properties can be studied, such as the forces acting on the various nuclei in macromolecules,"*^ leading to conformational changes and changes of folding patterns of long, polymeric chain molecules. Electron density clouds are rather fuzzy objects, where any, detailed enough graphical representation of the peripheral, low electron density range is likely to hide from view the higher density range closer to the nuclei. This fact, and the fact that it is difficult to construct any macroscopic model that properly reflects the fuzzy nature of molecules have apparently contributed to the popularity of the simpler, and easily visuahzable "ball and stick" and fused sphere "space-filling" models, where molecules appear analogous to macroscopic, classical-mechanistic constructions. Whereas for classical-mechanical objects geometry is the appropriate tool of shape description, for fuzzy, quantum-mechanical molecules, geometrical methods are no longer efficient; by contrast, topology appears as an ideal tool of shape description. Topology allows for flexibility in a very natural way, and it also has the capacity to describe quantum-mechanical uncertainty and the associated fuzziness in a natural and intuitively transparent way. During the past decade, topology—specifically, algebraic topology—has been advocated as a powerful framework for molecular shape description that is compatible with the fundamental, quantum-mechanical nature of molecules. Among the relevant results, the introduction of the molecular shape group methods (SGM)^"^^ has led both to novel theoretical interpretation of molecular shape properties, as well as to various applications in molecular similarity and complementarity analysis in systematic pharmaceutical drug discovery approaches and in toxicological risk assessment.^^'"*^"^^

Topology and the Quantum Chemical Shape Concept

81

In this chapter, some of the more recent advances in the systematization of topological methods of molecular similarity analysis, involving the concept of topological resolution of molecular electron densities p(^, r) and their contour surfaces, will be reviewed.

II. TOPOLOGICAL RESOLUTION AND MOLECULAR SHAPE Before discussing the precise formulation of topological resolution and its applications to molecular similarity analysis, I shall briefly review the motivation for the approach and some of the relevant topological concepts. In the study of molecular similarity, structural features of molecules are the most commonly used properties that are analyzed and compared. Depending on the level of detail required, static shape features of molecules can be studied at various levels of resolution; these levels of resolution can be used as a tool for the introduction of various similarity measures.^^ Evidently, if three objects A, B, and C are indistinguishable at some low level of resolution, but at some higher level of resolution A is distinguishable from B and C, but B and C are still indistinguishable, then B and C are more similar to each other than to object A. Evidently, the level of resolution required to distinguish objects can be used as a measure of similarity. Since resolutions in the geometrical sense are easily characterized by numbers, this approach, the resolution-based similarity measures (RBSM) approach, provides numerical similarity measures.^^ Originally, the RBSM approach was formulated in terms of geometrical resolution,^^ for example, by placing objects on a regular, rectangular grid and using the occupied cells of the grid for comparisons. The finer the grid, the finer the resolution, and the grid size required served as a numerical measure of similarity. In this contribution I shall discuss some topological generalization of the idea of using resolution to measure the degree of similarity. The switch to topological resolution is the simplest if one focuses on the static shape features of molecules. However, static shape features provide a biased representation of the molecule, and in the evaluation of molecular similarity, it is also of importance to use information on chemical reactions and conformational changes, involving molecular interactions. Such interactions are typically determined by local molecular shape properties. In the long-range interactions during the initial stages of a chemical reaction, typically the large-scale features of local molecular moieties are dominant. However, as the reaction progresses, which usually involves a close approach of one molecule by another, details of local shape features become increasingly more important. Consequently, the level of resolution required for the analysis of shape features relevant in different stages of molecular interactions does change in the course of the reaction or conformational change induced by one molecule in another. Since the local shape features of both the static and the dynamic repre-

82

PAUL G. MEZEY

sentations of molecules are characterized by their topological properties, it is natural to invoke the concept of topological resolution in molecular similarity analysis. In the following paragraphs some of the fundamental concepts of the relevant branches of point set topology are reviewed with special focus on topological resolution, followed by a description of a topological realization of RBSM. Consider a set X, and a family T of subsets T^ of X, X D 7^, where the members T^ of family T fulfill the following three conditions: (i)

X, 0 G T

(1)

that is, the original set X and the empty set 0 are included in the family T, furthermore,

(ii)

urêT

(2)

a

for any number of sets T^ in the family T, and

(iii)

T^r^TaeT

(3)

for any two sets T^, T^ e T. If properties (i)-(iii) are fulfilled, then the family T is called a topology on set X, and the members T^ of family T are called the T-open sets of set X. The pair (X, T) is called a topological space. The structure of set X provided with a topology T, and various functions defined on X can be studied in terms of the T-open sets of X. Note that nowhere in the above discussion was it assumed that set X has a geometrical structure, and there is no need even for a distance function for introducing a topology on a set X. Nevertheless, properties (i)-(iii) are precisely the most fundamental properties of open sets in a metric space, for example, in a Euclidean space provided with the ordinary, Pythagorean distance function. In a metric space, open sets are defined in terms of distance: A set Y is open within a metric space if for any point y of y one can find a ball of some nonzero radius, centered on the point y, such that the entire ball still falls within Y. Invoking balls with some nonzero radius involves the distance function of the metric space, since the ball is the collection of all points with distance from y less than the specified distance chosen as radius. What provides topology with a remarkable versatility is the fact that many of the properties of open sets, and the implied properties for continuous functions defined in terms of these open sets, are fully operational without any reference to distance. This provides a well-controlled flexibility, which is the special hallmark of topology. Furthermore, as evident from requirements (i)-(iii), on any given set X one can introduce many, different topologies. This provides a whole range of possible degrees of "flexibility," an important concern in the chemistry of actual, quantum-


83

mechanical, nonrigid molecules. Since there are many different ways a topology T can be chosen, in the study of the topological properties of any object X (for example, a molecule X), one must specify the actual topology T used. The following relations are crucial for the introduction of the concept of topological resolution. Consider a set X, and assume that for two topologies Tj and T2 on X the following relation holds: Every Tj-open subset of X is also a T2-open set. This implies that Tj is a subfamily of T2, T2DT1

(4)

If relation 4 holds, then topology Tj is said to be coarser (or weaker) than topology T2, or one can say that topology T2 is finer (or stronger) than topology Tj. Two topologies on a set X do not always relate to one another in such a clear-cut fashion; in fact, a relation such as 4 above is rather special. Two topologies are regarded incomparable if neither is finer than the other. The finer-coarser relation between topologies on a given set X gives only a partial ordering of the set of all topologies on the set X. Exploiting this partial order, the interrelations among topologies on a given set X can be studied using lattice theory. The detailed analysis of a topology, as well as the construction of new topologies on a given set X, can be carried out using the concepts of base and subbase of a topology T. Consider a subfamily B of family T: T DB

(5)

This subfamily Bis a, base for topology T on X if and only if every T-open set G e T is a union of some sets in B. Consider a subfamily S of family T: T D5

(6)

This subfamily 5 is a subbase for topology T on X if and only if finite intersections of elements of S form a base for T. The special role of subbases is illustrated by the fact that they can be used to define topologies. In such cases we refer to subfamily 5 as a defining subbase. By choosing a family of subsets of X as a subbase 5, and generating a base B by the above recipe (of generating allfiniteintersections), one can indeed generate a family T that fulfills all three conditions (i)-(iii). Note that a special finite intersection, the empty intersection of subsets of space X, is the full space X, consequently, X is automatically included in the base B generated by this recipe, and hence X is also included in the family T. The empty union of sets from the base B is the empty set 0 , hence the empty set 0 is automatically a member of the family T generated by this recipe. The subbase-base approach provides a very versatile method for generating topologies. Also note that, if for two generating subbases 5j and 5*2 the relation

84

PAUL G. MEZEY

52=)5i

(7)

holds, then the corresponding topologies are necessarily comparable, and topology T2 isfinerthan topology Tj, T2DT1

(8)

Consider a set X and a family T of topologies T- on X, where these topologies T- are fully ordered by thefiner-cruderrelation: T={Ti,T2,...T,,...}

(9)

T,,,DT,

(10)

where

for every index / for which T-^j is included in the family T of topologies. We shall also use the notation (X,T.,i)D(X,T.)

(11)

to express the same fact in terms of topological spaces, if the specification of the underlying space X is important. If relation 11 holds, we say that the topological space {X, T-^j) is of higher topological resolution than topological space (X,T.). Topological resolution is defined in terms of thefiner-cruderrelations. In comparison 11, the former topological space, (X, T-^^), provides a more detailed topological description of the underlying space X than the topological space (X, T.). In particular, afiinction/thatmaps X onto itself may be T.-continuous but not T-^j-continuous (note that a function is continuous if and only if the inverse image of every open set is open, where openness is interpreted within the actual topologies used). A topological description with a higher level of topological resolution provides more information than one at a lower level of topological resolution. We consider the three-dimensional, fuzzy electron densities of molecules embedded in the ordinary 3D Euclidean space E^. The shape analysis of these electron densities, leading to the determination of the algebraic-topological shape groups, has been reviewed extensively,^^'^"^ and only a brief sununary will be given here. For each nuclear arrangement K, the molecular electron density function p(^, r) can be represented by an infinite family of molecular isodensity contour (MIDCO) surfaces G(K, a), where the density threshold a can take values from the [0, ©o) interval. Each MIDCO is defined as a set G{K,a) = {r:p{K,r) = a]

(12)

that is a surface with a specific shape for each nuclear configuration K and each value of the electron density threshold a along the MIDCO.


85

The local shape of each MIDCO G(Ky a) is tested against a range of reference curvatures b. For each reference curvature value b, the points r along each MIDCO G(K, a) are classified according to the local curvatures, as being a point where the contour surface is either 1. Locally convex relative to b (r belonging to a domain of type D2(b)), 2. Locally of the saddle type relative to reference curvature b (r belonging to a domain of type D^(b)), or 3. Locally concave relative to b (r belonging to a domain of type DQ(b)). For each MIDCO G(K, a), the domains D (b) for various curvature types jii with reference to a curvature value b generate a pattern P(K, a, b) of domains on the MIDCO G{K, a). These patterns P{K, a, b) can be analyzed by topological methods, leading to a description of the interrelations among the domains within each topologically distinct pattern P{K, a, b). Whereas no actual construction of new objects is needed, it is useful to picture the process of focusing on a given curvature type within the actual pattern P(K, a, b) by assuming that the corresponding domain is excised from the MIDCO surface G{K, a). For each MIDCO G(K, a), the domains DJb) of a specified curvature type \i with reference to a curvature value b are removed from the MIDCO, leading to a truncated object G {K, a, b) with a certain set of holes. The algebraic-topological homology groups of the truncated MIDCO G^{K, a, b) are denoted by H^JJC, a, b), and are, by definition, the shape groups of the molecule. There are three families of these groups, the zero-, one-, and two-dimensional shape groups, where in the notation H^(K, a, b) the dimension k, the truncation type \x, the nuclear configuration K, the density threshold a, and the reference curvature b are all specified. Note that some of these specifications are often omitted if they are evident from the context. The shape groups HHK, a, b) are invariant within small intervals of the threshold values a for electron densities, also within small intervals of reference curvature b, and also for some small molecular deformations changing the nuclear arrangement K. Note that, for a molecule of A^ nuclei {N > 3), the family of accessible nuclear configurations K form a subset of the nuclear configuration space M, where M is a metric space of 3N-6 dimensions. The local invariances of the shape groups H^(K, a, b) within the parameter plane {a, b) spanned by the density and curvature thresholds a and fc, and within domains of the nuclear configuration space M imply that there are only a finite number of shape groups for each molecule.^^'^'* Usually a separate shape group analysis is performed for each specified nuclear configuration K of interest. The finite number of shape groups HHK, a, b) within the parameter plane (a, b) can be characterized by their ranks, called Betti numbers, providing a numerical shape code for each conformation K of each molecule. In some sense, the topological shape group approach represents a reduction of the information content of a three-dimensional continuum of a fuzzy electron density

86

PAUL G. MEZEY

cloud p(^, r) to a set of discrete Betti numbers, in a process that retains the essential shape information about molecules. The shape codes provide a concise representation of shape information. These shape codes can be compared numerically, in a process that is much simpler than the direct comparisons of molecular electron densities. Note that in direct density comparisons the mutual orientation of the molecules must be optimized in the initial step of shape comparison. No such optimum superposition is needed when evaluating the similarity of molecules based on their shape codes. These direct, numerical comparisons of the topological shape codes are used to compute numerical shape similarity measures and complementarity measures for the fuzzy electron density clouds of molecules. The shape group approach provides a detailed shape description and shape comparison. In most instances, such a detailed shape description is required to detect and interpret the shape features of the electron density p(^, r) relevant to a given chemical problem. However, in some cases, one does not need a complete shape analysis of the electron density p(^, r), and the focus can be shifted from the details to some of the more prominent shape features. Furthermore, for large molecules, the large amount of detail obtained in a complete shape group analysis of the entire electron density p(^, r) may render the computational task and the interpretation of the results cumbersome. In such cases, it is warranted to use alternative shape characterization methods where the level of detail studied can be appropriately modified. A natural condition that can be used to control the amount of detail is the level of resolution. In the next section, a new set of molecular similarity measures will be discussed, based on the concept of topological resolution of fuzzy electron density clouds p(^, r).

III. MOLECULAR SIMILARITY MEASURES BASED ON TOPOLOGICAL RESOLUTION OF THE SHAPE OF ELECTRON DENSITY Different ranges of the electron density threshold parameter a and of the reference curvature b provide a natural approach to shifting the emphasis between the local details and the large-scale features of the shapes of molecular electron densities p(^, r). For example, the MIDCO surfaces G{K, a) corresponding to low-electrondensity thresholds a usually exhibit less detail than the high-density contours G{K, a) running closer to the atomic nuclei in the molecule. However, for our present analysis, the role of range selection of the reference curvature parameter h is more important than the choice of density threshold a. When considering a reference curvature b of high negative value, one finds that at most points r along a MIDCO G{K, a), even for MIDCOs with high values of the electron density threshold a, both of the local canonical curvatures of the surface (that is, both eigenvalues of the local Hessian matrix expressing the local curvatures at point r) are greater than the reference curvature b. Consequently, most if not all


87

points r of the MIDCO G(K, a) belong to a curvature domain of type DQ(^) of relative concavity with reference to the curvature b. This, in turn implies that a truncation of type fx = 2 eliminates only a few domains or no domain at all from the MIDCO G(K, a). For example, if the local canonical curvatures at all points r of the MIDCO G{K, a) are greater than the reference curvature b, then no truncation occurs, resulting in the coincidence G2(K,a,b) = G(K,a)

(13)

and in the associated trivial group as shape group H\(K, a, b), Hl(K,a,b) = {0}

(14)

If the value of the reference curvature is gradually increased, eventually, more and more local canonical curvatures fall below the value b, and more domains become subject to elimination from the MIDCO G(K, a), resulting in topologically distinct objects, G2(K,a,b)i^G(K,a)

(15)

as well as in nontrivial groups as shape groups H\{K, a, b). In fact, more detail of the shape properties of the MIDCO G{K, a) of the molecular electron density function p(^, r) becomes accessible. It is possible to consider all of the topological changes of the truncated objects G2{K, a, b) as the value of the reference curvature b is increased, and this approach has merits if the goal is to detect the actual threshold values for b where the topological change occurs. However, it is computationally simpler if one focuses on a finite series of selected b values: b,,b2,...b,,.,.,b^

(16)

where fe, ,,,

(17)

and determines the truncated MEDCO G2(K, a, b) for each b- value. This leads to the series of truncated MIDCOs GîK, a, b,\ GîK, a, b^),...,

GîK, a, b),...,

GîK. a, bj

(18)

and, if the focus is not restricted to the curvature domains type D (b) of type ^l = 2, to the associated pattern series P(K, a, î), P(K, a, b^),...,

P(K, a, Z..),..., P(K, a, bj

(19)

Note that all of these patterns P{K, a,fc.)are generated on the same MIDCO G(K,a). Consider now the combined pattern

88

PAUL G. MEZEY

P{K,a,b,...b;)

(20)

obtained by superimposing all patterns on G{K, a) involving the subsequence up to and including index /: P(K, a, b,l P(K, a, b^\ . . . , P(K a, b)

(21)

The set of DJ^b^ domains within each pattern P{K, a, b.) can be regarded as a defining subbase S{K, a, Z?.) for a particular topology T(K, a, b) on the MIDCO G(K,al S(K,a,b) = {D^(b;)}

(22)

each generating a formal topological space iG{K,alT{K,a,b))

(23)

The superposition of patterns on the same MIDCO G(K, a) generates finite intersections of domains belonging to various reference curvatures b-. Consequently, the subbase S\K, a,by.. b-) taken as the set of all such intersections of domains D^(bf) for / = 1, 2,. . . , / generates a topology T' that is the same as the topology T(K, a,by .. b^ obtained with a subbase S{K, a,b^... b^ defined as the union of subbases S{K, a, b^, S(K, a, b^,..., S{K, a, b-): S(K, a, by .. b) = uî^,. S{K, a, bj)

(24)

S\K, a, by .. b) = S{K, a, by .. b)

(25)

where

hence T = T{K,a,by,.b)

(26)

Note that for the defining subbases S{K, a,by .. b^ of these topologies T(K, a,by ., bj) the relation S{K,a,by.,b.^,)^S{K,a,by..b)

(27)

must hold for any choice of index /, 1 < / < m - 1, as a consequence of the definition of subbase S(K, a,by.. fo.) as the union (24) of all of the individual subbases S(K, a, bj) of indices k up to and including the index /. Consequently, the corresponding topologies T(Ky a,by .. b) are also fully ordered, T(K, a, by .. Z7j D T{K a, by .. ^ i ) ^ • • • D T(A:, a, by .. Z?.) D . . . D T(A:, a, b^)

(28)


89

This complete ordering implies that these topologies T(K, a,by.. b.) provide a monotonia series suitable for a systematic adaptation of the techniques of topological resolution for the shape characterization of the MIDCO G(K, a), using gradually increasing topological resolution as index / sweeps over the interval [1, m]. Based on these topologies T(K, a,by .. b-), an RBSM, or more precisely, a topological RBSM (in short, TRBSM), can be constructed for molecules, as follows. Take two molecules, Mj and M2, of nuclear configurations K^ and K2, respectively, and consider two respective density thresholds a^ and a2> and the associated two MIDCOs G(K^, a^) and G(K2, a^. Generate the series of topologies T(A'p flj, Z7j... ^.) and T{K2, ^2, b^ .. ./?•), respectively, for the range 1 < / < m of indices /. We say that the two MIDCOs G{K^, a^ and G(^2' ^2) ^^^^ equivalent shapes at the level / of topological resolution if and only if there is a one to one and onto correspondence between the two defining subbases S{K^,a^,by..b^ and 5(A^2' ^2' ^r • • ^/) ^^^^ ^^^^ preserves the curvature index \k assignment of each element of the subbase for each sublevel k,\-1\1, and references cited therein. 7. Baumer, L.; Sello, G. /. Chem. Inf. Comput. Sci. 1992, 32, 125-130. 8. (a) Baumer, L.; Sala, G.; Sello, G. Tetrahedron Comput. Method 1989, 2, 37-46. (b) Baumer, L.; Sala, G.; Sello, G. ibid 1989, 2, 93-103. (c) Baumer, L.; Sala, G.; Sello, G. ibid 1989, 2, 105-118. 9. \i = (dE/dN)z. 10. Sello, G. Theochem 1995, 340, 15-28. 11. Sello, G.; Termini, M. Unpublished results. 12. Gordy, W.; Thomas, W. J .0. J. Chem. Phys. 1956,24, 439-444. 13. Pritchard, H .O.; Skinner, H. A. Chem. Rev. 1955,55,745-786. 14. Pauling, L. The Nature of the Chemical Bond, 3rd ed.; Cornell University Press: Ithaca, 1960. 15. Molecular Advanced Design', Aquitaine Systemes: Paris, Version 2,1990. The standard conformation used by the builder is a-helix, which is the most representative for mRNA and tRNA. 16. A fast search for rotational minima has been performed using a Montecarlo-Metropolis algorithm. 17. Also 2D calculations feel the effect of the geometry because the bond lengths are different. 18. The full set of the calculated EDs is available as supplementary material. 19. To demonstrate the sensitivity of the method to the atom position in the space we considered a last case concerning a triplet (ACA) where one base (the 5' P) has been rotated by 90° (Table 8). Both the triplet and the base to rotate were chosen arbitrarily. The variations of the ED values are quite large as expected. 20. GT* is a three-point pairing whose index is quite lower than the comparable GC index.

Transferability of Similarity Calculations

133

21. In similarity evaluation the possibility of having very accurate calculations could have a dangerous side effect. In fact, the more precise the model is, the less extensible is its applicability. This situation is not surprising because similarity can be seen as a fuzzy property. 22. The sensitivity of the measuring method does not exclude the possibility that, in some cases, the result will depend on the approximation level used.

SUPPLEMENTARY MATERIAL (Values calculated for the complete set of 64 base triplets.)

Values of Energy Differences of Base Triplets Calculated in 3D with SP Residue and Planar Nitrogen Atom^

AAA^

AAC^

AAG^

AAU^

ACA^

ACC^

ACG^

ACU^

1 2 3 4 5

0.261 3.694 3.789 3.744

0.260 3.693 3.789 3.744

0.262 3.694 3.789 3.740

0.260 3.693 3.788 3.743

0.270 3.696 3.793 3.744

0.268 3.693 3.792 3.742

0.270 3.695 3.795 3.743

0.269 3.693 3.792 3.740

1 2 3 4 5

0.416 3.687 3.788 3.735

0.419 3.688 3.791 3.734

0.457 3.683 3.792 3.733

0.448 3.684 3.794 3.731

0.413 2.810 2.836 2.919 2.732

0.420 2.810 2.841 2.917 2.751

0.457 2.804 2.840 2.919 2.730

0.460 2.803 2.844 2.915 2.752

1 2 3 4 5

0.437 3.690 3.785 3.735

0.421 2.817 2.836 2.900 2.741

2.787 2.912 0.329 2.814 0.487

2.785 2.914 0.336 2.924 2.727

0.418 3.690 3.785 3.734

0.413 2.818 2.828 2.922 2.728

2.776 2.916 0.321 2.812 0.494

2.765 2.918 0.338 2.926 2.713

Notes: Âtom numbering refers to Figure 3. H-ipiets are indicated by the first letter of the names of the corresponding bases. Neutral triplets.

Atom^

AGA^

AGC^

AGG^

AGU^

AUA^

AUC^

AUG^

AUU^

1 2 3 4 5

0.374 3.692 3.793 3.742

0.377 3.691 3.794 3.742

0.379 3.691 3.794 3.740

0.383 3.690 3.793 3.742

0.381 3.694 3.797 3.743

0.381 3.693 3.797 3.743

0.387 3.692 3.798 3.741

0.386 3.690 3.796 3.739

1 2 3 4 5

2.734 2.915 0.322 2.812 0.481

2.745 2.915 0.328 2.811 0.491

2.755 2.909 0.330 2.810 0.485

2.763 2.909 0.330 2.808 0.492

2.721 2.917 0.329 2.924 2.718

2.727 2.917 0.329 2.924 2.718

2.744 2.910 0.340 2.922 2.716

2.749 2.910 0.339 2.919 2.738

1 2 3 4 5

0.468 3.689 3.793 3.734

0.444 2.814 2.843 2.919 2.749

2.818 2.906 0.341 2.811 0.494

2.813 2.909 0.347 2.923 2.734

0.445 3.687 3.795 3.733

0.441 2.812 2.838 2.919 2.739

2.817 2.907 0.335 2.808 0.501

2.805 2.908 0.353 2.922 2.725

134

GUIDO SELLO and MANUELA TERMINI

Atom^

CAA^

CAC^

CAG^

CAiP

CCA^

CCC^

CCG^

CCU^

1 2 3 4 5

0.288 2.818 2.842 2.917 2.770

0.287 2.818 2.842 2.918 2.770

0.288 2.818 2.842 2.917 2.769

0.288 2.816 2.844 2.918 2.770

0.295 2.819 2.848 2.915 2.791

0.297 2.818 2.850 2.914 2.794

0.395 2.816 2.851 2.914 2.792

0.297 2.818 2.850 2.913 2.794

1 2 3 4 5

0.420 3.687 3.787 3.734

0.423 3.687 3.790 3.733

0.460 3.683 3.792 3.732

0.452 3.683 3.794 3.730

0.415 2.811 2.829 2.922 2.718

0.421 2.811 2.834 2.920 2.736

0.461 2.804 2.833 2.922 2.715

0.461 2.805 2.839 2.919 2.739

1 2 3 4 5

0.442 3.690 3.785 3.735

0.428 2.818 2.837 2.920 2.739

2.785 2.914 0.330 2.813 0.491

2.782 2.915 0.336 2.923 2.725

0.436 3.690 3.786 3.734

0.434 2.819 2.829 2.923 2.724

2.782 2.915 0.331 2.812 0.511

2.769 2.919 0.344 2.927 2.710

Atom^

CGA^

CGC^

CGG^

CGLP

CUA^

CUC^

CUG^

CUU^

1 2 3 4 5

0.419 2.815 2.846 2.919 2.765

0.421 2.815 2.847 2.920 2.766

0.423 2.814 2.847 2.918 2.765

0.427 2.810 2.848 2.920 2.766

0.430 2.815 2.852 2.914 2.794

0.428 2.815 2.854 2.912 2.795

0.431 2.813 2.854 2.913 2.795

0.430 2.814 2.854 2.910 2.796

1 2 3 4 5

2.729 2.915 0.333 2.810 0.506

2.737 2.916 0.339 2.803 0.517

2.750 2.910 0.341 2.807 0.511

2.754 2.910 0.341 2.806 0.518

2.706 2.918 0.337 2.926 2.704

2.716 2.919 0.339 2.924 2.721

2.726 2.912 0.347 2.925 2.702

2.738 2.912 0.345 2.922 2.724

/\fO/T7^

CGA^

CGC^

CGG^

CGU^

CUA^

CUC^

CUG^

CUU^

1 2 3 4 5

0.469 3.689 3.793 3.733

0.448 2.815 2.843 2.918 2.746

2.816 2.907 0.342 2.810 0.497

2.809 2.910 0.347 2.922 2.731

0.463 3.688 3.795 3.732

0.461 2.814 2.839 2.918 2.734

2.822 2.907 0.342 2.808 0.520

2.809 2.909 0.357 2.923 2.721

/Atom^

GAA^

GAC^

GAG^

GAU^

GCA^

GCC^

GCG^

GCU^

1 2 3 4 5

2.760 2.917 0.319 2.822 0.499

2.758 2.917 0.319 2.822 0.502

2.762 2.916 0.320 2.820 0.501

2.760 2.916 0.321 2.821 0.503

2.774 2.918 0.322 2.823 0.487

2.772 2.914 0.332 2.822 0.505

2.779 2.916 0.323 2.821 0.487

2.775 2.914 0.332 2.819 0.507

1 2 3 4 5

0.448 3.684 3.795 3.733

0.451 3.685 3.798 3.732

0.489 3.680 3.799 3.731

0.481 3.681 3.802 3.730

0.436 2.806 2.841 2.918 2.739

0.445 2.805 2.846 2.916 2.758

0.480 2.800 2.846 2.918 2.737

0.484 2.800 2.850 2.914 2.760

Transferability of Similarity Calculations

135

1 2 3 4 5

0.437 3.691 3.786 3.734

0.424 2.816 2.836 2.917 2.744

2.790 2.913 0.328 2.813 0.486

2.788 2.913 0.338 2.921 2.729

0.416 3.693 3.789 3.767

0.415 2.819 2.830 2.924 2.731

2.781 2.918 0.321 2.813 0.495

2.770 2.918 0.337 2.928 2.717

Atom^

CGA^

GGC^

GGG^

GGU^

GUA^

GUC^

GUG^

GUU^

1 2 3 4 5

2.782 2.912 0.325 2.820 0.504

2.776 2.912 0.322 2.820 0.504

2.785 2.912 0.327 2.818 0.507

2.780 2.910 0.323 2.820 0.506

2.791 2.913 0.320 2.821 0.490

2.789 2.901 0.330 2.820 0.506

2.795 2.911 0.330 2.820 0.506

2.793 2.908 0.332 2.816 0.510

1 2 3 4 5

2.762 2.909 0.336 2.809 0.488

2.711 2.908 0.342 2.808 0.498

2.786 2.903 0.344 2.806 0.493

2.791 2.902 0.344 2.805 0.500

2.744 2.912 0.340 2.923 2.726

2.747 2.912 0.344 2.920 2.743

2.768 2.906 0.352 2.920 2.723

2.770 2.904 0.351 2.917 2.745

1 2 3 4 5

. 0.467 3.689 3.794 3.733

0.443 2.810 2.842 2.916 2.748

2.818 2.907 0.342 2.810 0.491

2.818 2.905 0.349 2.920 2.736

0.449 3.687 3.800 3.734

0.443 2.813 2.840 2.920 2.742

2.825 2.906 0.337 2.809 0.502

2.807 2.909 0.354 2.924 2.727

Atom^

UAA^

UAC^

UAG^

UAU^

UCA^

UCC^

UCG^

UCU^

1 2 3 4 5

2.731 2.917 0.359 2.922 2.752

2.731 2.917 0.356 2.923 2.752

2.734 2.916 0.355 2.922 2.753

2.735 2.916 0.356 2.923 2.754

2.746 2.917 0.359 2.920 2.773

2.750 2.917 0.367 2.919 2.775

2.748 2.916 0.360 2.920 2.775

2.752 2.917 0.367 2.919 2.776

Atom^

UAA^

UAC^

UAG^

UAU^

UCA^

UCC^

UCG^

UCU^

1 2 3 4 5

0.451 3.682 3.796 3.731

0.454 3.683 3.799 3.730

0.491 3.679 3.801 3.730

0.483 3.679 3.803 3.728

0.445 2.805 2.838 2.919 2.728

0.452 2.805 2.843 2.916 2.748

0.491 2.799 2.842 2.919 2.726

0.491 2.799 2.847 2.915 2.750

1 2 3 4 5

0.443 3.688 3.787 3.734

0.429 2.814 2.838 2.918 2.741

2.792 2.911 0.330 2.814 0.493

2.788 2.913 0.337 2.922 2.726

0.437 3.692 3.788 3.734

0.434 2.818 2.830 2.922 2.727

2.786 2.916 0.332 2.812 0.511

2.771 2.918 0.344 2.926 2.713

Afom^

UGA^

UGC^

UGG^

UGU^

UUA^

UUC^

UUG^

UUU^

1 2 3 4 5

2.755 2.914 0.368 2.925 2.749

2.751 2.913 0.357 2.945 2.776

2.758 2.912 0.365 2.923 2.749

2.756 2.910 0.363 2.925 2.751

2.767 2.913 0.364 2.920 2.775

2.769 2.913 0.374 2.914 2.774

2.770 2.910 0.365 2.918 2.777

2.770 2.907 0.372 2.912 2.779

136

GUIDO SELLO and MANUELA TERMINI

1 2 3 4 5

2.766 2.908 0.344 2.806 0.512

2.774 2.907 0.350 2.805 0.523

2.787 2.902 0.352 2.803 0.516

2.793 2.902 0.353 2.802 0.524

2.739 2.911 0.352 2.922 2.715

2.747 2.908 0.358 2.917 2.730

2.760 2.905 0.363 2.921 2.713

2.769 2.901 0.364 2.915 2.732

1 2 3 4 5

0.472 3.686 3.795 3.733

0.452 2.812 2.844 2.917 2.748

2.822 2.904 0.343 2.811 0.500

2.816 2.906 0.349 2.921 2.733

0.467 3.688 3.796 3.731

0.465 2.813 2.840 2.915 2.734

2.826 2.906 0.345 2.807 0.521

2.810 2.908 0.361 2.918 2.720

SIMILARITY IN ORGANIC SYNTHESIS DESIGN: COMPARING THE SYNTHESES OF DIFFERENT COMPOUNDS

Guido Sello

I. II. III. IV. V.

Abstract Introduction Similarity Measures Comparison Methodology Results and Discussion Conclusion Acknowledgments References

137 138 139 140 142 150 150 150

ABSTRACT The possibility of using similarity concepts to compare syntheses of different compounds is examined and discussed. New similarity measures and indexes are described; they analyze both the strategic and tactical aspects suggesting a systematic

Advances in Molecular Similarity, Volume 2, pages 137-151. Copyright © 1998 by JAI Press Inc. Allrightsof reproduction in any form reserved. ISBN: 0-7623-0258-5 137

138

GUIDOSELLO

approach to the problem. The examples reported are used to help the reader understand the principles and methods introduced. Discussion of the results illustrates the advantages that similarity can bring into synthesis planning and emphasizes the real applicability of the realized procedures.

I. INTRODUCTION Organic synthesis planning is one of the most creative and difficult tasks that can be faced by chemists. It needs the assistance of many human intellectual abilities that all contribute to the attainment of the final result: an efficient and often elegant synthesis. In this respect, we can predict a very high productive application of similarity, as is demonstrated by several literature references.^"^ However, the explicit use of similarity in thisfieldis scarce and incomplete. Very few examples^"^^ exist that attempt to introduce similarity in the design of synthesis and most of them just mention its possible use without making an accurate analysis of its importance and contribution. But, when examining many of the best known syntheses, the impression of its presence is immediate. It is often possible to note the intelligent use of well-known synthetic steps in the planning of the synthesis of new and diverse compounds. Recently we became involved in a project fully dedicated to the introduction of similarity concepts into synthesis design.^"* We could thus elaborate on some preliminary ideas that helped the development of a rough initial system based on similarity measures. After some further modifications the system was applied to different evaluation phases and its contribution was certain. However, the greatest part of the studies^ was devoted to the application of similarity to the synthetic design of a single target with the aim of selecting the best path among many possibilities, of locating alternative steps, and of predicting the most different solutions. Now we are interested in the application of the analysis to the comparison of the syntheses of different compounds. It is evident that this problem is more difficult to solve and that it could even be difficult to assess the quality of the solution. Our previous experience has shown that even structurally similar compounds (e.g., the same compound) can be synthesized by very different routes where comparisons can be problematic. In addition, while the comparison of structures is possible by diverse methods, the comparison of transformations is often done by using substructure changes, neglecting reagents. As a consequence, it becomes hard to compare different molecules where substructures might not be similar at all. For example, the reduction of a carbon-oxygen double bond is keyed by a different substructure with respect to the reduction of a carbon-carbon double bond, despite the evident similarity from the viewpoint of synthesis planning. Herein I will address the complex problem of the comparison of the syntheses of different compounds, developing new ideas and calculation methods at the same time. The synthetic routes, used as examples, are taken directiy from the literature^^

Similarity in Organic Synthesis Design

139

and are not the best possible routes, but they can serve to assess the utility of similarity as a tool in synthesis design.

II. SIMILARITY MEASURES To compare synthetic routes we need both structure and reaction descriptors. In fact, I will develop a system that can consider the strategic aspect of the synthesis and its tactical realization. It is generally accepted that strategy is mainly a structure problem because the strategic approach to the synthesis of a target cannot be conditioned by the state-of-the-art development of transformation methodologies. On the contrary, tactics are concerned with the application of the strategic principles to the current target and depend on the transform management. During the development of a synthetic plan it is possible to conceive a system that, using similarity, can enhance strategic and tactical efficiency. However, when comparing existing synthetic routes we are forced to use similarity measures only to weigh the alternative options. We selected two structure and one reaction descriptors. In synthesis design it is important to consider two aspects of the similarity between structures: The first is a classical substructures comparison between educts and products (substructure similarity measure, SSM); the second aims at measuring the effectiveness of the synthetic step (globularity similarity measure, GSM). Every synthetic chemist instinctively feels that a good synthetic step must correlate two compounds that partially share structural features, but at the same time are as different as possible. In principle the best synthetic step transforms an educt into a product that is the most diverse where the change was predicted, but that maintains every other part of the structure unchanged. In other words, we can say that in a good educt-to-product passage all of the building blocks are conserved and all of the reacting blocks are affected. Our two structure descriptors aim indeed at measuring the realization of this goal. On the contrary, the comparison between transformations is best represented by a single descriptor that can measure the efficiency of a transformation in association with its group. The use of the reaction classification scheme developed by us should sustain the system by defining a global transform similarity measure (GTSM). SSM is directly derived from our previous work in the field of substructure similarity. ^^ We defined a substructure similarity index as SS\ = Nx{A + B)l{AxB)

(1)

where A^ is the number of similar atoms, and A and B are the numbers of atoms in molecules A and B, respectively. This index is well suited for comparing structures. Nevertheless, we slightly modified its definition to take into account two problems: the first concerning the different weight that the similarity of a connected and an unconnected substructure

140

GUIDOSELLO

must have; the second changing the limits of the index that now ranges between zero and one. The calculation is thus obtained as SSM = VSSFf

SF. = 2xN/(A-^B)

(2)

where A^. is the number of similar atoms in fragment /, and A and B are the numbers of significant atoms of molecules A and B, respectively. It follows that 0 < SF- < 1, equal to 0 if A^- is equal to 0, and equal to 1 if A^. is equal to A and A equal to B, and consequently 0 < SSM < 1. Globularity is a measure of the structure complexity and of its distribution on the molecule. ^'^ It is calculated by G = MAXD/COMPTOT

(3)

where MAXD is the greatest of the smallest distances of atom pairs measured as atom complexity, and COMPyQj is the molecular complexity measured as the sum of all atom complexities; from this descriptor we derive the similarity measure as G S M = AGAB

W

where AG is the difference between globularity of molecules A and B, A being the educt and B the product. From our reaction classification scheme we chose one descriptor that is used to measure the similarity between transformations. It is based on the calculated chemical potential ^^ of changing atoms and is obtained as ii = dE/dn = -IC,Z,^,^/(ZZXJ

+ k^

(5)

where Z^^ = Z-G = Z-[N^ + 0.85A^2 + 0.35(A^3 - 1)] Z^j^, = Z - (Aî + 0.S5N2 + 0.35A^3) Z is the atomic nuclear charge, a is Slater's core screening factor, N^ is the number of inner-shell electrons, N2 is the number of medium-shell electrons, and A^3 is the number of outer-shell electrons. Using this descriptor we can obtain the corresponding similarity measure as GTSM = ^ A ^ .

(6)

where A|i. is the difference of chemical potential of atom i in the product and the educt.

III. COMPARISON METHODOLOGY The definition of the similarity measure is necessary for the realization of a system that can compare synthetic routes; nevertheless, it is also necessary to conceive a

Similarity in Organic Synthesis Design

141

methodology that can guarantee the correct use of similarity measures. The methodology must be clear and stable; but, in addition, it is worth remembering that we are comparing quite different objects using an approach that must conserve its flexibility and large scope. Consequently, I am not going to calculate similarity indexes by just combining the corresponding similarity measures, but instead will develop a procedure where the similarity measures are used following a precise scheme. The final result will still be a number but its value will be highly dependent on the current comparison. In other words, the same synthetic step either can contribute to the calculation of the similarity index with reference to one synthesis, or can be neglected when considering a different synthesis. We can define three similarity indexes, one for SSM, one for GSM, and one for GTSM, that are calculated as reported below. SSI (substructure similarity index) is obtained by taking the geometric mean of the similarity percentages of the SSMs of all of the synthetic steps that have an SSM similar to the corresponding partner to an extent greater than or equal to 80%. For example, consider step 1 of routes A and A', and define SSM of A equal to 0.75 and SSM of A' equal to 0.82; because their ratio is equal to 0.91, step 1 contributes to SSI. On the contrary, if step 2 shows an SSM of A equal to 0.70 and an SSM of A' equal to 0.50, because their ratio is equal to 0.71, step 2 does not contribute to SSI. The rationale is that a synthetic step of two different syntheses can be considered sufficiently similar if the level of similarity of the educts and the products is at least 80%, neglecting the absolute value of the SSM. It is clear that this rationale is valid only when comparing educts and products, i.e., when we can be confident that the compounds we are considering cannot be too dissimilar. Calculation of the index is as follows: SSI = (n„ (SSM/SSM^)^/'^

V SSM/SSMj > 0.80

(7)

GSI (globularity similarity index) is based on the use of the descriptor globularity to evaluate the strategic efficiency of a synthetic step. As shown above, G is a measure of the molecular complexity and has a lower value for more complex structures; therefore, it should increase going down from target to precursors. Nevertheless, in real syntheses, G can either increase or decrease and GSM can thus be positive or negative. The corresponding index must consider this situation and we again chose to add together only GSM that show similar trends, i.e., GSI is the geometric mean of the ratio only of GSMs that have the same sign (positive or negative): GSI = (n„ (GSM. / GSM^.))^'''' V GSM/GSM^. > 0

(8)

The rationale, in this case, is that we can compare, and then add their contribution to the similarity, only those changes in globularity that have the same strategic effect, i.e., either increase or decrease the molecular complexity. Those synthetic steps that have opposite strategic meaning cannot be compared. The importance of

142

GUIDOSELLO

the contribution is measured by the similarity between the complexity changes; thus, even very small complexity variations that are strategically meaningless, can contribute to the overall strategic similarity between syntheses. GTSI (global transform similarity index) is naturally connected to the similarity between transformations, i.e., between reactions transforming educts into products. Its calculation uses the measure of the global changes in chemical potential of all of the atoms. Also in this case we must reckon when two reactions in two different syntheses can be compared. We have already reported^^ a system for reaction classification based on two descriptors, electronic energy and chemical potential, that allows the hierarchical subdivision of reactions into ordered sets. Thus, we are in the position of using that classification scheme in our procedure. However, the classification scheme is too articulated and, for the sake of synthesis comparison, we decided to use only two levels of the hierarchy. GTSI is the geometric mean of the ratio of the GTSMs of the reactions that belong to the same energy class; i.e., we consider comparable only two reactions that are both additions, or eliminations, or substitutions. GTSI is calculated as follows: GTSI = (n^ (GTSM./GTSM^)^''"

iff class(/?.) = class(/?p

(9)

In this case it is clear that we can use for similarity evaluation only those reactions that have similar reactivity bases. The GTSI values vary much more than their structure counterparts and their mere comparison can be misleading. As a consequence we decided to maintain a trace of the number of participating reactions (PRN) to the calculation and to use both GTSI and PRN as indexes of synthesis similarity.

IV. RESULTS AND DISCUSSION The results presented in the following concern two molecular sets: the first composed of four molecules of the same class (four prostaglandin derivatives), the second composed of the first set augmented by two different compounds, Sirenin and Methoxatin, chosen because the size and the length of their synthetic routes are of the same extent (Figure 1). The syntheses have been selected from the same literature source, thus their description is sufficiently uniform. However, not all of the synthetic steps have been explicitly considered and in the course of the discussion I will point out the differences that can appear just because of different descriptions. Before beginning the discussion I will repeat some general guidelines. The analysis is always carried out in the synthetic sense; thus, when I speak about, for example, step 3,1 mean the third step from the starting material. The values of the similarity measures (SSM, GSM, GTSM) are always obtained by comparing two intermediates of the same synthetic route (e.g., PGl-5 with PGl-6); on the contrary, the values of the similarity indexes (SSI, GSI, GTSI) are calculated by comparing similarity measures of corresponding steps in different syntheses.

143

Similarity in Organic Synthesis Design NHCHO PGE1

0" > r O

"N" "COOH METHOXATIN

Figure 1, Set of examined compounds.

Finally, it must be clear that, while the similarity measures represent the values of the descriptors and consequently have constant values, the similarity indexes are strictly dependent on the calculation procedure and can be changed very easily. In the first set we have four prostaglandin derivatives: PGl and PG2 are the same intermediate in two syntheses of PGEl; PG3 and PG4 are also intermediates in the syntheses of PGEl, but they are different from PGl and PG2. The four synthetic routes are sketched in Figures 2, 3, 4, and 5. The syntheses of PG3 and PG4 are very similar differing only in the last step. Nevertheless, I have freely chosen to describe the two routes in two different ways at the second and third steps because we would like to verify the response of the system. Values of SSM, GSM, and GTSM are reported in Table 1. From Table 2 we can observe that the SSI factors for PG3 and PG4 are over 80% similar in the last three steps only, in agreement with the different weight that the ethereal chain has in the two structures; as a consequence the SSI is the smallest in the PG series. Looking at the GSI, on the contrary, it is immediately obvious that the two syntheses are strategically very similar; all six steps show homogeneous variations and the final result is very clear. In conclusion, the two syntheses are similar for the strategy concerned, but it must be clear that the use of big protective groups in small molecules can influence the overall yield. The transformations are obviously very similar, being in agreement for all but one of the steps. The GTSM values are influenced by the diverse reagents used only in step 5 and in step 3, where we deliberately used water for the hydrolysis of PG3, and NaOH for that of PG4. Comparing PGl and PG2, which are exactly the same target, we can note that different synthetic routes of the same molecule are not certainly similar. All of the

GUIDO SELLO

144 NO,

NO, As^(CH2)eCN

(CH2)5CN

/

•

♦ 7 « 15

Delivered potency

■ 33 * 14

Sweetness

+X5KAO®« D • • • • • • •

•

X *

37

+ 0

K 2*

• 1

1

Q

2 ^ 3 '♦^ 22

1

1

1

Ring projection

F/gure / / . Joint plot of jittered BdDelRPI, BdDelMPI, delivered potency, and sweetness.

The sweet compounds, indicated by the large plotted characters, cluster in the middle of the figure. The five labeled compounds in this region form the upper row of Figure 12. This is a highly organized set of structures with an easily recognized commonality consisting of a six-membered ring substituted with analogously sized substituents in thtpara position.

"v>^Ch. 'Xc

2.8

4.6

85.2

XcU' '^Ch/ '^Ch

1.9

3.1

88.3

1.5

2.4

90.8

XcUf XQ\^f '^Ch Xn PlO/ P9

Table 3. Summary of Principal Component Analysis of 61 Topochemical Indices for 2926 Chemicals

PC 1 2 3 4 5 6 7 8 9 10

Table 4.

PC

Eigenvalue

Proportion of Explained Variance

Cumulative Explained Variance

Top Three Correlated Indices

20.4

33.5

33.5

Â^, 2;^, Â*'

10.8

17.8

51.2

5/C4, S/C3, S/C5

8.1

13.3

64.6

3A^, Â^, Â^C

6.1

9.9

74.5

3.0

5.0

79.5

2.4

3.9

83.4

1.7

2.8

86.2

1.4

2.2

88.4

-^h/ ' ^ h / -^h 3yb 3yv 4yb '^Ch/ ^Chf -^Ch ICQ, SICQ, / Q 6 vb 5 vb Sy/ ^0 ^C' '^C Â^, 2;^, Â^

1.2

2.0

90.4

1.1

1.8

92.1

Â^, Â^, Â^ 4yb 4yv 6yv C' C' '^PC

Summary of Principal Component Analysis of 101 Topological Indices for 2926 Chemicals

Eigenvalue

Proportion of Explained Variance

Cumulative Explained Variance

Top Three Correlated Indices

1

42.6

41.6

41.6

2 3 4 5

13.3 11.4 8.9 5.1

13.0 11.1 8.7 5.0

54.7 65.8 74.5 79.6

Pi, PQ/^'^

6

3.7

3.6

83.2

ICQ, SICQ, SIC^

7

2.6

2.6

85.8

Â^, Â^ Â^

8

2.0

1.9

87.7

9

1.7

1.7

89.4

"^X^, ICQ, SICQ

10 11

1.4 1.1

1.4 1.1

90.8 91.9

Â^/A^c/^'^ ICyf^JCo

12

1.0

1.0

92.8

PS.PW^PQ

"Âô % o ^-^c SICs, SICe, CICe ^^ck^'Xcu^'Xû y/^ch/'^'^h

^'^/^'^/'^h

Molecular Similarity Using Topological Invariants

181

Twelve PCs were retained from the PC A of the full set of 101 TIs. Each of these PCs had an eigenvalue greater than one and, cumulatively, they explained 92.8% of the variance within the full set of TIs. These PCs are summarized in Table 4.

Probe: 3-methyl-4-chlorophefX)l

CH3

Structural:

OL O a OL O ^CHa

CI

OH

(1) 0.00

Chemical:

CI

CI

(2) 0.00

(3) 0.01

dLtt'O ^Cl

^ ^

^CHg

(1) 0.01

All:

NHj

(1) 0.01

"Y

CI

NHj

(2) 0.02

NH2

(3) 0.02

NH2

(4) 0.01

on

T^

^ ^

6H

(4) 0.02

(5) 0.01

^T^

^CHg

CI

(5) 0.03

a,a.oco CI

OH

(2) 0.02

(3) 0.02

CI

(4) 0.03

(5) 0.03

Figure 3. The five analogues selected for the probe 3-methyl-4-chlorophenol using three molecular similarity spaces: topostructural, topochemical, and all indices. The numbers under the structures indicate the ranking of the analogues and the Euclidean distance to the probe.

SUBHASH C. BASAK, BRIAN D. CUTE, and GREGORY D. GRUNWALD

182

Table 5. Comparison of the Three Sets of TIs and Their Derivative PCs for Prediction of Normal Boiling Point (°C) Using K-Nearest-Neighbors {n = 2926) Indices

K

r

s

Topostructural Topochemical Topostructural + topochemical

10 6 8

0.881 0.883 0.896

39.0 38.6 36.6

0.92 0.90 -

0.88

«

0.86

a> o ^ 0.84 Topostructural indices Topochemical indices All Indices

0.82 H 0.80

T

5

10

15

20

25

30

Number of neighbors (K)

50 Topostructural indices Topochemical indices All indices

48 46 44 42 40 38 36 34 10

15

20

25

30

Number of neighbors (K) Figure 4. Pattern of (top) correlation (r) and (bottom) standard error (s) of the estimates according to the /C-nearest-neighbor selection for 2926 normal boiling points using three molecular similarity spaces.


183

B. Analogue Selection

Figure 3 shows an example of analogue selection using PCs to derive a Euclidean distance space. The first five analogues (neighbors) for the probe compound, 3-methyl-4-chlorophenol, are presented for each of the three similarity spaces. The analogues selected by the topostructural model show a repetition of the same skeletal structure, ignoring substituents, throughout the first five analogues. In the topochemical model and the full set model some variability in the skeletal structure arises (chemical analogues 2 and 5, full set analogue 4). Also of interest is the repetition of chemicals between the sets of analogues. While the ordering varies between the methods, the topostructural and topochemical models select two identical structures, the topostructural and the full set have three analogues in common, and the topochemical and full set select four of the same analogues. 2-Chloro-5-methylphenol appears in all three sets, while there are only three unique compounds (topostructural analogues 4 and 5, topochemical analogue 5). C. fC-Nearest-Neighbor Property Estimation

Figure 4 presents the correlation (r) and the standard error (s) of the prediction of the normal boiling points for the 2926 chemicals for the three groups of indices over the full range of i^ values examined (K= 1-10,15, 20, 25). Table 5 shows the best normal boiling point model for each set of indices. The best boiling point estimates for all three sets were for K in the range of 6 to 10. The full set of indices gave the best result, although there was only a small difference between models.

IV. DISCUSSION The purpose of this paper was to study the relative effectiveness of three similarity spaces derived from graph invariants in the selection of structural analogues and in the KNN-based estimation of properties. The similarity spaces were created using a PCA of calculated graph invariants. Tables 2-4 summarize the results of the PCA of the three sets of indices. The first PC is always correlated with indices that quantify molecular size. In the case of the topostructural indices, the second PC is most correlated with branching indices. In the case of PCs derived from either topochemical or the full set of topostructural and topochemical parameters, the first PC was strongly correlated with molecular size, while the second PC was highly associated with the molecular complexity indices. These results are in line with our earlier studies on different sets of chemicals."^'^'^^'^^'-^^ All three spaces were used in the selection of five analogues of a particular structure (Figure 3). Perusal of the three sets of structures shows that there is a substantial degree of similarity among the three groups of five chemicals selected. It is interesting to note that all five nearest neighbors of the probe selected by the topostructural method had isomorphic skeletal graphs when hydrogen atoms are

184

SUBHASH C. BASAK, BRIAN D. GUTE, and GREGORY D. GRUNWALD

suppressed. For the two similarity spaces created by topochemical indices alone and the combined set of topostructural and topochemical indices, four of the five selected neighbors are common (Figure 3) although the ordering of the molecules is different. This shows that these two similarity methods are not intrinsically very different. Our earlier results showed that analogues selected by similarity methods derived from experimental physical properties, atom pairs, and TIs select very similar sets of analogues.^^ In the case of KNN-based estimation of boiling points of chemicals from their analogues, K was varied from 1 to 25. The best estimated value was obtained in the range of ^ = 6-10. This is in line with our earlier studies with different properties."-'2 In conclusion, the three similarity spaces derived in this paper have reasonable power for selecting analogous molecules from a very diverse database of chemicals. The KNN-based estimation shows that selected analogues can be used for the estimation of boiling points of diverse chemicals if more accurate methods are not available.

ACKNOWLEDGMENTS This is contribution number 161 from the Center for Water and the Environment of the Natural Resources Research Institute. Research reported herein was supported in part by grants F49620-94-1-0401 and F49620-96-1-0330 from the United States Air Force, a grant from Exxon Corporation, and the Structure-Activity Relationship Consortium (SARCON) of the Natural Resources Research Institute of the University of Minnesota.

REFERENCES 1. Johnson, M. A.; Maggiora, G. M. Eds. Concepts and Applications of Molecular Similarity; Wiley: New York, 1990. 2. Carbd, R.; Leyda, L.; Amau, M. Int. J. Quantum Chem. 1980, 77,1185. 3. Bowen-Jenkins, P. E.; Cooper, D. L.; Richards, G. J. Phys. Chem. 1985, 59, 2195. 4. Basak, S. C ; Magnuson, V. R.; Niemi, G. J.; Regal, R. R. Discrete Appl. Math. 1988, 79,17. 5. Basak, S. C ; Bertelsen, S.; Grunwald, G. J. Chem. Inf. Comput. Sci. 1994, 34, 270. 6. Rum, G.; Hemdon, W C. J. Am. Chem. Soc. 1991,113,9055. 7. Willett, P.; Winterman, V. Quant. Struct.-Act. Relat. 1986, 5, 18. 8. Wilkins, C. L.; RandiC, M. Theor. Chim. Acta 1980,58, 45. 9. Trinajstie, N. Chemical Graph Theory Vols. I c& 77; CRC Press: Boca Raton, PL, 1983. 10. Basak, S. C ; Grunwald, G. D. Math. Model. Sci. Comput., in press. 11. Basak, S. C ; Grunwald, G. D. SAR QSAR Environ. Res. 1994, 2, 289. 12. Basak, S. C ; Grunwald, G. D. New J. Chem. 1995, 79, 231. 13. Basak, S. C ; Grunwald, G. D. / Chem. Inf Comput. Sci. 1995,35, 366. 14. Basak, S. C ; Grunwald, G. D. SAR QSAR Environ. Res. 1995,3, 265. 15. Basak, S. C ; Grunwald, G. D. Chemosphere 1995,31, 2529. 16. Basak, S. C ; Gute, B. D.; Grunwald, G. D. Croat. Chim. Acta 1996, 69,1159. 17. Lajiness, M. S. In: Computational Chemical Graph Theory, Rouvray, D. H., Ed. Nova Science Publishers: New York, 1990, p. 300.


185

18. Russom, C. L. Assessment Tools for the Evaluation of Risk (Aster) v. 7.0; U.S. Environmental Protection Agency, 1992. 19. Wiener, H. J. Am. Chem. Soc. 1947, 69, 17. 20. Randie, M. J. Am. Chem. Soc. 1975, 97, 6609. 21. Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis; Research Studies Press: Hertfordshire, U.K., 1986. 22. Bonchev, D.; TrinajstiC, N. J. Chem. Phys. 1977, 67, 4517. 23. Raychaudhury, C ; Ray, S. K.; Ghosh, J. J.; Roy, A. B.; Basak, S. C. J. Comput. Chem. 1984, 5, 581. 24. Basak, S. C; Roy, A. B.; Ghosh, J. J. In Proceedings of the Second International Conference on Mathematical Modelling; Avula, X. J. R; Bellman, R.; Luke, Y. L.; Rigler, A. K., Eds.; University of Missouri-Rolla, 1980, p. 851. 25. Basak, S. C ; Magnuson, V. R. Arzneim.-Forsch. Drug Res. 1983, 33, 501. 26. Roy, A. B.; Basak, S. C ; Harriss, D. K.; Magnuson, V. R. In Mathematical Modelling in Science and Technology; Avula, X. J. R.; Kalman, R. E.; Liapis, A. I.; Rodin, E. Y, Eds.; Pergamon Press: New York, 1984, p. 745. 27. Balaban, A. T. Chem. Phys. Lett. 1982, 89, 399. 28. Balaban, A. T. Pure andAppl. Chem. 1983, 55, 199. 29. Balaban, A. T. Math. Chem. (MATCH) 1985, 21, 115. 30. Basak, S. C ; Harriss, D. K.; Magnuson, V. R. POLLY v. 2.3 (copyright University of Minnesota), 1988. 31. Shannon, C. E. Bell Syst. Tech. J. 1948, 27, 379. 32. Sarkar, R.; Roy, A. B.; Sarkar, R. K. Math. Biosci. 1978,39, 299. 33. Magnuson, V. R.; Harriss, D. K.; Basak, S. C. In Studies in Physical and Theoretical Chemistry; King, R. B., Ed.; Elsevier: Amsterdam, 1983, p. 178. 34. SAS Institute Inc. In SAS/STAT User's Guide, Release 6.03 Edition; SAS Institute Inc.: Gary, NC, 1988, p. 751. 35. Basak, S. C ; Niemi, G. J.; Veith, G. D. J. Math. Chem. 1991, 7, 243. 36. Basak, S. C.; Magnuson, V. R.; Niemi, G. J.; Regal, R. R.; Veith, G. D. Math. Model 1987, 8, 300.


OPTIMIZING HYBRID DENSITY FUNCTIONALS BY MEANS OF QUANTUM MOLECULAR SIMILARITY TECHNIQUES

Miquel Sola, Marta Fores, and Miquel Duran

Abstract I. Introduction II. Methodology III. Results and Discussion A. The CO System B. The N2 System C. The LiF System IV. Conclusions Acknowledgment References and Notes

188 188 190 192 192 199 200 201 201 201


188

MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN

ABSTRACT The ao, a^, and ac semiempirical parameters of the original three-parameter method of Becke have been optimized by minimizing the difference between the density deUvered by this method and the singles and doubles quadratic configuration interaction (QCISD) generalized density using quantum molecular similarity measures. The optimization is performed employing the relaxed geometry at each set of ao, ax, and ac parameters. This method has been applied to a series of small molecules (N2, CO, and LiF) that have experimentally known properties and molecular bonds of diverse degrees of ionicity and covalency. Results show that, at least in these diatomic molecules, it is possible to obtain a set of parameters that reproduces almost exactly the electron density obtained from the QCISD methodology. Especially interesting are the values obtained for the ao parameter, which reflect how much exact exchange should be included in the description of a particular system.

I. INTRODUCTION The density-functional theory (DFT)^ of electronic structure has seen significant advances since its original formulation by Hohenberg, Kohn, and Sham.^ The use of density gradients in exchange and correlation corrections to the so-called local spin-density approximation (LSDA) has largely improved the computed molecular properties like geometries, vibrational frequencies, dipole moments, and particularly the molecular bond energies.^'^"^ Despite the unquestionable success of DFT in many fields, there is still a need to improve the current DFT schemes. For instance, the difficulties that DFT has in describing weak interactions like hydrogen bonds, charge transfer, and van der Waals complexes are well established,^ and they will be surmounted only with the advent of more precise exchange-correlation potentials. In this sense, the analysis of the performance and the improvement of different exchange-correlation functional is a subject of great current interest. So far, two approaches have been followed. On one hand, some authors have investigated nonlocal schemes of the generalized gradient approximation (GGA) which include the Laplacian of the electron density and the kinetic energy density.^ On the other hand, different hybrid schemes^"^^ have been proposed in which the Hartree-Fock (HF) exact treatment of exchange is incorporated to some extent into the available functionals. These hybrid DFT methods, which are based on the adiabatic connection formula, ^^ attempt to improve the exchange part of the exchange-correlation functional. In fact, it has been shown that errors in molecular descriptions arise mainly from the treatment of exchange, which is the dominant part of the exchange-correlation energy.^^'^'^"^ So, it is commonly argued that a partial inclusion of the exact exchange must improve the overall accuracy of the exchange-correlation functional. Indeed, results show that these hybrid methods

Optimizing Hybrid Density Functionals

189

yield energetic and structural results of an accuracy comparable to those obtained by methods that are much more demanding computationally.^^ Probably the most popular hybrid scheme for the exchange-correlation functional is Becke's three-parameter method, which was originally formulated as^

Here the E^^^^ term corresponds to the HF exchange energy based on Kohn-Sham orbitals, while E]^^ is the uniform electron gas exchange-correlation energy, AE^^ is Becke's 1988 gradient correction for exchange,^^ and AE^^^ is Perdew and Wang's gradient correction for correlation.^^ Commonly, this procedure is referred to as the B3PW91 method. The coefficients a^, a^, and a^ were determined by Becke fitting 56 atomization energies, 42 ionization potentials, and 8 proton affinities. The values obtained, which are in some sense semiempirical, were a^ = 0.20, a^ = 0.72, and a^ = 0.81. It is worth noting that, in this fitting, single point energy calculations were performed at experimental geometries. Furthermore, the exact exchange and the gradient corrections were added in the evaluation of energies in a non-self-consistent fashion using converged LSDA densities.^^ Probably, the so-called B3LYP functional is even more popular than the B3PW91 method. In the Gaussian-94^^ implementation the expression used is similar to Eq. 1 with slight differences: ^xc = Ex'""^ + %(^r'

- ^ ^ ^ ^ ) + ^.^T

+ E^"" + a.iAE"^ - E^^^) (2)

As the Lee-Yang-Parr (LYP) functional^^ already contains a local part and a gradient correction, one has to remove the local part to obtain a coherent implementation. This can be done in an approximate way by subtracting £^^^ to AE^^. The method has been normally used with the same three parameters derived originally for the B3PW91 functional. Other common hybrid schemes are those based on Becke's half-and-half (BHandH)^ linear interpolation of the adiabatic connection integral, which takes the values of a^ = 0.5 and a^ = a^ = Oin Eq. 1. If one has a high-quality electron density for a particular system, the DPT functionals, and in particular the three parameters of the B3LYP method, can be optimized so that one can obtain an electron density in better agreement with the reference density for the system considered. Ideally, the reference density should be taken from experimental results. In practice, however, it is very difficult to obtain a reliable electron density from a direct experimental measurement. In this case, an electron density obtained from a high-level ab initio method, such as the singles and doubles quadratic configuration interaction (QCISD),^^ can be used as the reference density. The aim of this work is to optimize the a^, a^, and a^ semiempirical parameters of the B3LYP method by minimizing the differences between the density furnished by the B3LYP method and the QCISD generalized density^^ which is taken as the reference density. This minimization is performed by maximizing the quantum

190


molecular similarity measure (QMSM) between the B3LYP and the QCISD densities. The expression used to compute the QMSM between two first-order electron density functions {p^, Pj] is given by^^ ZjjiS) = J J p/r^) e(ri, r^) p/r^) dr, dv^

(3)

0(rp r^) being a positive definite operator depending on two-electron coordinates. Overlap-like QMSM are obtained when the ©(r^, r^) operator is chosen as the Dirac delta function 5(rj - r^. Use of the operator l/rj2 or l/r\2 gives rise to Coulomblike QMSM and gravitational-like QMSM, respectively.^^ In the particular case that P/ = Py» 01^^ g^ts Zji, which is the so-called self-QMSM.^^ In previous works,^^ it has been shown that comparison between densities through use of QMSM can provide detailed information on the similarities and differences between various methods. Such comparative studies should increase our understanding of the behavior of the different DFT schemes, and they should also provide hints toward improved treatments. In this work, we show how QMSM can be used to improve standard hybrid methodologies.

II. METHODOLOGY Let us define the function difference: ^(r) = pB3LYp(r) - pQCISD(r)

(4)

If Becke's a^, a^, and a^ parameters are allowed to change, the function difference P(r) will depend on these three parameters. Then, one can obtain the set of parameters that yield the B3LYP density closest to the QCISD density by minimizing the quadratic error integral:

D = lp\r)dr

(5)

which is obtained as a function of a^, a^, and a^. The Gaussian-94^^ program has been used to perform singles and doubles quadratic configuration interaction (QCISD)^^ and DF (B3PW91, B3LYP,^ BHandHLYP, and BHandrf) calculations. In all of these DF calculations, the nonlocal energy functional is incorporated into the optimization of both electronic and nuclear degrees of freedom self-consistently. B3LYP and B3PW91 calculations with (2Q, a^, and a^ coefficients different from the standard coefficients provided by Becke,^ have been performed with Gaussian-94 using internal options.^^ To minimize basis set effects, which may produce relevant QMSM differences,^^ the 6-311++G** basis set^^ has been used throughout. All calculations have been done within the restricted Hartree-Fock formalism except ionization potentials, which have been calculated with the unrestricted Hartree-Fock method.

Optimizing

Hybrid Density

191

Functionals

QMSM have been obtained from the Gaussian-94 electron densities using the Messem program developed in our group.^^ For QCISD calculations, generalized densities^^ have been used. Likewise, the DF electron densities have been calculated from self-consistently converged Kohn-Sham orbitals. All QMSM are Coulomblike, i.e., ©(FJ, r2) = l/rj2 in Eq. 3, except self-QMSM, which are overlap-like. Gradients of the D function in Eq. 5 with respect to the a^, a^, and a^ parameters have been computed numerically, and the optimization of this function has been performed using the quasi-Newton Davidon-Fletcher-Powell (DFP) algorithm.^^

INITIALIZE a„. a„ and ac

OBTAIN D(ao,a,.a.)

COMPUTE 5D/5ak ak = ak + Ei ak = ak - ei i = i +1

T

OBTAIN D(ao.a..ac)

_ilfl_

a. a., and a^ L

Y«

XConvergedV_Jifl

Scheme 1.

DFP OPTIMIZATION J NEW SET OF ao. a,, a.

192


Scheme 1 depicts the program structure chart of the present algorithm. In this chart, shadow rectangles indicate calculations that use the Gaussian-94 program for geometry optimization and obtention of the electron density. Thus, starting from the initial set of a^, a^, and a^ parameters given by Becke, the QCISD and B3LYP electron densities are computed allowing complete electronic and nuclear relaxation of the system. With these two densities and using the Messem program one can obtain the value for the D function through Eq. 5. After that, the gradients of the D function with respect to the a^, a^, and a^ coefficients are computed numerically using the central difference approximation, i.e.,: dP

Z)(a^ + S ) - D ( a ^ - 5 )

daj^

25

(6)

with 5 = 10"^. Finally, and after checking the convergence of the process, a DFP optimization step is performed. We have considered that the optimization has been converged when the norm of the gradient vector was below 2 x 10"^ au. Throughout this contribution, the final converged a^, a^, and a^ parameters will be referred to as the QCISD-density-optimized parameters. Bader topological analyses^^ and maps of electron density differences were carried out with the Electra program developed in our group.^^

III. RESULTS AND DISCUSSION To test the methodology reported above, a study of the CO, N2, and LiF systems has been carried out. We have selected these three molecules because, from the point of view of Bader's atoms-in-molecules theory,^^ they exhibit different bonding nature: LiF is a typical case of closed-shell ionic interaction; N2 is an appropriate example of a molecule with a shared interaction; and CO is a well-known case of an intermediate interaction. Analyses of B3LYP electron densities in these molecules can put forward the behavior of the a^, a^, and a^ parameters in these different types of bonding. A. The CO System

Before starting the process of optimization, it is interesting to represent the quadratic error integral as a function of the a^, a^, and a^ parameters to obtain information about possible multiple minima on this surface. Figures 1-3 plot the quadratic error integral as a function of two variable Becke's parameters, while the remaining parameter is kept frozen. For all three surfaces we have computed 100 points changing each variable parameter from 0.1 to 1 by 0.1 unit each time. During the calculation of these surfaces, the geometry of the CO molecule has been kept frozen at the QCISD-optimized geometry.

Optimizing Hybrid Density Functionals

193

a,=0.81

0.014 '

where Z^^ and Zgg are the self-similarities of molecules A and B, respectively.^ For positive-defined molecular fields (as the electron density), C^g values range from 0 to 1: The value of 1 is only achieved when the molecules under comparison are identical; any dissimilarity between the two molecules is reflected by C^g values within the (0,1) range; finally, the value of 0 corresponds to the mathematical limit situation of zero overlap. So far, QSM applications in chemistry have been mainly based on first-order density functions."^"^^ In fact, not only QSM but also most of the tools applied to study electron distributions in molecules, such as the well-known theory of atoms in molecules,^^ density maps, and others, have relied on this first-order description, although in many aspects of chemistry it would be desirable to go beyond it and analyze directly second-order density functions. This would be specially important in studies in which the role of electron correlation is important.^^'^^ A definition of an overlap-like second-order QSM using two-electron density functions can be obtained as an extension of measure 7 as described by Carbo et al.:^

218

XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES

4 ' ^ = K('-i.'-2)rB(r„r,)dr,dr,

(9)

but, so far, only semiempirical approximations to second-order QSM have been described.^^''^^ There are several explanations for this avoidance of the general use of second-order QSM as a convenient tool for the analysis of molecular electronpair distributions. First of all, second-order density functions are more difficult to visualize than first-order ones, because of their higher dimensionality. Moreover, an overlap-like QSM between ab initio second-order density functions, as in Eq. 9, is computationally too expensive to be applicable to molecular systems, even to small ones. This situation can be alleviated by reducing second-order density functions to intracule and extracule densities.^^ That is, from the coordinates that describe the simultaneous position of two electrons, r^ and r2, an intracule coordinate, r, and an extracule coordinate, R, can be defined: (10)

R=lilii

(11)

2 Then, the intracule density function, /(r), is defined as /(r) = |r(r„r,)8((r, - r2) - r)dr^dt, ; J/(r) = (f\

^^^^

and the extracule density function, E(R), as E(R) = J r ( r i , r 2 ) 5 ^ ^ ^ - R dr.dr^ ; /^(R) = ('

(13)

/(r) and E(R) are the probability density functions for the electron-electron distance and for the electron-pair center of mass, respectively. As the second-order density itself, both /(r) and E(R) must integrate to the number of electron pairs, /(r) and E(R) have the advantage of reducing the six-dimensionality of the original second-order density function while keeping an electron-pair character. So, /(r) and E(R) are three-dimensional functions like the first-order density function, and are easily visualizable. Since the intracule coordinate, r, depends only on the relative positioning of the electrons in the molecule, /(r) has the property of being invariant to any molecular translation. Another remarkable property of/(r) is that it always shows an inversion center around the point 7(0), regardless of the symmetry of the molecule. Additional symmetry elements present in the molecule are also reflected in /(r). On the other hand, the extracule coordinate, R, is directly related to the molecular three-dimensional space, and E(R) shows the same symmetry elements as p(r). Calculations of /(r) and E{R) require evaluating many costly four-indexed two-electron integrals, whose number depends on the fourth power of the number

Comparison of Quantum Similarity Measures

219

of primitive basis functions. In the past, the lack of proper algorithms to deal efficiently with the computation of those integrals in large grids of points has restricted /(r) and E(R) calculations to atoms and small molecules, and, in many cases, only along longitudinal or transversal atomic or molecular axes rather than on rectangular grids.^^"^^ Recently, Cioslowski and Liu have developed a computational scheme that allows for faster calculations of /(r) and E(R) on large grids of points,^^ which has permitted a deeper understanding of the topological characteristics of molecular /(r) and E(R) distributions^'* and their Laplacians.^^ The possibility of obtaining /(r) and E(R) distributions of atoms and molecules in a very feasible way opens the path for calculating second-order QSM,^^ as a natural extension of the originally proposed first-order QSM."^ Thus, overlap-like intracule QSM, F^g, ÂB^kW/B^dr

(14)

and overlap-like extracule QSM, X^g, X^3 = j£^(R)£3(R)dR

(15)

can be computed, which are quantitative measures of the similarity of molecules A and B as represented by their contracted second-order electron-pair densities, /(r) or E(R). Likewise, one-electron densities, in the particular case of A = B, 7^^ and X^^ are the self-similarity measures quantifying how locally concentrated are /(r) or E(R) distributions for molecule A. Consideration of self-similarity measures allows for normalizing second-order QSM through the definition of a Carbo second-order similarity index. W2) ^B

7(2) ÂB /'7(2)'7(2)\l/2

(16)

following the original form of the Carbo similarity index.^ In Eq. 16, Z^^ generally represents F^g or X^g depending on the use of/(r) or £(R), respectively. The objective of this contribution is to compare the values and trends of quantum similarity measures and indices computed from one-electron, intracule, and extracule densities. The following sections contain, first, a description of the computational details used for evaluating /(r) and E(R) and, second, two illustrative numerical applications on atoms and linear diatomic molecules.

II. COMPUTATIONAL DETAILS In this section, the actual approaches for calculating /(r) and E(R), first, and F^g and X^g, afterwards, are briefly described, with specific mention of the numerical integration schemes used for particular systems having spherical or cylindrical

220


symmetry. Throughout this work, ab initio second-order density matrices have been computed at the Hartree-Fock (HF) level of theory by means of the programs Gaussian 94^^ and Gamess.-'^ A. Calculation of Intracule and Extracule Densities

Within the HF approximation, the second-order D^^ matrix elements appearing in Eq. 6 can be obtained from first-order D- matrix elements. For closed-shell systems they are evaluated as

whereas for open-shell systems they are computed in a UHF framework as D>j^ = ^D,p„-^{D-D--Dfpf,)

(18)

where first-order elements are split into a and P spin contributions D.. = Z)° + DP. IJ

I]

(19)

IJ

Although /(r) distributions calculated at the HF level do not possess the characteristic electron-electron cusp condition at the origin, it has been shown that the main topological features of /(r) and E(R) are already manifested at this level of theory.^"^'"^^ Future work will be directed toward analyzing the effect of electron correlation on the topology of/(r) and E(R) distributions. At present, there are no analytical expressions for /(r) and ^(R) such as Eqs. 5 and 6 for first- and second-order density functions, so the only feasible approach for assessing /(r) and E(R) is through the numerical integration of Eqs. 12 and 13 on large grids of points. To perform these numerical integrations in a fast and feasible way, the algorithmic scheme proposed recently by Cioslowski and Liu^^ has been followed, which divides the computational load into a grid-dependent and a grid-independent part, and reduces the number of integrals that need to be computed by discarding those integrals below an arbitrary significance threshold. A detailed description of this algorithm can be found in Ref. 33. Integration of /(r) or E(R) over all space should return the total number of electron pairs. However, only an approximate value to the exact number of electron pairs will be obtained because of the use of a numerical integration from three-dimensional grids as ^''^^'Î(r)Ar

-X^(R)AR

(20)

(21)


221

where A^ is the number of electrons of the system studied, and Ar and AR are the grid spacings for the three Cartesian components of the intracule and extracule coordinates, respectively. Throughout this contribution the same grid spacing will be taken for the three Cartesian components. The approximate number of electron pairs obtained from Eqs. 20 and 21 will be used to assess quantitatively the validity and quality of the numerical integration performed. The dependency of this value on the grid extension and spacing will be examined. B. Calculation of Second-Order Quantum Similarity Measures Following the numerical integration scheme presented above, the evaluation of overlap-like second-order QSM between /(r) or E(R) distributions is straightforward: i'ABÊÂWWAr

(22)

ÂB = I : ^ A ( R ) £ B ( R ) A R

(23)

As stated above for the number of electron pairs, the quality of the approximate values obtained for K^g and X^g will depend on the extension and spacing of the grid employed in the numerical integration, as well as on the integral screening threshold used in the calculation of/(r) and E(R), respectively. All of these aspects will be investigated below. Equations 20-23 assume generally the definition of three-dimensional density grids, which means consideration of a very large number of points. However, in particular cases, atomic or molecular symmetry can be used to reduce the dimensionality of the grids needed to compute Y^^ and X^g. For instance, the spherical symmetry of atomic systems can be exploited to compute /(r) and E(R) solely along an axis starting at the nuclear position. In this case, second-order QSM between two atoms can be computed as Y^^ = 4nÎ^(r)IîryAr

X^^^4n2Eîr)EîryAr

(24)

(25)

For linear systems, cylindrical /(r) and E{R) distributions can be generated by rotating a planar grid around the internuclear axis. Thus, second-order QSM will be evaluated as Yf,B^2nÎ^{x,z)Isix,z)xAxAz

(26)

222


ÂB = 271^ E^{x. z)E^(x, z)xAxAz

(^'^)

where z and x are the rotation axis and an axis perpendicular to it. From a practical point of view, it is important that any two grids being compared have the same grid spacing. When the extension of the grids is not the same, the integration can be carried out only over the region common to both grids. This does not imply a significant loss of accuracy, as long as both grids are sufficiently large to include all regions with any significant contribution to /(r) or E(R) distributions. Then, assuming a zero contribution from regions where data from only one of the two grids are available should be a reasonable approach. Another aspect worth taking into account is the fact that the similarity between two molecules depends on their superposition. As a consequence, the two molecules being compared have to be mutually aligned so as to maximize the corresponding QSM. The optimization of the similarity function depending on the particular density definition employed to evaluate the QSM will also be one of the points of discussion later in this work.

III. APPLICATION EXAMPLES Two application examples are presented to illustrate the use of second-order QSM for analyzing quantitatively atomic and molecular electron-pair density distributions. A series of two-electron atomic systems (H~, He, Li"^, Be^"*") is considered first as the simplest case where the values and trends followed by first- and second-order QSM can be analyzed and compared. One-electron, intracule, and extracule similarity matrices for a series of diatomic molecules (N2, CO, LiF) are presented next, and the topologies of the similarity functions arising from the maximization of the different QSM are discussed. A. Two-Electron Atomic Systems: H", He, Li% Be^"^

As the simplest case, the H", He, Li"^, and Be^"^ two-electron isoelectronic series of atomic systems is studied first. First- and second-order density matrices were computed at the HF/6-31G level of theory by means of the Gaussian 94 package.^^ Since all of these systems are spherically symmetric, p(r), /(r), and £(R) values were computed only along an axis starting at the nucleus. Evaluation of the number of electrons and Z^^ from p(r), and the number of electron pairs, and Y^^ and Xj^^ from /(r) and £'(R), respectively, was performed numerically by spherical integration. The number of electrons, the number of electron pairs, and analytical evaluation of Z^^ can then be used to validate the quality of the numerical integration. Table 1 shows the results obtained using four different combinations of length and spacing for the axial calculations. For all systems considered, the correct number of electrons for p(r) and electron pairs for /(r) and E(R) is reproduced with


223

Table 1. Number of Electrons or Electron Pairs and Self-Similarities Computed from Several Grids for H", He, Li"^, and Be^"^, Analytical ZÂ

m

Atom

H"

grid^

1 2 3 4

n.e. 1.999995 2.000000 2.000000 2.000000

exact

He

1 2 3 4

1.998323 2.000000 2.000000 2.000000

exact

Li^

1 2 3 4

1.987350 1.999321 1.999962 2.000000

exact Be2^

1 2 3 4 exact

1.964233 1.997353 1.999741 2.000000

E(R)

i

p(r) ZAA

n.e.p.

ÂA

0.08844 0.999806 0.00646 0.08845 1.000000 0.00646 1.000000 0.00646 0.08845 0.08845 1.000000 0.00646 0.08845 0.74720 1.000000 0.04503 0.76010 1.000000 0.04503 0.76012 1.000000 0.04503 0.76012 1.000000 0.04503 0.76012 2.83197 - 0.999984 0.18766 3.05588 1.000000 0.18769 3.07256 1.000000 0.18769 3.07376 1.000000 0.18769 3.07376 6.42902 0.999882 0.49051 7.76495 0.999999 0.49133 7.91138 1.000000 0.49133 7.92676 1.000000 0.49133 7.92676

n.e.p.

'ÂA

1.000000 1.000000 1.000000 1.000000

0.05165 0.05165 0.05165 0.05165

0.999918 1.000000 1.000000 1.000000

0.35983 0.36022 0.36022 0.36022

0.998087 0.999986 1.000000 1.000000

1.46224 1.50129 1.50153 1.50153

0.987046 0.999882 0.999999 1.000000

3.37330 3.92408 3.93061 3.93063

^Grid definitions: 1: length 7.5 au, spacing 0.2 au (38 points) 2: length 10.0 au, spacing 0.1 au (100 points) 3: length 10.0 au, spacing 0.05 au (200 points) 4: length 15.0 au, spacing 0.01 au (1500 points)

at least two decimal figures when using the coarsest calculation (considering only 38 points along the axial grid) and five decimal figures when using the finest grid (which considers a total of 1500 points along the axis). Note, however, that the number of electron pairs shows a faster convergence to the exact value than the number of electrons when systematically refining the grid used for the numerical integrations. Values of Z^^ obtained from the finest numerical integration scheme reproduce within five decimal figures the corresponding analytical values for the four atomic systems. However, it is observed that finer grids are needed for obtaining quality ÂA ^^lu^s when going from H" to Be^^. This is due to the fact that p(r) attractors become sharper as the number of protons in the nucleus grows and hence more

224


precise integrations are required. As regards 7^^ and X^^, although no analytical data are available for comparison, in all cases a fast convergence is achieved when the numerical integration is systematically refined, which can be considered as a guarantee of correctness for these values. Z^^ is very sensible both to the number of electrons in the system considered and to the shape of the corresponding p(r) distribution, and it is also strongly dependent on local charge density concentrations. Consequently, Z^^ has been used as a measure for analyzing quantitatively the concentration of the electron density distribution in atoms and molecules.^ For instance, within an isoelectronic series, larger Z^^ values correspond to systems having electron densities more locally concentrated, whereas smaller Z^^ values are found for systems possessing electron densities more uniformly distributed. This is indeed the case in the H~, He, Li"^, and Be^"*^ two-electron series, where the trend observed in the respective Z^^ values (0.088, 0.760, 3.074, and 7.927) reflects the local concentration of the electron density toward the nucleus as the atomic number increases. In this case, Y^^ (0.006, 0.045, 0.188, and 0.491) and X^^ (0.052, 0.360, 1.501, and 3.931) values follow the same trend along this series. As this is the simplest case where one-electron densities (two-electron systems) and two-electron densities (one-electron-pair systems) can be compared, not surprisingly Z^^, 7^^, and Xj^^ are tightly related between them. To compare the electron density distributions of the systems along this isoelectronic series, similarity matrices were constructed by computing the corresponding similarity elements using the finest grid. Since for these systems all atomic density distributions show a single maximum centered at the origin, maximization of the similarity was not necessary in this case. The three different similarity matrices containing the Z^g, F^g, and X^g similarity measures, respectively, and their corresponding first- and second-order Carbo similarity indices are collected in Table 2. Again, the overall trend is quite the same for the three similarity matrices, but there are some points worthy of comment. For any given pair, the magnitude of QSM values follows the order Z^g > X^g > y^g. Comparison between one-electron and electron-pair QSM is not straightforward, since they are related to different numbers of particles or particle interactions. For instance, there are two electrons and only one electron pair in this particular case. This certainly contributes to making Z^g values larger. As the number of electron pairs depends approximately on half the square of the number of electrons, this effect would be reversed as the atomic number increases, because the number of electron pairs will be much larger than the number of electrons. More consistent comparisons can be made between y^g and X^g values. Because of the inherent definition of intracule and extracule coordinates, /(r) distributions are always more disperse than E(R). Consequently, X^g values are found to be larger than y^g. As regards similarity indices for a given element, first-order indices are found to be larger than second-order indices, thus indicating that these systems are more similar from the point of view of the one-electron density than from the intracule or extracule densities. Interestingly,


225

Table 2. Similarity Matrices for Two-Electron Systems^ One-electron similarity matrix H-

He

Li-^

H-

0.0884

0.7880

0.5733

0.4335

He

0.2043

0.7601

0.9330

0.8220

Li-^

0.2989

1.4261

3.0738

0.9672

Be2+

0.3629

2.0177

4.7740

7.9268

Be^-^

Intracule similarity matrix H-

He

Li+

H-

0.0065

0.7596

0.4911

0.3380

He

0.0129

0.0450

0.8959

0.7383

Li-^

0.0170

0.0824

0.1877

0.9496

Be2^

0.0190

0.1098

0.2884

0.4913

Be2+

Extracule similarity matrix H-

He

Li-^

H-

0.0516

0.7596

0.4911

0.3380

He

0.1036

0.3602

0.8959

0.7383

Be2+

LJ-^

0.1367

0.6589

1.5015

0.9496

Be2+

0.1523

0.8785

2.3069

3.9306

Note:

^Values in roman type refer to Q S M ; italic, Carbo indices; boldface, self-similarities.

due to the spherical symmetry of atomic systems, F^g andZ^g second-order Carbo similarity indices are exactly the same for any atomic pair in this series. B. Diatomic Molecules: N2, CO, LiF

This section presents the results for the series of N2, CO, and LiF diatomic molecules and is organized as follows: First, the profiles of p(r), /(r), and E(R) along the internuclear axis of these molecules are presented to show the different topological characteristics of each particular density distribution; a systematic study into the quality of the grid in numerical /(r) and E(R) calculations necessary for obtaining a sufficient accuracy when evaluating F^g and X^g values is done afterwards; the section continues with a detailed analysis on the topology of the corresponding pairwise similarity functions in terms of the molecular alignments associated with each local similarity maximum; and finally, similarity matrices are constructed and Z^g, y^g, and X^g values for the three possible molecular pairs in this series are compared and discussed. Topology of One-Electron, Intracule, and Extracule Densities All molecular geometries were optimized at the HF/6-31G* level by means of the Gamess package.^^ First- and second-order density functions obtained at this


226

level of theory were then used to calculate p(r), /(r), and E(R). The profiles of p(r), /(r), and E(R) when evaluated along the internuclear axis of N2, CO, and LiF are depicted in Figures 1-3, respectively. Interpretation of the topology of p(r) distributions for the three molecules considered is very simple, all of them having two attractors located at nuclear positions. The height of attractors is directly related to the amount of electron density associated with each atom. This can be seen clearly for the LiF molecule, where the Li peak has a value of ca. 10 au while the F peak is about 400 au high (see Figure 3). /(r) and E(R) profiles present two attractors located at the positions defined by the positive and negative values of the internuclear distance and at nuclear positions, respectively, and an additional attractor located at the origin. Because of the electron-pair nature of/(r) and E(R) distributions, interpretation of attractors in /(r) and E(R) is significantly different than in p(r).^'^'^^ For instance, within the Hartree-Fock approximation, intra-atomic electron pairs furnish the attractor at the origin in /(r), while contributing to the attractors at nuclear positions in E(R). On the other hand, in this particular series of diatomic molecules, interatomic electronpair interactions are responsible for the attractors at internuclear distance positions in /(r), while furnishing the attractor at the origin in E(R) (provided that molecules were previously centered). The topological characteristics of the electron density distributions of the systems under comparison will ultimately determine the topology of the similarity function evaluated from them (vide infra).

Na on«-«l«ctron dtntlty

500.00 - 1

400.00

-

300.00

-

200.00

-

100.00

-

1 -4. 00

1 -2.00

11

ji ^

1

i\ 1

0.00

\

2.00

1

'

1 4.00

internuclear axis (a.u.)

Figure 1, p(r), /(r), and E(R) profiles along the internuclear axis for N2.


227

N, intracule density

3.

20.00 -\

-2.00

0.00 internuciear axis (a.u.)

T

1

2.00

4.00

Nj extracule density

1 -4.00

-2.00

^

0.00 internudear axis (a.u.)

Figure 1. (Continued)

T~ 2.00

"~1 4.00


228

CO on«-«l«ctron density

•2.00

0.00 internuclear axis (a.u.)

200

4.00

CO Intracule density

5.

20.00 ■

J^ 4.00

VJ


4.00

Figure 2. p(r), /(r), and E(R) profiles along the internuclear axis for CO.


229

CO extracule density

I

'

•2.00

1

'


n2.00

~1 4.00

Figure 2, (Continued)

FLI one-electron density

1 •4.00

-2.00

^^

\

'


I 2.00

Figure 3, p(r), /(r), and E(R) profiles along the internuclear axis for LiF.

230


FLi Intracule density

5,

20.00 -

0.00 -H •8.00

r-

-f" -4.00


4.00

8.00

FLI extracule density

"1 -4.00

-2.00

^

\

::^^ ""


Figure 1. (Continued)

I 2.00

'

I 4.00


231

Computation of Second-Order Quantum Similarity Measures

To investigate the grid extension and spacing required to obtain a sufficiently accurate number of electron pairs, several definitions of bidimensional grids were examined for computing /(r) and E(R). Then, values for F ^ and Xj^ can be evaluated following Eqs. 26 and 27, respectively, defined above. For ^(R) distributions, grid limits were defined by adding an extension value to the coordinates of the two atoms. For /(r) grids, grid limits were determined by adding the extension value to the interatomic distance, in the same plane. In all cases, density values were evaluated in the plane formed by the internuclear axis and one of the axes perpendicular to it. Due to the particular definition of intracule and extracule coordinates, the extension value and grid step used for evaluating E(R) were both doubled when evaluating /(r). In this way, for the same number of points, F ^ , and Xp^ values computed numerically from these definitions of the grids are expected to achieve comparable accuracy. In addition to the dependence of the number of electron pairs, T^g and X ^ on the extension value and the step of the grids, the effect of using different values for the integral neglect threshold when computing /(r) and ^(R) was also studied. Results obtained from /(r) and E(R) for the series of N2, CO, and LiF molecules are listed in Tables 3 and 4, respectively. The series of grids employed for evaluating /(r) and E(R) are given in order of increasing number of points. As a general trend, the accuracy of the number of electron pairs obtained by numerical integration of /(r) and J^(R) distributions (Eqs. 20 and 21, respectively) increases systematically as the grid is further extended and refined. Also, it is worth commenting that results obtained by setting the integral neglect threshold to 10"^ do not differ significantly from those obtained with a threshold of 10"^. It is for that reason that evaluation of /(r) and £'(R) using the two finest, and thus computationally most expensive, grids was done by setting the integral neglect threshold to 10~^. So far, the number of electron pairs obtained numerically for the different molecules considered, have been used to check the accuracy of /(r) and E(R) calculations on grids of points. In this sense, it is reasonable to expect a similar trend in accuracy for the second-order quantum self-similarity measures, F^^ and X^^. However, since evaluation of similarity measures involves products of density values, y^^ and X^^ are expected to be strongly dependent on the grid step. This effect will be particularly large in those regions around the attractors contributing significantly to the total similarity. In contrast, the dependence on the extension of the grid is smaller, because the superposition in external regions involves products of low-density values, having a very low contribution to the total value of the similarity measure. As an example of this critical effect on the N2 molecule, comparison of the accuracy in the number of electron pairs and F^^ andX^^ values in Tables 3 and 4, respectively, reflects that while the number of electron pairs in grids 3 and 8 [where the only difference is that the grid step has been refined from 0.20 to 0.02 au in /(r) and from 0.10 to 0.01 au in E(R)] changes from 90.011618

Table 3. Number of Electron Pairs and Self-Similarities Computed from /(r) for the N2, CO, and LiF Molecules

co

N2

Grid

Step

Ext.

1 2

0.20 0.20

3 4 5 6 7 8

0.20 0.20 0.10 0.10 0.04 0.02

5.0 5.0 10.0 10.0 10.0 10.0 10.0 10.0

LiF

Thres.

n.e.p.

YAA

n.e.p.

YAA

n.e.p.

10"

89.954418 89.954417

128.650437 128.650442

89.958387 89.958293

132.475271 132.475262

65.246080 65.248329

90.011618 90.011617 90.767124 90.767122 90.963268 90.990813

128.650455 128.650449 140.695092 140.695098 143.524521 143.906903

90.011981 90.011886 90.767405 90.767311 90.963391 90.990918

132.475555 132.475271 144.324330 144.324043 147.097547 147.473385

65.341073 65.3451 14 65.838921 65.842961 65.970634 65.9891 60

10" 10" 10"

YAA

97.756019 97.755845 97.756030 97.755856 103.210131 103.219955 104.591502 104.776202

N

W N

Table 4. Number of Electron Pairs and Self-Similarities Computed from €(R) for the N2, CO, and LiF Molecules

co

N2

Grid

LiF

Step

Ext.

Thres.

n.e.p.

XAA

n.e.p.

XAA

n.e.p.

XAA

0.10 0.10 0.10 0.10 0.05 0.05 0.02 0.01

2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0

10-~ 10" 10-5 lo4

89.9421 72 89.9421 72 89.988485 89.988485 90.761 349 90.761 349 90.962346 90.990581

1249.788832 1249.788832 1249.788903 1249.788902 1392.135885 1392.135884 1425.586745 1430.118215

89.940503 89.940495 89.98461 8 89.984609 90.760501 90.760493 90.962221 90.990561

1294.987260 1294.987060 1294.987321 1294.987121 1432.041380 1432.0411 52 1464.330352 1468.707001

65.210030 65.210924 65.303102 65.304345 65.833487 65.833729 65.972046 65.991678

958.056875 958.059042 958.056971 958.0591 39 1014.828059 1014,830244 1028.901076 1030.790246

10" 10-~ 10-5


233

to 90.990813 (in Table 3) and from 89.988485 to 90.990581 (in Table 4), Y^^ (in Table 3) changes from 128.650455 to 143.906903, and X^^ (in Table 4) changes from 1249.788903 to 1430.118215. Comparable effects are found for the CO and LiF molecules. The ensemble of results gathered in Tables 3 and 4 allows for establishing that small grid steps [0.02 au for /(r) and 0.01 au for £(R)] are required to obtain values for the number of electron pairs, 7^^, and X^^ with acceptable accuracy. Consequently, all second-order similarities discussed below were evaluated using the definition of grid 8 in Tables 3 and 4. Maximization of the Similarity Functions

In this section, the dependence of Z^g, F^g, and X^g on the molecular alignment is discussed. A point that deserves a special comment here is the fact that /(r) distributions are invariant to molecular translations. Consequently, an important aspect of evaluating second-order similarities from /(r) is that, provided that linear molecules are previously defined along the same axis, no similarity maximization is needed for F^g. Actually, a y^g similarity measure cannot be assigned to a unique molecular alignment in the real molecular space, but to a set of molecular alignments that share the same orientation. Thus, in general, for rigid matchings between two molecules A and B with coordinates r^ and Tg, while ÂB^'Â'^^B) ^^^ X^g(r^, Tg) similarity functions must be optimized in a six-dimensional space (three translational and three rotational degrees of freedom), the Y^^(r^, Fg) similarity function needs to be optimized only in a three-dimensional space (three rotational degrees of freedom). The question of Z^g similarity maximization has recently been studied, and it has been shown that, since p(r) distributions are strongly localized on atomic nuclei, the optimal superpositions are achieved when atomic nuclei overlap strongly, the heavier the atoms, the more important their contribution to the total similarity and the more dominant their overlap in the superposition^^'^^ The same situation is expected to occur for E(R) distributions. As only linear molecules are considered in this work, maxima in p(r) and E(R) distributions will arise from alignments in which the internuclear axis of both molecules is coincident. In consequence, Z^g and X^B maxima can be located by overlapping both molecular axes and allowing one of the molecules under comparison to move along the overlapped axis. This was done for the three possible molecular pairs from the current set: {N2,C0}, {N2,LiF} and {CO,LiF}. In addition, for the {CO,LiF} pair, two orientations have been taken into account, corresponding to the two possible relative orientations between these two molecules. Figure 4 depicts the variation of thefirst-order(CÂB) ^^^ second-order extracule (C^^g) Carbo similarity indices for the {N2,C0} pair as N2 is translated over the CO molecule maintained at the origin. The use of similarity indices instead of similarity measures allows for a better comparison between the two types of similarity obtained when molecules are represented by their p(r) and E{R) distri-


234

-2.0

i

0.0

-1.0

displacement (a.u.)

r

1.0

2.0

Figure 4. CAB (clotted line) and CAB (solid line) similarity functions for the {N2, CO} pair.

butions. Each maximum of CZAB and CXAB in Figure 4 has been associated with a given molecular alignment and labeled accordingly. The five molecular matchings recognized when comparing {N2,CO}, together with the relative position of the molecules and the similarity index values, are collected in Table 5. Three sharp maxima appear in the CZAB function: the global maximum (3, CZAB = 0.9410) arises from aligning CO and N2 molecules by optimally matching one N atom with the C atom and the other N atom with the O atom (N-C, N-O). Since N-N and C - 0

Table 5. Displacement between the Centers of the Molecules (D, in au), CJ& and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {N2,CO} Pair (See Figure 4) Matchings in {N2,CO} 1. 2. 3. 4. 5.

(N-C) (N-»,»-C) (N-Q^N-O) (N-^-O) (N-O)

D

QB

-2.06 -0.98 0.01 1.01 2.06

0.3766 0.0530 0.9410 0.0602 0.5984

C

AB

0.1393 0.5299 0.9782 0.7368 0.2476


235

distances are quite similar (2.0378 au and 2.1047 au, respectively), atoms from the two molecules can be closely superposed. The other two maxima appear when aligning one N with C (1, C^g = 0.3766) and one N with O (5, C \ g = 0.5984). This result is fully consistent with previous studies on Z ^ measures: Since p(r) distributions are strongly localized around atomic nuclei, the major contributions to the Z ^ measure come from very close atom-atom overlaps.^^'^^ As atoms begin to separate, its contribution to the similarity function diminishes very quickly. The situation is quite different for the C^^g profile. In this case, there are also three maxima, but while the global maximum (3, C^^g = 0.9782) is assigned again to the matching of (N-C, N - 0 ) atoms and the attractors at the center of the two molecules (labeled • - • ) , the other two maxima arise, respectively, from matching the center of the N2 molecule with the C atom and the center of the C - 0 molecule with one of the N atoms (2, C^^g = 0.5299), and matching the N2 center with the O atom and the center of the CO molecule with one N atom (4, C \ g = 0.7368). For the (N-C) and (N-0) alignments located as maxima in C^y^g, only slight shoulders appear in the C^^g function (matchings 1 and 5). To understand the differences between Z ^ and X ^ similarity spaces, one must go back to the topologies of the corresponding p(r) and £(R) distributions (Figures 1 and 2). As stated above, p(r) distributions present strong sharp attractors around nuclei, and p(r) values decay quickly out of these attractors. On the other hand, £(R) distributions present attractors at the nuclei, but also at the centers of the molecules (due to electron-electron interatomic interactions). For N2 and CO, the strongest peaks in E(K) distributions are those located at the origin (the centers of the molecules). E(R) values for the attractors are consistent with the number of electron pairs contributing to it, calculated from the number of electrons that would be formally assigned to each atom. Following this qualitative approach we should find 28 electron pairs for the O attractor, 21 for N, 15 for C, 49 for the N2 center, and 48 for the CO center. The relation between these figures is in good qualitative agreement with the actual ^(R) values on the corresponding attractors. It may now be clearer that the global maximum in C^^g collects contributions from the matchings of (N-C, N-O, • - • in 3), whereas the other two maxima get contributions from the (•-N, •-C in 2) alignments and from the (•-N, •-0 in 4) alignments (see Table 5). The other two maxima identified in the C^^g function (matchings 1 and 5) appear only as shoulders in C^^g because only a single attractor from each £(R) distribution is overlapping (N-C in 1 and N - 0 in 5). Finally, another point worth mentioning is that the shape of the C^^g function is significantly smoother than the shape of the C^^g function, mainly due to the close proximity of maxima

inCV Similarity-index functions for the {N2,LiF} pair are depicted in Figure 5. As in the previous case, attractor alignments are labeled in the figure and listed in Table 6. The first interesting result from Figure 5 is that the global maxima for C^^g and C^^g are associated with different molecular alignments. It is also observed that the topology of C^^g and C^^g functions for the {N2,LiF} pair is now more

236


complicated than it was for the {N2,C0} pair. Focusing first our attention on the variation of C^^g when translating N2 on LiF, two important maxima are clearly visible, matchings 1 and 4, which are associated with external and internal (N-F) alignments with C^^g values of 0.6865 and 0.6922, respectively. Furthermore, two additional maxima appear, matchings 6 and 9, which can be assigned to internal and external (N-Li) alignments with C^^g values of 0.1238 and 0.0911, respectively. The presence of four maxima when comparing {N2,LiF} from p(r) distributions instead of the three maxima found in the comparison of {N2,C0} is due to the fact that the internuclear distance in N2 (2.0378 au) is significantly smaller than in LiF (2.9384 au). In contrast, the N2 internuclear distance was comparable to that in CO (2.1047 au). Consequently, the four atoms cannot be matched simultaneously, as was the case in {N2,CO}, and an additional maximum appears. Three maxima appear in the C^^g function for {N2,LiF}. The global maximum (2, C^^g = 0.8244) is assigned to the overlap of (•-F) attractors, and the other two are assigned to the internal (N-F) alignment (4, C^^g = 0.6348) and the overlap of the two central (•-•) attractors in E(R) (5, C^^g = 0.6497). Alignments of other attractors in the topology of E(R) do not give rise to a maximum in this case, but to a shoulder in the shape of the C^^g function. (N-F) is the only alignment giving rise to a maximum (matching 4) in both C^^g and C^^g functions (see Table 7). The differences between C^^g and C^^g similarity functions for {N2,C0} and {N2,LiF} pairs are due to the differences in the electron density distributions of CO and LiF. From the one-electron density for the CO molecule, it is observed that the density on the position of the O atom is approximately twice as high as that on the C atom (see Figure 2). For the LiF molecule, the density on the position of the Li atom is more than ten times smaller than that on the F atom (see Figure 3). This is reflected by the C^^g values on matchings 1 (N-C) and 5 (N-O) for the {N2,C0} pair (see Table 5) and matchings 1 (N-F) and 9 (N-Li) for the {N2,LiF} pair (see Table 6) from which the following order for atom-atom matchings is observed: (N-F) > (N-O) > (N-C) > (N-Li). Moreover, it has been pointed out that, while for the {N2,C0} pair the global maximum arises from a double (N-C, N-O) alignment (matching 3 in Figure 4), an (N-F,N-Li) alignment for the {N2,LiF} pair is not possible because of the large LiF interatomic distance. As regards the C^^^g function, results show that the global maxima for {N2,C0} and {N2,LiF} arise from matching the two higher attractors in the corresponding E(R) distributions which, in N2 and CO, correspond to the attractor at the center of mass (furnished by electron-pair interatomic interactions) but, in LiF, correspond to the attractor on the position of the F atom (furnished by electron-pair intra-atomic interactions). This is the reason why the global maximum for {N2,LiF} in Figure 5 aligns the center of the N2 molecule with the F atom, and not with the center of the LiF molecule. The results of the similarity study on the {CO,LiF} pair can be anticipated from the discussion made above for the {N2,C0} and {N2,LiF} pairs. However, the similarity study of the {CO,LiF} pair possesses the additional interest of having to explore the similarity functions for two possible relative orientations of one


237

molecule with respect to the other. The results can be visually analyzed in Figures 6 and 7. Values of C^^g and C^^g at each similarity-index maximum are gathered in Tables 7 and 8. Following the arguments stated above, regardless of the relative orientation of the two molecules C^^g has a maximum when (O-F) are aligned (matching 1 for the [CO,FLi] orientation in Figure 6 and matching 4 for the [OC,FLi] orientation in Figure 7). For the [OC,FLi] orientation C^^g at the global maximum (0.8403) is slightly larger than that for the [CO,FLi] orientation (0.8338), because of the small additional overlap of the C atom with the Li atom. The second maximum in importance occurs when (C-F) are aligned. In this case, the C^^g value for the [CO,FLi] orientation (0.5120) is now slightly larger than that for the [OC,FLi] orientation (0.5013), because of the extra overlap of the O atom with the Li atom. On the other hand, the global maxima for the C^^g functions are achieved when matching the center of the CO molecule with the F atom. To the C^^g value of the global maxima contribute also the overlap of the O atom (matching 2 in the [CO,FLi] orientation) or the C atom (matching 2 in the [OC,FLi] orientation) with the center of the LiF molecule, the former giving rise to a larger C^^g value (0.8499) than the latter (0.7645). Construction of Similarity Matrices Similarity matrices containing the values of first- and second-order similarity measures and indices at the global maxima located in the previous section for each molecular pair are presented in Table 9. Molecular Self-Similarities. Self-similarity values will be discussed first. Self-similarities are reported in the diagonal of the similarity matrices in Table 9. According to these values, the following ordering can be derived: Z ^ LiF > CO > N2 Y^ CO > N2 > LiF X^C0>N2>LiF The usefulness of Z^j^ as a quantitative measure of electronic concentration (or dispersion) has already been discussed.^ In our series of molecules, LiF (121.8149) is the molecule showing a higher concentration of the one-electron density, despite having less electrons (12) than CO and N2 (14). This is due to the fact that most of the electron density in LiF is locally concentrated around the F atom. From the same argument, the one-electron density in CO (112.5459) is more locally concentrated (around the O atom) than in N2 (104.6178), which has its one-electron density more uniformly distributed: while CO has an attractor of ca. 300 au high (on O) and one of ca. 125 au (on C), N2 has two attractors of ca. 200 au high (see one-electron density distributions in Figures 1 and 2).

238


-2.0

-1.0

0.0

1.0

displacement (a.u.) Figure 5. CAB (dotted line) and CAB (solid line) similarity functions for the {N2, LiF} pair.

The ordering becomes quite different when comparing the values obtained for two-electron self-similarities. According to Y^^ and Xj^ the molecules are ordered as CO > N2 > LiF, although CO and N2 are very much closer than N2 and LiF. This trend can be easily rationalized if one realizes that only 66 electron-pair interactions Table 6. Displacement between the Centers of the Molecules (D, in au), Cê and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {N2,LiF} Pair (See Figure 5) Match ings in {NiXiFj 1. 2. 3. 4. 5. 6. 7. 8. 9. Note:

(N-F) (.-F) (N-*) (N-F) (•-•) (N-Li) (N-.) (•-Li) (N-Li)

D

CAB

-2.48 -1.44

0.6865 0.0532

0.3602 0.8244

0.6922 0.1086 0.1238

0.6348 0.6497 0.3702

0.0911

0.0194

a

-0.45 -0.06 0.42

— — 2.48

^Dash indicates absence of maximum in both C^g ^'^^ ÂB-

239


o

T -2.0

0.0

-1.0

1.0

2.0

displacement (a.u.) Figure 6. CAB (solid line) and CAB (dotted line) similarity functions for the {CO, LiF} pair (CO-FLI orientation).

Table 7. Displacement between the Centers of the Molecules (D, in au), C^B and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {CO,LIF} Pair for the [CO,FLi] Orientation (See Figure 6) Match ings in {Co.LiF} [CO.FLi] Orientation 1. 2. 3. 4. 5. 6. 7. 8. 9.

(O-F) (•-F) (0-.) (C-F)

(•-•) (O-Li) (€-•) (•-Li) (C-Li)

D

CAB

ÂB

-2.51 -1.44

0.8338 0.0478

0.4804 0.8499

0.5120 0.1102 0.1306

0.5467 0.5776 0.3379

0.0729

0.0145

a

-0.41 -0.05 0.39

— — 2.51


240

\

0-»-C F~—Li

8

N 6

o-»-c

\

O-'-C F—.—Li

1

\

1.0

0.0

p._._.Li

\

o-»-c \ F

Li

r 1.0

2.0

displacement (a.u.)

Figure 7. CAB (solid line) and CAB (dotted line) similarity functions for the {CO, LiF} pair (OC,FLI} orientation).

Table 8. Displacement between the Centers of the Molecules (D, in au), C^B and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {CO,LiF} Pair for the [OC,FLi] Orientation (See Figure 7) Matchings in {CO,LiF) [OQFLi] Orientation 1. 2. 3. 4.

(C-F) (--F) (C-*) (O-F)

5. 6. 7. 8. 9.

(•-•) (C-Li) (.-Li) (0-») (0-Li)

D

CAB

ÂB

-2.51 -1.44

0.5013 0.0486

0.2521 0.7645

0.8403 0.7960 0.1519 0.1171

0.7276 0.7298 0.7328 0.4472

0.0355 0.1036

0.3191 0.0254

a

-0.41 -0.37 -0.10 0.36

— 0.92 2.51

Note: ^Dash indicates absence of maximum in both Cjg and C^g.


Table 9.

241

Similarity Matrices for the Set of N2, CO, and LiF Diatomic Molecules^ One-electron similarity matrix

N2 CO LiF

CO

LiF

N2 104.6178

0.9410

102.1060

112.5459

0.8403

78.1381

97.6297

121.8149

0.6922

Intracule similarity matrix

N2 CO LiF

CO

LiF

N2 143.9069

0.9919

144.4941

147.4734

0.8391

99.8228

104.3031

104.7762

0.8129

Extracule similarity matrix

N2 CO LiF Note:

N2 1430.1182

CO

LiF

0.9782

0.8244

1417.7297

1468.7070

0.8499

1000.8985

1045.8006

1030.7902

^Values in roman type refer to QSM; itaiiC; Carbo indices; boldface, self-similarities.

are possible in LiF, in comparison with the 91 electron-pair interactions in CO and N2. However, within the two isoelectronic molecules, /(r) and E(R) distributions for CO are slightly more concentrated than those for N2. Furthermore, comparison of Zpj^ with Ypj^ and Xpj^ reveals that density redistribution between N2 and CO is not so important on /(r) and E{R) distributions as it was on p(r) distributions. For instance, it can be observed from E(R) distributions presented in Figures 1 and 2 that the attractors at the center of mass of both molecules are ca. 400 au high and, in fact, the attractor in N2 is slightly higher than that in CO. This result is consistent with the formal assignment of 49 and 48 electron-electron interatomic interactions in N2 and CO, respectively. Pairwise Molecular Similarities. Pairwise comparisons between molecules can be performed by analyzing the nondiagonal terms in similarity matrices. The following discussion will be done from Carbo similarity indices, which provide a more convenient means for comparing molecules from different types of similarity measures. Using values in Table 9, it is extracted that for all similarity matrices the ordering of the three nondiagonal elements is {N2,C0} > {CO,LiF} > {N2,LiF}, as could be qualitatively expected from the electronic nature of the molecules under study. A more detailed analysis reveals that for the {N2,C0} and {CO,LiF} pairs the differences between first-order (C^^g) and second-order (C^^g and C^^g) similarity indices are relatively small. For example, for the {N2,C0} pair, they are ordered as

242


C^g > C^^B > C \ B , while for the {CO, LiF} pair the ordering is C^g > C^^g > C^g. In contrast, for the {N2, LiF}, while the relative ordering is again C^^g > ^ AB ^ ^ \ B ' ^^^^^ ^^ ^ ^^^^ quantitative difference between the respective values: 0.8500, 0.8391 and 0.6922. These trends can be understood by looking at the p(r), /(r), and E(R) profiles in Figures 1-3. Essentially, the low height (ca. 200 au) of the attractors on N in the p(r) distribution for N2 and the higher height (ca. 400 au) of the attractor on F in the p(r) distribution for LiF are responsible for the C^^g value of 0.6922 for the similarity between N2 and LiF (matching 4 in Figure 5). The situation is reversed when comparing the E(R) distributions of the two molecules. In this case, the height of the central attractor for N2 (ca. 400 au) is higher than that for LiF (ca. 250 au), which gives a C \ g value of 0.8499.

IV. CONCLUSIONS A comparison of one-electron, intracule, and extracule similarity measures and indices computed from the respective density distributions for a series of atomic and molecular systems has revealed that, although in some cases similar trends can be observed, in general the values for the three types of similarity do not have to follow the same trend. Furthermore, it has been shown how the topological characteristics of one-electron, intracule, and extracule density distributions determine the topology of similarity functions. As a consequence, different similarity measures can lead to different optimal alignments associated with their global similarity maximum. We hope that future algorithmic and computational developments will allow computing /(r) and £'(R) distributions on large grids of points for larger molecular systems. This would allow comparing the behavior of first- and second-order similarities for a larger series of molecules, and may find applications for which the newly defined second-order similarities could perform better than the widely used first-order similarities. In particular, second-order similarities computed from intracule densities appear as a good choice for analyzing the quality of wave functions calculated at different levels of theory, because of their inherent advantage during the alignment procedure and the special sensibility of the 7(0) attractor to correlation effects.

ACKNOWLEDGMENTS This work has been supported by the Spanish DGICYT Project No. PB95-0762. X.F benefits from a doctoral fellowship from the University of Girona. We also thank the Centre de Supercomputacio de Catalunya (CESCA) for a generous allocation of computing time.

REFERENCES 1. LSwdin, P. O. Phys. Rev. 1955,97,6,1474.

Comparison of Quantum Similarity Measures 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.

38.

39.

243

Carbo, R.; Leyda, L.; Amau, M. Int. 7. Quantum Chem. 1980, 77, 1185. Sola, M.; Mestres, J.; Oliva, J. M.; Duran, M.; Carbo, R. Int. J. Quantum Chem. 1996, 58, 361. Carbo, R.; Domingo, L. Int. J. Quantum Chem. 1987, 32, 517. Cioslowski, J.; Heischmann, E. D. J. Am. Chem. Soc. 1987,113, 64. Cooper, D. L.; Allan, N. L. J. Am. Chem. Soc. 1992,114, 4773. Carbo, R.; Calabuig, B.; Vera, L.; Besalu, E. Adv. Quantum Chem. 1994, 25, 253. Mestres, J.; Sola, M.; Duran, M.; Carbo, R. J. Comput. Chem. 1994, 75, 1113. Sola, M.; Mestres, J.; Carbo, R.; Duran, M. J. Am. Chem. Soc. 1994, 776, 5909. Besalu, E.; Carbo, R.; Mestres, J.; Sola, M. Top. Curr. Chem. 1995, 775, 31. Constans, R; Carbo, R. / Chem. Inf. Comput. Set. 1995, 35, 1046. Mestres, J.; Sola, M.; Carbo, R.; Luque, R J.; Orozco, M. J. Phys. Chem. 1996,100, 606. Sola, M.; Mestres, J.; Carbo, R.; Duran, M. J. Chem. Phys. 1996,104, 636. Carbo, R.; Besalu, E.; Amat, L.; Fradera, X. J. Math. Chem. 1996, 79, 47. Cioslowski, J.; Stefanov, B.; Constans, R; J. Comput. Chem. 1996, 77, 1352. Carbo-Dorca, R.; Mezey, R G., Eds. Advances in Molecular Similarity, Vol. 1; JAI Press: Greenwich, CT, 1996. Fradera, X.; Amat, L.; Besalu, E.; Carbo-Dorca, R. Quant. Struct.-Act. Relat. 1997,16, 25. Constans, R; Amat, L.; Carbo-Dorca, R. J. Comput. Chem. 1997, 75, 826. Bader, R. F. W. Atoms in Molecules: A Quantum Theory; Clarendon: London, 1990. Ponec, R. In Ref. 16. Stmad, M.; Ponec, R. Int. J. Quantum Chem. 1994, 49, 35. Coleman, A. J. Int. J. Quantum Chem. 1967, 75, 457. Thakkar, A. J.; Smith, V H., Jr. Chem. Phys. Lett. 1976,42, 476. Carlsson, A. E.; Ashcroft, N. W. Phys. Rev. B 1982, 25, 3474. Thakkar, A. J. J. Chem. Phys. 1986, 84, 6830. Cioslowski, J.; Stefanov, B.; Tang, A.; Umrigar, C. J. J. Chem. Phys. 1995,103, 6093. Wang, J.; Smith, V. H., Jr. Chem. Phys. Lett. 1994, 220, 331. Sarasola, C ; Dominguez, L.; Aguado, M.; Ugalde, J. M. J. Chem. Phys. 1992, 96, 6778. Thakkar, A. J.; Tripathi, A. N.; Smith, V. H., Jr. Int. J. Quantum Chem. 1984, 26, 157. Breitenstein, M.; Meyer, H.; Schweig, A. Chem. Phys. 1988,124, 47. Wang, J.; Smith, V. H., Jr. Int. J. Quantum Chem. 1994,49, 147. Ugalde, J. M.; Sarasola, C. Phys. Rev. A 1994, 49, 3081. Cioslowski, J.; Liu, G. J. Chem. Phys. 1996, 705, 4151. Cioslowski, J.; Liu, G. J. Chem. Phys. 1996,105, 8187. Fradera, X.; Duran, M.; Mestres, J. J. Chem. Phys. 1997, 707, 3576. Fradera, X.; Duran, M.; Mestres, J. Theor Chem. Ace. 1998, 99, 44. Gaussian 94: Frisch, M. J.; Trucks, G. W; Schlegel, H. B.; Gill, R M. W; Johnson, B. G.; Robb, M. A.; Cheeseman, J. R.; Keith, T; Petersson, G. A.; Montgomery, J. A.; Raghavachari, K.; Al-Laham, M. A.; Zakrzewski, V. G.; Ortiz, J. V; Foresman, J. B.; Peng, C. Y; Ayala, R Y; Chen, W; Wong, M. W; Andres, J. L.; Replogle, E. S.; Gomperts, R.; Martin, R. L.; Fox, D. J.; Binkley, J. S.; Defrees, D. J.; Baker, J.; Stewart, J. R; Head-Gordon, M.; Gonzalez, C ; Pople, J.A. Gaussian, Inc.: Pittsburg, PA, 1995. Schmidt, M. W; Baldridge, K. K.; Boatz,J. A.; Elbert, S. T; Gordon, M. S.; Jensen,J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, M.; Montgomery, J. A. J. Comput. Chem. 1993,14, 1347. Constans, P; Amat, L.; Fradera, X.; Carbo-Dorca, R. In Ref. 16.


THE COMPLEMENTARITY PRINCIPLE AND ITS USES IN MOLECULAR SIMILARITY AND RELATED ASPECTS

Jerry Ray Dias

Abstract Introduction Basic Definitions Aufbau Principle Results and Discussion A. Self-Complementary Molecular Graphs B. Infinite Series of Molecular Graphs that Are Pairwise Strongly Subspectral V. Conclusion References

I. II. III. IV.

245 246 247 248 248 249 254 257 258

ABSTRACT Properties and theorems of complementary molecular graphs are delineated. Aufbau constructions that constitute inductive proofs are included. Collections of strongly subspectral molecular graphs are tabulated.


246

JERRY RAY DIAS

I. INTRODUCTION Molecular modeling involves the analysis of a given structure in terms of its elementary substructures, stereochemistry, symmetry, shape, size, and similarity to other structures. These six S's (structure, stereochemistry, symmetry, shape, size, and similarity) are intricately related and interwoven.^ When comparing two molecular structures, some type of similarity is sought whereby one might be characterized by what is known about the other. Similarity is the degree of overlap between two or more structures and has been the subject of numerous studies.^ The more elementary substructures (e.g., atoms, bonds, fragments, subgraphs, functional groups) two molecules have in common and the closer they are in size and symmetry, the more they are similar. Similarity serves as a conceptual and molecular modeling tool that allows existing knowledge about molecular systems to be correlated, assembled, and integrated, parameters that are difficult or impossible to measure be calculated, hypotheses to be formulated and inexpensively tested, and gaps in knowledge to be pinpointed. Shape, size, and stereochemistry measure different spatial characteristics of molecules. Both geometrical and orbital symmetry play a vital role in the interpretation and understanding of electronic, vibrational, and NMR spectra of molecules.^ The fact that even isospectral molecules (isomeric molecules having the same eigenvalue set) with different symmetries will have different photoelectron ionization spectra^'"* emphasizes the importance of this variable when considering similarity. The importance of symmetry in similarity comparisons of conjugated polyenes is also emphasized by the fact that molecular graphs with greater than twofold symmetry are guaranteed to have a doubly degenerate eigenvalue subset. The search for and study of structural invariants is a vital undertaking in similarity studies and the development of topological indices.^'^ Allied to this endeavor, the search for and discovery of elementary substructures with specific relations among their eigenvalues is of great relevance for qualitative understanding of chemical systems. Characteristic polynomials, eigenvectors, recurring eigenvalues,^ embedding fragments (substructures),^'^ and right-hand mirror-plane fragments^'^^ are just some examples of quantum chemical-based invariants. The more eigenvalues two subspectral molecular graphs have in common the more they are similar, other things being equal. Two molecular graphs are subspectral if they have one or more eigenvalues in common. Subspectrality is one kind of measure of similarity that is maximized if the frontier molecular orbitals are included in the common eigenvalues. The HMO model is particularly important when dealing with 7i-electron systems. HMO does not include other variables, like strain-related components, which must be determined separately. Embedding fragments and right-hand mirror-plane fragments are molecular orbital functional groups.^"^^ This chapter reports our recent studies on complementary molecular graphs which are correlated by an expanded version of our aufbau principle. ^^

The Complementarity Principle

247

II. BASIC DEFINITIONS A molecular graph is the C-C a-bond skeleton representation of a fully conjugated polyene molecule. Such a graph, therefore, omits the C and H atoms and the C-H and p7i bonds. Since most polycyclic conjugated polyenes can have more than one arrangement of their 7t-bonds, the molecular graph representation avoids artificially representing these molecular systems by writing only one of these arrangements. Molecular energy level and eigenvalue are synonymous as are wave function and eigenvector. The highest occupied MO (HOMO) and the lowest unoccupied MO (LUMO) are called the frontier MOs (FMOs). Strongly subspectral molecular graphs have a preponderance of common eigenvalues. Isospectral molecular graphs have precisely the same eigenvalue spectrum. Almost-isospectral molecular graphs are strongly subspectral molecular graphs with 0,0, ±1, or ±2 as unique eigenvalue pairs."^Functional groups are substructures (groups of interconnected atoms) having a characteristic set of properties that are conveyed to the whole structure. If the two eigenvalues (X) within a single molecular graph or two related mirror-plane fragment graphs sum to zero (Xj + X2 = 0), they are said to be paired. The well-known pairing theorem states that all eigenvalues in a conjugated alternant hydrocarbon (AH) are either zero (nonbonding) or paired (bonding and antibonding). AHs have no odd size rings and every other carbon vertex can be starred so that no two starred and no two unstarred positions are adjacent. The eigenvector coefficients for the starred positions of the AH are unchanged in going from one eigenvalue (Xj) to its paired partner {X^, and for the unstarred positions the sign (but not magnitude) changes in going from one eigenvalue to its paired partner; if an eigenvalue has no paired partner (i.e., X= 0), then the coefficients of the unstarred positions are zero. When an internal mirror-plane of symmetry divides a molecular graph into two parts, the vertices on the mirror-plane remain with the left-hand fragment and vertices in the right-hand fragment originally connected by a bisected edge have weights of -1 .^^ If two eigenvalues in a single molecular graph, a single right-hand mirror-plane fragment, or two related molecular graphs or right-hand mirror-plane fragments sum to minus one (Xj + X2 = -1), they are said to be complementary}^ Two equal-sized right-hand mirror-plane fragments are complementary if all of their eigenvalues are complementary; the normal vertices of one of the complementary right-hand fragments correspond to - 1 weighted vertices in the other and both have the same sets of normalized eigenvector coefficients whose relative sign are fixed for the starred positions in going from one to the other. Two AH molecular graphs are complementary if their right-hand mirror-plane fragments containing normal and - 1 weighted vertices are complementary. If a molecular graph has a right-hand mirror-plane fragment that contains an equal number of normal and - 1 weighted vertices which when interchanged gives the same fragment, then both this molecular graph and its right-hand fragment are said to be self-complementary. For

248

JERRY RAY DIAS

a given eigenvalue, the McClelland mirror-plane of symmetry ^° defines an antisymmetric relationship among the coefficients of the relevant eigenvector.

III. AUFBAU PRINCIPLE All benzenoid (polyhex) structures of a given C^H^ (n = N^ and s = N^j) formula can be generated by a combination of the following three types of attachments to the perimeter of all of its precursor isomeric benzenoids: (1) attachment of C4H2 units to the ^(2,2) edges of all isomeric benzenoids with the formula of C^_4H^_2, (2) attachment of C3H units to the vee regions of all isomeric benzenoids with the formula of C^_3H^_p and (3) attachment of C2 units to the bay regions of all isomeric benzenoids with the formula of C^_2H^. Taking all of the above combinatorial attachments and deleting duplicates gives all of the benzenoids of a given C^H^ formula. In benzenoid enumeration and structure generation, C2, C3H, and C4H2 are elementary aufbau units as all other benzenoid aufbau units can be built by some successive union of these elementary units. ^^ In this chapter, other aufbau units will be used in construction proofs and in the generation of infinite pairs of series composed of strongly subspectral molecular graphs.

IV. RESULTS AND DISCUSSION Figures 1-3 present chemically relevant examples of complementary molecular graphs and their eigenvector relationships. At the head of each column in these figures is the corresponding right-hand mirror-plane fragment. For each eigenvalue belonging to these mirror-plane fragments, the eigenvector coefficients are indicated at each posifion on the molecular graph. The information displayed in each figure for two adjacent columns corresponds to complementary molecular systems. Recall that the mirror-plane defines an antisymmetric relationship for the eigenvector coefficients of each eigenvalue. If you identically star the complementary right-hand mirror-plane fragments, you will note that the signs to the coefficients of the starred positions remain unchanged in going from the structure in one column to the other for a given complementary set of eigenvalues, whereas the signs to the coefficients of the unstarred positions do change. Let a given right-hand mirror-plane fragment be designated by M and its complementary by M. If k is the index number of a specified starred posifion of normal weight in M, then k is also the index number for the same starred position of - 1 weight in M; starred normal weighted vertices in M become starred - 1 weighted vertices in M. Theorem 1. The associated eigenvalues (X) of two complementary right-hand mirror-plane fragments are related by X(M) + X(M) = - 1 .


Theorem 2. is given by

249

If the eigenvector 0(M) of a right-hand mirror-plane fragment

^(M) = X ^*i ^1 "^ S ^j ^j

^^^ eigenvalue X(M)

where (|)* is the p AO of a starred atomic vertex and (t)° that of an unstarred atomic vertex, then the eigenvector of its complementary is given by 0(M) = ^ a* ([)* - ^ aj (^j

for eigenvalue X(M)

Once half of the eigenvalues/eigenvectors of an AH molecular graph have been calculated, then the pairing relationship allows one to obtain the remaining values by inspection. Similarly from the complementary relationship, if the eigenvalues/eigenvectors of one complementary molecular graph are known, then these quantities for the other can be obtained without calculation. These eigenvalue/eigenvector theorems are illustrated by the complementary pairs depicted in Figures 1-3. Naphthalene and 1,2,4,5-tetramethylenebenzene (Figure 1) have been extensively studied, both experimentally and theoretically."^ The molecular graphs in Figures 2 and 3 are strongly subspectral. The molecular graph of tetravinylethylene and its complementary in Figure 2 are strongly subspectral to the corresponding complementary molecular graph pair in Figure 3. While tetravinylethylene (HOMO = 0.3111) itself has been synthesized,^^ only air-sensitive derivatives of benzodicyclobutadiene (HOMO = 0) have been synthesized,^"^ results that are consistent with the relative energies of their frontier orbitals (Figure 2) and conjugated circuit resonance energies.^^ A. Self-Complementary Molecular Graphs Figure 4 gives an example of a self-complementary molecular graph and its corresponding eigenvalue/eigenvector relationships. If every pair of eigenvalues in a single right-hand mirror-plane fragment sum to minus one (X^ +^2 ~ ~^)' this mirror-plane fragment and its corresponding AH molecular graph are said to be self-complementary. From this Figure 4 example, it should be evident that one needs only to determine one-fourth of the eigenvalues/eigenvectors of selfcomplementary molecular graphs and then use the complementarity principle and pairing theorem to determine the remaining eigenvalues/eigenvectors. Thus, selfcomplementary molecular graphs possess a type of hidden symmetry.^^ Theorem 3. Starting with 1,3-butadiene and the C4H2 set of aufbau units, all (nonbranched) self-complementary molecular graphs are generated per Figure 5.

JERRY RAY DIAS

250

complementary right-hand mirror-plane fragments

4(v^-l)

a=0.1735 2J = 0.2307 c-0.2629 d= 0.30055 6 = 0.3470 /= 0,3996 ^-0.4082 h=0.4253 i = 0.4614

Ji(/5-l)

1^

-g

-h

h

h -h i5(/5-l)

"^^Nw ^ / V ^

-f

^~^

-f

i5(/l3-l) naphthalene

1,2,4,5-tetramethylenebenzene

Figure 1, Corresponding eigenvectors for complementary eigenvalues belonging to complementary molecular graphs.


251

complementary right-hand mirror-plane fragments e

e

'^

_

-b

-e -e -2.1701

-;CC-. a

3 1.1701

iCb:

a-0.1268 h -0.1409 c-0.2530 d» 0.3020 e« 0.3058 /-0.35355 ^-0.3682 h = 0.3747 3 -'Q.5121

a

-0.3111

"^ -0.6889

1.0 -h

.

h

-h

h 1.4812

tetravinylethylene

-2.4812 benzodicyclobutadiene

Figure 2. Corresponding eigenvectors for complementary eigenvalues belonging to complementary molecular graphs.

JERRY RAY DIAS

252

complementary right-hand mirror-plane fragments

/ -/ ^ - ^ ^ /

v-f

-/

i

-/

-(/2 + 1)

b

-0

/2

^

X_/-«7 -b -b -2.1701

a-0 2>-0.1166 c » 0.08607 0, 0 ±li(/5±l) ±1.0 ±J{(/l3±l) ±2.0

±0.1A62 ±0.3111 ±0.758A

±1.0 ±1.20A7 ±1.A812 ±1.667A ±1.9090 ±2.1701 ±2.3510

Figure 7. Two series of strongly subspectral molecular graphs that approach isospectrality in the infinite limit. The unmatched eigenvalues are indicated next to the first-generation molecular graphs of each series.

JERRY RAY DIAS

256

.. [^D aCcCn CnCcCn ±/2,±^

0. 0 ±0.6889 ±1.1701 ±2.4012

±1.0, ±1.0 ±»j{/r3± 1) ±4(3 ± / 5 )

±0.2416 ±0.6889 ±0.9090 ±1.1462 ±1.1701 ±1.3510 ±2.2047 ±2.4812 ±2.6674

Figure 8, Two series of strongly subspectral molecular graphs that approach isospectrality in the infinite limit. The unmatched eigenvalues are indicated next to the first-generation molecular graphs of each series.

pseudosymmetry). The matching members of these series grow by successive addition of a CH unit. The unique feature of the pair of strongly subspectral series in Figure 10 is that the unmatched zero eigenvalues escalate in the lower series with the increase in size. The l,3,l',3'-phenylene triradical member of the upper series in Figure 10 has been the subject of recent theoretical study. ^^ The formulas of the aufbau units used to successively build up the series in Figures 6-10 are indicated in the upper left-hand corners. Two different attachment modes for the elementary C4H2 elementary aufbau unit were used to generate the matching series in Figure 6.^ ^ More complicated recursive constructions using C^^

o acrcncnrcr^ (^ r^ rV fV^ rW i V r 1 1.2593 2.1010

±1

±/z

±2.1358

±1, ±1 ±1.5434 ±2.8492

±1 ±1.1260 ±/3 ±2.1753

±0.6953 ±1 ±1.2032 ±1.8131 ±2.1867

Figure 9, Two series of almost-isospectral molecular graphs that approach isospectrality in the infinite limit. The unmatched zero eigenvalue is indicated next to the first-generation molecular graph of the lower series.

The Complementarity

"'"o ±1,

±1

Principle

257

crooxro 0

±2

±1, ±2 ±/5

±1

0, 0 ± 1 , ± 1 , ± 1 , ±1 ±1.54336 ±2 ±2.84922

Figure 10, Two series of almost-isospectral molecular graphs. The unmatched zero eigenvalues are indicated next to the corresponding structure.

aufbau units were employed for the series in Figure 7; the upper series is generated by successive attachment of l,3,5-hexatriene-2,3,4,5-tetrayl and the lower series by splicing in the tetrayl of bisallyl. Successive attachment of the tetrayl of 3,4-dimethylenylcyclobutene generates the upper series and successive splicing in 1,2,4,5-tetraylbenzene generates the lower series in Figure 8. The two pairs of matching series in Figures 6 and 7 successively increase one ring at a time, whereas the two matching series in Figure 8 successively increase two rings at a time. Successive addition of CH aufbau units to benzene and trivinylmethyl in Figure 9 represents the simplest example of aufbau construction of strongly subspectral series. The same aufbau increments were used in the generation of each pair of series in Figures 6-9, but the series in Figure 10 are built up by different aufbau increments, which explains why the number of zero eigenvalues in the lower series escalate.

V* CONCLUSION Complementary molecular graphs have right-hand mirror-plane fragments that have the following characteristics: The normal weighted vertices in one are - 1 weighted vertices in the other, the starred vertices in both have identical eigenvector coefficients and the unstarred vertices have eigenvector coefficients of opposite sign, and their eigenvalues X are related by X(M) +X(M) = - 1 . Complementary molecular graphs correspond to AH molecules with at least twofold symmetry and their complementary eigenvalues correspond to antisymmetric eigenvectors. The aufbau constructions in Figures 5-10 contain the essence of inductive proofs. The complete set of self-complementary molecular graphs and their corresponding right-hand mirror-plane fragments are contained in Figure 5. Many unique strongly

258

JERRY RAY DIAS

subspectral pairs of series are evolved by aufbau constructions. This work has more completely revealed the various relationships between structure, symmetry (both obvious and hidden), size, and similarity. REFERENCES 1. Johnson, M.; Maggiora, G. M. Similarity in Chemistry, Wiley: New York, 1991; TrinajstiC, N. Chemical Graph Theory; CRC Press: Boca Raton, FL, 1992; Randie, M. J. Chem. Inf. Comput. Sci. 1992, 32, 686-692; Sen, K., (Ed.) Molecular Similarity I and U\ Springer-Verlag: Beriin, 1995; Klein, D. J. / Math. Chem. 1995,18, 321-348; Mezey, R G. Shape in Chemistry, VCH: New York. 1993. 2. Hargittai, I.; Hargittai, M. Symmetry through the Eyes of a Chemist; Plenum: New York, 2nd ed., 1995; Halevi, E. A. Orbital Symmetry and Reaction Mechanism; Springer-Veriag: Berlin, 1992. 3. Heilbronner, E.; Jones, T. B. J. Am. Chem. Soc. 1978,100, 6506-6507. 4. Dias, J. R. Chem. Phys. Lett. 1996, 253, 305-312. 5. Balasubramanian, K. SAR QSAR Environ. Res. 1994,2, 59-77; Chem. Rev. 1985,85, 599-618; Carbo, R., (Ed.) Molecular Similarity and Reactivity: Quantum Chemical to Phenomenological Approaches; Kluwer: Dordrecht, 1995. 6. Randie, M. J. Math. Chem. 1992, 9, 97-146. 7. Jiang, Y; Yu, W.; Kirby, E. C. J. Chem. Soc. Faraday Trans. 1991, 87, 3631-3640. 8. Dias, J. R. Molecular Orbital Calculations Using Chemical Graph Theory; Springer: Beriin, 1993. 9. Hall, G. G. Trans. Faraday Soc. 1957,53, 573-581; Bull. Inst. Math. Appl. 1981,17, 70-72; J. Math Chem. 1993,13, 191-203. 10. McClelland, B. J. J. Chem. Soc. Faraday Trans. 2 1974, 70,1453-1456; J. Chem. Soc. Faraday Trans 2 1982, 78, 911-916; Mol. Phys. 1982, 45, 189-190. 11. Dias, J. R. Z Naturforsch. 1989,44a, 765-771; J. Math. Chem. 1990,4, 17-29. 12. Dias, J. R. Molec. Phys. 1996, 88, 407-417. 13. Skattebol, L; Chariton, J. L.; deMayo, R Tetrahedron Lett. 1966, 2257-2260. 14. Toda, R; Garratt, R Chem. Rev. 1992, 92,1685-1707. 15. Randie, M. Tetrahedron Wll, 33, 1905-1920. 16. Liu, J. J. Chem. Soc. Faraday Trans. 1997, 93, 5-9. 17. Dias, J. R. J. Mol. Struct. (Theochem) 1997,417,49-67. 18. Dias, J. R. J. Phys. Chem. A 1997,101, 7167-7175. 19. Zhang, J.; Baumgarten, M. Chem. Phys. Lett. 1997,269, 187-192.

CORRELATIONS AND APPLICATIONS OF THE CIRCUMSCRIBING/EXCISED INTERNAL STRUCTURE CONCEPT

Jerry Ray Dias

I. II. III. IV. V.

Abstract 259 Introduction 260 History 260 Constant-Isomer Benzenoid Series 261 Constant-Isomer Series of Fluoranthenoids/Fluorenoids and Indacenoids . . . 262 Other Applications 262 References 264

ABSTRACT The circumscribing/excised internal structure concept has been used to generate constant-isomer series of strictly pericondensed benzenoids, fluoranthenoids, indacenoids, and related conjugated polycyclic hydrocarbons and identify their topological properties.

Advances in Molecular Similarity, Volume 2, pages 259-264. Copyright © 1998 by JAI Press Inc. Allrightsof reproduction in any form reserved. ISBN: 0-7623-0258-5 259

260

JERRY RAY DIAS

I. INTRODUCTION The search for and discovery of new elementary substructures is an essential strategy in the quest to understand chemical phenomena. Atoms, bonds, and functional groups are examples of the most fundamental elementary substructures that are used to describe molecules and to decipher their properties. To determine chemical properties, one must first analyze the properties of isolated molecules and then determine how these result in the observable bulk chemical properties. For example, while a paramagnetic molecule gives rise to paramagnetic material, there is no such thing as a ferromagnetic molecule, for ferromagnetic materials require that all of the magnetic moments associated with paramagnetic molecules in the bulk phase be permanently aligned in the same direction. Similarly, a molecule has no melting point transition. Melting point is a bulk property associated with a conglomerate of interacting molecules. Thus, it is necessary for one to first understand molecular properties and then deduce chemical properties from the cooperative effect of many molecules. Consideration of cooperative effects is not necessary for spectroscopy of molecules in the gas phase. But this is not true for most other types of physical property determinations. The excised internal structure (EIS) is an elementary substructure of recent origin. It was originally defined as the conjugated hydrocarbon formed when the internal carbons of a strictly pericondensed benzenoid are excised by stripping away the perimeter carbon ring;^ in other words, the EIS was defined as the subgraph spanned by the internal vertices of a strictly pericondensed benzenoid system."^ The reverse process is called circumscribing. The EIS may be more generally defined as the connected subgraph spanned by the internal vertices of a strictly pericondensed polycyclic conjugated system. Strictly pericondensed systems have no catacondensed appendages. ^'^

II. HISTORY The excised internal structure was forecasted by Piatt's perimeter rule^ and the subsequent spectroscopic distinction of the insular versus perimeter orbitals in pericondensed benzenoids.'^''^ Clar"^'^ named two highly condensed benzenoids using a circo/circum terminology: circobiphenyl (C^Ê.^^, K = 136) and circumanthracene (C^^^^, K = 105). Because these are the only examples for which this terminology was used, this nomenclature terminology was relatively unknown. Subsequentiy, the concept of the one-isomer coronene series put forth by Dias^ involved successive circumscribing of benzene to coronene (C^H^ to €2411^2), coronene to circumcoronene (C24H12 to €5411^3), circumcoronene to dicircumcoronene (C54Hig to C94H24), and so on. Shortiy thereafter, the excised internal structure/circumscribing concept was more fully developed. ^'^ Hall showed that the dualist (inner dual) graph of the dualist graph of a strictiy pericondensed benzenoid is the excised internal structure.^

Circumscribing/Excised Internal Structure Concept

261

III. CONSTANT-ISOMER BENZENOID SERIES If an EIS consisting of only hexagonal rings and/or polyenes branches with no less than two-carbon gaps is wrapped (circumscribed) by a perimeter of hexagonal rings, a benzenoid is generated. Every strictly pericondensed benzenoid isomer has a unique EIS. Constant-isomer series are infinite series of benzenoid hydrocarbons that successively increase in formula per N^^ = N^-\- IN^j + 6 and JSt^ = A^^ + 6 and have the same number of isomers at each stage of increase. They are generated by successive circumscribing with a perimeter of 2A^^ + 6 carbon atoms and incrementing with six hydrogens. Starting with the only three possible C^^ polyene isomers—trimethylenemethane diradical, 5'-rran5'-l,3-butadiene, and 5-c/5-l,3-butadiene—the only three C22H12 benzenoid isomers are generated (Figure 1). Circumscribing these first-generation C22H12 benzenoid isomers gives the only three possible C52Hig benzenoids. Continuing to circumscribe in a successive fashion gives the 3-isomer benzenoid series {êw\nvi^^6x\-^d' Symmetry, the number of bay regions and selective lineations, and the radicaloid cardinality of the benzenoid members of constant-isomer series are conserved on successive circumscribing. We have shown that as one moves downward on the left-hand staircase edge of Table PAH6, a constant-isomer number pattern of . . . abb . . . is observed.^ For those

A

\ t-trana-I,3-butadlenB

) s-aia-1,3-butadiene

y Isotopological j/ mates

clrcumOO) Crianguleaa

clrciim(30)anthanthrcne

clrcum (30) benzo l^W ] pery lene

Figure 1. Illustration of the excised internal structure concept in enumeration of all of the benzenoid isomers of C22H12, C52H18, C94H24, and so on.

262

JERRY RAY DIAS

constant-isomer series with the same cardinality of b, there exists a one-to-one topological matching of their benzenoid membership.^ Constant-isomer benzenoid formulas only occur on the left-hand staircase edge of Table PAH6.

IV. CONSTANT-ISOMER SERIES OF FLUORANTHENOIDS/FLUORENOIDS AND INDACENOIDS Polyhexes are molecular graphs that correspond to benzenoid hydrocarbons. The concepts of EIS, circumscribing, and constant-isomer series can be generalized to polypent/polyhex composite systems where r^ < 6. Ruoranthenoids/fluorenoids which contain one pentagonal ring and indacenoids which contain two pentagonal rings among otherwise hexagonal rings have been shown to possess constant-isomer series.^ A polypent/polyhex system consists of interlocking pentagons and hexagons where the degree-2 vertices correspond to methine >C-H units and degree-3 vertices correspond to = C < carbon units. The number of degree-2 vertices, degree-3 vertices, edges, and rings are given by A^^, A^^ = n, q, and r, respectively. Denote the circumscription of a polypent/polyhex (polyene) system by P —> circum-P = F . It has been previously shown that N^ = N + A^^^ -H A^^ and N , Nj^, and q are the number of perimeter and internal degree-3 vertices and perimeter edges, respectively. For P -4 P', A^^ -^ Aj^ and ^H ~^ ^pc- Thus, for circum-P, N" = A^ - 6 + ^5 = A^ giving /V^ = A^// + 6 - ^3 and, similarly, qp = ^pc-^^H ^"^ K"^%^^c giving K^'^c'^^pc'^^H = N^ + 2A^^ + 6 - r^. These recursive equations are useful for monitoring the progress of successive circumscription. It is presumed that polypent/polyhex constant-isomer series can successively increase without limit, and since A^ = A^^ + 6 - rg should not decrease, this places the constraint that ^5^6.

V. OTHER APPLICATIONS The coronene one-isomer series has been shown by Pisanski et al. to be a rotagraph Wg(r2;X),^^ and Klavzar and Gutman have shown this series to have the lowest Wiener indices. ^ ^ Scott and Necula have used our EIS concept to explain the relative ^H-NMR shielding of C20H1Q indacenoids.^^ Per their interpretation, indacenoids 9,10,18, and 19 (Figure 2) should have ^H-NMR chemical shifts that are relatively more deshielded than the other C20H1Q indacenoids in Figure 5 because their trimethylenemethane EISs prevent the antiaromatic perimeter ring current from participating. Thus, 9 which has a trimethylenemethane diradical EIS was observed to have resonances of all of its hydrogens shifted downfield by 0.4-0.7 ppm relative to the corresponding hydrogens in 1 and 5 which have a closed-shell 1,3-butadiene EIS.^^ Those benzenoid structures having an EIS with K = 1 will have a monoqui-

Circumscribing/Excised Internal Structure Concept

263

^

w^m^ SC - 8

18

19

SC - 8

SC - 4

5

6

7

SC - 7

SC - 2

SC - 9

Figure 2, All 19 C20H10 indacenoid isomers possible.

264

JERRY RAY DIAS

none isomer with K = l.^"^ A two-dimensional map of a family of benzenoids has been shown to have a one-to-one almost-isospectral matching to another two-dimensional map of related EIS.^"^ REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Dias, J. R. / Chem. Inf. Comput. ScL 1984,24, 124-135. Dias, J. R. Can. J. Chem. 1984, 62, 2914-2922. Piatt, J. R. J. Chem. Phys. 1954, 22, 144. Clar, E.; Roberson, R.; Schlogl, R.; Schmidt, W. J. Am. Chem. Soc. 1981,103,1320. Clar, E. Polycyclic Hydrocarbons; Wiley: New York, 1964, Vols. 1 and 2. Dias, J. R. J. Chem. Inf. Comput. Sci. 1982, 22,15-22. Hall, G. G. Theor. Chim. Acta 1988, 73,425-435. Dias, J. R. J. Chem. Inf Comput. Sci. 1990, 30, 251-256. Dias, J. R. J. Chem. Inf Comput. Sci. 1993,33, 117-130. Pisanski, T.; Zitnik, A.; Graovac, A.; Baumgartner, A. /. Chem. Inf Comput. Sci. 1994, 34, 1090-1093. Klavzar, S.; Gutman, I. / Chem. Inf Comput. Sci. 1996,36,1001-1003. Scott, L. T.; Necula, A. / Org. Chem. 1996,61, 386-388. Dias, J. R. J. Chem. Inf Comput. Sci. 1990,30, 53-61. Dias, J. R. / Phys. Chem. A 1997,101, 7167-7175.

LEAST-SQUARES AND NEURAL-NETWORK FORECASTING FROM CRITICAL DATA: DIATOMIC MOLECULAR re AND TRIATOMIC AHa AND IP

Jason Wohlers, W. Blake Laing, Ray Hefferlin, and W. Bradford Davis

Abstract Introduction Theory Data Results for Diatomic-Molecular re A. Least-Squares Results B. Neural Network Results C. Graphical Representations of Neural Network Results V. Results for Triatomic-Molecular/P-and AHa A. Least-Squares Results B. Neural Network Results

I. II. III. IV.


266 266 267 267 277 277 277 282 282 282 282

266

JASON WOHLERS ET AL.

C. Graphical Results VI. Discussion References

284 286 287

ABSTRACT Multiple regression was used to predict 299 diatomic internuclear separations using atomic period and group numbers as a basis. Van der Waals molecules were excluded. The standard deviation a of the differences of predictions from 150 tabulated data is 4.128%. Neural networks, one with van der Waals molecules in the learning set and one without, each predicted the property for 2145 real and nonredundant molecules; a of the differences of the predictions from 342 and 316 tabulated data are 25.00 and 8.63%. For comparable cases, the least-squares technique was more accurate. Multiple regression has been used to predict 205 triatomic ionization potentials using a 3D basis consisting of combinations of atomic period and group numbers. The a of differences from 80 tabulated data is 14.65%. Neural networks using the same 3D basis and using a 6D period and group number basis have predicted the IP for 2596 and 5148 molecules; the a of the differences from 69 and 92 tabulated data are 12.35 and 10.97%. The neural network method was more accurate than the least-squares technique. Neural networks using the same two bases have predicted the bonding energies for 16324 and 5418 molecules; the a of the differences from 79 and 117 tabulated data are 22.67 and 15.13%.

I. INTRODUCTION Systematization of existing data for small molecules holds out hope for enhanced learning by chemistry students, for more efficient preparation of computer databases, for a better understanding of how molecular periodicity and molecular periodic systems are related to atomic periodicity and the chart of the elements, and even for relating those understandings to the observed periodicities and corresponding periodic tables of nuclei and nucleons and of hadrons and quarks.^"^ Systematization also makes possible the rapid forecasting of approximate data for large numbers of molecules, data that should be useful until experiments or ab initio computations produce much more precise values. This systematization of small-molecule data has been carried out with graphical,^'^ statistical,^ and least-squares (LS) techniques,^'^ and also has just begun with the construction of neural networks (NN).^^"^^ LS methods have the advantage that fitting and predicting errors can be well known, while NN have the ability to "learn" and predict data without being supplied a smoothing equation. The present report will not repeat information about LS^'^ or NN^^^^ techniques, but it will evaluate and compare some of the tens of thousands of predictions of molecular data obtained by these two methods.

Forecasting re, AHa, and IP

267

The properties considered here are the diatomic-molecular equilibrium internuclear separation in angstroms (here called r^), and the triatomic-molecular heat of atomization in kilojoules per mole (A//^) and ionization potential in electronvolts

II. THEORY The periodicity of atomic data is associated with the period number R (the principal quantum number) and the group number C (associated with the angular momentum and the magnetic quantum numbers). Detailed studies have shown that the periodicity of diatomic-molecular data can be very successfully associated with the independent variable basis set {/?j,Cj,/?2»^2}' ^here the subscripts refer to the two atoms in the molecules. This basis is the foundation of the matrix-product theory of molecular periodicity,^"^'^^ has been used extensively in the least-squares fitting and prediction of data,^ and was used in the study of main-group, neutral, groundstate, diatomic-molecular data being reported here. Diatomic-molecular data were symmetrized so that each appears with independent variables {R.,Cj,Rj^,C^] and {Rj^.C^R-.C-} except for the homonuclear cases R- - R^ and C = Q (mirror-image molecules are identical). The matrix-product theory of molecular periodicity specifies the six-dimensional basis {/?j,Cj,i?2»^2'^3'^3}' ^h^r^ the central atom is number 2, for the analysis of triatomic-molecular data.^"^'^^ Triatomic data were symmetrized such that each appears with independent variables {/?.,C.,/?^,C^,/?^,C^} and {/?^,C^,/?^,C^,/?y,C.}, except for the cases where R- = R^ and C- = C^. Graphical study^ of triatomic-molecular data has shown that the three-dimensional basis {(R^ * /?2 + /?2 * ^3)'(Ci + C2 + C^)yC2] = {f(rXn^,C2} is also useful because: (1 )/(r) is the reduced variable in /?• along which isovalent molecular (fixed C-) data are most monotonic, (2) n^ enumerates series of isoelectronic molecules, and (3) C2 allows for the very weak tendency of C2 = 4 molecules to be more stable.^ Of course, there may be more than one triatomic molecule for a given {/(r),n^,C2}, and their data are learned and predicted as a single value. Triatomic data in the 3D basis require no symmetrization. LS predictions have already been made in this basis.^ This paper includes results using both the 6D and 3D bases for the study of acyclic, main-group, neutral, ground-state triatomic molecules.

III. DATA The diatomic-molecular r^ in angstroms are from Ref 16. The triatomic-molecular A//^ and IP are from Refs. 17 and 18. Instead of a 4D LS fit to the diatomic data, series of 2D fits were made. For fixed values of (/?i,/?2)» data were fitted to smoothing equations in Cj and C2 to obtain

Table 1. lnternuclear Separations in Angstroms: Tabulated Data, Least Squares and Neural Network Predictions, and Differences from Tabulated Values Least Squares Tabulated

AlAl AIB AIC AlCl AIF

N

(r,

03

AIN AIO AIP AIS AlSi

2.466

2.1 30 1.654 1.786 1.61 8 2.029

BB

1.590

BC B CI

1.715

BF BN BO BP

1.262 1.281 1.204

Area

Values

33 23 23 33 CG 23

2.349 1.897 1.754 2.075 1.983 1.651 1.634 1.665 1.631 2.1 37 2.088 2.224 1.529 1.368 1.657 1.716 1.267 1.366 1.270 1.236 1.682

cc 23 23 33 33 33 22 22 23 CG 22 CG 22 22 23

Average

Neural Network # I 0

% diff

Values

Neural Network #2

% diff.

Values

% diff.

4.74

2.085 1.613 1.613

-1 5.47

1.851 1.591 1.562

-24.94

2.029

0.023

4.74

2.084

2.71

2.1 52

1.05

1.643

0.005

4.67 -6.77 0.80

1.671 1.543 1.556 1.996 1.878 2.024 1.330 1.305

1.69 -1 3.61 -3.82

1.542 1.535 1.529 1.934 2.01 1 1.885 1.31 2 1.297

6.78 -1 4.04 -5.52

-1.98

2.91 -3.84

1.687

0.01 8

-1.63

1.654

1.316

0.038

4.28 -0.86 2.66

1.367 1.297 1.295 1.524

-7.45 -1 6.36

3.86 1.27 7.59

-0.89 -1 7.48

1.562

-8.91

1.288 1.285 1.283 1.535

2.02 0.34 6.57

BS B Si

1.609

BeAl

h,

rn

a

BeB BeC BeCl BeF BeN Be0 BeP Be5 BeSi

cc c CI

1.797 1.361 1.331 1.741 1.242

CF CN

1.272 1.1 72 1.128

CP

cs

1S 6 2 1.535

ClCl

1.987

CIF

1.628

co

23 23 23 22 22 23 22 22 22 23 23 23 22 23 22 22 22 DF 23 23 DF 33

CG 23 CC

1.646 1.766 2.096 1.755 1.564 1.797 1.373 1.436 1.373 1.851 1.800 1.949 1.236 1.572 1.224 1.168 1.164 1.215 1.568 1.546 1.51 5 1.972 1.910 1.645 1.623

2.30

0.00 0.88 3.1 6 3.39 -0.48 -3.77 -0.34 3.19 5.41 0.38

1.512 1.613 1.842 1.435 1.387 2.085 1.598 1.380 1.394 1.654 1.716 1.716 1.294 1.606 1.338 1.283

6.01

4.59

5.1 6 9.50

1.535 1.535 1.766 1.492 1.469 1.897 1.428 1.443 1.428 1.786 1.840 1.766 1.290 1.542 1.283 1.281

1.279 1.501

7.54 -3.92

1.277 1.529

13.19 -2.14

16.00 17.44 4.76 -1.43 4.1 5

5.58 4.96 7.32 5.67 3.84 0.87 9.30

1.189

0.021

1.531

0.010

4.26

1.484

-3.06

1.529

4.42

1.941

0.01 6

-2.32

2.100

8.1 9

2.079

4.66

1.634

0.007

0.37

1.614

-1.25

1.475

-9.41 (continued)

Table 1. Continued Neural Network # I

Least Squares

FF

t 4

u

Tabulated

Area

Values

1.412

22 GG 23 22 22 22 23 AG 22 AG 22 AA 23 22 22 23 23 23 33 23 23 33

1.481 1.337 2.348 2.044 2.330 1.823 1.991 1.963 1.544 1.588 2.679 2.528 2.557 1.666 1.573 2.074 2.009 2.1 88 2.510 2.077 1.919 2.194

LiAl Lit3 LiBe LiC LiCl

2.021

LiF

1.564

LiLi

2.673

0

LiMg LiN LiO Lip LiS LiSi MgAl MgB MgC MgCl

2.199

% dift

% diff.

(3

1.409

0.051

-0.21

1.357 1.878 1.459 1.842 1.402

-3.66

1.269 1.807 1.535 1.862 1.492

-10.13

1.977

0.007

-2.1 8

2.1 65

9.50

1.959

-3.08

1.566

0.01 4

0.1 3

1.645

5.06

1.453

-7.08

2.603

0.029

-2.62

1.928 2.597 1.391 1.410 1.680 1.755 1.745 2.270 1.735 1.654 2.345

-25.92

1.897 2.108 1.464 1.453 1.829 1.885 1.807 2.123 1.874 1.874 2.447

-29.02

6.63

Values

% diff

Average

4.23

Values

Neural Network #2

11.30

MgF MgN MgO MgP Mg5 MgSi N CI

N

v,

1.750 1.749 2.142

23

1.772

23

1.815

23

1.766

33

2.277

33

2.21 7

33

2.375

23

1.542

1.317

EC 22 EC

1.593

NF NN

1.098

22

NO NS NaAl

1.1 51 1.494

7.31

0.97

1.671

4.48

0.01 5

1.277

1.256

0.01 2

EE

1.1 30 1.187

1.159

0.025

22

1.156

23

1.501

0.47

33

2.708

23 23

2.054

9.1 0

1.885

7.80

2.1 37

2.01 0 3.50

1.909 1.862

1.621

1.559

4.11

2.245

2.21 6

2.1 23

1.550

1.529

4.81

-4.63

1.314

4.58

1.279

-2.89

5.56

1.270

16.10

1.271

9.58 10.46

1.275

0.43

1.275

10.76

1.449

-3.02

1.516

1.47

2.198

2.304

2.487 1.776

2.531

2.364

2.245

1.679

1.921

23

2.131

2.361

33

2.350

1.926

AC 23 AC

2.292

NaF NaLi

2.81 0

23

2.81 3

AA

2.894

33

2.891

3.449

23

2.01 3

NaN

1.878

1.246

NaB NaBe NaC NaCl

NaMg

1.26

2.321

0.01 3

1.929

0.006

2.854

0.01 4

1.921

2.445

5.34

2.538

7.51

0.1 6

1.955

1.33

1.972

2.36

1.57

2.424

-1 5.06

2.310

-17.80

-1.69

1.940 1.91 7

1.645

1.921 (continued

Table 1. Continued Neural Network # I

Least Squares Tabulated

Area

Values

Average

3.079

33 AA

3.1 86

NaO

23

3.111 3.260 1.949

NaP

33 33 33

2.455 2.384 2.563

23 22 22 FF

1.566 1.331 1.212 1.176

33

1.949 1.883 1.552

NaNa

NaS NaSi

0 CI OF 00 h,

1.570 1.207

PCI

-4

EC

h,

% diff

(5

0.023

-0.25

1.194

0.01 5

1.927

0.01 6

1.567 1.508

1.557

0.005

PF

1.589

PN

1.491

23 EC 23

1.474 1.503 1.969

0.01 1

1.476 1.893

EE 23 33

1.491

PO PP

EE 33 33

1.760 1.940 1.942

1.865

0.056

23

1.575

PS s CI SF

1.601

3.48

-1.08

Values

3.82 1 1.698 2.069 2.1 00 2.306 1.570 1.31 1 1.273

% diff

19.93

-0.02

6.60

2.01 0 -2.01 0.00 1.83 -1.48

Values

3.1 67 1.921 2.198 2.294 2.1 83 1.504 1.275 1.269

% diff 2.85

4.22

5.14

2.1 52

1.584

1.72

1.522

-4.20

1.495 1.506

0.28 2.06

1.504 1.504

0.85 1.88

1.915

2.69

2.024 2.079 2.1 23

6.93

-0.1 6

1.504

-6.08

1.842 1.941 -1.62

Neural Network #Z

1.598

so ss

N

2

Sic SiCl SiF SIN SiO Si P SiS

SiSi Average

1.481 1.889

23

1.510

FF

1.484

33

1.922

1.497

0.009

1.857

0.035

1.08

1.495

-0.1 3

1.486

0.34

1.878

1.12

2.094

10.83

FF

1.792

23

1.637

2.058

33

2.1 52

4.58

-1.44

2.069 1.629

0.55

23

1.993 1.578

-3.1 6

1.601

1.76

1.529

4.52

1.571

23

1.563

-0.51

1.524

-2.97

1.522

-3.10

1.51 0

23

1.543 0.33

1.543

1.85

1.510

-0.01

1.929 2.246

Standard deviation

DF

1.486

33

2.035

33

1.995

DF

1.787

33

2.111

-1.69

1.542

1.598

1.515

0.01 9

1.982 1.891

0.055

1.959

-1.97

1.890

-0.05

2.038

5.64

-6.01

2.010

-1 0.51

1.922

-1 4.45

0.020

-0.220

0.741

-0.442

0.014

2.609

7.951

8.736

274

JASON W O H L E R S E T A L

D @ 298K

(eV). from Sauval

0

F

X = Right atoa. Y = Left atoi. Z = Central atoi

Symbol Key 4.9900

j^ 2.5000B 5.0000^ 7.5000^ 10.0000 0^ 12.5000 - L - 15.0000 -

7.4900 9.9900 12.4900 14.9900 17.4900

Figure 1, An example of the triatomic molecular data. Shown are unpublished heats of atomization in eV from Ref. 19, in Q (right front), C2 (vertical), and C3 (right rear) coordinates; C2 pertains to the central atom; 0 < C/ < 8; /?i = /?2 = /?3 = 2. The data are separated into bins indicated by symbols. The bins are of equal "width" and lie between the largest A/-/a, i.e., + for OCO at coordinates (6,4,6), and the smallest, i.e., A for FOF at (7,6,7). Isoelectronic molecules lie on tilted parallel planes whose intersections with planes of contant C2 are indicated by small numbers (ne).

Table 2. Ionization Potentials in e V Tabulated Data, Least Squares and Neural Network Predictions, 95% Confidence-Limit Errors, and Differences from Tabulated Values Least Squares (3 0)

Tabulated

N

v

UI

FBO NNO FBF OOF FOF

oco NCF

oso FSiF

OSiO Average

Values

Errors

Values

Errors

13.400 12.890 8.400 12.600 13.700 13.790 13.320 12.340 11 .ooo 11.700

0.251 0.528 0.1 55 0.155 0.155 0.155 0.528 0.155 0.251 0.251

12.15 12.15 8.14 12.60 12.61 12.15 12.15 11.96 11.96 11.59

0.22 0.22 0.25 0.28 0.34 0.22 0.22 0.21 0.21 0.1 7

Standard deviation

% diff.

-9.34 -5.75 -3.07 0.01 -7.98 -11.90 -8.79 -3.07 8.74 -0.96 4.21 5.67

Neural Network (60) Values

Errors

12.33 12.33 12.14 12.56 13.90 12.33 12.33 11.51 11.51 11.43

0.90 0.90 0.89 0.92 1.02 0.90 0.90 0.84 0.84 0.84

% diff.

-7.95 -4.31 44.50 -0.35 1.46 -10.56 -7.40 -6.76 4.60 -2.28 1.10 15.14

Neural Network ( 3 0 ) Values

12.84 12.83 12.75 11.91 12.46 12.42 12.50 11.50 12.23 12.06

Errors

% dihl

ERR

-4.21 -0.44 51.73 -5.48 -9.07 -9.93 -6.1 6 -6.78 11.15 3.11

ERR 1.24 1.16 1.21 1.21 1.21 1.12 1.19 1.17

2.39 17.51

Table 3. Heats of Atomization i n kJ/mol: Tabulated Data, Least Squares and Neural Network Predictions, 95% Confidence-Limit Errors, and Differences from Tabulated Values Least Squares

Tabulated Values

c3 BOB FBO NNO FBF 03 CNN NCO

cco FOF

oco

1302.550 11 00.000 1477.058 1103.390 1216.550 595.892 1252.821 1251.786 1382.153 374.578 1597.893 1378.566 1215.243 1200.000 927.384 973.021 857.508 1046.21 3 588.368 1225.278 1192.217 1259.950

OBO FCO BCC ON0 N3 ONF FCF FNF NCF FSiF OSiO Average Standard deviation

Errors

10.1 3.2 1.6 0.3 1.6 1.6 3.2 1.6 1.6 1.6 0.3 1.6 3.2 1.6 0.3 3.2 1.6 3.2 3.2 1.6 1.6 1.6

Neural Network (6D)

% diff.

Values

Errors

1187 1101 1213 1144 1147 937 1194 1221 1233 654

12 1 264 41 69 342 58 31 149 279

-8.9 0.1 -1 7.9 3.7 -5.7 57.3 -4.7 -2.5 -1 0.8 74.5

1251 1117 1132 1079 1182 984 1023 859 1183 905

128 98 68 152 209 127 23 270 43 287

-9.3 -8.1 -5.7 16.3 21.5 14.8 -2.2 46.0 -3.5 -24.1 6.5 24.9

Values

Errors

1274 900 1349 1127 1276 862 1095 1298 1441 491 1258 1302 1240 1465 1095 1102 953 973 632 1229 1153 1174

157 111 166 139 157 106 135 160 177 60 155 160 153 180 135 136 117 120 78 151 142 145

Neural Network ( 3 0 )

% diff.

-2.2 -1 8.1 -8.7 2.2 4.9 44.6 -1 2.6 3.7 4.2 31 .O -21.3 -5.6 2.1 22.1 18.1 13.2 11.2 -7.0 7.5 0.3 -3.3 -6.8 3.6 15.2

Values

Errors

1232 1125 1258 1246 1167 61 3 1323 1346 1339 41 7 1282 1317 1161 1141 1055 1328 846 1022 660 1282 1048 1165

251 23 257 254 238 125 27 275 273 85 262 269 237 233 21 5 271 173 28 135 262 214 238

% diff.

-5.4 2.3 -1 4.8 12.9 -4.0 2.8 5.6 7.5 -3.1 11.2 -1 9.8 4.5 -4.5 -4.9 13.8 36.5 -1.3 -2.3 12.3 4.7 -1 2.1 -7.6 1.2 11.7

Forecasting re, A Ha, and IP

277

the coefficients of their equations and the subsequent predicted data. Then, for fixed values of (€^,€2), data were fitted to separate smoothing equations in R^ and /?2. These variables are limited by I < R- < S and 0 < C. < 8; however, alkaline-earth pairs at (€^,€2) = (2,2) were excluded. In cases where a datum was fitted both times, its average was used. Columns 1 and 2 of Table 1 list some of the tabulated data for molecules with the additional restriction (to limit the chapter size) that the molecules are formed from row-2 (Li-Ne) and row-3 atoms. The 95% confidence-limit errors in rj are small as to be negligible. For AH^ and IP, the triatomic-molecular data tend to lie in well-defined regions. For/(r) = 8 (^1 = /?2 = /?3 = 2), tabulated data lie mostly within 2 < C2 < 8 and «^ -H C2 > 17 (Figure 1). For/(r) > 8, the domains are at most within the tighter limits 3 < C2 < 7 and 12 < Cj -f- C3 < 14. Some of the data and their errors are given in columns 1 to 3 of Tables 2 and 3.

IV. RESULTS FOR DIATOMIC-MOLECULAR re A. Least-Squares Results^

Column 3 of Table 1 shows whether the predictions for r^ were obtained with fixed R^ and /?2 (numbers) or fixed Cj and C2 (letters: A = 1 , B = 2 , . . . ) . Columns 4 to 7 show the LS predictions, the averages and standard deviations in cases of double predictions, and the percent differences between these LS and the tabulated values. The average of these percent differences is -0.22, clearly not different from zero given the standard deviation of 2.609. Size limitations dictated that R- < 4 in Table 1, which is the reason that in the summary of the LS smoothing for all periods (Table 4) the average of the percent differences is different, i.e. -0.030 with a = 4.128. B. Neural Network Results

The remaining columns of Table 1 show the predictions, and their percent differences from the tabulated data, for two neural networks. The averages of the percent differences of the NNs in Table 1 are 0.741 and -0.442 (neither statistically different from zero to one standard deviation), whereas in Table 4 they are 26.39 (statistically different from zero) and -0.53 (with a = 8.63). The numbers for NN #1 differ so much primarily because no van der Waals molecules are listed in Table 1 and secondarily because of the /?• < 4 limitation. NN #2 learned tabulated data with van der Waals molecules culled out, so its fitting of the tabulated data is more comparable to the LS smoothing. The average percent difference is -0.030 with a = 4.128. These and the following statistics in Section IV are summarized in Table 4. NN #2 has no prediction as low as the minimum tabulated datum but has a maximum prediction higher than the maximum tabulated datum; the latter shows

278


Table 4. Characteristics of Neural-Network Learning, Least-Squares Smoothing, and Preditions: [Ry R2, Q, Cj) 4D and {R2, R2, Rv R\, Ri, R^) 6D Bases Diatomic Moledules Intemuclear. Separation Neural Network ^1

Neural Network #2

Trlatomlc Molecules Ionization Potential

Heat of Atomlzatlon

Neural network information Learning file Number of points

342

316

92

117

Points with no partner

0

0

0

0

Duplicated points

0

0

2

0

Minimum R Maximum R

2

2

2

2

6 1

6 1 7

6 1 7

6

0

0

Minimum C Maximum C Rare-gas molecules Alkalil-earth atoms bonded Minimum tabulated datum Maximum tabulated datum

8 33 2

0 0 1.150 5.100

0

Minimum prediction

1.270

1.265

3.72

298.000 1597.893 289.842

Maximum prediction

2.643

4.981

12.99

1464.713

-0.53

-1.92

-1.61

8.63 6.23 5.99

10.97 7.56 8.14

15.13 10.05 11.39

35 2.25 10.77 7.53

12 -0.04 7.33 5.74

-1.78 12.32 9.77

7.93

4.21

7.14

Average % difference Standard deviation Average abs. % difference Standard deviation Validation file Number of data Average % difference Standard deviation Average abs. % difference Standard deviation

1.150 5.200

0 3.2 14.7

1 7

26.39 25.00 31.32 30.43 37 18.55 28.95 28.39 22.41

12

Global predictions Number of predictions New maximum R

2145

Minimum prediction

7 1.270

Maximum prediction

5.994

2145 7

Least-squares information Tabulated data Number of molecular data Average percent error Least-squares smoothing Average % difference Standard deviation Number of global predictions

150 2.63 -0.030

■

4.128 299

5418

5418

1.265

3.65

45.270

4.981

13.24

1775.672

Forecasting r^ AHa, and IP

279

Figure2, Neural network predictions (•) of internuclear separation for {R^,R2) = (2,2). Known data are shown by x.

Figure 3. Mesh surface fitted to the NN predictions in Figure 2. Note the slight asymmetries, and the maximum at left caused by data for various alkaline-earth pairs at(Ci,C2) = (2,2).

280


Figure 4, Same as Figure 2 except for (/?i,/?2) = (3,3).

J>

'V

Figure 5, Same as Figure 2 but now the groups are fixed, (Ci,C2) = (2,6), and the periods vary. The tabulated data for these alkaline-earth chalcedonides are not symmetrically located.

Forecasting r^ AHa^ and IP

Figure 6. Mesh surface fitted to the NN predictions in Figure 5.

Figure 7, Same as Figure 5 for dihalides (Ci,C2) = 0,7).

281

282

JASON WOHLERS ET AL.

that networks do not have to plateau when predictions reach extrema in the learned data. To the extent that the absolute average percent differences and standard deviations of the validation files are similar to those of the learning files, there is good indication that NN #1 and NN #2 learned the tabulated data well. The standard deviations of r^ global predictions made by these NN can be tentatively set by the learning file values of a = 25.00 and 8.63%, respectively. C. Graphical Representations of Neural Network Results

Figures 2 to 7 show NN #1 results plotted on fixed-row and fixed-column coordinates. They show where the data tend to be concentrated in portions of the base planes. The figures show slight asymmetries. These asymmetries can be quantified by computing the centroids for the data, Z[(Cj. - C2,) x rg-]/S[r^.], summed over all of the data /. For the globally predicted values, this centroid is 0.055. In fixed-row graphs for r^, NN predicted surfaces seem to match known surfaces''^ fairly well. NN predictions in the fixed-column graphs for r^ are also quite faithful (Figures 5 to 7). The surfaces are in qualitative agreement with the log(/?j/?2) formula presented in Ref. 20. There are plateaus in the predicted data for (/?i,/?2) = (6,6),(6,7),and(7,7).

V. RESULTS FOR TRIATOMIC-MOLECULAR /P AND AHa A. Least-Squares Results^

Columns 4 to 6 of Tables 2 and 3 show the LS predictions and 95% confidencelimit errors. The LS predictions were made in the 3D basis only, because it was impossible to guess at any fitting formulas from inspection of the data in the 6D basis. The molecule OCO was omitted in the LS analysis of A//^ because the high numerical value distorted the fitting; OSiO was also omitted. B. Neural Network Results

Columns 7-9 and 10-12 of Tables 2 and 3 show NN predictions using the 6D and 3D bases. The prediction for a given address in the 3D basis can pertain to several molecules (Section II). Table 4 shows that the predictions for IP are inside the extrema of the tabulated data for the 6D basis; Table 5 shows that they are outside the extrema for the 3D basis. Exactiy the opposite is true for AH^. Again, the point is that NNs can extrapolate beyond the extrema of the learned data. The standard deviations of global predictions for IP made by these NN in the 6D and 3D bases are given by a = 10.97% (Table 4) and 12.35% (Table 5), respectively. For A//^, the respective standard deviations are 15.13 and 22.67%.


283

Table 5, Characteristics of Neural-Network Learning, Least-Squares Smoothing, and Predictions: {/(r),A7e, C2} 3D bases IP(eV)

Ha

Neural network information Learning file Number of points

69

79

M i n i m u m fid

8

8

Maximum fir)

72

60

M i n i m u m n^

3

8

Maximum n^

20

20 2

M i n i m u m C2

2

Maximum C2

7

M i n i m u m magnitude Maximum magnitude

3.2

298.00

14.7

1597.89

7

M i n i m u m prediction Maximum prediction

2.812

416.53

13.900

1346.23

Average % difference Standard deviation

1.39 12.35

Average abs. % difference Standard deviation

4.91 22.67

8.77

15.02

8.74

17.62

Validation file Number of data

8

9

Average % difference

1.37

4.99

Standard deviation

9.72

20.4

Average abs. % difference Standard deviation

8.07

14.73

4.39

14.12

Global predictions Same independent-variable limits Number of predictions M i n i m u m prediction Maximum prediction New independent-variable limits Number of predictions New maximum fir) New minimum n^ New maximum n^ New minimum C2

5724 376.38 1346.23 2596

16324

75

98

24

3 24 1

Maximum C2

8

M i n i m u m prediction

0.000

367.59

Maximum prediction

16.484

1556.87

Least-squares information Tabulated data Number of molecular data Average percent error

80

91

4.23

2.63

11.42

2.92 26.82

Least-squares smoothing Average % difference Standard deviation Number of global predictions

14.65 205

254

284


C. Graphical Results

Figure 8 shows global NN predictions for A//^, when R^= R^ = R^ = 2, plotted on the independent variables n^ and C2. Most of the tabulated data lie in the region between the solid line and the near edges of the figure. Figure 9 shows a contour map of the same surface. The region bounded by dotted lines and the edges of the figure is of considerable interest, because the contours have slopes of approximately - 1 . Thus the contours are described approximately by , + C2 = Ci + 2C2 + C3 : (Cj + C2) + (C2 + C3) = constant It appears that in this region, molecules with similar AH^ are not isoelectronic in the usual sense, n^ = constant, but in the "adjacent-DIM" sense.^^ The phenomenon appears to be restricted, for this property, to molecules formed of atoms with high electronegativities. Figure 10 shows a contour map of the surface of global predictions for (R^,R2,R^) = (2,2,3) [and of course (3,2,2)]. Most of the tabulated data lie in a much smaller region than in Figures 7 and 8. In the region at the top right of the figure.

CCF.NCO FCN.OCO

^^V^^

CCN.BCO.BeCF C).BCN.BeCO.UCF

Figure 8. Neural-network global predictions for heat of atomization AHa plotted in the 3D basis [on He and C2, for Ar) = 8]. As explained in the text, this basis results in there being more than one molecule for most addresses. All conceivable molecules are considered, whether or not they exist under currently studied conditions. Most of the meaningful predictions (i.e. those in the domain where most tabulated data lie) are in front of the diagonal solid line or in the corridor extending from that line along C2 = 4 to the right to He = 12 (C3, BCN, BeCO, and LiCF).

285

Forecasting re, A Ha, and IP

D1340-1400 ■1280-1340 01220-1280 ■1160-1220 ailOO-1160 ■ 1040-1100 0980-1040 ■ 920-980 D860-920 ■ 800-860 0740-800 ■ 680-740 0620-680 ■ 560-620 0500-560 ■440-500 0380-440

11

12

13

14

IS

16

17

18

19

20

Figure 9, A contour map of the surface in Figure 8 (but with different intervals).

«C!^5^>:

■ 1280-1340 01220-1280 ■ 1160-1220 01100-1160

■ io4o-noo{ 0980-1040 ■ 920-980 0860-920 ■ 800-860 0740-800 ■680-740 D620-680 ■ 560-620 0500-560 ■440-SOO 0380-440

IM4 L%^wJÎ

II

12

13

14

IS

16

17

18

19

20

Figure 10, Same as Figure 8 except that Ar) = 10; see text for associated values of /?i, /?2, and /?3.

286


'^j0^

01220-1280 ■1160-1220 01100-1160 ■ 1040-1100 0980-1040 ■ 920-980 |a880-920 ■ 800-860 0740-800 ■680-740 0620-680 ■ S60-620 |OS0O-S6O ■440-500 0380-440

19

20

Figure 11, Same as Figure 8 except that /(r) = 12. Comparing this figure with the previous two figures makes it easy to see the periodicity of the predicted data, and their monotonic decline, and the shrinking ofthe region with slope-1 as Ad increases.

similar molecules are again approximately isoelectronic in the "adjacent-DIM" sense. Figure 11 shows the contour map pertaining to (/?i,/?2,/?3) = (2,2,4) [and (4,2,2)], (2,3,2), and (3,2,3). Now the region bounded by the dotted lines is much smaller. At larger values of/(r), the phenomenon of "adjacent-DIM" isoelectron similarity disappears. All IP predictions with the same/(r) and n^ were the same, and so the contours on the graphs (not shown) all consist of lines paralleling the C2 axis.

VI. DISCUSSION This paper assumes, just as do Refs. 14-16, that molecules exist, in spite of the questions raised in Refs. 22-24. For diatomic-molecular r^ and triatomic-molecular A//^, the LS results are more accurate; for triatomic-molecular IP, the NN results are more accurate. The predictions of both methods might be improved if the tabulated data were culled so as to keep only the diatomic or triatomic molecules with the same ground-state terms; this improvement can be an area of future work. NNs are very sensitive to the presence of additional independent variables in the basis (e.g., n^, n^), may be sensitive to the extent that the learning data are equally distributed in the space of independent variables, and are not very sensitive to various partitions of tabulated data into learning and validation files. These aspects of NNs, in the context of molecular classification, are under study now.


287

Inquiries concerning predictions for molecules not listed in Tables 1 through 3 should be directed to R.H. REFERENCES 1. R. Hefferlin, J. Phys. Chem. 1995, 99, Sill. 2. R. Hefferlin, Periodic Systems of Molecules and their Relation to the Systematic Analysis of Molecular Data (Edwin Mellen Press, Lewiston, New York, 1989). 3. E. V. Bavaev and R. Hefferlin, in: Concepts in Chemistry, ed. D.H. Rouvray (Research Studies Press/John Wiley, Chichester, U.K., 1997). 4. C. M. Carlson, R. J. Cavanaugh, R. A. Hefferlin, and G. V. Zhuvikin, 7. Chem. Inf. Comp. Sci. 1996, 36, 396. 5. R. HefferUn and M. Kutzner, J. Chem. Phys. 1981, 75, 1035. 6. Ref. 2, pp. xxiii-xxxiv, 190-233. 7. Ref. 2, pp. 234-249. 8. Ref. 2, pp. 262-289. 9. C. Carlson, J. Gilkeson, K. Linderman, S. LeBlanc, and R. Hefferlin, Estimation of Properties of Triatomic Molecules from Tabulated Data Using Least-square Fitting, Croatica Chem. Acta, in press for the June, 1997, issue. 10. B. Davis, B. Laing, and R. Hefferhn, in: Proceedings of the 1997 International Arctic Seminar (Pedagogical Institute, Murmansk, Russia, 1997), pp. 31-36. 11. T. R. Cundari and E. Moody, /. Chem. Inf. Comp. Sci. 1997, 32, 871. 12. T. R. Cundari and E. Moody, J. Mol. Struct. (Theochem) 1998, 425, 43. 13. J. Lawrence, Introduction to Neural Network Design, Theory, and Applications (California Scientific Software Press, Nevada City, CA, 1994). 14. R. Hefferiin and G. Zhuvikin, J. Quant. Spectrosc. Radial Transfer 1984, 32,151. 15. R. Hefferiin, J. Chem. Inf Comput. Sci. 1994, 34, 314. 16. K. Huber and G. Herzberg, Constants of Diatomic Molecules (D. Van Nostrand Reinhold Co. Inc., New York, 1979). 17. L. V. Gurvich, et al, Thermodinamicheskie Svoista Individual'nikh Veschestv, Vols. 1 -4, (Nauka, Moscow, 1978, 1979, 1981, 1982). 18. L. V. Gurvich, et al, Energii Razryva Khimicheskikh Svyazei. Potentialy lonizatzii i Srodsvo k Electronu, (Nauka, Moscow, 1974) pp. 229-289. [An earlier edition was translated into English: V. I. Vedeneyev, et al, (Bond Energies, Ionization Potentials, and Electron Affinities, St. Martins, New York, 1966).] 19. A. J. Sauval and J. B. Tatum [computations for triatomic molecules done at the same time as those for diatomic molecules, the latter appearing in Astrophys. J. Suppl. 1984, 56, 193]. 20. R. E Nalewajski, J. Phys. Chem. 1979, 83, 2677. 21. R. Cavanaugh, R. Marsa, J. Robertson, R. Hefferiin, J. Mol. Struct. 1996, 382, 137. 22. H. Primas, Chemistry, Quantum Mechanics and Reductionism: Perspectives in Theoretical Chemistry (Springer-Verlag, Berlin, Germany, 1983). 23. V. V Nefedova, A. I. Boldyrev, and J. Simons, J. Chem. Phys. 1993, 98, 8801. 24. A. I. Boldyrev, Structure and Dynamics of Non-Rigid Molecular Systems (Kluwer Academic Publishers, Dordrecht, The Netheriands, 1995).


INDEX ASA, 57-61 (see also "Tagged sets") density functions, quantum similarity measures (QSM) and, 43-45, 51-56, 57-68 (see also "Tagged sets") definitions, 51-56 Atomic similarity through neural network, 205-213 abstract, 205, 206 conclusions, 212, 213 introduction, 206, 207 lAC net, 206, 207-209 property layers, 207 neural network for periodic table, architecture and function of, 207-209 database retriever, application as, 209 hidden associations or atomic similarity, applications for, 209 prediction of properties for elements, 211,212 self-association of elements and properties, 209-211 families, three, for 58 elements, 209, 210 Mendeleyev-like properties, 206, 209,213 telluric screw of de Chancourtois, 210

well self-associated group, term, 209 Bader's atoms-in molecules theory, 192, 217 Betti numbers, 85, 86 (see also "Quantum chemical shape...") Boltzmann Distribution (BD), 38, 39 (see also "Quantum similarity") Boolean tagged sets, 43-65 (see also "Fuzzy sets...") degenerate and nondegenerate, 48 metric background vector spaces, 49 vector spaces, 48, 49 Born-Oppenheimer approximation, 17, 38 Breit Hamiltonian, 4 Browsable structure-activity datasets, 153-171 (see also "Structure-activity...") Calculations, similarity, transferability of, 105-134 (see also "Transferability...") Chemicals, molecular similarity of using topological invariants, ni-lSS (see also "Topological invariants...") Circumscribing/excised internal structure (EIS) concept. 289

290 correlations and applications of, 259-264 abstract, 259 applications, other, 262-264 constant-isomer benzenoid series, 261,262 constant-isomer series of fluoranthenoids/fluorenoids and indacenoids, 262, 263 history, 260 Piatt's perimeter rule, 260 introduction, 260 Comparison of quantum similarity measures (QSM) derived from one-electron, intracule, and extracule densities, 215-243 abstract, 216 application examples, 222-242 diatomic molecules, 225-242 Hartree-Fock approximation, 226 second-order QSMs, computation of, 231-233 similarity functions, maximization of, 233-237 similarity matrices, construction of, 237-242 similarity measures, three, 224 topological characteristics of electron density distributions, 225, 226 two-electron atomic systems, 222-225 Z^^, 222-224, 237, 241 computational details, 219-222 Gaussian 94 and Gamess programs, 220, 222, 225 grid spacing, 220-222 Hartree-Fock theory level, 220 intracule and extracule densities, calculation of, 220, 221 second-order QSMs, calculation of, 221, 222

INDEX

superposition of molecules, 222 conclusions, 242 introduction, 216-219 first-order density functions, 216, 217 intracule and extracule densities, 218 second-order density functions, 217,218 Complementarity principle, uses of in molecular similarity and related aspects, 245-258 abstract, 245 aufbau principle, 248 conclusion, 257, 258 definitions, basic, 247, 248 complementary, 247 eigenvalue, 247 eigenvector, 247 frontier MOs (FMOs), 247 functional groups, 247 molecular energy level, 247 molecular graph, 247 pairing theorem, 247 self-complementary, 247, 248 wave function, 247 introduction, 246 aufbau principle, 246, 248 eigenvalues, relations among, 246 HMO model and 7i-electron systems, 246 quantum chemical-based invariants, 246 similarity as modeling tool, 246 six S*s in molecular modeling, 246 subspectrality, 246 results and discussions, 248-257 infinite series of molecular graphs pairwise strongly subspectral, 254-257 self-complementary molecular graphs, 249, 253, 254

Index

Convex sets, 43-45, 55-57 {see also "Tagged sets") Datasets, browsable structure-activity, 153-170 {see also "Structure-activity...") Density function, 51 conclusions, 70 statistical interpretation of, 65-69 {see also "Diagonal vector...") tagged set, 52 Diagonal vector spaces and quantum chemistry, 43-45, 65-70 abstract, 44 conclusions, 70 density functions and other problems, expression of, 68 discrete QO representations, nature of, 66, 67 generating n-dimensional VS: DVS, 67,68 Hilbert spaces, 65, 66 introduction, 44, 45 Elementary Jacobi rotations (EJR) technique, 5, 9-12 {see also "Quantum similarity") Extracule similarity measures, comparison of, 215-243 {see also "Comparison...") Fuzzy sets and Boolean tagged sets, 43-51 abstract, 44 applications, 49, 50 conclusions, 70 definitions, preliminary, 46, 47 extensions, 50, 51 hypercube, 46,47 introduction, 44,45 metric background vector spaces, 49 Minkowski formula, 49 molecular point-cloud, 50

291

operations over Boolean tagged sets, 48, 49 point-molecule, 50 QSM^ 50 tagged classes, 47, 48 unit n-dimensional cube, 46, 47 GATOMIC program, 12 Girona index, 24 Hartree-Fock (HF) approximation, 188,190,220,226 Hybrid density functional, optimizing by quantum molecular similarity techniques, 187-203 abstract, 188 conclusions, 201 introduction, 188-190 adiabatic connection formula, 188 approaches, two, 188 B3LYPfunction, 189, 190, 195-201 B3PW91 method, 189, 193 density functional theory (DFT), 188 exchange-correlation function, 188, 189 generalized gradient approximation (GGA), 188 Hartree-Fock treatment of exchange, 188, 190 Lee-Yang-Parr (LYP) functional, 189 local spin-density approximation (LSDA), 188 quantum molecular similarity measure (QMSM), 190 singles and doubles quadratic configuration interaction (QCISD), 189-201 three-parameter function, Becke's, 189

292 methodology, 190-192 Davidon-Fletcher-Powell (DFP) algorithm, 191, 192 Hartree-Fock method, 190 Messem program, 191, 192 results and discussion, 192-201 CO molecule, 192-199 LiFmolecule, 200, 201 N2 molecule, 198-200 Intracule similarity measures, comparison of, 215-243 (see also "Comparison...") Introduction to Solid State Physics, 206 Lagrange multiplier technique, 8, 13 Least-squares (LS) and neural-network (NN) forecasting from critical data, 265-287 abstract, 266 multiple regression, use of, 266 data, 267-277 discussion, 286, 287 introduction, 266, 267 LS methods, advantage of, 266 NN methods, advantage of, 266 systemization, advantages of, 266 results for diatomic-molecular r^, 277-282 graphical representations of neural network results, 279-281,282 least-squares results, 277 neural network results, 277-282 results for triatomic-molecular IP and A//^, 282-286 graphical results, 284-286 least-squares results, 282, 283 neural network results, 282, 283 theory, 267

INDEX

diatomic-molecular data, 267, 277-282 triatomic data, 267, 274 Mendeleyev postulates, 2, 206, 209, 213 {see also "Atomic similarity...") Neural network, atomic similarity through, 205-213 {see also "Atomic similarity...") Neural-network (NN) forecasting, 265-287 {see also "Least-squares...") One-electron similarity measures, comparison of, 215-243 {see also "Comparison...") Organic synthesis design, similarity in, 137-151 abstract, 137, 138 comparison methodology, 140-142 global transform similarity index (GTSI), 142 globularity similarity index (GSI), 141 number of participating reactions (PRN), 142, 147 substructure similarity index (SSI), 141 conclusion, 150 introduction, 138, 139 results and discussion, 142-150 with Sirenin and Methoxatin, 142-150 similarity measures, 139, 140 global transform similarity measure (GTSM), 139 globularity similarity measure (GSM), 139, 140 strategy and tactics, 139 substructure similarity measure (SSM), 139, 140

Index

Pattern recognition techniques, 73-77 abstract, 73 alignment, 74-76 conclusion, 76 introduction, 74 rotational invariance, 74 seven-dimensional vector, 75, 76 translational invariance, 74 two-dimensional representations, 74 Periodic table database, applying neural network to, 205-213 (see also "Atomic similarity...") Pfeiffer rule, 166 Piatt's perimeter rule, 260 QSAR: finely tuned, 63-65 Quantum chemical shape concept, topology and, 79-92 introduction, 79-81 additive fuzzy density fragmentation (AFDF) methods, 80 adjustable density matrix assembler (ADMA) method, 80 algebraic topology as ideal tool, 80 "ball and stick"-type stereodiagram, 80 electron density cloud models, 80 molecular electron density loge assembler (MEDLA) technique, 80 shape group methods (SGM), 80 sphere "space-filling" models, 80 molecular shape and topological resolution, 81-86 Betti numbers, 85, 86 molecular isodensity contour (MEDCO) surfaces, 84-90

293 resolution-based similarity measures (RBSM) approach, 81 shape-group approach, 84-86 subbase-base approach, 83 topological space, 82-84 summary, 90 topological resolution of shape of electron density, molecular similarity measures based on, 86-90 Quantum chemistry, diagonal vector spaces and, 43-45, 65-69 (see also "Diagonal vector...") Quantum molecular similarity techniques, using to optimize hybrid density functional, 185-201 (see also "Hybrid density...") Quantum objects (QO), 51-56 Quantum similarity, 1-42 atomic shell approximations (ASA), 5-14 alternative approximate expression of density functions, 12-14 approximate expectation values, 14 ASA coefficient constraints, 7, 8 coefficient optimization using elementary Jacobi rotations, 9-12 complete ASA (CASA), 12-14 density functions, 6, 7 elementary Jacobi rotations (EJR) technique, 5, 9-12 GATOMIC program, 12 generating vector, 9, 10, 12 promolecular approximation, 8 quadratic error function, 8, 9 conclusions, 40 introduction, 2-4

294 atomic shell approximation, 3, 5-14 concepts, relevant, 2, 3 Mendeleyev postulates, 2 QSAR or QSPR procedures, 3 manipulation of similarity measures, 21-26 {see also "...similarity indices") measures, 4, 5 Breit Hamiltonian, 4 Dirac's delta function, 4, 17 multiple density QSM, 5 overlap-like QSM, 4 quantum object set (QOS), 5 quantum self-similarity measures (QS-SM), 4 triple density QSM, 4, 5 molecular representations, 14-21 density integral transformations (DIT), 16, 17 density maps and overlap-like measures, 17 discrete matrix representation, 18-21 molecular point cloud, 19 molecular superposition, 14, 16 MQSM surfaces, 15 density transformations, 14, 15-17 QO discrete representation, 14, 18-21 transform kernel, 17 QSAR and related problems, origin of, 27-37 convex sets and QSPR, 29-31 molecular descriptors, 29 molecular quantum self-similarity measure (MQS-SM), 31 MQSM and molecular topology, 31-33 MQSM topological indices (MQTI), 33-37 NESTED-MLR, 37

INDEX

quantitative structure-property or-activity relationships (QSPR or QSAR), 27 success of, 27-29 topological indices (TI), 31, 33-37 topological matrices (TM), 31 similarity over energy surfaces, 38-40 Boltzmann Distribution (BD), 38, 39 Boltzmann similarity measure (BSM), 39 electronic energy surfaces (EES), 38 Gaussian distribution (GD), 39,40 general distributions and similarity measures, 39,40 molecular electrostatic potential (MEP), 38 partition functions, 38, 39 similarity indices (QSI), 21-26 C-class, 22, 23 C-class generalized QSI, 24 Carbo similarity index, 22, 25, 26 cosine-like, and multiple QSM, 22,23 D-class dissimilarity indices, 23 D-class generalized QSI, 24 discrete representation indices, 25,26 generalized QSI, 23, 24 Girona index, 24 Hodgkin-Richards index, 24, 26 Tanimoto index, 24 transformations between QSI, 24, 25 Self-associative periodic table of elements, 205-213 {see also "Atomic similarity...") Structure-activity datasets, browsable, 153-170

Index

abstract, 153, 154 introduction, 154, 155 browsing, question of, 154 level sets, 154, 164-167 similarity searching as browsing tool, 154 substructure searching as traditional method, 154 level sets as primary structural browsing variables, 164-167 BdSetO and BdDel2, 165-167 meta and para positions, cliffs and planes related to, 165 Pfeiffer rule, 166 merchandiser, problem of, 155, 156 chemical descriptor, 155 primary browsing variable, 155 secondary browsing variables, 155 systematic browsing, need for, 155 molecular equivalence numbers as primary structural browsing variables, 160-164 cliffs and planes, 163, 164 globally quantitative chemical descriptor, 161, 162 nominal chemical descriptors, 162 Rnglso value, 162, 163, 167 ShrbSiz count, 160-163 similarity-based projections as primary browsing variables, 167-169 BdDelRPl andBdDelMPl, 167-169 structure-activity dataset, 156-160 aldoxime, 156, 158 delivered potency, 156 perillartine, 156 planes and cliffs, 159, 160 summary and conclusions, 169 Splus, 169 Spotfire, 169

295 Syntheses of different compounds, comparing, 137-151 (see also "Organic synthesis...") Tagged sets, convex sets, and QSM, 51-65 ASA, 57-61 in atoms, 58, 59 continuous case, 60 within CS environment, 57, 58 elementary Jacobi rotations, 58, 59 LCAO MO approach, 60 MO theory, considerations around, 59, 60 molecules, structure in, 59 promolecular approach, 58, 59 SCF theory, 60 conclusions, 70 convex operators, 61-63 PD operators, convex linear combinations of, 62 tuned QSM, SM, and QO descriptors, 62, 63 convex sets, 56, 57 generating vector, 56, 57 Hilbert space, 57 density function, 51 statistical interpretation of, 65-69 (see also "Diagonal vector spaces...") quantum objects (QO), 51-56 similarity matrices and discrete representations of, 53, 54 QSAR, finely tuned, 63-65 and QSM, 51-56 definitions, 51-56 molecular point cloud, 54, 55 vector semispaces, 55 Topological fragment spectra (TFS), structural similarity analysis based on, 93-104 abstract, 94

296 concluding remarks, 103 introduction, 94, 95 approaches, two, 94 exhaustive fragmentation profile, 95 graph theoretical analysis, 94, 95 substructural analysis, 94, 95 methods, 95-98 quantitative evaluation of structural similarity based on TFS, 97, 98 topological fragment spectrum (TFS), 95-97 results and discussion, 98-103 application to similar structure search in chemical database, 100-103 psychotropic agents, forty-two, structural similarity analysis of, 98 spanning tree, 98, 99 subspectrum use, 98-100 Topological invariants, characterization of molecular similarity of chemicals using, 171-185 abstract, 172 discussion, 183, 184 introduction, 172 graph invariants, 172 ^-nearest-neighbor (KNN)-based estimation method, 172 planar graphs, use of, 172 topological indices (TIs), 172 methods, 173-179 database, 173 indices, calculation of, 173-178 indices, classification of, 178 /indices, 178 ^-nearest-neighbor selection and property estimation, 179 Az-dimensional space, 179 PCA analysis, 178, 179

INDEX

principal components (PCs), 178, 179 PRINCOMP, 178 statistical methods and computation of similarity, 178, 179 topochemical indices, 178 topological parameters, symbols, definitions, and classifications of, 174 topostructural indices, 178 Wiener index (W), 173 results, 179-183 analogue selection, 181, 183 A'-nearest-neighbor property estimation, 182, 183 principal component analysis (PCA), 179-181 Topology and quantum chemical shape concept, 81-94 {see also "Quantum chemical shape...") Transferability of similarity calculations from substructures to complex compounds, analysis of, 105-134 abstract, 106 calculations, transferability, similarity measures and indices, 128-130 conclusions, 130 introduction, 106-108 approaches in drug design, two, 106, 107 in drug design, 106 electronic distribution calculation, 107 in nucleic bases, 107, 110, 111 QSAR measures, 106 methodology, 108-112 bases, effects of, 112 results and discussion, 112-128

Index

base triplets, calculations on, 120-128, 131-134 charged compounds, ED values of, 115-117 ED variations: changing conformation, 117-119 ED variations: changing structure, 112-117 ED variations: changing system, 119-125 neutral compounds, ED values of, 114,115 similarity index: juxtapositioned pairs, 127

297 similarity index, 2D and 3D, 125, 126 similarity index: SP influence, 126, 127 similarity index: triplets, 127, 128 similarity indices calculated using neutral and charged isolated bases, values of, 118, 119 supplementary information, 131-134 Vector semispaces, 43-45, 65-69 (see also "Diagonal vector...")

Advances in Molecular Similarity Edited by Ramon Carbo-Dorca, University of Girona and Paul G. Mezey, University of Saskatctiewan Volume 1,1996, 287 pp. ISBN 0-7623-0131-7

$112.50/£72.50

CONTENTS: Introduction to the Series: An Editor's Foreword, Albert Padwa. Preface, Ramon Carbo-Dorca and Paul G. Mezey. Quantum Molecular Similarity Measures: Concepts, Definitions, and Applications to Quantitative StructureProperty Relationships, R. Carbo-Dorca, E. Besalu, LI. Amat, and X. Fradera. Similarity of Atoms in Molecules, B.B. Stefanov and J. Cioslowski. MomentumSpace Similarity: Some Recent Applications, P.T. Measures, N.L Allan, and D.L Cooper Molecular Similarity Measures of Conformational Changes and Electron Density Deformations, P.G. Mezey. Electron Correlation in Allowed and Forbidden Pericyclic Reactions from Geminal Expansion of Pair Densities. A Similarity Approach, R. Ponec. Conformational Analysis from the Viewpoint of Molecular Similarity, J.M. Ollva, R. Carbo-Dorca, and J. Mestres. How Similar are HF, MP2 and DFT Charge Distributions in the Cr (C0)6 Complex?, M. Torrent, M. Duran, and M. Sola. Quantum Molecular Similarity Measures (QMSM) and the Atomic Shell Approximation (ASA), P. Constans, LI. Amat, X. Fradera, and R. Carbo-Dorca. Automatic Search for Substructure Similarity: Canonical Versus Maximal Matching; Topological Versus Spatial Matching, G. Sello and M. Termini. Using Canonical Matching to Measure the Similarity Between Molecules: The Taxol and the Combretastatine A1 Case, G. Sello and M. Termini. New Antibacterial Drugs Designed by Molecular Connectivity, J Galvez, R. Garcia-Domenech, C. de Gregorio Alapont, J. V. de Julian-Ortiz, M.T. Salabert-Salvador, R. Soler-Roca. Index.

Û^^X^^ÛMÛ^s^smi

Advances in Molecular Structure Research Edited by Magdolna Hargittai, Structural Chemistry Research Group, Hungarian Academy of Sciences, Budapest, IHungary an6 Istvan Hargittai, Institute of General and Analytical Chemistry, Budapest Technical University, Budapest, Hungary Volume 1,1995, 368 pp.

$109.50/£69.50

ISBN 1-55938-799-8 CONTENTS: List of Contributors. Introduction to the Series: An Editor's Foreword, Albert Padwa. Preface, Magdolna Hargittaian6 Istvan Hargittai. Measuring Symmetry in Structural Chemistry, Hagit Zabrodsky and David Anvir Some Perspectives in Molecular Structure Research: An Introduction, Istvan Hargatfa/and Magdolna Hargattai. Accurate Molecular Structure from Microwave Rotational Spectroscopy, Hans Dieter Rudolph. Gas-Phase NMR Studies of Conformational Processes, Nancy S. True and Cristina Suarez. Fourier Transform Spectroscopy of Radicals, Henry W. Rohrs, Gregory J. Frost, G. Barney Ellison, Erik C. Richard, and Veronica Vaida. The Interplay between X-Ray Crystallography and AB Initio Calculations, Roland Boese, Thomas Haumann and Peter Stellberg. Computational and Spectroscopic Studies on Hydrated Molecules, Alfred H. Lowrey and Robert W. Williams. Experimental Electron Densities of Molecular Crystals and Calculation of Electrostatic Properties from High Resolution X-Ray Diffraction, Claude Lecomte. Order in Space: Packing of Atoms and Molecules, Laura E. Depero. Index. Volume 2,1996, 272 pp. ISBN 0-7623-0025-6

$109.50/£69.50

CONTENTS: List of Contributors. Preface, Magdolna Hargittai and Istvan Hargittai. Conformational Principles of Congested Organic Molecules: Trans is Not Always More Stable Than Gauche, Eiji Osawa. Transition Metal Clusters: Molecular versus Crystal Structure, Dario Braga and Fabrizia Grepioni. A Novel Approach to Hydrogen Bonding Theory, Paola Gilli, Valeria Ferretti, Valeric Bertolasi and Gastone Gilli. Partially Bonded Molecules and Their Transition to the Crystalline State, Kenneth R. Leopold. Valence Bond Concepts, Molecular Mechanics Computations, and Molecular Shapes, Clark R. Landis. Empirical Correlations in Structural Chemistry, Vladimir S. Mastryukov and Stanley H. Simonsen. Structure Determination Using the NMR "Inadequate" Technique, Du Li and Noel L. Owen. Enumeration of Isomers and Conformers: A Complete Mathematical Solution for Conjugated Polyene Hydrocarbons, Sven J. Cyvin, Jon Brunvoll, Bjorg Cyvin, and Egil Brendsdal. Index.

(BiliiiiilSBliiP Volumes, 1997, 360 pp. ISBN 0-7623-0208-9

$109.50/£69.50

CONTENTS: List of Contributors. Preface, Magdolna Harglttai and Istvan Hargittai. Determination of Reliable Structures from Rotational Constraints, Jean Demaison, Georges Wlodarczak, and Heinz Dieter Rudolph. Equilibrium Structure and Potential Function: A Goal to Structure Determination, Victor P. Spiridonov. Structures and Conformations of Some Compounds Containing C-C, CN, C-0, N-0, and 0 - 0 Single Bonds: Critical Comparison of Experiment and Theory, Hans-Georg Mack and Heinz Oberhammer. Absorption Spectra of Matrix-Isolated Small Carbon Molecules, Ivo Cermak, Gerold Monninger, and Wolfgang Kratschmer. Specific Intermolecular Interactions in Organic Crystals: Conjugated Hydrogen Bonds and Contacts of Benzene Rings, Peter M. Zorky and Olga N. Zorkaya. Isostructurality of Organic Crystals: A Tool to Estimate the Complementarity of Homo- and Heteromolecular Associates, Alajos Kalman and Laszio Parkanyi. Aromatic Character of Carbocyclic 7c-Electron Systems Deduced from Molecular Geometry, Tadeusz Marek Krygowski and Michal Cyranski. Computational Studies of Structures and Properties of Energetic Difluoramines, Peter Politzer and Pat Lane. Chemical Properties and Structures of Binary and Ternary SE-N and TE-N Species: Application of X-Ray and AB Initio Methods, Inis C. Tornieporth-Oetting and Thomas M. Klapotke. Some Relationships between Molecular Structure and Thermochemistry, Joel F. Liebman and Suzanne W. Slayden. Index. Volume 4,1998, 390 pp. ISBN 0-7623-0348-4

$109.50/£69.50

CONTENTS: Preface, Magdolna Hargittai and Istvan Harglttai. Molecular Geometry of "Ionic" Molecules: A Ligand Close-Packing Model, Ronald J. Gillespie and Edward A. Robinson. The Terminal Alkynes: A Versatile Model for Weak Directional Interactions in Crystals, Thomas Steiner. Hydrogen Bonding Systems in Acid Metal Sulfates and Selenates, Erhard Kemnitz and Sergei I. Troyanov. A Crystal log raphic Structure Refinement Approach Using ab Initio Quality Additive, Fuzzy Density Fragments, Paul G. Mezey. Novel Inclusion Compounds with Urea/Thiourea/Selenourea-Anion Host Lattices, Thomas C. W. Makand Qi Li. Roles of Zinc and Magnesium Ions in Enzymes, Amy Kaufman Katz and Jenny P. Glusker The Electronic Spectra of Ethane and Ethylene, Camille Sandorfy. Formation of (E,E)- and (Z,Z)-Muconic Acid in Metabolism of Benzene: Possible Roles of Putative 2,3-Epoxyoxepins and Probes for Their Detection, Arthur Greenberg. Some Relationships between Molecular Structure and Thermochemistry, Joel F. Libeman and Suzanne W. Slayden. Index.

Advances in Molecular Similarity, Volume 2

Advances in Molecular Similarity, Volume 1 (Advances in Molecular Similarity) (Advances in Molecular Similarity)

Advances in Molecular Toxicology, Volume 2

ADVANCES IN CATALYSIS VOLUME 2, Volume 2

Molecular Similarity II

Molecular Similarity I

Advances in Immunology Volume 2

Advances in Computers, Volume 2

Advances in Agronomy, Volume 2

Advances in Molecular and Cellular Endocrinology Volume 2

Advances in Atomic and Molecular Physics, Volume 2

Advances in Molecular Toxicology, Volume 5

Advances in Molecular Structure Research. Volume 4

Advances in Molecular Structure Research, Volume 1

ADVANCES IN GEOPHYSICS VOLUME 2, Volume 2 (v. 2)

Advances in Molecular Toxicology

Advances in Lipobiology, Volume 2 (Advances in Lipobiology)

Advances in Neural Science, Volume 2 (Advances in Neural Science)

Advances in Neural Science, Volume 2 (Advances in Neural Science)

Advances in Molecular Structure Research, Volume 1 (Advances in Molecular Structure Research)

Advances in Molecular Structure Research, Volume 3, First Edition (Advances in Molecular Structure Research)

Advances in Atomic, Molecular, and Optical Physics, Volume 59 (Advances in Atomic, Molecular and Optical Physics)

Advances in Cartography and GIScience. Volume 2

Advances in Clinical Chemistry Volume 2

Advances in Sulfur Chemistry, Volume 2

Advances in Developmental Biology, Volume 2

Advances in Ecological Research, Volume 2

Advances in Advertising Research, Volume 2

Advances in Carbohydrate Chemistry, Volume 2

Advances in Molecular Similarity, Volume 2

Advances in Molecular Similarity, Volume 1 (Advances in Molecular Similarity) (Advances in Molecular Similarity)

Advances in Molecular Toxicology, Volume 2

ADVANCES IN CATALYSIS VOLUME 2, Volume 2

Molecular Similarity II

Molecular Similarity I

Advances in Immunology Volume 2

Advances in Computers, Volume 2

Advances in Agronomy, Volume 2

Advances in Molecular and Cellular Endocrinology Volume 2

Advances in Atomic and Molecular Physics, Volume 2

Advances in Molecular Toxicology, Volume 5

Advances in Molecular Structure Research. Volume 4

Advances in Molecular Structure Research, Volume 1

ADVANCES IN GEOPHYSICS VOLUME 2, Volume 2 (v. 2)

Advances in Molecular Toxicology

Advances in Lipobiology, Volume 2 (Advances in Lipobiology)

Advances in Neural Science, Volume 2 (Advances in Neural Science)

Advances in Neural Science, Volume 2 (Advances in Neural Science)

Advances in Molecular Structure Research, Volume 1 (Advances in Molecular Structure Research)

Advances in Molecular Structure Research, Volume 3, First Edition (Advances in Molecular Structure Research)

Advances in Atomic, Molecular, and Optical Physics, Volume 59 (Advances in Atomic, Molecular and Optical Physics)

Advances in Cartography and GIScience. Volume 2

Advances in Clinical Chemistry Volume 2

Advances in Sulfur Chemistry, Volume 2

Advances in Developmental Biology, Volume 2

Advances in Ecological Research, Volume 2

Advances in Advertising Research, Volume 2

Advances in Carbohydrate Chemistry, Volume 2

Recommend Documents