[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Discussion #2

To: corecifchem@iucr.org
Subject: Discussion #2
From: David Brown <idbrown@mcmaster.ca>
Date: Mon, 01 Dec 2003 15:53:23 -0500
Dear Colleagues,

    You will have received the report on the NIST workshop on chemistry XMLs
that I recently circulated.  This email contains a further discussion paper
designed to build on the information I picked up at NIST, and on the 
comments
stimulated by discussion paper #1.  Please respond by replying to the whole
discussion group by Dec 31.

    CIF is designed to report crystallographic information, something which
it does well.  It also includes some (not much) specifically chemical
information (e.g. authors have to decide whether the distance between two
atoms is a bond or a contact, and provision is made for including bond 
graphs,
i.e. a 2D structures).  There are also some items in which the chemical and
crystallographic concepts are inextricably mixed.  For example, in an 
earlier
email I wrote: 'The entity defined as an atomic site in CIF is strictly an
abstract point in the unit cell which is identified by its coordinates.  
Such
points have crystallographic attributes such as symmetry, but we have also
decorated them with chemical attributes such as the names of atoms that 
occupy
the site.  This leads people to take the category name ATOM_SITE 
literally as
being a list of atoms that define some number of molecular units.  
Indeed, for
Z' = 1 structures many people will transform the crystallographic positional
coordinates to orthogonal coordinates and treat the molecule as if the 
crystal
never existed.  We have been forced to include 'dummy atoms' in ATOM_SITE in
order to define points having no chemical significance.  We have never faced
up to the problem that CIF describes only the crystallography, not the
chemistry.'  Subsequent to my writing this, it was pointed out that 
ATOM_SITE
does indeed define the chemistry, in that it defines the 3D structure.  It
therefore combines both crystallographic and chemical concepts, and 
separating
out these two threads and making sure they are treated logically within 
their
own strands is one of our tasks.

    This problem was brought to a head by the innocent request of CCDC for a
way to include Z' in CIF but Howard Flack (HDF)wrote: ' There are two HUGE
problems with this [proposed] definition. ...
(1) There is not in fact a single Z' which describes what they are trying to
express. Take the example of the compound AB composed of two molecules A 
and B
in the ratio 1:1 as represented by the chemical formula AB. The structure is
in space group P2. Molecules A have point symmetry 2 and sit on a 2-fold 
axis
in the crystal structure. There are two independent A molecules in the
asymmetric unit. Molecules B have symmetry 1 and sit in a general position.
There is one independent molecule B in the asymmetric unit. What is the 
value
of your Z'?
(2) And, surprise surprise coming from HDF, there are problems of chirality
with this Z'....'

    Carol Brock identifies a second problem we face when she writes:
'A problem that stands out is the difficulty of developing a set of
descriptors that apply well to all kinds of structures.  Molecular 
structures,
and especially organic structures, are easier (even if they contain ions)
because there is almost always a clear distinction between intra- and
intermolecular distances.  Similar distinctions can almost always be 
made for
organic macromolecules.  There are problems for some coordination complexes
and polymers that have metal-atom contacts that some people would 
describe as
bonding and others would describe as nonbonding.  Pure inorganics and
intermetallics have their own complications.' 

    HDF, coming at this problem from a different direction, expressed shock
at the thought that the chemists have no unique way of defining a molecule.

    From what I learned at the workshop, chemists not only have no unique
way of defining a molecule, they don't even care!  Most of the ontologies
(dictionaries) being developed in chemistry are, like CIF, closely 
related to
an experimental technique where the question of defining a molecule is not
important.  Peter Murray-Rust's CML defines a molecule as a composed of 
atoms,
but leaves it to the author to state which atoms.  Miloslav Nic in Prague is
developing GTML (Graph Theory Mark-up Language) which can be used for
molecular descriptions, but this also assumes that the molecular graph is
already known.

    mmCIF includes chemical descriptions but these are fairly specific to
the kind of macromolecules found in biological systems, i.e., polymers
composed of a limited number of different monomeric units such as 
aminoacids.
Provision is made for the definition of the structures of the monomers 
and the
ways in which these are linked to form the polymeric macromolecules.  We are
unlikely to find much help in these definitions.

    HDF recently wrote: 'I wonder also whether we should, and have the
courage to, embark on representing information on supramolecules, which I
think are probably molecules made out of molecules. It all sounds too 
awful to
be true. Even a standard hexa-coordinated Co complex might be encoded as 
each
of the individual ligands as a 'molecule' and the whole complex as another
'molecule'.

    In the rest of this email I present a possible way to resolve some of
these difficulties using a description based on graph theory, since the
familiar 2-dimensional molecular structure diagram is what the 
mathematicians
call a graph.  They define a graph as a set of vertices (atoms) that are
linked by a number of edges (bonds).  The selection of the atoms that 
form the
set of vertices does not present a problem because everyone agrees on what
atoms are present in a given compound, but not everyone will agree on which
atoms should be connected by bonds.  Graph theory does not solve this 
problem,
but it does help to distinguish between the chemical properties of the 
atoms
and bonds on the one hand and the mathematical properties of the graph 
on the
other.  Chemical properties can be assigned to the atoms and bonds in a 
graph.
This requires a chemical interpretation of the structure, but once these
properties have been assigned, graph theory can be used to explore the
different possible graphs and their properties.  The use of graph theory
separates out the intrinsically chemical concepts from the graph theoretical
description that can be manipulated mathematically.

    The bond graph represents a chemical interpretation of the 3D geometry,
i.e., the geometry tells us which atoms are neighbours but not where the 
bonds
are to be found.  The bonds are assigned by applying various rules 
relating to
the chemical properties of the atoms.  However, not all bonds are of equal
value; some are clearly stronger than others, i.e., they survive many of the
physical and chemical treatments we can subject them to such as melting or
dissolution in a solvent.  Weaker bonds do not survive this treatment.  
We can
thus imagine that each of the possible edges in the graph is associated 
with a
number representing its 'strength'.  (The 'strength' would be zero for the
edges that could not possibly represent a chemical bond).  I will 
deliberately
avoid defining in detail what I mean by 'strength', but qualitatively it
represents the number of electron pairs associated with the bond and it 
obeys
(by definition in some treatments) the rule that the sum of the bond
'strengths' received by any atom is equal to the number of valence electrons
the atom uses for bonding.  (Implicit in this description is the notion 
that a
bond 'strength' is not restricted to integer values).  For over a century
chemists have struggled to find a tight quantitative definition for bond
'strength' under such names as bond order, bond number, bond valence,
electrostatic bond strength, etc., each definition trying to capture the
concept in numeric form.  All of these definitions are incomplete in one way
or another, but in principle they allow us to order the bonds from strongest
to weakest.  Assuming that we can at least determine this order even if we
cannot assign actual numbers to the bond 'strength', our problem then 
reduces
to the question of where to place the cut-off between the bonds that are 
shown
on the graph and those that are omitted.

    The fashionable supramolecule provides an interesting example.  At one
level it can be considered as a collection of individual molecules held
together by weak intermolecular bonds.  In the bond graph generated at this
level, each ligand and each coordinated metal atom would be treated as a
separate molecule.  If however one is interested in the properties of the
supramolecule, e.g., how it 'self-assembles' (something that NaCl and other
inorganic compounds have been doing since long before 'self-assembly' became
the word of the month), then the cut-off would be set much lower and more
bonds would be included in the graph.  The result would be an infinite graph
describing the infinite supramolecule (c.f. diamond and graphite).  Lest you
worry how one deals with infinite graphs, I will only point out here that in
practice this is not a problem.

    The creation of bond graphs at different levels can be explored by first
generating a graph that includes only the strongest bonds, and then step by
step adding progressively weaker bonds.  This would create a series of 
graphs
at different cut-off levels.  A graph containing all the bonds that could
possibly be drawn for a given 3-dimensional structure would, of 
necessity, be
infinite, but if only the strongest bonds were included, the graph of most
compounds would consist of several finite disconnected subgraphs, some
conceivably containing only one atom, e.g., [Na] [H] [CO3].  Adding 
bonds that
are a little weaker would reduce the number of disconnected subgraphs, but
they might still all be finite, e.g., [Na] [HCO3].  Adding yet weaker bonds
would ultimately result in a graph that was infinite.  One interesting graph
that can be uniquely defined is the graph obtained by progressively removing
weaker bonds from the complete graph until the graph is no longer infinite.
This might be called the maximal finite graph.  It is not clear whether this
would be a useful graph, but it would be unique, and might represent a 
useful
boundary.

    The distinction between organic and inorganic compounds is treated as a
chemical property which is separate from the treatment of the graph.  In the
bond graph the difference between the two appears in the spectrum of
'strengths' assigned to the bonds: graphs of organic compounds have a clear
gap between the strong and the weak bonds, a gap which is missing for the
inorganic compounds.  However, there is no difference in the way the 
graph is
treated: in both cases bonds are included down to some arbitrarily 
chosen cut-
off.

    Before we try to define CIF items for particular chemical concepts, we
need to have a consensus about the definition of a molecule.  I have 
made some
suggestions above, and I would be interested in people's comments.  Is graph
theory a fruitful way to go or should we take a different approach?  
What are
the problems we might encounter using the approach described above?

    Please circulate your views by replying to the whole discussion group
(use the reply to: option) and let us see if we can develop a consensus.

                    David


_______________________________________________
coreCIFchem mailing list
coreCIFchem@iucr.org
http://scripts.iucr.org/mailman/listinfo/corecifchem
[Send comment to list secretary]
[Reply to list (subscribers only)]
Follow-Ups:
- Re: Discussion #2 (Howard Flack)
- Re: Discussion #2 (Howard Flack)
Prev by Date: IUPAC workshop on XML and IChI
Next by Date: RE: Discussion #2
Prev by thread: Re: coreCIFchem Discussion #3
Next by thread: Re: Discussion #2
Index(es):
- Date
- Thread
Discussion List Archives

Discussion #2