[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
CHEMISTRY (was Re: Survey of available CIF software and
- To: Multiple recipients of list <comcifs-l@iucr.org>
- Subject: CHEMISTRY (was Re: Survey of available CIF software and
- From: Peter Murray-Rust <Peter.Murray-rust@nottingham.ac.uk>
- Date: Fri, 1 Dec 2000 11:13:11 GMT
I have been working on how CIF and CML can interoperate and benefit from each other - the synergy looks very good There are a few specific comments below about chemistry. **I would be very grateful for samples CIFs that support chemistry, see below** At 14:22 20/09/00 +0100, Brian McMahon wrote: > Two-dimensional chemical diagrams. CCDC and Acta have requirements for > 2D diagrams. I would suggest that "chemical diagrams" is replaced by "structural formula", "connection table" and "2D coordinates". CIF core provides support for all these concepts but I suspect they are drastically underused, to the detriment of everyone. The purpose of a structural formula is to communicate to humans **and machines** what the compound(s) actually *are*! This is, of course, not trivial and at present there is a surprising amount of implicit human perception required in many diagrams. Work with CML has shown that some chemistry can *only* be represented with a graphical component but the vast majority of compounds can be presented by some or all of: - connection tables - 3-D coordinates Note, of course, that neither is formally deducible from the other by an algorithm - charges, formal bond orders, etc. are matters of human opinion and hopefully convention. 3-D coordinates do not normally represent fluxional and similar molecules completely. Does Acta currently accept (a) 2D diagrams (b) connection tables in CIF-based papers? (Please excuse ignorance here :-) If so, do they use pixel-based representations or use 2-D coordinates in _chemical_ to draw the diagrams. >There are various possible avenues of approach. (i) One is to > embed a graphics file (TIFF or PostScript) in a text file in the CIF. > This would require an embedding convention, similar to the imgCIF > MIME convention; software to de-embed and decode the graphic; > software to render the resulting TIFF or PS image. Substantial effort, > and the result is just a picture. I argue very strongly against the continuing use of pixel-based diagrams. I have examples where a diagram is embedded in HTML, rescaled by the browser and **BONDS DISAPPEAR**. This happens when a horizontal or vertical bond is 1-pixel wide and falls on a non-integer coordinate. there are also many diagrams where it is impossible to be sure whether a set of pixels is (say) a 4, a + or something else. >(ii) Another way is to embed the output > file from common drawing packages such as ChemDraw and ISISDraw. As > before, one needs to de-embed the file, decode it, render it in the > style of the original package, and then parse it for chemical > connectivity information (which is what is really wanted). The payoff > is that the connectivity is read, but the software engineering is > substantial and at the mercy of several proprietary formats. I have several examples of such files that I cannot interpret. ChemDraw has some binary formats and these are unreadable without the software. > (iii) One could use the CIF (or, better, MIF) connectivity datanames. I support this. CIF has got it all present - we only need to use it. > Ideally one would persuade the major manufacturers of such software > to provide CIF/MIF as an export format from their packages. It may > still be necessary to embed a graphics file for high-resolution > publication, however. (iv) The other approach to connectivity is to > infer chemical bond types from the 3D image, and allow the user to > edit the 3D diagram interactively, trapping the result in CIF/MIF > fields. This captures the chemical information, but loses the > aesthetics of the commercial graphic presentation. It also alienates > chemist authors who are familiar with the existing software > packages. Of these options, (iii) looks best, but depends on > persuading the manufacturers... usual story. There is an important principle here. Any bond types, charges, etc. depend on conventions (ontologies). Unless the source of these is documented, there is considerable opportunity for confusion. Thus (I believe) MDLMolfiles use 4 for aromatic whereas other packages use this for (the rare) quadruple bond. CCDC use -5 for aromatic, etc. Therefore any representation requires either: - agreement to use a single convention - careful recording of the convention. CML uses both approaches - it has a small core ontology but can support the use of any other convention. I suspect that current CIF terminology, extended with some MIF terminology - e.g. for stereochemistry would cater for 99% of "small" crystal structures. There is thus option (v) which is possible: convert legacy formats to the appropriate CIF datanames (this is possible within Core CIF without breaking it). These can be extended if appropriate with either MIF datanames or IUPAC terms. CIF/MIF/IUPAC is probably strong enough to hold most communal concepts. The re-export to legacy formats is not always possible because these are not extensible (thus "PDB" and MDLMolfile do not support lots of what is in CIF). I am developing support for this approach by using an internal DOM to hold the CIF enhanced by CMLDOM. >[...] >Chemistry >--------- >As mentioned above in my lengthy discourse on the CCDC editor, it would be >beneficial to have 2D chemical structural information output in MIF format >by standard commercial software packages. Perhaps relevant to this is an >IUPAC initiative to generate identifiers for chemical compounds that is >derivable from the compound's connection table. Perhaps also of some >relevance to CIF matters is IUPAC's official endorsement of CML (chemical >markup language) as an information interchange mechanism. CML interoperates very well with IF because both are extensible and CIF provides a top-class dictionary approach. Thus I do not reinvent ontologies - I re-use existing ones. Thus CML uses CIF _cell_ concepts to hold cell data. CML will interoperate extremely closely with the IUPAC initiative and help to separate ontology from semantics and ontology. I am therefore developing CIF2CML and vice versa. XML provides high-quality graphics tools (SVG) which make it possible to provide true vector-based graphics *with semantic and ontological enhancement*. Thus it's possible to create smart diagrams which can be clicked and carry the whole _chemical_ information underneath (see http://www.xmlcml.org for examples). CML has now been adopted as a central part of one of the submissions to OMG for "small molecules". This means that with the mmCIF-based submission we have a very strong crystallographic input into the formal representation of both small and large molecular objects. This will make it much easier to use standard tools. To do this I would be grateful for some sample CIFs which contain chemical connectivity, and also for any which contain 2D coordinates with/out 3D coordinates P. Peter Murray-Rust, Director Virtual School of Molecular Sciences Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110 http://www.vsms.nottingham.ac.uk
- Prev by Date: CONFORMANCE [was Re: Survey of available CIF software and
- Next by Date: Some New Perl mmCIF Software Tools
- Prev by thread: Re: CONFORMANCE [was Re: Survey of available CIF software and
- Next by thread: Some New Perl mmCIF Software Tools
- Index(es):