[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
IUPAC workshop on XML and IChI
- To: <coredmg@iucr.org>, Comcifs List Server <comcifs@iucr.org>,coreCIFchem <corecifchem@iucr.org>, <phase-identifiers@iucr.org>
- Subject: IUPAC workshop on XML and IChI
- From: "I. David Brown" <idbrown@mcmail.cis.mcmaster.ca>
- Date: Wed, 19 Nov 2003 16:02:24 -0500 (EST)
Dear Colleague, I have just returned from a workshop dealing with chemistry XML and the IUPAC Chemical Indentifier (IChI). I have appended below a report on those aspects of the workshop that are likely to be of interest to members of IUCr committees. I apologize to those of you who receive more than one copy of this email. I am circulating it two four groups who might be interested and several of you will be members of more than one of these. I will be following up this report with further suggestions for discussion by the coreCIFchem and phaseID groups, but those of you who belong to other groups may find this report interesting. Best wishes David Keep scrolling - More below ***************************************************** Dr.I.David Brown, Professor Emeritus Brockhouse Institute for Materials Research, McMaster University, Hamilton, Ontario, Canada Tel: 1-(905)-525-9140 ext 24710 Fax: 1-(905)-521-2773 idbrown@mcmaster.ca ***************************************************** Report on the workshop on Chemical XML and the IUPAC Chemical Identifier (IChI) held at NIST 12-14 Nov. 2003. I.D.Brown Summary. -------- There is currently no organization coordinating the XML ontologies being developed for the various branches of chemistry, even though several chemical specialties are developing detailed ontologies in their own disciplines. However, a project to develop an IUPAC Chemical Identifier (IChI) in the form of an electronic character string that uniquely identifies a compound, is well advanced and shows promise as a search key. Introduction ------------ IUPAC has appointed a Committee on Printed and Electronic Publication (CPEP) which in turn has a subcommittee on Electronic Data Standards (EDS). The latter has two projects that were the subject of a workshop held at NIST, Gaithersburg in November 2003. The first is the development of a Chemical XML dictionary and the second the development of an IUPAC Chemical Identifier (IChI). This document reports on this workshop for the benefit of interested groups in the International Union of Crystallography. Chemical XML ------------ Although the EDS would appear to be the IUPAC equivalent of COMCIFS, the two committees have very different mandates. The primary role of EDS is to define XML schema or dictionaries that would allow IUPAC to produce web versions of its Gold Book (definitions of chemical terms) and Green Book (mathematical relations used in analytical chemistry). This is equivalent to producing web versions of International Tables for Crystallography. EDS is therefore interested in reproducing text, mathematical equations and chemical structure diagrams on the web using XML versions of the printed Gold and Green Books. EDS is explicitly not interested in (or believes it does not have the authority to) recommend or coordinate electronic ontologies for chemistry as a whole, including defining such items as chemical formulae that might be expected to appear in many different chemistry XML schema. In its more limited role, EDS is proposing to express mathematical formulae using the existing MathML (a general mark-up language prepared by mathematicians), units using the similarly general UnitsML, and chemical diagrams in a form that would allow them to be printed using SVG. Even though the scope of EDS is limited, the workshop received reports from several groups developing ontologies for specialists branches of chemistry (including the report on CIF that I gave). There was a general appreciation that the most important task is to define the ontologies (the contents of the dictionaries) and that one should not worry too much about the language in which they are expressed. XML is the current flavour of the year, but XML might well be superceded by a different (better?) system in five or ten years time. A well designed ontology could easily migrate from one delivery system to another. Among the 8 to 10 groups working on specialized chemical ontologies in the form of XML schema, ThermoML and SpectaML stood out as being well advanced. Their schema (schemae?) are more directly comparable with CIF, in that they are designed to capture of the results of experimental measurements in their respective disciplines. ThermoML has been adopted by five of the leading thermodynamic journals (representing three different publishers), but rather than requiring authors to submit papers in ThermoML, the journals will continue to accept papers in traditional formats (90% are submitted in MSWord). The mark-up into XML will be carried out by the publishers and XML versions of the results will be submitted to a thermodynamic database. Another group is producing a schema (a schemum?) for analytical measurements (AniML) and a group in Prague is working on a Mark- up Language for chemical structures based on Graph Theory (GTML). Most of these projects are closely related to particular experimental techniques where the concepts are specialized. There is no group, either existing or proposed, that is charged with coordinating these efforts to ensure that the definitions do not conflict. >From the crystallographer's point of view the most interesting project is Peter Murray-Rust's Chemical Mark-up Language (CML) which aims to capture the chemical structures that are at the heart of any description of chemistry, specifically organic chemistry. Peter has been working on this project for many years and his schema are well thought out and tested using software he has written. A number of publishers and the European Patent Office have expressed interest in CML, and Peter has been working closely with the chemical modelling community to develop a version of CML for them. The schema in CML are very general, specifying only that molecules are composed of atoms which are linked by bonds, but molecules, atoms and bonds are not defined, leaving it to the user to decide which atoms are bonded and therefore which atoms constitute a molecule. One can see the reasons for such an open-ended approach, but the philosophy is very different from that adopted by CIF. CML is not likely to give us much guidance as we extend CIF to include chemical (as opposed to crystallographic) concepts. However, Peter has written programs that will convert DDL1 CIF to cifML and vice versa, cifML being a version of XML that explicitly employs CIF datanames and ontologies. One attractive feature of XML that we might consider incorporating into CIF is the ability to avoid namespace collisions. Two schema (dictionaries), foo and fee, that both use the name 'bond_order', though with different definitions, would give rise to items with names like foo:bond_order and fee:bond_order where 'foo' and 'fee' are equivalenced to web URLs where the respective schema can be found. This allows two XML files based on different schema to be concatenated, but it does not provide precise definitions for the values of 'bond_order' in the different schema. They may be defined the same way or they may not. A search across databases would retrieve both kinds of bond_orders, but a computer would have to assume that the quantities are unrelated. The resulting different dialects of chemistry would make it difficult to synthesize information across different databases. When I asked the EDS where one could find IUPAC recommendations for an electronic coding of widely used chemical concepts such as the chemical formulae, everybody in the room started pointing to someone else (the scene was reminiscent of Alice in Wonderland!), but the eventual consensus was that IUPAC has no mechanism for making recommendations at this level of detail, because if it did, the recommendations would probably be ignored by the chemical community. This may have been the experience with IUPAC recommendations in the past, but a consortium of groups devising chemMLs would have a strong motivation to adopt compatible definitions for common chemical concepts. At present it would appear that, apart from the sum_chemical_formula for which rules already exist, it is unlikely that the various chemMLs will adopt compatible definitions of key chemical concepts. The feeling among the members of EDS is that it will be time enough to resolve these conflicts when they arise! IChI (IUPAC Chemical Identifier) -------------------------------- This inability to coordinate ontologies is perhaps why EDS set up the IUPAC Chemical Identifier (IChI) project which aims to recommend an identifier that would be able to locate the same compound in different databases. This project was the subject of the second half of the workshop. When the IChI group was set up, they approached the IUCr Nomenclature Commission for advice on how identify different crystalline phases. The chair of the Commission at the time, S.C.Abrahams, asked me to set up working group to make recommendations that could be passed back to IChI. Our working group, acting independently of IChI, has discussed a number of possibilities which, fortunately, should be easy to incorporate into the recommended IChI identifier. A proposal for the first version of the identifier covering mostly organic compounds is nearly ready, and the IChI working group has given thought to how a later version might cover a wider range of compounds. The identifier is built up of a number of layers. The top (first) layer contains only the chemical formula and will, for many compounds, be sufficient to identify the compound uniquely. The second layer includes the chemical structure, i.e. a normalized description of the connectivity. The contents of this layer are determined by computer algorithms from a connectivity diagram supplied by the author. Insofar as different authors may disagree on which atoms are bonded, the same compound may end up with different identifiers, but this layer of the identifier is made as robust as possible by ignoring hydrogen atoms, bond orders and charge assignments. Hydrogen atoms are introduced at the third layer which can be ignored if one is not interested in a particular tautomer. Still lower levels contain information about stereocenters and isotopes, and are included only if required. Searches can be deep, returning only compounds with the same stereochemistry and isotopic content, or they can be restricted to higher levels if tautomers, stereochemistry and isotopes are not of interest. Identification of the crystallographic phase by including, e.g., the space group number, can easily be added as yet a further layer. Version 1 of IChI has impressed those who have been testing it. It works well, as might be expected, for organic compounds, but also for many inorganic and metallorganic compounds if the bonds to the metal atoms (or cations) are not included in the second layer. They can be introduced in a lower layer if needed, e.g., to distinguish between isomers with different metal coordination. At present the identifier is not designed to describe polymeric structures, clusters or disordered structures but the IChI group is interested in including these features in future versions. We will probably wish to incorporate IChI into CIF when the final standard is approved. I.D.Brown 2003-11-19
Reply to: [list | sender only]
- Prev by Date: Re: CIF specification: reserved prefixes
- Next by Date: CIF_rho Dictionary Maintenance Group.
- Prev by thread: Re: CIF_rho Dictionary Maintenance Group.
- Next by thread: CIF specification: reserved prefixes
- Index(es):