[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Survey of available CIF software and request for wish list
- To: Multiple recipients of list <comcifs-l@iucr.org>
- Subject: Re: Survey of available CIF software and request for wish list
- From: Peter Murray-Rust <pazpmr@unix.ccc.nottingham.ac.uk>
- Date: Mon, 2 Oct 2000 12:56:50 +0100 (BST)
At 14:22 20/09/00 +0100, Brian McMahon wrote: >There has been a private discussion among some members over the >last few days about how to direct the development of software to >advance the use of CIF. I'd like to take that discussion onto the whole >COMCIFS list for two reasons : (1) to survey what is needed, and >(2) to canvass opinions on how to secure development effort and funding. It >will be best to split these two threads, so I'll start here by trying to >categorise the types of tools we need to consider, and reviewing what I know >about the ones that exist. I have spent a lot of my life writing tools for CIF and am fully committed to the CIF effort and process. I frequently tell people in other disciplines that I think CIF is a major achievement in scientific informatics. [So, if any of the remarks below seem to suggest anything else, I assure you that they do not, but warn you in advance that we need to look at other technologies as well as CIF. But CIF works and Chester/diffractometers will continue to speak in CIF for some time yet] Firstly, writing a protocol like CIF implies a huge amount of work that is invisible at the start and is gradually catching up with us. I have learnt this the hard way (!) and have found that only when I became heavily engaged in the XML effort did I realise the full extent of what was required in managing structured documents (SDs). CIF, and more so STAR, are structured documents and require a *large* amount of software to process them properly. This software is not easy to write, is tedious, does not bring glamorous rewards and cannot normally justify research grants. So if the CIF community intends to develop all its tools among itself I doubt whether there is the resource and commitment, especially for quality control. My experience is taken from XML. I have spent the last 3-4 years heavily involved in XML, including development of the language. This includes Chemical Markup Language (CML) for chemistry. This does NOT mean that I have deserted the CIF camp but I can speak from experience about the issues involved. Essentially XML and STAR (and probably the mmCIF syntax) are of the same complexity. XML and STAR are specifications (metalanguages) for creating domain-specific languages - XML is used to define XHTML, MathML, CML and so on; STAR is used to define CIF, mmCIF, pdCIF and so on. It is natural to define other support processes using the language itself, so XML has XMLSchemas written in XML to control the structure of the languages; CIF has DDLs (written in CIF). Almost weekly I get a feeling of deja vu in XML seeing something I have already tried to tackle in CIF/STAR and realising why I found it difficult! My analysis is that *semantically*, XML and STAR are virtually identical. The reality of XML is that the community has put in a huge amount of effort to prove the language and an even larger amount to build tools (many being open source, including mine). CIF/STAR will have to go through the same steps - there is no real alternative unless the scope and power of CIF/STAR are dumbed down. For example, I have more or less finished writing a Document Object Model (DOM) for CML - I didn't realise I would have to do this when I developed CML - now I realise it is inevitable. this will be required for CIF. In essence there are these conservation laws: - you cannot hide complexity, you can only move it around - for everything you define in a specification, someone has to write code - it is far easier to write a specification than to implement it CIF/STAR is *unavoidably* complex, and also requires a great deal of code to support it (** if it is to be processed by machines **). If we were writing software on an industrial basis we would be talking of 10+ years' work (and only that low because we already have the experience from XML of what needs to be done). The technical options for software to support CIF/STAR are: - continue with CIF-specific software and commit much more resource than we currently do - re-use non-CIF tools already written in other contexts - re-define what we wish to do using CIF and what using other representations I believe that only the last two are feasible. My experience is that the effort involved in implementing a protocol increases in the order: 1 a "paper" specification 2 tools to write documents in the specification 3 tools to read documents in the specification (much harder if the spec is flexible, like STAR or XML) 4 tools to edit and transform documents (you have to have the equivalent of DOM or XSLT) I have been through all of these with CML and reached about 3.5. I would only have got to 2.5 if there had not already been a community of 1000's of XML developers and masses of free, high quality software (e.g. from James Clark) (This omits all the discipline-specific stuff like checking bond orders, cell parameters, etc. which is where our most valuable efforts should be put). Do not underestimate the problems that many people find with SD technology. The W3C has developed XML schemas (very similar to DDLs) and many people - including software developers - are questioning whether they are too complex. There is real doubt as to whether some of the XML constructs will be easy enough for general implementation. My own approach to CIF - which is the only one that I personally can write code for - is to transform it into XML and use XML tools. This may seem like heresy, but ... If I wish to use CIF/STAR syntax then I write DOM and XSL-based converters in both directions. This does NOT mean I abandon the CIF effort - quite the reverse. In CML I specifically support the use of ontologies (dictionaries) from IU's and other learned bodies and I put the CIF dictionaries at the top. But I have to convert them to XML to make the reading, editing, display, validation, etc. possible. If you have stayed so far :-), I'll comment on specifics below. >=== Executive Summary === >There is a shortage of basic tools for handling syntax issues and dictionary >validation checks. The existing ones are often incomplete or not fully >robust. In particular, support for fashionable scripting languages (Perl, >Tcl, Python) is poor. The needs of the small-molecule crystallographer are >(or soon will be) reasonably well met, but uptake of mmCIF and imgCIF are >still weak. Even with small-molecule applications much would be gained by >working in an environment that can interface easily with existing >lexer/parser tools, graphical widget sets and object storage conventions. >========================= There must be an object storage convention. There are two approaches to this - XML effectively defines one in the DOM, and OMG/CORBA define one in IDL. My impression is that XML specs are easier for most people to understand but that IDL is more powerful. At the limit they can both be defined in UML (and I have started to do this for CML). UML allows other tools to automatically generate code, specs, etc. though there is still a lot o manual work to be done. >A major problem with CIF is its breadth. Unlike rendering a graphics image, >which is well defined (so TIFF, GIF, JPEG, PNG etc are addressing the same >problem), CIF (and friends) includes raw and processed data, connectivity >maps, 3d coordinate sets, symmetry operations, discursive text etc etc, and >is used to describe inorganic, molecular, macromolecular and incommensurate >structures at least - there are already many other dictionaries in the >pipeline. This is a major task. XML addresses it through the namespace mechanism and assumes that different domains will develop protocols in parallel. It also presupposes that there are high-quality tools for processing all of the components and a means for assembling them and managing the ensemble. I have a list of ca 12 "media-types" I have to support in CML - and these basically cover the range of STM documents in general: text, hypermedia, image, vectorgraphics, tables, units, math, bibliography, terminology, multimedia?, metadata, molecules there are XML solutions for "most". CIF should not try to address these problems independently. I shall add comments from my CIF and XML experience below. Please treat these as constructive, though taken altogether they may seem negative. >So we need to envisage domain-specific applications; but we must also provide >a core of utilities that can be used in any domain. Let's begin by thinking >about these application-independent tools. What can we identify as essential >or even desirable? > > >1. PURE SYNTAX HANDLERS >----------------------- >Tools that handle CIF tokens without any interpretation, and so are >universal across all domains. > >a. Standalone tools > Function Description Exists? > -------- ----------- ------- > Syntax checker Returns result code if there is a vcif (C) > definite syntax error, and perhaps > a human-readable error message Agreed. Equivalent to well-formed XML checkers. > Intelligent syntax Indicates (probably) where the error No > checker really occurred Apart from syntax (above) checking should involve: document VS dictionary (equivalent to XML validation). not trivial dictionary VS DDL ditto but additional effort DDL vs DDL > Prettifier Enforces line lengths, aligns loop cif2cif > (Fortran) > elements equivalent to simple XSLT transforms > Stream editor Allows CIF elements to be added, No > deleted by command-line instruction > Rearranger Modifies order of existing elements quasar (f77) > cif2cif (f77) > Interrogator Extracts CIF data meeting specified starbase (C) > criteria All these are equivalent to XSLT operations. For example, sorting a CIF is not trivial. > Tokeniser Reads CIF and passes individual tokens cifzinc (C) > to stdout in some normalised meta > representation > Interactive editor Enforces correct syntax during emacs cif.el > on-screen editing (Lisp) This is ultimately equivalent to XML editors with Schema-enforced validation. These are very complex to write. It may be that the CIF versions can be simpler because the range of operations is less complex, but ultimately an editor should check: - document structure (what elements can go here?) - element content (what generic content can this element have?) - domain-specific content. Is this value allowed (e.g. by the dictionary) >b. Libraries > > CIFtbx (Fortran), CIFLIB (C API), CIFOBJ (C++ class library) are publicly > available, CCDC has developed a C++ class library within the CIFer > project, Luca Lutterotti of Trento, Italy has advertised an incomplete > Java class library on cif-developers. There is also Peter Murray-Rust's > old C++ library (somewhere). I can look out everything I have written and make it available! Some may have decayed. I have some more recent Java stuff for CIF2XML which can act as a basis for someone to work with. > So far as I am aware, the Rutgers libraries compile (easily) on only a > small number of platforms. I now use java because it is the lingua franca of the web and the first tool for XML developers. It also comes with a huge library (e.g. Date, Math, Collections, etc.) which simplify a lot of things. > Is it beneficial to define a standard applications program interface that > different libraries could converge to? It is ultimately essential. It is also expensive and boring. I know! This is effectively what a DOM is. CML DOM has ca 50 classes and over 1000 methods. CIF DOMs will be smaller (if there is no crystallography involved). > For example, a standard set of > exceptions defining types of syntax error (applications would of course > use their own exception handlers, but the specific errors in a file would > be well defined across all libraries, e.g. > _a A _b 'Broken char string _c C > would raise the exception INCOMPLETE_QUOTE_DELIMITED_STRING at the end of > the line). Likewise, how closely aligned are the library functions across > the existing libraries? Does CIFtbx have an equivalent function to the > CIFLIB cifGetRowByIndex, for example? Should it have? > > >2. DICTIONARY TOOLS >------------------- >The next most general category contains tools which know how to handle >dictionaries, but have no domain-specific content. Ideally they should be >able to handle DDL1 and DDL2 dictionaries transparently. > >a. Standalone tools > Function Description Exists? > -------- ----------- ------- > Syntax checker As for data files, but knows about vcif (C) > save_ frames which are absent from > data files > Intelligent syntax Less important than for data files No > checker > Prettifier Aligns lists of definition elements No > Merger Combines dictionary files and fragments No > into a single dictionary a la > McMahon/Bernstein/Westbrook protocol > Name locator Finds CIF datanames in dictionaries cyclops (f77) > Extractor Extracts definition cman > >(rudimentary) (C) > Browser Graphical tool to browse dictionary No > (read-only) I wrote an mmCIF dictionary browser in Java ca 2 years ago. It would be easier now. The dictionary is sufficiently complex that it has to have a browser. > Web browser Really an implementation of a mmCIF > (Rutgers)/ > cif2html conversion core/pdCIF > (IUCr) Again I wrote something which expanded CIFs into something that could be displayed on the screen. There is a real challenge with mmCIF as it can be viewed as a structured document and/or a set of relational tables. It is very difficult to devise a generic approach to browsing that satisfies all possible mmCIFs. I would certainly now address it through XSLT which allows joins through keys. >b. Libraries > >The primary requirement is to validate data files against the contents of >one or more nominated dictionaries. CIFtbx (f77) and CIFOBJ (C++) provide >routines for this (probably some also in CIFLIB), but I think these are all >incomplete - please correct me if I'm wrong. CIFOBJ is DDL2 specific. CCDC's >HICCuP program had some Python validation routines against DDL1 >dictionaries, again incomplete. > >Specific things that need doing include: > > completing validation functions for DDL1/2 dictionaries in CIFtbx; > a C or C++ DDL1 validator; > a reference _type_construct parser/validator to check data typing > through regular expressions (_type_construct has been used in the > msCIF dictionary, but without software it's difficule to be sure that > Gotzon's expressions will work). In fact, _type_construct would need > to be fully specified before such software can be developed; > an IP-enabled tool to retrieve and cache public dictionaries referenced > through _audit_conform... data items and the IUCr registry; > implementation of the dictionary merging protocol. > >c. "Trip" test > >A suite of tests that would allow developers to confirm that they are >writing CIFs fully compliant with the standard would be beneficial. This >should be at the level of checking syntax and compliance against specified >dictionaries. Does this mean roundtripping? I mean the ability to transform a CIF into something else (memory or other format) and retransform to original CIF without information loss. I have just finished doing this for a (non-molecular) XML application and it has been very useful. There is also the question of whether there should be a canonical CIF representation - given 2 CIF representations of data can we normalise/canonicalise these to show they are identical? >3. SEMANTIC TRANSLATORS >----------------------- >Still steering clear of applications that need specifically crystallographic >programming... > >a. Standalone programs > Function Description Exists? > -------- ----------- ------- > Formatters Render in readable format via TeX, ciftex, > cif2xml, > HTML, SGML, XML etc Rutgers > dic->HTML If you start with XML, XSLT does all of these and could do XML2CML. XSL-FO is also being developed to render to PDF > Data converters Conversion of all (or some) CIF data cif2sx (ShelX) > to various other existing > formats pdb2cif/cif2pdb XSLT can sometimes do this, but other times there needs to be a DOM. >b. Libraries > Such utilities will tend to be fairly specific, but it would help to have > common routines for mapping tokens between identical or similar data > structures. So an mmCIF and associated DDL2 dictionary are isomorphous > to a relational database with an associated schema. My ciftex output is > a linear stream of tagged values, and is essentially isomorphous to the > input CIF. However, an SGML translation is harder, because the document > structure in SGML (depending on how it is defined by a DTD) may be a > hierarchical model; how does the flat-field CIF map into that structure? This is a useful point. core CIF is less complex than STAR and is flattish. But it still needs some of the SD technology. >4. CRYSTALLOGRAPHIC APPLICATIONS >-------------------------------- >Now we get to the bit where we ask what the crystallographic community >wants. Here are a few observations and suggestions from me; others are >welcome to add their 2 cents (or $2!). > >Small-molecule community >------------------------ >a. A structured CIF editor. CCDC are working well on this. The tool can import > data files and data blocks (so things like descriptions of equipment can > be stored in a template block. There is a "wizard" that prompts for > "required" data items (to be supplied by journals or other applications > in a lookup file). There is a visualisation window where a 3D structure > can be rendered and rotated - this borrows code from the CSD database > software, and so is quite crystallographically aware - it can (I think) > show symmetry-generated parts of a molecule and packing in a unit cell, > in a variety of rendering styles. I differentiate between an *editor* and a *primary authoring tool*. An editor has to be able to read in *any* compliant CIF (which could have any elements in any order) and validate it. A p.a.t simply has to be able to emit valid CIF. This is normally a lot easier. > What's missing? I would guess that the version 1 release will lack the > following features and functionality that CCDC want to have in due course > (please correct me if I've got anything wrong, Owen). > > "WYSIWYG". Text needs to be entered using the CIF backslash coding > conventions. Probably WYSIWYG will be introduced initially through > cut-and-paste out of a word-processor window. I don't know whether > it's possible to support clipboard formats across different platforms > (Microsoft Windows, Mac, Linux StarOffice etc). > > Two-dimensional chemical diagrams. CCDC and Acta have requirements for > 2D diagrams. There are various possible avenues of approach. (i) One is to > embed a graphics file (TIFF or PostScript) in a text file in the CIF. No - please no! I have horror stories of TIFFs and GIFs for chemistry. Sometimes when rescaled bits disappear and bonds can literally disappear. 4 can be transformed to +, etc. I strongly urge SVG - the new graphics language from W3C. it's gorgeous. See http://www.adobe.com/svg for some examples. See also http://www.xml-cml.org > This would require an embedding convention, similar to the imgCIF > MIME convention; software to de-embed and decode the graphic; > software to render the resulting TIFF or PS image. Substantial effort, > and the result is just a picture. (ii) Another way is to embed the output > file from common drawing packages such as ChemDraw and ISISDraw. As > before, one needs to de-embed the file, decode it, render it in the > style of the original package, and then parse it for chemical > connectivity information (which is what is really wanted). The payoff > is that the connectivity is read, but the software engineering is > substantial and at the mercy of several proprietary formats. I sympathise with this and as a result developed CML. CML is open, and I am developing an opensource set of tools. So far I have a CMLDOM (to be announced shortly), and am developing display and editing software. A major problem with *all* chemical editors is that there is no agreed ontology (unlike CIF!!) and so conventions from different manufacturers require proprietary software to convert them. As Brian mentioned IUPAC is working on a unique chemical identifier (IChI) which will address the unique representation of molecules and I have actively committed CML to this. Proprietary tools are a first step, but open protocols should be used asap > (iii) One could use the CIF (or, better, MIF) connectivity datanames. > Ideally one would persuade the major manufacturers of such software > to provide CIF/MIF as an export format from their packages. It may > still be necessary to embed a graphics file for high-resolution > publication, however. (iv) The other approach to connectivity is to > infer chemical bond types from the 3D image, and allow the user to > edit the 3D diagram interactively, trapping the result in CIF/MIF > fields. This captures the chemical information, but loses the > aesthetics of the commercial graphic presentation. It also alienates > chemist authors who are familiar with the existing software > packages. Of these options, (iii) looks best, but depends on > persuading the manufacturers... usual story. > > Polyhedron rendering for inorganics. > > Intensity profiles for powder patterns? Not one I've discussed anywhere > else, but if a structural CIF included a powder pattern it would be nice > to be able to visualise the intensity data. Maybe not an essential > component > of a CIF editor, though. I think it's critical to start capturing as much data *in machine processable form* as possible. I assume that pdCIF will do this anyway so this is a question of how it is displayed. SVG could be very useful here - I use it for spectra from JCAMP, but I admit that there are no high-level tools yet. > Consistency checks against CIF dictionaries. > > mmCIF compatibility (i.e. I don't think it will be able to read a > small-molecule structure written in the DDL2 version of the Core). > > >b. Three-dimensional visualiser > > Existing tools are: Xtal_GX - not bad, and with a lot of crystallographic > knowledge. Accesses CIF data blocks through a Tcl/Tk parser and GUI > editor. Undoubtedly very useful to Xtal users, but the user interface > is probably not intuitive to other users. I don't think it can read mmCIF > format. > > OpenSource RasMol - favourite tool of protein folk; can read DDL1 and > DDL2 CIFs, though can make incorrect bond assignments in small-molecule > structures. Not crystallographically aware - cannot generate missing > molecular fragments through application of symmetry operations, nor > cell packing diagrams. Displays properly annotated disordered ensembles > in different colours. Easy to use. Its major drawback (other than its > lack of crystallography) is that it is available only as a helper and not > a plugin to web browser windows - though I understand from Herbert that > developing Netscape/IE plugins is a very high-overhead business. > > There are also commercial products: I know of at least Crystallographica, > WebLabViewer. These are platform-specific (Windows, Mac respectively) and > cannot read mmCIF. > > >c. Data exchange > > Most small-molecule refinement packages seem to read and write coreCIF > satisfactorily. Some make assumptions about data ordering or content > that are not mandated (or even warranted) by the specification. This is a fundamental aspect of the spec. It requires all software to be able to read CIFs *from another tool*. No assumption about ordering can be made - the spec says so. It is a good example of how reading is a lot harder than writing! Note that I am NOT suggesting that diffractometers, IUCR editors, etc abandon CIF *syntax*. It's here, works well and has a very high success rate. But the process should accommodate more recent technologies when they become appropriate. >Macromolecular community >------------------------ > >I invite comment from mmCIFers. As I understand things, protein >crystallographers deposit data through a web editor which transforms the >input to mmCIF files. The web editor uses the mmCIF dictionary for >validation as the deposition proceeds, and ensures a high degree of data >consistency. It is configurable to different purposes, but I'm not sure that >it would have any application to constructing a small-molecule CIF (though I >shall be happy to be corrected). mmCIFs are also available for download for >every structure in the PDB, generated in the case of legacy data from >Herbert's reworking of Phil Bourne's original pdb2cif translator. Despite >community awareness of deficiencies, the old PDB format remains the de facto >standard for macromolecular software, though a small number of refinement >packages now write (and read?) mmCIF. RasMol is an effective macromolecular >structure viewer. > > >Powder diffraction, modulated structures >---------------------------------------- >pdCIF and msCIF are written by a small number of programs in their >respective fields (msCIF still in beta). I am not aware of any visualisation >tools or any specific requirements by journals that would impinge upon >software for these domains. > > >Image plate data >---------------- >The imgCIF dictionary is now under active COMCIFS review, and the imgCIF/CBF >working group have a well developed API and library. The handling of images, >though not a trivial task, is well defined. Support is still lacking from >equipment manufacturers. > > >Chemistry >--------- >As mentioned above in my lengthy discourse on the CCDC editor, it would be >beneficial to have 2D chemical structural information output in MIF format >by standard commercial software packages. Perhaps relevant to this is an >IUPAC initiative to generate identifiers for chemical compounds that is >derivable from the compound's connection table. Perhaps also of some >relevance to CIF matters is IUPAC's official endorsement of CML (chemical >markup language) as an information interchange mechanism. See above. I hope this is useful. I am willing to try to unearth (though not to repair!) any CIF-related software I may have written and make it available as a first step. But software decays if not used and I can't make promises. My summary is roughly: - the CIF initiative in creating dictionaries is absolutely the right way to go - the greatest effort should go into verifying the domain-specific aspects of the dictionaries. Only IUCr/COMCIFs can reasonably do this - the dictionaries should be re-usable by other disciplines (chemistry, materials science, etc.) In this way we start to normalise the use of crystallographic information over the world - in reverse, CIF should borrow from other disciplines (e.g. chemistry) where appropriate - the CIF project implies a large amount of generic technology for structured documents. Where possible this technology should be borrowed from elsewhere rather than rewritten by crystallographers - this is a general problem facing all IUs, scientific authors and publishers. The last 3 years have shown dramatic changes in technology and appreciation of the challenges. Whatever is decided must have an element of flexibility and an element of consistency. Not easy! And - if it is some reassurance - I see a number of other disciplines and crystallography is often well ahead of them. Several of them are moving to XML and I am sure that this will play a central role in the future. Peter >Regards >Brian Peter Murray-Rust, Director Virtual School of Molecular Sciences Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110 http://www.vsms.nottingham.ac.uk
- Prev by Date: Re: Backus-Naur Form for CIF
- Next by Date: Re: Backus-Naur Form for CIF
- Prev by thread: Re: Survey of available CIF software and request for wish list
- Next by thread: Re: Survey of available CIF software and request for wish list
- Index(es):