[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Survey of available CIF software and request for wish list
- To: Multiple recipients of list <comcifs-l@iucr.org>
- Subject: Survey of available CIF software and request for wish list
- From: Brian McMahon <bm@iucr.org>
- Date: Wed, 20 Sep 2000 14:22:45 +0100 (BST)
There has been a private discussion among some members over the last few days about how to direct the development of software to advance the use of CIF. I'd like to take that discussion onto the whole COMCIFS list for two reasons : (1) to survey what is needed, and (2) to canvass opinions on how to secure development effort and funding. It will be best to split these two threads, so I'll start here by trying to categorise the types of tools we need to consider, and reviewing what I know about the ones that exist. === Executive Summary === There is a shortage of basic tools for handling syntax issues and dictionary validation checks. The existing ones are often incomplete or not fully robust. In particular, support for fashionable scripting languages (Perl, Tcl, Python) is poor. The needs of the small-molecule crystallographer are (or soon will be) reasonably well met, but uptake of mmCIF and imgCIF are still weak. Even with small-molecule applications much would be gained by working in an environment that can interface easily with existing lexer/parser tools, graphical widget sets and object storage conventions. ========================= A major problem with CIF is its breadth. Unlike rendering a graphics image, which is well defined (so TIFF, GIF, JPEG, PNG etc are addressing the same problem), CIF (and friends) includes raw and processed data, connectivity maps, 3d coordinate sets, symmetry operations, discursive text etc etc, and is used to describe inorganic, molecular, macromolecular and incommensurate structures at least - there are already many other dictionaries in the pipeline. So we need to envisage domain-specific applications; but we must also provide a core of utilities that can be used in any domain. Let's begin by thinking about these application-independent tools. What can we identify as essential or even desirable? 1. PURE SYNTAX HANDLERS ----------------------- Tools that handle CIF tokens without any interpretation, and so are universal across all domains. a. Standalone tools Function Description Exists? -------- ----------- ------- Syntax checker Returns result code if there is a vcif (C) definite syntax error, and perhaps a human-readable error message Intelligent syntax Indicates (probably) where the error No checker really occurred Prettifier Enforces line lengths, aligns loop cif2cif (Fortran) elements Stream editor Allows CIF elements to be added, No deleted by command-line instruction Rearranger Modifies order of existing elements quasar (f77) cif2cif (f77) Interrogator Extracts CIF data meeting specified starbase (C) criteria Tokeniser Reads CIF and passes individual tokens cifzinc (C) to stdout in some normalised meta representation Interactive editor Enforces correct syntax during emacs cif.el on-screen editing (Lisp) b. Libraries CIFtbx (Fortran), CIFLIB (C API), CIFOBJ (C++ class library) are publicly available, CCDC has developed a C++ class library within the CIFer project, Luca Lutterotti of Trento, Italy has advertised an incomplete Java class library on cif-developers. There is also Peter Murray-Rust's old C++ library (somewhere). So far as I am aware, the Rutgers libraries compile (easily) on only a small number of platforms. Is it beneficial to define a standard applications program interface that different libraries could converge to? For example, a standard set of exceptions defining types of syntax error (applications would of course use their own exception handlers, but the specific errors in a file would be well defined across all libraries, e.g. _a A _b 'Broken char string _c C would raise the exception INCOMPLETE_QUOTE_DELIMITED_STRING at the end of the line). Likewise, how closely aligned are the library functions across the existing libraries? Does CIFtbx have an equivalent function to the CIFLIB cifGetRowByIndex, for example? Should it have? 2. DICTIONARY TOOLS ------------------- The next most general category contains tools which know how to handle dictionaries, but have no domain-specific content. Ideally they should be able to handle DDL1 and DDL2 dictionaries transparently. a. Standalone tools Function Description Exists? -------- ----------- ------- Syntax checker As for data files, but knows about vcif (C) save_ frames which are absent from data files Intelligent syntax Less important than for data files No checker Prettifier Aligns lists of definition elements No Merger Combines dictionary files and fragments No into a single dictionary a la McMahon/Bernstein/Westbrook protocol Name locator Finds CIF datanames in dictionaries cyclops (f77) Extractor Extracts definition cman (rudimentary) (C) Browser Graphical tool to browse dictionary No (read-only) Web browser Really an implementation of a mmCIF (Rutgers)/ cif2html conversion core/pdCIF (IUCr) b. Libraries The primary requirement is to validate data files against the contents of one or more nominated dictionaries. CIFtbx (f77) and CIFOBJ (C++) provide routines for this (probably some also in CIFLIB), but I think these are all incomplete - please correct me if I'm wrong. CIFOBJ is DDL2 specific. CCDC's HICCuP program had some Python validation routines against DDL1 dictionaries, again incomplete. Specific things that need doing include: completing validation functions for DDL1/2 dictionaries in CIFtbx; a C or C++ DDL1 validator; a reference _type_construct parser/validator to check data typing through regular expressions (_type_construct has been used in the msCIF dictionary, but without software it's difficule to be sure that Gotzon's expressions will work). In fact, _type_construct would need to be fully specified before such software can be developed; an IP-enabled tool to retrieve and cache public dictionaries referenced through _audit_conform... data items and the IUCr registry; implementation of the dictionary merging protocol. c. "Trip" test A suite of tests that would allow developers to confirm that they are writing CIFs fully compliant with the standard would be beneficial. This should be at the level of checking syntax and compliance against specified dictionaries. 3. SEMANTIC TRANSLATORS ----------------------- Still steering clear of applications that need specifically crystallographic programming... a. Standalone programs Function Description Exists? -------- ----------- ------- Formatters Render in readable format via TeX, ciftex, cif2xml, HTML, SGML, XML etc Rutgers dic->HTML Data converters Conversion of all (or some) CIF data cif2sx (ShelX) to various other existing formats pdb2cif/cif2pdb b. Libraries Such utilities will tend to be fairly specific, but it would help to have common routines for mapping tokens between identical or similar data structures. So an mmCIF and associated DDL2 dictionary are isomorphous to a relational database with an associated schema. My ciftex output is a linear stream of tagged values, and is essentially isomorphous to the input CIF. However, an SGML translation is harder, because the document structure in SGML (depending on how it is defined by a DTD) may be a hierarchical model; how does the flat-field CIF map into that structure? 4. CRYSTALLOGRAPHIC APPLICATIONS -------------------------------- Now we get to the bit where we ask what the crystallographic community wants. Here are a few observations and suggestions from me; others are welcome to add their 2 cents (or $2!). Small-molecule community ------------------------ a. A structured CIF editor. CCDC are working well on this. The tool can import data files and data blocks (so things like descriptions of equipment can be stored in a template block. There is a "wizard" that prompts for "required" data items (to be supplied by journals or other applications in a lookup file). There is a visualisation window where a 3D structure can be rendered and rotated - this borrows code from the CSD database software, and so is quite crystallographically aware - it can (I think) show symmetry-generated parts of a molecule and packing in a unit cell, in a variety of rendering styles. What's missing? I would guess that the version 1 release will lack the following features and functionality that CCDC want to have in due course (please correct me if I've got anything wrong, Owen). "WYSIWYG". Text needs to be entered using the CIF backslash coding conventions. Probably WYSIWYG will be introduced initially through cut-and-paste out of a word-processor window. I don't know whether it's possible to support clipboard formats across different platforms (Microsoft Windows, Mac, Linux StarOffice etc). Two-dimensional chemical diagrams. CCDC and Acta have requirements for 2D diagrams. There are various possible avenues of approach. (i) One is to embed a graphics file (TIFF or PostScript) in a text file in the CIF. This would require an embedding convention, similar to the imgCIF MIME convention; software to de-embed and decode the graphic; software to render the resulting TIFF or PS image. Substantial effort, and the result is just a picture. (ii) Another way is to embed the output file from common drawing packages such as ChemDraw and ISISDraw. As before, one needs to de-embed the file, decode it, render it in the style of the original package, and then parse it for chemical connectivity information (which is what is really wanted). The payoff is that the connectivity is read, but the software engineering is substantial and at the mercy of several proprietary formats. (iii) One could use the CIF (or, better, MIF) connectivity datanames. Ideally one would persuade the major manufacturers of such software to provide CIF/MIF as an export format from their packages. It may still be necessary to embed a graphics file for high-resolution publication, however. (iv) The other approach to connectivity is to infer chemical bond types from the 3D image, and allow the user to edit the 3D diagram interactively, trapping the result in CIF/MIF fields. This captures the chemical information, but loses the aesthetics of the commercial graphic presentation. It also alienates chemist authors who are familiar with the existing software packages. Of these options, (iii) looks best, but depends on persuading the manufacturers... usual story. Polyhedron rendering for inorganics. Intensity profiles for powder patterns? Not one I've discussed anywhere else, but if a structural CIF included a powder pattern it would be nice to be able to visualise the intensity data. Maybe not an essential component of a CIF editor, though. Consistency checks against CIF dictionaries. mmCIF compatibility (i.e. I don't think it will be able to read a small-molecule structure written in the DDL2 version of the Core). b. Three-dimensional visualiser Existing tools are: Xtal_GX - not bad, and with a lot of crystallographic knowledge. Accesses CIF data blocks through a Tcl/Tk parser and GUI editor. Undoubtedly very useful to Xtal users, but the user interface is probably not intuitive to other users. I don't think it can read mmCIF format. OpenSource RasMol - favourite tool of protein folk; can read DDL1 and DDL2 CIFs, though can make incorrect bond assignments in small-molecule structures. Not crystallographically aware - cannot generate missing molecular fragments through application of symmetry operations, nor cell packing diagrams. Displays properly annotated disordered ensembles in different colours. Easy to use. Its major drawback (other than its lack of crystallography) is that it is available only as a helper and not a plugin to web browser windows - though I understand from Herbert that developing Netscape/IE plugins is a very high-overhead business. There are also commercial products: I know of at least Crystallographica, WebLabViewer. These are platform-specific (Windows, Mac respectively) and cannot read mmCIF. c. Data exchange Most small-molecule refinement packages seem to read and write coreCIF satisfactorily. Some make assumptions about data ordering or content that are not mandated (or even warranted) by the specification. Macromolecular community ------------------------ I invite comment from mmCIFers. As I understand things, protein crystallographers deposit data through a web editor which transforms the input to mmCIF files. The web editor uses the mmCIF dictionary for validation as the deposition proceeds, and ensures a high degree of data consistency. It is configurable to different purposes, but I'm not sure that it would have any application to constructing a small-molecule CIF (though I shall be happy to be corrected). mmCIFs are also available for download for every structure in the PDB, generated in the case of legacy data from Herbert's reworking of Phil Bourne's original pdb2cif translator. Despite community awareness of deficiencies, the old PDB format remains the de facto standard for macromolecular software, though a small number of refinement packages now write (and read?) mmCIF. RasMol is an effective macromolecular structure viewer. Powder diffraction, modulated structures ---------------------------------------- pdCIF and msCIF are written by a small number of programs in their respective fields (msCIF still in beta). I am not aware of any visualisation tools or any specific requirements by journals that would impinge upon software for these domains. Image plate data ---------------- The imgCIF dictionary is now under active COMCIFS review, and the imgCIF/CBF working group have a well developed API and library. The handling of images, though not a trivial task, is well defined. Support is still lacking from equipment manufacturers. Chemistry --------- As mentioned above in my lengthy discourse on the CCDC editor, it would be beneficial to have 2D chemical structural information output in MIF format by standard commercial software packages. Perhaps relevant to this is an IUPAC initiative to generate identifiers for chemical compounds that is derivable from the compound's connection table. Perhaps also of some relevance to CIF matters is IUPAC's official endorsement of CML (chemical markup language) as an information interchange mechanism. Regards Brian
- Prev by Date: Re: OMG proposal for macromolecular structure
- Next by Date: Backus-Naur Form for CIF
- Prev by thread: Re: OMG proposal for macromolecular structure
- Next by thread: Re: Survey of available CIF software and request for wish list
- Index(es):