[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Report and commentary on BCA chemical crystallography group meeting
- To: epc@iucr.org, rand@iucr.org, gh@iucr.org
- Subject: Report and commentary on BCA chemical crystallography group meeting
- From: Brian McMahon <bm@iucr.org>
- Date: Tue, 18 Nov 2003 14:47:35 +0000
BCA Chemical Crystallography Group Meeting 12 November 2003 =========================================================== Four of us from the Chester office (myself, Pete, Gillian and Mike Hoyland) attended a very interesting session on "Beyond Refinement: What Happens Next?" in Cambridge. I started to jot down my notes to record the content of the meeting, but the exercise led me on to begin to think about some of the implications for us. These are my personal thoughts, and don't as yet dictate a policy line - but you'll see the germ of some ideas in the final paragraphs. Pete, Mike and Gillian are welcome to contradict or augment any of these remarks as they see fit. Talk 1 - Tony Linden: CheckCIF2003 - a general structure validation tool ------------------------------------------------------------------------ Tony gave a clear account of the evolution and practice of checkCIF, and the advantages to the prospective author of using it early and often. The talk was well received, and it seems to me to reinforce the perception that the crystallographers who are likely to publish in Acta consider the checkCIF system to be a good thing. Tony spoke of the role of the checking process and the support of coeditors in providing an element of education to the non-expert crystallographers who may be involved in structure determinations. We could perhaps help this by taking the explanatory notes that are associated with checkCIF alerts (which are already very good) and providing onward links to more detailed online tutorials promoting good experimental practice. Who might be good at commissioning and editing such tutorials? He also commented that some residual errors were not checkable by checkCIF, but the examples he gave seemed to me amenable to software analysis. For example _exptl_crystal_description was given as "needle", while _exptl_crystal_size_ had the dimensions of a plate; or _exptl_crystal_colour was "colourless" for a compound with Fe in its formula. So long as (human) editorial judgement is involved, it would be possible to construct heuristic procedures to flag such things as C alerts. The establishment of a set of heuristic rules is equivalent to building up a "knowledge base"; and the addition of rules to the "knowledge base" could proceed in parallel with the addition of "tips", "rules of thumb" and "indicators of good practice" to the online tutorials suggested above. He also emphasised the use of platon with reflection data in checking, and I suggest we begin to look at extending checkCIF to use the reflection data. (The problems are largely to do with file management and associating the right structure factors with the right structure.) Andrew Bond asked if a distributable checkCIF could be made available (e.g. for people who liked to work on trains or who have disconnected themselves from the Web because of hackers and spammers). Our difficulties with this are fluidity of the code (could be alleviated by web-based auto-update techniques) but also the logistics of maintaining multi-platform multi-version software. In the longer run, maintenance of code is a general problem - what will happen to platon when Ton Spek retires? Simon Parsons asked what was wrong with having a structure rather well determined but with only a 95% data collection set? Would the journal reject it? Here was the beginning of what could have been an interesting debate, that never really got going, but is constantly simmering beneath the surface both of this meeting and of the whole practice of publishing crystal structure reports: cost and benefit. There are costs in collecting that extra 5% of data - maybe an extended time on the diffractometer that a service department will not let you have; maybe a very extended time at the synchrotron station that no-one could reasonably afford. The researcher has determined the structure well enough to answer the question of interest. What's the big deal? Against that the benefits include a better-characterised data point in the CSD or other database, and that will have a small but important effect on the quality of scientific deductions that can be made from the databases. This is important, and should perhaps be pushed by the IUCr (through policy statements), by the databases (through careful research studies demonstrating the effects of high-quality data), and by individual crystallographers pressing this point of view on their chemist colleagues. All this is of course done to some extent already, but much more may be needed. Talk 2 - Richard Cooper: Validation as you go --------------------------------------------- An account of the CRYSTALS package, and how it enables users to validate results at the diffractometer and beyond. (There is general consensus that validation at source is a good thing.) A particular type of validation is in-situ matching of portions of molecular geometry against MOGUL, a CCDC database that stores molecular geometries from the CSD in a fast-access data structure. A few ideas came out of conversations after this talk. MOGUL validation sounded like a nice idea (it's a "knowledge-based" extrension of the limited set of geometry checks that platon can do). The problem is identifying from the input molecule the geometries and bond types against which to launch the search. Since CRYSTALS does this, I thought we might see whether CRYSTALS can be hooked into checkCIF to perform just these checks. Richard thought this was do-able, but would involve a fair amount of code manipulation, and he thought that it would be as easy for us to write something ourselves. But the rub comes in the bond typing - apparently CRYSTALS uses Sam Motherwell's code to do this. Now, it transpires that Sam's algorithms have been reimplemented (and improved) in the C++ toolkit which CCDC are planning to release along with an API to developers next year. That may be a good time to start thinking seriously of adding such a check to our suite. Ian Bruno (who is a very farsighted programmer at CCDC) will let us have a copy of the next MOGUL release. CCDC is also interested in offering checkCIF-style validation alongside the enCIFer editor. I spoke to Greg Shields about the possibility of building a user-friendly package that included enCIFer and checkCIF as a standalone program. It seems a nice idea in principle, and CCDC would probably have the expertise to write a wrapper to call the various programs from within a nice cross-platform GUI. What's more problematic is the very different programming languages used; here again an API would help in exchanging data between the component programs. We chatted briefly about a "generic" API for crystallographic applications, that abstracted concepts such as molecules, cells, bonds etc: the CIF dictionaries could be used as a seedbed for specifying many of the necessary objects. At the moment it seems very much a pipe dream, but the rise of open-source collaborative toolkits both emphasises the usefulness of such an approach and begins to address the knotty problem of maintainability of crystallographic algorithms once individual programs and programmers retire. Talk 3 - Kirsty Anderson: Publishing crystallography in chemical journals ------------------------------------------------------------------------- Kirsty is the in-house Crystallographic Data Editor for journals of the Royal Society of Chemistry. She sees her role as validating chemical results drawn from crystal structures. It seems to me that there is a danger in this emphasis that plausible chemistry may be passed on the basis of a consistent (but wrong) crystal structure model, but I'm sure that the crystallography is actually checked adequately. Checking is done by arrangement with CCDC and through use of local copies of software such as platon and Mercury. RSC earlier asked us for permission to point authors to our checkCIF service, although RSC declines to sponsor checkcif.iucr.org. It would also appear that the Crystallographic Data Editor wishes to outsource the checking to CCDC rather than using the tailored checkCIF service that we provided for Dave Bardwell at RSC (very likely because of workload). There is no doubt, however, that RSC journals will often accept minimal crystallographic standards so long as the structure is not obviously wrong or contradictory to the indicated chemistry. It is likely that checkcif would be seen as excessively demanding for validation at this level. Talk 4 - John Davies: Unpublished structures -------------------------------------------- John runs a very productive service crystallography unit at Cambridge University. He made the point that despite the efficiency of journals such as Acta E, it is likely that there will always be large numbers of structures that will never be candidates for formal publication - if it takes even a day to prepare a structure report for Acta E, he would prefer to spend the time determining another three or more structures. Again, the fate of a structure will depend on its intrinsic interest. John in fact publishes reasonably frequently in Acta, but as part of his service he might determine the structure of dozens of reaction intermediates. These are of no especial interest to the commissioning chemist; have been refined only to the extent needed to satisfy the chemist that his research is on the right track; do not individually merit a detailed write-up; yet are valuable data points within a comprehensive collection such as the CSD. At present, each solved structure is registered and stored in a local version of the CSD database. This makes transfer to the CSD trivial from a technical viewpoint; but such transfers depend either upon publication, or upon the agreement of the researchers who commissioned the structure to deposit it as a private communication. There is a problem with chemists who commission such structures and then lose interest or can no longer be contacted. Without their permission, there is no way to make the structure public. In his talk John argued for an alternative means of publication. Ideally there would be some external database, accessible over the web, to which he could transfer all his finished structures. Each structure to be deposited in such a public database should have full hkl data; and there should be evaluation software freely available to check the quality of the structures deposited, acting on the experimental data. This collection should have sufficient merit that deposits could be credited towards crystallographers' career development, and acceptance into the database may require passing some quality threshold. But the structures would not be subject to human refereeing. He would also seek from the commissioning chemists agreement that if structures were not published within a certain time (say 3 years), they could be transferred to this depository without other formality. Talk 5 - Peter Murray-Rust: e-Science in crystallography and chemistry: ----------------------------------------------------------------------- CIF and CML ----------- Peter began by praising the CIF development effort, but warned that crystallographers ignore current major software developments at our peril. Bioscientists are concerned that crystallographic information is not available in the way that they want it (i.e. XML). He urged the abandonment of the CIF formalism in future informatics developments. He argued that the democracy of the web undermines the value of the traditional closed databases with bespoke query languages, and that crystallographic laboratories should host their own data collections on peer-to-peer servers suitable for mutual discovery and harvesting of content. He also pressed the case that CIFs such as those available as supplementary documents on the RSC website should be considered as belonging to the community. Talk 6 - Frank Allen: The future of crystallographic publication ---------------------------------------------------------------- CCDC continues to produce innovative research tools based on the contents of the CSD. MOGUL is a derivative database of molecular fragment geometries, and CCDC plan to release an API to allow developers to provide direct interaction with MOGUL (and ultimately other components of the CSD) through their own applications (see above). CCDC remains concerned that not all solved structures are finding their way into the public domain and into the CSD - the anticipated "explosion" in structures from area detector technology still had not materialised, although the number of published structures still grows at the previous exponential rate. It's likely that John Davies' approach to publication goes some way towards explaining that. CCDC are considering ways to ensure that material deposited privately with them be placed in the public domain if there has been no publication within a certain time span. They are also interested in the direct automated harvesting of structures for direct incorporation in CSD. Chairman's interventions ------------------------ Mike Hursthouse, who was chairing the afternoon session, commented that work he was involved in with UKOLN had the potential to address several of the issues of concern. From subsequent inspection of his group's research proposal (http://www.ukoln.ac.uk/projects/ebank-uk/docs/bid/bid.pdf) it becomes apparent that the project involves using the Southampton University eprints.org software as the basis for a distributed preprint server-cum-data repository. One way in which this might work is that structure reports are deposited, with an accompanying write-up in preprint format. The availability of such preprints allows the possibility of community-based review, and the possible development of preprints which receive much attention into fully-fledged publications. It's an interesting development for chemists, but it doesn't entirely answer the needs of John Davies and CCDC for public access and incorporation of unpublished data sets in isolation. General discussion ------------------ There was a broad discussion on the topics arising from the meeting, during which a number of concerns surfaced: 1. Intellectual property rights. There is consensus that it is inappropriate for journals or databases to claim copyright over individual data sets. However, the issue of ownership of IPR is not clear cut. If a structure determination is carried out by a crystallographer on behalf of a chemist as part of a University department funded publicly (or perhaps also from other sources), who has the ownership of the data set and who can dictate how that data set is disposed of? 2. How to ensure that the service crystallographer receives due credit and career development? 3. Quality of a reported structure. How to assess it? What is "acceptable"? - many structures are solved only to the extent necessary to achieve a chemical purpose. 4. It became clear that many speakers' experiences and proposals were very UK-centric. Understandable at a BCA meeting, of course, but we do need to understand how different initiatives informed by different national or regional philosophies will best serve the international nature of science. Random thoughts of an observer ------------------------------ 1. The OAI-PMH (open archive initiative metadata harvesting protocol) certainly offers a technical solution to the location and harvesting of metadata concerning distributed data sets. It is employed in Hursthouse's e-Bank project; it could link together individual laboratory collections like John Davies', as well as larger-scale activities such as The Reciprocal Net and the Crystallography Open Database. 2. To my mind the great danger of the distributed approach is that there is no guarantee of long-term preservation of (and access to) the data. It would be helpful if some central authority could act as a mirror of the content offered by these distributed servers. Or, if that's unacceptable, that there is at least a federation of regional mirrors able and willing to interoperate and ideally provide mutual backup. It is implicit that I see this as specific to the field of crystallography; other scientific domains would make their own arrangements (but perhaps might be inspired by a well-coordinated effort in one field). 3. Peter M-R's criticism of the closed database is too harsh. While open-source collaboration is very attractive and potentially very powerful (a la Linux), it is unlikely that the functionality and intellectual capital invested in the CSD over the last 4 decades could be reproduced on any sort of short timescale by a de novo open-source project. More useful is the prospect of making available public APIs to the databases so that add-on applications may be developed in the community. The MOGUL API is an indication that CCDC would be willing to go down this road. Development of a general "API" to database contents (i.e. a standardised meta-query language appropriate to the contents of crystallographic databases) would be useful not only in enabling common queries to be posed to the public databases, but in extending the domain of inquiry to distributed data repositories built on different architectures. Identification of appropriate elements of such a generic metaquery language could start with the existing domain ontology (i.e. from the CIF dictionaries). 4. Likewise, Peter M-R's enthusiasm to replace CIF by XML is misplaced. The bioscientists have jettisoned mmCIF because they are impatient, and ad hoc XML representations better suit their short-term needs for data exchange. There is an increasing rush towards machine-generated taxonomies and domain ontologies; but what is clear from the discussions of the CIF dictionary committees (and in part the reason they operate so slowly) is that unambiguous definition of physical concepts for electronic computing machines is a very difficult thing to achieve. Short-term exchange between consenting programmers does not guarantee extensibility and long-term fitness for purpose. 5. The structural genomicists are reasonably successful in their exertions because the field is new, well focused and well funded. However, it's worrying that for all the public money going into the subject, there's little or no emphasis on the need to preserve partial results in a well-characterised form for future study. 6. Frank Allen made the point that despite the apparent size of the CSD, individual structural or chemical motifs can be very sparsely represented. There is still a very strong need to accumulate data points to cover all of chemical/conformation space. However, there is also the need to understand the weight due to an individual data point in the collection. It is clear that economic demands will prevent the fullest possible precision in every structure experiment. Surely objective weighting factors can be devised based on analysis of the structural models - and, so far as possible, of the reflection data too - that can be used as metrics to assess the trustworthiness of a deduction made from a population of database entries. 7. It appears that there is still very limited availability in the public domain of reflection data. Is this best addressed by renewed pressure on publishers, by exhorting the databases to require structure factors among their deposited materials, or by devising a distributed system linking resources like The Reciprocal Net by open-access protocols? Note that the old reluctance of publishers and databases to store structure factors because of the bulk of the data is barely credible in these days of terabyte PCs. Brian _______________________________________________ Epc mailing list Epc@iucr.org http://scripts.iucr.org/mailman/listinfo/epc
Reply to: [list | sender only]
- Follow-Ups:
- Prev by Date: ICSTI: news items
- Next by Date: Re: ICSTI: news items
- Prev by thread: [Fwd: Predicted Crystallography Open Database (PCOD)]
- Next by thread: Re: Report and commentary on BCA chemical crystallography groupmeeting
- Index(es):