Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Report and commentary on BCA chemical crystallography group meeting

BCA Chemical Crystallography Group Meeting 12 November 2003
===========================================================

Four of us from the Chester office (myself, Pete, Gillian and Mike
Hoyland) attended a very interesting session on "Beyond Refinement: What
Happens Next?" in Cambridge. I started to jot down my notes to record the
content of the meeting, but the exercise led me on to begin to think about 
some of the implications for us. These are my personal thoughts, and don't 
as yet dictate a policy line - but you'll see the germ of some ideas in
the final paragraphs. Pete, Mike and Gillian are welcome to contradict or
augment any of these remarks as they see fit.

Talk 1 - Tony Linden: CheckCIF2003 - a general structure validation tool
------------------------------------------------------------------------
Tony gave a clear account of the evolution and practice of checkCIF, and
the advantages to the prospective author of using it early and often. The
talk was well received, and it seems to me to reinforce the perception
that the crystallographers who are likely to publish in Acta consider the
checkCIF system to be a good thing.

Tony spoke of the role of the checking process and the support of
coeditors in providing an element of education to the non-expert
crystallographers who may be involved in structure determinations.
We could perhaps help this by taking the explanatory notes that are
associated with checkCIF alerts (which are already very good) and
providing onward links to more detailed online tutorials promoting good
experimental practice. Who might be good at commissioning and editing such
tutorials?

He also commented that some residual errors were not checkable by checkCIF, 
but the examples he gave seemed to me amenable to software
analysis. For example _exptl_crystal_description was given as "needle",
while _exptl_crystal_size_ had the dimensions of a plate; or
_exptl_crystal_colour was "colourless" for a compound with Fe in its
formula. So long as (human) editorial judgement is involved, it would be
possible to construct heuristic procedures to flag such things as C
alerts. The establishment of a set of heuristic rules is equivalent to
building up a "knowledge base"; and the addition of rules to the
"knowledge base" could proceed in parallel with the addition of "tips",
"rules of thumb" and "indicators of good practice" to the online tutorials 
suggested above.

He also emphasised the use of platon with reflection data in checking, and 
I suggest we begin to look at extending checkCIF to use the reflection
data. (The problems are largely to do with file management and associating 
the right structure factors with the right structure.)

Andrew Bond asked if a distributable checkCIF could be made available
(e.g. for people who liked to work on trains or who have disconnected
themselves from the Web because of hackers and spammers). Our difficulties 
with this are fluidity of the code (could be alleviated by web-based
auto-update techniques) but also the logistics of maintaining
multi-platform multi-version software. In the longer run, maintenance of
code is a general problem - what will happen to platon when Ton Spek retires?

Simon Parsons asked what was wrong with having a structure rather well
determined but with only a 95% data collection set? Would the journal
reject it? Here was the beginning of what could have been an interesting
debate, that never really got going, but is constantly simmering beneath
the surface both of this meeting and of the whole practice of publishing
crystal structure reports: cost and benefit. There are costs in collecting 
that extra 5% of data - maybe an extended time on the diffractometer that 
a service department will not let you have; maybe a very extended time at
the synchrotron station that no-one could reasonably afford. The
researcher has determined the structure well enough to answer the question 
of interest. What's the big deal? Against that the benefits include a
better-characterised data point in the CSD or other database, and that
will have a small but important effect on the quality of scientific
deductions that can be made from the databases. This is important, and
should perhaps be pushed by the IUCr (through policy statements), by the
databases (through careful research studies demonstrating the effects of
high-quality data), and by individual crystallographers pressing this
point of view on their chemist colleagues. All this is of course done to
some extent already, but much more may be needed.

Talk 2 - Richard Cooper: Validation as you go
---------------------------------------------
An account of the CRYSTALS package, and how it enables users to validate
results at the diffractometer and beyond. (There is general consensus that 
validation at source is a good thing.) A particular type of validation is
in-situ matching of portions of molecular geometry against MOGUL, a CCDC
database that stores molecular geometries from the CSD in a fast-access 
data structure.

A few ideas came out of conversations after this talk. MOGUL validation
sounded like a nice idea (it's a "knowledge-based" extrension of the
limited set of geometry checks that platon can do). The problem is
identifying from the input molecule the geometries and bond types against
which to launch the search. Since CRYSTALS does this, I thought we might
see whether CRYSTALS can be hooked into checkCIF to perform just these
checks. Richard thought this was do-able, but would involve a fair amount
of code manipulation, and he thought that it would be as easy for us to
write something ourselves. But the rub comes in the bond typing -
apparently CRYSTALS uses Sam Motherwell's code to do this. Now, it
transpires that Sam's algorithms have been reimplemented (and improved) 
in the C++ toolkit which CCDC are planning to release along with an
API to developers next year. That may be a good time to start thinking
seriously of adding such a check to our suite. Ian Bruno (who is a very
farsighted programmer at CCDC) will let us have a copy of the next MOGUL
release.

CCDC is also interested in offering checkCIF-style validation alongside
the enCIFer editor. I spoke to Greg Shields about the possibility of
building a user-friendly package that included enCIFer and checkCIF as a
standalone program. It seems a nice idea in principle, and CCDC would
probably have the expertise to write a wrapper to call the various
programs from within a nice cross-platform GUI. What's more problematic
is the very different programming languages used; here again an API would
help in exchanging data between the component programs. We chatted briefly 
about a "generic" API for crystallographic applications, that abstracted
concepts such as molecules, cells, bonds etc: the CIF dictionaries could
be used as a seedbed for specifying many of the necessary objects. At the
moment it seems very much a pipe dream, but the rise of open-source
collaborative toolkits both emphasises the usefulness of such an approach
and begins to address the knotty problem of maintainability of
crystallographic algorithms once individual programs and programmers
retire.

Talk 3 - Kirsty Anderson: Publishing crystallography in chemical journals
-------------------------------------------------------------------------
Kirsty is the in-house Crystallographic Data Editor for journals of the
Royal Society of Chemistry. She sees her role as validating chemical
results drawn from crystal structures. It seems to me that there is a
danger in this emphasis that plausible chemistry may be passed on the
basis of a consistent (but wrong) crystal structure model, but I'm sure
that the crystallography is actually checked adequately. Checking is
done by arrangement with CCDC and through use of local copies of software
such as platon and Mercury.

RSC earlier asked us for permission to point authors to our checkCIF
service, although RSC declines to sponsor checkcif.iucr.org. It would also 
appear that the Crystallographic Data Editor wishes to outsource the
checking to CCDC rather than using the tailored checkCIF service that we
provided for Dave Bardwell at RSC (very likely because of workload).

There is no doubt, however, that RSC journals will often accept minimal
crystallographic standards so long as the structure is not obviously wrong 
or contradictory to the indicated chemistry. It is likely that checkcif
would be seen as excessively demanding for validation at this level.

Talk 4 - John Davies: Unpublished structures
--------------------------------------------
John runs a very productive service crystallography unit at Cambridge
University. He made the point that despite the efficiency of journals such 
as Acta E, it is likely that there will always be large numbers of
structures that will never be candidates for formal publication - if it
takes even a day to prepare a structure report for Acta E, he would prefer 
to spend the time determining another three or more structures.

Again, the fate of a structure will depend on its intrinsic interest. John 
in fact publishes reasonably frequently in Acta, but as part of his
service he might determine the structure of dozens of reaction
intermediates. These are of no especial interest to the commissioning
chemist; have been refined only to the extent needed to satisfy the
chemist that his research is on the right track; do not individually merit 
a detailed write-up; yet are valuable data points within a comprehensive
collection such as the CSD.

At present, each solved structure is registered and stored in a local
version of the CSD database. This makes transfer to the CSD trivial from a 
technical viewpoint; but such transfers depend either upon publication, or 
upon the agreement of the researchers who commissioned the structure to
deposit it as a private communication. There is a problem with chemists
who commission such structures and then lose interest or can no longer be
contacted. Without their permission, there is no way to make the structure 
public.

In his talk John argued for an alternative means of publication. Ideally
there would be some external database, accessible over the web, to which
he could transfer all his finished structures. Each structure to be deposited 
in such a public database should have full hkl data; and there should be
evaluation software freely available to check the quality of the
structures deposited, acting on the experimental data. This collection
should have sufficient merit that deposits could be credited towards
crystallographers' career development, and acceptance into the database
may require passing some quality threshold. But the structures would not
be subject to human refereeing. He would also seek from the commissioning
chemists agreement that if structures were not published within a certain
time (say 3 years), they could be transferred to this depository without
other formality.

Talk 5 - Peter Murray-Rust: e-Science in crystallography and chemistry:
-----------------------------------------------------------------------
CIF and CML
-----------
Peter began by praising the CIF development effort, but warned that
crystallographers ignore current major software developments at our
peril. Bioscientists are concerned that crystallographic information is
not available in the way that they want it (i.e. XML). He urged the
abandonment of the CIF formalism in future informatics developments.

He argued that the democracy of the web undermines the value of the
traditional closed databases with bespoke query languages, and that
crystallographic laboratories should host their own data collections on
peer-to-peer servers suitable for mutual discovery and harvesting of
content. He also pressed the case that CIFs such as those available as
supplementary documents on the RSC website should be considered as
belonging to the community.

Talk 6 - Frank Allen: The future of crystallographic publication
----------------------------------------------------------------
CCDC continues to produce innovative research tools based on the contents
of the CSD. MOGUL is a derivative database of molecular fragment
geometries, and CCDC plan to release an API to allow developers to
provide direct interaction with MOGUL (and ultimately other components of
the CSD) through their own applications (see above).

CCDC remains concerned that not all solved structures are finding their
way into the public domain and into the CSD - the anticipated "explosion"
in structures from area detector technology still had not materialised,
although the number of published structures still grows at the previous
exponential rate. It's likely that John Davies' approach to publication
goes some way towards explaining that.

CCDC are considering ways to ensure that material deposited privately with 
them be placed in the public domain if there has been no publication
within a certain time span. They are also interested in the direct
automated harvesting of structures for direct incorporation in CSD.


Chairman's interventions
------------------------
Mike Hursthouse, who was chairing the afternoon session, commented that 
work he was involved in with UKOLN had the potential to address several of 
the issues of concern. From subsequent inspection of his group's research
proposal (http://www.ukoln.ac.uk/projects/ebank-uk/docs/bid/bid.pdf) it
becomes apparent that the project involves using the Southampton
University eprints.org software as the basis for a distributed preprint
server-cum-data repository.

One way in which this might work is that structure reports are deposited,
with an accompanying write-up in preprint format. The availability of such 
preprints allows the possibility of community-based review, and the
possible development of preprints which receive much attention into
fully-fledged publications. It's an interesting development for chemists,
but it doesn't entirely answer the needs of John Davies and CCDC for
public access and incorporation of unpublished data sets in isolation.


General discussion
------------------
There was a broad discussion on the topics arising from the meeting,
during which a number of concerns surfaced:

  1. Intellectual property rights. There is consensus that it is
     inappropriate for journals or databases to claim copyright over
     individual data sets. However, the issue of ownership of IPR is not
     clear cut. If a structure determination is carried out by a 
     crystallographer on behalf of a chemist as part of a University
     department funded publicly (or perhaps also from other sources),
     who has the ownership of the data set and who can dictate how that
     data set is disposed of?

  2. How to ensure that the service crystallographer receives due credit
     and career development?

  3. Quality of a reported structure. How to assess it? What is
     "acceptable"? - many structures are solved only to the extent
     necessary to achieve a chemical purpose.

  4. It became clear that many speakers' experiences and proposals were
     very UK-centric. Understandable at a BCA meeting, of course, but
     we do need to understand how different initiatives informed by
     different national or regional philosophies will best serve the
     international nature of science.

Random thoughts of an observer
------------------------------
1. The OAI-PMH (open archive initiative metadata harvesting protocol)
certainly offers a technical solution to the location and harvesting of
metadata concerning distributed data sets. It is employed in Hursthouse's
e-Bank project; it could link together individual laboratory collections
like John Davies', as well as larger-scale activities such as The
Reciprocal Net and the Crystallography Open Database.

2. To my mind the great danger of the distributed approach is that there 
is no guarantee of long-term preservation of (and access to) the data. It
would be helpful if some central authority could act as a mirror of the
content offered by these distributed servers. Or, if that's unacceptable,
that there is at least a federation of regional mirrors able and willing
to interoperate and ideally provide mutual backup. It is implicit that I
see this as specific to the field of crystallography; other scientific
domains would make their own arrangements (but perhaps might be inspired
by a well-coordinated effort in one field).

3. Peter M-R's criticism of the closed database is too harsh. While
open-source collaboration is very attractive and potentially very powerful
(a la Linux), it is unlikely that the functionality and intellectual
capital invested in the CSD over the last 4 decades could be reproduced on
any sort of short timescale by a de novo open-source project. More useful
is the prospect of making available public APIs to the databases so that
add-on applications may be developed in the community. The MOGUL API is an
indication that CCDC would be willing to go down this road. Development of
a general "API" to database contents (i.e. a standardised meta-query
language appropriate to the contents of crystallographic databases) would
be useful not only in enabling common queries to be posed to the public
databases, but in extending the domain of inquiry to distributed data
repositories built on different architectures. Identification of
appropriate elements of such a generic metaquery language could start with
the existing domain ontology (i.e. from the CIF dictionaries).

4. Likewise, Peter M-R's enthusiasm to replace CIF by XML is
misplaced. The bioscientists have jettisoned mmCIF because they are
impatient, and ad hoc XML representations better suit their short-term
needs for data exchange. There is an increasing rush towards
machine-generated taxonomies and domain ontologies; but what is clear from 
the discussions of the CIF dictionary committees (and in part the reason
they operate so slowly) is that unambiguous definition of physical
concepts for electronic computing machines is a very difficult thing
to achieve. Short-term exchange between consenting programmers does not
guarantee extensibility and long-term fitness for purpose.

5. The structural genomicists are reasonably successful in their exertions 
because the field is new, well focused and well funded. However, it's
worrying that for all the public money going into the subject, there's
little or no emphasis on the need to preserve partial results in a
well-characterised form for future study.

6. Frank Allen made the point that despite the apparent size of the CSD,
individual structural or chemical motifs can be very sparsely
represented. There is still a very strong need to accumulate data points
to cover all of chemical/conformation space. However, there is also the
need to understand the weight due to an individual data point in the
collection. It is clear that economic demands will prevent the fullest
possible precision in every structure experiment. Surely objective
weighting factors can be devised based on analysis of the structural
models - and, so far as possible, of the reflection data too - that can be 
used as metrics to assess the trustworthiness of a deduction made from a
population of database entries.

7. It appears that there is still very limited availability in the public
domain of reflection data. Is this best addressed by renewed pressure on
publishers, by exhorting the databases to require structure factors among
their deposited materials, or by devising a distributed system linking
resources like The Reciprocal Net by open-access protocols? Note that the
old reluctance of publishers and databases to store structure factors
because of the bulk of the data is barely credible in these days of
terabyte PCs.


Brian
_______________________________________________
Epc mailing list
Epc@iucr.org
http://scripts.iucr.org/mailman/listinfo/epc

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.