[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Opinions on comments as part of the content
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>, "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>
- Subject: Re: Opinions on comments as part of the content
- From: peter murray-rust <pm286@cam.ac.uk>
- Date: Wed, 07 Mar 2007 07:48:45 +0000
- In-Reply-To: <45EDB89B.20907@niehs.nih.gov>
- References: <45EDB89B.20907@niehs.nih.gov>
At 18:53 06/03/2007, Joe Krahn wrote: Thanks for this topic - it has concerned us in writing CIF parsers. The first observation is that CIF does not define an abstract data model (e.g. the Infoset in XML) so it is difficult to on what a parser should do other than confirm validity to the CIF standard. (An analogy was early XML parsers whose only required output was "valid" or "invalid"). I suspect that each parser writer has created their own data model. It would be extremely valuable to develop such as model for CIF. We have written a CIF parser (CIFDOM) which parses CIFs into an abstract data model which can be expose in XML syntax and conforms to Document Object models (DOM). IN doing this we have had to make various interpretations of the standard, while trying to retain the goodwill of authors and readers. We have parsed ca 80,000 CIFs (standard "small molecule", all DDL-1, core dictionary, no mmCIF, no images, etc.). We apply the following from then standard * within a CIF the order of the data blocks is arbitrary and changing that does not alter the data model * within a data block the order of the items and loops is arbitrary and changing that does not affect the data model * white space between CIF tokens (e.g. between item name and item value, between items and loops and between loop name or loop values) can be normalised to a single space or any other conformant white space string. This may surprise and upset authors who expect the pretty printing to emerge from a parser but the standard does not require it and it is difficult. * the quoting mechanism for values can be changed or normalised. For example 'foo' can be normalised to foo. It may not always be clear how many line-ends should be preserved in semi-colon values or whether a single-line semicolon value could be translated to a quoted string. * duplicate item names are not allowed * all cif names can be case-normalised (e.g. H-M can become h-m) * duplicate data block ids are not allowed I would be grateful to know if any COMCIFer has a different view of these. If these are accepted then comments can be reordered within blocks. Many comments are created on the assumption that they attach to the following CIF item or loop but a parser need not (and in principle cannot) preserve this implicit semantic. There are no such things as inter-block comments except that any comments preceding the first block can be identified as not belonging to any block. These can be reordered. It is therefore legitimate (if unpretty) to assemble all comments within a block together and sort them into arbitrary order; the same can be done for the non-block comments. >It seems that some CIF parsers retain comments. Only if the parser has a data model which can be inspected or output. > Are there people using >comments to hold pertinent information? If so, has there been any >attempt to add a general purpose comment data items? My thinking is that >the only comment that should have valid information is the CIF header >comment, Does this mean one or more comments before the first block? I don't think the standard defines a CIF header comment. This is one of a small number of topics which could benefit from clarification (and in some cases an arbitrary ruling): * data blocks. Is the value of the data block case-sensitive? are data block ids which differ only in case identical and therefore illegal. Is it allowed to have an empty string as id? or any mixture of non-whitespace CIF chars (e.g. punctuation only) * data_global. This is so widespread that it would be useful to have at least an agreed heuristic for it. * multi-data-block CIFs. Is it legitimate to split them? If so, can/should data_global be copied into each? * what are the semantics of '?' and '.' Is it legitimate to delete an item of the form: _foo ? or does it convey information? >and all the rest can be stripped. Are there any opinions that >comments are important to retain? P. Peter Murray-Rust Unilever Centre for Molecular Sciences Informatics University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK +44-1223-763069
Reply to: [list | sender only]
- Follow-Ups:
- Re: Opinions on comments as part of the content (Joe Krahn)
- Re: Opinions on comments as part of the content (Brian McMahon)
- References:
- Opinions on comments as part of the content (Joe Krahn)
- Prev by Date: Opinions on comments as part of the content
- Next by Date: Re: Opinions on comments as part of the content
- Prev by thread: Opinions on comments as part of the content
- Next by thread: Re: Opinions on comments as part of the content
- Index(es):