[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: CIF Infoset
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>
- Subject: Re: CIF Infoset
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sat, 4 Sep 2004 09:29:10 -0400
- In-Reply-To: <Pine.LNX.4.44.0409041834570.9129-100000@owari.msl.titech.ac.jp>
- References: <Pine.LNX.4.44.0409041834570.9129-100000@owari.msl.titech.ac.jp>
I suspect this discussion is starting to sound like "how many angels can dance on the head of a pin" to some people. My apologies. For most people this discussion _is_ irrelevant. If you are simply preparing a small molecule CIF for submission to Acta using the tags needed for journal publication, you don't really need to do anything different than what you are doing. However, there are some people trying to understand if they should be using CIF or XML or something else as a general data framework for the creation of other sorts of documents. For them it is very important to understand the interaction between "normal" CIF files that use tags from some standard CIf dictionary and documents using "made-up" tags that are (not yet) in any official, agreed dictionary. It is also important for them to understand if they can fiddle a bit with the standard CIF rules (such as order-independence) and do something different, and still have a CIF that other people might be able to handle. Let is begin with some basics: 1. Implicit sub-text question: Shouldn't we really be using XML? After all everyone else is doing it. It has well-defined "name spaces", does not impose order independence, etc. Everyone is entitled to their opinion, but as a practical matter, if what you are doing is populating databases, it is fairly easy to move back and forth among CIF, XML and various internal database representations. If what you are doing is marking up documents for publication, XML is easier to use, but, if you don't have the equivalent of the discipline imposed by CIF (order independence, well-defined tags) you are going to have to re-invent it for documents that need to populate standard templates, such as structure reports. Any CIF documents has a simple, direct translation into an XML document. Going the other way is harder because arbitrary XML documents do not necessarily have clearly defined tags, and may need to have order preserved, and may need to have complex trees translated into tables. Doing these things is not difficult. However, we need to have discussions like this one to help the community come to agreement on how best to do that. 2. Is order-independence in CIF really necessary? Yes and no. Clearly one could define a CIF-like language that differed from CIF in allowing multiple uses of the same tag and in requiring the ordering of tag-value pairs to be preserved. However, that would then make the publication of structure reports more difficult. So, in creating CIFs for use in the publication process, order-independence is a good idea. In addition, when you do need to provide order-dependent data, such as atoms in a particular sequence, all you need to do is to add a column in your table that contains the ordinal of the item in that column of each row. That may seem like a nuisance, but it easy to hand off that nuisance to a bit of software, and most crystallographers are fairly adept at dealing with numbers anyway. That being said, I for one think we should have a flavor of CIF that allows XML-like order dependence and repeated tags, and an agreed protocol to translate between order-independent CIFs and order-dependent CIFs. 3. What do multiple data block really mean? With a given data block a given tag may only be used once. It is legal to use that same tag again in another data block. This is a convenient way around the order independence in CIF, but in paractice, if a CIF represents a single structure report, you are not going to want to do that, since it would produce confusion as to which version of the data for which tag you wish to include in your paper. For example, you might have propared your biblio in one data block and your coordicnates in another. You may have only one common tag between the two data blocks -- something to help you keep the two data blocks associated with the same study, but except from removing the duplicate tag, you would have the same information if you merged the two data blocks in any order, including shuffling them together. Alternatively, you might take some huge data block and break it up into several smaller ones to help make a neater, more reaable and organized file. Now to the namespace/dictionary question I. David Brown has pointed out that CIF has long had a set of tags for specifying dictionary conformance, and has suggested that we should require more formal use of those tags to help readers understand what namespace is being used. DDB then asks about the precise mechanisms for using these tags in multiple data block CIFs and what to do when multiple dictionaries are involved, especially in deciding the order in which to apply the dictionaries to avoid conflicts. I must emphasize, that for most people this is not an issue. Even if you are drawing tags from multiple CIF standard dictionaries, yous document is remarkably unambiguous because, for the official dictionaries, COMCIFS works to avoid overlap and duplicate use of the same tags, or the use of two different tags for the same concept. The major exception is the replication of the Core dictionary in the mmCIF dictionary using slightly different tags. The two dictionaries are kept aligned with an "alias" mechanism, and most users do not have to worry about a conflict. For people working with their own locally-defined tags, however, this is an interesting question. There is a detailed protocol for "layering" of dictionaries (including ones created locally) ( http://www.iucr.org/iucr-top/cif/spec/dictionaries/maintenance.html ). Clearly it is time to instantiate this protocol in software, and, as part of a software upgrade project for the IUCr we will be doing that. It would be very helpful if those who have an interest in this subject would read the protocol and provide their comments and suggestions for improvement, so that the software we are writing will be useful to as many people as possible. Regards, Herbert At 6:43 PM +0900 9/4/04, ddb@owari.msl.titech.ac.jp wrote: >Hi > >> Here are a few more comments from IDB: >> >So how do you intend to get around this namespace issue? No CIFs that I >> >have encountered have ever declared their conformance to any >dictionary. >> >Even if they did, there is something called the dictionary stacking >> >protocol >> >which allows those definitions to be overridden without declaring a >> >namespace. >> >On top of that there is the boundless capacity for making up your own >> >data names on the fly for which there may never be any dictionary >> >definition >> >at all. How can you reliably assign anything but a generic namespace to >an >> >infoset? Its all just adhoc guesswork. >> >> The core dictionary defines three items which can be looped: >> _audit_conform_dict_name >> _audit_conform_dict_version >> _audit_conform_dict_location # Contains the URL where the >> dictionary can be found >> As far as I know these have not been widely used - Acta Cryst. should >> start insisting that these be included in submitted papers. There is no >> need to give the dictionary version in anything as ephemeral a comment. > > >That sounds like a positive step, but would that go in every data_block or >is it a global_ thing? > >You may need to add something like _audit_conform_dict_stacking_order >to ensure looped dictionaries of symmetry overriding core don't get >confused with core overriding symmetry, for example, (assuming loop order >is not significan?) if that is possible? > >The problem I see is that the effort invested in implementing it for all >newly created and submitted CIFs is wasted because it is an >incomplete solution and no current software uses it or needs it. > >You still have to deal with existing archives of CIF which don't state >their conformance, and even for CIFs that do, users are free to >conjure up any ad hoc data names they like and use them in any context. > >So, to try and resolve the namespace of each name, you would need to >(1) check the _audit_conform list of dictionaries in reverse order >(2) check against the list of registered prefixes for accidental matches >(3) check all versions of all publically accessible dictionaries >(4) then give up. > >Not an efficient process if there was a match and no guarantee that >it was a correct match if names were reused in different >contexts in different dictionaries. Two simple things would fix that. >Associating a distinguishable prefix on each name with the _audit_conform >stuff and banning ad hoc data names. > >Anything else and you will always be just guessing. >I don't really know what you are hoping to achieve. > >> >> ># start Validation Reply Form >> >_vrf_DIFF020_114 >> >;PROBLEM: _diffrn_standards_interval_count and >> >RESPONSE: ... We have used an image-plate system >> >; >> > >> >If intelligent software was ever intended to deal with such _vrf_s, why > > >embed the only pointer to their purpose in supposedly non parsable data >> >names rather than in looped, discrete sets of tags such as >> > >> >loop_ >> > _vrf_suite _vrf_subroutine _vrf_error_code _vrf_authors_response >> >> This would tidy things up, but the parser must be able to handle ad hoc >> data names without choking. > > >If its important enough to create a name for it then isn't it important >enough >define its purpose somewhere? Ad hoc data names seem to provide >nothing useful besides a legitimate excuse for laziness in the >specification. Theres no incentive to organize things tidily. >Maybe they were important originally when COMCIFS were exploring >the field, before dictionaries were introduced, but is it still important >to be able to make up arbitrary stuff and stick it in a CIF without >definition? >Who is doing this and how are they using it? >Do they really intend to save it for posterity? > > > >> >>>>Q Is the order of "rows" in a loop_ unimportant? >> >>> >> >>>Yes (in CIF). >> >> >> >>That is very useful (and non-obvious from the spec. It then makes it >> >>possible to confirm the identity of two sets of coordinates, symmetry >> >>operations, etc. >> >> >> >>It is also debatable. >> >>The very recent introduction of _symmetry_equiv_pos_site_id means that >> >>the data integrity of the majority of prior archived CIFs containing >tag >> >>values like: _geom_bond_site_symmetry_1 "4_564" >> >>would be seriously impaired by a change of order in the >> >>loop_ _symmetry_equiv_pos_as_xyz >> >> This was a serious omission in the first version of CIF (you have to >> remember that this was produced before we even considered writing >> dictionaries in STAR format). As you point out we have introduced the >> list reference _symmetry_equiv_posi_site_id (which incidentally has now >> been superceded by _space_group_symop_id taken from the symmetry_cif >> dictionary - a dictionary which takes a more systematic and >> forward-looking approach to symmetry). Again Acta Cryst. should insist >> on the inclusion of these id's. > >Would a statement of conformance to an older dictionary version be >sufficient grounds to escape these CIF changes (just checking :-)? > >But I guess my original concern here was that order independence of loop_ >structures based on earlier, and possibly alternative dictionaries, as >well as >ad hoc looped data (maybe thats not important, but you never know...), >is not assured in general, particularly for raw data in whatever form it >takes >(nmr? image CIF?). > > >> >I had a hazy recollection that "this is a string" and >this_is_a_string >> >were equally valid CIF constructs containing identical information >> >content, >> >used for example in space group names. Would they be formally identical >in >> >an infoset? Does the white space in all strings have to be normalised >(is >> >that the right word?)? >> >> We had a discussion of this point while preparing the symmetry_CIF >> dictionary and came to the decision that these two strings were not >> equivalent, i.e., underscore is not white space.. > >Bummer. I know one program that needs changes made :-( > >But perhaps I could also draw your attention to this: > http://journals.iucr.org/services/cif/stdcodes.html#Appdx4.3 >as evidence that underscores do seem to be an >officially sanctioned form of white space in uchar data types. > > >And maybe I can raise another issue, in the context of PMR's interest in >data_global, would the following construct be legitimate: > >data_global > _publ_contact_author_name "Fred" > >data_a > _import_data_from_block global > ># defined in an associated dictionary as: >data_import_data_from_block > _name '_import_data_from_block' > _category obscure_semantics > _type uchar > _definition >; > Import all data from the named data_block into the current data_block >Watch out for duplicate _data_element_names though! >Also watch out for circular imports! >; > >As far as I am aware there is nothing that restricts such semantics. >Everything seems to be above board in terms of the CIF content. >its just that a request for _publ_contact_author_name from >within data block data_a seems destined to fail at the software >access stage. Does that mean CIF conformant software can never be >totally CIF conformant? > > >Thanks for the response. >Doug > >_______________________________________________ >comcifs mailing list >comcifs@iucr.org >http://scripts.iucr.org/mailman/listinfo/comcifs -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu =====================================================
Reply to: [list | sender only]
- Follow-Ups:
- Re: CIF Infoset (Brian McMahon)
- References:
- Re: CIF Infoset (ddb)
- Prev by Date: Re: CIF Infoset
- Next by Date: Re: CIF Infoset
- Prev by thread: Re: CIF Infoset
- Next by thread: Re: CIF Infoset
- Index(es):