[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: the dictionary merging protocol
- Subject: Re: the dictionary merging protocol
- From: Brian McMahon <bm@xxxxxxxx>
- Date: Tue, 16 Jul 2002 15:25:29 +0100 (BST)
Hi Doug > I hope it is okay to make a few comments here about the dictionary > overlay protocol as documented here: I'm very happy that the community is discussing this proposal here. Although it has been approved by COMCIFS, I see it as still very much a pen-and-paper description, and I'd be much happier to see it tested in an implementation. If anyone on the list has a working implementation (or is interested in writing one) I'd be very much interested to hear about it. > I hope we can daw a distinction between "valid" and "conformant" > with respect to the encouraged CIF data_block tags: > ... > My understanding/definition of conformant is 100% or nothing. The slightest > discrepancy at all means it is no longer conformant. > With this definition the _audit tags above seem mislabeled, but I > will continue here assuming the intended meaning is "valid". OK, we may need to work on a precise and consistent terminology. Perhaps the current core definition for the category: Data items in the AUDIT_CONFORM category describe the dictionary versions against which the data names appearing in the current data block are conformant. would be better recast as: Data items in the AUDIT_CONFORM category describe the dictionary versions against which the current data block claims to be conformant. Individual data items may be validated against their matching dictionary definitions; the data block as a whole is conformant if all values for which there is a dictionary definition are valid according to that definition. Notice that there are levels of "validity": a _cell_length_a value of 12.763(1) may be "valid" in the sense that it has a numeric value in the permitted range; but it may be invalid inasmuch as there is a discrepancy between its value and a cell volume determined from it. So there is consistency checking to be done. But it may also be that the numbers are all consistent, yet just plain wrong - a better experiment finds a rather different value. So "validation" is performed against particular localised criteria. For our current purposes I'm defining valid as obeying the constraints and relationships explicitly stated in the dictionary. (So the current core dictionary can't catch the cell length/volume discrepancy because that's not stated, at least in machine-readable form; but the work of the Perth group will make that an achievable goal in future dictionaries.) The fact that CIF has always allowed private data items means that a data file can always contain items not in a dictionary, so I allow the notion of "conformance" so long as there are no demonstrably invalid data values, even if no conclusion can be drawn against items in the file with no matching dictionary definitions. > From the point of view of CIF validation, the proposed dictionary merging > protocol looks functional enough. But the protocol itself seems to be > a set of externally based informal rules designed to be hard coded > into validation software. The commands for specifying how to create/ > assemble a dictionary to which a given CIF data block may or may not > be conformant (even though it may be valid) are actually embeded in > the CIF, or passed to the validation software as arguments. > > There is no support therein for fine grained control over how > individual data items and or category classes may be totally replaced, > by or appended to from the separate disparate dictionaries. > It is an all or nothing approach. > > The currently envisaged dictionary construction mechanism does not > yet permit specification of such PREPEND, APPEND REPLACE modification > attributes in the CIF data_block itself, so there is no way to retain > this information across dictionary reconstruction invocations. That is a fair criticism. I thought about how to carry along the modification attributes in the CIF, but considered that would produce a much greater overhead in both the writing and reading phases to get it right. If people think it's useful, I'm willing to revisit the possibility. > The recent discussion of the CIF specification indicates that in CIF1.1 > dictionary style save_ frames will be permitted in purely data CIFs > opening up the possibility of combined dictionaries and data. > I am not sure if this is the direction things are intended to go > but it seems to me to be tooooo flexible for something that is > supposed to be a purely data archival format. One doesn't even need save_ frames in DDL1 applications because the dictionary definitions there live in data blocks. I see an analogy here with SGML, where the DTD is usually an external file but can be carried along (or modifications to a library DTD can be carried along) within the file. My feeling is that the community doesn't want to travel in that direction; this was reflected in the explicit definitions of "data file" and "dictionary file" in paras 2.2 and 2.3 of the specification documents, and in the classification of dictionaries as "external reference files" in para 3 of the semantics document. > It also seems counterproductive > to the overall scheme of standardization because basically any > CIF can create any dictionary it likes and say hey I am valid against > this, (even if it doesn't conform). Yes, there is a danger in that, but there is also advantage. For most physical quantities, the CIF core dictionary is permissive in what it considers "valid" - usually anything positive definite is allowed. But for the editorial purposes of Acta Cryst., certain ranges of values might be excluded, while another journal might insist on a different range. The proposed approach allows each journal to layer its own restrictive ranges on top of what is in the core. That's also an argument against building too much specificity about validation criteria in the metadata carried along within the data block: different external criteria might be applied to different purpose. So why is my dictionary "better" than yours? So long as the dictionaries are retrievable for inspection, they can be compared and criticised by independent reviewers. It would be expected (or at least hoped) that the dictionaries sanctioned by the IUCr would carry, if you will, a higher level of trust than others, but the essence of the matter is to ensure that the dictionaries are public and open to independent review. So, to revisit your question of what were the guiding principles behind this proposal, they included the following considerations. 1. Multiple dictionaries already exist (core, powder, msCIF, mmCIF and others). It's important to have a way of addressing the several dictionaries that might contribute to a data file. Of course, everything could be brought into a single increasingly large dictionary, but keeping them separate facilitates distributed authorship and management. Not in itself a compelling argument perhaps, but a very useful thing to have in practice. 2. Dictionaries of private data names can be constructed and employed for validation in the same way as in the public arena. So if your local archive files for Xtal have lots of _xtal_ data names you can in principle validate them with off-the-shelf software, without needing to add your private data names to the public dictionary. Both of these represent a sort of horizontal integration. 3. The desire to overwrite particular attributes in a public dictionary for more specific validation purposes is addressed by the overlay mode. If the previous cases were "horizontal", this is more of a "vertical" integration. Of course Acta Cryst could write its own validation routines to satisfy the Notes for Authors (and of course it has); but it seems attractive to be able to carry out much of the validation using generic dictionary-based tools. And it seems attractive to be able to overlay a small change, such as modification of a single enumeration range, rather than to have to make a complete copy of the official dictionary. One thing to consider of course is that the generic "off-the-shelf" validators I envisage will need to interact sensibly with specific applications, and one might need to think what types of error codes or return values the validator should make when invalid cases are found. Perhaps numeric codes defined in a standard header file with symbolic names like NON_NUMERIC, OUT_OF_RANGE, ILLEGAL_CODE ... ? Of course now we are talking at an implementation level, and it's a topic that is also relevant to the current thoughts about the syntax specification. How should CIF parsers handle exceptions? That's all I have time for at the moment, but there are other interesting thoughts in Doug's messages that I would like to follow up later. It would also be interesting to hear other views from the community about the perceived usefulness or otherwise of this protocol. Since we have survived for a decade without it, it may not be of critical importance. On the other hand, I see it as potentially having substantial impact on the way we develop and use dictionaries in the future. Best wishes Brian
Reply to: [list | sender only]
- Prev by Date: Re: the dictionary merging protocol
- Next by Date: RE: square brackets in the draft
- Prev by thread: Re: the dictionary merging protocol
- Next by thread: A formal specification for CIF version 1.1 (Draft)
- Index(es):