[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
RE: A formal specification for CIF version 1.1 (Draft)
- Subject: RE: A formal specification for CIF version 1.1 (Draft)
- From: "Bollinger, John Clayton" <jobollin@xxxxxxxxxxx>
- Date: Thu, 11 Jul 2002 19:22:26 +0100 (BST)
This is a combined response to two messages, both from Herb. I have cut and pasted together parts of the responses, and I hope I have not thereby taken anything out of context. Herbert J. Bernstein [mailto:yaya@bernstein-plus-sons.com] wrote: > On Wed, 10 Jul 2002, Bollinger, John Clayton wrote: > > I think it unfortunate that the specification lumps together CIF > > dictionaries and CIF data files as CIF, considering that they are > > in fact slightly different STAR dialects. It furthermore seems > > like the spec has been tailored to allow this combination > (by addition > > of save frames, at least), which I find a questionable strategy -- > > especially given that it did not really accomplish the apparent > > goal anyway (that apparent goal being to produce a single > STAR dialect > > with which both the dictionaries and the data files could > be expressed). > > What in particular is still missing to allow a common format > for CIFs and dictionaries? Syntax, section 5: "Save frames may only be used in dictionary files." The language for CIF data files is therefore a restriction of the language for CIF dictionaries. I find it a bit disingenuous to claim that they are the same. Yes, perhaps this is a picky point, but picky points are what specifications are all about. A non-validating CIF parser does not have to recognize the full language specified by the draft specification. > > I do not see any point whatsoever to adding the stop_ keyword to > > the accepted CIF syntax. It is not necessary as long as CIF does > > not permit nested loops, so it only makes parsers more difficult > > to write. The question should be "why add it?" rather than > "why not?" > > > stop_ has always been a reserved word, so now, instead of recognizing > stop_ and declaring an error in all cases, a parser is allowed to > recognize stop_ and discard it in certain cases. [and] > I believe that this use of stop_ and save_ does not invalidate > any previously valid CIFs, and is a realistic approach to dealing > with these reserved words. Any validating CIF parser needs to have > a module to read dictionaries, where it will encounter save frames. > Any properly written CIF parser has to recognize stop_ to distinguish > it from a data value. By making these changes in the specification, > we are specifying a common practice (save frames), and saying > that a use of a reserved word (stop_) in a context in which it > clearly is not an error, should not be treated as an error. As far as implementation stop_ and save_ not breaking existing valid CIFs, I agree. As for their usefulness and propriety, however, I am not persuaded. Yes, a validating CIF parser must be able to read dictionaries, which use save_ and which therefore are written in a superset of the language for data CIFs. In a sense, adding save_ is then not a problem for non-validating parsers, because they may continue to reject the keyword as an error. I would prefer, though, to just acknowledge that the two languages remain different. Stop_ is another story. First, from a language design perspective, stop_ is absolutely useless in CIF. A valid usage in a version 1.1 compliant CIF would not express one iota of information, because removing the stop_ keyword would in no way whatsoever change the semantic interpretation. Second, from a parsing perspective, it is much simpler to recognize "stop_" and then unequivocally issue an error than it is to evaluate the parser state every time a "stop_" is encountered to check whether it is legal or not, and if so to modify the state appropriately. As for whether or not a particular usage of stop_ is an error, I would think that that was a matter dictated by the specifications we are discussing, not by how logical or how compliant with STAR the usage may be. I don't see that it is particularly relevant that using stop_ in the contexts the draft spec permits is consistent with STAR or that it is human interpretable. > > And what about data values beginning with a substring matching a > > reserved word? (Paragraph 10) In CIF 1.0 it was reasonably clear > > that something like this applied to data_ because such a construct > > had its own semantics defined, but it was not clear that this was > > a general restriction applied to all the reserved words. Did I > > just miss it somewhere, or is this one of those points of 1.0 that > > is being clarified via the 1.1 spec? If the latter, then let me > > throw in that I don't like it. I think that's because it is a > > departure from the normal sense of the term "reserved word." In any > > case, it makes a parser that incremental bit trickier to write. > > CIF has always been presented as an application of STAR, so the > reserved words have, in fact always been reserved, and it has > always been the case the having a data value beginning data_ or > save_ was incorrect. By applying exactly the same logic to the > full set of reserved words, I believe we should make the design of > most parsers cleaner and simpler. Well, I don't think I agree that parsers for 1.1 would be cleaner or simpler by virtue of this language feature, but I'll withdraw my claim that they would be trickier -- so long as they don't have to support both the 1.0 spec and the 1.1 spec. This change has more potential to break existing CIFs than do most, but my biggest objection remains that this is not the behavior that I would expect when presented only with the claim that loop_, stop_, save_, data_, and global_ are reserved words. If this feature is desired then the specification text should be changed to say something to the effect that strings starting with those substrings are reserved, and the language that calls out those particular instances of such strings as reserved words should be dropped or suitably marked as describing special cases. > > What exactly is the point of introducing the square bracket > delimiters > > for text values? > > > It is more convenient to use than semicolon delimiters, and allows > a handy nesting. Okay, I'll buy that it's a convenience feature -- for CIF writers. It's an inconvenience for CIF parsers, but it can be handled. I'd like it better if it served a useful role that was not otherwise performed; see below. > > In paragraph 17: "The end-of-line associated with the > closing semicolon > > does not form part of the data value." Is this another > > change/clarification, or another published detail that had > previously > > escaped me? I had thought that that last eol was part of the value. > > If you exclude the terminal <eol> from the text field, you > then allow > the semi-colon to quote arbitrary text fields, including those that > do not have a terminal semicolon. If you do not exclude the terminal > <eol> from the text fields, then the only text that can be quoted with > semicolons is text that ends with a semicolon. I'm sorry, but I don't follow that. I'm guessing you mean that including the <eol> as part of the delimiter enables quotation of strings that do not include a terminal <eol>. Indeed, I always thought that the exclusion of such strings was a quirk of the CIF language. I am certain that some of the earlier BNFs floated as candidates for a CIF BNF included the <eol> in the production for the quoted content, although I suppose that means little. This must be one of those cases that has always been seen conflicting interpretations, but I think this may be the wrong level at which to discuss it. If CIF is to retain compatibility with STAR, then it is the interpretation required by STAR that we must use. The 1994 STAR specification paper describes semicolon-quoted text as "a sequence of lines," with "lines" emphasized. To me that indicates that the trailing <eol> is part of the quoted material, not part of the delimiter. I observe, however, that the bracket-delimited quoting mechanism being introduced in the draft specification does fill the hole left by the interpretation of semicolon-delimited quoting that I am advocating. > > In paragraphs 22 and 41: Exclusion of ASCII characters 11 and 12 > > decimal is a departure from and incompatibility with CIF 1.0. Not > > that I particularly object -- handling these appropriately > is a pain. > > > > The second sentence of the abstract of the Hall, Allen, Brown paper > says: > > "The CIF is a general, flexible and easily extensible free-format > archive file; it is human and machine readable and can be > edited by a > simple text editor." > > It is not always possible to edit texts containing ASCII control > characters other than HT with a "simple text editor". VT and FF > serve to useful purpose in a CIF, and, as you note, they can > be a pain to handle. Did you mean VT and FF serve _no_ useful purpose in CIF? If you looked hard I think you might find people who would argue in favor of FF, at least, but I personally agree with you. My point was that this is another difference from CIF 1.0, and another restriction of STAR. Both facts should be documented. > > In paragraph 29: the data name length restriction to 75 > characters is > > another incompatibility with CIF 1.0 (as revised) where the > data name > > length was restricted only indirectly by the line length > restriction. > > Thus in CIF 1.0 data names could be 80 characters long. > > > > Actually, to allow a data name to be defined in a dictionary you have > to allow it to appear with a prepended "data_" or "save_". In DDL1 > dictionaries, the leading underscore of the data name is > dropped, which > has created a limit of 76 characters. In DDL2 the underscore is > retained, which has create a limit of 75 characters. Thus the 75 > character limit is simply a recognition of the implicit line > length restrictions that had been in effect in the past, and helps > to ensure that old systems will be able to work with these new names. But CIF has never before restricted data names to only those that could be defined in a DDL1 or DDL2 dictionary. Moreover, with the increased line lengths in CIF 1.1, the dictionary storage problem should be alleviated anyway. > > Paragraph 42 makes it optional to support line termination semantics > > different from the host OS'. That would be another departure > > from CIF 1.0, I think, and, in my opinion, an all-around bad idea if > > CIFs are supposed to be portable. As far as I can tell, the pseudo- > > production presented for <eol> is in fact the required > implementation > > for a fully-conformant CIF 1.0 parser. > > > > If you are on a unix system, the pseudo-production is almost right > for a "liberal-reader" CIF parser. It misses the case of a final > line in a file which has not been terminated by "\n". If you are > on a VMS system, or an IBM mainframe, the pseudo-production may be > completely wrong for a CIF created locally as a text file. If CIFs > are truly to be portable, it must be possible for someone on > a non-Unix system (and non-Windows, non-Mac system) to work with them. > > > > Paragraph 43: In combination with the formal grammar > presented earlier, > > the definitions of the <eol> and <noteol> non-terminals in > fact seems > > to _preclude_ CIF parsers from handling non-native line termination > > semantics. Even if that's not a departure from CIF 1.0, it's still > > a bad idea. > > > > We are not trying to preclude people from writing parsers which are > liberal and able to read a wider range of CIF formats than those > produced by the text editors of their own machines, but it would > be unreasonable and impractical to insist that every parser be able > to read every line format that ever has or will be invented. It > is not even reasonable to insist that every parser be able to > read some short list of non-native line formats. That would, > for example, make Fortran-implemented parsers non-conformant on > certain systems. [and] > > Regardless of whether the end of line handling is different in 1.1 > > than it was in 1.0, I think that those comments are a > > mischaracterization of the details of the draft 1.1 spec. As far as > > I can tell, what the spec now says is that CIF line termination > > is in fact machine dependent, and that an external utility must > > -- must! -- be used to convert a CIF from any foreign machine > > line termination convention to the local machine convention (if they > > differ) before a conforming CIF parser can successfully parse the > > file. I think this is exactly the wrong direction. > > > > The issue you raise is a fundamental one for many data formats. > CIF has always been specified as an editable text format. This > is not unusual for archival scientific data formats. You seem > to be saying that you would prefer a binary format. In that > case I would suggest the binary variant of CIF: CBF/imgCIF. I would prefer the approach taken by Postscript and some other languages: <CR>, <LF>, and a <CR><LF> sequence are all accepted as line terminators. Support for systems that have record-oriented text files or different character encodings necessarily requires conversion in both directions, which I consider a separate issue altogether. As I reread section 42 of the syntax document, I see what appear to be conflicting statements about line-termination handling. In fact, the first two sentences seem to be inconsistent -- the first says that <eol> is the system-dependent end-of-line, and the second says that CIF follows the same convention as XML (complete with a quote from the XML recommendation, which is more or less along the lines of my stated preference above). A few lines later, the spec proposes a parser that recognizes exactly the line termination semantics I prefer, but this is at variance with the earlier definition of <eol>. Moreover, the quotation from the XML recommendation describes how the XML processor _translates_ end-of-line sequences to a standard (normalized) representation. Is that in fact what CIF 1.1 parsers will be expected to do? That would be fine by me, but other statements in this section of the draft seem to indicate that that is not the intent. I find it particularly troublesome that at the end of section 42 the nature of software used to transfer CIFs is specified. Not only ought this to be beyond the scope of the specification, but it also seems to be unnecessarily restrictive. It says, for instance, that I may not move a CIF from a Win32 system to a Linux system by diskette. And what about my personal desktop, which dual boots Windows and Linux? May I not reboot without worrying that I have violated the CIF spec? As for not being able to write parsers in Fortran that support line terminations different from the host OS', I say (1) Fortran is not an ideal language for this sort of thing; (2) it will be easier when Fortran acquires stream I/O in the next iteration of its standard; and (3) it CAN be done with Fortran 77, and I have the working code to prove it. (Works with both DEC/Compaq/Intel Fortran and g77 on Win32 and Linux, at least, and requires no language extensions that I am aware of.) > > According to paragraph 60, a file containing only whitespace and > > comments but no data block is not a valid 1.1 CIF. That is another > > departure from CIF 1.0 if it is really the intent. One of > the ciftest > > trip files actually tests this case, in fact. > > This sounds like a good topic for further discussion. I for one > would favor allowing such a file to be a CIF, but I am not certain > what I would do with it. > > > > > Paragraph 61: this is another departure from CIF 1.0, which > did allow > > data blocks without data items. Another of the ciftest trip files > > tests this case. (vcif evidently produces a warning, which seems > > reasonable, but this is not an error.) > > Yet another good topic for discussion. These two cases are similar. A CIF with no data content would often be an error case for an application, but I prefer to let the application decide that, rather than enforcing it in the CIF spec. For the sake of discussion, I point out that the formal STAR grammar in the 1994 STAR specification paper recognizes a file without any data or global block as a valid STAR file, but requires a data block to contain at least one data item, data loop, or save frame, and requires a save frame to contain at least one data item or data loop. Oddly, however, a data loop with no data values can legally be the only content of a STAR data block or save frame, according to the grammar presented there. > There is an open debate as to whether the production for <Tag> > should be: > > <Tag> ::= '_'{<NonBlankChar>}+ > > or > <Tag> ::= '_'{<NameChar>}* Well, the former is easier for an electronic parser because it has less context dependency. Yes, it allows nasty, ugly data names, but as far as I am concerned anyone who uses such deserves what he gets. That's the one I would prefer. > > Also in the formal grammar, the productions for > > <SingleQuotedString><WhiteSpace> and > <DoubleQuotedString><WhiteSpace> > > are ambiguous. What is intended, I think, is that the [...] > Please read paragraph 58, where it says: "The <WhiteSpace> on the > lefthand side must evalue to the same string instance on the righthand > side and the parse must terminate on the first valid match > reading left > to right." If one uses a parser which accepts the first full-depth > match in a left to right scan, the productions are not ambiguous, > and are sufficient to define the quoted strings without having to > defined digraphs. I would much rather see this expressed in the formal grammar than (or in addition to) in the commentary. It would be clearer that way. (Evidently so, as I at first missed the part of the text that explains this.) > > Moreover, is there any value to including the <Numeric> non-terminal > > and its children in the grammar at all? Anything that matches > > <Numeric> will also match <CharString>, so <Numeric> is not > necessary > > to describe the language. > > CIF differs from STAR in paying attention to numeric items. The > dictionaries control the semantics and help to resolve the > ambiguities. I realize that CIF has more extensive data typing than does STAR, and that CIF dictionaries can be used to resolve ambiguities. My point is to question whether it is necessary or useful to ambiguate and expand the formal grammar by including the <Numeric> non-terminal. My current opinion is that the information conveyed by those productions is more appropriate for the description of language semantics. John Bollinger jobollin@indiana.edu
Reply to: [list | sender only]
- Prev by Date: Re: A formal specification for CIF version 1.1 (Draft)
- Next by Date: the dictionary merging protocol
- Prev by thread: Re: A formal specification for CIF version 1.1 (Draft)
- Next by thread: Re: A formal specification for CIF version 1.1 (Draft)
- Index(es):