Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Revised draft of CIF 1.1 syntax document

  • Subject: Re: Revised draft of CIF 1.1 syntax document
  • From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
  • Date: Tue, 24 Sep 2002 16:10:00 +0100 (BST)
There are two issues here that would seem to need further discussion.  The
first is this concept of including a particular new line in a text field,
and the other is whether or not to allow d (or even q) as an alternative
for e in a numeric field.

Please understand that while it is appropriate to speak of lines in a CIF,
it will cause a great deal of machine dependent trouble if we require the
inclusion of a "newline" within the definition of a line.  The terminator
of a line is a _very_ machine/system dependent concept.  A
semi-colon-delimited text field consists of some number of lines, some, or
all of which may be "empty" (i.e. with an indeterminate amount of trailing
white space on the line).  Consider the fields

;
;

; This is the first line
;

; This is the first line
  This is the second line
;

In the first case we have a text field with no text, i.e. one very empty
line.  In the second case we also have a text field with one line, but it
is not empty.  In the third case we have two lines.  The obligation of a
parser is to provide one empty line in the first case, one line with
text in the second case, and two lines with text in the third case.
If we have an event-driven parser in C, one compliant approach would be
to return an empty NULL-terminated string in the first case, and some sort
of flag for end of text field on the next event; to return the C-string
" This is the first line" for the first event in the second case, and some
sort of flag for end of text field on the next event; to return the
C-string " This is the first line" for the first event in the third case,
"  This is the second line" for the second event and then finish of with
some sort of flag for end of text field on the last event.  It would be
an equally valid approach to have a parser spew out the entire text field
as a series of lines, but be warned that some valid imgCIF files may
demand more memory from such a parser than it is likely to have available.
I do not think it would be appropriate for a syntax document to require
a parser to carry a particular line terminator in its internal response to
a text field.

  This does, however beg the question of the semantics of such a field.
Please remeber that there is a strong ambiguity in CIF about lines and
the handling of trailing whitespace.  There is no way to tell if

;
;

refers to an intentionally empty single line or a single line with
79 blanks in it.  It would be a major, fortran application-breaking change
to CIF to remove this ambiguity by somehow "seeing" invisible "\n"
characters at this late date.  The best we can do is to require
application to act is if they to had trimmed trailing whitespace from all
lines within a CIF before making decisions which depend on such details.

With respect to e, d, and q, for many years CIFtbx has recognized all of
the above as valid in numbers, since people use them.  I would suggest
a permissive approach of:

  Reserve e,E,d,D,q and Q style numbers all to be used only in writing
numbers, not in writing unquoted text (to avoid misrecognizing existing
CIFs), but encourage a transition to writing CIFs just using e or E.
Thus 1Q3 would not be a valid text string without quotation marks.

  Regards,
    Herbert

=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 020
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 yaya@dowling.edu
=====================================================

On Tue, 24 Sep 2002, Brian McMahon wrote:

> Following the latest round of discussions during this review, I intend to
> make the following changes to the draft specification document. It may take
> a little while to implement them, but input is welcomed on the proposals in
> the meantime.
>
> 1. Remove the STAR *use* of stop_ as a loop or loop header delimiter, but
>    retain it as a reserved word.
>
>    Reason: while there is some support for using stop_, there is even more
>    passionate argument against it. I am particularly persuaded by Brian
>    Toby's comment that adding a feature to CIF that is formally
>    unnecessary will have the practical effect of giving people more ways
>    to break the files.
>
> 2. Change the definition of semicolon-delimited text values to *include* the
>    terminal newline.
>
>    Reason: I am swayed by John Bollinger's reference to the STAR emphasis on
>    lines of text. For those to whom it matters, the semicolons allow a
>    distinction to be drawn between inline and line-delimited
>    strings. Specific applications may if desired choose to elide the terminal
>    newline - I see now that effectively that is what ciftex does.
>
>    However, there is a possible way of excluding the terminal newline, which
>    I shall refer to in a different context below.
>
> 3. Amend the productions for number values to permit 1e5 as valid. Greg
>    Shields has pointed out to me that this was an error already flagged that I
>    had forgotten to correct.
>
> 4. Review the representations for floating point numbers in scientific
>    notation. I wish to exclude the version that permits an exponent to be
>    identified solely by an embedded +/- sign. (Reason: Greg has pointed
>    out that 12-14, which is often entered erroneously by authors intending
>    to specify a range of values, could be parsed as a (rather small) number.)
>
>    I would prefer to retain only the 'e' notation to express an exponential
>    with machine-independent precision.
>
>    Reason: if we retain the 'd' notation also, the assumption would be that
>    a distinction should be made between e and d in the manner of the IEEE 754
>    standard (referenced below) for floating-point representations. This then
>    makes rather specific statements about machine storage (and also raises
>    such questions as whether NaN should be included as a valid string value
>    for a floating-point number representation).
>
>    I am however troubled by the fact that Herbert is already using 'd', and
>    would like to know more about how it affects internal storage in his
>    applications. I am amenable to further debate on this point.
>
>       The IEEE Floating Point Standard (IEEE 754) is an IEEE standard, used
>       by many CPUs and FPUs, which defines formats for representing floating
>       point numbers; representations of special values (i.e. zero, infinity,
>       very small values (denormal? numbers), and bit combinations that
>       don't represent a number (NaN)); five exceptions, when they occur, and
>       what happens when they do occur; four rounding modes; and a set of
>       floating-point operations that will work identically on any conforming
>       system.
>
>       IEEE 754 specifies four formats for representing floating-point values:
>       single-precision (32-bit), double-precision (64-bit), single-extended
>       precision (>= 43-bit, not commonly used) and double-extended precision
>       (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are
>       required by the standard, the others are optional. Many languages
>       specify that they implement IEEE arithmetic, although sometimes it is
>       optional. The C programming language for example allows but does not
>       require IEEE arithmetic. IEEE is commonly used in C where float
>       implemented IEEE single precision and double implements IEEE double
>       precision.
>
>       Also known as IEEE Standard for Binary Floating-Point Arithmetic
>       (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic
>       for microprocessor systems.
>
>
> 5. Lastly, I want to introduce into the *semantics* document a protocol for
> escaping newlines in text fields (and in comments). This idea has been
> discussed before by COMCIFS members and has had a chequered history.
> Nevertheless, I think the current work that CCDC are doing on their
> CIF editor demonstrates again its usefulness. The idea is that for a text
> field or a comment line, a convention is introduced that allows the
> end-of-line to be escaped (i.e. ignored) and the text on the following line
> to be concatenated to the current line.
>
> Why is this useful?
>  - It allows one to preprocess a CIF 1.1 with long lines of text
>    (> 80 characters) and fold them into the 80-character limit of CIF 1.0
>    without loss of information. Thus the 'folded' file can be processed by
>    older CIF 1.0 software with 80-character line buffers. If needed, a
>    postprocessor can reconstitute the longer lines.
>  - Similar of processing can wrap text into still narrower columns
>    (sometimes needed even today as text file are autowrapped by certain
>    mailers to 72 characters or less).
>  - Even with the more generous line lengths in CIF 1.1, it may be
>    necessary to handle strings longer than 2048 characters (for example a
>    protein aminoacid sequence or very complex systematic chemical name).
>    The protocol then allows wrapping into the 2048 buffer.
>  - The CCDC editor may be required to import a text stream from a word
>    processor document where embedded newlines are not used.
>
> The arguments against this convention in an earlier round of discussion had
> to do with the burden of accommodating such additional processing within any
> CIF application. However, only applications that really need to handle (in
> some sense 'understand') the contents of text fields have to worry about
> this. For many applications text fields are not parsed for content, and can
> simply be passed through the application unchanged. Standalone utilities
> will be provided to perform the line wrapping or unwrapping.
>
> I shall send out a more complete description of the proposal separately,
> because it should be discussed in the context of semantics - the meaning of
> the content of a text field - rather than syntax. I mention it here
> because it does provide a method for eliding the terminal newline of
> a text field at a semantic level, if such a result is needed in the
> light of proposal (2) above.
>
> Regards
> Brian
>


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.