[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Revised draft of CIF 1.1 syntax document
- Subject: Re: Revised draft of CIF 1.1 syntax document
- From: Brian McMahon <bm@xxxxxxxx>
- Date: Tue, 24 Sep 2002 11:32:25 +0100 (BST)
Following the latest round of discussions during this review, I intend to make the following changes to the draft specification document. It may take a little while to implement them, but input is welcomed on the proposals in the meantime. 1. Remove the STAR *use* of stop_ as a loop or loop header delimiter, but retain it as a reserved word. Reason: while there is some support for using stop_, there is even more passionate argument against it. I am particularly persuaded by Brian Toby's comment that adding a feature to CIF that is formally unnecessary will have the practical effect of giving people more ways to break the files. 2. Change the definition of semicolon-delimited text values to *include* the terminal newline. Reason: I am swayed by John Bollinger's reference to the STAR emphasis on lines of text. For those to whom it matters, the semicolons allow a distinction to be drawn between inline and line-delimited strings. Specific applications may if desired choose to elide the terminal newline - I see now that effectively that is what ciftex does. However, there is a possible way of excluding the terminal newline, which I shall refer to in a different context below. 3. Amend the productions for number values to permit 1e5 as valid. Greg Shields has pointed out to me that this was an error already flagged that I had forgotten to correct. 4. Review the representations for floating point numbers in scientific notation. I wish to exclude the version that permits an exponent to be identified solely by an embedded +/- sign. (Reason: Greg has pointed out that 12-14, which is often entered erroneously by authors intending to specify a range of values, could be parsed as a (rather small) number.) I would prefer to retain only the 'e' notation to express an exponential with machine-independent precision. Reason: if we retain the 'd' notation also, the assumption would be that a distinction should be made between e and d in the manner of the IEEE 754 standard (referenced below) for floating-point representations. This then makes rather specific statements about machine storage (and also raises such questions as whether NaN should be included as a valid string value for a floating-point number representation). I am however troubled by the fact that Herbert is already using 'd', and would like to know more about how it affects internal storage in his applications. I am amenable to further debate on this point. The IEEE Floating Point Standard (IEEE 754) is an IEEE standard, used by many CPUs and FPUs, which defines formats for representing floating point numbers; representations of special values (i.e. zero, infinity, very small values (denormal? numbers), and bit combinations that don't represent a number (NaN)); five exceptions, when they occur, and what happens when they do occur; four rounding modes; and a set of floating-point operations that will work identically on any conforming system. IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (>= 43-bit, not commonly used) and double-extended precision (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard, the others are optional. Many languages specify that they implement IEEE arithmetic, although sometimes it is optional. The C programming language for example allows but does not require IEEE arithmetic. IEEE is commonly used in C where float implemented IEEE single precision and double implements IEEE double precision. Also known as IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic for microprocessor systems. 5. Lastly, I want to introduce into the *semantics* document a protocol for escaping newlines in text fields (and in comments). This idea has been discussed before by COMCIFS members and has had a chequered history. Nevertheless, I think the current work that CCDC are doing on their CIF editor demonstrates again its usefulness. The idea is that for a text field or a comment line, a convention is introduced that allows the end-of-line to be escaped (i.e. ignored) and the text on the following line to be concatenated to the current line. Why is this useful? - It allows one to preprocess a CIF 1.1 with long lines of text (> 80 characters) and fold them into the 80-character limit of CIF 1.0 without loss of information. Thus the 'folded' file can be processed by older CIF 1.0 software with 80-character line buffers. If needed, a postprocessor can reconstitute the longer lines. - Similar of processing can wrap text into still narrower columns (sometimes needed even today as text file are autowrapped by certain mailers to 72 characters or less). - Even with the more generous line lengths in CIF 1.1, it may be necessary to handle strings longer than 2048 characters (for example a protein aminoacid sequence or very complex systematic chemical name). The protocol then allows wrapping into the 2048 buffer. - The CCDC editor may be required to import a text stream from a word processor document where embedded newlines are not used. The arguments against this convention in an earlier round of discussion had to do with the burden of accommodating such additional processing within any CIF application. However, only applications that really need to handle (in some sense 'understand') the contents of text fields have to worry about this. For many applications text fields are not parsed for content, and can simply be passed through the application unchanged. Standalone utilities will be provided to perform the line wrapping or unwrapping. I shall send out a more complete description of the proposal separately, because it should be discussed in the context of semantics - the meaning of the content of a text field - rather than syntax. I mention it here because it does provide a method for eliding the terminal newline of a text field at a semantic level, if such a result is needed in the light of proposal (2) above. Regards Brian
Reply to: [list | sender only]
- Prev by Date: Re: CIF-DEVELOPERS digest 39
- Next by Date: CIF line folding/reassembly protocol
- Prev by thread: RE: Revised draft of CIF 1.1 syntax document
- Next by thread: Re: Revised draft of CIF 1.1 syntax document
- Index(es):