[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
CIF line folding/reassembly protocol
- Subject: CIF line folding/reassembly protocol
- From: Brian McMahon <bm@xxxxxxxx>
- Date: Tue, 24 Sep 2002 14:45:08 +0100 (BST)
Here is a draft description of the line-folding protocol mentioned earlier that I wish to add to the semantics document as part of the CIF 1.1 specification. It is a slightly modified version of a proposal elaborated by Herbert Bernstein. Note that the specific aim of this proposal is to introduce a technique for folding lines within a text field or comment that exceed the CIF line-length limit into lines within that limit, for the purpose of producing a syntactically valid CIF where the semantic information within the long lines can be recovered without loss by applying the unfolding part of the protocol. This requirement has always been present, and has in the past been handled in various ad hoc ways (Acta Cryst. implemented something similar within ciftex); this proposal formalises a specific approach that may be used robustly by content handlers of text fields. As a beneficial corollary, it also facilitates mechanical interconversion between CIF 1.0 and 1.1 files. Brian PS: To protect the unwary, I'll draw your attention to a couple of specific things in the example folded CIF below. One is the way that a quote-delimited text string has been converted into a folded-multiline text field where the terminal newline is elided; the second is that one of the folded lines carries a colon into the first column of the next line - be careful not to see that as a semicolon! ============================================================================== A line-folding/reassembly protocol ---------------------------------- It must be emphasized that most CIF software and applications need not be concerned with line folding. However, if one has software for CIF 1.0 and a dataset with long lines, it is useful to have a consistent way in which to convert the data to conform to CIF 1.0. Line folding using backslashes allows us to do this. In order to permit such a folding we define a special semantics for use of the backslash. It is important to understand that this does not change the syntax of CIF 1.0. All existing CIFs conforming to the CIF 1.0 specification can be viewed as having exactly the same semantics as they now have. Use of these transformational semantics is optional, but recommended. In order to avoid confusion between CIFs that have undergone these transformations and those that have not, the special comment beginning with a hash mark immediately followed by a backslash (#\) as the last non-blank characters on a line is reserved to mark the beginning of comments created by folding long-line comments, and the special text field beginning with the sequence line-termination, semicolon, backslash (<eol>;\) as the only non-blank characters on a line is reserved to mark the beginning of text fields created by folding long-line text fields. The backslash character is used to fold long lines in character strings and comments. Consider a comment which extends beyond column 80. In order to provide a comment with the same meaning which can be fitted into 80 character lines, prefix the comment with the special comment consisting of a hash mark followed by a backslash (#\) and the line terminator. Then on new lines take appropriate fragments of the original comment, beginning each fragment with a hash mark and ending all but the last fragment with a backslash. In doing this conversion, check for an original line that ends with a backslash followed only by blanks or tabs. To preserve that backslash in the conversion, add another backslash after it. If the next lexical token (not counting blanks or tabs) is another comment, to avoid fusing this comment with the next comment, be sure to insert a line with just a hash mark. Similarly, for a character string that extends beyond column 80, - first convert it to be a text field delimited by line-termination-semicolon (<eol>;) sequences - then change the initial line-termination-semicolon (<eol>;) sequence to line-termination-semicolon-backslash-line-termination (<eol>;\<eol>) - and break all subsequent lines that do not fit within 80 columns with a trailing backslash. In the course of doing the translation, * check for any original text lines that end with a backslash followed only by blanks or tabs. * To preserve that backslash in the conversion, add another backslash after it, and then an empty line. (More formally, the line folding should be done separately and directly on single line non-semicolon delimited characters strings to allow for recognition of the fact that no terminal line-termination is intended -- see below). In order to understand this scheme, suppose the CIF fragment (1) below were considered to have long lines, then we could transform them as follows (2): (1) Initial CIF ============================================================== ################################################### # # # Converted from PDB format to CIF format by # # pdb2cif version 2.3.1 24 Aug 96 # # by # # P.E. Bourne, H.J. Bernstein and F.C. Bernstein # # # ################################################### data_1DIN _entry.id 1DIN loop_ _struct.entry_id _struct.title 1DIN ; DIENELACTONE HYDROLASE AT 2.8 ANGSTROMS Compound:: MOL_ID: 1; MOLECULE: DIENELACTONE HYDROLASE; CHAIN: NULL; SYNONYM: DLH; EC: 3.1.1.45; ENGINEERED: YES Source:: MOL_ID: 1; ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.; STRAIN: B13; EXPRESSION_SYSTEM: EXPRESSED UNDER OWN PROMOTER; EXPRESSION_SYSTEM_PLASMID: PDC100; EXPRESSION_SYSTEM_GENE: CLC D ; _exptl.entry_id 1DIN _exptl.method ' X-RAY DIFFRACTION ' (2) Transformed CIF ========================================================== #\ ##########################\ ########################## # # #\ # Converted from PDB format\ # to CIF format by # # pdb2cif version 2.3.1 24 Aug 96 # # by # # P.E. Bourne, H.J. Bernstein and F.C. Bernstein # # # ################################################### data_1DIN _entry.id 1DIN loop_ _struct.entry_id _struct.title 1DIN ;\ DIENELACTONE HYDROLASE\ AT 2.8 ANGSTROMS Compound:\ : MOL_ID: 1; MOLECULE: DIENELACTONE HYDROLASE; CHAIN: NULL; SYNONYM: DLH; EC: 3.1.1.45; ENGINEERED: YES Source:: MOL_ID: 1; ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.; STRAIN: B13; EXPRESSION_SYSTEM:\ EXPRESSED UNDER OWN PROMOTER; EXPRESSION_SYSTEM_PLASMID: PDC100; EXPRESSION_SYSTEM_GENE: CLC D ; _exptl.entry_id 1DIN _exptl.method ;\ X-RAY DIFFRACTION \ ; ============================================================================== In making the transformation from the backslash folded form to long lines, it is very important to strip trailing blanks before attempting to recognize a backslash as the last character. When re-assembling text field lines, no reassembly should be done except in text fields that begin with the special sequence described above, line-termination-semicolon-backslash-line-termination, (<eol>;\<eol>), so that text fields which happen to contain backslashes, but which were not created by folding long lines, are not changed. It is also important to remove the trailing backslashes when reassembling long lines. The final line-termination-semicolon sequence of a text field takes priority over the reassembly process and ends it, but a trailing backslash on the last line of a text field very nicely conveys the information that no trailing line termination is intended to be included within the character string. Similarly, when reassembling long-line comments, the reassembly begins with a comment of the form hash-backslash-line-termination. The initial hash mark is retained and then a forward scan is made through line-terminations and blanks for the next comment, from which the initial hash mark is stripped and then the contents of the comment are appended. If that comment ends with a backslash, the trailing backslash is stripped and the process repeats. Note that the process will be ended by intervening tags, values, data blocks or other no-whitespace information, and that the process will not start at all without the special hash-backslash-line-termination comment. Since there are very few, if any, CIFs which contain text fields and comments beginning this way, in most cases, it is reasonable to adopt the policy of doing this processing unless it is disabled. Here is another example of folding. The following three text fields would be equivalent: ;C:\foldername\filename ; ;\ C:\foldername\filename ; and ;\ C:\foldername\file\ name ; but the next example would be a two-line value where the first line had the value "C:\foldername\file\" and the second had the value "name": ; C:\foldername\file\ name ; When these line-folding transformation are performed on long-line CIFs, and when long tags are replaced with aliases no longer than 75 characters, it is then simple to fold the entire CIF into lines of no more than 80 characters, making it conform to CIF 1.0 specifications. Note that backslashes should not be used to fold lines outside of comments and text fields. That would introduce extraneous characters into the CIF and violate the basic syntax rules. In any case, such an action is not necessary. Note that the line folding and reassembly mechanism has been introduced to allow folding of long-line CIFs to the 80-character maximum width of the CIF 1.0 specification; but it is a general mechanism that may be used to fold lines into any width imposed or required by applications or transmission mechanisms (for example, some older mail transfer agents fold lines in text files at 72 characters). ==============================================================================
Reply to: [list | sender only]
- Prev by Date: Re: Revised draft of CIF 1.1 syntax document
- Next by Date: Re: Revised draft of CIF 1.1 syntax document
- Prev by thread: Request for approval of CIF version 1.1 specification
- Next by thread: Re: CIF line folding/reassembly protocol
- Index(es):