Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C andD. .


On Friday, January 07, 2011 7:56 AM, John Westbrook wrote:
>I have been quiet on this issue as my bias for supporting Python semantics
>has not been popular or productive in prior DDLm/Cif2 discussions.   I would
>extend Herb's argument to the whole of this enterprise and emphasize
>my view that meaningful adoption of DDLm/CIF2 will require embracing
>and leveraging existing technologies as much as possible.
[...]
>On 1/7/11 7:52 AM, Herbert J. Bernstein wrote:
>> As noted in my prior message, I disagree. I find it
>> counter-inutitive and unproductive to adopt something
>> that looks very much like the python treble quoted
>> string but which follows confusingly different rules.
>> Remeber -- for most of the the coomunity, the entire
>> CIF2 approach to quoting is something new and different.
>> It does not agree with the well-established CIF1 quoting
>> rules. By giving them the python treble quoted strings
>> we are giving them a way to simply and easily carry any
>> and all strings and text fields forward from CIF1 to CIF2
>> without having to seriously rework them. Sure, we could
>> come up with some other set of rules for treble quoted
>> strings, but by following the python rules we will
>> greatly reduce the chances of misinterpretations in
>> the marginal cases, and give ourselves an independent
>> check on our new parsers -- all the existing oython
>> parsers.
>>
>> I believe that Ralf is right.

[...]

There is a large number of elide mechanisms in popular programming languages, many of them similar, but none of them identical.  I don't see why choosing any one language's conventions, Python's for instance, is an overall win over choosing any of the others' (C/C++, Java, Ruby, Perl, ...).  Furthermore, although I recognize the advantages of drawing on existing technologies, I don't think that doing so in this case necessarily requires adopting the entire package of conventions from any particular language, no matter that language's present popularity level.

For elide processing I greatly favor approaches that are invisible to a STAR lexer.  That will allow CIF to remain a subset of STAR, which matters at least to me, though it may be of little concern to some others here.  Furthermore, I prefer an approach that is as minimal as possible while still adequately addressing the problem.  That will reduce the work to implement the scheme, the potential for bugs and incompatibilities, and the details that people need to learn.

I have these objections to adopting specifically the Python scheme for CIF2:

1) I dislike tying CIF strongly to a particular programming language.

2) The Python system is incompatible with STAR.

3) It is far more complicated than we need.  In particular, I don't like \N{name} for representing characters by UCD name, but nearly all of the other elides are redundant with \uxxxx forms.

4) It needlessly coopts some commonly used elides of the IUCr system: \a, \b, \f, \', \"

5) In Python, Unicode strings have a different data type than plain strings (though there it matters little), and I don't want to carry that impression over to CIF2, where it would be false.


I observe in passing that the Python conventions provide for eliding newlines to indicate that they should be ignored, similar to the CIF line-wrapping protocol.  I have no objection to including that, but it has been a controversial topic in the past, so I do not let it go unremarked.


I don't think any existing programming language's system has the characteristics we (I) want, so I propose instead yet another alternative, "Proposal E", derived from Python's system but much restricted:

1) All triple-quoted strings (either delimiter) are handled according to these conventions.  No [uUrR] sigils are recognized (or needed).

2) (Only) The following elides are recognized and handled in triple-quoted strings as they are in Python Unicode strings:
  a) \uxxxx             (represents the Unicode character having the specified 4(-hex)-digit code point)
  b) \Uxxxxxxxx (represents the Unicode character having the specified 8(-hex)-digit code point)
  c) \(newline) (represents nothing; that is, it is consumed and ignored)
  d) \\         (represents a single backslash (same as \u0062))

3) As in Python, unrecognized escape sequences are treated as literals (that is, they are left uninterpreted in the string, including the backslash).  Because \' and \" are not among the recognized elides, trailing backslashes are subject to this rule as well: they are treated as literals unless part of a \\ elide.


Those are the essentials, but a few more details are necessary to ensure consistent interpretation:

4) Elides are processed as if after lexical analysis (unlike in Java, where Unicode escapes are processed as if before lexing).

5) Elides are processed left-to-right, and when an elide is replaced by a character, elide processing continues immediately *after* the replacement character.  (Thus '''\u0062u0062''' is equivalent to '\u0062', not to '\'.)

6) As in Python, Unicode characters outside the BMP may be represented as surrogate pairs via the \uxxxx mechanism, with the same meaning as the corresponding \Uxxxxxxxx representation.

7) Unlike the IUCr elides, these elides are considered part of the CIF _representation_ of values, not part of the values themselves.  That is, applications consuming CIF data should not have to process or generate these elides.  Of course, general STAR applications will not and should not recognize them (unless they were adopted there, too), but that is desirable.

8) Characters not allowed to appear as literals in CIF must not appear as Unicode escapes, either.


Comments on Proposal E:
() I think (4) and (5) are consistent with Python, but I had trouble finding documentation of that (which is another reason to be wary of adopting the Python system whole-hog, by reference).  (5) could as easily go the other way, but (4) is needed as-is to avoid additional elides.

() This proposal would allow almost all existing, well-formed CIF character data to be triple-quoted as-is, without need for eliding anything, even when IUCr elides are present.

() All allowed Unicode characters can be represented in data values via this system, using only printable ASCII characters and CIF whitespace.

() There are few rules to remember or code

() Rules (1) - (6) are, I think, a strict subset of Python's Unicode string elide system, but [uUrR] sigils are not needed or used to activate them.

() This system preserves lexical compatibility with STAR, provides for line-wrapping, is mostly compatible with CIF1-style IUCr elides, and is small and relatively easy to code.

() The biggest potential gotcha I see for users is the absence of \' and \" elides, but that is necessary for the scheme to satisfy my objective of STAR compatibility, and it is furthermore useful for compatibility with the IUCr elide system.


Regards,

John

--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.