Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D


I have been quiet on this issue as my bias for supporting Python semantics
has not been popular or productive in prior DDLm/Cif2 discussions.   I would
extend Herb's argument to the whole of this enterprise and emphasize
my view that meaningful adoption of DDLm/CIF2 will require embracing
and leveraging existing technologies as much as possible.


John


On 1/7/11 7:52 AM, Herbert J. Bernstein wrote:
> As noted in my prior message, I disagree. I find it
> counter-inutitive and unproductive to adopt something
> that looks very much like the python treble quoted
> string but which follows confusingly different rules.
> Remeber -- for most of the the coomunity, the entire
> CIF2 approach to quoting is something new and different.
> It does not agree with the well-established CIF1 quoting
> rules. By giving them the python treble quoted strings
> we are giving them a way to simply and easily carry any
> and all strings and text fields forward from CIF1 to CIF2
> without having to seriously rework them. Sure, we could
> come up with some other set of rules for treble quoted
> strings, but by following the python rules we will
> greatly reduce the chances of misinterpretations in
> the marginal cases, and give ourselves an independent
> check on our new parsers -- all the existing oython
> parsers.
>
> I believe that Ralf is right.
>
> Regards,
> Herbert
>
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
> Dowling College, Kramer Science Center, KSC 121
> Idle Hour Blvd, Oakdale, NY, 11769
>
> +1-631-244-3035
> yaya@dowling.edu
> =====================================================
>
> On Fri, 7 Jan 2011, SIMON WESTRIP wrote:
>
>> Dear All
>>
>> My initial reaction to the adoption of the python mechanism for
>> tripple-quoted strings
>> was that it is counter-intuitive in a CIF context - i.e. you might expect
>> the base syntax of
>> ''' and """ delimiited strings to be the same as that of the other delimeted
>> strings, which in
>> CIF1 and the proposed CIF2 is closer to python's 'raw' strings.
>>
>> However, I am in favour of revisiting the issue to address the restrictions
>> of the current set of
>> delimiters, and believe that there may indeed be an answer amoungst James's
>> proposals, which
>> could be agreed upon quite swiftly, both respecting the lagacy of CIF1 and
>> rectifying its shortcomings in
>> this respect.
>>
>> I will follow up on this when I have considered James's proposals in more
>> detail.
>>
>> I'd rather the group spent a little more time on this than just 'dumping' a
>> bit of python syntax into CIF.
>>
>> Cheers
>>
>> Simon
>>
>>
>> ____________________________________________________________________________
>> From: James Hester <jamesrhester@gmail.com>
>> To: ddlm-group <ddlm-group@iucr.org>
>> Sent: Friday, 7 January, 2011 4:46:10
>> Subject: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
>>
>> Dear DDLm group members,
>>
>> Most of you will be aware that the CIF2 standard has been approved by
>> COMCIFS, with one dissenting vote.  I propose to revisit the point
>> raised by Ralf in his dissenting vote, in order to see if we can't
>> improve this aspect of the standard.  The particular problem
>> identified by Ralf, and this problem exists to a more limited extent
>> with CIF1 as well, is that there is no mechanism to elide instances of
>> the string delimiter sequence, meaning that certain pathological
>> strings cannot be included in a CIF2 file.  A further issue is that
>> CIF writing programs have to run through a long series of checks when
>> determining how to delimit any given string. I propose that we revisit
>> this problem, with the restriction proposed by Ralf that we consider
>> only triple quote/triple apostrophe delimited strings.
>>
>> To get us back up to speed on this issue, you will recall some salient
>> points from previous discussions, which taken together led to our
>> failure to make any progress:
>>
>> (1) CIF files are often edited in text editors.  Working with CIF text
>> in a text editor should not produce unexpected behaviour for a typical
>> workflow.
>> (2) CIF text may include LaTeX or other marked-up text, which will be
>> cumbersome to insert in the file if it contains many instances of
>> elide characters (see point (1))
>> (3) IUCr "markup" for Greek letters uses backslash to introduce the
>> special character combination
>> (4) Any characters that function as elides must be removed from the
>> string at parse time to avoid ambiguity in interpretation when
>> returned to the calling application
>>
>> If we limit ourselves to triple quote/apostrophe delimited strings, as
>> Ralf proposes, then we can construct an elide scheme that is invisible
>> to the lexer, by simply breaking the trigraph appropriately.  I
>> propose the following general scheme, where <delimiter> refers to one
>> delimiter character, so the full string delimiter would be
>> <delimiter><delimiter><delimiter>:
>>
>> Proposal C:
>>
>> When reconstructing the datavalue from an input triple-<delimiter>
>> delimited string, the following simple transformation is performed:
>> all occurrences of <delimiter><elide> are replaced by <delimiter>.
>>
>> My comments on this scheme are as follows:
>> (0) When preparing a string for output, any occurrences of
>> <delimiter><elide> *must* be replaced by <delimiter><elide><elide>;
>> <delimiter> only needs to be elided when necessary to break up triple
>> <delimiter> sequences in the source string, and when the final
>> character of a string is <delimiter>
>> (1) It is invisible to the lexer, which will correctly find the string
>> terminator characters without knowledge of the <elide> character used.
>> (2) With appropriate choice of <elide>, there is a low likelihood of
>> ever encountering a string where transformation needs to be performed,
>> which means transforming the string is necessary only where three or
>> more delimiter characters are present in a row, or the string
>> concludes with a delimiter character.
>> (3) The <elide> is a post-elide, by which I mean it elides the
>> preceding character, not the next character.  This is preferable to
>> cover the case of an input string finishing with the <delimiter>
>> character, in which case some non-<delimiter> character must appear
>> after it to ensure the lexer does not consider the final <delimiter>
>> character in the string as the first character of the terminating
>> <delimiter><delimiter><delimiter> sequence.
>>
>> Finally, consider a general proposal D:
>>
>> Elided triple-<delimiter> strings are delimited by
>> <char><delimiter><delimiter><delimiter>...<delimiter><delimiter><delimiter>
>> .
>> The initial <char> defines the character to use to post-elide the
>> contents of the string as per proposal C. <char> would initially be
>> any non-alphanumeric ASCII character, with the set expanded in the
>> future to include Unicode characters once most applications were
>> Unicode-aware.
>>
>> Examples (LHS is string as written in CIF file, RHS is actual
>> datavalue inside angle brackets)
>>
>> &""" Bleg blah blah ""&"  and so forth "&""" <
>> Bleg blah blah """ and so forth">
>>       $'''''$' AAABBB ''$' CCCDDD '$'''
>> <''' AAABBB ''' CCCDDD '>
>>
>> This allows the string writer to choose the elide character to
>> minimise <delimiter><elide> occurrences in the source text.  Note that
>> the need to choose and prepend a character to the string minimizes the
>> likelihood that somebody will do a naive cut and paste.
>>
>> An even more general proposal would prepend a character to the string
>> to indicate pre-elide (as per Proposal A in a separate email) or
>> append a character to indicate post-elide.  I don't propose to
>> consider this.
>>
>> Again, please indicate your views on including any of these proposals
>> in the CIF standard.
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

-- 
******************************************************************
   John Westbrook, Ph.D.
   Rutgers, The State University of New Jersey
   Department of Chemistry and Chemical Biology
   610 Taylor Road
   Piscataway, NJ 08854-8087
   e-mail: jwest@rcsb.rutgers.edu
   Ph:  (732) 445-4290  Fax: (732) 445-4320
******************************************************************
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.