[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] Python-type eliding for triple-quoted strings
- To: ddlm-group <ddlm-group@iucr.org>
- Subject: [ddlm-group] Python-type eliding for triple-quoted strings
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 7 Jan 2011 15:22:24 +1100
- In-Reply-To: <AANLkTi=xRn2cJgen4dUdThRGqOV_W-K6cT1FjiS-0rS4@mail.gmail.com>
- References: <AANLkTi=KRObuU61HryEUBCx=Od-RsL8GxsGWwZZ097ZK@mail.gmail.com><AANLkTi=xRn2cJgen4dUdThRGqOV_W-K6cT1FjiS-0rS4@mail.gmail.com>
I do not think Ralf's proposal as it stands is suitable for CIF, for the following reasons: (i) It implies 10 escape sequences in non-raw strings which are syntactically irrelevant and a need for which has not been identified. These are \newline, \a, \b, \f,\n,\r,\t,\v,\ooo, \xhh (ii) Raw strings are mostly useful for the non-lexically significant escape sequences listed in (i). Raw and cooked strings are almost equivalent when these escape sequences are removed (see below for discussion of this point). It is not clear therefore that raw strings provide much benefit. (iii) The inclusion of Unicode strings is not required to satisfy the need for expressing any string in CIF. They are a viable stand-alone solution, however (Proposal B below). Why are cooked and raw strings almost equivalent in our situation? The point of raw strings is to allow backslash escape sequences to be preserved in the input string, which is particularly important for backslash-rich markup such as LaTeX strings. If we exclude the 10 unneeded escape sequences and allow only the <backslash><delimiter> and <backslash><backslash> sequences to be significant during parsing, then we win very little by including raw strings in the proposal. Most importantly, we cannot determine whether a <backslash><delimiter> sequence in our raw string is due to a need to elide the delimiter, or is a backslash combination that is intended for the string's consumer. In order to lift this ambiguity, we need to process the string to remove the <backslash> when it is only intended to elide the delimiter, and include some way of indicating this in the string, most simply by preceding the <backslash><delimiter> with a <backslash><backslash> when we want a backslash to remain in the string - which means that the raw string is interpreted just like the cooked string, as the extra backslashes do not form part of the string's value. Note that the one thing that a raw string does give us in this situation is that we can ignore double backslashes elsewhere in the string as long as they are not associated with a <delimiter>. Proposal A below also has this attribute. Most of these criticisms are rectifiable, so I'm going to simplify and divide Ralf's proposal into two parts, which both separately solve the problem of representing every possible string in a CIF file. Proposal A: strings can be delimited by three quotes or three apostrophes. Whenever the sequence <backslash><delimiter> is encountered when reading such a string, it is replaced by <delimiter>, and <delimiter> loses any special meaning. Proposal B: strings can be delimited by three quotes or three apostrophes or else by three quotes or three apostrophes immediately preceded by the letter 'u' ("unicode strings"). In a non-unicode string, no special behaviour is defined (as in the current CIF2 proposal). In a Unicode string, the escapes \uxxxx and \Uxxxxxx are defined as the corresponding Unicode code point. Delimiters and backslashes can therefore be included in the string by using their Unicode number. JRH comments on these two proposals: (i) Both proposals require the cooperation of the lexer. Proposal A requires it only because of the particular case of the source string terminating with <delimiter>; in all other cases the triple <delimiter> is broken by <backslash>, and so the lexer can be oblivious to any special meaning. Proposal B obviously requires that the initial <u><delimiter> sequence is recognised. (ii) When preparing a string for output: in Proposal A, all <backslash><delimiter> sequences *must* be replaced by <backslash><backslash><delimiter>. <delimiter> *must* be prepended with a <backslash> when to do otherwise would terminate the string prematurely. <delimiter> *may* be prepended with a <backslash> elsewhere, but this is not a requirement; in Proposal B, <delimiter> *must* be replaced by <backslash>uxxxx, where xxxx is the Unicode code point number for <delimiter>, if to do otherwise would prematurely terminate the string. <backslash> *must* be replaced by <backslash>u005C if the character sequence <backslash><u> is contained in the source text. (iii) Proposal B allows Unicode characters to be included in strings even without access to a Unicode-aware editor, so this proposal may be useful in general. (iv) No proposal can make cutting and pasting foolproof, but the need for editing is considerably reduced if we restrict the operation of eliding to triple-<delimiter> delimited strings. What responses do others have to these proposals? Even if your response is predictable from previous discussions, please at least indicate that this is the case. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Python-type eliding for triple-quoted strings (James Hester)
- Re: [ddlm-group] Python-type eliding for triple-quoted strings (James Hester)
- Prev by Date: Re: [ddlm-group] Python-type eliding for triple-quoted strings
- Next by Date: [ddlm-group] Eliding in triple-quoted strings: Proposals C and D
- Prev by thread: Re: [ddlm-group] Python-type eliding for triple-quoted strings
- Next by thread: [ddlm-group] Moving forward with DDLm
- Index(es):