[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Accent escape sequences
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>
- Subject: Re: Accent escape sequences
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Mon, 05 Mar 2007 16:30:20 -0500
- In-Reply-To: <20070305160044.GB13871@emerald.iucr.org>
- References: <45E72969.1090100@niehs.nih.gov> <20070302101147.GA26353@emerald.iucr.org> <Pine.BSF.4.58.0703020830490.46806@epsilon.pair.com> <45EA0C29.5060604@niehs.nih.gov> <a06230900c20fde7910a9@[192.168.2.101]> <45EC3846.5070001@niehs.nih.gov><20070305160044.GB13871@emerald.iucr.org>
Brian McMahon wrote: >> The advantage of a simple escape mechanism, like the current scheme, is >> that it is fairly easy to read directly. The disadvantage is that it has >> limited abilities. With MIME, the multipart/alternative could be used, >> where simple ASCII escapes are combined with a more accurate version >> that is not directly readable. This give the advantages of both forms. > > In principle, this is a great idea. Consider the CIF dictionaries, > where the pure-text _definition field sometimes carries inventive > representations of maths (e.g. > http://www.iucr.org/iucr-top/cif/cifdic_html/1/cif_core.dic/Irefine_ls_restrained_S_gt.html ) > that have to be reverse-engineered into something more useful (e.g. TeX) > when typesetting these for International Tables. It would make it > easier to keep these representations in sync if they were both > transported as multipart/alternative content in the same text field. > > But ... this does come at the expense of significantly more > complexity in applications that need to do something with the > content of text fields. Most scientific CIF applications (the > ones that work on the data) won't be affected - they just skip > over text fields. The others will need to have the ability to > parse and extract MIME content (not too difficult), but also > to *write* proper multipart content, and that's not necessarily > so easy if you're to provide tools that ingest content from > different input streams (TeX-savvy editors, html editors, > clipboards...). In practice the Acta office doesn't see a > critical mass of content provision to justify this complexity > at this stage (it's still really only Acta C and E that use > CIF text fields extensively, and they're catered for through > publCIF). Having said which, there's no harm in working through > the details of how such a system could operate. As long as the multi-part processing is optional, it should not be a problem. The extra effort then only needs to be done for those cases where the content is sufficiently complex that the software is already dealing with the extra complexity. > > Going back to Joe's original wishes to rationalise and perhaps > extend the existing CIF markup, it's important also to remember > that some data items will also occasionally require markup for > simple string fields - e.g. how to markup the "alpha" Wyckoff > position in the symmetry CIF dictionary? The use of > the '\a' digraph in > http://www.iucr.org/iucr-top/cif/cifdic_html/2/cif_sym.dic/Ispace_group_Wyckoff.letter.html > clearly derives from the "usual" CIF markup for alpha, but that is > nowhere made formally clear. It looks like we need unambiguous > markup rules in these cases too. Are you saying that the current CIF markup is defined only for multi-line text? If so, the description sentence is an example where '\a' needs to represent the character sequence in the non-markup form (not converted to '<alpha>'). > > (I'm hoping to see our publCIF developer later this week so that > we can discuss the specifics of the proposal Joe posted recently.) > > Brian When I first looked at this, I thought it would be sufficient to covert the Latin1 and Latin2 character sets. But, these do not include the over-bar already defined. I also realized that RFC-1345 covers a lot of this. It defines 2-character sequences for most Latin characters, 3-4 in some cases, and longer sequences for languages like Japanese. Maybe it would be a good goal to cover all of the Latin characters from the 2-letter set from RFC-1345? Most of those 2-letter codes have the alphabetic character first, then the modifier. These would be quite similar to the CIF markup by swapping the two characters, and with a few differences in the modifiers, such as zero instead of % for ring-above. It also adds Hook (2) and Horn (9) modifiers. It would be nice to use the RFC-1345 set of modifier codes for increased standardization. Any chance of having CIF markup "version 2" with some incompatible changes? Maybe it is OK in the context of including Content-Type headers? In the case of 2 alphanumeric codes it is simple to map RFC-1345 to 'word based' CIF codes, such as "\\ae " for the ae ligature. Joe Krahn
Reply to: [list | sender only]
- Follow-Ups:
- Re: Accent escape sequences (Herbert J. Bernstein)
- References:
- Accent escape sequences (Joe Krahn)
- Re: Accent escape sequences (Brian McMahon)
- Re: Accent escape sequences (Herbert J. Bernstein)
- Re: Accent escape sequences (Joe Krahn)
- Re: Accent escape sequences (Herbert J. Bernstein)
- Re: Accent escape sequences (Joe Krahn)
- Re: Accent escape sequences (Brian McMahon)
- Prev by Date: Re: Accent escape sequences
- Next by Date: Re: Accent escape sequences
- Prev by thread: Re: Accent escape sequences
- Next by thread: Re: Accent escape sequences
- Index(es):