[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Accent escape sequences
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>
- Subject: Re: Accent escape sequences
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Sat, 03 Mar 2007 16:01:45 -0500
- In-Reply-To: <20070302101147.GA26353@emerald.iucr.org>
- References: <45E72969.1090100@niehs.nih.gov><20070302101147.GA26353@emerald.iucr.org>
Brian McMahon wrote: > Dear Joe > > We have recently exchanged a few messages off-list, and it is > clear that you have an interest in, and perhaps some time for, > working on CIF-based applications. It would be great if you would > introduce yourself to the list with a brief indication of your > current interests. Recently, I have been working on some tools for data management in macromolecular programming, with an interest in combining force-field development with crystallography. The software idea is to create a framework for modular programming. Most applications are tied together into one big package that makes it difficult for individual experimentation without digging through a lot of source code. It also typically means that individual contributions may give up ownership, such as a lot of community efforts into programs like CNS getting sucked into Accelrys, where the scientific development pretty much dies. My plan involves an "in-memory database", where modular units access molecular data using memory pointer look-ups by name. Then, a module programmer can (for example) add atom properties without modifying compiled data structures in the core code. It should also provide a natural way to tie in to scripting tools. As for CIF format, it is a fairly good fit to the molecular database concept. I realized that there seem to be no decent Fortran tools. The available Fortran code seems to be mostly inflexible F77 spaghetti code. Also, most of the C/C++ code is generally oriented towards multi-structure databases. I also want to keep things very simple, where no CIF dictionary is needed, with float/int types automatically recognized and stored as such. So, I decided to implement my own Fortran95 CIF library. In the process, I realized that some parts of CIF and mmCIF are a bit ill-defined. Now that many people have used CIF, it seems like now is a good time to work out some of the unfinished details. > > Regarding the untidy typographic markup conventions in CIF text > fields, what we currently have arises from the pragmatic > requirements of our early 1991 (prehistoric!) CIF-handling > procedures in Acta Cryst. We used TeX as a formatter, so > the markup (initially) was somewhat TeX-like; but there was > pressure on us not to rely on TeX, especially as many of our > authors would have no experience of it. Thus a minimal set > of markup was devised, requiring very little learning from > authors, that covered most markup that in practice we came > across in Acta C papers (which have rather little > mathematical content). Very few additional codes were > introduced; and, for example, the relatively recent <i> and > <b> markup for italic and bold was chosen because > non-specialist authors were beginning to become familiar > with such codes in HTML markup. > > The current arrangement is, in my opinion, very inelegant, > but it is supported by publCIF, the IUCr's own CIF editor, > and is workable within that tool's reasonably user-friendly > interface. > > To provide better formatting abilities, I think it would be > preferable to allow text fields to contain markup in various > different standard formats, suitably identified, and to > pass the fields to appropriate handlers. The simplest way to > do so would be to have a 'magic number' introducing each text > field. There's an undocumented example of this inasmuch as > ciftex, the old cif->TeX translater, passes through unchanged > any text field beginning > ;%T (i.e. it treats is as containing pure TeX markup). > The 'magic number' might be a simple character sequence > (%T for TeX, %L for LaTeX, %H html, %R RTF, %U Unicode...) > or could be a more general, but more verbose, signature > involving MIME headers: > ; > Content-Type: application/tex > (this mimics the approach for embedding binary data in imgCIF files). Something along those lines sounds good. One problem with the current multi-line text is that the text fields often are indented, with one less character n the first line to offset the semicolon. I think the multi-line format would be much simpler if the begin and end semicolons were both required to be the only character on a line, i.e. the text-block delimiter is "<eol>;<eol>" instead of just "<eol>;". Also, a line starting with a semicolon within the multiline text is not a problem. A content-type tag could be placed on the line with the starting semicolon. A multi-line pattern would then be: <eol>;<content-type><eol><multi-line text><eol>;<eol> > > There's nothing fundamentally wrong with extending the existing > special character sequences, and I'm happy to consider a > specific proposal in terms of whether we could easily provide > publCIF support for it. The problem is that the more one offers > to the author, the more the author will want to do, and the more > unwieldy an ad-hoc markup will become. (And recall that even > TeX, which is unparalleled for mathematics, does not offer as > primitives anywhere near all the symbols that our authors do > use.) > I think the current set IS fundamentally flawed. Any proper set of 'escape' codes should be able to display the escape characters literally. Currently, there is no rule for displaying backslash or carat without potentially being recognized as escape-code characters. I thought that CIF code were rather ad-hoc, but realized that similar code sequences have been used elsewhere. The advantage of the current codes is that they are simple enough to be read fairly well in plain text form. For an archival format, I think that it a good thing. My proposal is not just to make a huge list of character codes, but to define some simple rules that keep things from getting ad-hoc. Personally, I would not have included <I> and <B>. It would be a better fit to use old-style /italics/ and *bold*, specifically because CIF markup is not HTML. Here is my idea. Note that the second rule provides the unescaped form of any special character by using a blank second character. special character sequence result \<alphabetic> Greek letter \<not alpha><char> combination of 2 chars \\<one or more alpha chars><space> named code style rules: superscript text: ~text~ subscript text: ^text^ italic text: /text/ bold text: *text* Some of the existing named 'by convention' rules might be better written with the combined-character trigraph: \\leftarrow to \\<- \\rightarrow to \\-> \\simeq to \\~= \\square to \\[] I also think that the bare codes should be changed. How do I write "---" and not mean single bond? -- to \\-- +- to \\+- -+ to \\-+ --- to \\sb Single bond could also be "\\--", but only if other bond types are also visual. Also, the italic and bold style suggestion would interfere a bit with equations if not written with separating spaces. But, the carat sequence also is a conflict with it's use as an exponential operator, and nobody seems to mind the lack of a carat escape. Joe
Reply to: [list | sender only]
- References:
- Accent escape sequences (Joe Krahn)
- Re: Accent escape sequences (Brian McMahon)
- Prev by Date: Re: Accent escape sequences
- Next by Date: Re: Accent escape sequences
- Prev by thread: Re: Accent escape sequences
- Next by thread: Annual Report for 2005
- Index(es):