[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Survey of available CIF software and request for wish list
- To: Multiple recipients of list <comcifs-l@iucr.org>
- Subject: Re: Survey of available CIF software and request for wish list
- From: Peter Murray-Rust <pazpmr@unix.ccc.nottingham.ac.uk>
- Date: Wed, 4 Oct 2000 18:28:28 +0100 (BST)
At 07:17 04/10/00 +0100, Nick Spadaccini wrote: Thanks Nick... >On Mon, 2 Oct 2000, Peter Murray-Rust wrote: > >As a general note, looking at things from a STAR point of view, I have >never considered XML a exclusive competitor to STAR with respect to the >discipline domains that have adopted STAR or some derivative of it. >Certainly STAR is not a competitor to XML in wider web based applications. >I believe that discipline specific derivatives of XML (such as CML) can >and should coexist with STAR. It would be pointless not to leverage off >the XML based tools which are touted to be "just around the corner". The coexistence of XML and STAR (and other formats) is very important and one that we shall need to address. A common question in XML is "I have (binary) data - how do I incorporate this into XML?" I see this at 4 levels: - encoding. Are the character sets compatible? If not some conversion will be needed. A good approach is to convert "binary" data into base64 (or similar) and wrap this with appropriate delimiters. - syntax. The syntax of each component must be carefully preserved. Some minimal escaping will always be needed in case the delimiters occur by chance in the included material. - semantics. How to we determine what to do with the included material. It must at least be labelled with appropriate metadata, e.g.: <?xml version="1.0" encoding="UTF-7"?> <!DOCTYPE cml SYSTEM http://www.xml-cml.org/dtd"> <cml xmlns="http://www.xml-cml.org/dtd/V1.0"> <molecule id="toz"> <string title="data" convention="org.iucr/CIF/DDL1.0/data"> <![CDATA[ data_TOZ#=================================================================== =========== # 5. CHEMICAL DATA _chemical_name_systematic "trans-3-Benzoyl-2-(tert-butyl)-4-(iso-butyl)-1,3-oxazolidin-5one" _chemical_formula_moiety 'C18 H25 N O3' _chemical_formula_sum 'C18 H25 N O3' _chemical_formula_weight 303.40 loop_ _atom_type_symbol _atom_type_scat_dispersion_real _atom_type_scat_dispersion_imag _atom_type_scat_source C .017 .009 International_Tables_Vol_IV_Table_2.3.1 H 0 0 International_Tables_Vol_IV_Table_2.3.1 O .047 .032 International_Tables_Vol_IV_Table_2.3.1 N .029 .018 International_Tables_Vol_IV_Table_2.3.1 ]]> </string> </molecule> </cml> There are many important points here. 1.The whole file is a well-formed XML file. The CDATA mechanism escapes all characters except ]]> so that the body of the <string> element is just seen as simply character data (#PCDATA in XML). The file identifies itself as XML - and this mechanism (<?xml...?>) is registered with the IETF. (It would also carry a media type of text/xml) 2. The file identifies the tags (element names) as belonging to a namespace, uniquified by the URI www.xml-cml.org/dtd/V1.0 THERE ARE NO SEMANTICS ASSOCIATED WITH THIS STATEMENT. In XML there is no current agreement on how to determine the semantics of a namespace; it is simply there to uniquefy the tags. Mechanisms are required and are starting to emerge but there is no universal way of applying semantics. 3. The file can be validated against the DTD listed in the DOCTYPE. This (for example) recognises that the <molecule> tag is allowed in CML but would forbid (say) <unitCell> which is not part of CML. The DTD is required to contain *prose* semantics but has no machine-processable means of delivering semantics. This is as far as XML can go. Beyond this it is up to the semantics of the particular Language (application). Implicit in the file is that there are CML semantics. Thus if I read the file into JUMBO3 it will create a molecule object. This object has a string child - that is all that JUMBO3 knows. [If there had been <atom> children, JUMBO3 would have drawn a molecule.] JUMBO recognises the keyword "convention" (this is part of CML). There is no agreed way of treating this in CML at present. The intention is that there will be a list of conventions which CML. This is the real challenge! The possibilities are: 1 the system simply carries the information through. This is likely to be the first phase. At least we avoid information loss. 2 the system can hyperlink to appropriate dictionaries. This is also possible if the convention-provider produces a URL. Thus: <cml:float convention="IUCr" title="_cell.measurement_temperature" units="K">293</float> could be processed to something like: <a href="http://www.iucr.org/cif/core/dic.html#_cell.measurement_temperature">_ cell.measurement_temperature</a>: 293 so that at least the human reader knows what the quantity is and what it means in human terms (by reading the dictionary) 3 there can be a mapping of equivalent terms. Thus <cml:builtin type="a">... maps to cell.length_a in CIF. This as be done manually and hopefully agreed by curatorial humans. 4 the "terms" can have machine semantics included. STAR ca do this through dREL and Python/Java. JUMBO does it by associating an XML element with a class through the DOM mechanism. There is still the question of how to discover these semantics. IN the first instance I suspect we shall compile lists of conventions with which we can interoperate. Thus CML could know that when it encountered a STAR/CIF term for which it had no mapping but knew it was CI from the convention attribute, it could extract the dRELs (if any) and apply the Python/java This is getting rather hairy - but it represents the current cutting edge. 5. We then run up against ontology. Do I mean the same by bond-order as CIF? Probably not. In which case there has to be extensive mapping by humans. This is a highly valuable, if very tedious activity. I suspect that we shall want to limit the number of conventions with which each interacts. Thus for CML I would see: - core XML tools (XSL, XSL-FO, Schemas) - CIF - MathML - SVG - UnitsML (if it happens) - various Bio-MLs, possibly and I have to be able to do some horrid stuff with the main legacy formats in chemistry STAR/CIF will presumably have a similar list Back to the example. At present JUMBO would be able to: - hold the CIF - convert the CIF to generic XML (I have a CIFDOM for this. It has general elements like <data> and <loop>) - extract some of the key equivalences (essentially cell params and atoms) - orthogonalise things - keep the other stuff safe but uninterpreted - be able to write out a CIF or a CML file at the end. The reverse might also be possible. Consider: data_cmlfile _cml ; <molecule id="NaCl"> <atomArray> <atom id="a1"> <builtin type="elementType">Na</builtin> </atom> <atom id="cl1"> <builtin type="elementType">Cl</builtin> </atom> </atomArray> </molecule> ; This is (I think) a valid CIF. CIF would have to decide how to wrap the XML/CML - what metadata to provide, etc. I would suggest that this would become increasingly common so that CML might wish to be able to support namespaces, e.g. _xml.namespace "http://www.xml-cml.org/dtd/V1.0" > > XML seeing something I have already tried to tackle in CIF/STAR and > > realising why I found it difficult! My analysis is that *semantically*, > XML > > and STAR are virtually identical. > >Yes, I would think so, otherwise the universality of XML would be brought >into question. I think the new developments with respect to methods >included in the dictionary definitions, and then compiling the dictionary >into classes and object instantiations of those means we have moved on >significantly from the view of STAR/CIF/DDL as piles of text. The >dictionaries in our new system are executable, and any attempt to "access" >a data item results in the Java/Python object for that data item being >invoked. In this way all manner of validation, verfication and evaluation >can be done on data items. The Java/Python blend has worked for us because >both support "reflection" (the programmatic term, not the crystallographic >term), meaning that executable objects written in either code can be >pulled in at run-time. I now call these things "information objects". In one sense they can be seen as documents - and XML has excellent tools for processing this - XSLT and XSL-FO. On the other hand they are objects with methods and behaviour. XML provides a DOM - which is fairly basic and mainly consists of navigating the tree (and editing it) but there is no very good way of adding element-specific semantics. Each ML has to make these up. XML schemas may add a bit here but I think we are near the limit of consensus. >I know all of this can be specified in XML but I think the generation of >an executable version of the XML based dictionary isn't going to result >from "tools", someone is going to have to knuckle down and write some >significant code. No question!! There are different sorts of tools: 1 generic tools. These can deal with a wide variety of documents/objects but mainly move the material round. Examples are XSLT and DOM which can reorder and reformat but doe domain-specific stuff like inverting matrices :-). However the generic tools will almost certainly form the basis of editors, etc. 2 discipline-specific but application-independent. I put SVG, MathML, CML in this class. They don't know what the user is going to do in detail. I have also developed a generic dictionary application (http://www.vhg.org.uk) which will browse any hierarchical dictionary. For example it will index and search dictionaries. It may be useful for parts of what CIF does. 3 application-specific. This could be a publishing application (e.g. ActaCryst), a logfile analyser for X-ray refinement, a database of crystal structures, etc. In general these will have to include several components of 2 mixed in varying proportions. I put CIF/STAR in 2, but note that it actually contains several disciplines. Wherever possible these should be separated and modularised. In some cases it will be seen that there are solutions which CIF needs to provide (e.g. validation against crystallographic concepts) in others (typesetting) it may be easier to convert CIF to another approach. > > My own approach to CIF - which is the only one that I personally can write > > code for - is to transform it into XML and use XML tools. This may seem > > like heresy, but ... If I wish to use CIF/STAR syntax then I write DOM and > > XSL-based converters in both directions. This does NOT mean I abandon the > > CIF effort - quite the reverse. In CML I specifically support the use of > > ontologies (dictionaries) from IU's and other learned bodies and I put the > > CIF dictionaries at the top. But I have to convert them to XML to make the > > reading, editing, display, validation, etc. possible. > >Doesn't sound like heresy to me. It sounds like astute and sensible re-use >of existing technologies to leverage up CIF/STAR as a usuable format. Glad it's not heresy! > Apart from syntax (above) checking should involve: > > document VS dictionary (equivalent to XML validation). not trivial > > dictionary VS DDL ditto but additional effort > > DDL vs DDL > >We do this is star. Infact everything is driven through the dictionaries, >the discipline specific dictionary and the DDL dictionary (the dictionary >that defines the DDL language) > > > All these are equivalent to XSLT operations. For example, sorting a CIF is > > not trivial. > >What do you mean by "sorting a CIF"? I mean that some readers/authors may wish to view a CIF in a different order from the authors/readers, perhaps based on category names, or a local set of "most important data items". This is what XSLT is good at. > > I now use java because it is the lingua franca of the web and the first > > tool for XML developers. It also comes with a huge library (e.g. Date, > > Math, Collections, etc.) which simplify a lot of things. > >We have focussed on both Java and Python. I think Java is more than just a >web language (though many will disagree) and I think it will probably >survive any onslaught from Mircosoft's C#. We use the reflection >capabilities of both languages to write code components in either >language. Agreed. Personally Java is ideal for what I do - development - but I don't want to force it on others.Jon Bosak has commented on how java and XML complement each other. > > No - please no! I have horror stories of TIFFs and GIFs for chemistry. > > Sometimes when rescaled bits disappear and bonds can literally > disappear. 4 > > can be transformed to +, etc. > > > > I strongly urge SVG - the new graphics language from W3C. it's gorgeous. > > See http://www.adobe.com/svg for some examples. > > See also http://www.xml-cml.org > >I had a quick browse of svg. Very impressive, but plug-ins are restricted >to WinTel and Macintosh. The fact that Adobe is behind it and given the >great job they have done with postscript and pdf I think svg is very >likely to be around for a while. There is a CSIRO applet/application in Java which is being used and developed by the Apache/FOP effort. Also there was a very good early IBM java implementation. I am not sure which efforts are being pursued most strongly. I believe that Netscape6 also has woken up to SVG in the end but I can't quote... P. >Nick > >-------------------------------- >Dr Nick Spadaccini >Department of Computer Science voice: +(61 8) 9380 3452 >University of Western Australia fax: +(61 8) 9380 1089 >Nedlands, Perth, WA 6907 email: nick@cs.uwa.edu.au >AUSTRALIA web: http://www.cs.uwa.edu.au/~nick > Peter Murray-Rust, Director Virtual School of Molecular Sciences Pharmaceutical Sciences, University of Nottingham, NG7 2RD, UK Tel: +44-(0)-115-951-5087 Fax: +44-(0)-115-951-5110 http://www.vsms.nottingham.ac.uk
- Prev by Date: Re: Backus-Naur Form for CIF
- Next by Date: Re: Membership of pdCIF dictionary management group
- Prev by thread: Re: Survey of available CIF software and request for wish list
- Next by thread: Backus-Naur Form for CIF
- Index(es):