[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Fine-tuning CIF dictionary regexes
- Subject: Re: Fine-tuning CIF dictionary regexes
- From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Thu, 16 Jun 2005 09:20:57 -0400
- In-Reply-To: <1118909116.18478.100.camel@anbf10>
- References: <1118909116.18478.100.camel@anbf10>
Now that CIF can handle long lines and has documented handling of special characters, it should be feasible to convert any previously fudged regex expressions to be fully posix compliant regexes that can be used for automatic validation. I would propose that we start collecting and testing a full set of compliant regexes for the types in, say, the mmCIF dictionary, and, once we have general agreement on the expressions, update the dictionaries and our on-line documentation. I have appended John's list in the current mmCIF dictionary, which, is, I think, in fairly good shape. I would suggest we do as much as we can before the IUCr meeting in Florence, so that those of us who are at Florence can have a productive discussion. -- Herbert #################### ## ITEM_TYPE_LIST ## #################### # # # The regular expressions defined here are not compliant # with the POSIX 1003.2 standard as they include the # '\n' and '\t' special characters. These regular expressions # have been tested using the version 0.12 of Richard Stallman's # GNU regular expression libary in POSIX mode. # # # For some data items, a standard syntax is assumed. The syntax is # described for each data item in the dictionary, but is summarized here: # # Names: The family name(s) followed by a comma, precedes the first # name(s) or initial(s). # # Telephone numbers: # The international code is given in brackets and any extension # number is preceded by 'ext'. # # Dates: In the form yyyy-mm-dd. # ############################################################################## loop_ _item_type_list.code _item_type_list.primitive_code _item_type_list.construct _item_type_list.detail code char '[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*' ; code item types/single words ... ; ucode uchar '[_,.;:"&<>()/\{}'`~!@#$%A-Za-z0-9*|+-]*' ; code item types/single words (case insensitive) ... ; line char '[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*' ; char item types / multi-word items ... ; uline uchar '[][ \t_(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*' ; char item types / multi-word items (case insensitive)... ; text char '[][ \n\t()_,.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*' ; text item types / multi-line text ... ; int numb '-?[0-9]+' ; int item types are the subset of numbers that are the negative or positive integers. ; float numb '-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?' ; int item types are the subset of numbers that are the floating numbers. ; name uchar '_[_A-Za-z0-9]+\.[][_A-Za-z0-9%-]+' ; name item types take the form... ; idname uchar '[_A-Za-z0-9]+' ; idname item types take the form... ; any char '.*' ; A catch all for items that may take any form... ; yyyy-mm-dd char '[0-9]?[0-9]?[0-9][0-9]-[0-9]?[0-9]-[0-9][0-9]' ; Standard format for CIF dates. ; uchar3 uchar '[+]?[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]' ; data item for 3 character codes ; uchar1 uchar '[+]?[A-Za-z0-9]' ; data item for 1 character codes ; symop char '([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?' ; symop item types take the form n_klm, where n refers to the symmetry operation that is applied to the coordinates in the ATOM_SITE category identified by _atom_site_label. It must match a number given in _symmetry_equiv_pos_site_id. k, l, and m refer to the translations that are subsequently applied to the symmetry transformed coordinates to generate the atom used. These translations (x,y,z) are related to (k,l,m) by k = 5 + x l = 5 + y m = 5 + z By adding 5 to the translations, the use of negative numbers is avoided. ; atcode char '[][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*' ; Character data type for atom names ... ; At 5:05 PM +0900 6/16/05, James Hester wrote: >On Mon Apr 18th Nick wrote: > >> POSIX compliance makes sure you exhaust the input string until you find >> the longest matching sequence. This is necessary to get the "correct" >> token. > >I understand it as "leftmost, longest" so that the regexp engine must >search through all alternative matches to find the longest. > >> But if you are throwing the "number" to a series of compiled regular >> expressions won't 78.456(22) also match '7', an integer, and return the >> INT token? If that happens to be the first rule it comes across? > >My question came up in connection with validating a CIF against a >dictionary: all I want is to be able to determine whether or not a given >string matches the regexp, so rather than throwing a series of regexps >at a string to get a token, I'm throwing a string corresponding to a >data item value at a single regexp. I had hoped to be able to read the >regexps from the dictionary rather than hard code them. > >(As an aside, I have split CIF processing into syntax and validation, so >that no tokenisation in terms of INT/FLOAT/NUMBER etc. happens during >syntax checking. All data values after the syntax stage are strings >which are then inspected during validation). > >>> One suggestion is that these two regular expressions are re-ordered so >>> that those alternatives in an alternation which are a subset of other >>> alternatives come later. This remains POSIX-compliant and means many >>> non-POSIX engines will find the longest match. > >> Are you sure you can order the rules such that it eliminates all instances >> of the problem you allude to? > >Not at all. However, such a reordering will increase the number of >regexp engines which will match the entire string. POSIX correctness is >maintained, so nothing is lost and something (not necessarily all the >time) practical is gained in that Perl/Python/Tcl/? programmers can >automate type checking. > >(This reply is so late because I seem to have dropped off the mailing >list and only noticed that some discussion had occurred when checking >the archive later on). > >James. > > >_______________________________________________ >cif-developers mailing list >cif-developers@iucr.org >http://scripts.iucr.org/mailman/listinfo/cif-developers -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 Office: +1-631-244-3035 Lab (KSC 020): +1-631-244-3451 yaya@dowling.edu ===================================================== _______________________________________________ cif-developers mailing list cif-developers@iucr.org http://scripts.iucr.org/mailman/listinfo/cif-developers
Reply to: [list | sender only]
- References:
- Re: Fine-tuning CIF dictionary regexes (James Hester)
- Prev by Date: Re: Fine-tuning CIF dictionary regexes
- Next by Date: RE: Fine-tuning CIF dictionary regexes
- Prev by thread: Re: Fine-tuning CIF dictionary regexes
- Next by thread: RE: Fine-tuning CIF dictionary regexes
- Index(es):