[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Fine-tuning CIF dictionary regexes
- Subject: Re: Fine-tuning CIF dictionary regexes
- From: Nick Spadaccini <nick@xxxxxxxxxxxxx>
- Date: Mon, 18 Apr 2005 15:16:47 +0800 (WST)
- In-Reply-To: <1113801106.28243.45.camel@anbf10>
- References: <1113801106.28243.45.camel@anbf10>
On Mon, 18 Apr 2005, James Hester wrote: > The point I want to discuss boils down to: should the regular > expressions in the CIF dictionary be find-tuned to be compatible not > only with POSIX-compliant regular expression engines? POSIX compliance makes sure you exhaust the input string until you find the longest matching sequence. This is necessary to get the "correct" token. > > The following two constructs from mm_cif, although POSIX compliant, will > not correctly match in a Perl or Python or Tcl regular expression (and > any other NFA engine) > > floating point numbers: > > '-?(([0-9]+)[.]?|([0-9]*[.][0-9]+))([(][0-9]+[)])?([eE][+-]?[0-9]+)?' > > symmetry operations > '([1-9]|[1-9][0-9]|1[0-8][0-9]|19[0-2])(_[1-9][1-9][1-9])?' > > The problem is that the non-POSIX engines will go through the > alternations (separated by |) in the above expressions from left to > right, returning the first match, and as the second part is optional, > there is no requirement to match it. In contrast, a POSIX engine must > return the longest match. So e.g. if Python is fed the number > 78.456(22), "78." will be matched by the floating point expression, as > this satisfies the first part of the alternation, and everything else in > the regular expression is optional. But if you are throwing the "number" to a series of compiled regular expressions won't 78.456(22) also match '7', an integer, and return the INT token? If that happens to be the first rule it comes across? > > One suggestion is that these two regular expressions are re-ordered so > that those alternatives in an alternation which are a subset of other > alternatives come later. This remains POSIX-compliant and means many > non-POSIX engines will find the longest match. Are you sure you can order the rules such that it eliminates all instances of the problem you allude to? cheers Nick -------------------------------- Dr N. Spadaccini Head of School School of Computer Science & voice: +(61 8) 6488 3452 Software Engineering fax: +(61 8) 6488 1089 The University of Western Australia email: nick@csse.uwa.edu.au 35 Stirling Highway w3: www.csse.uwa.edu.au/~nick CRAWLEY, Perth, WA 6009 AUSTRALIA CRICOS Provider Code: 00126G _______________________________________________ cif-developers mailing list cif-developers@iucr.org http://scripts.iucr.org/mailman/listinfo/cif-developers
Reply to: [list | sender only]
- References:
- Fine-tuning CIF dictionary regexes (James Hester)
- Prev by Date: Fine-tuning CIF dictionary regexes
- Next by Date: RE: Fine-tuning CIF dictionary regexes
- Prev by thread: Fine-tuning CIF dictionary regexes
- Next by thread: RE: Fine-tuning CIF dictionary regexes
- Index(es):