This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: grep



I believe that subdirectory of Rx 1.5 is just the old GNU regexp, the
one that's still used in Emacs.  It's a well-coded traditional
backtracking system, as opposed to a state-caching lazy nfa->dfa
converter, which is what Rx is.

I wouldn't mind using it, but I don't want to distribute a regexp with
Guile that's likely to conflict with what's already there on the system.

Credit where credit is due: I didn't write or design the old GNU
regexp; that's (I believe) Mike Haertel and Stallman.  Haertel did
grep, too.

GNU Grep uses a hierarchy of "false positive" tests (i.e., tests that
may report a match where there is none, but will never fail to report
a real match), which bottom out to the good old (but relatively slow)
GNU backtracking regexp system.

The advantage of the false positive tests is that, if you don't have
to be 100% right, you can be much faster.  For example, back
references make regular expressions actually much more powerful than
their name would suggest: (a*)b\1b\1 is actually a context-sensitive
language.  So you can't build a DFA (strictly speaking) for a Posix
regexp.  But you can build a DFA that recognizes a strict superset of
a Posix regexp --- for example (a*)b(a*)b(a*) is a superset of the
expression given above, and it's regular.  It'll run quickly, but
it'll accept stuff you don't want.  So GNU grep runs the DFA, and if
the DFA matches, then it runs the full Posix regexp matcher.

I think grep actually does yet another test before the DFA, using
Boyer-Moore to check for all constant strings in the pattern that it
can prove must be part of any match.