This is the mail archive of the gdb@sources.redhat.com mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Another RFC: regex in libiberty

To: Jim Blandy <jimb at zwingli dot cygnus dot com>, Eli Zaretskii <eliz at is dot elta dot co dot il>
Subject: Re: Another RFC: regex in libiberty
From: Randall R Schulz <rrschulz at cris dot com>
Date: Mon, 11 Jun 2001 23:48:30 -0700
Cc: dj at redhat dot com, gcc at gcc dot gnu dot org, gdb at sources dot redhat dot com, binutils at sources dot redhat dot com
References: <"Eli Zaretskii"'s message of Fri, 08 Jun 2001 10:06:51 +0300><200106080127.VAA01308@greed.delorie.com><9003-Fri08Jun2001100651+0300-eliz@is.elta.co.il>

Jim,

[ This isn't cygwin-specific, so I removed it from the recipient list. ]

Your analysis is correct, basically, but the requirement for "maximum bite" 
or "greediness" (as it's variously called) is quite common and has been the 
behavior of all the Unix-based or -inspired regular expression matchers 
I've worked with for about 20 years.

If there are regular expression matchers out there that do otherwise, I 
haven't encountered them.

The maximum bite requirement really is far from "insane," because without 
it, there's no other well-defined and meaningful specification of how 
(much) to match when the regular expression is ambiguous w.r.t. the target 
text. It would hardly do to just return the first match found, since that 
would (well, might) depend on implementation details. I'm not sure, but I'd 
want to think about whether relaxing maximum bite was significant w.r.t. 
the choice of NFA vs. DFA matcher (I don't know which approach is used by 
regex).

It's fine to have an option to change this, I guess, but regular expression 
matchers that don't implement maximum bite by default would not be what 
people expect at all. Actually, I'm not certain how the user would be able 
to predict what would be matched if maximum bite were disabled.

It seems to me that if you have an on / off option for maximum bite, then 
the only meaningful choice when maximum bite is off would be minimum bite, 
and for * that would always be zero and for + it would be one, so what's 
the point of closure? If there's no closure involved, then static 
examination of the regular expression (when the NFA or DFA is constructed) 
is enough to determine maximum bite, so there's no need for a performance 
hit. Implementing that optimization isn't difficult at all.

As you suggest, there are other more subtle (but still well defined) 
optimizations that make some common cases much better behaved (e.g., 
detecting disjointedness of the sets of characters that can be matched at 
the boundaries of closed (* and +) or alternated (|) sub-expressions can be 
used to eliminate backtracking). Differentiating parenthesized 
sub-expressions that must be tracked for independent extraction from those 
that only group the sub-expression they enclose to override the normal 
operator precedence can also be used to advantage (I think).

Anyway, I really don't think you should change this behavior--you'd be 
breaking regex. Maximum bite is specified for a reason--a good reason.

Randall Schulz
Mountain View, CA USA

At 22:49 2001-06-11, Jim Blandy wrote:

>"Eli Zaretskii" <eliz@is.elta.co.il> writes:
> > One notorious problem with GNU regex is that it is quite slow for many
> > simple jobs, such as matching a simple regular expression with no
> > backtracking.  It seems that the main reason for this slowness is the
> > fact that GNU regex supports null characters in strings.  For
> > examnple, Sed 3.02 compiled with GNU regex is about 2-4 times slower
> > on simple jobs than the same Sed compiled with Spencer's regex
> > library.  (The DJGPP port of Sed is actually distributed with two
> > executables, one build with GNU regex, the other with Spencer's, for
> > this very reason.)
>
>I'm suspicious of this assertion.  I've worked on GNU regexp in the
>past, and I don't see any reason this should be so.
>
>However, I was messing around with regexps a lot when GNU regexp
>suddenly became slow on certain regexps.  I looked into the cause, and
>it turned out that this was because GNU regexp had been made to comply
>with the POSIX regexp specification.  POSIX regexp semantics require
>that the regexp match the longest possible string (I may have the
>details wrong, but it's something like that).  For backtracking regexp
>engines (the GNU, Henry Spencer, and Perl regexp matchers are all of
>this design), that innocent-sounding constraint basically requires
>insanely slow behavior.
>
>GNU regexp has a flag that allows you to turn this behavior off, and
>get the traditional, faster, non-POSIX-compliant behavior.  So I don't
>see any reason the GNU regexp library couldn't serve all the GPL'd
>software's needs.
>
>----
>
>The details, for the curious:
>
>Suppose you have a regexp like this (assume the obvious
>metacharacters)
>
>   (FOObar|FOO)(barbar)+
>    ---a-- -b-  ---c--      <= labels for pieces of the regexp
>
>which you're matching against the string
>
>   FOObarbarbarbar
>   0  3  6  9  12
>
>and let's suppose you're calling the regexp library in a manner which
>asks "does a prefix of this string match this regexp?"  (That is,
>you're not asking "does this regexp match this entire string?")
>
>The traditional behavior is for the regexp engine to match part ---a--
>of the regexp against data[0..5], match one repetition of part ---c--
>against data[6..8], and say it's done.  The Perl regexp matcher will
>return this match.
>
>But this isn't the behavior POSIX requires.  POSIX says you must
>return the *longest possible* match.  So a POSIX-compliant matcher
>must match -b- against data[0..2], and then two repetitions of ---c--
>against data[3..8] and data[9..14].  This is a longer match.
>
>To find the longest match, in general, a backtracking matcher has to
>generate every possible match, and return the longest one it found.
>This is what GNU regexp does.
>
>So, just how bad is this?  Well, suppose you're matching a regexp like:
>
>         .*.*.*.*.*.*
>
>against a string like
>
>      aaaaaaaaaaaaaaaaaaaa
>
>To generate every possible match, you have to choose every possible
>way to divide up those twenty a's amongst six .* patterns.  I think
>this is 20 choose 5, or 1.9 million, matches you have to try.  In
>general, I think the time to match POSIXly can increase exponentially
>in the length of your regexp, given a long enough data string.
>
>If you have a smart regexp compiler (I understand Perl is pretty
>clever), then you could probably handle this regexp with a bit more
>aplomb.  But I'll bet that I can find a regexp where you really do
>have to try all matchings, no matter how smart your regexp compiler
>is.
>
>(So think of that the next time you propose some innocent-sounding
>constraint, like "longest match always"!)
>
>Anyway, the outcome is that all the really popular regexp matchers
>either don't implement the POSIX behavior, or provide options to turn
>it off.  For GNU regexp, you can use the RE_NO_POSIX_BACKTRACKING
>flag, and you'll get the traditional not-always-the-longest-match nice
>fast behavior.  Perl simply documents the traditional behavior ("The
>[Perl regexp] engine thinks locally and acts globally," as the Camel
>book puts it).
>
>--
>Want to unsubscribe from this list?
>Check out: http://cygwin.com/ml/#unsubscribe-simple

References:
- Another RFC: regex in libiberty
  - From: DJ Delorie
- Re: Another RFC: regex in libiberty
  - From: Eli Zaretskii
- Re: Another RFC: regex in libiberty
  - From: Jim Blandy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]