This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug regex/12811] New: regexec/re_search consumes huge amounts of memory
- From: "emil at wojak dot eu" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: Thu, 26 May 2011 15:29:17 +0000
- Subject: [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory
- Auto-submitted: auto-generated
http://sourceware.org/bugzilla/show_bug.cgi?id=12811
Summary: regexec/re_search consumes huge amounts of memory
Product: glibc
Version: 2.13
Status: NEW
Severity: normal
Priority: P2
Component: regex
AssignedTo: drepper.fsp@gmail.com
ReportedBy: emil@wojak.eu
Created attachment 5753
--> http://sourceware.org/bugzilla/attachment.cgi?id=5753
Fix for huge memory usage
The bug is triggered under the following circumstances:
- multibyte character encoding, like pl_PL.UTF-8
- either translation table is used or RE_ICASE flag is set
- input buffer which ends with a UTF-8 character cut in the middle, ex.
aaaaaaaaaaaa\xc4
- specific kind of regex, that does not match the input buffer, and that
re_search would apply starting at each position of the input buffer ex. [^b]*ab
or simply .*ab
Here's a sample program that consumes 1.4 GB on 32-bit architecture and 5.2 GB
on 64-bit machines (measured with valgrind --tool=massif).
#include <regex.h>
#include <locale.h>
int main(void) {
regex_t preg;
setlocale(LC_CTYPE, "en_US.UTF-8");
regcomp(&preg, ".*ab", REG_ICASE);
regexec(&preg, "aaaaaaaaaaaa\xc4", 0, NULL, 0);
regfree(&preg);
return 0;
}
The exhaustive memory usage is caused by calling extend_buffers with each
re_search_internal iteration, even though internal buffers already are long
enough to hold the whole string. When matching procedure reaches
mctx->input.valid_len, internal buffer size is doubled and the rest of the
input buffer is converted to wchar_t, except for the last byte, which is a
UTF-8 character cut in the middle. This last character is never converted,
because it's continuation never comes, but still internal buffers are
needlessly doubled.
A patch solving this problem is attached.
There's another issue. Once the internal buffers are long enough to hold at
least half of the input buffer, they shouldn't get doubled, because that's a
waste of memory. Instead it's enough to extend them to the actual length of the
input buffer. This can save significant amounts of memory for long input
buffers.
A patch for this issue is attached as well.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.