This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

memmem.c improvement

From: Christian Thaeter <ct at pipapo dot org>
To: libc-alpha at sourceware dot org, git at vger dot kernel dot org
Date: Sat, 01 Dec 2007 01:48:15 +0100
Subject: memmem.c improvement

Short story, 'memmem()' is a gnuish, nonstandard function. I wanted to
provide a generic fallback in some code. So, lets borrow it somewhere else:

First looking at git's compat/memmem.c which I found was suboptimal with
roughly O(M*N) performance (albeit ok for that case since it was just a
generic fallback).

Well, second taking a look at glibc's source surprised me, there is the
same code as in git. I somewhat expected a faster implementation from a
generic C library.

That thought and done, the code is attached to this mail. The algorithm
is similar to the Rabin/Karp method for string searches but uses a
weaker and simpler additive rolling hash. The result is still somewhat
like O(M+N) for most cases (There may be corner cases where it not that
good, but its really hard to imagine those). Anyways, it is always
faster than the current implementation, in my tests with common data
(parsing multipart/form-data cgi responses) it yields approx 20%-100%
improvements and with little artificial cheating (having strings in
haystack which only differ at the last char of needle) the improvements
are much better (500%, since is another complexity class). The fact is
that the old implementation does a brute force search which has corner
cases which are quite easy to hit (repeating symbol sequences, small set
of possible symbols) while for my implementation the corner cases are
much more rare and even then, it will still perform better than the old
implementation.

The code is not much more complex as the old one, not the original
'Rabin/Karp' algorithm because that would require a efficient modulo
operation with a prime number which might be slower on some platforms
(just a guess, I didn't even tried, the performance satisfies me
perfectly). Other search algorithms like 'Knuth/Morris/Pratt' or
'Boyer/More' are more complex to implement and require O(Needle) extra
memory for common implementations, which reserve them for special
purposes imo. I just wanted to keep it simple but still considerably
better than the current one.

Feel free to use the code in glibc and/or git.

	Christian

/* Copyright (C) 1991,92,93,94,96,97,98,2000,2004 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <stddef.h>
#include <string.h>

#ifndef _LIBC
# define __builtin_expect(expr, val)   (expr)
#endif

#undef memmem

/* Return the first occurrence of NEEDLE in HAYSTACK. */
void *
memmem (haystack, haystack_len, needle, needle_len)
     const void *haystack;
     size_t haystack_len;
     const void *needle;
     size_t needle_len;
{
  /* not really Rabin-Karp, just using additive hashing */
  char* haystack_ = (char*)haystack;
  char* needle_ = (char*)needle;
  int hash = 0;		/* this is the static hash value of the needle */
  int hay_hash = 0;	/* rolling hash over the haystack */
  char* last;
  size_t i;

  if (haystack_len < needle_len)
    return NULL;

  if (!needle_len)
    return haystack_;

  /* initialize hashes */
  for (i = needle_len; i; --i)
    {
      hash += *needle_++;
      hay_hash += *haystack_++;
    }

  /* iterate over the haystack */
  haystack_ = (char*)haystack;
  needle_ = (char*)needle;
  last = haystack_+(haystack_len - needle_len + 1);
  for (; haystack_ < last; ++haystack_)
    {
      if (__builtin_expect(hash == hay_hash, 0) &&
	  *haystack_ == *needle_ &&	/* prevent calling memcmp, was a optimization from existing glibc */
	  !memcmp (haystack_, needle_, needle_len))
	return haystack_;

      /* roll the hash */
      hay_hash -= *haystack_;
      hay_hash += *(haystack_+needle_len);
    }

  return NULL;
}

Follow-Ups:
- Re: (class=ham score=-4.96032) memmem.c improvement
  - From: Tor Myklebust

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]