This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug libc/11261] malloc uses excessive memory for multi-threaded applications
- From: "rich at testardi dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: 10 Feb 2010 13:10:19 -0000
- Subject: [Bug libc/11261] malloc uses excessive memory for multi-threaded applications
- References: <20100208202339.11261.rich@testardi.com>
- Reply-to: sourceware-bugzilla at sourceware dot org
------- Additional Comments From rich at testardi dot com 2010-02-10 13:10 -------
Hi Ulrich,
I apologize in advance and want you to know I will not reopen this bug again,
but I felt I had to show you a new test program that clearly shows "The cost
of large amounts of allocated address space is insignificant" can be
exceedingly untrue for heavily threaded systems using large amounts of
memory. In our product, we require 2x the RAM on Linux vs other OS's because
of this. :-(
I've reduced the problem to a program that you can invoke with no options and
it runs fine, but with the "-x" option it thrashes wildly. The only
difference is that in the "-x" case we allow the threads to do some dummy
malloc/frees up front to create thread-preferred arenas.
The program simply has a bunch of threads that, in turn (i.e., not
concurrently), allocate a bunch of memory, and then free most (but not all!)
of it. The resulting allocations easily fit in RAM, even when fragmented. It
then attempts to memset the unfreed memory to 0.
The problem is that in the thread-preferred arena case, the fragmented
allocations are now spread over 10x the virtual space, and when accessed,
result in actual commitment of at least 2x the physical space -- enough to
push us over the top of RAM and into thrashing.
So as a result, without the -x option, the program memset runs in two seconds
or so on my system (8-way, 2GHz, 12GB RAM); with the -x option, the program
memset can take hundreds to thousands of seconds.
I know this sounds contrived, but it was in fact *derived* from a real-life
problem.
All I am hoping to convey is that there are memory intensive applications for
which thread-preferred arenas actually hurt performance significantly.
Furthermore, turning on MALLOC_PER_THREAD can actually have an even more
devastating effect on these applications than the default behavior. And
unfortunately, neither MALLOC_ARENA_MAX nor MALLOC_ARENA_TEST can prevent the
thread-preferred arena proliferation.
The test run output without and with "-x" option are below; the source code is
below that.
Thank you for your time. Like I said, I won't reopen this again, but I hope
you'll consider giving applications like ours a "way out" of the thread-
preferred arenas in the future -- especially since it seems our future is even
more bleak with MALLOC_PER_THREAD, and that's the way you are moving (and for
certain applications, MALLOC_PER_THREAD makes sense!).
Anyway, I've already written a small block binned allocator that will live on
top of mmap'd pages for us for Linux, so we're OK. But I'd rather just use
malloc(3).
-- Rich
[root@lab2-160 test_heap]# ./memx2
cpus = 8; pages = 3072694; pagesize = 4096
nallocs = 307200
--- creating 100 threads ---
--- waiting for threads to allocate memory ---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
--- malloc_stats() ---
Arena 0:
system bytes = 1557606400
in use bytes = 743366944
Total (incl. mmap):
system bytes = 1562529792
in use bytes = 748290336
max mmap regions = 2
max mmap bytes = 4923392
--- cat /proc/29565/status | grep -i vm ---
VmPeak: 9961304 kB
VmSize: 9951060 kB
VmLck: 0 kB
VmHWM: 2517656 kB
VmRSS: 2517656 kB
VmData: 9945304 kB
VmStk: 84 kB
VmExe: 8 kB
VmLib: 1532 kB
VmPTE: 19432 kB
--- accessing memory ---
--- done in 3 seconds ---
[root@lab2-160 test_heap]# ./memx2 -x
cpus = 8; pages = 3072694; pagesize = 4096
nallocs = 307200
--- creating 100 threads ---
--- allowing threads to create preferred arenas ---
--- waiting for threads to allocate memory ---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
--- malloc_stats() ---
Arena 0:
system bytes = 1264455680
in use bytes = 505209392
Arena 1:
system bytes = 1344937984
in use bytes = 653695200
Arena 2:
system bytes = 1396580352
in use bytes = 705338800
Arena 3:
system bytes = 1195057152
in use bytes = 503815408
Arena 4:
system bytes = 1295818752
in use bytes = 604577136
Arena 5:
system bytes = 1094295552
in use bytes = 403053744
Arena 6:
system bytes = 1245437952
in use bytes = 554196272
Arena 7:
system bytes = 1144676352
in use bytes = 453434608
Arena 8:
system bytes = 1346199552
in use bytes = 654958000
Total (incl. mmap):
system bytes = 2742448128
in use bytes = 748234656
max mmap regions = 2
max mmap bytes = 4923392
--- cat /proc/29669/status | grep -i vm ---
VmPeak: 49213720 kB
VmSize: 49182988 kB
VmLck: 0 kB
VmHWM: 12052384 kB
VmRSS: 11861284 kB
VmData: 49177232 kB
VmStk: 84 kB
VmExe: 8 kB
VmLib: 1532 kB
VmPTE: 95452 kB
--- accessing memory ---
60 secs... 120 secs... 180 secs... 240 secs... 300 secs... 360 secs... 420
secs... 480 secs... 540 secs... 600 secs... 660 secs... 720 secs... 780 secs...
--- done in 818 seconds ---
[root@lab2-160 test_heap]#
[root@lab2-160 test_heap]# cat memx2.c
// ****************************************************************************
#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <inttypes.h>
#define NTHREADS 100
#define ALLOCSIZE 16384
#define STRAGGLERS 100
static uint cpus;
static uint pages;
static uint pagesize;
static uint nallocs;
static volatile int go;
static volatile int done;
static volatile int spin;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static void **ps; // allocations that are freed in turn by each thread
static int nps;
static void **ss; // straggling allocations to prevent arena free
static int nss;
void
my_sleep(
int ms
)
{
int rv;
struct timespec ts;
struct timespec rem;
ts.tv_sec = ms / 1000;
ts.tv_nsec = (ms % 1000) * 1000000;
for (;;) {
rv = nanosleep(&ts, &rem);
if (! rv) {
break;
}
assert(errno == EINTR);
ts = rem;
}
}
void *
my_thread(
void *context
)
{
int i;
int n;
int si;
int rv;
void *p;
n = (int)(intptr_t)context;
while (! go) {
my_sleep(100);
}
// first we spin to get our own arena
while (spin) {
p = malloc(ALLOCSIZE);
assert(p);
if (rand()%20000 == 0) {
my_sleep(10);
}
free(p);
}
my_sleep(1000);
// then one thread at a time, do our big allocs
rv = pthread_mutex_lock(&mutex);
assert(! rv);
for (i = 0; i < nallocs; i++) {
assert(i < nps);
ps[i] = malloc(ALLOCSIZE);
assert(ps[i]);
}
// N.B. we leave 1 of every STRAGGLERS allocations straggling
for (i = 0; i < nallocs; i++) {
assert(i < nps);
if (i%STRAGGLERS == 0) {
si = nallocs/STRAGGLERS*n + i/STRAGGLERS;
assert(si < nss);
ss[si] = ps[i];
} else {
free(ps[i]);
}
}
done++;
printf("%d ", done);
fflush(stdout);
rv = pthread_mutex_unlock(&mutex);
assert(! rv);
}
int
main(int argc, char **argv)
{
int i;
int rv;
time_t n;
time_t t;
time_t lt;
pthread_t thread;
char command[128];
if (argc > 1) {
if (! strcmp(argv[1], "-x")) {
spin = 1;
argc--;
argv++;
}
}
if (argc > 1) {
printf("usage: memx2 [-x]\n");
return 1;
}
cpus = sysconf(_SC_NPROCESSORS_CONF);
pages = sysconf (_SC_PHYS_PAGES);
pagesize = sysconf (_SC_PAGESIZE);
printf("cpus = %d; pages = %d; pagesize = %d\n", cpus, pages, pagesize);
nallocs = pages/10/STRAGGLERS*STRAGGLERS;
assert(! (nallocs%STRAGGLERS));
printf("nallocs = %d\n", nallocs);
nps = nallocs;
ps = malloc(nps*sizeof(*ps));
assert(ps);
nss = NTHREADS*nallocs/STRAGGLERS;
ss = malloc(nss*sizeof(*ss));
assert(ss);
if (pagesize != 4096) {
printf("WARNING -- this program expects 4096 byte pagesize!\n");
}
printf("--- creating %d threads ---\n", NTHREADS);
for (i = 0; i < NTHREADS; i++) {
rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i);
assert(! rv);
rv = pthread_detach(thread);
assert(! rv);
}
go = 1;
if (spin) {
printf("--- allowing threads to create preferred arenas ---\n");
my_sleep(5000);
spin = 0;
}
printf("--- waiting for threads to allocate memory ---\n");
while (done != NTHREADS) {
my_sleep(1000);
}
printf("\n");
printf("--- malloc_stats() ---\n");
malloc_stats();
sprintf(command, "cat /proc/%d/status | grep -i vm", (int)getpid());
printf("--- %s ---\n", command);
(void)system(command);
// access the stragglers
printf("--- accessing memory ---\n");
t = time(NULL);
lt = t;
for (i = 0; i < nss; i++) {
memset(ss[i], 0, ALLOCSIZE);
n = time(NULL);
if (n-lt >= 60) {
printf("%d secs... ", (int)(n-t));
fflush(stdout);
lt = n;
}
}
if (lt != t) {
printf("\n");
}
printf("--- done in %d seconds ---\n", (int)(time(NULL)-t));
return 0;
}
[root@lab2-160 test_heap]#
--
What |Removed |Added
----------------------------------------------------------------------------
Status|RESOLVED |REOPENED
Resolution|WONTFIX |
http://sourceware.org/bugzilla/show_bug.cgi?id=11261
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.