[PATCH] malloc: Use current (C11-style) atomics for fastbin access

Wed Jan 16 16:30:00 GMT 2019

On 1/16/19 11:04 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 1/16/19 7:30 AM, Florian Weimer wrote:
>>> * Anton Blanchard:
>>>
>>>> Hi Florian,
>>>>
>>>>>> I see a 16% regression on ppc64le with a simple threaded malloc test
>>>>>> case. I guess the C11 atomics aren't as good as what we have in
>>>>>> glibc.  
>>>>>
>>>>> Uh-oh.  Would you please check if replacing the two
>>>>> atomic_load_acquire with atomic_load_relaxed restore the previous
>>>>> performance?
>>>>
>>>> As you suspect, doing this does restore the performance. The two lwsync
>>>> barrier instructions must be causing the slow down.
>>>
>>> Okay, I'll post a patch to revert that commit.
>>>
>>> However, the old code had this in _int_malloc (where the arena lock is
>>> acquired):
>>>
>>> #define REMOVE_FB(fb, victim, pp)                       \
>>>   do                                                    \
>>>     {                                                   \
>>>       victim = pp;                                      \
>>>       if (victim == NULL)                               \
>>>         break;                                          \
>>>     }                                                   \
>>>   while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
>>>          != victim);                                    \
>>> â€¦
>>>             REMOVE_FB (fb, pp, victim);
>>>
>>> And this in _int_free (without the arena lock):
>>>
>>>       do
>>>         {
>>>           /* Check that the top of the bin is not the record we are going to
>>>              add (i.e., double free).  */
>>>           if (__builtin_expect (old == p, 0))
>>>             malloc_printerr ("double free or corruption (fasttop)");
>>>           p->fd = old2 = old;
>>>         }
>>>       while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>>>              != old2);
>>>
>>> I really don't see what makes sure that the store of p->fd happens
>>> before the load of victim->fd.
>>>
>>> It works out on POWER for some reason.  I'm attaching a test case that
>>> should exercise these two code paths.
>>
>> This is still not a root cause analysis of the problem.
> 
> We know the root of the performance problem, the additional
> synchronization, and, on Aarch64, a consequence of what was supposed to
> be a minor algorithmic change.

OK, I hadn't seen an analysis of the instruction sequences, so I wasn't
sure if this was completely understood.

This begs the question: Is the existing code correct for Aarch64 and POWER64?

I guess we don't know without serious review.

>> I think it is OK to revert your patch under the pressure of the upcoming
>> release and the fact that this is a serious performance regression.
>>
>> However, there is some due diligence that needs to be done here, and I
>> think the patch should *right back in* as soon as master opens. We already
>> have *lots* of other code using C11-style atomics in glibc, and we want
>> to make sure those work as best we can. This issue is either one of two
>> things:
>>
>> * non-C11 atomics are *wrong* for aarch64/power64 and the fastbins can
>>   eventually become corrupted or worse see unbounded growth as we
>>   corrupt the chunk list.
>>
>> or
>>
>> * non-C11 atomics are more optimal for C11 atomics and we can adjust
>>   the C11 atomics to be as good as the non-C11 versions with a little
>>   help from Arm and IBM.
> 
> It's also possible there's a different algorithm that's fast to
> implement with C11 atomics.

Or the existing C11-style atomics for Aarch64 and POWER64 are pessimistic
and we might do better.

> I'm not a concurrency expert, but I really don't see a way to avoid the
> additional barriers with the current algorithm.  It would also be worth
> checking if fastbins are even a win on these architectures.

Right, it needs a review.

At least on x86_64 when we did the tcache vs. fastbin testing the addition
of fastbin was still a win.

We'd have to run the simulator and workload on Aarch64 to see, which we
can ask Arm to do... and IBM for POWER64.

-- 
Cheers,
Carlos.