This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Another tweak to the multiplication algorithm
- From: Siddhesh Poyarekar <siddhesh at redhat dot com>
- To: libc-alpha at sourceware dot org
- Date: Mon, 25 Feb 2013 17:51:12 +0530
- Subject: Re: [PATCH] Another tweak to the multiplication algorithm
- References: <20130213142137.GU15748@spoyarek.pnq.redhat.com>
Ping!
I think I have defended the rationale of the patch successfully in the
earlier discussion with Joseph. OK to check in?
Thanks,
Siddhesh
On Wed, Feb 13, 2013 at 07:51:37PM +0530, Siddhesh Poyarekar wrote:
> Hi,
>
> This is a second tweak to the mpa multiplication algorithm, which is
> based on the Karatsuba algorithm[1]. This reduces multiplication
> instructions in favour of additions and halves the number of
> iterations required in calculating a single mantissa digit. Joseph
> Myers had made the original suggestion of trying to implement
> something similar to the Karatsuba algorithm some months ago.
>
> This is a slight improvement on Karatsuba though. The Karatsuba
> algorithm suggests pre-computing x[i]*y[i] and using that in
> computation. I have pre-computed the sum of x[i]*y[i] from 1 to k-1
> for the Kth resultant mantissa digit to avoid doing the summation
> within the loop.
>
> There were no regressions reported due to this patch on x86_64.
>
> Performance:
>
> Using the same pow program as before and the same inputs:
>
> On x86_64 without patch:
>
> Total:1955474682, Fastest:185881, Slowest:658253, Avg:195547.468200
>
> On x86_64 with patch:
>
> Total:1567523991, Fastest:152083, Slowest:484606, Avg:156752.399100
>
> That gives us an improvement of 19.84% in the average case and 18.18%
> in the best case.
>
> OK to commit?
>
> Siddhesh
> [1] http://en.wikipedia.org/wiki/Karatsuba_algorithm
>
> * sysdeps/ieee754/dbl-64/mpa.c: Include alloca.h.
> (__mul): Reduce iterations for calculating mantissa.
>
> diff --git a/sysdeps/ieee754/dbl-64/mpa.c b/sysdeps/ieee754/dbl-64/mpa.c
> index 5b50b0d..f3656db 100644
> --- a/sysdeps/ieee754/dbl-64/mpa.c
> +++ b/sysdeps/ieee754/dbl-64/mpa.c
> @@ -43,6 +43,7 @@
> #include "endian.h"
> #include "mpa.h"
> #include <sys/param.h>
> +#include <alloca.h>
>
> #ifndef SECTION
> # define SECTION
> @@ -612,6 +613,7 @@ __mul (const mp_no *x, const mp_no *y, mp_no *z, int p)
> {
> int i, j, k, ip, ip2;
> double u, zk;
> + double *diag;
>
> /* Is z=0? */
> if (__glibc_unlikely (X[0] * Y[0] == ZERO))
> @@ -662,12 +664,33 @@ __mul (const mp_no *x, const mp_no *y, mp_no *z, int p)
> while (k > ip + ip2 + 1)
> Z[k--] = ZERO;
>
> - zk = Z[k] = ZERO;
> + zk = ZERO;
> +
> + /* Precompute sums of diagonal elements so that we can directly use them
> + later. See the next comment to know we why need them. */
> + diag = alloca (k * sizeof (double));
> + double d = ZERO;
> + for (i = 1; i <= ip; i++)
> + {
> + d += X[i] * Y[i];
> + diag[i] = d;
> + }
> + while (i < k)
> + diag[i++] = d;
>
> while (k > p)
> {
> - for (i = k - p, j = p; i < p + 1; i++, j--)
> - zk += X[i] * Y[j];
> + int lim = k / 2;
> +
> + if (k % 2 == 0)
> + /* We want to add this only once, but since we subtract it in the sum
> + of products above, we add twice. */
> + zk += 2 * X[lim] * Y[lim];
> +
> + for (i = k - p, j = p; i < j; i++, j--)
> + zk += (X[i] + X[j]) * (Y[i] + Y[j]);
> +
> + zk -= diag[k - 1];
>
> u = (zk + CUTTER) - CUTTER;
> if (u > zk)
> @@ -676,11 +699,32 @@ __mul (const mp_no *x, const mp_no *y, mp_no *z, int p)
> zk = u * RADIXI;
> }
>
> - /* The real deal. */
> + /* The real deal. Mantissa digit Z[k] is the sum of all X[i] * Y[j] where i
> + goes from 1 -> k - 1 and j goes the same range in reverse. To reduce the
> + number of multiplications, we halve the range and if k is an even number,
> + add the diagonal element X[k/2]Y[k/2]. Through the half range, we compute
> + X[i] * Y[j] as (X[i] + X[j]) * (Y[i] + Y[j]) - X[i] * Y[i] - X[j] * Y[j].
> +
> + This reduction tells us that we're summing two things, the first term
> + through the half range and the negative of the sum of the product of all
> + terms of X and Y in the full range. i.e.
> +
> + SUM(X[i] * Y[i]) for k terms. This is precalculated above for each k in
> + a single loop so that it completes in O(n) time and can hence be directly
> + used in the loop below. */
> while (k > 1)
> {
> - for (i = 1, j = k - 1; i < k; i++, j--)
> - zk += X[i] * Y[j];
> + int lim = k / 2;
> +
> + if (k % 2 == 0)
> + /* We want to add this only once, but since we subtract it in the sum
> + of products above, we add twice. */
> + zk += 2 * X[lim] * Y[lim];
> +
> + for (i = 1, j = k - 1; i < j; i++, j--)
> + zk += (X[i] + X[j]) * (Y[i] + Y[j]);
> +
> + zk -= diag[k - 1];
>
> u = (zk + CUTTER) - CUTTER;
> if (u > zk)