This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [GOLD] add new method for computing a build ID


On Wed, Oct 3, 2012 at 3:36 PM, Cary Coutant <ccoutant@google.com> wrote:
>> The patch adds a new mathematical function for build ID, in addition
>> to the two that are available now (SHA-1 and MD5). The new function
>> does MD5 on chunks of the output file and then does SHA-1 on the MD5
>> hashes of the chunks. This is easy to parallelize.
>
> Why use SHA-1 to combine the MD5 hashes? Why not just use MD5
> throughout? Or SHA-1 throughout? Is it the case that feeding MD5 into
> itself is known to be weaker than one MD5 pass? If the benefit is from
> parallelization, I don't really see why you'd need to switch from
> SHA-1 to MD5 -- couldn't you just add your approach on top of whatever
> hash function is selected?

Any of the above would work fine. SHA-1 has a larger output than MD5,
20 bytes vs 16 bytes. I wanted to use SHA-1 as the "fallback" for
--build-id=tree, e.g., if the output file is small it is nice to have
a fallback rather than bothering to parallelize a small computation.
So that suggested using SHA-1 of <something> as the normal behavior
for --build-id=tree. Then both the "normal" and "fallback"
computations output 20 bytes. The <something> could be whatever, and I
choose MD5 for it's slight speed advantage over SHA-1. So that's my
reasoning, but other ideas are fine too.

> I've got an incremental linker patch (haven't posted it yet because I
> haven't finished writing the test cases) that recomputes the build id
> for an incremental link by saving the context structure and streaming
> just the new data into it. At the time I was implementing that, I was
> thinking about rewriting the regular hash so that it would compute the
> hashes of chunks in each Relocate_task, then combine the resulting
> chunks at the end (adding in a few pieces not covered by the relocate
> tasks). The difference is that each chunk would be the set of
> contributions from an individual .o file, rather than a fixed-size
> chunk of the output file. I think this would have an advantage,
> though, in taking advantage of the cache locality as we're writing the
> data to the output file, rather than starting up a whole new set of
> tasks to go back over the data.

That would be more complicated than what I proposed, especially if you
wanted to be sure to hash everything, but also a tad faster. I don't
know which approach is best.

thanks,

Geoff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]