This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash?



Miroslav Silovic (MS) writes:
MS> Greg Harvey <Greg.Harvey@thezone.net> writes:
>> It doesn't necessarily have to (that being the nature of hashes and
>> all); the code for strhash when strings are > 5 characters long only
>> uses five characters of the string to determine the hash; in this
>> case, the particular selection of hash size, strings, and string
>> lengths have given us the dreaded `worst case scenario'.

MS> I think the Right Thing to do here (both speedy and reasonably
MS> diffused) would be to step through string with step equal lenth >> 2.
MS> This would still use only 4 chars but once the string length is 8 or
MS> more, it would at least use reasonable snapshot of the string.

I am surprised only four characters are used. It can't be that
expensive to use more. In fact, if I just picked a ready-made
hash-function, I would be disappointed if not the whole string was
used to compute the key. You want a fast hash function? Then make one
yourself, because you know the data! 

I think your suggestion will just plant a similar problem for someone
else. Notice how the author of scm_strhash actually tried to make a
decent and "almost randomized" sample of the string by letting the
partially computed key modulo string length decide where to pick the
four characters. Also, notice how you suggested hash function would
work on Han-Wen's strings. The length is six, shift twice to get the
step value one. The same first four characters would be sampled...

The real problem is that the hash function is not described in
guile-ref so that a surprise like this is easily avoided. 

>> Looking at the code I'm curious about the selection of initial hash as
>> 264 % table size; I'd think it'd be better to make the generated hash
>> totally dependant on the value being hashed, rather than a fixed start
>> value for each hash for a particular size table (264 might not be the
>> best value to go there, either, but it's been a while since I've
>> looked deeply at hashes, so I might just be being prime-centric).

MS> True - making hash independant on the table size would also allow
MS> storing the full (nondivided) hashes into the table for dynamic
MS> resize/rehash.

264 seems to be totally arbitrary and only chosen to offset long
strings from the short ones. I don't see why 256 was not chosen, or
why a prime value would make a difference. 

Furthermore, the initial modulo operation is not needed since 264 is
small and even shifted << 8 should fit in a long. 

I would not want to discover that hash keys were stored in my hash
table. Dynamic resize is a luxury feature IMHO, and could come at
some cost. I work will large volumes of data and I would certainly be
disappointed if I found that an additional word was stored for each
item in the table. 

Regarding the code, isn't it possible to simply cast a block of four
characters as a long? I realize that there can be word alignment
problems, but given a string pointer, that should not be an
issue. This way, four bytes/characters could be processed in one
read. 

	Lars

-- 
Lars Arvestad               Dept. of Numerical Analysis and Computing Science
                       Royal Institute of Technology (KTH), Stockholm, Sweden

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]