This is the mail archive of the cygwin-talk mailing list for the cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Compressing hippos really fast


Sounds like he needs data-dedupe. Google "data de-duplication" for an array of vendors.

Phil Betts wrote:
Corinna Vinschen wrote on Tuesday, March 04, 2008 3:43 PM::

Hi,


does anybody know about a compression tool which is above all capable
of compressing really fast? The compression ratio is only a mild
concern, it's rather more important that the tool is not acting as
bottleneck when compressing files which are badly compressable. Unfortunately the usual compression tools are rather interested in a good
compression than in a good speed when streaming lots of data.


Here are a couple of disks which are supposed to be backed up.  Right
now this is done using a script which creats tar.gz archives of all
disks.  Some of this disks are quite big and contains many files which
are already compressed.  It turns out that gzipping these disks is
*the* bottleneck when backing up.

When not compressing, tar creates archives with 37MB/s.  When creating
tar.gz archives, the compression takes so much time that the speed
goes down to 6MB/s.  Using gzip --fast doesn't help much.  bzip is a
lot slower than gzip.

So the question is, does anybody know a compression tool which can be
used with tar, which doesn't slow down the backup by a factor of 6? It would be cool to have a tool which is as quick as the hardware
compression used in modern tape drives, but that's just dreaming...



May the hippos be with you,
Corinna

I had this problem ages ago. My solution was to run two backups. One uncompressed including only files globbing *.gz, *.t[bg]z, *.[zZ], *.bz2, *.zip etc, and one for the remainder which was piped through gzip.


Even a fast compression algorithm is just wasting time trying to compress previously compressed files, and as most compressors work on some variant of Lempel Ziv, if they're fed a mixture of compressible and incompressible data, the incompressible data flushes the dictionary making the compression of the compressible part worse.

Phil



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]