Pack as pack can

Cobra

2010-10-11 19:01

Back in the days where data were transported on disks capable of holding a mere 360 kB, file compression was a vital technology and widely used. In the age of terabyte harddisks, file compression appears somewhat outdated, even obsolete, since storage space is cheap and available in abundance. Even more important, most modern file formats utilize compression (both lossy and lossless) anyway. Pictures, for example, are stored as png or jpeg, documents as pdf or odt (which is nothing but a zipped xml container), music as mp3, and movies as mpeg4. Very little is gained when trying to compress these files once more, so why bother at all?

Well, I can only speak for myself. I use LaTeX for publications and presentations, so all my texts are also stored as such: plain text. Graphics are included as pure ASCII encoded encapsulated post scripts, and thus as plain text. Data are plain text anyway, Mathematica scripts are, and shell or Python scripts also. All of my important files are plain ASCII. Now, plain text can be compressed very efficiently, and since I usually synchronize my computers via the net, I welcome small sizes since the bandwidth I'm commanding is far from infinite.

Size, however, is only one aspect: the compression and decompression time should not exceed the saving in time obtained when transmitting the compressed file. Otherwise, the undisputed champion of text compression, paq8, would win single-handedly. 😉

For a comparison, I've selected a realistic example: the project folder of this paper, which is roughly 200 MB in size (precisely 215336960 bytes). This folder contains all files needed in the course of the project, including many whose size cannot be reduced significantly by further compression (such as figures in pdf format). I've archived the folder using 'tar -a -cvf test.tar paper/' since several of the file compressors do not archive themselves.

The following table displays all relevant data: the name of the contender in the first, the size of the compressed tar archive in the second, and the compression and decompression time in the third and fourth column, respectively. I've used the default setting unless otherwise noted. The first numbers in the latter columns stem from my dual core desktop system (e6600, 2.4 GHz, 8 GB RAM), the second ones in brakets from a 24 core workstation (opteron 8431, 2.4 GHz, 64 GB RAM). Draw your own conclusions (but note that only pigz and pbzip are fully parallelized).

Ok, ok: I use nanozip. And yes, nanozip is also available for Windows, even with a GUI. 😉

zip 3.0	99850765	16.5 (11.7)	4.3 (1.9)
gzip 1.4	99850626	13.4 (11.5)	3.6 (2.0)
pigz 2.1.6	99621622	11.3 (0.5)	4.0 (1.1)
7z 9.04¹	94904167	49.8 (5.6)	11.6 (8.0)
bzip2 1.0.5	94573460	59.0 (71.3)	16.5 (13.8)
pbzip2 1.0.5	94156558	35.0 (4.1)	9.5 (0.9)
rar 3.93	88617152	76.6 (72.4)	5.2 (4.0)
lzip 1.9	85559809	196.7 (192.4)	12.3 (9.5)
xz 4.999.9	85550424	155.2 (148.3)	10.2 (8.3)
7z 9.04	83407794	86.6 (64.6)	9.9 (9.8)
nz 0.08²	83332611	19.7 (10.2)	2.9 (1.0)
7z 9.04³	81677612	105.6 (98.7)	9.3 (9.3)
zpipe 2.00	81562842	518.7 (516.5)	558.4 (529.8)
rzip 2.1	81167559	54.0 (66.1)	20.8 (13.7)
nz 0.08	78614298	80.5 (81.3)	12.6 (11.7)
lrzip 0.45	77979977	136.8 (95.2)	20.7 (9.3)
paq9	75379214	418.0 (425.7)	439.2 (416.3)

Maximum compression reference (noncompetetive):

paq8o

66929961

11068

11359

¹ with -m0=BZIP2 -mmt
² with -cdP -t6
³ with -t7z -m0=lzma -mx=9 -mfb=64 -ms=on -md=32m (upon popular request 😉 )