Pack as pack can

Back in the days where data were transported on disks capable of holding a mere 360 kB, file compression was a vital technology and widely used. In the age of terabyte harddisks, file compression appears somewhat outdated, even obsolete, since storage space is cheap and available in abundance. Even more important, most modern file formats utilize compression (both lossy and lossless) anyway. Pictures, for example, are stored as png or jpeg, documents as pdf or odt (which is nothing but a zipped xml container), music as mp3, and movies as mpeg4. Very little is gained when trying to compress these files once more, so why bother at all?

Well, I can only speak for myself. I use LaTeX for publications and presentations, so all my texts are also stored as such: plain text. Graphics are included as pure ASCII encoded encapsulated post scripts, and thus as plain text. Data are plain text anyway, Mathematica scripts are, and shell or Python scripts also. All of my important files are plain ASCII. Now, plain text can be compressed very efficiently, and since I usually synchronize my computers via the net, I welcome small sizes since the bandwidth I'm commanding is far from infinite.

Size, however, is only one aspect: the compression and decompression time should not exceed the saving in time obtained when transmitting the compressed file. Otherwise, the undisputed champion of text compression, paq8, would win single-handedly. 😉

For a comparison, I've selected a realistic example: the project folder of this paper, which is roughly 200 MB in size (precisely 215336960 bytes). This folder contains all files needed in the course of the project, including many whose size cannot be reduced significantly by further compression (such as figures in pdf format). I've archived the folder using 'tar -a -cvf test.tar paper/' since several of the file compressors do not archive themselves.

The following table displays all relevant data: the name of the contender in the first, the size of the compressed tar archive in the second, and the compression and decompression time in the third and fourth column, respectively. I've used the default setting unless otherwise noted. The first numbers in the latter columns stem from my dual core desktop system (e6600, 2.4 GHz, 8 GB RAM), the second ones in brakets from a 24 core workstation (opteron 8431, 2.4 GHz, 64 GB RAM). Draw your own conclusions (but note that only pigz and pbzip are fully parallelized).

Ok, ok: I use nanozip. And yes, nanozip is also available for Windows, even with a GUI. 😉

zip 3.0
99850765
16.5 (11.7)
4.3 (1.9)
gzip 1.4
99850626
13.4 (11.5)
3.6 (2.0)
pigz 2.1.6
99621622
11.3 (0.5)
4.0 (1.1)
7z 9.04¹
94904167
49.8 (5.6) 
11.6 (8.0)
bzip2 1.0.5
94573460
59.0 (71.3)
16.5 (13.8)
pbzip2 1.0.5
94156558
35.0 (4.1)
9.5 (0.9)
rar 3.93
88617152
76.6 (72.4)
5.2 (4.0)
lzip 1.9
85559809
196.7 (192.4)
12.3 (9.5)
xz 4.999.9
85550424
155.2 (148.3)
10.2 (8.3)
7z 9.04
83407794
86.6 (64.6)
9.9 (9.8)
nz 0.08²
83332611
19.7 (10.2)
2.9 (1.0)
7z 9.04³
81677612
105.6 (98.7)
9.3 (9.3)
zpipe 2.00
81562842
518.7 (516.5)
558.4 (529.8)
rzip 2.1
81167559
54.0 (66.1)
20.8 (13.7)
nz 0.08
78614298
80.5 (81.3)
12.6 (11.7)
lrzip 0.45
77979977
136.8 (95.2)
20.7 (9.3)
paq9
75379214
418.0 (425.7)
439.2 (416.3)

Maximum compression reference (noncompetetive):

paq8o
66929961
11068
11359

¹ with -m0=BZIP2 -mmt
² with -cdP -t6
³ with -t7z -m0=lzma -mx=9 -mfb=64 -ms=on -md=32m (upon popular request 😉 )