No magic

Cobra

2016-11-06 13:52

In times of rain, fog and drizzles, I take the U-Bahn for commuting to and from work. I'm not as regular as your proverbial clockwork, but still punctual enough to see certain co-commuters on an almost daily basis. Some of them are individual enough to stick out of the crowd. Sir David, for example, a long thin figure with short grey hair and a short, accurately trimmed grey beard, invariably dressed in a checkered grey suit with a light grey shirt and a gunmetal grey tie, dark grey suede shoes and charcoal grey socks. Even his beyerdynamic headphones, which he's sure to wear, are grey, and while listening to Vivaldi, he's studying the daily Frankfurter Allgemeine with great interest and concentration.

Such intense is his concentration that he does not even notice Mike sitting next to him with a battered Nokia cell phone and talking very loudly, as every day, in an unidentifiable language with what I assume to be a subordinate of Mike. Mike is ebony black, 6'4'', 150+ kg, wears a gold Rolex and other golden accessories over a black Savile Row suit but still manages to look like a very serious and very worried business man. When U2 goes underground at Potsdamer Platz and his phone loses the connection (as every day), his worries grow troublesome for his health as accentuated by repeated violent bursts of shouting at his unfortunate subordinate who, luckily for him or her, can't hear the verbal assault because of the severed connection.

Despite this acoustic disturbance, Sir David remains concentrated on his newspaper and entirely misses Audrey stepping into the train. Audrey is a seriously cute American girl in her twenties with dazzling blue eyes and jet black hair she's wearing as an asymmetric bob. Unlike many Americans, she has a soft, melodious voice which is as pleasant to listen to as she's a pleasure to look at, despite her constant fumbling with a white Apple Watch. Today, however, she's discussing a technical subject with one of her colleagues from the advertising agency she's working for. The guy is the living prototype of a Berlin-Mitte hipster with the trade-mark combination of an undercut leaving a 2 cm ponytail stub at the back and a beard of Abrahamian dimensions in front, together with the indispensable oversized horn-rimmed black glasses fulfilling no medical function but representing a fashion statement.

He seems to believe, as many non-native speakers, that an excessive use of the F-word documents an intimate familiarity with both the American language and the American culture, and thus demonstrates that you are on the level. Know what I'm sayin'? Fuck, eh. Right now, he keeps whining, in a high wheezy voice not fitting the beard, that his brochure wasn't accepted because the fucking pdf was too fucking big, fuck eh. Know what I fucking mean, eh? Audrey knows, and contrary to what I would have expected, she's reacting in a totally enthusiastic way. Oh so true, she cheers, and adds immediately that she's recently found the solution for big pdfs that seem to be the major, if not the problem troubling her agency. A wonderful, an absolutely fabulous web service! Her original pdf of 500 MB reduced to 10 MB without any loss of quality! Pure magic! But ... who ... ?, her hipster colleague manages to ask, and she's shouting at him, full of delight: ILOVEPDF DOT COM!

Two days later one of my colleagues (with a PhD in physics) tells me that he used ilovepdf.com to compress the pdf of his recent publication, so that it's of a size suitable for uploading to arXiv. He shows me the result, and I'm impressed: there are no immediately obvious compression artifacts, although the file size has been reduced from more than eight to just one MB. Now I'm really getting curious. Is it possible that these ilovepdf guys are doing something ... clever? Perhaps they employ one of the new image formats, such as webp, bpg, or even flif? That would be most interesting, and I thus set out to get to the core of this business.

Several web services promise to compress images (or, more precisely, pixel graphics or, synonymously, raster image files, or short, bitmaps or pixmaps) of various formats or entire pdf documents (where the compression, of course, boils down to exactly the same: compressing the pixmaps embedded in the pdf container). All of them also promise to respect our privacy. For example, ilovepdf.com (and iloveimg.com) states:

Absolutely all uploaded files on ilovepdf.com are deleted from our servers one, two or twenty-four hours (depending on if a user is non registered, registered or premium) after been processed.

Hmmm. Why only the uploaded files? What about the compressed ones? I like the statement of smallpdf.com better:

Please note that uploaded and processed files are never stored longer than an hour on our servers and then are deleted permanently. During this hour your files are not accessed, copied, analyzed or anything else except we have the explicit permission of the user for example for a support case.

“Analyzed” is the key term here. Even if the uploaded and processed files are deleted, it takes only fractions of a second to extract the text of pdf documents, or the raw pixmaps embedded in them:

        less upload.pdf > text.txt
        mutool extraxt upload.pdf

On tinypng.com, we only read:

Submitted content will not be shared with third parties other than Voormedia’s service providers, unless required to comply with the law or requests of governmental entities. Voormedia uses service providers based in the USA.

For the present case of use (getting a file size acceptable for arXiv), all of that doesn't seem to matter. After all, our intention is to publish our content, not to keep it confidential. The same goes for Audrey and her agency. Still ... if you can do it yourself, why should you become dependent on others? And after you have made yourself depending on this service, what will you do when it really matters?

But can we do it ourselves? Are the results of these services within our reach, or are their makers truly magicians with capabilities beyond the John and Susan Does of the interwebs? To narrow it down to the point which matters most for me: can these services, given a file that I deem to be suitable for publication, significantly compress it further? After all, I know how to treat my images, don't I? Well, at least I believe I do.

I've made a comparison using several pdf documents (including the publication of my colleague above) as well with a few raster images. To my simultaneous relief and disappointment, the most frequent statement I got from the web services under consideration included:

We are sorry, your file is very well compressed and we can't compress it without quality loss.

Or, as honest as cute:

We compressed your file from 30.36 kB to 29.57 kB. That's not that much. Sorry.

No magic, no new formats. What a pity! At the same time, I was impressed by the technical quality of these services. All documents and images returned from them were an excellent compromise between image quality and size. Furthermore, I actually never managed to produce an image of equal quality but smaller file size. But I was always close with very little effort.

To give explicit examples: the publication of my colleague was 8.3 MB in size. The sole reason for the large size were a couple of images embedded as uncompressed tiff. I would normally compress these images prior to generating the pdf with pdfLaTeX, of course, but we can equally well compress them afterwards. Our two services return files of 0.95 (ilove) and 0.86 MB (small). I have to magnify them a five, even a tenfold to see the effects of compression. Yes, these services use lossy compression schemes, specifically jpeg, but they do that expertly.

What could John Doe do against specialized services jam-packed with expertise and knowledge on image compression? Well, I just tried to apply the little I know, like this (well-known) ghostscript one-liner:

            gs -sDEVICE=pdfwrite -dPDFSETTINGS=/default -dNOPAUSE -dQUIET 
            -dBATCH -sOutputFile=out.pdf in.pdf

resulting in

                gs (default):       1.41 MB     83% 
                gs (ebook)          1.05 MB     87%
                ilovepdf:           0.95 MB     89%         
                gs (screen)         0.94 MB     89%
                smallpdf            0.86 MB     90%

With the default setting, image quality is basically indistinguishable from the original and thus even better than the files produced by ilove and small. The higher compression stages of ghostscript with compression ratios rivaling those of ilove and small produce visual artifacts in this particular case, but are always worth a glance when trying to compress a pdf.

The conclusion of this exercise is obvious, I believe. Pixmaps for inclusion in documents should be compressed, most preferably losslessly. At the moment of writing, the most suitable format for images in my trade is png. These images can be compressed further (but lossily!) by reducing the range of colors (with pngquant, for example, or simply by converting them to greyscale). For photographs with soft contrasts and 16 M colors, jpeg remains the reigning format, but we have to be very careful not to reduce the image quality in an obvious way. And if you forgot all that and are facing a giant pdf that cannot be uploaded anywhere, use ghostscript.

That's it. That's all there is to know (at least concerning the current topic). I plan to look at new image formats, though, but those will be the subject of future posts. 😉