Oid's Graffel

I generally like Debian, as documented by the fact that it's my Linux distribution of choice for pdes-net.org, for the two compute server at the office, for the Mini (which is currently out of order due to a defunct SSD), and for the virtual machine that I've reserved for online banking (biig mistake...see below). Since the stable version of Debian delivers only outdated software, I'm using 'testing' as the base, and if needed, I also install packages from 'sid'.

On my main systems, however, I don't use Debian, but Archlinux. I have several good reasons for this decision. One of them is that packages that belong in a museum are not reserved to Debian Stable, but are also regularly found in Testing or Sid.

One example is 'look', which I've recently reported to be a fast way for finding an entry in a huge file. The version of look in Debian, however, contains a bug that has been fixed ten years ago. Except, of course, in Debian (and all derivatives).

But what are 10 years if you can have 20? In 2010, c't presented a Perl script for downloading and processing the transactions from an account at Deutsche Bank. The script served me well for several years, but it broke a number of times due to changes of the web interface and Perl itself. I was able to fix the script the first four times, but the last time, about five years ago, I had to ask haui for help. And a few weeks ago, it simply broke completely, and I decided to let it go and extend my old bash script to process the csv files downloaded from Deutsche Bank.

Part of one of the new scripts is the following oneliner:

tail -n +4 $current_rates | iconv -f ISO8859-1 -t utf8 | awk '{split($0,a,";"); print a[14]}' | sed 's/,/./g' | bc -l | xargs printf %.2f"\n" | tr '\n' ' ' | awk '{print strftime("%Y-%m-%d")"\t"$7"\t"$6"\t"$1"\t"$5"\t"$4" \t"$2" \t"$3}'> $cleaned_rates

Worked perfectly on my notebook running Archlinux, but in the virtual machine reserved for online banking, I got the following error message:

mawk: line 2: function strftime never defined

Hmmm...

$ awk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

Are you kidding me? That's rather extreme even for Debian standards. Particularly when considering that version 1.3.4 was published in 2009, and strftime was added to it in 2012. But surely, sid has a more recent version...NOT 😒

Even CentOS 6 came with mawk 1.3.4. Shame on you, Debilian!

Well, the only choice was to install gawk, and in this particular case, the performance hit doesn't matter at all. But why isn't that the default, if the Debilians have chosen to neglect mawk? And why do they do that anyway?

Well, whatever. The scripts are working now. 😉

Saving Nexie

We own a first generation Nexus 7 from 2012, which my wife affectionately calls Nexie because of its compact form factor. It impressed me with its high build quality, particularly considering its modest price point of €199. It's performance was more than satisfactory with the stock Android 4.1, but when it got updated to Android 5.1 in 2015, it was reduced to an unresponsive brick, no matter what I've tried (including a factory reset). We finally decided to retire it and to use it as a, well, wall clock. But recently, the time on the Nexus was often running slow by more than an hour. The reason were resource-intensive background processes updating the various Google apps installed by default.

Can we resurrect the Nexus by flashing it with an alternative ROM such as LineageOS? That's at least what I'm going to try. Note that I'm fairly ignorant with respect to Android and its devices. But anyway, let's get going.

1st step

I search the web for “Nexus 7 (2012) LineageOS”. To my relief, there's general agreement that our Nexus can run LineageOS 14.1, corresponding to Android 7.1. I thus download the image for our (GSM free) version of the Nexus 7. I also find some instructions regarding the installation. I learn from them that I have the option to install the Google Apps (particularly the play store) or not.

2nd step

I'm in favor of installing a pure LineageOS system free of Google Apps, and to use F-Droid instead of the Google play store, but my wife pleads for the latter. Since she's the primary user, I look at the options on OpenGapps and download the pico build for ARM 32 bit, Android 7.1.

3rd step

To install the custom ROM and the Google Apps, I need a custom recovery image such as the one provided by TWRP. I download the latest version for my Nexus.

4th step

Prepare your device, they say. All right, I tap the 'Build Number' under 'About Phone' in 'Settings' seven times and thus become a developer (not a joke, it really works that way 😲). I then scroll down and enable USB debugging.

5th step

The instructions mention the commands adb and fastboot, which I find to be contained in the package android-tools. I thus install these tools on my Fujitsu Lifebook:

sudo pacman -S android-tools

6th step

I use the USB cable of my Kobo reader (a conventional microUSB-to-USB cable) to connect the Nexus to my Lifebook, and chose MTP in the USB dialog on the Nexus.

7th step

All right, now it comes. (On hindsight, I could certainly do better when I would try a second time. But anyway: it worked. 😎)

# adb reboot bootloader

Yup, the Nexus boots and is now in a kind of repair mode. 😊

Now as root:

$ fastboot oem unlock
$ fastboot flash recovery twrp-3.3.0-0-grouper.img

A subsequent 'adb reboot bootloader' doesn't work (I now believe that 'fastboot reboot' would have). I reboot the Nexus manually by navigating with the volume and power keys. I then switch in the same way to recovery mode, upon which TWRP starts up.

# adb push /home/cobra/Downloads/lineage-14.1-20171122_224807-UNOFFICIAL-aaopt-grouper.zip /sdcard/
# adb push /home/cobra/Downloads/open_gapps-arm-7.1-pico-20190426.zip /sdcard/

Next, as found in the instructions, I navigate in TWRP to the Wipe menu. For some reason, wiping fails, and I'm stuck in a boot loop. 😨

I search the web and find that boot loops are rather common. A recommended solution is to either erase or format the userdata:

$ fastboot erase userdata
$ fastboot format userdata

but that doesn't do anything (just telling me that it's <waiting for device>). Only after I hold power/volumedown for 10 s, I see the repair menu, and when going to recovery mode, TWRP seems to finish what it has tried to do. pooh

I have no idea what went wrong there, nor precisely how it was corrected. But I found one statement in the web that gave me courage: “as long as your device does anything when switching it on, it is NOT bricked.”

The rest is easy: I go to install, select LineageOS and Google Apps, install and reboot.

The result

Much better than I had hoped for. The interface reacts instantaneously, animations run smoothly, and apps start fast. It feels as good as new. Well, my wife says: even better. ☺

Search package providing a certain command

I've posted a short note on this topic almost exactly ten years ago, and it's time for an update. The situation: you've heard or read about a certain tool and want to install it, but you can't find it no matter how hard you try.

My first advice: don't search and install via graphical applications. ”Software centers” popular in consumer distributions may not show command line applications at all, so if you've read my last post and search for dc, you won't find what you are looking for.

Second: not every command comes in a package with the same name. For example, in Archlinux, dc is bundled with bc, and it is the latter (much more popular) application which gives the package its name.

To master such situations, it's time to leave graphical software centers behind and to learn a few basics about the actual package manager underneath. As an example, I'm showing a search for dig and drill, each of which is contained in a differently named package, with the name of these packages depending (as always) on the distribution.

Archlinux

pacman -Fs dig
        bind-tools
pacman -Fs drill
        ldns

Debian

wajig whichpkg /usr/bin/dig
        dnsutils
wajig whichpkg /usr/bin/drill
        ldnsutils

CentOS/Fedora

yum/dnf provides /usr/bin/dig
        bind-utils
yum/dnf provides /usr/bin/drill
        ldns

OpenSUSE

Reportedly, zypper offers the same functionality.

Of the big six, that leaves Gentoo and Slackware. If you use these or a distribution whose package manager is not covered here, while it offers the desired funtionality, send me a note.

Calculators

For quick calculations, I prefer to use my HP handheld calculators whenever possible, simply because I'm much faster with them than with anything else thanks their responsive physical keypad and, of course, RPN. Alas, there are computational tasks that few, if any, handhelds are up to. Big numbers, in particular, usually result in an overflow rather than in the desired solution. Let's take factorials as example – they are faster growing than any ordinary functions (including exponential ones) and are thus perfectly suited for getting big numbers.

Here's how the factorial \(n!\) looks in comparison to its little sister, the exponential \(e^n\):

../images/factorial.svg

The dashed line shows the Stirling approximation \(\sqrt{2 \pi n} \left(\frac{n}{e}\right)^n\), which reveals that the factorial essentially grows with \(n^n\) and thus faster than any exponential whatever its base.

Now, the largest factorial my HP42s can handle is 253!, which amounts to 5.173460992e+499. For a handheld, this is more than respectable: the largest factorial one can compute on a Linux desktop with, for example, xcalc as the calculator application, is 170!, limited simply by the fact that numbers in xcalc are represented by double precision floats.

All right, xcalc is ancient. But as a matter of fact, most calculator applications running on Windows, MacOS, or Linux have difficulties with large numbers. The Windows calculator, for example, gives up at any numbers bigger than 1e+10,000, and hence can't calculate factorials larger than 3249!. And we didn't even talk about getting exact results, which demand arbitrary precision arithmetic already for much smaller numbers.

Let's have a look at some calculators for Linux that can do better than those above. I use Mathematica 11.2 as reference:

  • 1,000,000! = 8.263932e+5,565,708, taking 0.185/1 s for an exact result/numerical approximation
  • 100,000,000! = 1.617204e+756,570,556, taking 46/344 s for an exact result/numerical approximation
  • 0verflows at $MaxNumber 1.605216761933662e+1,355,718,576,299,609

Note that a file storing the result of 100,000,000! has an uncompressed size of 0.757 GB. So be careful when writing even larger factorials to disk ;) ($MaxNumber would be 1.35 PB!).

CLI Calculators

bc/dc

The Unix calculators. Offer arbitrary precision since 1970, and now you know what the 'bc' stands for in this blog's title! ;) Neither of them supports factorials out of the box, but hey, these are programming languages, not plain calculators. Examples for scripts computing factorials can be found on Rosettacode and on Stackoverflow, but be aware that these examples are horribly inefficient — for fast algorithms see Peter Luschny's page. Here's the “script” for dc:

dc -e '?[q]sQ[d1=Qd1-lFx*]dsFxp'

After typing a number like 1000, we get an exact result. Overflows somewhat below 67,000!. bc does not, but it's too slow to be of much use for very much larger numbers.

wcalc

Approximate results with principally arbitrary precision defined by the command line parameter P (which accepts only integers and is thus useless for really large numbers).

wcalc -P 10 'fact(1000000)'

Takes 1.42 s, overflows somewhat below 44,500,000!

calc

My default calculator on PCs. Gives exact results.

calc 1000000!

Takes 330 s, and overflows somewhat below 2,200,000,000!

hypercalc

When firing up hypercalc, we are greeted with ”Go ahead -- just TRY to make me overflow!”. And indeed, that's not an easy task at first. Hypercalc gives approximate results only, but essentially instantaneous ones even for absolutely monstrous numbers. The factorials we have considered so far are kids play for this program. Instead of the factorial of a million, a billion, a trillion, why not ask for the factorial of a Googol! Hypercalc tells us this number amounts to 1e+(9.9565705518098e+101), and that agrees with the solution from Wolfram Alpha (see below), the only tool, which can at least partly follow hypercalc into the realm of big numbers.

But not when it comes to really big ones. Let's have a look, for example, at Pickover's superfactorial n$. What about, say, 10$? That's completely out of reach for any program I know, but not for hypercalc: 8pt8e+23804068 (PT stands for PowerTower). But even this is still a very very very small number: hypercalc overflows only at 1e+308pt1e+34 or, equivalently in Donald Knuth's up-arrow notation, 10↑↑1.7976e+308.

All of this is contained in a Perl script available for download (or for installation in the AUR for Archlinux users), and additionally in a Javascript powered web interface.

iPython

So far, we haven't been able to get exact results faster than with Mathematica. Let's see how python is doing in this regard. There are two possibilities, the first using plain python, the second scipy:

In [1]: import math
In [2]: %timeit math.factorial(1000000)
6.5 s ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: from scipy.special import factorial
In [4]: %timeit factorial(1000000, exact=True)
6.53 s ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's disappointing that scipy doesn't outperform regular python, but instead seems to use exactly the same code. A special function should, IMHO, perform better than Mathematica (which needs only 1 s for the same task).

You can use python also directly from the terminal, and write the result to disk instead of displaying it directly:

echo 1000000 | python -c 'import sys; import math; print(math.factorial(int(sys.stdin.readline())))' > fac1M.dat

julia

Python turned out to be a disappointment, what about Julia, which is advertised to be suitable for high-performance numerical analysis and computational science?

julia> @time factorial(big(1000000));
0.162650 seconds (1.57 k allocations: 53.524 MiB, 2.00% gc time)
julia> @time factorial(big(100000000));
45.550577 seconds (432.60 k allocations: 11.586 GiB, 0.90% gc time)

Now we're talking!

You can write the results to disk in this way:

julia> using DelimitedFiles
julia> fac1M=factorial(big(1000000));
julia> writedlm("fac1M.dat", fac1M)

Graphical calculators

Most desktop calculators still attempt to imitate the look of handhelds, just like media players used to resemble stereo decks. Some of these reconstructions are historically accurate and appeal to our nostalgia, but in terms of usability, these relicts of the 1990s are among the most clumsy and inefficient user interfaces ever invented, particularly since people never use the keyboard to interact with these abominations, but the mouse. However, some graphical calculators give you the choice.

qalculate!

A great general purpose calculator packed with features. Includes an RPN mode and a plotting interface to gnuplot, as well as excellent conversion utilities that can be updated daily (important for currencies). We can get an approximate result for 1,000,000! in under 1 s, but 100,000,000! takes so long that I didn't wait. Overflows reportedly at 922,337,203,854,775,808!, which is impressive, but of little use because of the comparably poor performance.

speedcrunch

Easy to use and insanely fast. Even on slow hardware, the result is there as soon as you type it. Overflows at 72,306,961!

Web calculators

Young people tell me that installing local apps is so 1990ish, and of course, you can perform essentially all calculations you ever need in the interwebs.

Casio

When I look at my colleagues desks at the office, I frequently see Casio calculators from the 1980s. Even I have one, although I have no idea when and where I've acquired it (and I also don't remember ever using it). In any case, true to their roots, Casio offers a quite capable online calculator. It gives 1,000,000! in about 2 s and overflows only at 1e+100,000,000, just below 14,845,000!

Wolfram Alpha

More than a calculator: a knowledge engine. You may ask what's the weather in Berlin today, and get the interesting bit of information that on April 22nd, it was -4°C in 1997 and 31°C in 1968. But foremost, Wolfram Alpha is an arbitrary precision calculator. It gives exact results where appropriate (for reasonably sized outputs) and approximate ones when the output seems to large. Regardless the task, the answers take a few second, whether you calculate 2+2 or 1e+(9^9^9)! [1e+1e+(1.58274e+369693108)]. Overflows at (2e+1573347107)! or 1e+(1e+9.196824545990035)! That's much higher than Mathematica, but still nothing compared to hypercalc.

Quality journalism, the second

I'll keep this post in German since most of the links and the quotes are. Use DeepL to translate. ;)

Der Niedergang von Zeitungen und Zeitschriften macht auch vor Computerzeitschriften nicht halt – ganz im Gegenteil. Die c't is davon noch vergleichsweise wenig betroffen, was aber ausschließlich an ihren treuen Abonnenten liegt. Doch auch hier bröckelt es seit Jahren langsam, aber stetig. Ich habe seit mehr als 20 Jahren ein Abonnement der c't, die ich als Klolektüre auch auf keinen Fall vermissen möchte. Allerdings häufen sich in den letzten Jahren Fehler einer Art, die einem das Vergnügen nachhaltig vergällen. Wenn man jede Aussage hinterfragen muß, ist es einfacher, selbst zu recherchieren. Und zur reinen Unterhaltung kann ich auch Fix und Foxi lesen.

Mit der letzten Ausgabe ist mir der Kragen geplatzt, und ich habe mich dazu hinreißen lassen, einen Leserbrief einzusenden:

...

Es ist ja ein wirklich lobenswertes Ziel, den Leuten Lua- oder auch Python-Programmierung näher zu bringen, aber ich erwarte, daß zumindest erwähnt wird, daß es auch deutlich einfacher geht (wenn es denn so ist).

c't 7/2019, p. 158. „Trotz dieser Flexibilität stößt man irgendwann an Grenzen: So kann das Tool von Haus aus nicht die aktuelle Wetterlage bei wttr.in erfragen und anzeigen. Diese und weitere Funktionen lassen sich jedoch leicht über selbstgeklöppelte Lua-Skripte nachrüsten.”

Was immer auch „von Haus aus” bedeuten soll, Lua-Skripte braucht man nicht dafür.

Stündliche Abfrage des Wetters:

${execpi 3600 curl -s "wttr.in/Berlin?nT&lang=de" | head -n -2}

Stündliche Abfrage des Wetters in Farbe. ;)

${execpi 3600 curl -s "wttr.in/Berlin?n&lang=de" | ~/.config/conky/ansito | head -n -2}

Ansito: https://github.com/pawamoy/ansito

Und ganz ähnlich in

c't 5/2019, p. 42. „Der Grep-Befehl durchsucht die Datei auf einem Rechner mit Core i5 mit SSD in etwas mehr als einer Minute. Er nutzt aber nicht aus, dass die Datei bereits nach Hashes sortiert ist. In einer sortierten Liste kann man per binärer Suche viel schneller suchen. Eine selbst programmierte binäre Suche in Python braucht nur wenige Zeilen Code. Wir haben daher kurzerhand ein Skript entwickelt, das die Datenmassen in Rekordzeit durchforstet.”

Auch sehr schön, aber mit keinem Wort erwähnt, daß es unter Linux deutlich einfacher und etwa viermal schneller geht:

$ look $(echo -n "111111" | sha1sum | awk '{print toupper($1)}') pwned-passwords-sha1-ordered-by-hash-v4.txt

Himmel nochmal, das ist doch nicht so schwer. Ein Satz, der darauf hinweist, ist doch nicht zu viel verlangt. Oder doch?

...

Begleitet werden diese Eindrücke natürlich von der Entwicklung von heise online (ein von der c't prinzipell redaktionell unabhängiges Medium), das ich wie so viele langjährige c't-Abonennten als Online-Heimathafen betrachte. Neulich kam es dort zur Veröffentlichung eines Artikels einer Autorin aus der Ecke der genderfeministischen SJWs, der vor allem mit der kompletten Abwesenheit auch nur irgendeiner Kompetenz glänzt. Ein Zitat:

Während Entwickler stets bemüht sind, möglichst genau den Programmcode einzugeben und dabei keine Tippfehler zu machen, sind SozialwissenschaftlerInnen trainiert das "große Ganze" zu erkennen, die systemischen Zusammenhänge in der Welt zu überblicken.

Die mehr als 5000 Kommentare ließen keinen Zweifel daran übrig, daß es Heise hiermit geschafft hat, seine Kernklientel nachhaltig zu verärgern. :)

Wenige Tage später erreichte mich diese E-Mail vom „neuen Online-Service heise+”:

Sehr geehrter Herr Brandt, es freut uns sehr, dass Sie als 7 Leser unseren Qualitäts-Journalismus unterstützen.

Ich habe nicht nachgefragt, was „7 Leser” zu bedeuten hat. Ein solch eklatanter Fehler in einer Mail, die schätzungsweise an eine halbe Million Leute rausgeht, in einem Atemzug mit dem selbsternannten Merkmal des Qualitätsjournalismus' (nur echt mit Deppenbindestrich) zu nennen, ist schon recht frech. Ob sie wohl irgendwann mal merken, warum die Leute sie nicht mehr kaufen?

The fastest search, or: find your password

A couple of weeks ago, a monstrous set of login data was passed around in the dark web. Now known as collections #1 – #5, the files contain an estimated 2.2 billion of unique email address/password combinations. And that means cracked passwords, mind you, not only hashes.

Some of my account data have been leaked ages ago by Adobe and Dropbox (at that time, I believed in simple – 8-digit – passwords for test accounts). Incidentally, I had already deleted these accounts when the breach became public.

Currently, all of my accounts are secured by an at least 25-digit random password with an actual entropy not lower than 100 bits, which is essentially impossible to brute force even from unsalted SHA1 hashes. Hence, none of my accounts should show up in the collections mentioned above. If one did, it would show that the password had been saved in plain text, and I would react correspondingly by deleting this account.

The most comprehensive list of password hashes (including collection #1) has been assembled by Troy Hunt. We can download and search this list very easily as shown in the following.

$ cd /hdd/Downloads/pwned
$ wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v4.7z
$ cp pwned-passwords-sha1-ordered-by-hash-v4.7z /ssd/temp/pwned
$ cd /ssd/temp/pwned/
$ unp ppwned-passwords-sha1-ordered-by-hash-v4.7z
$ wc -l <pwned-passwords-sha1-ordered-by-hash-v4.txt
551509767
$ awk -F: '{ SUM += $2 } END { print SUM }' pwned-passwords-sha1-ordered-by-hash-v4.txt
3344070078

551 million passwords, and 3.344 billion accounts: holy cow. What's the fastest way to search for a single string in such a huge file? C't 5/2019 has actually an entire article on this issue, complete with a github repository.

Pina Merkert points out that grep, the standard tool for searching text files under Linux, does not exploit the fact that Troy Hunt offers the download also in a version where the passwords are ordered by hash (that's the version I've downloaded above). The linear search performed by grep is indeed rather slow:

$ time echo -n "111111" | sha1sum | awk '{print toupper($1)}' | grep -f - pwned-passwords-sha1-ordered-by-hash-v4.txt
3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D:3093220

real    0m48.227s
user    0m30.939s
sys     0m5.752s

The possibility to order the data allows a binary search to be performed, which is potentially orders of magnitude faster – O(log[n]) vs. O(n) – than a linear search. Pina's python script is indeed dramatically faster:

$ time ./binary_search.py '111111'
Searching for hash 3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D of password "111111".
Password found at byte  5824044773: "3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D:3093220"
Your password "111111" was in 3093220 leaks or hacked databases! Please change it immediately.

real    0m0.042s
user    0m0.035s
sys     0m0.007s

But my one-liner is four times faster than her 57-line script: ;)

$ time look $(echo -n "111111" | sha1sum | awk '{print toupper($1)}') pwned-passwords-sha1-ordered-by-hash-v4.txt
3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D:3093220

real    0m0.011s
user    0m0.009s
sys     0m0.005s

'Look', by the way, is a binary search tool, part of the util-linux package and thus present on any Linux installation (and also on the Windows subsystem for Linux, I guess). Unfortunately, several distributions (such as Debian and Ubuntu) manage to ship a broken version of look since about 10 years. There's an unofficial patched version available on GitHub if you'd like to try.

With the help of 'look', one can also very easily go through lists of passwords – let's take these completely arbitrary examples:

less topten.dat

123456
123456789
qwerty
password
111111
12345678
abc123
1234567
password1
12345

$ time while read -r password; do look $(echo -n "$password" | sha1sum | awk '{print toupper($1)}') pwned-passwords-sha1-ordered-by-hash-v4.txt; done <topten.dat

7C4A8D09CA3762AF61E59520943DC26494F8941B:23174662
F7C3BC1D808E04732ADF679965CCC34CA7AE3441:7671364
B1B3773A05C0ED0176787A4F1574FF0075F7521E:3810555
5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8:3645804
3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D:3093220
7C222FB2927D828AF22F592134E8932480637C0D:2889079
6367C48DD193D56EA7B0BAAD25B19455E529F5EE:2834058
20EABE5D64B0E216796E834F52D61FD0B70332FC:2484157
E38AD214943DAAD1D64C102FAEC29DE4AFE9DA3D:2401761
8CB2237D0679CA88DB6464EAC60DA96345513964:2333232

real    0m0.053s
user    0m0.048s
sys     0m0.023s

If you ever had a shred of doubt that humanity has a glorious future, look at these 10 ingenious passwords, which secure the 54 million accounts of our most brilliant minds!

Modern file compression

Unknown to most users, file compression silently works behind the scene. Updates for any operating system, for example, are compressed. That happens automatically and the user doesn't even need to know about it.

But sometimes, we have a choice. In Archlinux, for example, we can set the compression we'd like to use for packages created by makepkg (such as those installed over the AUR) – but how to chose between gz, bz2, xz, lrz, lzo, and z? And some backup software adds further options: Borg, for example, offers zlib, lzma, lz4, and zstd.

Most surprisingly, some of these algorithms have been developed only very recently: zstd comes from Facebook (2016), and there's brotli from Google (2015) and lzfse from Apple (2015). Why do these multi-billion-dollar companies develop compression algorithms? Because of the multi-billion dollars.

Instead of testing each of these algorithms yourself, you can use lzbench. It tests all open source algorithms of the lz family with the de facto standard file package in the compression business, the silesia suite.

Here are three examples geared toward high compression ratio, high speed compression, and high speed decompression:

High compression ratio (<25%)

➜  lzbench -c -ebrotli,11/xz,6,9/zstd,22 silesia.tar
lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio
memcpy                   9814 MB/s  9852 MB/s   211947520 100.00
brotli 2017-12-12 -11    0.48 MB/s   385 MB/s    51136654  24.13
xz 5.2.3 -6              2.30 MB/s    74 MB/s    48745306  23.00
zstd 1.3.3 -22           2.30 MB/s   600 MB/s    52845025  24.93

These are single core values. xz compression (but not decompression) profits from multithreading, while brotli and zstd do not.

High speed compression (for compression ratios <50%)

➜  lzbench -c -elz4/lzo1x silesia.tar
Compressor name         Compress. Decompress. Compr. size  Ratio
memcpy                   9861 MB/s  9768 MB/s   211947520 100.00
lz4 1.8.0                 524 MB/s  2403 MB/s   100880800  47.60
lzo1x 2.09 -12            521 MB/s   738 MB/s   103238859  48.71

High speed decompression (> 2000 MB/s)

↪ lzbench -c -elz4/lizard,10/lzsse8,6 silesia.tar
Compressor name         Compress. Decompress. Compr. size  Ratio
memcpy                   9579 MB/s 10185 MB/s   211947520 100.00
lz4 1.8.0                 525 MB/s  2421 MB/s   100880800  47.60
lizard 1.0 -10            421 MB/s  2115 MB/s   103402971  48.79
lzsse8 2016-05-14 -6     8.25 MB/s  3359 MB/s    75469717  35.61

What do we learn from these benchmarks?

  1. If we want high compression reasonably fast, nothing beats xz. It's just perfect for what it's actually used by some (all?) Linux distributions: to distribute updates with acceptable computational resources over a channel with a very limited band width.
  2. If the distributor commands over virtually unlimited resources, and compression speed is thus not an issue, brotli and zstd are clearly superior to all other choices. That's how we would like to have our updates: small and fast to decompress.
  3. If size is not of primary importance, but compression speed is, lz4 and lzo are the champions.
  4. If decompression speed is essential, lzsse8 wins. This is a lesser known member of the lz family and not widely available, in contrast to lz4 which thus scores again.

Weather widget

To my great dismay, Yahoo announced earlier this year that their weather API will retire at January 3rd. And indeed, my little weather conky ceased to work at this day and thus reached its EOL. :(

That's really too bad. Common forecast sites are peppered with dozens of javascripts and take ages to open on my Mini or my Lifebook. Here's an example which I find disagreeably slow even on my desktop. Besides, I also liked my weather conky for its aesthetic merit – at least _I_ found it pleasing to look at.

Well well well. There’s no point in crying over spilled milk. What are the options? There's conkywx, of course, the script collection for true weather aficionados. I look at it every second year and always feel that it's a tad overwhelming, at least for me, both with regard to features and visuals (one, two, three).

Alternatively, I could rebuild my weather conky around a new API – perhaps even the new Yahoo API. But I don't like to register anywhere for a simple weather forecast (and certainly not at Yahoo), and also don't want to spent more than, say, 15 min in getting a new forecast.

And then I found wttr.in – the fastest weather forecast ever. It's accessible by browser or in the terminal as a curl-based service, can be configured on the fly, and its ASCII graphs breath nerdy charm.

It's a one-liner in conky:

${execpi 3600 curl -s "wttr.in/Berlin?nT&lang=de" | head -n -2}

for a narrow (?n) and black-and-white (?T) view and

${execpi 3600 curl -s "wttr.in/Berlin?n&lang=de" | ~/.config/conky/ansito | head -n -2}

in full color thanks to ansito, which translates the ANSI color codes to the corresponding ones on conky.

But do we actually need a desktop widget when the forecast is so readily accessible? One could simple have an extra tab open in a browser, for example. Or, one could have a tab reserved for wttr.in in a Guake style terminal. For convenience, one could define an alias (works for all major shells) that updates the weather forecast every hour:

alias wttr='watch -ct -n 3600 "curl -s wttr.in/Berlin?lang=de | head -n -2"'

Hey, I like that. :)

A simple automated backup scheme

I haven't talked about backups since more than two years, but as detailed in my previous post, I was recently most emphatically reminded that it isn't clear to most people that their data can vanish anytime. Anytime? Anytime. poof

I've detailed what I expect from a backup solution and what I've used over the years in several previous posts (see one, two, and three) and do not need to repeat that here. My current backup scheme is based on borg with support from rsync and ownCloud. The following chart (which looks way more complicated than it is) visualizes this scheme and is explained below.

../images/backup.svg.png

My home installation is on the left, separated from my office setup on the right by the dashed line in the center. The desktops (top) are equipped with an SSD containing the system and home partitions as well as an HDD for archiving, backup, and caching purposes. Active projects are stored in a dedicated ownCloud folder and synchronized between all clients, including my notebooks. The synchronization is done via the ownCloud server (the light gray box at the lower right) and is denoted by (1) in the chart. The server keeps a number of previous versions of each file, and in case a file is deleted, the latest version is kept, providing a kind of poor man's backup. But I'm using ownCloud to sync my active projects between my machines, not for its backup features.

The actual backup from the SSD to the HDD is denoted by (2) or by (2') for my notebooks (to the same HDD, but over WiFi), and is done by borg. The backup is then transferred to a NAS (the gray boxes at the bottom) by simple rsync scripts indicated by (3). All of that is orchestrated with the help of a few crontab entries – here's the one for my office system as an example:

$ crontab -l
15  7-21        *   *   *       $HOME/bin/backup.borg
30  21          *   *   *       source $HOME/.keychain/${HOSTNAME}-sh;  $HOME/bin/sync.backup
30  23          *   *   *       $HOME/bin/sync.vms
30  02          *   *   *       $HOME/bin/sync.archive

The first entry invokes my borg backup script (see below) every hour between 7:15 and 21:15 (I'm highly unlikely to work outside of this time interval). The second entry (see also below) takes care of transferring the entire backup folder on the HDD 15 min after the last backup to the NAS. Since my rsync script invokes ssh for transport, I use keychain to inform cron about the current values of the environment variables SSH_AUTH_SOCK and SSH_AGENT_PID. The third entry induces the transfer of any changed virtual machine to the internal HDD. And finally, the fourth entry syncs the archive on the internal HDD to an external one (3'). I do that since once a project is finished, the corresponding folder is moved out of the ownCloud folder to the archive, effectively taking it out from the daily backup. This way, the size of my ownCloud folder never increased beyond 3.5 GB over the past five years. Since the projects typically don't change anymore once they are in the archive, this step effectively just creates a copy of the archive folder.

What's not shown in the chart above: there's a final backup level involving the NAS. At home I do that manually (and thus much too infrequently) by rsyncing the NAS content to the HDDs I've collected over the years. ` ;) At the office, the NAS is backed up to tape every night automatically. The tapes are part of a tape library located in a different building (unfortunately, too close to survive a nuclear strike ;) ), and are kept ten years as legally required.

What does all of that mean when I prepare an important document such as a publication or a patent? Well, let's count. It doesn't matter where I start working: since the document is in my ownCloud folder, it is soon present on five different disks (two desktops and notebooks, and the ownCloud server). The backup and sync adds four more disks, and the final backup of the NAS results in two more copies (one on disk, one on tape). Altogether, within one day, my important document is automatically duplicated to ten different storage media (disks or tape) in three different locations. And when I continue working on this document the next days, my borg configuration (see below) keeps previous copies up to six month in the past (see cron mail below).

You're probably thinking that I'm a complete paranoid. 10 different storage media in 3 different locations! Crazy! Well, the way I do it, I get this kind of redundancy and the associated peace of mind for free. See for yourself:

(1) ownCloud

My employee runs an ownCloud server. I just need to install the client on all of my desktops and notebooks. If you are not as lucky: there are very affordable ownCloud or nextCloud (recommended) servers available in the interwebs.

(2) backup.borg

A simple shell script:

#!/bin/bash

#`https://github.com/borgbackup/borg <https://github.com/borgbackup/borg>`_
#`https://borgbackup.readthedocs.io/en/stable/index.html <https://borgbackup.readthedocs.io/en/stable/index.html>`_

ionice -c3 -p$$

repository="/bam/backup/attic" # directory backing up to
excludelist="/home/ob/bin/exclude_from_attic.txt"
hostname=$(echo $HOSTNAME)

notify-send "Starting backup"

          borg create --info --stats --compression lz4                  \
          $repository::$hostname-`date +%Y-%m-%d--%H:%M:%S`             \
          /home/ob                                                      \
          --exclude-from $excludelist                                   \
          --exclude-caches

notify-send "Backup complete"

          borg prune --info $repository --keep-within=1d --keep-daily=7 --keep-weekly=4 --keep-monthly=6

          borg list $repository

(3) sync.backup

A simple shell script:

#!/bin/bash

# `http://everythinglinux.org/rsync/ <http://everythinglinux.org/rsync/>`_
# `http://troy.jdmz.net/rsync/index.html <http://troy.jdmz.net/rsync/index.html>`_

ionice -c3 -p$$

RHOST=nas4711

BUSPATH=/bam/backup/attic
BUDPATH=/home/users/brandt/backup

nice -n +10 rsync -az -e 'ssh -l brandt' --stats --delete $BUSPATH $RHOST:$BUDPATH

The two other scripts in the crontab listing above are entirely analogous to the one above.

That's all. It's just these scripts and the associated crontab entries above, nothing more. And since (2) and (3) are managed by cron, I'm informed about the status of my backup every time one is performed. The list of entries you see in the mail below are the individual backups I could roll back to, or just copy individual files from after mounting the whole caboodle with 'borg mount -v /bam/backup/attic /bam/attic_mnt/' (see the screenshot below). You see how these backups are organized: hourly for the last 24h, daily for the last week, weekly for the past month, and monthly for the five months after.

From: "(Cron Daemon)" <ob@pdi282>
Subject: Cron <ob@pdi282> $HOME/bin/backup.borg

------------------------------------------------------------------------------
Archive name: pdi282-2018-12-21--14:15:01
Archive fingerprint: d31db1cd8223ca084cc367deb62e440bfe2dfe4fd163aefc6b6294935f1877b8
Time (start): Fri, 2018-12-21 14:15:01
Time (end):   Fri, 2018-12-21 14:15:21
Duration: 19.52 seconds
Number of files: 173314
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                                           Original size      Compressed size    Deduplicated size
This archive:               30.14 GB             22.17 GB              1.40 MB
All archives:              975.54 GB            715.02 GB             43.41 GB

                                           Unique chunks         Total chunks
Chunk index:                  216890              6892358
------------------------------------------------------------------------------
pdi282-2018-06-30--21:15:01          Sat, 2018-06-30 21:15:01 [113186c574898837d0fb11e6fb7b71f62b0a5422d71b627662aec0d2d6a0e0bf]
pdi282-2018-07-31--21:15:01          Tue, 2018-07-31 21:15:01 [8af0cccbab5645490fcec5e88576dad1a3fbbfd3d726a35e17851d7bec545958]
pdi282-2018-08-31--21:15:01          Fri, 2018-08-31 21:15:01 [2d763ea253d18222015d124c48826425e75b83efedeedcc11b24cf8f0d7e8899]
pdi282-2018-09-30--21:15:01          Sun, 2018-09-30 21:15:01 [39932a0d8c081bc05f9cdff54637e3962fd9e622edce8ef64160e79ae767541f]
pdi282-2018-10-31--21:15:01          Wed, 2018-10-31 21:15:02 [49386980b5270554c6c92b8397809736dea5d07c7ccb3861187a6ed5065ba7a6]
pdi282-2018-11-18--21:15:01          Sun, 2018-11-18 21:15:02 [c2eb215ce883fa5a0800a9d4b9a6c53ac82ace48151180e6a15e944dbf65e009]
pdi282-2018-11-25--21:15:01          Sun, 2018-11-25 21:15:01 [e99c2f3baed4a863b08551605eb8ebeaa5ed6a02decccdb88268c89a9b9b9cc0]
pdi282-2018-11-30--21:15:01          Fri, 2018-11-30 21:15:01 [882f6466adcbc43d7e1a12df5f38ecc9b257a436143b00711aa37e16a4dbf54d]
pdi282-2018-12-02--21:15:02          Sun, 2018-12-02 21:15:02 [7436da61af62faf21ca3f6aeb38f536ec5f1a4241e2d17c9f67271c3ba76c188]
pdi282-2018-12-09--21:15:01          Sun, 2018-12-09 21:15:02 [82e6c845601c1a12266b0b675dfeaee44cd4ab6f33dafa981f901a3e84567bbb]
pdi282-2018-12-13--21:15:01          Thu, 2018-12-13 21:15:01 [9ac3dfd4aca2e56df8927c7bc676cd476ea249f4dd2c1c39fc2a4997e0ada896]
pdi282-2018-12-14--21:15:02          Fri, 2018-12-14 21:15:02 [c8c1358f58dae6eb28bd66e9b49c7cfe237720de21214ebd99cc4b4964ec9249]
pdi282-2018-12-15--21:15:01          Sat, 2018-12-15 21:15:01 [e24d3b26dcdf81d0b0899085fb992c7a7d33d16671fba7a2c9ef1215bd3ae8fb]
pdi282-2018-12-16--21:15:01          Sun, 2018-12-16 21:15:01 [27a8a6943f1053d106ced8d40848eccbfb6c145d80d5e2a9e92f891ed98778ce]
pdi282-2018-12-17--21:15:02          Mon, 2018-12-17 21:15:02 [14118ea958387e0a606e9f627182e521b92b4e2c2dd9fb5387660b84a08971a6]
pdi282-2018-12-18--21:15:01          Tue, 2018-12-18 21:15:01 [842c3f7e301de89944d8edf7483956aff2b7cf9e15b64b327f476464825bd250]
pdi282-2018-12-19--21:15:01          Wed, 2018-12-19 21:15:01 [b7f99c56a8e6ee14559b3eddec04646c8a756515765db562c35b8fbefcd4e58e]
pdi282-2018-12-20--15:15:01          Thu, 2018-12-20 15:15:01 [e832afd41762a69cb8c5fe1c14395dde313dc4368871fd27073fdc50e9f7c6c9]
pdi282-2018-12-20--16:15:01          Thu, 2018-12-20 16:15:01 [8471ccb87d513604d31320ff91c2e0aaf0d31e5ff908ff41b8653c55ee11c1e5]
pdi282-2018-12-20--17:15:01          Thu, 2018-12-20 17:15:01 [73a3ae72815a10732fc495317a7e0f8cd9d05eb2ea862f8c01b437138ac82103]
pdi282-2018-12-20--18:15:01          Thu, 2018-12-20 18:15:01 [7eced8e18b52d00300c8f1b17e188fbfc1124dc60adf68ef2924425677615a96]
pdi282-2018-12-20--19:15:01          Thu, 2018-12-20 19:15:01 [6b7dbc4095b704209921424a52ed37d854b3a61c49cc65ac6889d215aad95a6f]
pdi282-2018-12-20--20:15:01          Thu, 2018-12-20 20:15:01 [66da0f57d6c93b149a9fdf679acf5e43fc22ce6b582db4da3ab606df741bdf82]
pdi282-2018-12-20--21:15:01          Thu, 2018-12-20 21:15:01 [1fce9aa4751be905a45ccce7fca3d44be3cf580d5e4b7c4f5091167099df57ad]
pdi282-2018-12-21--07:15:01          Fri, 2018-12-21 07:15:02 [ee551653a18d400719f9ffe1a67787326f5d5dad41be7d7b5482d5610ed86d43]
pdi282-2018-12-21--08:15:01          Fri, 2018-12-21 08:15:01 [264d7ce1dab3bc1578b521a170ee944598fa99f894d6ca273793ad14824b1689]
pdi282-2018-12-21--09:15:01          Fri, 2018-12-21 09:15:01 [b37de3616438e83c7184af57080690db3a76de521e77fd1ae6e90262f6beb1cc]
pdi282-2018-12-21--10:15:01          Fri, 2018-12-21 10:15:01 [6862d0136b2e4ac7fc0544eb74c0085e7baceca7147bd59b13cd68cbf00cb089]
pdi282-2018-12-21--11:15:01          Fri, 2018-12-21 11:15:01 [e5c6ee4ea65d6dacb34badb850353da87f9d5c19bb42e4fb3b951efecd58e64f]
pdi282-2018-12-21--12:15:01          Fri, 2018-12-21 12:15:01 [5b93f864b9422ed953c1aabb5b1b98ce9ae04fe2f584c05e91b87213082e2ff0]
pdi282-2018-12-21--13:15:01          Fri, 2018-12-21 13:15:01 [461f976422c45a7d10d38d1db097abd30a4885181ec7ea2086d05f0afd9169eb]
pdi282-2018-12-21--14:15:01          Fri, 2018-12-21 14:15:01 [d31db1cd8223ca084cc367deb62e440bfe2dfe4fd163aefc6b6294935f1877b8]

Here's a screenshot of nemo running on my notebook with an sftp connection to my office desktop, after having mounted the available backups with the command given above.

../images/backup_list.png

The age of (digital) decline

About a decade ago, AppleInsider presented an enthusiastic report on the latest innovation from the iPhone inventor:

Apple is dramatically rethinking how applications organize their documents on iPad, leaving behind the jumbled file system [...].

Outside of savvy computer users, the idea of opening a file by searching through hierarchical paths in the file system is a bit of a mystery.

Apple has already taken some steps to hide complexity in the file system in Mac OS X, [...] the iPhone similarly abstracts away the file system entirely; there is no concept of opening or saving files.

I remember reading that and being very sceptical, perhaps as one of a few, but not the only one:

Heck, my 6 year old daughter can understand the idea of saving some files to a folder with her name on it, and others to different locations.

Another lone voice in the digital wilderness:

While this might sound like some kind of user experience utopia, I have a grave concern that eliminating a file system in this manner misses a huge audience. Us.

Now, almost 10 years later, we begin to pay the price for this development. How's that? Well, my experience shows that users who grew up with iOS or Android as their prime computing environment have difficulties to grasp the basic paradigms that still dominate professional work with computers. In particular, only few young users seem to understand the concept of a file (yes, a file, but see above), file types, and file systems. Even less understood is the client-server model, a concept that is indispensable in a modern IT infrastructure.

Consequences of this erosion of knowledge range from the comical to the disastrous. As an example for the former: when I ask for original data I do not mean ASCII data embedded in an MS Word document and a photo- or micrograph embedded in a Powerpoint presentation. However, many young users do not know that a 'data.dat' or a 'photograph.tiff' are valid file formats that can be viewed and edited by suitable applications. A secretary at an associated university had the opposite tendency: she wrote invitations for seminars with Word, printed them, scanned the printout with 1200 dpi, and attached the resulting 100 MB bitmap to electronic invitations sent by e-mail.

That's funny if your e-mail account has no size limit. But even if, you may see that this development has also much less amusing consequences. On a very general level, these users are incapable to appropriately interact with professional IT infrastructures, including common desktop environments (regardless of their provenience). More specifically, users with these deficiencies should not be trusted with handling and managing important data at all. Because ... they will lose them.

At least that's what's happening here: the number of users who experience a total loss of their data increased rapidly over the past few years. In most cases, the cause was not negligence and carelessness, but an alarming level of ignorance. Often, the root cause arose simply from bypassing the infrastructure we provide, and employing the private notebook for data analysis and presentation instead of the dedicated office desktop. Now, our employees can bring their own devices if they like, but if they don't register them with our IT staff, they will be classified as guest devices that have no access to our intranet – with the rather obvious result that the data on these devices cannot be synced to the home directory of the respective user on our file server (which is part of a daily incremental backup on tape, covering every day over the last 10 years as legally required).

The users, naturally, don't find that obvious at all (although they have been informed at length about these facts). They claim to have acted in the firm believe that the data on their private notebook will be automatically backed up to “the cloud” as soon as they enter a certain “zone” around their working place. When I asked how they imagined this miracle backup would work, one of them referred to Apple commercials in which photographs were transferred from an iPhone to an iPad “magically”. “That's the state-of-the-art, right? I expected that it would be implemented here!” She also said that she imagined the mechanism to work wirelessly, but that she wouldn't care how it worked, as long as it did.

Now, when people approach me with these stories, they want (i) forgiveness and understanding and (ii) an immediate solution. Well...

../images/mccoy-doctor-not-magician.svg

Fortunately, we now have first level support consisting of an invariably cheery youth who finds these problems most entertaining. Let's see what he says in a few years from now, when pampering the “digital natives” has become the next big thing.

And let's see where we are then, with our big hopes and high flying dreams of, for example, artificial intelligence and quantum computing, autonomous electric mobility, populating the Mars, establishing controlled fusion on Earth, and controlling the world's climate. Personally, when I see the present generation of which the majority has difficulties to count to three, well, you know, I'm not all that optimistic. ;)