The fastest search, or: find your password

A couple of weeks ago, a monstrous set of login data was passed around in the dark web. Now known as collections #1 – #5, the files contain an estimated 2.2 billion of unique email address/password combinations. And that means cracked passwords, mind you, not only hashes.

Some of my account data have been leaked ages ago by Adobe and Dropbox (at that time, I believed in simple – 8-digit – passwords for test accounts). Incidentally, I had already deleted these accounts when the breach became public.

Currently, all of my accounts are secured by an at least 25-digit random password with an actual entropy not lower than 100 bits, which is essentially impossible to brute force even from unsalted SHA1 hashes. Hence, none of my accounts should show up in the collections mentioned above. If one did, it would show that the password had been saved in plain text, and I would react correspondingly by deleting this account.

The most comprehensive list of password hashes (including collection #1) has been assembled by Troy Hunt. We can download and search this list very easily as shown in the following.

$ cd /hdd/Downloads/pwned
$ wget
$ cp pwned-passwords-sha1-ordered-by-hash-v4.7z /ssd/temp/pwned
$ cd /ssd/temp/pwned/
$ unp ppwned-passwords-sha1-ordered-by-hash-v4.7z
$ wc -l <pwned-passwords-sha1-ordered-by-hash-v4.txt
$ awk -F: '{ SUM += $2 } END { print SUM }' pwned-passwords-sha1-ordered-by-hash-v4.txt

551 million passwords, and 3.344 billion accounts: holy cow. What's the fastest way to search for a single string in such a huge file? C't 5/2019 has actually an entire article on this issue, complete with a github repository.

Pina Merkert points out that grep, the standard tool for searching text files under Linux, does not exploit the fact that Troy Hunt offers the download also in a version where the passwords are ordered by hash (that's the version I've downloaded above). The linear search performed by grep is indeed rather slow:

$ time echo -n "111111" | sha1sum | awk '{print toupper($1)}' | grep -f - pwned-passwords-sha1-ordered-by-hash-v4.txt

real    0m48.227s
user    0m30.939s
sys     0m5.752s

The possibility to order the data allows a binary search to be performed, which is potentially orders of magnitude faster – O(log[n]) vs. O(n) – than a linear search. Pina's python script is indeed dramatically faster:

$ time ./ '111111'
Searching for hash 3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D of password "111111".
Password found at byte  5824044773: "3D4F2BF07DC1BE38B20CD6E46949A1071F9D0E3D:3093220"
Your password "111111" was in 3093220 leaks or hacked databases! Please change it immediately.

real    0m0.042s
user    0m0.035s
sys     0m0.007s

But my one-liner is four times faster than her 57-line script: ;)

$ time look $(echo -n "111111" | sha1sum | awk '{print toupper($1)}') pwned-passwords-sha1-ordered-by-hash-v4.txt

real    0m0.011s
user    0m0.009s
sys     0m0.005s

'Look', by the way, is a binary search tool, part of the util-linux package and thus present on any Linux installation (and also on the Windows subsystem for Linux, I guess). Unfortunately, several distributions (such as Debian and Ubuntu) manage to ship a broken version of look since about 10 years. There's an unofficial patched version available on GitHub if you'd like to try.

With the help of 'look', one can also very easily go through lists of passwords – let's take these completely arbitrary examples:

less topten.dat


$ time while read -r password; do look $(echo -n "$password" | sha1sum | awk '{print toupper($1)}') pwned-passwords-sha1-ordered-by-hash-v4.txt; done <topten.dat


real    0m0.053s
user    0m0.048s
sys     0m0.023s

If you ever had a shred of doubt that humanity has a glorious future, look at these 10 ingenious passwords, which secure the 54 million accounts of our most brilliant minds!