Modern plague

Since I was just talking about publications: one of the corner-stones of the modern publishing business is the peer review process. Non-scientists usually have a hard time to believe what this term actually means. Believe me, I've had a hard time too.

In short, peer review means that you have to read the manuscripts of absolute strangers and provide an authoritative and tactfully written review which details the reasons for acceptance or rejection. All for the editors of the journals for which you pay to publish, and for which you pay to read, and all for free, of course. That's right, we don't get anything for that. Some of the journals send a Christmas card, but the main motivation is that the community sells it as an honor and a moral obligation.

That wouldn't be a problem if they'd ask you once a month, and you had nothing more important to do anyway. I'm currently getting asked twice a week, and I know that in 95% in all cases, the manuscript under consideration is from China or Korea and I will reject it anyway.

Why? Well, let me quote an insider:

'One Chinese scientist has referred to the majority of China's publications as “pollution"'

Poorly written, zero content. And if they continue to flood the journals with their trash, the peer review system is going to end.

Act of desperation

An essential, even defining element of modern science is that it's taking place in the public. Scientific results must be published: this principle is valid since centuries, but the process itself has changed drastically only over the past decades.


Back in 1958, for example, John J. Hopfield obtained his PhD in physics at Cornell University. The results of his work are summarized in one paper submitted to the Physical Review.

The manuscript itself was handwritten, as usual in these days. It was probably John's girlfriend rather than the department's secretary which transcribed the text using a mechanical typewriter, excluding, of course, the numerous equations which John carefully inserted into the transcript. John also provided rough sketches of the figures, which were then traced by the ladies in Cornell's drafters office and photographed for submission. The people at Cornell's postal office helped with the submission of the paper.

Upon acceptance, the manuscript was carefully proofread and converted into a typographically correct typescript by the Physical Review's assistant editor responsible for the manuscript. Particularly, the handwritten equations of John's manuscript had to be transcribed correctly, and that was no small task:

Hamiltonian

Now, let's compare that to the situation encountered 2012 by the imaginary PhD student Hans Weißwurst. Hans has been told that prospective employers will not be very impressed by a single Physical Review. He heard lots of talking about illustrious "high-impact" journals, and he has the vague hope to to submit his work to one of them.

What he doesn't know, and what nobody prepared him for: he will be author, secretary, designer, drafter, editor, first critic and proof-reader in one person. Unfortunately, he does not have the a clue about any of these jobs. When he's told that it's about time to write a paper, the drama unfolds.

Hans believes that the first thing he has to work on is the title, which is not altogether unreasonable if nobody tells you otherwise. He gives himself a week to find an appropriate title, but is unsatisfied even after the second week. He thus starts to write the abstract and the introduction, but finds himself brooding over the title even after six weeks instead.

After three months, his supervisor asks about the progress of the paper, and Hans hurries to finish his work. What's still missing is the representation and discussion of his results, but that's not too difficult, since he knows them all by heart. He thus quickly describes them and finally rushes to create some rough plots of the data.


Hans is imaginary, but his manuscript is not. In the past twenty years I've edited and rewritten many manuscripts by young scientists, and several of them reflected a shocking naïvety and an astounding ignorance. This year, the situation has worsened to such a degree that I couldn't see any other way than to write a tutorial. At first, this tutorial was intended for internal use only, but at a second glance I believe it may also be read profitably by a broader audience. Have fun. 😉

Growth

Every now and then I pick up Indy and step on our balance to check his weight. The results are displayed in the figure below.

Indy's weight

The thick, light blue line is the result of a linear fit, revealing a weekly weight increase of about 140 g. A deviation from linearity is not yet apparent. That's why the second fit by a logistic function (thin dark blue line) and the resulting prediction of a final weight of about 6 kg is not very accurate at the moment.

Here's the gnuplot script generating this figure:

set terminal svg size 600,400 dynamic enhanced fname 'palatino' fsize 12 solid
set output 'indyweight.svg'
set xlabel "date"
set ylabel "weight (kg)"
set xtics nomirror
set ytics nomirror
set style line 1 ps 1.5 pt 4 lc -1
set style line 2 lw 8 lc 8
set style line 3 lw 2 lc 6
set key right bottom
set xdata time
set timefmt "%d.%m.%y"
set xrange ["11.07.11":"27.11.11"]
toffset=strptime("%d.%m.%y","18.07.11")
linear(x) = a + b*(x-toffset)
c = 5; d = 0.1; e = 3e-9
logistic(x) = c/(1 + d*exp(-e*(x-toffset)))
fit linear(x) "/home/cobra/Documents/indy" using 1:2 via a,b
fit logistic(x) "/home/cobra/Documents/indy" using 1:2 via c,d,e
gain = 1000*7*24*60*60*b
finalweight = c
set label "linear gain: %g g/week", gain at graph  0.02, graph  0.95
set label "predicted final weight: %g kg",  finalweight at graph  0.02, graph  0.9
plot "/home/cobra/Documents/indy" using 1:2 notitle with points ls 1, linear(x) with line ls 2, logistic(x) with lines ls 3

Update: For some reason, webkit-based browsers such as Chromium don't handle the particular svg above correctly, but add huge upper and lower margins. I thus have replaced it for the moment with a png.

The encoding hell

At work, we've now discarded JabRef in favor of Mendeley for reference management. This step turned out to be a breakthrough: adding papers to Mendeley is so easy (you simply drag and drop from the browser to the Mendeley client) that people actually do it unsolicited. Our databases are thus rapidly growing.

Mendeley extracts all bibliographic information directly from the pdf and is fully unicode aware. That's nice, since all author names and special characters in the title will be displayed correctly in the Mendeley client. However, what will happen when this information is exported for further use as a LaTeX bibliography? Mendeley itself actually sports a conversion facility, but will that be sufficient?

Well, let's try and analyze the resulting bibliography with the help of the little script I've shown previously. Just as I've feared, the encoding is shown to be ok (the file does not contain any non-utf8 encoded characters) but compilation fails. The reason is simple: good ol' LaTeX is ASCII only, unicode support via inputenc of very limited nature, and mendeley translates only a very few characters (and perversely just those where such an action would not be required).

What to do now?

The right thing to do would be to use a modern TeX system. Both XeTeX and LuaTeX fully support unicode, and so does the bibTeX successor 'biber'. The main problem with that approach is simply that the journals to which we submit will change from the original TeX to one of its modern incarnations not earlier than...say, 2017. And that's optimistic.

I hoped that biber alone could solve the problem, since it has the capability to convert from one encoding (in the input) to another one (in the output). However, it turned out that biber also knows only a very limited set of unicode characters. What's worse is that biber/biblatex is not compatible to natbib, a prerequsite for RevTeX. Of course, the fact that biber is not available in the standard repositories of the major Linux distributions will not contribute to its further dissemination.

A partial solution is the package 'mab2bib' that contains the python script 'utf8_to_latex.py'. Using the included conversion map 'latex.py', calling this script converts the majority of characters to LaTeX compliant command sequences. Those it doesn't know will be converted to an expression like '\char{xxxx}', where xxxx is the decimal (html or utf-16) descriptor for the character in question.

What you thus will see anytime when attempting to convert a bibliography from Mendeley containing names such as 'Sánchez-García' to pure LaTeX are the following sequences: 769, 771, and 776. Those sequences do not correspond to actual characters, but to accents accompanying certain German and Spanish letters:

769 [Unicode Character 'COMBINING ACUTE ACCENT' (U+0301)]   i.e.,  an acute accent as in    á 
771 [Unicode Character 'COMBINING TILDE' (U+0303)]      i.e., a tilde as in     ñ 
776 [Unicode Character 'COMBINING DIAERESIS' (U+0308)]      i.e., an umlaut as in       ö

The character 'á' can thus be represented in two different ways using unicode...

(i)  a + 'U+0301'   (letter first)
(ii)  'U+00ED'

...while in LateX, this character is represented by

\'{a}           (letter last)

The python script mentioned above lacks the ability to translate these characters. Looks like we have to do it ourselves. Stay tuned.

Cloudy

Geek [giːk]: A person discovering the cloud the day the icloud is announced.
Nerd [nɜːd]: A person bored to death by geeks chattering about the cloud.
Freak [fɹi:k]: A person believing that clouds are in the sky.

Synchronization of my data has been an issue for me long before Dropbox materialized in 2008. I used a crude but simple solution based on rsync scripts started manually or via the crontab. Something more elegant and efficient would have been possible with inotify as described here. Lsyncd is another option aiming at the same purpose. However, building an automatic two-way sync service based on these tools comparable to Dropbox or Ubuntu One is far from trivial. Since the amount of data I have to sync is steadily increasing, I start to feel a little frustrated with this situation.

As much as I disapprove of the recent hype of cloud services, I cannot deny that Dropbox & Co. are far more complete synchronization services than the primitive and rudimentary solutions I've been using. Particularly, the real-time synchronization offered by these services results in a data integrity unattainable by conventional sync or even backup schemes. For example, while I'm typing this very blog entry, any one of these services would ensure that not a word would be lost even if my cat suddenly hits the power button, because everything I type would be synced in real time to the cloud.

Well, then, why don't I use these services if they are so great? For reasons of control, security, and privacy. As a general rule, I prefer to have control over my data rather than turning them over to an organization which I do not trust by default (and why should I?). This attitude is corroborated by experience, and indeed, there was never a better example than Dropbox. How can we be expected to trust the system if its proven to be broken by design after one critical glance (see also here and here). These security concerns compromise the usability of Dropbox: I really don't want to have to think about which data would better be contained in an encfs encypted folder before putting them on the cloud.

Wouldn't it be great if there were a free and trustworthy service capable of the same effortless, instantaneous synchronization of data as offered by Dropbox & Co? Ideally, this service could be installed on our own servers, so that there'd be no need to register or pay, no size limit, and no one to trust except ourselves. Meet sparkleshare.


Sparkleshare

Server

Any machine running openssh-server and git-core will do. On pdes-net.org, piet took care of the latter dependency some days ago—thx, piet! 😊

After installation, issuing

cd
git init --bare sync.git

will initialize a git repository in the directory /home/user/sync.git.

For matured versions of git, do the following:

mkdir sync.git
cd sync.git
git --bare init-db

Client

I assume that you have already a public-key ssh connection to the server of your choice. If you connect to server.org via a non-standard port, for example 1234, define it in ~/.ssh/config:

Host server.org
Port 1234

Now, install Sparkleshare and its dependencies. Both Archlinux and Ubuntu had the latest version in their repositories, but YMMV.

If you did not use git before, introduce yourself:

git config --global user.name "Firstname Lastname"
git config --global user.email "first.last@email.com"

Now, start Sparkleshare from the menu or via the commandline by issuing 'sparkleshare start'. Answer the questions. Note that the server address should be in the form "user@server" and the subsequent path should be absolute.

That's it. You have established your own personal cloud. 😊

Bundestrojaner

Almost 5 years ago, I speculated that a state trojan launched by the German government and the BKA would soon be detected by common anti-virus scanners ("Eine hohe Verbreitung vorausgesetzt, werden nach einer gewissen Anlaufzeit vermutlich auch alle Malwarescanner in der Lage sein, den Bundestrojaner zu erkennen.")

The following screenshot illustrates the situation just one day after the CCC disassembled code they believe to represent such a state trojan.

results from jotti

Groundhog Day

Every year at that time we are preparing the annual report, and every year I'm shocked by the horrid look of many of the submitted figures. This year I decided to try to understand what people do and why (C: Cobra, S: Student).

C: Your figures are not suitable for the annual report. They look...eh...horrible.
S: Why?
C: Well, don't you see all the compression artifacts here *point* and there *point* and all these pixels all over the place?
S: Now that you say that...but what can I do? *shrug*
C: What about telling me what you did?
S: Nothing special, the standard way.
C: The standard way?
S: Sure. I create the figure with Powerpoint, copy and paste it into this Gimp thing and then save it as eps.
C: Hm...do you actually know the difference between vector and pixel graphics?
S: Of course! Most certainly!
C: Tell me.
S: In pixel graphics, the information, I mean the color and so on, is encoded per pixel. In vector graphic, each pixel is represented by three vectors, one for each color. That's why vector graphics is so much bigger. But the advantage is that you can distort the image as you like, while pixel graphics is fixed because the pixel is always square shaped.
C: ...

Let's examine these statements with the help of an example.

Here's the original Archlinux Logo as vector art, scaled to the column width of 600 px. Regardless of scaling, its size is 4 kB when saved as svgz. Saved as pdf reduces its size further to 2.9 kB.

ArchLinux logo as vector graphics

And here's the same image when saved in the format of the 21st century, scaled to the column width of 600 px, and reduced in quality to finally yield a size of 4.8 kB (and thus similar to the vector art above).

ArchLinux logo as raster image

For educational reasons, I invite you to press the '+' key on your keyboard five, no, ten times. I'm sure you'll see the difference and understand it, at least from a practical point of view. 😉

When we finally save the pathetic remains of the logo as a vector graphic, the resulting size is indeed on the order of a hundred kB. The reason is obvious.

If not, magnify the images again and have a closer look.

PS: I'm not alone:

Your figures are ugly.

The last day of summer

Or: How to lose loyal users: a beginners guide for soon-to-be extinct Linux distributions

  1. Promise that the badly needed upgrade will be recognized by the update manager.
  2. Just in case if not, put a description on the Wiki which can't possibly work. Let the user find out why.
  3. After the user found out and forced the upgrade, arrange numerous conflicts which increase the users problem-solving ability.
  4. When the user has sorted out all the challenges, present a kernel panic upon reboot. Give him the real deal!

Farewell, Mandriva! You were my trusted companion for a decade, and I'm sure to miss many of your amenities. But I can't use a system which offers a TeX distribution from 2007, and which breaks upon an online upgrade.

There was no question what I'd install instead: that had been clear since my discovery of Arch Linux more than two years ago. I use Debian Testing on all compute servers, but it's not quite up-to-date enough for a desktop if you ask me (I was an avid user of Mandriva Cooker until I decided that this platform, while offering comparatively current packages, is simply too unstable to be of use). In contrast, I have not seen Arch to break in the two years I'm following its progress in two virtual machines. I also became moderately familiar with Arch Linux itself, which I still believe to be the most transparent and, in a sense, most simple distribution I've ever tested and used.

The installation and configuration was, as usual, straightfoward, but two issues remain. First, 'keychain' works, but neither 'openssh-askpass' nor 'ksshaskpass' do. I thus have to manually call 'ssh-add' upon each reboot. Admittingly not a big thing. The second issue is more disturbing: while both 'privoxy' and 'pdnsd' work perfect separately, they don't work together. I just get 404s when trying, and I have no clue as to the reason.

Everything else, however, functions perfectly. There are, of course, many small things to be taken care of when changing from a very old to a very new distribution (just think about python 2.x and 3.x), but most of this tinkering is over and done. I can lean back and enjoy. 😉

New desktop

The little pacman you see in the tray, by the way, is the icon of yapan, a cute little update manager which keeps the system up-to-date in its own cute little way. 😊

Resolution

True to my word, I've acquired a camera to be able to document Indy's growth and progress in appropriate detail. You are the judge, of course, but I admit that I'm very happy with the first snapshots. 😊

Indy Indy Indy Indy Indy

Integrity check

Now I got this super-bibliography with thousands of entries and want to know if all entries are ok, or if one contains a weird character which is displayed incorrectly or even inhibits a successful compilation by LaTeX/BibTeX. How do I do that? Probably the easiest way is to run the following LaTeX file

\documentclass{amsart}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
\nocite{*}
\bibliographystyle{amsplain}
\bibliography{bibliography}
\end{document}

through the following script:

#!/bin/bash
charset=$(file -bi bibliography.bib | awk '{print $2}')
if [ $charset == charset=utf8 ]; then
  echo "Encoding ok"
else
  echo "Non UTF character detected"
fi
errors=$(rubber --pdf --quiet biblist.tex 2>&1)
if [ -z "$errors" ]; then
  echo "Compilation successful"
else
  echo "Compilation failed"
  echo $errors
fi
rubber --pdf --clean biblist.tex