The encoding hell

Cobra

2011-11-13 17:13

At work, we've now discarded JabRef in favor of Mendeley for reference management. This step turned out to be a breakthrough: adding papers to Mendeley is so easy (you simply drag and drop from the browser to the Mendeley client) that people actually do it unsolicited. Our databases are thus rapidly growing.

Mendeley extracts all bibliographic information directly from the pdf and is fully unicode aware. That's nice, since all author names and special characters in the title will be displayed correctly in the Mendeley client. However, what will happen when this information is exported for further use as a LaTeX bibliography? Mendeley itself actually sports a conversion facility, but will that be sufficient?

Well, let's try and analyze the resulting bibliography with the help of the little script I've shown previously. Just as I've feared, the encoding is shown to be ok (the file does not contain any non-utf8 encoded characters) but compilation fails. The reason is simple: good ol' LaTeX is ASCII only, unicode support via inputenc of very limited nature, and mendeley translates only a very few characters (and perversely just those where such an action would not be required).

What to do now?

The right thing to do would be to use a modern TeX system. Both XeTeX and LuaTeX fully support unicode, and so does the bibTeX successor 'biber'. The main problem with that approach is simply that the journals to which we submit will change from the original TeX to one of its modern incarnations not earlier than...say, 2017. And that's optimistic.

I hoped that biber alone could solve the problem, since it has the capability to convert from one encoding (in the input) to another one (in the output). However, it turned out that biber also knows only a very limited set of unicode characters. What's worse is that biber/biblatex is not compatible to natbib, a prerequsite for RevTeX. Of course, the fact that biber is not available in the standard repositories of the major Linux distributions will not contribute to its further dissemination.

A partial solution is the package 'mab2bib' that contains the python script 'utf8_to_latex.py'. Using the included conversion map 'latex.py', calling this script converts the majority of characters to LaTeX compliant command sequences. Those it doesn't know will be converted to an expression like '\char{xxxx}', where xxxx is the decimal (html or utf-16) descriptor for the character in question.

What you thus will see anytime when attempting to convert a bibliography from Mendeley containing names such as 'Sánchez-García' to pure LaTeX are the following sequences: 769, 771, and 776. Those sequences do not correspond to actual characters, but to accents accompanying certain German and Spanish letters:

769 [Unicode Character 'COMBINING ACUTE ACCENT' (U+0301)]   i.e.,  an acute accent as in    á 
771 [Unicode Character 'COMBINING TILDE' (U+0303)]      i.e., a tilde as in     ñ 
776 [Unicode Character 'COMBINING DIAERESIS' (U+0308)]      i.e., an umlaut as in       ö

The character 'á' can thus be represented in two different ways using unicode...

(i)  a + 'U+0301'   (letter first)
(ii)  'U+00ED'

...while in LateX, this character is represented by

\'{a}           (letter last)

The python script mentioned above lacks the ability to translate these characters. Looks like we have to do it ourselves. Stay tuned.