Module:LaTeX2UTF8/doc

This is the documentation page for Module:LaTeX2UTF8

Contex

The mathematical typesetting systems TeX and LaTeX are in near-universal use in mathematics as well as several scientific disciplines. Sites as diverse as Google Scholar, Mathematical Reviews and Zentralblatt MATH allow citations to be exported in BibTeX, a bibliographical system that uses TeX/LaTeX markup. Unfortunately, this set of software was written well before internationalization and locationization efforts such as Unicode and UTF-8 reached critical mass. As a result, there is a fundamental impedance mismatch between how the two systems handle typography.

LaTeX (by while I'll refer to the three systems as a whole) is primarily concerned with character composition: if the user wishes to add an umlaut to the number 7, LaTeX provides that capability (and indeed, there are multiple ways of coding up such a construct in the language). Unicode, on the other hand, focuses on mapping glyphs to unique numerical identifiers. As such, Unicode tends to capture only extant glyphs.

Mapping LaTeX to Unicode (or UTF-8) thus has two difficulties: LaTeX has an effectively infinite number of ways of representing the same symbol (although only handful are in common use), and only a subset of possible symbols will be found in the set covered by Unicode. The second problem tends to be rare and we will leave it for another day. This module addresses the first problem.

Consider the following (legal) LaTeX code:

\'o {\'o} {\' o} {\'{o}}

If this were compiled, the result would look like:

ó ó ó ó

This module handles all four cases as well as a few LaTeX escape codes. Attempting to capture even a significant subset of the Unicode glyph set would be a herculean task, and nearly all of that effort might never be used. Instead, I have seeded this module with the glyphs that I expect to use in my own editing and hope that others add to it as its shortcomings become obvious.

One final note: a properly cynical reader might have noticed that in the last example above there are two adjacent closing braces. If a user attempts to pass this string from article space using #invoke, the parser will assume the LaTeX "}}" closes the template, even if it is embedded in a string. The best solution I have so far is to keep LaTeX-encoded data in Module-space and have article-space calls request the translated strings. Thus, this module is not designed to work reliably via #invoke.

Usage

This module is designed to be called from within other Lua modules. As an example:

function TranslateLaTeX()
  local l2u = require( "Module:LaTeX2UTF8" )  
  local s = [[
  @book {MR0161818,
  AUTHOR = {Erd{\H{o}}s, P{\'a}l and Sur{\'a}nyi, J{\'a}nos},
  TITLE = {V\'alogatott fejezetek a sz\'amelm\'eletb{\H o}l},
  PUBLISHER = {Tank\"onyvkiad\'o V\'allalat, Budapest},
  YEAR = {1960},
  PAGES = {250},
  MRCLASS = {10.00},
  MRNUMBER = {0161818 (28 \#5022)},
  } 
  ]]
  s = l2u.translate_diacritics( s )
  s = l2u.p.translate_special_characters( s )
  return s
end

The BibTeX entry should be transformed to a UTF8 string that looks (likebreaks aside) like this:

@book {MR0161818,
AUTHOR = {Erdős, Pál and Surányi, János}, 
TITLE = {Válogatott fejezetek a számelmáletből}, 
PUBLISHER = {Tankányvkiadá Vállalat, Budapest}, 
YEAR = {1960}, 
PAGES = {250}, 
MRCLASS = {10.00}, 
MRNUMBER = {0161818 (28 #5022)}, }