Encodings and R

The use of encodings is raised sporadically on the R mailing lists, with discussion of ideas to `do better'.  R has been developed by authors speaking English or a Western European language, and its current mindset is the ISO Latin 1 (aka ISO 8859-1) character set.  Even these authors find some problems, for example the lack of  some currency symbols (notably the Euro, € if it displays for you).  Users of R in Central Europe need more characters and are sometimes puzzled that Latin 2 (aka ISO 8859-2) works only partially.  Other languages present much greater challenges, and there is a project to `Japanize' R which (for obvious reasons) is little known outside Japan.

One of the challenges is that in European usage, nchar(x) is the number of characters in the string but is used for adjusting layouts.  In other encodings there can be different values for

  1. The number of characters in a string
  2. The number of bytes used to store a string and
  3. The number of columns used to display the string -- some chars may be double width even in a monospaced font.
Fortunately nchar is little used (and often to see if a string is empty), but the C-level equivalents are widely used in all three meanings, and it is used at R level in all three meanings.

Update: This document was first written in December 2003: see the below for changes made for R 2.1.0.

Encoding in R 1.8.x

The default behavour  is to treat characters as a stream of 8-bit bytes, and not to interpret them other than to assume that each byte represents one character.  The only exceptions are

With these exceptions, character encoding is the responsibility of the environment provided by the OS, so

Towards Unicode?

It seems generally agreed that Unicode is the way to cope with all known character sets.  There is a comprehensive FAQ. Unicode defines a numbering of characters up to 31 bits although it seems agreed than only 21 bits will ever be used.  However, to use it as an encoding would be rather wasteful, and most people seem to use UTF-8 (see this FAQ, rather Unix-oriented), in which each character is represented as 1,2,...,6 bytes (and how many can be deduced from the first byte).  As 7-bit ASCII characters are represented as a single byte (with the high bit zero) there is no storage overhead unless non-American characters are used. An alternative encoding is UTF-16, which is a two-byte encoding of most characters and a pair of two-bytes for others (`surrogate pairs').  UTF-16 without surrogates is sometimes known as UCS-2, and was the Unicode standard prior to version 3.0.  (Note that the ISO C99 wide characters need not be encoded as UCS-2.)  UTF-16 is big-endian unless otherwise specified (as UTF-16LE).  There is the concept of a BOM, a non-printing first character that can be used to determine the endian-ness (and which Microsoft code expects to see in UTF-16 files).

Not only can a single character be stored in a variable number of bytes but it can be displayed in 1, 2 or even 0 columns.

Linux and other systems based on glibc are moving towards UTF-8 support: if the locale is set to en_GB.utf8 then the run-time assumes UTF-8 encoding is required. Here is a somewhat outdated Linux HOWTO: its advice is to use wide characters internally and ISO C99 facilities to convert to and from extenal representations.

Major Unix distributions (e.g. Solaris 2.8) are also incorporating UTF-8 support. It appears that the Mac part of MacOS X uses UTF-16.

Windows has long supported `wide characters', that is 2-byte representations of characters, and provides fonts covering a very wide range of glyphs (at least under NT-based versions of Windows).  This appears to be little-endian UCS-2, and it is said that internally Windows NT uses wide characters, converting the usual byte-based characters to and fro as needed.  Some Asian versions of Windows use a double-byte character set (DBCS) which appears to represent characters in one or two bytes: this is the meaning of char in Japanese-language versions of Windows.  Long filenames are stored in `Unicode', and are sometimes automatically translated to the `OEM' character set (that is ASCII plus an 8-bit extension set by the code page). Windows 2000 and later have limited support for the surrogate pairs of UTF-16.  Translations from `Unicode' to UTF-8 and vice versa by functions WideCharToMultiByte and MultiByteToWideChar are supported in NT-based Windows versions, and in earlier ones with the `Microsoft Layer for Unicode'.

Implementation issues

If R were to use UTF-8 internally we would need to handle at least the following issues
The API for extending R would be problematical.  There are a few hundred R extensions written in C and FORTRAN, and a few of them manipulate character vectors.  They would not be expecting UTF-8 encoding (and probably have not thought about encodings at all).  Possible ways forward are
This does raise the issue of whether the CHAR internal type should be used for UTF-8 or a new type created.  It would probably be better to create a new type for raw bytes.

Eiji Nakama's paper on `Japanizing R' seems to take the earlier multi-byte character approach rather than UTF-8 or UCS-2, except for Windows fonts.  Functions such as isalpha do not work correctly in MBCSs (including UTF-8).

The Debian guide to internationalization is a useful background resource.  Note that internationalization is often abbreviated as 'i18n', and localization (support for a particular locale) as 'L10n'.  The main other internationalization/localization issue is to allow for the translation of messages (and to translate them).

Encodings in R 2.1.0

Work has started in December 2004 on implementing UTF-8 support for R 2.1.0, expected to be released in April 2005.  Currently implemented are: For many of these features R needs to be configured with --enable-utf8.

Implementation details

The C code often handles character strings as a whole.  We have identified the following places where character-level access is used:

There are many other places which do a comparison with a single ASCII character (such as . or / or \ or LF) and so cause no problem in UTF-8 but might in other MBCSs.  These include filbuf (platform.c, which looks for CR and LF and these seem safe), fillBuffer (scan.c) and there are others.

Encodings which are likely to cause problems include

fillBuffer (scan.c) has now been rewritten to be aware of double-byte character sets and to only test the lead byte.

Windows

Windows does things somewhat differently. `Standard' versions of Windows have only single-byte locales, with the interpretation of those bytes being determined by code pages. However, `East Asian' versions (an optional install at least on Windows XP) use double-byte locales in which characters can be represented by one or two bytes (and can be one or two columns wide). Windows also has `Unicode' (UCS-2) applications in which all information is transferred as 16-bit wide characters, and the locale does not affect the interpretation. Windows 2000 and later have optional support for surrogate pairs (UTF-16) but this is not normally enabled. (See here for how to enable it.) Currently R-devel has three levels of MBCS support under Windows.

Localization of messages

As from 2005-01-25, R uses GNU gettext where available. So far only the start-up message is marked for translation, as a proof-of-concept: there are several thousand C-level messages that could potentially be translated. The same mechanism could be applied to R packages, provided they call dgettext with a PACKAGE specific to the package, and install their own PACKAGE.mo files, say via an inst/po directory. The splines package was been converted to show how this might be done: it only has one error message.

Brian Ripley
2004-01-11, 2005-01-25