Daily News about R-devel/NEWS

This blog is updated daily.

There are new ‘configure’ options ‘--with-internal-iswxxxxx’, ‘--with-internal-towlower’ and ‘--with-internal-wcwidth’ which allows the system functions for wide-character classification, case-switching and width (‘wcwidth’ and ‘wcswidth’) to be replaced by internal ones. The first has long been used on macOS, AIX (and Windows) but this enables it to be unselected there and selected for other platforms (it is the new default on Solaris). The second is new in this version of R and is selected by default on macOS and Solaris. The third has long been the default and remains so as it contains customizations for East Asian languages.

System versions of these functions are often minimally implemented (sometimes only for ASCII characters) and do not cover the full range of Unicode points: for example Solaris (and Windows) only cover the Basic Multilingual Plane.

~~Unicode character width tables (as used by ‘nchar(, type="w")’) have been updated to Unicode 12.1 by Brodie Gaslam (PR#17781).~~
The parser now treats ‘\Unnnnnnnn’ escapes larger than the upper limit for Unicode points (‘\U10FFFF’) as an error as they cannot be represented by valid UTF-8.

Where such escapes are used for outputting non-printable (including unassigned) characters, 6 hex digits are used (rather than 8 with leading zeros). For clarity, braces are used, for example ‘\U{0effff}’.
There is a build-time option to replace the system's wide-character ‘wctrans’ C function by tables shipped with R: use ‘configure’ option ‘--with-internal-towlower’ or (on Windows) ‘-DUSE_RI18N_CASE’ in ‘CFLAGS’ when building R. This may be needed to allow ‘tolower()’ and ‘toupper()’ to work with Unicode characters beyond the Basic Multilingual Plan where on supported by system functions (e.g. on Solaris where it is the new default).

Unicode character width tables (as used by ‘nchar(, type = "w")’) have been updated to Unicode 12.1 by Brodie Gaslam (PR#17781), including many emoji.