Daily News about R-devel/NEWS

This blog is updated daily.

The parser now treats ‘\Unnnnnnnn’ escapes larger than the upper limit for Unicode points (‘\U10FFFF’) as an error as they cannot be represented by valid UTF-8.

Where such escapes are used for outputting non-printable characters, 6 (not 8) hex digits are used (as it was decided by Unicode that the first two would always be zero).
The parser now looks for non-ASCII spaces on Solaris, in addition to Windows, macOS, FreeBSD and OSes such as Linux that declare ‘wchar_t’ is encoded as Unicode.
There are warnings (including from the parser) on the use of unpaired surrogate Unicode points such as ‘\uD834’ (which cannot be converted to valid UTF-8).
‘tolower()’, ‘toupper()’ and ‘chartr()’ have more support for inputs with a marked encoding (UTF-8 or Latin-1) in a single-byte locale.
The code for the evaluating default (extended) regular expressions now uses the same character-classification functions as the rest of R; in some cases (Windows, AIX, macOS) these replace limited system ones, the differences being in non-Latin characters.

~~The parser now treats ‘\Unnnnnnnn’ escapes larger than the upper limit for Unicode points (‘\U10FFFF’) as an error as they cannot be represented by valid UTF-8.~~

Where such escapes are used for outputting non-printable characters, 6 (not 8) hex digits are used (as it was decided by Unicode that the first two would always be zero).
~~Code converting UTF-8 strings (e.g., ‘tolower()’ and some ‘printing’) now uses internal routines rather than system functions: these detect more uses of invalid UTF-8 strings.~~

There are warnings (including from the parser) on the use of unpaired surrogate Unicode points such as ‘\uD834’ (which cannot be converted to valid UTF-8).