Daily News about R-devel/NEWS

This blog is updated daily.

A different regular expression engine for basic and extended regexps and also for approximate matching. This based on the TRE library of Ville Laurikari, a modifed copy of which is included in the R sources.

This is often faster, especially in a MBCS locale.

Known differences are that it is less tolerant of invalid inputs in MBCS locales, and conforms more strictly to the POSIX standard in its interpretation of incorrect regexps such as "^*".
The use of repeated boundary regexps in gsub() and gregexpr() warned about in the help page does not work in this engine (it did in the previous one since 2005).
Basic and extended regexps now support same set of options as for fixed = TRUE and perl = TRUE, including 'useBytes' and support for UTF-8-encoded strings in non-UTF-8 locales.
agrep() now has full support for MBCS locales with a modest speed penalty. This enables help.search() to use approximate matching character-wise rather than byte-wise.
[g]sub use a single-pass algorithm instead of matching twice.
The perl = TRUE versions now work correctly in a non-UTF-8 MBCS locale, by translating the inputs to UTF-8.
useBytes = TRUE now inhibits the translation of inputs with marked encodings.

~~grep() and friends now use useBytes=TRUE to inhibit the translation of inputs with marked encodings.~~
~~The perl=TRUE versions of strsplit(), grep() and friends now work in a non-UTF-8 MBCS locale, by translating the inputs to UTF-8.~~
agrep() now makes use of the TRE library of Ville Laurikari rather than apse, and so has full support for MBCS locales (and enables help.search() to use approximate matching characterwise rather than bytewise).
There is a different regular expression engine for basic and extended regexps, based on the TRE library. This is often faster, especially in a MBCS locale, and it allows us to implement the same set of options as for fixed = TRUE and perl = TRUE, including 'useBytes' and support for UTF-8-encoded strings in all locales.

One known difference is that it is less tolerant of invalid inputs in MBCS locales, and conforms more strictly to the POSIX standard in its interpretation of incorrect regexps such as "^*".

[Currently experimental, can be deselected by the configure option --without-TRE.]
"\uxxxx" and "\Uxxxxxxxx" escapes can now be parsed to a UTF-8 encoded string even in non-UTF-8 locales (this has been implemented on Windows since R 2.7.0). The semantics have been changed slightly: a string containing such escapes is always stored in UTF-8 (and hence is suitable for portably including Unicode text in packages). ~~Use of such escape requires MBCS support.~~