UNPROTECT_PTR is dangerous and should not be used. This text describes why and describes how to replace it, including mset-based functions that have been introduced as a substitute for situations when unprotection by value is really needed. This could be of interest to anyone who writes native code to interface with the R heap, and definitely to all who use
UNPROTECT_PTR in their code.
R provides several functions to protect pointers to R objects held by local C variables (typed
SEXP) from the garbage collector. As documented in Writing R Extensions, there are two structures to hold protected pointers: the pointer protection stack and the precious list.
The pointer protection stack is accessed using
UNPROTECT. Pointers are unprotected by being removed from the top of the stack. One can also use
PROTECT_WITH_INDEX and then
REPROTECT to replace a pointer defined by its position in the stack, which allows to simplify and speed-up code that repeatedly updates local variables holding pointers (in such scenarios, one could in principle still use a sequence of
UNPROTECT operations, instead). The pointer protection stack needs to be managed in-line with the C call stack: after returning from a function, the stack depth should be the same as when the function was called (pointer protection balance). These and other rules are described in Writing R Extensions and The danger of PROTECT errors, they are relatively easy follow and check, both visually and by tools (also, pointer-protection balance is checked to some level at runtime). The stack-based protection and unprotection are fast, do not require additional allocation and are automatically handled during R errors (long jumps): a long jump recovers the previous stack depth, unprotecting the values that have been left on the stack by the code executed after the jump was set but before the jump was executed.
Although such situations are very rare, sometimes achieving pointer protection balance is difficult, sometimes say package code wishes to keep some allocated space without returning a pointer to it (hence without making the caller protect it, when we have global variables pointing to R heap and for some reason cannot turn them into locals). This is addressed by the precious list, which is accessed using
R_ReleaseObject. It is implement as a linked list (and yes,
R_PreserveObject allocates!) and objects are unprotected by value. There is no automated unprotection on error, the user is always responsible for unprotecting objects stored on the precious list. To achieve that in case of R errors (long jumps) or in callbacks (e.g. unloading of a package), it may be necessary to allocate a dummy object, set up its finalizer, and let the finalizer release needed objects from the precious list.
R_ReleaseObject are also much slower than
The API was still not sufficient for very special applications, applications which used generated code that allocated memory from the R heap, such as the R parser generated by
bison. The parser code uses a stack of semantic values, which are pointers to objects on the R heap. Values are pushed on the stack by the tokenizer during shift operations, are both pushed and removed during actions of reduce operations, and are removed on some parse errors. R errors (long jumps) can also occur during parsing. The stack is local to a parsing function. The key problem is that the code of the parser is generated and
bison cannot be customized enough to ensure insertion of
UNPROTECT operations. It would be natural to allocate the semantic values stack on the R heap, protect it, and protect semantic values when held in local variables but not yet on the semantic values stack, all using
UNPROTECT. But, this is not possible. In principle,
R_ReleaseObject could be used, but one would have to handle the errors and, most importantly, the performance overhead would not be acceptable.
To work around this problem,
UNPROTECT_PTR has been introduced. It allows relatively fast unprotect-by-value operation for semantic values protected in the pointer protection stack. When new semantic values are created, they are immediatelly put on the protection stack using
PROTECT by the tokenizer and reduce rules. The values are unprotected by
UNPROTECT_PTR inside the reduce rules, and the pointer protection stack depth is restored after certain parse errors that did not cause a long jump (one can also define a
bison for some tokens and make it call
UNPROTECT_PTR, as done in the
Rd parser in package
UNPROTECT_PTR removes the first occurrence of the pointer (starting at stack top) and squeezes the stack, reducing the stack depth. Using
UNPROTECT_PTR this way causes pointer protection imbalance by design (the tokenizer and reduce rules are implemented in different functions), which increases cognitive complexity of the code. It is, however, faster than the precious list and uses less memory, when used carefully it works with R long jumps (automated unprotection), and it may well be that there is not a better way to do protection in the parser than unprotect-by-value (if we don’t modify the generated parser code). It has been used for many years in the parser and, unfortunately, started to be used also outside the parser where not necessary.
It has been known and documented that combining
PROTECT_WITH_INDEX is dangerous, because by removing a certain object from the stack by
UNPROTECT_PTR and squeezing the stack, the protect index may become invalid/unexpected (objects locations on the stack change).
REPROTECT would then replace the wrong pointer, resulting in a memory leak (the object intended for unprotection stays protected) and, worse, premature unprotection (
REPROTECT would replace an object that still was to be protected). Code which uses
UNPROTECT_PTR is also rather hard to read.
UNPROTECT_PTR is dangerous
While working on some improvements of the parser I realized that
UNPROTECT_PTR is unsafe also in combination with
PROTECT/UNPROTECT. The problem occurs when the same pointer is stored multiple times on the protection stack. One can accidentally use
UNPROTECT_PTR to unprotect the unintended instance of the object, an instance that was intended to be unprotected by
UNPROTECT, instead. At the point of
UNPROTECT_PTR, nothing bad yet happens, but, when one later gets to the
UNPROTECT, the wrong object gets unprotected, resulting in a premature unprotection (protect bug). Unfortunately, it is quite common particularly in the parser for the same pointer to be protected multiple times (
To illustrate this, imagine this sequence of pointers on the stack (3 is protected last, A and A’ are the same pointer, A is intended for unprotection by value):
after UNPROTECT_PTR(A), we get
instead of what the enclosing code expected:
The depth is ok, say the code later does
UNPROTECT(1) intending to unprotect 3 and actually doing so, so still ok. But, then it calls
UNPROTECT(1) intending to unprotect
A', but instead unprotecting 2. As a result,
A) will still be kept alive (memory leak, possibly temporary, so not that bad), but 2 will be prematurely unprotected, causing a protect bug (and one that may be very hard to debug).
R_NilValue and symbols do not need to be protected at all, but they are and sometimes it makes the code more readable when the distinction is not made. Moreover, any function returning a pointer may sometimes return a fresh pointer and sometimes a pointer that already exists (including in the parser, where some list manipulating functions work(ed) that way). So, this seems to be a real danger. Also, using
UNPROTECT_PTR the way as in the parser makes verification of other, purely stack-based
PROTECT/UNPROTECT operations, harder, both manually and by tools, because it is not made explicit which pointers were intended to be unprotected by
UNPROTECT_PTR and which by
Phasing out UNPROTECT_PTR
I have thus removed the use of
UNPROTECT_PTR from all R base code. It was relatively easy in the few cases when used outside the parser, I have just rewritten the code using stack-based protection functions. I think in all cases this actually simplified the code.
For use in the parsers (the R parser and the two parsers from package
tools), I’ve introduced API for value-based unprotection outside the pointer protection stack. These functions use a
precious multi-set to protect these objects; the multi-set is allocated on the R heap and needs to be protected by the caller (e.g. using
PROTECT). Consequently, it is automatically unprotected on the long jump, and hence all pointers protected in the mset get indirectly unprotected as well. The current implementation uses a (vector-) list instead of a pair-list, so is also faster than
R_ReleaseObject, but this is just an implementation detail that can change and certainly the unprotection could be made faster if it turns out to be a bottleneck in practice. The main benefit is that these functions use a separate structure for unprotection by value, not polluting the pointer protection stack.
SEXP R_NewPreciousMSet(int initialSize); void R_PreserveInMSet(SEXP x, SEXP mset); void R_ReleaseFromMSet(SEXP x, SEXP mset); void R_ReleaseMSet(SEXP mset, int keepSize);
To use this API, one needs first to create a new mset using
PROTECT it. The mset is expanded automatically as needed (
R_PreserveInMSet may allocate). Objects are released by value via
R_ReleaseFromMSet using the same (naive) algorithm as was used in
UNPROTECT_PTR, so there should be no performance hit (in principle, the operations could be faster as they do not have to deal with objects intended for stack-based protection). One does not have to release objects explicitly, they will all be released when the mset is garbage collected (e.g. on a long jump that would unprotect the mset). For performance reasons, one may however use
R_ReleaseMSet to clear the mset but keep it allocated, if the allocated size is not bigger than given number of elements (this can be used e.g. on errors that are not implemented as long jumps). As anything in R-devel code base, the API is still subject to change.
UNPROTECT_PTR to the new API is harder than it may first seem as one has to identify the
PROTECT operations that are intended for unprotection by value (and rewrite the code when some code paths unprotect the same “variable” in one way and other code paths in another).
Choosing the right API
I think for memory protection one should always use
UNPROTECT, possibly with
REPROTECT in performance critical code.
R_ReleaseObject help if we have global variables holding on to R memory, but global variables should be avoided anyway also for other reasons, so this should be very rare. Also, arranging for unprotection on error is a bit tedious.
R_ReleaseFromMSet should be used only in
yacc parsers and
UNPROTECT_PTR should be phased out from all code.