This text is about a new feature in R, staged installation of packages. It may be of interest to package authors and maintainers, and particularly to those who maintain packages that are affected.
The problem
I often have to run checks for all CRAN and BIOC packages to test the impact of my changes to R. This is to find about my own bugs, but often I also wake up existing bugs in packages or R or find out that some packages rely on undocumented API or behavior. I run all CRAN/BIOC package tests for the baseline R-devel version, then for my modified version, and then I compare the outcomes looking for packages newly failing or newly with warnings. In each run, I install (the same version of) packages afresh, and indeed to get that in a reasonable time, the installation is run in parallel.
During the last months this process has been increasingly complicated by randomly appearing warnings during installation, like
Warning: S3 methods '[.fun_list', '[.grouped_df', 'all.equal.tbl_df' ... [... truncated]
.
These warnings appeared for many packages, but not repeatably, so they complicated the analysis of check results. Some of the processing is automated, re-checking packages in base and modified version to reduce the number of differences due to temporary unavailability of remote systems. Initially the install warnings were also accompanied by check warnings like:
Warning in grep(pattern, x, invert = TRUE, value = TRUE, ...) : input string 1 is invalid in this locale
These check warnings turned out to be emitted because of the truncation that sometimes accidentally split multi-byte UTF-8 characters. I fixed the truncation and then found out the original installation warning was actually saying “S3 methods were declared in NAMESPACE but not found”.
Incidentally, there were just two distinct (very long) lists of methods in
the warnings across all installed packages in my run, but repeated for many
packages. It turned out that they were lists of exported methods from
dplyr
and rlang
packages. These two packages take very long to install
due to C++ code compilation. They also have a lot of reverse dependencies
and so while they are being installed, it is very likely that another
package being installed would use them in a partially-installed state, and
this is why these warnings were emitted.
I learned that the CRAN team indeed had been affected by this problem as
well for long and that they have seen it unsurprisingly caused by also other
packages that took long time to install, not just dplyr
and rlang
.
In principle, this problem does not only happen during parallel installation and does not affect only repository maintainers and R core developers who regularly check all CRAN and/or BIOC packages. The problem is present any time the same R library is used from different R sessions (and in some installations there could be sessions run by different users).
The package installation process has become complicated and can run arbitrary code, even from packages themselves, so the consequences of accessing other packages in inconsistent/partially-installed state are unpredictable and potentially dangerous. The probability of this race condition happening seems to have increased in the last years with wider use of C++ (in patterns that take long to compile), as the problem has not been observed before.
Existing lock directories do not solve the problem
The current implementation of package installation by default backs up the
old installation of the package by moving it into a per-library 00LOCK
directory (or per-package 00LOCK-pkgname
). The installation is performed
directly into the final directory pkgname
in the library. If it fails, it
is by default cleaned up and the old version is moved back; otherwise, if it
succeeds, the old version is deleted. If the lock directory already exists
when the installation is requested, the installation fails with an error and
one typically would delete the directory manually. During parallel install,
the per package locking is used (00LOCK-pkgname
).
This locking mechanism works for backing-up and recovering previous versions of packages in case of error, but it does not prevent access to partially installed packages. I’ve been trying initially to extend it to do so, after all, it would seem natural to make R respect the lock directories and ignore packages that were “locked”, getting a cheap partial solution to the problem. “Partial” because of the obvious race condition - what happens between checking the existence of a lock directory and accessing the package. It turned out to be neither cheap nor easy to implement, and in the end we decided for staged install, instead.
The first observation was that one cannot simply hide/ignore the packages
for which there is a lock directory – this is not possible because during
installation, one needs to be able to see the (partially installed) package.
For example, this is while the lazy loading database is being built (so one
has to be able to load the namespace), but also when running a custom
installation script from the package (install.libs.R
). One would have to
customize all package access/discovery functions so that they would make the
locked package visible just to the R session(s) that were installing the
package. Passing function arguments all the way down to the package
discovery functions would not be realistic, but in principle this would be
possible via environment variables, some of which are already in use.
For a start, I’ve looked at how packages check if another package is
installed. This is a surprisingly common task and I found many popular ways
(installed.packages()
, requireNamespace()
, require()
, .packages()
,
system.file()
, find.package()
, packageVersion
). I may have easily
overlooked some cases as I’ve just grepped the source code of all the
packages and there will be most likely many more types of access to packages
than just checking if they were installed. If we missed to handle any of
the cases, the resulting race conditions would be extremely hard to debug
(not repeatable runs, only showing on some systems, etc). Also, it is not
impossible that some tools or packages are looking directly into the library
directory to discover packages. Finally, there will be a non-trivial
performance overhead in package access functions.
Staged installation
Staged installation is hence the implemented solution to the problem. It
only works together with the lock directories, which are used by default. A
package is first installed into a temporary directory under the lock
directory (under 00LOCK
or 00LOCK-pkgname
). When the package is being
installed, this temporary directory is the R library for that R session, so
the R session sees the partially installed package using the standard
means. Other packages, however, do not see it. After the package is
installed (byte-compiled, lazy loading database created, native code
compiled and built, test-loaded, etc), it is moved to the final location
(pkgname
) and becomes visible to other packages. Directory move is very
fast operation within the same filesystem and in POSIX/Unix it is atomic (on
Windows it is also fast, but not easily done to be guaranteed atomic).
Staged installation thus provides isolation of partially installed packages on the file-system level and all package access APIs or even file-based API usage can stay as they are now. It was clear from the beginning that the problems would, instead, arise from the fact that packages are moved to a different directory after they are installed and the original directory no longer exists.
Packages fail with staged install when they hard-code the temporary
installation directory name (save it to some configuration file, keep it in
an R object, or save it via linker to a shared object as absolute path or
linker rpath
). Luckily, this is the case with only a small number of
packages from CRAN and BIOC and it is relatively easy to find out without
spending days of debugging (compared to debugging that would be needed if
package access code had to be updated to respect lock directories).
Paths hard-coded in R code
Packages often need to access files from their own installation directory,
which can always be obtained by system.file(package=)
call. Some packages
save the directory names obtained by system.file()
, but that practice is
dangerous with staged install and should be avoided.
With staged install, it may happen that the saving of the directory is executed when the package still runs in the temporary installation directory, typically while the package is being prepared for lazy loading. The preparation for lazy loading involves sourcing all R files of the package, hence also executing all the assignments to global variables.
Therefore, assignments like this (from pd.ecoli
) at the top level in an R
source file in a package save the temporary installation directory:
globals$DB_PATH <- system.file("extdata", "pd.ecoli.sqlite",
package="pd.ecoli")
Sometimes the calls to system.file(package=)
are hidden deeper in
assignments that are executed when the namespace is loaded for preparation
of lazy loading database, including in assignments setting up S4 classes. I
think the best way to fix these patterns is to just always call
system.file()
, so in this case have a function like below, and never
save the result in anything that is not an obviously local variable in a
function.
getDbPath <- function() system.file("extdata", "pd.ecoli.sqlite",
package="pd.ecoli")
However, even though not ideal, it is also possible to fix such hard-coded
paths in .onLoad
package hook (pd.ecoli
does already fix them, even
before staged install, but only in .onAttach
, so one can still access the
wrong path):
.onAttach <- function(libname, pkgname) {
globals$DB_PATH <- system.file("extdata", "pd.ecoli.sqlite",
package="pd.ecoli",
lib.loc=libname)
...
The problem with fixing in .onLoad
is that the binary image of the package
still includes the hard-coded temporary installation directory name, and
thus checking tools that look at the files without loading the namespace
would report errors (the tool described later in this text, however, loads
the namespace so it would see the state after hooks have been executed).
During staged installation, R checks for hard-coded paths that include the temporary installation directory, and if it finds any, the installation fails with an informative message. This is a conservative approach, because in some cases the hard-coded installation directory would never really be used to access files, but it is a prevention against hard-to-find bugs.
The problem of hard-coded paths in R code is a bit more common that of the paths in shared objects, but it still directly affects only a small number of packages from CRAN and BIOC.
Testing packages for staged install
Package authors can test their packages for staged installation by
attempting the install using R CMD INSTALL --staged-install
with a recent
version of R-devel. The checks during the installation should be defensive
enough to catch most problems: if staged installation succeeds and the
package worked with non-staged installation (to be applied also to package
dependencies), it should also work with staged installation. Currently, the
only known exception is when a package saves its temporary installation path
into an external file, which is not checked automatically. I would be happy
for reports about any other issues that are undetected by the checks.
My tests on Linux suggest that currently 21 CRAN and 4 BIOC packages fail to install because they have hard-coded temporary installation paths in their R code. 2 CRAN and 2 BIOC packages fail to install because they have hard-coded temporary installation paths in their shared objects. Some packages fail to install because they depend on these: in total, out of CRAN/BIOC, 48 packages failed to install with staged installation, but could be installed with non-staged installation. The CRAN team has been running many more tests with on multiple platforms and with multiple C compilers.
The problem of hard-coded paths in shared objects is trivial to diagnose
from the installation log/output, which contains the name of the shared
object in the error message and typically also the compilation/linking
commands used for building the native code of the package (so most of the
times one can just search the output for “rpath”). Also, package authors
did have to specify linking using rpath
or absolute path explicitly, so
there needs to be a record of it in build scripts or make files of the
package.
The problem of hard-coded paths in R code is a bit harder to diagnose, the
installation only performs a trivial check to find out that there is a
hard-coded path, but checking out where is a bit more time consuming. I’ve
written a simple program (sicheck
) that finds out what are the hard-coded
paths (already knowing the path sometimes helps, when one can search the
suffix in R package sources). It also tries to find out R expressions
(object paths) how to get to these hard-coded paths from the environment of
the package namespace. The program and results for recent versions of CRAN
and BIOC 3.9 packages can be found
here.
For example, package franc
has these reports:
Package contains these hard-coded paths (sercheck):
CONTAINS: franc/speakers.json
CONTAINS: franc/data.json
Package contains these objects with hard-coded paths (walkcheck):
OBJPATH: as.list(getNamespace("franc"), all.names=TRUE)[["speakers_file"]] franc/speakers.json
SPATH: franc$speakers_file franc/speakers.json
OBJPATH: as.list(getNamespace("franc"), all.names=TRUE)[["datafile"]] franc/data.json
SPATH: franc$datafile franc/data.json
In the above, CONTAINS: franc/speakers.json
means that sicheck
tool
found hard-coded path to franc/speakers.json
(the output copied to this
text excludes the prefix of the full path including the 00LOCK-franc
directory). The name is hard-coded in variable datafile
of the package
namespace (OBJPATH:
and SPATH:
sections). It is easy to see that this
happens because source file speakers.R
of the package has this assignment
at the top-level:
speakers_file <- system.file("speakers.json", package = packageName())
A slightly less trivial example is package zonator
. Its report includes:
CONTAINS: zonator/extdata/test_project/zsetup/01/01_out
OBJPATH: as.list(as.list(getNamespace("zonator"), all.names=TRUE)[[".options"]],all.names=TRUE)[["results.dir"]] zonator/extdata/test_project/zsetup/01/01_out
SPATH: zonator$.options$results.dir zonator/extdata/test_project/zsetup/01/01_out
The hard-coded path is extdata/test_project/zsetup/01/01_out
. It is being
hard-coded in source file options.R
of the package, in (top-level command):
assign("results.dir", file.path(.options$setup.dir, "01/01_out"), envir = .options)
I found this line of code first using grep
on the sources, looking for
01_out
. It is probably always easiest to try this first before trying to
interpret more complicated object paths, but it does not help when the
hard-coded path does not have a unique suffix, e.g. when it is just path to
the root of the package installation. Then, one needs to analyze the object
path. In this example, the object path is is still easy to understand. The
executable one (OBJPATH
) can be executed to get the value (excluding
hard-coded path prefix) in R:
> as.list(as.list(getNamespace("zonator"), all.names=TRUE)[[".options"]],all.names=TRUE)[["results.dir"]]
Registered S3 methods overwritten by 'ggplot2':
method from
[.quosures rlang
c.quosures rlang
print.quosures rlang
[1] "zonator/extdata/test_project/zsetup/01/01_out"
SPATH
(zonator$.options$results.dir
) tries to be more concise, but is
not executable. The special elements of these paths are:
$name | named vector element
[i] | unnamed vector element
-A | attributes
-E | environment
@ | S4 data part
Note that currently the tool does not attempt to find the shortest path to the object.
Opting out
Staged installation is not currently turned on by default but the plan is to do so soon. Packages that for some reason could not be fixed for staged installation (or could not be fixed in time) can be still installed after the switch using the current, non-staged, procedure.
Packages can opt-out via StagedInstall
field in their DESCRIPTION
file.
There is no need for packages to opt-in as this is going to be the default.
There are also new options for R CMD INSTALL
: --staged-install
and
--no-staged-install
.
Summary
Staged installation is a new feature of R CMD INSTALL
in R-devel, which is
intended to be soon turned on by default. It isolates packages during
installation time so that they are not accidentally accessed by other R
sessions, which is key to correct function of parallel installation, but is
relevant to any installation that may use multiple R sessions.
Some packages need to be fixed to work with staged installation and package authors are kindly asked to cooperate with repository maintainers and update their packages promptly. It may not be immediately obvious that the role of the repository maintainers is very important also in the process of enhancing R. Adding a feature to R often puts a significant amount of work on them as they test packages on different platforms, analyze the outputs, and sometimes debug the packages to figure out whom to report the bugs to or to help package maintainers who do not have enough technical skill to do so on their own.
In addition to that “usual” load for repository maintainers, this feature has been implemented in close collaboration with the CRAN team and particularly Brian Ripley has provided valuable advice, comments, reviews and found a number of issues by testing.