Tomas Kalibera: GFortran Issues with LAPACK



Recent version of the GNU Fortran compiler (7, 8, 9) include optimizations that break interoperability between C and Fortran code with BLAS/LAPACK. The compiled code of BLAS/LAPACK corrupts stack, often resulting in crashes. This impacts R, R packages directly calling into BLAS/LAPACK, and all other applications of BLAS/LAPACK. The work-around is to compile BLAS/LAPACK with -fno-optimize-sibling-calls. This option is now used in R-Devel and R-Patched, so that the reference BLAS and LAPACK included in R are compiled with that option.

However, often other BLAS/LAPACK are used and they need to be compiled with that option as well. GFortran developers are looking into this and hopefully it will be addressed soon (GCC PR#90329), currently it seems that the tail-call optimization will be avoided but only in situations when it causes these problems. Unfortunately, GFortran releases with this issue started appearing in recent Linux distributions.

The problem was detected initially by Brian Ripley and confirmed by Kurt Hornik (RCore/CRAN): recent, then unreleased versions of GFortran 8 and 9 caused about 20 CRAN packages to fail, often with error codes from BLAS/LAPACK. I’ve debugged the issue on one of those packages, BDgraph. It turned out that the interoperability interface between C/Fortran used for decades by BLAS/LAPACK and their applications no longer works with GFortran. This interoperability interface depends on Fortran compiler behavior not guaranteed by the standard. Fortran 2003 provides ways to create a portable interface, yet the standard is newer than the code of this key software.

Assembly corner

When I started debugging, I first saw crashes in GOMP (OpenMP). I’ve recompiled all of R and the required packages with OpenMP disabled and luckily the bug did not go away. This example

bdgraph.sim( n = 70, p = 5, size = 7, vis = TRUE )

resulted in error

BLAS/LAPACK routine 'DPOTRS' gave error code -1

while it worked with older GFortran. I found the error was issued because UPLO in LAPACK function DPOTRS has been 0. So I added a watchpoint on UPLO and got to the line

CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

I had to look at the assembly to get why, gdb gave me the instruction (movq $0x1,0x70(%rsp)) and I have read the full luckily short disassembly of DPOSV. This is the interesting section:

        CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )
  1174d4:       48 8b 04 24             mov    (%rsp),%rax <======= rax holds LDB
  1174d8:       4c 89 7c 24 68          mov    %r15,0x68(%rsp) <=== save INFO to output param
  1174dd:       49 89 d8                mov    %rbx,%r8 <========== pass LDA as LDA
  1174e0:       4c 89 e1                mov    %r12,%rcx <========= pass A as A
  1174e3:       4c 8b 4c 24 08          mov    0x8(%rsp),%r9 <===== pass B as B
  1174e8:       4c 89 ea                mov    %r13,%rdx <========= pass NRHS as NRHS
  1174eb:       48 89 ee                mov    %rbp,%rsi <========= pass N as N
  1174ee:       4c 89 f7                mov    %r14,%rdi <========= pass UPLO as UPLO
  1174f1:       48 c7 44 24 70 01 00    movq   $0x1,0x70(%rsp) <=== pass 1 hidden arg on stack
  1174f8:       00 00 
  1174fa:       48 89 44 24 60          mov    %rax,0x60(%rsp) <=== pass LDB as LDB (stack)
      END
  1174ff:       48 83 c4 28             add    $0x28,%rsp <== remove 5 vars from stack
  117503:       5b                      pop    %rbx
  117504:       5d                      pop    %rbp
  117505:       41 5c                   pop    %r12
  117507:       41 5d                   pop    %r13
  117509:       41 5e                   pop    %r14
  11750b:       41 5f                   pop    %r15
         CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )
  11750d:       e9 de 56 ef ff          jmpq   cbf0 <dpotrs_@plt> <=== tail call to dpotrs

The instruction at 1174f1 is the move that corrupts UPLO (uplo) in the stack frame of the caller of DPOSV, in a function from the BDgraph package

void inverse( double A[], double A_inv[], int *p )
{
        int info, dim = *p;
        char uplo = 'U';

        // creating an identity matrix
        #pragma omp parallel for
        for( int i = 0; i < dim; i++ )
                for( int j = 0; j < dim; j++ )
                        A_inv[ j * dim + i ] = ( i == j );
  
        // LAPACK function: computes solution to A * X = B, where A is symmetric positive definite ma
        F77_NAME(dposv)( &uplo, &dim, &dim, A, &dim, A_inv, &dim, &info );
}

In fact the move corrupts also other bytes on the stack, incidentally the byte held by uplo gets one of the 7 zero bytes from the move. The compiler writes 1 to the stack, because it wants to pass it as a hidden argument to DPOTRS:

SUBROUTINE DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

DPOTRS takes 8 arguments above plus the one hidden argument, which is the length of UPLO, which is 1. The generated code, instead of calling into DPOTRS, modifies the call arguments given already to DPOSV and jumps to DPOTRS. This is called tail-call optimization/elimination (in this case sibling-call optimization) and it saves the stack depth in recursive programs, which could allow to run deep-recursive code with limited stack, and it could in principle improve performance. This 1 as hidden argument for DPOTRS should already have been passed to DPOSV, but it wasn’t, hence there was no space for it on the stack, hence the stack corruption.

The crash did not happen with an older version of GFortran, which did not tail-call optimize the call to DPOTRS, instead it would allocate more space on the stack for the call arguments and normally call DPOTRS.

Strings in Fortran and LAPACK

Unlike in C, strings in Fortran are not zero-terminated. It is clear that for a CHARACTER*(*) parameter of a subroutine, Fortran needs to communicate the length to the called function somehow. In some systems, the length is part of the memory layout (e.g. Pascal), but in Fortran it is typically passed automatically as a hidden argument to the function at the end of the argument list. The argument is invisible in the Fortran program. These hidden length arguments are currently used in GFortran, ifort, flang and probably other compilers. They are, however, compiler-specific and for instance their type changes between GFortran 7 and GFortran8 (int to size_t, on my system this means also 32 bit to 64 bit). This results in problems for interoperability between C and Fortran (and also between different Fortran compilers), and it has been known for many years, and BLAS/LAPACK/CBLAS/LAPACKE avoid this problem by not passing such strings between languages.

Subroutine parameters of CHARACTER/CHARACTER*(1), that is strings of length 1, are, however used widely in BLAS and LAPACK. Sometimes the actual argument can be longer, e.g. “Upper” for “U”, but it is only the first character that is used as a parameter. The actual length of the string arguments is never needed and never accessed. This is why the interface has been working (and is still, with other compilers and older versions of GFortran, and with -fno-optimize-sibling-calls): the hidden length arguments were last on the argument list, so with common calling conventions that allow variadic arguments they did not cause trouble when present neither when missing. The only situation when they are now required is the tail-call optimizations with recent GFortran.

LAPACK C/Fortran interface

The practice adopted by BLAS/LAPACK is that when calling from C, the string length arguments (always 1) are not passed. The same interface is then used for calls from Fortran and calls from C, and even for calls between different Fortran compilers (maybe it did not work always, Writing R Extensions reports that sometimes the same compiler may be needed to build BLAS/LAPACK and R). Applications using BLAS/LAPACK are linked against those libraries via that interface, and often it is dynamically. R is no exception: it can be built with the included reference BLAS/LAPACK, but also it can use an external implementation, and it is expected that the choice will be made at dynamic linking time: R can be and often is built so that when a different BLAS/LAPACK implementation is installed as a system package (e.g. in a Linux distribution), it will be used by R (see R Installation and Administration for more details).

As part of reference BLAS and LAPACK, Netlib provides also CBLAS, a C interface to BLAS, and LAPACKE, a C interface to LAPACK. These are wrappers and take slightly different arguments from the underlying Fortran code. However, they call from C to Fortran following the same practice: they do not pass string lengths. Also, applications typically call into BLAS/LAPACK directly, not via CBLAS/LAPACKE, but they can be hardly blamed for following the same convention as CBLAS/LAPACKE itself, and using those interfaces as they are implemented now would not solve the problem.

R does not use CBLAS/LAPACKE. BLAS is used directly from the R runtime and base packages and LAPACK is used (only) from the stats package out of the base packages. R packages include native code which interacts directly with BLAS and LAPACK. If the solution to this problem in the end will be a change of the interface, R will follow suite and it means that all R packages using BLAS/LAPACK directly will have to be updated.

The standard

It seems that the standard does not guarantee that the practice used by BLAS/LAPACK has to be supported by a Fortran compiler. To get a C interface with that guarantee, one can use BIND(C) and declare the length-1 string parameter as CHARACTER (c_char). The book “Numerical Computing with Modern Fortran” by Richard J. Hanson and Tim Hopkins has an example which uses these features to tell the Fortran compiler to generate a function C_dgemm that can be called from C using the natural interface that does not require the 1 as string length (DGEMM is from BLAS). This wrapper function C_dgemm calls into Fortran function dgemm_ compiled by the same Fortran compiler, but from the original Netlib code (the example is available here). With this approach, one gets an extra function call in the chain, but that should be negligible overhead in most cases.

In principle, Netlib and then all implementors of optimized BLAS/LAPACK could switch to such an interface, and then all applications including R and R packages. If it was possible without changing anything but the name of the functions, that would be great (and with the 1s it is the case, C_dgemm does not require the length). Calls from Fortran to Fortran (bypassing those C interfaces) would remain in the zone unsupported by standards as now.

This solution could not be implemented on the user side (say in R/R packages), because the C interface needs to be generated by the same Fortran compiler as the library, so one would not be able to switch between BLAS/LAPACK implementations at dynamic linking time.

Passing the lengths

One can also pass the 1s as string lengths when calling the BLAS/LAPACK Fortran code from C, even though that this is compiler specific and prone to compiler changes as well as the current practice of not passing the lengths at all (e.g. the type changed between GFortran 7 and GFortran 8).

If this was employed on the user side (GNU Octave does that via macros), there would again be the problem with switching between BLAS/LAPACK implementations.

Recent GFortran 9 (trunk) adds an option -fc-prototypes-external to generate C prototypes for Fortran functions, that is including all hidden parameters with their correct size (option documentation, gfortran convention specifics) This is useful, but indeed compiler-dependent, so could not be really used in user code interfacing with BLAS/LAPACK.

Detecting the problem

The example from BDgraph can be extracted into a standalone example to demonstrate the problem outside of any R code:

#include <stdio.h>

void dposv_(const char* uplo, const int* n, const int* nrhs,
            double* a, const int* lda,
            double* b, const int* ldb, int* info);

int main(int argc, char **argv) {
  char UPLO = 'U';  
  int N = 1;  
  int NRHS = 1;
  double A[1] = {2};
  int LDA = 1;
  double B[1] = {10};
  int LDB = 1;
  int INFO;

  dposv_(&UPLO, &N, &NRHS, A, &LDA, B, &LDB, &INFO);
  printf("Result: %.0f Info: %d UPLO: %c\n", B[0], INFO, UPLO);
  return 0;
}

It can be linked against the build of the latest reference LAPACK, compiled using the default options for GFortran (`-O2 -frecursive’). I built the example as

gfortran -O2 -g -o dposv dposv.c liblapack.a librefblas.a

The output should be

Result: 5 Info: 0

but with the GFortran 7, 8, 9 on Ubuntu 19.04, I get

 ** On entry to DPOTRS parameter number  1 had an illegal value

NOTE: this should not be used to check whether the installed compiler has the problem. The problem may be present, but this program may pass because some other part of the stack gets corrupted. I’ve seen this happen on Debian and Fedora.

A more reliable way (but yet with the limitation it is just for this one LAPACK call) is that one can look at the disassembly (objdump -dS) and search for dposv, whether it has the problematic tail-call shown above. Also, one can use GFortran option -fdump-tree-all to get outputs from different compiler passes and then look say into dlapack.f.231t.optimized (-fdump-tree-optimized can be used to get just that), search for dposv and then check if there is (note the [tail call] at the end):

dpotrs (uplo_18(D), n_22(D), nrhs_23(D), a_29(D), lda_14(D), b_31(D), ldb_15(D), info_16(D), 1); [tail call]

The current scope of the problem

Based on analyzing the GFortran 9 output of compilation of the reference LAPACK included in R, it seems that 38 LAPACK functions may be affected: dlarrc dpotrs dtrtrs dgetrs dlarzb dlatzm dlarfx ilaenv dtrtri dsygst dpotrf dlauum dpbtrf dlalsd dbdsdc dpbsv dpftrs dposv dpotri dppsv dpstrf dsbgv dspsv dsytri2 zdrscl zgesv zgetrf zgetrs zlaed0 zlahr2 zlalsd zlarfx zlatdf zlauum zpotrf zpotri ztrtri ztrtrs. This is a list of functions that take a character argument and tail-call into another function. I checked by parsing the output from -fdump-tree-all, but I did not check whether they all actually re-use the hidden length argument, and whether that argument is on the stack, which is platform-dependent. The number of R packages that call into some of these functions is over a hundred, even though only about ~20 were found to actually fail their tests.

The problematic GFortran releases are already appearing in Linux distributions.

In Ubuntu 19.04, all of GFortran 7,8,9 are already with this problem (7.4.0-8ubuntu1, 8.3.0-6ubuntu1, 9-20190402-1ubuntu1). GFortran 6 is not affected and works fine. GFortran 8 is the default. In Debian Sid, GFortran 7 and GFortran 8 already have the problem (‘Debian 8.3.0-7’, ‘Debian 7.4.0-9’). In Debian Buster, GFortran 8 (‘Debian 8.3.0-6’) already has the problem. In Debian Stretch, GFortran 6 (‘Debian 6.3.0-18+deb9u1’) is ok. In Fedora 30, GFortran 9 (‘Red Hat 9.1.1-1’) already has the problem. I’ve only been looking at x86_64.

That the distributions already contain GFortran versions with this problem does not mean that the binary packages for LAPACK/BLACK in those distributions have the problem, and I have not found any binary packages with that problem, but I’ve just been looking for dposv, which may not be enough as illustrated below, and also this can change in case binaries are re-built.

For example in Ubuntu 19.04, liblapack-dev 3.8.0-2, libopenblas-dev 0.3.5+ds-2 and libatlas3-base 3.10.3-8 are stil ok (and the same packages in Debian Sid). However, when re-built using the default compiler (apt-get build-dep, apt-get source, dpkg-buildpackage), they already surely or most likely have the problem. liblapack-dev gets the bad dposv just when re-built. OpenBLAS and ATLAS use liblapack-pic for the LAPACK code, so the problem only appears after re-building liblapack-pic, installing it, and then re-building libopenblas-dev or libatlas-base-dev. With OpenBLAS, one then gets the bad dposv. With ATLAS, one does not, but it is a coincidence because ATLAS re-implements selected LAPACK functions, including dposv, which is then compiled differently (and I did not examine its disassembly in detail), but other functions are taken from LAPACK (via liblapack-pic) as well.

ATLAS internally uses wrappers to call from C to Fortran correctly including the lengths, but only has them for several routines that the author needed, and it does not impact the external interface to LAPACK. Note ATLAS does this for Fortran sources it builds, and does not substitute at linking time, so the Fortran compiler is known and fixed when the C wrappers are built.

What next

GFortran developers are actively working on resolving this issue, so that BLAS/LAPACK still works with GFortran, and more updates are available on the bug report discussion. After all, a Fortran compiler that would not work with BLAS/LAPACK (abstracting from whose fault it is) would be of little use.

For the time being, everyone should use -fno-optimize-sibling-calls with GFortran version 7 and newer. This is already the default in R-Devel and R-Patched, but unfortunately it has to be added explicitly when building released versions of R with GFortran version 7 and newer (Debian Sid already has R 3.6.0-2 which uses this option). Though, this setting only fixes the reference BLAS/LAPACK included in R.

With -fno-optimize-sibling-calls, all of the new package failures due to the compiler change found by the CRAN team started to work again (this means, they’ve been checked with R, the packages, and most importantly the BLAS/LAPACK recompiled with -fno-optimize-sibling-calls). To be save, I think it would make sense to just compile all Fortran code with this option, and it is essential that this option is used for all BLAS/LAPACK implementations. Clearly there is a lot of work there for the Linux distributions to ensure that this option is used when building binary packages.

I think that the interface to BLAS/LAPACK needs to be fixed/complemented upstream, in the reference implementation, then in optimized implementations, and then in all applications (unless the authors find a way for the interface to stay as now, but be guaranteed to work by the standard). Trying to fix this individually in R packages using different ways would have the problems I mentioned above, and might contribute to chaos.

R packages written in a mix of C and Fortran, however, should not depend on the non-standardized interoperability behavior of the Fortran compilers (Writing R Extensions discourages the use of strings, even of length one), except when calling to BLAS/LAPACK.

Problems with incompatibility between C and Fortran calls (but also between C declarations and definitions in different source files) can sometimes be detected by the compiler/linker with LTO (Link-time Optimization). With GCC it is -Wlto-type-mismatch warning. They have to be expected in calls to BLAS/LAPACK due to the missing hidden string length argument, but all other cases should be fixed, if possible, and package authors are encouraged to try it out (see --enable-lto configure option of R). Note that when building the reference LAPACK distribution with LTO, one gets these warnings for CBLAS and LAPACKE (and that has been the case even before this GFortran change after which these incompatibilities are no longer harmless). We are also looking at these warnings issued for R itself.

For the curious, interesting pointers and more up-to-date information can be found in the discussion of the GFortran developers GCC PR#90329, and their issue opened against reference lapack.