Pytorch autograd sometimes fails with CUDNN_STATUS_INTERNAL_ERROR

🐛 Bug

Pytorch autograd sometimes fails with CUDNN_STATUS_INTERNAL_ERROR.
Happens to me on a specific dataset, reproducible with following code unrelated to system or data.

To Reproduce

Run the following code to reproduce:

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 4, 5, 360, 640], dtype=torch.float, device=’cuda’, requires_grad=True)
net = torch.nn.Conv3d(4, 1, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

Expected behavior

Output should contain no errors

Environment

Environment 1 of 2:

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.2.142
GPU models and configuration: GPU 0: GeForce RTX 2070 with Max-Q Design
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0

Versions of relevant libraries:
[pip3] botorch==0.3.3
[pip3] gpytorch==1.3.1
[pip3] numpy==1.20.0
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchcontrib==0.0.2
[pip3] torchvision==0.2.2
[conda] blas 1.0 mkl
[conda] botorch 0.3.3 pypi_0 pypi
[conda] cudatoolkit 11.0.221 h6bb024c_0
[conda] gpytorch 1.3.1 pypi_0 pypi
[conda] libblas 3.9.0 8_mkl conda-forge
[conda] libcblas 3.9.0 8_mkl conda-forge
[conda] liblapack 3.9.0 8_mkl conda-forge
[conda] liblapacke 3.9.0 8_mkl conda-forge
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] numpy 1.20.0 py38h18fd61f_0 conda-forge
[conda] pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch
[conda] torchaudio 0.7.2 py38 pytorch
[conda] torchcontrib 0.0.2 pypi_0 pypi
[conda] torchvision 0.2.2 py_3 pytorch

Environment 2 of 2:

PyTorch version: 1.8.0a0+17f8c32
Is debug build: True
CUDA used to build PyTorch: 11.1

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.1.74
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti

Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.7.0
[pip3] numpy==1.19.2
[pip3] pytorch-transformers==1.1.0
[pip3] torch==1.8.0a0+17f8c32
[pip3] torchcontrib==0.0.2
[pip3] torchtext==0.8.0a0
[pip3] torchvision==0.8.0a0
[conda] efficientnet-pytorch 0.7.0 pypi_0 pypi
[conda] magma-cuda110 2.5.2 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] nomkl 3.0 0
[conda] numpy 1.19.2 py36h6163131_0
[conda] numpy-base 1.19.2 py36h75fe3a5_0
[conda] pytorch-transformers 1.1.0 pypi_0 pypi
[conda] torch 1.8.0a0+17f8c32 pypi_0 pypi
[conda] torchtext 0.8.0a0 pypi_0 pypi
[conda] torchvision 0.8.0a0 pypi_0 pypi

Additional context

Full trace of the error:

Traceback (most recent call last):
File “/*******/scratch_2.py”, line 10, in
out.backward(torch.randn_like(out))
File “/opt/conda/lib/python3.6/site-packages/torch/tensor.py”, line 227, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/opt/conda/lib/python3.6/site-packages/torch/autograd/init.py”, line 138, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 4, 5, 360, 640], dtype=torch.float, device=’cuda’, requires_grad=True)
net = torch.nn.Conv3d(4, 1, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x5600a29d1c30
type = CUDNN_DATA_FLOAT
nbDims = 5
dimA = 2, 4, 5, 360, 640,
strideA = 4608000, 1152000, 230400, 640, 1,
output: TensorDescriptor 0x7fe8cc00c690
type = CUDNN_DATA_FLOAT
nbDims = 5
dimA = 2, 1, 5, 360, 640,
strideA = 1152000, 1152000, 230400, 640, 1,
weight: FilterDescriptor 0x7fe8cc03b1f0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 1, 4, 3, 3, 3,
Pointer addresses:
input: 0x7fe976000000
output: 0x7fe95b4ca000
weight: 0x7fe95aa00600

Process finished with exit code 1

cc @csarofeen @ptrblck @xwang233

1 possible answer(s) on “Pytorch autograd sometimes fails with CUDNN_STATUS_INTERNAL_ERROR

  1. Thanks for reporting this issue!
    I’m able to reproduce it using cudnn8.0.5 on an RTX2080Ti and this issue is apparently fixed in the cudnn8.1 release.

    To test it, you could install the 1.8 release candidate via:

    conda install pytorch cudatoolkit=11.2 -c pytorch-test -c conda-forge
    

    which ships with cudnn8.1:

    >>> import torch
    >>> torch.__version__
    '1.8.0'
    >>> torch.version.cuda
    '11.2'
    >>> torch.backends.cudnn.version()
    8100