Wrong gradients when using DistributedDataParallel and autograd.grad.

🐛 Bug

The gradient reduction across workers doesn’t work when:

  1. gradient penalty is used as a loss,
  2. a bias is used in the last layer.

To Reproduce

import os
import torch
import torch.nn as nn

import torch.distributed as dist
import torch.multiprocessing as mp


NUM_GPUS = 2


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.w = nn.Parameter(torch.rand(1))
        self.b = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return self.w * x.pow(2) + self.b * x/x  # !!! 


def worker(rank):

    torch.manual_seed(rank)
    torch.cuda.set_device(rank)
    device = torch.device(rank)
    dist.init_process_group(backend='nccl', world_size=NUM_GPUS, rank=rank)

    def parallelize(module):
        return nn.parallel.DistributedDataParallel(module, device_ids=[rank])

    model = Model().to(device)
    model = parallelize(model)

    # all workers have the same initial model
    w = model.module.w
    b = model.module.b
    print(f'initial weights at {rank}:', w.data, b.data)

    x = torch.randn(3).to(device)
    x.requires_grad = True
    y = model(x)  # shape [3]

    # all workers have different data
    print(f'input data at {rank}:', x)

    grad = torch.autograd.grad(y.sum(), x, create_graph=True)[0]
    loss = grad.pow(2).mean(0)  # gradient penalty

    # compare with gradient calculated by hand
    assert torch.isclose(2 * x * w, grad).all()

    model.zero_grad()
    loss.backward()

    # all workers have the same grad
    print(f'final gradient at {rank}:', w.grad, b.grad)

    # compare with gradient calculated by hand
    t = (8 * x.pow(2) * w).mean(0)
    print(f'local gradient at {rank}:', t)
    dist.all_reduce(t, op=dist.ReduceOp.SUM)
    assert torch.isclose(t/NUM_GPUS, w.grad).all()


def main():
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    mp.spawn(worker, nprocs=NUM_GPUS, args=())


if __name__ == '__main__':
    main()

The above code works as it should and outputs this:

initial weights at 1: tensor([0.4963], device='cuda:1') tensor([0.], device='cuda:1')
initial weights at 0: tensor([0.4963], device='cuda:0') tensor([0.], device='cuda:0')
input data at 1: tensor([ 0.1994,  2.0394, -0.2148], device='cuda:1', requires_grad=True)
input data at 0: tensor([0.2072, 0.2699, 0.5507], device='cuda:0', requires_grad=True)
final gradient at 1: tensor([3.0861], device='cuda:1') tensor([0.], device='cuda:1')
final gradient at 0: tensor([3.0861], device='cuda:0') tensor([0.], device='cuda:0')
local gradient at 1: tensor(5.6176, device='cuda:1', grad_fn=<MeanBackward1>)
local gradient at 0: tensor(0.5546, device='cuda:0', grad_fn=<MeanBackward1>)

But when I change the model to this:

class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.w = nn.Parameter(torch.rand(1))
        self.b = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return self.w * x.pow(2) + self.b  # x/x is removed here!

The output is wrong:

initial weights at 1: tensor([0.4963], device='cuda:1') tensor([0.], device='cuda:1')
initial weights at 0: tensor([0.4963], device='cuda:0') tensor([0.], device='cuda:0')
input data at 1: tensor([ 0.1994,  2.0394, -0.2148], device='cuda:1', requires_grad=True)
input data at 0: tensor([0.2072, 0.2699, 0.5507], device='cuda:0', requires_grad=True)
final gradient at 1: tensor([5.6176], device='cuda:1') None
final gradient at 0: tensor([0.5546], device='cuda:0') None
local gradient at 1: tensor(5.6176, device='cuda:1', grad_fn=<MeanBackward1>)
local gradient at 0: tensor(0.5546, device='cuda:0', grad_fn=<MeanBackward1>)
...
    assert torch.isclose(t/NUM_GPUS, w.grad).all()
AssertionError

Expected behavior

  1. The final gradients at each worker must be the same.
  2. Gradient for b must be zero and not None.

Environment

PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti

Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip] numpy==1.18.2
[pip] torch==1.7.0+cu110
[pip] torchaudio==0.7.0
[pip] torchvision==0.8.1+cu110

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

Author: Fantashit

1 thought on “Wrong gradients when using DistributedDataParallel and autograd.grad.

  1. Also note that the DDP doc explicitely states: “This module doesn’t work with torch.autograd.grad() (i.e. it will only work if gradients are to be accumulated in .grad attributes of parameters).”

Comments are closed.