test/distributed/test_c10d.py::RendezvousEnvTest::test_common_errors is failing sometimes

Error message (from this job):

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1492, in wrapper
    return func(*args, **kwargs)
  File "distributed/test_c10d.py", line 507, in test_common_errors
    next(gen)
AssertionError: ValueError not raised

According to the HUD, this is the timeline:

  • c0adabe test started failing
  • f595ba1 test stopped failing
  • 8c798e0 test started failing again
  • 1fe6a65 test switched from shard 2 to shard 1

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

2 thoughts on “test/distributed/test_c10d.py::RendezvousEnvTest::test_common_errors is failing sometimes

  1. I can confirm that the following environment variables were added before test is run:

    BACKEND=gloo
    INIT_METHOD=file:///tmp/tmpu73gqc0r/shared_init_file
    WORLD_SIZE=2
    TEST_REPORT_SOURCE_OVERRIDE=dist-gloo