CUDA out of memory in subprocesses spawned by unit tests in Windows

🐛 Bug

In PR #41377 (extra details about this issue can be found there if needed), I added a new test to TestTorchDeviceTypeCUDA in test_torch.py which spawns several new processes using subprocess.Popen(). These subprocesses all create new CUDA tensors and run operations on them. The test works fine for all CI jobs that run it, except for pytorch_windows_vs2019_py36_cuda10.1_test2, in which the subprocesses throw a CUDA out of memory error when they try to create CUDA tensors. For now, I am disabling my test under Windows, and it should be re-enabled if this issue is fixed.

To Reproduce

Steps to reproduce the behavior:

  1. Check out my branch from PR #41377. Or, if that PR has been merged already, just use the most recent main branch.
  2. Remove the @unittest.skipIf(IS_WINDOWS, ...) decorator from test_cublas_config_deterministic_error in test/test_torch.py.
  3. Run either the whole test suite or just python test/test_torch.py TestTorchDeviceTypeCUDA inside the pytorch_windows_vs2019_py36_cuda10.1_test2 CI job.

Running only the concerned test by itself with python test/test_torch.py TestTorchDeviceTypeCUDA.test_cublas_config_deterministic_error_cuda does not reproduce the error–the entire TestTorchDeviceTypeCUDA module or the entire test suite must be run. I believe the reason for this is that other unit tests in TestTorchDeviceTypeCUDA are reserving and not letting go of CUDA memory. I printed out torch.cuda.memory_summary() right before spawning the subprocesses, and it showed this:

Click to expand

|===========================================================================|^M
|                  PyTorch CUDA memory summary, device ID 0                 |^M
|---------------------------------------------------------------------------|^M
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |^M
|===========================================================================|^M
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |^M
|---------------------------------------------------------------------------|^M
| Allocated memory      |       0 B  |   11573 MB |   13202 MB |   13202 MB |^M
|       from large pool |       0 B  |   11573 MB |   12740 MB |   12740 MB |^M
|       from small pool |       0 B  |       4 MB |     462 MB |     462 MB |^M
|---------------------------------------------------------------------------|^M
| Active memory         |       0 B  |   11573 MB |   13202 MB |   13202 MB |^M
|       from large pool |       0 B  |   11573 MB |   12740 MB |   12740 MB |^M
|       from small pool |       0 B  |       4 MB |     462 MB |     462 MB |^M
|---------------------------------------------------------------------------|^M
| GPU reserved memory   |   11890 MB |   11890 MB |   11890 MB |       0 B  |^M
|       from large pool |   11884 MB |   11884 MB |   11884 MB |       0 B  |^M
|       from small pool |       6 MB |       6 MB |       6 MB |       0 B  |^M
|---------------------------------------------------------------------------|^M
| Non-releasable memory |       0 B  |   53913 KB |    7541 MB |    7541 MB |^M
|       from large pool |       0 B  |   51869 KB |     831 MB |     831 MB |^M
|       from small pool |       0 B  |    3134 KB |    6710 MB |    6710 MB |^M
|---------------------------------------------------------------------------|^M
| Allocations           |       0    |      27    |   86688    |   86688    |^M
|       from large pool |       0    |      10    |     127    |     127    |^M
|       from small pool |       0    |      27    |   86561    |   86561    |^M
|---------------------------------------------------------------------------|^M
| Active allocs         |       0    |      27    |   86688    |   86688    |^M
|       from large pool |       0    |      10    |     127    |     127    |^M
|       from small pool |       0    |      27    |   86561    |   86561    |^M
|---------------------------------------------------------------------------|^M
| GPU reserved segments |      17    |      17    |      17    |       0    |^M
|       from large pool |      14    |      14    |      14    |       0    |^M
|       from small pool |       3    |       3    |       3    |       0    |^M
|---------------------------------------------------------------------------|^M
| Non-releasable allocs |       0    |       9    |   38427    |   38427    |^M
|       from large pool |       0    |       8    |      85    |      85    |^M
|       from small pool |       0    |       7    |   38342    |   38342    |^M
|===========================================================================|^M

Whereas running only the one test by itself showed this:

Click to expand

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |     512 B  |    1024 B  |    1024 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |     512 B  |    1024 B  |    1024 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |     512 B  |    1024 B  |    1024 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |     512 B  |    1024 B  |    1024 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    2047 KB |    4095 KB |    4095 KB |
|       from large pool |       0 B  |       0 KB |       0 KB |       0 KB |
|       from small pool |       0 B  |    2047 KB |    4095 KB |    4095 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       1    |       2    |       2    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       1    |       2    |       2    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       1    |       2    |       2    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       1    |       2    |       2    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       1    |       1    |       1    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       1    |       2    |       2    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       1    |       2    |       2    |
|===========================================================================|

The big difference that I see between those two outputs is that the first one shows that almost 12 GB of CUDA memory is reserved before spawning the subprocesses, while the second one shows only 2 MB. I get very similar measurements when I run the test in these two ways on my Linux work machine. However, in Linux, the test completes successfully for both ways of running the test.

I found out that I could make this test run before all other tests in TestTorchDeviceTypeCUDA by renaming it to TestTorchDeviceTypeCUDA.test_000_cublas_config_deterministic_error_cuda. This makes the test pass. It seems likely that the reason it passes is that the other TestTorchDeviceTypeCUDA tests don’t have a chance to reserve all the available memory before the subprocesses need it. (I chose not to use this workaround because I felt that it was a little too hacky)

Why does the failure only happen on Windows though? I have no idea.

I also originally tried using torch.multiprocessing instead of subprocess, but I had seemingly the same issue with that.

Expected behavior

The test should pass on Windows without the CUDA out of memory error.

Environment

CircleCI job pytorch_windows_vs2019_py36_cuda10.1_test2

Additional context

A possible way to fix this issue, if the cause is really the 12 GB of reserved CUDA memory, could be to call torch.cuda.memory_reserved() before and after each unit test in TestTorchDeviceTypeCUDA is called. This could help find and fix the unit tests that contribute most to the reserved memory problem.

cc @ngimel @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @smartcat2010 @mszhanyi

1 possible answer(s) on “CUDA out of memory in subprocesses spawned by unit tests in Windows

  1. This can be fixed by add ‘poll()’ after ‘communicate()’ maybe because the function poll will do job for recycling resources.
    But this test can be passed on windows for latest code after #42627, because the communicate() has been replaced by check_output(), then the processes execute serially.
    The test is enabled on windows by #42796