Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 3.16.7
TensorFlow installed from (source or binary): source, hash: b1e174e
TensorFlow version (use command below): v1.1.0-rc2-119-gb1e174e 1.1.0-rc2
Bazel version (if compiling from source): 0.4.5-jdk7
CUDA/cuDNN version: CUDA 8.0, cudnn-v5.1
GPU model and memory: Titan X with 12 GiB
Exact command to reproduce: ./tf_multiGPU.py
tensorflow/benchmark version: source, hash: 9165a70
Describe the problem
Recently, we are testing the TensorFlow scalability on multiGPU machines. We use the official scripts provided in the benchmarks using codes from GitHub repository tensorflow/benchmark. We execute the script according to the official website to test the scalability of TensorFlow on the machine equipped with 8 Titan X GPUs. We test the model VGG16 with batch size equaling to 64.
The results are shown in the following table:
|Num of GPUs||Throughput (images/sec)|
We are surprised to find that the total throughput of 5 GPUs is smaller than 4 GPUs, which means TensorFlow incurs significant overheads when the number of GPU is larger than 4. Because I don’t know whether this performance issue belongs to the tensorflow/tensorflow repository or tensorflow/benchmark repository, I submit this issue here looking forward to the official response. The script for reproducing this issue can be found in Source code/logs section with the results.
The scalability is strongly related to the topology of GPU interconnection. In our machine, we have a tree topology for GPUs. The topology details can be found at the nv-topo-matrix.txt
We believe this is a performance bug since if more GPUs can not achieve higher throughput, at least they should obtain similar throughput.
Source code / logs
Source code: tf_multiGPU.py
The log after executing the script above: https://drive.google.com/file/d/0Bw3_V-EwBVToYXVqZDkzQkN6c3M/view?usp=sharing