Hey guys,
I can compile the cpu version of tensorflow 1.2.0-rc2 without any problem, however, i am blocked when i try to compile the gpu version. I spend almost the whole day to install tensorflow 1.2.0-rc2 gpu version in our cluster, however there is no lucky.
Here is the error from the terminal:
ERROR: /home/hpc-xin/.cache/bazel/_bazel_hpc-xin/c5e302e32d7acb9c3e4fce8142d1cd66/external/nccl_archive/BUILD:33:1: undeclared inclusion(s) in rule ‘@nccl_archive//:nccl’:
this rule is missing dependency declarations for the following files included by ‘external/nccl_archive/src/libwrap.cu.cc’:
‘/gpfs/gpfs1/apps2/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include-fixed/limits.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include-fixed/syslimits.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/stddef.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/new’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/c++config.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/os_defines.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/cpu_defines.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/exception’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/bits/atomic_lockfree_defines.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/bits/exception_ptr.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/bits/exception_defines.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/bits/nested_exception.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/stdarg.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/cmath’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/bits/cpp_type_traits.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/ext/type_traits.h’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/cstdlib’
‘/gpfs/gpfs1/apps2/gcc/5.4.0/include/c++/5.4.0/cstdio’.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use –verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 22.226s, Critical Path: 5.20s
OS Information
Red Hat Enterprise Linux Workstation release 6.7 (Santiago)
Modules I loaded
gcc/5.4.0, cuda/8.0.61, cudnn/6.0, java/1.8.0_31, sqlite/3.18.0, tcl/8.6.6.8606, python/3.6.1
Here is how i make the configuration
gcc (v5.4.0): /apps2/gcc/5.4.0 (customized path)
enabled cuda (v8.0): /apps2/cuda/8.0.61 (customized path)
enabled cudnn: /apps2/cudnn/6.0 (customized path)
python (v3.6.1, which is compiled with gcc 5.4.0): /apps2/python/3.6.1 (customized path)
Other options are disabled (such as hadoop, google cloud).
Here are some files i modified manually before compiling
(1) /tensorflow-1.2.0-rc2/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
/tensorflow-1.2.0-rc2/third_party/gpus/crosstool/CROSSTOOL_nvcc.tpl
Add the following contents in all of “toolchain”:
cxx_builtin_include_directory: “/apps2/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include”
cxx_builtin_include_directory: “/apps2/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include-fixed”
cxx_builtin_include_directory: “/apps2/gcc/5.4.0/include/c++/5.4.0”
cxx_builtin_include_directory: “/apps2/gcc/5.4.0/include”
cxx_builtin_include_directory: “/apps2/cuda/8.0.61/include”
cxx_builtin_include_directory: “/apps2/cudnn/6.0/include”
(2) /home/hpc-xin/.cache/bazel/_bazel_hpc-xin/c5e302e32d7acb9c3e4fce8142d1cd66/external/protobuf/protobuf.bzl
Add the following content in “ctx.action”
env=ctx.configuration.default_shell_env
(3) /home/hpc-xin/.cache/bazel/_bazel_hpc-xin/c5e302e32d7acb9c3e4fce8142d1cd66/external/nccl_archive/Makefile
Some modification as below:
CUDA_HOME ?= /usr/local/cuda –> CUDA_HOME ?= /apps2/cuda/8.0.61
I noticed that some of people said this issue could be bypassed by delete the nccl dependence in “/tensorflow-1.2.0-rc2/tensorflow/contrib/BUILD”. However, i do not think this is the correct way to solve this issue. I wanna keep this dependence.
Thanks so much for your help. Any comments are appreciated.
Best Regards
Xin
I had this exact problem on CentOS 7.1, with gcc 4.8.3, bazel 0.4.5, CUDA 8.0, cuDNN 5.1. CUDA and cuDNN are installed in the default path (
/usr/local/cuda
).The error messages are as follows:
Adding
to
seems fixed the problem.
@lxwgcool I noticed that the file path in your errors starts with
/gpfs/gpfs1/
, while your added path inCROSSTOOL_nvcc.tpl
starts with/apps2
, could this be the problem?you have to run bazel with this option: –cxxopt=”-I/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/*.h”