https://github.com/horovod/horovod/blob/master/docs/docker.rst
Prerequisite
1.password
less access to All machines
http://www.linuxproblem.org/art_9.html
Master has
to set its Pub key to all host
authorize_keys
files.
e.g Setup
using 2 Host running Dockers
(Host
A/DOCKER A) ------------------------------------------------> (Host B/DOCKER
B)
(136.18.225.72/INDFCQ4RG2-l/1GPU) (136.18.225.116 antpc-MS-7A94 /2 GPU)
a)generate
pub key inside docker A and place it in
Host A,
Host B and Docker B.
b)Generate
pub key inside Host A and Place in authorize keys of Docker A, Host B and
dockerB.
c)a)generate
pub key inside docker B and place it in
Host A,
Host B and Docker A.
d)Generate
pub key inside Host B and Place in authorize keys of Docker B, Host A and
docker A
2.Connected
links should have same interface name in primary worker and al secondary
workers
eg All
interface should be enp0s31f6 across all involved host.
You can use
following command ( in bashrc if you want to retain on each reboot)
sudo
/sbin/ip link set down
sudo
/sbin/ip link set name
sudo
/sbin/ip link set up
Note:this
is not an limitation with following pull request
https://github.com/horovod/horovod/issues/1724#issuecomment-603613522
https://github.com/horovod/horovod/pull/1808
3.Enter
hostname in /etc/hosts in all machines/dockers
136.18.225.72
INDFCQ4RG2-l (primary worker)
136.18.225.116
antpc-MS-7A94 (secondary worker, this can be multiple machines)
Setting up
(
horovod/horovod
0.19.0-tf2.0.0-torch1.3.0-mxnet1.5.0-py3.6-gpu)
A)
Primary Worker
1. Run docker
nvidia-docker run
-it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
2.Test Sample App
horovodrun -np 2 -H
INDFCQ4RG2-l:1,antpc-MS-7A94:2 --start-timeout=360 --verbose -p 12345 python
tensorflow2_mnist.py
B) Secodary Worker
nvidia-docker run
-it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345;
sleep infinity"
sample run
root@INDFCQ4RG2-l:/examples#
horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 --start-timeout=360
--verbose -p 12345 python tensorflow2_mnist.py
Filtering
local host names.
Remote host
found: antpc-MS-7A94
Checking
ssh on all remote hosts.
SSH was
successful into all the remote hosts.
Testing
interfaces on all the hosts.
Launched
horovodrun server.
Attempted
to launch horovod task servers.
Waiting for
the hosts to acknowledge.
('127.0.0.1',
26576)
('136.18.225.116',
55673)
Notified
all the hosts that the registration is complete.
Waiting for
hosts to perform host-to-host interface checking.
Host-to-host
interface checking successful.
Interfaces
on all the hosts were successfully checked.
Common
interface found: enp0s31f6
Checking
whether extension tensorflow was built with MPI.
Extension
tensorflow was built with MPI.
mpirun
--allow-run-as-root --tag-output -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1
-bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca plm_rsh_args
"-p 12345" -mca btl_tcp_if_include enp0s31f6 -x
NCCL_SOCKET_IFNAME=enp0s31f6 -x
CUDA_PKG_VERSION -x CUDA_VERSION -x CUDNN_VERSION -x HOME -x
HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x
HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x
HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOSTNAME -x LD_LIBRARY_PATH -x
LIBRARY_PATH -x LS_COLORS -x MXNET_VERSION -x NCCL_VERSION -x
NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x
PATH -x PWD -x PYTHON_VERSION -x PYTORCH_VERSION -x SHLVL -x TENSORFLOW_VERSION
-x TERM -x TORCHVISION_VERSION -x _
python tensorflow2_mnist.py
[1,0]:2020-03-20
10:43:54.841973: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcuda.so.1
[1,1]:2020-03-20
10:45:59.911264: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcuda.so.1
[1,1]:2020-03-20
10:45:59.920605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found
device 0 with properties:
[1,1]:name:
GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
[1,1]:pciBusID:
0000:b3:00.0
[1,1]:2020-03-20
10:45:59.921154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found
device 1 with properties:
[1,1]:name:
GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
[1,1]:pciBusID:
0000:02:00.0
[1,1]:2020-03-20
10:45:59.921189: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudart.so.10.0
[1,1]:2020-03-20
10:45:59.922238: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcublas.so.10.0
[1,1]:2020-03-20
10:45:59.923157: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcufft.so.10.0
[1,1]:2020-03-20
10:45:59.923417: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcurand.so.10.0
[1,1]:2020-03-20
10:45:59.924681: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusolver.so.10.0
[1,1]:2020-03-20
10:45:59.925597: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusparse.so.10.0
[1,1]:2020-03-20
10:45:59.928283: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudnn.so.7
[1,1]:2020-03-20
10:45:59.930282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746]
Adding visible gpu devices: 0, 1
[1,0]:2020-03-20
10:43:54.895746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found
device 0 with properties:
[1,0]:name:
GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
[1,0]:pciBusID:
0000:03:00.0
[1,0]:2020-03-20
10:43:54.896396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found
device 1 with properties:
[1,0]:name:
Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
[1,0]:pciBusID:
0000:02:00.0
[1,0]:2020-03-20
10:43:54.896450: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudart.so.10.0
[1,0]:2020-03-20
10:43:54.898509: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcublas.so.10.0
[1,0]:2020-03-20
10:43:54.900181: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcufft.so.10.0
[1,0]:2020-03-20
10:43:54.900609: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcurand.so.10.0
[1,0]:2020-03-20
10:43:54.903080: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusolver.so.10.0
[1,0]:2020-03-20
10:43:54.906525: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusparse.so.10.0
[1,0]:2020-03-20
10:43:54.912235: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudnn.so.7
[1,0]:2020-03-20
10:43:54.915803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731]
Ignoring visible gpu device (device: 1, name: Quadro M2000, pci bus id:
0000:02:00.0, compute capability: 5.2) with core count: 6. The minimum required
count is 8. You can adjust this requirement with the env var
TF_MIN_GPU_MULTIPROCESSOR_COUNT.
[1,0]:2020-03-20
10:43:54.915841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746]
Adding visible gpu devices: 0
[1,1]:2020-03-20
10:46:00.242715: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU
supports instructions that this TensorFlow binary was not compiled to use: AVX2
AVX512F FMA
[1,1]:2020-03-20
10:46:00.271269: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU
Frequency: 2900000000 Hz
[1,1]:2020-03-20
10:46:00.272378: I tensorflow/compiler/xla/service/service.cc:168] XLA service
0x440d420 executing computations on platform Host. Devices:
[1,1]:2020-03-20
10:46:00.272396: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default
Version
[1,1]:2020-03-20
10:46:00.444966: I tensorflow/compiler/xla/service/service.cc:168] XLA service
0x446f770 executing computations on platform CUDA. Devices:
[1,1]:2020-03-20
10:46:00.444995: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080
Ti, Compute Capability 6.1
[1,1]:2020-03-20
10:46:00.445805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found
device 0 with properties:
[1,1]:name:
GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
[1,1]:pciBusID:
0000:b3:00.0
[1,1]:2020-03-20
10:46:00.445842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44]
Successfully opened dynamic library libcudart.so.10.0
[1,1]:2020-03-20
10:46:00.445867: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcublas.so.10.0
[1,1]:2020-03-20
10:46:00.445880: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcufft.so.10.0
[1,1]:2020-03-20
10:46:00.445892: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcurand.so.10.0
[1,1]:2020-03-20
10:46:00.445903: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusolver.so.10.0
[1,1]:2020-03-20
10:46:00.445915: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusparse.so.10.0
[1,1]:2020-03-20
10:46:00.445929: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudnn.so.7
[1,1]:2020-03-20
10:46:00.447077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746]
Adding visible gpu devices: 0
[1,1]:2020-03-20
10:46:00.447111: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudart.so.10.0
[1,1]:2020-03-20
10:46:00.504039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159]
Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]:2020-03-20
10:46:00.504074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
[1,1]:2020-03-20
10:46:00.504080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178]
0: N
[1,1]:2020-03-20
10:46:00.505187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304]
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
133 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci
bus id: 0000:b3:00.0, compute capability: 6.1)
[1,0]:2020-03-20
10:43:56.160047: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU
supports instructions that this TensorFlow binary was not compiled to use: AVX2
FMA
[1,0]:2020-03-20
10:43:56.198625: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU
Frequency: 2394525000 Hz
[1,0]:2020-03-20
10:43:56.201578: I tensorflow/compiler/xla/service/service.cc:168] XLA service
0x1f2c430 executing computations on platform Host. Devices:
[1,0]:2020-03-20
10:43:56.201618: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default
Version
[1,0]:2020-03-20
10:43:56.348911: I tensorflow/compiler/xla/service/service.cc:168] XLA service
0x41fad60 executing computations on platform CUDA. Devices:
[1,0]:2020-03-20
10:43:56.348964: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080,
Compute Capability 6.1
[1,0]:2020-03-20
10:43:56.350439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found
device 0 with properties:
[1,0]:name:
GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
[1,0]:pciBusID:
0000:03:00.0
[1,0]:2020-03-20
10:43:56.350531: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudart.so.10.0
[1,0]:2020-03-20
10:43:56.350591: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcublas.so.10.0
[1,0]:2020-03-20
10:43:56.350629: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcufft.so.10.0
[1,0]:2020-03-20
10:43:56.350664: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcurand.so.10.0
[1,0]:2020-03-20
10:43:56.350695: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusolver.so.10.0
[1,0]:2020-03-20
10:43:56.350732: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcusparse.so.10.0
[1,0]:2020-03-20
10:43:56.350769: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudnn.so.7
[1,0]:2020-03-20
10:43:56.353401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746]
Adding visible gpu devices: 0
[1,0]:2020-03-20
10:43:56.353472: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudart.so.10.0
[1,0]:2020-03-20
10:43:56.457968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159]
Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]:2020-03-20
10:43:56.458023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
[1,0]:2020-03-20
10:43:56.458032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178]
0: N
[1,0]:2020-03-20
10:43:56.464616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304]
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
7594 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus
id: 0000:03:00.0, compute capability: 6.1)
[1,0]:2020-03-20
10:43:56.468105: W tensorflow/core/framework/cpu_allocator_impl.cc:81]
Allocation of 376320000 exceeds 10% of system memory.
[1,0]:2020-03-20
10:43:57.434876: W tensorflow/core/framework/cpu_allocator_impl.cc:81]
Allocation of 188160000 exceeds 10% of system memory.
[1,0]:2020-03-20
10:43:57.605212: W tensorflow/core/framework/cpu_allocator_impl.cc:81]
Allocation of 188160000 exceeds 10% of system memory.
[1,0]:2020-03-20
10:44:02.771952: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcublas.so.10.0
[1,0]:2020-03-20
10:44:03.001307: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully
opened dynamic library libcudnn.so.7
[1,1]:2020-03-20
10:46:10.693376: W tensorflow/core/common_runtime/bfc_allocator.cc:419]
Allocator (GPU_0_bfc) ran out of memory trying to allocate 358.89MiB (rounded
to 376320000). Current allocation
summary follows.
[1,1]:2020-03-20
10:46:10.693435: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin
(256): Total Chunks: 0, Chunks in
use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use
in bin.
[1,1]:2020-03-20
10:46:10.693672: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin
(134217728): Total Chunks: 0, Chunks
in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in
use in bin.
[1,1]:2020-03-20
10:46:10.693684: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin
(268435456): Total Chunks: 0, Chunks
in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in
use in bin.
[1,1]:2020-03-20
10:46:10.693699: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for
358.89MiB was 256.00MiB, Chunk State:
[1,1]:2020-03-20
10:46:10.693710: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size:
[1,1]:2020-03-20
10:46:10.693721: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum
Total of in-use chunks: 0B
[1,1]:2020-03-20
10:46:10.693733: I tensorflow/core/common_runtime/bfc_allocator.cc:923]
total_region_allocated_bytes_: 0 memory_limit_: 139591680 available bytes:
139591680 curr_region_allocation_bytes_: 1048576
[1,1]:2020-03-20
10:46:10.693748: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
[1,1]:Limit: 139591680
[1,1]:InUse: 0
[1,1]:MaxInUse: 0
[1,1]:NumAllocs: 0
[1,1]:MaxAllocSize: 0
[1,1]:
[1,1]:2020-03-20
10:46:10.693764: W tensorflow/core/common_runtime/bfc_allocator.cc:424]
....
.....
.....
.....
......
.....
....
....
stdout>:Step
#230 Loss: 0.049809
[1,1]:Step
#240 Loss: 0.114416
[1,0]:Step
#240 Loss: 0.118383
[1,1]:Step
#250 Loss: 0.072860
[1,0]:Step
#250 Loss: 0.075680
[1,0]:Step
#260 Loss: 0.077773
[1,1]:Step
#260 Loss: 0.256634
[1,1]:Step
#270 Loss: 0.052928
[1,0]:Step
#270 Loss: 0.055714
[1,0]:Step
#280 Loss: 0.065934
[1,1]:Step
#280 Loss: 0.129994