Sunday, April 5, 2020

Setting UP Horovod for distributed Training on 2 Hosts

Setting UP Horovod for distributed Training on 2 Hosts



https://github.com/horovod/horovod/blob/master/docs/docker.rst

Prerequisite

1.password less access to All machines
http://www.linuxproblem.org/art_9.html
Master has to set its Pub key to all host
authorize_keys files.

e.g Setup using 2 Host running Dockers

(Host A/DOCKER A) ------------------------------------------------> (Host B/DOCKER B)
(136.18.225.72/INDFCQ4RG2-l/1GPU)         (136.18.225.116 antpc-MS-7A94 /2 GPU)

a)generate pub key inside docker A and place it in
Host A, Host B and Docker B.
b)Generate pub key inside Host A and Place in authorize keys of Docker A, Host B and dockerB.
c)a)generate pub key inside docker B and place it in
Host A, Host B and Docker A.
d)Generate pub key inside Host B and Place in authorize keys of Docker B, Host A and docker A


2.Connected links should have same interface name in primary worker and al secondary workers
eg All interface should be enp0s31f6 across all involved host.
You can use following command ( in bashrc if you want to retain on each reboot)
sudo /sbin/ip link set down
sudo /sbin/ip link set name
sudo /sbin/ip link set up

Note:this is not an limitation with following pull request
https://github.com/horovod/horovod/issues/1724#issuecomment-603613522
https://github.com/horovod/horovod/pull/1808

3.Enter hostname in /etc/hosts in all machines/dockers
136.18.225.72 INDFCQ4RG2-l (primary worker)
136.18.225.116 antpc-MS-7A94 (secondary worker, this can be multiple machines)



Setting up


( horovod/horovod                0.19.0-tf2.0.0-torch1.3.0-mxnet1.5.0-py3.6-gpu)

A)
Primary Worker

1. Run docker
nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest


2.Test Sample App
  horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:2 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

B) Secodary Worker

nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"


sample run

root@INDFCQ4RG2-l:/examples# horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py
Filtering local host names.
Remote host found: antpc-MS-7A94
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
Testing interfaces on all the hosts.
Launched horovodrun server.
Attempted to launch horovod task servers.
Waiting for the hosts to acknowledge.
('127.0.0.1', 26576)
('136.18.225.116', 55673)
Notified all the hosts that the registration is complete.
Waiting for hosts to perform host-to-host interface checking.
Host-to-host interface checking successful.
Interfaces on all the hosts were successfully checked.
Common interface found: enp0s31f6
Checking whether extension tensorflow was built with MPI.
Extension tensorflow was built with MPI.
mpirun --allow-run-as-root --tag-output -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca plm_rsh_args "-p 12345" -mca btl_tcp_if_include enp0s31f6 -x NCCL_SOCKET_IFNAME=enp0s31f6  -x CUDA_PKG_VERSION -x CUDA_VERSION -x CUDNN_VERSION -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOSTNAME -x LD_LIBRARY_PATH -x LIBRARY_PATH -x LS_COLORS -x MXNET_VERSION -x NCCL_VERSION -x NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x PATH -x PWD -x PYTHON_VERSION -x PYTORCH_VERSION -x SHLVL -x TENSORFLOW_VERSION -x TERM -x TORCHVISION_VERSION -x _  python tensorflow2_mnist.py
[1,0]:2020-03-20 10:43:54.841973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[1,1]:2020-03-20 10:45:59.911264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[1,1]:2020-03-20 10:45:59.920605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
[1,1]:pciBusID: 0000:b3:00.0
[1,1]:2020-03-20 10:45:59.921154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
[1,1]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
[1,1]:pciBusID: 0000:02:00.0
[1,1]:2020-03-20 10:45:59.921189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[1,1]:2020-03-20 10:45:59.922238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[1,1]:2020-03-20 10:45:59.923157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
[1,1]:2020-03-20 10:45:59.923417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
[1,1]:2020-03-20 10:45:59.924681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
[1,1]:2020-03-20 10:45:59.925597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
[1,1]:2020-03-20 10:45:59.928283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,1]:2020-03-20 10:45:59.930282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
[1,0]:2020-03-20 10:43:54.895746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
[1,0]:pciBusID: 0000:03:00.0
[1,0]:2020-03-20 10:43:54.896396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
[1,0]:name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
[1,0]:pciBusID: 0000:02:00.0
[1,0]:2020-03-20 10:43:54.896450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[1,0]:2020-03-20 10:43:54.898509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[1,0]:2020-03-20 10:43:54.900181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
[1,0]:2020-03-20 10:43:54.900609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
[1,0]:2020-03-20 10:43:54.903080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
[1,0]:2020-03-20 10:43:54.906525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
[1,0]:2020-03-20 10:43:54.912235: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,0]:2020-03-20 10:43:54.915803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Ignoring visible gpu device (device: 1, name: Quadro M2000, pci bus id: 0000:02:00.0, compute capability: 5.2) with core count: 6. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
[1,0]:2020-03-20 10:43:54.915841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[1,1]:2020-03-20 10:46:00.242715: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
[1,1]:2020-03-20 10:46:00.271269: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz
[1,1]:2020-03-20 10:46:00.272378: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x440d420 executing computations on platform Host. Devices:
[1,1]:2020-03-20 10:46:00.272396: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
[1,1]:2020-03-20 10:46:00.444966: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x446f770 executing computations on platform CUDA. Devices:
[1,1]:2020-03-20 10:46:00.444995: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[1,1]:2020-03-20 10:46:00.445805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
[1,1]:pciBusID: 0000:b3:00.0
[1,1]:2020-03-20 10:46:00.445842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[1,1]:2020-03-20 10:46:00.445867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[1,1]:2020-03-20 10:46:00.445880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
[1,1]:2020-03-20 10:46:00.445892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
[1,1]:2020-03-20 10:46:00.445903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
[1,1]:2020-03-20 10:46:00.445915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
[1,1]:2020-03-20 10:46:00.445929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,1]:2020-03-20 10:46:00.447077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[1,1]:2020-03-20 10:46:00.447111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[1,1]:2020-03-20 10:46:00.504039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]:2020-03-20 10:46:00.504074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
[1,1]:2020-03-20 10:46:00.504080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
[1,1]:2020-03-20 10:46:00.505187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 133 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)
[1,0]:2020-03-20 10:43:56.160047: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[1,0]:2020-03-20 10:43:56.198625: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394525000 Hz
[1,0]:2020-03-20 10:43:56.201578: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1f2c430 executing computations on platform Host. Devices:
[1,0]:2020-03-20 10:43:56.201618: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
[1,0]:2020-03-20 10:43:56.348911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x41fad60 executing computations on platform CUDA. Devices:
[1,0]:2020-03-20 10:43:56.348964: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
[1,0]:2020-03-20 10:43:56.350439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
[1,0]:pciBusID: 0000:03:00.0
[1,0]:2020-03-20 10:43:56.350531: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[1,0]:2020-03-20 10:43:56.350591: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[1,0]:2020-03-20 10:43:56.350629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
[1,0]:2020-03-20 10:43:56.350664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
[1,0]:2020-03-20 10:43:56.350695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
[1,0]:2020-03-20 10:43:56.350732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
[1,0]:2020-03-20 10:43:56.350769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,0]:2020-03-20 10:43:56.353401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[1,0]:2020-03-20 10:43:56.353472: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[1,0]:2020-03-20 10:43:56.457968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]:2020-03-20 10:43:56.458023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
[1,0]:2020-03-20 10:43:56.458032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
[1,0]:2020-03-20 10:43:56.464616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7594 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
[1,0]:2020-03-20 10:43:56.468105: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 376320000 exceeds 10% of system memory.
[1,0]:2020-03-20 10:43:57.434876: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.
[1,0]:2020-03-20 10:43:57.605212: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.
[1,0]:2020-03-20 10:44:02.771952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[1,0]:2020-03-20 10:44:03.001307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,1]:2020-03-20 10:46:10.693376: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 358.89MiB (rounded to 376320000).  Current allocation summary follows.
[1,1]:2020-03-20 10:46:10.693435: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
[1,1]:2020-03-20 10:46:10.693672: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
[1,1]:2020-03-20 10:46:10.693684: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
[1,1]:2020-03-20 10:46:10.693699: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 358.89MiB was 256.00MiB, Chunk State:
[1,1]:2020-03-20 10:46:10.693710: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size:
[1,1]:2020-03-20 10:46:10.693721: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 0B
[1,1]:2020-03-20 10:46:10.693733: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 0 memory_limit_: 139591680 available bytes: 139591680 curr_region_allocation_bytes_: 1048576
[1,1]:2020-03-20 10:46:10.693748: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
[1,1]:Limit:                   139591680
[1,1]:InUse:                           0
[1,1]:MaxInUse:                        0
[1,1]:NumAllocs:                       0
[1,1]:MaxAllocSize:                    0
[1,1]:
[1,1]:2020-03-20 10:46:10.693764: W tensorflow/core/common_runtime/bfc_allocator.cc:424]
....
.....
.....
.....
......
.....
....
....
stdout>:Step #230    Loss: 0.049809
[1,1]:Step #240    Loss: 0.114416
[1,0]:Step #240    Loss: 0.118383
[1,1]:Step #250    Loss: 0.072860
[1,0]:Step #250    Loss: 0.075680
[1,0]:Step #260    Loss: 0.077773
[1,1]:Step #260    Loss: 0.256634
[1,1]:Step #270    Loss: 0.052928
[1,0]:Step #270    Loss: 0.055714
[1,0]:Step #280    Loss: 0.065934
[1,1]:Step #280    Loss: 0.129994



Featured Post

XDP - Getting Started with XDP (Linux)