Wednesday, August 7, 2024

Setting UP Horovod for distributed Training on 2 Hosts

 


https://github.com/horovod/horovod/blob/master/docs/docker.rst

 

Prerequisite

1.password less access to All machines

http://www.linuxproblem.org/art_9.html

Master has to set its Pub key to all host

authorize_keys files.

 

e.g Setup using 2 Host running Dockers

 

(Host A/DOCKER A) ------------------------------------------------> (Host B/DOCKER B)

(136.18.225.72/INDFCQ4RG2-l/1GPU)         (136.18.225.116 antpc-MS-7A94 /2 GPU)

 

a)generate pub key inside docker A and place it in

Host A, Host B and Docker B.

b)Generate pub key inside Host A and Place in authorize keys of Docker A, Host B and dockerB.

c)a)generate pub key inside docker B and place it in

Host A, Host B and Docker A.

d)Generate pub key inside Host B and Place in authorize keys of Docker B, Host A and docker A

 

 

2.Connected links should have same interface name in primary worker and al secondary workers

eg All interface should be enp0s31f6 across all involved host.

You can use following command ( in bashrc if you want to retain on each reboot)

sudo /sbin/ip link set down

sudo /sbin/ip link set name

sudo /sbin/ip link set up

 

Note:this is not an limitation with following pull request

https://github.com/horovod/horovod/issues/1724#issuecomment-603613522

https://github.com/horovod/horovod/pull/1808

 

3.Enter hostname in /etc/hosts in all machines/dockers

136.18.225.72 INDFCQ4RG2-l (primary worker)

136.18.225.116 antpc-MS-7A94 (secondary worker, this can be multiple machines)

 

 

 

Setting up


( horovod/horovod                0.19.0-tf2.0.0-torch1.3.0-mxnet1.5.0-py3.6-gpu)

A)
Primary Worker

1. Run docker
nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest


2.Test Sample App
  horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:2 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

B) Secodary Worker

nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"



sample run

root@INDFCQ4RG2-l:/examples# horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

Filtering local host names.

Remote host found: antpc-MS-7A94

Checking ssh on all remote hosts.

SSH was successful into all the remote hosts.

Testing interfaces on all the hosts.

Launched horovodrun server.

Attempted to launch horovod task servers.

Waiting for the hosts to acknowledge.

('127.0.0.1', 26576)

('136.18.225.116', 55673)

Notified all the hosts that the registration is complete.

Waiting for hosts to perform host-to-host interface checking.

Host-to-host interface checking successful.

Interfaces on all the hosts were successfully checked.

Common interface found: enp0s31f6

Checking whether extension tensorflow was built with MPI.

Extension tensorflow was built with MPI.

mpirun --allow-run-as-root --tag-output -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca plm_rsh_args "-p 12345" -mca btl_tcp_if_include enp0s31f6 -x NCCL_SOCKET_IFNAME=enp0s31f6  -x CUDA_PKG_VERSION -x CUDA_VERSION -x CUDNN_VERSION -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOSTNAME -x LD_LIBRARY_PATH -x LIBRARY_PATH -x LS_COLORS -x MXNET_VERSION -x NCCL_VERSION -x NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x PATH -x PWD -x PYTHON_VERSION -x PYTORCH_VERSION -x SHLVL -x TENSORFLOW_VERSION -x TERM -x TORCHVISION_VERSION -x _  python tensorflow2_mnist.py

[1,0]:2020-03-20 10:43:54.841973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

[1,1]:2020-03-20 10:45:59.911264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

[1,1]:2020-03-20 10:45:59.920605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683

[1,1]:pciBusID: 0000:b3:00.0

[1,1]:2020-03-20 10:45:59.921154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:

[1,1]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,1]:pciBusID: 0000:02:00.0

[1,1]:2020-03-20 10:45:59.921189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:45:59.922238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,1]:2020-03-20 10:45:59.923157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,1]:2020-03-20 10:45:59.923417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,1]:2020-03-20 10:45:59.924681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,1]:2020-03-20 10:45:59.925597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,1]:2020-03-20 10:45:59.928283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:45:59.930282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1

[1,0]:2020-03-20 10:43:54.895746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,0]:pciBusID: 0000:03:00.0

[1,0]:2020-03-20 10:43:54.896396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:

[1,0]:name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625

[1,0]:pciBusID: 0000:02:00.0

[1,0]:2020-03-20 10:43:54.896450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:54.898509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:43:54.900181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,0]:2020-03-20 10:43:54.900609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,0]:2020-03-20 10:43:54.903080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,0]:2020-03-20 10:43:54.906525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,0]:2020-03-20 10:43:54.912235: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,0]:2020-03-20 10:43:54.915803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Ignoring visible gpu device (device: 1, name: Quadro M2000, pci bus id: 0000:02:00.0, compute capability: 5.2) with core count: 6. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.

[1,0]:2020-03-20 10:43:54.915841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,1]:2020-03-20 10:46:00.242715: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

[1,1]:2020-03-20 10:46:00.271269: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz

[1,1]:2020-03-20 10:46:00.272378: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x440d420 executing computations on platform Host. Devices:

[1,1]:2020-03-20 10:46:00.272396: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

[1,1]:2020-03-20 10:46:00.444966: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x446f770 executing computations on platform CUDA. Devices:

[1,1]:2020-03-20 10:46:00.444995: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

[1,1]:2020-03-20 10:46:00.445805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683

[1,1]:pciBusID: 0000:b3:00.0

[1,1]:2020-03-20 10:46:00.445842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:46:00.445867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,1]:2020-03-20 10:46:00.445880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,1]:2020-03-20 10:46:00.445892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,1]:2020-03-20 10:46:00.445903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,1]:2020-03-20 10:46:00.445915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,1]:2020-03-20 10:46:00.445929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:46:00.447077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,1]:2020-03-20 10:46:00.447111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:46:00.504039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:

[1,1]:2020-03-20 10:46:00.504074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0

[1,1]:2020-03-20 10:46:00.504080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N

[1,1]:2020-03-20 10:46:00.505187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 133 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)

[1,0]:2020-03-20 10:43:56.160047: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

[1,0]:2020-03-20 10:43:56.198625: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394525000 Hz

[1,0]:2020-03-20 10:43:56.201578: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1f2c430 executing computations on platform Host. Devices:

[1,0]:2020-03-20 10:43:56.201618: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

[1,0]:2020-03-20 10:43:56.348911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x41fad60 executing computations on platform CUDA. Devices:

[1,0]:2020-03-20 10:43:56.348964: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1

[1,0]:2020-03-20 10:43:56.350439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,0]:pciBusID: 0000:03:00.0

[1,0]:2020-03-20 10:43:56.350531: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:56.350591: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:43:56.350629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,0]:2020-03-20 10:43:56.350664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,0]:2020-03-20 10:43:56.350695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,0]:2020-03-20 10:43:56.350732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,0]:2020-03-20 10:43:56.350769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,0]:2020-03-20 10:43:56.353401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,0]:2020-03-20 10:43:56.353472: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:56.457968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:

[1,0]:2020-03-20 10:43:56.458023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0

[1,0]:2020-03-20 10:43:56.458032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N

[1,0]:2020-03-20 10:43:56.464616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7594 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)

[1,0]:2020-03-20 10:43:56.468105: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 376320000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:43:57.434876: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:43:57.605212: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:44:02.771952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:44:03.001307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:46:10.693376: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 358.89MiB (rounded to 376320000).  Current allocation summary follows.

[1,1]:2020-03-20 10:46:10.693435: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693672: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693684: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693699: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 358.89MiB was 256.00MiB, Chunk State:

[1,1]:2020-03-20 10:46:10.693710: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size:

[1,1]:2020-03-20 10:46:10.693721: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 0B

[1,1]:2020-03-20 10:46:10.693733: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 0 memory_limit_: 139591680 available bytes: 139591680 curr_region_allocation_bytes_: 1048576

[1,1]:2020-03-20 10:46:10.693748: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:

[1,1]:Limit:                   139591680

[1,1]:InUse:                           0

[1,1]:MaxInUse:                        0

[1,1]:NumAllocs:                       0

[1,1]:MaxAllocSize:                    0

[1,1]:

[1,1]:2020-03-20 10:46:10.693764: W tensorflow/core/common_runtime/bfc_allocator.cc:424]

....

.....

.....

.....

......

.....

....

....

stdout>:Step #230    Loss: 0.049809

[1,1]:Step #240    Loss: 0.114416

[1,0]:Step #240    Loss: 0.118383

[1,1]:Step #250    Loss: 0.072860

[1,0]:Step #250    Loss: 0.075680

[1,0]:Step #260    Loss: 0.077773

[1,1]:Step #260    Loss: 0.256634

[1,1]:Step #270    Loss: 0.052928

[1,0]:Step #270    Loss: 0.055714

[1,0]:Step #280    Loss: 0.065934

[1,1]:Step #280    Loss: 0.129994

 

Offline Real-time Natural Sounding Speech Synthesis

 

Offline Real-time Natural Sounding Speech Synthesis

 

Present Embedded solution is not very natural sounding and hard to support new languages, ascent and voices. Neural Network based solution addresses above issues but they are not real-time on low foot print devices and have huge models requiring more memory.

NN Based Solution usually contains two Parts:

1.       Mel spectrogram Generator

Which generates Mel Spectrogram (important acoustic features that the human brain uses when processing speech) from Text.

2.       Vocoder

Speech Synthesizer that generates speech waveform from Mel spectrogram.



 


 

Based on the our real-time and low footprint requirement we are exploring various option of Mel Spectrogram Generator and Vocoder

Solutions

1.       Tacatron2 + LPCNET

2.       FastSpeech + LPCNET

3.       FastSpeech +  Squeezewave

 

 

1.      Tacotron2 + LPCNET

 



Tacotron2 is End to End TTS with Wavenet Vocoder, to make it Realtime we done following optimization

a)       Wavenet Vocoder Changed to LPCNET

b)      Retraining the Tacatron2 with Only 20 Mels

c)       Memory Mapped Model

d)      Streaming LPCNET output

 

 

A prototype app that runs in Android platform produces output quite fast, but still it not real-time but still acceptable for small utterances. In addition, more training is required for better natural sound.

 

On Qualcomm 820 Board

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

15s

10s

25s

3 Sec

2s

1.5s

3.5s

 

On X86 Single Core

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

6s

4s

10s

3 Sec

 

 

 

 

 

2.      FastSpeech + Squeezewave

 

Fastspeech is astonishing fast mel generator.



Combining it with Squeezewave Vocoder on X86 single Core CPU we have attained following Timings.

On X86 Single Core

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

1.2s

2.7s

3.9s

3 Sec

0.5s

0.8s

1.3s

 

But Both of the Implementation is in Pytorch ( Python ) and converting model that be inferenced via c/c++ code is little difficult .

 

1.       Converting model to Torch Script and run inference via c/c++

2.       Converting the both Fastspeech and Squeezewave to Tensorflow Implemenation

Then use the model with c/c++ code.

 

 

3.      FastSpeech + LPCNET

 

Because of complexity of converting model from pytorch to torch script, and proven LPCNET performance. We are trying to integrated Fastspeech Mel spectrogram generation with LPCNET vocoder.

Since LPCNET Vocoder only works with fewer Mel, we need to modify the FastSpeech and Retrain The Model to be used with LPCNET.

 

 

 

 

 

Featured Post

XDP - Getting Started with XDP (Linux)