Wednesday, August 7, 2024

Setting UP Horovod for distributed Training on 2 Hosts

 


https://github.com/horovod/horovod/blob/master/docs/docker.rst

 

Prerequisite

1.password less access to All machines

http://www.linuxproblem.org/art_9.html

Master has to set its Pub key to all host

authorize_keys files.

 

e.g Setup using 2 Host running Dockers

 

(Host A/DOCKER A) ------------------------------------------------> (Host B/DOCKER B)

(136.18.225.72/INDFCQ4RG2-l/1GPU)         (136.18.225.116 antpc-MS-7A94 /2 GPU)

 

a)generate pub key inside docker A and place it in

Host A, Host B and Docker B.

b)Generate pub key inside Host A and Place in authorize keys of Docker A, Host B and dockerB.

c)a)generate pub key inside docker B and place it in

Host A, Host B and Docker A.

d)Generate pub key inside Host B and Place in authorize keys of Docker B, Host A and docker A

 

 

2.Connected links should have same interface name in primary worker and al secondary workers

eg All interface should be enp0s31f6 across all involved host.

You can use following command ( in bashrc if you want to retain on each reboot)

sudo /sbin/ip link set down

sudo /sbin/ip link set name

sudo /sbin/ip link set up

 

Note:this is not an limitation with following pull request

https://github.com/horovod/horovod/issues/1724#issuecomment-603613522

https://github.com/horovod/horovod/pull/1808

 

3.Enter hostname in /etc/hosts in all machines/dockers

136.18.225.72 INDFCQ4RG2-l (primary worker)

136.18.225.116 antpc-MS-7A94 (secondary worker, this can be multiple machines)

 

 

 

Setting up


( horovod/horovod                0.19.0-tf2.0.0-torch1.3.0-mxnet1.5.0-py3.6-gpu)

A)
Primary Worker

1. Run docker
nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest


2.Test Sample App
  horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:2 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

B) Secodary Worker

nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"



sample run

root@INDFCQ4RG2-l:/examples# horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

Filtering local host names.

Remote host found: antpc-MS-7A94

Checking ssh on all remote hosts.

SSH was successful into all the remote hosts.

Testing interfaces on all the hosts.

Launched horovodrun server.

Attempted to launch horovod task servers.

Waiting for the hosts to acknowledge.

('127.0.0.1', 26576)

('136.18.225.116', 55673)

Notified all the hosts that the registration is complete.

Waiting for hosts to perform host-to-host interface checking.

Host-to-host interface checking successful.

Interfaces on all the hosts were successfully checked.

Common interface found: enp0s31f6

Checking whether extension tensorflow was built with MPI.

Extension tensorflow was built with MPI.

mpirun --allow-run-as-root --tag-output -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca plm_rsh_args "-p 12345" -mca btl_tcp_if_include enp0s31f6 -x NCCL_SOCKET_IFNAME=enp0s31f6  -x CUDA_PKG_VERSION -x CUDA_VERSION -x CUDNN_VERSION -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOSTNAME -x LD_LIBRARY_PATH -x LIBRARY_PATH -x LS_COLORS -x MXNET_VERSION -x NCCL_VERSION -x NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x PATH -x PWD -x PYTHON_VERSION -x PYTORCH_VERSION -x SHLVL -x TENSORFLOW_VERSION -x TERM -x TORCHVISION_VERSION -x _  python tensorflow2_mnist.py

[1,0]:2020-03-20 10:43:54.841973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

[1,1]:2020-03-20 10:45:59.911264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

[1,1]:2020-03-20 10:45:59.920605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683

[1,1]:pciBusID: 0000:b3:00.0

[1,1]:2020-03-20 10:45:59.921154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:

[1,1]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,1]:pciBusID: 0000:02:00.0

[1,1]:2020-03-20 10:45:59.921189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:45:59.922238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,1]:2020-03-20 10:45:59.923157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,1]:2020-03-20 10:45:59.923417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,1]:2020-03-20 10:45:59.924681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,1]:2020-03-20 10:45:59.925597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,1]:2020-03-20 10:45:59.928283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:45:59.930282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1

[1,0]:2020-03-20 10:43:54.895746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,0]:pciBusID: 0000:03:00.0

[1,0]:2020-03-20 10:43:54.896396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:

[1,0]:name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625

[1,0]:pciBusID: 0000:02:00.0

[1,0]:2020-03-20 10:43:54.896450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:54.898509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:43:54.900181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,0]:2020-03-20 10:43:54.900609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,0]:2020-03-20 10:43:54.903080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,0]:2020-03-20 10:43:54.906525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,0]:2020-03-20 10:43:54.912235: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,0]:2020-03-20 10:43:54.915803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Ignoring visible gpu device (device: 1, name: Quadro M2000, pci bus id: 0000:02:00.0, compute capability: 5.2) with core count: 6. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.

[1,0]:2020-03-20 10:43:54.915841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,1]:2020-03-20 10:46:00.242715: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

[1,1]:2020-03-20 10:46:00.271269: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz

[1,1]:2020-03-20 10:46:00.272378: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x440d420 executing computations on platform Host. Devices:

[1,1]:2020-03-20 10:46:00.272396: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

[1,1]:2020-03-20 10:46:00.444966: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x446f770 executing computations on platform CUDA. Devices:

[1,1]:2020-03-20 10:46:00.444995: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

[1,1]:2020-03-20 10:46:00.445805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683

[1,1]:pciBusID: 0000:b3:00.0

[1,1]:2020-03-20 10:46:00.445842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:46:00.445867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,1]:2020-03-20 10:46:00.445880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,1]:2020-03-20 10:46:00.445892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,1]:2020-03-20 10:46:00.445903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,1]:2020-03-20 10:46:00.445915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,1]:2020-03-20 10:46:00.445929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:46:00.447077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,1]:2020-03-20 10:46:00.447111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:46:00.504039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:

[1,1]:2020-03-20 10:46:00.504074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0

[1,1]:2020-03-20 10:46:00.504080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N

[1,1]:2020-03-20 10:46:00.505187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 133 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)

[1,0]:2020-03-20 10:43:56.160047: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

[1,0]:2020-03-20 10:43:56.198625: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394525000 Hz

[1,0]:2020-03-20 10:43:56.201578: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1f2c430 executing computations on platform Host. Devices:

[1,0]:2020-03-20 10:43:56.201618: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

[1,0]:2020-03-20 10:43:56.348911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x41fad60 executing computations on platform CUDA. Devices:

[1,0]:2020-03-20 10:43:56.348964: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1

[1,0]:2020-03-20 10:43:56.350439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,0]:pciBusID: 0000:03:00.0

[1,0]:2020-03-20 10:43:56.350531: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:56.350591: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:43:56.350629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,0]:2020-03-20 10:43:56.350664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,0]:2020-03-20 10:43:56.350695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,0]:2020-03-20 10:43:56.350732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,0]:2020-03-20 10:43:56.350769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,0]:2020-03-20 10:43:56.353401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,0]:2020-03-20 10:43:56.353472: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:56.457968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:

[1,0]:2020-03-20 10:43:56.458023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0

[1,0]:2020-03-20 10:43:56.458032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N

[1,0]:2020-03-20 10:43:56.464616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7594 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)

[1,0]:2020-03-20 10:43:56.468105: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 376320000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:43:57.434876: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:43:57.605212: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:44:02.771952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:44:03.001307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:46:10.693376: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 358.89MiB (rounded to 376320000).  Current allocation summary follows.

[1,1]:2020-03-20 10:46:10.693435: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693672: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693684: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693699: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 358.89MiB was 256.00MiB, Chunk State:

[1,1]:2020-03-20 10:46:10.693710: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size:

[1,1]:2020-03-20 10:46:10.693721: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 0B

[1,1]:2020-03-20 10:46:10.693733: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 0 memory_limit_: 139591680 available bytes: 139591680 curr_region_allocation_bytes_: 1048576

[1,1]:2020-03-20 10:46:10.693748: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:

[1,1]:Limit:                   139591680

[1,1]:InUse:                           0

[1,1]:MaxInUse:                        0

[1,1]:NumAllocs:                       0

[1,1]:MaxAllocSize:                    0

[1,1]:

[1,1]:2020-03-20 10:46:10.693764: W tensorflow/core/common_runtime/bfc_allocator.cc:424]

....

.....

.....

.....

......

.....

....

....

stdout>:Step #230    Loss: 0.049809

[1,1]:Step #240    Loss: 0.114416

[1,0]:Step #240    Loss: 0.118383

[1,1]:Step #250    Loss: 0.072860

[1,0]:Step #250    Loss: 0.075680

[1,0]:Step #260    Loss: 0.077773

[1,1]:Step #260    Loss: 0.256634

[1,1]:Step #270    Loss: 0.052928

[1,0]:Step #270    Loss: 0.055714

[1,0]:Step #280    Loss: 0.065934

[1,1]:Step #280    Loss: 0.129994

 

Offline Real-time Natural Sounding Speech Synthesis

 

Offline Real-time Natural Sounding Speech Synthesis

 

Present Embedded solution is not very natural sounding and hard to support new languages, ascent and voices. Neural Network based solution addresses above issues but they are not real-time on low foot print devices and have huge models requiring more memory.

NN Based Solution usually contains two Parts:

1.       Mel spectrogram Generator

Which generates Mel Spectrogram (important acoustic features that the human brain uses when processing speech) from Text.

2.       Vocoder

Speech Synthesizer that generates speech waveform from Mel spectrogram.



 


 

Based on the our real-time and low footprint requirement we are exploring various option of Mel Spectrogram Generator and Vocoder

Solutions

1.       Tacatron2 + LPCNET

2.       FastSpeech + LPCNET

3.       FastSpeech +  Squeezewave

 

 

1.      Tacotron2 + LPCNET

 



Tacotron2 is End to End TTS with Wavenet Vocoder, to make it Realtime we done following optimization

a)       Wavenet Vocoder Changed to LPCNET

b)      Retraining the Tacatron2 with Only 20 Mels

c)       Memory Mapped Model

d)      Streaming LPCNET output

 

 

A prototype app that runs in Android platform produces output quite fast, but still it not real-time but still acceptable for small utterances. In addition, more training is required for better natural sound.

 

On Qualcomm 820 Board

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

15s

10s

25s

3 Sec

2s

1.5s

3.5s

 

On X86 Single Core

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

6s

4s

10s

3 Sec

 

 

 

 

 

2.      FastSpeech + Squeezewave

 

Fastspeech is astonishing fast mel generator.



Combining it with Squeezewave Vocoder on X86 single Core CPU we have attained following Timings.

On X86 Single Core

Audio Sample

Mel Spectogram Gen

Vocoder

Total Time

12 Sec

1.2s

2.7s

3.9s

3 Sec

0.5s

0.8s

1.3s

 

But Both of the Implementation is in Pytorch ( Python ) and converting model that be inferenced via c/c++ code is little difficult .

 

1.       Converting model to Torch Script and run inference via c/c++

2.       Converting the both Fastspeech and Squeezewave to Tensorflow Implemenation

Then use the model with c/c++ code.

 

 

3.      FastSpeech + LPCNET

 

Because of complexity of converting model from pytorch to torch script, and proven LPCNET performance. We are trying to integrated Fastspeech Mel spectrogram generation with LPCNET vocoder.

Since LPCNET Vocoder only works with fewer Mel, we need to modify the FastSpeech and Retrain The Model to be used with LPCNET.

 

 

 

 

 

Tuesday, December 14, 2021

KPATCH - live patch load/unloading in running kernel

Prerequisite
=============
apt-get install gcc-7-plugin-dev
yum install python2-devl
yum install python3-devel
yum install yum-utils
Install kpatch
================
git clone https://github.com/dynup/kpatch.git
cd kpatch
source test/integration/lib.sh
kpatch_dependencies
make -j
make install

Run kpatch on kernel src with patch to be applied
==================================================
1.build kernel from source
2.kpatch-build -j20 nvme.patch -s <kernel_src> -c <kernel_src/.config>

This will create kpatch-nvme.ko 

Installing module
==================
kpatch install kpatch-nvme.ko
kpatch list

Loading/unloading kpatch module.
=============================
kpatch load kpatch-nvme.ko
kpatch unload kpatch-nvme.ko

to check build  logs- > tail -f /root/.kpatch/build.log
Good Read -> https://blog.kernelcare.com/live-patching-debian-10-linux-kernel-with-kpatch

Wednesday, July 28, 2021

Pyverbs - RDMA with Python

 


Skip to end of metadata

>>Install following packaes prerequiste for pyverbs compilation in rdma-core

rpm -ivh http://mirror.centos.org/centos/8/PowerTools/x86_64/os/Packages/python3-Cython-0.28.1-3.el8.x86_64.rpm
yum install python3-devel
yum install libudev-devel
yum install pkgconfig valgrind-devel libudev-devel cmake libnl3-devel python3-devel python3-docutils

>>with above package installed , now build rdma-core

./build.sh → with above packages will compile pyvers in rdma-core

>>run sample application present in rdma-core using
PYTHONPATH='/opt/rdma-core/build/python' python3 pyverbs/examples/ib_devices.py


Good resources

https://github.com/linux-rdma/rdma-core/blob/master/Documentation/pyverbs.md

https://webcache.googleusercontent.com/search?q=cache:ichFGVm_EvkJ:https://bugzilla.redhat.com/show_bug.cgi%3Fid%3D1894516+&cd=2&hl=en&ct=clnk&gl=in

Server and DRAC Identification via SNMP

 Sometimes Network IP Changes creates a problem in identifying DRAC Ip or Server IP , with SNMP agent Enable on DRAC and Server


we can use Advance IP Scanner to Quickly Scan network range and with Name appear we can easily get the IP.


Setting Name FOR DRAC

Enable SNMP and Change DNS name


FastLinQ Documentation > Server and DRAC Identification via SNMP > image2021-2-19_22-58-46.png






FastLinQ Documentation > Server and DRAC Identification via SNMP > image2021-2-19_22-55-11.png


Assign DNS IDRAC and enable SNMP Agent






Setting Name on Servers

RHEL/CENTOS

Installation

Execute the command:

yum install -y net-snmp

Add the line below to the configuration file (/etc/snmp/snmpd.conf):

rocommunity public

agentAddress udp:161,udp6:[::1]:161

Start the snmpd service:

systemctl enable snmpd && systemctl start snmpd

Allowing SNMP ports in firewall

Execute the following commands:


firewall-cmd --zone=public --add-port=161/udp --permanent

firewall-cmd --zone=public --add-port=162/udp --permanent

firewall-cmd --reload






CentOS

Installation

Execute the commands:


> yum update

> yum install net-snmp


Configuration

Edit the file: /etc/snmp/snmpd.conf 


Add the line:

rocommunity public

Replace the line below:

view systemview included .1.3.6.1.2.1.25.1.1

with the following line:

view systemview included .1.3.

Restart the SNMP Service:

service snmpd restart

Allowing SNMP ports in Firewall

Execute the commands:


firewall-cmd --zone=public --add-port=161/udp --permanent

firewall-cmd --zone=public --add-port=162/udp --permanent

firewall-cmd --reload


Ubuntu

Installation

Execute the command:


> apt update

> apt install snmpd


Configuration

Edit the file: /etc/snmp/snmpd.conf 


Add the line:

rocommunity public

Comment the line:

#agentAddress udp:127.0.0.1:161

Uncomment the line: 

agentAddress udp:161,udp6:[::1]:161

Restart the SNMP Service:

service snmpd restart

Allowing SNMP ports in firewall

Execute the following commands to allow necessary ports:


ufw allow 161/udp

ufw allow 162/udp




Using Advance IP Scanner  To Scan network 


Now Running IP Scanner will Show the Name of the Server and DRAC


FastLinQ Documentation > Server and DRAC Identification via SNMP > image2021-2-19_23-6-14.png








https://www.site24x7.com/help/admin/adding-a-monitor/configuring-snmp-linux.html


Friday, October 9, 2020

Binding VF to VFIO inside QEMU


Bnding the VF to vfio-pci in the guest VM , there are two option

1.   Enabling vIOMMU inside QEMU/VM

2.   Using no-iommu mode of VFIO drivers

 


NO_IOMMU_MODE 

On Guest VM

 1.modprobe vfio-pci

2.echo 1 > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode

3. usertools/dpdk-devbind.py -b vfio-pci 07:00.0

 

VIOMMU MODE

 

On HOST Machine

================

1.load modules

   modprobe qede

  modprobe vfio-pci

 

2. Check PF B:D:F ( bus:device:function)

   #lspci | grep QL

  04:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)

  04:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)

 

3.Create VF on a PF

echo 1 > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.1/sriov_numvfs

 

4. Unbind the qede driver from VF so that it can imported to VM.

  echo -n "0000:04:0e.0" > /sys/bus/pci/drivers/qede/unbind

 

5.Get Vendor ID of VF device

  # lspci -nn | grep QL | grep IOV

  04:0e.0 Ethernet controller [0200]: QLogic Corp. FastLinQ QL41000 Series Gigabit Ethernet Controller (SR-IOV VF) [1077:8090] (rev 0

  * Value in square Bracket

 

5.Bind the device to the vfio-pci driver

  sudo echo "1077 8090" > /sys/bus/pci/drivers/vfio-pci/new_id

 

6.Start QEMU with Quest VM.

  /usr/bin/qemu-system-x86_64  -machine q35,kernel-irqchip=split,accel=kvm -smp 4 -m 2G \

  -device intel-iommu,intremap=on,caching-mode=on -nographic /home/fastlinq/centos-7.8.qcow2 \

  -device vfio-pci,host=04:0e.0

 

 

Guest VM

=======

1.edit  /etc/default/grub

and add these "iommu=pt intel_iommu=on"  at end of GRUB_CMDLINE_LINUX

 

e.g

GRUB_CMDLINE_LINUX="console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=40ff

14688-2619-4046-a9eb-b7333fff1b84 console=ttyS0,115200 iommu=pt intel_iommu=on"

 

 

2.update the grub using

a)grub2-mkconfig -o /boot/grub2/grub.cfg (For Redhat/Centos)

b) update-grub (For Ubuntu)

 

3.reboot

 

4.Check BDF for VF inside VM

# lspci | grep QL

00:03.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series Gigabit Ethernet Controller (SR-IOV VF) (rev 02)

 

5. modprobe vfio-pci

 

6.Bind the VF to VFIO using below

  usertools/dpdk-devbind.py -b vfio-pci 00:03.0

 

7.Check status

 

[root@centos-8 dpdk-stable-19.11.5]#  usertools/dpdk-devbind.py --status-dev net

Network devices using DPDK-compatible driver

============================================

0000:00:03.0 'FastLinQ QL41000 Series Gigabit Ethernet Controller (SR-IOV VF) 8090' drv=vfio-pci unused=qede

 

Network devices using kernel driver

===================================

0000:00:02.0 '82574L Gigabit Network Connection 10d3' if=enp0s2 drv=e1000e unused=vfio-pci *Active*

Saturday, October 3, 2020

XDP - Getting Started with XDP (Linux)

 

XDP


Introduced in Linux 4.8 eBPF hook at the driver level (ingress)

Intercept packet before it reaches the stack, before allocating sk_buff

Rationale: implement a faster data path which is part of the kernel, maintained by the kernel community Rather for simple use cases.

Complex processing: forward to stack Not a “kernel bypass”, works in cooperation with the networking stack


Essentially, user-space networking achieves high-speed performance by moving packet-processing out of the kernel’s realm into user-space

XDP does in fact the opposite: it moves user-space networking programs (filters, mappers, routing, etc) into the kernel’s realm. 

XDP allows us to execute our network function as soon as a packet hits the NIC, and before it starts moving upwards into the 

kernel’s networking layer, which results into a significant increase of packet-processing speed

Accelerating-VM-Networking-through-XDP_Jason-Wang.pdf

https://help.netronome.com/support/solutions/articles/36000050009-agilio-ebpf-2-0-6-extended-berkeley-packet-filter

https://www.netronome.com/blog/hello-xdp_drop/

https://archive.fosdem.org/2018/schedule/event/xdp/attachments/slides/2220/export/events/attachments/xdp/slides/2220/fosdem18_SdN_NFV_qmonnet_XDPoffload.pdf


XDP MODES

In total, XDP supports three operation modes which iproute2 implements as well: xdpdrv, xdpoffload and xdpgeneric.

xdpdrv stands for native XDP, meaning the BPF program is run directly in the driver’s receive path at the earliest possible point in software.
This is the normal / conventional XDP mode and requires driver’s to implement XDP support, which all major 10G/40G/+ networking drivers
in the upstream Linux kernel already provide.

xdpgeneric stands for generic XDP and is intended as an experimental test bed for drivers which do not yet support native XDP.
Given the generic XDP hook in the ingress path comes at a much later point in time when the packet already enters the stack’s
main receive path as a skb, the performance is significantly less than with processing in xdpdrv mode.
xdpgeneric therefore is for the most part only interesting for experimenting, less for production environments.

xdpoffload Last but not least, thIs mode is implemented by SmartNICs such as those supported by Netronome’s nfp driver and
allow for offloading the entire BPF/XDP program into hardware, thus the program is run on each packet reception directly
on the card. This provides even higher performance than running in native XDP although not all BPF map types or BPF helper
functions are available for use compared to native XDP. The BPF verifier will reject the program in such case and report
to the user what is unsupported. Other than staying in the realm of supported BPF features and helper functions,
no special precautions have to be taken when writing BPF C programs.











#include <linux/bpf.h>
int main()
{
return XDP_DROP;
}


clang -target bpf -O2 -c xdp.c -o xdp.o


ip -force link set dev ens1f0 xdpdrv obj xdp.o sec .text


ip link show ens1f0
32: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether f4:e9:d4:ed:25:38 brd ff:ff:ff:ff:ff:ff
prog/xdp id 36


0 XDP_ABORTED - Error, Block the packet
1 XDP_DROP - Block the packet
2 XDP_PASS - Allow the packet to continue up the kernel
3 XDP_TX - Bounce the packet back in the direction it came from



$ hping3 [IP Address of Host]
Traffic can be monitored using tcpdump, however it will show that no packets are received.
This is due to XDP dropping packets at the start of the kernel path, before the packets can reach tcpdump

unload xdp
ip link set dev [DEV] xdpdrv off


H/w offload load
ip -force link set dev ens1f0 xdpoffload obj xdp.o sec .text


Testing XDP



Steps 

Description  


Step 1

Check "clang" is installed or not else install it by yum install clang and XDP only supports on RHEL8 and above kernel 


Step 2

Create xdp_drop.c file in "/usr/src/kernels/$(uname -r)/net/xdp" directory 

touch /usr/src/kernels/$(uname -r)/net/xdp/xdp_drop.c


Step 3 

Write xdp_drop code inside xdp_drop.c file



#include <linux/bpf.h>

 #ifndef __section

 # define __section(NAME)                  \

    __attribute__((section(NAME), used))

 #endif

 

 __section("prog")

 int xdp_drop(struct xdp_md *ctx)

 {

     return XDP_DROP;

 }

 char __license[] __section("license") = "GPL";




Step 4 

Compile this code with below command so that it will create obj file

clang -O2 -Wall -target bpf -c xdp_drop.c -o xdp_drop.o


Step 5

Insert/Probe xdp_drop.o file to both interface (PF) with below command

ip link set dev ens3f0 xdp obj xdp_drop.o

ip link set dev ens3f1 xdp obj xdp_drop.o



Step 6

With  "ip link show"  command  check xdp loaded with some id.


[root@Gen9-XDP-Host-RHEL8 xdp]# ip link show

4: ens3f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000

     link/ether 00:0e:1e:d6:62:fc brd ff:ff:ff:ff:ff:ff

     prog/xdp id 1 tag f95672269956c10d jited

 5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000

     link/ether 00:0e:1e:d6:62:fd brd ff:ff:ff:ff:ff:ff

     prog/xdp id 2 tag f95672269956c10d jited


Step 7

Send Traffic through scapy tool from Peer system to both the interface simultaneously 



sendp (Ether(src="00:0e:1e:d6:62:fc",dst="14:02:ec:d3:af:0a")/IP(src="44.44.44.1",dst="55.55.55.1")/TCP(sport=0xbbbb,dport=0xaaaa)/("x"*200), iface="ens3f0",count=1000000)

sendp (Ether(src="00:0e:1e:d6:62:fd",dst="14:02:ec:d3:af:0b")/IP(src="44.44.44.1",dst="55.55.55.1")/TCP(sport=0xbbbb,dport=0xaaaa)/("x"*200), iface="ens3f1",count=1000000)


1. Observed that packets were being dropped and “xdp_no_pass” counters were increasing, No packets were seen in tcpdump that suggest that Xpress data path was being used


[root@Gen9-XDP-Host-RHEL8 xdp]# ethtool -S ens3f0 | grep xdp

      0: xdp_no_pass: 5000

      1: xdp_no_pass: 3731

      2: xdp_no_pass: 5000

      3: xdp_no_pass: 4000

      4: xdp_no_pass: 4609

      5: xdp_no_pass: 5000

      6: xdp_no_pass: 4000

      7: xdp_no_pass: 5000


2. You should not see any unexpected failures in dmesg or /var/log/messages

 3. Should not see any driver/FW failure messages or system hang. 




Loading IN NATIVE MODE

# ip -force link set dev em1 xdpdrv obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000
link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
prog/xdp id 1 tag 57cd311f2e27366b
[...]
# ip link set dev em1 xdpdrv off

The option verb can be appended for loading programs in order to dump the verifier log,
# ip -force link set dev em1 xdpdrv obj prog.o verb

LOADING IN GENERIC MODE

# ip -force link set dev em1 xdpgeneric obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000
link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
prog/xdp id 4 tag 57cd311f2e27366b <-- BPF program ID 4
[...]
# bpftool prog dump xlated id 4 <-- Dump of instructions running on em1
0: (b7) r0 = 1
1: (95) exit
# ip link set dev em1 xdpgeneric off


XDP Related Config
==================
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_ACT=y
CONFIG_BPF_JIT=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m
CONFIG_XDP_SOCKETS=y

$ cd tools/testing/selftests/bpf/
$ make
$ sudo ./test_verifier


Sample COde to drop IP Traffic from 50.50.50.1

#include "../../include/uapi/linux/bpf.h"

#include "../../include/uapi/linux/if_ether.h"

#include "../../include/uapi/linux/if_packet.h"

#include "../../include/uapi/linux/ip.h"

#include "../../include/uapi/linux/in.h"

#include "../../include/uapi/linux/tcp.h"

#include "../../include/uapi/linux/udp.h"

//#include "bpf_helpers.h"

#ifndef __section

# define __section(NAME)                  \

           __attribute__((section(NAME), used))

#endif__section("prog")

         //https://www.vultr.com/resources/ipv4-converter/?ip_address=50.50.50.1

         //842150401

int xdp_drop(struct xdp_md *ctx)

{

        void *data_end = (void *)(long)ctx->data_end;

        void *data     = (void *)(long)ctx->data;

        struct ethhdr *eth = data;        if (eth + 1 > data_end) {

                return XDP_PASS;

        }        struct iphdr *iph = data + sizeof(struct ethhdr);        if (iph + 1 > data_end) {

                return XDP_PASS;

        }

        unsigned int ip_src = iph->saddr;

        //printf("%ld\n",htonl(842150401));  network byte order conversion for

        //50.50.50.1

        if(ip_src == 20066866)

        {

                return XDP_DROP;

        }        return XDP_PASS;

}



Good Links

https://docs.cilium.io/en/latest/bpf/

https://medium.com/@fntlnz/load-xdp-programs-using-the-ip-iproute2-command-502043898263
https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf


Featured Post

XDP - Getting Started with XDP (Linux)