Friday, October 9, 2020

Binding VF to VFIO inside QEMU

Bnding the VF to vfio-pci in the guest VM , there are two option

1. Enabling vIOMMU inside QEMU/VM

2. Using no-iommu mode of VFIO drivers

NO_IOMMU_MODE

On Guest VM

1.modprobe vfio-pci

2.echo 1 > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode

3. usertools/dpdk-devbind.py -b vfio-pci 07:00.0

VIOMMU MODE

On HOST Machine

================

1.load modules

modprobe qede

modprobe vfio-pci

2. Check PF B:D:F ( bus:device:function)

#lspci | grep QL

04:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)

04:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)

3.Create VF on a PF

echo 1 > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.1/sriov_numvfs

4. Unbind the qede driver from VF so that it can imported to VM.

echo -n "0000:04:0e.0" > /sys/bus/pci/drivers/qede/unbind

5.Get Vendor ID of VF device

# lspci -nn | grep QL | grep IOV

04:0e.0 Ethernet controller [0200]: QLogic Corp. FastLinQ QL41000 Series Gigabit Ethernet Controller (SR-IOV VF) [1077:8090] (rev 0

* Value in square Bracket

5.Bind the device to the vfio-pci driver

sudo echo "1077 8090" > /sys/bus/pci/drivers/vfio-pci/new_id

6.Start QEMU with Quest VM.

/usr/bin/qemu-system-x86_64 -machine q35,kernel-irqchip=split,accel=kvm -smp 4 -m 2G \

-device intel-iommu,intremap=on,caching-mode=on -nographic /home/fastlinq/centos-7.8.qcow2 \

-device vfio-pci,host=04:0e.0

Guest VM

=======

1.edit /etc/default/grub

and add these "iommu=pt intel_iommu=on" at end of GRUB_CMDLINE_LINUX

e.g

GRUB_CMDLINE_LINUX="console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=40ff

14688-2619-4046-a9eb-b7333fff1b84 console=ttyS0,115200 iommu=pt intel_iommu=on"

2.update the grub using

a)grub2-mkconfig -o /boot/grub2/grub.cfg (For Redhat/Centos)

b) update-grub (For Ubuntu)

3.reboot

4.Check BDF for VF inside VM

# lspci | grep QL

00:03.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series Gigabit Ethernet Controller (SR-IOV VF) (rev 02)

5. modprobe vfio-pci

6.Bind the VF to VFIO using below

usertools/dpdk-devbind.py -b vfio-pci 00:03.0

7.Check status

[root@centos-8 dpdk-stable-19.11.5]# usertools/dpdk-devbind.py --status-dev net

Network devices using DPDK-compatible driver

============================================

0000:00:03.0 'FastLinQ QL41000 Series Gigabit Ethernet Controller (SR-IOV VF) 8090' drv=vfio-pci unused=qede

Network devices using kernel driver

===================================

0000:00:02.0 '82574L Gigabit Network Connection 10d3' if=enp0s2 drv=e1000e unused=vfio-pci *Active*

Saturday, October 3, 2020

XDP - Getting Started with XDP (Linux)

XDP

Introduced in Linux 4.8 eBPF hook at the driver level (ingress)

Intercept packet before it reaches the stack, before allocating sk_buff

Rationale: implement a faster data path which is part of the kernel, maintained by the kernel community Rather for simple use cases.

Complex processing: forward to stack Not a “kernel bypass”, works in cooperation with the networking stack

Essentially, user-space networking achieves high-speed performance by moving packet-processing out of the kernel’s realm into user-space.

XDP does in fact the opposite: it moves user-space networking programs (filters, mappers, routing, etc) into the kernel’s realm.

XDP allows us to execute our network function as soon as a packet hits the NIC, and before it starts moving upwards into the

kernel’s networking layer, which results into a significant increase of packet-processing speed

Accelerating-VM-Networking-through-XDP_Jason-Wang.pdf

https://help.netronome.com/support/solutions/articles/36000050009-agilio-ebpf-2-0-6-extended-berkeley-packet-filter

https://www.netronome.com/blog/hello-xdp_drop/

https://archive.fosdem.org/2018/schedule/event/xdp/attachments/slides/2220/export/events/attachments/xdp/slides/2220/fosdem18_SdN_NFV_qmonnet_XDPoffload.pdf

XDP MODES

In total, XDP supports three operation modes which iproute2 implements as well: xdpdrv, xdpoffload and xdpgeneric.

xdpdrv stands for native XDP, meaning the BPF program is run directly in the driver’s receive path at the earliest possible point in software.
This is the normal / conventional XDP mode and requires driver’s to implement XDP support, which all major 10G/40G/+ networking drivers
in the upstream Linux kernel already provide.

xdpgeneric stands for generic XDP and is intended as an experimental test bed for drivers which do not yet support native XDP.
Given the generic XDP hook in the ingress path comes at a much later point in time when the packet already enters the stack’s
main receive path as a skb, the performance is significantly less than with processing in xdpdrv mode.
xdpgeneric therefore is for the most part only interesting for experimenting, less for production environments.

xdpoffload Last but not least, thIs mode is implemented by SmartNICs such as those supported by Netronome’s nfp driver and
allow for offloading the entire BPF/XDP program into hardware, thus the program is run on each packet reception directly
on the card. This provides even higher performance than running in native XDP although not all BPF map types or BPF helper
functions are available for use compared to native XDP. The BPF verifier will reject the program in such case and report
to the user what is unsupported. Other than staying in the realm of supported BPF features and helper functions,
no special precautions have to be taken when writing BPF C programs.

#include <linux/bpf.h>
int main()
{
return XDP_DROP;
}

clang -target bpf -O2 -c xdp.c -o xdp.o

ip -force link set dev ens1f0 xdpdrv obj xdp.o sec .text

ip link show ens1f0
32: ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether f4:e9:d4:ed:25:38 brd ff:ff:ff:ff:ff:ff
prog/xdp id 36

0 XDP_ABORTED - Error, Block the packet
1 XDP_DROP - Block the packet
2 XDP_PASS - Allow the packet to continue up the kernel
3 XDP_TX - Bounce the packet back in the direction it came from

$ hping3 [IP Address of Host]
Traffic can be monitored using tcpdump, however it will show that no packets are received.
This is due to XDP dropping packets at the start of the kernel path, before the packets can reach tcpdump

unload xdp
ip link set dev [DEV] xdpdrv off

H/w offload load
ip -force link set dev ens1f0 xdpoffload obj xdp.o sec .text

Testing XDP

Steps

Description

Step 1

Check "clang" is installed or not else install it by yum install clang and XDP only supports on RHEL8 and above kernel

Step 2

Create xdp_drop.c file in "/usr/src/kernels/$(uname -r)/net/xdp" directory

touch /usr/src/kernels/$(uname -r)/net/xdp/xdp_drop.c

Step 3

Write xdp_drop code inside xdp_drop.c file

#include <linux/bpf.h>

#ifndef __section

# define __section(NAME) \

__attribute__((section(NAME), used))

#endif

__section("prog")

int xdp_drop(struct xdp_md *ctx)

{

return XDP_DROP;

}

char __license[] __section("license") = "GPL";

Step 4

Compile this code with below command so that it will create obj file

clang -O2 -Wall -target bpf -c xdp_drop.c -o xdp_drop.o

Step 5

Insert/Probe xdp_drop.o file to both interface (PF) with below command

ip link set dev ens3f0 xdp obj xdp_drop.o

ip link set dev ens3f1 xdp obj xdp_drop.o

Step 6

With "ip link show" command check xdp loaded with some id.

[root@Gen9-XDP-Host-RHEL8 xdp]# ip link show

4: ens3f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000

link/ether 00:0e:1e:d6:62:fc brd ff:ff:ff:ff:ff:ff

prog/xdp id 1 tag f95672269956c10d jited

5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000

link/ether 00:0e:1e:d6:62:fd brd ff:ff:ff:ff:ff:ff

prog/xdp id 2 tag f95672269956c10d jited

Step 7

Send Traffic through scapy tool from Peer system to both the interface simultaneously

sendp (Ether(src="00:0e:1e:d6:62:fc",dst="14:02:ec:d3:af:0a")/IP(src="44.44.44.1",dst="55.55.55.1")/TCP(sport=0xbbbb,dport=0xaaaa)/("x"*200), iface="ens3f0",count=1000000)

sendp (Ether(src="00:0e:1e:d6:62:fd",dst="14:02:ec:d3:af:0b")/IP(src="44.44.44.1",dst="55.55.55.1")/TCP(sport=0xbbbb,dport=0xaaaa)/("x"*200), iface="ens3f1",count=1000000)

1. Observed that packets were being dropped and “xdp_no_pass” counters were increasing, No packets were seen in tcpdump that suggest that Xpress data path was being used

[root@Gen9-XDP-Host-RHEL8 xdp]# ethtool -S ens3f0 | grep xdp

0: xdp_no_pass: 5000

1: xdp_no_pass: 3731

2: xdp_no_pass: 5000

3: xdp_no_pass: 4000

4: xdp_no_pass: 4609

5: xdp_no_pass: 5000

6: xdp_no_pass: 4000

7: xdp_no_pass: 5000

2. You should not see any unexpected failures in dmesg or /var/log/messages

3. Should not see any driver/FW failure messages or system hang.

Loading IN NATIVE MODE

# ip -force link set dev em1 xdpdrv obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000
link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
prog/xdp id 1 tag 57cd311f2e27366b
[...]
# ip link set dev em1 xdpdrv off

The option verb can be appended for loading programs in order to dump the verifier log,
# ip -force link set dev em1 xdpdrv obj prog.o verb

LOADING IN GENERIC MODE

# ip -force link set dev em1 xdpgeneric obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000
link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
prog/xdp id 4 tag 57cd311f2e27366b <-- BPF program ID 4
[...]
# bpftool prog dump xlated id 4 <-- Dump of instructions running on em1
0: (b7) r0 = 1
1: (95) exit
# ip link set dev em1 xdpgeneric off

XDP Related Config
==================
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_ACT=y
CONFIG_BPF_JIT=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m
CONFIG_XDP_SOCKETS=y

$ cd tools/testing/selftests/bpf/
$ make
$ sudo ./test_verifier

Sample COde to drop IP Traffic from 50.50.50.1

#include "../../include/uapi/linux/bpf.h"

#include "../../include/uapi/linux/if_ether.h"

#include "../../include/uapi/linux/if_packet.h"

#include "../../include/uapi/linux/ip.h"

#include "../../include/uapi/linux/in.h"

#include "../../include/uapi/linux/tcp.h"

#include "../../include/uapi/linux/udp.h"

//#include "bpf_helpers.h"

#ifndef __section

# define __section(NAME) \

__attribute__((section(NAME), used))

#endif__section("prog")

//https://www.vultr.com/resources/ipv4-converter/?ip_address=50.50.50.1

//842150401

int xdp_drop(struct xdp_md *ctx)

{

void *data_end = (void *)(long)ctx->data_end;

void *data = (void *)(long)ctx->data;

struct ethhdr *eth = data; if (eth + 1 > data_end) {

return XDP_PASS;

} struct iphdr *iph = data + sizeof(struct ethhdr); if (iph + 1 > data_end) {

return XDP_PASS;

}

unsigned int ip_src = iph->saddr;

//printf("%ld\n",htonl(842150401)); network byte order conversion for

//50.50.50.1

if(ip_src == 20066866)

{

return XDP_DROP;

} return XDP_PASS;

}

Good Links

https://docs.cilium.io/en/latest/bpf/

https://medium.com/@fntlnz/load-xdp-programs-using-the-ip-iproute2-command-502043898263
https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf

Tuesday, May 26, 2020

Memory Usage of Kernel Driver Module in Linux

A nice tool memstrack .

kthread.c ( example for memory allocation of 50mb)

#include
#include
#include
#include
#include
#include

static int thread_init(void){

char *buffer;
int i =0;
for(i=0;i<50 br="" i=""> {
buffer = (char *)kmalloc(1000*1000, GFP_KERNEL);
}
if(buffer == NULL)
printk(KERN_ERR "low memory...");
else
printk(KERN_ERR "Allocation succedded...\n");
return 0;
}

void thread_exit(void){
printk(KERN_INFO "done.");
}

2. run

./memstrack --report module_summary,proc_slab_static --notui -o mem.txt

(https://github.com/ryncsn/memstrack)

3. cat mem.txt

======== Report format module_summary: ========
Module kthread using 50.0MB (12800 pages), peak allocation 50.0MB (12800 pages)
Module xfs using 0.1MB (31 pages), peak allocation 0.1MB (31 pages)
Module tg3 using 0.1MB (16 pages), peak allocation 0.1MB (16 pages)
Module sr_mod using 0.0MB (1 pages), peak allocation 0.0MB (1 pages)
Module cdrom using 0.0MB (0 pages), peak allocation 0.0MB (0 pages)
======== Report format module_summary END ========

you can cleary see 50mb tracked by the tool.

Saturday, May 16, 2020

Forcing Packet to go through Wire using Two Ports of Same Card or 2 NIC on Single HOST Linux Machine

Src
https://wiki.psuter.ch/doku.php?id=force_local_traffic_through_external_ethernet_cable_by_using_ip_namespaces
https://serverfault.com/questions/127636/force-local-ip-traffic-to-an-external-interface

We have one interface which is called as loopback interface (lo). When we ping or send traffic to test local
interface it is the loopback interface which replies.

Lets say we have three interfaces on Linux PC eth1, eth2 and lo (loopback interface).
so whatver br the ip of eth1 and eth2. you can always ping them and packet will not actually go over wire.

To Force packet over wire we use either of approach
1.Iptables modification
2.Netns

This blog will use netns as it much clearner methid.

Normally in OS there is only one instance of Network stack and related sets of Table ( Arp , routing table etc).
With Namespace you logically have seperate have copy of All of Above.

ip netns add ns_server
ip netns add ns_client

ip link set ens1f0 netns ns_server
ip netns exec ns_server ip addr add dev ens1f0 192.168.1.1/24
ip netns exec ns_server ip link set dev ens1f0 up

ip link set ens1f1 netns ns_client
ip netns exec ns_client ip addr add dev ens1f1 192.168.1.2/24
ip netns exec ns_client ip link set dev ens1f1 up

ip netns exec ns_server iperf -s -B 192.168.1.1
ip netns exec ns_client iperf -c 192.168.1.1 -B 192.168.1.2

ethtool shows actual hardware stats ( dont rely on ifconfig/ip command output they are from kernel stats)

root@hp-p70:/home/fastlinq# ip netns exec ns_server ethtool -S ens1f0 | grep rcv
rcv_pkts: 187171024
root@hp-p70:/home/fastlinq# ip netns exec ns_server ethtool -S ens1f0 | grep xmit
xmit_pkts: 98899174

Friday, April 17, 2020

Tensorflow C/C++ Inferences related link

https://github.com/mozilla/DeepSpeech/blob/master/native_client/BUILD

https://becominghuman.ai/how-to-build-tensorflow-as-a-static-library-for-android-5c762dbdd5d4

http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html

https://github.com/xifengcun/tensorflow-aarch64-crossbuild

https://developer.codeplay.com/products/computecpp/ce/guides/tensorflow-guide/tensorflow-arm-setup

https://collaborate.linaro.org/display/BDTS/Building+and+Installing+Tensorflow+on+AArch64

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/cc/BUILD

for static library

https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://www.cnblogs.com/ZisZ/p/9145164.html&prev=search

https://becominghuman.ai/how-to-build-tensorflow-as-a-static-library-for-android-5c762dbdd5d4

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/makefile/README.md

https://discourse.mozilla.org/t/deepspeech-native-client-compilation-for-asus-thinkerboard/26284/13

https://ce39906.github.io/2018/09/11/Tensorflow-%E7%BC%96%E8%AF%91%E5%8F%8A%E5%BA%94%E7%94%A8C-%E9%9D%99%E6%80%81%E5%BA%93/

https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://www.cnblogs.com/ZisZ/p/9145164.html&prev=search

https://blog.csdn.net/xinchen1234/article/details/78750079

Sunday, April 5, 2020

Setting UP Horovod for distributed Training on 2 Hosts

Setting UP Horovod for distributed Training on 2 Hosts

https://github.com/horovod/horovod/blob/master/docs/docker.rst

Prerequisite

1.password less access to All machines
http://www.linuxproblem.org/art_9.html
Master has to set its Pub key to all host
authorize_keys files.

e.g Setup using 2 Host running Dockers

(Host A/DOCKER A) ------------------------------------------------> (Host B/DOCKER B)
(136.18.225.72/INDFCQ4RG2-l/1GPU) (136.18.225.116 antpc-MS-7A94 /2 GPU)

a)generate pub key inside docker A and place it in
Host A, Host B and Docker B.
b)Generate pub key inside Host A and Place in authorize keys of Docker A, Host B and dockerB.
c)a)generate pub key inside docker B and place it in
Host A, Host B and Docker A.
d)Generate pub key inside Host B and Place in authorize keys of Docker B, Host A and docker A

2.Connected links should have same interface name in primary worker and al secondary workers
eg All interface should be enp0s31f6 across all involved host.
You can use following command ( in bashrc if you want to retain on each reboot)
sudo /sbin/ip link set down
sudo /sbin/ip link set name
sudo /sbin/ip link set up

Note:this is not an limitation with following pull request
https://github.com/horovod/horovod/issues/1724#issuecomment-603613522
https://github.com/horovod/horovod/pull/1808

3.Enter hostname in /etc/hosts in all machines/dockers
136.18.225.72 INDFCQ4RG2-l (primary worker)
136.18.225.116 antpc-MS-7A94 (secondary worker, this can be multiple machines)

Setting up

( horovod/horovod 0.19.0-tf2.0.0-torch1.3.0-mxnet1.5.0-py3.6-gpu)

A)
Primary Worker

1. Run docker
nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest

2.Test Sample App
horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:2 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

B) Secodary Worker

nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

sample run

root@INDFCQ4RG2-l:/examples# horovodrun -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 --start-timeout=360 --verbose -p 12345 python tensorflow2_mnist.py

Filtering local host names.

Remote host found: antpc-MS-7A94

Checking ssh on all remote hosts.

SSH was successful into all the remote hosts.

Testing interfaces on all the hosts.

Launched horovodrun server.

Attempted to launch horovod task servers.

Waiting for the hosts to acknowledge.

('127.0.0.1', 26576)

('136.18.225.116', 55673)

Notified all the hosts that the registration is complete.

Waiting for hosts to perform host-to-host interface checking.

Host-to-host interface checking successful.

Interfaces on all the hosts were successfully checked.

Common interface found: enp0s31f6

Checking whether extension tensorflow was built with MPI.

Extension tensorflow was built with MPI.

mpirun --allow-run-as-root --tag-output -np 2 -H INDFCQ4RG2-l:1,antpc-MS-7A94:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca plm_rsh_args "-p 12345" -mca btl_tcp_if_include enp0s31f6 -x NCCL_SOCKET_IFNAME=enp0s31f6 -x CUDA_PKG_VERSION -x CUDA_VERSION -x CUDNN_VERSION -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOSTNAME -x LD_LIBRARY_PATH -x LIBRARY_PATH -x LS_COLORS -x MXNET_VERSION -x NCCL_VERSION -x NVIDIA_DRIVER_CAPABILITIES -x NVIDIA_REQUIRE_CUDA -x NVIDIA_VISIBLE_DEVICES -x PATH -x PWD -x PYTHON_VERSION -x PYTORCH_VERSION -x SHLVL -x TENSORFLOW_VERSION -x TERM -x TORCHVISION_VERSION -x _ python tensorflow2_mnist.py

[1,0]:2020-03-20 10:43:54.841973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

[1,1]:2020-03-20 10:45:59.911264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1

[1,1]:2020-03-20 10:45:59.920605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683

[1,1]:pciBusID: 0000:b3:00.0

[1,1]:2020-03-20 10:45:59.921154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:

[1,1]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,1]:pciBusID: 0000:02:00.0

[1,1]:2020-03-20 10:45:59.921189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:45:59.922238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,1]:2020-03-20 10:45:59.923157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,1]:2020-03-20 10:45:59.923417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,1]:2020-03-20 10:45:59.924681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,1]:2020-03-20 10:45:59.925597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,1]:2020-03-20 10:45:59.928283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:45:59.930282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1

[1,0]:2020-03-20 10:43:54.895746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,0]:pciBusID: 0000:03:00.0

[1,0]:2020-03-20 10:43:54.896396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:

[1,0]:name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625

[1,0]:pciBusID: 0000:02:00.0

[1,0]:2020-03-20 10:43:54.896450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:54.898509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:43:54.900181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,0]:2020-03-20 10:43:54.900609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,0]:2020-03-20 10:43:54.903080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,0]:2020-03-20 10:43:54.906525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,0]:2020-03-20 10:43:54.912235: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,0]:2020-03-20 10:43:54.915803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1731] Ignoring visible gpu device (device: 1, name: Quadro M2000, pci bus id: 0000:02:00.0, compute capability: 5.2) with core count: 6. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.

[1,0]:2020-03-20 10:43:54.915841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,1]:2020-03-20 10:46:00.242715: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

[1,1]:2020-03-20 10:46:00.271269: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz

[1,1]:2020-03-20 10:46:00.272378: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x440d420 executing computations on platform Host. Devices:

[1,1]:2020-03-20 10:46:00.272396: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version

[1,1]:2020-03-20 10:46:00.444966: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x446f770 executing computations on platform CUDA. Devices:

[1,1]:2020-03-20 10:46:00.444995: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1

[1,1]:2020-03-20 10:46:00.445805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,1]:name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683

[1,1]:pciBusID: 0000:b3:00.0

[1,1]:2020-03-20 10:46:00.445842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:46:00.445867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,1]:2020-03-20 10:46:00.445880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,1]:2020-03-20 10:46:00.445892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,1]:2020-03-20 10:46:00.445903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,1]:2020-03-20 10:46:00.445915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,1]:2020-03-20 10:46:00.445929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:46:00.447077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,1]:2020-03-20 10:46:00.447111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,1]:2020-03-20 10:46:00.504039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:

[1,1]:2020-03-20 10:46:00.504074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0

[1,1]:2020-03-20 10:46:00.504080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N

[1,1]:2020-03-20 10:46:00.505187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 133 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)

[1,0]:2020-03-20 10:43:56.160047: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

[1,0]:2020-03-20 10:43:56.198625: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394525000 Hz

[1,0]:2020-03-20 10:43:56.201578: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1f2c430 executing computations on platform Host. Devices:

[1,0]:2020-03-20 10:43:56.201618: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version

[1,0]:2020-03-20 10:43:56.348911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x41fad60 executing computations on platform CUDA. Devices:

[1,0]:2020-03-20 10:43:56.348964: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1

[1,0]:2020-03-20 10:43:56.350439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

[1,0]:name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715

[1,0]:pciBusID: 0000:03:00.0

[1,0]:2020-03-20 10:43:56.350531: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:56.350591: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:43:56.350629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0

[1,0]:2020-03-20 10:43:56.350664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0

[1,0]:2020-03-20 10:43:56.350695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0

[1,0]:2020-03-20 10:43:56.350732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0

[1,0]:2020-03-20 10:43:56.350769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,0]:2020-03-20 10:43:56.353401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

[1,0]:2020-03-20 10:43:56.353472: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[1,0]:2020-03-20 10:43:56.457968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:

[1,0]:2020-03-20 10:43:56.458023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0

[1,0]:2020-03-20 10:43:56.458032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N

[1,0]:2020-03-20 10:43:56.464616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7594 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)

[1,0]:2020-03-20 10:43:56.468105: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 376320000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:43:57.434876: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:43:57.605212: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 188160000 exceeds 10% of system memory.

[1,0]:2020-03-20 10:44:02.771952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

[1,0]:2020-03-20 10:44:03.001307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

[1,1]:2020-03-20 10:46:10.693376: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 358.89MiB (rounded to 376320000). Current allocation summary follows.

[1,1]:2020-03-20 10:46:10.693435: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693672: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693684: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

[1,1]:2020-03-20 10:46:10.693699: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 358.89MiB was 256.00MiB, Chunk State:

[1,1]:2020-03-20 10:46:10.693710: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size:

[1,1]:2020-03-20 10:46:10.693721: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 0B

[1,1]:2020-03-20 10:46:10.693733: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 0 memory_limit_: 139591680 available bytes: 139591680 curr_region_allocation_bytes_: 1048576

[1,1]:2020-03-20 10:46:10.693748: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:

[1,1]:Limit: 139591680

[1,1]:InUse: 0

[1,1]:MaxInUse: 0

[1,1]:NumAllocs: 0

[1,1]:MaxAllocSize: 0

[1,1]:

[1,1]:2020-03-20 10:46:10.693764: W tensorflow/core/common_runtime/bfc_allocator.cc:424]

....

.....

......

.....

....

stdout>:Step #230 Loss: 0.049809

[1,1]:Step #240 Loss: 0.114416

[1,0]:Step #240 Loss: 0.118383

[1,1]:Step #250 Loss: 0.072860

[1,0]:Step #250 Loss: 0.075680

[1,0]:Step #260 Loss: 0.077773

[1,1]:Step #260 Loss: 0.256634

[1,1]:Step #270 Loss: 0.052928

[1,0]:Step #270 Loss: 0.055714

[1,0]:Step #280 Loss: 0.065934

[1,1]:Step #280 Loss: 0.129994

.

Friday, October 9, 2020

Binding VF to VFIO inside QEMU

Saturday, October 3, 2020

XDP - Getting Started with XDP (Linux)

XDP

Tuesday, May 26, 2020

Memory Usage of Kernel Driver Module in Linux

Memory Usage of Kernel Driver Module in Linux

Saturday, May 16, 2020

Forcing Packet to go through Wire using Two Ports of Same Card or 2 NIC on Single HOST Linux Machine

Forcing Packet to go through Wire using Two Ports of Same Card or 2 NIC on Single HOST Linux Machine

Friday, April 17, 2020

Tensorflow C/C++ Inferences related link

Tensorflow C/C++ Inferences related link

Sunday, April 5, 2020

Setting UP Horovod for distributed Training on 2 Hosts

Prerequisite

Setting up

sample run

Featured Post

XDP - Getting Started with XDP (Linux)

Search This Blog