RDMA Verb Programming FAQs
(Compiled from RDMAMojo.com)
What will happen if I will destroy an AH when there are still outstanding Send Requests that use it?
- Doing this may lead to Work Completion with error or a segmentation fault.
I want to destroy an AH, but there are still outstanding Send Requests that still use it, what can I do?
- Wait until the outstanding Send Requests will be completed
- Flush the Send Queue that those Send Requested were posted to (this can be done, for example, by changing the state of the QP that those Send Requests were posted to into the IBV_QPS_ERR state).
What is OFED ?
- The OpenFabrics Enterprise Distribution (OFED) is a package for Linux that includes all of the needed software (libraries, utilities and drivers) in order to work with RDMA enabled protocols (InfiniBand and iWARP).
Given that we want to transfer 100 pages (continuous or not) through RDMA READ, which way is more efficient, 100 WRs with only one SGE or 10 WRs with 10 SGEs?
- However, IMHO 10 WRs with 10 S/Gs will be more effective than the other suggestion,
- since the overhead of Send Requests attributes (not related to the S/Gs) checkers will be reduced.
- For example: check if QP exists, check if WQ is full, etc.
RDMA can be implemented inside kernel? or kernel module?
- RDMA can work in kernel level.
- IPoIB is an example to such a module
why RDMA write/read is better than send/recv.?
- Data is being traveled over the network and when it reached to remote side,
- a Receive Request is fetched and the device scatter/write to those buffers,
- according to the S/G list.
- Data is being traveled over the network and when it reached to remote side,
- the information where this data will be written is known (remote address is known to the sender),
- and the data is written in a contiguous memory block (no extra Receive Request fetch is required).
- So, RDMA Write is better than Send/Recv because:
- * Extra Work Request fetch is not being done
- * Only contiguous memory write (in remote side) is allowed
what Does ibv_open_device do?
- Actually this do not open the device, this device was opened by the kernel low-level driver and may be used by other user/kernel level code. This verb only opens a context to allow user level applications to use it.
I called ibv_get_device_list() and it returned NULL, what does it mean?
- This is a basic verb that shouldn't fail, check if the module ib_uverbs is loaded.
- I called ibv_get_device_list() and it didn't find any RDMA device at all (empty list), what does it mean?
- The driver couldn't find any RDMA device.
- - Check with lspci, if you have any RDMA device in your machine
- - Check if the low-level driver for your RDMA device is loaded, using lsmod
- - Check dmesg/var/log/messages for errors
When you should post recieve WR ?
- In INIT state , because if you move the QP state to RTR and some how delay the posting of reciece WR
- when data arrives , you will get RNR ( Reciver not Ready).
Explaing Different QP State
- In a QP lifetime, the possible states can be:
- A QP can be transitioned from one state to another state by two possible ways:
- 1. Explicit call to ibv_modify_qp()
- 2. Automatic transition by the device in case of a processing error
- A QP is being created in the Reset state.
- State when QP is created.
- When a QP is being created, it takes some time to create it (there are context switches, memory allocations for the work queue buffers, QP number allocation, etc.).
- Creating , using and destroying consumes lot of time .
- if one needs to use a new QP in the fast path, a better way is to :
- -Use the QP (modify to RTS and send/receive data)
- -Modify the QP to Reset for later use
- Use the QP (modify to RTS and send/receive data) Modify the QP to Reset for later use.
- Moving the QP to the Reset state will prevent the QP from sending or receiving packets.
- When moving the QP to the RTR state, the QP will handle incoming packets. If for any reason (for example: OS scheduler didn't give time to that process) after the transition to the RTR state,
- Work Requests weren't posted to the receive queue of that QP, Receiver Not Ready (RNR) error may occur to the requester of those packets.
- In order to prevent this from happening, one can post receive requests to the QP when it is in the Init state. Those Work Request won't be processed until the QP will be transitioned to the RTR state.
- In this state, the QP handles incoming packets
- In most of the applications, the QPs will be transitioned to the RTS state. In this state, the QP can send packets as a requester and handle incoming packets.
- QP can be moved to the Error by two ways:
- 1.Automatically, by the RDMA device in case of a completion with error
- 2.Move the QP to Error state from any other state
- If there was an error, the first completion in the Completion Queue of the Queue (Send or Receive Queue) that got the error will hold a status that indicates the reason of the error.
- The rest of the Work Requests in that Queue and in the other queue will be flushed in error.
- If the QP was transitioned to this state using ibv_modify_qp() all outstanding Work Request in both send and receive requests will be flushed with error
In which State you post Send Request and in which recive request.?
- INIT - recieve Queue - WR & then RTR to recieve data.S
- RTS- send Queue ( but actually Work Requests can be posted to both send and receive queues. Work Requests in both send and receive queues will be processed.)
What is Queue Key (Q_Key) >
- Since in RDMA any UD QP can send/receive a packet to/from any other UD QP in the subnet, there is a need for a protection mechanism to allow a UD QP to specify from which other UD QPs it wishes to get packets from. This mechanism is the Q_Key.
- Q_Key is a 32 bits value used in UD QPs to validate that a remote sender has the right to access a local Receive Queue.
What is Pkey and Qkey?
- However, P_Keys are things that the Subnet Manager configures and enforce (it is like VLAN),
- packet may be dropped by switch if it isn't part of that partition.
- Q_Key is at the Queue Pair level, and is relevant only for Unreliable Datagram QPs
- (which can get packets from any other QP in the subnet). The SW configures its value.
- So, P_Key should prevent packets from reaching a QP which isn't member in that partition at all,
- but Q_Key will prevent from QPs (in the same partition) which don't have the right Q_Key value.
I'm trying to do RDMA write of 5 GB of memory but i see only 1GB is getting into the remote buffer?
- In general, RDMA (the protocol itself) can support up to 2 GB in one message.
- RDMA devices may have lower limit.
- If you need to send more data than the maximum supported value (1 GB in your example),
- you can use several RDMA writes to send the local (big) buffer to the remote buffer.
what is the meaning on a scatter/gather entry with zero bytes?
- .addr = (uintptr_t) ctx->buf + 40,
- .length = 0, /* ctx->size, */--->2GB
- .num_sge = 1 and .length = 0, --> To transfer 2GB.
- One another reason is that 0 is actually 2GB modulo 2GB,
- so if for any scatter/gather entry length you perform a module of 2GB (the maximum size of a message in RDMA), you'll get to 0.
What zero byte messages are good for?
- THis is not zero size scatter gather ( .num_sge = 1)
- struct ibv_send_wr wr = {
- .wr_id = PINGPONG_SEND_WRID,
- RDMA supports zero byte messages, and this can be done by posting a Send Request without a scatter/gather list (i.e. a list with zero entries).
- Zero byte messages can be useful in the following scenarios:
- When only the immediate data is used - This can be useful to mark a directive or a status update.
- For keep alive messages in a reliable QP.
So what's the alternative method of exchanging the QP params that wouldn't use TCP/IP?
- 2) Use multicast groups to know about new members
How to retrieve information from remote side to establish a connection:?What is QP1?
- In order to establish a connection, the two sides need to exchange information
- In RDMA there are two options for establishing a connection between two sides:
- Changing the QP state explicitly in the application by calling ibv_modify_qp()
- Each side cannot send the information about the QP in the QP itself, since in order to send/receive the data it already needs this data (chicken and egg problem).
- So in RDMA, when the remote side details are known, the connection is established using the Communication Manager (CM)
- which use a well-known QP (QP 1) to exchange the needed information.
How to Increase Memory Pinning for Non Root User?
- RDMA needs to work with pinned memory, i.e. memory which cannot be swapped out by the kernel. By default, every process that is running as a non-root user is allowed to pin a low amount of memory (64KB).
- ( ulimit -l --->64, this should be unlimited)
- In order to work properly as a non-root user, it is highly recommended to increase the size of memory which can be locked. Edit the file /etc/security/limits.conf and add the following lines:
- root soft memlock unlimited
- root hard memlock unlimited
- This will allow process that is running as any user to pin unlimited amount of memory. Changing this line will become effective for new login sessions.
WHy and How to Flush Work Queue ?
- When one wishes to stop the outstanding Work Requests from being processed, flushing the Work Queues may be useful.
- The most common reason for doing this is to reclaim the memory buffers that the Work Requests refer to.
- In order to flush the Send Queue, one should call ibv_modify_qp() and move the QP to the Error state.
- Flushing Recieve QUue ( SRQ)
- In order to flush the Receive Queue in a QP that is associated with an SRQ, one should call ibv_modify_qp() and move the QP to the Error state.
- The SRQ itself cannot be flushed and the posted Receive Requests in it cannot be reclaimed. One possible workaround for this can be to associate a QP with this SRQ and consume all of the WRs from this SRQ by sending messages with an opcode that consumes a Receive
How do I know if all the messages have been flushed. Will there be some event generated for it?
- 1) No. There won't be any special event that specify that all messages have been flushed;
- you simply need to count the Work Completions and figure it up by yourself.
- 2) Flushing of incomplete messages will be stopped if the QP state moved to Reset,
- suggest to move the QP to error, wait until all messages are flushed,
What is CQ ? how it is created?
- ibv_create_cq() creates a Completion Queue (CQ) for an RDMA device context.
- When an outstanding Work Request, within a Send or Receive Queue, is completed, a Work Completion is being added to the CQ of that Work Queue.
- This Work Completion indicates that the outstanding Work Request has been completed (and no longer considered outstanding) and provides details on it (status, direction, opcode, etc.).
- A single CQ can be shared for sending, receiving, and sharing across multiple QPs
Why is a CQ good for anyway?
- CQ is used to hold Work Completion for any Work Request, that was completed and should produce a Work Completion, and provides details on it.
Can I use different CQs for Send/Receive Queues in the same QP?
- Yes. In any QP the CQ of the Send Queue and the CQ of the Receive Queue may be the same or may be different. This is flexible and up to the user to decide.
Can several QPs be associated with the same CQ?
- Yes. Several QPs can be associated with the same CQ in they Send or Receive Queues or in both of them.
What should be the CQ size?
- A CQ should have enough room to hold all of Work Completions of the Outstanding Work Requests of the Queues that are associated with that CQ, so the CQ size shouldn't be less than the total number of Work Request that may be outstanding.
What will happen if the CQ size that I choose is too small??
- If there will be a case that the CQ will be full and a new Work Completion is going to be added to that CQ, there will be a CQ overrun. A Completion with error will be added to the CQ and an asynchronous event IBV_EVENT_CQ_ERR will be generated.
ibv_create_cq is failing when size of cq was specified 1024 but successfully created when size was 512. Device limit is much larger than 1024. Memory registration is also failing for 20MB memory size block. can you have any idea or thought why this is issue occurring.?
- Check ulimit -l , it should be unlimited if not change the limit or use root access.
I just want to know can two HCA share same completion queue.?
- Two different RDMA devices cannot share any RDMA resource;
How to mapped a PCI memory for RDMA?
- Mmap PIC memory Region and then use Peer memory API to enable RDMA Write to this PCI memory.
Why is a MR good for anyway?
- MR registration is a process in which the RDMA device takes a memory buffer and prepare it to be used for local and/or remote access.
How to Register a MR?
- ibv_reg_mr() registers a Memory Region (MR) associated with a Protection Domain. By doing that, allowing the RDMA device to read and write data to this memory.
- Performing this registration takes some time, so performing memory registration isn't recommended in the data path, when fast response is required.
- Every successful registration will result with a MR which has unique (within a specific RDMA device) lkey and rkey values.
What is the total size of memory that can be registered?
- There isn't any way to know what is the total size of memory that can be registered. Theoretically, there isn't any limit to this value.
- However, if one wishes to register huge amount of memory (hundreds of GB), maybe default values of the low-level drivers aren't enough;
- look at the "Device Specific" section to learn how to change the default parameter values in order to solve this issue.
What kind of Memory can be registered with MR reg call?
Every memory address in the virtual space of the calling process can be registered, including, but not limited to:
- Local memory (either variable or array)
- Global memory (either variable or array)
- Dynamically allocated memory (using malloc() or mmap())
- Addresses from the text segment
What will happen if I will release the memory buffer that is associated with an MR before deregistering it?
- Doing this may lead to a segmentation fault.
What will happen if I will use the keys (lkey or rkey) that were associated with the MR after I deregistered it?
- Doing this will lead to Work Completion with error since those keys are invalid. One should make sure that there aren't any local Work Request
- or remote operation requests that use those keys, before deregister this MR
Can I use memory block in RDMA without this registration?
- Basically no. However, there are RDMA devices that have the ability to read memory without the need of memory registration (inline data send).
How to mapped a PCI memory for RDMA?
Mmap PIC memory Region and then use Peer memory API to enable RDMA Write to this PCI memory.
Will the internal objects like QP, CQ and related handles uses pinned memory(locked memory) or not ?
yes. The internal Queues (which require space: such as QP, CQ, SRQ) are using pinned memory.
When a RDMA read/write operation is performed via ibv_post_send(), the hardware will use dlid in the associated qp and rkey in wr.rdma.rkey to locate which remote MR to read from or write to. Is that correct?
- When performing RDMA Write or Read, the DLID and remote QP number will be taken from the (local) QP context,
- and the remote RDMA device will use the rkey (that was posted in the SR) to understand which MR to use.
Is it better to register only once (at the beginning) the entire buffer and use the same registered memory key for all the send (receive) operations?
Or is it better to register a new memory region for each part of the buffer before a send (receive)?
- I would suggest to use one big Memory Region an use different parts of it,
- on demand (the management of it is easy + you will get many cache hits
Where are the work Queue Created?
- Actually, there are adapters that their Work Queues are onboard, but more and more adapters are now using host memory
- (low costs, no need for different adapters with different amount of memory, etc). So, in those adapters there will be an extra PCI access.
Receive side has posted two receive work request with n-bytes worth of buffer each.So receiver has total of 2n
byte buffer available. Now sender issues one send work request with total of 2n byte data , Can receiver use two receive work requests
to satisfy one send work request?
- The Receive Request is working in resolution of messages and not in resolution of bytes.
- Every Receive Request will handle only one incoming message:
- for each incoming message one Receive Request will be fetched from the head
- of the Receive Queue. The messages will be handled by the order of their arrival.
- In above example it will fail as you have posted(rx) only n byte but data is 2n.
What happens if the local node calls ibv_post_send() with opcode ibv_wr_send before the remote node calls ibv_post_recv()?
- If message that consumes a Receive Request received by a Queue Pair when there isn't any available Receive Request in that Queue,
- and RNR (Receive Not Ready) flow will start for a Reliable QPs. For Unreliable QPs, the incoming message will be (silently) dropped.
What is RNR and how to tackle.?
- when Responder do not post Recive QP , requester will get RNR , Send QP goes to Error state & can't post another Send Request without reconnecting it with the remote QP.
- When you have an RNR error, your local QP is in ERROR state, so you can't post another Send Request without reconnecting it with the remote QP.
- * You can increase the RNR timeout
- * You can increase the RNR retry count (the value 7 means infinite retries)
- * If you have several QPs at the receiver side, you can use a SRQ and make sure that the SRQ is never empty
- (the SRQ LIMIT mechanism can help you to detect if the number of Receive Requests dropped bellow a specific watermark).
How ibv_post_send works?
- ibv_post_send() posts a linked list of Work Requests (WRs) to the Send Queue of a Queue Pair (QP). ibv_post_send() go over all of the entries in the linked list, one by one, check that it is valid, generate a HW-specific Send Request out of it and add it to the tail of the QP's Send Queue without performing any context switch.
- The RDMA device will handle it (later) in asynchronous way. If there is a failure in one of the WRs because the Send Queue is full or one of the attributes in the WR is bad, it stops immediately and return the pointer to that WR ( this is called Bad WR).
- QP is in RTS state, Send Requests can be posted.
Does ibv_post_send() cause a context switch?
- No. Posting a SR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).
How many WRs can I post?
- There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.
If the remote side isn't aware of RDMA operations are being performed in its memory, isn't this a security hole?
- Actually, no. For several reasons:
- -In order to allow incoming RDMA operations to a QP, the QP should be configured to enable remote operations
- - In order to allow incoming RDMA access to a MR, the MR should be registered with those remote permissions enabled
- - The remote side must know the r_key and the memory addresses in order to be able to access remote memory
What is the difference between inline data and immediate data?
- Using immediate data means that out of band data will be sent from the local QP to the remote QP: if this is an SEND opcode, this data will exist in the Work Completion, if this is a RDMA WRITE opcode, a WR will be consumed from the remote QP's Receive Queue. Inline data influence only the way that the RDMA device gets the data to send; The remote side isn't aware of the fact that it this WR was sent inline.
I called ibv_post_send() and I got segmentation fault, what happened?
- There may be several reasons for this to happen:
- 1) At least one of the sg_list entries is in invalid address
- 2) In one of the posted SRs, IBV_SEND_INLINE is set in send_flags, but one of the buffers in sg_list is pointing to an illegal address
- 3) The value of next points to an invalid address
- 4) Error occurred in one of the posted SRs (bad value in the SR or full Work Queue) and the variable bad_wr is NULL
- 5) A UD QP is used and wr.ud.ah points to an invalid address
I've posted and Send Request and it wasn't completed with a corresponding Work Completion. What happened?
- In order to debug this kind of problem, one should do the following:
- Verify that a Send Request was actually posted
- Wait enough time, maybe a Work Completion will eventually be generated
- Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
- Verify that the QP state is RTS
- in RC QP , check timeout & retry.
Does control path involves RDMA?Where does Kernel bypass happens?
- The "kernel bypass" means that in the data path, your user level code will be able
- to work directly with the HW (without performing a context switch).
- But kernel level must be involved in the control part in order to
- sync the resources (between different processes/modules) and configure the HW since user
- level application can't write directly in the device memory space (since this is a privileged operation).
What is the best Approach to implement Clinet server , where CLient want to send large Data to Server?
(If we dont want to use RDMA Send/Recv Operation)?
- - If we dont want to use Duplex RDMA Send/Recv operation , we can use RDMA Write (single way)
- the server allocates blocks and advertise its attributes to the client and the client will initiate an RDMA Write(s).
- - We can send RDMA write with Immediate Data for
- 1.For the last message to end the Transfer
- 2.ALso sending keeping alive messages in between for the server to know how many
- more messages it can expect.
If we want to send a huge message via post_send that reqiures more than one work request(we will use send work request list).
For example, we have a send workrequest list that contains 2 work request(sendwr0, sendwr1)
for sendwr0 and sendwr1,
1) do I need to assign them the same workrequestID because they basically represent the same message?
2) About send flag, do I only need to assign send_flag_signaled on the last request(in the case above, it's sendwr1)?
- The RDMA stack doesn't know (or care) that you used two Send Requests for one application message
- (from the RDMA stack point of view, you have two different messages).
- 1) No, you don't *need* to do it, but you *can* do it.
- wr_id is the application attribute for use (or not use).
- If your application needs to know that the two Work Completion are of the same message, you can use it as a hint.
- 2) You can set the SIGNALED flag to the second Send Request and get one Work Completion if everything will be fine.
If I have a very huge size of data(it's divided into multiple chunks) want to send out, there're two possible ways of doing it.
1.First one is using one work request with Multiple SGE
2. multi rdma work request by posint multiple send post request.
Which one is better?
- You can use one Send Requests with a scatter list;
- not, the best solution depends on the size of the total message size:
- * If this is small (~ < 1KB), I think that the first one is the best.
- * If the total message size is big, the second approach will give you best performance. I suggest to use selective signal and create Work Completion only for the last Send Request
does the RC QPs guarantee the ordering of RDMA_WRITE WR? For example, if an "initiator" issues 2 consecutive.
- * From network point of view, the first message will reach to destination before the second one.
- * The memory will be DMA'ed (by the RDMA device) according to the message ordering.
Is there any limit on the maximal message size posted using
ibv_post_send? Say 16MB, 32MB, 64MB, 128MB?
- The maximal message size can be found in the port properties: max_msg_sz (in general, RDMA supports up to 2GB messages).
- Posting bigger messages will end with completion with error.
I was wondering what is the behavior of an RDMA read of a remote memory if the remote machine is also writing to it concurrently?
- Local Read and Local Write are not atomic and you may get garbage...
- If you want to guarantee atomicity, you must use the Atomic operations.
When I use ibv_post_send to to transfer one large message(200K) using one work request in UD mode
i get IBV_WC_LOC_LEN_ERR.
- UD QP doesn't support more than the path MTU message size:
- this value is in the range 256-4096 bytes (depends on your subnet).
- 1. It is up to the application to split the (big) message to smaller messages,
- 2. using multiple Work Requests
- 3. different QP transport type (RC which supports 2GB)
If I want to use "ibv_post_send", since we already have "IBV_WR_SEND", why we need "IBV_WR_RDMA_WRITE"? Is there any performance difference between these two approaches?
- Yes. There is a performance difference:
- * Send operation will consume a Receive Request in the remote side
- * RDMA Write operation won't, and a PCI read will be prevented (better latency)
how create srq?
ibv_create_srq() creates a Shared Receive Queue (SRQ) associated with a Protection Domain.
This SRQ will later be used when calling ibv_create_qp() to indicate that the Receive Queue of that Queue Pair is shared with other Queue Pairs
Why is an SRQ good for anyway?/Why to use SRQ?
- SRQ, as its name states: is a Shared Receive Queue. Using SRQs decreases the total number of posted Receive Requests compared to using Queue Pairs, each with its own Receive Queue.
- It is use to reduce RNR Error. If you have several QPs at the receiver side, you can use a SRQ and make sure that the SRQ is never empty
- (the SRQ LIMIT mechanism can help you to detect if the number of Receive Requests dropped bellow a specific watermark).
Can I use one SRQ with different QPs?
Yes. you can.
Can I use one SRQ with QPs with different transport types?
Yes. you can.
How do I use srq_limit?
If an SRQ was armed with a value, when the number of RRs in that SRQ drop below that value, the affiliated asynchronous event IBV_EVENT_SRQ_LIMIT_REACHED will be generated.
Which attributes should be used for an SRQ?
The number of Work Requests should be enough to hold Receive Requests for all of the incoming Messages in all of the QPs that are associated with this SRQ.
The number of scatter/gather elements should be the maximum number of scatter/gather elements in any Work Request that will be posted to this SRQ.
How to Post Recieve Work reuest in SRQ?
- If the QP is associated with a Shared Receive Queue (SRQ), one must call ibv_post_srq_recv(), and not ibv_post_recv(), since the QP's own receive queue will not be used.
The work request should always be in order for consumption, or they can be in different order?
- The order of the Receive Request consumptions in a Receive Queue is by the order that they were posted to it.
- When you have a SRQ, you cannot predict which Receive Request will be consumed by which QP,
- so all the Receive Requests in that SRQ should be able to contain the incoming message (in terms of length).
ibv_post_srq_recv() posts a linked list of Work Requests (WRs) to a Shared Receive Queue (SRQ)
- The RDMA device will take one of those Work Requests as soon as an incoming opcode to a QP, which is associated with this SRQ,
will consume a Receive Request (RR).
How to Handle Out of Order Request in SRQ.
- one cannot control or predict in advanced which WR will be fetched from the SRQ to which QP,
- it is highly advised that all of the WRs in the SRQ will be able to handle the maximum message that any QP may receive.
Can I know which QP will fetch a specific WR from the SRQ?
- No, you don't. This is the reason that all of the WRs in the SRQ should be able to hold the maximum message that any of the QP which are associated with the SRQ may receive.
How ibv_post_recv works?
- ibv_post_recv() posts a linked list of Work Requests (WRs) to the Receive Queue of a Queue Pair (QP).
- The RDMA device will take one of those Work Requests as soon as an incoming opcode to that QP will
- consume a Receive Request (RR).
- If the QP is in RTR, RTS, SQD or SQE state, Receive Requests can be posted and they will be processed.
Why the data in UD QP is at offset 40 bytes?
- If a RR is being posted to an UD QP, the Global Routing Header (GRH) of the incoming message will be placed in the first 40 bytes of the buffer(s) in the scatter list.
- If no GRH is present in the incoming message, then the first bytes will be undefined.
- This means that in all cases, the actual data of the incoming message will start at an offset of 40 bytes into the buffer(s) in the scatter list.
Which operations will consume RRs?
If the remote side post a Send Request with one of the following opcodes, a RR will be consumed:
- RDMA Write with immediate
I called ibv_post_recv() and I got segmentation fault, what happened?
- There may be several reasons for this to happen:
- 1) At least one of the sg_list entries is in invalid address
- 2) The value of next points to an invalid address
- 3) Error occurred in one of the posted RRs (bad value in the RR or full Work Queue) and the variable bad_wr is NULL
I had a code that work with UC or RC QP and I added support to UD QP, but I keep getting Work Completion with error. What happened?
- For UD QP, an extra 40 bytes should be added to the RR buffers (to allow save the GRH, if such exists in this message).
what are "chained WRs posted on an SQ of a QP". i think WRs are individual to each other
how can be they chained together?
- A "chained WRs posted on an SQ of a QP" is actually a linked list of Send Requests.
- As you (correctly) said, every Send Request by itself is individual (and independent),
- but posting them together allow to perform some optimization when compared to posting them
If we have a program involved in different message size(from 4KB to 4MB even more), what's the best practice for post buffer?
For example, if server side post ten 4kb RR and ten 4MB RR, is it can match the best incoming payload size?
- per Receive Request one can't predict which message size will consume it,
- IMHO there are two options for handling this:
- 1) Be prepared to receive the maximum incoming message size (4 MB in your example)
- 2) Work with two SRQs: one will handle 4 KB messages, and the second one will handle 4 MB messages
- To use big and small buffers you should associate the QP with the small and big SRQs
- (i.e. the SRQ that will accept big messages and the one that will accept small messages).
- You need to know to which QPs to send the messages,
- otherwise the buffers in the Receive Requests won't be enough .
What is RDMA Fence
- if you perform RDMA Read followed by RDMA Write, Send or Atomic operation you may need to use Fence
- (if you access the same addresses).
How out of Oder packet are handle in RDMA?
- In certain fabric configurations IB packets for a given QP may take up
- different paths in a network from source to destination. This results
- into reaching packets out of order at the receiver side.
WHat are Types of QP ?
- In RDMA, there are several QP types. They can be represented by : XY
- Reliable: There is a guarantee that messages are delivered at most once, in order and without corruption.
- Unreliable: There isn't any guarantee that the messages will be delivered or about the order of the packets.
- In RDMA, every packet has a CRC and corrupted packets are being dropped (for any transport type). The Reliability of a QP transport type refers to the whole message reliability.
- Connected: one QP send/receive with exactly one QP
- Disconnected : one QP send/receive with any QP
How Packet Validity is checked in RDMA?
- * CRC: The CRC field which validates that packets weren't corrupted along the path.
- * PSN: The Packet Serial Number makes sure that packets are being received by the order.
- This helps detect missing packets and packet duplications.
What is Reliable Connection?
- One RC QP is being connected (i.e. send and receive messages) to exactly one RC QP in a reliable way.
- It is guaranteed that messages are delivered from a requester to a responder at most once, in order
- and without corruption. The maximum supported message size is up to 2GB (this value may be lower,
- depends on the support RDMA device attributes).
- If a message size is bigger than the path MTU, it is being fragmented in the side that sends the
- data and being reassembled in the receiver side.
- Requester considers a message operation complete once there is an ACK from the responder side that
- the message was read/written to its memory.
- RC QP supports - both WRITE and SEND operation.
- RC QP should be chosen if:
- Reliability by the fabric is needed
- Fabric size isn't big or the cluster size is big, but not all nodes send traffic to the same node (one victim)
- Several uses for a RC QP can be: FTP over RDMA or file system over RDMA.
What is Unreliable Datagram (UD) QP
- One QP can send and receive message to any other UD QP in either unicast (one to one) or
- multicast (one to many) way in an unreliable way.
- There isn't any guaranteed that the messages will be received by the other side:
- corrupted or out of sequence packets are silently dropped. There isn't any guarantee about
- There isn't any guarantee about the packet ordering. The maximum supported message size is the maximum path MTU.
- The maximum supported message size is the maximum path MTU ( while other it is fragmented and resasembled)
- UD QP supports only Send operations.
- UD QP should be chosen if:
- Reliability by the fabric isn't needed (i.e. reliability isn't important at all or it is being taken care of by the application)
- Fabric size is big and all nodes and every node send messages to any other node in the fabric. UD is one of the best solutions for scalability problems.
- Multicast messages are needed
- One use for an UD QP can be: voice over RDMA.
Unreliable Connected (UC) QP
- One UC QP is being connected (i.e. send and receive messages) to exactly one UC QP in an unreliable way.
- There isn't any guaranteed that the messages will be received by the other side: corrupted or out of
- sequence packets are silently dropped. If a packet is being dropped, the whole message that it belongs to will be dropped
- n this case, the responder won't stop, but continues to receive incoming packets.
- There isn't any guarantee about the packet ordering
- If a message size is bigger than the path MTU, it is being fragmented in the side that sends the
- data and being reassembled in the receiver side.
- UC QP should be chosen if:
- Reliability by the fabric isn't needed (i.e. reliability isn't important at all or it is being taken care of by the application)
- Fabric size isn't big or the cluster size is big, but not all nodes send traffic to the same node (one victim)
- Big messages (more than the path MTU) are being sent
- One use for an UC QP can be: video over RDMA.
what is use of do i have to use ibv_req_notify_cq() ?
- ibv_req_notify_cq() is relevant if one wished to work with CQ events,
- to decrease the CPU utilization by polling the CQ.
What is Completion Queue ? and How it is created?
- ibv_create_cq() creates a Completion Queue (CQ) for an RDMA device context.
- When an outstanding Work Request, within a Send or Receive Queue, is completed, a Work Completion is being added to the CQ of that Work Queue. This Work Completion indicates that the outstanding Work Request has been completed (and no longer considered outstanding) and provides details on it (status, direction, opcode, etc.).
- A single CQ can be shared for sending, receiving, and sharing across multiple QPs
How CQ is processed ?
- 2. Event ( by passing struct ibv_comp_channel during creation of CQ)
Can I use different CQs for Send/Receive Queues in the same QP?
- Yes. In any QP the CQ of the Send Queue and the CQ of the Receive Queue may be the same or may be different. This is flexible and up to the user to decide.
Can several QPs be associated with the same CQ?
- Yes. Several QPs can be associated with the same CQ in they Send or Receive Queues or in both of them.
What should be the CQ size?
- A CQ should have enough room to hold all of Work Completions of the Outstanding Work Requests of the Queues that are associated with that CQ, so the CQ size shouldn't be less than the total number of Work Request that may be outstanding.
What will happen if the CQ size that I choose is too small??
- If there will be a case that the CQ will be full and a new Work Completion is going to be added to that CQ, there will be a CQ overrun. A Completion with error will be added to the CQ and an
- asynchronous event IBV_EVENT_CQ_ERR will be generated.
How to destroy CQ?
- ibv_destroy_cq() destroys a Completion Queue.
- The destruction of a CQ will fail if any QP is still associated with it.
- A CQ can be destroyed either if it is empty or contains Work Completions that still weren't
- 1.polled by ibv_poll_cq()
- 2. any affiliated asynchronous event on that CQ that was read, using ibv_get_async_event(), but still wasn't acknowledged, using ibv_ack_async_event
What is polling of CQ?
- ibv_poll_cq() polls Work Completions from a Completion Queue (CQ).
- A Work Completion indicates that a Work Request in a Work Queue.
- When a Work Requests end, a Work Completion is being added to the tail of the CQ that this Work Queue is associated with. ibv_poll_cq() check if Work Completions are present in a CQ and pop them from the head of the CQ in the order they entered it (FIFO). After a Work Completion was popped from a CQ, it can't be returned to it.
- One should consume Work Completions at a rate that prevents the CQ from being overrun (hold more Work Completions than the CQ size).
What is that Work Completion anyway?
- Work Completion means that the corresponding Work Request is ended and the buffer can be (re)used for read, write or free.
I got a Work Completion from the Receive Queue of a UD QP and it ended well. I read the data from the memory buffers and I got bad data. Why?
- Maybe you looked at the data starting offset 0. For any Work Completion of a UD QP, the data is placed in offset 40 of the relevant memory buffers, no matter if GRH was present or not
What is this GRH and why do I need it?
- The Global Routing Header (GRH) provides information that is most useful for sending a message back to the sender of this message if it came from a different subnet or from a multicast group.
Does the entry in the Completion queue of the sender, indicate that the receiver has received the data, or
does it only indicate that the sender can now reuse the buffer( as it sent over wire).
- Assuming that the Work Completion was ended successfully:
- For Reliable QP (for example, RC): this means that the sent buffer was written at the receiver side.
- For Unreliable QP: this means that the sent buffer can be reused, since the message was already sent.
Is it possible that I do not generate a completion entry for send operation?
- Yes, .send_flags = IBV_SEND_SIGNALED, for which completion entry is required
- this is selective signalling , but for it to work signal one WR for every SQ-depth worth of WRs posted.
- For example, If the SQ depth is 16, we must signal at least one out of every 16.
- This ensures proper flow control for HW resources.
How to get work completion in Receive side?
- If you want to get a Work Completion in the receiver side, I suggest that you'll:
- 1) post a Receive Request at the server side
- 2) Use RDMA Write with immediate, which will consume the Receive Request in the receiver side and generate a Work Completio
- if you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
is ibv_poll_cq a blocking function? for example, if the CQ is empty, would ibv_poll_cq return 0 immediately, or would it block and only sporadically returns 0?
ibv_poll_cq() isn't a blocking function and it will always return immediately.
- Negative return value in case of an error.
- Otherwise, the number of Work Completions returned (0 means that no Work Completions were found in that CQ).
What is PD and how to Allocate it.
ibv_alloc_pd() allocates a Protection Domain (PD) for an RDMA device context.
The created PD will be used for:
Create AH, SRQ, QP
Register MR
Allocate MW
PD is a mean of protection and helps you create a group of object that can work together.
If several objects were created using PD1, and others were created using PD2, working with objects from
group1 together with objects from group2 will end up with completion with error.
Can I destroy the PD and all of the RDMA resources which are associated with it with one verb call?
- No, libibverbs doesn't support it. If a user wishes to deallocate a PD, he needs to destroy all of the RDMA resources which are associated with it before calling ibv_dealloc_pd().
Which QP Transport Types can be associated with an SRQ?
- RC and UD QPs, can be associated with an SRQ by all RDMA devices. In some RDMA devices, you can associate a UC QP with an SRQ as well.
I'm using an UD communication. At the "server" side I do a ibv_create_qp every time that a client "connects" (with quotes because there is no connection in the traditional sense). However, since there is no real connection, how can I know that the client disconnected in order to release the QP created with ibv_create_qp?
- In order to know when to destroy the QP, you have several options:
- 1) Use the CM libraries (libibcm/librdmacm) for connection establishment and teardown
- 2) Handle this within your application: maintain a "keep alive" messages and/or "leaving" message
How to Increase RDMA code performce?
1.When posting multiple WRs, post them in a linkedlist in one call
2.When using Work Completion events, acknowledge several events in one call
3.Avoid using many scatter/gather entries
4.Read multiple Work Completions at once
What is Unsignaled Completion?
By default, all Work Requests generate Work Completions when their processing is finished. However, Send Request may or may not generate Work Completions when their processing is finished - this is fully controllable by the application and this is called Unsignaled Completions.
qp_init_attr.sq_sig_all = 0;
Why we need multiple QP?
- Single QP cant reach Line rate
what is Address Handle?
- This object describes the path from the requester side to the responder side.
- ibv_create_ah() creates an Address Handle (AH) associated with a Protection Domain.
- This AH will later be used when a Send Request (SR) will be posted to a Unreliable Datagram QP.
How can I get the needed information for the AH when calling ibv_create_ah()?
There are several ways to obtain this information:
- Perform path query to the Subnet Administrator (SA)
- Out of band connection to the remote node, for example: using socket
- Using well-known values, for example: this can be done in a static subnet (which all of the addresses are predefined) or using multicast groups
How to Increase RDMA code performce?
1.When posting multiple WRs, post them in a linkedlist in one call
2.When using Work Completion events, acknowledge several events in one call
3.Avoid using many scatter/gather entries
4.Read multiple Work Completions at once
What is Unsignaled Completion?
By default, all Work Requests generate Work Completions when their processing is finished. However, Send Request may or may not generate Work Completions when their processing is finished - this is fully controllable by the application and this is called Unsignaled Completions.
qp_init_attr.sq_sig_all = 0;
What is maximum Message size for RC and UD mode.
* Maximum message size of RC QPs is 2GB (unless one of the end nodes supports a lower value)
* Maximum message size of UD QPs is 4KB (unless one of the end nodes/switches in the path supports a lower value)
Why we need to create Multiple QP in an application.When same thing can be done by Single QP. ? i mean what is the usecase of Creating Multiple QP.?
One can use one or multiple QPs in the same application; depends on its usage.
If you develop an all-to-all applications, a since QP may not provide the best performance
(will you use a UD QP? or one RC QP which will be connected each one to other client).
This is just like the question: should I use one or multiple sockets?
It depends on what your application is doing, how many parallel connections, performance requirements, etc.