rdma over converged ethernet

RDMA over Converged Ethernet and Network Offload Logic

The implementation of rdma over converged ethernet (RoCE) represents a critical shift in high performance data center architecture; it replaces traditional TCP/IP stacks with a direct memory access model that bypasses the operating system kernel. In standard networking, data transfer requires multiple CPU cycles for context switching and buffer copying between the user space and the kernel space. This creates a significant bottleneck in cloud environments and distributed storage systems where high throughput and low latency are non-negotiable requirements. By offloading the transport layer to specialized hardware, rdma over converged ethernet allows a remote host to read or write memory directly from a local host. This process minimizes signal attenuation and reduces the overhead typically associated with software based networking. Within the broader technical stack of a modern utility or network infrastructure, RoCE serves as the foundational transport for technologies like NVMe over Fabrics (NVMe-oF) and distributed database clusters. It addresses the “I/O Wall” problem, ensuring that the network latency does not become the primary constraint for high speed physical assets or logical controllers.

TECHNICAL SPECIFICATIONS

| Requirements | Default Port/Operating Range | Protocol/Standard | Impact Level (1-10) | Recommended Resources |
| :— | :— | :— | :— | :— |
| HCA (Host Channel Adapter) | N/A (Hardware Layer) | IBTA RoCE v2 | 10 | PCIe Gen4 x16 Slot |
| L3 Routing Support | UDP Port 4791 | IETF RFC 7510 | 8 | 100GbE+ Managed Switch |
| Flow Control | IEEE 802.1Qbb (PFC) | DCB / ECN | 9 | Support for Jumbo Frames |
| Kernel Support | Linux 4.9+ / Windows Server 2016+ | Verbs API | 7 | 16GB+ RAM (Pinned Memory) |
| MTU Configuration | 1500 to 9000 Bytes | IEEE 802.3ah | 6 | Minimum 4200 MTU recommended |

THE CONFIGURATION PROTOCOL

Environment Prerequisites:

Successful deployment requires an idempotent state across the network fabric. Ensure all ConnectX or equivalent series adapters are updated to the latest firmware using the Mellanox Firmware Tools (MFT). The operating system must have the OFED (OpenFabrics Enterprise Distribution) or the native RDMA Core libraries installed. Hardware requirements include a non-blocking switch fabric that supports Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) to prevent packet-loss, as RoCE is highly sensitive to drops in the Ethernet layer.

Section A: Implementation Logic:

The engineering design of rdma over converged ethernet is predicated on the concept of “Zero-copy” data transfers. In a traditional stack, a packet travels from the wire through the NIC, into a kernel buffer, and finally into the application buffer. Each step consumes CPU cycles and increases latency. RoCE moves the entire transport logic (sequencing, acknowledgment, and retransmission) into the hardware of the Host Channel Adapter (HCA). By using a Queue Pair (QP) architecture consisting of a Send Queue and a Receive Queue, the application can post work requests directly to the hardware. The hardware then executes the transfer via the Memory Management Unit (MMU) of the CPU, referencing “pinned” memory regions that cannot be swapped to disk. This ensures that the data path remains strictly in the hardware domain, maximizing concurrency and throughput.

Step-By-Step Execution

1. Enable Priority Flow Control (PFC) on the Network Interface

mlnx_qed -i eth0 –pfc 0,0,0,1,0,0,0,0
System Note: This command configures the hardware to use IEEE 802.1Qbb flow control on a specific priority lane (Priority 3 in this example). This action prevents the switch from dropping packets when buffers are full; instead, it sends a “PAUSE” frame to the sender. This is essential for maintaining the lossless nature of rdma over converged ethernet.

2. Load the RDMA Kernel Modules

modprobe ib_uverbs ib_ipoib rdma_ucm
System Note: This command injects the necessary drivers into the running Linux kernel. The ib_uverbs module allows user space applications to communicate directly with the hardware via the “Verbs” API, while rdma_ucm handles the connection management logic.

3. Identify and Verify the RDMA Device Status

ibv_devinfo -d mlx5_0
System Note: This utility queries the HCA state from the hardware registers. It confirms that the physical port is “PORT_ACTIVE” and that the “link_layer” is set to “Ethernet”. If the state is “PORT_DOWN”, the physical layer or the SFP28/QSFP56 module has failed to establish a link.

4. Configure the GID (Global Identifier) Table

show_gids
System Note: RoCE v2 requires a GID that maps to an IP address for L3 routing. This step verifies that the HCA has correctly associated a GID with the local IP address assigned to the VLAN interface. Without a valid GID, the UDP encapsulation of the RDMA payload will fail.

5. Set the MTU for High Throughput

ip link set dev eth0 mtu 9000
System Note: Increasing the Maximum Transmission Unit (MTU) to 9000 (Jumbo Frames) reduces the per packet overhead. Because rdma over converged ethernet optimizes for large data transfers, a larger payload size significantly improves effective throughput and reduces the number of interrupts the HCA must process.

Section B: Dependency Fault-Lines:

The most common failure point in rdma over converged ethernet deployments is a mismatch in PFC settings between the host and the switch. If the host expects a lossless fabric but the switch is configured for “tail-drop” congestion management, packet-loss will trigger massive retransmission timeouts in the RDMA layer, leading to a performance collapse. Another bottleneck is “Memory Pinning” limits; if the ulimit -l (locked-in-memory size) is not set to “unlimited”, the application will fail to register memory regions with the HCA.

THE TROUBLESHOOTING MATRIX

Section C: Logs & Debugging:

When performance degrades, the primary tool for diagnostic data is the rdma tool and the kernel ring buffer. Review the output of dmesg | grep -i rdma to identify hardware initialization errors or firmware version mismatches.

For specific error codes:
1. Remote Access Error: This usually indicates that the application attempted to access a memory address that was not properly registered or the R_Key (Remote Key) was invalid.
2. Local Length Violation: The size of the posted work request exceeds the size of the registered memory buffer.
3. Transport Retry Counter Exceeded: This is a critical error signifying that the sender did not receive an ACK from the receiver. Check the physical cabling with a fluke-multimeter or check the switch logs for “PFC PAUSE” frames.

Path-specific diagnostics:
HCA Counters: Check /sys/class/infiniband//ports/1/counters/. Focus on port_rcv_errors and port_xmit_discards. Any non-zero value here indicates a physical or flow control issue.
Ethernet Statistics: Use ethtool -S to monitor rx_priority_pause_frames. High counts in this field indicate the network is congested and the flow control logic is actively throttling traffic to prevent loss.

OPTIMIZATION & HARDENING

– Performance Tuning: Aligning the IRQ (Interrupt Request) affinity of the network interface to the same NUMA node as the application is vital. Use lscpu to identify the topology and map the interface interrupts to the local cores to minimize memory latency. Additionally, increasing the Ring Buffer size via ethtool -G rx 4096 tx 4096 can mitigate transient bursts of high traffic.

– Security Hardening: RDMA by design provides direct access to memory, which introduces risks. Use PKEYs (Partition Keys) to isolate different tenants on the same fabric. Ensure that the Firewall (iptables/nftables) allows UDP Port 4791 but restricts it to known GIDs. Implement strict Permissions on the /dev/infiniband/uverbsX device nodes to ensure only authorized service accounts can post RDMA commands.

– Scaling Logic: As the fabric grows, move from RoCE v1 (Layer 2 only) to RoCE v2 (Layer 3) to allow routing across different subnets. Maintain a consistent ECN (Explicit Congestion Notification) policy across all routers to ensure that the “Congestion Notification Packets” (CNPs) are correctly generated, allowing the HCA to slow down the transmission rate before a “PAUSE” frame is required.

THE ADMIN DESK (FAQ)

Why am I seeing high CPU usage despite using RoCE?
The system might be falling back to “Soft RoCE” (rxe). Verify that the hardware HCA is correctly detected and that the application is using the hardware GID. Run ibv_devinfo to confirm hardware transport offload is active.

Can I run RoCE over standard unmanaged switches?
No; rdma over converged ethernet requires a lossless fabric. Unmanaged switches typically lack PFC (Priority Flow Control). Without it, the network will experience packet-loss, causing the RDMA connection to time out and disconnect frequently.

How do I test the maximum bandwidth of the RoCE link?
Use the ib_send_bw or ib_write_bw utilities included with the perftest package. These tools bypass the filesystem and testing the raw throughput of the HCA and the network fabric directly via the Verbs API.

What is the difference between RoCE v1 and RoCE v2?
RoCE v1 is an Ethernet link layer protocol (Ethertype 0x8915) and cannot be routed. RoCE v2 encapsulates the RDMA payload in UDP/IP, allowing it to traverse routers and operate at Layer 3 of the OSI model.

Does RoCE require specific cabling?
While it runs on standard fiber or copper (DAC), the high throughput and low latency nature of rdma over converged ethernet often require high quality OM4/OM5 fiber or Active Optical Cables (AOC) to maintain signal integrity over longer distances.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top