Mellanox OFED for Linux User Manual Rev 2.3-1.0.1 www.mellanox.
Rev 2.3-1.0.
Rev 2.3-1.0.1 Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . .
Rev 2.3-1.0.1 2.8 UEFI Secure Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.8.1 Enrolling Mellanox's x.509 Public Key On your Systems . . . . . . . . . . . . . . . . . . 32 2.8.2 Removing Signature from kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.9 Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 3 Features Overview and Configuration. .
Rev 2.3-1.0.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 General Related Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ethernet Related Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . InfiniBand Related Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . InfiniBand/Ethernet Related Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installation Related Issues .
Rev 2.3-1.0.1 List of Figures Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards . . . . . . . . . . . . . . . . . . . .13 Figure 2: RoCE and v2 Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54 Figure 3: QoS Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100 Figure 4: Example QoS Deployment on InfiniBand Subnet . . . . . . . . . . . . . . . . . . . . . .
Rev 2.3-1.0.1 List of Tables Table 1: Document Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Table 2: Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Table 3: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Table 4: Reference Documents . . . . . . . . . . . . . . . . . . . .
Rev 2.3-1.0.1 Document Revision History Table 1 - Document Revision History 8 Release Date Description 2.3-1.0.1 September, 2014 • Major restructuring of the User Manual • Added the following sections: • Section 2.6, “Installing MLNX_OFED using apt-get”, on page 30 • Section 2.6.1, “Setting up MLNX_OFED apt-get Repository”, on page 30 • Section 2.6.2, “Installing MLNX_OFED using the apt-get Tool”, on page 30 • Section 2.6.3, “Uninstalling Mellanox OFED using the aptget Tool”, on page 31 • Section 2.
Rev 2.3-1.0.1 About this Manual This preface provides general information concerning the scope and organization of this User’s Manual. Intended Audience This manual is intended for system administrators responsible for the installation, configuration, management and maintenance of the software and hardware of VPI (InfiniBand, Ethernet) adapter cards. It is also intended for application developers.
Rev 2.3-1.0.
Rev 2.3-1.0.1 Table 3 - Glossary (Sheet 2 of 2) Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric. Master Subnet Manager The Subnet Manager that is authoritative, that has the reference configuration information for the subnet. See Subnet Manager. Multicast Forwarding Tables A table that exists in every switch providing the list of ports to forward received multicast packet. The table is organized by MLID.
Rev 2.3-1.0.1 Related Documentation Table 4 - Reference Documents Document Name Description InfiniBand Architecture Specification, Vol. 1, Release 1.2.1 The InfiniBand Architecture Specification that is provided by IBTA IEEE Std 802.3ae™-2002 (Amendment to IEEE Std 802.
Rev 2.3-1.0.1 1 Introduction 1.1 Overview Mellanox OFED is a single Virtual Protocol Interconnect (VPI) software stack which operates across all Mellanox network adapter solutions supporting 10, 20, 40 and 56 Gb/s InfiniBand (IB); 10, 40 and 56 Gb/s Ethernet; and 2.5 or 5.0 GT/s PCI Express 2.0 and 8 GT/s PCI Express 3.0 uplinks to servers. All Mellanox network adapter cards are compatible with OpenFabrics-based RDMA protocols and software, and are supported with major operating system distributions.
Rev 2.3-1.0.1 1.2.1 Introduction mlx4 VPI Driver mlx4 is the low level driver implementation for the ConnectX® family adapters designed by Mellanox Technologies. ConnectX® family adapters can operate as an InfiniBand adapter, or as an Ethernet NIC. The OFED driver supports InfiniBand and Ethernet NIC configurations.
Rev 2.3-1.0.1 • MLX5_SHUT_UP_BF • Disables blue flame feature • Otherwise - do not disable • MLX5_SINGLE_THREADED • All spinlocks are disabled • Otherwise - spinlocks enabled • Used by applications that are single threaded and would like to save the overhead of taking spinlocks.
Rev 2.3-1.0.1 Introduction provide access to remote storage devices across an InfiniBand fabric. The SRP Target resides in an I/O unit and provides storage services. See Chapter 3.3.1, “SCSI RDMA Protocol (SRP)”. uDAPL User Direct Access Programming Library (uDAPL) is a standard API that promotes data center application data messaging performance, scalability, and reliability over RDMA interconnects: InfiniBand and RoCE. The uDAPL interface is defined by the DAT collaborative.
Rev 2.3-1.0.1 • Generation of a standard or customized Mellanox firmware image for burning—in .bin (binary) or .img format • Burning an image to the Flash/EEPROM attached to a Mellanox HCA or switch device • Querying the firmware version loaded on an HCA board • Displaying the VPD (Vital Product Data) of an HCA board • flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mellanox network adapter/bridge/switch device.
Rev 2.3-1.0.1 Introduction • IPoIB, RDS*, SRP Initiator and SRP * NOTE: RDS was not tested by Mellanox Technologies. • MPI • Open MPI stack supporting the InfiniBand, RoCE and Ethernet interfaces • OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces • MPI benchmark tests (OSU BW/LAT, Intel MPI Benchmark, Presta) • OpenSM: InfiniBand Subnet Manager • Utilities • Diagnostic tools • Performance tests 1.3.
Rev 2.3-1.0.1 1.4 Module Parameters 1.4.1 mlx4 Module Parameters In order to set mlx4 parameters, add the following line(s) to /etc/modprobe.conf: options mlx4_core parameter= and/or options mlx4_ib parameter= and/or options mlx4_en parameter= The following sections list the available mlx4 parameters. 1.4.1.
Rev 2.3-1.0.1 Introduction probe_vf: log_num_mgm_entry_size: high_rate_steer: fast_drop: enable_64b_cqe_eqe: log_num_mac: log_num_vlan: log_mtts_per_seg: port_type_array: log_num_qp: log_num_srq: log_rdmarc_per_qp: log_num_cq: log_num_mcg: log_num_mpt: log_num_mtt: enable_qos: internal_err_reset: 20 Mellanox Technologies Either a single value (e.g.
Rev 2.3-1.0.1 1.4.1.3 mlx4_en Parameters inline_thold: udp_rss: pfctx: pfcrx: 1.4.2 Threshold for using inline data (int) Default and max value is 104 bytes. Saves PCI read operation transaction, packet less then threshold size will be copied to hw buffer directly. Enable RSS for incoming UDP traffic (uint) On by default. Once disabled no RSS for incoming UDP traffic will be done. Priority based Flow Control policy on TX[7:0]. Per priority bit mask (uint) Priority based Flow Control policy on RX[7:0].
Rev 2.3-1.0.1 2 Installation Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and/or Ethernet adapter hardware installed. 2.1 Hardware and Software Requirements Table 1 - Software and Hardware Requirements Requirements 2.
Rev 2.3-1.0.1 2.3 Installing Mellanox OFED The installation script, mlnxofedinstall, performs the following: • Discovers the currently installed kernel • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel) • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically1 upgrades the firmware Usage .
Rev 2.3-1.0.1 Installation Example The following command will create a MLNX_OFED_LINUX ISO image for RedHat 6.3 under the /tmp directory. # ./MLNX_OFED_LINUX-x.x-x-rhel6.3-x86_64/mlnx_add_kernel_support.sh -m /tmp/ MLNX_OFED_LINUX-x.x-x-rhel6.3-x86_64/ --make-tgz Note: This program will create MLNX_OFED_LINUX TGZ for rhel6.3 under /tmp directory. All Mellanox, OEM, OFED, or Distribution IB packages will be removed. Do you want to continue?[y/N]:y See log file /tmp/mlnx_ofed_iso.21642.
Rev 2.3-1.0.1 In case your machine has the latest firmware, no firmware update will occur and the installation script will print at the end of installation a message similar to the following: Device #1: ---------Device Type: ConnectX3Pro Part Number: MCX354A-FCC_Ax Description: ConnectX-3 Pro VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and 40GigE;PCIe3.0 x8 8GT/s;RoHS R6 PSID: MT_1090111019 PCI Device Name: 0000:05:00.0 Versions: Current Available FW 2.31.5000 2.31.5000 PXE 3.4.0224 3.4.
Rev 2.3-1.0.1 2.3.
Rev 2.3-1.0.1 2.3.5 mlnxofedinstall Return Codes Table 2 lists the mlnxofedinstall script return codes and their meanings. Table 2 - mlnxofedinstall Return Codes Return Code 2.4 Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration. This can occur when the required hardware is not present on the system. 172 Prerequisites are not met.
Rev 2.3-1.0.1 Installation http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox # wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox --2014-04-20 13:52:30-- http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox Resolving www.mellanox.com... 72.3.194.0 Connecting to www.mellanox.com|72.3.194.0|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1354 (1.
Rev 2.3-1.0.1 Step 9. Check that the repository was successfully added. # yum repolist Loaded plugins: product-id, security, subscription-manager This system is not registered to Red Hat Subscription Management. tion-manager to register. repo id repo name mlnx_ofed MLNX_OFED Repository rpmforge RHEL 6Server - RPMforge.net - dag You can use subscripstatus 108 4,597 repolist: 8,351 2.5.
Rev 2.3-1.0.1 • Installation Run the mlnxofedinstall script with the "--fw-update-only" flag or • 2.5.3 Update the firmware to the latest version available on Mellanox Technologies’ Web site as described in section Section 2.7, “Updating Firmware After Installation”, on page 31. Uninstalling Mellanox OFED using the YUM Tool If MLNX_OFED was installed using the yum tool, then it can be uninstalled as follow: yum groupremove ''1 1.
Rev 2.3-1.0.1 Step 2. Install the desired group. apt-get install '' Example: apt-get install mlnx-ofed-all 2.6.3 Uninstalling Mellanox OFED using the apt-get Tool Use the script /usr/sbin/ofed_uninstall.sh to uninstall the Mellanox OFED package. The script is part of the ofed-scripts RPM. 2.
Rev 2.3-1.0.1 Installation The following command burns firmware onto the ConnectX® device with the device name obtained in the example of Step 2. > flint -d /dev/mst/mt25418_pci_cr0 -i fw-25408-2_31_5050-MCX353A-FCA_A1.bin burn Step 4. 2.8 Reboot your machine after the firmware burning is completed. UEFI Secure Boot All kernel modules included in MLNX_OFED for RHEL7 and SLES12 are signed with x.509 key to support loading the modules when Secure Boot is enabled. 2.8.1 Enrolling Mellanox's x.
Rev 2.3-1.0.1 To remove the signature from the MLNX_OFED kernel modules: # rpm -qa | grep -E "kernel-ib|mlnx-ofa_kernel" | xargs rpm -ql | grep "\.
Rev 2.3-1.0.1 Features Overview and Configuration 3 Features Overview and Configuration 3.1 Ethernet Network 3.1.1 Interface 3.1.1.1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports. By default both ConnectX ports are initialized as InfiniBand ports. If you wish to change the port type use the connectx_port_config script after the driver is loaded.
Rev 2.3-1.0.1 Upon driver start up: 1. Sense the adapter card’s port type: If a valid cable or module is connected (QSFP, SFP+, or SFP with EEPROM in the cable/module): • Set the port type to the sensed link type (IB/Ethernet) Otherwise: • Set the port type as default (Ethernet) During driver run time: • Sense a link every 3 seconds if no link is sensed/detected • If sensed, set the port type as sensed 3.1.1.
Rev 2.3-1.0.1 Features Overview and Configuration Counter rx_dropped Number of receive packets which were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher-layer protocol.
Rev 2.3-1.0.1 Counter Description tx_1548_bytes_packets Number of transmitted 1523-to-1548-octet frames tx_gt_1548_bytes_packets Number of transmitted 1549-or-greater-octet frames Counter Description rx_prio__packets Total packets successfully received with priority i. rx_prio__bytes Total bytes in successfully received packets with priority i. rx_novlan_packets Total packets successfully received with no VLAN priority.
Rev 2.3-1.0.
Rev 2.3-1.0.1 • Example for IPoIB interfaces: SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{dev_id}=="0x0", ATTR{type}=="32", NAME="ib0" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{dev_id}=="0x1", ATTR{type}=="32", NAME="ib1" 3.1.2 Quality of Service (QoS) Quality of Service (QoS) is a mechanism of assigning a priority to a network flow (socket, rdma_cm connection) and manage its guarantees, limitations and its priority over other flows.
Rev 2.3-1.0.1 Features Overview and Configuration • If the underlying device is not a VLAN device, the tc command is used. In this case, even though tc manual states that the mapping is from the sk_prio to the TC number, the mlx4_en driver interprets this as a sk_prio to UP mapping. Mapping the sk_prio to the UP is done by using tc_wrap.py -i -u 0,1,2,3,4,5,6,7 4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used.
Rev 2.3-1.0.1 1. The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP: • Sets qp_attrs.ah_attrs.sl = up • Calls modify_qp with IB_QP_AV set in the mask 2. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used When using Raw Ethernet QP mapping, the TOS/sk_prio to UP mapping is lost. Performing the Raw Ethernet QP mapping forces the QP to transmit using the given UP.
Rev 2.3-1.0.1 Features Overview and Configuration Strict Priority When setting a TC's transmission algorithm to be 'strict', then this TC has absolute (strict) priority over other TC strict priorities coming before it (as determined by the TC number: TC 7 is highest priority, TC 0 is lowest). It also has an absolute priority over non strict TCs (ETS). This property needs to be used with care, as it may easily cause starvation of other TCs.
Rev 2.3-1.0.1 Usage: mlnx_qos -i [options] Options: --version show program's version number and exit -h, --help show this help message and exit -p LIST, --prio_tc=LIST maps UPs to TCs. LIST is 8 comma seperated TC numbers. Example: 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and UPs 4-7 to TC1 -s LIST, --tsa=LIST Transmission algorithm for each TC. LIST is comma seperated algorithm names for each TC. Possible algorithms: strict, etc. Example: ets,strict,ets sets TC0,TC2 to ETS and TC1 to strict.
Rev 2.3-1.0.
Rev 2.3-1.0.1 Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2: tc: 0 ratelimit: 3 Gbps, up: 0 skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: up: 1 up: 2 up: 3 up: 4 up: 5 up: 6 up: 7 tsa: strict 0 1 2 (tos: 8) 3 4 (tos: 24) 5 6 (tos: 16) 7 8 9 10 11 12 13 14 15 Configure QoS. map UP 0,7 to tc0, 1,2,3 to tc1 and 4,5,6 to tc 2. set tc0,tc1 as ets and tc2 as strict.
Rev 2.3-1.0.1 Features Overview and Configuration up: 1 up: 2 up: 3 tc: 2 ratelimit: 2 Gbps, tsa: strict up: 4 up: 5 up: 6 tc and tc_wrap.py The 'tc' tool is used to setup sk_prio to UP mapping, using the mqprio queue discipline. In kernels that do not support mqprio (such as 2.6.34), an alternate mapping is created in sysfs. The 'tc_wrap.py' tool will use either the sysfs or the 'tc' tool to configure the sk_prio to UP mapping. Usage: tc_wrap.
Rev 2.3-1.0.1 UP UP UP UP UP UP 2 3 4 5 6 7 Additional Tools tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher. This is a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs interface is available. 3.1.3 • mlnx_qos tool • tc_wrap.py (package: ofed-scripts) requires python >= 2.5 (package: ofed-scripts) requires python >= 2.
Rev 2.3-1.0.
Rev 2.3-1.0.1 priority 0: rpg_enable: 0 rppp_max_rps: 1000 rpg_time_reset: 1464 rpg_byte_reset: 150000 rpg_threshold: 5 rpg_max_rate: 40000 rpg_ai_rate: 10 rpg_hai_rate: 50 rpg_gd: 8 rpg_min_dec_fac: 2 rpg_min_rate: 10 cndd_state_machine: 0 priority 1: rpg_enable: 0 rppp_max_rps: 1000 rpg_time_reset: 1464 rpg_byte_reset: 150000 rpg_threshold: 5 rpg_max_rate: 40000 rpg_ai_rate: 10 rpg_hai_rate: 50 rpg_gd: 8 rpg_min_dec_fac: 2 rpg_min_rate: 10 cndd_state_machine: 0 ............................. .............
Rev 2.3-1.0.1 3.1.4 Features Overview and Configuration Ethtool ethtool is a standard Linux utility for controlling network drivers and hardware, particularly for wired Ethernet devices.
Rev 2.3-1.0.1 Table 3 - ethtool Supported Options Options Description ethtool -C eth adaptive-rx on|off Enables/disables adaptive interrupt moderation. ethtool -C eth [pkt-rate-low N] [pkt-rate-high N] [rx-usecs-low N] [rx-usecs-high N] Sets the values for packet rate limits and for moderation time high and low values. ethtool -C eth [rx-usecs N] [rxframes N] Sets the interrupt coalescing settings when the adaptive moderation is disabled.
Rev 2.3-1.0.1 3.1.5 Features Overview and Configuration Checksum Offload MLNX_OFED supports the following Receive IP/L4 Checksum Offload modes: • CHECKSUM_UNNECESSARY: By setting this mode the driver indicates to the Linux Networking Stack that the hardware successfully validated the IP and L4 checksum so the Linux Networking Stack does not need to deal with IP/L4 Checksum validation.
Rev 2.3-1.0.1 • Since LID is a layer 2 attribute of the InfiniBand protocol stack, it is not set for a port and is displayed as zero when querying the port • With RoCE, the alternate path is not set for RC QP and therefore APM is not supported • Since the SM is not present, querying a path is impossible. Therefore, the path record structure must be filled with the relevant values before establishing a connection.
Rev 2.3-1.0.1 Features Overview and Configuration Figure 2: RoCE and v2 Protocol Stack 3.1.6.2 RoCE Configuration In order to function reliably, RoCE requires a form of flow control. While it is possible to use global flow control, this is normally undesirable, for performance reasons. The normal and optimal way to use RoCE is to use Priority Flow Control (PFC). To use PFC, it must be enabled on all endpoints and switches in the flow path.
Rev 2.3-1.0.1 • Ports facing the network should be configured as trunk ports, and use Port Control Protocol (PCP) for priority flow control For further information on how to configure SwitchX, please refer to SwitchX User Manual. 3.1.6.2.3 Configuring the RoCE Mode RoCE mode is configured via the module parameter, roce_mode, in mlx4_core kernel module. The value can be set globally and it will be applied to all HCAs on the node, or specifically to an HCA (identified by BDF).
Rev 2.3-1.0.1 Features Overview and Configuration Step 2. Display the existing MLNX_OFED version. # ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.31.
Rev 2.3-1.0.1 The fw_ver parameter shows that the firmware version is 2.31.5050. The firmware version can also be obtained by running the following commands: # cat /sys/class/infiniband/mlx4_0/fw_ver 2.31.5050 # Although the InfiniBand over Ethernet's Port MTU is 2K byte at maximum, the actual MTU cannot exceed the mlx4_en interface's MTU. Since the mlx4_en interface's MTU is typically 1560, port 2 will run with MTU of 1K. Please note that RoCE's MTU are subject to InfiniBand MTU restrictions.
Rev 2.3-1.0.1 Features Overview and Configuration Step 2. Make sure that ping is working. # ping 20.4.3.219 PING 20.4.3.219 (20.4.3.219) 56(84) bytes of data. 64 bytes from 20.4.3.219: icmp_seq=1 ttl=64 time=0.873 ms 64 bytes from 20.4.3.219: icmp_seq=2 ttl=64 time=0.198 ms 64 bytes from 20.4.3.219: icmp_seq=3 ttl=64 time=0.167 ms --- 20.4.3.219 ping statistics --3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.167/0.412/0.873/0.326 ms 3.1.6.2.
Rev 2.3-1.0.1 Step 4. Examine the GID table. # cat /sys/class/infiniband/mlx4_0/ports/2/gids/0 fe80:0000:0000:0000:0202:c9ff:fe08:e811 # # cat /sys/class/infiniband/mlx4_0/ports/2/gids/1 fe80:0000:0000:0000:0202:c900:0708:e811 3.1.6.2.11Running an - ibv_rc_pingpong Test on the VLAN Step 1. Start the server.
Rev 2.3-1.0.1 Features Overview and Configuration Step 2. Use rdma_cm test on the client. # ucmatose -s 20.4.3.219 cmatose: starting client cmatose: connecting receiving data transfers sending replies data transfers complete test complete return status 0 # This server-client run is without PCP or VLAN because the IP address used does not belong to a VLAN interface. If you specify a VLAN IP address, then the traffic should go over VLAN. 3.1.6.
Rev 2.3-1.0.1 Step 2. Enable ECN CLI. options mlx4_ib en_ecn=1 Step 3. Restart the driver. /etc/init.d/openibd restart Step 4. Mount debugfs to access ECN attributes. mount -t debugfs none /sys/kernel/debug/ Please note, mounting of debugfs is required.
Rev 2.3-1.0.1 Features Overview and Configuration pm_qos feature is both global and static, once a request is issued, it is enforced on all CPUs and does not change in time. MLNX_OFED provides an option to trigger a request when required and to remove it when no longer required. It is disabled by default and can be set/unset through the ethtool priv-flags. For further information on how to enable/disable this feature, please refer to Table 3, “ethtool Supported Options,” on page 50. 3.1.
Rev 2.3-1.0.1 SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to the system time base SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in software SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
Rev 2.3-1.0.1 Features Overview and Configuration Receive side time sampling: • Enabled by ifreq.hwtstamp_config.
Rev 2.3-1.0.1 a pending bounced packet is ready for reading as far as select() is concerned. If the outgoing packet has to be fragmented, then only the first fragment is time stamped and returned to the sending socket. When time-stamping is enabled, VLAN stripping is disabled. For more info please refer to Documentation/networking/timestamping.txt in kernel.
Rev 2.3-1.0.1 Features Overview and Configuration For example: struct ibv_exp_device_attr attr; ibv_exp_query_device(context, &attr); if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_TIMESTAMP_MASK) { if (attr.timestamp_mask) { /* Time stamping is supported with mask attr.timestamp_mask */ } } if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_HCA_CORE_CLOCK) { if (attr.hca_core_clock) { /* reporting the device's clock is supported. */ /* attr.
Rev 2.3-1.0.1 Querying the Hardware Time Querying the hardware for time is done via the ibv_exp_query_values verb. For example: ret = ibv_exp_query_values(context, IBV_EXP_VALUES_HW_CLOCK, &queried_values); if (!ret && queried_values.comp_mask & IBV_EXP_VALUES_HW_CLOCK) queried_time = queried_values.hwclock; To change the queried time in nanoseconds resolution, use the IBV_EXP_VALUES_HW_CLOCK_NS flag along with the hwclock_ns field.
Rev 2.3-1.0.1 Features Overview and Configuration bit Operation Description b1 Disable IPoIB Flow Steering When set to 1, it disables the support of IPoIB Flow Steering. This bit should be set to 1 when "b2- Enable A0 static DMFS steering" is used (see Section 3.1.11.3, “A0 Static Device Managed Flow Steering”, on page 69). b2 Enable A0 static DMFS steering (see Section 3.1.11.3, “A0 Static Device Managed Flow Steering”, on page 69) When set to 1, A0 static DMFS steering is enabled.
Rev 2.3-1.0.1 Flow Steering support in InfiniBand is determined according to the EXP_MANAGED_FLOW_STEERING flag. 3.1.11.3 A0 Static Device Managed Flow Steering This mode enables fast steering, however it might impact flexibility. Using it increases the packet rate performance by ~30%, with the following limitations for Ethernet link-layer unicast QPs: • Limits the number of opened RSS Kernel QPs to 96. MACs should be unique (1 MAC per 1 QP). The number of VFs is limited.
Rev 2.3-1.0.1 Features Overview and Configuration Be advised that as of MLNX_OFED v2.0-3.0.0, the parameters (both the value and the mask) should be set in big-endian format. Each header struct holds the relevant network layer parameters for matching. To enforce the match, the user sets a mask for each parameter.
Rev 2.3-1.0.1 The mlx4 driver supports only a subset of the flow specification the ethtool API defines. Asking for an unsupported flow specification will result with an “invalid value” failure.
Rev 2.3-1.0.1 Features Overview and Configuration 3.2 InfiniBand Network 3.2.1 Interface 3.2.1.1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports. By default both ConnectX ports are initialized as InfiniBand ports. If you wish to change the port type use the connectx_port_config script after the driver is loaded. Running “/sbin/connectx_port_config -s” will show current port configuration for all ConnectX devices.
Rev 2.3-1.0.1 3.2.2 OpenSM OpenSM is an InfiniBand compliant Subnet Manager (SM). It is provided as a fixed flow executable called “opensm”, accompanied by a testing application called “osmtest”. OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters: Management Model (13), Subnet Management (14), and Subnet Administration (15). 3.2.2.
Rev 2.3-1.0.1 Features Overview and Configuration Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for logrotate purposes. 3.2.2.1.3 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it, and sweep occasionally for changes.
Rev 2.3-1.0.1 The default mode runs all the flows except for the Quality of Service flow (see Section 3.2.2.6). After installing opensm (and if the InfiniBand fabric is stable), it is recommended to run the following command in order to generate the inventory file: host1# osmtest -f c Immediately afterwards, run the following command to test opensm: host1# osmtest -f a Finally, it is recommended to occasionally run “osmtest -v” (with verbosity) to verify that nothing in the fabric has changed. 3.2.2.
Rev 2.3-1.0.1 Features Overview and Configuration • : [|]* | • : [,] • : [=[full|limited|both]] where • PortGUID GUID of partition member EndPort. Hexadecimal numbers should start from 0x, decimal numbers are accepted too. full, limited Indicates full and/or limited membership for this both port. When omitted (or unrecognized) limited membership is assumed.
Rev 2.3-1.0.1 PortGUIDs list: PortGUID GUID of partition member EndPort. Hexadecimal numbers should start from 0x, decimal numbers are accepted too. full or limited indicates full or limited membership for this port. When omitted (or unrecognized) limited membership is assumed. There are two useful keywords for PortGUID definition: • 'ALL' means all end ports in this subnet. • 'ALL_CAS' means all Channel Adapter end ports in this subnet. • 'ALL_SWITCHES' means all Switch end ports in this subnet.
Rev 2.3-1.0.1 Features Overview and Configuration mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group ALL=full; The following rule is equivalent to how OpenSM used to run prior to the partition manager: Default=0x7fff,ipoib:ALL=full; 3.2.2.4 Effect of Topology Changes If a link is added or removed, OpenSM may not recalculate the routes that do not have to change. A route has to change if the port is no longer UP or no longer the MinHop.
Rev 2.3-1.0.1 7. “Unicast Routing Cache” Unicast routing cache prevents routing recalculation (which is a heavy task in a large cluster) when no topology change was detected during the heavy sweep, or when the topology change does not require new routing calculation (for example, when one or more CAs/RTRs/leaf switches going down, or one or more of these nodes coming back after being down). 8.
Rev 2.3-1.0.1 Features Overview and Configuration connected through the loop. As such, the UPDN routing algorithm should be send if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure). The UPDN algorithm is based on the following main stages: 1. Auto-detect root nodes - based on the CA hop length from any switch in the subnet, a statistical histogram is built for each switch (hop num vs number of occurrences).
Rev 2.3-1.0.1 ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any Constant Bisectional Ratio (CBB )ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks.
Rev 2.3-1.0.1 Features Overview and Configuration /|\ /|\ / | \ / | \ Going down to compute nodes To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option. These nodes will be allowed to use switches the wrong way around a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'. With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree.
Rev 2.3-1.0.1 Note that the implementation of LASH in opensm attempts to use as few layers as possible. This number can be less than the number of actual layers available. In general LASH is a very flexible algorithm. It can, for example, reduce to Dimension Order Routing in certain topologies, it is topology agnostic and fares well in the face of faults. It has been shown that for both regular and irregular topologies, LASH outperforms Up/Down.
Rev 2.3-1.0.1 • Features Overview and Configuration Two levels of QoS, assuming switches support 8 data VLs • Ability to route around a single failed switch, and/or multiple failed links, without: • introducing credit loops • changing path SL values • Very short run times, with good scaling properties as fabric size increases 3.2.2.5.6.
Rev 2.3-1.0.1 Note that it can do this without changing the path SL value; once the 1D ring m-S-n-T-o-p-m has been broken by failure, path segments using it cannot contribute to deadlock, and the x-direction dateline (between, say, x=5 and x=0) can be ignored for path segments on that ring. One result of this is that torus-2QoS can route around many simultaneous link failures, as long as no 1D ring is broken into disjoint segments.
Rev 2.3-1.0.1 Features Overview and Configuration set. As a further example, consider a case that torus-2QoS cannot route without deadlock: two failed switches adjacent in a dimension that is not the last dimension routed by DOR; here the failed switches are O and T: In a pristine fabric, torus-2QoS would generate the path from S to D as S-n-O-T-r-D. With failed switches O and T, torus-2QoS will generate the path S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q using a VL with bit 1 set.
Rev 2.3-1.0.1 tion, if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus, and if multicast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS uses SL bit 3 to differentiate QoS level), then multicast traffic also cannot contribute to the "ring" credit loops that are otherwise possible in a torus. Torus-2QoS uses these ideas to create a master spanning tree.
Rev 2.3-1.0.1 Features Overview and Configuration Assuming the y dateline was between y=4 and y=0, this spanning tree has a branch that crosses a dateline. However, again this cannot contribute to credit loops as it occurs on a 1D ring (the ring for x=3) that is broken by a failure, as in the above example. 3.2.2.5.6.
Rev 2.3-1.0.1 and SL. Torus-2QoS can only support two quality of service levels, so only the high-order bit of any SL value used for unicast QoS configuration will be honored by torus-2QoS. For multicast QoS configuration, only SL values 0 and 8 should be used with torus-2QoS. Since SL to VL map configuration must be under the complete control of torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc., must and will be ignored, and a warning will be generated.
Rev 2.3-1.0.1 Features Overview and Configuration (looped) by suffixing its radix specification with one of m, M, t, or T. Thus, "mesh 3T 4 5" and "torus 3 4M 5M" both specify the same topology. Note that although torus-2QoS can route mesh fabrics, its ability to route around failed components is severely compromised on such fabrics. A failed fabric componentis very likely to cause a disjoint ring; see UNICAST ROUTING in torus-2QoS(8).
Rev 2.3-1.0.1 portgroup_max_ports max_ports - This keyword specifies the maximum number of parallel inter-switch links, and also the maximum number of host ports per switch, that torus-2QoS can accommodate. The default value is 16. Torus-2QoS will log an error message during topology discovery if this parameter needs to be increased. If this keyword appears multiple times, the last instance prevails. port_order p1 p2 p3 ...
Rev 2.3-1.0.1 Features Overview and Configuration 4. Define routing engine chains over previously defined topologies and configuration files. Defining Port Groups The basic idea behind the port groups is the ability to divide the fabric into sub-groups and give each group an identifier that can be used to relate to all nodes in this group. The port groups is a separate feature from the routing chains, but is a mandatory prerequisite for it.
Rev 2.3-1.0.1 Parameter guid list Description Comma separated list of guids to include in the group. If no specific physical ports were configured, all physical ports of the guid are chosen. However, for each guid, one can detail specific physical ports to be included in the group.
Rev 2.3-1.0.1 Features Overview and Configuration Parameter Example port name One can configure a list of hostnames as a rule. Hosts with a node description that is built out of these hostnames will be chosen. Since the node description contains the network card index as well, one might also specify a network card index and a physical port to be chosen. For example, the given configuration will cause only physical port 2 of a host with the node description ‘kuku HCA-1’ to be chosen.
Rev 2.3-1.0.1 Parameter subtract rule Description One can define a rule that subtracts one port group from another. The given rule, for example, will cause all the ports which are a part of grp1, but not included in grp2, to be chosen. In subtraction (unlike union), the order does matter, since the purpose is to subtract the second group from the first one. There is no option to define more than two groups for union/subtraction.
Rev 2.3-1.0.
Rev 2.3-1.0.1 Table 4 - Topology Qualifiers Parameter Description Example id Topology ID. Legal Values – any positive value. Must be unique. id: 1 sw-grp Name of the port group that includes all switches and switch ports to be used in this topology. sw-grp: ys_switches hca-grp Name of the port group that includes all HCA’s to be used in this topology. hca-grp: ys_hosts Configuration File per Routing Engine Each engine in the routing chain can be provided by its own configuration file.
Rev 2.3-1.0.1 Features Overview and Configuration Routing Engine Qualifiers Unlike unicast-step and end-unicast-step which do not require a colon, all qualifiers must end with a colon (':'). Also - a colon is a predefined mark that must not be used inside qualifier values. An inclusion of a colon in the qualifier values will result in the policy's failure. Parameter id Description ‘id’ is mandatory. Without an id qualifier for each engine, the policy fails.
Rev 2.3-1.0.1 Parameter fallback-to Description This is an optional qualifier that enables one to define the current unicast step as a fallback to another unicast step. This can be done by defining the id of the unicast step that this step is a fallback to. • • • • path-bit Example - If undefined, the current unicast step is not a fallback. If the value of this qualifier is a non-existent engine id, this step will be ignored.
Rev 2.3-1.0.1 Features Overview and Configuration If, for example, engine 2 runs ftree and it has a fallback engine with 3 as its id that runs minhop, one should expect to find 2 sets of dump files, one for each engine: • opensm-lid-matrix.2.ftree.dump • opensm-lid-matrix.3.minhop.dump • opensm.fdbs.2.ftree • opensm.fdbs.3.munhop 3.2.2.6 Quality of Service Management in OpenSM When Quality of Service (QoS) in OpenSM is enabled (using the ‘-Q’ or ‘--qos’ flags), OpenSM looks for a QoS Policy file.
Rev 2.3-1.0.1 • Port GUID • Port name, which is a combination of NodeDescription and IB port number • PKey, which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group • Partition name, which means that all the ports in the subnet that belong to partition with a given name belong to this port group • Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and SELF (SM's port).
Rev 2.3-1.0.1 Features Overview and Configuration Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos-ulps. Similar to the advanced QoS policy, it has a list of match rules and their QoS Level, but in this case a match rule has only one criterion - its goal is to match a certain ULP (or a certain application on top of this ULP) PR/MPR request, and QoS Level has only one constraint - Service Level (SL).
Rev 2.3-1.0.1 name: Storage # "use" is just a description that is used for logging # Other than that, it is just a comment use: SRP Targets port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA port-guid: 0x1000000000FFFF end-port-group port-group name: Virtual Servers # The syntax of the port name is as follows: # "node_description/Pnum". # node_description is compared to the NodeDescription of the node, # and "Pnum" is a port number on that node.
Rev 2.3-1.0.1 Features Overview and Configuration name: DEFAULT use: default QoS Level sl: 0 end-qos-level # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime qos-level name: WholeSet sl: 1 mtu-limit: 4 rate-limit: 5 pkey: 0x1234 packet-life: 8 end-qos-level end-qos-levels # Match rules are scanned in order of their apperance in the policy file. # First matched rule takes precedence.
Rev 2.3-1.0.1 source: Virtual Servers destination: Storage service-id: 0x0000000000010000-0x000000000001FFFF pkey: 0x0F00-0x0FFF qos-level-name: WholeSet end-qos-match-rule end-qos-match-rules Simple QoS Policy - Details and Examples Simple QoS policy match rules are tailored for matching ULPs (or some application on top of a ULP) PR/MPR requests. This section has a list of per-ULP (or per-application) match rules and the SL that should be enforced on the matched PR/MPR query.
Rev 2.3-1.0.
Rev 2.3-1.0.1 connect to. Default port number for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA. The following two match rules are equivalent: rds : any, service-id 0x00000000010648CA : 3.2.2.6.4 SRP Service ID for SRP varies from storage vendor to vendor, thus SRP query is matched by the target IB port GUID.
Rev 2.3-1.0.1 Features Overview and Configuration qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 qos_swe_max_vls 15 qos_swe_high_limit 0 qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 VL arbitration tables (both high and low) are lists of VL/Weight pairs.
Rev 2.3-1.0.1 Deployment Example Figure 4 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs. Figure 4: Example QoS Deployment on InfiniBand Subnet 3.2.2.7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments. Each example provides the QoS level assignment and their administration via OpenSM configuration files.
Rev 2.3-1.0.1 Features Overview and Configuration In the following policy file example, replace OST* and MDS* with the real port GUIDs.
Rev 2.3-1.0.1 ipoib :1 sdp :1 srp, target-port-guid SRPT1,SRPT2,SRPT3 :2 end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:32 qos_vlarb_low 0:1, qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 EDC (3-tier): IPoIB, RDS, SRP The following is an example of QoS configuration for an enterprise data center (EDC), with IPoIB carrying all application traffic, RDS for database traffic, and SRP used for storage.
Rev 2.3-1.0.1 Features Overview and Configuration rds :3 srp, target-port-guid SRPT1, SRPT2, SRPT3 : 4 end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:96,3:96,4:96 qos_vlarb_low 0:1 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 • Partition configuration file Default=0x7fff, ipoib : ALL=full; PartA=0x8001, sl=1, ipoib : ALL=full; Adaptive Routing Adaptive Routing is at beta stage.
Rev 2.3-1.0.1 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing (AR) Manager can be enabled/disabled through SM options file. 3.2.2.7.1 Enabling Adaptive Routing To enable Adaptive Routing, perform the following: 1. Create the Subnet Manager options file. Run: opensm -c 2. Add 'armgr' to the 'event_plugin_name' option in the file: # Event plugin name(s) event_plugin_name armgr 3.
Rev 2.3-1.0.1 Features Overview and Configuration others) cannot be used. To query the switch for the content of its Adaptive Routing table, use the 'smparquery' tool that is installed as a part of the Adaptive Routing Manager package. To see its usage details, run 'smparquery -h'. Adaptive Routing Manager Options File The default location of the AR Manager options file is /etc/opensm/ar_mgr.conf. To set an alternative location, please perform the following: 1.
Rev 2.3-1.0.1 3.2.2.7.3 General AR Manager Options Option File ENABLE: AR_ALGORITHM: AR_MODE: Description Values Enable/disable Adaptive Routing on fabric switches. Note that if a switch was identified by AR Manager as device that does not support AR, AR Manager will not try to enable AR on this switch.
Rev 2.3-1.0.1 Features Overview and Configuration Option File LOG_SIZE: Description This option defines maximal AR Manager log file size in MB. The logfile will be truncated and restarted upon reaching this limit. This option cannot be changed on-the-fly. Values 0: unlimited log file size. Default: 5 Per-switch AR Options A user can provide per-switch configuration options with the following syntax: SWITCH { ; ; ...
Rev 2.3-1.0.1 AGEING_TIME: 44; } SWITCH 0xabcde { ENABLE: false; } 3.2.2.8 Congestion Control Congestion Control Manager is a Subnet Manager (SM) plug-in, i.e. it is a shared library (libccmgr.so) that is dynamically loaded by the Subnet Manager. Congestion Control Manager is installed as part of Mellanox OFED installation.
Rev 2.3-1.0.1 Features Overview and Configuration event_plugin_options ccmgr --conf_file 2. Run the SM with the new options file: 'opensm -F ' To turn CC OFF, set 'enable' to 'FALSE' in the Congestion Control Manager configuration file, and run OpenSM ones with this configuration. For the full list of CC Manager options with all the default values, See “Configuring Congestion Control Manager” on page 117.
Rev 2.3-1.0.1 • The default is: 0x200 • When number of errors exceeds 'max_errors' of send/receive errors or timeouts in less than 'error_window' seconds, the CC MGR will abort and will allow OpenSM to proceed. To do so, set the following parameter: max_errors error_window • The values are: max_errors = 0: zero tollerance - abort configuration on first error error_window = 0: mechanism disabled - no error checking.[0-48K] • The default is: 5 3.2.2.8.
Rev 2.3-1.0.1 Features Overview and Configuration Option File Desctiption Values ccti_min Sets the CC Table Index (CCTI) minimum. Default: 0 cct Sets all the CC table entries to a specified value . The first entry will remain 0, whereas last value will be set to the rest of the table. Values: Default: 0 When the value is set to 0, the CCT calculation is based on the number of nodes. ccti_timer Sets for all SL's the given ccti timer.
Rev 2.3-1.0.1 Figure 5: I/O Consolidation Over InfiniBand The basic need is to differentiate the service levels provided to different traffic flows, such that a policy can be enforced and can control each flow utilization of fabric resources.
Rev 2.3-1.0.1 Features Overview and Configuration the policy, so clients (ULPs, programs) can obtain a policy enforced QoS. The SM may also set up partitions with appropriate IPoIB broadcast group. This broadcast group carries its QoS attributes: SL, MTU, RATE, and Packet Lifetime. 3. IPoIB is being setup. IPoIB uses the SL, MTU, RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition. 4.
Rev 2.3-1.0.1 Path Bits are not implemented in OFED. Matching Rules A list of rules that match an incoming PR/MPR request to a QoS-Level. The rules are processed in order such as the first match is applied. Each rule is built out of a set of match expressions which should all match for the rule to apply.
Rev 2.3-1.0.1 3.2.4 Features Overview and Configuration Secure Host Secure host enables the device to protect itself and the subnet from malicious software.
Rev 2.3-1.0.1 If you do not explicitly restore hardware access when the maintenance operation is completed, the driver restart will NOT do so. The driver will come back after restart with hardware access disabled. Note, though, that the SMP firewall will still be active. A host reboot will restore hardware access (with SMP firewall active).
Rev 2.3-1.0.1 3.2.5 Features Overview and Configuration Upper Layer Protocols 3.2.5.1 IP over InfiniBand (IPoIB) The IP over IB (IPoIB) driver is a network interface implementation over InfiniBand. IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service.
Rev 2.3-1.0.1 3.2.5.1.1.1 Port Configuration The physical port MTU in Datagram mode (indicates the port capability) default value is 4k, whereas the IPoIB port MTU ("logical" MTU ) default value is 2k as it is set by the OpenSM. To change the IPoIB MTU to 4k, edit the OpenSM partition file in the section of IPoIB setting as follow: Default=0xffff, ipoib, mtu=5 : ALL=full; *Where "mtu=5" indicates that all IPoIB ports in the fabric are using 4k MTU, ("mtu=4" indicates 2k MTU) 3.2.5.1.
Rev 2.3-1.0.1 Features Overview and Configuration The length of the client identifier field is not fixed in the specification. For the Mellanox OFED for Linux package, it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier. 3.2.5.1.2.2 DHCP Server In order for the DHCP server to provide configuration records for clients, an appropriate configuration file needs to be created. By default, the DHCP server looks for a configuration file called dhcpd.conf under /etc.
Rev 2.3-1.0.1 ration. The IPoIB configuration file can specify either or both of the following data for an IPoIB interface: • A static IPoIB configuration • An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses. The following code lines are an excerpt from a sample IPoIB configuration file: # Static settings; all values provided by this file IPADDR_ib0=11.4.3.175 NETMASK_ib0=255.255.0.0 NETWORK_ib0=11.4.
Rev 2.3-1.0.1 Features Overview and Configuration Step 2. (Optional) Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib# argument. The following example shows how to verify the configuration: host1$ ifconfig ib0 b0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:11.4.3.175 Bcast:11.4.255.255 Mask:255.255.0.
Rev 2.3-1.0.1 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Step 4. As can be seen, the interface does not have IP or network addresses. To configure those, you should follow the manual configuration procedure described in Section 3.2.5.1.2.5. Step 5. To be able to use this interface, a configuration of the Subnet Manager is needed so that the PKey chosen, which defines a broadcast address, be recognized. 3.2.5.1.3.
Rev 2.3-1.0.1 Features Overview and Configuration • The only meaningful bonding policy in IPoIB is High-Availability (bonding mode number 1, or active-backup) • Bonding parameter "fail_over_mac" is meaningless in IPoIB interfaces, hence, the only supported value is the default: 0 (or "none" in SLES11) For a persistent bonding IPoIB Network configuration, use the same Linux Network Scripts semantics, with the following exceptions/ additions: • In the bonding master configuration file (e.
Rev 2.3-1.0.1 Step 3. Attach the Ethernet interface in the Virtual Machine to that bridge The diagram below describes the topology that was created after these steps: The diagram shows how the traffic from the Virtual Machine goes to the virtual-bridge in the Hypervisor and from the bridge to the eIPoIB interface. eIPoIB interface is the Ethernet interface that enslaves the IPoIB interfaces in order to send/receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath. 3.2.
Rev 2.3-1.0.1 Features Overview and Configuration For example, on a system with dual port HCA, the following two interfaces might be created; eth4 and eth5. cat /sys/class/net/eth_ipoib_interfaces eth4 over IB port: ib0 eth5 over IB port: ib1 These interfaces can be used to configure the network for the guest.
Rev 2.3-1.0.1 Figure 6: An Example of a Virtual Network The example above shows a few IPoIB instances that serve the virtual interfaces at the Virtual Machines. To display the services provided to the Virtual Machine interfaces: # cat /sys/class/net/eth0/eth/vifs Example: # cat /sys/class/net/eth0/eth/vifs SLAVE=ib0.2 MAC=52:54:00:60:55:88 VLAN=N/A In the example above the ib0.2 IPoIB interface serves the MAC 52:54:00:60:55:88 with no VLAN tag for that interface. 3.2.5.2.
Rev 2.3-1.0.1 Features Overview and Configuration 3.2.5.2.4 Setting Performance Tuning • Use 4K MTU over OpenSM. Default=0xffff, ipoib, mtu=5 : ALL=full; • Use MTU for 4K (4092 Bytes): • In UD mode, the maximum MTU value is 4092 Bytes Make sure that all interfaces (including the guest interface and its virtual bridge) have the same MTU value (MTU 4092 Bytes). For further information of MTU settings, please refer to the Hypervisor User Manual.
Rev 2.3-1.0.1 request to open an interface on a specific gateway identifying it by the BridgeX box and eport name. Distinguishing between gateways is essential because they determine the network topology and affect the path that a packet traverses between hosts. A packet that is sent from the host on a specific EoIB interface will be routed to the Ethernet subnet through a specific external port connection on the BridgeX box. 3.2.5.3.1.
Rev 2.3-1.0.1 Features Overview and Configuration Both forms of configuration supply the same functionality. If both forms of configuration files exist, the central configuration file has precedence and only this file will be used. 3.2.5.3.2.2 Central Configuration File - /etc/infiniband/mlx4_vnic.conf The mlx4_vnic.conf file consists of lines, each describing one vNic.
Rev 2.3-1.0.1 BXADDR=BX001 BXEPORT=A10 VNICIBPORT=mlx4_0:1 VNICVLAN=3 (Optional field) GW_PKEY=0xfff1 The fields used in the file for vNic configuration have the following meaning: Table 6 - Red Hat Linux mlx4_vnic.conf file format Field Description DEVICE An optional field. The name of the interface that is displayed when running ifconfig. If it is not present, the trailer of the configuration file name (e.g. ifcfg-eth47 => "eth47") is used instead. HWADDR The mac address to assign the vNic.
Rev 2.3-1.0.1 Features Overview and Configuration The MAC and VLAN values are set using the configuration files only, other tools such as (vconfig) for VLAN modification, or (ifconfig) for MAC modification, are not supported. 3.2.5.3.2.5 EoIB Network Administered vNic In network administered mode, the configuration of the vNic is done by the BridgeX®. If a vNic is configured for a specific host, it will appear on that host once a connection is established between the BridgeX and the mlx4_vnic module.
Rev 2.3-1.0.1 Add "VNICVLAN=" or remove VNICVLAN property for no VLAN Using a VLAN tag value of 0 is not recommended because the traffic using it would not be separated from non VLAN traffic. For Host administered vNics, VLAN entry must be set in the BridgeX first. For further information, please refer to BridgeX® documentation. 3.2.5.3.2.8 EoIB Multicast Configuration Configuring Multicast for EoIB interfaces is identical to multicast configuration for native Ethernet interfaces.
Rev 2.3-1.0.1 Features Overview and Configuration Ethernet configuration files are located at /etc/sysconfig/network-scripts/ on a RedHat machine and at /etc/sysconfig/network/ on a SuSE machine. 3.2.5.3.2.13 Sub Interfaces (VLAN) EoIB interfaces do not support creating sub interfaces via the vconfig command, unless working in ALL VLAN mode.. To create interfaces with VLAN, refer to Section 3.2.5.3.2.7, “Configuring VLANs”, on page 140. 3.2.5.3.3 Retrieving EoIB Information 3.2.5.3.3.
Rev 2.3-1.0.1 To query the link state run the following command and look for "Link detected": ethtool Example: ethtool eth10 Settings for eth10: Supported ports: [ ] Supported link modes: Supports auto-negotiation: No Advertised link modes: Not reported Advertised auto-negotiation: No Speed: Unknown! (10000) Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x00000000 (0) Link detected: yes 3.2.5.3.3.
Rev 2.3-1.0.1 Features Overview and Configuration 3.2.5.3.3.5 Discovery Partitions Configuration EoIB enables mapping of VLANs to InfiniBand partitions. Mapping VLANs to partitions causes all EoIB data traffic and all vNic related control traffic to be sent to the mapped partitions. In rare cases, it might be useful to ensure that EoIB discovery packets (packets used for discovery of Gateways (GWs) and vice versa) are sent to a non default partition.
Rev 2.3-1.0.1 ALL VLAN must be supported by both the BridgeX® and by the host side. When enabling ALL VLAN, all gateways (LAG or legacy) that have eports belonging to a gateway group (GWG) must be configured to the same behavior. For example it is impossible to have gateway A2 configured to all-vlan mode and A3 to regular mode, because both belong to GWG A.
Rev 2.3-1.0.1 Features Overview and Configuration Example: # mlx4_vnic_info -g A2 IOA_PORT mlx4_0:1 BX_NAME bridge-119c64 BX_GUID 00:02:c9:03:00:11:61:67 EPORT_NAME A2 EPORT_ID 63 STATE connected GW_TYPE LEGACY PKEY 0xffff ALL_VLAN yes • vNic Support To verify the vNIC is configured to All-VLAN mode.
Rev 2.3-1.0.1 To check the current module parameters, run: mlx4_vnic_info -P Default RX/TX rings number is the number of logical CPUs (threads). To set non-default values to module parameters, the following line should be added to modprobe configuration file (e.g. / etc/modprobe.conf file): options mlx4_vnic = = ... For additional information about discovery_pkeys please refer to Section 3.2.5.3.3.5, “Discovery Partitions Configuration”, on page 144 3.2.5.3.4.
Rev 2.3-1.0.1 Features Overview and Configuration 3.2.5.3.4.4 Driver Configuration For PV-EoIB to work properly, the following features must be disabled in the driver: • Large Receive Offload (LRO) • TX completion polling • RX fragmented buffers To disable the features above, edit the modprobe configuration file as follow: options mlx4_vnic lro_num=0 tx_polling=0 rx_linear=1 For the full list of mlx4_vnic module parameters, run: # modinfo mlx4_vnic 3.2.5.3.4.
Rev 2.3-1.0.1 b. Enslave it to a virtual bridge to be used by the Guest OS. The VLAN tagging/untagging is transparent to the Guest and managed in EoIB driver level. The vconfig utility is not supported by EoIB driver, a new vNic instance must be created instead. For further information, see Section 3.2.5.3.2.6, “VLAN Configuration”, on page 140. Virtual Guest Tagging (VGT) is not supported. The model explained above applies to Virtual Switch Tagging (VST) only. 3.2.5.3.4.
Rev 2.3-1.0.1 Features Overview and Configuration For further information on how to increase dom0_mem, please refer to: http://support.citrix.com/article/CTX126531 b. Lower the mlx4_vnic driver memory consumption by decreasing its RX/TX rings number and length, For further information, please refer to Section 3.2.5.3.4.1, “Module Parameters”, on page 146. 3.2.6 Advanced Transport 3.2.6.1 Atomic Operations 3.2.6.1.
Rev 2.3-1.0.
Rev 2.3-1.0.1 Features Overview and Configuration 3.2.6.3 Dynamically Connected Transport (DCT) Dynamically Connected transport (DCT) service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic. Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynamically connect and disconnect from any remote node. DCT connections only stay connected while they are active.
Rev 2.3-1.0.1 The following are environment variables that can be used to control error cases / contiguity: Parameters Description MLX_MR_ALLOC_TYPE Configures the allocator type. • ALL (Default) - Uses all possible allocator and selects most efficient allocator. • ANON - Enables the usage of anonymous pages and disables the allocator • CONTIG - Forces the usage of the contiguous pages allocator.
Rev 2.3-1.0.1 Features Overview and Configuration The request to share the MR can be repeated multiple times and an arbitrary number of Memory Regions can potentially share the same physical memory locations. Usage: • Uses the “handle” field that was returned from the ibv_exp_reg_mr as the mr_handle • Supplies the desired “access mode” for that MR • Supplies the address field which can be either NULL or any hint as the required output.
Rev 2.3-1.0.1 @access: @attr: If IBV_EXP_REREG_MR_CHANGE_ACCESS is set in flags, this field specifies the new memory access rights, otherwise, this parameter is ignored.
Rev 2.3-1.0.1 • Features Overview and Configuration Create memory regions, which support the definition of regular non-contiguous memory regions. 3.2.7.6 On-Demand-Paging (ODP) On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings.
Rev 2.3-1.0.1 To check which operations are supported for a given transport, the capabilities field need to be masked with one of the following masks: enum ibv_odp_transport_cap_bits { IBV_EXP_ODP_SUPPORT_SEND IBV_EXP_ODP_SUPPORT_RECV IBV_EXP_ODP_SUPPORT_WRITE IBV_EXP_ODP_SUPPORT_READ IBV_EXP_ODP_SUPPORT_ATOMIC IBV_EXP_ODP_SUPPORT_SRQ_RECV }; = = = = = = 1 1 1 1 1 1 << << << << << << 0, 1, 2, 3, 4, 5, For example to check if RC supports send: If (dattr.odp_caps.per_transport_caps.
Rev 2.3-1.0.1 Features Overview and Configuration Example: struct ibv_exp_prefetch_attr prefetch_attr; prefetch_attr.flags = IBV_EXP_PREFETCH_WRITE_ACCESS; prefetch_attr.addr = addr; prefetch_attr.length = length; prefetch_attr.comp_mask = 0; ibv_exp_prefetch_mr(mr, &prefetch_attr); For further information, please refer to the ibv_exp_prefetch_mr manual page. 3.2.7.6.6 ODP Statistics To aid in debugging and performance measurements and tuning, ODP support includes an extensive set of statistics.
Rev 2.3-1.0.1 Counter Name Description num_failed_resolutions Number of failed page faults that could not be resolved due to nonexisting mappings in the OS. num_mrs_not_found Number of faults that specified a non-existing ODP MR. num_odp_mr_pages Total size in pages of current ODP MRs. num_odp_mrs Number of current ODP MRs. 3.2.7.7 Inline-Receive When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE.
Rev 2.3-1.0.1 Features Overview and Configuration For example: truct ibv_exp_device_attr device_attr = {.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED 1}; ibv_exp_query_device(context, & device_attr); if (device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MEM_WINDOW || device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MW_TYPE_2B) { /* Memory window is supported */ 3.2.7.7.4 Allocating Memory Window Allocating memory window is done by calling the ibv_alloc_mw verb.
Rev 2.3-1.0.1 3.3 Storage Protocols 3.3.1 SCSI RDMA Protocol (SRP) As described in Section 3.3.1, the SCSI RDMA Protocol (SRP) is designed to take full advantage of the protocol off-load and RDMA features provided by the InfiniBand architecture. SRP allows a large body of SCSI software to be readily used on InfiniBand architecture. The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric.
Rev 2.3-1.0.1 Features Overview and Configuration reconnect_delay Time between successive reconnect attempts. Time between successive reconnect attempts of SRP initiator to a disconnected target until dev_loss_tmo timer expires (if enabled), after that the SCSI target will be removed. fast_io_fail_tmo Number of seconds between the observation of a transport layer error and failing all I/O.
Rev 2.3-1.0.1 • To establish a connection with an SRP Target and create an SRP (SCSI) device for that target under /dev, use the following command: echo -n id_ext=[GUID value],ioc_guid=[GUID value],dgid=[port GID value],\ pkey=ffff,service_id=[service[0] value] > \ /sys/class/infiniband_srp/srp-mlx[hca number]-[port number]/add_target See Section , “SRP Tools - ibsrpdm, srp_daemon and srpd Service Script”, on page 164 for instructions on how the parameters in this echo command may be obtained.
Rev 2.3-1.0.1 Features Overview and Configuration service_id A 16-digit hexadecimal number specifying the InfiniBand service ID used to establish communication with the SRP target. How to find out the value of the service ID is specified in the documentation of the SRP target. max_sect A decimal number specifying the maximum number of 512-byte sectors to be transferred via a single SCSI command. max_cmd_per_lun A decimal number specifying the maximum number of outstanding commands for a single LUN.
Rev 2.3-1.0.1 • A service script srpd which may be started at stack startup The utilities can be found under /usr/sbin/, and are part of the srptools RPM that may be installed using the Mellanox OFED installation. Detailed information regarding the various options for these utilities are provided by their man pages. Below, several usage scenarios for these utilities are presented. 3.3.1.1.4 ibsrpdm ibsrpdm is using for the following tasks: 1. Detecting reachable targets a.
Rev 2.3-1.0.1 Features Overview and Configuration b. To establish a connection with an SRP Target using the output from the ‘ibsrpdm -c’ example above, execute the following command: echo -n id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1 > /sys/ class/infiniband_srp/srp-mlx4_0-1/add_target The SRP connection should now be up; the newly created SCSI devices should appear in the listing obtained from the ‘fdisk -l’ command. 3.
Rev 2.3-1.0.1 • Executing srp_daemon over a port without the -a option will only display the reachable targets via the port and to which the initiator is not connected. If executing with the -e option it is better to omit -a. • It is recommended to use the -n option. This option adds the initiator_ext to the connecting string. (See Section for more details). • srp_daemon has a configuration file that can be set, where the default is /etc/ srp_daemon.conf.
Rev 2.3-1.0.1 Features Overview and Configuration each path. The conventions is to use the Target port GUID as the initiator_ext value for the relevant path. If you use srp_daemon with -n flag, it automatically assigns initiator_ext values according to this convention. For example: id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,\ dgid=fe800000000000000002c90200402bed,pkey=ffff,\ service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200 Notes: 1.
Rev 2.3-1.0.1 It is possible for regular (non-SRP) LUNs to also be present; the SRP LUNs may be identified by their names. You can configure the /etc/multipath.conf file to change multipath behavior. It is also possible that the SRP LUNs will not appear under /dev/mapper/. This can occur if the SRP LUNs are in the black-list of multipath. Edit the ‘blacklist’ section in /etc/multipath.conf and make sure the SRP LUNs are not black-listed. 3.3.1.1.
Rev 2.3-1.0.1 3.3.2 Features Overview and Configuration iSCSI Extensions for RDMA (iSER) iSCSI Extensions for RDMA (iSER) extends the iSCSI protocol to RDMA. It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies. 3.3.2.1 iSER Initiator Setting the iSER target is out of scope of this manual. For guidelines of how to do so, please refer to the relevant target documentation (e.g. stgt, clitarget).
Rev 2.3-1.0.1 3.4 Virtualization 3.4.1 Single Root IO Virtualization (SR-IOV) Single Root IO Virtualization (SR-IOV) is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the device with separate resources. Mellanox adapters are capable of exposing in ConnectX®-3 adapter cards up to 126 virtual instances called Virtual Functions (VFs). These virtual functions can then be provisioned separately.
Rev 2.3-1.0.1 Features Overview and Configuration Step 2. Enable "Intel Virtualization Technology". Step 3. Install a hypervisor that supports SR-IOV. Step 4. Depending on your system, update the /boot/grub/grub.conf file to include a similar command line load parameter for the Linux kernel. For example, to Intel systems, add: default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title Red Hat Enterprise Linux Server (2.6.32-36.x86-645) root (hd0,0) kernel /vmlinuz-2.6.32-36.
Rev 2.3-1.0.1 1. Verify in the [HCA] section the following fields appear1,2: [HCA] num_pfs = 1 total_vfs = <0-126> sriov_en = true Parameter Recommended Value num_pfs 1 Note: This field is optional and might not always appear. total_vfs • When using firmware version 2.31.5000 and above, the recommended value is 126. • When using firmware version 2.30.8000 and below, the recommended value is 63 Note: Before setting number of VFs in SR-IOV, please make sure your system can support that amount of VFs.
Rev 2.3-1.0.1 Features Overview and Configuration Parameter num_vfs Recommended Value • If absent, or zero: no VFs will be available • If its value is a single number in the range of 0-63: The driver will enable the num_vfs VFs on the HCA and this will be applied to all ConnectX® HCAs on the host.
Rev 2.3-1.0.1 Parameter num_vfs Recommended Value Notes: • PFs not included in the above list will not have SRIOV enabled. • Triplets and single port VFs are only valid when all ports are configured as Ethernet. When an InfiniBand port exists, only num_vfs=a syntax is valid where “a” is a single value that represents the number of VFs. • The second parameter in a triplet is valid only when there are more than 1 physical port.
Rev 2.3-1.0.1 Features Overview and Configuration Parameter probe_vf Recommended Value • probe_vf=1,2,3 - The PF driver will activate 1 VF on physical port 1, 2 VFs on physical port 2 and 3 dual port VFs (applies only to dual port HCA when all ports are Ethernet ports). This applies to all ConnectX® HCAs in the host. • probe_vf=00:04.0-5;6;7,00:07.0-8;9;10 - The PF driver will activate: • HCA positioned in BDF 00:04.
Rev 2.3-1.0.1 Step 10. Load the driver and verify the SR-IOV is supported. Run: lspci | grep Mellanox 03:00.0 InfiniBand: Mellanox / 10GigE] (rev b0) 03:00.1 InfiniBand: Mellanox (rev b0) 03:00.2 InfiniBand: Mellanox (rev b0) 03:00.3 InfiniBand: Mellanox (rev b0) 03:00.4 InfiniBand: Mellanox (rev b0) 03:00.5 InfiniBand: Mellanox (rev b0) Technologies MT26428 [ConnectX VPI PCIe 2.
Rev 2.3-1.0.1 Features Overview and Configuration Step 4. Attach a virtual NIC to VM. ifconfig -a … eth6 Link encap:Ethernet HWaddr 52:54:00:E7:77:99 inet addr:13.195.15.5 Bcast:13.195.255.255 Mask:255.255.0.0 inet6 addr: fe80::5054:ff:fee7:7799/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:481 errors:0 dropped:0 overruns:0 frame:0 TX packets:450 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:22440 (21.9 KiB) TX bytes:19232 (18.
Rev 2.3-1.0.1 Step 7. Add the device to the /etc/sysconfig/network-scripts/ifcfg-ethX configuration file. The MAC address for every virtual function is configured randomly, therefore it is not necessary to add it. 3.4.1.5 Uninstalling SR-IOV Driver To uninstall SR-IOV driver, perform the following: Step 1. For Hypervisors, detach all the Virtual Functions (VF) from all the Virtual Machines (VM) or stop the Virtual Machines that use the Virtual Functions.
Rev 2.3-1.0.1 Features Overview and Configuration Only the PFs are set via this mechanism. The VFs inherit their port types from their associated PF. Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host, while a single HCA is observable by the network which is unaware of the vHCAs. No changes are required by the InfiniBand subsystem, ULPs, and applications to support SR-IOV, and vHCAs are interoperable with any existing (non-virtualized) IB deployments.
Rev 2.3-1.0.1 • - one for Dom0 and one per guest. Here, you may see the mapping between virtual and physical pkey indices, and the virtual to physical gid 0. directories Currently, the GID mapping cannot be modified, but the pkey virtual to physical mapping can . These directories have the structure: • /port//gid_idx/0 where m = 1..2 (this is read-only) and • /port//pkey_idx/, where m = 1..2 and n = 0..126 For instructions on configuring pkey_idx, please see below. 3.4.
Rev 2.3-1.0.1 Features Overview and Configuration 3.4.1.6.3 Multi-GUID Support in InfiniBand As of MLNX_OFED v2.2-1.0.0, Infiniband VFs in SR-IOV setting can have more than a single GUID to be used for their purposes. Totally there are 128 GUIDs per port, where the PF occupies 2 entries, and the remaining GUIDs are divided equally between all the VFs. If there are any remainders, those GUIDs are given to the VFs with the lowest IDs.
Rev 2.3-1.0.1 3.4.1.6.4 Partitioning IPoIB Communication using PKeys PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by mapping a non-default full-membership PKey to virtual index 0, and mapping the default PKey to a virtual pkey index other than zero. The below describes how to set up two hosts, each with 2 Virtual Machines. Host-1/vm-1 will be able to communicate via IPoIB only with Host2/vm1,and Host1/vm2 only with Host2/vm2.
Rev 2.3-1.0.1 Features Overview and Configuration Step 2. Configure (on Dom0) the virtual-to-physical PKey mappings for the VMs. Step a. Check the PCI ID for the Physical Function and the Virtual Functions. lspci | grep Mel Step b. Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and that on Host2 it is "0000:03:00.0" On Host1 do the following. cd /sys/class/infiniband/mlx4_0/iov 0000:02:00.0 0000:02:00.1 0000:02:00.2 ...1 1. 0000:02:00.
Rev 2.3-1.0.1 The feature may be controlled on the Hypervisor from userspace via iprout2 / netlink: ip link set { dev DEVICE | group DEVGROUP } [ { up | down } ] ... [ vf NUM [ mac LLADDR ] [ vlan VLANID [ qos VLAN-QOS ] ] ... [ spoofchk { on | off} ] ] ... use: ip link set dev vf vlan [qos ] • where NUM = 0..max-vf-num • vlan_id = 0..4095 (4095 means "set VGT") • qos = 0..
Rev 2.3-1.0.1 Features Overview and Configuration When a MAC is ff:ff:ff:ff:ff:ff, the VF is not assigned to the port of the net device it is listed under. In the example above, vf 38 is not assigned to the same port as p1p1, in contrast to vf0. However, even VFs that are not assigned to the net device, could be used to set and change its settings. For example, the following is a valid command to change the spoof check: ip link set dev p1p1 vf 38 spoofchk on This command will affect only the vf 38.
Rev 2.3-1.0.1 The first entry, enable_smi_admin, is used to enable SMI on a VF. By default, the value of this entry is zero (disabled). When set to “1”, the SMI will be enabled for the VF on the next rebind or openibd restart on the VM that the VF is bound to. If the VF is currently bound, it must be unbound and then re-bound. The second sysfs entry, smi_enabled, indicates the current enablement state of the SMI. 0 indicates disabled, and 1 indicates enabled. This entry is read-only.
Rev 2.3-1.0.1 Features Overview and Configuration 3.4.1.8.1 Configuring VGT+ The default operating mode is VGT: cat /sys/class/net/eth5/vf0/vlan_set oper: admin: Both states (operational and administrative) are empty. If you set the vlan_set parameter with more the 10 VLAN IDs, the driver chooses the first 10 VLAN IDs provided and ignores all the rest. To enable VGT+ mode: Step 1. Set the corresponding port/VF (in the example below port eth5 VF0) list of allowed VLANs.
Rev 2.3-1.0.1 To add a VLAN: In the example below, the following state exist: # cat /sys/class/net/eth5/vf0/vlan_set oper: 0 1 2 3 admin: 0 1 2 3 Step 1. Make an operational VLAN set identical to the administration VLAN. echo 2 3 4 5 6 > /sys/class/net/eth5/vf0/vlan_set The delta will be added to the operational state immediately (4 5 6): # cat /sys/class/net/eth5/vf0/vlan_set oper: 0 1 2 3 4 5 6 admin: 2 3 4 5 6 Step 2. 3.4.2 Reset the VF for changes to take effect. VXLAN 3.4.2.
Rev 2.3-1.0.1 Features Overview and Configuration (e.g 1450 instead of 1500), or the uplink NIC MTU has to be incremented by 50 bytes (e.g 1550 instead of 1500) • From upstream 3.15-rc1 and onward, it is possible to use arbitrary UDP port for VXLAN. Note that this requires firmware version 2.31.2800 or higher. Additionally, you need to enable this kernel configuration option CONFIG_MLX4_EN_VXLAN=y. • On upstream kernels 3.12/3.13 GRO with VXLAN is not supported 3.5 Resiliency 3.5.
Rev 2.3-1.0.1 3.5.1.4 Forcing the VF to Reset If an outside "reset" is forced by using the PCI sysfs entry for a VF, a reset is executed on that VF once it runs any command over its communication channel. For example, the below command can be used on a hypervisor to reset a VF defined by 0000\:04\:00.1: echo 1 >/sys/bus/pci/devices/0000\:04\:00.1/reset 3.5.1.
Rev 2.3-1.0.1 4 InfiniBand Fabric Utilities InfiniBand Fabric Utilities This section first describes common configuration, interface, and addressing for all the tools in the package. 4.1 Common Configuration, Interface and Addressing 4.1.1 Topology File (Optional) An InfiniBand fabric is composed of switches and channel adapter (HCA/TCA) devices. To identify devices in a fabric (or even in one switch system), each device is given a GUID (a MAC equivalent).
Rev 2.3-1.0.1 4.3 Addressing This section applies to the ibdiagpath tool only. A tool command may require defining the destination device or port to which it applies. The following addressing modes can be used to define the IB ports: • Using a Directed Route to the destination: (Tool option ‘-d’) This option defines a directed route of output port numbers from the local port to the destination.
Rev 2.3-1.0.1 InfiniBand Fabric Utilities Table 7 - Diagnostic Utilities (Sheet 2 of 6) Utility 194 Description ibcongest Provides static congestion analysis. It calculates routing for a given topology (topo-mode) or uses extracted lst/fdb files (lst-mode). Additionally, it analyzes congestion for a traffic schedule provided in a "schedule-file" or uses an automatically generated schedule of all-to-all-shift.
Rev 2.3-1.0.1 Table 7 - Diagnostic Utilities (Sheet 3 of 6) Utility Description ibdiagpath Traces a path between two end-points and provides information regarding the nodes and ports traversed along the path. It utilizes device specific health queries for the different devices along the path. The way ibdiagpath operates depends on the addressing mode used on the command line.
Rev 2.3-1.0.1 InfiniBand Fabric Utilities Table 7 - Diagnostic Utilities (Sheet 4 of 6) Utility 196 Description iblinkinfo Reports link info for each port in an InfiniBand fabric, node by node. Optionally, iblinkinfo can do partial scans and limit its output to parts of a fabric. For further information, please refer to the tool’s man page. ibnetdiscover Performs InfiniBand subnet discovery and outputs a human readable topology file.
Rev 2.3-1.0.1 Table 7 - Diagnostic Utilities (Sheet 5 of 6) Utility Description ibstat ibstat is a binary which displays basic information obtained from the local IB driver. Output includes LID, SMLID, port state, link width active, and port physical state. For further information, please refer to the tool’s man page. ibstatus Displays basic information obtained from the local InfiniBand driver. Output includes LID, SMLID, port state, port physical state, port width and port rate.
Rev 2.3-1.0.1 InfiniBand Fabric Utilities Table 7 - Diagnostic Utilities (Sheet 6 of 6) Utility 4.4.1 Description mstflint Queries and burns a binary firmware-image file on non-volatile (Flash) memories of Mellanox InfiniBand and Ethernet network adapters. The tool requires root privileges for Flash access. To run mstflint, you must know the device location on the PCI bus. Note: If you purchased a standard Mellanox Technologies network adapter card, please download the firmware image from www.
Rev 2.3-1.0.1 • Bandwidth of links can be reduced if cable performance degrades and LLR retransmissions become too numerous. Traditional IB bandwidth performance utilities can be used to monitor any bandwidth impact. Due to these factors, an LLR retransmission rate counter has been added to the ibdiagnet utility that can give end users an indication of the link health. To monitor LLR retransmission rate: 1. Run ibdiagnet, no special flags required. 2.
Rev 2.3-1.0.1 InfiniBand Fabric Utilities Table 8 - Diagnostic Utilities (Sheet 2 of 3) Utility 200 Description ib_read_lat Calculates the latency of RDMA read operation of message_size between a pair of machines. One acts as a server and the other as a client. They perform a ping pong benchmark on which one side RDMA reads the memory of the other side only after the other side have read his memory.
Rev 2.3-1.0.1 Table 8 - Diagnostic Utilities (Sheet 3 of 3) Utility Description raw_ethernet_lat Calculates the latency of sending a packet in message_size between a pair of machines. One acts as a server and the other as a client. They perform a ping pong benchmark on which you send packet only if you receive one. Each of the sides samples the CPU each time they receive a packet in order to calculate the latency. Using the "-a" provides results for all message sizes.
Rev 2.3-1.0.1 5 Troubleshooting Troubleshooting You may be able to easily resolve the issues described in this section. If a problem persists and you are unable to resolve it yourself please contact your Mellanox representative or Mellanox Support at support@mellanox.com. 5.1 General Related Issues Table 9 - General Related Issues Issue Cause Solution The system panics when it is booted with a failed adapter installed. Malfunction hardware component 1. Remove the failed adapter. 2.
Rev 2.3-1.0.1 Table 10 - Ethernet Related Issues Issue Degraded performance is measured when having a mixed rate environment (10GbE, 40GbE and 56GbE). Cause Sending traffic from a node with a higher rate to a node with lower rate. Solution Enable Flow Control on both switch's ports and nodes: • On the server side run: ethtool -A rx on tx on • On the switch side run the following command on the relevant interface: send on force and receive on force No link with break-out cable. 5.
Rev 2.3-1.0.1 5.4 Troubleshooting InfiniBand/Ethernet Related Issues Table 12 - InfiniBand/Ethernet Related Issues Issue Solution Physical link fails to negotiate to maximum supported rate. The adapter is running an outdated firmware. Install the latest firmware on the adapter. Physical link fails to come up while port physical state is Polling. The cable is not connected to the port or the port on the other end of the cable is disabled.
Rev 2.3-1.0.1 5.5 Installation Related Issues Table 13 - Installation Related Issues Issue Cause Solution Driver installation fails. The install script may fail for the following reasons: • Using an unsupported installation option • Failed to uninstall the previous installation due to dependencies being used • The operating system is not supported • The kernel is not supported. You can run mlnx_add_kernel_suppo rt.
Rev 2.3-1.0.1 5.6 Troubleshooting Performance Related Issues Table 14 - Performance Related Issues Issue Cause The driver works but the transmit and/or receive data rates are not optimal. Solution These recommendations may assist with gaining immediate improvement: 1. Confirm PCI link negotiated uses its maximum capability 2. Stop the IRQ Balancer service. /etc/init.d/irq_balancer stop 3. Start mlnx_affinity service.
Rev 2.3-1.0.1 Table 15 - SR-IOV Related Issues Issue When assigning a VF to a VM the following message is reported on the screen: Cause SR-IOV and virtualization are not enabled in the BIOS. 1. Verify they are both enabled in the BIOS 2. Add to the GRUB configuration file to the following kernel parameter: "intel_immun=on" (see Section 3.4.1.2, “Setting Up SRIOV”, on page 171). "PCI-assgine: error: requires KVM support" 5.
Rev 2.3-1.0.1 5.9 Troubleshooting RDMA Related Issues Table 17 - RDMA Related Issues Issue Infiniband-diags tests, such as 'ib_write_bw', fail between systems with different driver releases. 208 Mellanox Technologies Cause When running a test between 2 systems in the fabric with different Infiniband-diags packages installed. Solution Run the test using the same perftest RPM on both systems.