Mellanox OFED for Linux User Manual Rev 2.0-3.0.0 Last Updated: 03 October, 2013 www.mellanox.
Rev 2.0-3.0.0 NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT (“PRODUCT(S)”) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES “AS-IS” WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS. THE CUSTOMER'S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO(S) AND/OR THE SYSTEM USING IT.
Rev 2.0-3.0.0 Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 1 Mellanox OFED Overview . . . . . . . . . . . . . . . .
Rev 2.0-3.0.0 4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 SRP Initiator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 iSCSI Extensions for RDMA (iSER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.0-3.0.0 4.13.2 4.13.3 4.13.4 4.13.5 4.13.6 4.13.7 Setting Up SR-IOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enabling SR-IOV and Para Virtualization on the Same Setup . . . . . . . . . . . . . . . Assigning a Virtual Function to a Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . Uninstalling SR-IOV Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Burning Firmware with SR-IOV. . . . . . . . . .
Rev 2.0-3.0.0 7.2.3 7.2.4 7.2.5 7.2.6 7.2.7 7.2.8 Preserving Your Performance Settings after a Reboot . . . . . . . . . . . . . . . . . . . . Tuning Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interrupt Moderation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tuning for NUMA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IRQ Affinity . . . . . . . . . . . . . . . . . .
Rev 2.0-3.0.0 8.9 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.9.1 8.9.2 8.9.3 8.9.4 Congestion Control Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running OpenSM with Congestion Control Manager . . . . . . . . . . . . . . . . . . . . Configuring Congestion Control Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring Congestion Control Manager Main Settings . . . . .
Rev 2.0-3.0.0 C.1 C.2 C.3 mlx4_ib Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 mlx4_core Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 mlx4_en Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Appendix D mlx5 Module Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225 Appendix E Lustre Compilation over MLNX_OFED . . . . . .
Rev 2.0-3.0.0 List of Figures Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards . . . . . . . . . . . . . . . . . . . . 19 Figure 2: I/O Consolidation Over InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 3: An Example of a Virtual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 4: QoS Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.0-3.0.0 List of Tables Table 1: Document Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Table 2: Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Table 3: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Table 4: Reference Documents . . . . . . . . . . . . . . . . . . .
Rev 2.0-3.0.0 Document Revision History Table 1 - Document Revision History Release 2.0-3.0.0 2.0-3.0.0 Date October 2013 August 2013 Description Updated the following sections: • Appendix E, “Lustre Compilation over MLNX_OFED” page 226 • Updated the following sections: • Section 1.3.4, “ULPs”, on page 21 • Section 4.12, “Flow Steering”, on page 77 and its subsections • Section 1.3.3, “Mid-layer Core”, on page 21 • Section 4.
Rev 2.0-3.0.0 About this Manual This Preface provides general information concerning the scope and organization of this User’s Manual. Intended Audience This manual is intended for system administrators responsible for the installation, configuration, management and maintenance of the software and hardware of VPI (InfiniBand, Ethernet) adapter cards. It is also intended for application developers.
Rev 2.0-3.0.
Rev 2.0-3.0.0 Table 3 - Glossary (Sheet 2 of 2) Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric. Master Subnet Manager The Subnet Manager that is authoritative, that has the reference configuration information for the subnet. See Subnet Manager. Multicast Forwarding Tables A table that exists in every switch providing the list of ports to forward received multicast packet. The table is organized by MLID.
Rev 2.0-3.0.0 Table 4 - Reference Documents Document Name Description Firmware Release Notes for Mellanox adapter devices See the Release Notes PDF file relevant to your adapter device under docs/ folder of installed package. MFT User’s Manual Mellanox Firmware Tools User’s Manual. See under docs/ folder of installed package. MFT Release Notes Release Notes for the Mellanox Firmware Tools. See under docs/ folder of installed package.
Rev 2.0-3.0.0 Support and Updates Webpage Please visit http://www.mellanox.com > Products > InfiniBand/VPI Drivers > Linux SW/Drivers for downloads, FAQ, troubleshooting, future updates to this manual, etc.
Rev 2.0-3.0.0 1 Mellanox OFED Overview 1.1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect (VPI) software stack which operates across all Mellanox network adapter solutions supporting 10, 20, 40 and 56 Gb/s InfiniBand (IB); 10, 40 and 56 Gb/s Ethernet; and 2.5 or 5.0 GT/s PCI Express 2.0 and 8 GT/s PCI Express 3.0 uplinks to servers.
Rev 2.0-3.0.0 Mellanox OFED Overview • • mlx4_en (Ethernet) Mid-layer core • Verbs, MADs, SA, CM, CMA, uVerbs, uMADs • Upper Layer Protocols (ULPs) • IPoIB, RDS*, SRP Initiator and SRP * NOTE: RDS was not tested by Mellanox Technologies.
Rev 2.0-3.0.0 1.3 Architecture Figure 1 shows a diagram of the Mellanox OFED stack, and how upper layer protocols (ULPs) interface with the hardware and with the kernel and user space. The application level also shows the versatility of markets that Mellanox OFED applies to.
Rev 2.0-3.0.0 Mellanox OFED Overview mlx4_en A 10/40GigE driver under drivers/net/ethernet/mellanox/mlx4 that handles Ethernet specific functions and plugs into the netdev mid-layer 1.3.2 mlx5 Driver mlx5 is the low level driver implementation for the Connect-IB™ adapters designed by Mella- nox Technologies. Connect-IB™ operates as an InfiniBand adapter. The mlx5 driver is comprised of the following kernel modules: mlx5_core Acts as a library of common functions (e.g.
Rev 2.0-3.0.0 MLX5_SCATTER_TO_CQE • Small buffers are scattered to the completion queue entry and manipulated by the driver. Valid for RC transport. • Default is 1, otherwise disabled. 1.3.3 Mid-layer Core Core services include: management interface (MAD), connection manager (CM) interface, and Subnet Administrator (SA) interface. The stack includes components for both user-mode and kernel applications. The core services run in the kernel and expose an interface to user-mode for verbs, CM and management.
Rev 2.0-3.0.0 1.3.5 Mellanox OFED Overview MPI Message Passing Interface (MPI) is a library specification that enables the development of parallel software libraries to utilize parallel computers, clusters, and heterogeneous networks.
Rev 2.0-3.0.0 This tool burns a firmware binary image to the EEPROM(s) attached to an InfiniScaleIII® switch device. It includes query functions to the burnt firmware image and to the binary image file. The tool accesses the EEPROM and/or switch device via an I2C-compatible interface or via vendor-specific MADs over the InfiniBand fabric (In-Band tool). • Debug utilities A set of debug utilities (e.g., itrace, mstdump, isw, and i2c) For additional details, please refer to the MFT User’s Manual docs/. 1.
Rev 2.0-3.0.0 2 Installation Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and/or Ethernet adapter hardware installed. 2.
Rev 2.0-3.0.0 2.3 Installing Mellanox OFED The installation script, mlnxofedinstall, performs the following: 2.3.
Rev 2.0-3.0.0 Installation Example The following command will create a MLNX_OFED_LINUX ISO image for RedHat 6.3 under the /tmp directory. # ./MLNX_OFED_LINUX-2.0-3.0.1-rhel6.3-x86_64/mlnx_add_kernel_support.sh -m / MLNX_OFED_LINUX-2.0-3.0.1-rhel6.3-x86_64 --make-tgz Note: This program will create MLNX_OFED_LINUX TGZ for rhel6.3 under /tmp directory. All Mellanox, OEM, OFED, or Distribution IB packages will be removed. Do you want to continue?[y/N]:y See log file /tmp/mlnx_ofed_iso.1380.
Rev 2.0-3.0.
Rev 2.0-3.0.0 2.3.3 Installation Installation Procedure Step 1. Login to the installation machine as root. Step 2. Mount the ISO image on your machine host1# mount -o ro,loop MLNX_OFED_LINUX---.iso /mnt Step 3. Run the installation script. ./mlnxofedinstall This program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, or Distribution IB packages will be removed.
Rev 2.0-3.0.0 Installing user level RPMs: Preparing... ofed-scripts Preparing... libibverbs Preparing... libibverbs Preparing... libibverbs-devel Preparing... libibverbs-devel Preparing... libibverbs-devel-static Preparing... libibverbs-devel-static Preparing... libibverbs-utils Preparing... libmlx4 Preparing... libmlx4 Preparing... libmlx4-devel Preparing... libmlx4-devel Preparing... libmlx5 Preparing... libmlx5 Preparing... libmlx5-devel Preparing... libmlx5-devel Preparing... libcxgb3 Preparing...
Rev 2.0-3.0.0 Installation Preparing... libcxgb4-devel Preparing... libnes Preparing... libnes Preparing... libnes-devel-static Preparing... libnes-devel-static Preparing... libipathverbs Preparing... libipathverbs Preparing... libipathverbs-devel Preparing... libipathverbs-devel Preparing... libibcm Preparing... libibcm Preparing... libibcm-devel Preparing... libibcm-devel Preparing... libibumad Preparing... libibumad Preparing... libibumad-devel Preparing... libibumad-devel Preparing...
Rev 2.0-3.0.0 Preparing... libibmad-devel Preparing... libibmad-devel Preparing... libibmad-static Preparing... libibmad-static Preparing... ibsim Preparing... ibacm Preparing... librdmacm Preparing... librdmacm Preparing... librdmacm-utils Preparing... librdmacm-devel Preparing... librdmacm-devel Preparing... opensm-libs Preparing... opensm-libs Preparing... opensm Preparing... opensm-devel Preparing... opensm-devel Preparing... opensm-static Preparing... opensm-static Preparing... dapl Preparing...
Rev 2.0-3.0.0 Installation Preparing... dapl-devel-static Preparing... dapl-devel-static Preparing... dapl-utils Preparing... perftest Preparing... mstflint Preparing... mft Preparing... srptools Preparing... rds-tools Preparing... rds-devel Preparing... ibutils2 Preparing... ibutils Preparing... cc_mgr Preparing... dump_pr Preparing... ar_mgr Preparing... ibdump Preparing... infiniband-diags Preparing... infiniband-diags-compat Preparing... qperf Preparing... fca INFO: updating ...
Rev 2.0-3.0.0 - The FCA Manager and FCA MPI Runtime library are installed in /opt/mellanox/fca directory. - The FCA Manager will not be started automatically. - To start FCA Manager now, type: /etc/init.d/fca_managerd start - There should be single process of FCA Manager running per fabric. - To start FCA Manager automatically after boot, type: /etc/init.d/fca_managerd install_service - Check /opt/mellanox/fca/share/doc/fca/README.txt for quick start instructions. Preparing...
Rev 2.0-3.0.0 Installation In case your machine has the latest firmware, no firmware update will occur and the installation script will print at the end of installation a message similar to the following: ... The firmware version on to date. Note: To force firmware The firmware version on to date. Note: To force firmware /dev/mst/mt26448_pci_cr0 - 2.9.1000 is up update use '--force-fw-update' flag. /dev/mst/mt4099_pci_cr0 - 2.11.500 is up update use '--force-fw-update' flag.
Rev 2.0-3.0.0 Note: For more details on hca_self_test.ofed, see the file hca_self_test.readme under docs/. # hca_self_test.ofed ---- Performing Adapter Device Self Test ---Number of CAs Detected ................. 2 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-2.0-2.0.0 (OFED-2.0-2.0.0): 2.6.32-279.el6.x86_64 Host Driver RPM Check .................. PASS Firmware on CA #0 NIC .................. v2.9.
Rev 2.0-3.0.0 Installation Firmware • The firmware of existing network adapter devices will be updated if the following two conditions are fulfilled: a. You run the installation script in default mode; that is, without the option ‘--without-fw-update’. b.
Rev 2.0-3.0.0 host1# mst status MST modules: -----------MST PCI module loaded MST PCI configuration module loaded MST Calibre (I2C) module is not loaded MST devices: -----------/dev/mst/mt25418_pciconf0 /dev/mst/mt25418_pci_cr0 /dev/mst/mt25418_pci_msix0 /dev/mst/mt25418_pci_uar0 - PCI configuration cycles access. bus:dev.fn=02:00.0 addr.reg=88 data.reg=92 Chip revision is: A0 - PCI direct access. bus:dev.fn=02:00.0 bar=0xdef00000 size=0x100000 Chip revision is: A0 - PCI direct access. bus:dev.
Rev 2.0-3.0.0 Installation 2.5 Installing MLNX_OFED using YUM 2.5.1 Setting up MLNX_OFED YUM Repository Step 1. Download the tarball to your host. The image’s name has the format MLNX_OFED_LINUX--.tgz. You can download it from http://www.mellanox.com > Products > Software> InfiniBand Drivers. Step 2. Extract the MLNX_OFED tarball package to a shared location in your network. # tar xzf MLNX_OFED_LINUX--rhel6.4-x86_64.tgz Step 3.
Rev 2.0-3.0.0 2.5.2 Installing MLNX_OFED using the YUM Tool After setting up the YUM repository for MLNX_OFED package, perform the following: Step 1. View the available package groups by invoking: # yum grouplist | grep MLNX_OFED MLNX_OFED ALL MLNX_OFED BASIC MLNX_OFED GUEST MLNX_OFED HPC MLNX_OFED HYPERVISOR MLNX_OFED VMA MLNX_OFED VMA-ETH MLNX_OFED VMA-VPI Step 2. Install the desired group.
Rev 2.0-3.0.0 3 Configuration Files Configuration Files For the complete list of configuration files, please refer to MLNX_OFED_configuration_files.txt 3.1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the "/etc/udev/rules.d/ 70-persistent-net.rules" file.
Rev 2.0-3.0.0 4 Driver Features 4.1 SCSI RDMA Protocol 4.1.1 Overview As described in Section 1.3.4, the SCSI RDMA Protocol (SRP) is designed to take full advantage of the protocol off-load and RDMA features provided by the InfiniBand architecture. SRP allows a large body of SCSI software to be readily used on InfiniBand architecture. The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric.
Rev 2.0-3.0.0 Driver Features 4.1.2.2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target. Section 4.1.2.4 explains how to do this automatically. • Make sure that the ib_srp module is loaded, the SRP Initiator is reachable by the SRP Target, and that an SM is running.
Rev 2.0-3.0.0 ibsrpdm ibsrpdm is using for the following tasks: 1. Detecting reachable targets a. To detect all targets reachable by the SRP initiator via the default umad device (/dev/umad0), execute the following command: ibsrpdm This command will output information on each SRP Target detected, in human-readable form.
Rev 2.0-3.0.0 Driver Features srp_daemon The srp_daemon utility is based on ibsrpdm and extends its functionality. In addition to the ibsrpdm functionality described above, srp_daemon can also • Establish an SRP connection by itself (without the need to issue the “echo” command described in Section 4.1.2.
Rev 2.0-3.0.0 4.1.2.4 Automatic Discovery and Connection to Targets • Make sure that the ib_srp module is loaded, the SRP Initiator can reach an SRP Target, and that an SM is running. • To connect to all the existing Targets in the fabric, run “srp_daemon -e -o”. This utility will scan the fabric once, connect to every Target it detects, and then exit. srp_daemon will follow the configuration it finds in /etc/srp_daemon.conf. Thus, it will ignore a target that is disallowed in the configuration file.
Rev 2.0-3.0.0 Driver Features If you use srp_daemon with -n flag, it automatically assigns initiator_ext values according to this convention. For example: id_ext=200500A0B81146A1,ioc_guid=0002c90200402bec,\ dgid=fe800000000000000002c90200402bed,pkey=ffff,\ service_id=200500a0b81146a1,initiator_ext=ed2b400002c90200 Notes: 1. It is recommended to use the -n flag for all srp_daemon invocations. 2. ibsrpdm does not have a corresponding option. 3. srp_daemon.
Rev 2.0-3.0.0 Manual Activation of High Availability Initialization: (Execute after each boot of the driver) 1. Execute modprobe dm-multipath 2. Execute modprobe ib-srp 3. Make sure you have created file /etc/udev/rules.d/91-srp.rules as described above. 4. Execute for each port and each HCA: srp_daemon -c -e -R 300 -i -p This step can be performed by executing srp_daemon.sh, which sends its log to /var/log/ srp_daemon.log.
Rev 2.0-3.0.0 Driver Features 2. After Manual Activation of High Availability If you manually activated SRP High Availability, perform the following steps: a. Unmount all SRP partitions that were mounted. b. Kill the SRP daemon instances. c. Make sure there are no multipath instances running. If there are multiple instances, wait for them to end or kill them. d. Run: multipath -F 3.
Rev 2.0-3.0.0 4.3 IP over InfiniBand 4.3.1 Introduction The IP over IB (IPoIB) driver is a network interface implementation over InfiniBand. IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service.
Rev 2.0-3.0.0 4.3.3 Driver Features IPoIB Configuration Unless you have run the installation script mlnxofedinstall with the flag ‘-n’, then IPoIB has not been configured by the installation. The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port, like any other network adapter card (i.e., you need to prepare a file called ifcfg-ib for each port). The first port on the first HCA in the host is called interface ib0, the second port is called ib1, and so on.
Rev 2.0-3.0.0 To run the DHCP server from the command line, enter: dhcpd -d Example: host1# dhcpd ib0 -d 4.3.3.1.2 DHCP Client (Optional) A DHCP client can be used if you need to prepare a diskless machine with an IB driver. See Step 8 under “Example: Adding an IB Driver to initrd (Linux)”. In order to use a DHCP client identifier, you need to first create a configuration file that defines the DHCP client identifier.
Rev 2.0-3.0.0 Driver Features 4.3.3.2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP, you need to supply the installation script with a configuration file (using the ‘-n’ option) containing the full IP configuration.
Rev 2.0-3.0.0 • The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface: host1$ ifconfig ib0 11.4.3.175 netmask 255.255.0.0 Step 2. (Optional) Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib# argument. The following example shows how to verify the configuration: host1$ ifconfig ib0 b0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:11.4.3.175 Bcast:11.4.
Rev 2.0-3.0.0 Driver Features Step 3. Verify the configuration of this interface by running: host1$ ifconfig . Using the example of Step 2: host1$ ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Step 4.
Rev 2.0-3.0.0 5 packets transmitted, 5 received, 0% packet loss, time 3999ms rtt min/avg/max/mdev = 0.044/0.058/0.079/0.014 ms, pipe 2 4.3.6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces, you should use the standard syntax (depending on your OS). Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces: via the Linux Bonding Driver. • Network Script files for IPoIB slaves are named after the IPoIB interfaces (e.
Rev 2.0-3.0.0 Driver Features 4.4 Quality of Service InfiniBand 4.4.1 Quality of Service Overview Quality of Service (QoS) requirements stem from the realization of I/O consolidation over an IB network. As multiple applications and ULPs share the same fabric, a means is needed to control their use of network resources. Figure 2: I/O Consolidation Over InfiniBand Servers Unified I/O Administrator QoS Manager IPC Storage InfiniBand Subnet Net.
Rev 2.0-3.0.0 4.4.2 QoS Architecture QoS functionality is split between the SM/SA, CMA and the various ULPs. We take the “chronology approach” to describe how the overall system works. 1. The network manager (human) provides a set of rules (policy) that define how the network is being configured and how its resources are split to different QoS-Levels. The policy also define how to decide which QoS-Level each application or ULP or service use. 2.
Rev 2.0-3.0.0 Driver Features II. Fabric Setup Defines how the SL2VL and VLArb tables should be setup. In OFED this part of the policy is ignored. SL2VL and VLArb tables should be configured in the OpenSM options file (opensm.opts). III. QoS-Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to. Each set holds SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits. Path Bits are not implemented in OFED. IV.
Rev 2.0-3.0.0 4.4.5 OpenSM Features The QoS related functionality that is provided by OpenSM—the Subnet Manager described in Chapter 8 can be split into two main parts: I. Fabric Setup During fabric initialization, the Subnet Manager parses the policy and apply its settings to the discovered fabric elements. II. PR/MPR Query Handling OpenSM enforces the provided policy on client request.
Rev 2.0-3.0.0 Driver Features 1. The application sets the ToS of the socket using setsockopt (IP_TOS, value). 2. ToS is translated into the sk_prio using a fixed translation: TOS TOS TOS TOS 0 <=> sk_prio 0 8 <=> sk_prio 2 24 <=> sk_prio 4 16 <=> sk_prio 6 3. The Socket Priority is mapped to the UP: • If the underlying device is a VLAN device, egress_map is used controlled by the vconfig command. This is per VLAN mapping. • If the underlying device is not a VLAN device, the tc command is used.
Rev 2.0-3.0.0 4. The the UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used. With RoCE, there can only be 4 predefined ToS values for the purpose of QoS mapping. 4.5.5 Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly. The following is the RoCE QoS mapping flow: 1. The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP: • Sets qp_attrs.ah_attrs.
Rev 2.0-3.0.0 Driver Features • After mapping the skb_priority to UP, one should map the UP into a TC. This assigns the user priority to a specific hardware traffic class. In order to do that, mlnx_qos should be used. mlnx_qos gets a list of a mapping between UPs to TCs. For example, mlnx_qos ieth0 -p 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and Ups 4-7 to TC1. 4.5.
Rev 2.0-3.0.0 The tool will also display maps configured by TC and vconfig set_egress_map tools, in order to give a centralized view of all QoS mappings. • Set UP to TC mapping • Assign a transmission algorithm to each TC (strict or ETS) • Set minimal BW guarantee to ETS TCs • Set rate limit to TCs For unlimited ratelimit set the ratelimit to 0.
Rev 2.0-3.0.
Rev 2.0-3.0.0 Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2: tc: 0 ratelimit: 3 Gbps, up: 0 skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: up: 1 up: 2 up: 3 up: 4 up: 5 up: 6 up: 7 tsa: strict 0 1 2 (tos: 8) 3 4 (tos: 24) 5 6 (tos: 16) 7 8 9 10 11 12 13 14 15 Configure QoS. map UP 0,7 to tc0, 1,2,3 to tc1 and 4,5,6 to tc 2.
Rev 2.0-3.0.0 Driver Features as strict.
Rev 2.0-3.0.0 Usage: tc_wrap.py -i [options] Options: --version show program's version number and exit -h, --help show this help message and exit -u SKPRIO_UP, --skprio_up=SKPRIO_UP maps sk_prio to UP. LIST is <=16 comma separated UP. index of element is sk_prio.
Rev 2.0-3.0.0 4.6 Driver Features Time-Stamping Service Time Stamping is currently at beta level. Please be aware that everything listed here is subject to change. Time Stamping is currently supported in ConnectX®-3/ConnectX®-3 Pro adapter cards only. Time stamping is the process of keeping track of the creation of a packet/ A time-stamping service supports assertions of proof that a datum existed before a particular time.
Rev 2.0-3.0.0 • Enabled by ifreq.hwtstamp_config.tx_type when /* possible values for hwtstamp_config->tx_type */ enum hwtstamp_tx_types { /* * No outgoing packet will need hardware time stamping; * should a packet arrive which asks for it, no hardware * time stamping will be done. */ HWTSTAMP_TX_OFF, /* * Enables hardware time stamping for outgoing packets; * the sender of the packet decides which are to be * time stamped by setting %SOF_TIMESTAMPING_TX_SOFTWARE * before sending the packet.
Rev 2.0-3.0.0 Driver Features Receive side time sampling: • Enabled by ifreq.hwtstamp_config.
Rev 2.0-3.0.0 a pending bounced packet is ready for reading as far as select() is concerned. If the outgoing packet has to be fragmented, then only the first fragment is time stamped and returned to the sending socket. When time-stamping is enabled, VLAN stripping is disabled. For more info please refer to Documentation/networking/timestamping.txt in kernel.org 4.7 Atomic Operations Atomic Operations are applicable to the mlx4 driver only. 4.7.
Rev 2.0-3.0.0 Driver Features | | | | | | | | | | | | | | | | | | | | | | 4.
Rev 2.0-3.0.0 The diagram below describes the topology that was created after these steps: The diagram shows how the traffic from the Virtual Machine goes to the virtual-bridge in the Hypervisor and from the bridge to the eIPoIB interface. eIPoIB interface is the Ethernet interface that enslaves the IPoIB interfaces in order to send/receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath. 4.8.
Rev 2.0-3.0.0 Driver Features For example, on a system with dual port HCA, the following two interfaces might be created; eth4 and eth5. cat /sys/class/net/eth_ipoib_interfaces eth4 over IB port: ib0 eth5 over IB port: ib1 These interfaces can be used to configure the network for the guest.
Rev 2.0-3.0.0 Figure 3: An Example of a Virtual Network Host ib0.2 ib0.3 KVM GUEST1 IPoIB LAN (via port #1) vif0.2 eth0 tap0 tap1 KVM GUEST2 br0 vif0.3 The example above shows a few IPoIB instances that server the virtual interfaces at the Virtual Machines. To display the services provided to the Virtual Machine interfaces: # cat /sys/class/net/eth0/eth/vifs Example: # cat /sys/class/net/eth0/eth/vifs SLAVE=ib0.2 MAC=52:54:00:60:55:88 VLAN=N/A In the example above the ib0.
Rev 2.0-3.0.0 4.8.
Rev 2.0-3.0.0 1. Values are NOT case sensitive. Usage: The application calls the ibv_reg_mr API which turns on the IBV_ACCESS_ALLOCATE_MR bit and sets the input address to NULL. Upon success, the address field of the struct ibv_mr will hold the address to the allocated memory block. This block will be freed implicitly when the ibv_dereg_mr() is called.
Rev 2.0-3.0.0 Driver Features • Turns on via the ibv_reg_mr one or more of the sharing access bits. The sharing bits are part of the ibv_reg_mr man page. • Turns on the IBV_ACCESS_ALLOCATE_MR bit Step 2. Request to register to a shared MR A new verb called ibv_reg_shared_mr is added to enable sharing an MR. To use this verb, the application supplies the MR ID that it wants to register for and the desired access mode to that MR.
Rev 2.0-3.0.0 4.12 Flow Steering Flow Steering is applicable to the mlx4 driver only. Flow steering is a new model which steers network flows based on flow specifications to specific QPs. Those flows can be either unicast or multicast network flows. In order to maintain flexibility, domains and priorities are used. Flow steering uses a methodology of flow attribute, which is a combination of L2-L4 flow specifications, a destination QP and a priority.
Rev 2.0-3.0.0 Driver Features • struct ibv_flow_attr - attaches the QP to the flow specified. The flow contains mandatory control parameters and optional L2, L3 and L4 headers. The optional headers are detected by setting the size and num_of_specs fields: struct ibv_flow_attr can be followed by the optional flow headers structs: struct struct struct struct ibv_flow_spec_ib ibv_flow_spec_eth ibv_flow_spec_ipv4 ibv_flow_spec_tcp_udp For further information, please refer to the ibv_create_flow man page.
Rev 2.0-3.0.0 • ethtool –u eth5 Shows all of ethtool’s steering rule When configuring two rules with the same priority, the second rule will overwrite the first one, so this ethtool interface is effectively a table. Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in kernel (v2.6.28). MLX4 Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines.
Rev 2.0-3.0.0 Driver Features We recommend using libibverbs v2.0-3.0.0 and libmlx4 v2.0-3.0.0 and higher as of MLNX_OFED v2.0-3.0.0 due to API changes. 4.13 Single Root IO Virtualization (SR-IOV) Single Root IO Virtualization (SR-IOV) is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the device with separate resources.
Rev 2.0-3.0.0 Step 2. Enable "Intel Virtualization Technology". Step 3. Install the hypervisor that supports SR-IOV. Step 4. Depending on your system, update the /boot/grub/grub.conf file to include a similar command line load parameter for the Linux kernel. For example, to Intel systems, add: default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title Red Hat Enterprise Linux Server (2.6.32-36.x86-645) root (hd0,0) kernel /vmlinuz-2.6.32-36.
Rev 2.0-3.0.0 Driver Features If the current firmware version is the same as one provided with MLNX_OFED, run it in combination with the '--force-fw-update' parameter. This configuration option is supported only in HCAs that their configuration file (INI) is included in MLNX_OFED. Parameter Recommended Value num_pfs 1 Note: This field is optional and might not always appear. total_vfs 63 sriov_en true • If the HCA does not support SR-IOV, please contact Mellanox Support: support@mellanox.
Rev 2.0-3.0.0 Parameter Recommended Value port_type_array Specifies the protocol type of the ports. It is either one array of 2 port types 't1,t2' for all devices or list of BDF to port_type_array 'bb:dd.f-t1;t2,...'. (string) probe_vf Absent, or zero: • • • • No VFs will be used by the PF driver Its value is a single number in the range of 0-63. Physical Function driver will use probe_vf VFs and this will be applied to all ConnectX® HCAs on the host.
Rev 2.0-3.0.0 Driver Features Step 10. Load the driver and verify the SR-IOV is supported. Run: lspci | grep Mellanox 03:00.0 InfiniBand: Mellanox / 10GigE] (rev b0) 03:00.1 InfiniBand: Mellanox (rev b0) 03:00.2 InfiniBand: Mellanox (rev b0) 03:00.3 InfiniBand: Mellanox (rev b0) 03:00.4 InfiniBand: Mellanox (rev b0) 03:00.5 InfiniBand: Mellanox (rev b0) Technologies MT26428 [ConnectX VPI PCIe 2.
Rev 2.0-3.0.0 Step 4. Attach a virtual NIC to VM. ifconfig -a … eth6 Link encap:Ethernet HWaddr 52:54:00:E7:77:99 inet addr:13.195.15.5 Bcast:13.195.255.255 Mask:255.255.0.0 inet6 addr: fe80::5054:ff:fee7:7799/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:481 errors:0 dropped:0 overruns:0 frame:0 TX packets:450 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:22440 (21.9 KiB) TX bytes:19232 (18.7 KiB) Interrupt:10 Base address:0xa000 … Step 5.
Rev 2.0-3.0.0 Driver Features Step 3. Go to Details->Add hardware ->PCI host device. Step 4. Choose a Mellanox virtual function according to its PCI device (e.g., 00:03.1) Step 5. If the Virtual Machine is up reboot it, otherwise start it. Step 6. Log into the virtual machine and verify that it recognizes the Mellanox card. Run: lspci | grep Mellanox 00:03.0 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0) Step 7.
Rev 2.0-3.0.
Rev 2.0-3.0.0 Driver Features 4.13.7 Configuring Pkeys and GUIDs under SR-IOV 4.13.7.1 Port Type Management Port Type management is static when enabling SR-IOV (the connectx_port_config script will not work). The port type is set on the Host via a module parameter, port_type_array, in mlx4_core. This parameter may be used to set the port type uniformly for all installed ConnectX® HCAs, or it may specify an individual configuration for each HCA.
Rev 2.0-3.0.
Rev 2.0-3.0.0 Driver Features To configure the GUID at index on port : cd /sys/class/infiniband/mlx4_0/iov/ports//admin_guids echo > n Example: cd /sys/class/infiniband/mlx4_0/iov/ports/1/admin_guids echo1 "0x002fffff8118" > 3 1. echo "0x0" means let the SM assign a value to that GUID echo "0xffffffffffffffff" means delete that GUID echo means request the SM to assign this GUID to this index Step 3.
Rev 2.0-3.0.0 • The vm2's virt-to-phys pkey mapping will be: pkey_idx 0 = 2 pkey_idx 1 = 0 so that the default pkey will reside on the vms at index 1 instead of at index 0. The IPoIB QPs are created to use the PKey at index 0. As a result, the Dom0, vm1 and vm2 IPoIB QPs will all use different PKeys. To partition IPoIB communication using PKeys: Step 1. Create a file "/etc/opensm/partitions.conf" on the host on which OpenSM runs, containing lines.
Rev 2.0-3.0.0 Driver Features vm1 pkey index 0 will be mapped to physical pkey-index 1, and vm2 pkey index 0 will be mapped to physical pkey index 2. Both vm1 and vm2 will have their pkey index 1 mapped to the default pkey. Step d. On Host2 do the following. cd /sys/class/infiniband/mlx4_0/iov echo 0 > 0000:03:00.1/ports/1/pkey_idx/1 echo 1 > 0000:03:00.1/ports/1/pkey_idx/0 echo 0 > 0000:03:00.2/ports/1/pkey_idx/1 echo 2 > 0000:03:00.2/ports/1/pkey_idx/0 Step e.
Rev 2.0-3.0.0 4.13.7.3.2Additional Ethernet VF Configuration Options • Guest MAC configuration By default, guest MAC addresses are configured to be all zeroes. In the mlnx_ofed guest driver, if a guest sees a zero MAC, it generates a random MAC address for itself. If the administrator wishes the guest to always start up with the same MAC, he/she should configure guest MACs before the guest driver comes up.
Rev 2.0-3.0.0 4.15 Driver Features Ethtool ethtool is a standard Linux utility for controlling network drivers and hardware, particularly for wired Ethernet devices.
Rev 2.0-3.0.0 Table 6 - ethtool Supported Options Options Description ethtool -C eth adaptive-rx on|off Enables/disables adaptive interrupt moderation. ethtool -C eth [pkt-rate-low N] [pkt-rate-high N] [rx-usecs-low N] [rx-usecs-high N] Sets the values for packet rate limits and for moderation time high and low values. By default, the driver uses adaptive interrupt moderation for the receive path, which adjusts the moderation time to the traffic pattern.
Rev 2.0-3.0.0 HPC Features 5 HPC Features 5.1 Shared Memory Access The Shared Memory Access (SHMEM) routines provide low-latency, high-bandwidth communication for use in highly parallel scalable programs. The routines in the SHMEM Application Programming Interface (API) provide a programming model for exchanging data between cooperating parallel processes. The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program.
Rev 2.0-3.0.0 5.1.2 Running SHMEM with FCA The Mellanox Fabric Collective Accelerator (FCA) is a unique solution for offloading collective operations from the Message Passing Interface (MPI) or ScalableSHMEM process onto Mellanox InfiniBand managed switch CPUs. As a system-wide solution, FCA utilizes intelligence on Mellanox InfiniBand switches, Unified Fabric Manager and MPI nodes without requiring additional hardware.
Rev 2.0-3.0.0 HPC Features These enhancements significantly increase the scalability and performance of message communi-cations in the network, alleviating bottlenecks within the parallel communication libraries 5.1.4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages. It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr. To activate MLNX_OFED 2.
Rev 2.0-3.0.0 These MPI implementations, along with MPI benchmark tests such as OSU BW/LAT, Intel MPI Benchmark, and Presta, are installed on your machine as part of the Mellanox OFED for Linux installation. Table 7 lists some useful MPI links. Table 7 - Useful MPI Links MPI Standard http://www-unix.mcs.anl.gov/mpi Open MPI http://www.open-mpi.org MVAPICH 2 MPI http://mvapich.cse.ohio-state.edu/ MPI Forum http://www.mpi-forum.org This chapter includes the following sections: 5.2.2 • Section 5.
Rev 2.0-3.0.0 HPC Features -rw-r--r-- 1 root root 404 Mar 5 04:57 id_rsa.pub Step 3. Check the public key. host1$ cat id_rsa.
Rev 2.0-3.0.0 5.2.4 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html. To review the default configuration of the installation, check the default configuration file: /usr/mpi//mvapich-/etc/mvapich.conf Compiling Open MPI Applications Please refer to http://www.open-mpi.org/faq/?category=mpi-apps. 5.
Rev 2.0-3.0.0 HPC Features To upgrade MLNX_OFED v2.0 or later with a newer MXM: Step 1. Remove MXM v1.1. rpm -e mxm Step 2. Remove the pre-compiled OpenMPI. rpm -e mlnx-openmpi_gcc Step 3. Install the new MXM and compile the OpenMPI with it. To run OpenMPI without MXM, run: % mpirun -mca mtl ^mxm <...> When upgrading to MXM v1.5, OMPI compiled with MXM v1.1 should be recompiled with MXM v1.5. 5.3.2 Enabling MXM in OpenMPI MXM Rev 2.0-3.0.
Rev 2.0-3.0.0 5.3.4 Configuring Multi-Rail Support Multi-Rail support enables the user to use more than one of the active ports on the card, by making a better use of the resources. It provides a combined throughput among the used ports. To configure dual rail support: • Specify the list of ports you would like to use to enable multi rail support. -x MXM_RDMA_PORTS=cardName:portNum mpirun -x MXM_RDMA_PORTS=mlx4_0:1,mlx4_0:2 <...> 5.3.
Rev 2.0-3.0.0 5.5 HPC Features ScalableUPC Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines.The language provides a uniform programming model for both shared and distributed memory hardware. The programmer is presented with a single shared, partitioned address space, where variables may be directly read and written by any processor, but each variable is physically associated with a single processor.
Rev 2.0-3.0.0 Please note, the binary distribution of ScalableUPC is compiled with the following defaults: 5.5.2 • FCA support. FCA is disabled at runtime by default and must be configured prior to using it from the ScalableUPC. For further information, please refer to FCA User Manual.
Rev 2.0-3.0.0 HPC Features ScalableUPC contains modules configuration file (http://modules.sf.net) which can be found at /opt/mellanox/bupc/2.2/etc/bupc_modulefile. 5.5.3 Various Executable Examples The following are various executable examples.
Rev 2.0-3.0.
Rev 2.0-3.0.0 6 Working With VPI Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth. 6.1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports. By default both ConnectX ports are initialized as InfiniBand ports. If you wish to change the port type use the connectx_port_config script after the driver is loaded.
Mellanox OFED for Linux User’s Manual 6.2 Rev 2.0-3.0.0 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type (InfiniBand or Ethernet) based on the link partner and load the appropriate driver stack (InfiniBand or Ethernet). For example, if the first port is connected to an InfiniBand switch and the second to Ethernet switch, the NIC will automatically load the first switch as InfiniBand and the second as Ethernet. 6.2.1 Enabling Auto Sensing Upon driver start up: 1.
Rev 2.0-3.0.0 Performance 7 Performance 7.1 General System Configurations The following sections describe recommended configurations for system components and/or interfaces. Different systems may have different features, thus some recommendations below may not be applicable. 7.1.1 PCI Express (PCIe) Capabilities Table 9 - Recommended PCIe Configuration PCIe Generation 3.
Rev 2.0-3.0.0 7.1.3.2 Intel® Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors.
Rev 2.0-3.0.0 Performance 7.1.3.3 Intel® Nehalem/Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalembased processors.Configuring the Completion Queue Stall Delay.
Rev 2.0-3.0.0 Table 12 - Recommended BIOS Settings for AMD Processors BIOS Option Memory 7.2 Values Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled / NUMA Channel Interleaving Enabled Thermal Mode Performance Performance Tuning for Linux You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance.
Rev 2.0-3.0.0 • Performance Enable the TCP selective acks option for better CPU utilization: sysctl -w net.ipv4.tcp_sack=1 7.2.3 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot, you need to add them to the file /etc/ sysctl.
Rev 2.0-3.0.0 7.2.4.1 Setting the Scaling Governor If the following modules are loaded, CPU scaling is supported, and you can improve performance by setting the scaling mode to performance: • freq_table • acpi_cpufreq: this module is architecture dependent. It is also recommended to disable the module cpuspeed; this module is also architecture dependent.
Rev 2.0-3.0.0 7.2.5 Performance Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU. Mellanox network adapters use an adaptive interrupt moderation algorithm by default. The algorithm checks the transmission (Tx) and receive (Rx) packet rates and modifies the Rx interrupt moderation settings accordingly. To manually set Tx and/or Rx interrupt moderation, use the ethtool utility.
Rev 2.0-3.0.0 Example for supported system: # cat /sys/class/net/eth3/device//numa_node 0 Example for unsupported system: # cat /sys/class/net/ib0/device/numa_node -1 7.2.6.1.1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling, will have an impact when using the remote NUMA node.
Rev 2.0-3.0.0 Performance 7.2.6.3.1 Running an Application on a Certain NUMA Node In order to run an application on a certain NUMA node, the process affinity should be set in either in the command line or an external tool. For example, if the adapter's NUMA node is 1 and NUMA 1 cores are 8-15 then an application should run with process affinity that uses 8-15 cores only. To run an application, run the following commands: taskset -c 8-15 ib_write_bw -a or: taskset 0xff00 ib_write_bw -a 7.2.
Rev 2.0-3.0.0 • Stop # mlnx_affinity stop • Restart # mlnx_affinity restart mlnx_affinity can also be started by driver load/unload To enable mlnx_affinity by default: • Add the line below to the /etc/infiniband/openib.conf file. RUN_AFFINITY_TUNER=yes 7.2.7.3 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter. It is recommended to separate the adapter's core utilization so there will be no interleaving between interfaces.
Rev 2.0-3.0.0 7.2.8 Performance Tuning Multi-Threaded IP Forwarding To optimize NIC usage as IP forwarding: 1. Set the following options in /etc/modprobe.d/mlx4.conf: • For MLNX_OFED-2.0.x: options mlx4_en inline_thold=0 options mlx4_core high_rate_steer=1 • For MLNX_EN-1.5.10: options mlx4_en num_lro=0 inline_thold=0 options mlx4_core high_rate_steer=1 2. Apply interrupt affinity tuning. 3. Forwarding on the same interface: # set_irq_affinity_bynode.sh 4.
Rev 2.0-3.0.0 8 OpenSM – Subnet Manager 8.1 Overview OpenSM is an InfiniBand compliant Subnet Manager (SM). It is provided as a fixed flow executable called opensm, accompanied by a testing application called osmtest. OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters: Management Model (13), Subnet Management (14), and Subnet Administration (15). 8.
Rev 2.0-3.0.0 OpenSM – Subnet Manager bound to 1 port at a time. If GUID given is 0, OpenSM displays a list of possible port GUIDs and waits for user input. Without -g, OpenSM tries to use the default port. --lmc, -l This option specifies the subnet's LMC value. The number of LIDs assigned to each port is 2^LMC. The LMC value must be in the range 0-7. LMC values > 0 allow multiple paths between ports.
Rev 2.0-3.0.0 --do_mesh_analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing --lash_start_vl Sets the starting VL to use for the lash routing algorithm. Defaults to 0. --sm_sl Sets the SL to use to communicate with the SM/SA. Defaults to 0.
Rev 2.0-3.0.
Rev 2.0-3.0.0 --timeout, -t This option specifies the time in milliseconds used for transaction timeouts. Specifying -t 0 disables timeouts. Without -t, OpenSM defaults to a timeout value of 200 milliseconds. --retries This option specifies the number of retries used for transactions. Without --retries, OpenSM defaults to 3 retries for transactions. --maxsmps, -n This option specifies the number of VL15 SMP MADs allowed on the wire at any one time.
Rev 2.0-3.0.0 OpenSM – Subnet Manager --port_search_ordering_file, -O This option provides the means to define a mapping between ports and dimension (Order) for controlling Dimension Order Routing (DOR). Moreover this option provides the means to define non default routing port order. --dimn_ports_file, -O (DEPRECATED) This option provides the means to define a mapping between ports and dimension (Order) for controlling Dimension Order Routing (DOR).
Rev 2.0-3.0.0 --part_enforce, -Z [both, in, out, off] This option indicates the partition enforcement type (for switches) Enforcement type can be outbound only (out), inbound only (in), both or disabled (off). Default is both. --allow_both_pkeys, -W This option indicates whether both full and limited membership on the same partition can be configured in the PKeyTable. Default is not to allow both pkeys. --qos, -Q This option enables QoS setup.
Rev 2.0-3.0.0 OpenSM – Subnet Manager --consolidate_ipv6_snm_req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P_Key. --consolidate_ipv4_mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P_Key. --pid_file Specifies the file that contains the process ID of the opensm daemon.The default is /var/run/opensm.
Rev 2.0-3.0.0 This option sets the log verbosity level. A flags field must follow the -D option.
Rev 2.0-3.0.0 OpenSM – Subnet Manager opensm stores certain data to the disk such that subsequent runs are consistent. The default directory used is /var/cache/opensm. The following file is included in it: • guid2lid – stores the LID range assigned to each GUID 8.2.3 Signaling When OpenSM receives a HUP signal, it starts a new heavy sweep as if a trap has been received or a topology change has been found. Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for logrotate purposes. 8.2.
Rev 2.0-3.0.0 • 8.3.
Rev 2.0-3.0.0 OpenSM – Subnet Manager -s, -M, -t, -l, -v, -V -vf 132 received from the SA during testing. If -i is not specified, osmtest defaults to the file osmtest.dat.
Rev 2.0-3.0.0 -h, --help 8.3.2 0x08 - DEBUG (diagnostic, high volume) 0x10 - FUNCS (function entry/exit, very high volume) 0x20 - FRAMES (dumps all SMP and GMP frames) 0x40 - ROUTING (dump FDB routing information) 0x80 - currently unused. Without -vf, osmtest defaults to ERROR + INFO (0x3) Specifying -vf 0 disables all messages Specifying -vf 0xFF enables all messages (see -V) High verbosity levels may require increasing the transaction timeout with the -t option Display this usage info then exit.
Rev 2.0-3.0.0 OpenSM – Subnet Manager where PartitionName string, will be used with logging. When omitted, an empty string will be used. PKey P_Key value for this partition. Only low 15 bits will be used. When omitted, P_Key will be autogenerated. flag used to indicate IPoIB capability of this partition. defmember=full|limited specifies default membership for port guid list. Default is limited.
Rev 2.0-3.0.
Rev 2.0-3.0.0 8.5 OpenSM – Subnet Manager Routing Algorithms OpenSM offers six routing engines: 1. “Min Hop Algorithm” Based on the minimum hops to each node where the path length is optimized. 2. “UPDN Algorithm” Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. 3.
Rev 2.0-3.0.0 Up/Down routing. Each port has a counter counting the number of target LIDs going through it. When there are multiple alternative ports with same MinHop to a LID, the one with less previously assigned ports is selected. If LMC > 0, more checks are added. Within each group of LIDs assigned to same target port: a. Use only ports which have same MinHop b. First prefer the ones that go to different systemImageGuid (then the previous LID of the same LMC group) c.
Rev 2.0-3.0.0 8.5.3 OpenSM – Subnet Manager UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet. A loop-deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop. As such, the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure).
Rev 2.0-3.0.0 1. A valid guid file specifies one guid in each line. Lines with an invalid format will be discarded. 2. The user should specify the root switch guids. However, it is also possible to specify CA guids; OpenSM will use the guid of the switch (if it exists) that connects the CA to the subnet as a root node. 8.5.4 Fat-tree Routing Algorithm The fat-tree algorithm optimizes routing for "shift" communication pattern.
Rev 2.0-3.0.0 OpenSM – Subnet Manager 8.5.4.1 Routing between non-CN Nodes The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree. In such case, it is not guaranteed that the Fat Tree algorithm will route between two nonCN nodes. In the scheme below, N1, N2 and N3 are non-CN nodes. Although all the CN have routes to and from them, there will not necessarily be a route between N1,N2 and N3.
Rev 2.0-3.0.0 LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual layers as deadlock will not arise between switch and HCA. In more detail, the algorithm works as follows: 1. LASH determines the shortest-path between all pairs of source / destination switches.
Rev 2.0-3.0.0 8.5.6 OpenSM – Subnet Manager DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths. Instead of spreading traffic out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions. Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension.
Rev 2.0-3.0.0 Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches, torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0) per QoS level to provide deadlock-free routing on a 3D torus. Torus-2QoS routes around link failure by "taking the long way around" any 1D ring interrupted by a link failure. For example, consider the 2D 6x5 torus below, where switches are denoted by [+a-zA-Z]: For a pristine fabric the path from S to D would be S-n-T-r-D.
Rev 2.0-3.0.0 OpenSM – Subnet Manager because they cannot be used to construct a loop encircling T. The hop I-r uses a separate VL, so it cannot contribute to a credit loop encircling T. Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock, torus2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR.
Rev 2.0-3.0.0 not arise from a combination of multicast and unicast path segments. It turns out that it is possible to construct spanning trees for multicast routing that have that property. For the 2D 6x5 torus example above, here is the full-fabric spanning tree that torus-2QoS will construct, where "x" is the root switch and each "+" is a non-root switch: For multicast traffic routed from root to tip, every turn in the above spanning tree is a legal DOR turn.
Rev 2.0-3.0.0 OpenSM – Subnet Manager Two things are notable about this master spanning tree. First, assuming the x dateline was between x=5 and x=0, this spanning tree has a branch that crosses the dateline. However, just as for unicast, crossing a dateline on a 1D ring (here, the ring for y=2) that is broken by a failure cannot contribute to a torus credit loop. Second, this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric.
Rev 2.0-3.0.0 occurs if torus-2QoS is misconfigured, i.e., the radix of a torus dimension as configured does not match the radix of that torus dimension as wired, and many switches/links in the fabric will not be placed into the torus. 8.5.7.4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with -Q.
Rev 2.0-3.0.0 OpenSM – Subnet Manager 8.5.7.6 Torus-2QoS Configuration File Syntax The file torus-2QoS.conf contains configuration information that is specific to the OpenSM routing engine torus-2QoS. Blank lines and lines where the first non-whitespace character is "#" are ignored. A token is any contiguous group of non-whitespace characters. Any tokens on a line following the recognized configuration tokens described below are ignored.
Rev 2.0-3.0.0 eter for a dateline keyword moves the origin (and hence the dateline) the specified amount relative to the common switch in a torus seed. next_seed If any of the switches used to specify a seed were to fail torus-2QoS would be unable to complete topology discovery successfully. The next_seed keyword specifies that the following link and dateline keywords apply to a new seed specification. For maximum resiliency, no seed specification should share a switch with any other seed specification.
Rev 2.0-3.0.0 OpenSM – Subnet Manager 8.6 Quality of Service Management in OpenSM 8.6.1 Overview When Quality of Service (QoS) in OpenSM is enabled (using the ‘-Q’ or ‘--qos’ flags), OpenSM looks for a QoS Policy file. During fabric initialization and at every heavy sweep, OpenSM parses the QoS policy file, applies its settings to the discovered fabric elements, and enforces the provided policy on client requests.
Rev 2.0-3.0.0 II) QoS Setup (denoted by qos-setup) This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric. However, this is not supported in OFED. SL2VL and VLArb tables should be configured in the OpenSM options file (default location - /var/cache/opensm/opensm.opts).
Rev 2.0-3.0.0 8.6.4 8.6.5 OpenSM – Subnet Manager Policy File Syntax Guidelines • Leading and trailing blanks, as well as empty lines, are ignored, so the indentation in the example is just for better readability. • Comments are started with the pound sign (#) and terminated by EOL. • Any keyword should be the first non-blank in the line, unless it's a comment. • Keywords that denote section/subsection start have matching closing keywords.
Rev 2.0-3.0.0 port-group name: Virtual Servers # The syntax of the port name is as follows: # "node_description/Pnum". # node_description is compared to the NodeDescription of the node, # and "Pnum" is a port number on that node.
Rev 2.0-3.0.0 OpenSM – Subnet Manager sl: 1 mtu-limit: 4 rate-limit: 5 pkey: 0x1234 packet-life: 8 end-qos-level end-qos-levels # Match rules are scanned in order of their apperance in the policy file. # First matched rule takes precedence.
Rev 2.0-3.0.0 8.6.6 Simple QoS Policy - Details and Examples Simple QoS policy match rules are tailored for matching ULPs (or some application on top of a ULP) PR/MPR requests. This section has a list of per-ULP (or per-application) match rules and the SL that should be enforced on the matched PR/MPR query.
Rev 2.0-3.0.
Rev 2.0-3.0.0 8.6.6.4 SRP Service ID for SRP varies from storage vendor to vendor, thus SRP query is matched by the target IB port GUID.
Rev 2.0-3.0.0 OpenSM – Subnet Manager qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 VL arbitration tables (both high and low) are lists of VL/Weight pairs. Each list entry contains a VL number (values from 0-14), and a weighting value (values 0-255), indicating the number of 64 byte units (credits) which may be transmitted from that VL when its turn in the arbitration occurs. A weight of 0 indicates that this entry should be skipped.
Rev 2.0-3.0.0 Figure 5: Example QoS Deployment on InfiniBand Subnet 8.7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments. Each example provides the QoS level assignment and their administration via OpenSM configuration files. 8.7.
Rev 2.0-3.0.0 OpenSM – Subnet Manager qos-ulps default :0 # default SL (for MPI) any, target-port-guid OST1,OST2,OST3,OST4:1 # SL for Lustre OST any, target-port-guid MDS1,MDS2 :2 # SL for Lustre MDS end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 2:1 qos_vlarb_low 0:96,1:224 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8.7.
Rev 2.0-3.0.0 end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:32 qos_vlarb_low 0:1, qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8.7.3 EDC (3-tier): IPoIB, RDS, SRP The following is an example of QoS configuration for an enterprise data center (EDC), with IPoIB carrying all application traffic, RDS for database traffic, and SRP used for storage.
Rev 2.0-3.0.0 OpenSM – Subnet Manager end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:96,3:96,4:96 qos_vlarb_low 0:1 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 • Partition configuration file Default=0x7fff, ipoib : ALL=full; PartA=0x8001, sl=1, ipoib : ALL=full; 8.8 Adaptive Routing 8.8.1 Overview Adaptive Routing is at beta stage. Adaptive Routing (AR) enables the switch to select the output port based on the port's load.
Rev 2.0-3.0.0 8.8.2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug-in, i.e. it is a shared library (libarmgr.so) that is dynamically loaded by the Subnet Manager. Adaptive Routing Manager is installed as a part of Mellanox OFED installation. 8.8.3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing (AR) Manager can be enabled/disabled through SM options file. 8.8.3.1 Enabling Adaptive Routing To enable Adaptive Routing, perform the following: 1.
Rev 2.0-3.0.0 OpenSM – Subnet Manager Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table (LFT). Therefore, no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing. 8.8.4 Querying Adaptive Routing Tables When Adaptive Routing is active, the content of the usual Linear Forwarding Routing Table on the switch is invalid, thus the standard tools that query LFT (e.g.
Rev 2.0-3.0.0 8.8.5.1 General AR Manager Options Table 13 - Adaptive Routing Manager Options File Option File Description Values ENABLE: Enable/disable Adaptive Routing on fabric switches. Note that if a switch was identified by AR Manager as device that does not support AR, AR Manager will not try to enable AR on this switch.
Rev 2.0-3.0.0 OpenSM – Subnet Manager SWITCH { ; ; ... } The following are the per-switch options: Table 14 - Adaptive Routing Manager Pre-Switch Options File Option File Description ENABLE: Allows you to enable/disable the AR on this switch. If the general ENABLE option value is set to 'false', then this per-switch option is ignored. This option can be changed on the fly. Default: true AGEING_TIME: Applicable to bounded AR mode only.
Rev 2.0-3.0.0 8.9 Congestion Control 8.9.1 Congestion Control Overview Congestion Control Manager is a Subnet Manager (SM) plug-in, i.e. it is a shared library (libccmgr.so) that is dynamically loaded by the Subnet Manager. Congestion Control Manager is installed as part of Mellanox OFED installation. The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and networks.
Rev 2.0-3.0.0 OpenSM – Subnet Manager To turn CC OFF, set 'enable' to 'FALSE' in the Congestion Control Manager configuration file, and run OpenSM ones with this configuration. For the full list of CC Manager options with all the default values, See “Configuring Congestion Control Manager” on page 167. For further details on the list of CC Manager options, please refer to the IB spec. 8.9.
Rev 2.0-3.0.0 • When number of errors exceeds 'max_errors' of send/receive errors or timeouts in less than 'error_window' seconds, the CC MGR will abort and will allow OpenSM to proceed. To do so, set the following parameter: max_errors error_window • The values are: max_errors = 0: zero tollerance - abort configuration on first error error_window = 0: mechanism disabled - no error checking.[0-48K] • The default is: 5 8.9.4.
Rev 2.0-3.0.0 OpenSM – Subnet Manager Table 17 - Congestion Control Manager CA Options File Option File Desctiption Values ca_control_map An array of sixteen bits, one for each SL. Each bit indicates whether or not the corresponding SL entry is to be modified. Values: 0xffff ccti_increase Sets the CC Table Index (CCTI) increase. Default: 1 trigger_threshold Sets the trigger threshold. Default: 2 ccti_min Sets the CC Table Index (CCTI) minimum.
Rev 2.0-3.0.0 9 InfiniBand Fabric Diagnostic Utilities 9.1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand (IB) devices in a fabric. 9.2 Utilities Usage This section first describes common configuration, interface, and addressing for all the tools in the package. Then it provides detailed descriptions of the tools themselves including: operation, synopsis and options descriptions, error codes, and examples. 9.2.
Rev 2.0-3.0.0 9.2.3 InfiniBand Fabric Diagnostic Utilities Addressing This section applies to the ibdiagpath tool only. A tool command may require defining the destination device or port to which it applies. The following addressing modes can be used to define the IB ports: • Using a Directed Route to the destination: (Tool option ‘-d’) This option defines a directed route of output port numbers from the local port to the destination.
Rev 2.0-3.0.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities -i|--device : Specifies the name of the device of the port used to connect to the IB fabric (in case of multiple devices on he local system). -p|--port : Specifies the local device's port number used to connect to the IB fabric. -g|--guid : Specifies the local port GUID value of the port used to connect to the IB fabric.
Rev 2.0-3.0.0 --ber_test : Provides a BER test for each port. Calculate BER for each port and check no BER value has exceeds the BER threshold. (default threshold="10^-12"). --ber_use_data : Indicates that BER test will use the received data for calculation. --ber_thresh : Specifies the threshold value for the BER test. The reciprocal number of the BER should be provided. Example: for 10^-12 than value need to be 1000000000000 or 0xe8d4a51000 (10^12).
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 19 - ibdiagnet (of ibutils2) Output Files Output File ibdiagnet2.
Rev 2.0-3.0.0 Options -c Min number of packets to be sent across each link (default = 10) -v Enable verbose mode -r Provides a report of the fabric qualities -t Specifies the topology file name -s Specifies the local system name.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 20 - ibdiagnet (of ibutils) Output Files Output File Description ibdiagnet.fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet.mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet.masks In case of duplicate port/node Guids, these file include the map between masked Guid and real Guids ibdiagnet.sm List of all the SM (state and priority) in the fabric ibdiagnet.
Rev 2.0-3.0.0 Error Codes 1 2 3 4 5 6 9.5 - Failed Failed Failed Failed Failed Failed to to to to to to fully discover the fabric parse command line options intract with IB fabric use local device or local port use Topology File load requierd Package ibdiagpath - IB diagnostic path ibdiagpath traces a path between two end-points and provides information regarding the nodes and ports traversed along the path. It utilizes device specific health queries for the different devices along the path.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Options -n <[src-name,]dst-name> Names of the source and destination ports (as defined in the topology file; source may be omit ted -> local port is assumed to be the source) -l <[src-lid,]dst-lid> -d -c -v -t -s -i -p -o -lw <1x|4x|12x> -ls <2.
Rev 2.0-3.0.0 Error Codes 1 - The path traced is un-healthy 2 - Failed to parse command line options 3 - More then 64 hops are required for traversing the local port to the "Source" port and then to the "Destination" port 4 - Unable to traverse the LFT data from source to destination 5 - Failed to use Topology File 6 - Failed to load required Package 9.6 ibv_devices Lists InfiniBand devices available for use from userspace, including node GUIDs. Synopsis ibv_devices Examples 1.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 22 - ibv_devinfo Flags and Options Flag Default (If Not Specified) Optional / Mandatory Description -l --list Optional Inactive Only list the names of InfiniBand devices -v --verbose Optional Inactive Print all available information about the InfiniBand device(s) Examples 1. List the names of all available InfiniBand devices. > ibv_devinfo -l 2 HCAs found: mthca0 mlx4_0 2.
Rev 2.0-3.0.0 Options -v Enable verbose mode. Adds additional information such as: Device ID, Part Number, Card Name, Firmware version, IB port state. -h Print help messages. Example: sw417:~/BXOFED-1.5.2-20101128-1524 # ibdev2netdev -v mlx4_0 (MT26428 - MT1006X00034) FALCON QDR fw 2.7.9288 (Down) mlx4_0 (MT26428 - MT1006X00034) FALCON QDR fw 2.7.9288 (Down) mlx4_0 (MT26428 - MT1006X00034) FALCON QDR fw 2.7.9288 (Down) mlx4_1 (MT26448 - MT1023X00777) Hawk Dual Port fw 2.7.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 23 - ibstatus Flags and Options Flag Optional / Mandatory Optional, but requires specifying a device name Default (If Not Specified) All ports of the specified device Description Print information for the specified port only (of the specified device) Examples 1. List the status of all available InfiniBand devices and their ports.
Rev 2.0-3.0.0 2. List the status of specific ports of specific devices. > ibstatus mthca0:1 mlx4_0:2 Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0002:c900:0101:d151 base lid: 0x0 sm lid: 0x0 state: 2: INIT phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0000:0000:0007:3897 base lid: 0x1 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) 9.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 24 - ibportstate Flags and Options (Continued) Flag Default (If Not Specified) Optional / Mandatory Description -v(erbose) Optional Increase verbosity level. May be used several times for additional verbosity (-vvv or -v -v -v) -V(ersion) Optional Show version info -D(irect) Optional Use directed path address arguments. The path is a comma separated list of out ports. Examples: ‘0’ – self port ‘0,1,2,1,4’ – out via port 1, then 2, ...
Rev 2.0-3.0.0 1. Query the status of Port 1 of CA mlx4_0 (using ibstatus) and use its output (the LID – 3 in this case) to obtain additional link information using ibportstate. > ibstatus mlx4_0:1 Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0000:0000:9289:3895 base lid: 0x3 sm lid: 0x3 state: 2: INIT phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) > ibportstate -C mlx4_0 3 1 query PortInfo: # Port info: Lid 3 port 1 LinkState:.......................Initialize PhysLinkState:...
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities LinkSpeedActive:.................2.5 Gbps 3. Change the speed of a port. # First query for current configuration > ibportstate -C mlx4_0 -D 0 1 PortInfo: # Port info: DR path slid 65535; dlid 65535; 0 port 1 LinkState:.......................Initialize PhysLinkState:...................LinkUp LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.
Rev 2.0-3.0.0 Synopsis ibroute [-h] [-d] [-v] [-V] [-a] [-n] [-D] [-G] [-M] [-s ] \[-C ] [-P ] [ -t ] \ [ [ []]] Output Files Table 25 lists the various flags of the command. Table 25 - ibportstate Flags and Options Flag Optional / Mandatory Default (If Not Specified) Description -h(help) Optional Print the help menu -d(ebug) Optional Raise the IB debug level.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 25 - ibportstate Flags and Options Flag Optional / Mandatory Default (If Not Specified) Description -t Optional Override the default timeout for the solicited MADs [msec] Optional Destination’s directed path, LID, or GUID Optional Starting LID in an MLID range Optional Ending LID in an MLID range Examples 1. Dump all Lids with valid out ports of the switch with Lid 2.
Rev 2.0-3.0.0 Unicast lids [0x3-0x7] of switch Lid 2 guid 0x0002c902fffff00a (MT47396 Infiniscale-III Mellanox Technologies): Lid Out Destination Port Info 0x0003 021 : (Switch portguid 0x000b8cffff004016: 'MT47396 Infiniscale-III Mellanox Technologies') 0x0006 007 : (Channel Adapter portguid 0x0002c90300001039: 'sw137 HCA-1') 0x0007 021 : (Channel Adapter portguid 0x0002c9020025874a: 'sw157 HCA-1') 3 valid lids dumped 4. Dump all Lids with valid out ports of the switch with portguid 0x000b8cffff004016.
Rev 2.0-3.0.0 9.12 InfiniBand Fabric Diagnostic Utilities smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info, node description, switch info, and port info. Synopsis smpquery [-h] [-d] [-e] [-v] [-D] [-G] [-s ] [-V] [-C ] [-P ] [-t ] [--node-name-map ] [op params] Output Files Table 26 lists the various flags of the command.
Rev 2.0-3.0.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps LinkState:.......................Active PhysLinkState:...................LinkUp LinkDownDefState:................Polling ProtectBits:.....................0 LMC:.............................0 LinkSpeedActive:.................5.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps NeighborMTU:.....................2048 SMSL:............................0 VLCap:...........................
Rev 2.0-3.0.0 LifeTime:........................18 StateChange:.....................0 LidsPerPort:.....................0 PartEnforceCap:..................32 InboundPartEnf:..................1 OutboundPartEnf:.................1 FilterRawInbound:................1 FilterRawOutbound:...............1 EnhancedPort0:...................0 3. Query NodeInfo by direct route. > smpquery -D nodeinfo 0 # Node info: DR path slid 65535; dlid 65535; 0 BaseVers:........................1 ClassVers:.......................
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 27 - perfquery Flags and Options Optional / Mandatory Flag Default (If Not Specified) Description -G(uid) Optional Use GUID address argument. In most cases, it is the Port GUID.
Rev 2.0-3.0.0 RcvSwRelayErrors:................0 XmtDiscards:.....................0 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................55178210 RcvData:.........................55174680 XmtPkts:.........................766366 RcvPkts:.........................766315 2. Read performance counters from LID 2, all ports.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 9.14 ibcheckerrs Validates an IB port (or node) and reports errors in counters above threshold.
Rev 2.0-3.0.0 Table 28 - ibcheckerrs Flags and Options Flag Optional / Mandatory Default (If Not Specified) Description -C Optional Use the specified channel adapter or router -P Optional Use the specified port -t Optional Override the default timeout for the solicited MADs [msec] Mandatory with -G flag Use the specified port’s or node’s LID/GUID (with -G option) [] Mandatory without -G flag Use the specified port Examples 1.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities > ibcheckerrs -v -T thresh1 2 1 Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port 1: OK 9.15 mstflint Queries and burns a binary firmware-image file on non-volatile (Flash) memories of Mellanox InfiniBand and Ethernet network adapters. The tool requires root privileges for Flash access. If you purchased a standard Mellanox Technologies network adapter card, please download the firmware image from www.mellanox.
Rev 2.0-3.0.0 Table 29 - mstflint Switches (Sheet 2 of 3) Switch Affected/ Relevant Commands Description -mac burn, sg MAC address base value. Two MACs are automatically assigned to the following values: mac -> port1 mac+1 -> port2 Note: This switch is applicable only for Mellanox Technologies Ethernet products. -macs burn, sg Two MACs must be specified here. The specified MACs are assigned to port1 and port2, repectively.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities Table 29 - mstflint Switches (Sheet 3 of 3) Affected/ Relevant Commands Switch Description -vsd burn Write this string of up to 208 characters to VSD upon a burn command. use_image_p s burn Burn vsd as it appears in the given image - do not keep existing VSD on Flash. -dual_image burn Make the burn process burn two images on Flash. The current default failsafe burn process burns a single image (in alternating locations).
Rev 2.0-3.0.0 Possible command return values are: 0 - successful completion 1 - error has occurred 7 - the burn command was aborted because firmware is current Examples 1. Find Mellanox Technologies’s ConnectX® VPI cards with PCI Express running at 2.5GT/s and InfiniBand ports at DDR / or Ethernet ports at 10GigE. > /sbin/lspci -d 15b3:634a 04:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0).
Rev 2.0-3.0.0 9.16 InfiniBand Fabric Diagnostic Utilities ibv_asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device.
Rev 2.0-3.0.0 Examples 1. Display asynchronous events. > ibv_asyncwatch mlx4_0: async event FD 4 9.17 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX® family adapters InfiniBand ports. The dump file can be loaded by the Wireshark tool for graphical traffic analysis.
Rev 2.0-3.0.0 InfiniBand Fabric Diagnostic Utilities --mem-mode --decap -h, --help -v, --version when specified, packets are written to file only after the capture is stopped. It is faster than default mode (less chance for packet loss), but takes more memory. In this mode, ibdump stops after bytes are captured Decapsulate port mirroring headers. Should be used when capturing RSPAN traffic. Display this help screen. Print version information. Examples 1. Run ibdump.
Rev 2.0-3.0.0 Appendix A: Mellanox FlexBoot A.1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology. FlexBoot supports remote Boot over InfiniBand (BoIB) and over Ethernet. Using Mellanox Virtual Protocol Interconnect (VPI) technologies available in ConnectX® adapters, FlexBoot gives IT Managers’ the choice to boot from a remote storage target (iSCSI target) or a LAN target (Ethernet Remote Boot Server) using a single ROM image on Mellanox ConnectX products.
Rev 2.0-3.0.0 Prerequisites 1. Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot_release_notes.txt. 2. Firmware Burning Tools You need to install the Mellanox Firmware Tools (MFT) package (version 2.7.0 or later) in order to burn the PXE ROM image. To download MFT, see Firmware Tools under www.mellanox.com > Downloads. Image Burning Procedure To burn the composite image, perform the following steps: 1.
Rev 2.0-3.0.0 A.3.2 Configuring the DHCP Server A.3.2.1 For ConnectX Family Devices When a FlexBoot client boots, it sends the DHCP server various information, including its DHCP client identifier. This identifier is used to distinguish between the various DHCP sessions. The value of the client identifier is composed of a prefix — ff:00:00:00:00:00:02:00:00:02:c9:00 — and an 8-byte port GUID (all separated by colons and represented in hexadecimal digits).
Rev 2.0-3.0.0 Placing Client Identifiers in /etc/dhcpd.conf The following is an excerpt of a /etc/dhcpd.conf example file showing the format of representing a client machine for the DHCP server. host host1 { next-server 11.4.3.7; filename "pxelinux.0"; fixed-address 11.4.3.130; option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39; } A.4 Subnet Manager – OpenSM This section applies to ports configured as InfiniBand only.
Rev 2.0-3.0.0 A.7 Operation A.7.1 Prerequisites A.7.2 • Make sure that your client is connected to the server(s) • The FlexBoot image is already programmed on the adapter card – see Section A.2 • For InfiniBand ports only: Start the Subnet Manager as described in Section A.4 • The DHCP server should be configured and started (see Section 4.3.3.1, “IPoIB Configuration Based on DHCP”, on page 50 • Configure and start at least one of the services iSCSI Target (see Section A.
Rev 2.0-3.0.0 After configuring the IB/ETH port, the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel/OS to boot from. For ConnectX (InfiniBand): Next, FlexBoot attempts to boot as directed by the DHCP server. A.8 Command Line Interface (CLI) A.8.1 Invoking the CLI When the boot process begins, the computer starts its Power On Self Test (POST) sequence.
Rev 2.0-3.0.0 A.8.3.1 ifstat Displays the available network interfaces (in a similar manner to Linux’s ifconfig). A.8.3.2 ifopen Opens the network interface net. The list of network interfaces is available via the ifstat command. Example: iPXE> ifopen net1 A.8.3.3 ifclose Closes the network interface net. The list of network interfaces is available via the ifstat command. Example: iPXE> ifclose net1 A.8.3.4 autoboot Starts the boot process from the device(s). A.8.3.
Rev 2.0-3.0.0 A.8.3.8 help Displays the available list of commands. A.8.3.9 exit Exits from the command line interface. A.9 Diskless Machines Mellanox FlexBoot supports booting diskless machines. To enable using an IB/ETH driver, the initrd image must include a device driver module and be configured to load that driver. This can be achieved by adding the device driver module into the initrd image and loading it.
Rev 2.0-3.0.0 A.9.1.1 Example: Adding an IB Driver to initrd (Linux) Prerequisites 1. The FlexBoot image is already programmed on the HCA card. 2. The DHCP server is installed and configured as described in Section 4.3.3.1, “IPoIB Configuration Based on DHCP”, and is connected to the client machine. 3. An initrd file. 4. To add an IB driver into initrd, you need to copy the IB modules to the diskless image.
Rev 2.0-3.0.0 Step 5. IB requires loading an IPv6 module. If you do not have it in your initrd, please add it using the following command: host1$ cp /lib/modules/`uname -r`/kernel/net/ipv6/ipv6.ko \ /tmp/initrd_ib/lib/modules Step 6. To load the modules, you need the insmod executable. If you do not have it in your initrd, please add it using the following command: host1$ cp /sbin/insmod /tmp/initrd_ib/sbin/ Step 7. If you plan to give your IB device a static IP address, then copy ifconfig.
Rev 2.0-3.0.0 /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /lib/modules/ib/ib_core.ko /lib/modules/ib/ib_mad.ko /lib/modules/ib/ib_sa.ko /lib/modules/ib/ib_cm.ko /lib/modules/ib/ib_uverbs.ko /lib/modules/ib/ib_ucm.ko /lib/modules/ib/ib_umad.ko /lib/modules/ib/iw_cm.ko /lib/modules/ib/rdma_cm.ko /lib/modules/ib/rdma_ucm.ko /lib/modules/ib/mlx4_core.ko /lib/modules/ib/mlx4_ib.
Rev 2.0-3.0.0 A.9.2.1 Example: Adding an Ethernet Driver to initrd (Linux) Prerequisites 1. The FlexBoot image is already programmed on the adapter card. 2. The DHCP server is installed and configured as described in Section 4.3.3.1 on page 50, and connected to the client machine. 3. An initrd file. 4. To add an Ethernet driver into initrd, you need to copy the Ethernet modules to the diskless image.
Rev 2.0-3.0.0 echo “loading Mellanox ConnectX EN driver” /sbin/insmod lib/modules/mlnx_en/mlx4_core.ko /sbin/insmod lib/modules/mlnx_en/mlx4_en.ko Step 8. Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface. Step 9. Save the init file. Step 10. Close initrd. host1$ cd /tmp/initrd_en host1$ find ./ | cpio -H newc -o > /tmp/new_initrd_en.img host1$ gzip /tmp/new_init_en.
Rev 2.0-3.0.0 A.10.1 Configuring an iSCSI Target in Linux Environment Prerequisites Step 1. Make sure that an iSCSI Target is installed on your server side. You can download and install an iSCSI Target from the following location: http://sourceforge.net/projects/iscsitarget/files/iscsitarget/ Step 2. Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3. Configure your iSCSI Target to work with the partition you dedicated.
Rev 2.0-3.0.0 Appendix B: SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks (http://www.openfabrics.org) or InfiniBand drivers in Linux kernel tree (kernel.org). It also interfaces with Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net). By interfacing with an SCST driver, it is possible to work with and support a lot of IO modes on real or virtual devices in the back end. 1. scst_vdisk – fileio and blockio modes.
Rev 2.0-3.0.0 The scst_disk module (pass-thru mode) of SCST is not supported by Mellanox OFED. Example 1: Working with VDISK BLOCKIO mode (Using the md0 device, sda, and cciss/c1d0) a. modprobe scst b. modprobe scst_vdisk c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices g.
Rev 2.0-3.0.0 1. Run: modprobe ib_srp 2. Run: ibsrpdm -c -d /dev/infiniband/umadX (to discover a new SRP target) umad0: port 1 of the first HCA umad1: port 2 of the first HCA umad2: port 1 of the second HCA 3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target 4. fdisk -l (will show the newly discovered scsi disks) Example: Assume that you use port 1 of first HCA in the system, i.e.
Rev 2.0-3.0.0 echo "add "mgmt"" > /proc/scsi_tgt/trace_level echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level *********************** End srpt.sh **************************** B.3 How-to Unload/Shutdown 1. Unload ib_srpt $ modprobe -r ib_srpt 2. Unload scst and its dev_handlers first $ modprobe -r scst_vdisk scst 3. Unload ofed $ /etc/rc.
Rev 2.0-3.0.0 Appendix C: mlx4 Module Parameters In order to set mlx4 parameters, add the following line(s) to /etc/modprobe.conf: options mlx4_core parameter= and/or options mlx4_ib parameter= and/or options mlx4_en parameter= The following sections list the available mlx4 parameters. C.1 mlx4_ib Parameters sm_guid_assign: dev_assign_str1: Enable SM alias_GUID assignment if sm_guid_assign > 0 (Default: 1) (int) Map device function numbers to IB device numbers (e.g.'0000:04:00.
Rev 2.0-3.0.0 log_num_mgm_entry_size: high_rate_steer: fast_drop: enable_64b_cqe_eqe: log_num_mac: log_num_vlan: log_mtts_per_seg: port_type_array: log_num_qp: log_num_srq: log_rdmarc_per_qp: log_num_cq: log_num_mcg: log_num_mpt: log_num_mtt: enable_qos: internal_err_reset: C.3 mlx4_en Parameters inline_thold: udp_rss: pfctx: pfcrx: 224 log mgm size, that defines the num of qp per mcg, for example: 10 gives 248.range: 7 <= log_num_mgm_entry_size <= 12.
Rev 2.0-3.0.0 Appendix D: mlx5 Module Parameters The mlx5_ib module supports a single parameter used to select the profile which defines the number of resources supported. The parameter name for selecting the profile is prof_sel.
Rev 2.0-3.0.0 Appendix E: Lustre Compilation over MLNX_OFED To compile Lustre version 2.3.65 and higher: $ ./configure --with-o2ib=/usr/src/ofa_kernel/default/ $ make rpms To compile older Lustre versions: $ EXTRA_LNET_INCLUDE="-I/usr/src/ofa_kernel/default/include/ -include /usr/src/ ofa_kernel/default/include/linux/compat-2.6.h" .