i40iw Linux* Driver for Intel(R) Ethernet Connection X722 ========================================================= May 16, 2019 Contents ======== - Prerequisites - Building and Installation - Testing - Interoperability - RDMA Statistics - Known Issues - Unsupported and Discontinued Items Prerequisites ============= One of the following: * Latest stable upstream kernel with the inbox Infiniband* support installed * Red Hat* Enterprise Linux* (RHEL) 7.4/7.5/7.6 with the inbox Infiniband* support installed * Suse* Linux Enterprise Server (SLES) 12 SP3/12 SP4/15 with the inbox Infiniband* support installed - NOTE: The i40e driver must be built from source on your system prior to i40iw install. Memory Requirements: -------------------- Default i40iw load requires a minimum of 6GB of memory for initialization. For applications where the amount of memory is constrained, you can decrease the required memory by lowering the available resources to the i40iw driver. To do this, load the driver with the following profile setting. Note: This can have performance and scaling impacts as the number of queue pairs and other RDMA resources are decreased in order to lower memory usage to approximately 1.2 GB. modprobe i40iw resource_profile=2 Scaling Limits -------------- Intel(R) Ethernet Connection X722 has limited RDMA resources, including the number of Queue Pairs (QPs), Completion Queues (CQs), and Memory Regions (MRs). In highly scaled environments or highly interconnected HPC-style applications such as all-to-all, users may experience QP failure errors once they reach the RDMA resource limits. Below are the per-physical port limits for 4-port devices for the three resources associated with the default i40iw driver load: QPs: 16384 CQs: 32768 MRs: 2453503 Other resource profiles allocate resources differently. If i40iw is loaded with resource_profile 2, then resources will be more limited. The example below shows the resource limit per-physical port when you use modprobe i40iw resource_profile 2. (Note that this may increase if you load fewer than 32 VFs using the max_rdma_vfs module parameter.) QPs: 2048 CQs: 3584 MRs: 6143 Building and Installation ========================= 1. Untar i40iw-.tar.gz. 2. Install the i40iw PF driver as follows: cd i40iw- directory ./build.sh k For example: ./build.sh /opt/i40e-2.3.3 k 3. Please download the latest rdma_core user-space package from https://github.com/linux-rdma/rdma-core/releases and follow its installation procedure. Note: There might be errors resulting from conflicting packages when upgrading rdma core to the latest content. If so, use the following procedure: Remove the existing packages using: rpm -e Install the newer version of the package: rpm -ivh Adapter and Switch Flow Control Setting --------------------------------------- We recommend enabling link-level flow control (both TX and RX) on X722 and connected switch. For better performance, enable flow control on all the nodes and on the switch they are connected to. To enable flow control on X722 use ethtool -A command. For example: ethtool -A p4p1 rx on tx on where p4p1 is the iwarp interface name. Confirm the setting with ethtool -a command. For example: ethtool -a p4p1 You should see this output: Pause parameters for p4p1: Autonegotiate: off RX: on TX: on To enable link-level flow control on the switch, please consult your switch vendor's documentation. Look for flow-control and make sure both TX and RX are set. Here is an example for a generic switch to enable both TX and RX flow control on port 45: enable flow-control tx-pause ports 45 enable flow-control rx-pause ports 45 Recommended Settings for Intel MPI 2017.0.x ------------------------------------------- Note: The following instructions assume that Intel MPI is installed using default locations. Refer to Intel MPI documentation for further details on parameters and general instructions. 1. Add or modify the following line in /etc/dat.conf, changing to match your interface name: ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 " 0" "" 2. To select the iWARP device, add the following to mpiexec command: -genv I_MPI_FALLBACK_DEVICE disable -genv I_MPI_DEVICE rdma:ofa-v2-iwarp Example mpiexec command line for uDAPL-2.0: mpiexec -machinefile mpd.hosts_impi -genv I_MPI_FALLBACK_DEVICE disable -genv I_MPI_DEVICE rdma:ofa-v2-iwarp -ppn -n Note: mpd.hosts_impi is a text file with a list of the nodes' qualified hostnames or IP addresses, one per line, in the MPI ring. Note: Recommended optional_parameters if running IMB-MPI1 benchmark: -time 1000000 (specifies that a benchmark will run at most that many seconds per message size) -mem 2GB (specifies that at most that many GBytes are allocated per process for the message buffers) Recommended Settings for Open MPI 3.x.x --------------------------------------- Note: The following instructions assume that Open MPI is installed using default locations. Refer to Open MPI documentation at open-mpi.org for further details on parameters and general instructions. Note: There is more than one way to specify MCA parameters in OpenMPI. Please visit this link and use the best method for your environment: http://www.open-mpi.org/faq/?category=tuning#setting-mca-params Necessary parameters to mpirun command: -mca btl openib,self,vader Use openib (Open Fabrics device), send to self semantics and shared memory. -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128 Set the receive queue sizes. This is especially useful for interop between iWARP RDMA vendors, because the queue sizes could be different per vendor in the file "openmpi/mca-btl-openib-device-params.ini" -mca oob ^ud Do not use UD QPs Example mpirun command line: mpirun -np -hostfile mpd.hosts_ompi --map-by node --allow-run-as-root --display-map -v -tag-output -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128 -mca btl openib,self,vader -mca btl_mpi_leave_pinned 0 -mca oob ^ud /openmpi_benchmarks/3.x.x/benchmark [optional_parameters] Note: mpd.hosts_ompi is a text file with a list of the nodes' qualified hostnames or IP addresses and "slots=", one per line, in the MPI ring. The slots parameter is required for greater than 72. Refer to openMPI documentation for more details. Note: underscores are not allowed in hostnames. Example: QA0094-1-0 slots=72 QA0096-1-0 slots=72 Recommended optional_parameters for IMB-MPI1 benchmark: -time 1000000 (specifies that a benchmark will run at most that many seconds per message size) Testing ======= Verify RDMA traffic ------------------- The following rping test can be run to confirm RDMA functionality: Server side: rping -s -a -vVd Client side: rping -c -a -vVd Execute the server side before the client side. Rping will run endlessly printing the data on the console. First run rping server and client both on machine A. After confirming machine A is operating correctly, run rping server and client both on machine B. After confirming machine B is operating correctly, run rping from machine A to machine B. * Make sure portmapper is running. To check the status: systemctl status iwpmd To start portmapper: systemctl start iwpmd Interoperability ================ To interoperate with Chelsio iWARP devices: Load Chelsio T4/T5 RDMA driver (iw_cxgb4) with parameter dack_mode set to 0. modprobe iw_cxgb4 dack_mode=0 If iw_cxgb4 is loaded on system boot, create the /etc/modprobe.d/iw_cxgb4.conf file with the following entry: options iw_cxgb4 dack_mode=0 Reload iw_cxgb4 for the new parameters to take effect. RDMA Statistics =============== Use the following command to read RDMA Protocol statistics: cd /sys/class/infiniband/i40iw0/proto_stats; for f in *; do echo -n "$f: "; cat "$f"; done; cd The following counters will increment when RDMA applications are transferring data over the network: - ipInReceives - tcpInSegs - tcpOutSegs Known Issues/Troubleshooting ============================ RDMA fails perftest ------------------- When testing iWARP devices with perftest 4.4-0.5, most devices will fail. See https://github.com/linux-rdma/perftest/issues/52 for details. Incompatible Drivers in initramfs --------------------------------- There may be incompatible drivers in the initramfs image. You can either update the image or remove the drivers from initramfs. Specifically look for i40e, ib_addr, ib_cm, ib_core, ib_mad, ib_sa, ib_ucm, ib_uverbs, iw_cm, rdma_cm, rdma_ucm in the output of the following command: lsinitrd |less If you see any of those modules, rebuild initramfs with the following command and include the name of the module in the "" list. Below is an example: dracut --force --omit-drivers "i40e ib_addr ib_cm ib_core ib_mad ib_sa ib_ucm ib_uverbs iw_cm rdma_cm rdma_ucm" Unsupported and Discontinued Items ================================== Support for libi40iw has been discontinued. i40iw does not support NFSoRDMA. Intel(R) Ethernet Connection X722 iWARP RDMA VF driver discontinued ------------------------------------------------------------------- Support for the Intel(R) Ethernet Connection X722 iWARP RDMA VF driver (i40iwvf) has been discontinued. There is no change to the Linux iWARP RDMA PF driver (i40iw). Support ======= For general information, go to the Intel support website at: http://www.intel.com/support/ or the Intel Wired Networking project hosted by Sourceforge at: http://sourceforge.net/projects/e1000 If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue to e1000-rdma@lists.sourceforge.net License ------- This software is available to you under a choice of one of two licenses. You may choose to be licensed under the terms of the GNU General Public License (GPL) Version 2, available from the file COPYING in the main directory of this source tree, or the OpenFabrics.org BSD license below: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Copyright(c) 2016-2019 Intel Corporation. Trademarks ---------- Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and/or other countries. * Other names and brands may be claimed as the property of others.