IBM ~ pSeries High Performance Switch Tuning and Debug Guide Version 1.
Contents 1.0 Introduction..................................................................................................... 4 2.0 Tunables and settings for switch software...................................................... 5 2.1 MPI tunables for Parallel Environment........................................................ 5 2.1.1 MP_EAGER_LIMIT .............................................................................. 5 2.1.2 MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL ......... 5 2.1.
5.10 MP_PRINTENV ...................................................................................... 22 5.11 MP_STATISTICS.................................................................................... 23 5.12 Dropped switch packets.......................................................................... 24 5.12.1 Packets dropped because of a software problem on an endpoint .... 24 5.12.2 Packets dropped in the ML0 interface .............................................. 26 5.12.
1.0 Introduction ~ ® This paper is intended to help you tune and debug the performance of the IBM pSeries® High Performance Switch (HPS) on IBM Cluster 1600 systems. It is not intended to be a comprehensive guide, but rather to help in initial tuning and debugging of performance issues. Additional detailed information on the materials presented here can be found in sources noted in the text and listed in section 7.0.
2.0 Tunables and settings for switch software To optimize the HPS, you can set shell variables for Parallel Environment MPI-based workloads and for IP-based workloads. This section reviews the shell variables that are most often used for performance tuning. For a complete list of tunables and their usage, see the documentation listed in section 7 of this paper. 2.1 MPI tunables for Parallel Environment The following sections list the most common MPI tunables for applications that use the HPS.
thread, and from within the MPI/LAPI polling code that is invoked when the application makes blocking MPI calls. MP_POLLING_INTERVAL specifies the number of microseconds an MPI/LAPI service thread should wait (sleep) before it checks whether any data previously sent by the MPI task needs to be retransmitted. MP_RETRANSMIT_INTERVAL specifies the number of passes through the internal MPI/LAPI polling routine between calls before checking whether any data needs to be resent.
2.1.5 MP_TASK_AFFINITY Setting MP_TASK_AFFINITY to SNI tells parallel operating environment (POE) to bind each task to the MCM containing the HPS adapter it will use, so that the adapter, CPU, and memory used by any task are all local to the same MCM. To prevent multiple tasks from sharing the same CPU, do not set MP_TASK_AFFINITY to SNI if more than four tasks share any HPS adapter.
Sometimes MPI-IO is used in an application as if it were basic POSIX read/write, either because there is no need for more complex read/write patterns or because the application was previously hand-optimized to use POSIX read/write. In such cases, it is often better to use the IBM_largeblock_io hint on MPI_FILE_OPEN. By default, the PE/MPI implementation of MPIIO tries to take advantage of the information the MPI-IO interface can provide to do file I/O more efficiently.
rfifosize rpoolsize spoolsize 0x1000000 0x02000000 0x02000000 receive fifo size IP receive pool size IP send pool size False True True 3.0 Tunables and settings for AIX 5L Several settings in AIX 5L impact the performance of the HPS. These include the IP and memory subsystems. The following sections provide a brief overview of the most commonly used tunables. For more information about these subjects, see the AIX 5L tuning manuals listed in section 7.0. 3.
The overhead in maintaining the file cache can impact the performance of large parallel applications. Much of the overhead is associated with the sync() system call (by default, run every minute from the syncd daemon). The sync() system call scans all of the pages in the file cache to determine if any pages have been modified since the last sync(), and therefore need to be written to disk.
3.3.1 svmon The svmon command provides information about the virtual memory usage by the kernel and user processes in the system at any given time. For example, to see system-wide information about the segments (256MB chunk of virtual memory), type the following command as root: svmon -S The command prints out segment information sorted according to values in the Inuse field, which shows the number of virtual pages in the segment that are mapped into the process address space.
PageSize 4KB 16MB Vsid 1f187f 218a2 131893 0 1118b1 d09ad 1611b6 31823 1a187a c17ec b11ab Inuse 448221 0 Esid 11 70000000 17 0 8001000a 90000000 90020014 10 ffffffff f00000002 9fffffff Pin 3687 0 Pgsp Virtual 2675 449797 0 0 Type Description LPage work text data BSS heap work default shmat/mmap work text data BSS heap work kernel segment work private load work loader segment work shared library text clnt text data BSS heap work application stack work process private pers shared library text, - Inuse
statistics in 5-second intervals, with the first set of statistics being the statistics since the node or LPAR was last booted. vmstat 5 The pi and po of the page group is the number of 4KB pages read from and written to the paging device between consecutive samplings. If po is high, it could indicate that thrashing is taking place. In that case, it is a good idea to run the svmon command to see the system-wide virtual segment allocation. 3.
adapter is configured. The volume of reservation is proportional to the number of user windows configured on the HPS adapter. A private window is required for each MPI task. Here is a formula to calculate the number of TLPs needed by the HPS adapter. In the formula below, number_of_sni refers to the number of sniX logical interfaces present in the partition.
3.5 Large pages and IP support One of the most important ways to improve IP performance on the HPS is to ensure that large pages are enabled. Large pages are required to allocate a number of large pages which will used by the HPS IP driver at boot time. Each snX needs one large page for the IP FIFO, plus the number of send pools and receive pools shared among all adapters. Here is the formula for the number of large pages, assuming that the send pool and receive pool each need two pages.
If you have eight cards for p690 (or four cards for p655), this command also indicates whether you have full memory bandwidth. 3.8 Debug settings in the AIX 5L kernel The AIX 5L kernel has several debug settings that affect the performance of an application.
4.2 LoadLeveler daemons The LoadLeveler® daemons are needed for MPI applications using HPS. However, you can lower the impact on a parallel application by changing the default settings for these daemons. You can lower the impact of the LoadLeveler daemons by: • Reducing the number of daemons running • Reducing daemon communication or placing daemons on a switch • Reducing logging 4.2.
SCHEDD_DEBUG = -D_ALWAYS 4.3 Settings for AIX 5L threads Several variables help you use AIX 5L threads to tune performance. These are the recommended initial settings for AIX 5L threads when using HPS. Set them in the /etc/environment file.
5.0 Debug settings and data collection tools Several debug settings and data collection tools can help you debug a performance problem on systems using HPS. This section contains a subset of the most common setting changes and tools. If a performance problem persists after you check the debug settings and the data that was collected, call IBM service for assistance. 5.1 lsattr tuning The lsattr command lists two trace and debug-level settings for the HPS links.
5.3 Affinity LPARs On p690 systems, if you are running with more than one LPAR for each CEC, make sure you are running affinity LPARs. To check affinity between CPU, memory, and HPS links, run the associativity scripts on the LPARs. To check the memory affinity setting, run the vmo command. 5.4 Small Real Mode Address Region on HMC GUI Because the HMC and hypervisor code on POWER4 systems uses up physical memory, some physical memory is unavailable to the LPARs.
On the HMC GUI, select Service Applications -> Service Focal Point -> Select Serviceable Events. 5.7 errpt command On AIX 5L, the errpt command lists a summary of system error messages. Some of the HPS subsystem errors are collected by errpt. To find out if you have hardware errors, you can either run the errpt command, or you can run the dsh command from the CSM manager: dsh errpt | grep “ 0223” | grep sysplanar0 (The value 0223 is the month and day.
• • For HAL libraries: dsh sum /usr/sni/aix52/lib/libhal_r.a For MPI libraries: dsh sum /usr/lpp/ppe.poe/lib/libmpi_r.a (or run with MP_PRINTENV=yes) To make sure you are running the correct combination of HAL, LAPI, and MPI, check the Service Pack Release Notes. 5.10 MP_PRINTENV If you set MP_PRINTENV=YES or MP_PRINTENV=script_name, the output includes the following information about environmental variables. The output for the user script is also printed, if it was specified.
MEMORY_AFFINITY Single Thread Usage(MP_SINGLE_THREAD) Hints Filtered (MP_HINTS_FILTERED) MPI-I/O Buffer Size (MP_IO_BUFFER_SIZE) MPI-I/O Error Logging (MP_IO_ERRLOG) MPI-I/O Node File (MP_IO_NODEFILE) MPI-I/O Task List (MP_IO_TASKLIST) System Checkpointable (CHECKPOINT) LoadLeveler Gang Scheduler DMA Receive FIFO Size (Bytes) Max outstanding packets LAPI Max Packet Size (Bytes) LAPI Ack Threshold (MP_ACK_THRESH) LAPI Max retransmit buf size (MP_REXMIT_BUF_SIZE) LAPI Max retransmit buf count (MP_REXMIT_BUF_C
MPCI: sends = 14 MPCI: sendsComplete = 14 MPCI: sendWaitsComplete = 17 MPCI: recvs = 17 MPCI: recvWaitsComplete = 13 MPCI: earlyArrivals = 5 MPCI: earlyArrivalsMatched = 5 MPCI: lateArrivals = 8 MPCI: shoves = 10 MPCI: pulls = 13 MPCI: threadedLockYields = 0 MPCI: unorderedMsgs = 0 LAPI: Tot_dup_pkt_cnt=0 LAPI: Tot_retrans_pkt_cnt=0 LAPI: Tot_gho_pkt_cnt=0 LAPI: Tot_pkt_sent_cnt=14 LAPI: Tot_pkt_recv_cnt=15 LAPI: Tot_data_sent=4194 LAPI: Tot_data_recv=3511 5.
Run the following command: /usr/sbin/ifsn_dump -a The data is collected in sni.snap (sni_dump.out.Z), and provides useful information, such as the local mac address: mac_addr 0:0:0:40:0:0 If you are seeing arpq drops, ensure the source has the correct mac_addr for its destination. The ndd statistics listed in ifsn_dump are useful for measuring packet drops in relation to the overall number of packets sent and received.
To help you isolate the exact cause of packet drops, the ifsn_dump -a command also lists the following debug statistics. If you isolate packet drops to these statistics, you will probably need to contact IBM support.
There are two routes. sending packet using route No. 1 ml ip address structure, starting: ml flag (ml interface up or down) = 0x00000000 ml tick = 0 ml ip address = 0xc0a80203, 192.168.2.3 There are two preferred route pairs: from local if 0 to remote if 0 from local if 1 to remote if 1 There are two actual routes (two preferred). --------------------------------------from local if 0 to remote if 0 destination ip address structure: if flag (up or down) = 0x000000c1 if tick = 0 ipaddr = 0xc0a80003, 192.168.
MAC WOF [. . .] (2F870): Bit: 1 5.12.4 Packets dropped in the switch hardware If a packet is dropped within the switch hardware itself (for example, when traversing the link between two switch chips), evidence of the packet drop is on the HMC, where the switch Federation Network Manager (FNM) runs. You can run /opt/hsc/bin/fnm.snap to create a snap archive in /var/hsc/log (for example, /var/hsc/log/c704hmc1.2004-11-19.12.50.33.snap.tar.gz).
5.14 LAPI_DEBUG_COMM_TIMEOUT If the LAPI protocol experiences communication timeouts, set the environment variable LAPI_DEBUG_COMM_TIMEOUT to PAUSE. This causes the application to issue a pause() call when encountering a timeout, which stops the application instead of closing it. 5.15 LAPI_DEBUG_PERF The LAPI_DEBUG_PERF flag is not supported and should not be used in production. However, it can provide useful information about packet loss.
5.16 AIX 5L trace for daemon activity If you suspect that a system daemon is causing a performance problem on your system, run AIX 5L trace to check for daemon activity. For example, to find out which daemons are taking up CPU time, use the following process: trace -j 001,002,106,200,10c,134,139,465 -a -o /tmp/trace.aux -L 40000000 -T 20000000 sleep XX (XX is the time for your trace) trcstop trcrpt -O 'cpuid=on exec=on pid=on tid=on' /tmp/trace.aux > /tmp/trace.out Look at /tmp/trace.
7.2 MPI documentation Parallel Environment for AIX 5L V4.1.1 Hitchhiker's Guide, SA22-7947-01 Parallel Environment for AIX 5L V4.1.1 Operation and Use, Volume 1, SA22-7948-01 Parallel Environment for AIX 5L V4.1.1 Operation and Use, Volume 2, SA22-7949-01 Parallel Environment for AIX 5L V4.1.1 Installation, GA22-7943-01 Parallel Environment for AIX 5L V4.1.1 Messages, GA22-7944-01 Parallel Environment for AIX 5L V4.1.1 MPI Programming Guide, SA22-7945-01 Parallel Environment for AIX 5L V4.1.
© IBM Corporation 2005 IBM Corporation Marketing Communications Systems Group Route 100 Somers, New York 10589 Produced in the United States of America April 2005 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice.