HPCPI and Xtools Version 0.6.
© Copyright 2008 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.......................................................................................................13 Intended Audience................................................................................................................................13 Document Organization.......................................................................................................................13 Typographic Conventions.......................................................
Step 8: Viewing Per Procedure Statistics for the Application...............................................................33 Step 9: Viewing Per Instruction Statistics.............................................................................................33 Step 10: Stopping the HPCPI Daemon..................................................................................................33 4 Using HPCPI...................................................................................................
Displaying Raw Values....................................................................................................................53 Limiting the hpcpiprof Output....................................................................................................54 Additional Options..........................................................................................................................54 Tips and Best Practices for Using HPCPI..........................................................
Using Labels with mpirun and Other Distribution Utilities..........................................................69 Collecting Data on Multiple Nodes......................................................................................................70 Consolidating and Synchronizing Data..........................................................................................70 Selecting Output Data for Specific Systems....................................................................................
CPU............................................................................................................................................96 Instructions.................................................................................................................................96 FPC.............................................................................................................................................96 Cycles.....................................................................
HPCPI Product Limitations................................................................................................................115 Skid................................................................................................................................................115 Attribution Issues..........................................................................................................................115 Inline Routines............................................................
List of Figures 1-1 1-2 7-1 7-2 7-3 7-4 7-5 7-6 7-7 7-8 7-9 7-10 7-11 7-12 A-1 xclus Display for AMD Opteron Systems..................................................................................21 xperf Display for an Itanium System..........................................................................................22 xclus Display for Itanium Systems.............................................................................................81 Itanium xclus Display................................
List of Tables 1-1 1-2 1-3 1-4 4-1 7-1 7-2 7-3 7-4 B-1 B-2 B-3 B-4 C-1 C-2 C-3 C-4 C-5 C-6 Processors that Support Enhanced Statistics.................................................................................20 Processors that Support Generic Statistics....................................................................................20 Statistics for xclus and xcxclus......................................................................................................
About This Document This document describes how to install and use the HPCPI and Xtools performance analysis tools on Linux systems running on HP Integrity Servers. Intended Audience This document is intended for programmers with Linux experience and knowledge of Intel® Itanium® or AMD Opteron™ processor architecture. Document Organization This document is organized as follows: Chapter 1: “Introduction” This chapter provides an overview of the product components.
Key Term User input Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE The name of a keyboard key. Return and Enter both refer to the same key. The defined use of an important word or phrase. Commands and other text that you type. The name of a placeholder in a command, function, or other syntax display that you replace with an actual value. The contents are optional in syntax. If the contents are a list separated by |, you can choose one of the items. The contents are required in syntax.
product changes. To ensure that you receive the updated or new editions, subscribe to the appropriate product support service. See your HP sales representative for details. Manufacturing Part Number Supported Operating Systems Supported Versions Edition Number Publication Date 5992–4009 2-6 based versions of Red Hat Linux Version 0.6.6 1 March 2008 HP Encourages Your Comments HP encourages your comments concerning this document. We are committed to providing documentation that meets your needs.
1 Introduction The HP Continuous Profiling Infrastructure (HPCPI) and Xtools are performance analysis tools for Linux systems running on HP Integrity Servers. HPCPI enables you to analyze the performance and execution of programs and to identify ways to improve runtime performance. You can also use HPCPI to analyze CPU events for a system.
— hpcpiprof The hpcpiprof utility displays performance profiles for systems (per image) or images (per procedure). The following excerpt from hpcpiprof output shows the number of CPU cycles used per image on system: CPU_CYCLES ---------283629e6 3824e6 2117e6 : : % ----96.9% 1.3% 0.7% cum% -----96.9% 98.2% 98.9% image ---------------------------vmlinux-2.6.9-34.7hp.XCsmp libm-2.3.4.
greater the number of samples, the closer the statistical correspondence. Therefore, the statistical event samples provide a reasonably accurate profile of actual event distributions in a program. Comparison with End-to-End Event Counts Some profilers monitor the total number of events that occur during a time interval, such as the duration of a program. These end-to-end event counts are usually accurate, even for short programs, because they are direct measurements and not statistical.
Xtools The Xtools utilities are X11 clients with GUIs that enable you to monitor the performance of multiple systems and individual systems. The Xtools bundle consists of the following utilities: • xclus • xcxclus • xperf • xcxperf xclus and xcxclus The xclus and xcxclus utilities enable you to monitor performance and resource utilization for multiple systems or nodes in a cluster.
Figure 1-1 xclus Display for AMD Opteron Systems Table 1-3 Statistics for xclus and xcxclus xclus Statistics (Enhanced) xcxclus Statistics (Generic) For Itanium and Opteron processors: • CPU utilization • • • • For Itanium processors only: • Front-side bus (FSB) activity • Memory interface data (MID) bus activity • I/O bus activity Processor activity Ethernet activity Physical memory utilization Interconnect I/O (Gigabit Ethernet, Infiniband, and Elan Quadrics QsNetII) • Disk I/O (for nodes with atta
Figure 1-2 xperf Display for an Itanium System 22 Introduction
Table 1-4 Statistics for xperf and xcxperf xperf Statistics (Enhanced) xcxperf Statistics (Generic) For Itanium and Opteron processors: • CPU utilization • Instructions per cycle • Floating point operations retired per cycle (FPC) • • • • • • • • • • • • • For Itanium processors only: • Per cycle statistics for numerous execution and stall events • Cache miss events • System bus utilization • I/O bus activity • Dynamic Memory Allocation (DMA) bus activity For Opteron processors only: • Per cycle statist
2 Installing HPCPI and Xtools This chapter describes the installation requirements and procedures for HPCPI and Xtools. This chapter addresses the following topics: • “Installation Requirements” (page 25) • “RPM Packages” (page 25) • “Installing the Software” (page 26) • “Verifying the Installation ” (page 30) • “Removing the Software” (page 30) Installation Requirements This section contains installation requirements. Patch Requirements See the HPCPI and Xtools Release Notes for any patch requirements.
• hpcpi This package contains all the files necessary to use HPCPI. • xtools-common This package contains files and utilities that are common to xclus and xperf, and to xcxclus and xcxperf (the HP XC variants of xclus and xperf). You must install this package if you are installing the xtools-clients or xtools-xc_clients package. • xtools-clients This package contains xclus and xperf and associated files. You must also install the xtools-common package to use xclus and xperf.
HP also recommends that you install the HPCPI or Xtools software when the system is idle to minimize the effects of the installation procedure on other computing tasks. You can use the SLURM scontrol command with the State=drain parameter to enable existing jobs to complete on a node and prevent new jobs from starting.
— — You can immediately install software on the clients; you do not have to wait until you have created the golden image on the head node. You do not have to manually run commands that are automatically executed by RPM. The disadvantage of this method is that it is not a standard HP XC installation procedure. These procedures are described in the following sections.
4. Set the shell variable nn to `nodename` to shorten the commands in the remainder of this procedure: # nn=`nodename` Where `nodename` resolves to the name of the local node, which is the head node and image server. (The nn shell variable is used in the cexec commands to exclude the local node from the command execution.) 5. Create a new golden image but do not set the clients for network reboot as follows: # updateimage --gc `nodename` --no-netboot 6.
4. Copy the package files to the shared directory /hptc_cluster as follows: # cp package_file package_file ... /hptc_cluster 5. Verify that HPCPI and Xtools are not running on the client, and that no time-sensitive tasks are running. Run RPM on each remote client. You can do this using cexec, or you can use the job scheduler to submit one job for each node, where the job runs rpm on the target client. For example: # cexec -x `nodename` -a 'cd /hptc_cluster ;\ rpm -i package_name package_name ...' 6.
3 Getting Started with HPCPI This chapter shows the commands used in a simple HPCPI user session. NOTE: The program analyzed in this chapter is a simple program selected for illustrative purposes and is not representative of the types of programs most users analyze.
You will create the directory in the next step. The following example uses the directory /tmp/hpcpidb: % setenv HPCPIDB /tmp/hpcpidb For information about selecting directories for HPCPI databases, see “Selecting a Location for the HPCPI Database Directory” (page 36).
Step 8: Viewing Per Procedure Statistics for the Application The following command enables you to view per-procedure statistics for the image myApp: % hpcpiprof myApp The output is as follows: Event Name Events ---------- ------------CPU_CYCLES 1925103240000 Period Samples ------ -------60000 32085054 CPU_CYCLES ---------191201e7 1309e7 procedure ----------routine1 unknown_rou % ----99.3% 0.7% cum% -----99.3% 100.
4 Using HPCPI This chapter describes how to perform basic HPCPI tasks, including how to start HPCPI, control the HPCPI daemon, and view data using HPCPI tools. This chapter also includes tips on using HPCPI.
Selecting a Location for the HPCPI Database Directory The HPCPI database directory contains files with performance data. The files are organized in subdirectories by epoch date and system name (see “HPCPI Database Directories and Files” (page 113) ). The hpcpid daemon writes data to the files and the HPCPI analysis tools (hpcpiprof, hpcpilist, hpcpitopcounts, and hpcpicat) read data from the files.
1 groups; user definition: 3 # 4 5 6 7 1 CPU_CYCLES ---multiplexing interval = 1000000 ---Logging to /usr/users/who1/myDB/hpcpid-node6.log Daemon is running on pid 1297 4 Many of the data fields are for HP use only. You can use the following data fields: 1 2 Build date for the daemon. Include this information when reporting HPCPI problems. Table showing the events hpcpi is monitoring. 3 Table showing the number of event groups and the events in each group.
event_set_name value Specifies an event set name. Specifies the event interval, which is the number of times an event is recorded by the PMU before generating an interrupt for hpcpid to record a sample. Range: 2000-65535 Default: 60000 Commonly Used Event Sets Table 4-1 describes some of the more commonly used event sets. To see a complete list and the events contained in each group, use the hpcpid -show-event-sets command.
Event Duty Qualifier The -events statement also supports a duty qualifier, which enables you to control how often an event is monitored when you are monitoring more events than the number of hardware event counters. For more information, see hpcpid (1).
Running an Application for Analysis After you start the HPCPI daemon, you can run the applications you want to analyze; run the applications as you normally would. If you want to use an HPCPI label to isolate data for a specific process, you can start the process and establish the label using the hpcpictl label command. Labeling Data An HPCPI label enables you to isolate performance data for processes according to process ID, process group ID, user ID, or CPU number.
Controlling the Daemon with hpcpictl The hpcpictl utility is a userspace application that controls the operation of the hpcpid daemon. You can use hpcpictl to do the following: • • • • Flush HPCPI data to disk (hpcpictl flush) Stop the HPCPI daemon (hpcpictl quit) Start a new epoch (hpcpictl epoch) Show information about the HPCPI daemon (hpcpictl show) Flushing Data to Disk: hpcpictl flush By default, the hpcpid daemon flushes data to disk every 10 minutes.
pretty proper name interval rnd duty active ------ ----------- -------- --- ------ -----Cycles CPU_CYCLES 60000 no always 1/1 hpcpictl show successful 42 Using HPCPI
Viewing Data with hpcpiprof, hpcpilist, and hpcpitopcounts HPCPI provides the following utilities to display HPCPI data: • hpcpiprof Displays performance profiles for systems (per-image data) or images (per-procedure data). • hpcpilist Lists per-line performance statistics for a procedure. • hpcpitopcounts Lists the instructions with the most counts for performance events. HPCPI also includes the hpcpicat utility, which displays the contents of a performance data file with minimal formatting.
Viewing Per-Image Data: hpcpiprof If you run hpcpiprof without an image name, it displays statistics for the system, partitioned per-image. For example: $ hpcpiprof Event Name ---------CPU_CYCLES CPU_CYCLES ---------385649e7 198708e7 192510e7 10636e7 4963e7 : : Events ------------7969037220000 Period -----60000 % ----48.4% 24.9% 24.2% 1.3% 0.6% image ---------------------------vmlinux-2.6.9-34.7hp.XCsmp libm-2.3.4.so myApp libperl.so libc-2.3.4.so cum% -----48.4% 73.3% 97.5% 98.8% 99.
% cum% image Lists the percentage of event samples for the event type that occurred in the image. Lists the cumulative percentage of all event samples for this entry and all entries above it. In this example, the event count for the first image (vmlinux-2.6.9-34.7hp.XCsmp) is 48.4% of the recorded total, and the event count for the first and second images together is 73.3% of the recorded total. Lists the image name.
Viewing Per-Procedure Data: hpcpiprof image_name If you run hpcpiprof with an image name, it displays statistics for the image, partitioned per-procedure. For example: % hpcpiprof myApp myApp: not found. + Found and using /var/users/who1/bin/myApp ------Event Name Events Period Samples ---------- ------------- ------ -------CPU_CYCLES 1925103240000 60000 32085054 CPU_CYCLES ---------191201e7 1309e7 % ----99.3% 0.7% cum% -----99.3% 100.
Viewing Per-Instruction Data: hpcpilist procedure_name image_name The hpcpilist utility lists HPCPI performance statistics per line of source and/or assembly code in a procedure within the specified image file. For example: % hpcpilist routine1 myApp myApp: not found.
Interpreting hpcpilist Event Counts The value of the instruction pointer recorded is typically several or many instructions after the instruction that caused the event. This lag or skid is common to all profilers that sample instruction pointers and HPCPI does not attempt to model the system to correct for this. As a result, HP recommends that you examine the assembly code surrounding regions where high event counts occur and consider if the surrounding code might be triggering the events.
Listing the Instructions with the Highest Event Counts: hpcpitopcounts The hpcpitopcounts utility displays the n instructions with the highest counts for an event. By default, n is 100. To display an alternate number of instructions, use the -n option. If the HPCPI daemon monitored multiple events, hpcpitopcounts uses the first event in the database as the sort key. To specify an alternate sort key, use the -st option, as described in “Specifying an Alternate Sort Key” (page 53).
Listing Instructions in an Image: hpcpitopcounts image_name You can run hpcpitopcounts with an image name to list the instructions with the highest event counts within an image. For example: % hpcpitopcounts myApp myApp: not found. + Found and using /var/users/who1/bin/myApp ------Event Name Events Period Samples ---------- ------------ ------ ------CPU_CYCLES 116980320000 60000 1949672 CPU_CYCLES ---------60800e06 8010e06 7973e06 : : % ---52.0 6.8 6.8 cum procedure ---- ---------52.0 main 58.8 main 65.
HPCPI Utility Options This section describes options for the hpcpiprof, hpcpilist, and hpcpitopcounts utilities. Specifying an Alternate Database By default, the HPCPI utilities use the value of the HPCPIDB environment variable as the HPCPI database directory. Use the -db option to specify an alternate database. The syntax is as follows: -db database Where: database Specifies the directory for the HPCPI database.
In a cluster environment with a consolidated HPCPI database and synchronized epochs, you might want to include or exclude the data from specific systems or nodes. To view data from individual nodes, use the -hosts option to include or exclude the data from specific systems or nodes. The syntax to include data from specific systems or nodes is as follows: -hosts hostname[,hostname]... The syntax to exclude data from specific systems or nodes is as follows: -hosts all-hostname[,hostname]...
data for myApp with the label myLabel. The following command displays performance data for code called by myApp from libc-2.3.4.so: % hpcpiprof -label myLabel libc-2.3.4.so Specifying an Alternate Sort Key When hpcpiprof and hpcpitopcounts display information for multiple events, the utilities sort the data table entries according to the event count for the first event in the HPCPI database. To specify an alternate sort key, use the -st option.
85189620000 10002660000 89.5% 10.5% 89.5% 100.0% 28200 3000 main unknown_rou myApp myApp Limiting the hpcpiprof Output The hpcpiprof -keep option lists entries only until the cumulative percentage meets a specified value. This option is useful if you do not want to display entries with low statistical values. The syntax for the option is as follows: hpcpiprof -keep percentage Where percentage is a floating point number in the 0 - 100 range.
Tips and Best Practices for Using HPCPI This section contains tips and best practices for using HPCPI. Tips To profile an application, you start by monitoring CPU cycles. After collecting and flushing the HPCPI data, you can run the hpcpiprof command without specifying an image name to view system activity, such as kernel and library activity. Next, run the hpcpiprof command with your image name (hpcpiprof image_name) to determine which procedures are consuming the most CPU cycles.
Limiting the Event Count Display (hpcpiprof -keep Option) If you have a lot of data, you can use the -keep option with hpcpiprof to limit the number of event counts it displays. For example: % hpcpiprof -keep 99 Using Database Directories, Epochs, or Labels to Organize Your Data You can use different HPCPI database directories, epochs, or labels to organize performance data from different applications or instances of an application.
Itanium Instruction Metrics On Itanium processors, the event counter IA64_INST_RETIRED includes retired instructions and retired no operation instructions (NOP_RETIRED) but not retired predicate squashed instructions (PREDICATE_SQUASHED_RETIRED). • • To calculate the total number of retired instructions, add IA64_INST_RETIRED and PREDICATE_SQUASHED_RETIRED.1 To determine the number of effective retired instructions, subtract NOP_RETIRED from IA64_INST_RETIRED.
5 Using HPCPI Labels This chapter describes how to use HPCPI labels.
Simple HPCPI Session Using Labels In the following session, the user associates the label myLabel with the performance data for a single process, myApp. This example also uses the label with hpcpiprof to extract performance data for myApp, including data for routines called by myApp from a shared library. The following HPCPI session shows the commands for using HPCPI with labels. The steps are numbered and described in the sections that follow.
: : If you run the same hpcpiprof command and specify the label name (hpcpiprof -label myLabel), hpcpiprof displays event counts for code executed in all images for myApp, such as code in shared libraries called from myApp. An extract of the output is as follows: % hpcpiprof -label myLabel Event Name ---------CPU_CYCLES Events ------------3914574240000 Period -----60000 CPU_CYCLES ---------198708e7 192510e7 192e7 : : % ----50.8% 49.2% 0.0% image -------------------------libm-2.3.4.so myApp vmlinux-2.
Label Selectors Using the hpcpictl label command in its simplest form is sufficient if you are executing and monitoring a single process that is executed directly from a run string. To monitor groups of processes or processes that are started indirectly, you can specify label selectors. When you specify selectors, HPCPI associates data from all processes that match the selectors with the label, independent of the process launched by the command in the run string.
-or -equiv -not Operator The unary postfix operator -not negates the specification. The following example uses the -not operator to select events for nonsuperuser processes: % hpcpictl label nonsuper -uid 0 -not sleep 30 This selects systemwide events for nonsuperuser processes (processes that do not have UID 0) for 30 seconds (the runtime for the sleep 30 process) and associates them with the label nonroot.
Multiple Labels An event can be recorded in only one data set, that is, one label. If you have multiple labels defined and a process matches the selectors for more than one label, the events for that process are recorded in only one data set, and that data set is indeterminate. Reusing Labels You can specify the same label name in multiple hpcpictl commands.
Label Examples This section contains HPCPI label examples. Existing Processes: -pid pid You can use the ps utility to determine the PID of an existing process and use the -pid pid selector to attach a label to performance data for that process.
Alternatively, you can use the sleep 99999 command and manually terminate the sleep process when you are done taking measurements. For example: % hpcpictl label all -pid -1 -not sleep 99999 You can also use this selector with the srun utility in a cluster environment to capture data for all processes on the local system for the duration of the srun execution. This would include the daemon started by srun and any user processes that srun launches on the local system.
Creating Labels in Programs You can use a function such as popen() to invoke the hpcpictl label command within an application and assign a label to specific code areas in the application. For example, you can profile the execution phase of an application only, without the initialization, reporting, or finalization phases. This is analogous to benchmarks, which typically report results for only the execution phase.
} else { perror("popen()"); } } } } Notes Note the following items: • • • • • The first if block terminates an existing label process. This block provides a locking mechanism and is included for applications that use multiple phases or start and stop labels multiple times during execution. You can construct the label name using environment variable values, numeric function arguments (such as problem size or phase number), or text function arguments (such as data set name or phase name).
6 Using HPCPI on an HP XC Cluster This chapter describes additional procedures for using HPCPI on an HP XC cluster. This chapter addresses the following topics: • “Overview” (page 69) • “Collecting Data on Multiple Nodes” (page 70) • “Collecting Data on One Node” (page 73) Overview When using HPCPI on an HP XC cluster you can do the following: • • Collect performance data from some or all nodes in the job allocation. Collect performance data from one node in the job allocation.
Collecting Data on Multiple Nodes This section describes the tasks you must perform to collect data on multiple nodes, and includes an example using HP-LSF, SLURM, and MPI. Consolidating and Synchronizing Data If you are collecting performance data from all nodes in your job allocation, you can consolidate the HPCPI data in one database and in one epoch. By default, each hpcpid daemon starts a new epoch. To consolidate and synchronize the data, follow these steps: 1. 2. 3.
Submitting the Job Use the HP-LSF bsub command to submit the following job: % bsub -n num_nodes \ mpirun -srun \ --task-prolog=`pwd`/slurm.task-prolog.hpcpi \ --task-epilog=`pwd`/slurm.task-epilog.hpcpi \ myApp myArgs The num_nodes is the number of nodes for the job, myApp is the name of the MPI application, and myArgs are any arguments for the MPI application. The prolog file is slurm.task-prolog.hpcpi and the epilog file is slurm.task-epilog.hpcpi.
the daemon when the specified PID process terminates. In this case, pid is the PID of the initial slurmstepd on node for this task. By default, the -terminate-with option does not flush HPCPI data to disk before terminating the daemon. You can specify the -doflush option with the -terminate-with option to flush the data before terminating the daemon.
Collecting Data on One Node To collect data on one node in a cluster environment, you can use the procedures described in Chapter 3 (page 31) and Chapter 4 (page 35) with the following additional guidelines: • • • • Do not start the hpcpid daemon on each node. Do not start the daemon from a script that is executed on every node. If you are using the hpcpictl label command, execute the distribution utility (such as mpirun) from the hpcpictl label command (hpcpictl label...mpirun...).
7 Using Xtools This chapter describes how to use xclus, xcxclus, xperf, and xcxperf.
Xtools Overview The Xtools utilities are X11 clients with GUIs that enable you to monitor the performance of multiple systems and individual systems. The Xtools bundle consists of the following utilities: • xclus and xcxclus The xclus and xcxclus utilities enable you to monitor performance and resource utilization for multiple systems or nodes in a cluster.
• not require superuser privileges to use the -unrestricted-nodes option and supports the -unrestricted-nodes option for all users. On non-cluster systems, you must specify the nodes you want xclus to monitor. You do not need to specify the -unrestricted-nodes option when running xclus on non-cluster systems.
Starting xclus and xcxclus To start xclus or xcxclus, follow these steps: 1. 2. 3. Set up the Xtools environment. Set the DISPLAY environment variable. Start the xclus or xcxclus program. If you are using xclus and are running on a non-cluster system or do not have a job allocation, you must specify the nodes you want to monitor.
Specifying Nodes with xcxclus By default, you do not need to specify the nodes you want to monitor with xcxclus, and xcxclus monitors all the nodes that are in your job allocation when it starts. However, you can specify nodes with xcxclus to: • • Monitor a subset of nodes in your job allocation. Monitor nodes outside of your job allocation. You specify the option -unrestricted-nodes and must have superuser privileges to do this.
Specifying the Cluster File Name with the -cluster Option If the cluster file is not named cluster and is not located in the current working directory, you must use the -cluster option to specify the name of the cluster file.
Viewing xclus and xcxclus Displays Figure 7-1 shows an xclus display for four Itanium systems. To view an xclus display with AMD Opteron systems, see Figure 1-1 (page 21). Figure 7-1 xclus Display for Itanium Systems Each icon shows data for one node. By default, xclus or xcxclus displays one icon for each node if it is monitoring fewer than 64 nodes; otherwise each icon represents a group of nodes with similar performance statistics, as described in “Viewing Grouped Nodes” (page 92).
Viewing xclus (Enhanced) Itanium Icons By default, xclus displays enhanced icons for Itanium processors. Figure 7-2 shows an enhanced icon for a node with two single-core Itanium processors. Figure 7-2 Itanium xclus Display 3 1 2 4 5 6 7 1 2 3 Node designator. Utilization rates for core 0 and 1. FSB bus utilization rate. 4 The dual-headed arrows at the bottom of this icon each represent an I/O bus. In this example, the left-most arrow shows data for I/O bus 0.
Viewing xclus (Enhanced) Single-Core and Dual-Core AMD Opteron Node Icons By default, xclus displays enhanced icons for AMD Opteron processors. Figure 7-3 shows an enhanced icon for a node with four single-core AMD Opteron processors. Figure 7-3 Four Single-Core AMD Opteron xclus Display 4 5 1 3 2 6 The four largest rectangles represent one processor each. Each processor has a smaller rectangle attached to it representing local DRAM and arrows representing HyperTransport links. 1 2 Node designator.
Viewing xclus(Enhanced) Native Quad-Core AMD Opteron Node Icons By default, xclus displays enhanced icons for AMD Opteron processors. Figure 7-4 shows an enhanced icon for a node with two native quad-core AMD Opteron processors. Figure 7-4 Native Quad-Core AMD Opteron xclus Display 2 10 3 1 4 5 6 7 8 9 The cores are represented by two sets of four small rectangles. The processors are dual-railed, so each processor has two HyperTransport buses to the other processor.
Viewing xcxclus (Generic) Node Icons By default, the xcxclus utility displays generic icons for all processor types, and the information displayed is the same for all processor types. Figure 7-5 shows a generic icon for a node. Figure 7-5 Generic Node xcxclus Display 1 2 3 4 5 6 1 2 Node designator Utilization rates for core 0 and core 1. The xcxclus utility displays a rectangle with the utilization rate for each processor.
Showing Statistic Names and Descriptions If you move your mouse over an icon area, xclus or xcxclus opens a window with the name of the statistic and more information about the data. Figure 7-6 is an example of this display for CPU 0 utilization on node onfire16. Figure 7-6 CPU Description Window Showing Bandwidth or Utilization Rates By default, xclus and xcxclus show utilization rates for I/O devices.
You can also modify the number of icons that xclus or xcxclus displays per row. By default, the xclus or xcxclus utility attempts to display eight node icons per row. You can specify an alternate value for the row width as follows: • • Specifying the -row-width argument when you start xclus or xcxclus Setting the X11 resource *xclus.
Table 7-2 xcxclus Generic) Menu Options Menu Option Description File Exit.. Stops the xcxclus utility. Options Group Control... Opens a dialog box that enables you to control node grouping parameters. This option is present only when node grouping is active. For more information, see “Viewing Grouped Nodes” (page 92). View Hold 88 Using Xtools Refresh... Opens a dialog box that enables you to set the refresh rate. Modify Key...
Recording, Replaying, and Plotting xclus and xcxclus Data You can save the data from the xclus or xcxclus utility in a file. The utilities update data for each monitored node every second. You can use this data file either to replay the data or to plot graphs of node performance statistics. Recording Data To record data and create a data file, specify the -output option when starting the xclus or xcxclus utility.
plot_file_prefix.xclus.gnuplot Script file for gnuplot. You can redisplay the plotted data using the /opt/xtools/gnu/bin/gnuplot command with the plot_file_prefix.xclus.gnuplot file name as its operand. Figure 7-7 shows xclus plotted data.
Figure 7-7 Plotted Data from xclus Recording, Replaying, and Plotting xclus and xcxclus Data 91
Starting xperf or xcxperf from xclus or xcxclus To start xperf from xclus or to start xcxperf from xcxclus, click a node icon. Viewing Grouped Nodes If you are monitoring a large number of nodes, xclus or xcxclus groups nodes with similar performance profiles and displays a single icon for the group.
CPU utilization DRAM utilization HyperTransport link utilization (processor-to-processor) HyperTransport link utilization (to external devices) If all the above utilization rates (for a given processor type) are the same within a tolerance range, the nodes are placed in the same group.
Using xperf and xcxperf The following sections describe general procedures for using xperf and xcxperf. The xperf and xcxperf utilities are similar, and the procedures for using them are the same, with the following differences: • By default, the xperf utility displays enhanced data. By default, the xcxperf utility displays generic data.
Viewing xperf and xcxperf Displays By default, xperf displays graphs for the statistics listed in “Viewing Itanium xperf (Enhanced) Statistics” (page 96) or “Viewing AMD Opteron xperf (Enhanced) Statistics” (page 98), and xcxperf displays graphs for the statistics listed in “Viewing xcxperf (Generic) Statistics” (page 101).
Viewing Itanium xperf (Enhanced) Statistics Figure 1-2 (page 22) shows an xperf display for an Itanium system. By default, xperf displays graphs with processor-dependent, enhanced statistics. The xperf utility displays graphs with the following enhanced statistics for Itanium processors. NOTE: The processor event names listed in this section are used for Itanium Processor 9000 series and may differ slightly from the event names used for other Itanium processor types.
• • L3cache misses: Level 3 cache misses TLB misses: Translation Lookaside Buffer misses SysBus Displays the following system bus utilization rates: • Address: BUS_ALL.
Viewing AMD Opteron xperf (Enhanced) Statistics Figure 7-9 shows an xperf display for an AMD Opteron system. By default, xperf displays graphs with processor-dependent, enhanced statistics. Figure 7-9 xperf Display for an AMD Opteron System The xperf utility displays graphs with the following enhanced statistics for AMD Opteron processors.
NOTE: AMD does not provide code-usable names for AMD Opteron processor events. In addition, the names listed in this section are used for single-core and dual-core AMD Opteron processors and may differ slightly from the event names used for native quad-core AMD Opteron processors.
• • • L2I Misses: ICACHE_REFILLS_FROM_LS_FROM.SYSTEM Misses: ICACHE_MISSES Icache Fetches: ICACHE_FETCHES Branch Displays the following branch metrics: • Branch Rate: RETIRED_BRANCHES / RETIRED_INSTRS • Mispredicts: RETIRED_BRANCHES_MISPREDICTED / RETIRED_BRANCHES • Branches Taken: RETIRED_TAKEN_BRANCHES / RETIRED_BRANCHES DRAM Displays the following events per unhalted cycles: • DRAM Conflicts Per 10k Cycles: DRAM_ACCESSES.PAGE_CONFLICT • DRAM Misses Per 10k Cycles: DRAM_ACCESSES.
Viewing xcxperf (Generic) Statistics Figure 7-10 shows an xcxperf display. By default, xcxperf displays graphs with processor-independent statistics for all processors. Figure 7-10 xcxperf Display The xcxperf utility displays graphs with the following generic statistics. If a component is not installed on a system, xcxperf does not display the corresponding graph.
Disk Displays the throughput rates in Mb/s for the following disk activities from /proc/diskstats: • Write • Read NFS Displays statistics for the following NFS activities in calls per second from /proc/net/rpc/nfs: • Write • Read Lustre Displays the throughput rates in Mb/s for the following Lustre activities from /proc/fs/lustre/llite: • Write • Read Infiniband Displays the throughput rates in Mb/s for the following Infiniband activities from /proc/voltaire: • Write • Read Ethernet Displays the through
Elan Displays the throughput rates in Mb/s for the following Quadrics QsNetII interconnect activities from Elan memory registers: • Write • Read Memory Displays the following utilization rates (percentages) from /proc/meminfo: • Free: Free memory • Buffers: Memory in the buffer caches • Cached: Memory in the page cache minus the swap cache • SwapCached: Memory in the swap cache—memory that once was swapped out and is swapped back in, but also still in the swapfile • Application: Memory that is not in a cac
Displaying Color Legends and Creating Tear-Away Legends To display the color legend for a graph, select the menu item with the graph name, such as CPU in Figure 7-11. If you select the tear-away icon (the perforated line at the top of the drop-down menu, which is circled in Figure 7-11), The xperf and xcxperf utilities create a tear-away (standalone) color legend for the graph. You can move the legend next to the appropriate graph for visual correlation.
Table 7-3 xperf (Enhanced) Menu Options (continued) Menu Option Description System Information Opens a dialog box that displays system information, as shown in Figure 7-12 (page 108). Instructions Opens a dialog box that enables you to display cycles per instruction instead of instructions per cycle. I/O Enables you to display the I/O data in terms of bandwidth (Mb/s) or utilization. (Supported only on Itanium processors.
Starting an HPCPI Label from xperf You can start an HPCPI label and collect data for that label from the xperf utility. An HPCPI label enables you to analyze a time interval of an application or system. To start an HPCPI label from xperf, select HPCPI→Start Label from the menu at the top of the display. When you start an HPCPI label from the xperf utility, the label applies to all processes on the system.
Recording, Replaying, and Plotting xperf and xcxperf Data You can save the data from the xperf or xcxperf utility in a file. The utilities update data for each monitored node every second. You can use this data file either to replay the data or to plot graphs of node performance statistics. The procedures and options (-output, -plot) for recording, replaying, and plotting data are the same as the procedures used with xclus and xcxclus (“Recording, Replaying, and Plotting xclus and xcxclus Data” (page 89)).
Displaying System Information with xperf or xcxperf If you select Options→System Information from the menu at the top of the display, xperf or xcxperf opens a display window with system information. Figure 7-12 shows an xcxperf system information window.
Viewing Generic Data with xclus or xperf By default, the xclus and xperf utilities display enhanced data. You can force xclus and xperf to display generic data by specifying the -generic option. For example, the following command starts xclus so it displays generic data: % xclus -generic Starting xclus with the -generic option causes it to display the same data that xcxclus displays.
Viewing Enhanced Data with xcxclus or xcxperf By default, the xcxclus and xcxperf utilities display generic data. You can force xcxclus and xcxperf to display enhanced data by specifying the -enhanced and either the -apmond or -clusmond option. For example, the following command starts xcxclus so it displays enhanced data: % xcxclus -enhanced -apmond Starting xcxclus with the -enhanced option causes it to display the same data that xclus displays.
Xtools Daemons Xtools use the following daemons: • apmond and clusmond The apmond and clusmond daemons are included with the Xtools software and collect enhanced statistics. The Xtools start these daemons when you run xclus or xperf using default parameters if they are not already running. The apmond daemon collects processor-specific data for an individual node and runs on each node being monitored. Only one instance of apmond is needed per node.
A Product Specifications This appendix contains product specifications. HPCPI Database Directories and Files The database root directory contains the following items: • A subdirectory for each epoch.
Figure A-1 HPCPI Database $HPCPIDB 200802141532 node 1 ... node 3 node 2 ... App12.ebadcb63fb63e830_myLabel_5 ... 200802141744 200802141712 ... sum.0479a583cd891014_3 The HPCPIDB environment variable is set to /tmp/hpcpidb. One of the epochs started on February 14, 2008 at 17:12 GMT on the system node2 contains the following profile file with data for sum: /tmp/hpcpidb/200802141712/node2/sum.
3 4 5 6 The fully-qualified path name for the image file. The epoch. See “HPCPI Database Directories and Files” (page 113) for the epoch name format. The host name of the system on which hpcpid ran. The starting virtual memory address for the loaded image. 10 11 12 number of samples recorded for the event multiplied by the sampling interval. The number of samples recorded for the event. The event name. The sampling interval.
Multi-Issue Architectures In multi-issue architectures (those that can execute more than one instruction per cycle), the interrupt handler associates only one instruction in a bundle with an event. The other instructions in the bundle have no associated events. This can skew the attribution of events to instructions. Calls to exec() If a process uses the exec() system call or its variants, HPCPI can attribute events to the wrong image and it is possible to get samples for unexecuted instructions.
B HPCPI Quick Reference This appendix contains quick reference information for basic HPCPI tasks.
Viewing HPCPI Data Table B-3 Viewing HPCPI Data To Perform this Task Use this Command Reference Display per-image data hpcpiprof “Viewing Per-Image Data: hpcpiprof” (page 44) Display per-procedure data hpcpiprof image_name “Viewing Per-Procedure Data: hpcpiprof image_name” (page 46) Display per-instruction data hpcpilist procedure_name image_name “Viewing Per-Instruction Data: hpcpilist procedure_name image_name” (page 47) Display the instructions with the highest event counts hpcpitopcounts “
C Xtools Quick Reference This appendix contains quick reference information for Xtools. xclus and xcxclus Tasks This section contains quick reference information for basic xclus and xcxclus tasks.
Table C-2 Modifying xclus or xcxclus Displays (continued) To Perform this Task Use this Procedure Reference Show HyperTransport Select the menu option Options→HT All.vs.Data→Data. “Showing HyperTransport statistics for data packets Data Statistics or Data and only (useful for Control Statistics” (page 86) confirming data rates) instead of statistics for data and control packets Change the refresh rate Select the menu option Options→Refresh.
xperf and xcxperf Tasks This section contains quick reference information for basic xperf and xcxperf tasks.
Additional xperf and xcxperf Tasks Table C-6 Additional xperf and xcxperf Tasks To Perform this Task 122 Use this Procedure Reference Start an HPCPI label from Select the menu option HPCPI→Start Label. xperf “Starting an HPCPI Label from xperf” (page 106) Stop the HPCPI label Select the menu option HPCPI→Stop Label.
Glossary active fraction The fraction of time an event was active in the PMU. See also duty group. duty group A group of HPCPI events, used to multiplex the events monitored. If hpcpid is monitoring more events than the number of event counters available for the processor PMU, hpcpid places the events in duty groups and multiplexes (cycles through) the duty groups so that only the events in one duty group are monitored at any time. enhanced statistics Statistics that are processor-dependent.
RPM Red Hat Package Manager. 1. A utility that is used for software package management on a Linux operating system, most notably to install and remove software packages. 2. A software package that is capable of being installed or removed with the RPM software package management utility. SLURM 124 Glossary Simple Linux for Resource Management. A set of commands for system resource management and job scheduling.
Index A active fraction, 114, 116 in HPCPI output, 44 AMD Opteron branch statistics displayed by xperf, 100 CPU statistics displayed by xperf, 99 CPU utilization display, 83, 84 data cache statistics displayed by xperf, 99 DRAM statistics displayed by xperf, 100 DRAM utilization display, 83, 84 execution statistics displayed by xperf, 99 floating point operation statistics displayed by xperf, 99 HyperTransport link utilization display, 83, 84 HyperTransport statistics displayed by xperf, 100 instruction cac
DISPLAY environment variable setting for Xtools, 78 DMA bus Itanium statistics displayed by xperf, 97 -doflush option for hpcpid, 72 DRAM AMD Opteron statistics displayed by xperf, 100 utilization display, 83, 84 duty group overview, 19 recommendation, 56 E elan (see Quadrics QsNet) -enhanced option for xcxclus or xcxperf, 110 epilog file for HPCPI, 72 epoch comparison with label, 64 displaying, 41 organizing HPCPI data with, 56 specifying alternate for HPCPI utilities, 51 starting, 36, 41 terminating afte
organizing data with, 56 loading the environment, 31, 35 log file, 37 manpage directory, 35 organizing data, 56 product limitations, 115 sampling characteristics, 18 selecting events for monitoring, 37 simple session, 31 stopping, 41 tips, 55 viewing instructions with highest counts, 49 viewing per-image statistics, 32, 44 viewing per-instruction statistics, 47 viewing per-line statistics, 33 viewing per-procedure statistics, 33, 46 HPCPI epoch (see epoch) hpcpi RPM package, 26 hpcpicat utility, 18 output,
full imaging procedure, 28 manual propagation procedure, 28 running RPM on clients procedure, 29 on standalone systems, 27 requirements, 25 instruction viewing HPCPI statistics for, 33, 47 instruction cache AMD Opteron statistics displayed by xperf, 99 event set for monitoring, 38 instruction counts viewing, 47, 49 instruction pointer in hpcpilist output, 47 instructions per cycle event set for, 38 Itanium, 57 interrupt statistics displayed by xcxperf, 102 IPC AMD Opteron statistics displayed by xperf, 99 I
for Xtools, 78 mond daemon, 111 MPI and HPCPI labels, 73 using HPCPI with, 70 mpirun and HPCPI labels, 73 using HPCPI labels with, 69 N negating HPCPI label selectors, 63 NFS statistics displayed by xcxperf, 102 -no-group-nodes option for xclus and xcxclus, 92 -node-grouping-threshold option for xclus and xcxclus, 92 -nodes option for xclus and xcxclus, 79 NOP_RETIRED measuring, 57 -not operator for HPCPI label selectors, 63 not found in HPCPI output, 46 O Opteron (see AMD Opteron) -or operator for HPCPI
xcxclus, 78 xperf or xcxperf, 94 Xtools subsystem, 29 status HPCPI, 41 stopping hpcpid, 41 subsume delay parameter for node grouping, 93 supermon daemon, 111 suppressing output for xclus or xcxclus, 89 suspending the xclus and xcxclus displays, 86 swap statistics displayed by xcxperf, 103 system bus Itanium statistics displayed by xperf, 97 system information displaying with xperf or xcxperf, 108 T TCP statistics displayed by xcxperf, 102 -terminate-with option for hpcpid, 72 tips HPCPI, 55 TLB Itanium sta
specifying nodes for, 79 starting, statistics, 21 suspending the display, 86 -zoom option, 86 xcxperf utility bandwidth, displaying, 104 comparison with xperf, 94 -enhanced option, 110 graphs, selecting for display, 104 legends, 104 menu options, 105 -output option, 107 overview, 76, 94 -plot option, 107 processors, 20 starting, 94 starting from xcxclus, 92 statistics, 23 xinetd and apmond and clusmond, 111 restarting, 29 xperf display for AMD Opteron, 98 for Itanium, 21 xperf utility AMD Opteron statistics