HP Caliper User Guide Release 5.2 March 2010 HP Part Number: 5969-7014 Published: March 2010 Edition: 5.
© Copyright 2010 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document...................................................................................................................17 1 HP Caliper at a Glance............................................................................................................23 What Is HP Caliper?...........................................................................................................23 What Does HP Caliper Run On?................................................................
Diagnostics View .......................................................................................................50 Help View ..................................................................................................................50 Tips for Using Views.....................................................................................................51 Making Measurements.......................................................................................................
--advice-details....................................................................................................................72 --analysis-focus...................................................................................................................72 --branch-sampling-spec......................................................................................................72 --bus-speed ........................................................................................................
--module-exclude................................................................................................................86 --module-include................................................................................................................87 --module-search-path.........................................................................................................87 --noinlines...............................................................................................................
Simplest Example........................................................................................................106 More Typical Examples...............................................................................................106 Explanation of Report Output.....................................................................................107 How to Read an Advisor Report......................................................................................
HP Caliper Environment Variables..................................................................................136 9 Controlling the Content of Reports............................................................................................137 Layout of an HP Caliper Text or CSV Report...................................................................137 Metrics You Can Use for Report Sorting and Cutoffs......................................................138 Module-Centric Reports..................
11 Producing a Sampled Call Stack Profile Analysis......................................................................171 Running HP Caliper to Produce a Call Stack Profile.......................................................171 Call Stack Profile Text Report Example for HP-UX..........................................................172 Call Stack Profile Text Report Example for Linux............................................................182 Call Stack Profile Report Details.......................
B Descriptions of Measurement Reports........................................................................................221 alat Measurement Report Description..............................................................................222 Example Command Line for Text Report....................................................................222 Example Command Line for CSV Report...................................................................222 alat Metrics Summed for Entire Run.................
Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems..................................................................................................250 dcache Measurement Report Metrics..........................................................................252 Example dcache Reports.............................................................................................255 Using the --dcache-data-profile Option to Produce a Data Summary......................
Example Command Line for Text Report....................................................................276 Example Command Line for CSV Report...................................................................276 icache Metrics Summed for Entire Run.......................................................................276 Metrics for Integrity Servers Itanium 2 Systems....................................................
cpubus Event Set...............................................................................................................309 Metrics Available from this Measurement..................................................................309 cspec Event Set..................................................................................................................311 Metrics Available from this Measurement..................................................................311 dispersal Event Set...........
List of Figures 1-1 2-1 2-2 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 6-1 6-2 7-1 7-2 7-3 7-4 7-5 7-6 8-1 8-2 8-3 9-1 10-1 11-1 11-2 13-1 13-2 13-3 13-4 13-5 A-1 A-2 14 HP Caliper Components (User Interfaces).................................................................24 fprof Measurement Report for matmul, with Default Report Output.......................30 fprof Measurement Report for matmul, with IP Sample Counts for One Function............................................................................
List of Tables 4-1 8-1 8-2 8-3 8-4 8-5 9-1 9-2 B-1 B-2 B-3 B-4 B-5 B-6 B-7 B-8 B-9 B-10 B-11 B-12 B-13 B-14 B-15 B-16 B-17 B-18 B-19 B-20 B-21 B-22 B-23 B-24 B-25 B-26 Available Measurements in Each Measurement Type................................................59 Command-Line and Measurement Configuration File Syntax for -p Option..........129 Action Options Used with -p some...........................................................................
List of Examples 6-1 9-1 9-2 11-1 B-1 B-2 16 HP Caliper Advisor Report.......................................................................................102 Example of a caliper merge Run...............................................................................151 Example of a caliper diff Run....................................................................................154 Sample cstack Report - Blocking Primitives Details.................................................
About This Document This document describes how to use HP Caliper to measure the performance of native applications running on HP-UX and Linux Integrity servers. NOTE: For the latest version of this document, go to the HP Caliper Web site at the following URL and click on Documentation in the Product Information box: http://hp.com/go/caliper This document is sometimes updated after a release. The document publication date appears on the title page.
• • --latency-buckets --per-module-data Deleted Information None. Document Organization For information to help you get started, read these chapters: • “HP Caliper at a Glance” (p. 23) provides an introduction to HP Caliper. • “Getting Started with the HP Caliper Command-Line Interface” (p. 29) helps you get started using the HP Caliper command-line interface. • “Getting Started with the HP Caliper GUI” (p. 43) introduces you to the HP Caliper graphical user interface (GUI).
• • • “HP Caliper Diagnostic and Warning Messages” (p. 215) describes some diagnostic and warning messages you might receive. “Descriptions of Measurement Reports” (p. 221) provides descriptions of reports produced for all of the HP Caliper measurements. “Event Set Descriptions for CPU Metrics ” (p. 299) contains descriptions for the output of each event set available with the cpu measurement.
... The preceding element can be repeated an arbitrary number of times. | Separates items in a list of choices. Related Documents The complete HP Caliper documentation set contains the following: • HP Caliper Quick Start • HP Caliper User Guide • HP Caliper Advisor Rule Writer Guide • HP Caliper for HP-UX Release Notes • HP Caliper for Linux Release Notes • HP Caliper Ktrace Features Guide You can get more information about HP Caliper in these ways: • An HP Caliper man page is provided.
• • • Using HP Caliper to analyze effective floating-point load latency Using HP Caliper with an application program to characterize the Itanium memory hierarchy Using HP Caliper to measure performance data related to translation lookaside buffers (TLBs) You can also read these technical reports about the microarchitecture used in HP Integrity servers: • Dual-Core Update to the Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization, Document Number 308065-001.
1 HP Caliper at a Glance What Is HP Caliper? HP Caliper is a general-purpose performance analysis tool for applications on HP-UX and Linux systems running on HP Integrity Servers. HP Caliper allows you to understand the performance and execution of your application and to identify ways to improve its run-time performance. HP Caliper works with any native Integrity Server application.
Figure 1-1 HP Caliper Components (User Interfaces) HP Caliper CLI Application Performance reports HP Caliper HP Caliper GUI (local) X11 server HP Caliper GUI (remote) HP Caliper database(s) Integrity Server (HP-UX or Linux) X86 desktop (Windows or Linux) HP Caliper selectively measures the processes, threads, and load modules of your application.
This command uses a default measurement called scgprof to produce a sampled call graph profile of the program myprog. The result of the measurement (the output) is saved automatically in a database. The output database is called scgprof and is placed in a databases directory in your current directory unless you specify otherwise. By default, a text report describing the results of the measurement run is sent to stdout.
On Linux, HP Caliper has been primarily validated with code produced with gcc and g++, and minimally tested with code produced from the Intel icc compiler. HP Caliper can generally be used on: • • • • • • • • • • • Programs compiled using HP, Intel, and GNU compilers, with and without optimization. Assembly and Java programs. Stripped programs, and programs compiled with optimization or debug information or both. This includes support for both the +objdebug and +noobjdebug options.
• • • • • • Disassembly listings (template types, symbolic branch targets, branch targets as function offsets, optional marked branch target instructions) are provided. Support for command-line options from a file is provided. The ability to attach and detach to running processes for certain measurements is provided. You can use the Advisor expert system to analyze applications, using either the default rules or rules you write yourself.
2 Getting Started with the HP Caliper Command-Line Interface This chapter provides some example programs to show you how to get started using the HP Caliper command-line interface. The programs are chosen for illustration purposes and are not necessarily representative of programs you might actually want to analyze. Example: Running fprof on a Short Program, with Default Output HP Caliper provides many types of performance measurements.
Figure 2-1 fprof Measurement Report for matmul, with Default Report Output ================================================================================ HP Caliper 4.3.
Invocation: Process ID: Start time: End time: Termination Status: Last modified: Memory model: Processor set: ./matmul 6991 (started by Caliper) 09:31:37 AM 09:31:38 AM 1 May 5, 2007 at 09:09 AM ILP32 default Target Execution Time 10 Real time: 0.428 seconds User time: 0.415 seconds System time: 0.008 seconds Sampling Specification 11 Number of samples: 1319 Data sampled: IP Metrics Summed for Entire Run 12 ----------------------------------------------PLM Event Name U..
-----------------------------------------[/home/meagher/matmul.c] 34 ~16 > mata[i][j] = matb[i][j] = (float) rand() ; 1 ~28 > for (k = 0 ; k < INDEX ; k++) { 1244 ~29 > matres[i][j] = --------------------------------------------------2.88 [libc.so.1::rand, 0x4119f60, rand.c] 38 ~57 Function Totals -----------------------------------------[File not found: /ux/libsobj_i380em/libs/libc/shared_em_32/obj/../../../ ../../core/libs/libc/shared_em_32/../core/gen/rand.
• 14 15 % Total IP Samples: Percent of the total IP samples attributable to a particular load module. • Cumulat % of Total: Running sum of the percent of total IP samples account for by the particular load module and those listed above it. • IP Samples: Total number of IP samples attributed to the particular load module. Function Summary: • % Total IP Samples: Percent of the total IP samples attributable to a particular function.
We will also specify an output file using the -o option: $ caliper fprof -o out.txt -r all In the resulting report output file (out.txt), you will find an fprof report that shows IP sample counts down to the instruction level. “fprof Measurement Report for matmul, with IP Sample Counts for One Function” (p. 34) shows a section of the report that contains IP sample counts for one function.
2 3 4 Shows total IP sample hits (1275) and starting line number (38) for the function. A tilde (~) preceding a line number (~38) indicates that the line number is approximate due to optimization. Names the source file of function main. Shows sample IP hits (41) for statement at line 16 (~ indicates the line number is approximate due to optimization) of matmul.c. Sample IP hits for statements are shown in parentheses; sample IP hits for instructions are not. Source lines are preceded by >.
Sampled Measurements A sampled measurement measures your program's performance at regular intervals, based on CPU events, recording the current program location and selected performance metrics.
What to Look for in Using HP Caliper A useful approach for tuning performance is to start with easy-to-perform global measurements to identify likely areas for tuning. Then you can use more specific measurements on areas of concern.
measurement The name of a measurement that contains information about what you want HP Caliper to measure and report. For more information, see “HP Caliper Measurement Configuration Files” (p. 57). caliper_options Parameters used to customize the performance analysis. For more information, see “HP Caliper Options” (p. 63). program The name of the executable program you want HP Caliper to measure. program_arguments Any number of arguments expected by your executable.
$ kill -s TERM caliper_process_ID CAUTION: If HP Caliper is measuring processes using instrumentation on HP-UX (that is, generating a precise measurement), stopping HP Caliper will cause it to forcibly terminate all processes that are being measured. Any resources used by the processes, such as shared memory segments, temporary files, and so forth, will not be cleaned up. For PMU-only measurements, HP Caliper will simply detach from the processes and stop. The processes will continue normally.
Using the HP Caliper Advisor One way to get started using HP Caliper quickly is to use the HP Caliper Advisor. See “Using the HP Caliper Advisor” (p. 101). Restrictions on Using HP Caliper Some restrictions are: • • • • • • • 40 When HP Caliper detaches from a process, it can affect the I/O relationship between the user and that process. An example is an application that reads stdin from the user; e.g.
Additional HP Caliper Commands In addition to the caliper measurement command, there are three more HP Caliper commands you can use. For information about these commands, including required syntax, see the references below: • caliper info Displays reference information about the CPU counters or reports. See “How to Display Reference Information About CPU Counters or HP Caliper Report Types” (p. 133). • caliper report | merge | diff Creates a report from an HP Caliper database.
3 Getting Started with the HP Caliper GUI In addition to the command-line interface, HP Caliper supports a full-featured, intuitive graphical user interface (GUI). This chapter describes how to get started using the GUI. For information on the command-line interface, see Chapter 2 (page 29). What Is the HP Caliper GUI? The GUI has the same underlying measurement technology and capabilities as the command-line interface. With the GUI, however, you can dynamically interact with HP Caliper.
Window Basics The main window of the HP Caliper GUI is divided up into views. A view is a tabbed page that groups related tasks. These views and their icons are: • Projects view • Collect view • Analyze view • Advisor view • Console view • Diagnostics view • Help view As is typical of most GUIs, the HP Caliper GUI lets you reconfigure, resize, and reposition all of the views to suit your needs.
Figure 3-1 HP Caliper GUI Projects View The Projects view helps you manage your performance information by storing it in projects. A project consists of one or more folders and can contain two types of information: • • Collection specifications Measurement runs A project is stored in a directory called a workspace. A collection specification, represented by measurement.
from collection runs are stored in a folder in your project. This folder is called Saved Collection Specifications, represented by . A dataset is a set of performance data collected by a single HP Caliper measurement run. Examples of datasets are: Run Summary, Memory Usage, Process Tree, Histogram, Call Graph, and so forth. Each HP Caliper measurement run produces several datasets. These datasets are shown in the Projects view for each run.
Required fields are typically indicated by red labels, and the Start button (for starting the measurement) becomes active when you have provided all required information. The figure below shows the Collect view with the Target tab selected: Figure 3-3 Collect View Analyze View The Analyze view lets you explore the performance data you collect. When displayed, the Analyze view is located, by default, to the right of the Projects view, overlaying the Collect and Advisor views.
The name given to the Analyze view is that of the measurement run for which data is shown. You can open multiple Analyze views, one for each measurement run shown in the Projects view. Multiple types of data are shown in the Analyze view in tabbed pages with the following names: Process Tree, Memory Usage, CPU Metrics, CPU EventsHistogram, Call Graph, and Call Stack Graph. The figure below shows the Analyze view with the Histogram tab selected.
To open the Advisor view, click on the Generate Advice button This will analyze the collected data and produce advice output. or toolbar choice. The figure below shows the Advisor view: Figure 3-5 Advisor View Console View The Console view displays any output your application writes to standard output and standard error streams. You can also use the Console view to provide any input your application expects to read from standard input.
Figure 3-6 Console View Diagnostics View The Diagnostics view contains any warning messages that HP Caliper might generate when measuring your application or retrieving its performance data for viewing. By default, this view overlays the Console view at the bottom of the GUI window. Any errors produced will appear in popup dialogs.
Figure 3-8 Help View Tips for Using Views All views have the following features: • • • Each view has its own Maximize and Minimize buttons (top right), and many views have their own pull-down menus (also top right). Double-clicking a view's tab causes the view to take up the entire GUI window. Double-clicking a view's tab a second time returns it to its previous size and restores the previous GUI layout. This feature is particularly useful when viewing performance data.
To move a view out of the main window, press and hold mouse button 1 over a view's tab, drag the view out of the main window, and release mouse button 1. You can return the view to the main window by again dragging the view (using the view's tab) back to the main window. • You can retrieve views you have closed by selecting Window→Show View. To restore all views to their default state, select Window→Reset View Locations.
— — — — The measured application completes. All the attached processes terminate. The measurement duration you set on the Target page expires. You select the Kill/Stop button. The application program being measured will be terminated immediately if you select the Kill button. • When a measurement run completes, its performance data is automatically added to the current project within the Projects view.
Once you have collected at least one set of performance data, click on the Generate Advice button or toolbar choice. HP Caliper will analyze the data and open an Advisor view with its advice. Every time you make another collection run, re-run the analysis to get updated advice. Getting Help Several forms of online help are available in the GUI: • “Getting started” help Select Help→Help Contents and then choose Getting Started. • Dynamic/context help Select Help→Context-sensitive Help or use the F1 key.
• • .metadata, containing information that the GUI needs as an Eclipse application Default_Project (native mode) or Default Project (client mode), which is a placeholder project folder Native (Local) GUI This is the simpler method. At a shell prompt on the Integrity server where the measurements will be made, enter: $ caliper –g [--jre path] This starts the GUI in its own X11 window.
Figure 3-9 Login Screen To connect to the remote system: 1. 2. 3. 4. 5. Enter the appropriate information for the Integrity server you want to make measurements on and your username and password on that system. Check the Save password box to save your password in a secure, encrypted form. Make sure that the Caliper Path value is correct for your system. By default, HP Caliper will execute the standard login scripts.
4 HP Caliper Measurement Configuration Files Each run of HP Caliper uses a particular measurement, which you can specify in the command line. Each measurement corresponds to a particular measurement configuration file supplied by HP Caliper. The measurement configuration files contain variables that control the types of measurements performed and the content of the reports.
its thread's call stacks. See “Producing a Sampled Call Stack Profile Analysis” (page 171) and “cstack Measurement Report Description” (page 240). • cycles (Integrity Servers dual-core Itanium 2 processor and Itanium 9300 quad-core processor only) The cycles measurement measures and reports a flat profile of the instruction pointers (IPs). See “cycles Measurement Report Description ” (p. 244). • dcache The dcache measurement measures and reports sampled data cache metrics.
• scgprof The scgprof measurement measures and reports (an inexact) call graph profile, produced by sampling the PMU to determine function calls. See “Producing a Sampled Call Graph Profile Analysis” (p. 157) and “scgprof Measurement Report Description” (page 288). • traps The traps measurement collects and reports a profile of traps, interrupts, and faults. See “traps Measurement Report Description ” (page 292).
The overview measurement multiplexes fprof, dcache, and cstackmeasurements at a time interval of 1 second. The default switch interval of 1 second can be changed using the --switch-interval=SECONDS option. After each switch interval, Caliper will switch from one measurement to the next. On system-wide runs, Caliper will multiplex fprof and dcache measurements.
Simultaneous fprof Sampling on Multiple PMU Counters Up until Caliper 5.1, the fprof measurement sampled the instruction pointer (IP) on only one counter (every 500,000 CPU cycles by default). As of Caliper 5.2, the fprof measurement is enhanced to support simultaneous IP sampling on mutiple PMU events. The -s or --sampling-spec option can be used multiple times to specify the list of sampling events, based on which IP samples are to be collected and reported.
You are free to rename measurement configuration files. Specifying Option Values in Measurement Configuration Files You can specify options on the command line, in a measurement configuration file, or in the .caliperinit file. See “Multiple Ways to Specify HP Caliper Option Values” (p. 63). Using the Command Line to Override Measurement Configuration File Parameters You can use the HP Caliper command line to override parameters specified in measurement configuration files.
5 HP Caliper Options This chapter describes basic information about options and presents them in alphabetical order. For a listing of the most commonly used options, see the HP Caliper Quick Start reference card. Basic Information About Options Options are used to customize the performance analysis. You can specify one or more options on the command line when you start HP Caliper. You can abbreviate options and their modifiers as long as they are unambiguous.
Examples Assume these command-line options: • -s 400000,10%,BRANCH_EVENT (also specified as --sampling-spec 400000,10%,BRANCH_EVENT) • --context-lines 10 You can replace them in the measurement configuration file or the .caliperinit file with these lines: • sampling_spec = "400000, 10%, BRANCH_EVENT" • context_lines = 10 Hierarchy for Processing an Option Value HP Caliper uses this sequential order to process an option value: 1. 2. 3. 4.
If you do not use the -d option, the database file is saved in the databases directory. Examples -d foo.db,unique --database bar.db,unique These options would result in database names such as the following: foo.db-14921 bar.db-288 For more information, see “How HP Caliper Saves Data in Databases” (p. 147). -e or --duration -e seconds Elapsed time in real-time seconds before detaching from a running process.
-h or -? -h or -? Displays the short version of the help text. For the long version of the help text, use -H or --help. -H or --help -H or --help Displays the long version of the help text. For the short version of the help text, use -h or -?. -m or --metrics This option has two forms: -m cpu_event and -m event_set -m cpu_event This form is used to specify one or more CPU events to measure: -m cpu_event[[:threshold=int][:privilege-level-mask=level]][,cpu_event[:...
threshold=int An integer value that specifies how HP Caliper counts events: • If the value is zero, HP Caliper counts all events. • If the value is greater than zero, HP Caliper counts only the CPU cycles in which the number of events is greater than or equal to the value you specify. The default value is zero. privilege-level-mask=level Determines the privilege level setting for a given counter. The default is user (counters are measured when your application runs in user space).
append Adds the report results to the end of an existing file that has the specified name. create Creates a file with the specified name and writes the report results to the file. Replaces any existing file with the specified name. Default value is create. When you generate multiprocess reports, you can specify whether results are combined in a single report file or in individual files by process: per-process Creates individual report files for each process with program name appended to each file.
-r for PMU Histogram Reports -r [statement][:instruction][none][all] Default value is -r statement. Shows source statements (if source code exists, or line numbers otherwise) with associated performance data. Machine instructions are not shown. statement You can shorten this option to -rs. Shows machine instructions. Source statements are not shown. instruction You can shorten this option to -ri. Prevents generation of the Function Details section. none You can shorten this option to -rn.
• cpu For the cpu measurement, you can use this option in conjunction with the --event-defaults option to control how samples are taken. For more information, see “Performing CPU Metrics Analysis ” (p. 197). • cstack For more information, see Chapter 11 (page 171). period Sampling period in seconds or milliseconds or microseconds, measured in CPU cycles (for cpu) or real time (for cstack).
threshold=int An integer value that specifies how HP Caliper counts events: • If the value is zero, HP Caliper counts all events. • If the value is greater than zero, HP Caliper counts only the CPU cycles in which the number of events is greater than or equal to the value you specify. The default value is zero. privilege-level-mask=level Determines the privilege level setting for a given counter. By default, counters are measured when your application runs in user space (user).
-w Equivalent to one form of the option for system-wide measurement. The -w option is equivalent to -–scope system,attr-mod, which is the default for -–scope system. See “Using --scope system for System-Wide Measurements” (p. 94). --advice-classes Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 104). --advice-cutoff Used only with the caliper advise command. See “Command Line to Invoke the Advisor” (p. 104).
variation Specifies how much to vary the number of events between samples. The number or percentage of CPU_counter events that HP Caliper uses to vary the sampling rate. HP Caliper adds or subtracts this number from the interval to vary the sampling frequency. The default value is 5 percent. cpu_event Specifies CPU events to measure. The default value is BRANCH_EVENT (or ETB_EVENT on the Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processors).
--bus-speed [--bus-speed int] Specifies the bus speed in MHz for the sysbus event set. If you specify the sysbus event set, you must use this option. For example: --bus-speed 200. --callpath-cutoff --callpath-cutoff percent_cutoff[,cum_percent_cutoff[,min_count]] Specifies cutoff values that limit the hot call paths reported in the Hot Call Paths section of the cstack, scgprof, and cgprofreports.
Default value is --context-lines 5 for source-only reports or --context-lines 0 for reports with disassembly. Specify all to report all source lines for reported functions. count_source Number of source lines to show count_disassembly Number of disassembly lines to show --cpu-aggregation --cpu-aggregation count Specifies how many samples will be aggregated into one sample. This option is valid only with the cpu measurement. Default value for count is 125 (125 samples will be aggregated into one sample).
NOTE: This option was formerly known as --cpu-metrics-details. The former option name is still accepted by HP Caliper, but will be removed in a future release. --csv-file --csv-file filename[append|create][,per-process|shared][,unique] Generates report output in Comma Separated Values (CSV) format. You can produce a CSV report for any HP Caliper measurement.
--database See “-d or --database” (p. 64). --dbrp --dbrp=DBRP_INDEX,PLM,ADDR_MATCH,ADDR_MASK,PROC_FLAGS Specifies the bits to program the PMU's data address range matching registers. Forces event monitoring to be constrained by data address ranges. DBRP_INDEXX can be 0, 1, 2 or 3. It identifies one of the four Data Breakpoint Registers (DBRs) used to specify the desired address range. PLM specifies the privilege level setting. The privilege levels available are: "user", "kernel", and "all".
Previously available options [:processor][:run][:help] have been deprecated and have no effect on report. --details Used only with the caliper info command. See “How to Display Reference Information About CPU Counters or HP Caliper Report Types” (p. 133). --detail-cutoff --detail-cutoff percent_cutoff[,cum_percent_cutoff[,min_count]] Specifies cutoff values that limit the functions reported in the Function Details section of a PMU histogram report.
min_count Sets the minimum number of functions to be displayed for all load modules. This is shown as minimum function entries on reports. Default value is zero. Example If you specify: $ caliper fprof --detail-cutoff ,80 wordplay The contents of the Function Details section is a list of functions containing: • • The functions that account for 80 percent of the total IP samples in the wordplay program.
--etb-walkback-cycles --etb-walkback-cycles integer Controls the number of cycles to walk back when iterating from the most recent execution trace buffer (ETB) entry to the oldest ETB entry. Use this option to change the way in which HP Caliper picks the instruction pointer (IP) samples from the 16 IP entries in an ETB sample. When iterating from the most recent entry, HP Caliper computes the cumulative elapsed cycles by adding up each entry's bubble cycles plus one cycle per entry.
• • • Number of cycles during which three or more FP_OPS_RETIRED events occurred, while executing in kernel space Number of cycles during which four or more NOPS_RETIRED events occurred, in user space Total number of NOPS_RETIRED at all levels In addition, assume that you want the samples to be captured every time 10,000 cycles occur, during which two or more IA64_INST_RETIRED events occur, at all privilege levels.
The default depth is 32. --group-by --group-by executable | module | none Specifies how “matching” processes or modules should have their data combined in reports. (Matching processes or modules are processes or modules that have the same basename.) This option can be used anytime you use the caliper command to produce a report (including the default stdout report).
--hotpaths --hotpaths Specifies whether or not the hot call paths section should appear in callgraph measurement reports. The default value is True. This option is used only with the cgprof, and scgprof reports. --ibrp --ibrp=IBRP_INDEX,PLM,ADDR_MATCH,ADDR_MASK Specifies the bits to program the PMU's instruction address range matching registers. Forces event monitoring to be constrained by instruction addresses ranges. IBRP_INDEX can be 0, 1, 2 or 3.
You can use this option with the following measurements: alat, branch, cgprof, dcache, dtlb, fprof, fcount, icache, itlb, and scgprof. For the cgprof and fcount measurements, this option should be specified during data collection as well as reporting, because the inline functions must be instrumented at collection time for the data to be available at report time. For the other measurements, this option needs to be specified only during reporting (though it can also be specified during data collection).
By default, HP Caliper uses this kernel file for symbol lookup and disassembly: • /stand/current/vmunix NOTE: On Linux, a default kernel path is not defined for a sampling level of kernel or all, so reports show only kernel module and function information for samples. To show disassembled instructions for kernel modules, use the –-kernel-path option. To produce an uncompressed kernel image for HP Caliper to work with, do the following: TMP_FILE=path_to_use_for_kernel_image gunzip -c /boot/efi/...
only The PMU is enabled only during interrupt processing, and disabled during regular processing. The default is on for the cpu measurement, no matter which --scope option you use. For all the other measurements, the default is on if you use the --scope system option, and off if you use the --scope process option. This option is not available on Linux. The behavior on Linux is equivalent to --measure-on-interrupts on.
Specifies explicitly the load modules to be excluded from measurement. You can specify module names in several ways: • • • As a simple file name (libapplib1.so) that matches libraries of this name in any directory As a full-path file name (/home/dev/libs/libapplib1.
NOTE: In --scope system measurements on HP-UX, HP Caliper cannot locate an executable or a shared library if it is invoked using a relative path. In addition, at certain times, executables and shared libraries cannot be located even if they are specified with complete paths. This problem is due to limitations in APIs provided to collect information about executables and shared libraries associated with a process on HP-UX.
PROC_FLAGS is a comma-separated separated list of "inv", "ign", "ibrp0", "ibrp1", "ibrp2" or "ibrp3". The values "inv" and "ign" set the "inv" and "ig_ad" bits of the opcode match register. These bits are cleared by default. The values "ibrp0", "ibrp1", "ibrp2" and "ibrp3" will clear the corresponding bits in the opcode match configuration register. These bits are set by default.. --options-file See “-f or --options-file” (p. 65). --output-file See “-o or --output-file” (p. 67).
Displays per thread data with percentages calculated for samples collected for each thread or for the process. The default value is process. It is helpful if an application has distinct groups of threads that perform different tasks entirely, so displaying their IP statistics as a percentage of individual threads is more relevant.
--process-cutoff --process-cutoff percent_cutoff[,cum_percent_cutoff[,min_count]] Specifies a cutoff value that limits the processes reported in the Process Summary section of a PMU histogram report. This option is used in conjunction with the --sort-by metric option, which you can use to specify which metric you want to be used for sorting and cutoffs.
Example If you specify: $ caliper fprof --process-cutoff ,80,0 -w The contents of the Process Summary section is a list of processes containing: • • The processes that account for 80 percent of the total IP samples of all the processes running in the system. Only those processes that each account for more than two percent of total samples. Because percent_cutoff was not specified, HP Caliper used the default value, 2 percent.
The default is --scope process. process The subject of measurements is processes, specifically the threads of execution that make up those processes. With --scope process, the default privilege level is user, but you can change this with the --event-defaults option. system The subject of measurements is user and kernel activity on all CPUs in the system, at any privilege level you choose. (You can specify the privilege level as user, kernel, or all with the --event-defaults option.
possible is to kernel modules. Therefore, this qualifier makes sense only if the privilege level is kernel or all. When the scope is system, the command-line arguments program and program_args should not be provided. pset pset_id[:pset_id:...] The subject of measurement is user and kernel activity on all CPUs belonging to the specified processor sets (psets). (You can change the privilege level using the --event-defaults option). For example, --scope pset 0:1 measures all CPUs belonging to psets 0 and 1.
Limitations in Using --scope system • • • You cannot use the --scope system option if another HP Caliper process is running on the system. Using the --scope system option also prevents another HP Caliper process from starting until the kernel measurement finishes. On HP-UX, you can only use the --scope system option when logged in as the root user. (On Linux, you do not need to be root user.
--source-path-map --source-path-map pathmap1[:pathmap2:...] Specifies the path map to use for finding source files used for reporting source statements. Applies to any PMU histogram report, which is the only kind of report that references source code. Path map entries are separated by a colon (:) and applied in order until HP Caliper finds a file match. • • • Simple entries are prepended to file names. You can provide substitute paths by using comma-separated entries.
HP Caliper stops reporting information when it reaches either a percentage cutoff or a cumulative percentage cutoff: • • You can limit the report only to functions that exceed a specified percentage of the total for the sorting/cutoff metric. Once HP Caliper encounters this percentage cutoff, it stops reporting functions. You can limit the report by having HP Caliper stop reporting functions once the cumulative percent of the functions so far listed exceeds the cumulative percentage cutoff value.
This option is necessary if you want HP Caliper to report system-specific latency buckets on Linux. If you do not use this option, a default set of latency buckets will be used. On HP-UX, HP Caliper automatically obtains the model number using the model command. For more information, see “dcache Measurement Report Description” (page 249). Example --system-model rx8640,1 --system-usage= --system-usage=[all][:runstatus][:cpu][:io][:syscalls] Controls the collection and reporting of system usage data.
Collect and report data per thread. all For a multithreaded program, the Function Summary and the Function Details sections of reports show information across threads in addition to the per-thread Function Summary and Function Details sections. sum-all Collect and report data summed across all threads. sum-all measures multithreaded applications as one entity. That is, HP Caliper produces a single report with the results of all threads aggregated together.
--user-regions --user-regions default|rum-sum For runs involving the PMU, specifies whether the data should be collected for the entire run (--user-regions default), or only in regions delimited by the PMU enable/disable instructions rum and sum. For more information, see “Restricting PMU Measurements to Specific Code Regions” (p. 211). --version See “-v or --version” (p. 71).
6 Using the HP Caliper Advisor This chapter introduces you to the HP Caliper Advisor and provides some example programs to show you how to get started using the Advisor from the command line. For information on how to use the Advisor in the HP Caliper graphical user interface (GUI), see Chapter 7 (page 113). For details about how to write rules for the Advisor, see the HP Caliper Advisor Rule Writer Guide.
Example 6-1 HP Caliper Advisor Report =========================================================================== HP Caliper 4.3.0 Advisor Report for my_app =========================================================================== Analysis Focus Executable: Last modified: Processor type: Processor speed: OS version: /tmp/my_app August 15, 2004 at 03:10 PM Itanium2 9M 1599 MHz HP-UX 11.23 Performance Databases /home/me/.hp_caliper_databases/cpu - March 23, 2005 at 11:17 AM /home/me/.
Figure 6-1 Steps in Using the Advisor Ma ke sugg ested chang es Start Buil d appl icat ion On e or more HP Calip er performanc e runs HP Calip er Advisor Gain better und erstandin g of appl icat ion performanc e End Ma ke sugg ested performanc e runs To use the HP Caliper Advisor, you perform these steps: 1. 2. 3. 4. Build the application with an initial set of compiler/linker options.
• Change the application code and/or build options and start over at step 1 with the revised application and a new set of performance measurements and databases. or: • 5. Make new performance measurements in step 2, saving the data to a new database, and rerun the Advisor. Continue making adjustments as long as you keep receiving meaningful data.
separated by colons (:). Advice messages don’t necessarily contain all of these portions. The default is all. --analysis-focus Specifies which application object(s) to analyze and report on. Currently, only executable programs can be analyzed, so specifying the default focus type of executable is optional. The object name can be all or a specific, simple executable name such as my_app. One or more executables can be specified with this option, separated by commas (,). The default is executable:all.
Getting Started with the Advisor: Examples To run the Advisor, you need to make one or more HP Caliper measurement runs on an application. Simplest Example Assume that you have made these data collection runs: $ caliper cpu my_app $ caliper fprof my_app $ caliper ecount my_app The output databases are saved in the databases directory by default and are named cpu, fprof, and ecount.
$ caliper dcache my_app Then, re-run the Advisor to analyze the full set of performance data and produce a more comprehensive analysis report: $ caliper advise If any suggested changes are made to the application, then you can measure and analyze the revised program: $ caliper cpu my_new_app or: $ caliper ecount my_new_app followed by: $ caliper fprof my_new_app $ caliper dcache my_new_app Then, run the Advisor on the composite performance data: $ caliper advise Explanation of Report Output Figure 6-2
Figure 6-2 HP Caliper Advisor Report, with Annotations =========================================================================== HP Caliper 4.3.0 Advisor Report for my_app =========================================================================== Analysis Focus 1 Executable: Last modified: Processor type: Processor speed: OS version: /tmp/my_app August 15, 2004 at 03:10 PM Itanium2 9M 1599 MHz HP-UX 11.23 Performance Databases 2 /home/me/.hp_caliper_databases/cpu - March 23, 2005 at 11:17 AM /home/me/.
This was run on an HP-UX 11i V2 September 2004 OE system. Reports run on other systems look similar, except that the specific advice given is unique to the application and the system. How to Read an Advisor Report Each Advisor run analyzes one or more application objects. (Currently, only executable objects can be analyzed.) A separate report is output for each object analyzed. The reports are in alphabetic name order. See “HP Caliper Advisor Report, with Annotations” (p. 108) for an example report.
The numbers (which are bold in the PDF version of this guide) are annotations to explain the report—they are not part of the output you receive. See the list at the end of the report for the explanations. ------------------------------------------------------------------------------Index Class Analysis ------------------------------------------------------------------------------23.9 cpu Function profile 1 [cpu_fprof_1] 2 The percentage of ITLB misses (16.6%) is higher than normal.
• • • • • • • • • Make all of the performance runs on the same type of system. Performance data from runs made on incompatible systems will be ignored. Save each HP Caliper performance run in a separate database. This makes it easier later to mix-and-match databases for analysis. When you re-run the Advisor, be sure to list all of the available, current databases: the initial ones and all additional ones. This gives the Advisor the most performance data to work with.
3. All the databases are scanned for executable objects to analyze that match the selection criteria. By default, all objects with performance data are analyzed. Alternatively, you can use the --analysis-focus option or the analysis_focus variable in the .caliperinit file to choose specific objects to analyze. If multiple versions of the same object exist, the one with the most recent modification date/time is chosen and the older versions are ignored.
7 Using the HP Caliper Advisor in the GUI This chapter describes how to use the HP Caliper Advisor in the HP Caliper graphical user interface (GUI). It assumes that you have some familiarity with the Advisor. For information about the HP Caliper Advisor, see Chapter 6 (page 101). For information about the HP Caliper graphical user interface (GUI), see Chapter 3 (page 43).
Figure 7-1 HP Caliper GUI In this screen shot of the GUI, you can see that three measurement runs have already been made: two in the Before Changes project (a CPU Cycles Run and a Data Cache Misses Run) and one in the After Changes project (a CPU Cycles Run). The application being measured is the HP C/C++ compiler, compiling the “Hello World” program. The application consists of three processes: cc, ecom, and ld.
Note that these are default measurement runs. That is, no special measurement options were used. The user entered the compile command to be measured on the Target page of the Collect view, selected the measurement to make on the Measurement page, and clicked the Start button. The requested measurement run was made and an index of the new datasets available was added to the list in the Projects view.
Figure 7-2 Projects View, with a Single Project Selected Figure 7-3 shows the Projects view, with a single measurement run, Data Cache Misses Run, selected. Every dataset in the run is also selected.
Figure 7-3 Projects View, with a Single Measurement Run Selected Generating Advice The easiest step is getting the HP Caliper Advisor to analyze the selected performance data and generate advice. Figure 7-4 shows the GUI toolbar. The square icon with a blue checkmark inside means check the performance data. If you “hover ” over the icon, the popup tooltip says Generate Advice. Simply click on the icon.
Figure 7-4 HP Caliper GUI Toolbar Figure 7-5 shows the Advisor menu, which has two actions: Generate Advice and Show Advisor View. Figure 7-5 HP Caliper GUI Advisor Menu Generate Advice does the same thing as the toolbar icon: generate new advice from the selected performance data and display it in an Advisor view. Show Advisor View brings up the Advisor view with the advice from the last analysis run. You can use this option to retrieve the Advisor view if you previously closed it.
Not every identified performance problem will include all three types of advice. The advice you receive depends on the actual situation, the type of problem, and the data currently available. Figure 7-6 shows an example Advisor report in the GUI. Figure 7-6 Advisor Report in the HP Caliper GUI The individual (potential) performance issues are separated by horizontal lines.
headings Executable, Index, or Class to make that item the sort key. The down arrow on a column heading indicates the sorted column. When measurement advice is given, it includes the pages and values to change within the Collect view. For example, consider this line: Measurement / Measurement = Data cache misses Sampling / Rate = 10 This advice tells you to select the Data cache misses measurement on the Collect view's Measurement page and to set the Rate field on the Sampling page to 10.
8 Configuring HP Caliper HP Caliper gives you multiple methods for configuring how HP Caliper collects data and reports results. Specifying Option Values with a .caliperinit Initialization File If you have an initialization file (called .caliperinit), HP Caliper automatically uses it at startup for data collection or data reporting runs. Putting the options in an initialization file simplifies the command line you use. This file is not required, but can be useful.
Figure 8-1 .caliperinit File ******************************************************************** #Options applied to all report types. application ='myapp' arguments = '-myarg 2' context_lines = 0,3 summary_cutoff = 1 detail_cutoff =5 source_path_map = '/proj/src,/net/dogbert/proj/src:/home/wilson/work' #Report-specific options.
• suppress_statement_data = True|False If True, no statement-level data will be reported. (Default: False.) • use_parens_for_statement_data = True|False If True, statement-level data in reports is placed in parentheses. (Default: True.) Configuring Data Collection HP Caliper gives you flexible control over the data you collect from your program. The types of control you have include: • • • • Particular CPU events to measure. See “Specifying Which CPU Events to Measure” (p. 123).
NOTE: You can specify a maximum of four events at a time, or 12 on Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processors. Depending on how the PMU registers are used, you might be limited to fewer events on any of the Integrity servers. Shortening CPU Event Names HP Caliper allows you to shorten CPU event names: • • You can truncate the name to the fewest number of characters that uniquely identify the event, such as CPU_CY to represent CPU_CYCLES.
Module inclusion or exclusion does not affect the measurements made through the PMU. Counting and sampling still occurs when code is executed in excluded modules. However, the reports show condensed data for those modules.
module-exclude • • • uld.so dld.so libsin.so You cannot override the settings for uld.so, dld.so, and libsin.so. How to Specify Load Module Names HP Caliper matches load module names in the following way: • • • If you provide a full path for the module name, only an exact match succeeds. To imply all modules within a directory and its subdirectories, you provide a directory name with a trailing slash (/). If you do not specify the full path, only the base file name is used to match module names.
If you specify --module-default none, then HP Caliper uses only files specified in the --module-include list. For example, if you only want to include libc in a measurement, you would use these options: --module-default none --module-include libc Controlling Granularity of Data Collection and Reports You can control the granularity of data collection and reports. If you want finer granularity (that is, more samples), use the -s option to lower the number of events between samples.
produces these files: COUT contains an overview of the entire collection run COUT.cc contains report data for all cc processes COUT.ecom contains report data for the ccom process COUT.ld contains report data for the ld process HP Caliper can measure shell script files. By default, HP Caliper measures the shell program and the programs that the script invokes. For Fortran MPI programs, by default, HP Caliper measures the mpirun controlling process and the real application.
Table 8-1 Command-Line and Measurement Configuration File Syntax for -p Option Command-Line Syntax Measurement Configuration File Syntax • • • • • • • • • • • • • -p default -p root -p root-forks -p all -p custom -p custom:function_name -p [some:][(opt1[,opt2,...])]pattern process="default" process="root" process="root-forks" process="all" process="custom:function_name" process="[some:][(opt1,[opt2,...
For more information, see “Using -p some ” (p. 130). If you specify multiple -p options, the last one takes precedence. Using -p some The syntax for -p some is the most complex. -p [some:][(opt1[,opt2,...
Table 8-4 Name Source Options Used with -p some Option Description file Default. The name is the executable base name of the process. arg0 or argv0 The name is argument 0 of the process. arg1 or argv1 The name is argument 1 of the process or "" (empty string) if there is not such an argument. The last option specified takes precedence. Table 8-5 Process Origin Options Used with -p some Option Description root Denotes the initial root process.
$ cat /path/ls.sh #!/bin/sh ls $ caliper ecount -p "(arg1)*ls.sh" /path/ls.sh Reports information for: /bin/sh /path/ls.sh, but does not report ls. • To select based on argv0: $ caliper ecount -p "(arg0)ora*" ./sqlplus Reports information on processes having ora as a prefix of their argument 0. • To detach from unimportant processes: $ caliper ecount -p "*" -p (ignore)basename ... HP Caliper does not track or measure children of the detached process.
• • • • • • • • • • cstack dcache dtlb ecount fprof icache itlb pmu_trace scgprof traps NOTE: For information about attaching to a running process for precise measurements, see “Attaching to a Running Process to Perform Precise Measurements ” (p. 207). To attach to a process, you must specify the process ID (PID). The syntax is: caliper measurement [options] pid[ pid][ ...] The process IDs should be placed at the end of the command line. Do not specify a target program name.
Figure 8-3 “Example Output of the caliper info Command” shows some example output of the caliper info command. In this case, the following command is specified: $ caliper info cycle Figure 8-3 Example Output of the caliper info Command $ caliper info cycle ================================================================================ HP Caliper 4.3.
The output of this option comes from two text files in the HP Caliper directory. See “Specifying Which CPU Events to Measure” (p. 123). -d or --details -d all|[name][:title][:category][:description] Specifies which information fields to include in CPU counter reports. This can be any combination of name, title, category, or description separated by colons or all. The default value is name:title.
This is the same as: $ caliper info -c L1D because --cpu-counter is assumed unless you specify the mutually exclusive --report option. To get all of the descriptive information on the BACK_END_BUBBLE.ALL processor event, use: $ caliper info -d all back_end_bubble.all To get information on the branch report, use: $ caliper info -r branch HP Caliper Environment Variables HP Caliper uses environment variables to control certain default settings.
9 Controlling the Content of Reports HP Caliper allows you to control the content of reports based on the data collected. Processor Information, Run Information, and Sampling Specifications are present by default in all collection run reports. Layout of an HP Caliper Text or CSV Report HP Caliper uses a consistent layout for the sections in all of the measurement reports produced for text or CSV output.
— Event Counts for ecount — Function Count Details for fcount — Source Directory Summary, Source File Summary, and other function information for fcover — Call Graph and Function Indexes for scgprof and cgprof — Hot Call Paths, Call Graph, and Function Indexes for cstack • Blocking Primitives Summary — Hot Call Paths, Call Graph, and Function Indexes for cstack • Report Help — A description of how to get help in understanding the report • Diagnostic Messages (possibly) See Table 9-1.
For most reports, you can control which metric to use for sorting and cutoffs. Specify the metric with the --sort-by metric option. Specify the amount of data with the --detail-cutoff and the --summary-cutoff options. These two options have parameters that let you specify how much of the Function Details section and how much of the Function Summary section should be displayed in the report.
Table 9-2 Available Metrics for Report Sorting and Cutoffs (continued) Report Name Notes Available Metrics fcount (HP-UX only) Sorting only • call-count (Default) fcover (HP-UX only) • • • • • • icache • avg-latency • latency (default) • sampled-misses itlb • hpw-fills • sampled-misses (default) • soft-fills scgprof • • • • traps Default by first trap address name reached_count reached_percent unreached_count unreached_percent (default) call-count msecs-per-call samples (default) seconds •
---------------------------------------------Metrics Summed for Entire Run ----------------------------------------------PLM Event Name U..K TH Count ----------------------------------------------CPU_CYCLES x___ 0 79328752 BACK_END_BUBBLE.ALL x___ 0 35578366 BE_EXE_BUBBLE.
% Total Cumulat IP % of IP Samples Total Samples Process ------------------------------------------46.60 46.60 48 ld (pid: 19386) 36.89 83.50 38 ecom (pid: 19385) 8.74 92.23 9 aCC (4 instances) 7.77 100.00 8 c++filt (pid: 19387) ------------------------------------------[Minimum process entries: 5, percent cutoff: 2.00, cumulative percent cutoff: 100.
Disassembly Listing A PMU histogram report gives a disassembly listing, which is a source listing with disassembly code for the top performance bottlenecks. The disassembly listing is not provided by default. To see it, you must use the -r (--report-details) option. Specify -ri (instructions) or -ra (all). Figure 9-1 “Disassembly Listing Example” shows a disassembly listing in a report.
Branch Targets in Disassembly Listings By default, the symbols shown for branch targets in disassembly are limited to 30 characters. You can change the limit by setting the following variable in the measurement configuration file or the .caliperinit file: disasm_target_name_limit = limit Source Position Correlation In addition to printing address and function names, HP Caliper prints source position information when it is available and appropriate.
• • • • In the list of included and excluded load modules near the top of reports, HP Caliper reports the run-time address range of each load module. For function start addresses, HP Caliper reports link-time addresses, the offset from the text base of the containing load module. HP Caliper reports each disassembled instruction address as an offset from a function start address. HP Caliper reports addresses with no known function name and no known load module as run-time addresses.
Unknown Functions When HP Caliper cannot find the name of a function (because, for example, strip(1) was used to remove symbolic information), the report lists the name of the function as: unknown_0xattr where attr is the starting address of the function. The report lists a function as unknown when HP Caliper is, for any reason, unable to resolve a function name in a shared library through import stubs.
Possible status outputs are: • • • • • Not available on this system: The processor is capable of HyperThreading, but the operating system used does not support it. A newer operating system (for example, HP-UX 11.31) is needed. Disabled in firmware: The processor and operating system are capable of HyperThreading, but it was disabled by the firmware.
For more information, see “Creating Reports from Multiple Databases” (page 149). Names and Locations for the Databases If you do not use the -d option to specify the name of the output database, by default, HP Caliper saves the database in the databases directory in a directory with the same name as the measurement. For example, if you are doing an scgprof measurement run, by default the database is named scgprof and is placed in the databases directory.
Creating Reports from Multiple Databases You can use the caliper report | merge | diff command to: • Create a report from one or more databases • Create a report that merges the data from two or more databases • Create a report that differences the data collected in two databases You can adjust any of the following aspects of the report: • • • • • • • • • Volume of data reported (--summary-cutoff and --detail-cutoff and --process-cutoff) Metric used to sort the data (--sort-by) Which statistics to report (
and database(s) is one of these: • • • [database ... ] (for caliper report) [database1 database2 ... ] (for caliper merge) database2 database1 (for caliper diff) Using the caliper report Command to Create a Report from One or More Databases Use caliper report to create a single output report from one or more databases. The syntax for this command is: caliper report [report_options] [database ...] You can specify multiple databases, either individually or by using wildcards.
Example 9-1 Example of a caliper merge Run ================================================================================ HP Caliper 4.3.
Processor speed: Virtual machine: 1600 MHz no Run Information Configuration: /opt/caliper/config/fprof Date: January 05, 2007 Version: HP Caliper - HP-UX Itanium Version 4.3.0 OS: HP-UX B.11.23 U ia64 Database: /home/sujoys/db3 Measurement scope: per-process Sampling Specification Sampling event: CPU_CYCLES Sampling period: 500000 events Sampling period variation: 25000 (5.
Using the caliper diff Command to Difference Data Collected in Two Databases Use caliper diff to create a report that differences the data collected in two databases. In the report, the contributing collection runs are appended one after another. The syntax for this command is: caliper diff [report_options] database2 database1 HP Caliper will produce a report that shows the difference in data collected between matching processes in the two databases, but only for processes with the same measurement type.
Example 9-2 Example of a caliper diff Run ================================================================================ HP Caliper 4.3.
1. 2. 3. The measurement type of the second database listed in the command is the only one that will be reported in the diff report output. (In other words, the second database listed is assumed to be the “base” database.) All matching processes (of the measurement type of the second database specifier) in each database specifier are merged together, with data in the first database specifier essentially subtracted from data in the second database specifier.
10 Producing a Sampled Call Graph Profile Analysis HP Caliper can produce a sampled call graph profile report (using the scgprof measurement) from any compiled program. You do not need to compile your program in any special way to use this feature. The call graph is produced by sampling the processor's performance monitoring unit (PMU) to determine function calls. The call graph is not exact, because it does not show every function call, but it is statistically useful. This chapter provides an overview.
Differences Between scgprof and cgprof These are differences between scgprof and cgprof collection runs: • • • • • scgprof collection runs have significantly less overhead than cgprof collection runs, because the call graphs are produced using sampling instead of instrumentation. The --duration option is supported for use with scgprof (but not cgprof). The --branch-sampling-spec option is supported for use with scgprof (but not cgprof).
Figure 10-1 Sampled Call Graph Text Report Example ================================================================================ HP Caliper A.4.3.
IP Sampling Specification Number of samples: 54 Data sampled: IP Branch Sampling Specification Number of samples: 2311 Data sampled: BTB Load Modules Included -------------------------------------------------------------------------Load Module Start Address End Address Full Path -------------------------------------------------------------------------dld.so 0x60000000c0011000 0x60000000c009d0c0 /usr/lib/hpux32/dld.so libc.so.1 0x60000000c013a000 0x60000000c03bf3e0 /usr/lib/hpux32/libc.so.
16 Function Totals -----------------------------------------0 0x0000:0 M addp4 :1 F nop.f :2 I nop.i 0x0010:0 M nop.m :1 F nop.f :2 I nop.i 1 0x0020:0 M cmp.eq.unc :1 M mov :2 I dep 0x0030:0 M and :1 M and :2 B (p15) br.ret.dpnt.few 1 0x0040:0 M alloc :1 M ld8 :2 I mov.i 0x0050:0 M sub :1 I czx1.r :2 I czx1.l 1 0x0060:0 M ld8.s :1 M adds :2 I shl 4 0x0070:0 M cmp.ltu.unc :1 M (p15) cmp.ne.unc :2 I (p14) cmp.ltu.unc 1 0x0080:0 M ld8.s :1 I mov :2 I czx1.l 0x0090:0 M (p13) sub :1 B (p11) br.cond.dpnt.
:1 I cmp.eq.unc p15=8,r37 :2 B (p15) br.wtop.dptk.few {self}+0x180;; 4 0x01a0:0 M chk.s.m r37,{self}+0x1f0 :1 M cmp.ltu.unc p15=3,r37 :2 I tbit.nz.unc p14=r37,1;; 0x01b0:0 M and r11=0x1,r37 :1 M add r17=r31,r37 :2 I (p15) shrp r34=r34,r34,32;; --------------------------------------------------14.81 [wordplay::main, 0x40018f0, wordplay.c] 8 ~184 Function Totals -----------------------------------------[/home/meagher/wordplay.
2 :2 ~5,0x2440:0 :1 :2 ~301 ~2,0x2450:0 :1 :2 B (p2) br.cond.dpnt.many {self}+0x3390;; M (p3) mov r47=1 M nop.m 0x0 B br.many {self}+0x2370;; *> while (i < (int) strlen(argv[iarg])) M ld4 r85=[r78] M ld8 r1=[r44] I mov b7=r75 ~ ~ ~ ~ ~ ~ ~ ~ ~499 *> strcpy (leftover, extract (initword, ubuffer)); ~5,0x2e50:0 M nop.m 0x0 :1 M nop.m 0x0 :2 B br.call.sptk.many b0=b7;; (1) ~500 *> if (leftover[0] == '0') continue; 1 ~5,0x2e60:0 M ld1 r8=[r43] :1 I mov r1=r48;; :2 I cmp4.eq.unc p6=48,r8 ~5,0x2e70:0 M nop.
/ux/libsobj_i380em/libs/libc/shared_em_32/obj/../../../../../core/libs/libc/shared_em_32/../core/gen/toupper.c] (4) 0 ~48 > ~1,0x0000:0 M addl r8=0xffffffffffffd8d8,r1;; :1 M ld4 r8=[r8] :2 I nop.i 0x0;; 2 ~1,0x0010:0 M shladdp4 r8=r32,2,r8;; :1 M ld4 r8=[r8] :2 I nop.i 0x0 ~1,0x0020:0 M nop.m 0x0 :1 M nop.m 0x0 :2 B br.ret.sptk.many b0;; --------------------------------------------------5.56 [wordplay::alphabetic, 0x4005b80, wordplay.
~ ~ ~ ~ ~ ~ ~ ~ (1) 1 ~1075 ~3,0x0110:0 :1 :2 *> ~1087 ~7,0x01c0:0 :1 :2 ~7,0x01d0:0 :1 :2 ~7,0x01e0:0 :1 :2 > M M B s1len = (int) strlen (s1p); nop.m 0x0 nop.m 0x0 br.call.sptk.many b0=b1;; if (*s2p == *s1p) ld1.s r22=[r15] nop.m 0x0 br.cond.dpnt.many {self}+0x290;; chk.s.m r22,{self}+0x440 cmp4.eq.unc p6=r18,r22 br.cond.dpnt.many {self}+0x280;; lfetch [r9],2 adds r16=2,r16 br.cloop.dptk.
:1 :2 M I cmp4.eq.unc addl p1,p2=r0,r36 r8=0xffffffffffffdf24,r1;; ~ ~ ~ ~ ~ ~ ~ ~ ~6912 > ~8,0x01a0:0 M (p6) ld4 r48=[r49] :1 I shladdp4 r8=r33,2,r9 :2 I chk.s.i r9,{self}+0x2f0;; (1) ~6947 > 1 ~8,0x01b0:0 M ld4 r11=[r8];; :1 M cmp4.ne.unc p6=r0,r11 :2 I nop.i 0x0 ~8,0x01c0:0 M nop.m 0x0 :1 M nop.m 0x0 :2 B (p6) br.cond.dptk.many {self}+0xe0;; --------------------------------------------------1.85 [(No source information) libc.so.
wordplay::main [2] *ROOT* [1] ---------------------------5.6 wordplay::extract [7] wordplay::main [2] *ROOT* [1] ---------------------------4.5 libc.so.1::strcpy [5] wordplay::extract [7] wordplay::main [2] *ROOT* [1] ---------------------------3.7 libc.so.1::_fgets [9] wordplay::main [2] *ROOT* [1] ---------------------------1.9 libc.so.1::memccpy [10] libc.so.1::_fgets [9] wordplay::main [2] *ROOT* [1] ---------------------------1.9 dld.so::LE_sym_name [11] dld.so::LL_best_fit [12] dld.
[6] 37.70 39 wordplay::alphabetic [6] 62.30 458/1478 31 libc.so.1::strlen [3] -----------------------------------------------------------------------100.00 44/44 100 wordplay::main [2] [7] 11.8 47.09 44 wordplay::extract [7] 38.12 51/189 27 libc.so.1::strcpy [5] 14.78 87/1478 6 libc.so.1::strlen [3] -----------------------------------------------------------------------100.00 392/392 100 wordplay::uppercase [4] [8] 7.4 100.00 392 libc.so.
0.00 1/1 100 libc.so.
The preceding lines in the entry describe the callers of this function. The lines following the primary line describe its subroutines (called children in the call graph). The entries are sorted by time spent in the function and its subroutines. Hot Call Paths Part of the Report This section reports the most probable hottest call paths. A call path represents a subset of the program's execution.
11 Producing a Sampled Call Stack Profile Analysis HP Caliper can produce a sampled call stack profile report (using the cstack measurement) from any compiled program. You do not need to compile your program in any special way to use this feature. HP Caliper periodically samples the application program counter and each of its thread's call stacks and then creates a call stack profile of the program's execution.
Call Stack Profile Text Report Example for HP-UX An example report for HP Caliper on HP-UX is shown here. This report is the result of this command line: $ /opt/caliper/bin/caliper cstack -o results.
Figure 11-1 Call Stack Profile Text Report Example ================================================================================ HP Caliper A.4.4.
Thread Summary (3 Threads) ---------------------------------------------------------------------------------------------------% Total Cumulat Sample Sample Kernel IP % of IP Hits Hits --- Sample Hits Waiting -Thread Samples Total Samples Running Blocked Spinning Blocked ID ---------------------------------------------------------------------------------------------------40.82 40.82 20 1 19 0 0 6065593@main 38.78 79.59 19 0 19 0 9 6065598@start_routine 20.41 100.
------------------------------------------------------------------------------------------------------------38.78 [(No source information) libpthread.so.1::___lwp_wait_sys, 0x4093be0] 19 0 19 0 0 Function Totals ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------18.37 [(No source information) libpthread.so.
libpthread.so.1::__vp_join [9] libpthread.so.1::pthread_join [12] enh_thr_mutex1::main [13] dld.so::main_opd_entry [6] ---------------------------------------------18.4 0.0 18.4 libpthread.so.1::__lwp_mutex_lock_sys [17] libpthread.so.1::_lwp_mutex_lock [15] libpthread.so.1::*unnamed@0x404(1670-5b70)* [14] libpthread.so.1::pthread_mutex_lock [16] enh_thr_mutex1::start_routine [3] libpthread.so.1::__pthread_bound_body [2] ---------------------------------------------2.0 2.0 0.0 enh_thr_mutex1::main [13] dld.
-------------------------------------------------------------------100.00 libpthread.so.1::pthread_mutex_lock [16] [14] 18.4 0.0 18.4 0.00 libpthread.so.1::*unnamed@0x404(1670-5b70)* [14] 100.00 libpthread.so.1::_lwp_mutex_lock [15] -------------------------------------------------------------------100.00 libpthread.so.1::*unnamed@0x404(1670-5b70)* [14] [15] 18.4 0.0 18.4 0.00 libpthread.so.1::_lwp_mutex_lock [15] 100.00 libpthread.so.
Instruction ------------------------------------------------------------------------------------------------------------38.78 [(No source information) libpthread.so.1::___lwp_wait_sys, 0x4093be0] 19 0 19 0 0 Function Totals ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------2.04 [enh_thr_mutex1::main, 0x4000d50, enh_thr_mutex1.
-------------------------------------------------------------------100.00 libpthread.so.1::_lwp_wait [4] [3] 95.0 0.0 95.0 100.00 libpthread.so.1::___lwp_wait_sys [3] -------------------------------------------------------------------100.00 libpthread.so.1::__vp_join [5] [4] 95.0 0.0 95.0 0.00 libpthread.so.1::_lwp_wait [4] 100.00 libpthread.so.1::___lwp_wait_sys [3] -------------------------------------------------------------------100.00 libpthread.so.1::pthread_join [6] [5] 95.0 0.0 95.0 0.00 libpthread.
10 0 10 0 0 Function Totals ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------18.37 [(No source information) libpthread.so.
100.00 enh_thr_mutex1::start_routine [1] 0.00 enh_thr_mutex1::foo [6] 100.00 libc.so.1::_sleep [4] -------------------------------------------------------------------100.00 libc.so.1::sigtimedwait [5] [7] 52.6 0.0 52.6 100.00 libc.so.1::__sigtimedwait_sys [7] -------------------------------------------------------------------100.00 enh_thr_mutex1::start_routine [1] [8] 47.4 0.0 47.4 0.00 libpthread.so.1::pthread_mutex_lock [8] 100.00 libpthread.so.
20.41 [(No source information) libc.so.1::__sigtimedwait_sys, 0x422ab40] 10 0 10 0 Function Totals 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------[Minimum function entries: 0, percent cutoff: 1.00, cumulative percent cutoff: 100.
Figure 11-2 Call Stack Profile Text Report Example for Linux ================================================================================ HP Caliper C.4.4.
IP % of Clock Hits Thread Samples Total Samples Waiting ID --------------------------------------------------------48.78 48.78 20 0 31021@main 26.83 75.61 11 1 31024@start_routine 24.39 100.00 10 0 31023@start_routine --------------------------------------------------------100.00 100.
[Minimum primitives: 10, Percent cutoff: 1.00, cumulative percent cutoff: 100.00] Hot Call Paths (All Threads) ---------------------------% Total Hits In Only Name ---------------------------48.8 *kernel gateway* [2] libc.so.6.1::__GC___libc_nanosleep [9] libc.so.6.1::sleep [10] enh_thr_mutex1::foo [11] enh_thr_mutex1::start_routine [3] libpthread.so.0::start_thread [4] libc.so.6.1::__clone2 [5] ---------------------------26.8 *kernel gateway* [2] libpthread.so.
100.00 libc.so.6.1::__libc_start_main [7] -----------------------------------------------100.00 libc.so.6.1::sleep [10] [9] 48.8 0.00 libc.so.6.1::__GC___libc_nanosleep [9] 100.00 *kernel gateway* [2] -----------------------------------------------100.00 enh_thr_mutex1::foo [11] [10] 48.8 0.00 libc.so.6.1::sleep [10] 100.00 libc.so.6.1::__GC___libc_nanosleep [9] -----------------------------------------------100.00 enh_thr_mutex1::start_routine [3] [11] 48.8 0.00 enh_thr_mutex1::foo [11] 100.00 libc.so.6.
---------------------------------------------------------[/home/bindu/TESTS/enh_thr_mutex1.
2 6 __libc_start_main pthread_join 1 3 main _start Load Module Summary (Thread 31024@start_routine) -------------------------------------------------------------% Total Cumulat WallSample IP % of Clock Hits Samples Total Samples Waiting Load Module -------------------------------------------------------------26.83 26.83 11 1 *kernel gateway* -------------------------------------------------------------26.83 26.
Under In Children Children -----------------------------------------------90.91 libc.so.6.1::__GC___libc_nanosleep [6] 9.09 libpthread.so.0::__lll_lock_wait [9] [1] 100.0 100.00 *kernel gateway* [1] -----------------------------------------------100.00 libpthread.so.0::start_thread [3] [2] 100.0 0.00 enh_thr_mutex1::start_routine [2] 90.91 enh_thr_mutex1::foo [8] 9.09 libpthread.so.0::pthread_mutex_lock [10] -----------------------------------------------100.00 libc.so.6.1::__clone2 [4] [3] 100.0 0.
Function Details (Thread 31023@start_routine) -----------------------------------------------------------------% Total WallSample Line| IP Clock Hits Slot| >Statement| Samples Samples Waiting Col,Offset Instruction -----------------------------------------------------------------24.
Call Stack Profile Report Details This section gives some information about the sampled call stack profile report.
Example 11-1 Sample cstack Report - Blocking Primitives Details Blocking Primitives Details (All Threads) -----------------------------------------------------------------------------------------------Sample Callpath Holder's % Total Sample Sample Sample Hits Index Kernel Hits Hits Hits Hits Blocking For Holder Holder Thread Waiting Waiting Spinning Blocked Primitive --For Waiter --Waiter ID -----------------------------------------------------------------------------------------------20.
is reported with a kernel thread ID suffixed with the the name of the routine that the thread will execute once it is created. Hot Call Paths Part of the Report This section reports the hottest call paths. A call path represents a subset of the program's execution. The Hot Call Paths section reports the percentage of program's real time that is attributed to specific call paths. You can use the --callpath-cutoff option to specify cutoff values that limit the hot call paths reported in this section.
• • In programs that dynamically load and then unload shared libraries, the cstack measurement might not attribute measurement results to the appropriate shared library. The kernel does not allow HP Caliper to stop a thread in uninterruptible sleep state. Hence, if a process has one or more threads in uninterrupible sleep state, the callstack profile may be inaccurate. Pstack like functionality With HP Caliper 5.2, the cstack measurement can be used to generate a pstack-like report.
------------------------------------------------------------0 100.0 0.0 100.0 libc.so.1::__ksleep { cond_var@0x6000000016586e78 } libpthread.so.1::__mxn_sleep libpthread.so.1::*unnamed@0x400000000001(d320-ec30)* libpthread.so.1::pthread_cond_wait caliper::cal_mqueue_read caliper::pmm_slds_write_main libpthread.so.
12 Performing CPU Metrics Analysis HP Caliper can measure and report per-process or system-wide metrics based on sampled CPU events. This is enabled by the cpu measurement. Specify the events and sampling period with the -m event_set and -s period options, respectively. You can measure multiple metrics in the same run. For most applications, the cpu measurement is the first measurement you should take when you begin using HP Caliper. Run this command: $ caliper cpu -o cpu.
13 HP Caliper Features Specific to HP-UX These features are available only when using HP Caliper on the HP-UX operating system: • These measurements: — cgprof — cpu See “Performing CPU Metrics Analysis ” (p. 197). — fcount — fcover • These command-line options: — --bus-speed See “--bus-speed ” (p. 74). — --cpu-aggregation See “--cpu-aggregation ” (p. 75). — --cpu-details See “--cpu-details ” (p. 75). — --exclude-caliper See “--exclude-caliper ” (p. 81). — --exclude-idle See “--exclude-idle ” (p.
--memory-usage=all|[begin][:timed][:end][:PERIOD[s|m|h]] For example: $ caliper cpu -o REPORT --memory-usage=all my_app Use of this option causes two different sets of memory measurements to be taken, each reported in its own table in the report: • Overall memory available (and currently in use and free) on the system • Memory currently being consumed by the process(es) being measured by a particular HP Caliper run If the HP Caliper run is made on a ccNUMA system, then the memory usage of every “logical do
Examples of the --memory-usage= Option Some examples of the option follow: • --memory-usage=all Causes process memory usage to be measured at the beginning, at the end, and every 1 second of the process's execution. • --memory-usage=begin:end Causes process memory usage to be measured twice: at the beginning and at the end of the process's execution. • --memory-usage=timed:15 Causes process memory usage to be measured every 15 seconds of the process's execution.
Figure 13-1 Example Memory Usage Report Output for an SMP System System Memory Configuration -----------------------------------------------------------------------Domain Physical # Used Free Total Id Id Type CPUs Pages Pages Pages -----------------------------------------------------------------------0 0 CLM 4 312922 731708 1044630 -----------------------------------------------------------------------Process Memory Usage ------------------------------------------------------------------------------Domain
Free Pages Current number of unused memory pages. Total Pages Sum of Used Pages and Free Pages. The Used Pages and Free Pages values are not very useful, because they reflect the total activity (kernel plus all user processes) on the system, not just the memory usage of the process(es) being measured by HP Caliper. The most important information in this table is the topology of the total memory available on the system.
be a multiple of the requested sample period (--memory-usage=nnn), which defaults to 1 second. “Gaps” in the time sequence of snapshots indicate a stretch of time where the process's memory usage did not change. Domain Id System identification number of the logical domain. On ccNUMA systems, cell local memory domains are numbered starting at 1, and the interleaved memory domain Id is –1. On SMP systems, the only domain is numbered 0.
• • • • Run status: how much time each process or thread spent running, eligible to run but not running, and waiting (runstatus parameter) cpu: how much of each process or thread running time was spent on each CPU and how often the process or thread was moved to another CPU (cpu parameter). io: gives a breakdown of all I/O a process or thread does, by category (10 parameter).
Figure 13-2 Example System Usage Report Output System Usage - Run Status (All Threads) -------------------------------------------------------------------------------Relative -------- Time (thread secs) -------------- Percentage -------Time Running Eligible Waiting Running Eligible Waiting -------------------------------------------------------------------------------Overall 5.4534 0.0060 18.3617 22.89% 0.03% 77.
read 7 345.27 0.00000 0.00029 0.00200 0.00203 lwp_sema_post 3 49.32 0.00006 0.00016 0.00021 0.00047 lwp_sema_wait 3 147.97 0.00012 0.00013 0.00015 0.00040 mmap 25 1233.11 0.00000 0.00001 0.00015 0.00034 munmap 12 591.89 0.00001 0.00002 0.00002 0.00018 write 12 591.89 0.00000 0.00001 0.00010 0.00018 siginhibit 132 1627.70 0.00000 0.00000 0.00000 0.00015 sigenable 132 1627.70 0.00000 0.00000 0.00000 0.00012 pstat 1 49.32 0.00010 0.00010 0.00010 0.00010 lwp_cond_broadcast 6 73.99 0.00000 0.00001 0.00006 0.
You can attach to the process for these measurements: • • • cgprof fcount fcover To attach to a process, you must specify the process ID (PID). The syntax is: caliper measurement [options] pid For example: $ caliper cgprof 7654 To perform precise measurements of a process: 1. Run chatr(1) with the +dbg enable option on the program you want to measure. For example: $ chatr +dbg enable ./myprog 2. 3. Run ./myprog and find the process ID of the process. Specify the process you want to measure.
1. Modify your source code to add trigger statements to take the samples. Use the macros in include/caliper_control.h in the HP Caliper home directory. The macros are named CALIPER_PMU_TAKE_SAMPLE_n, with n varying from 1 to 8. The report shows the value of n next to each Sample Origin column. Figure 13-3 “Using Macros to Trigger PMU Samples” shows an example of how to use the macros. Figure 13-3 Using Macros to Trigger PMU Samples #include ....
Figure 13-4 Example of PMU Trace Report PMU Trace Buffer 1, Kernel Thread Id 2218765, Samples 1 - 170 [IIR:IA64_INST_RETIRED, NR: NOPS_RETIRED, CC: CPU_CYCLES] -----------------------------------------------------------------------------------CPU Event---CPU Event--CPU Event-- ------IP Samples-----Sample IIR NR CC Bundle Address Sample Number Count Count Count (module:function) Origin ---------------------------------------------------------------------------------1 1996102 357759 2030895 0x4001230 TRG(0x1)
While executing those instructions will not cause an application to crash in the absence of HP Caliper, they will still have an impact on performance. Executing a break instruction causes a trap to the breakpoint handler in the kernel. • The presence of trigger macros may disable some optimization that the compiler could perform. The trigger instructions are defined so that code will not be moved around them.
Reasons to use this feature include: • • Analyzing a particular loop or function. You can restrict measurements to a particular loop to get information such as: ecount Number of events occurring in the loop fprof Hot spots in the loop branch Analysis of the loop branches dcache Data cache misses in the loop Analyzing a particular phase in an application. For applications with important startup or shutdown phases, it is sometimes beneficial to limit measurements to the “in-between” phase.
3. Use the command-line option --user-regions rum-sum. (Or place user_regions="rum-sum" in a measurement configuration file.) This option causes HP Caliper to allow the measured applications to control the PMU. When you specify --user-regions rum-sum, the PMU is initially disabled, and HP Caliper will not measure the application until the first CALIPER_PMU_ENABLE() is executed. The default behavior is to disallow such control and measure the full run (--user-regions default).
A HP Caliper Diagnostic and Warning Messages This appendix describes some diagnostic and warning messages you might receive. HP Caliper always attempts to measure everything that you request. When this is not possible, however, HP Caliper gives you diagnostic or warning messages. You can usually safely ignore these messages. Several situations can cause these messages: • A sampled address is outside the measurement context. • A function contains specialized assembly code. • A function cannot be identified.
Figure A-1 “Mispredicted Branches Example ” shows some mispredicted branches example output.
Figure A-1 Mispredicted Branches Example Function Details ---------------------------------------------------------------------------------------------% Total Target Line| Taken of Branch Branch Taken NTaken % Slot| >Statement| Mispr Branch Taken NTaken Mispr Mispr Mispr Col,Offset Instruction ---------------------------------------------------------------------------------------------25.00 [libc.so.1::__thread_mutex_lock, 0x40000000002123a0, wrappers1.c] 2 2 0 1 0 50.
~1,0x0080:0 M_ adds :1 M ld8 :2 I nop.i M_ ld8 :1 M nop.m :2 I mov [bundle] ~1,0x00a0:0 M nop.m :1 M nop.m :2 B_ r9=8,r8 ;; r8=[r8] 0 ~1,0x0090:0 r1=[r9] ;; 0 b6=r8,.+0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 1 0 1 0 100.00 0 0 0 0 0 0 1 1 br.call.dptk.many rp=b6 ;; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~1,0x00b0:0 M adds I_ mov r1=0,r35 :1 rp=r34,.+0 ;; :2 I mov.i ar.
Figure A-2 Comments in the branch Measurement Configuration File # -----------------------------------------------# Criteria for limiting types of branches recorded # branch_taken_criterion # #VALUES: # Caliper_no_branches - do not record branches taken/not taken # NOTE: Caliper_no_branches results in no # branch traces # Caliper_not_taken_branches - record branches not taken # Caliper_taken_branches - record branches taken # Caliper_all_branch_outcomes - record both taken and not taken branches # branch_ta
scgprof Reports Require Kernel Patch On HP-UX 11i v2, if you run the scgprof measurement, you might see this error message: On HP-UX, sampled call graph reports require kernel patch PHKL_34020. To install this patch, check the HP IT Resource Center for availability and download information. Email the HP Caliper team at caliper-help@cup.hp.com if you have questions about this patch.
B Descriptions of Measurement Reports This appendix contains descriptions of reports produced for each HP Caliper measurement. It shows example command lines you can use to produce the reports and describes the data available with the measurements.
alat Measurement Report Description With the alat measurement, produced by the alat measurement configuration file, HP Caliper measures and reports advance load address table (ALAT) misses. The ALAT keeps track of speculative (that is, advance) loads. An excessive number of ALAT compares that result in a failed advance load (an ALAT miss) can seriously degrade performance.
INST_FAILED_CHKA_LDC_ALAT.ALL The number of failed advance check load(chk.a) and check load (ld.c) instructions that reached retirement, including both integer and floating-point instructions. These failures occur when the ALAT does not contain the expected data. Up to two such events can happen in a given cycle. However, the processor only counts a maximum of one event per cycle. Each sample taken can potentially mask one increment of this counter.
could change.) It does not count cycles when the CPU is in low power mode. When HyperThreading is on, this is the number of variable clock cycles used by only this process's hyperthread. CPU_OP_CYCLES.ALL:all_threads=true The number of variable clock cycles used by both hyperthreads. Available only when HyperThreading is on. IA64_INST_RETIRED The number of retired instructions excluding hardware-generated RSE operations.
Percent float ALAT miss Percentage of floating-point components in all the misses incurred by instructions accessing the ALAT. Percent integer ALAT miss Percentage of integer components in all the misses incurred by instructions accessing the ALAT. Advanced check load per kinst The number of data speculation events per 1000 retired instructions. Failed advanced check load per kinst The number of data speculation fail events per 1000 retired instructions.
Table B-1 Information in alat Measurement Reports (continued) Column Description Function Routine from your application. File Source file associated with a function.
branch Measurement Report Description With the branch measurement, produced by the branch measurement configuration file, HP Caliper measures and reports two levels of information: • Exact counts of branch prediction metrics summed across the entire run of an application • Sampled branch prediction metrics that are associated with particular locations in the application The report shows measured data by thread, load module, function, statement, instruction bundle, and instruction.
• Percent Wrong Paths Percentage of branch predictions that mispredicted the branch predicate. • Percent Wrong Branch Targets Percentage of branch predictions that mispredicted the branch target. Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems 228 • BE_FLUSH_BUBBLE.ALL Number of full-pipe bubbles in the main pipe due to either an exception/interruption or a branch misprediction flush. • BR_MISPRED_DETAIL.IPREL.
• Percent Correct Predictions Percentage of branch predictions that predicted correctly. • Percent Wrong Paths Percentage of branch predictions that mispredicted the branch predicate. • Percent Wrong Branch Targets Percentage of branch predictions that mispredicted the branch target. • Percent iprel branch Percentage of IP-relative branches among all branches. • Percent ind branch Percentage of non-return indirect branches among all branches.
branch Measurement Report Metrics See Table B-2 “Information in branch Measurement Reports”. In this table, “program object” refers to any of the following: • Thread • Load module • Function • Source statement • Instruction bundle • Instruction Table B-2 Information in branch Measurement Reports Column Description % Total Percent of the total for attributable to a given program object.
Table B-2 Information in branch Measurement Reports (continued) Column Description File Source file associated with a function.
cgprof Report Description Available only on HP-UX. With the cgprof measurement, produced by the cgprof measurement configuration file, HP Caliper measures and reports both flat and call graph profiles (much like standard gprof).
• • Source statement Instruction bundle Table B-3 Information in cgprof Measurement Report Fields (Flat Profile) Column Description % Total IP Samples Percent of the total IP samples attributable to a given program object. Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it. IP Samples Total number of IP samples attributed to the given program object.
Table B-4 Information in cgprof Measurement Report: Function Entries (Self Entries) Column Description Index Index of the function in the call graph listing, as an aid to locating it. % Total Hits In or Under Percentage of the total hits of the program accounted for by this function and its descendants. % Func Hits In Func Number of hits due to this function, expressed as a percentage of the number of hits accounted for by this function and its descendants.
*This field is omitted for parents, or children, in the same cycle as the function. If the function, or child, is a member of a cycle, the propagated times and propagation denominator represent the self time and descendant time of the cycle as a whole. **Static-only parents and children are indicated by a call count of 0.
cpu Measurement Report Description Available only on HP-UX. With the cpu measurement, produced by the cpu measurement configuration file, HP Caliper measures and reports per-process or system-wide metrics based on sampled CPU events. Specify the events and sampling period with the -m event_set and -s period options, respectively. You can measure multiple metrics in the same run. The --cpu-aggregation option specifies how many low-level samples will be aggregated into one user-reported sample.
into one user-reported sample. The result is saved in the text file cpu.txt in the current directory. $ caliper cpu -m cpi:user,cpi:kernel my_app This command gets separate CPI data at user and kernel privilege levels. $ caliper cpu -o cpu.txt -w -e 120 This command collects system-wide overview metrics for two minutes and saves the result in the text file cpu.txt in the current directory. Example Command Line for CSV Report $ caliper cpu --csv csvout .
l2dcache l2icache l3cache overview queues stall 238 Provides miss rate information for the L2 data cache for Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems. Provides miss rate information for the L2 instruction cache for Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems. Provides miss rate information for the L3 unified cache.
sysbus threadswitch tlb Provides metrics on system bus utilization. If you specify the sysbus event set, you must use the --bus-speed option to provide bus speed in MHz. For example: --bus-speed 200. Provides data about the effect of HyperThreading on the measured processes for Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems. Provides metrics related to translation lookaside buffer (TLB) misses.
cstack Measurement Report Description With the cstack measurement, produced by the cstack measurement configuration file, HP Caliper measures and reports a sampled call stack profile, produced by periodically sampling the application program counter and each of its thread's call stacks. It also measures and reports the blocking primitives that are responsible for the blocked samples. Example Command Line for Text Report $ /opt/caliper/bin/caliper cstack -o results.
Table B-8 Information in cstack Measurement Report Fields (Flat Profile) (continued) Column Description Sample Hits Blocked Number of direct sample hits taken when process was blocked, attributed to the given object. (HP-UX only) Wall-clock Samples Total number of direct sample hits attributed to the given object.
Table B-9 Information in cstack Measurement Report Fields (Blocking Primitives Profile) (continued) Column Description Callpath Index Holder --Waiter Holder thread and waiter threads are identified by an index into Hot Call Path section. (HP-UX only) Holder's Kernel Thread ID Holder thread's kernel thread suffixed with the the name of the routine that the thread will execute once it is created.
Table B-11 Information in cstack Measurement Report Fields (Call Graph Profile) (continued) Column Description % Total Hits In/Under Percentage of the total sample hits in or under function. (Linux only) % Func Hits In Func Percentage of the total sample hits in function; run and blocked hits combined. % Func Hits Under Parent Percentage of the function's total sample hits under parents; run and blocked hits combined.
cycles Measurement Report Description Available only on Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems. With the cycles measurement, produced by the cycles measurement configuration file, HP Caliper measures and reports a flat profile of the instruction pointers (IPs). This measurement uses the IP-EAR of the dual-core Itanium 2 and Itanium 9300 quad-core processor systems.
cycles Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. BACK_END_BUBBLE.FE Full pipe bubbles in main pipe due to front end. This is the number of cycles lost (stall cycles) due to instruction cache, ITLB, and branch execution stalls. BE_EXE_BUBBLE.ALL Full pipe bubbles in main pipe due to execution unit stalls. This is the number of cycles lost (stall cycles) due to stalls caused by the execution unit. BE_EXE_BUBBLE.
CPU_OP_CYCLES.ALL:all_threads=true Number of elapsed CPU operating cycles used by both hyperthreads. Available only when HyperThreading is on. % Unstalled execution (higher is Percentage of unstalled cycles with respect to better) total number of elapsed CPU operating cycles. % of Cycles lost due to front end Percentage of cycles lost due to I-cache, ITLB, stalls (lower is better) and branch execution stalls.
In this table, “program object” refers to any of the following: • Thread • Load module • Function • Source statement • Instruction bundle Table B-12 Information in cycles Measurement Reports Column Description % Total IP Samples Percent of the total IP samples attributable to a given program object. (ETB) Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it.
How cycles Metrics Are Obtained HP Caliper obtains cycles metrics using the execution trace buffer (ETB) of the performance monitoring unit (PMU). The ETB is configured to capture IPs of retired instructions. When a bundle is retired, the IP address and the number of elapsed cycles to retire the bundle are recorded in the ETB. The dual-core Itanium 2 and Itanium 9300 quad-core processor ETB is a circular buffer and can contain up to 16 entries.
dcache Measurement Report Description With the dcache measurement, produced by the dcache measurement configuration file, HP Caliper measures and reports on data cache metrics. This measurement is similar to the icache measurement.
Example Command Line for CSV Report $ caliper dcache --csv csvout ./wordplay thequickbrownfox dcache Metrics Summed for Entire Run This section describes the metrics summed over the entire run of your application under HP Caliper. Metrics for Integrity Servers Itanium 2 Systems L1D_READS L1D_READ_MISSES.
BE_EXE_BUBBLE.GRALL BE_EXE_BUBBLE.GRGR CPU_OP_CYCLES.ALL DATA_REFERENCES IA64_INST_RETIRED L1D_READS L1D_READ_MISSES.ALL L2D_INSERT_MISSES of cycles lost (stall cycles) due to FR/FR or FR/load dependency. Full Pipe Bubbles in Main Pipe due to GR/GR or GR/load dependency stalls. This is the number of cycles lost (stall cycles) due to GR/GR or GR/load dependency. Full Pipe Bubbles in Main Pipe due to GR/GR dependency stalls. This is the number of cycles lost (stall cycles) due to GR/GR dependency.
not include secondary misses and other misses that were forced to recirculate. L2D_MISSES Number of L2 data cache misses (in terms of the number of L2 data cache line requests sent to L3). It includes all cacheable data requests. This does not include secondary misses of the L2D. L2D_REFERENCES.ALL Number of requests made to L2D due to a data read and/or write accesses. Semaphore operations are counted as one read and one write.
Table B-13 Information in dcache Measurement Reports Column Description % Total Dcache Latency Cycles Total cache miss latency cycles, expressed as a percent of the total cycles. For example, in Example B-1, 84.29 percent of the total cycles were expended on data cache misses. Sampled Dcache Total number of sampled data cache misses attributed to the given program object. Misses Dcache Latency Number of cycles expended on data cache misses summed across samples for the given Cycles program object. Avg.
Table B-13 Information in dcache Measurement Reports (continued) Column Description Latency Buckets The latency data is reported under eight different buckets: three for cache information as % Misses and five for memory information. The top row(s) of the heading specifies the names of the cache level (such as L2 or L3) and system memory names. For example, in Example B-1, cache levels L2 and L3 are shown and the system memory is shown as simply Memory (spanning five buckets).
Table B-13 Information in dcache Measurement Reports (continued) Column Description Line | Slot | Col,Offset The column contains one of these: • A source-code line number for rows showing statements • An instruction slot number for rows showing instructions not on a bundle boundary • A source-code column number followed by an offset from the beginning address of a function for rows showing instructions on a bundle boundary Column and line numbers are preceded by “~” when they are approximate due to optim
Example B-2 Example of a dcache Report for a Superdome Integrity Server Function Details --------------------------------------------------------------------------------------------------% Total Avg. ---Latency buckets as % Misses--Dcache Latency Sampled Dcache Dcache Latency Dcache Laten.
4.84 85.02 4 libc.so.1::_arena_rmutex 4.72 89.75 5 42 41 10.5 8.2 25 40 50 60 25 0 0 0 0 0 0 0 0 0 0 0 Process Data Region The Data Entry column shows the global variable name, process region name, or unknown data address. The process regions are: • Process Text Region - the address space occupied by the process text/instructions • Process Data Region - the address space occupied by initialized data and uninitialized data (.
can process only a subset of data cache misses. The PMU randomizes which loads it monitors. This means that the number of data cache misses observed through sampling—number of sampled misses multiplied by sampling rate—is only a subset of the total number of actual data cache misses. Therefore, it is best to interpret sampling data not as an indication of how many data cache misses a particular instruction incurred, but, instead, as an indication of which instructions incur the most data cache misses.
to a region in the process. HP Caliper creates a map of different regions within a process. This map is used to assign sample data addresses to a process region.
dtlb Measurement Report Description With the dtlb measurement, produced by the dtlb measurement configuration file, HP Caliper measures and reports two levels of information: • Exact counts of data translation lookaside buffer (TLB) metrics summed across the entire run of an application. • Sampled data TLB metrics that are associated with particular locations in the measured application. Data TLB misses can hit the L2 TLB, can be handled by the hardware page walker (HPW), or can be handled by software.
Percentage of Data References Covered by the HPW Percentage of Data References Covered by Software Trap Percentage of L2 DTLB Misses Covered by the HPW following: 100 * (1– L2DTLB_MISSES / DATA_REFERENCES). Percentage of data references that were satisfied by the hardware page walker (HPW). This is calculated as the following: 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES). Percentage of data references that were serviced by the software trap handler for the TLB misses fault.
DTLB_INSERTS_HPW IA64_INST_RETIRED L1DTLB_TRANSFER L1D_READS L2DTLB_MISSES % of Cycles lost due to all stalls (lower is better) % of Cycles lost due to GR/load dependency stalls (lower is better) % of Cycles lost due to GR/GR dependency stalls (lower is better) % of Cycles lost due to FR/load and FR/FR dependency stalls (lower is better) Total L1 data TLB references L1 data TLB for L1D miss percentage L2 data TLB misses L2 data TLB miss percentage Percentage of L2 DTLB misses covered by the HPW Percenta
Percentage of data references covered by the HPW: Percentage of data references covered by software trap Percentage of data references that were satisfied by the hardware page walker (HPW). Percentage of data references that were serviced by the software trap handler for the TLB misses fault. L1 DTLB miss per 1000 instructions Number of L1 DTLB misses per 1000 instructions retired retired. L2 DTLB miss per 1000 instructions Number of L2 DTLB misses per 1000 instructions retired retired.
Table B-14 Information in dtlb Measurement Reports (continued) Column Description % DTLB Soft Fill Percent of sampled data TLB misses that were handled by software for the given program object. Kernel Thread Identification Number. Kernel Thread ID suffixed with the the name of the routine that the thread will execute once it is created. Load Module Shared library or the main executable. Function Routine from your application. File Source file associated with a function.
ecount Measurement Report Description With the ecount measurement, produced by the ecount measurement configuration file, HP Caliper measures and reports total counts of processor metrics accumulated during an application's execution under HP Caliper control. These metrics are collected using the processor's performance monitoring unit (PMU). The number of metrics HP Caliper can accumulate during a single run of your application is limited to four by PMU constraints.
Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems The following CPU events are directly measured: • BACK_END_BUBBLE.ALL — The number of cycles when the back end of the pipeline was stalled. This is the number of cycles lost (stall cycles) due to any of five possible events (FPU/L1D, RSE, EXE, branch/exception, or the front end). • BE_EXE_BUBBLE.GRALL — The number of Full Pipe Bubbles in Main Pipe due to GR/GR or GR/load dependency stalls.
• • • • • Raw CPI (lower is better) — The cycles per instruction, including nop and predicated off instructions. Effective CPI (lower is better) — The cycles per effective instruction, excluding nop and predicated off instructions. Effective CPI during unstalled execution (lower is better) — The cycles per effective instruction, excluding stall cycles, nop, and predicated off instructions.
fcount Measurement Report Description Available only on HP-UX. With the fcount measurement, produced by the fcount measurement configuration file, HP Caliper measures and reports exact function call counts. This gives the total number of times each function is called, either directly or indirectly. Command-line options allow you to control how the report data are sorted. Example Command Line for Text Report $ caliper fcount -o reports/fcount.
fcover Measurement Report Description Available only on HP-UX. With the fcover measurement, produced by the fcover measurement configuration file, HP Caliper lists each function in your application and indicates whether or not the function was executed. It also lists the percent of functions in each load module, source directory, and source file that were executed. Command-line options allow you to control how the report data are sorted.
Unknown Source Files If the source file information for a function has been stripped, which is often done with system libraries, then HP Caliper reports on those functions separately. The report shows an additional line in the Source Directory Summary and Source File Summary tables for Unknown Source Files and the Totals coverage statistic includes them. At the end of the per-source-file function coverage tables is an optional table for Unknown Source Files.
fprof Measurement Report Description With the fprof measurement, produced by the fprof measurement configuration file, HP Caliper measures and reports sampled instruction pointers (IPs). The fprof measurement samples the instruction pointer (IP) at a regular interval (that is, at a particular number of CPU cycles). This provides a statistical identification of where CPU events are occurring.
Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems BACK_END_BUBBLE.FE BE_EXE_BUBBLE.ALL BE_EXE_BUBBLE.FRALL BE_EXE_BUBBLE.GRALL BE_EXE_BUBBLE.GRGR BE_FLUSH_BUBBLE.ALL BE_L1D_FPU_BUBBLE.L1D BE_RSE_BUBBLE.ALL CPU_CPL_CHANGES.ALL CPU_OP_CYCLES.ALL Full pipe bubbles in main pipe due to front end. This is the number of cycles lost (stall cycles) due to instruction cache, ITLB, and branch execution stalls.
% Unstalled execution (higher is better) % of Cycles lost due to front end stalls (lower is better) % of Cycles lost due to Pipeline flush stalls (lower is better) % of Cycles lost due to data access stalls (lower is better) % of Cycles lost due to RSE stalls (lower is better) % of Cycles lost due to Scoreboard stalls (lower is better) Percentage of unstalled cycles with respect to total number of elapsed CPU operating cycles. Percentage of cycles lost due to I-cache, ITLB, and branch execution stalls.
• • Source statement Instruction bundle Table B-18 Information in fprof Measurement Reports Column Description % Total IP Samples Percent of the total IP samples attributable to a given program object. Cumulat % of Total Running sum of the percent of total IP samples accounted for by the given program object and those listed above it. IP Samples Total number of IP samples attributed to the given program object.
The list of processor metrics you can use for the sampling event are available from the file itanium2_cpu_counters.txt, located in the HP Caliper home directory in the doc/text subdirectory. The IP collected at each sampling point is the IP recorded by the kernel (in the process's save state) when the PMU overflow trap is taken. The kernel does not record a instruction slot number. Thus, the lowest granularity HP Caliper reports is instruction bundles.
icache Measurement Report Description With the icache measurement, produced by the icache measurement configuration file, HP Caliper measures and reports on instruction cache metrics. This measurement is similar to the dcache measurement.
both the L1 instruction cache and the ISB regardless of whether they hit or miss in the RAB. If a demand fetch does not have an L1 instruction TLB miss, L2_INST_DEMAND_READS and L1_READS line up in time. If a demand fetch does not have an L2 instruction TLB miss, L2_INST_DEMAND_READS follows L1I_READS by 3-4 clocks (unless a flushed iwalk is pending ahead of it, which will increase the delay until the pending iwalk is finished).
L1I_READS L2I_DEMAND_READS L2I_PREFETCHES L2I_READS.ALL.ALL L2I_READS.MISS.ALL L2I_READS.MISS.DMND % of Cycles lost due to all stalls (lower is better) % of Cycles lost due to Front end stalls (ICACHE, ITLB and branch execution) L1 instruction cache references L1 instruction cache misses Number of demand fetch reads to the L1 instruction cache (32-byte chunks). For more information, see L1I_READS . Number of instruction requests to L2 instruction cache due to L1 instruction cache demand fetch misses.
L1 instruction prefetch misses per 1000 instructions retired L1 instruction demand misses per 1000 instructions retired L2 instruction cache misses per 1000 instructions retired L2 instruction prefetch misses per 1000 instructions retired L2 instruction demand misses per 1000 instructions retired Number of instructions retired per L1 instruction prefetch miss. Number of instructions retired per L1 instruction demand fetch miss. Number of instructions retired per L2 instruction cache miss.
Table B-19 Information in icache Measurement Reports (continued) Column Description Function Routine from your application. File Source file associated with a function.
HP Caliper attributes samples for a given cache line to the function associated with the start address of the cache line. Because cache lines can cross function boundaries, data attributed to functions will not always be accurate. However, only cache-line data at the boundaries of the function are potentially misattributed. More frequent sampling increases HP Caliper's perturbation of your application.
itlb Measurement Report Description With the itlb measurement, produced by the itlb measurement configuration file, HP Caliper measures and reports two levels of information: • Exact counts of instruction translation lookaside buffer (TLB) metrics summed across the entire run of an application • Sampled instruction TLB metrics that are associated with particular locations in the application The report shows masured data by thread, load module, function, statement, and cache line.
L1 Instruction TLB Miss Ratio Ratio of L1 instruction TLB misses to Total L1 instruction TLB references. Metrics for Integrity Servers Dual-Core Itanium 2 and Itanium 9300 Quad-Core Processor Systems BACK_END_BUBBLE.ALL Number of cycles when the back end of the pipeline was stalled. This is the number of cycles lost (stall cycles) due to any of five possible events (FPU/L1D, RSE, EXE, branch/exception, or the front end). BACK_END_BUBBLE.FE Full Pipe Bubbles in Main Pipe due to front end.
% of Cycles lost due to all stalls (lower is better) % of Cycles lost due to Front end stalls (ICACHE, ITLB, and branch execution) % of Cycles lost due to instruction TLB stalls % of Cycles lost due to instruction cache stalls % of Cycles lost due to instruction access stalls (ICACHE and ITLB) % of Cycles lost due to branch execution Total L1 instruction TLB references L1 instruction TLB miss percentage L2 instruction TLB misses Percentage of L2 ITLB misses covered by the HPW L1 ITLB miss per 1000 instructi
Table B-20 Information in itlb Measurement Reports (continued) Column Description Sampled ITLB Misses Total number of sampled instruction TLB misses attributed to the given program object. ITLB L2 Fills Number of sampled instruction TLB misses that hit the L2 instruction TLB for the given program object. L2 fills are not reported for, and do not apply to, Itanium systems. ITLB HPW Fills Number of sampled instruction TLB misses that were handled by the HPW for the given program object.
These reports show data associated with a cache line on the same row as the first instruction of the cache line. Each set of instructions that make up a cache line are preceded and followed by a row of dashes (“- - - -”). The cache lines shown might not be contiguous. Non-contiguous cache lines are separated by a row of tildes (“~ ~ ~ ~”). How Instruction TLB Metrics Are Obtained HP Caliper obtains instruction TLB metrics from the processor's performance monitoring unit (PMU).
pmu_trace Measurement Report Description With the pmu_trace measurement, produced by the pmu_trace measurement configuration file, HP Caliper measures traces of sampled PMU data associated with the application for each kernel thread. This data includes cache misses, TLB misses, ALAT misses, branch mispredictions, instruction addresses, and CPU events. These metrics are sampled using the processor's performance monitoring unit (PMU).
scgprof Measurement Report Description With the scgprof measurement, produced by the scgprof measurement configuration file, HP Caliper measures and reports both flat and call graph profiles (much like standard gprof). The call graph is produced by statistical sampling of the processor's performance monitoring unit (PMU) to determine function calls in an application. The call graph is not exact, because it does not show every function call, but it is statistically useful.
Table B-21 Information in scgprof Measurement Report Fields (Flat Profile) (continued) Column Description File Source file associated with a function.
Table B-22 Information in scgprof Measurement Report: Function Entries (Self Entries) (continued) Column Description +Self Number of times this function calls itself recursively. For a cycle entry, this denotes the number of calls within the cycle. If this column contains a hyphen (-), this means that there is at least one call, but the exact number of calls is unknown. This notation can mean that the function was called inline.
Table B-24 Information in scgprof Measurement Report: Parent Listings (continued) Column Description Parents Name of this parent function. Cycle Cycle that this parent is a member of, if any. *Static-only parents and children are indicated by a call count of 0. **This field is omitted for parents, or children, in the same cycle as the function.
traps Measurement Report Description Available only on Integrity servers dual-core Itanium 2 and Itanium 9300 quad-core processor systems. With the traps measurement, produced by the traps measurement configuration file, HP Caliper collects and reports a profile of traps, interrupts, and faults. The trap profile is produced by statistical sampling of the execution trace buffer (ETB) configured to capture all changes to/from privilege level 0.
Privileged operation fault Reserved register/field fault IA32EXP IACCS IARGHT IKEY INT ITLB KPERM LPTRP IA32 Exception Instruction access bit fault Instruction access rights fault Instruction key miss fault External interrupt Instruction translation lookaside buffer fault Key permission fault Lower Privilege Transfer Trap or Unimplemented Instruction Address Trap NATC NAT Consumption fault PNotP Page Not Present fault SPECOP Speculative Operation fault SSTRP Single Step Trap TBTRP Taken Branch Trap UADREF
BACK_END_BUBBLE.FE BE_EXE_BUBBLE.ALL BE_EXE_BUBBLE.FRALL BE_EXE_BUBBLE.GRALL BE_EXE_BUBBLE.GRGR BE_FLUSH_BUBBLE.ALL BE_L1D_FPU_BUBBLE.ALL BE_L1D_FPU_BUBBLE.L1D BE_RSE_BUBBLE.ALL CPU_OP_CYCLES.ALL Full pipe bubbles in main pipe due to front end. This is the number of cycles lost (stall cycles) due to instruction cache, ITLB, and branch execution stalls. Full pipe bubbles in main pipe due to execution unit stalls.
% of Cycles lost due to front end stalls (lower is better) % of Cycles lost due to Pipeline flush stalls (lower is better) % of Cycles lost due to data access stalls (lower is better) % of Cycles lost due to RSE stalls (lower is better) % of Cycles lost due to Scoreboard stalls (lower is better) Percentage of cycles lost due to instruction cache, ITLB, and branch execution stalls. Percentage of cycles lost due to branch misprediction or interruption flush.
core's other hyperthread or were lost to HyperThreading overhead. traps Measurement Metrics See Table B-26 “Information in traps Measurement Reports”. In this table, “program object” refers to any of the following: • Thread • Load module • Function • Source statement • Instruction Table B-26 Information in traps Measurement Reports Column Description % Total Trap Samples Percent of the total trap samples attributable to a given program object.
How traps Metrics Are Obtained HP Caliper obtains traps metrics using the execution trace buffer (ETB) of the performance monitoring unit (PMU). The ETB is configured to capture all changes to/from privilege level 0. HP Caliper takes samples by using the overflow of one of the PMU's event counters as a sampling trigger.
C Event Set Descriptions for CPU Metrics This appendix contains descriptions for the output of each event set available when you use the cpu measurement. NOTE: The information provided in this appendix for each report description is the same information you receive when you use the --info option to append help to the end of text reports, or when you use this command: $ caliper info -r event-set For more information, see “cpu Measurement Report Description ” (p. 236).
brpath Event Set The brpath event set provides information on the dynamic mix of branch types (IPREL, indirect, and return), branch path distribution, branch density, and so forth. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
— %Indirect Branch — This metric provides the percentage of Indirect branches among all branches. — %Return Branch This metric provides the percentage of Return branches among all branches. • IPREL Path Statistics This metric provides path distribution and mispredict rate for both paths of a non-call IPREL branch. Unconditional IPREL branches are included, so there is a slight bias toward the taken path.
brpred Event Set The brpred event set provides information useful in assessing the effectiveness of branch prediction for the three major classes of branches: IPREL, Indirect, and Return. The Itanium 2's branch semantics considers predicated off branches as retired untaken branches. So the branch statistics include the impact of predicated off branches that were predicated to be a taken branch.
branches. Predicated off IPREL branches that were predicted as taken will be counted as wrong path outcomes. — Weight Fraction of IPREL branches among all branch types. — Correct Percentage of correctly predicted IPREL branches. — Wrong Path Percentage of IPREL branches for which the target path (taken/not-taken) was predicted incorrectly. — Wrong Target Percentage of IPREL branches for which the target address was predicted incorrectly.
— Wrong Path Percentage of Return branches for which the target path (taken/not-taken) was predicted incorrectly. — Wrong Target Percentage of Return branches for which the target address was predicted incorrectly.
c2c Event Set Available only on Itanium 2 and dual-core Itanium 2 systems. The c2c (“cache-to-cache”) event set provides information relating to cache coherence activity from the local processor perspective, the source of data for local data cache misses, and rate at which the local processor satisfies the data cache miss of remote processors. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states).
• Snoops/Sec This is the number of snoops per second that the local processor sees on a bus-based system when a remote processor takes a data cache miss or issues a flush cache (fc) instruction. It also include local processor self snoops. • Dmiss/Sec This is the number of data misses from the L3 cache of the local processor per second. • Load This is the fractional component of the total data misses per second that is due to load misses for the local processor.
cpi Event Set The cpi event set provides information related to cycles per instruction (CPI). If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
• %Useful This gives an estimate of the percentage of instructions that have an architecturally visible result. It is only an estimate, because predicated off branches are considered useful as a result of the semantics that the Itanium 2 ascribes to predicated off instructions. • %Nops This is the percentage of instructions that were observed during the sample period that were NOPS. • %Pred This is an estimate of the percentage of instructions that are predicated off.
cpubus Event Set Available only on Itanium 2 and dual-core Itanium 2 systems. The cpubus event set provides information on the demand that a specific CPU presents to the central electronics complex (CEC), the chip set surrounding the CPU, and the demand the CPU experiences due to the CEC traffic initiated by other CPUs or I/O components in the system. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states).
• Flush This is the number of flush cache (fc) operations executed per second by the local processor. • Prtl This is the total number of partial (less than 128 byte) reads (BRP) or writes (BWP) initiated by the local processor per second. Partial transactions are normally due to reading/writing memory-mapped I/O control registers, semaphore operations, clean castouts (if monitoring a system with directory-based cache coherency), and sending interprocessor interrupts.
cspec Event Set The cspec event set provides information on the effectiveness of control speculation. Control speculation is the execution of an operation before the branch that guards it. Control speculation involves the movement of loads above the basic block with which is normally is associated. This can give the optimizer added degrees of freedom for instruction scheduling.
• Chks Retired This is the total number of non-predicated-off chk.s instructions retired during the sample interval. • Chks Failed This is the total number of failed chk.s instructions that were retired during the sample interval. • Control Speculation: — Spec/Sec: Total This is the total number of control speculation events per second. — Spec/Sec: Fail This is the number of control speculation fail events per second.
dispersal Event Set The dispersal event set provides a qualitative view of the parallelism that is available as seen at instruction dispersal and provides information on the compiler's architectural effectiveness. Instruction dispersal is the process of mapping instructions within bundles to functional units. Architectural effectiveness is the extent to which the compiler is able to exploit the available instruction level parallelism provided by the underlying CPU implementation.
• %Dispersed Instr retired This is a metric of dispatch efficiency, that is, the percentage of the instructions that were dispersed that reached retirement. The number of instructions dispersed will not equal the number reaching retirement primarily because of pipeline flushes mainly due to mispredicted branches, traps, and interrupts. If the retirement ratio is low, this is likely due to poor branch prediction.
dspec Event Set The dspec event set provides information on the effectiveness of data speculation. Data speculation is the execution of a memory load prior to a store which preceded it and which might potentially alias with it. Data speculation occurs when the ordering of data accesses is changed by the optimizer. The ability to alter the ordering of memory operations can greatly increase the degrees of freedom when attempting to generate optimal code.
• Chka Retired This is the total number advanced check loads (chk.a) and check loads (ld.c) that were retired during the sample interval. • Chka Failed This is the total number advanced check loads (chk.a) and check loads (ld.c) that failed during the sample interval. • ALAT Access This is the number of times the ALAT is accessed during the sample interval. In effect, this is the count of all instructions that access the ALAT. Instructions that access the ALAT include ld.a, ld.sa, ldf.a, ldf.
fp Event Set The fp event set provides information relating to floating-point operation density, execution rate, and flush/trap events density. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
FCVT.fx FPMA FPMPY FPMS FPMIN FPMAX FPAMIN FPAMAX FPCMP FPCVT.
• SIR This is the total number of safe instruction recognition (SIR) stalls observed during the sample interval. The count includes both false (stall only, no trap taken) stalls and true (SWFA trap taken) stalls. • FP Events /Sec: FOPS This is the number of floating-point operations (not instructions) that are executed per second. • FP Events/Sec: zero flush This is the number of flush to zero events that occur per second.
l1dcache Event Set The l1dcache event set provides information on L1 data cache miss rates for read misses. The L1 data cache (L1D cache) is special in that it does not handle all types of memory references. In particular, the L1D cache does not handle floating-point loads, semaphores, lfetch instructions and VHPT loads. The L1D cache is also a write-through, non-store allocate cache. Thus, the only operations that access the L1D are integer loads, RSE loads, and load checks.
• Total - Misses per Kinst This is the total number of L1D cache misses per 1000 retired instructions retired, including nops, predicated off instructions, and speculative instructions/associated recovery code. • NON RSE - Misses per Kinst This is the number of non-RSE L1D cache misses per 1000 retired instructions retired, including nops, predicated off instructions, and speculative instructions/associated recovery code.
l1icache Event Set The l1icache event set provides information on L1 instruction cache miss rates for both demand fetches and prefetches. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
• Total - Misses Per Kinst This is the number of demand instruction cache line accesses that and instruction prefetch cache lines accesses that miss the L1 instruction cache and ISB per 1000 instructions retired. • Dfectch - Misses Per Kinst This is the number of demand instruction cache line access that miss the L1 instruction cache and ISB per 1000 instructions retired.
l2cache Event Set The l2cache event set provides miss rate information for the L2 unified cache on Itanium 2 systems. This measurement is valid only on Itanium 2 systems. On dual-core Itanium 2 and Itanium 9300 quad-core processor systems, the event set name l2cache will produce the l2dcache and l2icache metrics.
The metrics are: • Total - Misses Per Second This is the total number of L2 cache misses per second. It includes all instruction prefetch misses, instruction demand misses, and data misses. • Pfetch - Misses Per Second This is the number of instruction line prefetch requests (streaming and non-streaming) that miss the L2 cache per second. • Dfetch - Misses Per Second This is the number of instruction line demand requests that miss the L2 cache per second.
• Writeback Hits Per Kinst This is the number of cache line writebacks that hit the L3 cache per 1000 retired instructions, including nops and predicated off instructions. • Writeback Misses Per Kinst This is the number of cache line writebacks that miss the L3 cache per 1000 retired instructions, including nops and predicated off instructions. Writeback misses are sent directly to memory; they do not allocate the line in the L3 cache.
l2dcache Event Set The l2dcache event set provides miss rate information for the L2 data cache on dual-core Itanium 2 and Itanium 9300 quad-core processor systems. On other Itanium 2 systems, use the l2cache event set, which provides the miss rate information for the unified L2 cache.
The metrics are: • Total - Misses Per Second This is the total number of L2 data cache misses per second. It includes all data load and store misses. 328 • Load - Misses Per Second This is the number of data load requests that miss the L2 cache per second. • Store - Misses Per Second This is the number of data store requests that miss the L2 cache per second. • Writebacks Per Second This is the total number of L2 data cache writebacks (L3 hit and miss) per second.
integer stores, all floating-point loads/stores, and semaphore (counted once) operations. • %Miss This is the percentage of all the L2 data cache misses out of the total number of L2 data cache accesses. Accesses include all integer and RSE loads that miss the L1 data cache, all RSE and integer stores, all floating-point loads/stores, and semaphore (counted once) operations.
l2icache Event Set The l2icache event set provides miss rate information for the L2 data cache on dual-core Itanium 2 and Itanium 9300 quad-core processor systems. On other Itanium 2 systems, use the l2cache event set, which provides the miss rate information for the unified L2 cache. The L2 instruction cache metrics include miss information for instruction prefetch requests and instruction demand requests.
The metrics are: • Total - Misses Per Second This is the total number of L2 instruction cache misses per second. It includes all instruction prefetch misses and instruction demand fetch misses. • Pfetch - Misses Per Second This is the number of instruction line prefetch requests (streaming and non-streaming) that miss the L2 instruction cache per second. • Dfetch - Misses Per Second This is the number of instruction line demand requests that miss the L2 instruction cache per second.
l3cache Event Set The l3cache event set provides miss rate information for the L3 unified cache, including miss information for instruction prefetch requests, instruction demand requests, integer loads and stores as well as L2 cache writebacks that might either hit or miss the L3 cache. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement.
• Writebacks Per Second This is the total number of L2 cache writebacks (L3 hit and miss) per second. • Total - Misses Per Kinst This is the total number of L3 cache misses per 1000 retired instructions, including nops and predicated off instructions. It includes instruction prefetch misses, instruction demand misses, and data misses.
RSE loads/stores, and instruction fetches/prefetches that miss higher levels of the cache hierarchy. memreq Event Set Available only on Itanium 9300 quad-core processor systems. The memreq event set provides data about memory read latency and cacheable and uncacheable memory requests. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement.
• Load This is total number of cacheable loads per 1000 retired instructions, including nops and predicated off instructions. • Store This is total number of cacheable RFO (Read For Ownership) stores per 1000 retired instructions, including nops and predicated off instructions. • Hint This is total number of cacheable RFO (Read For Ownership) hints per 1000 retired instructions, including nops and predicated off instructions.
queues Event Set Available only on Itanium 2 and dual-core Itanium 2 systems. The queues event set provides bus request queue (BRQ) information that might give insight into possible performance problems related to the system bus. The BRQ is a centralized queueing structure that collects almost all requests from the L1 cache and then schedules those requests to the L2 cache or front side bus (FSB). High values on the available metrics will likely indicate levels of bus utilization.
processor is being delayed during bus request arbitration, probably due to excessive bus utilization by a priority agent (I/O). • AVG IOQ Live Entries Per Cycle The in-order queue (IOQ) monitors all outstanding bus transactions generated by any bus agent. Requests are loaded into the IOQ during the bus transaction request phase. Transactions are retired from the IOQ on receipt of a positive response status from the bus.
• Data Percentage of snoops that are data snoops. • Inv Percentage snoops that are invalid snoops. • 128 Byte - Miss This is the fraction of 128-byte data snoops that miss, out of all data snoops (64-byte and 128-byte). • 128 Byte - Hit This is the fraction of 128-byte data snoops that hit a cache line, out of all data snoops (64-byte and 128-byte). • 128 Byte - Hitm This is the fraction of 128-byte data snoops that hit a modified cache line, out of all data snoops (64-byte and 128-byte).
stall Event Set The stall event set provides information on primary CPU performance limiters by breaking the CPI into seven components. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states). By default, the idle state is not included in the measurement. You can use command-line options to limit the scope of the measurement.
of instructions executing varies from 1 to 6, which is the maximum dispatch for the Itanium 2 processor. Taken branches, non-double-bundle aligned branch targets, and explicit stop bits are the primary determinants of code-based execution limitations. You can obtain some idea of this from the dispersal event set. 340 • BE Flush This counts the number of stall cycles resulting from a pipeline flush caused by a branch misprediction, an exception, an ALAT flush, or a serialization flush.
sysbus Event Set Available only on Itanium 2 and dual-core Itanium 2 systems. The sysbus event set provides data on system bus utilization and its breakdown into: • Transaction originator (all, local cpu, io) • Transaction type (brl, bril, bil, bwl, partial) If you use this option, you must use the --bus-speed option. If you use this event set, the default is to make the measurements irrespective of CPU operating state (that is, user, system, or interrupt states).
For control-dominated code or for workloads that seldom miss the internal caches, this value will be very small. For data-flow-type workloads, this number can, if extensive prefetching is employed, be quite high, up to a maximum of 16, which is the Itanium 2 bus limit. The reported average latency value will be incorrect on Itanium 2 steppings earlier than B2. • CPU CPU transaction component is a measure of the percentage of all bus transactions generated by all CPUs on a shared front side bus (FSB).
• BRIL Bus Read Invalidate Line is the transaction used when a store miss occurs, thus a read for ownership. In Itanium 2, this transaction is also used when a store hit occurs on a shared line. In this case, the BRIL is used to invalidate all remote copies on this cache line and have the memory controller return the line we already have to the cache. Itanium 2 does not implement the BIL optimization, which would have allowed remote copies to be invalidated without performing a superfluous memory request.
threadswitch Event Set Available only on dual-core Itanium 2 systems. The threadswitch event set provides data about the impact of HyperThreading on the measured process. It provides a full statistical breakdown of thread switch activity. HyperThreading (formally called Hyper-Threading Technology) provides the ability for a processor to create an additional logical processor that might allow additional efficiencies of processing.
• Hint Percentage of all thread switches that were triggered by the “hint@pause” instruction. This is when the measured process voluntarily gives up the processor because it is about to wait for something (like a mutex). A non-zero value indicates a “good” use of HyperThreading: this process has natural “idle” time that another process can make use of. • Other Percentage of all other reasons that thread switches occurred.
tlb Event Set The tlb event set provides information related to translation lookaside buffer (TLB) misses. The Itanium 2 TLB implementation is split for instructions and data, with two levels for each. The first level only maps 4K pages. Thus, the miss rate (per sec/per kinst) might be quite high. The second level supports large pages and is backed up by hardware that automatically inserts the required translation if it is found to be the head element on the page table list.
translation into the I2TLB. If the required translation is not the head element, a trap will be taken to the software TLB handler to perform the requisite update. • D1TLB Misses Per Sec This is the number of level 1 DTLB misses per second. This level of the DTLB only operates on 4K pages. Thus, its miss rate will be high, but it is normally the case that any required translation would be provided by the level 2 DTLB in three cycles.
This problem generally only occurs when there is a loop that has a statically mispredicted branch. This can lead to accesses to code that is never executed and thus never in the cache, but is continually being accessed by the mispredicted branch. This results in an ITLB miss, which is then dismissed before the trap is actually taken. • 348 %DTLB H/W Update This is the percentage of level 2 ITLB misses out of ITLB hardware inserts. This is a metric of the effectiveness of the HPW.
Glossary advance load address table (ALAT) In the Integrity servers processor family, a table that keeps track of speculative (that is, advance) loads. An excessive number of ALAT compares that result in a failed advance load (an ALAT miss) can seriously degrade performance. advice class A grouping for advice from the Advisor. Every piece of advice belongs to one of these classes: general, CPU, memory, IO, and system.
cpu_metrics measurement The non-preferred name for the cpu measurement. This name was used in releases prior to Release 3.9. cstack measurement A measurement, provided by the cstack measurement configuration file, that measures and reports a sampled call stack profile, produced by periodically sampling the application program counter and each of its thread's call stacks.
fcount measurement A measurement, provided by the fcount measurement configuration file, that measures and reports function counts in a program. fcover measurement A measurement, provided by the fcover measurement configuration file, that measures and reports functions used by a program. fprof measurement A measurement, provided by the fprof measurement configuration file, that measures and reports sampled instruction addresses.
instruction dispersal The process of mapping instructions within bundles to functional units. See “dispersal Event Set” (p. 313). instruction event address register (I-EAR) The component of the Integrity servers processor that records the instruction addresses of data cache misses for loads, the instruction addresses of data TLB misses, and the instruction addresses of instruction TLB and cache misses. See also data event address register (D-EAR).
performance monitor configuration (PMC)/performance monitor data (PMD) A set of registers used to configure the performance monitors (PMC) and provide data values from the performance monitors (PMD). In other words, the PMC register maintains control information about what to monitor and the PMD register holds the actual data that results from the monitoring.
scgprof measurement A measurement, provided by the scgprof measurement configuration file, that measures and reports (an inexact) call graph profile, produced by sampling the performance monitoring unit (PMU) to determine function calls. system-wide measurement A measurement that is performed on all CPUs in the system. Compare with per-process measurement. total_cpu measurement The non-preferred name for the ecount measurement. This name was used in releases prior to Release 3.9.
Index Symbols --[no]fold option, 81 --advice-classes option used with HP Caliper Advisor, 104 --advice-cutoff option used with HP Caliper Advisor, 104 --advice-details option used with HP Caliper Advisor, 104 --analysis-focus option used with HP Caliper Advisor, 105 --branch-sampling-spec option, 72 --bus-speed option, 74 --callpath-cutoff option, 74 --context-lines option, 74 --cpu-aggregation option, 75 --cpu-counter option used with caliper info command, 134 --cpu-details option, 75 --cpu-metrics-aggrega
-o option, 67 used with caliper info command, 135 used with HP Caliper Advisor, 105 -p option, 68 examples, 131 syntax, 128 -p some option syntax, 130 -r option, 68 -s option, 69 used with caliper info command, 135 -t option, 98 -w option, 72 .
Disabling the PMU, 211 disasm_target_name_limit constant, 122 Disassembly listing branch targets in, 143 Disassembly listings, 143 Disassembly, adding to report, 33 dispersal event set, 313 Displaying reference information, 133 Documentation about HP Caliper, 41 Documentation resources, 20 dspec event set, 315 dtlb measurement report description, 260 Dual-core Itanium 2 processor HyperThreading information, 146 fprof measurement report description, 271 fprof sampling on multiple PMU Counters, 61 Function d
L l1dcache event set, 320 l1icache event set, 322 l2cache event set, 324 l2dcache event set, 327 l2icache event set, 330 l3cache event set, 332 latest file, 148 Layout of reports, 137 Limiting PMU measurements, 211 Load modules collecting data for, 124 M Machine instructions omitting from reports, 69 Makefile including HP Caliper commands in, 132 Measurement global, 35 precise, 36 sampled, 36 Measurement configuration file, 57 Measurement configuration files Overview measurement, 59 provided with HP Calipe
pmu_trace measurement , 287 scgprof measurement, 288 traps measurement, 292 Report generation from saved data, 147 Report layout, 137 Report output HyperThreading information shown in, 146 Report structure, 137 Reports configuring, 137 Restricting PMU measurements, 211 S Sampled call graph profile analysis, producing, 157 Sampled call graph reports error message, 220 Sampled call stack profile, producing, 171 Sampled measurement, 36 Sampled measurements performing, 132 Sampling performance, 36 scgprof meas