SUSE Linux Enterprise Server 11 SP3 October 16, 2014 www.suse.
System Analysis and Tuning Guide Copyright © 2006–2014 SUSE LLC and contributors. All rights reserved. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.
Contents About This Guide ix 1 Available Documentation ......................................................................... x 2 Feedback ............................................................................................. xii 3 Documentation Conventions .................................................................. xiii I Basics 1 1 General Notes on System Tuning 3 1.1 Be Sure What Problem to Solve ............................................................. 3 1.
2.8 Files and File Systems ........................................................................ 40 2.9 User Information ................................................................................ 43 2.10 Time and Date ................................................................................. 43 2.11 Graph Your Data: RRDtool ............................................................... 44 3 Monitoring with Nagios 51 3.1 Features of Nagios .................................................
6 Kernel Probes 87 6.1 Supported Architectures ...................................................................... 88 6.2 Types of Kernel Probes ....................................................................... 88 6.3 Kernel probes API .............................................................................. 89 6.4 Debugfs Interface ............................................................................... 90 6.5 For More Information .....................................................
9.3 File Systems and Disk Access ............................................................ 118 10 Kernel Control Groups 121 10.1 Technical Overview and Definitions .................................................. 121 10.2 Scenario ......................................................................................... 122 10.3 Control Group Subsystems ............................................................... 124 10.4 Using Controller Groups ....................................................
14 Tuning the Task Scheduler 165 14.1 Introduction ................................................................................... 165 14.2 Process Classification ...................................................................... 166 14.3 O(1) Scheduler ............................................................................... 167 14.4 Completely Fair Scheduler ............................................................... 168 14.5 For More Information .......................................
18.2 Required Packages .......................................................................... 206 18.3 kexec Internals ............................................................................... 206 18.4 Basic kexec Usage .......................................................................... 207 18.5 How to Configure kexec for Routine Reboots ..................................... 208 18.6 Basic kdump Configuration .............................................................. 209 18.
About This Guide SUSE Linux Enterprise Server is used for a broad range of usage scenarios in enterprise and scientific data centers. SUSE has ensured SUSE Linux Enterprise Server is set up in a way that it accommodates different operation purposes with optimal performance. However, SUSE Linux Enterprise Server must meet very different demands when employed on a number crunching server compared to a file server, for example.
Part IV, “Resource Management” (page 113) Learn how to set up a tailor-made system fitting exactly the server's need. Get to know how to use power management while at the same time keeping the performance of a system at a level that matches the current requirements. Part V, “Kernel Tuning” (page 153) The Linux kernel can be optimized either by using sysctl or via the /proc file system. This part covers tuning the I/O performance and optimizing the way how Linux schedules processes.
Deployment Guide (↑Deployment Guide) Shows how to install single or multiple systems and how to exploit the product inherent capabilities for a deployment infrastructure. Choose from various approaches, ranging from a local installation or a network installation server to a mass deployment using a remote-controlled, highly-customized, and automated installation technique.
age KVM with libvirt or QEMU. The guide also contains detailed information about requirements, limitations, and support status. AutoYaST (↑AutoYaST) AutoYaST is a system for installing one or more SUSE Linux Enterprise systems automatically and without user intervention, using an AutoYaST profile that contains installation and configuration data. The manual guides you through the basic steps of auto-installation: preparation, installation, and configuration.
Bugs and Enhancement Requests For services and support options available for your product, refer to http:// www.suse.com/support/. To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select My Support > Service Request. User Comments We want to hear your comments about and suggestions for this manual and the other documentation included with this product.
• #amd64 em64t ipf: This paragraph is only relevant for the architectures amd64, em64t, and ipf. The arrows mark the beginning and the end of the text block. ◄ #ipseries zseries: This paragraph is only relevant for the architectures System z and ipseries. The arrows mark the beginning and the end of the text block. ◄ • Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.
Part I.
General Notes on System Tuning This manual discusses how to find the reasons for performance problems and provides means to solve these problems. Before you start tuning your system, you should make sure you have ruled out common problems and have found the cause (bottleneck) for the problem. You should also have a detailed plan on how to tune the system, because applying random tuning tips will not help (and could make things worse). 1 Procedure 1.
scription. If you plan to tune your Web server for faster delivery of static pages, for example, it makes a difference whether you need to generally improve the speed or whether it only needs to be improved at peak times. Furthermore, make sure you can apply a measurement to your problem, otherwise you will not be able to control if the tuning was a success or not. You should always be able to compare “before” and “after”. 1.
analysis, the Linux kernel offers means to perform such analysis. See Part III, “Kernel Monitoring” (page 69) for coverage. Once you have collected the data, it needs to be analyzed. First, inspect if the server's hardware (memory, CPU, bus) and its I/O capacities (disk, network) are sufficient. If these basic conditions are met, the system might benefit from tuning. 1.4 Step-by-step Tuning Make sure to carefully plan the tuning itself. It is of vital importance to only do one step at a time.
Part II.
System Monitoring Utilities There are number of programs, tools, and utilities which you can use to examine the status of your system. This chapter introduces some of them and describes their most important and frequently used parameters. 2 For each of the described commands, examples of the relevant outputs are presented. In the examples, the first line is the command itself (after the > or # sign prompt). Omissions are indicated with square brackets ([...]) and long lines are wrapped where necessary.
of the system at a glance. Use these tools first in order to get an overview and find out which part of the system to examine further. 2.1.1 vmstat vmstat collects information about processes, memory, I/O, interrupts and CPU. If called without a sampling rate, it displays average values since the last reboot. When called with a sampling rate, it displays actual samples: Example 2.
b Shows the number of processes waiting for a resource other than a CPU. A high number in this column may indicate an I/O problem (network or disk). swpd The amount of swap space (KB) currently used. free The amount of unused memory (KB). inact Recently unused memory that can be reclaimed. This column is only visible when calling vmstat with the parameter -a (recommended). active Recently used memory that normally does not get reclaimed.
in cs us sy id wa st Interrupts per second. A high value indicates a high I/O level (network and/or disk). Number of context switches per second. Simplified this means that the kernel has to replace executable code of one program in memory with that of another program. Percentage of CPU usage from user processes. Percentage of CPU usage from system processes. Percentage of CPU time spent idling. If this value is zero over a longer period of time, your CPU(s) are working to full capacity.
the fly or query existing reports gathered by the system activity data collector (sadc). sar and sadc both gather all their data from the /proc file system. NOTE: sysstat Package sar and sadc are part of sysstat package. You need to install the package either with YaST, or with zypper in sysstat. 2.1.2.1 Automatically Collecting Daily Statistics With sadc If you want to monitor your system about a longer period of time, use sadc to automatically collect the data.
and count. If filename, interval and count are not specified, sar attempts to generate a report from /var/log/sa/saDD, where DD stands for the current day. This is the default location to where sadc writes its data. Query multiple files with multiple -f options.
16:12:32 16:12:42 16:12:52 16:13:02 Average: 259320 381096 642668 311984 428651 1796356 1674580 1413008 1743692 1627025 87.39 81.46 68.74 84.82 79.15 20808 21084 21392 21712 21104 72660 75460 81212 84040 75515 2229080 2328192 1938820 2212024 2209280 62.06 64.82 53.98 61.58 61.51 The last two columns (kbcommit and %commit) show an approximation of the total amount of memory (RAM plus swap) the current workload would need in the worst case (in kilobyte or percent respectively).
16:28:51 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16:28:51 DEV 16:29:01 sdc 16:29:01 scd0 tps rd_sec/s 32.47 876.72 0.00 0.00 wr_sec/s 647.35 0.00 avgrq-sz 46.94 0.00 avgqu-sz 0.33 0.00 await 10.20 0.00 svctm 3.67 0.00 %util 11.91 0.00 16:29:01 DEV 16:29:11 sdc 16:29:11 scd0 tps rd_sec/s 48.75 2852.45 0.00 0.00 wr_sec/s 366.77 0.00 avgrq-sz 66.04 0.00 avgqu-sz 0.82 0.00 await 16.93 0.00 svctm 4.91 0.00 %util 23.94 0.00 16:29:11 DEV 16:29:21 sdc 16:29:21 scd0 tps rd_sec/s 13.
2.2 System Information 2.2.1 Device Load Information: iostat iostat monitors the system device loading. It generates reports that can be useful for better balancing the load between physical disks attached to your system. The first iostat report shows statistics collected since the system was booted. Subsequent reports cover the time since the previous report. tux@mercury:~> iostat Linux 2.6.32.7-0.
With the -P option, you can specify the number of processors to be reported (note that 0 is the first processor). The timing arguments work the same way as with the iostat command. Entering mpstat -P 1 2 5 prints five reports for the second processor (number 1) at 2 second intervals. tux@mercury:~> mpstat -P 1 2 5 Linux 2.6.32.7-0.2-default (geeko@buildhost) 08:57:10 CPU %usr %guest %idle 08:57:12 1 4.46 0.00 89.11 08:57:14 1 1.98 0.00 93.07 08:57:16 1 2.50 0.00 93.50 08:57:18 1 14.36 0.00 83.
2.2.4 Kernel Ring Buffer: dmesg The Linux kernel keeps certain messages in a ring buffer. To view these messages, enter the command dmesg: tux@mercury:~> dmesg [...] end_request: I/O error, dev fd0, sector 0 subfs: unsuccessful attempt to mount media (256) e100: eth0: e100_watchdog: link up, 100Mbps, half-duplex NET: Registered protocol family 17 IA-32 Microcode Update Driver: v1.14 microcode: CPU0 updated from revision 0xe to 0x2e, date = 08112004 IA-32 Microcode Update Driver v1.
useful. However, the list of all files can be combined with search functions to generate useful lists.
UEVENT[1138806687] UEVENT[1138806687] UEVENT[1138806687] UEVENT[1138806687] UDEV [1138806687] UDEV [1138806687] UDEV [1138806687] UDEV [1138806687] UEVENT[1138806692] UEVENT[1138806692] UEVENT[1138806692] UEVENT[1138806692] UDEV [1138806693] UDEV [1138806693] UDEV [1138806693] UDEV [1138806693] UEVENT[1138806694] UDEV [1138806694] UEVENT[1138806694] UEVENT[1138806697] add@/devices/pci0000:00/0000:00:1d.7/usb4/4-2/4-2.2 add@/devices/pci0000:00/0000:00:1d.7/usb4/4-2/4-2.2/4-2.
0x00000000 83984391 0x00000000 84738056 tux root 666 644 282464 151552 ------ Semaphore Arrays -------key semid owner perms 0x4d038abf 0 tux 600 ------ Message Queues -------key msqid owner 2 2 dest nsems 8 perms used-bytes messages 2.3.2 Process List: ps The command ps produces a list of processes. Most parameters must be written without a minus sign. Refer to ps --help for a brief help or to the man page for extensive help.
3 4 5 11 12 472 473 [...] 4028 4118 4114 4023 0 0 0 0 0 0 0 17556 17800 19172 25144 [events/0] [khelper] [kthread] [kblockd/0] [kacpid] [pdflush] [pdflush] nautilus --no-default-window --sm-client-id default2 ksnapshot sound-juicer gnome-panel --sm-client-id default1 Useful ps Calls ps aux --sort column Sort the output by column.
| `-hald-addon-stor |-kded |-kdeinit-+-kdesu---su---kdesu_stub---yast2---y2controlcenter | |-kio_file | |-klauncher | |-konqueror | |-konsole-+-bash---su---bash | | `-bash | `-kwin |-kdesktop---kdesktop_lock---xmatrix |-kdesud |-kdm-+-X | `-kdm---startkde---kwrapper [...] The parameter -p adds the process ID to a given name. To have the command lines displayed as well, use the -a parameter: 2.3.
1746 1752 2151 2165 2166 2171 2235 2289 2403 2709 2714 root root root messageb root root root root root root root 15 15 16 16 15 16 15 16 23 19 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 S 1464 496 416 S 3340 1048 792 S 1840 752 556 S 1600 516 320 S 1736 800 652 S 4192 2852 1444 S 1756 600 524 S 2668 1076 944 S 1756 648 564 S 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.1 0.1 0.2 0.6 0.1 0.2 0.1 0:00.00 0:00.00 0:00.00 0:00.64 0:00.01 0:00.00 0:00.10 0:02.05 0:00.00 0:00.00 0:00.
Output for the sys_list window under LPAR: 12:30:48 | CPU-T: IFL(18) CP(3) UN(3) system #cpu cpu mgm Cpu+ Mgm+ (str) (#) (%) (%) (hm) (hm) H05LP30 10 461.14 10.18 1547:41 8:15 H05LP33 4 133.73 7.57 220:53 6:12 H05LP50 4 99.26 0.01 146:24 0:12 H05LP02 1 99.09 0.00 269:57 0:00 TRX2CFA 1 2.14 0.03 3:24 0:04 H05LP13 6 1.36 0.34 4:23 0:54 TRX1 19 1.22 0.14 13:57 0:22 TRX2 20 1.16 0.11 26:05 0:25 H05LP55 2 0.00 0.00 0:22 0:00 H05LP56 3 0.00 0.00 0:00 0:00 413 823.39 23.
8 9 IFL IFL 0.00 0.00 | 0.00 0.00 | 534.79 10.78 | | Output for the sys window under z/VM: 15:46:57 | T6360003 | CPU-T: UN(16) ? = help cpuid cpu visual (#) (%) (vis) 0 548.72 |######################################### | 548.72 2.3.6 A top-like I/O Monitor: iotop The iotop utility displays a table of I/O usage by processes or threads. TIP iotop is not installed by default. You need to install it manually with zypper in iotop as root.
2.3.7 Modify a process' niceness: nice and renice The kernel determines which processes require more CPU time than others by the process' nice level, also called niceness. The higher the “nice” level of a process is, the less CPU time it will take from other processes. Nice levels range from -20 (the least “nice” level) to 19. Negative values can only be set by root.
Swap: 2104472 0 2104472 The options -b, -k, -m, -g show the output in bytes, KB, MB, or GB, respectively. The parameter -d delay ensures that the display is refreshed every delay seconds. For example, free -d 1.5 produces an update every 1.5 seconds. 2.4.2 Detailed Memory Usage: /proc/ meminfo Use /proc/meminfo to get more detailed information on memory usage than with free. Actually free uses some of the data from this file. See an example output from a 64-bit system below.
HugePages_Rsvd: HugePages_Surp: Hugepagesize: DirectMap4k: DirectMap2M: 0 0 2048 kB 2689024 kB 5691392 kB The most important entries are: MemTotal Total amount of usable RAM MemFree Total amount of unused RAM Buffers File buffer cache in RAM Cached Page cache (excluding buffer cache) in RAM SwapCached Page cache in swap Active Recently used memory that normally is not reclaimed.
Mapped Memory claimed with the mmap system call Slab Kernel data structure cache SReclaimable Reclaimable slab caches (inode, dentry, etc.) Committed_AS An approximation of the total amount of memory (RAM plus swap) the current workload needs in the worst case. 2.4.3 Process Memory Usage: smaps Exactly determining how much memory a certain process is consuming is not possible with standard tools like top or ps. Use the smaps subsystem, introduced in Kernel 2.6.14, if you need exact data.
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:68562268 errors:0 dropped:4609817 overruns:0 frame:0 TX packets:113273547 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:5375024474 (5126.0 Mb) TX bytes:321602834105 (306704.3 Mb) 2.5.2 Ethernet Cards in Detail: ethtool ethtool can display and change detailed aspects of your ethernet network device. By default it prints the current setting of the specified device.
2.5.3 Show the Network Status: netstat netstat shows network connections, routing tables (-r), interfaces (-i), masquerade connections (-M), multicast memberships (-g), and statistics (-s). tux@mercury:~> netstat -r Kernel IP routing table Destination Gateway 192.168.2.0 * link-local * loopback * default 192.168.2.254 Genmask 255.255.254.0 255.255.0.0 255.0.0.0 0.0.0.
TCPAbortOnLinger: 0 TCPAbortFailed: 0 TCPMemoryPressures: 0 2.5.4 Interactive Network Monitor: iptraf The iptraf utility is a menu based Local Area Network (LAN) monitor. It generates network statistics, including TCP and UDP counts, Ethernet load information, IP checksum errors and others. TIP iptraf is not installed by default, install it with zypper in iptraf as root If you enter the command without any option, it runs in an interactive mode.
Mon Mar 23 10:08:02 2010; \ 239.255.255.253:427 Mon Mar 23 10:08:02 2010; 224.0.0.18 Mon Mar 23 10:08:03 2010; 224.0.0.18 Mon Mar 23 10:08:03 2010; 224.0.0.18 [...] Mon Mar 23 10:08:06 2010; 10.20.7.255:111 Mon Mar 23 10:08:06 2010; 10.20.7.255:8765 Mon Mar 23 10:08:06 2010; \ 10.20.7.255:111 Mon Mar 23 10:08:06 2010; 224.0.0.18 --More--(7%) UDP; eth0; 107 bytes; from 192.168.1.192:33157 to VRRP; eth0; 46 bytes; from 192.168.1.252 to \ VRRP; eth0; 46 bytes; from 192.168.1.
10: 11: 12: 14: 15: NMI: LOC: ERR: MIS: 0 71772 101150 33146 149202 0 0 0 0 XT-PIC XT-PIC XT-PIC XT-PIC XT-PIC uhci_hcd:usb3 uhci_hcd:usb2, eth0 i8042 ide0 ide1 Some of the important files and their contents are: /proc/devices Available devices /proc/modules Kernel modules loaded /proc/cmdline Kernel command line /proc/meminfo Detailed information about memory usage /proc/config.
-rw-r--r--r--r--r-lrwxrwxrwx -rw-------r--r--r--r--r--r-[...
sw irq: idle : uptime: 1:15:35.25 9d 16:07:56.79 6d 13:07:11.14 0.4% 73.
2.7 Hardware Information 2.7.1 PCI Resources: lspci NOTE: Accessing PCI configuration. Most operating systems require root user privileges to grant access to the computer's PCI configuration. The command lspci lists the PCI resources: mercury:~ # lspci 00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE \ DRAM Controller/Host-Hub Interface (rev 01) 00:01.0 PCI bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE \ Host-to-AGP Bridge (rev 01) 00:1d.
Capabilities: [dc] Power Management version 2 Capabilities: [e4] PCI-X non-bridge device Kernel driver in use: e1000 Kernel modules: e1000 Information about device name resolution is obtained from the file /usr/share/ pci.ids. PCI IDs not listed in this file are marked “Unknown device.” The parameter -vv produces all the information that could be queried by the program. To view the pure numeric values, use the parameter -n. 2.7.2 USB Devices: lsusb The command lsusb lists all USB devices.
tux@mercury:~> file /usr/share/man/man1/file.1.gz /usr/share/man/man1/file.1.gz: gzip compressed data, from Unix, max compression tux@mercury:~> file -z /usr/share/man/man1/file.1.gz /usr/share/man/man1/file.1.gz: troff or preprocessor input text \ (gzip compressed data, from Unix, max compression) The parameter -i outputs a mime type string rather than the traditional description. tux@mercury:~> file -i /usr/share/misc/magic /usr/share/misc/magic: text/plain charset=utf-8 2.8.
tux@mercury:~> du -sh /opt 192M /opt 2.8.3 Additional Information about ELF Binaries Read the content of binaries with the readelf utility.
File: "/etc/profile" ID: d4fb76e70b4d1746 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 2581445 Free: 1717327 Available: 1586197 Inodes: Total: 655776 Free: 490312 2.9 User Information 2.9.1 User Accessing Files: fuser It can be useful to determine what processes or users are currently accessing certain files. Suppose, for example, you want to unmount a file system mounted at /mnt. umount returns "device is busy.
2.10.1 Time Measurement with time Determine the time spent by commands with the time utility. This utility is available in two versions: as a shell built-in and as a program (/usr/bin/time). tux@mercury:~> time find . > /dev/null real user sys 0m4.051s 0m0.042s 0m0.205s The real time that elapsed from the command's start-up until it finished. CPU time of the user as reported by the times system call. CPU time of the system as reported by the times system call. 2.
As mentioned above, RRDtool is designed to work with data that change in time. The ideal case is a sensor which repeatedly reads measured data (like temperature, speed etc.) in constant periods of time, and then exports them in a given format. Such data are perfectly ready for RRDtool, and it is easy to process them and create the desired output. Sometimes it is not possible to obtain the data automatically and regularly.
sleep $INTERVAL echo "rrdtool update free_mem.rrd $DATE:$FREEMEM" done Points to Notice • The time interval is set to 4 seconds, and is implemented with the sleep command. • RRDtool accepts time information in a special format - so called Unix time. It is defined as the number of seconds since the midnight of January 1, 1970 (UTC). For example, 1272907114 represents 2010-05-03 17:18:34. • The free memory information is reported in bytes with free -b.
rrdtool create free_mem.rrd --start 1272974834 --step=4 \ DS:memory:GAUGE:600:U:U RRA:AVERAGE:0.5:1:24 Points to Notice • This command creates a file called free_mem.rrd for storing our measured values in a Round Robin type database. • The --start option specifies the time (in Unix time) when the first value will be added to the database. In this example, it is one less than the first time value of the free_mem.sh output (1272974835).
2.11.2.4 Viewing Measured Values We have already measured the values, created the database, and stored the measured value in it. Now we can play with the database, and retrieve or view its values. To retrieve all the values from our database, enter the following on the command line: tux@mercury:~> rrdtool fetch free_mem.rrd AVERAGE --start 1272974830 \ --end 1272974871 memory 1272974832: nan 1272974836: 1.1729059840e+09 1272974840: 1.1461806080e+09 1272974844: 1.0807572480e+09 1272974848: 1.
Points to Notice • free_mem.png is the filename of the graph to be created. • --start and --end limit the time range within which the graph will be drawn. • --step specifies the time resolution (in seconds) of the graph. • The DEF:... part is a data definition called free_memory. Its data are read from the free_mem.rrd database and its data source called memory. The average value points are calculated, because no others were defined in Section 2.11.2.2, “Creating Database” (page 46). • The LINE...
Apart form RRDtool's man page (man 1 rrdtool) which gives you only basic information, you should have a look at the RRDtool home page [http:// oss.oetiker.ch/rrdtool/]. There is a detailed documentation [http:// oss.oetiker.ch/rrdtool/doc/index.en.html] of the rrdtool command and all its sub-commands. There are also several tutorials [http:// oss.oetiker.ch/rrdtool/tut/index.en.html] to help you understand the common RRDtool workflow.
3 Monitoring with Nagios Nagios is a stable, scalable and extensible enterprise-class network and system monitoring tool which allows administrators to monitor network and host resources such as HTTP, SMTP, POP3, disk usage and processor load. Originally Nagios was designed to run under Linux, but it can also be used on several UNIX operating systems. This chapter covers the installation and parts of the configuration of Nagios (http:// www.nagios.org/). 3.
For further information on how to install packages see: • Section “Using Zypper” (Chapter 6, Managing Software with Command Line Tools, ↑Administration Guide) • Section “Installing and Removing Packages or Patterns” (Chapter 9, Installing or Removing Software, ↑Deployment Guide) Both methods install the packages nagios and nagios-www. The later RPM package contains a Web interface for Nagios which allows, for example, to view the service status and the problem history.
3.3.1 Object Definition Files In addition to those configuration files Nagios comes with very flexible and highly customizable configuration files called Object Definition configuration files. Those configuration files are very important since they define the following objects: • Hosts • Services • Contacts The flexibility lies in the fact that objects are easily enhanceable. Imagine you are responsible for a host with only one service running.
with the max_check_attempts directive. All configuration flags beginning with notification handle how Nagios should behave when a failure of a monitored service occurs. In the host definition above, Nagios notifies the administrators only on working hours. However, this can be adjusted with notification_period. According to notification_interval notifications will be resend every two hours. notification_options contains four different flags: d, u, r and n.
the person who is contacted on a failure of a service. Usually this is the responsible administrator. use inherits configuration values from the generic-contact definition. An overview of all Nagios objects and further information about them can be found at: http://nagios.sourceforge.net/docs/3_0/ objectdefinitions.html. 3.4 Configuring Nagios Learn step-by-step how to configure Nagios to monitor different things like remote services or remote host-resources. 3.4.
max_check_attempts contact_groups notification_interval notification_options 10 admins 60 d,u,r } 5 Insert a service object in services.cfg: define service { use host_name service_description contact_groups check_command } generic-service host.name.com HTTP router-admins check_http 6 Insert a contact and contactgroup object in contacts.
3.4.2 Monitoring Remote Host-Resources with Nagios This section explains how to monitor remote host resources with Nagios. Proceed as follows on the Nagios server: Procedure 3.2: Monitoring a Remote Host Resource with Nagios (Server) 1 Install nagios-nsca (for example, zypper in nagios-nsca). 2 Set the following options in /etc/nagios/nagios.cfg: check_external_commands=1 accept_passive_service_checks=1 accept_passive_host_checks=1 command_file=/var/spool/nagios/nagios.
5 Execute rcnagios restart and rcnsca restart. Proceed as follows on the client you want to monitor: Procedure 3.3: Monitoring a Remote Host Resource with Nagios (client) 1 Install nagios-nsca-client on the host you want to monitor. 2 Write your test scripts (for example a script that checks the disk usage) like this: #!/bin/bash NAGIOS_SERVER=10.10.4.
(Return code of 127 is out of bounds - plugin may be missing) Make sure that you have installed nagios-plugins. E-mail notification does not work Make sure that you have installed and configured a mail server like postfix or exim correctly. You can verify if your mail server works with echo "Mail Server Test!" | mail foo@bar.com which sends an e-mail to foo@bar.com. If this e-mail arrives, your mail server is working correctly. Otherwise, check the log files of the mail server. 3.
Analyzing and Managing System Log Files 4 System log file analysis is one of the most important tasks when analyzing the system. In fact, looking at the system log files should be the first thing to do when maintaining or troubleshooting a system. SUSE Linux Enterprise Server automatically logs almost everything that happens on the system in detail. Normally, system log files are written in plain text and therefore, can be easily read using an editor or pager.
audit Logs from the audit framework. See Part “The Linux Audit Framework” (↑Security Guide) for details. boot.msg Log of the system init process—this file contains all boot messages from the Kernel, the boot scripts and the services started during the boot sequence. Check this file to find out whether your hardware has been correctly initialized or all services have been started successfully. boot.
mail* Mail server (postfix, sendmail) logs. messages This is the default place where all Kernel and system log messages go and should be the first place (along with /var/log/warn) to look at in case of problems. NetworkManager NetworkManager log files news/* Log messages from a news server. ntp Logs from the Network Time Protocol daemon (ntpd). pk_backend_zypp PackageKit (with libzypp backend) log files. puppet/* Log files from the data center automation tool puppet.
Xorg.0.log X startup log file. Refer to this in case you have problems starting X. Copies from previous X starts are numbered Xorg.?.log. YaST2/* All YaST log files. zypp/* libzypp log files. Refer to these files for the package installation history. zypper.log Logs from the command line installer zypper. 4.2 Viewing and Parsing Log Files To view log files, you can use your favorite text editor.
logrotate is usually run as a daily cron job. It does not modify any log files more than once a day unless the log is to be modified because of its size, because logrotate is being run multiple times a day, or the --force option is used. The main configuration file of logrotate is /etc/logrotate.conf. System packages as well as programs that produce log files (for example, apache2) put their own configuration files in the /etc/logrotate.d/ directory. The content of / etc/logrotate.
4.4 Monitoring Log Files with logwatch logwatch is a customizable, pluggable log-monitoring script. It parses system logs, extracts the important information and presents them in a human readable manner. To use logwatch, install the logwatch package. logwatch can either be used at the command-line to generate on-the-fly reports, or via cron to regularly create custom reports. Reports can either be printed on the screen, saved to a file, or be mailed to a specified address.
ignore.conf Filter for all lines that should globally be ignored by logwatch. services/*.conf The service directory holds configuration files for each service you can generate a report for. logfiles/*.conf Specifications on which log files should be parsed for each service. 4.5 Using logger to Make System Log Entries logger is a tool for making entries in the system log. It provides a shell command interface to the syslog(3) system log module.
Part III.
SystemTap—Filtering and Analyzing System Data 5 SystemTap provides a command line interface and a scripting language to examine the activities of a running Linux system, particularly the kernel, in fine detail. SystemTap scripts are written in the SystemTap scripting language, are then compiled to C-code kernel modules and inserted into the kernel. The scripts can be designed to extract, filter and summarize data, thus allowing the diagnosis of complex performance problems or functional problems.
5.1.1 SystemTap Scripts SystemTap usage is based on SystemTap scripts (*.stp). They tell SystemTap which type of information to collect, and what to do once that information is collected. The scripts are written in the SystemTap scripting language that is similar to AWK and C. For the language definition, see http://sourceware.org/system tap/langref/. The essential idea behind a SystemTap script is to name events, and to give them handlers. When SystemTap runs the script, it monitors for certain events.
5.1.3 Commands and Privileges The main commands associated with SystemTap are stap and staprun. To execute them, you either need root privileges or must be a member of the stapdev or stapusr group. stap SystemTap front-end. Runs a SystemTap script (either from file, or from standard input). It translates the script into C code, compiles it, and loads the resulting kernel module into a running Linux kernel. Then, the requested system trace or probe functions are performed. staprun SystemTap back-end.
/usr/share/systemtap/tapset/ Holds the standard library of tapsets. /usr/share/doc/packages/systemtap/examples Holds a number of example SystemTap scripts for various purposes. Only available if the systemtap-docs package is installed. ~/.systemtap/cache Data directory for cached SystemTap files. /tmp/stap* Temporary directory for SystemTap files, including translated C code and kernel object. 5.
• kernel-*-devel • kernel-source-* • gcc To get access to the man pages and to a helpful collection of example SystemTap scripts for various purposes, additionally install the systemtap-docs package. To check if all packages are correctly installed on the machine and if SystemTap is ready to use, execute the following command as root. stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}' It probes the currently used kernel by running a script and returning an output.
In case any error messages appear during the test, check the output for hints about any missing packages and make sure they are installed correctly. Rebooting and loading the appropriate kernel may also be needed. 5.3 Script Syntax SystemTap scripts consist of the following two components: SystemTap Events (Probe Points) (page 77) Name the kernel events at the associated handler should be executed.
exit () } Start of the probe. Event begin (the start of the SystemTap session). Start of the handler definition, indicated by {. First function defined in the handler: the printf function. String to be printed by the printf function, followed by a line break (/n). Second function defined in the handler: the exit() function. Note that the SystemTap script will continue to run until the exit() function executes. If you want to stop the execution of the script before, stop it manually by pressing Ctrl + C.
• Synchronous events: Occur when any process executes an instruction at a particular location in kernel code. This gives other events a reference point (instruction address) from which more contextual data may be available. An example for a synchronous event is vfs.file_operation: The entry to the file_operation event for Virtual File System (VFS). For example, in Section 5.2, “Installation and Setup” (page 74), read is the file_operation event used for VFS.
by a name. They take any number of string or numeric arguments (by value) and may return a single string or number. function function_name(arguments) {statements} probe event {function_name(arguments)} The statements in function_name are executed when the probe for event executes. The arguments are optional values passed into the function. Functions can be defined anywhere in the script. They may take any One of the functions needed very often was already introduced in Example 5.
tid() ID of the current thread. pid() Process ID of the current thread. uid() ID of the current user. cpu() Current CPU number. execname() Name of the current process. gettimeofday_s() Number of seconds since UNIX epoch (January 1, 1970). ctime() Convert time into a string. pp() String describing the probe point currently being handled. thread_indent() Useful function for organizing print results. It (internally) stores an indentation counter for each thread (tid()).
For more information about supported SystemTap functions, refer to the stapfuncs man page. 5.3.3.2 Other Basic Constructs Apart from functions, you can use several other common constructs in SystemTap handlers, including variables, conditional statements (like if/else, while loops, for loops, arrays or command line arguments. Variables Variables may be defined anywhere in the script.
hardware platform). With the global statement it is possible to use the variables count_jiffies and count_ms also in the probe timer.ms(12345). With + + the value of a variable is incremented by 1. Conditional Statements There are a number of conditional statements that you can use in SystemTap scripts.
!=: Is not equal to >=: Is greater than or equal to <=: Is less than or equal to 5.4 Example Script If you have installed the systemtap-docs package, you can find a number of useful SystemTap example scripts in /usr/share/doc/packages/system tap/examples. This section describes a rather simple example script in more detail: / usr/share/doc/packages/systemtap/examples/net work/tcp_connections.stp. Example 5.5: Monitoring Incoming TCP Connections with tcp_connections.
• IP address from which the TCP connection originated (IP_SOUCE) To run the script, execute stap /usr/share/doc/packages/systemtap/examples/network/tcp_connections.stp and follow the output on the screen. To manually stop the script, press Ctrl + C. 5.5 User-Space Probing For debugging user-space applications (like DTrace can do), SUSE Linux Enterprise Server 11 SP3 supports user-space probing with SystemTap: Custom probe points can be inserted in any user-space application.
5.6 For More Information This chapter only provides a short SystemTap overview. Refer to the following links for more information about SystemTap: http://sourceware.org/systemtap/ SystemTap project home page. http://sourceware.org/systemtap/wiki/ Huge collection of useful information about SystemTap, ranging from detailed user and developer documentation to reviews and comparisons with other tools, or Frequently Asked Questions and tips.
6 Kernel Probes Kernel probes are a set of tools to collect Linux kernel debugging and performance information. Developers and system administrators usually use them either to debug the kernel, or to find system performance bottlenecks. The reported data can then be used to tune the system for better performance. You can insert these probes into any kernel routine, and specify a handler to be invoked after a particular break-point is hit.
6.1 Supported Architectures Kernel probes are fully implemented on the following architectures: • i386 • x86_64 (AMD-64, EM64T) • ppc64 • arm • ppc Kernel probes are partially implemented on the following architectures: • ia64 (does not support probes on instruction slot1) • sparc64 (return probes not yet implemented) 6.2 Types of Kernel Probes There are three types of kernel probes: kprobes, jprobes, and kretprobes. Kretprobes are sometimes referred to as return probes.
6.2.2 Jprobe Jprobe is implemented through the kprobe mechanism. It is inserted on a function's entry point and allows direct access to the arguments of the function which is being probed. Its handler routine must have the same argument list and return value as the probed function. It also has to end by calling the jprobe_return() function. When jprobe is hit, the processor registers are saved, and the instruction pointer is directed to the jprobe handler routine.
register_jprobe() Inserts a break-point in the specified address. The address has to be the address of the first instruction of the probed function. When the break-point is hit, the specified handler is run. The handler should have the same argument list and return type as the probed. register_kretprobe() Inserts a return probe for the specified function. When the probed function returns, a specified handler is run. This function returns 0 on success, or a negative error number on failure.
6.4.1 How to List Registered Kernel Probes The list of all currently registered kprobes is in the /sys/kernel/de bug/kprobes/list file. saturn.example.com:~ # cat /sys/kernel/debug/kprobes/list c015d71a k vfs_read+0x0 [DISABLED] c011a316 j do_fork+0x0 c03dedc5 r tcp_v4_rcv+0x0 The first column lists the address in the kernel where the probe is inserted. The second column prints the type of the probe: k for kprobe, j for jprobe, and r for return probe.
• Thorough but more technically oriented information about kernel probes is in /usr/src/linux/Documentation/kprobes.txt (package kenrel-source). • Examples of all three types of probes (together with related Makefile) are in the /usr/src/linux/samples/kprobes/ directory (package kenrel-source). • In-depth information about Linux kernel modules and printk kernel routine is in The Linux Kernel Module Programming Guide [http://tldp.org/LDP/ lkmpg/2.6/html/lkmpg.
Perfmon2—HardwareBased Performance Monitoring 7 Perfmon2 is a standardized, generic interface to access the performance monitoring unit (PMU) of a processor. It is portable across all PMU models and architectures, supports system-wide and per-thread monitoring, counting and sampling. 7.1 Conceptual Overview The following subsections give you a brief overview about Perfmon. 7.1.
Figure 7.1: Architecture of perfmon2 pfmon Userspace Generic perfmon Linux Kernel Architecture- specific perfmon PMU CPU Hardware Each PMU model consists of a set of registers: the performance monitor configuration (PMC) and the performance monitor data (PMD). Only PMCs are writeable, but both can be read. These registers store configuration information and data. 7.1.2 Sampling and Counting Perfmon2 supports two modes where you can run your profiling: sampling or counting.
1306604 CPU_OP_CYCLES_ALL The following command gives the count of a specific function and the procentual amount of the total cycles: pfmon --no-cmd-output --short-smpl-periods=100000 bin/ls # results for [28119:28119<-[28102]] (/bin/ls) # total samples : 12 # total buffer overflows : 0 # # event00 # counts %self %cum code addr 1 8.33% 8.33% 0x2000000000007180 1 8.33% 16.67% 0x20000000000195a0 1 8.33% 25.00% 0x2000000000019260 1 8.33% 33.33% 0x2000000000014e60 1 8.33% 41.67% 0x20000000001f38c0 1 8.33% 50.
Model Processor 9000/9100 (Montecito, Montvale) and Generic AMD X86 Opteron (K8, fam 10h) Intel X86 Intel P6 (Pentium II, Pentium Pro, Pentium III, Pentium M); Yonah (Core Duo, Core Solo); Netburst (Pentium 4, Xeon); Core (Merom, Penryn, Dunnington) Core 2 and Quad; Atom; Nehalem; architectural perfmon v1, v2, v3 Install the following packages depending on your architecture: Table 7.2: Needed Packages Architecture Packages ia64 pfmon 7.
ALAT_CAPACITY_MISS_ALL ALAT_CAPACITY_MISS_FP ALAT_CAPACITY_MISS_INT BACK_END_BUBBLE_ALL BACK_END_BUBBLE_FE BACK_END_BUBBLE_L1D_FPU_RSE ... CPU_CPL_CHANGES_ALL CPU_CPL_CHANGES_LVL0 CPU_CPL_CHANGES_LVL1 CPU_CPL_CHANGES_LVL2 CPU_CPL_CHANGES_LVL3 CPU_OP_CYCLES_ALL CPU_OP_CYCLES_QUAL CPU_OP_CYCLES_HALTED DATA_DEBUG_REGISTER_FAULT DATA_DEBUG_REGISTER_MATCHES DATA_EAR_ALAT ...
pfmon -v --system-wide ... selected CPUs (2 CPU in set, 2 CPUs online): CPU0 CPU1 2 Delimit your session. The following list describes options which are used in the examples below (refer to the man page for more details): -e/--events Profile only selected events. See Section 7.3.1, “Getting Event Information” (page 96) for how to get a list. --cpu-list Specifies the list of processors to monitor. Without this options, all available processors are monitored.
pfmon --cpu-list=0-1 --system-wide -u -e CPU_OP_CYCLES_ALL,IA64_INST_RETIRED -- ls -l /dev/null crw-rw-rw- 1 root root 1, 3 27.
The data is located under /sys/kernel/debug/perfmon/ and organized per CPU. Each CPU contains a set of metrics, accessible as ASCII file. The following data is taken from the /usr/src/linux/Documentation/perf mon2-debugfs.txt: Table 7.
File Description is called (used for timeout-based set switching) handle_work_count Number of times pfm_handle_work() routine is called ovl_intr_all_count Number of PMU interrupts received by the kernel ovfl_intr_nmi_count Number of non maskeable interrupts (NMI) received by the kernel from perfmon (only for X86 hardware) ovfl_intr_ns Number of nanoseconds spent in the perfmon2 PMU interrupt handler routine.
File Description reset_pmds_count Number of times pfm_reset_pmds() is called set_switch_count Number of event set switches set_switch_ns Number of nanoseconds spent in the set switching rountine Average cost of switching sets = set_switch_ns / set_switch_count This might be useful to compare your metrics before and after the perfmon run. For example, collect your data first: for i in /sys/kernel/debug/perfmon/cpu0/*; do echo "$i:"; cat $i done >> pfmon-before.
Chapter 8, OProfile—System-Wide Profiler (page 105) Consult this chapter for other performance optimizations.
OProfile—System-Wide Profiler 8 OProfile is a profiler for dynamic program analysis. It investigates the behaviour of a running program and gathers information. This information can be viewed and gives hints for further optimizations. It is not necessary to recompile or use wrapper libraries in order to use OProfile. Not even a Kernel patch is needed. Usually, when you profile an application, a small overhead is expected, depending on work load and sampling frequency. 8.
8.2 Installation and Requirements In order to make use of OProfile, install the oprofile package. OProfile works on IA-64, AMD64, s390, and PPC64 processors. It is useful to install the *-debuginfo package for the respective application you want to profile. If you want to profile the Kernel, you need the debuginfo package as well. 8.3 Available OProfile Utilities OProfile contains several utilities to handle the profiling process and its profiled data.
Applications usually do not need to profile the Kernel, so better use the --no-vmlinux option to reduce the amount of information. 8.4.1 General Steps In its simplest form, start the daemon, collect data, stop the daemon, and create your report. This method is described in detail in the following procedure: 1 Open a shell and log in as root.
opreport Overflow stats not available CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt TIMER:0| samples| %| -----------------84877 98.3226 no-vmlinux ... 8 Shutdown the OProfile daemon: opcontrol --shutdown 8.4.2 Getting Event Configurations The general procedure for event configuration is as follows: 1 Use first the events CPU-CLK_UNHALTED and INST_RETIRED to find optimization opportunities. 2 Use specific events to find bottlenecks.
---------0x4f: No unit mask BR_MISS_PRED_RETIRED: (counter: all)) number of mispredicted branches retired (precise) (min count: 500) You can get the same output from opcontrol --list-events. Specify the performance counter events with the option --event. Multiple options are possible.
8.6 Generating Reports Before generating a report, make sure OProfile has dumped your data to the /var/ lib/oprofile/samples directory using the command opcontrol --dump. A report can be generated with the commands opreport or opannotate. Calling oreport without any options gives a complete summary. With an executable as an argument, retrieve profile data only from this executable. If you analyze applications written in C++, use the --demangle smart option.
/usr/share/doc/packages/oprofile/oprofile.html Contains the OProfile manual. http://developer.intel.com/ Architecture reference for Intel processors. http://www.amd.com/us-en/assets/content_type/ white_papers_and_tech_docs/22007.pdf Architecture reference for AMD Athlon/Opteron/Phenom/Turion. http://www-01.ibm.com/chips/techlib/techlib.nsf/product families/PowerPC/ Architecture reference for PowerPC64 processors in IBM iSeries, pSeries, and blade server systems.
Part IV.
General System Resource Management Tuning the system is not only about optimizing the kernel or getting the most out of your application, it begins with setting up a lean and fast system. The way you set up your partitions and file systems can influence the server's speed. The number of active services and the way routine tasks are scheduled also affects performance. 9 9.
• using separate disks for the operating system, the data, and the log files • placing a mail server's spool directory on a separate disk • distributing the user directories of a home server between different disks 9.1.2 Installation Scope Actually, the installation scope has no direct influence on the machine's performance, but a carefully chosen scope of packages nevertheless has got advantages. It is recommended to install the minimum of packages needed to run the server.
with network, no X). You will still be able to start graphical applications from remote or use the startx command to start a local graphical desktop. 9.2 Disabling Unnecessary Services The default installation starts a number of services (the number varies with the installation scope). Since each service consumes resources, it is recommended to disable the ones not needed. Run YaST > System > System Services (Runlevel) > Expert Mode to start the services management module.
smbfs Services needed to mount SMB/CIFS file systems from a Windows server. splash / splash_early Shows the splash screen on start-up. 9.3 File Systems and Disk Access Hard disks are the slowest components in a computer system and therefore often the cause for a bottleneck. Using the file system that best suits your workload helps to improve performance. Using special mount options or prioritizing a process' I/O priority are further means to speed up the system. 9.3.
To turn off access time updates, mount the file system with the noatime option. To do so, either edit /etc/fstab directly, or use the Fstab Options dialog when editing or adding a partition with the YaST Partitioner. 9.3.3 Prioritizing Disk Access with ionice The ionice command lets you prioritize disk access for single processes. This enables you to give less I/O priority to non time-critical background processes with heavy disk access, such as backup jobs.
10 Kernel Control Groups Kernel Control Groups (abbreviated known as “cgroups”) are a kernel feature that allows aggregating or partitioning tasks (processes) and all their children into hierarchical organized groups. These hierarchical groups can be configured to show a specialized behavior that helps with tuning the system to make best use of available hardware and network resources. 10.
• Every task running in the system is in exactly one of the cgroups in the hierarchy. 10.2 Scenario See the following resource planning scenario for a better understanding (source: / usr/src/linux/Documentation/cgroups/cgroups.
Figure 10.
Web browsers such as Firefox will be part of the Web network class, while the NFS daemons such as (k)nfsd will be part of the NFS network class. On the other side, Firefox will share appropriate CPU and memory classes depending on whether a professor or student started it. 10.
# Create a child cgroup: mkdir /freezer/0 # Put a task into this cgroup: echo $task_pid > /freezer/0/tasks # Freeze it: echo FROZEN > /freezer/0/freezer.state # Unfreeze (thaw) it: echo THAWED > /freezer/0/freezer.state Checkpoint/Restart (Control) Save the state of all processes in a cgroup to a dump file. Restart it later (or just save the state and continue). Move a “saved container” between physical machines (as VM can do). Dump all process images of a cgroup to a file.
• Anonymous and file cache. • No limits for kernel memory. • Maybe in another subsystem if needed. For more information, see /usr/src/linux/Documenta tion/cgroups/memory.txt. Blkio (Resource Control) The blkio (Block IO) controller is now available as a disk I/O controller. With the blkio controller you can currently set policies for proportional bandwidth and for throttling. These are the basic commands to configure proportional weight division of bandwidth by setting weight values in blkio.
Network Traffic (Resource Control) With cgroup_tc, a network traffic controller is available. It can be used to manage traffic that is associated with the tasks in a cgroup. Additionally, cls_flow can classify packets based on the tc_classid field in the packet. For example, to limit the traffic from all tasks from a file_server cgroup to 100 Mbps, proceed as follows: # create a file_transfer cgroup and assign it a unique classid # of 0x10 - this will be used later to direct packets.
10.4.2 Checking the Environment The kernel shipped with SUSE Linux Enterprise Server supports cgroups. There is no need to apply additional patches.
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset cd /sys/fs/cgroup/cpuset mkdir Charlie cd Charlie # List of CPUs in this cpuset: echo 2-3 > cpuset.cpus # List of memory nodes in this cpuset: echo 1 > cpuset.mems echo $$ > tasks # The subshell 'sh' is now running in cpuset Charlie # The next line should display '/Charlie' cat /proc/self/cpuset 3 Remove the cpuset using shell commands: rmdir /sys/fs/cgroup/cpuset/Charlie This fails as long as this cpuset is in use.
cd /sys/fs/cgroup/cpuset/cgroup mkdir priority cd priority cat cpu.shares 2 Understanding cpu.shares: • 1024 is the default (for more information, see /Documentation/sched uler/sched-design-CFS.txt) = 50% utilization • 1524 = 60% utilization • 2048 = 67% utilization • 512 = 40% utilization 3 Changing cpu.shares echo 1024 > cpu.shares 10.
• /usr/src/linux/Documentation/cgroups/ resource_counter.txt • For Linux Containers (LXC) based on cgroups, see Virtualization with Linux Containers (LXC) (↑Virtualization with Linux Containers (LXC)). • http://lwn.net/Articles/243795/—Corbet, Jonathan: Controlling memory use in containers (2007). • http://lwn.net/Articles/236038/—Corbet, Jonathan: Process containers (2007).
11 Power Management Power management aims at reducing operating costs for energy and cooling systems while at the same time keeping the performance of a system at a level that matches the current requirements. Thus, power management is always a matter of balancing the actual performance needs and power saving options for a system. Power management can be implemented and used at different levels of the system.
11.1.1 C-States (Processor Operating States) Modern processors have several power saving modes called C-states. They reflect the capability of an idle processor to turn off unused components in order to save power. Whereas C-states have been available for laptops for some time, they are a rather recent trend in the server market (for example, with Intel* processors, C-modes are only available since Nehalem). When a processor runs in the C0 state, it is executing instructions.
Mode Definition sor maintains all software-visible states, but may take longer to wake up through interrupts. C3 Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts. To avoid needless power consumption, it is recommended to test your workloads with deep sleep states enabled versus deep sleep states disabled.
C-states and P-states can vary independently of one another. 11.1.3 T-States (Processor Throttling States) T-states refer to throttling the processor clock to lower frequencies in order to reduce thermal effects. This means that the CPU is forced to be idle a fixed percentage of its cycles per second. Throttling states range from T1 (the CPU has no forced idle cycles) to Tn, with the percentage of idle cycles increasing the greater n is.
11.2 The Linux Kernel CPUfreq Infrastructure Processor performance states (P-states) and processor operating states (C-states) are the capability of a processor to switch between different supported operating frequencies and voltages to modulate power consumption. In order to dynamically scale processor frequencies at runtime, you can use the CPUfreq infrastructure to set a static or dynamic power policy for the system.
However, using this governor often does not lead to the expected power savings as the highest savings can usually be achieved at idle through entering C-states. Due to running processes at the lowest frequency with the powersave governor, processes will take longer to finish, thus prolonging the time for the system to enter any idle C-states. Tuning options: The range of minimum frequencies available to the governor can be adjusted (for example, with the cpupower command line tool).
/sys/devices/system/cpu/. If you list the contents of this directory, you will find a cpu{0..x} subdirectory for each processor, and several other files and directories. A cpufreq subdirectory in each processor directory holds a number of files and directories that define the parameters for CPUfreq. Some of them are writable (for root), some of them are read-only.
Monitoring Power Consumption with powerTOP (page 145) powerTOP combines various sources of information (analysis of programs, device drivers, kernel options, amounts and sources of interrupts waking up processors from sleep states) and shows them in one screen. The tool helps you to identify the reasons for unnecessary high power consumption (for example, processes that are mainly responsible for waking up a processor from its idle state) and to optimize your system settings to avoid these. 11.3.
available frequency steps: 3.40 GHz, 2.80 GHz available cpufreq governors: conservative, userspace, powersave, ondemand, performance current policy: frequency should be within 2.80 GHz and 3.40 GHz. The governor "performance" may decide which speed to use within this range. current CPU frequency is 3.40 GHz.
cpupower -c 4 frequency-info (versus cpufreq-info -c 4) cpupower lets you also specify a list of CPUs with -c. For example, the following command would affect the CPUs 1 , 2, 3, and 5: cpupower -c 1-3,5 frequency-set • If cpufreq* and cpupower are used without the -c option, the behavior differs: cpufreq-set automatically applies the command to CPU 0, whereas cpupower frequency-set applies the command to all CPUs in this case.
11.3.2.2 Viewing and Modifying Kernel Idle Statistics with cpupower The idle-info subcommand shows the statistics of the cpuidle driver used in the Kernel. It works on all architectures that use the cpuidle Kernel framework. Example 11.
The monitor subcommand allows you to execute performance benchmarks and to compare Kernel statistics with hardware statistics for specific workloads. Example 11.4: Example cpupower monitor Output |Mperf CPU | C0 | 0| 3.71| 1| 100.0| 2| 9.06| 3| 7.43| || Idle_Stats Cx | Freq || POLL | C1 | C2 | 96.29| 2833|| 0.00| 0.00| 0.02| -0.00| 2833|| 0.00| 0.00| 0.00| 90.94| 1983|| 0.00| 7.69| 6.98| 92.57| 2039|| 0.00| 2.60| 12.62| C3 96.32 0.00 76.45 77.
maximum CPU frequency the governor may select or to create a new governor. With the -c option, you can also specify for which of the processors the settings should be modified. That makes it easy to use a consistent policy across all processors without adjusting the settings for each processor individually. For more details and the available options, refer to the cpupower-freqency-set man page or run cpupower frequency-set --help. 11.3.
[...] Suggestion: Enable SATA ALPM link power management via: echo min_power > /sys/class/scsi_host/host0/link_power_management_policy or press the S key. The column shows the C-states. When working, the CPU is in state 0, when resting it is in some state greater than 0, depending on which C-states are available and how deep the CPU is sleeping. The column shows average time in milliseconds spent in the particular C-state. The column shows the percentages of time spent in various C-states.
11.4.1 Tuning Options for P-States The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors, influence minimum or maximum CPU frequency to be used or change individual governor parameters. To switch to another governor at runtime, use cpupower frequency-set (or cpufreq-set) with the -g option.
cd /sys/devices/system/cpu/cpu0/cpufreq/conservative/ 2 Show the current value of ignore_nice_load with: cat ignore_nice_load 3 To set the value to 1, execute: echo 1 > ignore_nice_load TIP: Using the Same Value for All Cores When setting the ignore_nice_load value for cpu0, the same value is automatically used for all cores. In this case, you do not need to repeat the steps above for each of the processors where you want to modify this governor parameter.
Procedure 11.2: Scheduling Processes on Cores 1 Become root on a command line. 2 To view the current value of sched_mc_power_savings, use the following command: cpupower info -m 3 To set sched_mc_power_savings to 1, execute: cpupower set -m 1 11.5 Creating and Using Power Management Profiles SUSE Linux Enterprise Server includes pm-profiler, intended for server use. It is a script infrastructure to enable or disable certain power management functions via configuration files.
3 Edit the settings in /etc/pm-profiler/testprofile/config and save the file. You can also remove variables that you do not need—they will be handled like empty variables, the settings will not be touched at all. 4 Edit /etc/pm-profiler.conf. The PM_PROFILER_PROFILE variable defines which profile will be activated on system start. If it has no value, the default system or kernel settings will be used.
In case of a CPU upgrade, make sure to upgrade your BIOS, too. The BIOS needs to know the new CPU and its valid frequencies steps in order to pass this information on to the operating system. CPUfreq subsystem enabled? In SUSE Linux Enterprise Server, the CPUfreq subsystem is enabled by default. To find out if the subsystem is currently enabled, check for the following path in your system: /sys/devices/system/cpu/cpufreq (or /sys/de vices/system/cpu/cpu*/cpufreq for machines with multiple cores).
brary/l-cpufreq-3/?ca=dgr-lnxw03ReduceLXPWR-P1dthLX&S_TACT=105AGX59&S_CMP=grlnxw03 • The LessWatts.org project deals with how to save power, reduce costs and increase efficiency on Linux systems. Find the project home page at http:// www.lesswatts.org/. The project page also holds an informative FAQs section at http://www.lesswatts.org/documentation/faq/ index.php and provides useful tips and tricks. For tips dealing with the CPU level, refer to http://www.lesswatts.org/tips/cpu.php.
Part V.
Installing Multiple Kernel Versions 12 SUSE Linux Enterprise Server supports the parallel installation of multiple kernel versions. When installing a second kernel, a boot entry and an initrd are automatically created, so no further manual configuration is needed. When rebooting the machine, the newly added kernel is available as an additional boot option. Using this functionality, you can safely test kernel updates while being able to always fall back to the proven former kernel.
fer to Section “The File /etc/sysconfig/bootloader” (Chapter 10, The Boot Loader GRUB, ↑Administration Guide). 12.1 Enabling and Configuring Multiversion Support Installing multiple versions of a software package (multiversion support) is not enabled by default. To enable this feature, proceed as follows: 1 Open /etc/zypp/zypp.conf with the editor of your choice as root. 2 Search for the string multiversion.
2.6.32.12-0.7 : keep the kernel with the specified version number latest: keep the kernel with the highest version number latest-N: keep the kernel with the Nth highest version number running keep the running kernel oldest keep the kernel with the lowest version number (the one that was originally shipped with SUSE Linux Enterprise Server) oldest+N keep the kernel with the Nth lowest version number Here are some examples multiversion.
Figure 12.1: The YaST Software Manager - Multiversion View 3 Select a package and open its Version tab in the bottom pane on the left. 4 To install a package, click its check box. A green check mark indicates it is selected for installation. To remove an already installed package (marked with a white check mark), click its check box until a red X indicates it is selected for removal. 5 Click Accept to start the installation. 12.
S | Name | Type | Version | Arch | Repository --+----------------+------------+-----------------+--------+------------------v | kernel-default | package | 2.6.32.10-0.4.1 | x86_64 | Alternative Kernel i | kernel-default | package | 2.6.32.9-0.5.1 | x86_64 | (System Packages) | kernel-default | srcpackage | 2.6.32.10-0.4.1 | noarch | Alternative Kernel i | kernel-default | package | 2.6.32.9-0.5.1 | x86_64 | (System Packages) ... 2 Specify the exact version when installing: zypper in kernel-default-2.6.32.
Tuning I/O Performance I/O scheduling controls how input/output operations will be submitted to storage. SUSE Linux Enterprise Server offers various I/O algorithms—called elevators— suiting different workloads. Elevators can help to reduce seek operations, can prioritize I/O requests, or make sure, and I/O request is carried out before a given deadline. 13 Choosing the best suited I/O elevator not only depends on the workload, but on the hardware, too.
echo SCHEDULER > /sys/block/DEVICE/queue/scheduler where SCHEDULER is one of cfq, noop, or deadline and DEVICE the block device (sda for example). NOTE: Default Schedulter on IBM System z On IBM System z the default I/O scheduler for a storage device is set by the device driver. 13.2 Available I/O Elevators In the following elevators available on SUSE Linux Enterprise Server are listed.
/sys/block//queue/iosched/quantum This option limits the maximum number of requests that are being processed by the device at once. The default value is 4. For a storage with several disks, this setting can unnecessarily limit parallel processing of requests. Therefore, increasing the value can improve performance although this can cause that the latency of some I/O may be increased due to more requests being buffered inside the storage.
several parallel readers from a SAN and for databases (especially when using “TCQ” disks). The DEADLINE scheduler has the following tunable parameters: /sys/block//queue/iosched/writes_starved Controls how many reads can be sent to disk before it is possible to send writes. A value of 3 means, that three read operations are carried out for one write operation.
Tuning the Task Scheduler 14 Modern operating systems, such as SUSE® Linux Enterprise Server, normally run many different tasks at the same time. For example, you can be searching in a text file while receiving an e-mail and copying a big file to an external hard drive. These simple tasks require many additional processes to be run by the system. To provide each task with its required system resources, the Linux kernel needs a tool to distribute available system resources to individual tasks.
14.1.1 Preemption The theory behind task scheduling is very simple. If there are runnable processes in a system, at least one process must always be running. If there are more runnable processes than processors in a system, not all the processes can be running all the time. Therefore, some processes need to be stopped temporarily, or suspended, so that others can be running again. The scheduler decides what process in the queue will run next.
One approach is to classify a process either I/O-bound or processor-bound. I/O-bound I/O stands for Input/Output devices, such as keyboards, mice, or optical and hard disks. I/O-bound processes spend the majority of time submitting and waiting for requests. They are run very frequently, but for short time intervals, not to block other processes waiting for I/O requests.
The scheduler calculates the timeslices dynamically. However, to determine the appropriate timeslice is a complex task: Too long timeslices cause the system to be less interactive and responsive, while too short ones make the processor waste a lot of time on the overhead of switching the processes too frequently. The default timeslice is usually rather low, for example 20ms.
As a result, CFS brings more optimized scheduling for both servers and desktops. 14.4.1 How CFS Works CFS tries to guarantee a fair approach to each runnable task. To find the most balanced way of task scheduling, it uses the concept of red-black tree. A red-black tree is a type of self-balancing data search tree which provides inserting and removing entries in a reasonable way so that it remains well balanced. For more information, see the wiki pages of Red-black tree [http://en.wikipedia.
14.4.3 Kernel Configuration Options Basic aspects of the task scheduler behavior can be set through the kernel configuration options. Setting these options is part of the kernel compilation process. Because kernel compilation process is a complex task and out of this document's scope, refer to relevant source of information.
SCHED_BATCH Scheduling policy designed for CPU-intensive tasks. SCHED_IDLE Scheduling policy intended for very low prioritized tasks. SCHED_OTHER Default Linux time-sharing scheduling policy used by the majority of processes. SCHED_RR Similar to SCHED_FIFO, but uses the Round Robin scheduling algorithm. 14.4.5 Changing Real-time Attributes of Processes with chrt The chrt command sets or retrieves the real-time scheduling attributes of a running process, or runs a command with the specified attributes.
saturn.example.com:~ # chrt -b saturn.example.com:~ # chrt -p pid 16244's current scheduling pid 16244's current scheduling -p 0 16244 16244 policy: SCHED_BATCH priority: 0 For more information on chrt, see its man page (man 1 chrt). 14.4.6 Runtime Tuning with sysctl The sysctl interface for examining and changing kernel parameters at runtime introduces important variables by means of which you can change the default behavior of the task scheduler.
A list of the most important task scheduler sysctl tuning variables (located at / proc/sys/kernel/) with a short description follows: sched_child_runs_first A freshly forked child runs before the parent continues execution. Setting this parameter to 1 is beneficial for an application in which the child performs an execution after fork. For example make -j performs better when sched_child_runs_first is turned off. The default value is 0.
This value also specifies the maximum amount of time during which a sleeping task is considered to be running for entitlement calculations. Increasing this variable increases the amount of time a waking task may consume before being preempted, thus increasing scheduler latency for CPU bound tasks. The default value is 20000000 (ns). sched_min_granularity_ns Minimal preemption granularity for CPU bound tasks. See sched_latency_ns for details. The default value is 4000000 (ns).
SCHED_OTHER policy, they will all be run on the same processor. The default value is 32. Increasing this value gives a performance boost to large SCHED_OTHER threads at the expense of increased latencies for real-time tasks. 14.4.7 Debugging Interface and Scheduler Statistics CFS comes with a new improved debugging interface, and provides runtime statistics information. Relevant files were added to the /proc file system, which can be examined simply with the cat or less command.
runnable tasks: task PID tree-key switches prio exec-runtime sum-exec sumsleep -------------------------------------------------------------------------R cat 16884 54410632.307072 0 120 54410632.307072 13.836804 0.000000 /proc/schedstat Displays statistics relevant to the current run queue. Also domain-specific statistics for SMP systems are displayed for all connected processors. Because the output format is not user-friendly, read the contents of /usr/src/linux/Doc umentation/scheduler/sched-stats.
• General information on Linux task scheduling is described in Inside the Linux scheduler [http://www.ibm.com/developerworks/linux/li brary/l-scheduler/]. • Information specific to Completely Fair Scheduler is available in Multiprocessing with the Completely Fair Scheduler [http://www.ibm.com/developer works/linux/library/l-cfs/?ca=dgr-lnxw06CFC4Linux] • Information specific to tuning Completely Fair Scheduler is available in Tuning the Linux Kernel’s Completely Fair Scheduler [http:// www.hotaboutlinux.
Tuning the Memory Management Subsystem 15 In order to understand and tune the memory management behavior of the kernel, it is important to first have an overview of how it works and cooperates with other subsystems. The memory management subsystem, also called the virtual memory manager, will subsequently be referred to as “VM”. The role of the VM is to manage the allocation of physical memory (RAM) for the entire kernel and user programs.
Finally, the workload itself should be examined and tuned as well. If an application is allowed to run more processes or threads, effectiveness of VM caches can be reduced, if each process is operating in its own area of the file system. Memory overheads are also increased. If applications allocate their own buffers or caches, larger caches will mean that less memory is available for VM caches.
as inode tables, allocation bitmaps, and so forth. Buffercache can be reclaimed similarly to pagecache. 15.1.4 Buffer Heads Buffer heads are small auxiliary structures that tend to be allocated upon pagecache access. They can generally be reclaimed easily when the pagecache or buffercache pages are clean. 15.1.5 Writeback As applications write to files, the pagecache (and buffercache) becomes dirty.
15.1.7.2 Directory Entry Cache This is an in-memory cache of the directory entries in the system. These contain a name (the name of a file), the inode which it refers to, and children entries. This cache is used when traversing the directory structure and accessing a file by name. 15.2 Reducing Memory Usage 15.2.1 Reducing malloc (Anonymous) Usage Applications running on SUSE Linux Enterprise Server 11 SP3 can allocate more memory compared to SUSE Linux Enterprise Server 10.
15.2.3 Memory Controller (Memory Cgroups) If the memory cgroups feature is not needed, it can be switched off by passing cgroup_disable=memory on the kernel command line, reducing memory consumption of the kernel a bit. 15.3 Virtual Memory Manager (VM) Tunable Parameters When tuning the VM it should be understood that some of the changes will take time to affect the workload and take full effect. If the workload changes throughout the day, it may behave very differently at different times.
/proc/sys/vm/vfs_cache_pressure This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed. It is difficult to know when this should be changed, other than by experimentation. The slabtop command (part of the package procps) shows top memory objects used by the kernel. The vfs caches are the "dentry" and the "*_inode_cache" objects.
Similar percentage value as above. When this is exceeded, applications that want to write to the pagecache are blocked and start performing writeback as well. The default value is 40 (%). These two values together determine the pagecache writeback behavior. If these values are increased, more dirty memory is kept in the system for a longer time. With more dirty memory allowed in the system, the chance to improve throughput by avoiding writeback I/O and to submitting more optimal I/O patterns increases.
ti-socket servers are NUMA machines. NUMA is a secondary concern to managing swapping and caches in terms of performance, and there are lots of documents about improving NUMA memory allocations. One particular parameter interacts with page reclaim: /proc/sys/vm/zone_reclaim_mode This parameter controls whether memory reclaim is performed on a local NUMA node even if there is plenty of memory free on other nodes. This parameter is automatically turned on on machines with more pronounced NUMA characteristics.
16 Tuning the Network The network subsystem is rather complex and its tuning highly depends on the system use scenario and also on external factors such as software clients or hardware components (switches, routers, or gateways) in your network. The Linux kernel aims more at reliability and low latency than low overhead and high throughput. Other settings can mean less security, but better performance. 16.
The special files in the /proc file system can modify the size and behavior of kernel socket buffers; for general information about the /proc file system, see Section 2.6, “The /proc File System” (page 35). Find networking related files in: /proc/sys/net/core /proc/sys/net/ipv4 /proc/sys/net/ipv6 General net variables are explained in the kernel documentation (linux/Docu mentation/sysctl/net.txt). Special ipv4 variables are explained in lin ux/Documentation/networking/ip-sysctl.
/proc/sys/net/ipv4/tcp_sack Select acknowledgments (SACKS). Use sysctl to read or write variables of the /proc file system. sysctl is preferable to cat (for reading) and echo (for writing), because it also reads settings from /etc/sysctl.conf and, thus, those settings survive reboots reliably.
When the kernel queue becomes full, all new packets are dropped, causing existing connections to fail. The 'fail-open' feature, available since SUSE Linux Enterprise Server 11 SP3, allows a user to temporarily disable the packet inspection and maintain the connectivity under heavy network traffic. For reference, see https://home.regit.org/netfilter-en/using-nfqueue-andlibnetfilter_queue/. For more information, see the home page of the Netfilter and iptables project, http://www.netfilter.org 16.
Part VI.
17 Tracing Tools SUSE Linux Enterprise Server comes with a number of tools that help you obtain useful information about your system. You can use the information for various purposes, for example, to debug and find problems in your program, to discover places causing performance drops, or to trace a running process to find out what system resources it uses. The tools are mostly part of the installation media, otherwise you can install them from the downloadable SUSE Software Development Kit.
To run a new command and start tracing its system calls, enter the command to be monitored as you normally do, and add strace at the beginning of the command line: tux@mercury:~> strace ls execve("/bin/ls", ["ls"], [/* 52 vars */]) = 0 brk(0) = 0x618000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) \ = 0x7f9848667000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) \ = 0x7f9848666000 access("/etc/ld.so.
The -e option understands several sub-options and arguments. For example, to trace all attempts to open or write to a particular file, use the following: tux@mercury:~> strace -e trace=open,write ls ~ open("/etc/ld.so.cache", O_RDONLY) = 3 open("/lib64/librt.so.1", O_RDONLY) = 3 open("/lib64/libselinux.so.1", O_RDONLY) = 3 open("/lib64/libacl.so.1", O_RDONLY) = 3 open("/lib64/libc.so.6", O_RDONLY) = 3 open("/lib64/libpthread.so.0", O_RDONLY) = 3 [...] open("/usr/lib/locale/cs_CZ.
brk(0) = 0x69e000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) \ = 0x7f3bb553b000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) \ = 0x7f3bb553a000 [...] [pid 4823] rt_sigprocmask(SIG_SETMASK, [], [pid 4822] close(4 [pid 4823] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 4822] <... close resumed> ) = 0 [...
17.2 Tracing Library Calls with ltrace ltrace traces dynamic library calls of a process. It is used in a similar way to strace, and most of their parameters have a very similar or identical meaning. By default, ltrace uses /etc/ltrace.conf or ~/.ltrace.conf configuration files. You can, however, specify an alternative one with the -F config_file option. In addition to library calls, ltrace with the -S option can trace system calls as well: tux@mercury:~> ltrace -S -o ltrace_find.txt find /etc -name \ xorg.
clock_gettime(1, 0x7fff4b5c34d0, 0, 0, 0) = 0 clock_gettime(1, 0x7fff4b5c34c0, 0xffffffffff600180, -1, 0) = 0 +++ exited (status 0) +++ You can make the output more readable by indenting each nested call by the specified number of space with the -n num_of_spaces. 17.3 Debugging and Profiling with Valgrind Valgrind is a set of tools to debug and profile your programs so that they can run faster and with less errors.
• ppc64 • System z 17.3.3 General Information The main advantage of Valgrind is that it works with existing compiled executables. You do not have to recompile or modify your programs to make use of it. Run Valgrind like this: valgrind valgrind_options your-prog your-program-options Valgrind consists of several tools, and each provides specific functionality. Information in this section is general and valid regardless of the used tool. The most important configuration option is --tool .
massif A heap profiler. Heap is an area of memory used for dynamic memory allocation. This tool helps you tune your program to use less memory. lackey An example tool showing instrumentation basics. 17.3.4 Default Options Valgrind can read options at start-up. There are three places which Valgrind checks: 1. The file .valgrindrc in the home directory of the user who runs Valgrind. 2. The environment variable $VALGRIND_OPTS 3. The file .valgrindrc in the current directory where Valgrind is run from.
grind, it also detects errors in associated libraries (like C, X11, or Gtk libraries). Because you probably do not need these errors, Valgrind can selectively, suppress these error messages to suppression files. The --gen-suppressions=yes tells Valgrind to report these suppressions which you can copy to a file. Note that you should supply a real executable (machine code) as an Valgrind argument.
1. By default, Valgrind sends its messages to the file descriptor 2, which is the standard error output. You can tell Valgrind to send its messages to any other file descriptor with the --log-fd=file_descriptor_number option. 2. The second and probably more useful way is to send Valgrind's messages to a file with --log-file=filename. This option accepts several variables, for example, %p gets replaced with the PID of the currently profiled process.
sages. In case of a duplicate error, it is recorded but no message is shown. This mechanism prevents you from being overwhelmed by millions of duplicate errors. The -v option will add a summary of all reports (sorted by their total count) to the end of the Valgrind's execution output. Moreover, Valgrind stops collecting errors if it detects either 1000 different errors, or 10 000 000 errors in total. If you want to suppress this limit and wish to see all error messages, use --error-limit=no.
kexec and kdump kexec is a tool to boot to another kernel from the currently running one. You can perform faster system reboots without any hardware initialization. You can also prepare the system to boot to another kernel if the system crashes. 18 18.1 Introduction With kexec, you can replace the running kernel with another one without a hard reboot. The tool is useful for several reasons: • Faster system rebooting If you need to reboot the system frequently, kexec can save you significant time.
• Booting without GRUB or LILO configuration When the system boots a kernel with kexec, it skips the boot loader stage. Normal booting procedure can fail due to an error in the boot loader configuration. With kexec, you do not depend on a working boot loader configuration. 18.2 Required Packages If you intend to use kexec on SUSE® Linux Enterprise Server to speed up reboots or avoid potential hardware problems, you need to install the kexec-tools package.
The capture kernel is loaded to the reserved area and waits for the kernel to crash. Then kdump tries to invoke the capture kernel because the production kernel is no longer reliable at this stage. This means that even kdump can fail. To load the capture kernel, you need to include the kernel boot parameters. Usually, the initial RAM file system is used for booting. You can specify it with --initrd = filename. With --append = cmdline , you append options to the command line of the kernel to boot.
6 Remount the root file system in read-only mode: mount -o remount,ro / 7 Initiate the reboot of the kernel that you loaded in Step 4 (page 207) with kexec -e It is important to unmount the previously mounted disk volumes in read-write mode. The reboot system call acts immediately upon calling. Hard drive volumes mounted in read-write mode neither synchronize nor unmount automatically. The new kernel may find them “dirty”. Read-only disk volumes and virtual file systems do not need to be unmounted.
18.6 Basic kdump Configuration You can use kdump to save kernel dumps. If the kernel crashes, it is useful to copy the memory image of the crashed environment to the file system. You can then debug the dump file to find the cause of the kernel crash. This is called “core dump”. kdump works similar to kexec (see Chapter 18, kexec and kdump (page 205)). The capture kernel is executed after the running production kernel crashes.
18.6.1 Manual kdump Configuration kdump reads its configuration from the /etc/sysconfig/kdump file. To make sure that kdump works on your system, its default configuration is sufficient. To use kdump with the default settings,follow these steps: 1 Append the following kernel command line option to your boot loader configuration, and reboot the system: crashkernel=size@offset You can find the corresponding values for size and offset in the following table: Table 18.
2 Enable kdump init script: chkconfig boot.kdump on 3 You can edit the options in /etc/sysconfig/kdump. Reading the comments will help you understand the meaning of individual options. 4 Execute the init script once with rckdump start, or reboot the system. After configuring kdump with the default values, check if it works as expected. Make sure that no users are currently logged in and no important services are running on your system.
Do not reset the computer because kdump always needs some time to complete its task. 18.6.2 YaST Configuration In order to configure kdump with YaST, you need to install the yast2-kdump package. Then either start the Kernel Kdump module in the System category of YaST Control Center, or enter yast2 kdump in the command line as root. Figure 18.1: YaST2 Kdump Module - Start-Up Page In the Start-Up window, select Enable Kdump. The default value for kdump memory is sufficient on most systems.
In the Dump Target window, select the type of the dump target and the URL where you want to save the dump. If you selected a network protocol, such as FTP or SSH, you need to enter relevant access information as well. Fill the Email Notification window information if you want kdump to inform you about its events via E-mail and confirm your changes with OK after fine tuning kdump in the Expert Settings window. kdump is now configured. 18.
The first parameter represents the kernel image. The second parameter is the dump file captured by kdump. You can find this file under /var/crash by default. 18.7.1 Kernel Binary Formats The Linux kernel comes in Executable and Linkable Format (ELF). This file is usually called vmlinux and is directly generated in the compilation process. Not all boot loaders, especially on x86 (i386 and x86_64) architecture, support ELF binaries.
If you decide to analyze the dump on another machine, you must check both the architecture of the computer and the files necessary for debugging. You can analyze the dump on another computer only if it runs a Linux system of the same architecture. To check the compatibility, use the command uname -i on both computers and compare the outputs. If you are going to analyze the dump on another computer, you also need the appropriate files from the kernel and kernel debug packages.
This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...
RIP: 00007fa958991f60 RSP: RAX: 0000000000000001 RBX: RDX: 0000000000000002 RSI: RBP: 0000000000000002 R8: R10: 00007fa958c209c0 R11: R13: 00007fa959284000 R14: ORIG_RAX: 0000000000000001 crash> 00007fff61330390 RFLAGS: 00010246 ffffffff8020bfbb RCX: 0000000000000001 00007fa959284000 RDI: 0000000000000001 00007fa9592516f0 R9: 00007fa958c209c0 0000000000000246 R12: 00007fa958c1f780 0000000000000002 R15: 00000000595569d0 CS: 0033 SS: 002b Now it is clear what happened: The internal echo command of Bash shel
You can change the directory for the kernel dumps with the KDUMP_SAVEDIR option. Keep in mind that the size of kernel dumps can be very large. kdump will refuse to save the dump if the free disk space, subtracted by the estimated dump size, drops below the value specified by the KDUMP_FREE_DISK_SIZE option. Note that KDUMP_SAVEDIR understands URL format protocol://specification, where protocol is one of file, ftp, sftp, nfs or cifs, and specification varies for each protocol.
• For more details on kdump specific to SUSE Linux, see http:// ftp.suse.com/pub/people/tiwai/kdump-training/kdumptraining.pdf . • An in-depth description of kdump internals can be found at http:// lse.sourceforge.net/kdump/documentation/ols2oo5-kdumppaper.pdf . For more details on crash dump analysis and debugging tools, use the following resources: • In addition to the info page of GDB (info gdb), you might want to read the printable guides at http://sourceware.org/gdb/documentation/ .
GNU Licenses This appendix contains the GNU Free Documentation License version 1.2. GNU Free Documentation License Copyright (C) 2000, 2001, 2002 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. 0.
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
A.Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and from those of previous versions (which should, if there were any, be listed in the History section of the Document). You may use the same title as a previous version if the original publisher of that version gives permission. B.
6. COLLECTIONS OF DOCUMENTS You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
with the Invariant Sections being LIST THEIR TITLES, with the Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST. If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation. If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.