Front cover Draft Document for Review May 4, 2007 11:35 am REDP-4285-00 Linux Performance and Tuning Guidelines Operating system tuning methods Performance monitoring tools Peformance analysis Eduardo Ciliendo Takechika Kunimasa ibm.
Draft Document for Review May 4, 2007 11:35 am 4285edno.
285edno.fm Draft Document for Review May 4, 2007 11:35 am Note: Before using this information and the product it supports, read the information in “Notices” on page vii. First Edition (April 2007) This edition applies to kernel 2.6 Linux distributions. This document created or updated on May 4, 2007. © Copyright International Business Machines Corporation 2007. All rights reserved. Note to U.S.
Draft Document for Review May 4, 2007 11:35 am 4285TOC.fm Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4285TOC.fm iv Draft Document for Review May 4, 2007 11:35 am 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Overview of tool function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 top . . . . . . . . . . . . . . . . . . . . . . . .
Draft Document for Review May 4, 2007 11:35 am 4285TOC.fm 4.2.4 SELinux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Compiling the kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Changing kernel parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Where the parameters are stored . . . . . . . . . . . . . . . . . . . . . . . . . .
4285TOC.
Draft Document for Review May 4, 2007 11:35 am 4285spec.fm Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area.
4285spec.
Draft Document for Review May 4, 2007 11:35 am 4285pref.fm Preface Linux® is an open source operating system developed by people all over the world. The source code is freely available and can be used under the GNU General Public License. The operating system is made available to users in the form of distributions from companies such as Red Hat and Novell. Some desktop Linux distributions can be downloaded at no charge from the Web, but the server versions typically must be purchased.
4285pref.fm Draft Document for Review May 4, 2007 11:35 am Tuning the operating system With the basic knowledge of the operating systems way of working and the skills in a variety of performance measurement utilities, the reader is now ready to go to work and explore the various performance tweaks available in the Linux operating system.
Draft Document for Review May 4, 2007 11:35 am 4285pref.fm Computer Sciences from Texas A&M University. He writes extensively in the areas of networking, application integration middleware, and personal computer software. Before joining the ITSO, Byron worked in IBM Learning Services Development in networking education development.
4285pref.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1 Chapter 1. Understanding the Linux operating system We begin this Redpaper with a quick overview of how the Linux operating system handles its tasks to complete interacting with its hardware resources. Performance tuning is a difficult task that requires in-depth understanding of the hardware, operating system, and application.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am We can tune the I/O subsystem for weeks in vain if the disk subsystem for a 20,000-user database server consists of a single IDE drive. Often a new driver or an update to the application will yield impressive performance gains. Even as we discuss specific details, never forget the complete picture of systems performance.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.1 Linux process management Process management is one of the most important roles of any operating system. Effective process management enables an application to operate steadily and effectively. Linux process management implementation is similar to UNIX® implementation. It includes process scheduling, interrupt handling, signaling, process prioritization, process switching, process state, process memory and so on.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.1.2 Lifecycle of a process Every process has its own lifecycle such as creation, execution, termination and removal. These phases will be repeated literally millions of times as long as the system is up and running. Therefore, the process lifecycle is a very important topic from the performance perspective. Figure 1-3 shows typical lifecycle of processes.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am From the performance perspective, thread creation is less expensive than process creation because a thread does not need to copy resources on creation. On the other hand, processes and threads have similar characteristics in term of scheduling algorithm. The kernel deals with both of them in the similar manner.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Linux supports nice levels from 19 (lowest priority) to -20 (highest priority). The default value is 0. To change the nice level of a program to a negative number (which makes it higher priority), it is necessary to log on or su to root. 1.1.5 Context switching During process execution, information of the running process is stored in registers on processor and its cache.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am In a multi-processor environment, interrupts are handled by each processor. Binding interrupts to a single physical processor may improve system performance. For further details, refer to 4.4.2, “CPU affinity for interrupt handling”. 1.1.7 Process state Every process has its own state to show what is currently happening in the process. Process state changes during process execution.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Zombie processes When a process has already terminated, having received a signal to do so, it normally takes some time to finish all tasks (such as closing open files) before ending itself. In that normally very short time frame, the process is a zombie. After the process has completed all of these shutdown tasks, it reports to the parent process that it is about to terminate.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Data segment The data segment consist of these three area. – Data: The area where initialized data such as static variables are stored. – BSS: The area where zero-initialized data is stored. The data is initialized to zero. – Heap: The area where malloc() allocates dynamic memory based on the demand. The heap grows toward higher addresses.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am active expired P array[0] : P P P array[1] P priority0 : priority 139 P : P P priority0 : priority 139 P Figure 1-8 Linux kernel 2.6 O(1) scheduler Another significant advantage of the new scheduler is the support for Non-Uniform Memory Architecture (NUMA) and symmetric multithreading processors, such as Intel® Hyper-Threading technology.
Draft Document for Review May 4, 2007 11:35 am 4285ch01.fm 1.2 Linux memory architecture To execute a process, the Linux kernel allocates a portion of the memory area to the requesting process. The process uses the memory area as workspace and performs the required work. It is similar to you having your own desk allocated and then using the desktop to scatter papers, documents and memos to perform your work. The difference is that the kernel has to allocate space in more dynamic manner.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 32-bit Architecture 64GB 1GB 16MB 64GB ZONE_HIGHMEM 128MB ZONE_NORMAL Pages in ZONE_HIGHMEM must be mapped into ZONE_NORMAL ~~ ~~ 896MB 64-bit Architecture “Reserved” ZONE_NORMAL 1GB Reserved for Kernel data structures ZONE_DMA ZONE_DMA Figure 1-10 Linux kernel memory layout for 32-bit and 64-bit systems Virtual memory addressing layout Figure 1-11 shows the Linux virtual addressing layout for 32-bit and 64-bit architecture.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.2.2 Virtual memory manager The physical memory architecture of an operating system usually is hidden to the application and the user because operating systems map any memory into virtual memory. If we want to understand the tuning possibilities within the Linux operating system, we have to understand how Linux handles virtual memory. As explained in 1.2.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Page frame allocation A page is a group of contiguous linear addresses in physical memory (page frame) or virtual memory. The Linux kernel handles memory with this page unit. A page is usually 4K bytes in size. When a process requests a certain amount of pages, if there are available pages, the Linux kernel can allocate them to the process immediately. Otherwise pages have to be taken from some other process or page cache.
Draft Document for Review May 4, 2007 11:35 am 4285ch01.fm memory segments” on page 8). When kswapd reclaims pages, it would rather shrink the page cache than page out (or swap out) the pages owned by processes. Note: The phrase “page out” and “swap out” is sometimes confusing. “page out” means take some pages (a part of entire address space) into swap space while “swap out” means taking entire address space into swap space. They are sometimes used interchangeably.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am User Process cp System call open(), read(), write() translation for each file system VFS ext3 ext2 NFS Reiserfs AFS XFS VFAT JFS proc Figure 1-14 VFS concept 1.3.2 Journaling In a non-journaling file system, when a write is performed to a file system the Linux kernel makes changes to the file system metadata first and then writes actual user data next. This operations sometimes causes higher chances of losing data integrity.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.3.3 Ext2 The extended 2 file system is the predecessor of the extended 3 file system. A fast, simple file system, it features no journaling capabilities, unlike most other current file systems. Figure 1-16 shows the Ext2 file system data structure. The file system starts with boot sector and followed by block groups.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am i-node of the file. The Linux kernel uses file object cache such as directory entry cache, i-node cache to accelerate finding the corresponding i-node. Now the Linux kernel knows i-node of the file then it tries to reach actual user data block. As we described, i-node has the pointer to the data block. By referring to it, the kernel can get to the data block. For large files, Ext2 implements direct/indirect reference to data block.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am the capability of manipulating Ext3 file systems. For example, PartitionMagic can handle the modification of Ext3 partitions. Mode of journaling Ext3 support three types of journaling mode. journal This journaling option provides the highest form of data consistency by causing both file data and metadata to be journaled. It is also has the higher performance overhead. ordered In this mode only metadata is written.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am We’ll take a look at Linux disk I/O subsystem to have better understanding of the components which have large effect on system performance. 1.4.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 6. A device driver such as SCSI or other device specific drivers will take care of write operation 7. A disk device firmware do hardware operation like seek head, rotation, data transfer to the sector on the platter. 1.4.2 Cache In the past 20 years, the performance improvement of processors has outperformed that of the other components in a computer system such as processor cache, bus, RAM, disk and so on.
4285ch01.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am • Process read a data from disk The data on memory and the data on disk are identical at this time. Process Data read Data Disk Cache • Process writes a new data Only the data on memory has been changed, the data on disk and the data on memory is not the identical. Process Data Data write Cache dirty buffer Disk • Flushing writes the data on memory to the disk. The data on disk is now identical to the data on memory.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am laptop computer quite likely has different I/O requirements from a 10,000-user database system. To accommodate this, four I/O elevators are available. Anticipatory The anticipatory I/O elevator was created based on the assumption of a block device with only one physical seek head (for example a single SATA drive). The anticipatory elevator uses the deadline mechanism described in more detail below plus an anticipation heuristic.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am SCSI The Small Computer System Interface (SCSI) is the most commonly used I/O device technology, especially in the enterprise server environment. In Linux kernel implementations, SCSI devices are controlled by device driver modules. They consist of the following types of modules. Upper level drivers: sd_mod, sr_mod(SCSI-CDROM), st(SCSI Tape), sq(SCSI generic device) etc.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.5 Network subsystem The network subsystem is another important subsystem in the performance perspective. Networking operations interact with many components other than Linux itself such as switches, routers, gateways, PC clients etc. Though these components may be out of the control of Linux, they have much influence on the overall performance. Keep in mind that you have to work closely with people working on the network system.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 7. The frame is moved into the network interface card buffer if the MAC address matches the MAC address of the interface card. 8. The network interface card eventually moves the packet into a socket buffer and issues a hard interrupt at the CPU. 9. The CPU then processes the packet and moves it up through the layers until it arrives at (for example) a TCP port of an application such as Apache.
85ch01.fm Draft Document for Review May 4, 2007 11:35 am Network API (NAPI) The network subsystem has undergone some changes with the introduction of the new network API (NAPI). The standard implementation of the network stack in Linux focuses more on reliability and low latency than on low overhead and high throughput.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Because of this, NAPI was introduced to counter the overhead associated with processing network traffic. For the first packet, NAPI works just like the traditional implementation as it issues an interrupt for the first packet.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am ACCEPT: DROP: REJECT: Accept the packet and let it through. Silently discard the packet. Discard the packet with sending back the packet such as ICMP port unreachable, TCP reset to originating host. LOG: Logging matching packet.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Server Client send SYN SYN_SENT SYN SYN+ACK LISTEN receive SYN SYN_RECV SYN+ACK sent receive SYN+ACK ESTABLISHED ACK receive ACK ESTABLISHED TCP session established receive FIN FIN FIN_WAIT1 ACK receive ACK FIN_WAIT2 receive FIN TIME_WAIT send ACK TimeOut CLOSED FIN receivr FIN CLOSE_WAIT receive ACK reveive FIN LAST_ACK ACK receive ACK CLOSED Figure 1-27 TCP 3-way handshake The state of a connection changes during the session.
4285ch01.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am Sender Receiver Sender Receiver Sliding window Delayed Ack Figure 1-29 Sliding window and delayed ack As an option, high-speed networks may use a technique called window scaling to increase the maximum transfer window size even more. We will analyze the effects of these implementations in more detail in “Tuning TCP options” on page 132.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.5.4 Bonding module The Linux kernel provides network interface aggregation capability by using a bonding driver. This is a device independent bonding driver, while there are device specific drivers as well. The bonding driver supports the 802.3 link aggregation specification and some original load balancing and fault tolerant implementations as well. It achieves a higher level of availability and performance improvement.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am That is, the average of the sum of TASK_RUNNING and TASK_UNINTERRUPTIBLE process. If processes that request CPU time are blocked (which means that the CPU has no time to process them), the load average will increase. On the other hand, if each process gets immediate access to CPU time and there are no CPU cycles lost, the load will decrease. Runable processes This value depicts the processes that are ready to be executed.
4285ch01.fm Draft Document for Review May 4, 2007 11:35 am 1.6.3 Network interface metrics Packets received and sent This metric informs you of the quantity of packets received and sent by a given network interface. Bytes received and sent This value depicts the number of bytes received and sent by a given network interface. Collisions per second This value provides an indication of the number of collisions that occur on the network the respective interface is connected to.
Draft Document for Review May 4, 2007 11:35 am 4285ch01.fm Chapter 1.
4285ch01.
Draft Document for Review May 4, 2007 11:35 am 4285ch02.fm 2 Chapter 2. Monitoring and benchmark tools The open and flexible nature of the Linux operating system has led to a significant number of performance monitoring tools. Some of them are Linux versions of well-known UNIX utilities, and others were specifically designed for Linux. The fundamental support for most Linux performance monitoring tools lays in the virtual proc file system.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am 2.1 Introduction The Enterprise Linux distributions are shipped with many monitoring tools. Some of them deal with many metrics in a single tool and give us well formatted output for easy understanding of system activities. Some of them are specific to certain performance metrics (i.e. Disk I/O) and give us detailed information.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Tool Most useful tool function netperf Network performance benchmark 2.3 Monitoring tools In this section, we discuss the monitoring tools. Most of the tools come with Enterprise Linux distributions. You should be familiar with the tools for better understanding of system behavior and performance tuning. 2.3.1 top The top command shows actual process activity.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am NI Niceness level (that is, whether the process tries to be nice by adjusting the priority by the number given; see below for details). SIZE Amount of memory (code+data+stack) used by the process in kilobytes. RSS Amount of physical RAM used, in kilobytes. SHARE Amount of memory shared with other processes, in kilobytes. STAT State of the process: S=sleeping, R=running, T=stopped or traced, D=interruptible sleep, Z=zombie.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am The columns in the output are as follows: Process (procs) r: The number of processes waiting for runtime. b: The number of processes in uninterruptable sleep. Memory swpd: The amount of virtual memory used (KB). free: The amount of idle memory (KB). buff: The amount of memory used as buffers (KB). cache: The amount of memory used as cache (KB). Swap si: Amount of memory swapped from the disk (KBps).
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Example 2-3 Sample output of uptime 1:57am up 4 days 17:05, 2 users, load average: 0.00, 0.00, 0.00 2.3.4 ps and pstree The ps and pstree commands are some of the most basic commands when it comes to system analysis. ps can have 3 different types of command options, UNIX style, BSD style and GNU style. Here we’ll take UNIX style options. The ps command provides a list of existing processes.
4285ch02.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am WCHAN Name of the kernel function in which the process is sleeping, a “-” if the process is running, or a “*” if the process is multi-threaded and ps is not displaying threads. RSS Resident set size, the non-swapped physical memory that a task has used (in kiloBytes). PSR Processor that process is currently assigned to. STIME Time the command started. TTY Terminal TIME Total CPU time used by the process (since it was started).
4285ch02.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Swap: 2031608 332 2031276 You can also determine much chunks of memory are available in each zone using /proc/buddyinfo file. Each column of numbers means the number of pages of that order which are available. In Example 2-10, there are 5 chunks of 2^2*PAGE_SIZE available in ZONE_DMA, and 16 chunks of 2^4*PAGE_SIZE available in ZONE_DMA32. Remember how the buddy system allocate pages (refer to “Buddy system” on page 14).
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am tps The number of transfers per second (I/O requests per second) to the device. Multiple single I/O requests can be combined in a transfer request, because a transfer request can have different sizes. Blk_read/s, Blk_wrtn/s Blocks read and written per second indicate data read from or written to the device in seconds. Blocks may also have different sizes. Typical sizes are 1024, 2048, and 4048 bytes, depending on the partition size.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Example 2-13 Using iostat -x -d to analyze the average I/O size Device: dasdc rrqm/s wrqm/s 0.00 0.00 r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 0.00 2502.97 0.00 24601.98 0.00 12300.99 9.83 142.93 57.08 0.40 100.00 The iostat output in Example 2-13 shows that the device dasdc had to write 12300.99 kB of data per second as being displayed under the kB_wrtn/s heading.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Example 2-15 Displaying system statistics with sar [root@linux sa]# sar -n DEV -f sa21 | less Linux 2.6.9-5.ELsmp (linux.itso.ral.ibm.com) 12:00:01 12:10:01 12:10:01 12:10:01 AM AM AM AM IFACE lo eth0 eth1 rxpck/s 0.00 1.80 0.00 txpck/s 0.00 0.00 0.00 04/21/2005 rxbyt/s 0.00 247.89 0.00 txbyt/s 0.00 0.00 0.00 rxcmp/s 0.00 0.00 0.00 txcmp/s rxmcst/s 0.00 0.00 0.00 0.00 0.00 0.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am To display three entries of statistics for all processors of a multiprocessor server at one-second intervals, use the command: mpstat -P ALL 1 2 Example 2-18 Output of mpstat command on two-way machine [root@linux ~]# mpstat -P ALL 1 2 Linux 2.6.9-5.ELsmp (linux.itso.ral.ibm.com) 03:31:51 03:31:52 03:31:52 03:31:52 PM PM PM PM Average: Average: Average: Average: 04/22/2005 CPU all 0 1 %user 0.00 0.00 0.00 %nice %system %iowait 0.00 0.00 0.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am whether this amount of memory is a cause of memory bottlenecks. For detailed information, use pmap -d option.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am There are many other useful options. Please check man page. The following example displays sample output of socket information. Example 2-21 Showing socket information with netstat [root@lnxsu5 ~]# netstat -natuw Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address tcp 0 0 0.0.0.0:111 0.0.0.0:* tcp 0 0 127.0.0.1:25 0.0.0.0:* tcp 0 0 127.0.0.1:2207 0.0.0.0:* tcp 0 0 127.0.0.1:36285 127.0.0.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-2 iptraf output of TCP/IP statistics by protocol Figure 2-3 iptraf output of TCP/IP traffic statistics by packet size 2.3.13 tcpdump / ethereal The tcpdump and ethereal are used to capture and analyze network traffic. Both tool uses the libpcap library to capture packets. They monitor all the traffic on a network adapter with promiscuous mode and capture all the frames the adapter has received.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am You can use these tools to dig into the network related problems. You can find TCP/IP retransmission, windows size scaling, name resolution problem, network misconfiguration etc. Just keep in mind that these tools can monitor only frames the network adapter has received, not entire network traffic. tcpdump tcpdump is a simple but robust utility.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Example 2-22 Example of tcpdump output 21:11:49.555340 21:11:49.671811 21:11:51.211869 21:11:51.332371 21:11:56.972822 21:11:57.133615 21:11:57.656919 21:11:57.818058 10.1.1.1.2542 > 66.218.71.102.http: S 2657782764:2657782764(0) win 65535 (DF) 66.218.71.102.http > 10.1.1.1.2542: S 2174620199:2174620199(0) ack 2657782765 win 65535 10.1.1.18.2543 > 216.239.57.99.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-4 ethereal GUI 2.3.14 nmon nmon, short for Nigel's Monitor, is a popular tool to monitor Linux systems performance developed by Nigel Griffiths. Since nmon incorporates the performance information for several subsystems, it can be used as a single source for performance monitoring.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Example 2-23 Using nmon to record performance data # nmon -f -s 30 -c 120 The output of the above command will be stored in a text file in the current directory named _date_time.nmon. For more information on nmon we suggest you visit http://www-941.haw.ibm.com/collaboration/wiki/display/WikiPtype/nmon In order to download nmon, visit http://www.ibm.com/collaboration/wiki/display/WikiPtype/nmonanalyser 2.3.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Example 2-25 Output of strace counting for system time [root@lnxsu4 ~]# strace -c find /etc -name httpd.conf /etc/httpd/conf/httpd.conf Process 3563 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------25.12 0.026714 12 2203 getdents64 25.09 0.026689 8 3302 lstat64 17.20 0.018296 8 2199 chdir 9.05 0.009623 9 1109 open 8.06 0.008577 8 1108 close 7.50 0.007979 7 1108 fstat64 7.
Draft Document for Review May 4, 2007 11:35 am 4285ch02.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am bus This subdirectory contains information about the bus subsystems such as the PCI bus or the USB interface of the respective system. irq The irq subdirectory contains information about the interrupts in a system. Each subdirectory in this directory refers to an interrupt and possibly to an attached device such as a network interface card.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-6 Default KDE System Guard window The graphical front end (Figure 2-6) uses sensors to retrieve the information it displays. A sensor can return simple values or more complex information such as tables. For each type of information, one or more displays are provided. Displays are organized in worksheets that can be saved and loaded independent of each other.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-7 KDE System Guard sensor browser System Load The System Load worksheet shows four sensor windows: CPU Load, Load Average (1 Min), Physical Memory, and Swap Memory. Multiple sensors can be displayed in one window. To see which sensors are being monitored in a window, mouse over the graph and descriptive text will appear. You can also right-click the graph and click Properties, then click the Sensors tab (Figure 2-8).
Draft Document for Review May 4, 2007 11:35 am 4285ch02.fm Process Table Clicking the Process Table tab displays information about all running processes on the server (Figure 2-9). The table, by default, is sorted by System CPU utilization, but this can be changed by clicking another one of the headings. Figure 2-9 Process Table view Configuring a work sheet For your environment or the particular area that you wish to monitor, you might have to use different sensors for monitoring.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Note: The fastest update interval that can be defined is two seconds. Figure 2-11 Empty worksheet 3. Fill in the sensor boxes by dragging the sensors on the left side of the window to the desired box on the right. The types of display are: – Signal Plotter: This displays samples of one or more sensors over time. If several sensors are displayed, the values are layered in different colors.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-12 Example worksheet Find more information about KDE System Guard at: http://docs.kde.org/ 2.3.18 Gnome System Monitor Although not as powerful as the KDE System Guard, the Gnome desktop environment features a graphical performance analysis tool. The Gnome System Monitor can display performance-relevant system resources as graphs for visualizing possible peaks and bottlenecks. Note that all statistics are generated in real time.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-13 The task list in the IBM Director Console Drag and drop the icon for Monitor Activator over a single system or a group of systems that have the Capacity Manager package installed. A window opens (Figure 2-14) in which you can select the various subsystems to be monitored over time. Capacity Manager for Linux does not yet support the full-feature set of available performance counters.
Draft Document for Review May 4, 2007 11:35 am 4285ch02.fm Figure 2-15 Scheduling reports In a production environment, it is a good idea to have Capacity Manager generate reports on a regular basis. Our experience is that weekly reports that are performed in off-hours over the weekend can be very valuable. An immediate execution or scheduled execution report is generated according to your choice.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Figure 2-16 A sample Capacity Manager report The Report Viewer window enables you to select the different performance counters that were collected and correlate this data to a single system or to a selection of systems. Data acquired by Capacity Manager can be exported to an HTML or XML file to be displayed on an intranet Web server or for future analysis. 2.4 Benchmark tools In this section, we pick up some of major benchmark tools.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am platform shares many of the technologies available for desktop computers. Server benchmarks spawn multiple threads in order to utilize the SMP capabilities of the system and in order to simulate a true multi user environment. While a PC might start one web browser faster than a high-end server, the server will start a thousand web browsers faster than a PC.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am make see: Finally after a minimum of three runs the results can be viewed using the make see command. The results will be displayed and can be copied to a spreadsheet application for further analysis or graphical representation of the data. The LMbench benchmark can be found at http://sourceforge.net/projects/lmbench/ 2.4.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am Note: Any benchmark using files that fit into the systems memory and that are stored on asynchronous file systems will measure the memory throughput rather than the disk subsystem performance. Hence you should either mount the file system of interest with the sync option or use a file size roughly twice the size of the systems memory.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am netserver, generates network traffic, gets the result from netserver via a control connection which is separated from the actual benchmark traffic connection. During the benchmarking, no communication occurs on the control connection so it does not have any effect on the result. The netperf benchmark tool also has a reporting capability including a CPU utilization report. The current stable version is 2.4.3 at the time of writing.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am -l Test length of benchmarking. If positive value is set, netperf perform the benchmarking in testlen seconds. If negative, it performs until value of testlen bytes data is exchanged for bulk data transfer benchmarking or value of testlen transactions for request/response type. -c Local CPU utilization report -C Remote CPU utilization report Note: The report of the CPU utilization may not be accurate in some platform.
4285ch02.fm Draft Document for Review May 4, 2007 11:35 am 16384 2048 87380 1024 64 1 60.00 3830.65 25.27 10.16 131.928 53.039 When you perform benchmarking, it’s wise to use the sample test scripts which come with netperf. By changing some variables in the scripts, you can perform your benchmarking as you like. The scripts are in the doc/examples/ directory of the netperf package. For more details, refer to http://www.netperf.org/ 2.4.
Draft Document for Review May 4, 2007 11:35 am 4285ch03.fm 3 Chapter 3. Analyzing performance bottlenecks This chapter is useful for finding a performance problem that may be already affecting one of your servers. We outline a series of steps to lead you to a concrete solution that you can implement to restore the server to an acceptable performance level. The topics that are covered in this chapter are: 3.1, “Identifying bottlenecks” on page 78 3.2, “CPU bottlenecks” on page 81 3.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am 3.1 Identifying bottlenecks The following steps are used as our quick tuning strategy: 1. 2. 3. 4. 5. 6. Know your system. Back up the system. Monitor and analyze the system’s performance. Narrow down the bottleneck and find its cause. Fix the bottleneck cause by trying only one single change at a time. Go back to step 3 until you are satisfied with the performance of the system.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am The fact that the problem can be reproduced enables you to see and understand it better. Document the sequence of actions that are necessary to reproduce the problem: – What are the steps to reproduce the problem? Knowing the steps may help you reproduce the same problem on a different machine under the same conditions.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am 3.1.2 Analyzing the server’s performance Important: Before taking any troubleshooting actions, back up all data and the configuration information to prevent a partial or complete loss. At this point, you should begin monitoring the server. The simplest way is to run monitoring tools from the server that is being analyzed. (See Chapter 2, “Monitoring and benchmark tools” on page 39, for information.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am 3.2 CPU bottlenecks For servers whose primary role is that of an application or database server, the CPU is a critical resource and can often be a source of performance bottlenecks. It is important to note that high CPU utilization does not always mean that a CPU is busy doing work; it may, in fact, be waiting on another subsystem.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am monitoring it, the CPU load will appear to be very balanced and not necessarily peaking on any CPU. Affinity is also useful in NUMA-based systems such as the IBM System x 3950, where it is important to keep memory, cache, and CPU access local to one another. 3.2.3 Performance tuning options The first step is to ensure that the system performance problem is being caused by the CPU and not one of the other subsystems.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am Figure 3-1 KDE System Guard memory monitoring The indicators in Table 3-1 can also help you define a problem with memory. Table 3-1 Indicator for memory analysis Memory indicator Analysis Memory available This indicates how much physical memory is available for use. If, after you start your application, this value has decreased significantly, you may have a memory leak.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am A process behaves poorly. Paging can be a serious performance problem when the amount of free memory pages falls below the minimum amount specified, because the paging mechanism is not able to handle the requests for physical memory pages and the swap mechanism is called to free more pages. This significantly increases I/O to disk and will quickly degrade a server’s performance.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will be idle or have low utilization because they wait long periods of time before processing the next request. The disk subsystem is perhaps the most challenging subsystem to properly configure.
4285ch03.
Draft Document for Review May 4, 2007 11:35 am 4285ch03.fm Changes made to the elevator algorithm as described in 4.6.2, “I/O elevator tuning and selection” on page 116 will be seen in avgrq-sz (average size of request) and avgqu-sz (average queue length). As the latencies are lowered by manipulating the elevator settings, avgrq-sz will decrease. You can also monitor the rrqm/s and wrqm/s to see the effect on the number of merged reads and writes that the disk can manage. 3.4.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am Figure 3-2 KDE System Guard network monitoring It is important to remember that there are many possible reasons for these performance problems and that sometimes problems occur simultaneously, making it even more difficult to pinpoint the origin. The indicators in Table 3-3 can help you determine the problem with your network.
4285ch03.fm Draft Document for Review May 4, 2007 11:35 am 3.5.2 Performance tuning options These steps illustrate what you should do to solve problems related to network bottlenecks: Ensure that the network card configuration matches router and switch configurations (for example, frame size). Modify how your subnets are organized. Use faster network cards. Tune the appropriate IPV4 TCP kernel parameters. (See Chapter 4, “Tuning the operating system” on page 91.
4285ch03.
Draft Document for Review May 4, 2007 11:35 am 4285ch04.fm 4 Chapter 4. Tuning the operating system By its nature and heritage, the Linux distributions and the Linux kernel offer a variety of parameters and settings to let the Linux administrator tweak the system to maximize performance. As stated earlier in this redpaper, there sadly is no magic tuning knob that will improve systems performance for any application.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 4.1 Tuning principals Tuning any system should follow some rather simple principles of which the most important is change management as described below. Generally the first step in systems tuning should be to analyze and evaluate the current system configuration. Ensuring that the system performs as stated by the hardware manufacturer and that all devices are running in their optimal mode will create a solid base for any later tuning.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am What flavor and version of Linux do I need? After you have collected the business and application requirements, decide which version of Linux to use. Enterprises often have contractual agreements that allow the general use of a specific Linux distribution. In this case, financial and contractual benefits will most likely dictate the version of Linux that can be used.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am suggest that you select the respective file system based on these requirements. Refer to 4.6, “Tuning the disk subsystem” for detailed selection criteria. Package selection: minimal or everything? During an installation of Linux, administrators are faced with the decision of a minimal-or-everything installation approach. Philosophies differ somewhat in this area.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am In addition, with dmesg, you can determine what hardware is installed on your server. During every boot, Linux checks your hardware and logs information about it. You can view these logs using the command /bin/dmesg. Example 4-1 partial output from dmesg Linux version 20070105 (Red 1-52)) #1 SMP Command line: 2.6.18-8.el5 (brewbuilder@ls20-bc1-14.build.redhat.com) (gcc version 4.1.1 Hat 4.1.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[1] TSOcap[1] eth0: dma_rwctrl[76180000] dma_mask[64-bit] EXT3 FS on dm-0, internal journal kjournald starting. Commit interval 5 seconds EXT3 FS on sda1, internal journal EXT3-fs: mounted filesystem with ordered data mode.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am In addition, make sure that the default pam configuration file (/etc/pam.d/system-auth for Red Hat Enterprise Linux, /etc/pam.d/common-session for SUSE Linux Enterprise Server) has the following entry: session required pam_limits.so This entry is required so that the system can enforce these limits. For the complete syntax of the ulimit command, issue: ulimit -? 4.2.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Daemons Description hpoj HP OfficeJet support. Do not disable if you plan to use an HP OfficeJet printer with your server. irqbalance Balances interrupts between multiple processors. You may safely disable this daemon on a singe CPU system or if you plan to balance IRQ statically. isdn ISDN modem support. Do not disable if you plan to use an ISDN modern with your server. kudzu Detects and configures new hardware.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Tip: Instead of wasting precious time waiting for a reboot to complete, simply change the run level to 1 and back to 3 or 5, respectively. There is another useful system command, /sbin/service, that enables an administrator to immediately change the status of any registered service.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Figure 4-2 The System Services panel in YaST In the YaST panel in Figure 4-2 various services can be enabled or disabled on a per run level basis. However this requires the utilization of the expert mode as displayed at the top of Figure 4-2.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Changing runlevels Whenever possible, do not run the graphical user interface on a Linux server. Normally, there is no need for a GUI on a Linux server, as most Linux administrators will happily assure you. All administrative tasks can be achieved efficiently via the command line, by redirecting the X display, or through a Web browser interface.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am ... (lines not displayed) # The default runlevel is defined here id:3:initdefault: To start Linux without starting the GUI, set the run level to 3 # First script to be executed, if not booting in emergency (-b) mode si::bootwait:/etc/init.d/boot # # # # # # # # # # /etc/init.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am policy model that overcomes the limitations of the standard discretionary access model employed by Linux. SELinux enforces security on user and process levels; hence a security flaw of any given process affects only the resources allocated to this process and not the entire system. SELinux works similar to a virtual machine.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am # disabled - SELinux is fully disabled. SELINUX=disabled # SELINUXTYPE= type of policy in use. Possible values are: # targeted - Only targeted network daemons are protected. # strict - Full SELinux protection. SELINUXTYPE=targeted If you decide to use SELinux with your Linux-based server, its settings can be tweaked to better accommodate your environment.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am file system provides an interface to the running kernel that may be used for monitoring purposes and for changing kernel settings on the fly. To view the current kernel configuration, choose a kernel parameter in the /proc/sys directory and use the cat command on the respective file. In Example 4-5 we parse the system for its current memory overcommit strategy.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Figure 4-5 Red Hat kernel tuning For Novell SUSE based systems, again YaST and more specifically powertweak is the tool of choice for changing any kernel parameter.
Draft Document for Review May 4, 2007 11:35 am 4285ch04.fm Figure 4-6 The powertweak utility The big advantage of powertweak via sysctl for instance is the fact that all tuning parameters are presented with a short explanation. Note that all changes made with the help of powertweak will be stored under /etc/powertweak/tweaks. 4.3.1 Where the parameters are stored The kernel parameters that control how the kernel behaves are stored in /proc (in particular, /proc/sys).
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am File/directory Purpose /proc/sys/fs/* Used to increase the number of open files the OS allows and to handle quota. /proc/sys/kernel/* For tuning purposes, you can enable hotplug, manipulate shared memory, and specify the maximum number of PID files and level of debug in syslog. /proc/sys/net/* Tuning of network in general, IPV4 and IPV6. /proc/sys/vm/* Management of cache memory and buffer. 4.3.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 4.4.1 Tuning process priority As we stated in 1.1.4, “Process priority and nice level” on page 5, it is not possible to change the process priority of a process. This is only indirectly possible through the use of the nice level of the process, but even this is not always possible. If a process is running too slowly, you can assign more CPU to it by giving it a lower nice level.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am To help with spotting bottlenecks, statistics provided by the numastat tool are available in the /sys/devices/system/node/%{node number}/numastat file. High values in numa_miss and the other_node field signal a likely NUMA issue.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am parameter stored in /proc/sys/vm/dirty_ratio the system administrator can define at what level the actual disk writes will take place. The value stored in dirty_ratio is a percentage of main memory. A value of 10 would hence mean that data will be written into system memory until the file system cache has a size of 10% of the server’s RAM.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am /dev/sdd2 swap swap sw,pri=1 0 0 Swap partitions are used from the highest priority to the lowest (where 32767 is the highest and 0 is the lowest). Giving the same priority to the first three disks causes the data to be written to all three disks; the system does not wait until the first swap partition is full before it starts to write on the next partition.
Draft Document for Review May 4, 2007 11:35 am 4285ch04.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am In this section we cover the characteristics and tuning options of the standard file system such as ReiserFS and Ext3 as well as the tuning potential found in the kernel 2.6 I/O elevators. 4.6.1 Hardware considerations before installing Linux Minimum requirements for CPU speed and memory are well documented for current Linux distributions.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am cache and the cache of the disk subsystem can no longer accommodate the amount or size of a read or write request, the physical disk spindles have to work. Consider the following example. A disk device is able to handle 200 I/Os per second. You have an application that performs 4kB write requests at random locations on the file systems so streaming or request merging is not an option.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Table 4-4 Linux partitions and server environments Partition Contents and possible server environments /home A file server environment would benefit from separating out /home to its own partition. This is the home directory for all users on the system, if there are no disk quotas implemented, so separating this directory should isolate a user’s runaway consumption of disk space.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am cases the performance of the anticipatory elevator usually has the least throughput and the highest latency. The three other schedulers perform equally good up to a I/O size of roughly 16kB at where the CFQ and the NOOP elevator begin to outperfom the deadline elevator (unless disk access is very seek intense) as can be seen in Figure 4-7.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Sometimes however the workload is not so much limited by the performance of the disk subsystem but much more by the performance of the CPU. Such a case could be a scientific workload or a data warehouse processing very complex queries. In such scenarios the NOOP elevator offers some advantage over the other elevators as it causes less CPU overhead as shown on the following chart.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 140000 120000 100000 80000 128 nr_requests kB/sec 64 nr_requests 512 nr_requests 60000 2028 nr_requests 40000 20000 0 4 8 16 32 64 128 256 512 1024 2048 kB/op Figure 4-9 Impact of nr_requests on the Deadline elevator (random write ReiserFS) A larger request queue may be offering a higher throughput for workloads that write many small files.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am I 140000 120000 100000 80000 cfq 128 nr_requests cfq 2048 nr_requests kB/sec cfq 64 nr_requests cfq 8192 nr_requests 60000 40000 20000 0 4 8 16 32 64 128 256 512 1024 2048 kB/op Figure 4-10 Impact of nr_requests on the CFQ elevator (random write Ext3) It is important to point out that the current enterprise distributions from Red Hat and Linux offer the possibility to set nr_requests on a per disk subsystem basis.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 4.6.3 File system selection and tuning As stated in 1.3, “Linux file systems” on page 15 the different file systems that are available for Linux have been designed with different workload and availability characteristics in mind.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 140000 120000 100000 80000 kB/sec ReiserFS Ext3 Ext2 60000 40000 20000 0 4 8 16 32 64 128 256 512 1024 2048 kB/op Figure 4-12 Random write throughput comparison between Ext3 and ReiserFS (asynchronous) Using ionice to assign I/O priority A new feature of the CFQ I/O elevator is the possibility to assign priorities on an process level.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Example 4-15 ionice command # ionice -c3 -p113 Access time updates The Linux file system keeps records of when files are created, updated, and accessed. Default operations include updating the last-time-read attribute for files during reads and writes to files. Because writing is an expensive operation, eliminating unnecessary I/O can lead to overall improved performance.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 140000 120000 100000 80000 kB/sec data=ordered 60000 data=writeback 40000 20000 0 4 8 16 32 64 128 256 512 1024 2048 kB/op Figure 4-13 Random write performance impact of data=writeback There are three ways to change the journaling mode on a file system: When executing the mount command: mount -o data=writeback /dev/sdb1 /mnt/mountpoint • /dev/sdb1 is the file system being mounted.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Streaming and sequential content usually benefits from large stripe sizes by reducing disk head seek time and improving throughput, but the more random type of activity, such as that found in databases, performs better with a stripe size that is equivalent to the record size. 4.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am netstat, tcpdump and ethereal are useful tools to get more accurate characteristics (Refer to 2.3.11, “netstat” on page 53 and 2.3.13, “tcpdump / ethereal” on page 55). 4.7.2 Speed and duplexing It may sound trivial but one of the easiest ways to improve network performance is by checking the actual speed of the network interface because there can be issues between network components (such as switches or hubs) and the network interface cards.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am change the configuration, you can use ethtool if the device driver supports the ethtool command. You may have to change /etc/modules.conf for some device drivers. 4.7.3 MTU size Especially in Gigabit networks, large maximum transmission units (MTU) sizes (also known as JumboFrames) may provide better network performance.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 125Mbytes/sec (1Gbit/sec) * 1msec = 125Kbytes As the default value of rmem_max and wmem_max are about 128Kbytes in most enterprise distributions, it may be fair enough for low-latency general purpose network environment. However if the latency is large, the default size may be too small.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 22:01:25.515601 22:01:25.515610 22:01:25.515617 22:01:25.515707 22:01:25.515714 22:01:25.515764 22:01:25.515768 22:01:25.515774 IP IP IP IP IP IP IP IP plnxsu4.34088 plnxsu4.34088 plnxsu4.34088 plnxsu5.40500 plnxsu5.40500 plnxsu5.40500 plnxsu5.40500 plnxsu5.40500 > > > > > > > > plnxsu5.40500: plnxsu5.40500: plnxsu5.40500: plnxsu4.34088: plnxsu4.34088: plnxsu4.34088: plnxsu4.34088: plnxsu4.34088: . . . . . . . .
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am sysctl -w net.ipv4.conf.lo.accept_source_route=0 sysctl -w net.ipv4.conf.default.accept_source_route=0 sysctl -w net.ipv4.conf.all.accept_source_route=0 These commands configure the server to ignore redirects from machines that are listed as gateways. Redirect can be used to perform attacks, so we only want to allow them from trusted sources: sysctl sysctl sysctl sysctl -w -w -w -w net.ipv4.conf.eth0.secure_redirects=1 net.ipv4.conf.lo.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am Tuning TCP behavior Here we describe some of tuning parameters that will change TCP behaviors. The following commands can be used for tuning servers that support a large number of multiple connections: For servers that receive many connections at the same time, the TIME-WAIT sockets for new connections can be reused. This is useful in Web servers, for example: sysctl -w net.ipv4.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am should be changed only after careful monitoring, as there is a risk of overflowing memory because of the number of dead sockets: sysctl -w net.ipv4.tcp_fin_timeout=30 One of the problems found in servers with many simultaneous TCP connections is the large number of connections that are open but unused. TCP has a keepalive function that probes these connections and, by default, drops them after 7200 seconds (2 hours).
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am 4.7.6 Performance impact of Netfilter As Netfilter provides TCP/IP connection tracking and packet filtering capability (refer to “Netfilter” on page 29), in certain circumstances it may have a large performance impact. The impact is clearly visible when the number of connection establishments is high. Figure 4-18 and Figure 4-19 show benchmark results with large and small connection establishments counts.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am However, Netfilter provides packet filtering capability and enhances network security. It can be a trade-off between security and performance. How much the Netfilter performance impact is depends on the following factors: Number of rules Order of rules Complexity of rules Connection tracking level (depends on protocols) Netfilter kernel parameter configuration 4.7.7 Offload configuration As we described in 1.5.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am cpu usage improvement - default vs offload off 8 socket size (bytes) cpu usage improvement (%) 7 6 2048 4096 5 8192 16384 4 32768 65536 3 131070 262144 2 1 0 1 16 128 1024 1460 4096 16384 32768 65536 131072 recv data size Figure 4-20 CPU usage improvement by offloading However, a slight performance degradation is observed in using offloading (Figure 4-21).
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am LAN adapters are efficient when network applications requesting data generate requests for large frames. Applications that request small blocks of data require the LAN adapter communication processor to spend a larger percentage of time executing overhead code for every byte of data transmitted. This is why most LAN adapters cannot sustain full wire speed for all frame sizes. Refer to Tuning IBM System x Servers for Performance, SG24-5287.
4285ch04.fm Draft Document for Review May 4, 2007 11:35 am After obtaining the interrupt number, you can use the smp_affinity parameter found in /proc/irq/%{irq number} to tie an interrupt to a CPU. Example 4-25 illustrates this for the above output of interrupt 169 of eth1 being bound to the second processor in the system. Example 4-25 Setting the CPU affinity of an interrupt [root@linux ~]# echo 02 > /proc/irq/169/smp_affinity Chapter 4.
4285ch04.
Draft Document for Review May 4, 2007 11:35 am 4285ax01.fm A Appendix A. Testing configurations This appendix lists the hardware and software configurations used to load and test various tuning parameters, monitoring software, and benchmark runs. © Copyright IBM Corp. 2007. All rights reserved.
4285ax01.fm Draft Document for Review May 4, 2007 11:35 am Hardware and software configurations The tests, tuning modifications, benchmark runs, and monitoring performed for this redpaper were executed with Linux installed on two different hardware platforms: Guest on IBM z/VM systems Native on IBM System x servers Linux installed on guest IBM z/VM systems IBM z/VM V5.2.0 was installed on an LPAR on an IBM z9 processor. Installed z/VM components were tcpip, dirmaint, rscs, pvm, and vswitch.
4285ax01.fm Draft Document for Review May 4, 2007 11:35 am /perf (RAID 5EE, 4*74GB) ReiserFS 200 GB Ext3 200 GB Ext3 200 GB Appendix A.
4285ax01.
4285abrv.
4285abrv.
Draft Document for Review May 4, 2007 11:35 am 4285bibl.fm Related publications The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this Redpaper. IBM Redbooks For information about ordering these publications, see “How to get IBM Redbooks” on page 147. Note that some of the documents referenced here may be available in softcopy only.
4285bibl.fm Draft Document for Review May 4, 2007 11:35 am http://www.faqs.org/docs/securing/index.html Linux 2.6 Performance in the Corporate Data Center http://www.osdl.org/docs/linux_2_6_datacenter_performance.pdf Developer of ReiserFS http://www.namesys.com New features of V2.6 kernel http://www.infoworld.com/infoworld/article/04/01/30/05FElinux_1.html WebServing on 2.4 and 2.6 http://www.ibm.com/developerworks/linux/library/l-web26/ man page about the ab command http://cmpp.linuxforum.
Draft Document for Review May 4, 2007 11:35 am 4285bibl.fm Information about EM64T http://www.intel.com/technology/64bitextensions/ How to get IBM Redbooks You can search for, view, or download Redbooks, Redpapers, Hints and Tips, draft publications and Additional materials, as well as order hardcopy Redbooks or CD-ROMs, at this Web site: ibm.com/redbooks Help from IBM IBM Support and downloads ibm.com/support IBM Global Services ibm.
4285bibl.
4285IX.
4285IX.
4285IX.
4285IX.
Draft Document for Review May 4, 2007 11:35 am 4285IX.
4285IX.
Draft Document for Review May 4, 2007 11:35 am Back cover ® Linux Performance and Tuning Guidelines Redpaper Operating system tuning methods Performance monitoring tools Peformance analysis IBM® has embraced Linux, and it is recognized as an operating system suitable for enterprise-level applications running on IBM systems. Most enterprise applications are now available on Linux, including file and print servers, database servers, Web servers, and collaboration and mail servers.