HP Scalable File Share User's Guide G3.
© Copyright 2008 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.........................................................................................................7 Intended Audience.................................................................................................................................7 Typographic Conventions......................................................................................................................7 Related Information..........................................................
5 Using HP SFS Software................................................................................................31 5.1 Creating a Lustre File System..........................................................................................................31 5.2 Configuring Heartbeat....................................................................................................................32 5.2.1 Preparing Heartbeat...............................................................................
List of Figures 1-1 1-2 1-3 A-1 A-2 A-3 A-4 A-5 A-6 A-7 A-8 A-9 Platform Overview........................................................................................................................12 MDS and Administration Server...................................................................................................13 OSS Server.....................................................................................................................................14 Benchmark Platform............
List of Tables 1-1 3-1 6 Supported Configurations ............................................................................................................11 Minimum Firmware Versions.......................................................................................................
About This Document This document provides installation and configuration information for HP Scalable File Share (SFS) G3.0-0. Overviews of installing and configuring the Lustre® File System and MSA2000 Storage Arrays are also included in this document. Pointers to existing documents are provided where possible. Refer to those documents for related information. Intended Audience This document is intended for anyone who installs and uses HP SFS.
WARNING A warning calls attention to important information that if not understood or followed will result in personal injury or nonrecoverable system problems. CAUTION A caution calls attention to important information that if not understood or followed will result in data loss, data corruption, or damage to hardware or software. IMPORTANT This alert provides essential information to explain a concept or to complete a task.
• HP StorageWorks Scalable File Share for SFS20 Enclosure Hardware Installation Guide Version 2.2 at: http://docs.hp.com/en/8958/HP_StorageWorks_SFS_SFS20_Installation_Guide_V2_2–0.pdf Structure of This Document This document is organized as follows: Chapter 1 Provides information about what is included in this product. Chapter 2 Provides information about installing and configuring MSA2000fc arrays.
1 What's In This Version 1.1 About This Product HP SFS G3.0-0 uses the Lustre File System on MSA2000fc hardware to provide a storage system for standalone servers or compute clusters. Currently, HP SFS servers V2.3 and earlier cannot be upgraded because G3.0-0 supports MSA2000 storage only. Contact your HP representative for updates on the status of upgrade support for both servers and clients. 1.2 Benefits and Features HP SFS G3.
• • Keyboard, video, and mouse (KVM) switch TFT console display All of the DL380 G5 file system servers must have their eth0 Ethernet interfaces connected to the ProCurve Switch making up an internal Ethernet network. The iLOs for the DL380 G5 servers should also be connected to the ProCurve Switch, to enable Heartbeat failover power control operations. HP recommends at least two nodes with Ethernet interfaces be connected to an external network.
Figure 1-2 MDS and Administration Server Figure 1-2 shows a block diagram of an MDS server and an administration server with two MSA2000fc enclosures. The network configuration should be adapted to your requirements. 1.
Figure 1-3 OSS Server Figure 1-3 shows a block diagram of a pair of OSS servers with two HP MSA2000fc enclosures. 1.3.1.1 Fibre Channel Switch Zoning If your Fibre Channel is configured with a single Fibre Channel switch connected to more than one server node failover pair and its associated MSA2000 storage devices, you must set up zoning on the Fibre Channel switch. Most configurations are expected to require this zoning.
2 Installing and Configuring MSA2000fc Arrays This chapter provides a summary of steps to install and configure MSA2000fc arrays for use in HP SFS G3.0-0 systems. 2.1 Installation For detailed instructions of how to set up and install the MSA2000fc, see Chapter 4 of the HP StorageWorks 2012fc Modular Smart Array User Guide on the HP website at: http://bizsupport.austin.hp.com/bc/docs/support/SupportManual/c01394283/c01394283.pdf 2.
2.3.2 Creating New Volumes To create new volumes on a set of MSA2000 arrays, follow these steps: 1. 2. Power on all the MSA2000 shelves. Define an alias. One way to execute commands on a set of arrays is to define a shell alias that calls /opt/hp/sfs/msa2000/msa2000cmd.pl for each array. The alias defines a shell for-loop which is terminated with ; done. For example: # alias forallmsas='for NN in `seq 101 2 119` ; do \ ./msa2000cmd.pl 192.168.16.
# forostmsas create vdisk level raid6 disks 16-26 spare \ 27 vdisk2 ; done NOTE: For MGS and MDS nodes, HP SFS uses RAID10. An example MSA2000 CLI command to do this is: # formdsmsas create vdisk level raid10 disks 0-4:5-89 spare 10,11 vdisk1 # forallmsas show vdisks ; done Make a note of the size of the vdisks and use that number in the following step. 5. Create a volume to fill each vdisk. For example: # forostmsas create volume vdisk vdisk1 size size \ mapping 0-1.
3 Installing and Configuring HP SFS Software on Server Nodes This chapter provides information about installing and configuring HP SFS G3.0-0 Software on the Lustre file system server. The following list is an overview of the installation and configuration procedure for file system servers and clients. These steps are explained in detail in the following sections and chapters. 1. Update firmware. 2. Installation Phase 1 a. Choose an installation method.
For the minimum firmware versions supported, see Table 3-1. Upgrade the firmware versions, if necessary. You can download firmware from the HP IT Resource Center on the HP website at: http://www.itrc.hp.com Table 3-1 Minimum Firmware Versions Component HP J4903A ProCurve Switch 2824 MSA2000fc Storage Controller Minimum Firmware Version I.10.43, 08/15/2007 Code Version J00P19 Memory Controller F300R21 Loader Version 15.010 Code Version W420R35 Loader Version 12.
• Provide root password information. These edits must be made, or the Kickstart process will halt, prompt for input, and/or fail. There are also some optional edits you can perform that make setting up the system easier, such as: • Setting the system name. • Configuring network devices. • Configuring ntp servers. • Setting the system networking configuration and name. • Setting the name server and ntp configuration.
## ## ## ## ## ## Template Template Template Template Template Template ADD ADD ADD ADD ADD ADD bootloader --location=mbr --driveorder=%{ks_harddrive} ignoredisk --drives=%{ks_ignoredisk} clearpart --drives=%{ks_harddrive} --initlabel part /boot --fstype ext3 --size=150 --ondisk=%{ks_harddrive} part / --fstype ext3 --size=27991 --ondisk=%{ks_harddrive} part swap --size=6144 --ondisk=%{ks_harddrive} The following optional, but recommended lines set up the name server and ntp server.
On another system, if it has not already been done, you must create and mount a Linux file system on the thumb drive. After you insert the thumb drive into the USB port of the system, examine the dmesg output on the system to determine the USB drive device name. The USB drive name is the first unused alphabetical device name of the form /dev/sd[a-z]1. There might be some /dev/sd* devices on your system already, some of which may map to MSA2000 drives.
%{nfs_server} must be replaced by the installation NFS server address or FQDN. %{nfs_iso_path} must be replaced by the NFS path to the RHEL5U2 ISO directory. %{post_image_dir} must be replaced by the NFS path to the HP SFS G3.0-0 ISO directory. %{post_image} must be replaced by the name of the HP SFS G3.0-0 ISO file. Each server node that is to be installed must be accessible over a network from an installation server that contains the Kickstart file, the RHEL5U2 ISO image, and the HP SFS G3.
HOSTNAME=mynode1 3.5.2 Creating the /etc/hosts file Create an /etc/hosts file with the names and IP addresses of all the Ethernet interfaces on each system in the file system cluster, including the following: • Internal interfaces • External interface • iLO interfaces • InfiniBand interfaces • Interfaces to the Fibre Channel switches • MSA2000 controllers • InfiniBand switches • Client nodes (optional) This file should be propagated to all nodes in the file system cluster. 3.5.
4 Installing and Configuring HP SFS Software on Client Nodes This chapter provides information about installing and configuring HP SFS G3.0-0 Software on client nodes running RHEL5U2, SLES10 SP2, and HP XC V4.0. 4.1 Installation Requirements HP SFS G3.0-0 Software supports file system clients running RHEL5U2 and SLES10 SP2, as well as the HP XC V4.0 cluster clients. The HP SFS G3.0-0 Software tarball contains the latest supported Lustre client RPMs for these systems. Use the correct type for your system.
NOTE: The network address shown above is the InfiniBand IPoIB ib0 interface for the HP SFS G3.0-0 Management Server (MGS) node, which must be accessible from the client system by being connected to the same InfiniBand fabric and with a compatible IPoIB IP address and netmask. 6. 7. 8. 9. Reboot the node and the Lustre file system is mounted on /testfs. Repeat steps 1 through 6 for additional client nodes, using the appropriate node replication or installation tools available on your client cluster.
• kernel-source-xxx RPM to go with the installed kernel 1. Install the Lustre source RPM as provided on the HP SFS G3.0-0 Software tarball in the /opt/hp/sfs/SRPMS directory. Enter the following command on one line: # rpm -ivh lustre-source-1.6.6-2.6.18_92.1.10.el5_ \ lustre.1.6.6smp.x86_64.rpm 2. Change directories: # cd /usr/src/linux-xxx 3. 4. Copy in the /boot/config-xxx for the running/target kernel, and name it .config. Run the following: # make oldconfig 5.
5 Using HP SFS Software This chapter provides information about creating, configuring, and using the file system. 5.1 Creating a Lustre File System The first required step is to create the Lustre file system configuration. At the low level, this is achieved through the use of the mkfs.lustre command. However, HP recommends the use of the lustre_config command as described in section 6.1.2.3 of the Lustre 1.6 Operations Manual.
mpath5 (3600c0ff000d5455bc8c95f4801000000) dm-3 HP,MSA2212fc [size=4.1T][features=1 queue_if_no_path][hwhandler=0] \_ round-robin 0 [prio=50][active] \_ 1:0:1:5 sdf 8:80 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 0:0:1:5 sdb 8:16 [active][ready] mpath4 (3600c0ff000d5467634ca5f4801000000) dm-2 HP,MSA2212fc [size=4.
how Heartbeat is configured. Manual fail back can prevent system oscillation if, for example, a bad node reboots continuously. Heartbeat nodes send messages over the network interfaces to exchange status information and determine whether the other member of the failover pair is alive. The HP SFS G3.0-0 implementation sends these messages using IP multicast. Each failover pair uses a different IP multicast group.
NOTE: The gen_hb_config_files.pl scripts only works if the host names in the /etc/ hosts file appear with the plain node name first, as follows: 192.168.8.151 node1 node1-adm The script will not work if a hyphenated host name appears first. For example: 192.168.8.151 node1-adm node1 Example The following example assumes only a single OSS pair, nodes node5 and node6. Each node has four OSTs.
It is possible to generate the simple files ha.cf, haresources, and authkeys by hand if necessary. One set of ha.cf with haresources is needed for each failover pair. A single authkeys is suitable for all failover pairs. ha.cf The /etc/ha.d/ha.cf file for the example configuration is shown below: use_logd yes deadtime 10 initdead 60 mcast eth0 239.0.0.3 694 1 0 mcast ib0 239.0.0.3 694 1 0 node node5 node node6 stonith_host * external/riloe node5 node5_ilo_ipaddress ilo_login ilo_password 1 2.
5.2.3.2 Editing cib.xml The haresources2cib.py script places a number of default values in the cib.xml file that are unsuitable for HP SFS G3.0-0. • By default, a server fails back to the primary node for that server when the primary node returns from a failure. If this behavior is not desired, change the value of the default-resource-stickiness attribute from 0 to INFINITY. Below is a sample of the line in cib.
10 Resources configured.
• Heartbeat uses iLO for STONITH I/O fencing. If a Heartbeat configuration has two nodes in a failover pair, Heartbeat would like both of those nodes to be up and running Heartbeat. If a node boots, starts Heartbeat, and does not see Heartbeat running on the other node in what it thinks is a reasonable time, it will power-cycle it. 5.3 Starting the File System After the file system has been created, it can be started.
5.5 Testing Your Configuration The best way to sanity test your Lustre file system is to perform normal file system operations, such as normal Linux file system shell commands like df, cd and ls. If you want to measure performance of your installation, you can use your own application or the standard file system performance benchmarks described in Chapter 17 Benchmarking of the Lustre 1.6 Operations Manual at: http://manual.lustre.org/images/9/92/LustreManual_v1_14.pdf. 5.5.
13 14 15 16 17 18 19 UP UP UP UP UP UP UP osc osc osc osc osc osc osc hpcsfsc-OST0004-osc hpcsfsc-OST0006-osc hpcsfsc-OST0007-osc hpcsfsc-OST0001-osc hpcsfsc-OST0002-osc hpcsfsc-OST0000-osc hpcsfsc-OST0003-osc hpcsfsc-mdtlov_UUID hpcsfsc-mdtlov_UUID hpcsfsc-mdtlov_UUID hpcsfsc-mdtlov_UUID hpcsfsc-mdtlov_UUID hpcsfsc-mdtlov_UUID hpcsfsc-mdtlov_UUID 5 5 5 5 5 5 5 Check the recovery status on an MDS or OSS server as follows: # cat /proc/fs/lustre/*/*/recovery_status INACTIVE This displays INACTIVE if no
Run this command on each server node for all the mpaths which that node normally mounts. 4. 5. chkconfig heartbeat off on all server nodes and reboot them. Restart the file system as described in section 5.3 in this order: MGS, MDS, OSTs. 5.5.1.2 On the Client Use the following command on a client to check whether the client can communicate properly with the MDS node: # lfs check mds testfs-MDT0000-mdc-ffff81012833ec00 active Use the following command to check OSTs or servers for both MDS and OSTs.
Also see man collectl.
6 Licensing Information When you ordered the licenses for your HP SFS G3.0-0 system, you received letters with the title License-To-Use from HP. There will be one License-To-Use letter for each HP SFS G3.0-0 license that you purchased. Retain these letters for future use.
7 Known Issues and Workarounds The following items are known issues and workarounds. 7.1 Server Reboot After the server reboots, it checks the file system and reboots again. /boot: check forced This can be ignored. 7.2 Errors from install2 You might receive the following errors when running install2. They can be ignored.
4. Manually mount mgs on the MGS node: # mount /mnt/mgs 5. Manually mount mds on the MDS node: # mount /mnt/mds In the MDS /var/log/messages file, look for a message similar to the following: kernel: Lustre: Setting parameter testfs-MDT0000.mdt.group_upcall in log testfs-MDT0000 This indicates the change has been successfully made. 6. 7. unmount /mnt/mdt and /mnt/mgs from MDT and MDS respectively. Restart the SFS server in the normal way using heartbeat.
A HP SFS G3.0-0 Performance A.1 Benchmark Platform HP SFS G3.0-0, based on Lustre File System Software, is designed to provide the performance and scalability needed for very large high-performance computing clusters. This appendix presents HP SFS G3.0-0 performance measurements. HP SFS G3.0-0 can also be used to estimate the I/O performance and specify performance requirements of HPC clusters.
The Lustre servers were DL380 G5s with two quad-core processors and 16GB of memory, running RHEL v5.1. These servers were configured in failover pairs using Heartbeat v2. Each server could see its own storage and that of its failover mate, but mounted only its own storage until failover. Figure A-2 shows more detail about the storage configuration. The storage comprised a number of HP MSA2212fc arrays. Each array had a redundant pair of RAID controllers with mirrored caches supporting failover.
Figure A-3 shows single stream performance for a single process writing and then reading a single 8GB file. The file was written in a directory with a stripe width of 1MB and stripe count as shown. The client cache was purged after the write and before the read. Figure A-3 Single Stream Throughput For a file written on a single OST (a single RAID volume), throughput is in the neighborhood of 200MB per second. As the stripe count is increased, spreading the load over more OSTs, throughput increases.
The test shown in Figure A-5 did not use direct I/O. Nevertheless, it shows the cost of client cache management on throughput. In this test, two processes on one client node each wrote 10GB. Initially, the writes proceeded at over 1.0GB per second. The data was sent to the servers, and the cache filled with the new data. At the point (14:10:14 in the graph) where the amount of data reached the cache limit imposed by Lustre (12GB), throughput dropped by about a third.
Figure A-6 Multi-Client Throughput Scaling In general, Lustre scales quite well with additional OSS servers if the workload is evenly distributed over the OSTs and the load on the metadata server remains reasonable. Neither the stripe size nor the I/O size had much effect on throughput when each client wrote to or read from its own OST. Changing the stripe count for each file did have an effect as shown in Figure A-7.
A.4 One Shared File Frequently in HPC clusters, a number of clients share one file either for read or for write. For example, each of N clients could write 1/N'th of a large file as a contiguous segment. Throughput in such a case depends on the interaction of several parameters including the number of clients, number of OSTs, the stripe size, and the I/O size.
Another way to measure throughput is to only average over the time while all the clients are active. This is represented by the taller, narrower box in Figure A-8. Throughput calculated this way shows the system's capability, and the stragglers are ignored. This alternate calculation method is sometimes called "stonewalling". It is accomplished in a number of ways. The test run is stopped as soon as the fastest client finishes. (IOzone does this by default.
For workloads that require a lot of disk head movement relative to the amount of data moved, SAS disk drives provide a significant performance benefit. Random writes present additional complications beyond those involved in random reads. These additional complications are related to Lustre locking, and the type of RAID used. Small random writes to a RAID6 volume requires a read-modify-write sequence to update a portion of a RAID stripe and compute a new parity block.
Index benchmark platform, 47 installation requirements client nodes, 27 RHEL systems, 27 server, 20 SLES systems, 27 XC systems, 27 IOR processes, 47 C K cache limit, 50 cib.
stonewalling, 52 stonith, 33 support HP support, 8 T throughput scaling, 50 thumb drive, 22 U upgrading clients, 29 servers, 11 user access configuring, 25 V volumes, 16 W workarounds, 45 56 Index