SteelEye Protection Suite for Linux v8.
This document and the information herein is the property of SIOS Technology Corp. (previously known as SteelEye® Technology, Inc.) and all unauthorized use and reproduction is prohibited. SIOS Technology Corp. makes no warranties with respect to the contents of this document and reserves the right to revise this publication and make changes to the products described herein without prior notification. It is the policy of SIOS Technology Corp.
Table of Contents Chapter 1: Introduction 1 About SteelEye Protection Suite for Linux 1 SPS for Linux Integrated Components 1 SteelEye Protection Suite Software Packaging 1 SPS for Linux Installation Image File 1 SPS Core Package Cluster 2 Optional Recovery Software 2 Documentation and Training 2 Documentation 2 Training 3 3 Technical Support 3 Chapter 2: SPS Installation 5 System Requirements 5 Technical Notes 5 SteelEye Protection Suite Software Packaging 5 SPS for Linux Inst
Supported Adapter Models Setting Up Your SPS Environment 25 Installing the Linux OS and Associated Communications Packages 25 Connecting Servers and Shared Storage 25 Configuring Shared Storage 25 Verifying Network Configuration 26 VLAN Interface Support Matrix 27 Creating Switchable IP Address 27 Installing and Setting Up Database Applications 28 Installing the SteelEye Protection Suite Software 29 Installing the SPS Software Obtaining and Installing the License 29 31 Primary Network In
System Grouping Arrangements 41 Active - Active Grouping 42 Active - Standby Grouping 43 Intelligent Versus Automatic Switchback 44 Logging With syslog 45 Resource Hierarchies 45 Resource Types 45 Resource States 46 Hierarchy Relationships 47 Shared Equivalencies 47 Resource Hierarchy Information 48 Resource Hierarchy Example 49 Detailed Status Display 49 Resource Hierarchy Information 51 Communication Status Information 52 LifeKeeper Flags 53 Shutdown Strategy 54 Short Sta
Set Up TTY Connections 62 LifeKeeper Event Forwarding via SNMP 63 Overview of LifeKeeper Event Forwarding via SNMP 63 LifeKeeper Events Table Configuring LifeKeeper Event Forwarding 65 Prerequisites 65 Configuration Tasks 65 Verifying the Configuration 66 Disabling SNMP Event Forwarding 66 SNMP Troubleshooting 67 LifeKeeper Event Email Notification 67 Overview of LifeKeeper Event Email Notification 67 LifeKeeper Events Generating Email Configuring LifeKeeper Event Email Notification 6
How Certificates Are Used 74 Using Your Own Certificates 74 Linux Configuration 75 Data Replication Configuration 78 Network Configuration 79 Application Configuration 79 Storage and Adapter Configuration 80 HP Multipath I/O Configurations 98 Device Mapper Multipath I/O Configurations 100 LifeKeeper I-O Fencing Introduction 103 Disabling Reservations 103 Non-Shared Storage 104 Configuring I/O Fencing Without Reservations 104 I/O Fencing Chart 104 Quorum/Witness 106 Quorum/Witne
SCSI Reservations Storage Fence Using SCSI Reservations 114 Alternative Methods for I/O Fencing 115 STONITH 115 Using IPMI with STONITH 115 Package Requirements 115 STONITH in VMware vSphere Environments 115 Package Requirements 116 Installation and Configuration 116 117 Expected Behaviors 118 Watchdog 118 Components 118 Configuration 119 Uninstall 120 Resource Policy Management 120 Overview 120 Steeleye Protection Suite/vAppKeeper Recovery Behavior 121 Custom and M
Adding or Changing Credentials 125 Listing Stored Credentials 125 Removing Credentials for a Server 125 Additional Information 125 LifeKeeper API 126 Network Configuration 126 Authentication 126 LifeKeeper Administration Overview 127 127 Error Detection and Notification 127 N-Way Recovery 127 Administrator Tasks 128 Editing Server Properties 128 Creating a Communication Path 128 Deleting a Communication Path 129 Server Properties - Failover 130 Creating Resource Hierarchies 13
Extending a Raw Device Resource Hierarchy 140 Unextending a Hierarchy 140 Creating a Resource Dependency 141 Deleting a Resource Dependency 142 Deleting a Hierarchy from All Servers 143 LifeKeeper User Guide 145 Using LifeKeeper for Linux 146 GUI 146 GUI Overview - General 146 GUI Server 146 GUI Client 146 Exiting GUI Clients 147 The LifeKeeper GUI Software Package 147 Menus 148 SteelEye LifeKeeper for Linux Menus Resource Context Menu 148 Server Context Menu 149 File Menu 1
GUI Client 157 Starting GUI clients 157 Starting the LifeKeeper GUI Applet 157 Starting the application client 158 Exiting GUI Clients Configuring the LifeKeeper GUI 158 158 Configuring the LifeKeeper Server for GUI Administration 158 Running the GUI 158 GUI Configuration 159 GUI Limitations 160 Starting and Stopping the GUI Server 160 To Start the LifeKeeper GUI Server 160 Troubleshooting 160 To Stop the LifeKeeper GUI Server 160 LifeKeeper GUI Server Processes 161 Configuring G
Internet Explorer Status Table 168 Properties Panel 169 Output Panel 169 Message Bar 170 Exiting the GUI 170 Common Tasks 170 Starting LifeKeeper 170 Starting LifeKeeper Server Processes 170 Enabling Automatic LifeKeeper Restart 171 Stopping LifeKeeper Disabling Automatic LifeKeeper Restart x 168 171 172 Viewing LifeKeeper Processes 172 Viewing LifeKeeper GUI Server Processes 172 Connecting Servers to a Cluster 173 Disconnecting From a Cluster 173 Viewing Connected Servers 174
Column Width Viewing Message History Reading the Message History 180 181 181 Expanding and Collapsing a Resource Hierarchy Tree 182 Cluster Connect Dialog 183 Cluster Disconnect Dialog 183 Resource Properties Dialog 184 General Tab 184 Relations Tab 185 Equivalencies Tab 185 Server Properties Dialog 185 General Tab 186 CommPaths Tab 188 Resources Tab 189 Operator Tasks 190 Bringing a Resource In Service 190 Taking a Resource Out of Service 191 Advanced Tasks 191 LCD 191 Lif
LifeKeeper Flags 196 Resources Subdirectories 197 Resource Actions Structure of LCD Directory in /opt/LifeKeeper 198 LCM 199 Communication Status Information 200 LifeKeeper Alarming and Recovery 200 Alarm Classes 200 Alarm Processing 201 Alarm Directory Layout 201 Maintenance Tasks 201 Changing LifeKeeper Configuration Values 201 File System Health Monitoring 203 Condition Definitions 204 Full or Almost Full File System 204 Unmounted or Improperly Mounted File System 204 Mainta
Starting LifeKeeper 210 Starting LifeKeeper Server Processes 211 Enabling Automatic LifeKeeper Restart 211 Stopping LifeKeeper 211 Disabling Automatic LifeKeeper Restart 212 Transferring Resource Hierarchies 212 Technical Notes 212 LifeKeeper Features 212 Tuning 213 LifeKeeper Operations 214 Server Configuration 216 Package Dependencies List for LifeKeeper 7.
Data Replication 231 IPv6 234 Apache 237 Oracle Recovery Kit 237 NFS Server Recovery Kit 238 SAP Recovery Kit 239 LVM Recovery Kit 240 DMMP Recovery Kit 241 PostgreSQL Recovery Kit 241 MD Recovery Kit 242 Samba Recovery Kit 243 GUI Troubleshooting 243 Network-Related Troubleshooting (GUI) 243 Long Connection Delays on Windows Platforms From Sun FAQ: 243 Running from a Modem: 244 Primary Network Interface Down: 244 No Route To Host Exception: 244 Unknown Host Exception: 244
Suggested Action: 251 Recovering from a Non-Killable Process 252 Recovering From A Panic During A Manual Recovery 252 Recovering Out-of-Service Hierarchies 252 Resource Tag Name Restrictions 252 Tag Name Length 252 Valid "Special" Characters 252 Invalid Characters 252 Serial (TTY) Console WARNING 252 Taking the System to init state S WARNING 253 Thread is Hung Messages on Shared Storage 253 Explanation 253 Suggested Action: 253 Chapter 4: SteelEye DataKeeper for Linux Introduction
Scenario 4 Installation and Configuration Before Configuring Your DataKeeper Resources Hardware and Software Requirements 265 265 265 Hardware Requirements 265 Software Requirements 266 General Configuration 266 Network Configuration 266 Changing the Data Replication Path 267 Determine Network Bandwidth Requirements 267 Measuring Rate of Change on a Linux System (Physical or Virtual) 267 Determine Network Bandwidth Requirements 268 Measuring Basic Rate of Change 268 Measuring Detailed Ra
Overview 284 Creating a DataKeeper Resource Hierarchy 284 Extending Your Hierarchy 286 Extending a DataKeeper Resource 287 Unextending Your Hierarchy 288 Deleting a Resource Hierarchy 289 Taking a DataKeeper Resource Out of Service 289 Bringing a DataKeeper Resource In Service 290 Testing Your Resource Hierarchy 290 Performing a Manual Switchover from the LifeKeeper GUI Administration Administering SteelEye DataKeeper for Linux 290 293 293 Viewing Mirror Status 293 GUI Mirror Administr
Example: 302 Server Failure 303 Resynchronization 303 Avoiding Full Resynchronizations 304 Method 1 Procedure Method 2 Procedure Clustering with Fusion-io Fusion-io Best Practices for Maximizing DataKeeper Performance 304 305 305 306 306 Network 307 TCP/IP Tuning 307 Configuration Recommendations Multi-Site Cluster SteelEye Protection Suite for Linux Multi-Site Cluster 308 309 309 SteelEye Protection Suite for Linux Multi-Site Cluster 309 Multi-Site Cluster Configuration Considerations 31
Successful Migration 334 Troubleshooting 337 Index 341 Table of Contents xix
Chapter 1: Introduction About SteelEye Protection Suite for Linux SteelEye Protection Suite (SPS) for Linux integrates high availability clustering with innovative data replication functionality in a single, enterprise-class solution. SPS for Linux Integrated Components SteelEye LifeKeeper provides a complete fault-resilient software solution to provide high availability for your servers' file systems, applications,and processes. LifeKeeper does not require any customized, fault-tolerant hardware.
SPS Core Package Cluster The SPS for Linux image file includes a core package cluster containing the following software packages: SPS Core Package Cluster l LifeKeeper (steeleye-lk). The LifeKeeper core packages provide recovery software for core system components, such as memory, CPUs, the operating system, the SCSI disk subsystem and file systems. l LifeKeeper GUI (steeleye-lkGUI). The LifeKeeper GUI package provides a graphical user interface for LifeKeeper administration and monitoring.
Training Section Description SPS for Linux Installation Guide Provides useful information for planning and setting up your SPS environment, installing and licensing SPS and configuring the LifeKeeper graphical user interface (GUI). Configuration Contains detailed information and instructions for configuring the LifeKeeper software on each server in your cluster.
Technical Support l Log a Case to report new incidents. l View Cases to see all of your open and closed incidents. l Review Top Solutions providing information on the most popular problem resolutions being viewed by our customers. Contact SIOS Technology Corp. Support at support@us.sios.com to set up and activate your SelfService Portal account. You can also contact SIOS Technology Corp. Support at: 1-877-457-5113 (Toll Free) 1-803-808-4270 (International) Email: support@us.sios.
Chapter 2: SPS Installation The SteelEye Protection Suite (SPS) Installation Guide contains information on how to plan and install your SPS environment. In addition to providing the necessary steps for setting up your server, storage device and network components, it includes details for configuring your LifeKeeper graphical user interface (GUI). Once you have completed the steps in this guide, you will be ready to configure your LifeKeeper and DataKeeper resources.
SPS Core Package Cluster answering Yes to each question in order to complete all the steps required by the installation image file. The SPS for Linux image file includes a core package cluster containing the following software packages: SPS Core Package Cluster l LifeKeeper (steeleye-lk). The LifeKeeper core packages provide recovery software for core system components, such as memory, CPUs, the operating system, the SCSI disk subsystem and file systems. l LifeKeeper GUI (steeleye-lkGUI).
Planning Your SPS Environment The following topics will assist in defining the SPS for Linux cluster environment. Mapping Server Configurations Document your server configuration using the following guidelines: 1. Determine the server names, processor types, memory and other I/O devices for your configuration. When you specify a backup server, you should ensure that the server you select has the capacity to perform the processing should a failure occur on the primary server. 2.
Sample Configuration Map for LifeKeeper Pair locked resources at any given time. LifeKeeper device locking is done at the Logical Unit (LUN) level. For active/active configurations, each hierarchy must access its own unique LUN. All hierarchies accessing a common LUN must be active (in-service) on the same server. 4. Determine your shared memory requirements.
Storage and Adapter Options reside on a disk array subsystem (Redundant Array of Inexpensive Disks, or RAID). LifeKeeper supports a number of hardware RAID peripherals for use in LifeKeeper configurations. See Storage and Adapter Options for a list of the supported peripherals.
Supported Storage Models Vendor Storage Model Certification Consan CRD5440 SIOS Technology Corp. testing CRD7220 (f/w 3.00) SIOS Technology Corp. testing DataCore SANsymphony SIOS Technology Corp. testing Dell 650F (CLARiiON) SIOS Technology Corp. testing Dell | EMC CX3−10c / CX3−40c / CX3−20c, CX3−80 / CX3−40(F) / CX3−20(F) Partner Testing Dell | EMC CX300 / CX600 / CX400 / CX700 / CX500 SIOS Technology Corp. testing PowerVault (w/ Dell PERC, LSI Logic MegaRAID) SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification EMC Symmetrix 3000 Series SIOS Technology Corp. testing Symmetrix 8000 Series Vendor support statement Symmetrix DMX / DMX2 Partner testing Symmetrix DMX3 / DMX4 Partner testing Symmetrix VMAX Series Partner testing CLARiiON CX200, CX400, CX500, CX600, and CX700 SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification Fujitsu ETERNUS3000 (w/ PG-FC105, PG-FC106, or PGFC107), single path only Partner testing ETERNUS6000 (w/ PG-FC106), single path only Partner testing ETERNUS4000 Model 80 and Model 100 (w/ PGFC106, PG-FC107, or PG-FC202), single path only Partner testing FibreCAT S80 (See Storage and Adapter Configuration) Partner testing ETERNUS SX300 (w/ PG-FC106 or PG-FC107), multipath only Partner testing ETERNUS2000 Series: Model 50, Model 100, an
Supported Storage Models Vendor Storage Model Certification Hitachi Data HDS RAID 700 (VSP) Systems HDS 7700 Partner testing Vendor support statement HDS 5800 Vendor support statement HDS 9570V Partner testing HDS 9970V Partner testing HDS 9980V Partner testing AMS 500 SIOS Technology Corp.
Supported Storage Models Vendor Storage Model HP/Compaq RA 4100 14Planning Your LifeKeeper Environment Certification SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification MA / RA 8000 SIOS Technology Corp. testing MSA1000 / MSA1500 (active/active and active/passive SIOS Technology Corp. firmware configurations) testing HP MSA1000 Small Business SAN Kit SIOS Technology Corp. testing HP P2000 G3 MSA FC(w/ DMMP on RHEL5.4) SIOS Technology Corp. testing HP P2000 G3 MSA SAS Partner testing HP P4000 / P4300 G2 SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification IBM FAStT200 SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification FAStT500 SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification DS4100 * Partner testing 18Planning Your LifeKeeper Environment
Supported Storage Models Vendor Storage Model Certification DS4200 Partner testing DS4300 (FAStT600) * SIOS Technology Corp. testing DS4400 (FAStT700) * SIOS Technology Corp. testing DS4500 (FAStT900) * SIOS Technology Corp. testing DS4700 Partner testing DS4800 Partner testing DS4300 (FAStT600) SIOS Technology Corp. testing DS4400 (FAStT700) SIOS Technology Corp. testing DS5000 Partner testing ESS Model 800 * SIOS Technology Corp. testing DS6800 * SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification JetStor JetStor II SIOS Technology Corp. testing MicroNet Genesis One Vendor support statement MTI Gladiator 2550 Vendor support statement Gladiator 3550 Vendor support statement Gladiator 3600 Vendor support statement NEC iStorage M100 FC (single path) Partner testing NEC iStorage M10e / M300 / M500 FC (single path) Vendor support statement NEC iStorage S500 / S1500 / S2500 (single path) SIOS Technology Corp.
Supported Storage Models Vendor Storage Model Certification Sun StorEdge 3310 Partner testing StorEdge 3510 FC (w/ Sun StorEdge 2Gb PCI Single FC Network Adapter) Partner testing StorEdge 6130 FC (w/ Sun StorEdge 2Gb PCI Single FC Network Adapter) Partner testing StorageTek 2540 (w/ Sun StorageTek 4Gb PCI-E Dual Partner testing FC Host Bus Adapter or Sun StorageTek 4Gb PCI Dual FC Network Adapter TID Winchester Systems Xiotech MassCareRAID Partner testing MassCareRAIDⅡ Partner testing Flas
Supported Adapter Models Supported Adapter Models Adapter Type Adapter Model Certification Adaptec 2944 W, Adaptec 2944 UW, or Adaptec 2940 U2W SIOS Technology Corp. testing Compaq 64bit PCI Dual Channel Wide Ultra2 SCSI Adapter SIOS Technology Corp. testing Compaq SA 5i, 6i, 532, and 642 PCI Dual Channel Wide Ultra3 SCSI Adapters SIOS Technology Corp. testing Dell PERC 2/DC, PERC 4/DC SIOS Technology Corp.
Supported Adapter Models Adapter Type Fibre Channel Adapter Model Certification QLogic QLA 2100, QLogic QLA 2200, QLogic QLA 2340, QLogic QLA 200 (HP Q200) SIOS Technology Corp. testing HP StorageWorks 2GB 64-bit / 133MHz PCI-X to Fibre Channel Host Bus Adapter (FCA2214) SIOS Technology Corp. testing Compaq 64 bit / 66MHz Fibre Channel Host Bus Adapter 120186-B21 SIOS Technology Corp.
Setting Up Your SPS Environment Now that the requirements have been determined and LifeKeeper configuration has been mapped, components of this SPS environment can be set up. Note: Although it is possible to perform some setup tasks in a different sequence, this list is provided in the recommended sequence.
Verifying Network Configuration 1. Partition disks and LUNs. Because all disks placed under LifeKeeper protection must be partitioned, your shared disk arrays must now be configured into logical units, or LUNs. Use your disk array management software to perform this configuration. You should refer to your disk array software documentation for detailed instructions. Note: l Remember that LifeKeeper locks its disks at the LUN level. Therefore, one LUN may be adequate in an Active/Standby configuration.
VLAN Interface Support Matrix b. Change the server’s hostname using the Linux hostname command. c. Before continuing, you should ensure that the new hostname is resolvable by each server in the cluster (see the previous bullets). d. Run the following command on every server in the cluster to update LifeKeeper’s hostname. (Refer to lk_chg_value(1M) for details.) /opt/LifeKeeper/bin/lk_chg_value -o oldhostname -n newhostname e. Start LifeKeeper using the command: /etc/init.
Installing and Setting Up Database Applications primary server, that IP address “switches” to the backup server. If you plan to configure resource hierarchies for switchable IP addresses, you must do the following on each server in the cluster: l Verify that the computer name is correct and will not be changed. l Verify that the switchable IP addresses are unique using the ping command. l Edit the /etc/hosts file to add an entry for each switchable IP address.
Installing the SteelEye Protection Suite Software Install the SPS software on each server in the SPS configuration. Each SPS server must have the packages necessary to support your configuration requirements, including any optional SPS Recovery Kit packages. The SPS core package cluster and any optional recovery kits will be installed through the command line using the SPS Installation Image File (sps.img).
Installing the SPS Software IMAGE_NAME is the name of the image MOUNT_POINT is the path to mount location 2. Change to the sps.img mounted directory and type the following: ./setup 3. Text will appear explaining what is going to occur during the installation procedure. You will now be asked a series of questions where you will answer “y” for Yes or “n” for No. The type and sequence of the questions are dependent upon your Linux distribution. Read each question carefully to ensure a proper response.
Obtaining and Installing the License SPS for Linux requires a unique license for each server. The license is a run-time license, which means that you can install SPS without it, but the license must be installed before you can successfully start and run the product. Note: If using newer hardware with RHEL 6.1, please see the IP Licensing Known Issues in the SPS for Linux Troubleshooting Section.
Primary Network Interface Change May Require a License Rehost 3. Ensure you have your LifeKeeper Entitlement ID (Authorization Code). You should have received an email with your software containing the Entitlement ID needed to obtain the license. 4. Obtain your licenses from the SIOS Technology Corp. Licensing Operations Portal. a. Using the system that has internet access, log in to the SIOS Technology Corp. Licensing Operations Portal. b. Select Manage Entitlements.
Internet/IP Licensing Internet/IP Licensing For information regarding Internet/IP Licensing, please see the Known Issue in the SPS for Linux Troubleshooting section and Obtaining an Internet HOST ID. Subscription Licensing A subscription license is a time-limited license with renewal capability. Similar to an evaluation license, it will expire after a set amount of time unless renewed. This renewal process can be set up to renew automatically by following the procedure below.
Obtaining an Internet HOST ID l If ownership of the license certificate has changed, please contact SIOS Technology Corp. support personnel to have the certificate moved to the new owner. Once ownership has been moved, the automatic license renewal service will need to be updated with these new credentials by running the following command using the new User ID and Password: /opt/LifeKeeper/bin/lmsubscribe --login Obtaining an Internet HOST ID Use lmhostid to obtain your machine's Internet Host ID.
Upgrading SPS upgrade from the older version to one of the two acceptable versions, then perform the upgrade to the current version. Note: If using lkbackup during your upgrade, see the lkbackup Known Issue for further information. 1. If you are upgrading an SPS cluster with only two nodes, proceed directly to Step 2. If you are upgrading an SPS cluster with greater than two nodes, switch all applications away from the server to be upgraded now.
Upgrading SPS upgrades, LifeKeeper should not be started when a different version or release is resident and running on another system in the cluster.
Chapter 3: SteelEye LifeKeeper for Linux Introduction SteelEye LifeKeeper for Linux provides high availability clustering for up to 32 nodes with many supported storage configurations, including shared storage (Fiber Channel SAN, iSCSI), network attached storage (NAS), host-based replication and integration with array-based SAN replication including HP Continuous Access.
LifeKeeper Core LifeKeeper Core LifeKeeper Core is composed of four major components: l LifeKeeper Core Software l File System, Generic Application, Raw I/O and IP Recovery Kit Software l LifeKeeper GUI Software l LifeKeeper Man Pages LifeKeeper Core Software The LifeKeeper Core Software consists of the following components: l LifeKeeper Configuration Database (LCD) - The LCD stores information about the LifeKeeperprotected resources.
File System, Generic Application, IP and RAW I/O Recovery Kit Software indicates the server has failed. l LifeKeeper Alarm Interface - The LifeKeeper Alarm Interface provides the infrastructure for triggering an event. The sendevent program is called by application daemons when a failure is detected in a LifeKeeper-protected resource. The sendevent program communicates with the LCD to determine if recovery scripts are available.
LifeKeeper GUI Software LifeKeeper GUI Software The LifeKeeper GUI is a client / server application developed using Java technology that provides a graphical administration interface to LifeKeeper and its configuration data. The LifeKeeper GUI client is implemented as both a stand-alone Java application and as a Java applet invoked from a web browser. LifeKeeper Man Pages The LifeKeeper Core reference manual pages for the LifeKeeper product.
Components Common to All LifeKeeper Configurations LifeKeeper automatically manages the unlocking of the disks from the failed server and the locking of the disks to the next available back-up server. 4. Shared communication. LifeKeeper can automatically manage switching of communications resources, such as TCP/IP addresses, allowing users to connect to the application regardless of where the application is currently active.
Active - Active Grouping In an active/active group, all servers are active processors, but they also serve as the backup server for resource hierarchies on other servers. In an active/standby group, the primary server is processing and any one of the backup servers can be configured to stand by in case of a failure on the primary server. The standby systems can be smaller, lower-performance systems, but they must have the processing capability to assure resource availability should the primary server fail.
Active - Standby Grouping AppB and AppC, however, have several grouping options because all four servers have access to the AppB and AppC shared resources. AppB and AppC could also be configured to failover to Server1 and/or Server2 as a third or even fourth backup system. Note: Because LifeKeeper applies locks at the disk level, only one of the four systems connected to the AppB and AppC disk resources can have access to them at any time.
Intelligent Versus Automatic Switchback A standby server can provide backup for more than one active server. For example in the figure above, Server 2 is the standby server in three active/standby resource pairs. The LifeKeeper resource definitions specify the following active/standby paired relationships: l AppA on Server1 fails over to Server2. l AppB on Server3 fails over to Server2. l AppC on Server4 fails over to Server2.
Logging With syslog l LifeKeeper never performs an automatic switchback from a higher priority server to a lower priority server. Logging With syslog Beginning with LifeKeeper 8.0, logging is done through the standard syslog facility. LifeKeeper supports three syslog implementations: standard syslog, rsyslog, and syslog-ng. During package installation, syslog will be configured to use the "local6" facility for all LifeKeeper log messages.
Resource States Resource States State In-Service, Protected (ISP) Meaning Resource is operational. LifeKeeper local recovery operates normally. LifeKeeper inter-server recovery and failure recovery is operational. In-Service, Resource is operational. LifeKeeper local recovery mechanism is not operational for Unprotected this resource. LifeKeeper inter-server recovery and failure recovery is operational.
Hierarchy Relationships Hierarchy Relationships LifeKeeper allows you to create relationships between resource instances. The primary relationship is a dependency, for example one resource instance depends on another resource instance for its operation . The combination of resource instances and dependencies is the resource hierarchy.
Resource Hierarchy Information Resource Hierarchy Information The resource status of each resource is displayed in the Detailed Status Display and the Short Status Display. The LifeKeeper tag names of root resources are displayed beginning in the left-most position of the TAG column, with tag names of resources within the hierarchy indented appropriately to indicate dependency relationships between resources.
Resource Hierarchy Example Resource Hierarchy Example Detailed Status Display This topic describes the categories of information provided in the detailed status display as shown in the following example of output from the lcdstatus command. For information on how to display this information, see the LCD(1M) man page. At the command line, you can enter either man lcdstatus or man LCD.
Detailed Status Display SHARED equivalency with "apache-home.fred" on "roadrunner", priority = 10 FAILOVER ALLOWED ipeth0-172.17.104.25: id=IP-172.17.104.25 app=comm type=ip state=ISP initialize=(AUTORES_ISP) automatic restore to IN-SERVICE by LifeKeeper info=wileecoyote eth0 172.17.104.25 fffffc00 reason=restore action has succeeded these resources are dependent: apache-home.fred Local priority = 1 SHARED equivalency with "ipeth0-172.17.104.25" on "roadrunner", priority = 10 FAILOVER ALLOWED ipeth0-172.
Resource Hierarchy Information to machine=roadrunner type=TCP addresses=192.168.1.1/192.168.105.19 state="DEAD" priority=2 #comm_downs=0 LifeKeeper Flags The following LifeKeeper flags are on: shutdown_switchover Shutdown Strategy The shutdown strategy is set to: switchover. Resource Hierarchy Information LifeKeeper displays the resource status beginning with the root resource. The display includes information about all resource dependencies.
Communication Status Information l these resources are dependent. If present, indicates the tag names of all parent resources that are directly dependent on this object. l Local priority. Indicates the failover priority value of the targeted server, for this resource. l SHARED equivalency. Indicates the resource tag and server name of any remote resources with which this resource has a defined equivalency, along with the failover priority value of the remote server, for that resource.
LifeKeeper Flags path in the DEAD state, which initiates a failover event if there are no other communications paths marked ALIVE. l Remote system writer process transmits a LifeKeeper maintenance message, causing the reader process to perform the protocol necessary to receive the message. l #NAKs. Number of times the writer process received a negative acknowledgment (NAK).
Shutdown Strategy l Other typical flags include !nofailover!machine, !notarmode!machine, and shutdown_ switchover. The !nofailover!machine and !notarmode!machineflags are internal, transient flags created and deleted by LifeKeeper, which control aspects of server failover. The shutdown_switchover flag indicates that the shutdown strategy for this server has been set to switchover such that a shutdown of the server will cause a switchover to occur.
Communication Status Information resources within the hierarchy indented appropriately to indicate dependency relationships between resources. The BACKUP column indicates the next system in the failover priority order, after the system for which the status display pertains. If the target system is the lowest priority system for a given resource, the BACKUP column for that resource contains dashes (for example, ------). l TAG column. Contains the root tag for the resource. l ID column.
Local Recovery Scenario when a particular interface fails on a server, the protected IP address can be made to function on the backup interface, therefore avoiding an entire application/resource hierarchy failing over to a backup server. Local Recovery Scenario IP local recovery allows you to specify a single backup network interface for each LifeKeeperprotected IP address on a server.
Resource Error Recovery Scenario This operation will fail if there is no backup interface configured for this instance. If the specified resource instance is currently in service, the move will be implemented by using the ipaction remove operation to un-configure the IP address on the current interface, and ipaction restore to configure it on the backup interface.
Resource Error Recovery Scenario 1. lkcheck runs. By default, the lkcheck process runs once every two minutes. When lkcheck runs, it invokes the appropriate quickCheck script for each in-service resource on the system. 2. quickCheck script checks resource. The nature of the checks performed by the quickCheck script is unique to each resource type.
Server Failure Recovery Scenario determines whether the failing resource or a resource that depends upon the failing resource has any shared equivalencies with a resource on any other systems,and selects the one to the highest priority alive server. Only one equivalent resource can be active at a time. If no equivalency exists, the recover process halts. If a shared equivalency is found and selected, LifeKeeper initiates inter-server recovery.
Server Failure Recovery Scenario The following steps describe the recovery scenario, illustrated above, if LifeKeeper marks all communications connections to a server DEAD. 1. LCM activates eventslcm. When LifeKeeper marks all communications paths dead, the LCM initiates the eventslcm process. Only one activity stops the eventslcm process: l Communication path alive. If one of the communications paths begins sending the heartbeat signal again, the LCM stops the eventslcm process.
Installation and Configuration SPS for Linux Installation For complete installation instructions on installing the SPS for Linux software, see the SPS for Linux Installation Guide. Refer to the SPS for Linux Release Notes for additional information. SPS for Linux Configuration Once the SPS environment has been installed, the SPS software can be configured on each server in the cluster. Follow the steps in the SPS Configuration Steps topic below which contains links to topics with additional details.
Set Up TTY Connections 5.
LifeKeeper Event Forwarding via SNMP echo Helloworld | /opt/LifeKeeper/bin/portio -p port -b baud where: l baud is the same baud rate selected for Server 1. l port is the serial port being tested on Server 2, for example /dev/ttyS0. 3. View the console. If the communications path is operational, the software writes "Helloworld" on the console on Server 1. If you do not see that information, perform diagnostic and correction operations before continuing with LifeKeeper configuration.
LifeKeeper Events Table LifeKeeper Event/Description Trap Object ID # LifeKeeper Startup Complete 100 .1.3.6.1.4.1.7359.1.0.100 101 .1.3.6.1.4.1.7359.1.0.101 102 .1.3.6.1.4.1.7359.1.0.102 110 .1.3.6.1.4.1.7359.1.0.110 111 .1.3.6.1.4.1.7359.1.0.111 112 .1.3.6.1.4.1.7359.1.0.112 120 .1.3.6.1.4.1.7359.1.0.120 121 .1.3.6.1.4.1.7359.1.0.121 122 .1.3.6.1.4.1.7359.1.0.
Configuring LifeKeeper Event Forwarding LifeKeeper Communications Path Up 140 .1.3.6.1.4.1.7359.1.0.140 141 .1.3.6.1.4.1.7359.1.0.141 Trap message all .1.3.6.1.4.1.7359.1.1 Resource Tag 130 .1.3.6.1.4.1.7359.1.2 Resource Tag 131 .1.3.6.1.4.1.7359.1.2 Resource Tag 132 .1.3.6.1.4.1.7359.1.2 List of recovered resources 111 .1.3.6.1.4.1.7359.1.3 List of recovered resources 121 .1.3.6.1.4.1.7359.1.3 List of failed resources 112 .1.3.6.1.4.1.7359.1.4 List of failed resources 122 .1.3.
Verifying the Configuration 1. Ensure that the snmptrap utility is available as noted above. 2. Specify the network management node to which the SNMP traps will be sent. This can be done either by command line or by editing the /etc/default/LifeKeeper file. You must specify the IP address rather than domain name to avoid DNS issues. l By command line, use the lk_configsnmp (see the lk_configsnmp(1M) man page for details). This utility will only accept IP addresses.
SNMP Troubleshooting SNMP Troubleshooting Following are some possible problems and solutions related to SNMP Event Forwarding. For specific error messages, see the LifeKeeper Message Catalog. Problem: No SNMP trap messages are sent from LifeKeeper. Solution: Verify that the snmptrap utility is installed on the system (it is usually located in /usr/bin). If it is not installed, install the appropriate snmp package (see Prerequisites ).
LifeKeeper Events Generating Email To disable Email Notification, either run lk_confignotifyalias (See the lk_ confignotifyalias(1M) man page for an example) with the --disable argument or edit the defaults file /etc/default/LifeKeeper and remove the setting of LK_NOTIFY_ALIAS (change the line to LK_NOTIFY_ALIAS=). LifeKeeper Events Generating Email The following LifeKeeper events will generate email notices when LK_NOTIFY_ALIAS is set.
Configuring LifeKeeper Event Email Notification LifeKeeper Communications Path Up A communications path to a node has become operational. LifeKeeper Communications Path Down A communications path to a node has gone down. Configuring LifeKeeper Event Email Notification Prerequisites The Event Email Notification feature is included as part of the LifeKeeper core functionality and does not require additional LifeKeeper packages to be installed.
Disabling Event Email Notification LifeKeeper log can be inspected to determine if there was a problem sending the email message. See Email Notification Troubleshooting for more information. Disabling Event Email Notification To disable the generation of email notices by LifeKeeper, simply remove the assignment of an email address or alias from the LK_NOTIFY_ALIAS environment variable in the file /etc/default/LifeKeeper.
Optional Configuration Tasks generate email notification messages. Also, see the Overview of LifeKeeper Event Email Notification to see if the LifeKeeper event generates an email message (not all events generate email messages). Optional Configuration Tasks Adding the LifeKeeper GUI Icon to the Desktop Toolbar The LifeKeeper GUI icon is automatically added to the desktop menu under the System sub-menu during installation of the LifeKeeper GUI package.
Tuning the LifeKeeper Heartbeat Do Not Switch Over Resources (default) LifeKeeper will not bring resources in service on a backup server during an orderly shutdown. Switch Over Resources LifeKeeper will bring resources in service on a backup server during an orderly shutdown. The Shutdown Strategy is set by default to "Do Not Switch Over Resources." You should decide which strategy you want to use on each server in the cluster, and if you wish, change the Shutdown Strategy to"Switch Over Resources".
Example Important Note: The values for both tunables MUST be the SAME on all servers in the cluster. Example Consider a LifeKeeper cluster in which both intervals are set to the default values. LifeKeeper sends a heartbeat between servers every 5 seconds. If a communications problem causes the heartbeat to skip two beats, but it resumes on third heartbeat, LifeKeeper takes no action.
Using Custom Certificates with the SPS API LCMNUMHBEATS=2 LifeKeeper will use a 1 second interval for the TCP communications path, and a 2 second interval for TTY. In the case of a server failure, LifeKeeper will detect the TCP failure first because its interval is shorter (2 heartbeats that are 1 second apart), but then will do nothing until it detects the TTY failure, which will be after 2 heartbeats that are 2 seconds apart. Using Custom Certificates with the SPS API Beginning with Release 7.
Linux Configuration Linux Configuration Operating System The default operating system must be installed to ensure that all required packages are installed. The minimal operating system install does not contain all of the required packages, and therefore, cannot be used with LifeKeeper.
Linux Configuration In order to provide the highest level of availability for a LifeKeeper cluster, the kernel version used on a system is very important. The table below lists each supported distribution and version with the kernel that has passed LifeKeeper certification testing. Note: Beginning with SPS 8.1, when performing a kernel upgrade on RedHat Enterprise Linux systems, it is no longer a requirement that the setup script (./setup) from the installation image be rerun.
Linux Configuration Dynamic device addition Prior to LifeKeeper startup, Linux must configure all devices. If a LifeKeeper protected device is configured after LifeKeeper is started, LifeKeeper must be stopped on each server that shares the device and then be restarted. This will enable the device detection and validation to confirm the configuration and enable LifeKeeper to access the device.
Data Replication Configuration While running the SPS Installation setup script, you may encounter a message regarding a failed dependency requirement for a libstdc++ library. This library is provided in one of several compat-libstdc++ rpm packages, depending on the hardware platform and Linux distribution you are running.
Network Configuration Item Description SteelEye DataKeeper documentation The documentation for SteelEye DataKeeper is located within the SteelEye Protection Suite Technical Documentation on the SIOS Technology Corp. Website. Network Configuration Item Description LifeKeeper-protected IP addresses are implemented on Linux as logical interfaces.
Storage and Adapter Configuration Item Localized Oracle mount points Description Localized Oracle environments are different depending on whether you connect as internal or as sysdba. A database on a localized mount point must be created with “connect / as sysdba” if it is to be put under LifeKeeper protection. Upgrading an SPS protected Apache application as part of upgrading the Linux operating system requires that the default server instance be disabled on start up.
Storage and Adapter Configuration Item Description In multipath configurations, performing heavy I/O while paths are being manipulated can cause a system to temporarily appear to be unresponsive. When the multipath software moves the access of a LUN from one path to another, it must also move any outstanding I/Os to the new path. The rerouting of the I/Os can cause a delay in the response times for these I/Os.
Storage and Adapter Configuration Item Special Considerations for Switchovers with Large Storage Configurations Description With some large storage configurations (for example, multiple logical volume groups with 10 or more LUNs in each volume group), LifeKeeper may not be able to complete a sendevent within the default timeout of 300 seconds when a failure is detected. This results in the switchover to the backup system failing.
Storage and Adapter Configuration Item HP 3PAR V400 Description The HP 3PAR V400 was tested by a SIOS Technology Corp. partner with the following configurations: HP 3PAR V400 (Firmware (InForm OS) version 3.1.1) using HP 82E 8Gb Dual Port PCI-e FC HBA AJ763A/AH403A (Firmware version 1.11A5 (U3D1.11A5) sli-3, driver version 8.3.5.30.1p (RHEL bundled)) with DMMP (device-mapper-1.02.62-3, devicemapper-multipath-0.4.9-41.el6). The test was performed with LifeKeeper for Linux v7.5 using RHEL 6.1 .
Storage and Adapter Configuration Item Description HP MSA2000fc Certified by Hewlett-Packard Company with Fibre Channel in both single path and multipath configurations. Models tested were the MSA2012fc and the MSA2212fc with the QLogic QMH2462 HBA using driver version 8.01.07.25 in a single path configuration. The multipath configuration testing was completed using the same models with HP DMMP and the LifeKeeper DMMP Recovery Kit.
Storage and Adapter Configuration Item Description HP P2000 G3 MSA SAS Certified by SIOS Technology Corp. in multipath configurations using the Device Mapper Multipath Recovery Kit. LifeKeeper for Linux can support up to 11 LUNs in a single cluster with the P2000 G3 SAS array. HP P4000/P4300 G2 Certified by SIOS Technology Corp. in both a single path and multipath configuration on RHEL 5.5 using the built-in SCSI support in the core of LifeKeeper with iSCSI Software Initiators.
Storage and Adapter Configuration Item HP XP20000/XP24000 Description Certified by SIOS Technology Corp. using LifeKeeper forLinux with DMMP ARK in RHEL 5, SLES10 and SLES 11,configured as multipath by DMMP. The model numbers of tested storage are XP20000 and XP24000. The connection interface is FC. The model number of tested HBA is QLogic QMH2562 and firmware is 4.04.09; driver version is 8.03.00.10.05.04-k. SIOS Technology Corp.
Storage and Adapter Configuration Item IBM DS3500 (FC Model) Description Certified by SIOS Technology Corp. in single path and multipath configurations on Red Hat Enterprise Linux Server Release 5.5 (Tikanga), HBA: QLE2560, QLE2460, RDAC: RDAC 09.03.0C05.0331. RDAC is needed for both single path and multipath. Note: SAS and iSCSI connect are not supported. IBM DS3400 Storage Certified by SIOS Technology Corp. with QLogic 2300 adapters in both single and multiple path configurations.
Storage and Adapter Configuration Item IBM eServer xSeries Storage Solution Server Type445-R / Type445-FR for SANmelody IBM Storwize V7000 iSCSI Description Certified by partner testing with IBM TotalStorage FC2-133 Host Bus Adapters in multiple path configurations. Use the qla2300 driver, version 7.00.61(non-failover) or later, as defined by IBM.
Storage and Adapter Configuration Item Description SIOS Technology Corp. has certified the Dell PowerVault storage array for use in a 2-node cluster with the Dell PERC 2/DC, Dell PERC 4/DC, and LSI Logic MegaRAID Elite 1600 storage controllers, as long as the following set of configuration requirements are met. (Note that the Dell PERC 3/DC is the OEM version of the MegaRAID Elite 1600.
Storage and Adapter Configuration Item Description Dell | EMC (CLARiiON) CX200 EMC has approved two QLogic driver versions for use with this array and the QLA2340 adapter: the qla2x00-clariion-v6.03.00-1 and the qla2x00clariion-v4.47.18-1. Both are available from the QLogic website at www.qlogic.com. DELL MD3000 Certified by Partner testing in both single path and multipath configurations with DELL SAS 5/e adapters.
Storage and Adapter Configuration Item Description The Dell EqualLogic was tested by a SIOS Technology Corp. partner with the following configurations: l Dell EqualLogic PS5000 using SCSI -2 reservations with the iscsi-initiator (Software initiator) using Red Hat Enterprise Linux ES release 4 (Nahant Update 5) with kernel 2.6.9-55.EL. The testing was completed using iscsiinitiator-utils-4.0.3.0-5 and multipath configuration using bonding with activebackup (mode=1).
Storage and Adapter Configuration Item Description Certified by SIOS Technology Corp. in a single path configuration with QLogic 23xx adapters. Use the qla2300 driver, version 6.04 or later. Hitachi HDS 9570V, 9970V and 9980V 92Configuration Note: SIOS Technology Corp. recommends the use of only single controller (i.e. single loop) configurations with these arrays, using a fibre channel switch or hub.
Storage and Adapter Configuration Item Description There are certain specific “host mode” settings required on the Hitachi arrays in order to allow LifeKeeper to work properly in this kind of directly connected configuration.
Storage and Adapter Configuration Item Description Fujitsu ETERNUS3000 This storage unit was tested by a SIOS Technology Corp. partner organization in a single path configuration only, using the PGFC105 (Emulex LP9001), PG-FC106 (Emulex LP9802), and PG-PC107 host bus adapters and the lpfc driver v7.1.14-3. Fujitsu ETERNUS 6000 This storage unit was tested by a SIOS Technology Corp.
Storage and Adapter Configuration Item Fujitsu ETERNUS2000 Model 200 Description This storage unit was tested by Fujitsu Limited in a multipath configuration using PG-FC203L (Emulex LPe1250-F8) host bus adapter (Firmware version 1.11A5, driver version 8.2.0.48.2p) with EMPD multipath driver (driver version V2.0L20, patch version T000973LP-1). The test was performed with LifeKeeper for Linux v7.1 using RHEL5 (kernel 2.6.18-164.el5).
Storage and Adapter Configuration Item Description This storage unit was tested by a SIOS Technology Corp. partner with the following single path configuration: TID MassCareRAID l Host1 Qlogic QLE2562 (HBA BIOS 2.10, driver version qla2xxx-8.03.01.04.05.05-k *) l Host2 HP AE312A (HBA BIOS 1.26, driver version qla2xxx-8.03.01.04.05.05-k *) l The test was performed with LifeKeeper for Linux v7.3 using Red Hat Enterprise Linux 5.5 (kernel2.6.18-194.
Storage and Adapter Configuration Item Description This storage device was successfully tested with SUSE SLES 9 Service Pack 3, Device Mapper - Multipath and Qlogic 2340 adapters. We expect that it should work with other versions, distributions and adapters; however, this has not been tested. See DataCore for specific support for these and other configurations.
HP Multipath I/O Configurations HP Multipath I/O Configurations Item Description MSA1000 and MSA1500 Multipath Requirements with Secure Path LifeKeeper supports Secure Path for multipath I/O configurations with the MSA1000 and MSA1500. This support requires the use of the Secure Path v3.0C or later. HP P2000 LifeKeeper supports the use of HP P2000 MSA FC. This storage unit was tested by SIOS Technology Corp. in a multipath configuration on RHEL 5.4.
HP Multipath I/O Configurations Multipath Support for EVA with QLogic Failover Driver LifeKeeper supports the EVA 3000/5000 and the EVA 4X00/6X00/8X00 with the QLogic failover driver. The 3000/5000 requires firmware version 4000 or higher. The 4000/6000/8000 requires firmware version 5030 or higher. The latest QLogic driver supplied by HP (v8.01.03 or later) should be used. The host connection must be "Linux". There is no restriction on the path/mode setting by LifeKeeper.
Device Mapper Multipath I/O Configurations Single Path on Boot Up Does Not Cause Notification If a server can access only a single path to the storage when the system is loaded, there will be no notification of this problem. This can happen if a system is rebooted where there is a physical path failure as noted above, but transient path failures have also been observed.
Device Mapper Multipath I/O Configurations The Device Mapper Multipath Kit has been tested by SIOS Technology Corp. with the EMC CLARiiON CX300, the HP EVA 8000, HP MSA1500, HP P2000, the IBM SAN Volume Controller (SVC), the IBM DS8100, the IBM DS6800, the IBM ESS, the DataCore SANsymphony, and the HDS 9980V. Check with your storage vendor to determine their support for Device Mapper Multipath.
Device Mapper Multipath I/O Configurations While running IO tests on Device Mapper Multipath devices, it is not uncommon for actions on the SAN, for example, a server rebooting, to cause paths to temporarily be reported as failed. In most cases, this will simply cause one path to fail leaving other paths to send IOs down resulting in no observable failures other than a small performance impact. In some cases, multiple paths can be reported as failed leaving no paths working.
LifeKeeper I-O Fencing Introduction LifeKeeper I-O Fencing Introduction I/O fencing is the locking away of data from a malfunctioning node preventing uncoordinated access to shared storage. In an environment where multiple servers can access the same data, it is essential that all writes are performed in a controlled manner to avoid data corruption. Problems can arise when the failure detection mechanism breaks down because the symptoms of this breakdown can mimic a failed node.
Non-Shared Storage Non-Shared Storage If planning to use LifeKeeper in a non-shared storage environment, the risk of data corruption that exists with shared storage is not an issue; therefore, reservations are not necessary. However, partial or full resyncs and merging of data may be required. To optimize reliability and availability, the above options should be considered with non-shared storage as well.
I/O Fencing Chart Quroum/Witness Watchdog Watchdog & Quorum/Witness STONITH (serial) Reservations Off Nothing STONITH (serial) CONFIRM_SO* Quorum/Witness Watchdog Non-Shared Storage Default Features Quorum/Witness CONFIRM_SO* Watchdog STONITH (serial) Watchdog & STONITH SteelEye Protection Suite for Linux105
Quorum/Witness Most Reliable Least Reliable * While CONFIRM_SO is highly reliable (depending upon the knowledge of the administrator), it has lower availability due to the fact that automatic failover is turned off.
Package Installation and Configuration Package Installation and Configuration The Quorum/Witness Server Support Package for LifeKeeper will need to be installed on each server in the quorum/witness mode cluster, including any witness-only servers. The only configuration requirement for the witness node is to create appropriate comm paths.
Available Quorum Modes Available Quorum Modes Three quorum checking modes are available which can be set via the QUORUM_MODE setting in /etc/default/LifeKeeper: majority (the default), tcp_remote and none/off. Each of these is described below: majority The majority setting, which is the default, will determine quorum based on the number of visible/alive LifeKeeper nodes at the time of the check. This check is a simple majority -- if more than half the total nodes are visible, then the node has quorum.
Available Witness Modes # This can be # mode, core # death of a # also think either off/none or remote_verify. In remote_verify event handlers (comm_down) will doublecheck the system by seeing if other visible nodes it is dead. QUORUM_LOSS_ACTION=fastboot # This can be one of osu, fastkill or fastboot. # fastboot will IMMEDIATELY reboot the system if a loss of quorum # is detected. # fastkill will IMMEDIATELY halt/power off the system upon # loss of quorum.
Available Actions When Quorum is Lost Available Actions When Quorum is Lost The witness package offers three different options for how the system should react if quorum is lost -“fastboot”, “fastkill” and “osu”. These options can be selected via the QUORUM_LOSS_ACTION setting in /etc/default/LifeKeeper. All three options take the system’s resources out of service; however, they each allow a different behavior. The default option, when the quorum package is installed, is fastboot.
Adding a Witness Node to a Two-Node Cluster GUI will not automatically connect to the other servers with established communication paths. As this behavior is not typical LifeKeeper GUI behavior, it may lead an installer to incorrectly conclude that there is a network or other configuration problem. To use the LifeKeeper GUI on a witness server with this setting, connect manually to one of the other nodes in the cluster, and the remaining nodes in the cluster will be shown in the GUI correctly.
Expected Behaviors (Assuming Default Modes) Simple Two-Node Cluster with Witness Node Server A and Server B should already be set up with LifeKeeper core with resource hierarchies created on Server A and extended to Server B (Server W will have no resource hierarchies extended to it). Using the following steps, a third node will be added as the witness node. 1. Set up the witness node, making sure network communications are available to the other two nodes. 2.
Scenario 4 In this case, Server B will do the following: l Begin processing a communication failure event from Server A. l Determine that it can still communicate with the Witness Server W and thus has quorum. l Verify via Server W that Server A really appears to be lost and, thus, begin the usual failover activity. l Server B will now have the protected resources in service.
SCSI Reservations SCSI Reservations Storage Fence Using SCSI Reservations While LifeKeeper for Linux supports both resource fencing and node fencing, its primary fencing mechanism is storage fencing through SCSI reservations. This fence, which provides the highest level of data protection for shared storage, allows for maximum flexibility and maximum security providing very granular locking to the LUN level. The underlying shared resource (LUN) is the primary quorum device in this architecture.
Alternative Methods for I/O Fencing from partially hung servers. In cases where a hung server goes undetected by LifeKeeper, watchdog will begin recovery. Also, in the case where a server is hung and not able to detect that the reservation has been stolen, watchdog can reboot the server to begin its recovery. Alternative Methods for I/O Fencing In addition to resource fencing using SCSI reservations, LifeKeeper for Linux also supports disabling reservations.
Package Requirements Package Requirements l VMware vSphere SDK Package (e.g. VMware-vSphere-SDK-4.X.X-XXXXX.i386.tar.gz) l l VMware vSphere CLI (vSphere CLI is included in the same installation package as the vSphere SDK.) (Note: Only required when using vmware-cmd) VMware Tools (e.g. VMwareTools-8.3.7-341836.tar.gz) Installation and Configuration After installing LifeKeeper and configuring communication paths for each node in the cluster, install and configure STONITH. 1.
/opt/LifeKeeper/config/stonith.conf # LifeKeeper STONITH configuration # # Each system in the cluster is listed below. To enable STONITH for a # given system, # remove the '#' on that line and insert the STONITH command line to power off # that system. # Example1: ipmi command # node-1 ipmitool -I lanplus -H 10.0.0.1 -U root -P secret power off # Example2: vCLI-esxcli command # node-2 esxcli --server=10.0.0.
Expected Behaviors vmware-cmd -H 10.0.0.1 -U root -P secret /vmfs/volumes/4e08c1b9-d741c09c-1d3e0019b9cb28be/lampserver/lampserver.vmx stop hard Note: For further information on VMware commands, use vmware-cmd with no arguments to display a help page about all options. Expected Behaviors When LifeKeeper detects a communication failure with a node, that node will be powered off and a failover will occur. Once the issue is repaired, the node will have to be manually powered on.
Configuration Read the next section carefully. The daemon is designed to recover from errors and will reset the system if not configured carefully. Planning and care should be given to how this is installed and configured. This section is not intended to explain and configure watchdog, but only to explain and configure how LifeKeeper interoperates in such a configuration. Configuration The following steps should be carried out by an administrator with root user privileges.
Uninstall Note: Configuring watchdog may cause some unexpected reboots from time to time. This is the general nature of how watchdog works. If processes are not responding correctly, the watchdog feature will assume that LifeKeeper (or the operating system) is hung, and it will reboot the system (without warning). Uninstall Care should be taken when uninstalling LifeKeeper. The above steps should be done in reverse order as listed below.
Steeleye Protection Suite/vAppKeeper Recovery Behavior Steeleye Protection Suite/vAppKeeper Recovery Behavior Steeleye Protection Suite and SteelEye vAppKeeper are designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources.
Meta Policies prior to performing a failover. This policy setting can be used to turn on/off local recovery. l TemporalRecovery - Normally, Steeleye Protection Suite will perform local recovery of a failed resource. If local recovery fails, Steeleye Protection Suite will perform a resource hierarchy failover to another node (vAppKeeper will trigger VMware HA). If the local recovery succeeds, failover will not be performed.
The lkpolicy Tool In the above resource hierarchy, app depends on both IP and file system. A policy can be set to disable local recovery or failover of a specific resource. This means that, for example, if the IP resource's local recovery fails and a policy was set to disable failover of the IP resource, then the IP resource will not fail over or cause a failover of the other resources.
Listing Policies [root@thor49 ~]# lkpolicy -l -d v6test4 Please enter your credentials for the system 'v6test4'.
Configuring Credentials Configuring Credentials Credentials for communicating with other systems are managed via a credential store. This store can be managed, as needed, by the /opt/LifeKeeper/bin/credstore utility. This utility allows server access credentials to be set, changed and removed - on a per server basis. Adding or Changing Credentials Adding and changing credentials are handled in the same way. A typical example of adding or changing credentials for a server, server.mydomain.
LifeKeeper API This will show the entire man/help page for the command. LifeKeeper API The LifeKeeper API is used to allow communications between LifeKeeper servers. IMPORTANT NOTE: Currently, this API is reserved for internal use only but may be opened up to customer and third party usage in a future release. Network Configuration Each LifeKeeper server provides the API via an SSL Connection on port 778. This port may be changed using the configuration variable API_SSL_PORT in /etc/default/LifeKeeper.
LifeKeeper Administration Overview LifeKeeper does not require administration during operation. LifeKeeper works automatically to monitor protected resources and to perform the specified recovery actions if a fault should occur. You use the LifeKeeper GUI in these cases: l Resource and hierarchy definition. LifeKeeper provides these interface options: l LifeKeeper GUI. l LifeKeeper command line interface. l Resource monitoring.
Administrator Tasks Administrator Tasks Editing Server Properties 1. To edit the properties of a server, bring up the Server Properties dialog just as you would for viewing server properties. 2. If you are logged into that server with the appropriate permissions, the following items will be editable. l Shutdown Strategy l Failover Confirmation 3. Once you have made changes, the Apply button will be enabled. Clicking this button will apply your changes without closing the window. 4.
Deleting a Communication Path Help for an explanation of each choice. 3. Select the Local Server from the list box and click Next. 4. Select one or more Remote Servers in the list box. If a remote server is not listed in the list box (i.e. it is not yet connected to the cluster), you may enter it using Add. You must make sure that the network addresses for both the local and remote servers are resolvable (for example, with DNS or added to the /etc/hosts file). Click Next. 5.
Server Properties - Failover l Right-click on a server icon, then click Delete Comm Path when the server context menu appears. l On the global toolbar, click the Delete Comm Path button. l On the server context toolbar, if displayed, click the Delete Comm Path button. l On the Edit menu, select Server, then Delete Comm Path. 2. A dialog entitled Delete Comm Path will appear. For each of the optionst hat follow, click Help for an explanation of each choice. 3.
Creating Resource Hierarchies To commit your selections, press the Apply button. Creating Resource Hierarchies 1. There are four ways to begin creating a resource hierarchy. l Right-click on a server icon to bring up the server context menu, then click on Create Resource Hierarchy. l On the global toolbar, click on the Create Resource Hierarchy button. l On the server context toolbar, if displayed, click on the Create Resource Hierarchy button.
LifeKeeper Application Resource Hierarchies 3. Select the Switchback Type and click Next. 4. Select the Server and click Next. Note: If you began from the server context menu, the server will be determined automatically from the server icon that you clicked on, and this step will be skipped. 5. Continue through the succeeding dialogs, entering whatever data is needed for the type of resource hierarchy that you are creating.
Creating a File System Resource Hierarchy button. l On the Edit menu, select Server, then click on Create Resource Hierarchy. 2. A dialog entitled Create Resource Wizard will appear with a Recovery Kit list. Select File System Resource and click Next. 3. Select the Switchback Type and click Next. 4. Select the Server and click Next. Note: If you began from the server context menu, the server will be determined automatically from the server icon that you clicked on, and this step will be skipped. 5.
Creating a Generic Application Resource Hierarchy Creating a Generic Application Resource Hierarchy Use this option to protect a user-defined application that has no associated recovery kit. Templates are provided for the user supplied scripts referenced below in $LKROOT/lkadm/subsys/gen/app/templates. Copy these templates to another directory before customizing them for the application that you wish to protect and testing them.
Creating a Raw Device Resource Hierarchy cause the resource state to be set to OSU following the create; selecting Yes will cause the previously provided restore script to be executed. For applications depending upon other resources such as a file system, disk partition, or IP address, select No if you have not already created the appropriate dependent resources. 11. Enter the Root Tag, which is a unique name for the resource instance.
Editing Resource Properties 7. Click Create Instance to start the creation process. A window titled Creating scsi/raw resource will display text indicating what is happening during creation. 8. Click Next. A window will display a message that the hierarchy has been created successfully. 9. At this point, you can click Continue to move on the extending the raw resource hierarchy, or you can click Cancel to return to the GUI.
Using the Up and Down Buttons There are two ways to modify the priorities: l Reorder the priorities by moving an equivalency with the Up/Down buttons ,or l Edit the priority values directly. Using the Up and Down Buttons 1. Select an equivalency by clicking on a row in the Equivalencies table. The Up and/or Down buttons will become enabled, depending on which equivalency you have selected. The Up button is enabled unless you have selected the highestpriority server.
Editing the Priority Values Editing the Priority Values 1. Select a priority by clicking on a priority value in the Priority column of the Equivalencies table. A box appears around the priority value, and the value is highlighted. 2. Enter the desired priority and press Enter. l Note: Valid server prioritiesare 1 to 999. After you have edited the priority, the Equivalencies table will be re-sorted.
Extending a File System Resource Hierarchy the Pre-Extend Wizard dialog appears, select a Template Server and a Tag to Extend, clicking on Next after each choice. 2. Either select the default Target Server or enter one from the list of choices, then click Next. 3. Select the Switchback Type, then click Next. 4. Either select the default or enter your own Template Priority, then click Next. 5. Either select or enter your own Target Priority, then click Next. 6.
Extending a Raw Device Resource Hierarchy extending resource hierarchies. After you have done that, you then complete the steps below, which are specific to generic application resources. 1. Select the Root Tag that LifeKeeper offers, or enter your own tag for the resource hierarchy on the target server, then click Next. 2. Enter any Application Information next (optional), then click Next. 3.
Creating a Resource Dependency l Right-click on the icon for the global resource hierarchy that you want to unextended. When the resource context menu appears, click Unextend Resource Hierarchy. When the dialog comes up, select the server in the Target Server list from which you want to unextended the resource hierarchy, and click Next. l On the global toolbar, click the Unextend Resource Hierarchy button.
Deleting a Resource Dependency l On the Edit menu, point to Resource and then click Create Dependency. When the dialog comes up, select the server in the Server list from which you want to begin creating the resource dependency, and click Next. On the next dialog, select the parent resource from the Parent Resource Tag list, and click Next again. 2. Select a Child Resource Tag from the drop down box of existing and valid resources on the server.
Deleting a Hierarchy from All Servers servers in the cluster. 4. If the output panel is enabled, the dialog closes, and the results of the commands to delete the dependency are shown in the output panel. If not, the dialog remains up to show these results, and you click Done to finish when all results have been displayed. Deleting a Hierarchy from All Servers 1. There are five possible ways to begin.
LifeKeeper User Guide The User Guide is a complete, searchable resource containing detailed information on the many tasks that can be performed within the LifeKeeper GUI. Click User Guide to access this documentation. The tasks that can be performed through the GUI can be grouped into three areas: Common Tasks - These are basic tasks that can be performed by any user such as connecting to a cluster, viewing server or resource properties, viewing log files and changing GUI settings.
Using LifeKeeper for Linux Using LifeKeeper for Linux The following topics provide detailed information on the LifeKeeper graphical user interface (GUI) as well as the many tasks that can be performed within the LifeKeeper GUI. GUI The GUI components should have already been installed as part of the LifeKeeper Core installation. The LifeKeeper GUI uses Java technology to provide a graphical user interface to LifeKeeper and its configuration data.
Exiting GUI Clients tasks. l The context (popup) and global menus provide access to all tasks. Exiting GUI Clients Select Exit from the File Menu to disconnect from all servers and close the client. The LifeKeeper GUI Software Package The LifeKeeper GUI is included in the steeleye-lkGUI software package which is bundled with the LifeKeeper Core Package Cluster. The steeleye-lkGUI package: l Installs the LifeKeeper GUI Client in Java archive format. l Installs the LifeKeeper GUI Server.
Menus Menus SteelEye LifeKeeper for Linux Menus Resource Context Menu The Resource Context Menu appears when you right-click on a global (cluster-wide) resource, as shown above, or a server-specific resource instance, as shown below, in the status table. The default resource context menu is described here, but this menu might be customized for specific resource types, in which case the menu will be described in the appropriate resource kit documentation.
Server Context Menu Extend Resource Hierarchy. Copy a resource hierarchy to another server for failover support. Unextend Resource Hierarchy. Remove an extended resource hierarchy from a single server. Create Dependency. Create a parent/child relationship between two resources. Delete Dependency. Remove a parent/child relationship between two resources. Delete Resource Hierarchy. Remove a resource hierarchy from all servers in the LifeKeeper cluster. Properties. Display the Resource Properties Dialog.
File Menu File Menu Connect. Connect to a LifeKeeper cluster. Connection to each server in the LifeKeeper cluster requires login authentication on that server. Exit. Disconnect from all servers and close the GUI window. Edit Menu - Resource In Service. Bring a resource hierarchy into service. Out of Service. Take a resource hierarchy out of service. Extend Resource Hierarchy. Copy a resource hierarchy to another server for failover support. Unextend Resource Hierarchy.
Edit Menu - Server Edit Menu - Server Disconnect. Disconnect from a cluster. Refresh. Refresh GUI. View Logs. View LifeKeeper log messages on connected servers. Create Resource Hierarchy. Create a resource hierarchy. Create Comm Path. Create a communication path between servers. Delete Comm Path. Remove communication paths from a server. Properties. Display the Server Properties Dialog. View Menu Global Toolbar. Display this component if the checkbox is selected. Message Bar.
Help Menu Properties Panel. Display this component if the checkbox is selected. Output Panel. Display this component if the checkbox is selected. Options. Edit the display properties of the GUI. History. Display the newest messages that have appeared in the Message Bar in the LifeKeeper GUI Message History dialog box (up to 1000 lines). Expand Tree. Expand the entire resource hierarchy tree. Collapse Tree. Collapse the entire resource hierarchy tree. Help Menu Technical Documentation.
GUI Toolbar Disconnect. Disconnect from a LifeKeeper cluster. Refresh. Refresh GUI. View Logs. View LifeKeeper log messages on connected servers. Create Resource Hierarchy. Create a resource hierarchy. Delete Resource Hierarchy. Remove a resource hierarchy from all servers in the LifeKeeper cluster. Create Comm Path. Create a communication path between servers. Delete Comm Path. Remove communication paths from a server. In Service. Bring a resource hierarchy into service. Out of Service.
Resource Context Toolbar Create Dependency. Create a parent/child relationship between two resources. Delete Dependency. Remove a parent/child relationship between two resources. Migrate Hierarchy to Multi-Site Cluster. Migrate an existing hierarchy to a Multi-Site Cluster Environment. Resource Context Toolbar The resource context toolbar is displayed in the properties panel when you select a server-specific resource instance in the status table.
Resource Context Toolbar Remove Dependency. Remove a parent/child relationship between two resources. Delete Resource Hierarchy. Remove a resource hierarchy from all servers.
Server Context Toolbar Server Context Toolbar The server context toolbar is displayed in the properties panel when you select a server in the status table. The actions are invoked for the server that you select. Disconnect. Disconnect from a LifeKeeper cluster. Refresh. Refresh GUI. View Logs. View LifeKeeper log messages on connected servers. Create Resource Hierarchy. Create a resource hierarchy. Delete Resource Hierarchy. Remove a resource hierarchy from all servers in the LifeKeeper cluster.
GUI Server its configuration data. Since the LifeKeeper GUI is a client/server application, a user will run the graphical user interface on a client system in order to monitor or administer a server system where LifeKeeper is executing. The client and the server may or may not be the same system. The LifeKeeper GUI allows users working on any machine to administer, operate, or monitor servers and resources in any cluster, as long as they have the required group memberships on the cluster machines.
Starting the application client NOTE: When you run the applet, if your system does not have the required Java Plug-in, you will be automatically taken to the web site for downloading the plug-in. You must also set your browser security parameters to enable Java. If you have done this and the client still is not loading, see GUI Troubleshooting. Starting the application client Users with administrator privileges on a LifeKeeper server can run the application client from that server.
GUI Configuration See Running the GUI on a Remote System for information on configuring and running the GUI on a remote system outside your LifeKeeper cluster. GUI Configuration Item Description GUI client and server communication The LifeKeeper GUI client and server use Java Remote Method Invocation (RMI) to communicate. For RMI to work correctly, the client and server must use resolvable hostnames or IP addresses.
GUI Limitations GUI Limitations Item Description GUI interoperability restriction The LifeKeeper for Linux client may only be used to administer LifeKeeper on Linux servers. The LifeKeeper for Linux GUI will not interoperate with LifeKeeper for Windows.
LifeKeeper GUI Server Processes This command halts all LifeKeeper GUI Server daemon processes on the server being administered if they are currently running. The following messages are displayed.
Java Security Policy They take effect on the user's next login or when the GUI server is restarted, whichever comes first. Each user has a single permission on a given server. Previous permission entries are deleted if a new permission is specified on that server.
Policy File Creation and Management The user policy file starts with `.` and is by default at: \.java.policy Note: USER.HOME refers to the value of the system property named "user.home", which specifies the user's home directory. For example, the home directory on a Windows NT workstation for a user named Paul might be "paul.000". For Windows systems, the user.
Sample Policy File perm = new java.io.FilePermission("/tmp/abc","read"); In this, the target name is "/tmp/abc" and the action string is "read". A policy file specifies what permissions are allowed for code from specified code sources. An example policy file entry granting code from the /home/sysadmin directory read access to the file /tmp/abc is: grant codeBase "file:/home/sysadmin/" { permissionjava.io.
Java Plug-In Java Plug-In Regardless of the browser you are using (see supported browsers), the first time your browser attempts to load the LifeKeeper GUI, it will either automatically download the Java Plug-In software or redirect you to a web page to download and install it. From that point forward, the browser will automatically invoke the Java Plug-in software every time it comes across web pages that support the technology.
Running the GUI on a Remote System l If you already have a user policy file, you can add the required entries specified in/opt/LifeKeeper/ htdoc/java.policy on a LifeKeeper server into the existing file using a simple text editor. See Java Security Policy for further information. 2. You must set your browser security parameters to low. This generally includes enabling of Java and Java applets.
Applet Troubleshooting Applet Troubleshooting If you suspect that the applet failed to load and initialize, try the following: 1. Verify that applet failed. Usually a message is printed somewhere in the browser window specifying the state of the applet. In Netscape and Internet Explorer, an icon may appear instead of the applet in addition to some text status. Clicking this icon may bring up a description of the failure. 2. Verify that you have installed the Java Plug-in.
Browser Security Parameters for GUI Applet and password. 4. Once a connection to the cluster is established, the GUI window displays a visual representation and status of the resources protected by the connected servers. The GUI menus and toolbar buttons provide administration functions. Browser Security Parameters for GUI Applet WARNING: Be careful of other sites you visit with security set to low values. Firefox 1. From the Edit menu, select Preferences. 2.
Properties Panel resources. It shows: l the state of each server in the top row, l the global (cross-server) state and the parent-child relationships of each resource in the leftmost column, and l the state of each resource on each server in the remaining cells. The states of the servers and resources are shown using graphics, text and color. An empty table cell under a server indicates that a particular resource has not been defined on that server.
Message Bar that command is added under this label. If you are running multiple commands at the same time (typically on different servers), the output from each command is sent to the corresponding section making it easy to see the results of each. You increase or decrease the size of the output panel by sliding the separator at the top of the panel up or down. If you want to open or close this panel, use the Output Panel checkbox on the View Menu.
Enabling Automatic LifeKeeper Restart Note: If you receive an error message referencing the LifeKeeper Distribution Enabling Package when you start LifeKeeper, you should install / re-install the LifeKeeper Installation Image File. See the LCD(1M) man page by entering man LCD at the command line for details on the /etc/init.d/lifekeeper start command. Enabling Automatic LifeKeeper Restart While the above command will start LifeKeeper, it will need to be performed each time the system is re-booted.
Disabling Automatic LifeKeeper Restart Disabling Automatic LifeKeeper Restart If you do not want LifeKeeper to automatically restart when the system is restarted, type the following command: chkconfig lifekeeper off See the chkconfig man page for further information.
Connecting Servers to a Cluster REGISTRY=internal -DGUI_WEB_PORT=81 com.steeleye.LifeKeeper.beans.S_LK Connecting Servers to a Cluster 1. There are two possible ways to begin. l On the global toolbar, click the Connect button. l On the File Menu, click Connect. 2. In the Server Name field of the Cluster Connect dialog, enter the name of a server within the cluster to which you want to connect. Note: If using an IPv6 address, this address will need to be enclosed in brackets [ ].
Viewing Connected Servers 1. There are three possible ways to begin. l On the Global Toolbar, click the Disconnect button. l On the Edit Menu, select Server and then click Disconnect. l On the Server Context Toolbar, if displayed, click the Disconnect button. 2. In the Select Server in Cluster list of the Cluster Disconnect Dialog, select the name of a server in the cluster from which you want to disconnect. 3. Click OK. A Confirmation dialog listing all the servers in the cluster is displayed. 4.
Viewing Server Properties Client has valid connection to the server. Comm paths originating from this server to an ALIVE remote server are ALIVE. ALIVE Comm paths which may be marked DEAD and which target a DEAD server are ignored because the DEAD server will be reflected in its own graphic. Client has valid connection to the server. One or more comm paths from this server to a given remote server are marked as DEAD. ALIVE No redundant comm path exists from this server to a given remote server.
Viewing Resource Tags and IDs l On the Edit Menu, point to Server, click View Log, then select the server that you want to view from the Server list in the LifeKeeper Log Viewer Dialog. 2. If you started from the Global Toolbar or the Edit Menu and you want to view logs for a different server, select that server from the Server list in the LifeKeeper Log Viewer Dialog. This feature is not available if you selected View Logs from the Server Context Menu or Server Context Toolbar. 3.
Global Resource Status Server Resource State Visual State What it Means Active Resource is operational on this server and protected. (ISP) Degraded Resource is operational on this server, but not protected by a backup resource. (ISU) StandBy Server can take over operation of the resource. (OSU) Failed Problem with resource detected on this server. For example, an attempt to bring the resource in-service failed.
Viewing Resource Properties Visual State Description What it Means / Causes Normal Resource is active (ISP) and all backups are active. Warning Resource is active (ISP). One or more backups are marked as unknown or failed (OSF). Resource has been taken out-of-service for normal reasons. Failed. Resource is not active on any servers (OSF). Resource has stopped running by unconventional means. Recovery has not been completed or has failed. More than one server is claiming to be active. Unknown.
Setting View Options for the Status Window Setting View Options for the Status Window The Options Dialog is available from the View menu. This allows you to specify various LifeKeeper display characteristics. These settings, along with all checkbox menu item settings and the various window sizes, are stored between sessions in the file .lkGUIpreferences in your home folder on the client machine. This file is used by both the web and application clients.
Resource Tree Resource Tree This option group allows you to specify the sorting order of the resources in the resource hierarchy tree. l Sort By Resource will sort resources by resource label only. l Sort By Cluster will sort by server cluster and resource label such that resources belonging in the same cluster of servers will be grouped together. l No Sort will disable sorting such that the resources are displayed in the order in which they are discovered by the GUI.
Viewing Message History l Automatic: Automatically resizes all columns to fill available space. Note: The 7 (seven) and 8 (eight) keys are defined as hot/accelerator keys to facilitate quickly resizing the column size of resources in the resource hierarchy table. Viewing Message History 1. On the View Menu, click History. The LifeKeeper GUI Message History dialog is displayed. 2. If you want to clear all messages from the history, click Clear. 3. Click OK to close the dialog.
Expanding and Collapsing a Resource Hierarchy Tree Expanding and Collapsing a Resource Hierarchy Tree In this segment of the tree, the resource file_system_ 2 is expanded and the resource nfs-/opt/qe_ auto/NFS/export1 is collapsed. appears to the left of a resource icon if it is expanded. appears if it is collapsed. To expand a resource hierarchy tree, l Click the or l Double-click the resource icon to the right of a .
Cluster Connect Dialog Note: The "9" and "0" keys are defined as hot/accelerator keys to facilitate quickly expanding or collapsing all resource hierarchy trees. Cluster Connect Dialog Server Name. The name of the server to which you want to connect. Login. The login name of a user with LifeKeeper authorization on the server to which you want to connect. Password. The password that authorizes the specified login on the server to which you want to connect.
Resource Properties Dialog Resource Properties Dialog The Resource Properties dialog is available from the Edit menu or from a resource context menu. This dialog displays the properties for a particular resource on a server. When accessed from the Edit menu, you can select the resource and the server. When accessed from a resource context menu, you can select the server. General Tab l Tag. The name of a resource instance, unique to a system, that identifies the resource to an administrator. l ID.
Relations Tab resources can be active on only one of the grouped systems at a time. l Initialization. The setting that determines resource initialization behavior at boot time, for example, AUTORES_ISP, INIT_ISP, or INIT_OSU. Relations Tab l Parent. Identifies the tag names of the resources that are directly dependent on this resource. l Child. Identifies the tag names of all resources on which this resource depends. l Root. Tag name of the resource in this resource hierarchy that has no parent.
General Tab General Tab l Name. Name of the selected server. l State. Current state of the server. These are the possible server state values: l l ALIVE - server is available. l DEAD - server is unavailable. l UNKNOWN - state could not be determined. The GUI server may not be available. Permission. The permission level of the user currently logged into that server.
General Tab l Administrator - the user can perform any LifeKeeper task. l Operator - the user can monitor LifeKeeper resource and server status, and can bring resources in service and take them out of service. l Guest - the user can monitor LifeKeeper resource and server status. l Shutdown Strategy. (editable if the user has Administrator permission) The setting that governs whether or not resources are switched over to a backup server in the cluster when a server is shutdown.
CommPaths Tab CommPaths Tab l Server. The server name of the other server the communication path is connected to in the LifeKeeper cluster. l Priority. The priority determines the order by which communication paths between two servers will be used. Priority 1 is the highest and priority 99 is the lowest. l State. State of the communications path in the LifeKeeper Configuration Database (LCD). These are the possible communications path state values: l ALIVE - functioning normally.
Resources Tab These are the possible communications path status values displayed below the detailed text in the lower panel: l NORMAL - all comm paths functioning normally. l FAILED - all comm paths to a given server are dead. l UNKNOWN - comm path status could not be determined. The GUI server may not be available. l WARNING - one or more comm paths to a given server are dead. l DEGRADED - one ore more redundant comm paths to a given server are dead. l NONE DEFINED - no comm paths defined.
Operator Tasks l Name. The tag name of a resource instance on the selected server. l Application. The application name of a resource type (gen, scsi, ...) l Resource Type. The resource type, a class of hardware, software, or system entities providing a service (for example, app, filesys, nfs, device, disk,...) l State. The current state of a resource instance: l ISP - In-service locally and protected. l ISU - In-service locally, but local recovery will not be attempted.
Taking a Resource Out of Service service without bringing its parent resource into service as well. Click In Service to bring the resource(s) into service along with any dependent child resources. 3. If the Output Panel is enabled, the dialog closes and the results of the commands to bring the resource(s) in service are shown in the output panel. If not, the dialog remains up to show these results and you click Done to finish when all results have been displayed.
Related Topics The LifeKeeper Configuration Database (LCD) maintains the object-oriented resource hierarchy information and stores recovery direction information for all resource types known to LifeKeeper. The data is cached within system shared memory and stored in files so that configuration data is retained over system restarts. The LCD also contains state information and specific details about resource instances required for recovery.
Hierarchy Definition Hierarchy Definition These are the tasks required to construct the example application hierarchy: 1. Create file system resources. The LifeKeeper GUI provides menus to create file system resources. See Creating File System Resource Hierarchies.
Hierarchy Definition Note: Although you can create much of the definition using the LifeKeeper GUI, the rest of this example demonstrates the command interface. 3. Create directories. On each system, you create the necessary application recovery directories under the directory /opt/LifeKeeper/subsys with the command: mkdir -p /opt/LifeKeeper/subsys/projectapp/Resources/plan/actions 4. Define application.
LCD Configuration Data dep_create -d Server1 -p the-project-plan -c schedule-onServer1 dep_create -d Server2 -p the-project-plan -cschedule-fromServer1 9. Execute lcdsync. Execute the following lcdsync commands to inform LifeKeeper to update its copy of the configuration: lcdsync -d Server1 lcdsync -d Server2 10. Bring resources into service. Access the LifeKeeper GUI on the primary server and on the Edit menu, select Resource, then In-Service to bring the resources into service.
LCD Directory Structure Disks on a Small Computer System Interface (SCSI) bus are one example of equivalent resources. With the SCSI locking (or reserve) mechanism, only one server can own the lock for a disk device at any point in time. This lock ownership feature guarantees that two or more servers cannot access the same disk resource at the same time.
Resources Subdirectories different resource type for other application resource hierarchies or app for generic or userdefined applications. l Other typical flags include !nofailover!machine and shutdown_switchover. The !nofailover!machine flag is an internal, transient flag created and deleted by LifeKeeper which controls aspects of server failover.
Resource Actions l actions. This directory contains the set of recovery action programs that act only on resource instances of the specific resource type. If, for your application, any actions apply to all resource types within an application, place them in an actions subdirectory under the application directory rather than under the resource type directory. Recovery direction software is used to modify or recover a resource instance.
LCM LCM The LifeKeeper Communications Manager (LCM) provides reliable communication between processes on one or more LifeKeeper servers. This process can use redundant communication paths between systems so that failure of a single communication path does not cause failure of LifeKeeper or its protected resources. The LCM supports a variety of communication alternatives including RS232 (TTY) and TCP/IP connections. The LCM provides the following: l LifeKeeper Heartbeat.
Communication Status Information l Failover Recovery. If a resource fails on a system, the LCM notifies LifeKeeper to recover the resource on a backup system. In addition to the LifeKeeper services provided by the LCM, inter-system application communication is possible through a set of shell commands for reliable communication. These commands include snd_msg, rcv_msg, and can_talk. These commands are described in the LCMI_mailboxes (1M) manual pages.
Alarm Processing Defining additional scripts for the sendevent alarming functionality is optional. When you define LifeKeeper resources, LifeKeeper provides the basic alarming functionality described in the processing scenarios later in this chapter. Note: Local recovery for a resource instance is the attempt by an application under control of LifeKeeper to return interrupted resource services to the end-user on the same system that generated the event.
Changing LifeKeeper Configuration Values 2. If you are changing the uname of a LifeKeeper server, change the server's hostname using the Linux hostname(1) command. 3. Before continuing, ensure that any new host names are resolvable by all of the servers in the cluster. If you are changing comm path addresses, check that the new addresses are configured and working (the ping and telnet utilities can be used to verify this). 4.
File System Health Monitoring Value uname Old Server1 comm path address 172.17.100.48 New Newserver1 172.17.105.49 IP resource address 172.17.100.220 172.17.100.221 The following steps should be performed to make these changes. 1. Stop LifeKeeper on both Server1 and Server2 using the command: /etc/init.d/lifekeeper stop-nofailover 2. Change the uname of Server1 to Newserver1 using the command: hostname Newserver1 3.
Condition Definitions l A warning message can be logged and email sent to a system administrator. l Local recovery of the resource can be attempted. l The resource can be failed over to a backup server. Condition Definitions Full or Almost Full File System A "disk full" condition can be detected, but cannot be resolved by performing a local recovery or failover - administrator intervention is required. A message will be logged by default. Additional notification functionality is available.
Maintaining a LifeKeeper Protected System l mount point is busy l mount failure l LifeKeeper internal error Maintaining a LifeKeeper Protected System When performing shutdown and maintenance on a LifeKeeper-protected server, you must put that system’s resource hierarchies in service on the backup server before performing maintenance. This process stops all activity for shared disks on the system needing maintenance.
Recovering After a Failover 3. Restore the hierarchy. Use the LifeKeeper GUI to bring the resource hierarchy back in service. See Bringing a Resource In Service for instructions. Recovering After a Failover After LifeKeeper performs a failover recovery from a primary server (Server A) to a backup server (Server B), perform the following steps: 1. Review logs. When LifeKeeper on Server B performs a failover recovery from Server A, status messages are displayed during the failover.
Removing via GnoRPM l Remove all packages. If you remove the LifeKeeper core, you should first remove other packages that depend upon LifeKeeper; for example, LifeKeeper recovery kits. It is recommended that before removing a LifeKeeper recovery kit, you first remove the associated application resource hierarchy. Note: It is recommended that before removing recovery kit software, first remove any associated hierarchies from that server. You may do this using the Unextend Resource configuration task.
LifeKeeper Communication Paths network access requirements. Note: If you wish to simply disable your firewall, see Disabling a Firewall below. LifeKeeper Communication Paths Communication paths are established between pairs of servers within the LifeKeeper cluster using specific IP addresses. Although TCP Port 7365 is used by default on the remote side of each connection as it is being created, the TCP port on the initiating side of the connection is arbitrary.
Disabling a Firewall formula: 10001 + + <256 * i> where i starts at zero and is incremented until the formula calculates a port number that is not in use. In use constitutes any port found defined in /etc/services, found in the output of netstat -an --inet, or already defined as in use by another LifeKeeper Data Replication resource.
Starting LifeKeeper 1. Make sure your IT department has opened the secure shell port on the corporate firewall sufficiently to allow you to get behind the firewall. Often the machine IT allows you to get to is not actually a machine in your cluster but an intermediate one from which you can get into the cluster. This machine must be a Unix or Linux machine. 2.
Starting LifeKeeper Server Processes Starting LifeKeeper Server Processes If LifeKeeper is not currently running on your system, type the following command as the user root on all servers: /etc/init.d/lifekeeper start Following the delay of a few seconds, an informational message is displayed. Note: If you receive an error message referencing the LifeKeeper Distribution Enabling Package when you start LifeKeeper, you should install / re-install the LifeKeeper Installation Image File.
Disabling Automatic LifeKeeper Restart This command will remove the resources from service but does not set the !nofailover! flag [see LCDIflag(1M)] on any of the systems that it can communicate with. This means that failover will occur if the shutdown_switchover flag is set. If shutdown_switchover is not set, then this command behaves the same as /etc/init.d/lifekeeper stop-nofailover. LifeKeeper will automatically restart when the system is restarted.
Tuning Large Cluster Support LifeKeeper supports large cluster configurations, up to 32 servers. There are many factors other than LifeKeeper, however, that can affect the number of servers supported in a cluster. This includes items such as the storage interconnect and operating system or storage software limitations. Refer to the vendor-specific hardware and software configuration information to determine the maximum supported cluster size. LifeKeeper for Linux v5.
LifeKeeper Operations IPC Semaphores and IPC Shared Memory LifeKeeper requires Inter-Process Communication (IPC) semaphores and IPC shared memory. The default Red Hat values for the following Linux kernel options are located in/usr/src/linux/include/linux/sem.h and should be sufficient to support most LifeKeeper configurations. Option Required Default Red Hat 6.
LifeKeeper Operations LifeKeeper uses a lock to protect shared data from being accessed by other servers on a shared SCSI Bus. If LifeKeeper cannot access a device as a result of another server taking the lock on a device, then a critical error has occurred and quick action should be System taken or data can be corrupted. When this condition is detected, LifeKeeper enables a Panic on feature that will cause the system to panic. Locked Shared If LifeKeeper is stopped by some means other than ‘/etc/init.
Server Configuration Server Configuration Item Description BIOS Updates The latest available BIOS should always be installed on all LifeKeeper servers. Package Dependencies List for LifeKeeper 7.5 and Later The following is a list of dependencies that may be necessary for the required packages in LifeKeeper 7.5 and later depending upon your OS distribution. IMPORTANT: The 32-bit versions of these packages are required.
Set Block Resource Failover On: system administrator before allowing LifeKeeper to perform a failover recovery of a system that it detects as failed. Use the lk_confirmso command to confirm the failover. By default, the administrator has 10 minutes to run this command. This time can be changed by modifying the CONFIRMSOTO setting in /etc/default/LifeKeeper. If the administrator does not run the lk_confirmso command within the time allowed, the failover will either proceed or be blocked.
NFS Client Mounting Considerations NFS Client Mounting Considerations An NFS Server provides a network-based storage system to client computers. To utilize this resource, the client systems must “mount” the file systems that have been NFS exported by the NFS server. There are several options that system administrators must consider on how NFS clients are connected to the LifeKeeper protected NFS resources.
Expanded Multicluster Example SteelEye Protection Suite for Linux219
Troubleshooting The Message Catalog (located on our Technical Documentation site under “Search for an Error Code”) provides a listing of all error codes, including operational, administrative and GUI, that may be encountered while using SteelEye Protection Suite for Linux and, where appropriate, provides additional explanation of the cause of the error code and necessary action to resolve the issue.
Installation Description GUI does not work with default RHEL6 64-bit There is a compatibility issue against Red Hat Enterprise Linux 6 64-bit Solution: Install the following packages, which are contained in the installation media of the OS, prior to installing LifeKeeper. If these are not installed prior to installing LifeKeeper, the install will not finish correctly. libXau-1.0.5-1.el6.i686.rpm libxcb-1.5-1.el6.i686.rpm libX11-1.3-2.el6.i686.rpm libXext-1.1-3.el6.i686.rpm libXi-1.3-3.el6.i686.
LifeKeeper Core LifeKeeper Core SteelEye Protection Suite for Linux223
LifeKeeper Core Description Language Environment Effects Some LifeKeeper scripts parse the output of Linux system utilities and rely on certain patterns in order to extract information. When some of these commands run under non-English locales, the expected patterns are altered, and LifeKeeper scripts fail to retrieve the needed information. For this reason, the language environment variable LC_MESSAGES has been set to the POSIX “C” locale (LC_MESSAGES=C) in /etc/default/LifeKeeper.
LifeKeeper Core Description lkscsid will halt system when it should issue a sendevent When lkscsid detects a disk failure, it should, by default, issue a sendevent to LifeKeeper to recover from the failure. The sendevent will first try to recover the failure locally and if that fails, will try to recover the failure by switching the hierarchy with the disk to another server.
LifeKeeper Core Description The use of lkbackups taken from versions of LifeKeeper previous to 8.0.0 requires manually updating /etc/default/LifeKeeper when restored on 8.0.0 In LifeKeeper/SPS 8.0.0, there have been significant enhancements to the logging and other major core components. These enhancements affect tunables in the /etc/default/LifeKeeper file. When an lkbackup is restored on 8.0.0, these tunables will no longer have the right values causing a conflict.
LifeKeeper Core Description LifeKeeper syslog EMERG severity messages do not display to a SLES10 or SLES11 host's console which has AppArmor enabled LifeKeeper is accessing /var/run/utmp which is disallowed by the SLES10 or SLES11 AppArmor syslog-ng configuration. Solution: To allow LifeKeeper syslog EMERG severity messages to appear on a SLES10 or SLES11 console with AppArmor enabled, add the following entry to /etc/apparmor.d/sbin.syslog-ng: /var/run/utmp kr If added to sbin.
Internet/IP Licensing Internet/IP Licensing Description INTERFACELIST syntax, /etc/hosts settings dependency /etc/hosts settings: When using internet-based licensing (IPv4 address), the configuration of /etc/hosts can negatively impact license validation. If LifeKeeper startup fails with: Error in obtaining LifeKeeper license key: Invalid host. The hostid of this system does not match the hostid specified in the license file.
GUI GUI Description GUI login prompt may not re-appear when reconnecting via a web browser after exiting the GUI When you exit or disconnect from the GUI applet and then try to reconnect from the same web browser session, the login prompt may not appear. Workaround: Close the web browser, re-open the browser and then connect to the server. When using the Firefox browser, close all Firefox windows and re-open.
GUI Description Java Mixed Signed/Unsigned Code Warning - When loading the LifeKeeper Java GUI client applet from a remote system, the following security warning may be displayed: Enter “Run” and the following dialog will be displayed: Block? Enter “No” and the LifeKeeper GUI will be allowed to operate. Recommended Actions: To reduce the number of security warnings, you have two options: 1. Check the “Always trust content from this publisher” box and select “Run”.
Data Replication Description steeleye-lighttpd process fails to start if Port 778 is in use If a process is using Port 778 when steeleye-lighttpd starts up, steeleye-lighttpd fails causing a failure to connect to the GUI. Solution: Set the following tunable on all nodes in the cluster and then restart LifeKeeper on all the nodes: Add the following line to /etc/default/LifeKeeper: API_SSL_PORT=port_number where port_number is the new port to use.
Data Replication Description 32-bit zlib packages should be installed to RHEL 6 (64-bit) for Set Compression Level When using SDR with RHEL 6 (64-bit), the following error may appear: Could not start balance on Target when Compression Level is set on RHEL 6 (64-bit) Solution: To resolve the issue, please install the 32-bit zlib packages from RHEL 6 when using RHEL 6 (64-bit). Mirror breaks and fills up /var/log/messages with errors This issue has been seen occasionally (on Red Hat EL 6.
Data Replication Description Mirror resyncs may hang in early RedHat/CentOS 6.x kernels with a "Failed to remove device" message in the LifeKeeper log Kernel versions prior to version 2.6.32-131.17.1 (RHEL 6.1 kernel version 2.6.32-131.0.15 before update, etc) contain a problem in the md driver used for replication. This problem prevents the release of the nbd device from the mirror resulting in the logging of multiple "Failed to remove device" messages and the aborting of the mirror resync.
IPv6 IPv6 234Troubleshooting
IPv6 Description SIOS has migrated to the use of the ip command and away from the ifconfig command. Because of this change, customers with external scripts are advised to make a similar change. Instead of issuing the ifconfig command and parsing the output looking for a specific interface, scripts should instead use "ip -o addr show" and parse the output looking for lines that contain the words "inet" and "secondary".
IPv6 Description 'IPV6_AUTOCONF = No' for /etc/sysconfig/network-scripts/ifcfg- is not being honored on reboot or boot On boot, a stateless, auto-configured IPv6 address is assigned to the network interface. If a comm path is created with a stateless IPv6 address of an interface that has IPV6_AUTOCONF=No set, the address will be removed if any system resources manage the interface, e.g. ifdown ;ifup .
Apache Description IPv6 resource reported as ISP when address assigned to bonded NIC but in 'tentative' state IPv6 protected resources in LifeKeeper will incorrectly be identified as 'In Service Protected' (ISP) on SLES systems where the IPv6 resource is on a bonded interface, a mode other than 'activebackup' (1) and Linux kernel 2.6.21 or lower. The IPv6 bonded link will remain in the 'tentative' state with the address unresolvable.
NFS Server Recovery Kit Description The Oracle Recovery Kit does not support the ASM or grid component features of Oracle 10g The following information applies to Oracle 10g database instances only. The Oracle Automatic Storage Manager (ASM) feature provided in Oracle 10g is not currently supported with LifeKeeper. In addition, the grid components of 10g are not protected by the LifeKeeper Oracle Recovery Kit.
SAP Recovery Kit Description NFS v4 cannot be configured with IPv6 IPv6 virtual IP gets rolled up into the NFSv4 heirarchy. Solution: Do not use an IPv6 virtual IP resource when creating an NFSv4 resource. NFS v4: Unable to re-extend hierarchy after unextend Extend fails because export point is already exported on the target server.
LVM Recovery Kit Description When changes are made to res_state, monitoring is disabled If Protection Level is set to BASIC and SAP is taken down manually (i.e. for maintenance), it will be marked as FAILED and monitoring will stop. Solution: In order for monitoring to resume, LifeKeeper will need to start up the resource instead of starting it up manually.
DMMP Recovery Kit DMMP Recovery Kit Description DMMP: Write issued on standby server can hang If a write is issued to a DMMP device that is reserved on another server, then the IO can hang indefinitely (or until the device is no longer reserved on the other server). If/when the device is released on the other server and the write is issued, this can cause data corruption. The problem is due to the way the path checking is done along with the IO retries in DMMP.
MD Recovery Kit MD Recovery Kit Description MD Kit does not support mirrors created with “homehost” The LifeKeeper MD Recovery Kit will not work properly with a mirror created with the "homehost" feature. Where "homehost" is configured, LifeKeeper will use a unique ID that is improperly formatted such that in-service operations will fail. On SLES 11 systems, the “homehost” will be set by default when a mirror is created.
Samba Recovery Kit Description MD resource instances can be adversely impacted by udev processing during restore During udev processing, device nodes are removed and recreated. Occasionally during a restore, LifeKeeper will try to access a node before it has been recreated causing the restore to fail. Solution: Perform the LifeKeeper restore action again. Samba Recovery Kit Description The Samba Recovery Kit is currently not supported with SLES 11 SP2.
Running from a Modem: e.g.: 208.2.84.61 homer.somecompany.com homer This should reduce the time it takes to make the first lookup. In addition, incorrect settings of the Subnet Mask and Gateway address may result in connection delays and failures. Verify with your Network Administrator that these settings are correct. Running from a Modem: When you connect to a network in which the servers reside via modem (using PPP or SLIP), your computer acquires a temporary IP number for its operation.
From Windows: addresses. When unresolvable names, WINS names or unqualified DHCP names are used, this causes Java to throw an UnknownHostException. This error message may also occur under the following conditions: l Server name does not exist. Check for misspelled server name. l Misconfigured DHCP servers may set the fully-qualified domain name of RMI servers to be the domain name of the resolver domain instead of the domain in which the RMI server actually resides.
From Linux: l Try editing the hosts file to include entries for the local host and the LifeKeeper servers that it will be connected to. On Windows 95/98 systems the hosts file is: %windir%\HOSTS (for example, C:\WINDOWS\HOSTS). Note: On Windows 95/98, if the last entry in the hosts file is not concluded with a carriage-return/line-feed then the hosts file will not be read at all.
Unable to Connect to X Window Server: 5. Try setting the hostname property to be used by the GUI client. This will need to be changed for each administrator. To do this from a browser with the Plugin, open the Java Plug-In Control Panel and set the hostname for the client by adding the following to "Java Run Time Parameters": -Djava.rmi.server.hostname= To do this from the HotJava browser, append the following to the hotjava command line: -Djava.rmi.server.hostname= For Example: -Djava.
Communication Paths Going Up and Down As a result of this problem, your users may have trouble creating or changing resources during the frozen interval. To adjust the system date/time counters backward: 1. Go to single-user mode (which stops LifeKeeper). 2. Change the date or time backwards. 3. Go back to multi-user mode. 4. Restart LifeKeeper. The operation builds a new ha_xref_tbl with the new current time so that the operation can continue.
Restoring Your Hierarchy to a Consistent State resources for a hierarchy. In order to maintain consistency in a hierarchy, LifeKeeper requires that priority changes be made to all resources in a hierarchy for each server. The GUI enforces this requirement by displaying all root resources for the hierarchy selected after the OK or Apply button is pressed. You have the opportunity at this point to accept all of these roots or cancel the operation.
No Shared Storage Found When Configuring a Hierarchy 6. As a last resort, if the hierarchy cannot be repaired, you may have to delete and re-create the hierarchy. No Shared Storage Found When Configuring a Hierarchy When you are configuring resource hierarchies there are a number of situations that might cause LifeKeeper to report a "No shared storage" message: Possible Cause: Communications paths are not defined between the servers with the shared storage.
Recovering from a LifeKeeper Server Failure with "NU-" then LifeKeeper was unable to get a unique ID from the device. Without a unique ID LifeKeeper cannot determine if the device is shared. Possible Cause: The storage may require a specific LifeKeeper software to be installed before the device can be used by LifeKeeper. Examples are the steeleye-lkRAW kit to enable Raw I/O support and the steeleye-lkDR software to enable data replication.
Recovering from a Non-Killable Process 4. Finally, extend each resource hierarchy from the server where the resource hierarchy is inservice to the re-installed server using the GUI. Recovering from a Non-Killable Process If a process is not killable, LifeKeeper may not be able to unmount a shared disk partition. Therefore, the resource cannot be brought into service on the other system. The only way to recover from a nonkillable process is to reboot the system.
Taking the System to init state S WARNING (RS-232 TTY) console can experience severe problems with LifeKeeper service. During operation, LifeKeeper generates console messages. If your configuration has a serial console (instead of the standard VGA console), the entire data path from LifeKeeper to the end-user terminal must be operational in order to ensure the delivery of these console messages.
Suggested Action: reason for alarm. However, if this message is frequently in the log, or the number is 2 or 3, then two actions may be necessary: l Attempt to decrease the load on the storage. If the storage is taking longer than 3 times the FAILFASTTIMER (3 times 5 or 15 seconds by default) then one should consider the load that is being placed on the storage and re-balance the load to avoid these long I/O delays.
Chapter 4: SteelEye DataKeeper for Linux Introduction SteelEye DataKeeper for Linux provides an integrated data mirroring capability for LifeKeeper environments. This feature enables LifeKeeper resources to operate in shared and non-shared storage environments.
Synchronous vs. Asynchronous Mirroring l Provides failover protection for mirrored data. l Integrates into the LifeKeeper Graphical User Interface. l Fully supports other LifeKeeper Application Recovery Kits. l Automatically resynchronizes data between the primary server and backup servers upon system recovery. l Monitors the health of the underlying system components and performs a local recovery in the event of failure. l Supports Stonith devices for I/O fencing.
Synchronization (and Resynchronization) that consists of a local disk or partition and a Network Block Device (NBD) as shown in the diagram below. A LifeKeeper supported file system can be mounted on a NetRAID device like any other storage device. In this case, the file system is called a replicated file system. LifeKeeper protects both the NetRAID device and the replicated file system. The NetRAID device is created by building the DataKeeper resource hierarchy.
Standard Mirror Configuration Standard Mirror Configuration The most common mirror configuration involves two servers with a mirror established between local disks or partitions on each server, as shown below. Server1 is considered the primary server containing the mirror source. Server2 is the backup server containing the mirror target. N+1 Configuration A commonly used variation of the standard mirror configuration above is a cluster in which two or more servers replicate data to a common backup server.
Multiple Target Configuration Multiple Target Configuration When used with an appropriate Linux distribution and kernel version 2.6.7 or higher, SteelEye DataKeeper can also replicate data from a single disk or partition on the primary server to multiple backup systems, as shown below.
SteelEye DataKeeper Resource Hierarchy A given source disk or partition can be replicated to a maximum of 7 mirror targets, and each mirror target must be on a separate system (i.e. a source disk or partition cannot be mirrored to more than one disk or partition on the same target system). This type of configuration allows the use of LifeKeeper’s cascading failover feature, providing multiple backup systems for a protected application and its associated data.
Failover Scenarios The resource datarep-ext3-sdr is the NetRAID resource, and the parent resource ext3-sdr is the file system resource. Note that subsequent references to the DataKeeper resource in this documentation refer to both resources together. Because the file system resource is dependent on the NetRAID resource, performing an action on the NetRAID resource will also affect the file system resource above it.
Scenario 2 Scenario 2 Considering scenario 1, Server 2 (still the primary server) becomes inoperable during the resynchronization with Server 1 (now the backup server). Result: Because the resynchronization process did not complete successfully, there is potential for data corruption. As a result, LifeKeeper will not attempt to fail over the DataKeeper resource to Server 1. Only when Server 2 becomes operable will LifeKeeper attempt to bring the DataKeeper resource in-service (ISP) on Server 2.
Scenario 4 Result: LifeKeeper will not bring the DataKeeper resource ISP on Server 2. When Server 1 comes back up, LifeKeeper will automatically bring the DataKeeper resource ISP on Server 1.
Installation and Configuration Before Configuring Your DataKeeper Resources The following topics contain information for consideration before beginning to create and administer your DataKeeper resources. They also describe the three types of DataKeeper resources. Please refer to the LifeKeeper Configuration section for instructions on configuring LifeKeeper Core resource hierarchies.
Software Requirements Software Requirements l Operating System – SteelEye DataKeeper can be used with any major Linux distribution based on the 2.6 Linux kernel. See the SPS for Linux Release Notes for a list of supported distributions. Asynchronous mirroring and intent logs are supported only on distributions that use a 2.6.16 or later Linux kernel. Multiple target support (i.e., support for more than 1 mirror target) requires a 2.6.7 or later Linux kernel.
Changing the Data Replication Path l This release of SteelEye DataKeeper does not support Automatic Switchback for DataKeeper resources. Additionally, the Automatic Switchback restriction is applicable for any other LifeKeeper resource sitting on top of a DataKeeper resource. l If using Fusion-io, see the Network section of Clustering with Fusion-io for further network configuration information. Changing the Data Replication Path Starting with LK 7.
Determine Network Bandwidth Requirements sufficient bandwidth to successfully replicate the partition and keep the mirror in the mirroring state as the source partition is updated throughout the day?" Keeping the mirror in the mirroring state is critical because a switchover of the partition is not allowed unless the mirror is in the mirroring state.
Measuring Detailed Rate of Change SteelEye DataKeeper can mirror daily, approximately: T1 (1.5Mbps) - 14,000 MB/day (14 GB) T3 (45Mbps) - 410,000 MB/day (410 GB) Gigabit (1Gbps) - 5,000,000 MB/day (5 TB) Measuring Detailed Rate of Change The best way to collect Rate of Change data is to log disk write activity for some period of time (one day, for instance) to determine what the peak disk write periods are.
Analyze Collected Detailed Rate of Change Data msg "\n"; msg "This utility takes a /proc/diskstats output file that contains\n"; msg "output, logged over time, and calculates the rate of change of\n"; msg "the disks in the dataset\n"; msg "OUTPUT_CSV=1 set in env. dumps the full stats to a CSV file on STDERR\n"; msg "\n"; msg "Example: $0 1hour \"jun 23 12pm\" steeleye-iostat.
Analyze Collected Detailed Rate of Change Data 'ios_pending' => 11, 'ms_time_total' => 12, 'weighted_ms_time_total' => 13 ); my $devfield = $fields{'dev'}; my $calcfield = $ENV{'ROC_CALC_FIELD'} || $fields{'sectors_written'}; dbg "using field $calcfield\n"; open(FD, "$file") or die "Cannot open $file: $!\n"; foreach () { chomp; @_ = split; if (exists($days{$_[0]})) { # skip datestamp divider if ($firsttime eq '') { $firsttime = join ' ', @_[0..5]; } $lasttime = join ' ', @_[0..
Analyze Collected Detailed Rate of Change Data my $dev = shift; my $blksize = shift; my $interval = shift; my ($max, $maxindex, $i, $first, $last, $total); my $prev = -1; my $first = $_[0]; if ($ENV{'OUTPUT_CSV'}) { print STDERR "$dev," } foreach (@_) { if ($prev != -1) { if ($_ < $prev) { dbg "wrap detected at $i ($_ < $prev)\n"; $prev = 0; } my $this = ($_ - $prev) * $blksize / $interval; if ($this > $max) { $max = $this; $maxindex = $i; } if ($ENV{'OUTPUT_CSV'}) { print STDERR "$this," } } $prev = $_; #
Analyze Collected Detailed Rate of Change Data return @totalvals; } # converts to KB, MB, etc. and outputs size in readable form sub HumanSize { # params: bytes/bits my $bytes = shift; my @suffixes = ( '', 'K', 'M', 'G', 'T', 'P' ); my $i = 0; while ($bytes / 1024.0 >= 1) { $bytes /= 1024.0; $i++; } return sprintf("%.
Graph Detailed Rate of Change Data # ./roc-calc-diskstats 2m “Jul 22 16:04:01” /root/diskstats.txt sdb1,sdb2,sdc1 > results.txt The above example dumps a summary (with per disk peak I/O information) to results.txt Usage Example (Summary + Graph Data): # export OUTPUT_CSV=1 # ./roc-calc-diskstats 2m “Jul 22 16:04:01” /root/diskstats.txt sdb1,sdb2,sdc1 2> results.csv > results.txt The above example dumps graph data to results.csv and the summary (with per disk peak I/O information) to results.
Graph Detailed Rate of Change Data 1. Open results.csv, and select all rows, including the total column. 2. Open diskstats-template.xlsx, select the diskstats.csv worksheet. 3. In cell 1-A, right-click and select Insert Copied Cells. 4. Adjust the bandwidth value in the cell towards the bottom left of the worksheet to reflect an amount of bandwidth you have allocated for replication.
Graph Detailed Rate of Change Data 5. Make a note of the following row/column numbers: a. Total (row 6 in screenshot below) b. Bandwidth (row 9 in screenshot below) c. Last datapoint (column R in screenshot below) 6. Select the bandwidth vs ROC worksheet. 7. Right-click on the graph and select Select Data... a. Adjust Bandwidth Series i. From the Series list on the left, select bandwidth ii. Click Edit iii.
Graph Detailed Rate of Change Data “=diskstats.csv!$B$:$$" example: “=diskstats.csv!$B$9:$R:$9" iv. Click OK b. Adjust ROC Series i. From the Series list on the left, select ROC ii. Click Edit iii. Adjust the Series Values: field with the following syntax: “=diskstats.csv!$B$:$$" example: “=diskstats.
Confirm Failover and Block Resource Failover Settings iv. Click OK c. Click OK to exit the Wizard 8. The Bandwidth vs ROC graph will update. Please analyze your results to determine if you have sufficient bandwidth to support replication of your data. Confirm Failover and Block Resource Failover Settings In certain configurations, it may be desirable to require manual confirmation by a system administrator before allowing SPS to perform a failover recovery of a system that it detects as failed.
When to Select This Setting time in seconds that LifeKeeper should wait before taking the default action (setting this to zero means “don’t wait before taking default action”). If the administrator does not run the lk_confirmso command within the time allowed, the failover will either proceed or be blocked. By default, the failover will proceed. This behavior can be changed by modifying the CONFIRMSODEF setting in /etc/default/LifeKeeper. CONFIRMSODEF specifies the action to be taken.
Setting the Flags on Each Server failure. It will only block failovers due to local resource failures. Setting the Flags on Each Server 1. Log in to the LifeKeeper GUI and select a server in your cluster. If the Properties panel option is selected on the View menu, the Properties panel will display (on the right side of the GUI). On the General tab in the bottom of the panel, your system configuration will be displayed: 2.
Examples 3. In the Set Block Resource Failover On column, select the checkbox for each server in the cluster as required. In the following example, ServerA properties are set to Block Failover to ServerB from ServerA. In order to set Block Failover to ServerA from ServerB, you will need to go into ServerB properties and check the box next to ServerA.
Block Failover in One Direction 3. In /etc/default/LifeKeeper, set the following: CONFIRMSODEF = 1 CONFIRMSOTO = 0 4. Perform the above steps on each server in your cluster. Block Failover in One Direction 1. Select the server in your cluster that would fail over in this scenario and view Properties. 2. On the General tab, check the “Set Confirm Failover On“ box on the server that you want to block failover on. 3.
Replicate New File System type. There are several different DataKeeper resource types. The following information can help you determine which type is best for your environment. Replicate New File System Choosing a New Replicated File System will create/extend the NetRAID device, mount the given mount point on the NetRAID device and put both the LifeKeeper supported file system and the NetRAID device under LifeKeeper protection. The local disk or partition will be formatted.
Resource Configuration Tasks Resource Configuration Tasks You can perform all SteelEye DataKeeper configuration tasks via the LifeKeeper Graphical User Interface (GUI). The LifeKeeper GUI provides a guided interface to configure, administer and monitor SteelEye DataKeeper resources. Overview The following tasks are available for configuring SteelEye DataKeeper: l Create a Resource Hierarchy - Creates a DataKeeper resource hierarchy.
Creating a DataKeeper Resource Hierarchy Field Tips You must select intelligent switchback. This means that after a failover to the backup server, an administrator must manually switch the DataKeeper resource back to the primary server. Switchback Type CAUTION: This release of SteelEye DataKeeper does not support Automatic Switchback for DataKeeper resources. Additionally, the Automatic Switchback restriction is applicable for any other LifeKeeper resource sitting on top of a DataKeeper resource.
Extending Your Hierarchy Extending Your Hierarchy This operation can be started from the Edit menu or initiated automatically upon completing the Create Resource Hierarchy option, in which case you should refer to Step 2 below. 1. On the Edit menu, select Resource, then Extend Resource Hierarchy. The Pre-Extend Wizard appears. If you are unfamiliar with the Extend operation, click Next.
Extending a DataKeeper Resource Field Target Priority Tips Select or enter the Target Priority. This is the priority for the new extended DataKeeper hierarchy relative to equivalent hierarchies on other servers. Any unused priority value from 1 to 999 is valid, indicating a server’s priority in the cascading failover sequence for the resource. A lower number means a higher priority (1=highest). Note that LifeKeeper assigns the number “1” to the server on which the hierarchy is created by default.
Unextending Your Hierarchy Field Tips DataKeeper Resource Select or enter the DataKeeper Resource Tag name. Tag Select or edit the name of the bitmap file used for intent logging. If you choose Bitmap File none, then an intent log will not be used, and every resynchronization will be a full resync instead of a partial resync. Select the pair of local and remote IP addresses to use for replication between the target server and the other indicated server in the cluster.
Deleting a Resource Hierarchy 1. On the Edit menu, select Resource then Unextend Resource Hierarchy. 2. Select the Target Server where you want to unextend the DataKeeper resource. It cannot be the server where the DataKeeper resource is currently in service (active). Note: If you selected the Unextend task by right-clicking from the right pane on an individual resource instance, this dialog box will not appear. Click Next. 3. Select the DataKeeper Hierarchy to Unextend and click Next.
Bringing a DataKeeper Resource In Service breaks the mirror, unmounts the file system (if applicable), stops the md device and kills the nbd server and client. WARNING: Do not take your DataKeeper resource out of service unless you wish to stop mirroring your data and remove LifeKeeper protection. Use the Pause operation to temporarily stop mirroring. 1. In the right pane of the LifeKeeper GUI, right-click on the DataKeeper resource that is in service. 2. Click Out of Service from the resource popup menu.
Performing a Manual Switchover from the LifeKeeper GUI If you execute the Out of Service request, the resource hierarchy is taken out of service without bringing it in service on the other server. The resource can only be brought in service on the same server if it was taken out of service during resynchronization.
Administration Administering SteelEye DataKeeper for Linux The following topics provide information to help in understanding and managing SteelEye DataKeeper for Linux operations and issues after DataKeeper resources are created.
GUI Mirror Administration GUI Mirror Administration A SteelEye DataKeeper mirror can be administered through the LifeKeeper GUI in two ways: 1. By enabling the Properties Panel and clicking the toolbar icons (shown in the screenshot).
Create and View Rewind Bookmarks Click on each icon below for a description or, 2. By right-clicking the data replication resource and selecting an action from the popup menu. Create and View Rewind Bookmarks A bookmark is an entry that is placed in the rewind log file. Bookmarks are useful for keeping track of important system events (such as upgrades) in case a rewind needs to be performed. When you perform a rewind, all bookmarked log entries will be displayed as choices for the rewind point.
Force Mirror Online Force Mirror Online Force Mirror Online should be used only in the event that both servers have become inoperable and the primary server cannot bring the resource in service after rebooting. Selecting Force Mirror Online removes the data_corrupt flag and brings the DataKeeper resource in service. For more information, see Primary server cannot bring the resource ISP in the Troubleshooting section.
Rewind and Recover Data 1. The mirror is paused. 2. A timestamp associated with previous disk write is selected and the disk is rewound to that time. 3. The user is asked to verify the rewound data and indicate its condition (good or bad). 4. The user then has the option to use the current data (go to Step 5) or continue rewinding by selecting another timestamp (go to Step 2). 5.
Rewind and Recover Data Dialog 3 4. The data is being rewound. After the data is rewound, the target disk is mounted for read-only access so that the data can be verified. Click Next. Dialog 4 5. You are asked for comments on the data. Enter any comments (not mandatory) and click Next. 6. You are now asked to indicate whether the data is valid or not. Answer Yes or No and click Next. 7.
Set Compression Level resynced to the old source disk. Click Finish. Rewind is complete. 10. You are asked to manually copy files to the source system. Copy any rewound data that you wish to keep to a safe location, then click Next. 11. The mirror is now being resumed. A full resynchronization will occur from the source to target. Click Finish. Rewind is complete. Set Compression Level The Network Compression Level may be set to a value from 0 to 9. A value of 0 disables compression entirely.
Command Line Mirror Administration maximum size. However, the log will wrap around and overwrite the earliest entries when it detects that it has run out of disk space. Command Line Mirror Administration In addition to performing actions through the LifeKeeper GUI, the mirror can also be administered using the command line. There are several commands (found in the $LKROOT/bin directory) that can be used to administer a DataKeeper resource.
Examples: Note: mirror_settings should be run on the target system(s) (or on all systems, if you want the settings to take effect regardless of which system becomes the mirror source). The mirror must be paused and restarted before any settings changes will take effect.
Monitoring Mirror Status via Command Line Monitoring Mirror Status via Command Line Normally, the mirror status can be checked using the Replication Status tab in the Resource Properties dialog of the LifeKeeper GUI.
Server Failure Server Failure If both your primary and backup servers become inoperable, your DataKeeper resource will be brought into service/activated only when both servers are functional again. This is to avoid data corruption that could result from initiating the resynchronization in the wrong direction.
Avoiding Full Resynchronizations Avoiding Full Resynchronizations When replicating large amounts of data over a WAN link, it is desirable to avoid full resynchronizations which can consume large amounts of network bandwidth and time. With newer kernels, SteelEye DataKeeper can avoid almost all full resyncs by using its bitmap technology. However, the initial full resync, which occurs when the mirror is first set up, cannot be avoided when existing data is being replicated.
Method 2 7. Bring the mirror and dependent filesystem and applications (if any), into service. The bitmap file will track any changes made while the data is transferred to the target system. 8. Transfer the disk image to the target system using your preferred transfer method. 9. Optional Step – Uncompress the disk image file on the target system: root@target# gunzip /tmp/sdr_disk.img.gz 10.
Clustering with Fusion-io 3. Configure the communication paths in LifeKeeper. 4. Create the mirror and extend to the target. A full resync will occur. 5. Pause the mirror. Changes will be tracked in the bitmap file until the mirror is resumed. 6. Delete the static routes: root@source# route del -net 10.10.20.0/24 root@target# route del -net 10.10.10.0/24 7. Shut down the target system and ship it to its permanent location. 8. Boot the target system and ensure network connectivity with the source. 9.
Network Network l Use a 10 Gbps NIC: Flash-based storage devices from Fusion-io (or other similar products from OCZ, LSI, etc.) are capable of writing data at speeds of HUNDREDS (750+) MB/sec or more. A 1 Gbps NIC can only push a theoretical maximum of approximately 125 MB/sec, so anyone taking advantage of an ioDrive’s potential can easily write data much faster than 1 Gbps network connection could replicate it.
Configuration Recommendations net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 0 net.core.optmem_max = 16777216 net.ipv4.tcp_congestion_control=htcp Configuration Recommendations l Allocate a small (~100 MB) disk partition, located on the Fusion-io drive to place the bitmap file.
Multi-Site Cluster SteelEye Protection Suite for Linux Multi-Site Cluster The SteelEye Protection Suite for Linux Multi-Site Cluster is a separately licensed product that uses a LifeKeeper shared storage configuration between two or more servers with the additional ability to replicate the shared disk(s) to one or more target servers using SteelEye DataKeeper for Linux.
Multi-Site Cluster Configuration Considerations and are familiar with the SteelEye LifeKeeper graphical user interface, the SteelEye Protection Suite Multi-Site Cluster resource hierarchy display in the LifeKeeper GUI will appear differently from previous releases of SteelEye DataKeeper. Multi-Site Cluster Configuration Considerations Before you begin configuring your systems, it’s important to understand what hierarchy configurations you should avoid in the Linux Multi-Site Cluster hierarchy environment.
Multi-Site Cluster Restrictions Example Description 1 Using the Multi-Site Cluster hierarchy’s mirror disk resource more than once in the same or different hierarchies. 2 Using the same Multi-Site Cluster file system or disk resource for the mirror bitmap in more than one Multi-Site Cluster hierarchy. (Each mirror’s bitmap file must reside on a unique LUN and can’t be shared.) 3 Using the bitmap file system, device or disk resource with another hierarchy (MultiSite or non-Multi-Site).
Replicate New File System Field Tips Choose the data replication type you wish to create by selecting one of the following: Hierarchy Type l Replicate New File System l Replicate Existing File System l DataKeeper Resource The next sequence of dialog boxes depends on which Hierarchy Type you have chosen. While some of the dialog boxes may be the same for each Hierarchy Type, their sequence and the required information may be slightly different.
Replicate New File System 3. Select Back to select a different source disk or partition that is shared. Provide the remaining information to finish configuring the SteelEye Protection Suite for Linux Multi-Site Cluster resource Field Tips New Mount Enter the New Mount Point of the new file system. This should be the mount Point point where the replicated disk or partition will be located. New File System Type Select the File System Type.
Replicate New File System Field Tips Select the bitmap file entry from the pull down list. Displayed in the list are all available shared file systems that can be used to hold the bitmap file. The bitmap file must be placed on a shared device that can switch between the local nodes in the cluster. Bitmap File Important: The bitmap file should not reside on a btrfs filesystem.
Replicate Existing File System Replicate Existing File System This option will unmount a currently mounted file system on a local disk or partition, create a NetRAID device, then re-mount the file system on the NetRAID device. Both the NetRAID device and the mounted file system are placed under LifeKeeper protection. You should select this option if you want to create a mirror on an existing file system and place it under LifeKeeper protection. 1.
DataKeeper Resource Field Tips Select the bitmap file entry from the pull down list. Displayed in the list are all available shared file systems that can be used to hold the bitmap file. The bitmap file must be placed on a shared device that can switch between the local nodes in the cluster. Bitmap File Important: The bitmap file should not reside on a btrfs filesystem.
DataKeeper Resource Field Tips The list of Source Disks or Partitions in the drop down box contains all the available disks or partitions that are not: Source Disk or Partition l currently mounted l swap type disks or partitions l LifeKeeper-protected disks or partitions The drop down list will also filter out special disks or partitions, for example, root (/), boot (/boot), /proc, floppy and cdrom. Note: If using VMware, see the VMware Known Issue. 2.
Extending Your Hierarchy Field Tips DataKeeper Select or enter a unique DataKeeper Resource Tag name for the DataKeeper Resource resource instance. Tag Select the bitmap file entry from the pull down list. Displayed in the list are all available shared file systems that can be used to hold the bitmap file. The bitmap file must be placed on a shared device that can switch between the local nodes in the cluster. Bitmap File Important: The bitmap file should not reside on a btrfs filesystem.
Extending Your Hierarchy input/confirmation, click Accept Defaults. 2. The Pre-Extend Wizard will prompt you to enter the following information. Note: The first two fields appear only if you initiated the Extend from the Edit menu. Field Template Server Tips Select the Template Server where your DataKeeper resource hierarchy is currently in service.
Extending a DataKeeper Resource 4. Depending upon the hierarchy being extended, LifeKeeper will display a series of information boxes showing the Resource Tags to be extended, some of which cannot be edited. Click Next to launch the Extend Resource Hierarchy configuration task. The next section lists the steps required to complete the extension of a DataKeeper resource to another server. Extending a DataKeeper Resource 1.
Extending a Hierarchy to a Disaster Recovery System During resynchronization, the DataKeeper resource and any resource that depends on it will not be able to fail over. This is to avoid data corruption.
Extending a Hierarchy to a Disaster Recovery System Note: Depending upon the hierarchy being extended, LifeKeeper will display a series of information boxes showing the Resource Tags to be extended, some of which cannot be edited. 4. Click Next to launch the Extend Resource Hierarchy configuration task. The next section lists the steps required to complete the extension of a DataKeeper resource to another server. 1.
Configuring the Restore and Recovery Setting for Your IP Resource Field Tips Select the pair of local and remote IP addresses to use for replication between the target server and the other indicated server in the cluster. The valid paths and their associated IP addresses are derived from the set of LifeKeeper communication paths that have been defined for this same pair of servers. Due Replication to the nature of DataKeeper, it is strongly recommended that you use a private (dedicated) network.
Migrating to a Multi-Site Cluster Environment Migrating to a Multi-Site Cluster Environment The SteelEye Multi-Site Migrate feature is included in the SteelEye Protection Suite for Linux MultiSite Cluster product. This additional feature enables an administrator to migrate an existing SteelEye Linux LifeKeeper environment to a Multi-Site Cluster Environment. The migration procedure allows selected shared file system’s resources to be safely migrated and replicated with minimum hierarchy downtime.
Performing the Migration Performing the Migration There are three methods for configuring and performing a Multi-Site Migrate. You can: l Select the Migrate icon from the LifeKeeper GUI toolbar resource to migrate. and then select the l Select the file system resource and right-click the mouse to display the Migrate Hierarchy to Multi-Site Cluster menu option.
Performing the Migration l Select the file system resource and select the Migration icon from the Properties Panel toolbar.
Performing the Migration If you initiate the Migrate from the global toolbar icon, the following dialog box will display: 1. Select the server where the hierarchy to migrate exists and is in-service. Click Next.
Performing the Migration 2. Select the root hierarchy tag that will be migrated and click Next. The root tag can be a file system or other application resource. The tag selected (for non-file system resources) must contain a file system dependent resource. If you select a File System in the LifeKeeper GUI window and select Migrate Hierarchy to Multi-Site Cluster from the pop-up window or the Migrate icon in the Properties Panel Migrate icon, the following initialization screen displays.
Performing the Migration 3. Press Continue when the Continue button is enabled. The following bitmap dialog will display.
Performing the Migration 4. Select a bitmap file for the file system you are migrating. Select Next. Important: Once you select Next, you will not be able to change the Bitmap File Selection for this file system resource.
Performing the Migration 5. Select the second bitmap file for the second file system being migrated within the hierarchy. After selecting the first bitmap file in the previous dialog box, any additional file system tags will be displayed so that the user can enter a unique bitmap file for each additional file system tag. Note: This screen will not appear if there is only one file system being migrated. Also, multiple screens similar to this will exist if there are more than two file systems being migrated.
Performing the Migration 7. This Summary screen displays all the configuration information you’ve submitted during the Migrate procedure. Once you select Migrate, the following screen displays.
Performing the Migration 8. The Migration status will display in this window. Press Finish when the Finish button is enabled.
Successful Migration Successful Migration The following image is an example of a file system resource hierarchy after the Multi-Site migration is completed. At this time, the hierarchy can be extended to the non-shared node (megavolt).
Successful Migration SteelEye Protection Suite for Linux335
Troubleshooting This section provides information regarding issues that may be encountered with the use of DataKeeper for Linux. Where appropriate, additional explanation of the cause of an error is provided along with necessary action to resolve the error condition. Messages specific to DataKeeper for Linux can be found in the DataKeeper Message Catalog. Messages from other SPS components are also possible. In these cases, please refer to the Combined Message Catalog.
Troubleshooting Symptom Suggested Action Primary server cannot bring the resource ISP when it reboots after both servers became inoperable. If the primary server becomes operable before the secondary server, you can force the DataKeeper resource online by opening the resource properties dialog, clicking the Replication Status tab, clicking the Actions button, and then selecting Force Mirror Online. Click Continue to confirm, then Finish.
Troubleshooting Symptom Core - Language Environment Effects Core - Shutdown hangs on SLES10 systems Suggested Action Some LifeKeeper scripts parse the output of Linux system utilities and rely on certain patterns in order to extract information. When some of these commands run under non-English locales, the expected patterns are altered and LifeKeeper scripts fail to retrieve the needed information.
Troubleshooting Symptom Suggested Action On SLES 10 SP2, netstat is broken due to a new format in /proc//fd. This issue is due to a SLES 10 SP2 kernel bug and has been fixed in kernel update version 2.6.16.60-0.23. Data Replication GUI does not show proper state on SLES 10 SP2 system Data Replication Size limitation on 32-bit machines Solution: Please upgrade to kernel version 2.6.16.60-0.23 if running on SLES 10 SP2. Note: Beginning with SPS 8.
Index Index A Active/Active 42 Active/Standby 43 Adapter Options 9 Administration 127 API 126 Asynchronous Mirroring 256 Automatic LifeKeeper Restart Disabling 172, 212 Enabling 171, 211 Automatic Switchback 44 B Bitmap File 285 Block Resource Failover 279 Browser Security Parameters 168 btrfs Filesystem 285, 314, 316, 318, 320, 322 C Command Line Mirror Administration 300 Monitoring Mirror Status 302 Communication Paths Creating 128 Deleting 129 Firewall 207 Heartbeat 40 Compression Level 299 SteelE
Index Configuration 61 Application 79 Concepts 40 Data Replication 78 General 266 Network 79 Verify Network Configuration 26 Network and LifeKeeper 266 Optional Tasks 71 Shared Storage 25 Steps 61 Storage and Adapter 80 Values 201 Confirm Failover 278 CONFIRM_SO Disabling Reservations 103 Connecting Servers and Shared Storage 25 Servers to a Cluster 173 Core 38 Credentials 125 Custom Certificates 74 D Data Replication Path 267 Database Applications 28 Dialogs Cluster Connect 183 Cluster Disconnect 183
Index Server Properties 185 Disconnecting 173 E Environment Setup 25 Error Detection 127 Event Email Notification 67 Configuration 69 Overview 63 Troubleshooting 70 Event Forwarding via SNMP 63 Configuration 65 Overview 63 SNMP Troubleshooting 67 F Failover Scenarios 261 Fault Detection and Recovery 55 IP Local Recovery 55 Resource Error Recovery Scenario 57 Server Failure Recovery Scenario 59 Fencing Alternative Methods 115 I/O Fencing Chart 104 Introduction 103 File Systems 39 Firewall Running Lif
Index Force Mirror Online 296 G Generic Applications 39 GUI Adding Icon to Desktop Toolbar 71 Configuring 158 Configuring Users 161 Exiting 170 Overview 156 Running on LifeKeeper Server 167 Running on Remote System 165 Software Package 147 Starting 160 Stopping 160 Viewing GUI Server Processes 172 H Hardware 40 Health Monitoring 203 I In Service 190 Installation 29 Command Line 29 License 31 Verify 34 Intelligent Switchback 44 INTERFACELIST 228 Internet Host ID 34 Introduction How It Works 256 344Trou
Index Mirroring 255 IP Addresses 39 J Java Plug-in 165 Security Policy 162 L LCD Interface (LCDI) 192 License 31 LifeKeeper Alarm Interface 200 LifeKeeper Communications Manager (LCM) 199 Alarming and Recovery 200 Status Information 200 LifeKeeper Configuration Database (LCD) 192 Commands 192 Configuration Data 195 Directory Structure 196 Flags 196 Resource Types 196 Resources Subdirectories 197 Structure of LDC Directory in /opt/LifeKeeper 198 LifeKeeper Recovery Action and Control Interface (LRACI)
Index Menus 148 Edit Menu - Resource 150 Edit Menu - Server 151 File 150 Help 152 Resource Context 148 Server Context 149 View 151 Message Bar 170 Mirror Administration Command Line 300 GUI 294 Mirror Status Monitoring via Command Line 302 Viewing 293 Multi-Site Cluster 309 Before You Start 324 Configuration Considerations 310 File System Replicate Existing 315 Replicate New 312 Migration Performing 325 Successful 334 Overview 309 Requirements 324 Resource Hierarchy Creating 311 Extending 318 Extending
Index Restore and Recover 323 Restrictions 311 N N-Way Recovery 127 Nested File System 233 Network Bandwidth Determine Requirements 267 Measuring Rate of Change 267 O Out of Service 191 Output Panel 169 P Packaging 1, 5 Pause and Resume 296 Properties Panel 169 Protected Resources 37 Q Quorum/Witness 106 Actions When Quorum is Lost 110 Configurable Components 107 Disabling Reservations 103 Installation and Configuration 107 Quorum Modes 108 Shared Witness 110 Witness Modes 109 R Rate of Change 267 RAW
Index Recovery After Failover 206 Non-Killable Process 252 Out-of-Service Hierarchies 252 Panic During Manual Recovery 252 Server Failure 251 Removing LifeKeeper 206 Requirements DataKeeper 78 Firewall 207 Hardware 265 Quorum/Witness Package 106 Software 265 STONITH 115 Storage and Adapter 8 Reservations Disabling 103 SCSI 114 Resource Dependency Creating 141 Deleting 142 Resource Hierarchies 45 Collapsing Tree 182 Creating 131, 284 File System 132 Generic Application 134 Raw Device 135 Deleting 143, 2
Index Extending 138, 286 File System 139 Generic Application 139 Raw Device 140 Hierarchy Relationships 47 In Service 290 Information 48 Maintaining 205 Out of Service 289 Testing 290 Transferring 212 Unextending 140, 288 Resource Policy Management 120 Resource Priorities 136 Resource Properties 136 Resource States 46 Resource Types 45 Resynchronization 303 Avoiding Full 304 Rewind Create and View Rewind Bookmarks 295 Rewind and Recover Data 296 Set Rewind Log Location 299 Set Rewind Log Max Size 299 S
Index Failover 130 Viewing 175 Shared Communication 41 Shared Data Resources 40 Shared Equivalencies 47 Starting LifeKeeper 170, 210 Status Display Detailed 49 Short 54 Status Table 168 STONITH Disabling Reservations 103 Stopping LifeKeeper 171, 211 Storage Options 9 Switchable IP Address 27 Synchronous Mirroring 256 System Date and Time 247 T Tag Name Restrictions 252 Valid Characters 252 Technical Notes 212 Technical Support 3 Toolbars 152 GUI 152 Resource Context 154 Server Context 156 Troubleshoo
Index GUI Troubleshooting 243 Incomplete Resource Created 248 Incomplete Resource Priority Modification 248 Known Issues 221 Restrictions 221 TTY Connections 62 U Upgrading 34 V View Options 179 Viewing Connected Servers 174 Message History 181 Resource Properties 178 Resource Tags and IDs 176 Server Log Files 175 Server Properties 175 Status of a Server 174 Status of Resources 176 VMware Known Issue 340 W Watchdog Disabling Reservations 103 SteelEye Protection Suite for Linux351