HP StorageWorks Scalable File Share Release Notes - Version 2.3

Known issues and workarounds in HP SFS Version 2.3 1–7

1.4 Known issues and workarounds in HP SFS Version 2.3

This section describes known issues and workarounds in the HP SFS Version 2.3 software, and is organized

as follows:

• "bad dst nid" events in the log are harmless (Section 1.4.1)

• When both power supplies of a node fail, then its peer node may reboot repeatedly (Section 1.4.2)

• NFS clients kernel panic trying to create files larger than physical mem size (Section 1.4.3)

• Start filesystem will hang on badly initialized InfiniBand fabrics (Section 1.4.4)

• Propagating contents of /etc/modprobe.conf(.lustre*) into the XC systemimage (Section 1.4.5)

• First boot of XC compute nodes may hang or fail during SFS nconfig operation (Section 1.4.6)

• lfs quotacheck issues (Section 1.4.7)

• colplot MDS performance counters do not reflect effective metadata activity (Section 1.4.8)

• Server management commands fail when targeting more than 24 servers at a time (Section 1.4.9)

• Need to avoid DHCP conflicts (Section 1.4.10)

• The SFS Web Server updates sometimes fail (Section 1.4.11)

• OFED 1.2 does not support RH4U6 (Section 1.4.12)

• What to do in case of large numbers of 'database not responding' messages (Section 1.4.13)

1.4.1 "bad dst nid" events in the log are harmless

There are often messages like this in the system event log, at client eviction/recovery time:

May 2 13:16:42 sfs4gre18-adm kernel: LustreError:

8172:0:(viblnd_cb.c:2327:kibnal_recv_connreq()) Can't accept 172.22.0.68@vib:

bad dst nid 172.22.130.90@vib

These events are harmless, and can safely be ignored.

1.4.2 When both power supplies of a node fail, then its peer node may reboot

repeatedly

To avoid this, it is necessary to run sfsmgr on the admin node, and manually mark the node with bad power

supplies as disabled.

Use the disable server command for that. After one final reboot, the peer node will then remain up.

Example: If the server #4 is dead, then type the command:

sfsmgr disable server 4 force=yes

When the problem occurs on the admin/mds server pair, the problem is more difficult to solve, as the peer

that reboots repeatedly runs the admin service... which is necessary for the disable server command

to work. There is a very short time window for typing the commands to break the cycle, and eventually

disable the bad server.

Open the console of the remaining server in the pair, and login as soon as possible after the next reboot.

Then run the following commands at the bash prompt:

chkconfig cluster off # Prevent the cluster service from restarting again.

service cluster stop # Stop the cluster service now.

# At this stage, it is safe, the machine will stop rebooting.

grep dev/hpls /etc/cluster.conf # Get the NAME of the admin LUN. Ex: name = /

dev/hpls/dev1a1

mount NAME /var/hpls

# Mount device NAME as the admin LUN. Ex: mount

/dev/hpls/dev1a1 /var/hpls

service mysqld start

# Start the system database