[Veritas-ha] Question about HA and disks
Andrey Dmitriev
admitriev at mentora.com
Tue Oct 28 01:56:20 CDT 2008
Sure,
Don't actually know what's causing this.. We've found that when the NFS server fails, one of the db servers pukes. (NFS is used for backups)
It's not actually clear if pluto crashed from the logs.. It's almost as the HB somehow went away (not sure how it's possible, heartbeats go across 2 hubs), and _then_ it crashed (when the other server took over the resources)
And per other posts, yes I'll look into io fencing (forgot about the feature, thnx for reminding), and yes, we are thinking of upgrading to RH5 with kdump (crashdump)
Oct 18 13:42:45 pluto Had[6220]: VCS WARNING V-16-1-51047 HAD Self Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)
Oct 18 13:42:45 pluto Had[6220]: VCS WARNING V-16-1-53024 HAD Signal SIGABRT received
Oct 18 13:42:45 pluto Had[6220]: VCS NOTICE V-16-1-53028 Beginning execution of the diagnostics script
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 7 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 8 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 9 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 10 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 11 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 12 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 13 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20057 Port h process 6220 inactive 14 sec
Oct 18 13:42:45 pluto kernel: GAB WARNING V-15-1-20058 Port h process 6220: heartbeat failed, killing process
Oct 18 13:42:45 pluto kernel: GAB INFO V-15-1-20059 Port h heartbeat interval 15000 msec. Statistics:
Oct 18 13:42:45 pluto kernel: GAB INFO V-15-1-20129 Port h: heartbeats in 0 ~ 3000 msec: 87923248
Oct 18 13:42:45 pluto kernel: GAB INFO V-15-1-20129 Port h: heartbeats in 3000 ~ 6000 msec: 0
Oct 18 13:42:45 pluto kernel: GAB INFO V-15-1-20129 Port h: heartbeats in 6000 ~ 9000 msec: 0
Oct 18 13:42:45 pluto kernel: GAB INFO V-15-1-20129 Port h: heartbeats in 9000 ~ 12000 msec: 0
Oct 18 14:05:52 pluto kernel: GAB INFO V-15-1-20129 Port h: heartbeats in 12000 ~ 15000 msec: 0
Oct 18 14:05:52 pluto kernel: GAB INFO V-15-1-20041 Port h: client process failure: killing process
Oct 18 14:05:52 pluto kernel: 000000 0 2283 1 2339 1564 (NOTLB)
Oct 18 14:05:52 pluto kernel: 00000106f3467b98 0000000000000002 0000010527635800 0000000000000246
Oct 18 14:05:52 pluto AgentFramework[6681]: VCS ERROR V-16-1-13027 Thread(4017064880) Resource(hok2_listener) - monitor procedure did not complete within the expected time.
Oct 18 14:05:52 pluto kernel: 0000000000000246 ffffffff8029eb47 00000105fd042080 000000002d877320
Oct 18 14:05:52 pluto AgentFramework[6681]: VCS ERROR V-16-1-13027 Thread(4006575024) Resource(str2_listener) - monitor procedure did not complete within the expected time.
And the other node (vcs log)
2008/10/18 13:43:06 VCS INFO V-16-1-10077 Received new cluster membership
2008/10/18 13:43:06 VCS NOTICE V-16-1-10080 System (sun) - Membership: 0xc, Jeopardy: 0x0
2008/10/18 13:43:06 VCS ERROR V-16-1-10079 System pluto (Node '1') is in Down State - Membership: 0xc
2008/10/18 13:43:06 VCS ERROR V-16-1-10322 System pluto (Node '1') changed state from RUNNING to FAULTED
2008/10/18 13:43:06 VCS NOTICE V-16-1-10446 Group pluto_gp is offline on system pluto
2008/10/18 13:43:06 VCS INFO V-16-1-10493 Evaluating pluto as potential target node for group pluto_gp
2008/10/18 13:43:06 VCS INFO V-16-1-10494 System pluto not in RUNNING state
2008/10/18 13:43:06 VCS INFO V-16-1-10493 Evaluating sun as potential target node for group pluto_gp
2008/10/18 13:43:06 VCS INFO V-16-1-10493 Evaluating mars as potential target node for group pluto_gp
2008/10/18 13:43:06 VCS NOTICE V-16-1-10301 Initiating Online of Resource plutodg (Owner: unknown, Group: pluto_gp) on System sun
2008/10/18 13:43:06 VCS NOTICE V-16-1-10301 Initiating Online of Resource orapluto (Owner: unknown, Group: pluto_gp) on System sun
2008/10/18 13:43:06 VCS INFO V-16-6-15004 (mars) hatrigger:Failed to send trigger for sysoffline; script doesn't exist
2008/10/18 13:43:06 VCS NOTICE V-16-10031-1514 (sun) DiskGroup:plutodg:online:Diskgroups will be imported without reservations.
2008/10/18 13:43:07 VCS WARNING V-16-10031-1516 (sun) DiskGroup:plutodg:online:Trying force import for the diskgroup.
2008/10/18 13:43:07 VCS WARNING V-16-10031-1506 (sun) DiskGroup:plutodg:online:vxdg import (force) succeeded on Disk Group plutodg.
2008/10/18 13:43:07 VCS INFO V-16-2-13001 (sun) Resource(plutodg): Output of the completed operation (online)
VxVM vxdg WARNING V-5-1-1328 Volume u81vol: Temporarily renumbered due to conflict
2008/10/18 13:43:08 VCS INFO V-16-1-10298 Resource plutodg (Owner: unknown, Group: pluto_gp) is online on sun (VCS initiated)
2008/10/18 13:43:08 VCS NOTICE V-16-1-10301 Initiating Online of Resource u83vol (Owner: unknown, Group: pluto_gp) on System sun
2008/10/18 13:43:08 VCS NOTICE V-16-1-10301 Initiating Online of Resource u81vol (Owner: unknown, Group: pluto_gp) on System sun
2008/10/18 13:43:09 VCS INFO V-16-10031-12501 (sun) Volume:u83vol:online:Volume u83vol is started. Any mirrors are updated in background.
2008/10/18 13:43:10 VCS INFO V-16-10031-12501 (sun) Volume:u81vol:online:Volume u81vol is started. Any mirrors are updated in background.
2008/10/18 13:43:11 VCS INFO V-16-1-10298 Resource u83vol (Owner: unknown, Group: pluto_gp) is online on sun (VCS initiated)
2008/10/18 13:43:11 VCS NOTICE V-16-1-10301 Initiating Online of Resource u83_mt (Owner: unknown, Group: pluto_gp) on System sun
2008/10/18 13:43:11 VCS INFO V-16-1-10298 Resource u81vol (Owner: unknown, Group: pluto_gp) is online on sun (VCS initiated)
2008/10/18 13:43:11 VCS NOTICE V-16-1-10301 Initiating Online of Resource u81_mt (Owner: unknown, Group: pluto_gp) on System sun
2008/10/18 13:43:11 VCS NOTICE V-16-10031-5511 (sun) Mount:u83_mt:online:Trying force mount...
2008/10/18 13:43:12 VCS NOTICE V-16-10031-5515 (sun) Mount:u83_mt:online:Performing log replay...
2008/10/18 13:43:22 VCS INFO V-16-1-10298 Resource orapluto (Owner: unknown, Group: pluto_gp) is online on sun (VCS initiated)
2008/10/18 13:43:41 VCS NOTICE V-16-10031-5511 (sun) Mount:u81_mt:online:Trying force mount...
2008/10/18 13:43:42 VCS INFO V-16-2-13001 (sun) Resource(u83_mt): Output of the completed operation (online)
UX:vxfs mount.vxfs: ERROR: V-3-21268: /dev/vx/dsk/plutodg/u83vol is corrupted. needs checking
UX:vxfs mount.vxfs: ERROR: V-3-21268: /dev/vx/dsk/plutodg/u83vol is corrupted. needs checking
fsck 1.35 (28-Feb-2004)
log replay in progress
replay complete - marking super-block as CLEAN
2008/10/18 13:43:42 VCS NOTICE V-16-10031-5515 (sun) Mount:u81_mt:online:Performing log replay...
2008/10/18 13:43:43 VCS INFO V-16-1-10298 Resource u83_mt (Owner: unknown, Group: pluto_gp) is online on sun (VCS initiated)
2008/10/18 13:44:20 VCS INFO V-16-2-13001 (sun) Resource(u81_mt): Output of the completed operation (online)
UX:vxfs mount.vxfs: ERROR: V-3-21268: /dev/vx/dsk/plutodg/u81vol is corrupted. needs checking
UX:vxfs mount.vxfs: ERROR: V-3-21268: /dev/vx/dsk/plutodg/u81vol is corrupted. needs checking
fsck 1.35 (28-Feb-2004)
log replay in progress
replay complete - marking super-block as CLEAN
-----Original Message-----
From: Jon E Price/SYS/NYTIMES [mailto:jon at nytimes.com]
Sent: Monday, October 27, 2008 8:55 PM
To: Jim Senicka
Cc: Andrey Dmitriev; Joshua Fielden; veritas-ha at mailman.eng.auburn.edu
Subject: RE: [Veritas-ha] Question about HA and disks
So...
1.) If a system "panic"'s, then filesystem corruption found when the Service Group onlines on the other node was likely caused by the same problem that caused the panic.
and
2.) If a system "hangs" (to use a word) -- but does not panic -- and filesystem corruption is found when the Service Group onlines on the other node, it is possible the first node was still writing to the filesystem resulting in a split brain scenario. And if so, it's likely that caused the filesystem corruption.
And one of the ways to prevent the split brain scenario is I/O Fencing...
Jon
"Jim Senicka"
<james_senicka at sy
mantec.com> To
"Jon E Price/SYS/NYTIMES"
10/27/2008 08:23 <jon at nytimes.com>, "Andrey
PM Dmitriev" <admitriev at mentora.com>,
"Joshua Fielden"
<Joshua_Fielden at symantec.com>,
<veritas-ha at mailman.eng.auburn.edu>
cc
Subject
RE: [Veritas-ha] Question about HA
and disks
In the original message
" We had an issue where a serverA failed and serverB took over.
However, serverB took over when serverA was still 'crashing' (it took a good 10-15mins to crash),"
I can assume crash = panic, as "crashing" has to refer to dumping core to disk.
If this is the case, there will be no logs on server A, as it is mid panic.
In this case (the node is in the middle of a crash dump), it will not be writing to data disks. What ever was written happened before the kernel call to panic. Fencing will protect that data once the new node imports, but in the case described here, the corruption had to happen before the panic, so fence would not have helped.
Bottom line is the node ceased writing as soon as the non maskable interrupt was called for panic (unless Linux somehow violates every Unix kernel rule, which I seriously doubt). When VCS took over the service group on Server B, Server A was down and could not have been writing
-----Original Message-----
From: Jon E Price/SYS/NYTIMES [mailto:jon at nytimes.com]
Sent: Monday, October 27, 2008 8:14 PM
To: Jim Senicka; Andrey Dmitriev; Joshua Fielden; veritas-ha at mailman.eng.auburn.edu
Subject: Re: [Veritas-ha] Question about HA and disks
Hi,
A few questions..
Andrey: Could you post the logs (or even portions of them) which show what ServerA was doing during the takeover?
Joshua: You're saying that IO Fencing can prevent split brain situations in which one server is still writing to a filesystem while a 2nd server has taken over that same service group and begun writing to the same fs, thus possibly causing corruption?
http://sfdoccentral.symantec.com/sf/5.0/linux/html/vcs_install/ch_vcs_in
stall_iofence.html#190559
Jim: What's the evidence that the server panic'd?
And is 16 seconds the default for the heartbeat failure?
Jon
"Jim Senicka"
<james_senicka at sy
mantec.com>
To
Sent by: "Andrey Dmitriev"
veritas-ha-bounce <admitriev at mentora.com>,
s at mailman.eng.aub
<veritas-ha at mailman.eng.auburn.edu>
urn.edu
cc
Subject
10/27/2008 07:19 Re: [Veritas-ha] Question about
HA
PM and disks
When a server panics, it stops writing to anything but the dump device.
VCS did exactly as designed. 16 seconds after heartbeat failure it started takeover. Whatever was damaged on your file system was already damaged at that point, regardless how long it took to dump core to the dump device. I would look at the cause of the panic, and it is likely it was something to do with what garbaged your FS
-----Original Message-----
From: veritas-ha-bounces at mailman.eng.auburn.edu
[mailto:veritas-ha-bounces at mailman.eng.auburn.edu] On Behalf Of Andrey Dmitriev
Sent: Monday, October 27, 2008 2:01 PM
To: veritas-ha at mailman.eng.auburn.edu
Subject: [Veritas-ha] Question about HA and disks
We had an issue where a serverA failed and serverB took over.
However, serverB took over when serverA was still 'crashing' (it took a good 10-15mins to crash), and apparently still had a hold of file systems (system logs confirm that takeover occurred while serverA was still 'puking').
The file systems on ServerB came up corrupt, and we lost some data b/c of that.
HA is setup via heartbeats. File system is vxfs, OS is RedHat 4.0.
Is there are any way to avoid that?
Thanks,
Andrey
_______________________________________________
Veritas-ha maillist - Veritas-ha at mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
_______________________________________________
Veritas-ha maillist - Veritas-ha at mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
More information about the Veritas-ha
mailing list