RSM Controller mis-reporting failures

Rick Rose rrose@dsw.net
Thu, 12 Mar 1998 15:01:40 -0800


Frank,
I've gotten ghost drives before when a drive was not marked for replacement
(through the "# vxdiskadm" interface) before a SUN service SE  replaced
the drive with a new drive.  I had to add the new drive to the VxVm configuration
so the customer could continue recovering (good thing it was the hot spare drive
as long as no other drive crashes). So you get 2 drives with the same device name
ones blue and ones grayed (ie. shows up bad) in the GUI screen. To fix the GUI
all that was needed to be done was to delete the grayed out drive through the GUI.
In the past I've seen the GUI get out of sync with the output from "vxprint -ht" and
"vxdisk -list" and I've had to go in and edit the corrupted VxVm configuration db
file.
Good Luck
Rick Rose

terrido wrote:

> Grank,
>
> First, the bug number you included is not an RSM2k bug.
> I am wondering if you meant to refer to bug 4104543?  It discusses a "ghost" drive
> as well.  To answer the question about this bug, we do not yet know what causes it to take
> place, except it showed up after an upgrade of RM software/firmware was done.
> Now, if this is not the bug you are referring to, then all of this is moot, I suppose.
> I will try to address your other issues.
>
> > Date: Thu, 12 Mar 1998 13:58:20 -0500
> > From: Frank Gutierrez <frankg@delphi.com>
>
> >
> > Hi,
> >
> > Im posting this question about the RSM here since
> > I do not know if there is an specific RSM group,
> > so here it goes:
> >
> > I have an RSM 2000 with dual-active controller
> > configuration. Recently, one of the controllers
> > have been giving us problems. But instead of
> > failing the controller, the RSM is marking multiple
> > drives as bad (when in reality the controller is
> > the one at fault). I did some searching on SunSolve
> > and found bug id 4042763 that describes this problem
> > but found no patch, or permanent fix for it. Is there
> > a patch or an upgrade to the RSM software that fixes
> > this so that controller failures are properly recognized
> > and controller fail-over can occur?
>
> First, how do you know that it is indeed a controller card causing the problem?
> I can well believe that it would be the source of the issue, however, there is also
> disk tray related hardware that has caused a good 90% of these anomalies with disks.
> Have you verified the SEN card, the interface card, the scsi cable to the 'Sym1000
> controller chassis' as not the source of the problem?  Normally it is one of the above.
> For instance, very simply, if the scsi cables are not "locked down' by their thumb screws
> securely, there can be anough vibration to back them off the connectors.  They may make
> intermittent connections, or none at all.
>
> I have even had one site where the SEN card in a disk tray had been replaced, but when the
> person put the replacement into the unit, they neglected to seat the board all the way!
> These things do (unfortunately) happen on occasion.
>
> Now, I stated that yes, it may well be a faulty controller card, if you believe this is the
> cause, then you should get a replacement.  As for a patch to make the software recognize the
> difference... well, all we can do is forward data on to the company that builds this device.
> We have many requests filed for enhancements already.  I do not know if this is one area
> addressed or not.  This device can be confusing at best when trying to go by what the Recovery
> Guru is telling you, but there are also comands that can be used to further isolate what is
> really happening.  Keep in mind that the controller cards must communicate with the disks.
> if any of the physical path to them is not-quite-correct, you will have these strange
> indications and behaviors.  This does include the controller card itself, as it has
> communications circuitry on it (for each scsi channel).
>
> >
> > Also, I have 16 drives on this RSM (15 in 3 LUNS
> > and hot spare). However, the RM6 tool seems to
> > think that there are 17 drives in the Array and
> > keeps complaining about an "unresponsive drive"
> > which is, of course, this "ghost" drive. Is there
> > any way to clear this drive?
>
> I would very much like to know about this drive.
> What is it's "address"??  Was it always there?  Did it only appear after upgrading to RM6.1?
> after firmware download?  when?  Any information you can provide will help us to further isolate
> what is happening.
>
> Terrie Douglas
> SMCC/CTE Engineer
> Mass Storage Specialist
> terrie.douglas@Eng.Sun.COM
>
> > TIA for any help!
> >
> > --
> > Frank Gutierrez
> > Harris Corporation