RSM Controller mis-reporting failures

terrido terrido@stormy1.Eng.Sun.COM
Thu, 12 Mar 1998 12:30:34 -0800 (PST)


Grank,

First, the bug number you included is not an RSM2k bug.
I am wondering if you meant to refer to bug 4104543?  It discusses a "ghost" drive
as well.  To answer the question about this bug, we do not yet know what causes it to take 
place, except it showed up after an upgrade of RM software/firmware was done.
Now, if this is not the bug you are referring to, then all of this is moot, I suppose.
I will try to address your other issues.

> Date: Thu, 12 Mar 1998 13:58:20 -0500
> From: Frank Gutierrez <frankg@delphi.com>

> 
> Hi,
> 
> Im posting this question about the RSM here since
> I do not know if there is an specific RSM group,
> so here it goes:
> 
> I have an RSM 2000 with dual-active controller
> configuration. Recently, one of the controllers
> have been giving us problems. But instead of
> failing the controller, the RSM is marking multiple
> drives as bad (when in reality the controller is
> the one at fault). I did some searching on SunSolve
> and found bug id 4042763 that describes this problem
> but found no patch, or permanent fix for it. Is there
> a patch or an upgrade to the RSM software that fixes
> this so that controller failures are properly recognized
> and controller fail-over can occur?

First, how do you know that it is indeed a controller card causing the problem?
I can well believe that it would be the source of the issue, however, there is also
disk tray related hardware that has caused a good 90% of these anomalies with disks.
Have you verified the SEN card, the interface card, the scsi cable to the 'Sym1000 
controller chassis' as not the source of the problem?  Normally it is one of the above.
For instance, very simply, if the scsi cables are not "locked down' by their thumb screws 
securely, there can be anough vibration to back them off the connectors.  They may make 
intermittent connections, or none at all.

I have even had one site where the SEN card in a disk tray had been replaced, but when the 
person put the replacement into the unit, they neglected to seat the board all the way!
These things do (unfortunately) happen on occasion.

Now, I stated that yes, it may well be a faulty controller card, if you believe this is the 
cause, then you should get a replacement.  As for a patch to make the software recognize the 
difference... well, all we can do is forward data on to the company that builds this device.
We have many requests filed for enhancements already.  I do not know if this is one area 
addressed or not.  This device can be confusing at best when trying to go by what the Recovery 
Guru is telling you, but there are also comands that can be used to further isolate what is 
really happening.  Keep in mind that the controller cards must communicate with the disks.
if any of the physical path to them is not-quite-correct, you will have these strange 
indications and behaviors.  This does include the controller card itself, as it has 
communications circuitry on it (for each scsi channel).

> 
> Also, I have 16 drives on this RSM (15 in 3 LUNS
> and hot spare). However, the RM6 tool seems to
> think that there are 17 drives in the Array and
> keeps complaining about an "unresponsive drive"
> which is, of course, this "ghost" drive. Is there
> any way to clear this drive? 

I would very much like to know about this drive.
What is it's "address"??  Was it always there?  Did it only appear after upgrading to RM6.1?
after firmware download?  when?  Any information you can provide will help us to further isolate 
what is happening.

Terrie Douglas
SMCC/CTE Engineer
Mass Storage Specialist
terrie.douglas@Eng.Sun.COM

> TIA for any help!
> 
> --
> Frank Gutierrez
> Harris Corporation