RSM Controller mis-reporting failures
Frank Gutierrez
fgutierr@delphi.com
Fri, 13 Mar 1998 10:40:54 -0500
terrido wrote:
>
> Grank,
>
> First, how do you know that it is indeed a controller card causing the problem?
> I can well believe that it would be the source of the issue, however, there is also
> disk tray related hardware that has caused a good 90% of these anomalies with disks.
> Have you verified the SEN card, the interface card, the scsi cable to the 'Sym1000
> controller chassis' as not the source of the problem? Normally it is one of the above.
> For instance, very simply, if the scsi cables are not "locked down' by their thumb screws
> securely, there can be anough vibration to back them off the connectors. They may make
> intermittent connections, or none at all.
>
> I have even had one site where the SEN card in a disk tray had been replaced, but when the
> person put the replacement into the unit, they neglected to seat the board all the way!
> These things do (unfortunately) happen on occasion.
>
> Now, I stated that yes, it may well be a faulty controller card, if you believe this is the
> cause, then you should get a replacement. As for a patch to make the software recognize the
> difference... well, all we can do is forward data on to the company that builds this device.
> We have many requests filed for enhancements already. I do not know if this is one area
> addressed or not. This device can be confusing at best when trying to go by what the Recovery
> Guru is telling you, but there are also comands that can be used to further isolate what is
> really happening. Keep in mind that the controller cards must communicate with the disks.
> if any of the physical path to them is not-quite-correct, you will have these strange
> indications and behaviors. This does include the controller card itself, as it has
> communications circuitry on it (for each scsi channel).
>
BTW.. The bug report I was refering to was 4070772...
We went and check all the connections between the Trays and the
controllers
as well as the SCSI cables between the controllers and the workstation
that
is driving it. We didn't find any loose connections or any indications
that
will lead us to believe we had some other problems. I think its the
controller
becuase in the one ocassion we had a system panic when writting to the
filesystem on the RSM. The failure lights were lit on the controller.
We re-seeded the controller and the system came back up. After it
came back up is when we noticied the "ghost drive" .. see below...
After this initial failure, the system ran fine for about 3 weeks. Then
the system console started getting flooded with error messages about
trying to write to the RSM. The error messages pointed to the same
controller as before.. only that this time it had multiple drive
failures
on the same LUN. Since I had a lot a data on the LUN and didn't want to
lose it all, I did some searching on Sun Solve and thats when I ran
accros bug report 4070772. I followed the instruction on the bug report
to "revive" the drives and it worked. The LUN returned to "optimal"
status and I has able to fsck and remount the LUN.
> I would very much like to know about this drive.
> What is it's "address"?? Was it always there? Did it only appear after upgrading to RM6.1?
> after firmware download? when? Any information you can provide will help us to further isolate
> what is happening.
The drive is reported on RM6 as being "5,12" where "5,11" is the last
physical
drive and its marked as a hot spare. This drive appeared after a reboot
from a panic cuased by trying to write to the filesystem on RSM as I
described earlier. All attempts to delete this drive using the RM6 tool
fail.
The host is an Ultra 2 running 2.5.1 with patches updated as of Oct. 97.
>
> Terrie Douglas
> SMCC/CTE Engineer
> Mass Storage Specialist
> terrie.douglas@Eng.Sun.COM
>
TIA!
Frank Gutierrez
Harris Corporation