Pertinent Info
Doug Hughes
Doug.Hughes
Fri, 6 Jun 1997 13:30:44 -0500
Just got this:
(We've already fixed this here, by the way, and had gotten bitten on it
many many times in the past, and submitted many many bug reports before
it finally got fixed. It's nefarious.. You go to move some subdisks.
Everything goes fine. Then, at some later time (usually within a week)
your computer crashes on free freeing free frag, or bad inode, or FS
corruption of some other kind. We're REALLY GLAD they finally fixed it...
Saves having to get up at 3am to fsck 40GB of data.. thank goodness
for remote consoles.. enough rambling..)
================================================================================
SunService
SUNSOLVE EARLYNOTIFIER(SM) ALERT
SunSolve EarlyNotifier Alert is published periodically to provide
SunService customers with the latest and most important technical
information regarding Sun hardware and software.
******************************************************************************
******************************************************************************
DATE: May/22/97
SYNOPSIS: "Bad parity" with RAID-5 and SPARCstorage Array Volume Manager
Software version 2.x.
PRODUCT CATEGORY: Storage/Software
PRODUCTS AFFECTED:
Any SPARCstorage Array Volume Manager Version 2.X Software releases,
patched and unpatched, that support RAID-5 configurations.
PART NUMBERS AFFECTED:
N/A
REFERENCES:
BUGID#s 1223482, 1242923, 4010911, 4043658
ESC #s 509297, 508222, 506310, 504099, 504031
PROBLEM DESCRIPTION:
In all Veritas releases that supported RAID-5 prior to VxVM 2.3, some
maintenance functions of RAID-5 volumes under Veritas control can create
bad parity. This could occur during maintenance of the RAID-5 volume when
growing the volume or file system, moving subdisk, vxevac, etc.
For non-typical RAID-5 configurations which have the following
characteristics:
- a RAID-5 column which is split into multiple subdisks where the
subdisks do NOT end and begin on a stripe-unit aligned
boundary
- and a RAID-5 reconstruction operation was performed.
Data corruption can be present, in the region of the RAID-5 column
where the split subdisks align.
The following is an example of a RAID-5 column which is comprised of
split subdisks within a RAID-5 stripe-unit, which is not on a
stripe-unit boundary: (Note the split in column 2 composed of subdisk
2 and subdisk 4)
e.g.:
3 column RAID-5 (using defaults):
Column 1 Column 2 Column 3
========= ========= =========
subdisk 1 subdisk2 subdisk3
+---------------+ +===============+ +---------------+
|stripe-unit | | (subdisk 2) | | | stripe-width
|is 16k | | | | | is 48k
+---------------+ +===============+ +---------------+
| | | (subdisk 2) | | |
| | | | | |
+---------------+ +===============+ +---------------+<-
stripe-unit
boundaries
... ... ... / /
+---------------+ +===============+ +---------------+<--/ /
| | | (subdisk 2) | | | /
| | | | | | /
+---------------+ +=======+++++++++ +---------------+<--/
| | +-->| (sd 2)+ (sd 4)+ | |
| | | | + + | |
+---------------+ | +=======+++++++++ +---------------+
| | | + (subdisk 4) + | |
| | | + + | |
+---------------+ | +++++++++++++++++ +---------------+
| | | + (subdisk 4) + | |
| | | + + | |
+---------------+ | +++++++++++++++++ +---------------+
... | ... ...
... | ... ...
... | ... ...
|
|
|
Note that this stripe-unit for this RAID-5 volume
is "split" by 2 subdisks (i.e. the end region of
subdisk 2 and the beginning region of subdisk
4). Note also that subdisk 2 does not end on a
stripe-unit boundary, and that subdisk 4 does
not start on a stripe-unit boundary, but rather
somewhere within the stripe-unit.
If a RAID-5 column geometry has multiple subdisks where the subdisks
boundaries are not stripe-unit aligned, you may see data corruption
after a RAID-5 reconstruction operation in the RAID-5 volume at the
point of the split subdisks. If this RAID-5 volume contains a
filesystem, this could manifest itself as a failed "FSCK" pass.
CORRECTIVE ACTION:
Upgrade the customer to VxVM 2.3 and execute the following procedure on
those RAID-5 volumes you suspect may be bad, and that a backup does not
exist for. If a current backup exists, then restore any RAID-5 volume
suspected of having bad parity.
How to repair the parity if suspected bad:
A raid parity resync can be forced by doing the following:
1) Unmount the file system on a RAID-5
2) vxvol -g <diskgroup> stop <volume>
3) vxmend -g <diskgroup> fix empty <volume>
4) vxvol -g <diskgroup> start <volume>
This will take a while and will resync the parity.
COMMENTS:
Timing tests took 58 minutes to resync the parity on a 10 GB volume with
enough I/O load on the system to cause all other disks to be an average
of 75% busy.
A patch (for Vm 2.3, only) is in test at this time and will be released
in the near future with a utility that will allow the checking of the
RAID-5 volumes to determine if they need to have their parity synced.
******************************************************************************
******************************************************************************
Other SunService Information Resources
--------------------------------------
SunSolve Online including SunSolve Bulletin Board(SM)
World Wide Web:
North America: http://sunsolve.sun.com
UK: http://online.sunsolve.sun.co.uk
France: http://sunsolve.sun.fr
Germany: http://sunsolve.sun.de/sunsolve
Switzerland: http://sunsolve.sun.ch
Japan: http://sunsolve.sun.co.jp/sunsolve
Australia: http://sunsolve1.sun.com.au
Telnet access is available on each of the above servers
(except in Germany, Telnet to suninfo.sun.de).
SunSolve CD-ROM(TM):
Updated and distributed every six weeks to SunService contract customers.
For a complete list of patches, check the patch reports in the EarlyNotifier
data collection on SunSolve.