Pertinent Info

Doug Hughes Doug.Hughes
Fri, 6 Jun 1997 13:30:44 -0500


Just got this:

(We've already fixed this here, by the way, and had gotten bitten on it
many many times in the past, and submitted many many bug reports before
it finally got fixed. It's nefarious.. You go to move some subdisks.
Everything goes fine. Then, at some later time (usually within a week)
your computer crashes on free freeing free frag, or bad inode, or FS
corruption of some other kind. We're REALLY GLAD they finally fixed it...
Saves having to get up at 3am to fsck 40GB of data.. thank goodness
for remote consoles.. enough rambling..)

 
================================================================================
                                SunService

                    SUNSOLVE EARLYNOTIFIER(SM) ALERT


SunSolve EarlyNotifier Alert is published periodically to provide 
SunService customers with the latest and most important technical 
information regarding Sun hardware and software.

******************************************************************************
******************************************************************************

DATE: May/22/97

SYNOPSIS: "Bad parity" with RAID-5 and SPARCstorage Array Volume Manager 
	  Software version 2.x.

PRODUCT CATEGORY: Storage/Software

PRODUCTS AFFECTED:

	Any SPARCstorage Array Volume Manager Version 2.X Software releases,
	patched and unpatched, that support RAID-5 configurations.

PART NUMBERS AFFECTED:

	N/A

REFERENCES:

	BUGID#s 1223482, 1242923, 4010911, 4043658
	ESC #s 509297, 508222, 506310, 504099, 504031

PROBLEM DESCRIPTION:

In all Veritas releases that supported RAID-5 prior to VxVM 2.3, some
maintenance functions of RAID-5 volumes under Veritas control can create
bad parity.  This could occur during maintenance of the RAID-5 volume when
growing the volume or file system, moving subdisk, vxevac, etc.

For non-typical RAID-5 configurations which have the following 
characteristics:

        - a RAID-5 column which is split into multiple subdisks where the
	  subdisks do NOT end and begin on a stripe-unit aligned
	  boundary

        - and a RAID-5 reconstruction operation was performed.

Data corruption can be present, in the region of the RAID-5 column
where the split subdisks align.

The following is an example of a RAID-5 column which is comprised of
split subdisks within a RAID-5 stripe-unit, which is not on a
stripe-unit boundary:  (Note the split in column 2 composed of subdisk
2 and subdisk 4)

e.g.:
		3 column RAID-5 (using defaults):

    Column 1                Column 2                Column 3
    =========               =========               =========  
        
    subdisk 1               subdisk2                subdisk3   
+---------------+       +===============+       +---------------+ 
|stripe-unit    |       | (subdisk 2)   |       |               | stripe-width
|is 16k         |       |               |       |               | is 48k
+---------------+       +===============+       +---------------+
|               |       | (subdisk 2)   |       |               |
|               |       |               |       |               |
+---------------+       +===============+       +---------------+<-
stripe-unit      
                                                               boundaries
        ...                     ...                     ...          /  /
+---------------+       +===============+       +---------------+<--/  /
|               |       | (subdisk 2)   |       |               |     /
|               |       |               |       |               |    /
+---------------+       +=======+++++++++       +---------------+<--/
|               |   +-->| (sd 2)+ (sd 4)+       |               |
|               |   |   |       +       +       |               |
+---------------+   |   +=======+++++++++       +---------------+
|               |   |   + (subdisk 4)   +       |               |   
|               |   |   +               +       |               |
+---------------+   |   +++++++++++++++++       +---------------+
|               |   |   + (subdisk 4)   +       |               |
|               |   |   +               +       |               |
+---------------+   |   +++++++++++++++++       +---------------+
        ...         |           ...                     ...
        ...         |           ...                     ...
        ...         |           ...                     ...
                    |
                    |
                    |
                    Note that this stripe-unit for this RAID-5 volume
                        is "split" by 2 subdisks (i.e. the end region of
			subdisk 2 and the beginning region of subdisk
			4).  Note also that subdisk 2 does not end on a
			stripe-unit boundary, and that subdisk 4 does
			not start on a stripe-unit boundary, but rather
			somewhere within the stripe-unit.


If a RAID-5 column geometry has multiple subdisks where the subdisks
boundaries are not stripe-unit aligned, you may see data corruption
after a RAID-5 reconstruction operation in the RAID-5 volume at the
point of the split subdisks.  If this RAID-5 volume contains a
filesystem, this could manifest itself as a failed "FSCK" pass.


CORRECTIVE ACTION:

Upgrade the customer to VxVM 2.3 and execute the following procedure on
those RAID-5 volumes you suspect may be bad, and that a backup does not
exist for.  If a current backup exists, then restore any RAID-5 volume
suspected of having bad parity.

How to repair the parity if suspected bad:
   
   A raid parity resync can be forced by doing the following:
   
   1) Unmount the file system on a RAID-5
   2) vxvol -g <diskgroup> stop <volume>
   3) vxmend -g <diskgroup> fix empty <volume>
   4) vxvol -g <diskgroup> start <volume>
   
This will take a while and will resync the parity.


COMMENTS:

Timing tests took 58 minutes to resync the parity on a 10 GB volume with
enough I/O load on the system to cause all other disks to be an average
of 75% busy.

A patch (for Vm 2.3, only) is in test at this time and will be released
in the near future with a utility that will allow the checking of the
RAID-5 volumes to determine if they need to have their parity synced.

******************************************************************************
******************************************************************************

Other SunService Information Resources
--------------------------------------
SunSolve Online including SunSolve Bulletin Board(SM)

World Wide Web:
   North America:  http://sunsolve.sun.com
   UK:             http://online.sunsolve.sun.co.uk
   France:         http://sunsolve.sun.fr
   Germany:        http://sunsolve.sun.de/sunsolve
   Switzerland:    http://sunsolve.sun.ch
   Japan:          http://sunsolve.sun.co.jp/sunsolve
   Australia:      http://sunsolve1.sun.com.au

Telnet access is available on each of the above servers
(except in Germany, Telnet to suninfo.sun.de).

SunSolve CD-ROM(TM):
   Updated and distributed every six weeks to SunService contract customers.

For a complete list of patches, check the patch reports in the EarlyNotifier
data collection on SunSolve.