Pertinent Info

Joe Harman jh@hsmpk12a-s1.Eng.Sun.COM
Fri, 6 Jun 1997 13:15:00 -0700


All,

The patch that is being worked on is to provide the ability to check
your RAID5 volume to determine if they have bad parity.  This is the
only intent of the patch.  

The FIN already provides the procedure to fix thoses RAID5 volumes that
you suspect of having bad parity.

		Joe

<jh> From mlloyd@cbis.com Fri Jun  6 12:13:59 1997
<jh> From: Mark Lloyd <mlloyd@cbis.com>
<jh> Date: Fri, 6 Jun 1997 15:06:46 -0400 (EDT)
<jh> To: ssa-managers@eng.auburn.edu
<jh> Subject: Re: Pertinent Info
<jh> Mime-Version: 1.0
<jh> Content-Transfer-Encoding: 7bit
<jh> Content-Md5: WWFCEsfWaCyGOgv8gYJl9w==
<jh> X-Info: To unsubscribe, send 'unsubscribe ssa-managers' to majordom@Eng.Auburn.EDU in message body
<jh> 
<jh> Am I the only confused by this Alert ? There's certain parts that seem to imply 
<jh> if you are running VxVM then everything should be ok. However, there's another 
<jh> line stating they are working for a patch for VxVM version 2.3
<jh> 
<jh> What's reality ?
<jh> 
<jh> Mark E. Lloyd
<jh> Cincinnati Bell Information Systems (CBIS)
<jh> 
<jh> voice : (513) 784 - 7455
<jh> email : mlloyd@cbis.com
<jh> 
<jh> 
<jh> 
<jh> > From owner-ssa-managers@Eng.Auburn.EDU Fri Jun  6 14:34:02 1997
<jh> > From: Doug Hughes <Doug.Hughes@Eng.Auburn.EDU>
<jh> > Date: Fri, 6 Jun 1997 13:30:44 -0500
<jh> > To: ssa-managers@Eng.Auburn.EDU
<jh> > Subject: Pertinent Info
<jh> > X-Info: To unsubscribe, send 'unsubscribe ssa-managers' to 
<jh> majordom@Eng.Auburn.EDU in message body
<jh> > 
<jh> > Just got this:
<jh> > 
<jh> > (We've already fixed this here, by the way, and had gotten bitten on it
<jh> > many many times in the past, and submitted many many bug reports before
<jh> > it finally got fixed. It's nefarious.. You go to move some subdisks.
<jh> > Everything goes fine. Then, at some later time (usually within a week)
<jh> > your computer crashes on free freeing free frag, or bad inode, or FS
<jh> > corruption of some other kind. We're REALLY GLAD they finally fixed it...
<jh> > Saves having to get up at 3am to fsck 40GB of data.. thank goodness
<jh> > for remote consoles.. enough rambling..)
<jh> > 
<jh> >  
<jh> > 
<jh> ================================================================================
<jh> >                                 SunService
<jh> > 
<jh> >                     SUNSOLVE EARLYNOTIFIER(SM) ALERT
<jh> > 
<jh> > 
<jh> > SunSolve EarlyNotifier Alert is published periodically to provide 
<jh> > SunService customers with the latest and most important technical 
<jh> > information regarding Sun hardware and software.
<jh> > 
<jh> > ******************************************************************************
<jh> > ******************************************************************************
<jh> > 
<jh> > DATE: May/22/97
<jh> > 
<jh> > SYNOPSIS: "Bad parity" with RAID-5 and SPARCstorage Array Volume Manager 
<jh> > 	  Software version 2.x.
<jh> > 
<jh> > PRODUCT CATEGORY: Storage/Software
<jh> > 
<jh> > PRODUCTS AFFECTED:
<jh> > 
<jh> > 	Any SPARCstorage Array Volume Manager Version 2.X Software releases,
<jh> > 	patched and unpatched, that support RAID-5 configurations.
<jh> > 
<jh> > PART NUMBERS AFFECTED:
<jh> > 
<jh> > 	N/A
<jh> > 
<jh> > REFERENCES:
<jh> > 
<jh> > 	BUGID#s 1223482, 1242923, 4010911, 4043658
<jh> > 	ESC #s 509297, 508222, 506310, 504099, 504031
<jh> > 
<jh> > PROBLEM DESCRIPTION:
<jh> > 
<jh> > In all Veritas releases that supported RAID-5 prior to VxVM 2.3, some
<jh> > maintenance functions of RAID-5 volumes under Veritas control can create
<jh> > bad parity.  This could occur during maintenance of the RAID-5 volume when
<jh> > growing the volume or file system, moving subdisk, vxevac, etc.
<jh> > 
<jh> > For non-typical RAID-5 configurations which have the following 
<jh> > characteristics:
<jh> > 
<jh> >         - a RAID-5 column which is split into multiple subdisks where the
<jh> > 	  subdisks do NOT end and begin on a stripe-unit aligned
<jh> > 	  boundary
<jh> > 
<jh> >         - and a RAID-5 reconstruction operation was performed.
<jh> > 
<jh> > Data corruption can be present, in the region of the RAID-5 column
<jh> > where the split subdisks align.
<jh> > 
<jh> > The following is an example of a RAID-5 column which is comprised of
<jh> > split subdisks within a RAID-5 stripe-unit, which is not on a
<jh> > stripe-unit boundary:  (Note the split in column 2 composed of subdisk
<jh> > 2 and subdisk 4)
<jh> > 
<jh> > e.g.:
<jh> > 		3 column RAID-5 (using defaults):
<jh> > 
<jh> >     Column 1                Column 2                Column 3
<jh> >     =========               =========               =========  
<jh> >         
<jh> >     subdisk 1               subdisk2                subdisk3   
<jh> > +---------------+       +===============+       +---------------+ 
<jh> > |stripe-unit    |       | (subdisk 2)   |       |               | stripe-width
<jh> > |is 16k         |       |               |       |               | is 48k
<jh> > +---------------+       +===============+       +---------------+
<jh> > |               |       | (subdisk 2)   |       |               |
<jh> > |               |       |               |       |               |
<jh> > +---------------+       +===============+       +---------------+<-
<jh> > stripe-unit      
<jh> >                                                                boundaries
<jh> >         ...                     ...                     ...          /  /
<jh> > +---------------+       +===============+       +---------------+<--/  /
<jh> > |               |       | (subdisk 2)   |       |               |     /
<jh> > |               |       |               |       |               |    /
<jh> > +---------------+       +=======+++++++++       +---------------+<--/
<jh> > |               |   +-->| (sd 2)+ (sd 4)+       |               |
<jh> > |               |   |   |       +       +       |               |
<jh> > +---------------+   |   +=======+++++++++       +---------------+
<jh> > |               |   |   + (subdisk 4)   +       |               |   
<jh> > |               |   |   +               +       |               |
<jh> > +---------------+   |   +++++++++++++++++       +---------------+
<jh> > |               |   |   + (subdisk 4)   +       |               |
<jh> > |               |   |   +               +       |               |
<jh> > +---------------+   |   +++++++++++++++++       +---------------+
<jh> >         ...         |           ...                     ...
<jh> >         ...         |           ...                     ...
<jh> >         ...         |           ...                     ...
<jh> >                     |
<jh> >                     |
<jh> >                     |
<jh> >                     Note that this stripe-unit for this RAID-5 volume
<jh> >                         is "split" by 2 subdisks (i.e. the end region of
<jh> > 			subdisk 2 and the beginning region of subdisk
<jh> > 			4).  Note also that subdisk 2 does not end on a
<jh> > 			stripe-unit boundary, and that subdisk 4 does
<jh> > 			not start on a stripe-unit boundary, but rather
<jh> > 			somewhere within the stripe-unit.
<jh> > 
<jh> > 
<jh> > If a RAID-5 column geometry has multiple subdisks where the subdisks
<jh> > boundaries are not stripe-unit aligned, you may see data corruption
<jh> > after a RAID-5 reconstruction operation in the RAID-5 volume at the
<jh> > point of the split subdisks.  If this RAID-5 volume contains a
<jh> > filesystem, this could manifest itself as a failed "FSCK" pass.
<jh> > 
<jh> > 
<jh> > CORRECTIVE ACTION:
<jh> > 
<jh> > Upgrade the customer to VxVM 2.3 and execute the following procedure on
<jh> > those RAID-5 volumes you suspect may be bad, and that a backup does not
<jh> > exist for.  If a current backup exists, then restore any RAID-5 volume
<jh> > suspected of having bad parity.
<jh> > 
<jh> > How to repair the parity if suspected bad:
<jh> >    
<jh> >    A raid parity resync can be forced by doing the following:
<jh> >    
<jh> >    1) Unmount the file system on a RAID-5
<jh> >    2) vxvol -g <diskgroup> stop <volume>
<jh> >    3) vxmend -g <diskgroup> fix empty <volume>
<jh> >    4) vxvol -g <diskgroup> start <volume>
<jh> >    
<jh> > This will take a while and will resync the parity.
<jh> > 
<jh> > 
<jh> > COMMENTS:
<jh> > 
<jh> > Timing tests took 58 minutes to resync the parity on a 10 GB volume with
<jh> > enough I/O load on the system to cause all other disks to be an average
<jh> > of 75% busy.
<jh> > 
<jh> > A patch (for Vm 2.3, only) is in test at this time and will be released
<jh> > in the near future with a utility that will allow the checking of the
<jh> > RAID-5 volumes to determine if they need to have their parity synced.
<jh> > 
<jh> > ******************************************************************************
<jh> > ******************************************************************************
<jh> > 
<jh> > Other SunService Information Resources
<jh> > --------------------------------------
<jh> > SunSolve Online including SunSolve Bulletin Board(SM)
<jh> > 
<jh> > World Wide Web:
<jh> >    North America:  http://sunsolve.sun.com
<jh> >    UK:             http://online.sunsolve.sun.co.uk
<jh> >    France:         http://sunsolve.sun.fr
<jh> >    Germany:        http://sunsolve.sun.de/sunsolve
<jh> >    Switzerland:    http://sunsolve.sun.ch
<jh> >    Japan:          http://sunsolve.sun.co.jp/sunsolve
<jh> >    Australia:      http://sunsolve1.sun.com.au
<jh> > 
<jh> > Telnet access is available on each of the above servers
<jh> > (except in Germany, Telnet to suninfo.sun.de).
<jh> > 
<jh> > SunSolve CD-ROM(TM):
<jh> >    Updated and distributed every six weeks to SunService contract customers.
<jh> > 
<jh> > For a complete list of patches, check the patch reports in the EarlyNotifier
<jh> > data collection on SunSolve.
<jh> > 
<jh> > 
<jh>