Plaiding, again.

Stuart Remphrey - Sun Computer Systems SE - QLD Australia Stuart.Remphrey@ausmail.Aus.Sun.COM
Mon, 17 Jan 2000 11:24:21 +1000 (EST)


Amos,

Either is possible--Doug's approach of two filesystems is probably
a good idea. From the way I read Doug's response I think he's assuming
you'd do all the volume management from VxVM, not RM6+VxVM, so you lose
the benefit of the controller-based RAID-5. He may not have picked up that
the A3500 does RAID-5 locally on the controller, and very efficiently.

Hence Solaris and VM see each RAID-5 stripe as a single disk, which can
then be striped. If this is done, make the VM stripe unit the same as
the RAID-5 stripe width. Ie. a single VM stripe should incorporate one
stripe from each RAID-5 LUN.

If I understand correctly, there's a bunch of 3+1 RAID-5's, with 8 KB
segment size, hence a stripe width of 3x8=24 KB. Now if there's 5 such
LUNs you could use a segment size of 5x24=120 KB for the VxVM stripes,
so each stripe in the volume is spread across all RAID-5 LUNs.

Four hot spares for only 24 disks is more than usual, one would be more
common, two if you're just a little paranoid. Note that just because
one *is* paranoid, it doesn't mean they're *not* out to get you :-)

Of course, if you only had 1 or 2 hot spares you'd then need to decide
what to do with the other 2-3 disks--do you have something useful to
do with a 2-3 disk stripe, or a 2-disk mirror? Or perhaps go with no
hot spares and 6 x RAID-5 stripes, but you'd want to replace a failed
disk reasonably quickly, a pain if it fails overnight.

HTH,

Stuart.


` On 15 Jan 2000, Amos Gouaux wrote:
` 
` > Yeah, I know this has been gone over before, but I'm still a little
` > bit stuck on this plaiding business, at least in this specific case
` > (Cyrus message store).
` > 
` > We've got an E250 that's hooked up to an A3500 controller, which in
` > turn has 2 D1000 trays associated with it.  These D1000 trays are
` > divided so that each half is on a different bus on the A3500
` > controller.  The trays are fully populated with the 9GB, 10K RPM
` > drives.  This results in 4 SCSI chains of 6 disk drives each.  We've
` > also purchased VxVM and VxFS for this box, which is currently
` > running Solaris 2.6.  (I just happened to get the VxVM and VxFS
` > releases that will run on Solaris 7, but I don't think I'll be able
` > upgrade the box just yet.)
` > 
` > These trays are going to be used as the message store for a Cyrus
` > IMAP server that will be running on this E250.  This message store
` > is quite similar to MH or INN in that it stores the messages one per
` > file.  There are also some cache files per folder (directory) that
` > store accesses and message headers, among other things.  After
` > running some crude scripts, we've observed that the predominate file
` > size is roughly 8KB.
` > 
` > So, what we've considered doing was to create 5 RAID5 LUNs composed
` > of 4 drives.  That way each drive of a LUN is on a separate bus.
` > The segment size (this time anyway) is 8KB (16 blocks).  We settled
` > on 8KB because of the message sizes.  We were then thinking of using
` > the remaining 4 drives as hot spares.
` > 
` > How does that sound so far?  Suggestions?
` 
` I would actually make the stripeunit larger than 8K. 64k would probably
` be good. For best results, you should experiment. (We use cyrus on a
` sparc storage array with a large stripe width for folders). And for
` the INBOX folder we use a stripewidth of 512K.
` 
` > 
` > Now comes the part that we've had the most difficulty in resolving:
` > filesystem layouts.  I suppose we could have 5 separate filesystems
` > on each of these LUNs.  Unfortunately, this would tend to mean that
` > we have to shuffle accounts between these filesystems to keep things
` > more or less balanced.  Instead, what I'd like to see is one
` > filesystem composed of these LUNs, if that's at all reasonable.
` 
` It is, but you've have to sort of scratch your proposed volume layout.
` The only way to get one filesystem is to have one volume. There are
` two ways to do this:
` 1) You can create a single RAID-5 volume on 4 disks and then grow
` it over the other disk (thus having a data column and parity column
` split over all the disks and having 1/3 overhead.
` 2) You can create a single RAID-5 volume across all of the disks, thus
` having a single disk overhead. - disadvantage - if you lose a single
` controller or SCSI bus or cable, you lose everything. Anecdotal reports
` in the past have also indicated this may be inefficient.
` 
` > 
` > If the one filesystem approach is taken, the next question is how to
` > organize these LUNs.  The simplest approach would be to have one big
` > concat.  The other approach would be this "plaiding", where the LUNs
` > are striped together.  If a stripe were to be used, exactly how
` > should it be configured?
` the concat would be analogous to #1 above. You can't really do the
` striping in the way you think. You could with SDS, but then your
` configuration would be forever fixed, and you'd lose some of the
` autoconfiguration advantage you get when running VxFS on top of VxVM.
` > 
` > After reading some of these ssa-manager posts over and over, it would
` > seem that the "chunk" (one of our problems I'm sure are the different
` > terms being used for segment size, interlace value, etc) should be
` > some multiple of the 8KB "chunk" size on the RAID5 LUNs.  Would that
` > be 3 * 8KB = 24KB, because we're using RAID5 LUNs composed of 4
` > member disks?  If so, isn't this 24KB kinda big for I/Os that would
` > be around 8KB?  Is the goal to try to cache the writes so that
` > several messages are written at once?  But VxVM, since it doesn't use
` > NVRAM, isn't going to be able to do that, right?
` > 
` I don't think 24 is big at all.. Because, these writes will be aggregated
` and bundled anyway. Remember, you'll be having lots of things being
` delivered at any one given time.
` 
` > We're thinking that because we're talking about so many small files,
` > that a VxVM stripe might not be all that advantageous, and it might
` > simply be more reasonable in this case to do a concat.  But then
` > again, this is getting into areas that we haven't dealt with much
` > before, at least not at this level of complexity.
` 
` Personally, what we do is have 2 separate volumes. We have one setup
` as a stripe+mirror for the inbox messages. All of the incoming email
` goes into a high performance high reliability volume. All folders
` are setup on a separate cyrus partition (in the cyrus sense) in 
` RAID-5. This seems to be working well for us.
` In your case, it would mean taking 8 disks, striping and mirroring, and
` then probably either doing a RAID-5 across another 8 (giving 1 parity)
` or doing a RAID-5 across 4 and growing across another 4 (giving 2 parity
` disks, but probably better fault tolerance)
` 
` ____________________________________________________________________________
` Doug Hughes					Engineering Network Services
` System/Net Admin  				Auburn University
` 			doug@eng.auburn.edu