Plaiding, again.
Doug Hughes
doug@Eng.Auburn.Edu
Sun, 16 Jan 2000 10:27:53 -0600 (CST)
On 15 Jan 2000, Amos Gouaux wrote:
> Yeah, I know this has been gone over before, but I'm still a little
> bit stuck on this plaiding business, at least in this specific case
> (Cyrus message store).
>
> We've got an E250 that's hooked up to an A3500 controller, which in
> turn has 2 D1000 trays associated with it. These D1000 trays are
> divided so that each half is on a different bus on the A3500
> controller. The trays are fully populated with the 9GB, 10K RPM
> drives. This results in 4 SCSI chains of 6 disk drives each. We've
> also purchased VxVM and VxFS for this box, which is currently
> running Solaris 2.6. (I just happened to get the VxVM and VxFS
> releases that will run on Solaris 7, but I don't think I'll be able
> upgrade the box just yet.)
>
> These trays are going to be used as the message store for a Cyrus
> IMAP server that will be running on this E250. This message store
> is quite similar to MH or INN in that it stores the messages one per
> file. There are also some cache files per folder (directory) that
> store accesses and message headers, among other things. After
> running some crude scripts, we've observed that the predominate file
> size is roughly 8KB.
>
> So, what we've considered doing was to create 5 RAID5 LUNs composed
> of 4 drives. That way each drive of a LUN is on a separate bus.
> The segment size (this time anyway) is 8KB (16 blocks). We settled
> on 8KB because of the message sizes. We were then thinking of using
> the remaining 4 drives as hot spares.
>
> How does that sound so far? Suggestions?
I would actually make the stripeunit larger than 8K. 64k would probably
be good. For best results, you should experiment. (We use cyrus on a
sparc storage array with a large stripe width for folders). And for
the INBOX folder we use a stripewidth of 512K.
>
> Now comes the part that we've had the most difficulty in resolving:
> filesystem layouts. I suppose we could have 5 separate filesystems
> on each of these LUNs. Unfortunately, this would tend to mean that
> we have to shuffle accounts between these filesystems to keep things
> more or less balanced. Instead, what I'd like to see is one
> filesystem composed of these LUNs, if that's at all reasonable.
It is, but you've have to sort of scratch your proposed volume layout.
The only way to get one filesystem is to have one volume. There are
two ways to do this:
1) You can create a single RAID-5 volume on 4 disks and then grow
it over the other disk (thus having a data column and parity column
split over all the disks and having 1/3 overhead.
2) You can create a single RAID-5 volume across all of the disks, thus
having a single disk overhead. - disadvantage - if you lose a single
controller or SCSI bus or cable, you lose everything. Anecdotal reports
in the past have also indicated this may be inefficient.
>
> If the one filesystem approach is taken, the next question is how to
> organize these LUNs. The simplest approach would be to have one big
> concat. The other approach would be this "plaiding", where the LUNs
> are striped together. If a stripe were to be used, exactly how
> should it be configured?
the concat would be analogous to #1 above. You can't really do the
striping in the way you think. You could with SDS, but then your
configuration would be forever fixed, and you'd lose some of the
autoconfiguration advantage you get when running VxFS on top of VxVM.
>
> After reading some of these ssa-manager posts over and over, it would
> seem that the "chunk" (one of our problems I'm sure are the different
> terms being used for segment size, interlace value, etc) should be
> some multiple of the 8KB "chunk" size on the RAID5 LUNs. Would that
> be 3 * 8KB = 24KB, because we're using RAID5 LUNs composed of 4
> member disks? If so, isn't this 24KB kinda big for I/Os that would
> be around 8KB? Is the goal to try to cache the writes so that
> several messages are written at once? But VxVM, since it doesn't use
> NVRAM, isn't going to be able to do that, right?
>
I don't think 24 is big at all.. Because, these writes will be aggregated
and bundled anyway. Remember, you'll be having lots of things being
delivered at any one given time.
> We're thinking that because we're talking about so many small files,
> that a VxVM stripe might not be all that advantageous, and it might
> simply be more reasonable in this case to do a concat. But then
> again, this is getting into areas that we haven't dealt with much
> before, at least not at this level of complexity.
Personally, what we do is have 2 separate volumes. We have one setup
as a stripe+mirror for the inbox messages. All of the incoming email
goes into a high performance high reliability volume. All folders
are setup on a separate cyrus partition (in the cyrus sense) in
RAID-5. This seems to be working well for us.
In your case, it would mean taking 8 disks, striping and mirroring, and
then probably either doing a RAID-5 across another 8 (giving 1 parity)
or doing a RAID-5 across 4 and growing across another 4 (giving 2 parity
disks, but probably better fault tolerance)
____________________________________________________________________________
Doug Hughes Engineering Network Services
System/Net Admin Auburn University
doug@eng.auburn.edu