[blfs-dev] New page

Qrux qrux.qed at gmail.com
Wed Feb 8 18:56:03 PST 2012

On Feb 8, 2012, at 6:00 PM, Ken Moffat wrote:

> On Wed, Feb 08, 2012 at 04:46:23PM -0800, Qrux wrote:
> [ confining my remarks to SW RAID, I have no experience of the other
> sorts ]
>> SW RAID is great.  Generally faster (at least for RAID-0, back when I used to benchmark this sort of thing).  But, to be fair, while SW has the benefit of being open-sourced, it does suffer from version skew, too.
> I've used SW RAID-1 for several years : my impression is that the
> change happens in mdadm, rather than the kernel, and that (so far)
> backwards-compatability has been a major consideration.

I think there's some level of kernel support:

	xlapp-linux-3.1-kernel.config:# CONFIG_MD_LINEAR is not set
	xlapp-linux-3.1-kernel.config:# CONFIG_MD_RAID0 is not set
	xlapp-linux-3.1-kernel.config:# CONFIG_MD_RAID1 is not set
	xlapp-linux-3.1-kernel.config:# CONFIG_MD_RAID10 is not set
	xlapp-linux-3.1-kernel.config:# CONFIG_MD_RAID456 is not set
	xlapp-linux-3.1-kernel.config:# CONFIG_MD_MULTIPATH is not set
	xlapp-linux-3.1-kernel.config:# CONFIG_MD_FAULTY is not set

But, even if the disk format is only controlled by mdadm, you still count on people doing "TheRightThing(TM)" and making sure backwards compatibility is there.  That's the same as with hardware vendors.  They have investments, too, and some companies are quite friendly to the Linux community (e.g., 3ware, at least before the AMCC/LSI acquisition).  I'm just saying that's a very similar argument for both HW and SW, and doesn't necessarily favor one over the other.

>> It's also worth saying that in software RAID, you have to shut down the machine to do any repairs, even if the array is running in a degraded state.  Unless you have PCI- or SATA-hotplug in your kernel (is this widely supported or stable?)...and even then, you'd have to be able to put those drives in a hot-plug bay.
>> Might also want to mention hot spares.
>> And...(again, still trying to be constructive, not a jerk)...a page about RAID absolutely has to have a recovery HOWTO.  It's just dangerous not to include it, lest someone get a machine running, and has no idea how to recover from it.  And, in addition to the "normal" recovery scenarios, point out how it might be worth using with udev (disk/by-id) long names lest they reorder devices (or the kernel does it on a version change).  I personally just went through this the hard way on a colo server...
> A recovery HOWTO might be useful (for RAID-1, the hardest part is
> actually making sure you have identified the bad drive - using
> different brands of drive [ if there is a choice ] can help!).  For

Different drives for RAID-1?  I'm not sure that should go into the book.  It's probably enough to say: "Make sure you have the right drive in a recovery scenario."

I'm of the school that the drives should be as similar as possible.  Sure, that's probably only critical on a HW RAID system, but even SW RAID should adhere by some of the same principles.  Having drives with potentially large operating characteristics (rotation rate, seek times, latencies, caches, different levels of UDMA support, different NCQ behavior) is going to create a lot of unpleasantness.  I'm not sure I'd advise that.  Probably better to get drives, record their serial numbers, and label the drive bays.  Or, get a Sharpie, and write on the disk itself.  When you have to replace it, update the label (and your notebook).

Consider a long sequential read in a R1 situation.  IIRC, SW RAID tries to optimize that read by reading different parts of the file from different drive.  So, one drive finishes before the other, so it gets throttled while the other drives catches up.  Maybe the first drive keeps reading a bit, while waiting, filling its cache.  That might be good. Up to cache-full.  But, then, if the next page of the file is at a different place, it has to dump its cache and seek somewhere else.  And, because these drives are physically different, that might be happening back and forth.  Sounds like it's introducing a lot of cache-thrashing and maybe making overall seek times worse.  Not to mention that's probably a use pattern that creates a lot of extra work on the drives.

> RAID-5, I've avoided using it - if it was something I dealt with
> regularly, I'm sure it would be fine, but for something (recovery) I
> only ever do infrequently, I've seen too many reports on lkml where
> recovery has been non-obvious to a layman.  OTOH, wrong information
> in a HOWTO is probably worse than none.

Yes, wrong info is bad.  But, if recovery information can't be reliably summarized, maybe avoid RAID in the book.  If you can only advise people on how to set it up, and leave them in a situation where their data is at the mercy of this system, I think not having a recovery page is equally bad.  Anyone who cares to run RAID (in a non-R0 setup) is probably 5% interested in how easy it is to set up, and 95% concerned with how to get their data back.  If you don't include recovery, don't include a usage page at all--just let them install mdadm, and take the approach the rest of the book does: "If you want to use it, investigate further; these are just the build instructions."

> What surprised me is that /etc/mdadm.conf isn't mentioned.  I
> thought I had to create this (either manually, or by running some
> command - I forget which), and without it the kernel cannot assemble
> the array(s) ?

Yes & no.  In general, I totally agree with you; one ought to have an mdadm.conf file, and it needs to be discussed on the page.  But, not because it's strictly necessary for array detection:

"Back in "The Good Old Days" (TM), the raidtools would read your /etc/raidtab file, and then initialize the array. However, this would require that the filesystem on which /etc/raidtab resided was mounted. This was unfortunate if you want to boot on a RAID.

"Also, the old approach led to complications when mounting filesystems on RAID devices. They could not be put in the /etc/fstab file as usual, but would have to be mounted from the init-scripts.
The persistent superblocks solve these problems. When an array is created with the persistent-superblock option (the default now), a special superblock is written to a location (different for different superblock versions) on all disks participating in the array. This allows the kernel to read the configuration of RAID devices directly from the disks involved, instead of reading from some configuration file that may not be available at all times.

"It's not a bad idea to maintain a consistent /etc/mdadm.conf file, since you may need this file for later recovery of the array.

"The persistent superblock is mandatory if you want auto-detection of your RAID devices upon system boot. This is described in the Autodetect section."

	-- https://raid.wiki.kernel.org/articles/r/a/i/RAID_setup_cbb2.html

The "new" MD system is more like current LVM; it writes metadata to the disk.  And, a decent example of changes that may be "strictly compatible", but supporting documentation would be even better.  For example, if someone were using an array, for example, built without persistent superblocks, and then trying to recover them much later, without knowing to create--or remembering to save--the mdadm.conf file.


More information about the blfs-dev mailing list