bcache on md or md on bcache

linux ssd block-device bcache

7,825

Solution 1

I think caching the whole md device make most sense.

Putting bcache to cache the whole md device sacrifices the whole idea of having raid, because it introduces another single point of failure.

OTH failurs of SSD disks are relatively rare, and bcache can be put into the writethrough/writearound mode (in contrast to the writeback mode), where there is no data stored only to the cache device, and failure of the cache doesn't kill the information in the raid makes it a relatively safe option.
Other fact is that there is significant computational overhead of soft RAID-5; when caching each spinning raid member separately, computer still has to re-calculate all the parities, even on cache hits
~~Obviously, you'd sacrifice some expensive ssd space, if you cache each spinning drive separately.~~ - Unless you plan to use raided ssd cache.
Both options relatively don't affect the time of growing process - although the option with spinning drives being cached separately has potential to be slower due to more bus traffic.

It is fast and relatively simple process to configure bcache to remove the ssd drive, when you need to replace it. Thanks to the blocks it should be possible to migrate the raid setup both ways on-place.

You should also remember, that at the moment most (all?) live-CD distributions don't support bcache, so you can't simply access your data with such tools regardless of the bcache-mdraid layout option you chose.

Solution 2

I'd think the sane approach is to cache the resulting MD device.

bcache is designed to pass-trough sequential reads and writes.

If you bcache each device separately, logically, several devices striping into a raided or stripped MD, will, from the bcache perspective, constantly be writing random blocks.

While a bcached MD volume will look as normal, writing files to the volume, rather then random blocks to several devices.

The entire point of hard and software raid is to do the striping of data in the backend so that the resulting filesystem looks like a normal volume.

This might not be correct (as bcache devs might be clever and account for that kind of situation), but the logical optimal thing to do is to cache volumes, rather then block devices.

7,825

Admin

Updated on September 18, 2022

Comments

Admin over 1 year
bcache allows one or more fast disk drives such as flash-based solid state drives (SSDs) to act as a cache for one or more slower hard disk drives.

If I understand correctly,
- an SSD^* could be assigned to cache multiple backing HDDs, and then the resulting cached devices could be RAIDed with mdadm
  or
- multiple HDDs could be RAIDed into a single backing md device and the SSD assigned to cache that
I'm wondering which is the saner approach. It occurs to me that growing a RAID5/6 may be simpler with one or other technique, but I'm not sure which!

Are there good reasons (eg growing the backing storage or anything else) for choosing one approach over the other (for a large non-root filesystem containing VM backing files)?

_{^* by "an SSD" I mean some sort of redundant SSD device, eg a RAID1 of two physical SSDs}
- Admin almost 10 years
  
  In either case all disks that bcache backs will have to be formatted with bcache - so you'll either have to create an md array, format the single resulting disk entirely as a bcache backed partition, link it to its cache drive and go from there, or format many disks with bcache, link them to their cache drive, then format the many disks as one array. In either case there are multiple points of possible failure all of which depend on interoperability between two filesystems - not to mention the final fs. see here: scroll down.
- Admin almost 10 years
  
  @mikeserv I understand all that, this is for a purpose built server so it's all good. What do you mean "two filesystems"? bcache is not a filesystem - the only filesystem I'll have will be XFS on the final bcache or mdadm device (depending which option I choose).
- Admin almost 10 years
  
  Thanks @Adam, in-place conversion is no issue for me.
- Admin almost 10 years
  
  @mikeserv no it isn't. Filesystems (eg btrfs, xfs, extN etc) live on top of block devices. mdadm and bcache work at the block device level not at the filesystem level (btrfs confuses the issue with it's layering violation, but that is a completely separate conversation).
- Admin almost 10 years
  
  maybe you're right. but bcache does also live on top of block devices - it just serves more of them.
- Admin over 9 years
  
  great question, so pity so many people didn't understand what you're after. :)
Admin almost 10 years

I've updated the question to make it clear I'm not planning to have a non-redundant SSD cache. Your second bullet point is an excellent point, thanks for that. You third bullet about space: do you mean because you'd be storing the parity on SSD? re your last para, I'm using F20 but will eventually be using RHEL/CentOS7 or Debian Jessie (if bcache-tools makes the cut).
Adam Ryczkowski almost 10 years

@JackDouglas Ad 3rd bullet: Yes, exactly that. But since you plan to use raided ssd drives, that doesn't apply to you.
Admin almost 10 years

It still does because they'll not only be mirrored but will also need to store the RAID parity for the backing drives. This isn't the case if the RAID is done beneath bcache which I thought was your point
Adam Ryczkowski almost 10 years

I believe you mean the opposite: ssd matrix does't have to store the spinning disks' parity, if it is fed the whole mdraid drive.
Admin almost 10 years

yes, that's exactly what I mean!
Peter Cordes about 8 years

A large sequential write to a RAID5/6 produces sequential writes to all the component devices. Each component devices gets every N-1 data block (or parity), but the data it does get is sequential. But you're right that it will distort things. If there are some chunks that see frequent partial-stripe writes, resulting in a read-modify-write of (part of) the parity stripe, that could be cached by bcache. Caching it higher up, before the partial-stripe write ever hit the MD device, would be even better though.
Louis Gerbarg over 7 years

"You should also remember, that at the moment most (all?) live-CD distributions don't support bcache, so you can't simply access your data with such tools regardless of the bcache-mdraid layout option you chose. " - many (sysresccd/arch) do support bcache module. You need to issue modprobe bcache; echo /dev/device1 > /sys/fs/bcache/register; echo /dev/device2 > /sys/fs/bcache/register manually.
KJ7LNW almost 3 years

SSD failures are not rare, they have a max TBW and then they get really slow and latent or just plain fail. We've burned out lots of SSDs due to wear.