Discussion:
OT: btrfs raid 5/6
(too old to reply)
Bill Kenworthy
2017-11-27 22:30:13 UTC
Permalink
Hi all,
I need to expand two bcache fronted 4xdisk btrfs raid 10's - this
requires purchasing 4 drives (and one system does not have room for two
more drives) so I am trying to see if using raid 5 is an option

I have been trying to find if btrfs raid 5/6 is stable enough to use but
while there is mention of improvements in kernel 4.12, and fixes for the
write hole problem I cant see any reports that its "working fine now"
though there is a phoronix article saying Oracle is using it since the
fixes.

Is anyone here successfully using btrfs raid 5/6? What is the status of
scrub and self healing? The btrfs wiki is woefully out of date :(

BillK
J. Roeleveld
2017-12-01 15:59:07 UTC
Permalink
Post by Bill Kenworthy
Hi all,
I need to expand two bcache fronted 4xdisk btrfs raid 10's - this
requires purchasing 4 drives (and one system does not have room for two
more drives) so I am trying to see if using raid 5 is an option
I have been trying to find if btrfs raid 5/6 is stable enough to use but
while there is mention of improvements in kernel 4.12, and fixes for the
write hole problem I cant see any reports that its "working fine now"
though there is a phoronix article saying Oracle is using it since the
fixes.
Is anyone here successfully using btrfs raid 5/6? What is the status of
scrub and self healing? The btrfs wiki is woefully out of date :(
BillK
I have not seen any indication that BTRFS raid 5/6/.. is usable.
Last status I heard: No scrub, no rebuild when disk failed, ...
It should work as long as all disks stay functioning, but then I wonder why
bother with anything more advanced than raid-0 ?

It's the lack of progress with regards to proper "raid" support in BTRFS which
made me stop considering it and simply went with ZFS.

--
Joost
Wols Lists
2017-12-01 16:58:49 UTC
Permalink
Post by Bill Kenworthy
Hi all,
I need to expand two bcache fronted 4xdisk btrfs raid 10's - this
requires purchasing 4 drives (and one system does not have room for two
more drives) so I am trying to see if using raid 5 is an option
I have been trying to find if btrfs raid 5/6 is stable enough to use but
while there is mention of improvements in kernel 4.12, and fixes for the
write hole problem I cant see any reports that its "working fine now"
though there is a phoronix article saying Oracle is using it since the
fixes.
Is anyone here successfully using btrfs raid 5/6? What is the status of
scrub and self healing? The btrfs wiki is woefully out of date :(
Or put btrfs over md-raid?

Thing is, with raid-6 over four drives, you have a 100% certainty of
surviving a two-disk failure. With raid-10 you have a 33% chance of
losing your array.

Cheers,
Wol
Rich Freeman
2017-12-01 17:14:12 UTC
Permalink
Post by Wols Lists
Post by Bill Kenworthy
Hi all,
I need to expand two bcache fronted 4xdisk btrfs raid 10's - this
requires purchasing 4 drives (and one system does not have room for two
more drives) so I am trying to see if using raid 5 is an option
I have been trying to find if btrfs raid 5/6 is stable enough to use but
while there is mention of improvements in kernel 4.12, and fixes for the
write hole problem I cant see any reports that its "working fine now"
though there is a phoronix article saying Oracle is using it since the
fixes.
Is anyone here successfully using btrfs raid 5/6? What is the status of
scrub and self healing? The btrfs wiki is woefully out of date :(
Or put btrfs over md-raid?
Thing is, with raid-6 over four drives, you have a 100% certainty of
surviving a two-disk failure. With raid-10 you have a 33% chance of
losing your array.
I tend to be a fan of parity raid in general for these reasons. I'm
not sure the performance gains with raid-10 are enough to warrant the
waste of space.

With btrfs though I don't really see the point of "Raid-10" vs just a
pile of individual disks in raid1 mode. Btrfs will do a so-so job of
balancing the IO across them already (they haven't really bothered to
optimize this yet).

I've moved away from btrfs entirely until they sort things out.
However, I would not use btrfs for raid-5/6 under any circumstances.
That has NEVER been stable, and if anything has gone backwards. I'm
sure they'll sort it out sometime, but I have no idea when. RAID-1 on
btrfs is reasonably stable, but I've still had it run into issues
(nothing that kept me from reading the data off the array, but I've
had various issues with it, and when I finally moved it to ZFS it was
in a state where I couldn't run it in anything other than degraded
mode).

You could run btrfs over md-raid, but other than the snapshots I think
this loses a lot of the benefit of btrfs in the first place. You are
vulnerable to the write hole, the ability of btrfs to recover data
with soft errors is compromised (though you can detect it still), and
you're potentially faced with more read-write-read cycles when raid
stripes are modified. Both zfs and btrfs were really designed to work
best on raw block devices without any layers below. They still work
of course, but you don't get some of those optimizations since they
don't have visibility into what is happening at the disk level.
--
Rich
Wols Lists
2017-12-01 17:24:26 UTC
Permalink
Post by Rich Freeman
You could run btrfs over md-raid, but other than the snapshots I think
this loses a lot of the benefit of btrfs in the first place. You are
vulnerable to the write hole,
The write hole is now "fixed".

In quotes because, although journalling has now been merged and is
available, there still seem to be a few corner case (and not so corner
case) bugs that need ironing out before it's solid.

Cheers,
Wol
Frank Steinmetzger
2017-12-06 23:28:29 UTC
Permalink
Post by Rich Freeman
Post by Wols Lists
Post by Bill Kenworthy
[
]
Is anyone here successfully using btrfs raid 5/6? What is the status of
scrub and self healing? The btrfs wiki is woefully out of date :(
[
]
Thing is, with raid-6 over four drives, you have a 100% certainty of
surviving a two-disk failure. With raid-10 you have a 33% chance of
losing your array.
[
]
I tend to be a fan of parity raid in general for these reasons. I'm
not sure the performance gains with raid-10 are enough to warrant the
waste of space.
[
]
and when I finally moved it to ZFS
[
]
I am about to upgrade my Gentoo-NAS from 2× to 4×6 TB WD Red (non-pro). The
current setup is a ZFS mirror. I had been holding off the purchase for months,
all the while pondering on which RAID scheme to use. First it was raidz1 due
to space (I only have four bays), but eventually discarded it due to reduced
resilience.

Which brought me to raidz2 (any 2 drives may fail). But then I came across
that famous post by a developer on “You should always use mirrors unless you
are really really sure what you’re doing”. The main points were higher strain
on the entire array during resilvering (all drives nead to read everything
instead of just one drive) and easier maintainability of a mirror set (e.g.
faster and easier upgrade).

I don’t really care about performance. It’s a simple media archive powered
by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though
^^). Sorry if I more or less stole the thread, but this is almost the same
topic. I could use a nudge in either direction. My workplace’s storage
comprises many 2× mirrors, but I am not a company and I am capped at four
bays.

So, Do you have any input for me before I fetch the dice?
--
Gruß | Greetings | Qapla’
Please do not share anything from, with or about me on any social network.

All PCs are compatible. Some are just more compatible than others.
Rich Freeman
2017-12-06 23:35:10 UTC
Permalink
I don’t really care about performance. It’s a simple media archive powered
by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though
^^). Sorry if I more or less stole the thread, but this is almost the same
topic. I could use a nudge in either direction. My workplace’s storage
comprises many 2× mirrors, but I am not a company and I am capped at four
bays.
So, Do you have any input for me before I fetch the dice?
IMO the cost savings for parity RAID trumps everything unless money
just isn't a factor.

Now, with ZFS it is frustrating because arrays are relatively
inflexible when it comes to expansion, though that applies to all
types of arrays. That is one major advantage of btrfs (and mdadm) over
zfs. I hear they're working on that, but in general there are a lot
of things in zfs that are more static compared to btrfs.
--
Rich
Frank Steinmetzger
2017-12-07 00:13:28 UTC
Permalink
Post by Rich Freeman
Post by Frank Steinmetzger
I don’t really care about performance. It’s a simple media archive powered
by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though
^^). Sorry if I more or less stole the thread, but this is almost the same
topic. I could use a nudge in either direction. My workplace’s storage
comprises many 2× mirrors, but I am not a company and I am capped at four
bays.
So, Do you have any input for me before I fetch the dice?
IMO the cost savings for parity RAID trumps everything unless money
just isn't a factor.
Cost saving compared to what? In my four-bay-scenario, mirror and raidz2
yield the same available space (I hope so).
--
Gruß | Greetings | Qapla’
Please do not share anything from, with or about me on any social network.

Advanced mathematics have the advantage that you can err more accurately.
Rich Freeman
2017-12-07 00:29:08 UTC
Permalink
Post by Frank Steinmetzger
Post by Rich Freeman
IMO the cost savings for parity RAID trumps everything unless money
just isn't a factor.
Cost saving compared to what? In my four-bay-scenario, mirror and raidz2
yield the same available space (I hope so).
Sure, if you only have 4 drives and run raid6/z2 then it is no more
efficient than mirroring. That said, it does provide more security
because raidz2 can tolerate the failure of any two disks, while
2xraid1 or raid10 can tolerate only half of the combinations of two
disks.

The increased efficiency of parity raid comes as you scale up.
They're equal at 4 disks. If you had 6 disks then raid6 holds 33%
more. If you have 8 then it holds 50% more. That and it takes away
the chance factor when you lose two disks. If you're really unlucky
with 4xraid1 the loss of two disks could result in the loss of 25% of
your data, while with an 8-disk raid6 the loss of two disks will never
result in the loss of any data. (Granted, a 4xraid1 could tolerate
the loss of 4 drives if you're very lucky - the luck factor is being
eliminated and that cuts both ways.)

If I had only 4 drives I probably wouldn't use raidz2. I might use
raid5/raidz1, or two mirrors. With mdadm I'd probably use raid5
knowing that I can easily reshape the array if I want to expand it
further.
--
Rich
Frank Steinmetzger
2017-12-07 21:37:55 UTC
Permalink
Post by Rich Freeman
Post by Frank Steinmetzger
Post by Rich Freeman
IMO the cost savings for parity RAID trumps everything unless money
just isn't a factor.
Cost saving compared to what? In my four-bay-scenario, mirror and raidz2
yield the same available space (I hope so).
Sure, if you only have 4 drives and run raid6/z2 then it is no more
efficient than mirroring. That said, it does provide more security
because raidz2 can tolerate the failure of any two disks, while
2xraid1 or raid10 can tolerate only half of the combinations of two
disks.
Ooooh, I just came up with another good reason for raidz over mirror:
I don't encrypt my drives because it doesn't hold sensitive stuff. (AFAIK
native ZFS encryption is available in Oracle ZFS, so it might eventually
come to the Linux world).

So in case I ever need to send in a drive for repair/replacement, noone can
read from it (or only in tiny bits'n'pieces from a hexdump), because each
disk contains a mix of data and parity blocks.

I think I'm finally sold. :)
And with that, good night.
--
Gruß | Greetings | Qapla’
Please do not share anything from, with or about me on any social network.

“I think Leopard is a much better system [than Windows Vista] 
 but OS X in
some ways is actually worse than Windows to program for. Their file system is
complete and utter crap, which is scary.” – Linus Torvalds
Wols Lists
2017-12-07 21:49:29 UTC
Permalink
Post by Frank Steinmetzger
I don't encrypt my drives because it doesn't hold sensitive stuff. (AFAIK
native ZFS encryption is available in Oracle ZFS, so it might eventually
come to the Linux world).
So in case I ever need to send in a drive for repair/replacement, noone can
read from it (or only in tiny bits'n'pieces from a hexdump), because each
disk contains a mix of data and parity blocks.
I think I'm finally sold. :)
And with that, good night.
So you've never heard of LUKS?

GPT
LUKS
MD-RAID
Filesystem

Simple stack so if you ever have to pull a disk, just delete the LUKS
key from it and everything from that disk is now random garbage.

(Oh - and md raid-5/6 also mix data and parity, so the same holds true
there.)

Cheers,
Wol
Frank Steinmetzger
2017-12-07 22:35:45 UTC
Permalink
Post by Wols Lists
Post by Frank Steinmetzger
So in case I ever need to send in a drive for repair/replacement, noone can
read from it (or only in tiny bits'n'pieces from a hexdump), because each
disk contains a mix of data and parity blocks.
I think I'm finally sold. :)
And with that, good night.
So you've never heard of LUKS?
Sure thing, my laptop’s whole SSD is LUKSed and so are all my other home and
backup partitions. But encrypting ZFS is different, because every disk needs
to be encrypted separately since there is no separation between the FS and
the underlying block device.

This will result in a big computational overhead, choking my poor Celeron.
When I benchmarked reading from a single LUKS container in a ramdisk, it
managed around 160 MB/s IIRC. I might give it a try over the weekend before
I migrate my data, but I’m not expecting miracles. Should have bought an i3
for that.
Post by Wols Lists
(Oh - and md raid-5/6 also mix data and parity, so the same holds true
there.)
Ok, wasn’t aware of that. I thought I read in a ZFS article that this were a
special thing.
--
Gruß | Greetings | Qapla’
Please do not share anything from, with or about me on any social network.

This is no signature.
Wols Lists
2017-12-07 23:48:45 UTC
Permalink
Post by Wols Lists
(Oh - and md raid-5/6 also mix data and parity, so the same holds true
Post by Wols Lists
there.)
Ok, wasn’t aware of that. I thought I read in a ZFS article that this were a
special thing.
Say you've got a four-drive raid-6, it'll be something like

data1 data2 parity1 parity2
data3 parity3 parity4 data4
parity5 parity6 data5 data6

The only thing to watch out for (and zfs is likely the same) if a file
fits inside a single chunk it will be recoverable from a single drive.
And I think chunks can be anything up to 64MB.

Cheers,
Wol
J. Roeleveld
2017-12-09 16:58:21 UTC
Permalink
Post by Wols Lists
Post by Wols Lists
(Oh - and md raid-5/6 also mix data and parity, so the same holds true
Post by Wols Lists
there.)
Ok, wasn’t aware of that. I thought I read in a ZFS article that this were
a special thing.
Say you've got a four-drive raid-6, it'll be something like
data1 data2 parity1 parity2
data3 parity3 parity4 data4
parity5 parity6 data5 data6
The only thing to watch out for (and zfs is likely the same) if a file
fits inside a single chunk it will be recoverable from a single drive.
And I think chunks can be anything up to 64MB.
Except that ZFS doesn't have fixed on-disk-chunk-sizes. (especially if you use
compression)

See:
https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz

--
Joost
Wols Lists
2017-12-09 18:28:20 UTC
Permalink
Post by J. Roeleveld
Post by Wols Lists
Post by Wols Lists
(Oh - and md raid-5/6 also mix data and parity, so the same holds true
Post by Wols Lists
there.)
Ok, wasn’t aware of that. I thought I read in a ZFS article that this were
a special thing.
Say you've got a four-drive raid-6, it'll be something like
data1 data2 parity1 parity2
data3 parity3 parity4 data4
parity5 parity6 data5 data6
The only thing to watch out for (and zfs is likely the same) if a file
fits inside a single chunk it will be recoverable from a single drive.
And I think chunks can be anything up to 64MB.
Except that ZFS doesn't have fixed on-disk-chunk-sizes. (especially if you use
compression)
https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
Which explains nothing, sorry ... :-(

It goes on about 4K or 8K database blocks (and I'm talking about 64 MEG
chunk sizes). And the OP was talking about files being recoverable from
a disk that was removed from an array. Are you telling me that a *small*
file has bits of it scattered across multiple drives? That would be *crazy*.

If I have a file of, say, 10MB, and write it to an md-raid array, there
is a good chance it will fit inside a single chunk, and be written -
*whole* - to a single disk. With parity on another disk. How big does a
file have to be on ZFS before it is too big to fit in a typical chunk,
so that it gets split up across multiple drives?

THAT is what I was on about, and that is what concerned the OP. I was
just warning the OP that a chunk typically is rather more than just one
disk block, so anybody harking back to the days of 512byte sectors could
get a nasty surprise ...

Cheers,
Wol

Cheers,
Wol
Rich Freeman
2017-12-09 23:36:45 UTC
Permalink
Post by Wols Lists
Post by J. Roeleveld
Post by Wols Lists
Post by Wols Lists
(Oh - and md raid-5/6 also mix data and parity, so the same holds true
Post by Wols Lists
there.)
Ok, wasn’t aware of that. I thought I read in a ZFS article that this were
a special thing.
Say you've got a four-drive raid-6, it'll be something like
data1 data2 parity1 parity2
data3 parity3 parity4 data4
parity5 parity6 data5 data6
The only thing to watch out for (and zfs is likely the same) if a file
fits inside a single chunk it will be recoverable from a single drive.
And I think chunks can be anything up to 64MB.
Except that ZFS doesn't have fixed on-disk-chunk-sizes. (especially if you use
compression)
https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
Which explains nothing, sorry ... :-(
It goes on about 4K or 8K database blocks (and I'm talking about 64 MEG
chunk sizes). And the OP was talking about files being recoverable from
a disk that was removed from an array. Are you telling me that a *small*
file has bits of it scattered across multiple drives? That would be *crazy*.
I'm not sure why it would be "crazy." Granted, most parity RAID
systems seem to operate just as you describe, but I don't see why with
Reed Solomon you couldn't store ONLY parity data on all the drives.
All that matters is that you generate enough to recover the data - the
original data contains no more information than an equivalent number
of Reed-Solomon sets. Of course, with the original data I imagine you
need to do less computation assuming you aren't bothering to check its
integrity against the parity data.

In case my point is clear a RAID would work perfectly fine if you had
5 drives with the capacity to store 4 drives wort of data, but instead
of storing the original data across 4 drives and having 1 of parity,
you instead compute 5 sets of parity so that now you have 9 sets of
data that can tolerate the loss of any 5, then throw away the sets
containing the original 4 sets of data and store the remaining 5 sets
of parity data across the 5 drives. You can still tolerate the loss
of one more set, but all 4 of the original sets of data have been
tossed already.
--
Rich
Wols Lists
2017-12-10 09:45:49 UTC
Permalink
Post by Rich Freeman
you instead compute 5 sets of parity so that now you have 9 sets of
data that can tolerate the loss of any 5, then throw away the sets
containing the original 4 sets of data and store the remaining 5 sets
of parity data across the 5 drives. You can still tolerate the loss
of one more set, but all 4 of the original sets of data have been
tossed already.
Is that how ZFS works?

Cheers,
Wol
Rich Freeman
2017-12-10 15:07:02 UTC
Permalink
Post by Wols Lists
Post by Rich Freeman
you instead compute 5 sets of parity so that now you have 9 sets of
data that can tolerate the loss of any 5, then throw away the sets
containing the original 4 sets of data and store the remaining 5 sets
of parity data across the 5 drives. You can still tolerate the loss
of one more set, but all 4 of the original sets of data have been
tossed already.
Is that how ZFS works?
I doubt it, hence why I wrote "most parity RAID systems seem to
operate just as you describe."
--
Rich
Wols Lists
2017-12-10 21:00:25 UTC
Permalink
Post by Rich Freeman
Post by Wols Lists
Is that how ZFS works?
I doubt it, hence why I wrote "most parity RAID systems seem to
operate just as you describe."
So the OP needs to be aware that, if his file is smaller than the chunk
size, then it *will* be recoverable from a disk pulled from an array, be
it md-raid or zfs.

The question is, then, how big is a chunk? And if zfs is anything like
md-raid, it will be a lot bigger than the 512B or 4KB that a naive user
would think.

Cheers,
Wol
Rich Freeman
2017-12-11 01:33:23 UTC
Permalink
Post by Wols Lists
So the OP needs to be aware that, if his file is smaller than the chunk
size, then it *will* be recoverable from a disk pulled from an array, be
it md-raid or zfs.
The question is, then, how big is a chunk? And if zfs is anything like
md-raid, it will be a lot bigger than the 512B or 4KB that a naive user
would think.
I suspect the data is striped/chunked/etc at a larger scale.

However, I'd really go a step further. Unless a filesystem or block
layer is explicitly designed to prevent the retrieval of data without
a key/etc, then I would not rely on something like this for security.
Even actual encryption systems can have bugs that render them
vulnerable. Something that at best provides this kind of security "by
accident" is not something you should rely on. Data might be stored
in journals, or metadata, or unwiped free space, or in any number of
ways that makes it possible to retrieve even if it isn't obvious from
casual inspection.

If you don't want somebody recovering data from a drive you're
disposing of, then you should probably be encrypting that drive one
way or another with a robust encryption layer. That might be built
into the filesystem, or it might be a block layer. If you're
desperate I guess you could use the SMART security features provided
by your drive firmware, which probably work, but which nobody can
really vouch for but the drive manufacturer. Any of these are going
to provide more security that relying on RAID striping to make data
irretrievable.

If you really care about security, then you're going to be paranoid
about the tools that actually are designed to be secure, let alone the
ones that aren't.
--
Rich
Frank Steinmetzger
2017-12-11 23:20:48 UTC
Permalink
Post by Wols Lists
Post by Frank Steinmetzger
I don't encrypt my drives because it doesn't hold sensitive stuff. (AFAIK
native ZFS encryption is available in Oracle ZFS, so it might eventually
come to the Linux world).
So in case I ever need to send in a drive for repair/replacement, noone can
read from it (or only in tiny bits'n'pieces from a hexdump), because each
disk contains a mix of data and parity blocks.
I think I'm finally sold. :)
And with that, good night.
So you've never heard of LUKS?
GPT
LUKS
MD-RAID
Filesystem
My new drives are finally here. One of them turned out to be an OEM. -_-
The shop says it will cover any warranty claims and it’s not a backyard
seller either, so methinks I’ll keep it.

To evaluate LUKS, I created the following setup (I just love ASCII-painting
in vim ^^):

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tmpfs ┃
┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃
┃ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ ┃
┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃
┃ V V V V ┃
┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃
┃ │ LUKS device │ │ LUKS device │ │ LUKS device │ │ LUKS device │ ┃
┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃
┃ V V V V ┃
┃ ┌─────────────────────────────────────────────────────────────┐ ┃
┃ │ RaidZ2 │ ┃
┃ └─────────────────────────────────────────────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

While dd'ing a 1500 MB file from and to the pool, my NAS Celeron achieved
(with the given number of vdevs out of all 4 being encrypted):

non-encrypted 2 encrypted 4 encrypted
────────────────────────────────────────────────────
read 1600 MB/s 465 MB/s 290 MB/s
write ~600 MB/s ~200 MB/s ~135 MB/s
scrub time 10 s (~ 100 MB/s)

So performance would be juuuust enough to satisfy GBE. I wonder though how
long a real scrub/resilver would take. The last scrub of my mirror, which
has 3.8 TB allocated, took 9Πhours. Once the z2 pool is created and the
data migrated, I *will* have to do a resilver in any case, because I only
have four drives and they will all go into the pool, but two of which
currently make up the mirror.


I see myself bying an i3 before too long. Talk about first-world problems.
--
Gruß | Greetings | Qapla’
Please do not share anything from, with or about me on any social network.

When you are fine, don’t worry. It will pass.
Neil Bothwick
2017-12-12 10:15:23 UTC
Permalink
Post by Frank Steinmetzger
My new drives are finally here. One of them turned out to be an OEM. -_-
The shop says it will cover any warranty claims and it’s not a backyard
seller either, so methinks I’ll keep it.
To evaluate LUKS, I created the following setup (I just love
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tmpfs ┃
┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃
┃ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ ┃
┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃
┃ V V V V ┃
┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃
┃ │ LUKS device │ │ LUKS device │ │ LUKS device │ │ LUKS device │ ┃
┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃
┃ V V V V ┃
┃ ┌─────────────────────────────────────────────────────────────┐ ┃
┃ │ RaidZ2 │ ┃
┃ └─────────────────────────────────────────────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
That means every write has to be encrypted 4 times, whereas using
encryption in the filesystem means it only has to be done once. I tried
setting encrypted BTRFS this way and there was a significant performance
hit. I'm seriously considering going back to ZoL now that encryption is
on the way.
--
Neil Bothwick

A printer consists of three main parts: the case, the jammed paper tray
and the blinking red light.
Wols Lists
2017-12-12 12:18:23 UTC
Permalink
Post by Neil Bothwick
That means every write has to be encrypted 4 times, whereas using
encryption in the filesystem means it only has to be done once. I tried
setting encrypted BTRFS this way and there was a significant performance
hit. I'm seriously considering going back to ZoL now that encryption is
on the way.
DISCLAIMER - I DON'T HAVE A CLUE HOW THIS ACTUALLY WORKS IN DETAIL

but there's been a fair few posts on LKML sublists about how linux is
very inefficient at using hardware encryption. Setup/teardown is
expensive, and it only encrypts in small disk-size blocks, so somebody's
been trying to make it encrypt in file-system-sized chunks. When/if they
get this working, you'll probably notice a speedup of the order of 90%
or so ...

Cheers,
Wol
Neil Bothwick
2017-12-12 13:24:46 UTC
Permalink
Post by Wols Lists
Post by Neil Bothwick
That means every write has to be encrypted 4 times, whereas using
encryption in the filesystem means it only has to be done once. I
tried setting encrypted BTRFS this way and there was a significant
performance hit. I'm seriously considering going back to ZoL now that
encryption is on the way.
DISCLAIMER - I DON'T HAVE A CLUE HOW THIS ACTUALLY WORKS IN DETAIL
but there's been a fair few posts on LKML sublists about how linux is
very inefficient at using hardware encryption. Setup/teardown is
expensive, and it only encrypts in small disk-size blocks, so somebody's
been trying to make it encrypt in file-system-sized chunks. When/if they
get this working, you'll probably notice a speedup of the order of 90%
or so ...
This isn't so much a matter of hardware vs. software encryption, more
that encrypting below the RAID level means everything has to be encrypted
multiple times.
--
Neil Bothwick

There's no place like ~
Richard Bradfield
2017-12-07 07:54:41 UTC
Permalink
Post by Rich Freeman
I don’t really care about performance. It’s a simple media archive powered
by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though
^^). Sorry if I more or less stole the thread, but this is almost the same
topic. I could use a nudge in either direction. My workplace’s storage
comprises many 2× mirrors, but I am not a company and I am capped at four
bays.
So, Do you have any input for me before I fetch the dice?
IMO the cost savings for parity RAID trumps everything unless money
just isn't a factor.
Now, with ZFS it is frustrating because arrays are relatively
inflexible when it comes to expansion, though that applies to all
types of arrays. That is one major advantage of btrfs (and mdadm) over
zfs. I hear they're working on that, but in general there are a lot
of things in zfs that are more static compared to btrfs.
--
Rich
When planning for ZFS pools, at least for home use, it's worth thinking
about your usage pattern, and if you'll need to expand the pool before
the lifetime of the drives rolls around.

I incorporated ZFS' expansion inflexibility into my planned
maintenance/servicing budget. I started out with 4x 2TB disks, limited
to those 4 bays as you are, but planned to replace those drives after a
period of 3-4 years.

By the time the first of my drives began to show SMART errors, the price
of a 3TB drive had dropped to what I had paid for the 2TB models, so I
bought another set and did a rolling upgrade, bringing the pool up to
6TB.

I expect I'll do the same thing late next year, I wonder if 4TB will be
the sweet spot, or if I might be able to get something larger.
--
Richard
Frank Steinmetzger
2017-12-07 09:28:56 UTC
Permalink
Post by Richard Bradfield
Post by Rich Freeman
I don’t really care about performance. It’s a simple media archive powered
by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though
^^). Sorry if I more or less stole the thread, but this is almost the same
topic. I could use a nudge in either direction. My workplace’s storage
comprises many 2× mirrors, but I am not a company and I am capped at four
bays.
So, Do you have any input for me before I fetch the dice?
IMO the cost savings for parity RAID trumps everything unless money
just isn't a factor.
Now, with ZFS it is frustrating because arrays are relatively
inflexible when it comes to expansion, though that applies to all
types of arrays. That is one major advantage of btrfs (and mdadm) over
zfs. I hear they're working on that, but in general there are a lot
of things in zfs that are more static compared to btrfs.
--
Rich
When planning for ZFS pools, at least for home use, it's worth thinking
about your usage pattern, and if you'll need to expand the pool before
the lifetime of the drives rolls around.
When I set the NAS up, I migrated everything from my existing individual
external harddrives onto it (the biggest of which was 3 TB). So the main
data slurping is over. Going from 6 to 12 TB should be enough™ for a loooong
time unless I start buying TV series on DVD for which I don't have physical
space.
Post by Richard Bradfield
I incorporated ZFS' expansion inflexibility into my planned
maintenance/servicing budget.
What was the conclusion? That having no more free slots meant that you can
just as well use the inflexible Raidz, otherwise would have gone with Mirror?
Post by Richard Bradfield
I expect I'll do the same thing late next year, I wonder if 4TB will be
the sweet spot, or if I might be able to get something larger.
Me thinks 4 TB was already the sweet spot when I bought my drives a year
back (regarding ¤/GiB). Just checked: 6 TB is the cheapest now according to
a pricing search engine. Well, the German version anyway[1]. The brits are a
bit more picky[2].

[1] https://geizhals.de/?cat=hde7s&xf=10287_NAS~957_Western+Digital&sort=r
[2] https://skinflint.co.uk/?cat=hde7s&xf=10287_NAS%7E957_Western+Digital&sort=r
--
This message was written using only recycled electrons.
Richard Bradfield
2017-12-07 09:52:55 UTC
Permalink
Post by Frank Steinmetzger
Post by Richard Bradfield
I incorporated ZFS' expansion inflexibility into my planned
maintenance/servicing budget.
What was the conclusion? That having no more free slots meant that you
can just as well use the inflexible Raidz, otherwise would have gone with
Mirror?
Correct, I had gone back and forth between RaidZ2 and a pair of Mirrors.
I needed the space to be extendable, but I calculated my usage growth
to be below the rate at which drive prices were falling, so I could
budget to replace the current set of drives in 3 years, and that
would buy me a set of bigger ones when the time came.

I did also investigate USB3 external enclosures, they're pretty
fast these days.
--
I apologize if my web client has mangled my message.
Richard
Frank Steinmetzger
2017-12-07 14:53:34 UTC
Permalink
Post by Richard Bradfield
Post by Frank Steinmetzger
Post by Richard Bradfield
I incorporated ZFS' expansion inflexibility into my planned
maintenance/servicing budget.
What was the conclusion? That having no more free slots meant that you
can just as well use the inflexible Raidz, otherwise would have gone with
Mirror?
Correct, I had gone back and forth between RaidZ2 and a pair of Mirrors.
I needed the space to be extendable, but I calculated my usage growth
to be below the rate at which drive prices were falling, so I could
budget to replace the current set of drives in 3 years, and that
would buy me a set of bigger ones when the time came.
I see. I'm always looking for ways to optimise expenses and cut down on
environmental footprint by keeping stuff around until it really breaks. In
order to increase capacity, I would have to replace all four drives, whereas
with a mirror, two would be enough.
Post by Richard Bradfield
I did also investigate USB3 external enclosures, they're pretty
fast these days.
When I configured my kernel the other day, I discovered network block
devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I
can do zpool replace with the drive-to-be-replaced still in the pool, which
improves resilver read distribution and thus lessens the probability of a
failure cascade.

[0] http://www.sharkoon.com/?q=de/node/2171
Rich Freeman
2017-12-07 15:26:34 UTC
Permalink
Post by Frank Steinmetzger
I see. I'm always looking for ways to optimise expenses and cut down on
environmental footprint by keeping stuff around until it really breaks. In
order to increase capacity, I would have to replace all four drives, whereas
with a mirror, two would be enough.
That is a good point. Though I would note that you can always replace
the raidz2 drives one at a time - you just get zero benefit until
they're all replaced. So, if your space use grows at a rate lower
than the typical hard drive turnover rate that is an option.
Post by Frank Steinmetzger
When I configured my kernel the other day, I discovered network block
devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I
can do zpool replace with the drive-to-be-replaced still in the pool, which
improves resilver read distribution and thus lessens the probability of a
failure cascade.
If you want to get into the network storage space I'd keep an eye on
cephfs. I don't think it is quite to the point where it is a
zfs/btrfs replacement option, but it could get there. I don't think
the checksums are quite end-to-end, but they're getting better.
Overall stability for cephfs itself (as opposed to ceph object
storage) is not as good from what I hear. The biggest issue with it
though is RAM use on the storage nodes. They want 1GB/TB RAM, which
rules out a lot of the cheap ARM-based solutions. Maybe you can get
by with less, but finding ARM systems with even 4GB of RAM is tough,
and even that means only one hard drive per node, which means a lot of
$40+ nodes to go on top of the cost of the drives themselves.

Right now cephfs mainly seems to appeal to the scalability use case.
If you have 10k servers accessing 150TB of storage and you want that
all in one managed well-performing pool that is something cephfs could
probably deliver that almost any other solution can't (and the ones
that can cost WAY more than just one box running zfs on a couple of
RAIDs).
--
Rich
Frank Steinmetzger
2017-12-07 16:04:34 UTC
Permalink
Post by Rich Freeman
Post by Frank Steinmetzger
I see. I'm always looking for ways to optimise expenses and cut down on
environmental footprint by keeping stuff around until it really breaks. In
order to increase capacity, I would have to replace all four drives, whereas
with a mirror, two would be enough.
That is a good point. Though I would note that you can always replace
the raidz2 drives one at a time - you just get zero benefit until
they're all replaced. So, if your space use grows at a rate lower
than the typical hard drive turnover rate that is an option.
Post by Frank Steinmetzger
When I configured my kernel the other day, I discovered network block
devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I
can do zpool replace with the drive-to-be-replaced still in the pool, which
improves resilver read distribution and thus lessens the probability of a
failure cascade.
If you want to get into the network storage space I'd keep an eye on
cephfs.
No, I was merely talking about the use case of replacing drives on-the-fly
with the limited hardware available (all slots are occupied). It was not
about expanding my storage beyond what my NAS case can provide.

Resilvering is risky business, more so with big drives and especially once
they get older. That's why I was talking about adding the new drive
externally, which allows me to use all old drives during resilvering. Once
it is resilvered, I install it physically.
Post by Rich Freeman
[…] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based
solutions. Maybe you can get by with less, but finding ARM systems with
even 4GB of RAM is tough, and even that means only one hard drive per
node, which means a lot of $40+ nodes to go on top of the cost of the
drives themselves.
No need to overshoot. It's a simple media archive and I'm happy with what I
have, apart from a few shortcomings of the case regarding quality and space.
My main goal was reliability, hence ZFS, ECC, and a Gold-rated PSU. They say
RAID is not a backup. For me it is -- in case of disk failure, which is my
main dread.

You can't really get ECC on ARM, right? So M-ITX was the next best choice. I
have a tiny (probably one of the smallest available) M-ITX case for four
3,5″ bays and an internal 2.5″ mount:
https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100

Tata...
--
I cna ytpe 300 wrods pre mniuet!!!
Rich Freeman
2017-12-07 23:09:01 UTC
Permalink
Post by Frank Steinmetzger
Post by Rich Freeman
[…] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based
solutions. Maybe you can get by with less, but finding ARM systems with
even 4GB of RAM is tough, and even that means only one hard drive per
node, which means a lot of $40+ nodes to go on top of the cost of the
drives themselves.
You can't really get ECC on ARM, right? So M-ITX was the next best choice. I
have a tiny (probably one of the smallest available) M-ITX case for four
https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100
I don't think ECC is readily available on ARM (most of those boards
are SoCs where the RAM is integral and can't be expanded). If CephFS
were designed with end-to-end checksums that wouldn't really matter
much, because the client would detect any error in a storage node and
could obtain a good copy from another node and trigger a resilver.
However, I don't think Ceph is quite there, with checksums being used
at various points but I think there are gaps where no checksum is
protecting the data. That is one of the things I don't like about it.

If I were designing the checksums for it I'd probably have the client
compute the checksum and send it with the data, then at every step the
checksum is checked, and stored in the metadata on permanent storage.
Then when the ack goes back to the client that the data is written the
checksum would be returned to the client from the metadata, and the
client would do a comparison. Any retrieval would include the client
obtaining the checksum from the metadata and then comparing it to the
data from the storage nodes. I don't think this approach would really
add any extra overhead (the metadata needs to be recorded when writing
anyway, and read when reading anyway). It just ensures there is a
checksum on separate storage from the data and that it is the one
captured when the data was first written. A storage node could be
completely unreliable in this scenario as it exists apart from the
checksum being used to verify it. Storage nodes would still do their
own checksum verification anyway since that would allow errors to be
detected sooner and reduce latency, but this is not essential to
reliability.

Instead I think Ceph does not store checksums in the metadata. The
client checksum is used to verify accurate transfer over the network,
but then the various nodes forget about it, and record the data. If
the data is backed on ZFS/btrfs/bluestore then the filesystem would
compute its own checksum to detect silent corruption while at rest.
However, if the data were corrupted by faulty software or memory
failure after it was verified upon reception but before it was
re-checksummed prior to storage then you would have a problem. In
that case a scrub would detect non-matching data between nodes but
with no way to determine which node is correct.

If somebody with more knowledge of Ceph knows otherwise I'm all ears,
because this is one of those things that gives me a bit of pause.
Don't get me wrong - most other approaches have the same issues, but I
can reduce the risk of some of that with ECC, but that isn't practical
when you want many RAM-intensive storage nodes in the solution.
--
Rich
Wols Lists
2017-12-07 20:02:41 UTC
Permalink
Post by Frank Steinmetzger
When I configured my kernel the other day, I discovered network block
devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I
can do zpool replace with the drive-to-be-replaced still in the pool, which
improves resilver read distribution and thus lessens the probability of a
failure cascade.
Or with mdadm, there's "mdadm --replace". If you want to swap a drive
(rather than replace a failed drive), this both preserves redundancy and
reduces the stress on the array by doing disk-to-disk copy rather than
recalculating the new disk.

Cheers,
Wol
Wols Lists
2017-12-07 18:35:16 UTC
Permalink
Post by Richard Bradfield
I did also investigate USB3 external enclosures, they're pretty
fast these days.
AARRGGHHHHH !!!

If you're using mdadm, DO NOT TOUCH USB WITH A BARGE POLE !!!

I don't know the details, but I gather the problems are very similar to
the timeout problem, but much worse.

I know the wiki says you can "get away" with USB, but only for a broken
drive, and only when recovering *from* it.

Cheers,
Wol
Richard Bradfield
2017-12-07 20:17:55 UTC
Permalink
Post by Wols Lists
Post by Richard Bradfield
I did also investigate USB3 external enclosures, they're pretty
fast these days.
AARRGGHHHHH !!!
If you're using mdadm, DO NOT TOUCH USB WITH A BARGE POLE !!!
I don't know the details, but I gather the problems are very similar to
the timeout problem, but much worse.
I know the wiki says you can "get away" with USB, but only for a broken
drive, and only when recovering *from* it.
Cheers,
Wol
I'm using ZFS on Linux, does that make you any less terrified? :)

I never ended up pursuing the USB enclosure, because disks got bigger
faster than I needed more storage, but I'd be interested in hearing if
there are real issues with trying to mount drive arrays over XHCI, given
the failure of eSATA to achieve wide adoption it looked like a good
route for future expansion.
--
Richard
Wols Lists
2017-12-07 20:39:40 UTC
Permalink
Post by Richard Bradfield
Post by Wols Lists
Post by Richard Bradfield
I did also investigate USB3 external enclosures, they're pretty
fast these days.
AARRGGHHHHH !!!
If you're using mdadm, DO NOT TOUCH USB WITH A BARGE POLE !!!
I don't know the details, but I gather the problems are very similar to
the timeout problem, but much worse.
I know the wiki says you can "get away" with USB, but only for a broken
drive, and only when recovering *from* it.
Cheers,
Wol
I'm using ZFS on Linux, does that make you any less terrified? :)
I never ended up pursuing the USB enclosure, because disks got bigger
faster than I needed more storage, but I'd be interested in hearing if
there are real issues with trying to mount drive arrays over XHCI, given
the failure of eSATA to achieve wide adoption it looked like a good
route for future expansion.
Sorry, not a clue. I don't know zfs.

The problem with USB, as I understand it, is that USB itself times out.
If that happens, there is presumably a tear-down/setup delay, which is
the timeout problem, which upsets mdadm.

My personal experience is that the USB protocol also seems vulnerable to
crashing and losing drives.

In the --replace scenario, the fact that you are basically streaming
from the old drive to the new one seems not to trip over the problem,
but anything else is taking rather unnecessary risks ...

As for eSATA, I want to get hold of a JBOD enclosure, but I'll then need
to get a PCI card with an external port-multiplier ESATA capability. I
suspect one of the reasons it didn't take off was the multiplicity of
specifications, such that people probably bought add-ons that were
"unfit for purpose" because they didn't know what they were doing, or
the mobo suppliers cut corners so the on-board ports were unfit for
purpose, etc etc. So the whole thing sank with a bad rep it didn't
deserve. Certainly, when I've been looking, the situation is, shall we
say, confusing ...

Cheers,
Wol

Cheers,
Wol
Continue reading on narkive:
Loading...