Skip to content

Instantly share code, notes, and snippets.

@MawKKe
Last active April 29, 2024 21:19
Show Gist options
  • Save MawKKe/caa2bbf7edcc072129d73b61ae7815fb to your computer and use it in GitHub Desktop.
Save MawKKe/caa2bbf7edcc072129d73b61ae7815fb to your computer and use it in GitHub Desktop.
dm-crypt + dm-integrity + dm-raid = awesome!
#!/usr/bin/env bash
#
# Author: Markus (MawKKe) ekkwam@gmail.com
# Date: 2018-03-19
#
#
# What?
#
# Linux dm-crypt + dm-integrity + dm-raid (RAID1)
#
# = Secure, redundant array with data integrity protection
#
# Why?
#
# You see, RAID1 is dead simple tool for disk redundancy,
# but it does NOT protect you from bit rot. There is no way
# for RAID1 to distinguish which drive has the correct data if rot occurs.
# This is a silent killer.
#
# But with dm-integrity, you can now have error detection
# at the block level. But it alone does not provide error correction
# and is pretty useless with just one disk (disks fail, shit happens).
#
# But if you use dm-integrity *below* RAID1, now you have disk redundancy,
# AND error checking AND error correction. Invalid data received from
# a drive will cause a checksum error which the RAID array notices and
# replaces with correct data.
#
# If you throw encryption into the mix, you'll have secure,
# redundant array. Oh, and the data integrity can be protected with
# authenticated encryption, so no-one can tamper your data maliciously.
#
# How cool is that?
#
# Also: If you use RAID1 arrays as LVM physical volumes, the overall
# architecture is quite similar to ZFS! All with native Linux tools,
# and no hacky solaris compatibility layers or licencing issues!
#
# (I guess you can use whatever RAID level you want, but RAID1 is the
# simplest and fastest to set up)
#
#
# Let's try it out!
#
# ---
# NOTE: The dm-integrity target is available since Linux kernel version 4.12.
# NOTE: This example requires LUKS2 which is only recently released (2018-03)
# NOTE: The authenticated encryption is still experimental (2018-03)
# ---
set -eux
# 1) Make dummy disks
cd /tmp
truncate -s 500M disk1.img
truncate -s 500M disk2.img
# Format the disk with luksFormat:
dd if=/dev/urandom of=key.bin bs=512 count=1
cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 disk1.img key.bin
cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 disk2.img key.bin
# The luksFormat's might take a while since the --integrity causes the disks to be wiped.
# dm-integrity is usually configured with 'integritysetup' (see below), but as
# it happens, cryptsetup can do all the integrity configuration automatically if
# the --integrity flag is specified.
# Open/attach the encrypted disks
cryptsetup luksOpen disk1.img disk1luks --key-file key.bin
cryptsetup luksOpen disk2.img disk2luks --key-file key.bin
# Create raid1:
mdadm \
--create \
--verbose --level 1 \
--metadata=1.2 \
--raid-devices=2 \
/dev/md/mdtest \
/dev/mapper/disk1luks \
/dev/mapper/disk2luks
# Create a filesystem, add to LVM volume group, etc...
mkfs.ext4 /dev/md/mdtest
# Cool! Now you can 'scrub' the raid setup, which verifies
# the contents of each drive. Ordinarily detecting an error would
# be problematic, but since we are now using dm-integrity, the raid1
# *knows* which one has the correct data, and is able to fix it automatically.
#
# To scrub the array:
#
# $ echo check > /sys/block/md127/md/sync_action
#
# ... wait a while
#
# $ dmesg | tail -n 30
#
# You should see
#
# [957578.661711] md: data-check of RAID array md127
# [957586.932826] md: md127: data-check done.
#
#
# Let's simulate disk corruption:
#
# $ dd if=/dev/urandom of=disk2.img seek=30000 count=30 bs=1k conv=notrunc
#
# (this writes 30kB of random data into disk2.img)
#
#
# Run scrub again:
#
# $ echo check > /sys/block/md127/md/sync_action
#
# ... wait a while
#
# $ dmesg | tail -n 30
#
# Now you should see
# ...
# [959146.618086] md: data-check of RAID array md127
# [959146.962543] device-mapper: crypt: INTEGRITY AEAD ERROR, sector 39784
# [959146.963086] device-mapper: crypt: INTEGRITY AEAD ERROR, sector 39840
# [959154.932650] md: md127: data-check done.
#
# But now if you run scrub yet again:
# ...
# [959212.329473] md: data-check of RAID array md127
# [959220.566150] md: md127: data-check done.
#
# And since we didn't get any errors a second time, we can deduce that the invalid
# data was repaired automatically.
#
# Great! We are done.
#
# --------
#
# If you don't need encryption, then you can use 'integritysetup' instead of cryptsetup.
# It works in similar fashion:
#
# $ integritysetup format --integrity sha256 disk1.img
# $ integritysetup format --integrity sha256 disk2.img
# $ integritysetup open --integrity sha256 disk1.img disk1int
# $ integritysetup open --integrity sha256 disk2.img disk2int
# $ mdadm --create ...
#
# ...and so on. Though now you can detect and repair disk errors but have no protection
# against malicious cold-storage attacks. Data is also readable by anybody.
#
# 2018-03 NOTE:
#
# if you override the default --integrity value (whatever it is) during formatting,
# then you must specify it again when opening, like in the example above. For some
# reason the algorithm is not autodetected. I guess there is no header written onto
# disk like is with LUKS ?
#
# ----------
#
# Read more:
# https://fosdem.org/2018/schedule/event/cryptsetup/
# https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
# https://gitlab.com/cryptsetup/cryptsetup/wikis/DMIntegrity
# https://mirrors.edge.kernel.org/pub/linux/utils/cryptsetup/v2.0/v2.0.0-rc0-ReleaseNotes
@tomato42
Copy link

tomato42 commented Sep 27, 2020

Unfortunately no, I don't know which exact versions have the fixes. I only know that they are in current version of RHEL-8. I've also verified that the behaviour is as expected: even gigabytes of read errors caused by checksum failures don't cause the dm-integrity volumes to be kicked from the array.

@Salamandar
Copy link

@tomato42 @khimaros So the setup you're presenting here is RAID over LUKS+integrity.
If i understand properly, that's done that way to be able to use RAID to give the correct data and detect disks failures.

Is it possible to do LUKS over RAID over dm-Integrity ?
I'd prefer having a single encrypted partition to having multiple ones. Unless you tell me there'are good reasons for having multiple LUKS below RAID.

@tomato42
Copy link

tomato42 commented Oct 3, 2020

It's not us that present the RAID over LUKS+integrity setup :)
In my article I'm describing RAID over dm-integrity. Yes, I do it to detect and correct disk failures (both vocal—when the disk returns read errors—and silent—when disk just returns garbage instead of data previously written)

Yes, it's possible to do LUKS over RAID over dm-integrity. If you want both encryption and protection against disk failures, I'd suggest to do it like this. Using LUKS below RAID has the unfortunate effect that you then have to encrypt data multiple times, so you will get worse performance than with LUKS above RAID.

One reason to do RAID over LUKS with integrity is that it's much easier to setup (the only difference is the use of special options for formatting the LUKS volume, opening and using is as with regular LUKS, so you can use most of the guides explaining the setup and migration). As dm-integrity is much newer, setting it up is much more manual and thus complicated. I've recently wrote an article on how to do it in Fedora 31, RHEL 8, CentOS 8 and Archlinux: https://securitypitfalls.wordpress.com/2020/09/27/making-raid-work-dm-integrity-with-md-raid/

@Salamandar
Copy link

@tomato42 Yes, I didn't even think about the fact that raid over luks needs multiple data encryptions. That's one more argument for LUKS over RAID instead.

One reason to do RAID over LUKS with integrity is that it's much easier to setup

Well… In my own head LUKS over RAID is easier to understand because RAID is at "hardware level" and LUKS is at "OS level". But dm-integrity is also at OS level so… :/

I've recently wrote an article on how to do it in Fedora 31, RHEL 8, CentOS 8 and Archlinux: https://securitypitfalls.wordpress.com/2020/09/27/making-raid-work-dm-integrity-with-md-raid/

Thanks a lot. I'm keeping it and if you want comments about the Debian implementation I can make you a feedback.

@tomato42
Copy link

tomato42 commented Oct 4, 2020

@Salamandar

Well… In my own head LUKS over RAID is easier to understand because RAID is at "hardware level" and LUKS is at "OS level". But dm-integrity is also at OS level so… :/

well, if we're talking about Linux, there's no limit to shenanigans with block devices :)

you can have LVM, on top of dm-crypt on top of md-raid, on top of dm-integrity on top of loop devices that use regular files, on an LVM...

and this is only about directly attached storage, with network based devices it can get really crazy

Thanks a lot. I'm keeping it and if you want comments about the Debian implementation I can make you a feedback.

sure, feel free to add a comment about Debian specific changes to the setup steps

@Salamandar
Copy link

Yeah, my daily job is about networked attached storage hardware so… setups can be funny somedays. Well, I just received my NAS today so I'll start playing with your dm-integrity tutorial.

@khimaros
Copy link

khimaros commented Oct 8, 2020

Unfortunately no, I don't know which exact versions have the fixes. I only know that they are in current version of RHEL-8. I've also verified that the behaviour is as expected: even gigabytes of read errors caused by checksum failures don't cause the dm-integrity volumes to be kicked from the array.

According to the Git tags on torvalds/linux@b76b471 the fix you referenced should be included in any kernel released after 5.4-rc1.

I re-ran my tests on linux 5.8.10 and mdadm 4.1. Debian Bullseye. The result there was very positive and the raid6 array survived even 1MB+ random corruption to 2/4 disks. manual scrub identified and corrected the corruption.

My take-away from this is that dm-integrity + md is DANGEROUS for unpatched kernels <5.4-rc1, but seems to be quite reliable for kernels including torvalds/linux@b76b471

@khimaros
Copy link

khimaros commented Oct 8, 2020

@tomato42 -- following up on this after running some tests where I intentionally corrupted beyond raid6 parameters (100K+ randomized corruption to 4/4 disks in the array). in cases where md doesn't have enough parity information to recalculate the correct value, checksum failures continue in perpetuity even following md --action=repair.

one solution I've found is to run fsck.ext4 -c -y -f to add these to the bad block list at the filesystem layer, but I'm curious if you're aware of any other solutions either at the md or dm level such as recalculating the integrity journal? are you aware of any way to identify corrupted files based on these kernel messages?

@tomato42
Copy link

tomato42 commented Oct 8, 2020

if there is not enough redundancy left in the array to recover a sector, then there's only one way to fix it: write to that sector some valid data

so the way I'd do it, is use something like dd if=/dev/md0 of=/dev/null bs=4096 to find the first failing block, and then use dd if=/dev/zero of=/dev/md0 bs=4096 seek=<number of valid blocks> count=1

mapping a file to the bad block is rather hard, and it depends on used file system; but then again you can do a tar cf /dev/null /file/system/on/md and tar will complain about the read errors...

(yeah, rather brute force approach, but will definitely work)

@flower1024
Copy link

If you are looking for performance it is better to keep the checksumming data on another device.
you can't do that with cryptsetup but with integritysetup.

i have them on an ssd.

4x integritysetup -> mdraid -> cryptsetup -> ext4

@ggeorgovassilis
Copy link

ggeorgovassilis commented Aug 26, 2021

Thanks for the writeup! There seems to be a significant performance issue when involving a combination of RAID 6 and dm-crypt.

I tried out the following setting: baseline is a RAID6 with 4 rotational HDDs + dm-raid + dm-crypt which I converted one disk at a time to dm-integrity + dm-raid + dm-crypt. The first disk resynced in about 12 hours (a full disk resync usually takes 10 hours), the second disk resynced at about 10 mb/s so I stopped the process after a few hours and reverted back to my original setup. CPU load was at no point particularly high.

I wrote about the issue here: https://blog.georgovassilis.com/2021/05/02/dm-raid-and-dm-integrity-performance-issue/

@khimaros
Copy link

khimaros commented Oct 3, 2021

@tomato42 -- my more recent tests with linux-5.10 seem to consistently repair corrupted data on raid6 by simply running a scrub.

however, I can't say the same for raid1/raid10. in my experiments, it correctly returns EILSEQ and avoids reading the corrupted data, but md does not try to repair it. do you know of a way to force this?

@tomato42
Copy link

tomato42 commented Oct 3, 2021

@khimaros including when you run --action=repair?

@khimaros
Copy link

khimaros commented Oct 6, 2021

@tomato42 -- it looks like my case just required a few scrubs to repair all of the corrupt data.

@Salamandar -- there are some resilience considerations with whether you run dm-crypt above or below the raid layer. if you run above the raid layer, you get automatic replication of your crypt headers, whereas if the headers are corrupted below the raid layer, you need to manually restore them (or re-image the entire disk).

@ggeorgovassilis -- i've been experimenting with a similar stack (dm-integrity as a separate layer on the bottom). what made you choose dm-raid over md for your use case?

@ggeorgovassilis
Copy link

@khimaros I wasn't aware that those are different things - I learned something. What I meant, I realise now, was md.

@khimaros
Copy link

khimaros commented Oct 6, 2021

note: if you are using dm-crypt + integrity in Debian for your root filesystem, you may need to ensure dm_integrity module is in your initrd. one way to do this is echo dm_integrity >> /etc/initramfs-tools/modules && update-initramfs -u -k all

@MrM40
Copy link

MrM40 commented Oct 10, 2021

In regards to dm-integrity, mdadm and LVM (keeping LUKS/crypto out), any recommendations in regards to enable and configure this using the LVM tool or doing it Linux-native, as you all do above? I guess using LVM tool you're sure it's done "correctly". Going Linux-native I have to read this thread 4 times thoroughly :-P
And does the LVM-tools add further performance/stability enhancement compared going Linux-native, is it only for comfortability.

@khimaros
Copy link

khimaros commented Oct 10, 2021

a discovery while working on https://github.com/khimaros/raid-explorations; separate dm-integrity layer below the md array (dm-integrity > md > dm-crypt > lvm > ext4) reduces sysbench (rndwr) performance by about 20% in a virtual machine compared to a combined dm-crypt/integrity (dm-crypt + integrity > md > lvm > ext4).

@tomato42
Copy link

@MrM40 for dm-integrity to have any affect in protecting against silent data corruption it must be used below the raid level.
With md-raid that means the individual devices making up the md device must be dm-integrity backed; for lvm raid that means individual PVs must be dm-integrity backed.
While technically you can put LVM on LVM, I don't think that it will be any easier or more straightforward than doing dm-integrity > md > lvm.
It's a complex setup, no matter how you slice or dice it.

@MrM40
Copy link

MrM40 commented Oct 11, 2021

So if you have 5 disks in RAID5/6 you need to have 5 independent dm-integrity setups with 5 independent checksum "files" and 5 independent concurrent checksum writes?, I see the last could be problematic performance wise.
Since RHEL ditched brtfs and not too happy about ZFS and therefore seems to bet 100% on LVM (Stratis) for their storage solution one could hope it was optimized (if it is a bottleneck at all).
E.g. lvcreate --type <raid-level> --raidintegrity y -L <usable-size> -n <logical-volume> <volume-group>
So what I mean is, you can either do the setup yourself manually or do it all by lvcreate. Will LVM/lvcreate add any "optimization" or is it technical 100% the same as doing it yourself? Going lvcreate will probably give your less control, but also less possibility to mess something up/make configuration mistake

@tomato42
Copy link

So if you have 5 disks in RAID5/6 you need to have 5 independent dm-integrity setups with 5 independent checksum "files" and 5 independent concurrent checksum writes?, I see the last could be problematic performance wise.

correct, and yes it has significant impact on performance, especially if the dm-integrity volume uses journaling

Since RHEL ditched brtfs and not too happy about ZFS and therefore seems to bet 100% on LVM (Stratis) for their storage solution one could hope it was optimized (if it is a bottleneck at all). E.g. lvcreate --type <raid-level> --raidintegrity y -L <usable-size> -n <logical-volume> <volume-group> So what I mean is, you can either do the setup yourself manually or do it all by lvcreate. Will LVM/lvcreate add any "optimization" or is it technical 100% the same as doing it yourself? Going lvcreate will probably give your less control, but also less possibility to mess something up/make configuration mistake

I don't think Stratis supports integrity just yet. I haven't inspected what raid + integrity on lvmcreate level do, or if that combination works at all.

@MrM40
Copy link

MrM40 commented Oct 11, 2021

The lvcreate cmd line was taking from RHEL's documentation so it should work ;-)
It's my impression Stratis is roughly just a stacking of current storage technologies wrapped in a new management-package, nothing new under the hood. And LVM is of course a key component and since it support mdadm and md-integrity Stratis do too. And so says the spec for Stratis 2.0.
It's been a good question for a long time if you should use LVM or native mdadm to create your md arrays (if your run LVM anyway of course). Not been able to find a good answer if LVM actually add any goodies other then a management layer.

@tallero
Copy link

tallero commented Jun 24, 2022

@MawKKe about the note on the default integrity algorithm for cryptsetup:
https://gitlab.com/cryptsetup/cryptsetup/-/issues/754

@wq9578
Copy link

wq9578 commented Mar 11, 2023

Are there any known disadvantages with encryption for ZFS shipped with FreeBSD?
Obviously encryption happens in a separate layer there.
Main thread: https://forums.raspberrypi.com/viewtopic.php?p=2089261#p2089261

@dfgshdsfh
Copy link

Currently (as of July 2023) there are some unresolved bugs with ZFS's native encryption and edge cases involving send/recv, that can potentially cause data corruption on both the home pool and a snapshot receiving pool.

See the following for references:

While these seem to be reported in regards to linux systems, that's likely simply because that's where the major users are, I personally wouldn't expect FreeBSD to be free from these issues.

Basically if you don't send/recv an encrypted dataset and instead use something like rsync, then you should be fine.

@tkonyves
Copy link

tkonyves commented Nov 3, 2023

I have read through the thread. Great discussion, and a lot of useful info!

@khimaros , @tomato42

Does anyone know whether disabling journaling for dm-integrity (--integrity-no-journal) is safe with a dm-integrity > md-raid > filesystem kind of setup?

I’m trying to research this, but there is hardly any info. My logic is this: If journaling is off, and there is a corruption in the dm-integrity layer, e.g. due to a power-cut, then this would be presented to the md-raid layer as a normal unrecoverable read error. Then, md-raid would correct the error the same way as in any other case. The fact that the error was a result of disabled journaling is irrelevant.

The only problem I can see with this setup is that if the integrity error is not discovered in time, and only found out when the md-raid layer doesn’t have a copy (Raid-1) or a checksum (Raid-5 or 6) of the lost data. E.g. there is a power cut -> faulty block is written in the dm-integrity layer -> sometime later a disk dies -> we try to repair the array, but the faulty block is causing a problem.

But thinking it further, if we do a Raid scrub every time after a power-cut, then is it safe to disable journaling? Am I right with this logic?

Apart from a power-cut or yanking a disk out mid-operation, are there any situations where dm-integrity would write faulty blocks silently, and where this wouldn't happen with journaling enabled?

@tomato42
Copy link

tomato42 commented Nov 3, 2023

Given it's likely that both copies will be written at once, there's high chance that both copies will be mismatched with the checksums. So, no, it's definitely not safe in general.

@csdvrx
Copy link

csdvrx commented Nov 15, 2023

This is a nice example to get ZFS-like error detection with regular filesystems like EXT4 or XFS, but what about alignment issues and write amplification? Do we have to correct mdadm or the filesystem chunk/stride calculation for alignment, or is it now taken care of by default?

The linked article https://securitypitfalls.wordpress.com/2020/09/27/making-raid-work-dm-integrity-with-md-raid/ suggest:

  • avoiding read-modify-write by passing --chunk 2048 to mdadm
  • having a /sys/block/md0/md/stripe_cache_size of 4096
  • using stripe unit and stripe width when doing mkfs to reflect the hardware realities as the mdadm device wrongly looks like 4kn

@tomato42
Copy link

Alignment issues are no worse than with any RAID: for a file system it looks like regular MD RAID.
There is journaling in integrity level so write amplification is on the horrible side. May be better if you have a device with configurable sector size (like 520 instead of 512 bytes).

@muay-throwaway
Copy link

If this is for the root drive, one potential advantage of doing LUKS over mdadm RAID1 instead of the reverse is that LUKS over a degraded RAID will boot normally since this patch, whereas a degraded RAID over LUKS array may fail the boot process and drop into an initramfs shell due to a missing encrypted root disk. That may require typing exit to continue, although this may not be a big deal to some.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment