Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCAS writeback caching for a home NAS with large media files with multi-tier (NVME+SSD+HDD) #1487

Open
TheLinuxGuy opened this issue Jul 27, 2024 · 9 comments
Labels
question Further information is requested

Comments

@TheLinuxGuy
Copy link

TheLinuxGuy commented Jul 27, 2024

Question

Looking to ensure that I am using the correct settings to achieve my goal of always write/read from NVME disks and promoting data from HDD as soon as files are accessed.

Motivation

I'm comparing bcache to OpenCAS to see if it can fit my needs. I have some notes in one of my repositories here.

Background:

  • 40TB worth of large video files (Plex Media server library)
  • Three 20TB disks on /dev/md127 (mdadm RAID5)
  • 2x 1TB nvme for cache
  • 2x 2TB SSD for middle tier cache (haven't put these in use in my initial tests of OpenCAS)
  • Workload is often someone streams a video file for several hours. My assumption is that as soon as a file is accessed if it is copied to cache from HDDs, the HDDs can then go to sleep and the datastream can simply be read from nvme.

My goal is as follows:

  • My slow spinning hard drives 20TB should be sleeping/powersave mode as much as possible.
  • Ideally btrfs, zfs filesystem for snapshot backups and zfs-send/btrfs-send capabilities would be on top of /dev/cas1-1

Your Environment

  • OpenCAS version (commit hash or tag): 22.12.0.0855.master
  • Operating System: Debian bookworm 12 - Proxmox
  • Kernel version: 6.8.8-2-pve
  • Cache device type (NAND/Optane/other): NVME WD Red SN700 1000GB__1
  • Core device type (HDD/SSD/other): 20TB 7200RPM disks
  • Cache configuration:
    • Cache mode: wb
    • Cache line size: 4 [KiB]
    • Promotion policy: always
    • Cleaning policy: nop
    • Sequential cutoff policy: never
# casadm -G -n seq-cutoff  -i 1 -j 1
╔═════════════════════════════════════════════════════╤═══════╗
║ Parameter name                                      │ Value ║
╠═════════════════════════════════════════════════════╪═══════╣
║ Sequential cutoff threshold [KiB]                   │  1024 ║
║ Sequential cutoff policy                            │ never ║
║ Sequential cutoff promotion request count threshold │     8 ║
╚═════════════════════════════════════════════════════╧═══════╝
  • Other (e.g. lsblk, casadm -P, casadm -L)

lsblk

# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda            8:0    0  12.7T  0 disk
sdb            8:16   0  16.4T  0 disk
sdc            8:32   0   9.1T  0 disk
├─sdc1         8:33   0   7.3T  0 part
│ └─md127      9:127  0  21.8T  0 raid5
│   └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
└─sdc2         8:34   0   1.8T  0 part
  └─md126      9:126  0   1.8T  0 raid1
sdd            8:48   0   9.1T  0 disk
├─sdd1         8:49   0   7.3T  0 part
│ └─md127      9:127  0  21.8T  0 raid5
│   └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
└─sdd2         8:50   0   1.8T  0 part
  └─md126      9:126  0   1.8T  0 raid1
sde            8:64   0   7.3T  0 disk
└─sde1         8:65   0   7.3T  0 part
  └─md127      9:127  0  21.8T  0 raid5
    └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
sdf            8:80   0   7.3T  0 disk
└─sdf1         8:81   0   7.3T  0 part
  └─md127      9:127  0  21.8T  0 raid5
    └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
zd0          230:0    0    32G  0 disk
├─zd0p1      230:1    0    31G  0 part
├─zd0p2      230:2    0     1K  0 part
└─zd0p5      230:5    0   975M  0 part
zd16         230:16   0    32G  0 disk
zd32         230:32   0    32G  0 disk
├─zd32p1     230:33   0    31G  0 part
├─zd32p2     230:34   0     1K  0 part
└─zd32p5     230:37   0   975M  0 part
nvme1n1      259:0    0 931.5G  0 disk
nvme0n1      259:1    0 931.5G  0 disk
nvme2n1      259:2    0 119.2G  0 disk
├─nvme2n1p1  259:3    0  1007K  0 part
├─nvme2n1p2  259:4    0     1G  0 part
└─nvme2n1p3  259:5    0   118G  0 part

casadm -P

# casadm -P -i 1
Cache Id                  1
Cache Size                241643190 [4KiB Blocks] / 921.80 [GiB]
Cache Device              /dev/nvme1n1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           nop
Promotion Policy          always
Cache line size           4 [KiB]
Metadata Memory Footprint 10.6 [GiB]
Dirty for                 695 [s] / 11 [m] 35 [s]
Status                    Running

╔══════════════════╤═══════════╤═══════╤═════════════╗
║ Usage statistics │   Count   │   %   │   Units     ║
╠══════════════════╪═══════════╪═══════╪═════════════╣
║ Occupancy        │      5356 │   0.0 │ 4KiB Blocks ║
║ Free             │ 241637834 │ 100.0 │ 4KiB Blocks ║
║ Clean            │         2 │   0.0 │ 4KiB Blocks ║
║ Dirty            │      5354 │   0.0 │ 4KiB Blocks ║
╚══════════════════╧═══════════╧═══════╧═════════════╝

╔══════════════════════╤══════════╤═══════╤══════════╗
║ Request statistics   │  Count   │   %   │ Units    ║
╠══════════════════════╪══════════╪═══════╪══════════╣
║ Read hits            │ 29676031 │  67.7 │ Requests ║
║ Read partial misses  │        0 │   0.0 │ Requests ║
║ Read full misses     │       40 │   0.0 │ Requests ║
║ Read total           │ 29676071 │  67.7 │ Requests ║
╟──────────────────────┼──────────┼───────┼──────────╢
║ Write hits           │ 10558462 │  24.1 │ Requests ║
║ Write partial misses │        0 │   0.0 │ Requests ║
║ Write full misses    │  3628633 │   8.3 │ Requests ║
║ Write total          │ 14187095 │  32.3 │ Requests ║
╟──────────────────────┼──────────┼───────┼──────────╢
║ Pass-Through reads   │        0 │   0.0 │ Requests ║
║ Pass-Through writes  │        0 │   0.0 │ Requests ║
║ Serviced requests    │ 43863166 │ 100.0 │ Requests ║
╟──────────────────────┼──────────┼───────┼──────────╢
║ Total requests       │ 43863166 │ 100.0 │ Requests ║
╚══════════════════════╧══════════╧═══════╧══════════╝

╔══════════════════════════════════╤═══════════╤═══════╤═════════════╗
║ Block statistics                 │   Count   │   %   │   Units     ║
╠══════════════════════════════════╪═══════════╪═══════╪═════════════╣
║ Reads from core(s)               │       264 │ 100.0 │ 4KiB Blocks ║
║ Writes to core(s)                │         0 │   0.0 │ 4KiB Blocks ║
║ Total to/from core(s)            │       264 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼───────────┼───────┼─────────────╢
║ Reads from cache                 │  74297900 │  63.3 │ 4KiB Blocks ║
║ Writes to cache                  │  43010161 │  36.7 │ 4KiB Blocks ║
║ Total to/from cache              │ 117308061 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼───────────┼───────┼─────────────╢
║ Reads from exported object(s)    │  74298164 │  63.3 │ 4KiB Blocks ║
║ Writes to exported object(s)     │  43009897 │  36.7 │ 4KiB Blocks ║
║ Total to/from exported object(s) │ 117308061 │ 100.0 │ 4KiB Blocks ║
╚══════════════════════════════════╧═══════════╧═══════╧═════════════╝

╔════════════════════╤═══════╤═════╤══════════╗
║ Error statistics   │ Count │  %  │ Units    ║
╠════════════════════╪═══════╪═════╪══════════╣
║ Cache read errors  │     0 │ 0.0 │ Requests ║
║ Cache write errors │     0 │ 0.0 │ Requests ║
║ Cache total errors │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Core read errors   │     0 │ 0.0 │ Requests ║
║ Core write errors  │     0 │ 0.0 │ Requests ║
║ Core total errors  │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Total errors       │     0 │ 0.0 │ Requests ║
╚════════════════════╧═══════╧═════╧══════════╝

casadm -L

# casadm -L
type    id   disk           status    write policy   device
cache   1    /dev/nvme1n1   Running   wb             -
└core   1    /dev/md127     Active    -              /dev/cas1-1
@TheLinuxGuy TheLinuxGuy added the question Further information is requested label Jul 27, 2024
@TheLinuxGuy
Copy link
Author

Also an important question, it seems like btrfs filesystem is not in the supported filesystems list: https://open-cas.com/guide_system_requirements.html

Btrfs seems to be working okay in my testbench with OpenCAS... is OpenCAS team testing btrfs or planning to support btrfs or any other advanced file system like zfs? Ext4 did give me better benchmarks but I rather use btrfs at minimum... this is not an issue with bcache.

@robertbaldyga
Copy link
Member

@TheLinuxGuy Technically Open CAS should be able to handle any filesystem, as it conforms to the standard Linux bdev interface. So btrfs and zfs almost certainly work just fine. What "supported" means in our case is that we actually do test it for the listed filesystems. Open CAS has a quite extensive set of functional tests which we execute for each release. I'm not sure how much bcache developers test it with various filesystems - I was not able to find this information - but extending our tests set is certainly possible. So far we did not consider adding tests for the other filesystems because no one asked for them.

We'll try to evaluate how much would that cost to include btrfs and zfs in our testing scope. For the context, currently the full execution of Open CAS functional tests takes about a week (day and night), so the cost is not negligible. We value stability of the project, and as much as we'd like to support every single configuration and scenario, we first need to make sure that whatever we decide to support, we are able to do it in excellent quality over long period of time.

@TheLinuxGuy
Copy link
Author

currently the full execution of Open CAS functional tests takes about a week (day and night), so the cost is not negligible. We value stability of the project, and as much as we'd like to support every single configuration and scenario, we first need to make sure that whatever we decide to support, we are able to do it in excellent quality over long period of time.

Understood, thank you for the detailed explanation and for the consideration.

XFS and EXT4 are reliable and great - but snapshotting, checksumming, btrfs-send/zfs-send are modern filesystem features that I feel are in demand. IIRC Facebook uses btrfs on all their production fleet. https://facebookmicrosites.github.io/btrfs/docs/btrfs-facebook.html

@dkmn-123
Copy link

I second that zfs support would be great...

@bubundas17
Copy link

For ZFS and btrfs won't it be great if somebody just merge proper SSD cache to that project?

I think for now btrfs support in Open-CAS would be better because in most of the time btrfs is used on single block device because of it's snapshot and other quality of life features. For ZFS it's different.

@TheLinuxGuy
Copy link
Author

TheLinuxGuy commented Oct 18, 2024

For ZFS and btrfs won't it be great if somebody just merge proper SSD cache to that project?

Fun fact the tiered storage feature request for btrfs was submitted by me a few years ago: kdave/btrfs-progs#610

Open-CAS officially supporting btrfs filesystem is another way of trying to get something reliable, not just a frankenstein experiment by a single guy hoping for the best and hopefully not hitting some data corruption issue down the line. Since I know I can throw btrfs on top of OpenCAS today but since it isn't really officially supported you are on your own.

@bubundas17
Copy link

For ZFS and btrfs won't it be great if somebody just merge proper SSD cache to that project?

Fun fact the tiered storage feature request for btrfs was submitted by me a few years ago: kdave/btrfs-progs#610

Open-CAS officially supporting btrfs filesystem is another way of trying to get something reliable, not just a frankenstein experiment by a single guy hoping for the best and hopefully not hitting some data corruption issue down the line. Since I know I can throw btrfs on top of OpenCAS today but since it isn't really officially supported you are on your own.

Ahh, its just a feature request which might never see the light of day.

I am also trying to find some good solution for storage tearing myself. I guess like me you are also frustrated by current solutions for storage tearing in Linux and finding something which just works and well tested in production.

@TheLinuxGuy
Copy link
Author

I guess like me you are also frustrated by current solutions for storage tearing in Linux and finding something which just works and well tested in production.

Exactly right.

The closest solution I have found is Windows ReFS and "Storage spaces" on Windows Server 2025. Linux (open-source) doesn't really have a good solution right now, you can try hacking away with LVM + bcache + zfs and that is the closest fastest thing close to what I would want that doesn't eat 10GB of ram: https://github.com/TheLinuxGuy/ugreen-nas/blob/main/experiments-bench/mdadm-lvm2-bcache-zfs.md#setup-notes

But that is a frankenstein of a thing. Unsupported setup and data reliability super questionable when you bundle a bunch of things together like that.

@bubundas17
Copy link

I guess like me you are also frustrated by current solutions for storage tearing in Linux and finding something which just works and well tested in production.

Exactly right.

The closest solution I have found is Windows ReFS and "Storage spaces" on Windows Server 2025. Linux (open-source) doesn't really have a good solution right now, you can try hacking away with LVM + bcache + zfs and that is the closest fastest thing close to what I would want that doesn't eat 10GB of ram: https://github.com/TheLinuxGuy/ugreen-nas/blob/main/experiments-bench/mdadm-lvm2-bcache-zfs.md#setup-notes

But that is a frankenstein of a thing. Unsupported setup and data reliability super questionable when you bundle a bunch of things together like that.

Wow you did experiments, too much experiments.
Did you tried just running ZFS with l2arc and slog and special vdev and benchmarking them?

And don't you think its strange seeing such large performance difference between
LVM + bcache + zfs and LVM + bcache + btrfs results?

I think Open Cas should work well with btrfs.
Today I received a server with 12x 10TB Drives. Will try some of your configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants