Hello fellow Proxmox enjoyers!
I have questions regarding the ZFS disk IO stats and hope you all may be able to help me understand.
Setup (hardware, software)
I have Proxmox VE installed on a ZFS mirror (2x 500 GB M.2 PCIe SSD) rpool . The data (VMs, disks) resides on a seperate ZFS RAID-Z1 (3x 4TB SATA SSD) data_raid.
I use ~2 TB of all that, 1.6 TB being data (movies, videos, music, old data + game setup files, ...).
I have 6 VMs, all for my use alone, so there's not much going on there.
Question 1 - costant disk write going on?
I have a monitoring setup (CheckMK) to monitor my server and VMs. This monitoring reports a constant write IO operation for the disks, ongoing, without any interruption, of 20+ MB/s.

I think the monitoring gets the data from zpool iostat, so I watched it with watch -n 1 'sudo zpool iostat', but the numbers didn't seem to change.
It has been the exact same operations and bandwidth read / write for the last minute or so (after taking a while for writing this, it now lists 543 read ops instead of 545).
Every 1.0s: sudo zpool iostat
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data_raid 2.29T 8.61T 545 350 17.2M 21.5M
rpool 4.16G 456G 0 54 8.69K 2.21M
---------- ----- ----- ----- ----- ----- -----
The same happens if I use -lv or -w flags for zpool iostat.
So, are there really constantly 350 write operations going on? Or does it just not update the IO stats all too often?
Question 2 - what about disk longevity?
This isn't my first homelab-setup, but it is my first own ZFS- and RAID-setup. If somebody has any SSD-RAID or SSD-ZFS experiences to share, I'd like to hear them.
The disks I'm using are:
- 3x Samsung SSD 870 EVO 4TB for
data_raid - 2x Samsung SSD 980 500GB M.2 for
rpool
Best regards from a fellow rabbit-hole-enjoyer.
I delved into exactly this when I was running proxmox on consumer ssds, since they were wearing out so fast.
Proxmox does a ton of logging, and a ton of small updates to places like /etc/pve and /var/lib/pve-cluster as part of cluster communications, and also to /var/lib/rrdcached for the web ui metrics dashboard, etc. All of these small writes go through huge amounts of write amplification via zfs, so a small write to the filesystem ends up being quite a large write to the backing disk itself.
I found that vms running on the same zfs pool didn't have quite the degree of write amplification when their writes were cached - they would accumulate their small writes into one large one at intervals, and amplification on the larger dump would be smaller.
For a while I worked on identifying everywhere these small writes were happening, and backing those directories with hdds instead of ssds, moving /var/log from each vm onto its own disk and moving it onto the same hdd-backed zpool, and my disk wearout issues mostly stopped.
Eventually, though, I found some super cheap retired enterprise ssds on ebay, and moved everything back to the much simpler stock configuration. Back to high sustained ssd writes, but I'm 3 years in and still at only around 2% wearout. They should last until the heat death of the universe.
Is "Discard" the write caching you refer to?
Or are you talking about the actual Write Cache?
The actual write cache there - writeback accumulates writes before flushing them in a larger chunk. It doesn't make a huge difference, nor did tweaking zfs cache settings when I tried it a few years ago, but it can help if the guest is doing a constant stream of very small writes.