March 26, 2023

Categories: braindump

Ceph cluster, SATA HDDs out, SATA SSDs in

Table of Contents

When I built my Ceph cluster in March 2020, I used old SATA HDDs I still had on the shelf. It was always clear that at some point these old consumables would die and need replacing. That time has come and I went with 3x 2TB SATA SSD OSD per node.

Summary

Previously each of the four nodes had a small SATA SSD (500 GB for WAL and DB) plus 3 HDDs. After the change, each of the four nodes has three 2TB SATA SSDs, one OSD per SSD, no separate WAL/DB device.

While performance is better and I run on 12 new storage devices (as opposed to 12 old ones), this is of course still not comparable to enterprise class hardware. But for the homelab it will do for now.

Ordered devices

quantity	name
2	Patriot SSD P210 2.5 SATA 2TB
2	Intenso SSD 3812470 SATA3 2TB
2	Silicon Power SSD Ace A55 2TB
2	PNY CS900 2.5 SATA3 2TB
2	Crucial BX500 SSD 2.5 2TB
2	Samsung SSD 870 QVO 2TB SATA 2.5

Yes, I am fully aware that price wise this is the very bottom of the barrel. Which is why I chose 6 different vendors, SATA SSDs are a consumable to me, they will break eventually, this mix should help spread the risk of multiple SSDs dying in close succession.

OSDs with HDD out, only SSD in

My old OSDs were 3xHDD (1 or 2 TB) plus 1xSSD (500 GB) shred per 3 OSDs on each node. These were all removed and replaced with 3x 2TB SATA SSD OSDs. The one small (500 GB) SATA SSD per node, previously used for WAL/DB for 3 HDD OSDs, was recycled in my Proxmox VE playground.

First node

Node number 4 was evacuated of OSDs earlier (after 2 of the 3 HDDs in it died), so I plugged 3 SSDs in and now have the following;

bay	model	note
1	Crucial CT120BX500SSD1	120 GB SSD with operating system
2	Patriot SSD P210 2.5 SATA 2TB	osd.0
3	Silicon Power SSD Ace A55 2TB	osd.1
4	Crucial BX500 SSD 2.5 2TB	osd.5
5		empty

Since my nodes only have 12 GiB RAM, I decided on 3 OSDs per node and 3 GiB RAM memory target per OSD. Far from ideal but the nodes cannot use more memory (if the 4 GiB it comes with were not soldered, then I could have 16 GiB).

Used drive spec

I wrote myself the following drive spec, wanting all free SSDs on my F5-422 nodes to be used and crypted.

service_type: osd
service_id: osd_spec_F5-422_SSDs
placement:
  host_pattern: 'f5-422-*'
encrypted: true
data_devices:
  rotational: 0

Test the drive spec

Simply as per the RHCS5 docs

[root@f5-422-01 ~]# cephadm shell --mount f5-422-SSD-as-OSD-spec.yml:/var/lib/ceph/osd/f5-422-SSD-as-OSD-spec.yml

[root@f5-422-01 ~]# vim f5-422-SSD-as-OSD-spec.yml 
[root@f5-422-01 ~]# cephadm shell --mount f5-422-SSD-as-OSD-spec.yml:/var/lib/ceph/osd/f5-422-SSD-as-OSD-spec.yml
Inferring fsid 720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
Using recent ceph image registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a42c490ba7aa8732ebc53a90ce33c4cb9cf8e556395cc9598f8808e0b719ebe7
[ceph: root@f5-422-01 /]# cd /var/lib/ceph/osd/
[ceph: root@f5-422-01 osd]# cat f5-422-SSD-as-OSD-spec.yml 
service_type: osd
service_id: osd_spec_F5-422_SSDs
placement:
  host_pattern: 'f5-422-*'
encrypted: true
data_devices:
  rotational: 0
[ceph: root@f5-422-01 osd]# ceph orch apply -i f5-422-SSD-as-OSD-spec.yml --dry-run
WARNING! Dry-Runs are snapshots of a certain point in time and are bound 
to the current inventory setup. If any of these conditions change, the 
preview will be invalid. Please make sure to have a minimal 
timeframe between planning and applying the specs.
####################
SERVICESPEC PREVIEWS
####################
+---------+------+--------+-------------+
|SERVICE  |NAME  |ADD_TO  |REMOVE_FROM  |
+---------+------+--------+-------------+
+---------+------+--------+-------------+
################
OSDSPEC PREVIEWS
################
Preview data is being generated.. Please re-run this command in a bit.

I waited a bit, as instructed, before checking again;

[ceph: root@f5-422-01 osd]# ceph orch apply -i f5-422-SSD-as-OSD-spec.yml --dry-run
WARNING! Dry-Runs are snapshots of a certain point in time and are bound 
to the current inventory setup. If any of these conditions change, the 
preview will be invalid. Please make sure to have a minimal 
timeframe between planning and applying the specs.
####################
SERVICESPEC PREVIEWS
####################
+---------+------+--------+-------------+
|SERVICE  |NAME  |ADD_TO  |REMOVE_FROM  |
+---------+------+--------+-------------+
+---------+------+--------+-------------+
################
OSDSPEC PREVIEWS
################
+---------+----------------------+-----------------------------+----------+----+-----+
|SERVICE  |NAME                  |HOST                         |DATA      |DB  |WAL  |
+---------+----------------------+-----------------------------+----------+----+-----+
|osd      |osd_spec_F5-422_SSDs  |f5-422-04.internal.pcfe.net  |/dev/sdb  |-   |-    |
|osd      |osd_spec_F5-422_SSDs  |f5-422-04.internal.pcfe.net  |/dev/sdd  |-   |-    |
|osd      |osd_spec_F5-422_SSDs  |f5-422-04.internal.pcfe.net  |/dev/sde  |-   |-    |
+---------+----------------------+-----------------------------+----------+----+-----+

Use the drive spec

As it picked up the drives I was expecting, I applied the specification;

[ceph: root@f5-422-01 osd]# ceph orch apply -i f5-422-SSD-as-OSD-spec.yml          
Scheduled osd.osd_spec_F5-422_SSDs update...

A short while later, as expected, cephadm deployed the 3 OSDs.

Remaining nodes

It’s really just a case of repeating the same procedure;

remove OSDs of a node via ceph orch
wait for completion
physically replace drives
wait for automatic creation of OSDs on new empty drives, as per drive spec
wait for all Placement Groups to be active+clean

As I put in a drive spec that applies to all four nodes, the only thing to watch out was to not wipe the small SSD before physically removing it or to tell Ceph to temporarily not apply. This is expected, see upstream docs for details.

Since rinse, wash, repeat procedures are boring, that text is collapsed. But you can expand if you do care about the details.

Details of the remaining nodes drive swaps (click to expand).

Evacuating node 3

Turns out that the SSDs arrived just in time, this morning one more HDD just died. OSD 4 on node 3. So I removed it;

[ceph: root@f5-422-01 osd]# ceph orch osd rm 4
Scheduled OSD(s) for removal

[ceph: root@f5-422-01 osd]# date ; ceph orch osd rm status
Fri Mar 17 15:53:27 UTC 2023
OSD  HOST                         STATE                    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT  
4    f5-422-03.internal.pcfe.net  done, waiting for purge    0  False    False  False

Might as well evacuate the remaining 2 OSDs on that node

[ceph: root@f5-422-01 osd]# date ; ceph orch osd rm 8 15 --zap
Fri Mar 17 15:54:48 UTC 2023
Scheduled OSD(s) for removal
[ceph: root@f5-422-01 osd]# date ; ceph orch osd rm status
Fri Mar 17 15:54:53 UTC 2023
OSD  HOST                         STATE                    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT            
4    f5-422-03.internal.pcfe.net  done, waiting for purge    0  False    False  False                              
8    f5-422-03.internal.pcfe.net  draining                  91  False    False  True   2023-03-17 15:54:50.285459  
15   f5-422-03.internal.pcfe.net  draining                 186  False    False  True   2023-03-17 15:54:51.171398

Waiting for clean PGs

Now all that is left to do is wait for clean placement groups (PG) and then rinse, wash, repeat

Just under an hour in

[ceph: root@f5-422-01 osd]# date ; ceph orch osd rm status
Fri Mar 17 16:44:15 UTC 2023
OSD  HOST                         STATE                    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT            
4    f5-422-03.internal.pcfe.net  done, waiting for purge    0  False    False  False                              
8    f5-422-03.internal.pcfe.net  draining                  65  False    False  True   2023-03-17 15:54:50.285459  
15   f5-422-03.internal.pcfe.net  draining                 140  False    False  True   2023-03-17 15:54:51.171398

[ceph: root@f5-422-01 osd]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         13.73222  root default                                  
 -9          4.09348      host f5-422-01                            
  2    hdd   1.97089          osd.2            up   1.00000  1.00000
  6    hdd   1.06129          osd.6            up   1.00000  1.00000
 10    hdd   1.06129          osd.10           up   1.00000  1.00000
 -7          3.03218      host f5-422-02                            
  3    hdd   1.97089          osd.3            up   1.00000  1.00000
 11    hdd   1.06129          osd.11           up   1.00000  1.00000
 -5          1.06129      host f5-422-03                            
  4    hdd   1.06129          osd.4          down         0  1.00000
  8    hdd         0          osd.8            up   1.00000  1.00000
 15    hdd         0          osd.15           up   1.00000  1.00000
 -3          5.54527      host f5-422-04                            
  0    ssd   1.86299          osd.0            up   1.00000  1.00000
  1    ssd   1.86299          osd.1            up   1.00000  1.00000
  5    ssd   1.81929          osd.5            up   1.00000  1.00000
-11                0      host ts-473a-01

[ceph: root@f5-422-01 osd]# ceph -s
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_WARN
            1 failed cephadm daemon(s)
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 2h)
    mgr: f5-422-02(active, since 7d), standbys: f5-422-01, f5-422-03
    mds: 1/1 daemons up, 1 standby
    osd: 11 osds: 10 up (since 65m), 10 in (since 67m); 204 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 736 GiB
    usage:   3.4 TiB used, 12 TiB / 16 TiB avail
    pgs:     1591310/4303725 objects misplaced (36.975%)
             192 active+remapped+backfill_wait
             125 active+clean
             12  active+remapped+backfilling
 
  io:
    client:   103 KiB/s rd, 2.7 MiB/s wr, 0 op/s rd, 103 op/s wr
    recovery: 48 MiB/s, 35 objects/s

About 2.5 hours in

[ceph: root@f5-422-01 osd]# date ; ceph orch osd rm status
Fri Mar 17 18:14:35 UTC 2023
OSD  HOST                         STATE                    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT            
4    f5-422-03.internal.pcfe.net  done, waiting for purge    0  False    False  False                              
8    f5-422-03.internal.pcfe.net  draining                  40  False    False  True   2023-03-17 15:54:50.285459  
15   f5-422-03.internal.pcfe.net  draining                  76  False    False  True   2023-03-17 15:54:51.171398

[ceph: root@f5-422-01 osd]# date ; ceph -s
Fri Mar 17 18:14:44 UTC 2023
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_WARN
            1 failed cephadm daemon(s)
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 4h)
    mgr: f5-422-02(active, since 7d), standbys: f5-422-01, f5-422-03
    mds: 1/1 daemons up, 1 standby
    osd: 11 osds: 10 up (since 2h), 10 in (since 2h); 116 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 736 GiB
    usage:   3.4 TiB used, 12 TiB / 16 TiB avail
    pgs:     1050758/4303731 objects misplaced (24.415%)
             213 active+clean
             107 active+remapped+backfill_wait
             9   active+remapped+backfilling
 
  io:
    client:   914 KiB/s wr, 0 op/s rd, 20 op/s wr
    recovery: 31 MiB/s, 86 objects/s

Nearly done

[ceph: root@f5-422-01 osd]# date ; ceph orch osd rm status
Fri Mar 17 20:58:51 UTC 2023
OSD  HOST                         STATE                    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT            
4    f5-422-03.internal.pcfe.net  done, waiting for purge    0  False    False  False                              
8    f5-422-03.internal.pcfe.net  draining                   5  False    False  True   2023-03-17 15:54:50.285459  
15   f5-422-03.internal.pcfe.net  draining                  11  False    False  True   2023-03-17 15:54:51.171398

[ceph: root@f5-422-01 osd]# date ; ceph -s
Fri Mar 17 20:58:55 UTC 2023
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_WARN
            1 failed cephadm daemon(s)
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 6h)
    mgr: f5-422-02(active, since 7d), standbys: f5-422-01, f5-422-03
    mds: 1/1 daemons up, 1 standby
    osd: 11 osds: 10 up (since 5h), 10 in (since 5h); 16 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 736 GiB
    usage:   3.4 TiB used, 12 TiB / 16 TiB avail
    pgs:     348391/4303734 objects misplaced (8.095%)
             313 active+clean
             8   active+remapped+backfilling
             8   active+remapped+backfill_wait
 
  io:
    client:   85 KiB/s rd, 2.1 MiB/s wr, 8 op/s rd, 37 op/s wr
    recovery: 17 MiB/s, 59 objects/s

3/17/23 11:10:20 PM[INF]Cluster is now healthy

3/17/23 11:10:20 PM[INF]Health check cleared: CEPHADM_FAILED_DAEMON (was: 1 failed cephadm daemon(s))

3/17/23 11:10:15 PM[INF]Detected new or changed devices on f5-422-03.internal.pcfe.net

3/17/23 11:10:01 PM[INF]Zapping devices for osd.15 on f5-422-03.internal.pcfe.net

3/17/23 11:10:01 PM[INF]Successfully purged osd.15 on f5-422-03.internal.pcfe.net

3/17/23 11:10:01 PM[INF]Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)

3/17/23 11:10:01 PM[INF]Health check cleared: OSD_DOWN (was: 1 osds down)

3/17/23 11:10:01 PM[INF]Successfully removed osd.15 on f5-422-03.internal.pcfe.net

3/17/23 11:10:01 PM[INF]Removing key for osd.15

3/17/23 11:10:00 PM[INF]Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)

3/17/23 11:09:55 PM[WRN]Health check failed: Reduced data availability: 3 pgs peering (PG_AVAILABILITY)

3/17/23 11:09:53 PM[INF]Removing daemon osd.15 from f5-422-03.internal.pcfe.net

3/17/23 11:09:53 PM[INF]osd.15 now down

3/17/23 11:09:53 PM[WRN]Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)

3/17/23 11:09:53 PM[WRN]Health check failed: 1 osds down (OSD_DOWN)

3/17/23 11:09:52 PM[INF]Successfully purged osd.4 on f5-422-03.internal.pcfe.net

3/17/23 11:09:52 PM[INF]Successfully removed osd.4 on f5-422-03.internal.pcfe.net

3/17/23 11:09:52 PM[INF]Removing key for osd.4

3/17/23 11:09:50 PM[INF]Removing daemon osd.4 from f5-422-03.internal.pcfe.net

3/17/23 11:09:50 PM[INF]osd.4 now down

3/17/23 11:00:59 PM[INF]Detected new or changed devices on f5-422-03.internal.pcfe.net

3/17/23 11:00:49 PM[INF]Successfully zapped devices for osd.8 on f5-422-03.internal.pcfe.net

3/17/23 11:00:44 PM[INF]Zapping devices for osd.8 on f5-422-03.internal.pcfe.net

3/17/23 11:00:44 PM[INF]Successfully purged osd.8 on f5-422-03.internal.pcfe.net

3/17/23 11:00:44 PM[INF]Health check cleared: OSD_DOWN (was: 1 osds down)

3/17/23 11:00:44 PM[INF]Successfully removed osd.8 on f5-422-03.internal.pcfe.net

3/17/23 11:00:44 PM[INF]Removing key for osd.8

3/17/23 11:00:38 PM[WRN]Monitor daemon marked osd.8 down, but it is still running

3/17/23 11:00:38 PM[INF]osd.8 marked itself dead as of e42811

3/17/23 11:00:37 PM[INF]Removing daemon osd.8 from f5-422-03.internal.pcfe.net

3/17/23 11:00:37 PM[INF]osd.8 now down

And, as expected because of the above drive spec, once the next Detected new or changed devices on f5-422-03.internal.pcfe.net happened, 3 new OSDs were rolled out automatically.

Evacuate node 2

[ceph: root@f5-422-01 /]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         18.17250  root default                                  
 -9          4.09348      host f5-422-01                            
  2    hdd   1.97089          osd.2            up   1.00000  1.00000
  6    hdd   1.06129          osd.6            up   1.00000  1.00000
 10    hdd   1.06129          osd.10           up   1.00000  1.00000
 -7          3.03218      host f5-422-02                            
  3    hdd   1.97089          osd.3            up   1.00000  1.00000
 11    hdd   1.06129          osd.11           up   1.00000  1.00000
 -5          5.50157      host f5-422-03                            
  4    ssd   1.86299          osd.4            up   1.00000  1.00000
  7    ssd   1.81929          osd.7            up   1.00000  1.00000
  8    ssd   1.81929          osd.8            up   1.00000  1.00000
 -3          5.54527      host f5-422-04                            
  0    ssd   1.86299          osd.0            up   1.00000  1.00000
  1    ssd   1.86299          osd.1            up   1.00000  1.00000
  5    ssd   1.81929          osd.5            up   1.00000  1.00000
-11                0      host ts-473a-01

[ceph: root@f5-422-01 /]# ceph orch osd rm 3 11
Scheduled OSD(s) for removal

[ceph: root@f5-422-01 /]# date ; ceph orch osd rm status
Fri Mar 17 22:53:03 UTC 2023
OSD  HOST                         STATE     PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT            
3    f5-422-02.internal.pcfe.net  draining  181  False    False  False  2023-03-17 22:53:03.110470  
11   f5-422-02.internal.pcfe.net  draining   94  False    False  False  2023-03-17 22:53:04.100753

went to bed, was finished next day.

[ceph: root@f5-422-01 /]# date ; ceph orch osd rm status
Sat Mar 18 11:19:06 UTC 2023
No OSD remove/replace operations reported

[ceph: root@f5-422-01 /]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         15.14032  root default                                  
 -9          4.09348      host f5-422-01                            
  2    hdd   1.97089          osd.2            up   1.00000  1.00000
  6    hdd   1.06129          osd.6            up   1.00000  1.00000
 10    hdd   1.06129          osd.10           up   1.00000  1.00000
 -7                0      host f5-422-02                            
 -5          5.50157      host f5-422-03                            
  4    ssd   1.86299          osd.4            up   1.00000  1.00000
  7    ssd   1.81929          osd.7            up   1.00000  1.00000
  8    ssd   1.81929          osd.8            up   1.00000  1.00000
 -3          5.54527      host f5-422-04                            
  0    ssd   1.86299          osd.0            up   1.00000  1.00000
  1    ssd   1.86299          osd.1            up   1.00000  1.00000
  5    ssd   1.81929          osd.5            up   1.00000  1.00000
-11                0      host ts-473a-01

[ceph: root@f5-422-01 /]# ceph -s
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 2m)
    mgr: f5-422-02(active, since 8d), standbys: f5-422-01, f5-422-03
    mds: 1/1 daemons up, 1 standby
    osd: 9 osds: 9 up (since 89s), 9 in (since 12h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 737 GiB
    usage:   2.7 TiB used, 12 TiB / 15 TiB avail
    pgs:     329 active+clean
 
  io:
    client:   400 KiB/s wr, 0 op/s rd, 12 op/s wr

zapped the 500GB SATA SSD that will be put in the Proxmox VE playground;

[ceph: root@f5-422-01 /]# ceph orch device ls f5-422-02.internal.pcfe.net
HOST                         PATH      TYPE  DEVICE ID                          SIZE  AVAILABLE  REFRESHED  REJECT REASONS                                                 
f5-422-02.internal.pcfe.net  /dev/sdc  ssd   Samsung_SSD_860_[DACTED]             500G             9m ago     Insufficient space (<10 extents) on vgs, LVM detected, locked  
f5-422-02.internal.pcfe.net  /dev/sdd  hdd   ST2000VM003-1CT1_[REDACTED]         2000G             9m ago     Insufficient space (<10 extents) on vgs, LVM detected, locked  
f5-422-02.internal.pcfe.net  /dev/sdf  hdd   WDC_WD10EFRX-68P_WD-[REDACTED]      1000G             9m ago     Insufficient space (<10 extents) on vgs, LVM detected, locked

[ceph: root@f5-422-01 /]# ceph orch device zap f5-422-02.internal.pcfe.net /dev/sdc
Error EINVAL: must pass --force to PERMANENTLY ERASE DEVICE DATA
[ceph: root@f5-422-01 /]# ceph orch device zap f5-422-02.internal.pcfe.net --force /dev/sdc
zap successful for /dev/sdc on f5-422-02.internal.pcfe.net

and then pulled that device but, as kind of expected, that was dumb as the drive spec was still active, doh!

So I powered down node 2 swapped all storage and powered it back again. Ceph healed itself :-)

[ceph: root@f5-422-01 /]# date ; ceph osd tree
Sat Mar 18 11:56:12 UTC 2023
ID   CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         20.68559  root default                                  
 -9          4.09348      host f5-422-01                            
  2    hdd   1.97089          osd.2            up   1.00000  1.00000
  6    hdd   1.06129          osd.6            up   1.00000  1.00000
 10    hdd   1.06129          osd.10           up   1.00000  1.00000
 -7          5.54527      host f5-422-02                            
  3    ssd   1.86299          osd.3            up   1.00000  1.00000
  9    ssd   1.86299          osd.9            up   1.00000  1.00000
 11    ssd   1.81929          osd.11           up   1.00000  1.00000
 -5          5.50157      host f5-422-03                            
  4    ssd   1.86299          osd.4            up   1.00000  1.00000
  7    ssd   1.81929          osd.7            up   1.00000  1.00000
  8    ssd   1.81929          osd.8            up   1.00000  1.00000
 -3          5.54527      host f5-422-04                            
  0    ssd   1.86299          osd.0            up   1.00000  1.00000
  1    ssd   1.86299          osd.1            up   1.00000  1.00000
  5    ssd   1.81929          osd.5            up   1.00000  1.00000
-11                0      host ts-473a-01

Last node

[ceph: root@f5-422-01 /]# date ; ceph osd tree
Sat Mar 18 11:56:12 UTC 2023
ID   CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         20.68559  root default                                  
 -9          4.09348      host f5-422-01                            
  2    hdd   1.97089          osd.2            up   1.00000  1.00000
  6    hdd   1.06129          osd.6            up   1.00000  1.00000
 10    hdd   1.06129          osd.10           up   1.00000  1.00000
 -7          5.54527      host f5-422-02                            
  3    ssd   1.86299          osd.3            up   1.00000  1.00000
  9    ssd   1.86299          osd.9            up   1.00000  1.00000
 11    ssd   1.81929          osd.11           up   1.00000  1.00000
 -5          5.50157      host f5-422-03                            
  4    ssd   1.86299          osd.4            up   1.00000  1.00000
  7    ssd   1.81929          osd.7            up   1.00000  1.00000
  8    ssd   1.81929          osd.8            up   1.00000  1.00000
 -3          5.54527      host f5-422-04                            
  0    ssd   1.86299          osd.0            up   1.00000  1.00000
  1    ssd   1.86299          osd.1            up   1.00000  1.00000
  5    ssd   1.81929          osd.5            up   1.00000  1.00000
-11                0      host ts-473a-01                           
[ceph: root@f5-422-01 /]# ceph orch osd rm 2 6 10
Scheduled OSD(s) for removal
[ceph: root@f5-422-01 /]# date ; ceph orch osd rm status
Sat Mar 18 11:57:44 UTC 2023
OSD  HOST                         STATE     PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT            
2    f5-422-01.internal.pcfe.net  draining  142  False    False  False  2023-03-18 11:57:41.521775  
6    f5-422-01.internal.pcfe.net  draining   72  False    False  False  2023-03-18 11:57:42.516314  
10   f5-422-01.internal.pcfe.net  draining   60  False    False  False  2023-03-18 11:57:40.984641

finished fine

[ceph: root@f5-422-01 /]# date ; ceph orch osd rm status
Sat Mar 18 13:54:29 UTC 2023
No OSD remove/replace operations reported
[ceph: root@f5-422-01 /]# date ; ceph osd tree
Sat Mar 18 13:54:34 UTC 2023
ID   CLASS  WEIGHT    TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         16.59212  root default                                  
 -9                0      host f5-422-01                            
 -7          5.54527      host f5-422-02                            
  3    ssd   1.86299          osd.3            up   1.00000  1.00000
  9    ssd   1.86299          osd.9            up   1.00000  1.00000
 11    ssd   1.81929          osd.11           up   1.00000  1.00000
 -5          5.50157      host f5-422-03                            
  4    ssd   1.86299          osd.4            up   1.00000  1.00000
  7    ssd   1.81929          osd.7            up   1.00000  1.00000
  8    ssd   1.81929          osd.8            up   1.00000  1.00000
 -3          5.54527      host f5-422-04                            
  0    ssd   1.86299          osd.0            up   1.00000  1.00000
  1    ssd   1.86299          osd.1            up   1.00000  1.00000
  5    ssd   1.81929          osd.5            up   1.00000  1.00000
-11                0      host ts-473a-01                           
[ceph: root@f5-422-01 /]# ceph -s
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 119s)
    mgr: f5-422-01(active, since 2h), standbys: f5-422-03, f5-422-02
    mds: 1/1 daemons up, 1 standby
    osd: 9 osds: 9 up (since 67s), 9 in (since 2h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 737 GiB
    usage:   2.2 TiB used, 14 TiB / 17 TiB avail
    pgs:     329 active+clean
 
  io:
    client:   274 KiB/s wr, 0 op/s rd, 15 op/s wr

shut down node 1

[root@f5-422-01 ~]# systemctl poweroff
[root@f5-422-01 ~]# Connection to f5-422-01 closed by remote host.
Connection to f5-422-01 closed.

swapped storage, powered back up

pcfe@t3600 ~ $ ssh -l root f5-422-01
check-mk-agent installed from monitoring server
kickstarted at Mon Jun  8 01:25:42 CEST 2020 for RHEL 8.2 on TerraMaster F5-422
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Fri Mar 17 16:22:15 2023 from 192.168.50.59
lsblk
any tmux running?
error connecting to /tmp/tmux-0/default (No such file or directory)
[root@f5-422-01 ~]# lsblk
NAME                             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                8:0    0 111.8G  0 disk 
├─sda1                             8:1    0   512M  0 part /boot/efi
├─sda2                             8:2    0     1G  0 part /boot
└─sda3                             8:3    0 110.3G  0 part 
  ├─VG_OS_SSD_bay1-LV_root       253:0    0     4G  0 lvm  /
  ├─VG_OS_SSD_bay1-LV_swap       253:1    0     4G  0 lvm  [SWAP]
  ├─VG_OS_SSD_bay1-LV_home       253:2    0     1G  0 lvm  /home
  ├─VG_OS_SSD_bay1-LV_containers 253:3    0     8G  0 lvm  /var/lib/containers
  ├─VG_OS_SSD_bay1-LV_var_crash  253:4    0    15G  0 lvm  /var/crash
  ├─VG_OS_SSD_bay1-LV_var_log    253:5    0     5G  0 lvm  /var/log
  └─VG_OS_SSD_bay1-LV_var        253:6    0     8G  0 lvm  /var
sdc                                8:32   0   1.9T  0 disk 
sdd                                8:48   0   1.8T  0 disk 
sde                                8:64   0   1.8T  0 disk 
[root@f5-422-01 ~]#

orch picked up the 3 after a short while

[ceph: root@f5-422-01 /]# ceph -W cephadm
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 81s)
    mgr: f5-422-03(active, since 18m), standbys: f5-422-02, f5-422-01
    mds: 1/1 daemons up, 1 standby
    osd: 9 osds: 9 up (since 21m), 9 in (since 2h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 737 GiB
    usage:   2.2 TiB used, 14 TiB / 17 TiB avail
    pgs:     329 active+clean
 
  io:
    client:   553 KiB/s wr, 0 op/s rd, 13 op/s wr
 

2023-03-18T14:15:52.013150+0000 mgr.f5-422-03 [INF] Detected new or changed devices on ceph-ansible.internal.pcfe.net
2023-03-18T14:20:20.567312+0000 mgr.f5-422-03 [INF] Deploying daemon osd.10 on f5-422-01.internal.pcfe.net
2023-03-18T14:20:31.839535+0000 mgr.f5-422-03 [INF] Deploying daemon osd.2 on f5-422-01.internal.pcfe.net
2023-03-18T14:20:44.269021+0000 mgr.f5-422-03 [INF] Deploying daemon osd.6 on f5-422-01.internal.pcfe.net
2023-03-18T14:21:20.815248+0000 mgr.f5-422-03 [INF] Detected new or changed devices on f5-422-01.internal.pcfe.net
2023-03-18T14:22:35.514871+0000 mgr.f5-422-03 [INF] Detected new or changed devices on f5-422-01.internal.pcfe.net
2023-03-18T14:23:50.739530+0000 mgr.f5-422-03 [INF] Detected new or changed devices on f5-422-01.internal.pcfe.net

now waiting for the final reshuffling of PGs to be finished.

[ceph: root@f5-422-01 /]# date ; ceph -s
Sat Mar 18 14:48:26 UTC 2023
  cluster:
    id:     720b4911-bb6e-4774-88ae-dc2d3d3b9bfb
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum f5-422-01,f5-422-02,f5-422-03 (age 34m)
    mgr: f5-422-03(active, since 52m), standbys: f5-422-02, f5-422-01
    mds: 1/1 daemons up, 1 standby
    osd: 12 osds: 12 up (since 27m), 12 in (since 2h); 75 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 329 pgs
    objects: 1.43M objects, 737 GiB
    usage:   2.3 TiB used, 20 TiB / 22 TiB avail
    pgs:     523749/4304157 objects misplaced (12.168%)
             254 active+clean
             63  active+remapped+backfill_wait
             12  active+remapped+backfilling
 
  io:
    client:   21 KiB/s rd, 517 KiB/s wr, 21 op/s rd, 29 op/s wr
    recovery: 267 MiB/s, 0 keys/s, 586 objects/s

All replacements completed

After replacing them all, I now have the following physical disks used for OSDs.

screenshot of Dashboard / Cluster / Physucal Disks. Filtered out the VM running metrics, because that runs no OSDs.

Once all HDDs were replaced, it was of course time to run some benchmarks. Ideally ones that can at least somehow compare to earlier tests of mine. Note that these days the Ceph cluster also serves my SO’s OCP4 cluster, whereas for the test in 2020 the cluster was brand new and not serving other workloads while I tested. All I want here is some rough numbers.

RADOS Benchmarks

This was straightforward, just the same rados bench runs run on one of the Ceph nodes, like last time. All tests were once again run on the node f5-422-01.

Sequential Write Test, pool with 16 PGs

[ceph: root@f5-422-01 /]# rados bench -p testbench 600 write --no-cleanup
[…]
Total time run:         600.186
Total writes made:      20749
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     138.284
Stddev Bandwidth:       32.5208
Max bandwidth (MB/sec): 232
Min bandwidth (MB/sec): 16
Average IOPS:           34
Stddev IOPS:            8.13019
Max IOPS:               58
Min IOPS:               4
Average Latency(s):     0.462764
Stddev Latency(s):      0.375508
Max latency(s):         4.27566
Min latency(s):         0.0361997

Sequential Read Test, pool with 16 PGs

[ceph: root@f5-422-01 /]# rados bench -p testbench 600 seq
[…]
Total time run:       449.178
Total reads made:     20749
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   184.773
Average IOPS:         46
Stddev IOPS:          5.82675
Max IOPS:             64
Min IOPS:             29
Average Latency(s):   0.342997
Max latency(s):       2.15112
Min latency(s):       0.0206885

Random Read Test, pool with 16 PGs

[ceph: root@f5-422-01 /]# rados bench -p testbench 600 rand
Total time run:       600.487
Total reads made:     28454
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   189.54
Average IOPS:         47
Stddev IOPS:          5.39549
Max IOPS:             68
Min IOPS:             34
Average Latency(s):   0.334573
Max latency(s):       2.01502
Min latency(s):       0.00787465

Sequential Write Test, pool with 256 PGs

[ceph: root@f5-422-01 /]# rados bench -p testbench2 600 write --no-cleanup
[…]
Total time run:         600.45
Total writes made:      18984
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     126.465
Stddev Bandwidth:       41.7122
Max bandwidth (MB/sec): 228
Min bandwidth (MB/sec): 0
Average IOPS:           31
Stddev IOPS:            10.4356
Max IOPS:               57
Min IOPS:               0
Average Latency(s):     0.505888
Stddev Latency(s):      0.581471
Max latency(s):         4.40465
Min latency(s):         0.0374363

Sequential Read Test, pool with 256 PGs

[ceph: root@f5-422-01 /]# rados bench -p testbench2 600 seq
[…]
Total time run:       522.082
Total reads made:     18984
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   145.448
Average IOPS:         36
Stddev IOPS:          3.67769
Max IOPS:             52
Min IOPS:             26
Average Latency(s):   0.436814
Max latency(s):       5.47685
Min latency(s):       0.0212489

Random Read Test, pool with 256 PGs

[ceph: root@f5-422-01 /]# rados bench -p testbench2 600 rand
[…]
Total time run:       600.5
Total reads made:     21323
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   142.035
Average IOPS:         35
Stddev IOPS:          3.7057
Max IOPS:             45
Min IOPS:             25
Average Latency(s):   0.447278
Max latency(s):       2.15062
Min latency(s):       0.00771143

VM on ProxmoxVE using RBD benchmarks

This is not fully comparable to earlier tests as this time I used a different hypervisor (one with a better network connection)

The host running the VM has a 10 Gigabit uplink. The 2020 test was with a libvirt hypervisor that only has a 1 Gigabit uplink.

The PVE side the RBD backed disk was of course defined with cache=none as I want to measure my Ceph, not PVE’s caching abilities.

VM side of taking the RBD image in use (click to expand)

[root@guest29 ~]# lsblk
NAME                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                     8:0    0   32G  0 disk 
├─sda1                  8:1    0  600M  0 part /boot/efi
├─sda2                  8:2    0    1G  0 part /boot
└─sda3                  8:3    0 30.4G  0 part 
  ├─rhel_guest29-root 253:0    0 27.2G  0 lvm  /
  └─rhel_guest29-swap 253:1    0  3.2G  0 lvm  [SWAP]
sr0                    11:0    1 11.3G  0 rom  
vda                   252:0    0  100G  0 disk 
[root@guest29 ~]# mkfs.xfs /dev/vda
meta-data=/dev/vda               isize=512    agcount=4, agsize=6553600 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.
[root@guest29 ~]# mount /dev/vdb /mnt/
mount: /mnt: special device /dev/vdb does not exist.
[root@guest29 ~]# mount /dev/vda /mnt/
[root@guest29 ~]# df -h /mnt/
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda        100G  746M  100G   1% /mnt
[root@guest29 ~]# uname -a
Linux guest29.internal.pcfe.net 4.18.0-425.13.1.el8_7.x86_64 #1 SMP Thu Feb 2 13:01:45 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@guest29 ~]# dnf check-upgrade
Updating Subscription Management repositories.
Last metadata expiration check: 0:06:44 ago on Sun 19 Mar 2023 05:07:57 PM CET.
[root@guest29 ~]# mkdir /mnt/fiotest
[root@guest29 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       184Mi       3.2Gi        12Mi       444Mi       3.4Gi
Swap:         3.2Gi          0B       3.2Gi

Sequential Write, 4k Blocksize

WRITE: bw=140MiB/s
iops: avg=46693.22

fio --name=rbd-test --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest29 ~]# fio --name=rbd-test --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4k
rbd-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1  
fio-3.19
Starting 1 process
rbd-test: Laying out IO file (1 file / 16384MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=67.9MiB/s][w=17.4k IOPS][eta 00m:00s]
rbd-test: (groupid=0, jobs=1): err= 0: pid=1347: Sat Mar 25 21:13:50 2023
  write: IOPS=35.9k, BW=140MiB/s (147MB/s)(16.0GiB/116787msec); 0 zone resets
    clat (nsec): min=1578, max=27871M, avg=27445.51, stdev=13625842.17
     lat (nsec): min=1629, max=27871M, avg=27506.21, stdev=13625842.18
    clat percentiles (nsec):
     |  1.00th=[    1672],  5.00th=[    1848], 10.00th=[    2256],
     | 20.00th=[    2416], 30.00th=[    2480], 40.00th=[    2512],
     | 50.00th=[    2576], 60.00th=[    2608], 70.00th=[    2704],
     | 80.00th=[    2960], 90.00th=[    3536], 95.00th=[    4192],
     | 99.00th=[   10176], 99.50th=[   13888], 99.90th=[ 8028160],
     | 99.95th=[12386304], 99.99th=[25559040]
   bw (  KiB/s): min=   32, max=790767, per=100.00%, avg=186773.34, stdev=145600.91, samples=178
   iops        : min=    8, max=197691, avg=46693.22, stdev=36400.24, samples=178
  lat (usec)   : 2=7.95%, 4=86.40%, 10=4.64%, 20=0.76%, 50=0.10%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.05%, 50=0.02%
  lat (msec)   : 100=0.01%, 250=0.01%, >=2000=0.01%
  cpu          : usr=3.21%, sys=9.02%, ctx=5485, majf=0, minf=17
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4194304,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=140MiB/s (147MB/s), 140MiB/s-140MiB/s (147MB/s-147MB/s), io=16.0GiB (17.2GB), run=116787-116787msec

Disk stats (read/write):
  vda: ios=0/15895, merge=0/5, ticks=0/14068216, in_queue=14068216, util=98.50%

Sequential Write, 4m Blocksize

WRITE: bw=181MiB/s
iops: avg=46.05

fio --name=rbd-test --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest29 ~]# fio --name=rbd-test --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4m
rbd-test: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=184MiB/s][w=45 IOPS][eta 00m:00s]
rbd-test: (groupid=0, jobs=1): err= 0: pid=1412: Sat Mar 25 21:18:02 2023
  write: IOPS=45, BW=181MiB/s (190MB/s)(16.0GiB/90592msec); 0 zone resets
    clat (usec): min=1471, max=2129.4k, avg=21938.67, stdev=46484.99
     lat (usec): min=1526, max=2129.6k, avg=22003.56, stdev=46491.34
    clat percentiles (usec):
     |  1.00th=[   1549],  5.00th=[   1614], 10.00th=[   1663],
     | 20.00th=[   1827], 30.00th=[   2114], 40.00th=[  12125],
     | 50.00th=[  16188], 60.00th=[  18482], 70.00th=[  26608],
     | 80.00th=[  34341], 90.00th=[  47973], 95.00th=[  63177],
     | 99.00th=[ 130548], 99.50th=[ 198181], 99.90th=[ 425722],
     | 99.95th=[ 675283], 99.99th=[2122318]
   bw (  KiB/s): min= 8175, max=630784, per=100.00%, avg=189632.49, stdev=133240.64, samples=175
   iops        : min=    1, max=  154, avg=46.05, stdev=32.59, samples=175
  lat (msec)   : 2=26.37%, 4=11.94%, 10=0.68%, 20=24.37%, 50=27.61%
  lat (msec)   : 100=7.76%, 250=0.98%, 500=0.22%, 750=0.02%, 1000=0.02%
  lat (msec)   : >=2000=0.02%
  cpu          : usr=0.33%, sys=9.03%, ctx=6223, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4096,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s), io=16.0GiB (17.2GB), run=90592-90592msec

Disk stats (read/write):
  vda: ios=0/16111, merge=0/0, ticks=0/11828596, in_queue=11828596, util=97.95%

Sequential Read, 4k Blocksize

READ: bw=196MiB/s
iops: avg=50392.08

fio --name=rbd-test --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest29 ~]# fio --name=rbd-test --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4k                
rbd-test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1   
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=222MiB/s][r=56.7k IOPS][eta 00m:00s]                                        
rbd-test: (groupid=0, jobs=1): err= 0: pid=1416: Sat Mar 25 21:20:42 2023                                   
  read: IOPS=50.2k, BW=196MiB/s (206MB/s)(16.0GiB/83520msec)                                                
    clat (nsec): min=807, max=233877k, avg=19463.80, stdev=885938.42                                        
     lat (nsec): min=850, max=233877k, avg=19525.01, stdev=885938.82                                        
    clat percentiles (nsec):
     |  1.00th=[     908],  5.00th=[     948], 10.00th=[     972],                                          
     | 20.00th=[     996], 30.00th=[    1020], 40.00th=[    1032],                                          
     | 50.00th=[    1048], 60.00th=[    1080], 70.00th=[    1096],                                          
     | 80.00th=[    1128], 90.00th=[    1192], 95.00th=[    1272],                                          
     | 99.00th=[    2928], 99.50th=[    7584], 99.90th=[   33024],                                          
     | 99.95th=[11206656], 99.99th=[44302336]
   bw (  KiB/s): min=36978, max=274432, per=100.00%, avg=201568.42, stdev=39879.74, samples=166             
   iops        : min= 9244, max=68608, avg=50392.08, stdev=9969.93, samples=166                             
  lat (nsec)   : 1000=21.06%
  lat (usec)   : 2=77.22%, 4=1.05%, 10=0.24%, 20=0.28%, 50=0.05%                                            
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%                                     
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.02%, 50=0.03%                                             
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=3.56%, sys=7.02%, ctx=3524, majf=0, minf=14                                            
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%                              
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                             
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                             
     issued rwts: total=4194304,0,0,0 short=0,0,0,0 dropped=0,0,0,0                                         
     latency   : target=0, window=0, percentile=100.00%, depth=1                                            

Run status group 0 (all jobs):
   READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=16.0GiB (17.2GB), run=83520-83520msec

Disk stats (read/write):
  vda: ios=16358/2, merge=0/0, ticks=973234/37, in_queue=973272, util=94.65%

Sequential Read, 4m Blocksize

READ: bw=199MiB/s
iops: avg=50.58

fio --name=rbd-test --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest29 ~]# fio --name=rbd-test --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4m
rbd-test: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=216MiB/s][r=53 IOPS][eta 00m:00s]
rbd-test: (groupid=0, jobs=1): err= 0: pid=1421: Sat Mar 25 21:23:55 2023
  read: IOPS=49, BW=199MiB/s (209MB/s)(16.0GiB/82165msec)
    clat (usec): min=443, max=845275, avg=19947.46, stdev=32040.72
     lat (usec): min=443, max=845275, avg=19947.72, stdev=32040.68
    clat percentiles (usec):
     |  1.00th=[   502],  5.00th=[   529], 10.00th=[   553], 20.00th=[   603],
     | 30.00th=[   914], 40.00th=[  1745], 50.00th=[  8455], 60.00th=[ 17433],
     | 70.00th=[ 26870], 80.00th=[ 36963], 90.00th=[ 53740], 95.00th=[ 69731],
     | 99.00th=[ 96994], 99.50th=[112722], 99.90th=[287310], 99.95th=[633340],
     | 99.99th=[843056]
   bw (  KiB/s): min=24576, max=270336, per=100.00%, avg=207331.80, stdev=39224.53, samples=161
   iops        : min=    6, max=   66, avg=50.58, stdev= 9.56, samples=161
  lat (usec)   : 500=0.90%, 750=27.81%, 1000=1.98%
  lat (msec)   : 2=10.67%, 4=3.54%, 10=7.30%, 20=10.30%, 50=25.83%
  lat (msec)   : 100=10.91%, 250=0.63%, 500=0.02%, 750=0.07%, 1000=0.02%
  cpu          : usr=0.04%, sys=6.59%, ctx=3550, majf=0, minf=525
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=199MiB/s (209MB/s), 199MiB/s-199MiB/s (209MB/s-209MB/s), io=16.0GiB (17.2GB), run=82165-82165msec

Disk stats (read/write):
  vda: ios=16374/0, merge=0/0, ticks=982509/0, in_queue=982509, util=97.45%

Sequential, 80% Read, 20% Write, 4k Blocksize

READ: bw=182MiB/s
iops: avg=46941.35
WRITE: bw=45.4MiB/s
iops: avg=11732.77

fio --name=rbd-test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest29 ~]# fio --name=rbd-test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4k
rbd-test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1     
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [M(1)][100.0%][r=178MiB/s,w=44.7MiB/s][r=45.5k,w=11.5k IOPS][eta 00m:00s]                    
rbd-test: (groupid=0, jobs=1): err= 0: pid=1424: Sat Mar 25 21:25:53 2023                                   
  read: IOPS=46.5k, BW=182MiB/s (190MB/s)(12.8GiB/72175msec)                                                
    clat (nsec): min=805, max=627098k, avg=20144.85, stdev=1047272.38                                       
     lat (nsec): min=843, max=627098k, avg=20187.87, stdev=1047272.48                                       
    clat percentiles (nsec):
     |  1.00th=[     932],  5.00th=[     980], 10.00th=[    1012],                                          
     | 20.00th=[    1064], 30.00th=[    1096], 40.00th=[    1128],                                          
     | 50.00th=[    1176], 60.00th=[    1208], 70.00th=[    1240],                                          
     | 80.00th=[    1288], 90.00th=[    1384], 95.00th=[    1496],                                          
     | 99.00th=[    3376], 99.50th=[   10048], 99.90th=[   36608],                                          
     | 99.95th=[10682368], 99.99th=[45350912]
   bw (  KiB/s): min=36978, max=258048, per=100.00%, avg=187765.51, stdev=40535.60, samples=142             
   iops        : min= 9244, max=64512, avg=46941.35, stdev=10133.91, samples=142                            
  write: IOPS=11.6k, BW=45.4MiB/s (47.6MB/s)(3277MiB/72175msec); 0 zone resets                              
    clat (nsec): min=1300, max=2313.3k, avg=3144.75, stdev=14867.24                                         
     lat (nsec): min=1344, max=2313.7k, avg=3202.44, stdev=14873.30                                         
    clat percentiles (nsec):
     |  1.00th=[  1816],  5.00th=[  1960], 10.00th=[  2040], 20.00th=[  2256],                              
     | 30.00th=[  2384], 40.00th=[  2480], 50.00th=[  2544], 60.00th=[  2640],                              
     | 70.00th=[  2768], 80.00th=[  2928], 90.00th=[  3600], 95.00th=[  4320],                              
     | 99.00th=[ 12480], 99.50th=[ 15424], 99.90th=[ 38656], 99.95th=[ 58112],                              
     | 99.99th=[847872]
   bw (  KiB/s): min= 9708, max=63168, per=100.00%, avg=46931.39, stdev=10065.63, samples=142               
   iops        : min= 2427, max=15792, avg=11732.77, stdev=2516.38, samples=142                             
  lat (nsec)   : 1000=6.06%
  lat (usec)   : 2=73.57%, 4=18.37%, 10=1.33%, 20=0.45%, 50=0.13%                                           
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%                                     
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.02%                                             
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%                                                 
  cpu          : usr=4.27%, sys=10.50%, ctx=2664, majf=0, minf=21                                           
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%                              
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                             
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                             
     issued rwts: total=3355445,838859,0,0 short=0,0,0,0 dropped=0,0,0,0                                    
     latency   : target=0, window=0, percentile=100.00%, depth=1                                            

Run status group 0 (all jobs):
   READ: bw=182MiB/s (190MB/s), 182MiB/s-182MiB/s (190MB/s-190MB/s), io=12.8GiB (13.7GB), run=72175-72175msec
  WRITE: bw=45.4MiB/s (47.6MB/s), 45.4MiB/s-45.4MiB/s (47.6MB/s-47.6MB/s), io=3277MiB (3436MB), run=72175-72175msec

Disk stats (read/write):
  vda: ios=13108/3037, merge=0/1, ticks=844664/637742, in_queue=1482406, util=90.78%

Sequential, 80% Read, 20% Write, 4m Blocksize

READ: bw=174MiB/s
iops: avg=44.48
WRITE: bw=45.4MiB/s
iops: avg=11.60

fio --name=rbd-test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest29 ~]# fio --name=rbd-test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4m
rbd-test: (g=0): rw=rw, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [M(1)][100.0%][r=200MiB/s,w=39.0MiB/s][r=49,w=9 IOPS][eta 00m:00s]
rbd-test: (groupid=0, jobs=1): err= 0: pid=1433: Sat Mar 25 21:29:16 2023
  read: IOPS=43, BW=174MiB/s (182MB/s)(12.7GiB/74761msec)
    clat (usec): min=444, max=1021.3k, avg=22443.71, stdev=44040.41
     lat (usec): min=445, max=1021.3k, avg=22443.95, stdev=44040.37
    clat percentiles (usec):
     |  1.00th=[    502],  5.00th=[    537], 10.00th=[    562],
     | 20.00th=[    611], 30.00th=[    685], 40.00th=[   1614],
     | 50.00th=[   5604], 60.00th=[  15139], 70.00th=[  27919],
     | 80.00th=[  41157], 90.00th=[  61604], 95.00th=[  78119],
     | 99.00th=[ 135267], 99.50th=[ 181404], 99.90th=[ 750781],
     | 99.95th=[ 851444], 99.99th=[1019216]
   bw (  KiB/s): min=16384, max=262144, per=100.00%, avg=182256.38, stdev=47410.09, samples=145
   iops        : min=    4, max=   64, avg=44.48, stdev=11.58, samples=145
  write: IOPS=11, BW=45.4MiB/s (47.6MB/s)(3392MiB/74761msec); 0 zone resets
    clat (usec): min=1076, max=9656, avg=1668.65, stdev=468.88
     lat (usec): min=1130, max=9936, avg=1726.70, stdev=476.32
    clat percentiles (usec):
     |  1.00th=[ 1123],  5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1270],
     | 30.00th=[ 1598], 40.00th=[ 1663], 50.00th=[ 1696], 60.00th=[ 1729],
     | 70.00th=[ 1745], 80.00th=[ 1844], 90.00th=[ 2024], 95.00th=[ 2147],
     | 99.00th=[ 2311], 99.50th=[ 2606], 99.90th=[ 9634], 99.95th=[ 9634],
     | 99.99th=[ 9634]
   bw (  KiB/s): min= 8192, max=139264, per=100.00%, avg=47614.06, stdev=26724.67, samples=144
   iops        : min=    2, max=   34, avg=11.60, stdev= 6.53, samples=144
  lat (usec)   : 500=0.81%, 750=24.34%, 1000=0.66%
  lat (msec)   : 2=27.10%, 4=5.76%, 10=5.52%, 20=7.01%, 50=17.36%
  lat (msec)   : 100=9.57%, 250=1.66%, 500=0.10%, 750=0.02%, 1000=0.07%
  lat (msec)   : 2000=0.02%
  cpu          : usr=0.11%, sys=7.57%, ctx=2730, majf=0, minf=16
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=3248,848,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=174MiB/s (182MB/s), 174MiB/s-174MiB/s (182MB/s-182MB/s), io=12.7GiB (13.6GB), run=74761-74761msec
  WRITE: bw=45.4MiB/s (47.6MB/s), 45.4MiB/s-45.4MiB/s (47.6MB/s-47.6MB/s), io=3392MiB (3557MB), run=74761-74761msec

Disk stats (read/write):
  vda: ios=12971/3192, merge=0/0, ticks=869566/579994, in_queue=1449561, util=95.40%

Random, 80% Read, 20% Write, 4k Blocksize

READ: bw=1324KiB/s
iops: avg=340.60
WRITE: bw=331KiB/s
iops: avg=85.84

While I expected this to be not good™, results are worse than I would have guessed. But well all my use cases involve some caching, so the numbers here are not too worrysome as me real use should not be as this synthetic test.

fio --name=rbd-test --rw=randrw --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest29 ~]# fio --name=rbd-test --rw=randrw --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4k
rbd-test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=1730KiB/s,w=443KiB/s][r=432,w=110 IOPS][eta 00m:00s]   
rbd-test: (groupid=0, jobs=1): err= 0: pid=2340: Sun Mar 26 14:27:04 2023
  read: IOPS=331, BW=1324KiB/s (1356kB/s)(12.8GiB/10136307msec)
    clat (usec): min=697, max=30885k, avg=3009.66, stdev=28079.13
     lat (usec): min=698, max=30885k, avg=3010.01, stdev=28079.13
    clat percentiles (usec):
     |  1.00th=[  1270],  5.00th=[  1418], 10.00th=[  1500], 20.00th=[  1631],
     | 30.00th=[  1745], 40.00th=[  1909], 50.00th=[  2114], 60.00th=[  2311],
     | 70.00th=[  2540], 80.00th=[  2704], 90.00th=[  2966], 95.00th=[  5276],
     | 99.00th=[ 20841], 99.50th=[ 31851], 99.90th=[ 68682], 99.95th=[183501],
     | 99.99th=[599786]
   bw (  KiB/s): min=    8, max= 2656, per=100.00%, avg=1362.46, stdev=555.01, samples=19700
   iops        : min=    2, max=  664, avg=340.60, stdev=138.76, samples=19700
  write: IOPS=82, BW=331KiB/s (339kB/s)(3277MiB/10136307msec); 0 zone resets
    clat (usec): min=2, max=5118, avg=19.35, stdev=23.17
     lat (usec): min=2, max=5118, avg=19.79, stdev=23.29
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[   11],
     | 30.00th=[   15], 40.00th=[   17], 50.00th=[   19], 60.00th=[   20],
     | 70.00th=[   21], 80.00th=[   29], 90.00th=[   34], 95.00th=[   37],
     | 99.00th=[   52], 99.50th=[   61], 99.90th=[   79], 99.95th=[  116],
     | 99.99th=[ 1172]
   bw (  KiB/s): min=    7, max=  894, per=100.00%, avg=343.46, stdev=147.67, samples=19536
   iops        : min=    1, max=  223, avg=85.84, stdev=36.92, samples=19536
  lat (usec)   : 4=0.72%, 10=3.12%, 20=9.46%, 50=6.48%, 100=0.21%
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=35.88%, 4=39.33%, 10=3.05%, 20=0.89%, 50=0.73%
  lat (msec)   : 100=0.07%, 250=0.03%, 500=0.02%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%, >=2000=0.01%
  cpu          : usr=0.37%, sys=1.34%, ctx=3355523, majf=0, minf=18
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=3355445,838859,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1324KiB/s (1356kB/s), 1324KiB/s-1324KiB/s (1356kB/s-1356kB/s), io=12.8GiB (13.7GB), run=10136307-10136307msec
  WRITE: bw=331KiB/s (339kB/s), 331KiB/s-331KiB/s (339kB/s-339kB/s), io=3277MiB (3436MB), run=10136307-10136307msec

Disk stats (read/write):
  vda: ios=3355366/834422, merge=0/4, ticks=9964327/30041062, in_queue=40005389, util=99.81%

Random, 80% Read, 20% Write, 4m Blocksize

READ: bw=67.2MiB/s
iops: avg=18.34
WRITE: bw=17.5MiB/s
iops: avg= 5.53

fio --name=rbd-test --rw=randrw --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest29 ~]# fio --name=rbd-test --rw=randrw --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4m
rbd-test: (g=0): rw=randrw, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [m(1)][99.5%][r=87.9MiB/s,w=47.0MiB/s][r=21,w=11 IOPS][eta 00m:01s] 
rbd-test: (groupid=0, jobs=1): err= 0: pid=1475: Sat Mar 25 21:52:14 2023
  read: IOPS=16, BW=67.2MiB/s (70.5MB/s)(12.7GiB/193365msec)
    clat (msec): min=41, max=2161, avg=55.61, stdev=69.20
     lat (msec): min=41, max=2161, avg=55.61, stdev=69.20
    clat percentiles (msec):
     |  1.00th=[   44],  5.00th=[   45], 10.00th=[   45], 20.00th=[   45],
     | 30.00th=[   46], 40.00th=[   46], 50.00th=[   46], 60.00th=[   47],
     | 70.00th=[   48], 80.00th=[   50], 90.00th=[   57], 95.00th=[   70],
     | 99.00th=[  262], 99.50th=[  418], 99.90th=[  995], 99.95th=[ 1418],
     | 99.99th=[ 2165]
   bw (  KiB/s): min= 8192, max=90112, per=100.00%, avg=75188.45, stdev=19320.94, samples=353
   iops        : min=    2, max=   22, avg=18.34, stdev= 4.73, samples=353
  write: IOPS=4, BW=17.5MiB/s (18.4MB/s)(3392MiB/193365msec); 0 zone resets
    clat (usec): min=1371, max=5103, avg=1871.27, stdev=428.74
     lat (usec): min=1449, max=5129, avg=1931.56, stdev=434.49
    clat percentiles (usec):
     |  1.00th=[ 1483],  5.00th=[ 1532], 10.00th=[ 1565], 20.00th=[ 1614],
     | 30.00th=[ 1647], 40.00th=[ 1663], 50.00th=[ 1696], 60.00th=[ 1762],
     | 70.00th=[ 1893], 80.00th=[ 2147], 90.00th=[ 2311], 95.00th=[ 2704],
     | 99.00th=[ 3523], 99.50th=[ 3752], 99.90th=[ 5080], 99.95th=[ 5080],
     | 99.99th=[ 5080]
   bw (  KiB/s): min= 8175, max=65536, per=100.00%, avg=22674.75, stdev=13647.46, samples=306
   iops        : min=    1, max=   16, avg= 5.53, stdev= 3.33, samples=306
  lat (msec)   : 2=15.45%, 4=5.15%, 10=0.10%, 50=65.16%, 100=11.62%
  lat (msec)   : 250=1.64%, 500=0.61%, 750=0.10%, 1000=0.10%, 2000=0.05%
  lat (msec)   : >=2000=0.02%
  cpu          : usr=0.04%, sys=2.75%, ctx=6509, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=3248,848,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=67.2MiB/s (70.5MB/s), 67.2MiB/s-67.2MiB/s (70.5MB/s-70.5MB/s), io=12.7GiB (13.6GB), run=193365-193365msec
  WRITE: bw=17.5MiB/s (18.4MB/s), 17.5MiB/s-17.5MiB/s (18.4MB/s-18.4MB/s), io=3392MiB (3557MB), run=193365-193365msec

Disk stats (read/write):
  vda: ios=13014/4244, merge=0/0, ticks=579911/1582547, in_queue=2162458, util=98.08%

sequential write, 2300 bytes, fsync

WRITE: bw=101KiB/s
iops: avg=45.15

fio --name=etcd-test --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300 --directory=/mnt/fiotest/

[root@guest29 ~]# fio --name=etcd-test --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300 --directory=/mnt/fiotest/
etcd-test: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.19
Starting 1 process
etcd-test: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=186KiB/s][w=82 IOPS][eta 00m:00s]
etcd-test: (groupid=0, jobs=1): err= 0: pid=1436: Sat Mar 25 21:34:32 2023
  write: IOPS=45, BW=101KiB/s (104kB/s)(21.0MiB/222696msec); 0 zone resets
    clat (usec): min=9, max=1535, avg=20.46, stdev=30.24
     lat (usec): min=9, max=1536, avg=20.97, stdev=30.28
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   15],
     | 30.00th=[   16], 40.00th=[   17], 50.00th=[   18], 60.00th=[   20],
     | 70.00th=[   22], 80.00th=[   24], 90.00th=[   32], 95.00th=[   37],
     | 99.00th=[   50], 99.50th=[   56], 99.90th=[   78], 99.95th=[  161],
     | 99.99th=[ 1418]
   bw (  KiB/s): min=    4, max=  193, per=99.85%, avg=100.85, stdev=60.58, samples=443
   iops        : min=    2, max=   86, avg=45.15, stdev=26.94, samples=443
  lat (usec)   : 10=0.05%, 20=62.27%, 50=36.73%, 100=0.89%, 250=0.01%
  lat (msec)   : 2=0.05%
  fsync/fdatasync/sync_file_range:
    sync (msec): min=5, max=526, avg=22.18, stdev=35.52
    sync percentiles (msec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    9], 20.00th=[   10],
     | 30.00th=[   12], 40.00th=[   15], 50.00th=[   17], 60.00th=[   18],
     | 70.00th=[   19], 80.00th=[   21], 90.00th=[   27], 95.00th=[   47],
     | 99.00th=[  236], 99.50th=[  288], 99.90th=[  338], 99.95th=[  372],
     | 99.99th=[  430]
  cpu          : usr=0.05%, sys=0.32%, ctx=30229, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=101KiB/s (104kB/s), 101KiB/s-101KiB/s (104kB/s-104kB/s), io=21.0MiB (23.1MB), run=222696-222696msec

Disk stats (read/write):
  vda: ios=0/25659, merge=0/4, ticks=0/310626, in_queue=310627, util=100.00%

sequential write, 2300 bytes, NO fsync

WRITE: bw=171MiB/s
iops: avg=77371.25

fio --name=etcdd-test --rw=write --ioengine=sync  --size=16g --bs=2300 --directory=/mnt/fiotest/

[root@guest29 ~]# fio --name=etcdd-test --rw=write --ioengine=sync  --size=16g --bs=2300 --directory=/mnt/fiotest/
etcdd-test: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.19
Starting 1 process
etcdd-test: Laying out IO file (1 file / 16384MiB)
Jobs: 1 (f=1): [W(1)][99.0%][w=258MiB/s][w=118k IOPS][eta 00m:01s]  
etcdd-test: (groupid=0, jobs=1): err= 0: pid=1453: Sat Mar 25 21:39:54 2023
  write: IOPS=77.9k, BW=171MiB/s (179MB/s)(15.0GiB/95839msec); 0 zone resets
    clat (nsec): min=1026, max=207763k, avg=12453.03, stdev=443304.45
     lat (nsec): min=1084, max=207763k, avg=12511.36, stdev=443304.47
    clat percentiles (nsec):
     |  1.00th=[    1080],  5.00th=[    1128], 10.00th=[    1176],
     | 20.00th=[    1224], 30.00th=[    1256], 40.00th=[    1368],
     | 50.00th=[    2384], 60.00th=[    2704], 70.00th=[    2832],
     | 80.00th=[    2928], 90.00th=[    3312], 95.00th=[    3856],
     | 99.00th=[    7456], 99.50th=[   12608], 99.90th=[  108032],
     | 99.95th=[ 8978432], 99.99th=[17694720]
   bw (  KiB/s): min=  256, max=848884, per=99.27%, avg=173782.75, stdev=135308.55, samples=190
   iops        : min=  114, max=377938, avg=77371.25, stdev=60241.76, samples=190
  lat (usec)   : 2=45.65%, 4=50.40%, 10=3.23%, 20=0.55%, 50=0.06%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.04%, 20=0.04%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=6.46%, sys=14.98%, ctx=6237, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7469508,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=171MiB/s (179MB/s), 171MiB/s-171MiB/s (179MB/s-179MB/s), io=15.0GiB (17.2GB), run=95839-95839msec

Disk stats (read/write):
  vda: ios=0/15905, merge=0/4, ticks=0/12849013, in_queue=12849013, util=98.23%

VM on libvirt using RBD benchmarks

These are just for comparison to an earlier test. The host running the VM only has a single 1 Gig Ethernet uplink and I know from the test VM on my Proxmox host that the Ceph cluster now delivers more bandwidth than this libvirt host has (it’s my workstation).

VM side of taking the RBD image in use (click to expand)

[root@guest21 ~]# mkfs.xfs /dev/vdb
meta-data=/dev/vdb               isize=512    agcount=4, agsize=67108864 blks                               
         =                       sectsz=512   attr=2, projid32bit=1                                         
         =                       crc=1        finobt=1, sparse=1, rmapbt=0                                  
         =                       reflink=1    bigtime=0 inobtcount=0                                        
data     =                       bsize=4096   blocks=268435456, imaxpct=5                                   
         =                       sunit=0      swidth=0 blks                                                 
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1                                           
log      =internal log           bsize=4096   blocks=131072, version=2                                      
         =                       sectsz=512   sunit=0 blks, lazy-count=1                                    
realtime =none                   extsz=4096   blocks=0, rtextents=0                                         
Discarding blocks...Done.
[root@guest21 ~]# mount /dev/vdb /mnt/
[root@guest21 ~]# df -h /mnt/
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        1.0T  7.2G 1017G   1% /mnt
[root@guest21 ~]# mkdir /mnt/fiotest/
[root@guest21 ~]# uname -r
4.18.0-425.13.1.el8_7.x86_64
[root@guest21 ~]# dnf check-upgrade
Updating Subscription Management repositories.
Last metadata expiration check: 0:06:15 ago on Mon 20 Mar 2023 08:35:28 PM CET.
[root@guest21 ~]# dnf repolist
Updating Subscription Management repositories.
repo id                                    repo name
rhel-8-for-x86_64-appstream-rpms           Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs)
rhel-8-for-x86_64-baseos-rpms              Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs)
[root@guest21 ~]#

Sequential Read, 4k Blocksize

READ: bw=107MiB/s
iops: avg=27482.47

fio --name=readtest --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest21 ~]# fio --name=readtest --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4k
readtest: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
readtest: Laying out IO file (1 file / 16384MiB)
Jobs: 1 (f=1): [R(1)][100.0%][r=106MiB/s][r=27.2k IOPS][eta 00m:00s]
readtest: (groupid=0, jobs=1): err= 0: pid=1338: Mon Mar 20 20:48:21 2023
  read: IOPS=27.5k, BW=107MiB/s (113MB/s)(16.0GiB/152659msec)
    clat (nsec): min=1139, max=229620k, avg=35329.79, stdev=1485689.68
     lat (nsec): min=1191, max=229620k, avg=35476.71, stdev=1485689.87
    clat percentiles (nsec):
     |  1.00th=[    1272],  5.00th=[    1400], 10.00th=[    1464],
     | 20.00th=[    1576], 30.00th=[    2352], 40.00th=[    2992],
     | 50.00th=[    3184], 60.00th=[    3248], 70.00th=[    3344],
     | 80.00th=[    3440], 90.00th=[    3632], 95.00th=[    3792],
     | 99.00th=[   19072], 99.50th=[   40192], 99.90th=[  140288],
     | 99.95th=[22151168], 99.99th=[85458944]
   bw (  KiB/s): min=77446, max=131072, per=100.00%, avg=109931.98, stdev=9122.31, samples=304
   iops        : min=19361, max=32768, avg=27482.47, stdev=2280.57, samples=304
  lat (usec)   : 2=26.98%, 4=69.83%, 10=1.57%, 20=0.63%, 50=0.55%
  lat (usec)   : 100=0.26%, 250=0.09%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.03%
  lat (msec)   : 100=0.02%, 250=0.01%
  cpu          : usr=5.51%, sys=9.89%, ctx=3285, majf=0, minf=16
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=4194304,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=107MiB/s (113MB/s), 107MiB/s-107MiB/s (113MB/s-113MB/s), io=16.0GiB (17.2GB), run=152659-152659msec

Disk stats (read/write):
  vdb: ios=16378/8, merge=0/1, ticks=1880033/636, in_queue=1880668, util=90.76%

Sequential Read, 4m Blocksize

READ: bw=103MiB/s
iops: avg=24.95

fio --name=readtest --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest21 ~]# fio --name=readtest --rw=read --size=16g --directory=/mnt/fiotest/ --bs=4m
readtest: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=108MiB/s][r=26 IOPS][eta 00m:00s]
readtest: (groupid=0, jobs=1): err= 0: pid=1366: Mon Mar 20 21:05:58 2023
  read: IOPS=25, BW=103MiB/s (108MB/s)(16.0GiB/159617msec)
    clat (usec): min=693, max=321996, avg=38794.65, stdev=49394.27
     lat (usec): min=694, max=321998, avg=38796.13, stdev=49394.24
    clat percentiles (usec):
     |  1.00th=[   799],  5.00th=[   955], 10.00th=[  1336], 20.00th=[  1598],
     | 30.00th=[  1729], 40.00th=[  3359], 50.00th=[  7701], 60.00th=[ 27395],
     | 70.00th=[ 55837], 80.00th=[ 85459], 90.00th=[122160], 95.00th=[143655],
     | 99.00th=[162530], 99.50th=[170918], 99.90th=[214959], 99.95th=[254804],
     | 99.99th=[320865]
   bw (  KiB/s): min=65405, max=131072, per=100.00%, avg=105557.24, stdev=12718.77, samples=317
   iops        : min=   15, max=   32, avg=24.95, stdev= 3.15, samples=317
  lat (usec)   : 750=0.24%, 1000=5.79%
  lat (msec)   : 2=27.20%, 4=7.74%, 10=10.40%, 20=5.20%, 50=11.72%
  lat (msec)   : 100=15.92%, 250=15.72%, 500=0.07%
  cpu          : usr=0.07%, sys=7.78%, ctx=3297, majf=0, minf=527
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=103MiB/s (108MB/s), 103MiB/s-103MiB/s (108MB/s-108MB/s), io=16.0GiB (17.2GB), run=159617-159617msec

Disk stats (read/write):
  vdb: ios=16368/0, merge=0/0, ticks=2047074/0, in_queue=2047074, util=95.09%

Sequential Write, 4k Blocksize

WRITE: bw=81.4MiB/s
iops: avg=20710.70

fio --name=writetest --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest21 ~]# fio --name=writetest --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4k
writetest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
writetest: Laying out IO file (1 file / 16384MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=81.0MiB/s][w=20.0k IOPS][eta 00m:00s]
writetest: (groupid=0, jobs=1): err= 0: pid=1369: Mon Mar 20 21:10:34 2023
  write: IOPS=20.8k, BW=81.4MiB/s (85.3MB/s)(16.0GiB/201381msec); 0 zone resets
    clat (usec): min=3, max=31664, avg=46.07, stdev=547.93
     lat (usec): min=3, max=31664, avg=46.33, stdev=547.94
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    7], 40.00th=[   10], 50.00th=[   10], 60.00th=[   10],
     | 70.00th=[   10], 80.00th=[   10], 90.00th=[   13], 95.00th=[   19],
     | 99.00th=[   71], 99.50th=[  176], 99.90th=[ 8160], 99.95th=[11338],
     | 99.99th=[12256]
   bw (  KiB/s): min=65604, max=812845, per=99.44%, avg=82843.81, stdev=36957.59, samples=401
   iops        : min=16403, max=203211, avg=20710.70, stdev=9239.40, samples=401
  lat (usec)   : 4=21.36%, 10=61.88%, 20=12.28%, 50=2.79%, 100=1.00%
  lat (usec)   : 250=0.22%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.40%, 20=0.07%, 50=0.01%
  cpu          : usr=6.78%, sys=16.79%, ctx=19491, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4194304,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=81.4MiB/s (85.3MB/s), 81.4MiB/s-81.4MiB/s (85.3MB/s-85.3MB/s), io=16.0GiB (17.2GB), run=201381-201381msec

Disk stats (read/write):
  vdb: ios=0/15844, merge=0/4, ticks=0/27150060, in_queue=27150060, util=99.56%

Sequential Write, 4m Blocksize

WRITE: bw=85.9MiB/s
iops: avg=20.69

fio --name=writetest --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest21 ~]# fio --name=writetest --rw=write --size=16g --directory=/mnt/fiotest/ --bs=4m
writetest: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=79.8MiB/s][w=19 IOPS][eta 00m:00s]
writetest: (groupid=0, jobs=1): err= 0: pid=1374: Mon Mar 20 21:14:36 2023
  write: IOPS=21, BW=85.9MiB/s (90.1MB/s)(16.0GiB/190748msec); 0 zone resets
    clat (usec): min=1592, max=183102, avg=46168.11, stdev=12784.28
     lat (usec): min=1648, max=183382, avg=46393.96, stdev=12794.64
    clat percentiles (usec):
     |  1.00th=[  1811],  5.00th=[ 32113], 10.00th=[ 35390], 20.00th=[ 40109],
     | 30.00th=[ 41681], 40.00th=[ 44827], 50.00th=[ 45351], 60.00th=[ 47449],
     | 70.00th=[ 49546], 80.00th=[ 53216], 90.00th=[ 59507], 95.00th=[ 67634],
     | 99.00th=[ 81265], 99.50th=[ 87557], 99.90th=[108528], 99.95th=[127402],
     | 99.99th=[183501]
   bw (  KiB/s): min=32702, max=1054658, per=100.00%, avg=88297.97, stdev=51626.48, samples=379
   iops        : min=    7, max=  257, avg=20.69, stdev=12.63, samples=379
  lat (msec)   : 2=2.29%, 4=0.63%, 10=0.05%, 20=0.05%, 50=67.85%
  lat (msec)   : 100=28.96%, 250=0.17%
  cpu          : usr=0.75%, sys=8.00%, ctx=20234, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4096,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=85.9MiB/s (90.1MB/s), 85.9MiB/s-85.9MiB/s (90.1MB/s-90.1MB/s), io=16.0GiB (17.2GB), run=190748-190748msec

Disk stats (read/write):
  vdb: ios=0/15892, merge=0/0, ticks=0/25821278, in_queue=25821278, util=99.15%

Sequential, 80% Read, 20% Write, 4k Blocksize

READ: bw=97.7MiB/s
iops: avg=25029.70
WRITE: bw=24.4MiB/s
iops: avg=6256.78

fio --name=read80write20test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4k

[root@guest21 ~]# fio --name=read80write20test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4k
read80write20test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
read80write20test: Laying out IO file (1 file / 16384MiB)                                                   
Jobs: 1 (f=1): [M(1)][99.3%][r=106MiB/s,w=26.2MiB/s][r=27.1k,w=6696 IOPS][eta 00m:01s]                      
read80write20test: (groupid=0, jobs=1): err= 0: pid=1387: Mon Mar 20 21:21:09 2023                          
  read: IOPS=25.0k, BW=97.7MiB/s (102MB/s)(12.8GiB/134185msec)                                              
    clat (nsec): min=1114, max=386692k, avg=36088.52, stdev=1603607.30                                      
     lat (nsec): min=1167, max=386692k, avg=36244.56, stdev=1603607.33                                      
    clat percentiles (nsec):
     |  1.00th=[    1352],  5.00th=[    1464], 10.00th=[    1512],                                          
     | 20.00th=[    1640], 30.00th=[    2040], 40.00th=[    2704],                                          
     | 50.00th=[    3120], 60.00th=[    3312], 70.00th=[    3440],                                          
     | 80.00th=[    3600], 90.00th=[    4048], 95.00th=[    4448],                                          
     | 99.00th=[   20352], 99.50th=[   41216], 99.90th=[  146432],                                          
     | 99.95th=[20316160], 99.99th=[85458944]
   bw (  KiB/s): min=32702, max=130810, per=100.00%, avg=100120.35, stdev=18395.24, samples=267             
   iops        : min= 8175, max=32702, avg=25029.70, stdev=4598.73, samples=267                             
  write: IOPS=6251, BW=24.4MiB/s (25.6MB/s)(3277MiB/134185msec); 0 zone resets                              
    clat (usec): min=2, max=2826, avg= 9.17, stdev=13.73                                                    
     lat (usec): min=2, max=2826, avg= 9.37, stdev=13.85                                                    
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    5],                                  
     | 30.00th=[    6], 40.00th=[    7], 50.00th=[    8], 60.00th=[   10],                                  
     | 70.00th=[   10], 80.00th=[   10], 90.00th=[   14], 95.00th=[   19],                                  
     | 99.00th=[   53], 99.50th=[   86], 99.90th=[  133], 99.95th=[  157],                                  
     | 99.99th=[  314]
   bw (  KiB/s): min= 8119, max=33029, per=100.00%, avg=25028.41, stdev=4603.53, samples=267                
   iops        : min= 2029, max= 8257, avg=6256.78, stdev=1150.88, samples=267                              
  lat (usec)   : 2=23.52%, 4=51.48%, 10=19.44%, 20=3.92%, 50=1.10%                                          
  lat (usec)   : 100=0.35%, 250=0.12%, 500=0.01%, 750=0.01%, 1000=0.01%                                     
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.02%                                             
  lat (msec)   : 100=0.02%, 250=0.01%, 500=0.01%
  cpu          : usr=7.20%, sys=13.53%, ctx=2694, majf=0, minf=21                                           
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%                              
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                             
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                             
     issued rwts: total=3355445,838859,0,0 short=0,0,0,0 dropped=0,0,0,0                                    
     latency   : target=0, window=0, percentile=100.00%, depth=1                                            

Run status group 0 (all jobs):
   READ: bw=97.7MiB/s (102MB/s), 97.7MiB/s-97.7MiB/s (102MB/s-102MB/s), io=12.8GiB (13.7GB), run=134185-134185msec
  WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=3277MiB (3436MB), run=134185-134185msec

Disk stats (read/write):
  vdb: ios=13120/3142, merge=0/3, ticks=1615070/38929, in_queue=1653999, util=88.28%

Sequential, 80% Read, 20% Write, 4m Blocksize

READ: bw=97.0MiB/s
iops: avg=23.57
WRITE: bw=25.3MiB/s
iops: avg= 6.04

fio --name=read80write20test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4m

[root@guest21 ~]# fio --name=read80write20test --rw=readwrite --rwmixread=80 --size=16g --directory=/mnt/fiotest/ --bs=4m
read80write20test: (g=0): rw=rw, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [M(1)][100.0%][r=79.8MiB/s,w=63.9MiB/s][r=19,w=15 IOPS][eta 00m:00s]
read80write20test: (groupid=0, jobs=1): err= 0: pid=1391: Mon Mar 20 21:24:07 2023
  read: IOPS=24, BW=97.0MiB/s (102MB/s)(12.7GiB/133923msec)
    clat (usec): min=714, max=471420, avg=40075.34, stdev=52774.81
     lat (usec): min=715, max=471421, avg=40076.83, stdev=52774.70
    clat percentiles (usec):
     |  1.00th=[   816],  5.00th=[   963], 10.00th=[  1237], 20.00th=[  1614],
     | 30.00th=[  1762], 40.00th=[  3064], 50.00th=[ 10028], 60.00th=[ 30016],
     | 70.00th=[ 54264], 80.00th=[ 85459], 90.00th=[120062], 95.00th=[143655],
     | 99.00th=[202376], 99.50th=[248513], 99.90th=[316670], 99.95th=[379585],
     | 99.99th=[471860]
   bw (  KiB/s): min=32702, max=131072, per=100.00%, avg=99904.80, stdev=19149.75, samples=265
   iops        : min=    7, max=   32, avg=23.57, stdev= 4.74, samples=265
  write: IOPS=6, BW=25.3MiB/s (26.6MB/s)(3392MiB/133923msec); 0 zone resets
    clat (usec): min=986, max=9089, avg=3457.38, stdev=1500.40
     lat (usec): min=1036, max=9471, avg=3622.83, stdev=1535.59
    clat percentiles (usec):
     |  1.00th=[ 1106],  5.00th=[ 1549], 10.00th=[ 1893], 20.00th=[ 2114],
     | 30.00th=[ 2278], 40.00th=[ 2540], 50.00th=[ 3163], 60.00th=[ 4047],
     | 70.00th=[ 4359], 80.00th=[ 4555], 90.00th=[ 5276], 95.00th=[ 6325],
     | 99.00th=[ 7767], 99.50th=[ 8356], 99.90th=[ 9110], 99.95th=[ 9110],
     | 99.99th=[ 9110]
   bw (  KiB/s): min= 8110, max=98304, per=100.00%, avg=28232.20, stdev=16757.83, samples=245
   iops        : min=    1, max=   24, avg= 6.04, stdev= 4.09, samples=245
  lat (usec)   : 750=0.12%, 1000=4.71%
  lat (msec)   : 2=24.93%, 4=15.26%, 10=15.33%, 20=4.42%, 50=10.11%
  lat (msec)   : 100=12.67%, 250=12.08%, 500=0.37%
  cpu          : usr=0.20%, sys=9.49%, ctx=2680, majf=0, minf=16
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=3248,848,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=97.0MiB/s (102MB/s), 97.0MiB/s-97.0MiB/s (102MB/s-102MB/s), io=12.7GiB (13.6GB), run=133923-133923msec
  WRITE: bw=25.3MiB/s (26.6MB/s), 25.3MiB/s-25.3MiB/s (26.6MB/s-26.6MB/s), io=3392MiB (3557MB), run=133923-133923msec

Disk stats (read/write):
  vdb: ios=12993/3237, merge=0/0, ticks=1740150/33009, in_queue=1773159, util=93.71%

sequential write, 2300 bytes, fsync

WRITE: bw=71.9KiB/s
iops: avg=32.50

fio --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300 --name=etcdtest --directory=/mnt/fiotest/

[root@guest21 ~]# fio --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300 --name=etcdtest --directory=/mnt/fiotest/
etcdtest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.19
Starting 1 process
etcdtest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][99.7%][w=114KiB/s][w=50 IOPS][eta 00m:01s]
etcdtest: (groupid=0, jobs=1): err= 0: pid=1568: Mon Mar 20 21:44:36 2023
  write: IOPS=32, BW=71.9KiB/s (73.6kB/s)(21.0MiB/313385msec); 0 zone resets
    clat (usec): min=14, max=444, avg=54.91, stdev=22.56
     lat (usec): min=14, max=445, avg=56.20, stdev=22.80
    clat percentiles (usec):
     |  1.00th=[   19],  5.00th=[   29], 10.00th=[   36], 20.00th=[   38],
     | 30.00th=[   42], 40.00th=[   49], 50.00th=[   51], 60.00th=[   55],
     | 70.00th=[   60], 80.00th=[   69], 90.00th=[   85], 95.00th=[   96],
     | 99.00th=[  127], 99.50th=[  137], 99.90th=[  182], 99.95th=[  273],
     | 99.99th=[  330]
   bw (  KiB/s): min=    4, max=  210, per=100.00%, avg=74.15, stdev=35.57, samples=602
   iops        : min=    1, max=   93, avg=32.50, stdev=15.86, samples=602
  lat (usec)   : 20=1.38%, 50=45.19%, 100=49.29%, 250=4.09%, 500=0.06%
  fsync/fdatasync/sync_file_range:
    sync (msec): min=5, max=1610, avg=31.18, stdev=53.58
    sync percentiles (msec):
     |  1.00th=[    9],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
     | 30.00th=[   19], 40.00th=[   22], 50.00th=[   24], 60.00th=[   28],
     | 70.00th=[   30], 80.00th=[   36], 90.00th=[   47], 95.00th=[   68],
     | 99.00th=[  138], 99.50th=[  203], 99.90th=[ 1003], 99.95th=[ 1250],
     | 99.99th=[ 1603]
  cpu          : usr=0.09%, sys=0.48%, ctx=30288, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=71.9KiB/s (73.6kB/s), 71.9KiB/s-71.9KiB/s (73.6kB/s-73.6kB/s), io=21.0MiB (23.1MB), run=313385-313385msec

Disk stats (read/write):
  vdb: ios=0/25682, merge=0/4, ticks=0/313614, in_queue=313614, util=100.00%

sequential write, 2300 bytes, NO fsync

WRITE: bw=101MiB/s
iops: avg=45746.66

fio --rw=write --ioengine=sync  --size=16g --bs=2300 --name=etcdtest --directory=/mnt/fiotest/

[root@guest21 ~]# fio --rw=write --ioengine=sync  --size=16g --bs=2300 --name=etcdtest --directory=/mnt/fiotest/
etcdtest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.19
Starting 1 process
etcdtest: Laying out IO file (1 file / 16384MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=97.8MiB/s][w=44.6k IOPS][eta 00m:00s]
etcdtest: (groupid=0, jobs=1): err= 0: pid=1622: Mon Mar 20 23:53:50 2023
  write: IOPS=45.9k, BW=101MiB/s (106MB/s)(15.0GiB/162817msec); 0 zone resets
    clat (nsec): min=1967, max=73232k, avg=20253.25, stdev=340705.21
     lat (usec): min=2, max=73232, avg=20.45, stdev=340.71
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    5], 40.00th=[    6], 50.00th=[    6], 60.00th=[    7],
     | 70.00th=[   10], 80.00th=[   10], 90.00th=[   11], 95.00th=[   14],
     | 99.00th=[   44], 99.50th=[   62], 99.90th=[ 7439], 99.95th=[ 9372],
     | 99.99th=[11338]
   bw (  KiB/s): min=83544, max=519038, per=99.72%, avg=102751.82, stdev=32582.46, samples=324
   iops        : min=37195, max=231085, avg=45746.66, stdev=14506.30, samples=324
  lat (usec)   : 2=0.01%, 4=24.24%, 10=63.12%, 20=9.88%, 50=2.01%
  lat (usec)   : 100=0.49%, 250=0.09%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.11%, 20=0.04%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=12.73%, sys=27.16%, ctx=11507, majf=0, minf=16
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7469508,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=101MiB/s (106MB/s), 101MiB/s-101MiB/s (106MB/s-106MB/s), io=15.0GiB (17.2GB), run=162817-162817msec

Disk stats (read/write):
  vdb: ios=0/15891, merge=0/4, ticks=0/21726052, in_queue=21726052, util=99.40%

Conclusion

For now this mainly gives me more peace of mind, some of the HDDs used perviously had died and all had power on hours between 5 and 8 years.

Even with all 4 nodes set to

no swap
tuned profile throughput-performance
reduced osd_memory_target

I still get latency spikes in the seconds range (yes, full seconds, not low ms range). A Picture from the Ceph Dashboard follows:

Ceph Dashboard graph showing OSD latencies over 2 days

Which is no wonder as I do have noticeable network error rates on the 10 GbE Aquantia interfaces these days. A screenshot of my checkmk server follows, it shows 25 hours of error rates on node 1 (the other 3 are pretty much the same):

example opf network error rates, 25 hours, node 1

And the state of the built-in 1 GbE interfaces is not rosy either these days.

When the F5-422 are eventually replaced, the SATA SSDs will of course be recycled in the new nodes.

All in all, the Ceph side was painless (and was done while the cluster was actively used) but in 2020 I really should have spent a bit more money for higher quality nodes.

As the German saying goes; When you buy cheap, you buy twice.

Ceph cluster, SATA HDDs out, SATA SSDs in

Summary

Ordered devices

OSDs with HDD out, only SSD in

First node

Used drive spec

Test the drive spec

Use the drive spec

Remaining nodes

Evacuating node 3

Waiting for clean PGs

Just under an hour in

About 2.5 hours in

Nearly done

Evacuate node 2

Last node

All replacements completed

RADOS Benchmarks

Sequential Write Test, pool with 16 PGs

Sequential Read Test, pool with 16 PGs

Random Read Test, pool with 16 PGs

Sequential Write Test, pool with 256 PGs

Sequential Read Test, pool with 256 PGs

Random Read Test, pool with 256 PGs

VM on ProxmoxVE using RBD benchmarks

Sequential Write, 4k Blocksize

Sequential Write, 4m Blocksize

Sequential Read, 4k Blocksize

Sequential Read, 4m Blocksize

Sequential, 80% Read, 20% Write, 4k Blocksize

Sequential, 80% Read, 20% Write, 4m Blocksize

Random, 80% Read, 20% Write, 4k Blocksize

Random, 80% Read, 20% Write, 4m Blocksize

sequential write, 2300 bytes, fsync

sequential write, 2300 bytes, NO fsync

VM on libvirt using RBD benchmarks

Sequential Read, 4k Blocksize

Sequential Read, 4m Blocksize

Sequential Write, 4k Blocksize

Sequential Write, 4m Blocksize

Sequential, 80% Read, 20% Write, 4k Blocksize

Sequential, 80% Read, 20% Write, 4m Blocksize

sequential write, 2300 bytes, fsync

sequential write, 2300 bytes, NO fsync

Conclusion

Related