From 057d602db445f8c066b303d2f614fc55038ab054 Mon Sep 17 00:00:00 2001 From: Arti Zirk Date: Sat, 2 Aug 2025 15:35:07 +0300 Subject: [PATCH] Add CEPH NUMA docs --- proxmox/README.md | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/proxmox/README.md b/proxmox/README.md index be10c52..ac30d92 100644 --- a/proxmox/README.md +++ b/proxmox/README.md @@ -77,3 +77,32 @@ After that is done you can modify under Pools change the cephfs_data and cephfs_metadata Crush rules to use NVMe drives. + +### CEPH NUMA Pinning + +This helps a bit with read latency (482.28us vs 437.22us) + +Inside `hwloc-nox` package there a programm called `hwloc-ls` that will visualize +connected hardware and NUMA nodes. In our case Ceph network interface and NVMe drive +are both connected to the same NUMA node. We can use `hwloc-calc -I core os=nvme0n1` +to get a list of CPU cores attached to the NVMe drive. + + # hwloc-calc -I core os=nvme0n1 + 8,9,10,11,12,13,14,15 + +From that output we can create a systemd override file for `ceph-osd@` daemons. + + systemctl edit ceph-osd@0 + +And then paste + + [Service] + CPUAffinity=8,9,10,11,12,13,14,15 + NUMAPolicy=default + NUMAMask=8,9,10,11,12,13,14,15 + +After restarting the OSD you should see in `numastat ceph-osd` that OSD is contained to mostly single node. + +Here are bunch of example `fio` benchmark commands that can be used to verify this change + +https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm