diff --git a/proxmox/README.md b/proxmox/README.md index be10c52..ac30d92 100644 --- a/proxmox/README.md +++ b/proxmox/README.md @@ -77,3 +77,32 @@ After that is done you can modify under Pools change the cephfs_data and cephfs_metadata Crush rules to use NVMe drives. + +### CEPH NUMA Pinning + +This helps a bit with read latency (482.28us vs 437.22us) + +Inside `hwloc-nox` package there a programm called `hwloc-ls` that will visualize +connected hardware and NUMA nodes. In our case Ceph network interface and NVMe drive +are both connected to the same NUMA node. We can use `hwloc-calc -I core os=nvme0n1` +to get a list of CPU cores attached to the NVMe drive. + + # hwloc-calc -I core os=nvme0n1 + 8,9,10,11,12,13,14,15 + +From that output we can create a systemd override file for `ceph-osd@` daemons. + + systemctl edit ceph-osd@0 + +And then paste + + [Service] + CPUAffinity=8,9,10,11,12,13,14,15 + NUMAPolicy=default + NUMAMask=8,9,10,11,12,13,14,15 + +After restarting the OSD you should see in `numastat ceph-osd` that OSD is contained to mostly single node. + +Here are bunch of example `fio` benchmark commands that can be used to verify this change + +https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm