# Proxmox Virtual Environment ## K-Space Hyper Converged CEPH setup 1. Configure a mesh network ansible-playbook proxmox/ceph.yaml This will configure the 40Gbit interfaces and FRR daemon with OpenFabric routing. Our CEPH setup uses a private IPv6 subnet for inner cluster communication. fdcc:a182:4fed::/64 > You can check Mesh network status by launching FRR shell `vtysh` and then typing > `show openfabric topology` root@pve91:~# vtysh pve91# show openfabric topology IS-IS paths to level-2 routers that speak IPv6 Vertex Type Metric Next-Hop Interface Parent ------------------------------------------------------------------------------ pve91 fdcc:a182:4fed::91/128 IP6 internal 0 pve91(4) pve93 TE-IS 10 pve93 enp161s0 pve91(4) pve90 TE-IS 10 pve90 enp161s0d1 pve91(4) fdcc:a182:4fed::93/128 IP6 internal 20 pve93 enp161s0 pve93(4) fdcc:a182:4fed::90/128 IP6 internal 20 pve90 enp161s0d1 pve90(4) 2. Setup CEPH packages on all nodes pveceph install --repository no-subscription --version squid 3. CEPH init pveceph init --network fdcc:a182:4fed::/64 4. Create CEPH monitors on each node pveceph mon create 5. Also create CEPH managers on each node pveceph mgr create 6. Create OSD daemons for each disk on all nodes NVMe drives will get 2 OSD daemons per disk for better IOPS pveceph osd create /dev/nvme0n1 --crush-device-class nvme --osds-per-device 2 HDD-s will get just 1 pveceph osd create /dev/sdX --crush-device-class hdd 7. Create CRUSH Maps We want to separate out HDD and NVMe storage into different storage buckets. Default `replicated_rule` would put datablock on all of the available disks # ceph osd crush rule create-replicated ceph osd crush rule create-replicated replicated_nvme default host nvme ceph osd crush rule create-replicated replicated_hdd default host hdd > **NB**: Using default `replicated_rule` for **ANY** CEPH Pool will result in > Placement Group (PG) Autoscaler not working as it cant properly calculate > how much space is available in CEPH due to different device classes we are using 8. Create CEPH Pools for VM disk images This is done in individual node Ceph -> Pools configuration **NB:** Under advanced, select correct Crush Rule (nvme or hdd) 9. Create CephFS Storage pool for ISO images First create metadata server on each node pveceph mds create Then on one of the individual nodes create a CephFS. After that is done you can modify under Pools change the cephfs_data and cephfs_metadata Crush rules to use NVMe drives. ### CEPH NUMA Pinning This helps a bit with read latency (482.28us vs 437.22us) Inside `hwloc-nox` package there a programm called `hwloc-ls` that will visualize connected hardware and NUMA nodes. In our case Ceph network interface and NVMe drive are both connected to the same NUMA node. We can use `hwloc-calc -I core os=nvme0n1` to get a list of CPU cores attached to the NVMe drive. # hwloc-calc -I core os=nvme0n1 8,9,10,11,12,13,14,15 From that output we can create a systemd override file for `ceph-osd@` daemons. systemctl edit ceph-osd@0 And then paste [Service] CPUAffinity=8,9,10,11,12,13,14,15 NUMAPolicy=default NUMAMask=8,9,10,11,12,13,14,15 After restarting the OSD you should see in `numastat ceph-osd` that OSD is contained to mostly single node. Here are bunch of example `fio` benchmark commands that can be used to verify this change https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm