ansible/proxmox/README.md

# Proxmox Virtual Environment
User-facing docs: https://wiki.k-space.ee/en/hosting/proxmox

## Adding new node
1. Upgrade existing nodes.
1. Install new nodes:
  - Hostname `pveXX.proxmox.infra.k-space.ee`
  - Boot disk ZRAID-1
  - 172.21 or DHCP may be used as initial IP. Installer configuration will be overwritten by cluster join and ansible.
1. Add `non-free-firmware` as component to `/etc/apt/sources.list` to debian (not PVE) bookworm, bookworm-updates, bookworm-security (next to `main` and `contrib`)
1. Upgrade new nodes
  - (unsure if needed nowdays: disabling pve-enterprise, and enabling pve-no-subscription)
1. Add new node to DNS (secretspace/ns1) and Ansible.
1. Apply Ansible and reboot.
1. `$ systemctl status watchdog-mux` should say `Watchdog driver 'IPMI', version 1` and NOT `Software Watchdog`
1. Join to cluster in UI → Datacenter.
  - IP to use is the last, ipv6 with vmbr0 <!-- TODO: might have changed -->
1. `$ passwd` on new node
1. `$ vim ~/.ssh/authorized_keys` → sort the new key. **Keys are managed manually** since PVE manages the file as well.

TODO: prometheus node exporter
TODO: create-external-cluster-resources.py in pve90
TODO: PVE backup server. We want local snapshots and offsite.
TODO: reinstate restic for /etc and /root
TODO: d12 discard

## K-SPACE Hyper-Converged CEPH setup
> [!WARNING]
> K-SPACE kubernetes uses PVE's CEPH cluster, k8s pools are not visible in general PVE UI.

1. Configure a mesh network

       ansible-playbook proxmox/ceph.yaml

    This will configure the 40Gbit interfaces and FRR daemon with OpenFabric routing.
    Our CEPH setup uses a private IPv6 subnet for inner cluster communication.

       fdcc:a182:4fed::/64

   > You can check Mesh network status by launching FRR shell `vtysh` and then typing
   > `show openfabric topology`

       root@pve91:~# vtysh
       pve91# show openfabric topology
       IS-IS paths to level-2 routers that speak IPv6
       Vertex                  Type          Metric  Next-Hop  Interface   Parent
        ------------------------------------------------------------------------------
       pve91
       fdcc:a182:4fed::91/128  IP6 internal  0                             pve91(4)
       pve93                   TE-IS         10      pve93     enp161s0    pve91(4)
       pve90                   TE-IS         10      pve90     enp161s0d1  pve91(4)
       fdcc:a182:4fed::93/128  IP6 internal  20      pve93     enp161s0    pve93(4)
       fdcc:a182:4fed::90/128  IP6 internal  20      pve90     enp161s0d1  pve90(4)

2. Setup CEPH packages on all nodes

       pveceph install --repository no-subscription --version squid
3. CEPH init

       pveceph init --network fdcc:a182:4fed::/64
4. Create CEPH monitors on each node

       pveceph mon create
5. Also create CEPH managers on each node

       pveceph mgr create
6. Create OSD daemons for each disk on all nodes

    NVMe drives will get 2 OSD daemons per disk for better IOPS

       pveceph osd create /dev/nvme0n1 --crush-device-class nvme --osds-per-device 2

    HDD-s will get just 1

       pveceph osd create /dev/sdX --crush-device-class hdd
7. Create CRUSH Maps

    We want to separate out HDD and NVMe storage into different storage buckets.

    Default `replicated_rule` would put datablock on all of the available disks

       # ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
       ceph osd crush rule create-replicated replicated_nvme default host nvme
       ceph osd crush rule create-replicated replicated_hdd default host hdd

   > **NB**: Using default `replicated_rule` for **ANY** CEPH Pool will result in
   > Placement Group (PG) Autoscaler not working as it cant properly calculate
   > how much space is available in CEPH due to different device classes we are using

8. Create CEPH Pools for VM disk images

    This is done in individual node Ceph -> Pools configuration

    **NB:** Under advanced, select correct Crush Rule (nvme or hdd)

9. Create CephFS Storage pool for ISO images

    First create metadata server on each node

       pveceph mds create

    Then on one of the individual nodes create a CephFS.

    After that is done you can modify under Pools change the cephfs_data and cephfs_metadata
    Crush rules to use NVMe drives.

### CEPH NUMA Pinning

This helps a bit with read latency (482.28us vs 437.22us)

Inside `hwloc-nox` package there a programm called `hwloc-ls` that will visualize
connected hardware and NUMA nodes. In our case Ceph network interface and NVMe drive
are both connected to the same NUMA node. We can use `hwloc-calc -I core os=nvme0n1`
to get a list of CPU cores attached to the NVMe drive.

    # hwloc-calc -I core os=nvme0n1
    8,9,10,11,12,13,14,15

From that output we can create a systemd override file for `ceph-osd@<ID>` daemons.

    systemctl edit ceph-osd@0

And then paste

    [Service]
    CPUAffinity=8,9,10,11,12,13,14,15
    NUMAPolicy=default
    NUMAMask=8,9,10,11,12,13,14,15

After restarting the OSD you should see in `numastat ceph-osd` that OSD is contained to mostly single node.

Here are bunch of example `fio` benchmark commands that can be used to verify this change

https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm