135 lines
5.2 KiB
Markdown
135 lines
5.2 KiB
Markdown
# Proxmox Virtual Environment
|
|
User-facing docs: https://wiki.k-space.ee/en/hosting/proxmox
|
|
|
|
## Adding new node
|
|
1. Upgrade existing nodes.
|
|
1. Install new nodes:
|
|
- Hostname `pveXX.proxmox.infra.k-space.ee`
|
|
- Boot disk ZRAID-1
|
|
- 172.21 or DHCP may be used as initial IP. Installer configuration will be overwritten by cluster join and ansible.
|
|
1. Add `non-free-firmware` as component to `/etc/apt/sources.list` to debian (not PVE) bookworm, bookworm-updates, bookworm-security (next to `main` and `contrib`)
|
|
1. Upgrade new nodes
|
|
- (unsure if needed nowdays: disabling pve-enterprise, and enabling pve-no-subscription)
|
|
1. Add new node to DNS (secretspace/ns1) and Ansible.
|
|
1. Apply Ansible and reboot.
|
|
1. `$ systemctl status watchdog-mux` should say `Watchdog driver 'IPMI', version 1` and NOT `Software Watchdog`
|
|
1. Join to cluster in UI → Datacenter.
|
|
- IP to use is the last, ipv6 with vmbr0 <!-- TODO: might have changed -->
|
|
1. `$ passwd` on new node
|
|
1. `$ vim ~/.ssh/authorized_keys` → sort the new key. **Keys are managed manually** since PVE manages the file as well.
|
|
|
|
TODO: prometheus node exporter
|
|
TODO: create-external-cluster-resources.py in pve90
|
|
TODO: PVE backup server. We want local snapshots and offsite.
|
|
TODO: reinstate restic for /etc and /root
|
|
TODO: d12 discard
|
|
|
|
## K-SPACE Hyper-Converged CEPH setup
|
|
> [!WARNING]
|
|
> K-SPACE kubernetes uses PVE's CEPH cluster, k8s pools are not visible in general PVE UI.
|
|
|
|
1. Configure a mesh network
|
|
|
|
ansible-playbook proxmox/ceph.yaml
|
|
|
|
This will configure the 40Gbit interfaces and FRR daemon with OpenFabric routing.
|
|
Our CEPH setup uses a private IPv6 subnet for inner cluster communication.
|
|
|
|
fdcc:a182:4fed::/64
|
|
|
|
> You can check Mesh network status by launching FRR shell `vtysh` and then typing
|
|
> `show openfabric topology`
|
|
|
|
root@pve91:~# vtysh
|
|
pve91# show openfabric topology
|
|
IS-IS paths to level-2 routers that speak IPv6
|
|
Vertex Type Metric Next-Hop Interface Parent
|
|
------------------------------------------------------------------------------
|
|
pve91
|
|
fdcc:a182:4fed::91/128 IP6 internal 0 pve91(4)
|
|
pve93 TE-IS 10 pve93 enp161s0 pve91(4)
|
|
pve90 TE-IS 10 pve90 enp161s0d1 pve91(4)
|
|
fdcc:a182:4fed::93/128 IP6 internal 20 pve93 enp161s0 pve93(4)
|
|
fdcc:a182:4fed::90/128 IP6 internal 20 pve90 enp161s0d1 pve90(4)
|
|
|
|
2. Setup CEPH packages on all nodes
|
|
|
|
pveceph install --repository no-subscription --version squid
|
|
3. CEPH init
|
|
|
|
pveceph init --network fdcc:a182:4fed::/64
|
|
4. Create CEPH monitors on each node
|
|
|
|
pveceph mon create
|
|
5. Also create CEPH managers on each node
|
|
|
|
pveceph mgr create
|
|
6. Create OSD daemons for each disk on all nodes
|
|
|
|
NVMe drives will get 2 OSD daemons per disk for better IOPS
|
|
|
|
pveceph osd create /dev/nvme0n1 --crush-device-class nvme --osds-per-device 2
|
|
|
|
HDD-s will get just 1
|
|
|
|
pveceph osd create /dev/sdX --crush-device-class hdd
|
|
7. Create CRUSH Maps
|
|
|
|
We want to separate out HDD and NVMe storage into different storage buckets.
|
|
|
|
Default `replicated_rule` would put datablock on all of the available disks
|
|
|
|
# ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
|
|
ceph osd crush rule create-replicated replicated_nvme default host nvme
|
|
ceph osd crush rule create-replicated replicated_hdd default host hdd
|
|
|
|
> **NB**: Using default `replicated_rule` for **ANY** CEPH Pool will result in
|
|
> Placement Group (PG) Autoscaler not working as it cant properly calculate
|
|
> how much space is available in CEPH due to different device classes we are using
|
|
|
|
8. Create CEPH Pools for VM disk images
|
|
|
|
This is done in individual node Ceph -> Pools configuration
|
|
|
|
**NB:** Under advanced, select correct Crush Rule (nvme or hdd)
|
|
|
|
9. Create CephFS Storage pool for ISO images
|
|
|
|
First create metadata server on each node
|
|
|
|
pveceph mds create
|
|
|
|
Then on one of the individual nodes create a CephFS.
|
|
|
|
After that is done you can modify under Pools change the cephfs_data and cephfs_metadata
|
|
Crush rules to use NVMe drives.
|
|
|
|
### CEPH NUMA Pinning
|
|
|
|
This helps a bit with read latency (482.28us vs 437.22us)
|
|
|
|
Inside `hwloc-nox` package there a programm called `hwloc-ls` that will visualize
|
|
connected hardware and NUMA nodes. In our case Ceph network interface and NVMe drive
|
|
are both connected to the same NUMA node. We can use `hwloc-calc -I core os=nvme0n1`
|
|
to get a list of CPU cores attached to the NVMe drive.
|
|
|
|
# hwloc-calc -I core os=nvme0n1
|
|
8,9,10,11,12,13,14,15
|
|
|
|
From that output we can create a systemd override file for `ceph-osd@<ID>` daemons.
|
|
|
|
systemctl edit ceph-osd@0
|
|
|
|
And then paste
|
|
|
|
[Service]
|
|
CPUAffinity=8,9,10,11,12,13,14,15
|
|
NUMAPolicy=default
|
|
NUMAMask=8,9,10,11,12,13,14,15
|
|
|
|
After restarting the OSD you should see in `numastat ceph-osd` that OSD is contained to mostly single node.
|
|
|
|
Here are bunch of example `fio` benchmark commands that can be used to verify this change
|
|
|
|
https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm
|