mingi 10 eri varianti proovisin ja variandid on: - üks host läheb 10s, gatherib kõike ja timeoutib zombie mountidega, ilmselgelt ei ole ainult cat /proc/cpuinfo - ei kogu mitte midagi võin siis juba käsitsi /proc/cpuinfo kallale minna, aga out of time ja pole hetkel ega mõne aasta jooksul oluline
Proxmox Virtual Environment
User-facing docs: https://wiki.k-space.ee/en/hosting/proxmox
Adding new node
- Upgrade existing nodes.
- Install new nodes:
- Hostname
pveXX.proxmox.infra.k-space.ee
- Boot disk ZRAID-1
- 172.21 or DHCP may be used as initial IP. Installer configuration will be overwritten by cluster join and ansible.
- Add
non-free-firmware
as component to/etc/apt/sources.list
to debian (not PVE) bookworm, bookworm-updates, bookworm-security (next tomain
andcontrib
) - Upgrade new nodes
- (unsure if needed nowdays: disabling pve-enterprise, and enabling pve-no-subscription)
- Add new node to DNS (secretspace/ns1) and Ansible.
- Apply Ansible and reboot.
$ systemctl status watchdog-mux
should sayWatchdog driver 'IPMI', version 1
and NOTSoftware Watchdog
- Join to cluster in UI → Datacenter.
- IP to use is the last, ipv6 with vmbr0
$ passwd
on new node$ vim ~/.ssh/authorized_keys
→ sort the new key. Keys are managed manually since PVE manages the file as well.
TODO: prometheus node exporter TODO: create-external-cluster-resources.py in pve90 TODO: PVE backup server. We want local snapshots and offsite. TODO: reinstate restic for /etc and /root TODO: d12 discard
K-SPACE Hyper-Converged CEPH setup
Warning
K-SPACE kubernetes uses PVE's CEPH cluster, k8s pools are not visible in general PVE UI.
-
Configure a mesh network
ansible-playbook proxmox/ceph.yaml
This will configure the 40Gbit interfaces and FRR daemon with OpenFabric routing. Our CEPH setup uses a private IPv6 subnet for inner cluster communication.
fdcc:a182:4fed::/64
You can check Mesh network status by launching FRR shell
vtysh
and then typingshow openfabric topology
root@pve91:~# vtysh pve91# show openfabric topology IS-IS paths to level-2 routers that speak IPv6 Vertex Type Metric Next-Hop Interface Parent ------------------------------------------------------------------------------ pve91 fdcc:a182:4fed::91/128 IP6 internal 0 pve91(4) pve93 TE-IS 10 pve93 enp161s0 pve91(4) pve90 TE-IS 10 pve90 enp161s0d1 pve91(4) fdcc:a182:4fed::93/128 IP6 internal 20 pve93 enp161s0 pve93(4) fdcc:a182:4fed::90/128 IP6 internal 20 pve90 enp161s0d1 pve90(4)
-
Setup CEPH packages on all nodes
pveceph install --repository no-subscription --version squid
-
CEPH init
pveceph init --network fdcc:a182:4fed::/64
-
Create CEPH monitors on each node
pveceph mon create
-
Also create CEPH managers on each node
pveceph mgr create
-
Create OSD daemons for each disk on all nodes
NVMe drives will get 2 OSD daemons per disk for better IOPS
pveceph osd create /dev/nvme0n1 --crush-device-class nvme --osds-per-device 2
HDD-s will get just 1
pveceph osd create /dev/sdX --crush-device-class hdd
-
Create CRUSH Maps
We want to separate out HDD and NVMe storage into different storage buckets.
Default
replicated_rule
would put datablock on all of the available disks# ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> ceph osd crush rule create-replicated replicated_nvme default host nvme ceph osd crush rule create-replicated replicated_hdd default host hdd
NB: Using default
replicated_rule
for ANY CEPH Pool will result in Placement Group (PG) Autoscaler not working as it cant properly calculate how much space is available in CEPH due to different device classes we are using -
Create CEPH Pools for VM disk images
This is done in individual node Ceph -> Pools configuration
NB: Under advanced, select correct Crush Rule (nvme or hdd)
-
Create CephFS Storage pool for ISO images
First create metadata server on each node
pveceph mds create
Then on one of the individual nodes create a CephFS.
After that is done you can modify under Pools change the cephfs_data and cephfs_metadata Crush rules to use NVMe drives.
CEPH NUMA Pinning
This helps a bit with read latency (482.28us vs 437.22us)
Inside hwloc-nox
package there a programm called hwloc-ls
that will visualize
connected hardware and NUMA nodes. In our case Ceph network interface and NVMe drive
are both connected to the same NUMA node. We can use hwloc-calc -I core os=nvme0n1
to get a list of CPU cores attached to the NVMe drive.
# hwloc-calc -I core os=nvme0n1
8,9,10,11,12,13,14,15
From that output we can create a systemd override file for ceph-osd@<ID>
daemons.
systemctl edit ceph-osd@0
And then paste
[Service]
CPUAffinity=8,9,10,11,12,13,14,15
NUMAPolicy=default
NUMAMask=8,9,10,11,12,13,14,15
After restarting the OSD you should see in numastat ceph-osd
that OSD is contained to mostly single node.
Here are bunch of example fio
benchmark commands that can be used to verify this change
https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm