k8s cluster on Raspberry PI's with iSCSI block storage

Toba

Meet Toba the cat, the latest security shield model, equipped with 24/7 surveillance mode and an aggressive defense mechanism when unauthorized movement is detected. 🚨😼

Any attempt to relocate this elite guardian from the server rack results in immediate resistance, accompanied by sharp glares and indignant meows. 🛑🐾

Plan to Host Mastodon on 4 × Raspberry Pis

In my current homelab project, I am building a bare-metal Kubernetes cluster to host ~10–50 containers per node. What app will run there? Well, I’ll figure that out along the way, but I’m considering hosting Mastodon. (For those who don’t know, Mastodon is a decentralized microblogging platform.)

No Hypervisors for This Project 🚫

As you might know from my previous posts, I have 4× Raspberry Pi 5 boards. Each features a 64-bit quad-core ARM Cortex-A76 processor running at 2.4GHz with 8 GB RAM. In my setup, 3 Raspberry Pis act as worker nodes, while 1 serves as the control plane (in the future, I’ll increase it to 3 to meet RAFT consensus, but for now, this setup is fine).

Kubernetes pods will run directly on the host OS. The OS is Raspberry Pi OS (64-bit), booted over the LAN network. If you’re interested, you can read more in this post about setting up PXE boot.

Even though Raspberry Pi 5 delivers a 2–3× increase in CPU performance compared to Raspberry Pi 4, I’m avoiding virtualization solutions like VMware and Proxmox (I like Proxmox, but not for this setup) because they introduce unnecessary overhead.

➕ One day, I’m thinking about building a voice assistant (currently, I have a Siri HomePod but want to make it smarter) that interacts with LLMs and controls physical devices like smart sockets, lights, or energy monitoring systems. For that, I need access to GPIO pins (direct hardware interaction). Virtualization makes passing through “exotic” hardware problematic, so running on bare metal is the better choice.

What is for dinner? 💭

In today’s post, I am going to shed a light on the issue I faced with PXE boot on NFS and containderd’s component snapshotter that is not compatible with NFS. But how I come to that conclusion? let’s check it out.

dmesg, linux kernel dump tool spaws this log ⬇️

[Thu Feb  6 09:20:56 2025] overlayfs: upper fs does not support tmpfile.
[Thu Feb  6 09:20:56 2025] overlayfs: upper fs does not support RENAME_WHITEOUT.
[Thu Feb  6 09:20:56 2025] overlayfs: failed to set xattr on upper
[Thu Feb  6 09:20:56 2025] overlayfs: ...falling back to redirect_dir=nofollow.
[Thu Feb  6 09:20:56 2025] overlayfs: ...falling back to uuid=null.
[Thu Feb  6 09:20:56 2025] overlayfs: upper fs missing required feature

and kubectl describe pod shows this error, but they are both about that same matter.

kubectl describe pod -l app=nginx --namespace default --kubeconfig /opt/hl-control-node/_tmp/kubeconfigs/admin.kubeconfig

Events:
  Type     Reason                  Age                      From     Message
  ----     ------                  ----                     ----     -------
  Warning  FailedCreatePodSandBox  13m (x19065 over 2d23h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "1323c487a948330ef32a782dc48095e74998758163f49401ce91cd649013d8ec": failed to create containerd task: failed to create shim task: failed to mount rootfs component: invalid argument
  Warning  FailedCreatePodSandBox  5m29s                    kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start s

overlayfs is one of the file systems that used by snapshotter. upper fs is literally my “upper FS” in my case it is NFS, since my Entire Root Filesystem (/) is on NFS.

 df -h
Filesystem                                      Size  Used Avail Use% Mounted on
udev                                            3.8G     0  3.8G   0% /dev
tmpfs                                           806M  5.3M  801M   1% /run
192.168.101.253:/volume1/RPi5-PXE/node4/rootfs  885G  148G  737G  17% /
tmpfs                                           4.0G     0  4.0G   0% /dev/shm
tmpfs                                           5.0M   48K  5.0M   1% /run/lock
192.168.101.253:/volume1/RPi5-PXE/node4         885G  148G  737G  17% /boot
tmpfs                                           806M     0  806M   0% /run/user/1000

all this means that network file system doesn’t have required features like: tmtfile; RENAME_WHITEOUT; xattr.

containerd architecture
Image taken from official source https://containerd.io

my first thought was to switch overlayfs into btrsfs or devmapper storage drivers, but is is “bargain one trouble for another” rather then help. all these storage drivers works by allocating block storage.

here is my illustration how block storage works:

in comparision to network file storage, it has a logic on top of file storage that is responsible for assign unique identifiers to files which are stored in the data lookup table. read operations are obviously much more faster then in a unsturctured storage like nfs.

and it means that for making k8s work in the current setup i need to “transform” those directories that kuberenets need into block storage type. and solution for that is iSCSI (/aɪˈskʌzi/ eye-SKUZ-ee). boom.

iSCSIfication

iSCSI (Internet Small Computer System Interface)

Wiki: iSCSI provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network.

iSCSI allows remote data sharing (same as nfs) but at the block level. It enables data exchange between multiple client machines and a block storage device (or block server), which is accessed similarly to a local disk drive.

here is illustration how it works in with my homelab devices

Setting up iSCSI on Synology NAS

create an iSCSI Targets for each cluster node.

define logical unit numbers (LUN) for:
- /var/lib/containerd (stores container images and metadata)
- /var/lib/kubelet (kubernetes node-specific data)
- /var/logs/pods (stores logs from running containers)

Configuring iSCSI Initiator on Raspberry PI nodes

install the iscsi client:

sudo apt install -y open-iscsi
sudo systemctl enable --now open-iscsi

discover the iscsi targets:

iscsiadm -m discovery -t st -p 192.168.101.253

sudo iscsiadm -m node --targetname "iqn.2000-01.com.synology:nas.Target-node3" --portal "192.168.101.253:3260" --login

3.1. enable iSCSI auto-login, to restore session after reboot

sudo iscsiadm -m node --targetname "iqn.2000-01.com.synology:nas.Target-node3" --portal "192.168.101.253:3260" --op update -n node.startup -v automatic

format the block devices as ext4:

sudo mkfs.ext4 -L kubelet /dev/sda
sudo mkfs.ext4 -L logs /dev/sdb
sudo mkfs.ext4 -L containerd /dev/sdc

 yuklia@node3:~ $ sudo blkid /dev/sda
 /dev/sda: LABEL="kubelet" UUID="317346de-f632-4f5d-9ec0-90770f56938d" BLOCK_SIZE="4096" TYPE="ext4"
 yuklia@node3:~ $ sudo blkid /dev/sdb
 /dev/sdb: LABEL="logs" UUID="d35111a5-f56a-4bde-a731-87e630fa0aed" BLOCK_SIZE="4096" TYPE="ext4"
 yuklia@node3:~ $ sudo blkid /dev/sdc
 /dev/sdc: LABEL="containerd" UUID="8c83f195-eb3e-4226-914d-98837e3ddcca" BLOCK_SIZE="4096" TYPE="ext4"

4.1 update fstab to persist mount targets after reboot

cat /etc/fstab 
proc            /proc           proc    defaults          0       0
192.168.101.253:/volume1/RPi5-PXE/node4 /boot nfs defaults,vers=3,proto=tcp 0 0
UUID=317346de-f632-4f5d-9ec0-90770f56938d /var/lib/kubelet ext4 defaults,_netdev 0 0
UUID=d35111a5-f56a-4bde-a731-87e630fa0aed /var/logs/pods ext4 defaults,_netdev 0 0
UUID=8c83f195-eb3e-4226-914d-98837e3ddcca /var/lib/containerd ext4 defaults,_netdev 0 0

mount the block devices:

sudo mount /dev/sda /var/lib/kubelet
sudo mount /dev/sdb /var/logs/pods
sudo mount /dev/sdc /var/lib/containerd

update /etc/fstab to persist mounts:

UUID=317346de-f632-4f5d-9ec0-90770f56938d /var/lib/kubelet ext4 defaults,_netdev 0 0
UUID=d35111a5-f56a-4bde-a731-87e630fa0aed /var/logs/pods ext4 defaults,_netdev 0 0
UUID=8c83f195-eb3e-4226-914d-98837e3ddcca /var/lib/containerd ext4 defaults,_netdev 0 0

job done

yuklia@node3:~ $ df -h
Filesystem                                      Size  Used Avail Use% Mounted on
udev                                            3.8G     0  3.8G   0% /dev
tmpfs                                           806M  5.4M  800M   1% /run
192.168.101.253:/volume1/RPi5-PXE/node4/rootfs  885G  227G  658G  26% /
tmpfs                                           4.0G     0  4.0G   0% /dev/shm
tmpfs                                           5.0M   48K  5.0M   1% /run/lock
/dev/sdb                                        4.9G   24K  4.6G   1% /var/logs/pods
/dev/sda                                         11G  124K   11G   1% /var/lib/kubelet
/dev/sdc                                        9.8G  270M  9.0G   3% /var/lib/containerd
192.168.101.253:/volume1/RPi5-PXE/node4         885G  227G  658G  26% /boot
tmpfs                                           806M     0  806M   0% /run/user/1000

Smoke test ?

kubectl get pod -l app=nginx

oh, no CrashLoopBackOff

...
     Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unified resource "memory.oom.group" can't be set: controller "memory" not available

Root Cause: memory.oom.group Can’t Be Set This means

the container runtime (containerd + runc) is trying to set memory cgroup configurations, but the memory cgroup controller is missing or not enabled on your system.

updating kernel boot parameters with systemd.unified_cgroup_hierarchy=1 cgroup_enable=memory helped.

sudo cat /boot/cmdline.txt
dwcotg.lpm_enable=0 console=serial0,115200 console=tty1 elevator=deadline rootwait rw root=/dev/nfs nfsroot=192.168.101.253:/volume1/RPi5-PXE/node4/rootfs,v3,tcp ip=dhcp cgroup_enable=memory cgroup_memory=1 systemd.unified_cgroup_hierarchy=1

Smoke test-test 🧪

check that all nodes up & running

kubectl get nodes --kubeconfig /opt/hl-control-node/_tmp/kubeconfigs/admin.kubeconfig
NAME    STATUS   ROLES    AGE   VERSION
node1   Ready    <none>   37d   v1.31.2
node2   Ready    <none>   37d   v1.31.2
node3   Ready    <none>   37d   v1.31.2

check that nginx pod up & running

kubectl get pod -l app=nginx --namespace default --kubeconfig /opt/hl-control-node/_tmp/kubeconfigs/admin.kubeconfig
NAME                     READY   STATUS    RESTARTS   AGE
nginx-54f87867d6-9tbgl   1/1     Running   0          3h13m

Final Thoughts

I started with PXE boot and NFS, thinking it would be a clean and efficient solution. However, I quickly encountered limitations due to OverlayFS incompatibility in containerd. After troubleshooting, I realized that NFS wasn’t sufficient and decided to pivot to iSCSI block storage.

Setting up iSCSI on my Synology NAS and configuring the initiators on the Raspberry Pi nodes required some effort. Once completed, it resolved the persistent storage issue. Of course, Kubernetes wouldn’t let me off that easy. I hit another roadblock with memory cgroup issues, which I fixed by tweaking kernel boot parameters.

After all the adjustments and smoke tests, the cluster is now running smoothly. I can finally deploy workloads without storage headaches. Next step? Automating iSCSI provisioning with OpenTofu (Terraform) to make scaling effortless. Stay tuned for that adventure! 🚀

Toba#

Plan to Host Mastodon on 4 × Raspberry Pis#

No Hypervisors for This Project 🚫#

What is for dinner? 💭#

iSCSIfication#

Setting up iSCSI on Synology NAS#

Configuring iSCSI Initiator on Raspberry PI nodes#

Smoke test ?#

oh, no CrashLoopBackOff#

Smoke test-test 🧪#

Final Thoughts#