Introduction
At ungleich we are running multiple Ceph clusters. Some of them are running Ceph Nautilus (14.x) based on Devuan. Our newer Ceph Pacific (16.x) clusters are running based on Rook on Kubernetes on top of Alpine Linux.
In this blog article we will describe how to migrate Ceph/Native/Devuan to Ceph/k8s+rook/Alpine Linux.
Work in Progress [WIP]
This blog article is work in progress. The migration planning has started, however the migration has not been finished yet. This article will feature the different paths we take for the migration.
The Plan
To continue operating the cluster during the migration, the following steps are planned:
- Setup a k8s cluster that can potentially communicate with the existing ceph cluster
- Using the disaster recovery guidelines from rook to modify the rook configuration to use the previous fsid.
- Spin up ceph monitors and ceph managers in rook
- Retire existing monitors
- Shutdown a ceph OSD node, remove it's OS disk, boot it with Alpine Linux
- Join the node into the k8s cluster
- Have rook pickup the existing disks and start the osds
- Repeat if successful
- Migrate to ceph pacific
Original cluster
The target ceph cluster we want to migrate lives in the 2a0a:e5c0::/64 network. Ceph is using:
public network = 2a0a:e5c0:0:0::/64
cluster network = 2a0a:e5c0:0:0::/64
Kubernetes cluster networking inside the ceph network
To be able to communicate with the existing OSDs, we will be using sub networks of 2a0a:e5c0::/64 for kubernetes. As these networks are part of the link assigned network 2a0a:e5c0::/64, we will use BGP routing on the existing ceph nodes to create more specific routes into the kubernetes cluster.
As we plan to use either cilium or calico as the CNI, we can configure kubernetes to directly BGP peer with the existing Ceph nodes.
The setup
Kubernetes Bootstrap
As usual we bootstrap 3 control plane nodes using kubeadm. The proxy for the API resides in a different kuberentes cluster.
We run
kubeadm init --config kubeadm.yaml
on the first node and join the other two control plane nodes. As usual, joining the workers last.
k8s Networking / CNI
For this setup we are using calico as described in the ungleich kubernetes manual.
VERSION=v3.23.3
helm repo add projectcalico https://docs.projectcalico.org/charts
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
BGP Networking on the old nodes
To be able to import the BGP routes from Kubernetes, all old / native hosts will run bird. The installation and configuration is as follows:
apt-get update
apt-get install -y bird2
router_id=$(hostname | sed 's/server//')
cat > /etc/bird/bird.conf <<EOF
router id $router_id;
log syslog all;
protocol device {
}
# We are only interested in IPv6, skip another section for IPv4
protocol kernel {
ipv6 { export all; };
}
protocol bgp k8s {
local as 65530;
neighbor range 2a0a:e5c0::/64 as 65533;
dynamic name "k8s_"; direct;
ipv6 {
import filter { if net.len > 64 then accept; else reject; };
export none;
};
}
EOF
/etc/init.d/bird restart
The router id must be adjusted for every host. As all hosts have a unique number, we use that one as the router id. The bird configuration allows to use dynamic peers so that any k8s node in the network can peer with the old servers.
We also use a filter to avoid receiving /64 routes, as they are overlapping with the on link route.
BGP networking in Kubernetes
Calico supports BGP peering and we use a rather standard calico configuration:
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: true
asNumber: 65533
serviceClusterIPs:
- cidr: 2a0a:e5c0:0:aaaa::/108
serviceExternalIPs:
- cidr: 2a0a:e5c0:0:aaaa::/108
Plus for each server and router we create a BGPPeer:
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: serverXX
spec:
peerIP: 2a0a:e5c0::XX
asNumber: 65530
keepOriginalNextHop: true
We apply the whole configuration using calicoctl:
./calicoctl create -f - < ~/vcs/k8s-config/bootstrap/p5-cow/calico-bgp.yaml
And a few seconds later we can observer the routes on the old / native hosts:
bird> show protocols
Name Proto Table State Since Info
device1 Device --- up 23:09:01.393
kernel1 Kernel master6 up 23:09:01.393
k8s BGP --- start 23:09:01.393 Passive
k8s_1 BGP --- up 23:33:01.215 Established
k8s_2 BGP --- up 23:33:01.215 Established
k8s_3 BGP --- up 23:33:01.420 Established
k8s_4 BGP --- up 23:33:01.215 Established
k8s_5 BGP --- up 23:33:01.215 Established
Testing networking
To verify that the new cluster is working properly, we can deploy a tiny test deployment and see if it is globally reachable:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.20.0-alpine
ports:
- containerPort: 80
And the corresponding service:
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
Using curl to access a sample service from the outside shows that networking is working:
% curl -v http://[2a0a:e5c0:0:aaaa::e3c9]
* Trying 2a0a:e5c0:0:aaaa::e3c9:80...
* Connected to 2a0a:e5c0:0:aaaa::e3c9 (2a0a:e5c0:0:aaaa::e3c9) port 80 (#0)
> GET / HTTP/1.1
> Host: [2a0a:e5c0:0:aaaa::e3c9]
> User-Agent: curl/7.84.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.20.0
< Date: Sat, 27 Aug 2022 22:35:49 GMT
< Content-Type: text/html
< Content-Length: 612
< Last-Modified: Tue, 20 Apr 2021 16:11:05 GMT
< Connection: keep-alive
< ETag: "607efd19-264"
< Accept-Ranges: bytes
<
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>
* Connection #0 to host 2a0a:e5c0:0:aaaa::e3c9 left intact
So far we have found 1 issue:
- Sometimes the old/native servers can reach the service, sometimes they get a timeout
In old calico notes on github it is referenced that overlapping Pod/CIDR networks might be a problem. Additionally we cannot use kubeadm to initialise the podsubnet to be a proper subnet of the node subnet:
[00:15] server57.place5:~# kubeadm init --service-cidr 2a0a:e5c0:0:cccc::/108 --pod-network-cidr 2a0a:e5c0::/100
I0829 00:16:38.659341 19400 version.go:255] remote version is much newer: v1.25.0; falling back to: stable-1.24
podSubnet: Invalid value: "2a0a:e5c0::/100": the size of pod subnet with mask 100 is smaller than the size of node subnet with mask 64
To see the stack trace of this error execute with --v=5 or higher
[00:16] server57.place5:~#
Networking 2022-09-03
- Instead of trying to merge the cluster networks, we will use separate ranges
- According to the ceph users mailing list discussion it is actually not necessary for mons/osds to be in the same network. In fact, we might be able to remove these settings completely.
So today we start with
- podSubnet: 2a0a:e5c0:0:14::/64
- serviceSubnet: 2a0a:e5c0:0:15::/108
Using BGP and calico, the kubernetes cluster is setup "as usual" (for ungleich terms).
Ceph.conf change
Originally our ceph.conf contained:
public network = 2a0a:e5c0:0:0::/64
cluster network = 2a0a:e5c0:0:0::/64
As of today they are removed and all daemons are restarted, allowing the native cluster to speak with the kubernetes cluster.
Setting up rook
Usually we deploy rook via argocd. However as we want to be easily able to do manual intervention, we will first bootstrap rook via helm directly and turn off various services
helm repo add rook https://charts.rook.io/release
helm repo update
We will use rook 1.8, as it is the last version to support Ceph nautilus, which is our current ceph version. The latest 1.8 version is 1.8.10 at the moment.
helm upgrade --install --namespace rook-ceph --create-namespace --version v1.8.10 rook-ceph rook/rook-ceph
Joining the 2 clusters, step 1: monitors and managers
In the first step we want to add rook based monitors and managers and replace the native ones. For rook to be able to talk to our existing cluster, it needs to know
- the current monitors/managers ("the monmap")
- the right keys to talk to the existing cluster
- the fsid
As we are using v1.8, we will follow the guidelines for disaster recover of rook 1.8.
Later we will need to create all the configurations so that rook knows about the different pools.
Rook: CephCluster
Rook has a configuration of type CephCluster
that typically looks
something like this:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
# see the "Cluster Settings" section below for more details on which image of ceph to run
image: quay.io/ceph/ceph:{{ .Chart.AppVersion }}
dataDirHostPath: /var/lib/rook
mon:
count: 5
allowMultiplePerNode: false
storage:
useAllNodes: true
useAllDevices: true
onlyApplyOSDPlacement: false
mgr:
count: 1
modules:
- name: pg_autoscaler
enabled: true
network:
ipFamily: "IPv6"
dualStack: false
crashCollector:
disable: false
# Uncomment daysToRetain to prune ceph crash entries older than the
# specified number of days.
daysToRetain: 30
For migrating, we don't want rook in the first stage to create any
OSDs. So we will replace useAllNodes: true
with useAllNodes: false
and useAllDevices: true
also with useAllDevices: false
.
Extracting a monmap
To get access to the existing monmap, we can export it from the native
cluster using ceph-mon -i {mon-id} --extract-monmap {map-path}
.
More details can be found on the documentation for adding and
removing ceph
monitors.
Rook and Ceph pools
Rook uses CephBlockPool
to describe ceph pools as follows:
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: hdd
namespace: rook-ceph
spec:
failureDomain: host
replicated:
size: 3
deviceClass: hdd
In this particular cluster we have 2 pools:
- one (ssd based, device class = ssd)
- hdd (hdd based, device class = hdd-big)
The device class "hdd-big" is specific to this cluster as it used to contain 2.5" and 3.5" HDDs in different pools.
[old] Analysing the ceph cluster configuration
Taking the view from the old cluster, the following items are important for adding new services/nodes:
- We have a specific fsid that needs to be known
- The expectation would be to find that fsid in a configmap/secret in rook
- We have a list of running monitors
- This is part of the monmap and ceph.conf
- ceph.conf is used for finding the initial contact point
- Afterwards the information is provided by the monitors
- For rook it would be expected to have a configmap/secret listing the current monitors
- The native clusters have a "ceph.client.admin.keyring" deployed which
allows adding and removing resources.
- Rook probably has a secret for keyrings
- Maybe multiple depending on how services are organised
Analysing the rook configurations
Taking the opposite view, we can also checkout a running rook cluster and the rook disaster recovery documentation to identify what to modify.
Let's have a look at the secrets first:
cluster-peer-token-rook-ceph kubernetes.io/rook 2 320d
default-token-xm9xs kubernetes.io/service-account-token 3 320d
rook-ceph-admin-keyring kubernetes.io/rook 1 320d
rook-ceph-admission-controller kubernetes.io/tls 3 29d
rook-ceph-cmd-reporter-token-5mh88 kubernetes.io/service-account-token 3 320d
rook-ceph-config kubernetes.io/rook 2 320d
rook-ceph-crash-collector-keyring kubernetes.io/rook 1 320d
rook-ceph-mgr-a-keyring kubernetes.io/rook 1 320d
rook-ceph-mgr-b-keyring kubernetes.io/rook 1 320d
rook-ceph-mgr-token-ktt2m kubernetes.io/service-account-token 3 320d
rook-ceph-mon kubernetes.io/rook 4 320d
rook-ceph-mons-keyring kubernetes.io/rook 1 320d
rook-ceph-osd-token-8m6lb kubernetes.io/service-account-token 3 320d
rook-ceph-purge-osd-token-hznnk kubernetes.io/service-account-token 3 320d
rook-ceph-rgw-token-wlzbc kubernetes.io/service-account-token 3 134d
rook-ceph-system-token-lxclf kubernetes.io/service-account-token 3 320d
rook-csi-cephfs-node kubernetes.io/rook 2 320d
rook-csi-cephfs-plugin-sa-token-hkq2g kubernetes.io/service-account-token 3 320d
rook-csi-cephfs-provisioner kubernetes.io/rook 2 320d
rook-csi-cephfs-provisioner-sa-token-tb78d kubernetes.io/service-account-token 3 320d
rook-csi-rbd-node kubernetes.io/rook 2 320d
rook-csi-rbd-plugin-sa-token-dhhq6 kubernetes.io/service-account-token 3 320d
rook-csi-rbd-provisioner kubernetes.io/rook 2 320d
rook-csi-rbd-provisioner-sa-token-lhr4l kubernetes.io/service-account-token 3 320d
TBC
Creating additional resources after the cluster is bootstrapped
To let rook know what should be there, we already create the two
CephBlockPool
instances that match the existing pools:
```apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: name: one namespace: rook-ceph spec: failureDomain: host replicated: size: 3 deviceClass: ssd
And for the hdd based pool:
apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: name: hdd namespace: rook-ceph spec: failureDomain: host replicated: size: 3 deviceClass: hdd-big
Saving both of these in ceph-blockpools.yaml and applying it:
kubectl -n rook-ceph apply -f ceph-blockpools.yaml
### Configuring ceph after the operator deployment
As soon as the operator and the crds have been deployed, we deploy the
following
[CephCluster](https://rook.io/docs/rook/v1.8/ceph-cluster-crd.html)
configuration:
apiVersion: ceph.rook.io/v1 kind: CephCluster metadata: name: rook-ceph namespace: rook-ceph spec: cephVersion: image: quay.io/ceph/ceph:v14.2.21 dataDirHostPath: /var/lib/rook mon: count: 5 allowMultiplePerNode: false storage: useAllNodes: false useAllDevices: false onlyApplyOSDPlacement: false mgr: count: 1 modules:
- name: pg_autoscaler
enabled: true
network: ipFamily: "IPv6" dualStack: false crashCollector: disable: false
# Uncomment daysToRetain to prune ceph crash entries older than the
# specified number of days.
daysToRetain: 30
We wait for the cluster to initialise and stabilise before applying
changes. Important to note is that we use the ceph image version
v14.2.21, which is the same version as the native cluster.
### rook v1.8 is incompatible with ceph nautilus
After deploying the rook operator, the following error message is
printed in its logs:
2022-09-03 15:14:03.543925 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed the ceph version check: the version does not meet the minimum version "15.2.0-0 octopus"
So we need to downgrade to rook v1.7. Using `helm search repo
rook/rook-ceph --versions` we identify the latest usable version should be `v1.7.11`.
We start the downgrade process using
helm upgrade --install --namespace rook-ceph --create-namespace --version v1.7.11 rook-ceph rook/rook-ceph
After downgrading the operator is starting the canary monitors and
continues to bootstrap the cluster.
### The ceph-toolbox
To be able to view the current cluster status, we also deploy the
ceph-toolbox for interacting with rook:
apiVersion: apps/v1 kind: Deployment metadata: name: rook-ceph-tools namespace: rook-ceph # namespace:cluster labels: app: rook-ceph-tools spec: replicas: 1 selector: matchLabels: app: rook-ceph-tools template: metadata: labels: app: rook-ceph-tools spec: dnsPolicy: ClusterFirstWithHostNet containers:
- name: rook-ceph-tools
image: rook/ceph:v1.7.11
command: ["/bin/bash"]
args: ["-m", "-c", "/usr/local/bin/toolbox.sh"]
imagePullPolicy: IfNotPresent
tty: true
securityContext:
runAsNonRoot: true
runAsUser: 2016
runAsGroup: 2016
env:
- name: ROOK_CEPH_USERNAME
valueFrom:
secretKeyRef:
name: rook-ceph-mon
key: ceph-username
- name: ROOK_CEPH_SECRET
valueFrom:
secretKeyRef:
name: rook-ceph-mon
key: ceph-secret
volumeMounts:
- mountPath: /etc/ceph
name: ceph-config
- name: mon-endpoint-volume
mountPath: /etc/rook
volumes:
- name: mon-endpoint-volume
configMap:
name: rook-ceph-mon-endpoints
items:
- key: data
path: mon-endpoints
- name: ceph-config
emptyDir: {}
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 5
### Checking the deployments
After the rook-operator finished deploying, the following deployments
are visible in kubernetes:
[17:25] blind:~% kubectl -n rook-ceph get deployment NAME READY UP-TO-DATE AVAILABLE AGE csi-cephfsplugin-provisioner 2/2 2 2 21m csi-rbdplugin-provisioner 2/2 2 2 21m rook-ceph-crashcollector-server48 1/1 1 1 2m3s rook-ceph-crashcollector-server52 1/1 1 1 2m24s rook-ceph-crashcollector-server53 1/1 1 1 2m2s rook-ceph-crashcollector-server56 1/1 1 1 2m17s rook-ceph-crashcollector-server57 1/1 1 1 2m1s rook-ceph-mgr-a 1/1 1 1 2m3s rook-ceph-mon-a 1/1 1 1 10m rook-ceph-mon-b 1/1 1 1 8m3s rook-ceph-mon-c 1/1 1 1 5m55s rook-ceph-mon-d 1/1 1 1 5m33s rook-ceph-mon-e 1/1 1 1 4m32s rook-ceph-operator 1/1 1 1 102m rook-ceph-tools 1/1 1 1 17m
Relevant for us are the mgr, mon and operator. To stop the cluster, we
will shutdown the deployments in the following order:
* rook-ceph-operator: to prevent deployments to recover
### Data / configuration comparison
Logging into a host that is running mon-a, we find the following data
in it:
[17:36] server56.place5:/var/lib/rook# find . ./mon-a ./mon-a/data ./mon-a/data/keyring ./mon-a/data/min_mon_release ./mon-a/data/store.db ./mon-a/data/store.db/LOCK ./mon-a/data/store.db/000006.log ./mon-a/data/store.db/000004.sst ./mon-a/data/store.db/CURRENT ./mon-a/data/store.db/MANIFEST-000005 ./mon-a/data/store.db/OPTIONS-000008 ./mon-a/data/store.db/OPTIONS-000005 ./mon-a/data/store.db/IDENTITY ./mon-a/data/kv_backend ./rook-ceph ./rook-ceph/crash ./rook-ceph/crash/posted ./rook-ceph/log
Which is pretty similar to the native nodes:
[17:37:50] red3.place5:/var/lib/ceph/mon/ceph-red3# find . ./sysvinit ./keyring ./min_mon_release ./kv_backend ./store.db ./store.db/1959645.sst ./store.db/1959800.sst ./store.db/OPTIONS-3617174 ./store.db/2056973.sst ./store.db/3617348.sst ./store.db/OPTIONS-3599785 ./store.db/MANIFEST-3617171 ./store.db/1959695.sst ./store.db/CURRENT ./store.db/LOCK ./store.db/2524598.sst ./store.db/IDENTITY ./store.db/1959580.sst ./store.db/2514570.sst ./store.db/1959831.sst ./store.db/3617346.log ./store.db/2511347.sst
### Checking how monitors are created on native ceph
To prepare for the migration we take 1 step back and verify how
monitors are created in the native cluster. The script used for
monitoring creation can be found on
[code.ungleich.ch](https://code.ungleich.ch/ungleich-public/ungleich-tools/src/branch/master/ceph/ceph-mon-create-start)
and contains the following logic:
* get "mon." key
* get the monmap
* Run ceph-mon --mkfs using the monmap and keyring
* Start it
In theory we could re-use these steps on a rook deployed monitor to
join our existing cluster.
### Checking the toolbox and monitor pods for migration
When the ceph-toolbox is deployed, we get a ceph.conf and a keyring in
/etc/ceph. The keyring is actually the admin keyring and allows us to
make modifications to the ceph cluster. The ceph.conf points to the
monitors and does not contain an fsid.
The ceph-toolbox gets this informatoin via 1 configmap
("rook-ceph-mon-endpoints") and a secret ("rook-ceph-mon").
The monitor pods on the other hand have an empty ceph.conf and no
admin keyring deployed.
### Try 1: recreating a monitor inside the existing cluster
Let's try to reuse an existing monitor and join it into the existing
cluster. For this we will first shut down the rook-operator, to
prevent it to intefere with our migration. Then
modify the relevant configmaps and secrets and import the settings
from the native cluster.
Lastly we will patch one of the monitor pods, inject the monmap from
the native cluster and then restart it.
Let's give it a try. First we shutdown the rook-ceph-operator:
% kubectl -n rook-ceph scale --replicas=0 deploy/rook-ceph-operator deployment.apps/rook-ceph-operator scaled
Then we patch the mon deployments to not run a monitor, but only
sleep:
for mon in a b c d e; do kubectl -n rook-ceph patch deployment rook-ceph-mon-${mon} -p \ '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}';
kubectl -n rook-ceph patch deployment rook-ceph-mon-$mon --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' done
Now the pod is restarted and when we execute into it, we will see that
no monitor is running in it:
% kubectl -n rook-ceph exec -ti rook-ceph-mon-a-c9f8f554b-2fkhm -- sh Defaulted container "mon" out of: mon, chown-container-data-dir (init), init-mon-fs (init) sh-4.2# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 4384 664 ? Ss 19:44 0:00 sleep infinity root 7 0.0 0.0 11844 2844 pts/0 Ss 19:44 0:00 sh root 13 0.0 0.0 51752 3384 pts/0 R+ 19:44 0:00 ps aux sh-4.2#
Now for this pod to work with our existing cluster, we want to import
the monmap and join the monitor to the native cluster. As with any
mon, the data is stored below `/var/lib/ceph/mon/ceph-a/`.
Before importing the monmap, let's have a look at the different rook
configurations that influence the ceph components
### Looking at the ConfigMap in detail: rook-ceph-mon-endpoints
As the name says, it contains the list of monitor endpoints:
kubectl -n rook-ceph edit configmap rook-ceph-mon-endpoints ...
csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["[2a0a:e5c0:0:15::fc2]:6789"... data: b=[2a0a:e5c0:0:15::9cd9]:6789,.... mapping: '{"node":{"a":{"Name":"server56","Hostname":"server56","Address":"2a0a:e5c0::...
As eventually we want the cluster and csi to use the in-cluster
monitors, we don't need to modify it right away.
### Looking at Secrets in detail: rook-ceph-admin-keyring
The first interesting secret is **rook-ceph-admin-keyring**, which
contains the admin keyring. The old one of course, so we can edit this
secret and replace it with the client.admin secret from our native
cluster.
We encode the original admin keyring using:
cat ceph.client.admin.keyring | base64 -w 0; echo ""
And then we update the secret it:
kubectl -n rook-ceph edit secret rook-ceph-admin-keyring
[done]
### Looking at Secrets in detail: rook-ceph-config
This secret contains two keys, **mon_host** and
**mon_initial_members**. The **mon_host** is a list of monitor
addresses. The **mon_host** only contains the monitor names, a, b, c, d and e.
The environment variable **ROOK_CEPH_MON_HOST** in the monitor
deployment is set to to **mon_host** key of that secret, so monitors
will read from it.
### Looking at Secrets in detail: rook-ceph-mon
This secret contains the following interesting keys:
* ceph-secret: the admin key (just the base64 key no section around
it) [done]
* ceph-username: "client.admin"
* fsid: the ceph cluster fsid
* mon-secret: The key of the [mon.] section
It's important to mention to use `echo -n` when inserting
the keys or fsids.
[done]
### Looking at Secrets in detail: rook-ceph-mons-keyring
Contains the key "keyring" containing the [mon.] and [client.admin]
sections:
[mon.] key = ...
[client.admin] key = ... caps mds = "allow" caps mgr = "allow " caps mon = "allow " caps osd = "allow *"
Using `base64 -w0 < ~/mon-and-client`.
[done]
### Importing the monmap
Getting the current monmap from the native cluster:
ceph mon getmap -o monmap-20220903
scp root@old-monitor:monmap-20220903
Adding it into the mon pod:
kubectl cp monmap-20220903 rook-ceph/rook-ceph-mon-a-6c46d4694-kxm5h:/tmp
Moving the old mon db away:
cd /var/lib/ceph/mon/ceph-a mkdir _old mv [a-z]* _old/
Recreating the mon fails, as the volume is mounted directly onto it:
% ceph-mon -i a --mkfs --monmap /tmp/monmap-20220903 --keyring /tmp/mon-key 2022-09-03 21:44:48.268 7f1a738f51c0 -1 '/var/lib/ceph/mon/ceph-a' already exists and is not empty: monitor may already exist
% mount | grep ceph-a /dev/sda1 on /var/lib/ceph/mon/ceph-a type ext4 (rw,relatime)
We can workaround this by creating all monitors on pods with other
names. So we can create mon b to e on the mon-a pod and mon-a on any
other pod.
On rook-ceph-mon-a:
for mon in b c d e; do ceph-mon -i $mon --mkfs --monmap /tmp/monmap-20220903 --keyring /tmp/mon-key; done
On rook-ceph-mon-b:
mon=a ceph-mon -i $mon --mkfs --monmap /tmp/monmap-20220903 --keyring /tmp/mon-key
Then we export the newly created mon dbs:
for mon in b c d e; do kubectl cp rook-ceph/rook-ceph-mon-a-6c46d4694-kxm5h:/var/lib/ceph/mon/ceph-$mon ceph-$mon; done
for mon in a; do kubectl cp rook-ceph/rook-ceph-mon-b-57d888dd9f-w8jkh:/var/lib/ceph/mon/ceph-$mon ceph-$mon; done
And finally we test it by importing the mondb to mon-a:
kubectl cp ceph-a rook-ceph/rook-ceph-mon-a-6c46d4694-kxm5h:/var/lib/ceph/mon/
And the other mons:
kubectl cp ceph-b rook-ceph/rook-ceph-mon-b-57d888dd9f-w8jkh:/var/lib/ceph/mon/
### Re-enabling the rook-operator
As the deployment
kubectl -n rook-ceph scale --replicas=1 deploy/rook-ceph-operator
Operator sees them running (with a shell)
2022-09-03 22:29:26.725915 I | op-mon: mons running: [d e a b c]
Triggering recreation:
% kubectl -n rook-ceph delete deployment rook-ceph-mon-a deployment.apps "rook-ceph-mon-a" deleted
Connected successfully to the cluster:
services: mon: 6 daemons, quorum red1,red2,red3,server4,server3,a (age 8s) mgr: red3(active, since 8h), standbys: red2, red1, server4 osd: 46 osds: 46 up, 46 in
A bit later:
mon: 8 daemons, quorum (age 2w), out of quorum: red1, red2, red3, server4, server3, a, c,
d mgr: red3(active, since 8h), standbys: red2, red1, server4 osd: 46 osds: 46 up, 46 in
And a little bit later also the mgr joined the cluster:
services: mon: 8 daemons, quorum red2,red3,server4,server3,a,c,d,e (age 46s) mgr: red3(active, since 9h), standbys: red1, server4, a, red2 osd: 46 osds: 46 up, 46 in
And a few minutes later all mons joined successfully:
mon: 8 daemons, quorum red3,server4,server3,a,c,d,e,b (age 31s)
mgr: red3(active, since 105s), standbys: red1, server4, a, red2
osd: 46 osds: 46 up, 46 in
We also need to ensure the toolbox is being updated/recreated:
kubectl -n rook-ceph delete pods rook-ceph-tools-5cf88dd58f-fwwlc
### Original monitors vanish
Did not add bgp peering.
Cannot reach ceph through the routers.
Seems like rook did remove them.
Updating the ceph.conf for the native nodes:
mon host = rook-ceph-mon-a.rook-ceph.svc..,
### Post monitor migration issue 1: OSDs start crashing
A day after the monitor migration some OSDs start to crash. Checking
out the debug log we found the following error:
2022-09-05 10:24:02.881 7fe005ce7700 -1 Processor -- bind unable to bind to v2:[2a0a:e5c0::225:b3ff:fe20:3554]:7300/3712937 on any port in range 6800-7300: (99) Cannot assign requested address 2022-09-05 10:24:02.881 7fe005ce7700 -1 Processor -- bind was unable to bind. Trying again in 5 seconds 2022-09-05 10:24:07.897 7fe005ce7700 -1 Processor -- bind unable to bind to v2:[2a0a:e5c0::225:b3ff:fe20:3554]:7300/3712937 on any port in range 6800-7300: (99) Cannot assign requested address 2022-09-05 10:24:07.897 7fe005ce7700 -1 Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address 2022-09-05 10:24:07.897 7fe0127b1700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 2022-09-05 10:24:07.897 7fe0127b1700 -1 osd.49 100709 Got signal Interrupt 2022-09-05 10:24:07.897 7fe0127b1700 -1 osd.49 100709 Immediate shutdown (osd_fast_shutdown=true)
Trying to bind to an IPv6 address that is **not** on the system.
https://tracker.ceph.com/issues/24602
Calico/CNI does IP rewriting and thus tells the OSD the wrong IPv6
address.
Adding
public_addr = 2a0a:e5c0::92e2:baff:fe26:642c
to the node. Verifying the binding after restarting the crashing OSD:
[10:35:06] server4.place5:/var/log/ceph# netstat -lnpW | grep 3717792 tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6821 ::: LISTEN 3717792/ceph-osd tcp6 0 0 :::6822 ::: LISTEN 3717792/ceph-osd tcp6 0 0 :::6823 ::: LISTEN 3717792/ceph-osd tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6816 ::: LISTEN 3717792/ceph-osd tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6817 ::: LISTEN 3717792/ceph-osd tcp6 0 0 :::6818 ::: LISTEN 3717792/ceph-osd tcp6 0 0 :::6819 ::: LISTEN 3717792/ceph-osd tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6820 ::: LISTEN 3717792/ceph-osd unix 2 [ ACC ] STREAM LISTENING 16880318 3717792/ceph-osd /var/run/ceph/ceph-osd.49.asok
### Post monitor migration issue 1: OSDs start crashing
After roughly a week an OSD on the native cluster started to fail on
restart with the following error:
unable to parse addrs in 'rook-ceph-mon-a.rook-ceph.svc.p5-cow.k8s.ooo, rook-ceph-mon-b.rook-ceph.svc.p5-cow.k8s.ooo, rook-ceph-mon-c.rook-ceph.svc.p5-cow.k8s.ooo, rook-ceph-mon-d.rook-ceph.svc.p5-cow.k8s.ooo, rook-ceph-mon-e.rook-ceph.svc.p5-cow.k8s.ooo'
Checking the cluster, it seems rook has replaced mon-a with mon-f:
[22:38] blind:~% kubectl -n rook-ceph get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
csi-cephfsplugin-metrics ClusterIP 2a0a:e5c0:0:15::f2ac
At this moment it is unclear why ceph does it, but if the native hosts
had already been migrated, this would probably not have caused an
issue. However as long as ceph.conf files are deployed with static
references to the monitors, this problem might repeat.
## Changelog
### 2023-05-18
* rook 1.7.11 does not work on kubernetes 1.26.1 anymore
* PodSecurityPolicy is missing
The Kubernetes API could not find policy/PodSecurityPolicy for requested resource rook-ceph/00-rook-ceph-operator. Make sure the "PodSecurityPolicy" CRD is installed on the destination cluster.
Same issue on rook 1.8.10:
kind: PodSecurityPolicy
message: >-
The Kubernetes API could not find policy/PodSecurityPolicy for
requested resource rook-ceph/00-rook-privileged. Make sure the
"PodSecurityPolicy" CRD is installed on the destination cluster.
name: 00-rook-privileged
namespace: rook-ceph
status: SyncFailed
syncPhase: Sync
version: v1beta1
revision: v1.8.10
source:
* rook 1.9.12 deploys in kubernetes 1.26.1
2023-05-18 11:44:36.169248 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed the ceph version check: the version does not meet the minimum version "15.2.0-0 octopus" ```
- requires 15.2.0 octopus as a minimum
2022-09-10
- Added missing monitor description
2022-09-03
- Next try starting for migration
- Looking deeper into configurations
2022-08-29
- Added kubernetes/kubeadm bootstrap issue
2022-08-27
- The initial release of this blog article
- Added k8s bootstrapping guide
Follow up or questions
You can join the discussion in the matrix room #kubernetes:ungleich.ch
about this migration. If don't have a matrix
account you can join using our chat on https://chat.with.ungleich.ch.