Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078

MainMan1998 · 2025-01-01T17:17:37Z

MainMan1998
Jan 1, 2025

Hello

I'm trying to upgrade my cluster 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 version.

My cluster has 2 masters and 3 workers on Fedora CoreOS 39.20240210.3.0, all are running on proxmox without secureboot.

I followed the documentation who explains how to upgrade it: https://okd.io/docs/project/upgrade-notes/from-4-15/force-upgrade-to-stable-4-16

I modified the kube-apiserver-operator deploy config like the documentation and the update began.

Practically all components were updated (except machine-config) and the first master rebooted on CentOS Stream CoreOS 416.9.202411211032-0.

I have some issues with the 2 machine-config-daemon pods who run on the first master:
1st pod machine-config-daemon-tbr5k crashed when I check the logs I have these errors:

I0101 15:34:37.800798 123225 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-2860-g4bb33649-dirty (4bb3364914c4dbcdfcc08b0914f402cdd38f014f)
I0101 15:34:37.801214 123225 update.go:2626] Running: mount --rbind /run/secrets /rootfs/run/secrets
I0101 15:34:37.809818 123225 update.go:2626] Running: mount --rbind /usr/bin /rootfs/run/machine-config-daemon-bin
I0101 15:34:37.816334 123225 daemon.go:517] using appropriate binary for source=rhel-9 target=rhel-9
I0101 15:34:38.509638 123225 daemon.go:570] Invoking re-exec /run/bin/machine-config-daemon
I0101 15:34:38.756945 123225 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-2860-g4bb33649-dirty (4bb3364914c4dbcdfcc08b0914f402cdd38f014f)
E0101 15:34:38.757374 123225 rpm-ostree.go:276] Merged secret file does not exist; defaulting to cluster pull secret
I0101 15:34:38.757505 123225 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json
F0101 15:34:39.225410 123225 start.go:106] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: exit status 1

The 2nd pod, machine-config-daemon-lnshc get a loopbackoff:

I0101 15:37:58.741413 121218 update.go:2641] Disk currentConfig "rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd" overrides node's currentConfig annotation "rendered-worker-84ea878f8910625351bfcf5b66a72542"
I0101 15:37:58.748062 121218 daemon.go:2113] Validating against current config rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd
I0101 15:37:58.748339 121218 daemon.go:2025] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
E0101 15:38:00.655362 121218 writer.go:226] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd: expected target osImageURL "quay.io/okd/scos-content@sha256:a1063638f762059609be1f33f4502b734450297c9cd31508d6e41ac1f27e2c04", have "quay.io/openshift/okd-content@sha256:eb85d903c52970e2d6823d92c880b20609d8e8e0dbc5ad27e16681ff444c8c83" ("eb631d7d0a2785a1708594d449d0975b23920c4a3119e5cee7ea4194f4785aa7")
I0101 15:39:00.719655 121218 daemon.go:1580] Previous boot ostree-finalize-staged.service appears successful
I0101 15:39:00.719720 121218 daemon.go:1703] Current config: rendered-worker-84ea878f8910625351bfcf5b66a72542
I0101 15:39:00.719733 121218 daemon.go:1704] Desired config: rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd
I0101 15:39:00.719771 121218 daemon.go:1712] state: Degraded
I0101 15:39:00.719830 121218 update.go:2626] Running: rpm-ostree cleanup -r

It tries to download a specific image but it gets another image, I tried to modify the pod configuration to give the good image but I didn't find the original sha (sha256:eb85d903c52970e2d6823d92c880b20609d8e8e0dbc5ad27e16681ff444c8c83). I don't know where is set.

I connected to the first node, it has 2 errors:
systemd-fsck@dev-disk-by\x2duuid-b2cdd0a3\x2d1d56\x2d4d32\x2d95e3\x2d77d622975f7a.service
systemd-sysusers.service

I tried to start the first service, it wait something and then it failed:

Jan 01 15:46:00 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:00.923494 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:00 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:00.926053 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:02 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01654|connmgr|INFO|br-ex<->unix#7493: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:14 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:14.924031 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:14 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:14.926433 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:17 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01655|connmgr|INFO|br-ex<->unix#7502: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:29 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:29.924141 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:29 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:29.926243 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:32 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01656|connmgr|INFO|br-ex<->unix#7506: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.565480 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-machine-config-operator/kube-rbac-proxy-crio-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.565798 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-etcd/etcd-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.565996 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.566123 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.566215 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:44 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:44.927684 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:44 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:44.929551 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:47 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01657|connmgr|INFO|br-ex<->unix#7514: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:56 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:56.938376 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:56 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:56.940293 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:47:02 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01658|connmgr|INFO|br-ex<->unix#7518: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device: Job dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device/start timed out.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Timed out waiting for device /dev/disk/byx2duuid/b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.
░░ Subject: A start job for unit dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device has failed
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device has finished with a failure.
░░
░░ The job identifier is 19401 and the job result is timeout.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Dependency failed for File System Check on /dev/disk/byx2duuid/b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.
░░ Subject: A start job for unit systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service has failed
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service has finished with a failure.
░░
░░ The job identifier is 19400 and the job result is dependency.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service: Job systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service/start failed with result 'dependency'.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device: Job dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device/start failed with result 'timeout'.
Jan 01 15:47:09 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:47:09.925292 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:47:09 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:47:09.928263 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:47:17 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01659|connmgr|INFO|br-ex<->unix#7526: 2 flow_mods in the last 0 s (1 adds, 1 deletes)

For the second service, it try to create a group, it saw that the group exist and it failed:
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd-sysusers[800]: Creating group 'sgx' with GID 991.
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd-sysusers[800]: /etc/gshadow: Group "sgx" already exists.
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Failed to start Create System Users.
Jan 01 14:17:33 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Create System Users was skipped because no trigger condition checks were met.

I read some topic on github who it said to use os-tree to rebase the scos-content.
The os-tree failed because dependency (link to the machine config daemon pods crashloopbackoff):

rpm-ostree rebase --experimental quay.io/okd/scos-content:tag--stream-coreos
Jan 01 16:05:31 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Jan 01 16:08:05 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Dependency failed for rpm-ostree System Management Daemon.

For me, the machine-config-daemon try to pull some wrong image, I tried to change to the great sha but I don't know who is set.
I checked if I put a wrong sha when I modified the kube-apiserver-operator they have no error

Does anyone have an idea what should be the problem ?
Many thanks for help :)

Answered by niktsl

Jan 9, 2025

The issue occurred because Fedora CoreOS version 40 or later was used for the installation. This led to e2fsck being a version later than 1.47, which, by default, enabled the orphan_file feature.

One of the significant changes introduced with the system upgrade starting from version 4.15 was the transition from a Fedora-based to a CentOS Stream-based , as a result, the e2fsck version was downgraded to 1.46, which does not support the orphan_file feature.
A prerequisite for mounting boot is to pass the check as you can see

Output of file /run/systemd/generator/boot.mount

However, during testing, the non-compatible version resulted in an error

System is unable to proceed with the update un…

View full answer

MainMan1998 · 2025-01-03T19:35:03Z

MainMan1998
Jan 3, 2025
Author

Hello,

I continue to research a solution ;)

I checked the filesystem, this one is on xfs except /boot who it is on ext4.

I checked the sha on openshift-machine-config-operator on the pod configuration, all sha are good.

I tried to delete the mcp worker / master without success.

I'm out of idea, I don't understand where could be the cause of the error, rpm-ostream is blocked because of dependency…

When I saw the log on the machine-config-daemon it try to access to something but it can't have access to it and then it crash:
F0103 17:58:01.109128 1377932 start.go:106] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: exit status 1

0 replies

niktsl · 2025-01-07T19:22:39Z

niktsl
Jan 7, 2025

in my case on one of the Nodes, /boot was not mounted

Check if your boot is there
$ ls -la /boot/

if not , try the following
$ mount /dev/disk/by-label/boot /boot

1 reply

MainMan1998 Jan 8, 2025
Author

It was the same issue :)
I revert the cluster to okd 4.15 and then I upgraded it to okd 4.16, when the nodes rebooted (during the update) I checked the /boot and I mounted manually with your command.
Then I upgraded to 4.17, same issue when workers / masters rebooted, the /boot didn't mount automatically.
I have one question: why nodes can boot if the /boot partition is not mounted ? Normally it shouldn't start ?

I tried to restart one worker, the /boot doesn't mount automatically.
Do you have the same issue ?

niktsl · 2025-01-09T19:39:43Z

niktsl
Jan 9, 2025

The issue occurred because Fedora CoreOS version 40 or later was used for the installation. This led to e2fsck being a version later than 1.47, which, by default, enabled the orphan_file feature.

One of the significant changes introduced with the system upgrade starting from version 4.15 was the transition from a Fedora-based to a CentOS Stream-based , as a result, the e2fsck version was downgraded to 1.46, which does not support the orphan_file feature.
A prerequisite for mounting boot is to pass the check as you can see

Output of file /run/systemd/generator/boot.mount

However, during testing, the non-compatible version resulted in an error

System is unable to proceed with the update unless the boot partition is manually mounted.

Now, regarding possible solutions, you could:

Disable the orphan_file feature.
Bypass the requirement for mounting the boot partition.
Wait for an upgrade of e2fsck, which I expect will be included in upcoming OKD updates.

1 reply

MainMan1998 Jan 10, 2025
Author

Ok, I will wait for the e2fsck upgrade.
Many thanks for help :)

SupremeMortal · 2025-04-09T18:20:35Z

SupremeMortal
Apr 9, 2025

There won't be an upgrade to e2fsck since you need to step through 4.16 to get to current builds, and there aren't any plans to build new versions of 4.16.

There is a solution to this which requires manual intervention on each node once the machine-config-daemon gets into this error state:

Reboot the node into Fedora CoreOS 40 live ISO
Find the partition that contains your ext4 boot data. sudo file -sL /dev/xxx is useful for discovery
Remove the orphan_file flag which causes the issue sudo tune2fs -O ^orphan_file /dev/xxx
Clean up any inodes containing the flag sudo e2fsck -f /dev/xxx
Reboot back into OKD

You should now have a working 4.16 install and can progress through the upgrade process to 4.17 and 4.18.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OKD

Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

OKD

Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078

Uh oh!

MainMan1998 Jan 1, 2025

Replies: 4 comments · 2 replies

Uh oh!

MainMan1998 Jan 3, 2025 Author

Uh oh!

niktsl Jan 7, 2025

Uh oh!

MainMan1998 Jan 8, 2025 Author

Uh oh!

Uh oh!

niktsl Jan 9, 2025

Uh oh!

MainMan1998 Jan 10, 2025 Author

Uh oh!

SupremeMortal Apr 9, 2025

MainMan1998
Jan 1, 2025

Replies: 4 comments 2 replies

MainMan1998
Jan 3, 2025
Author

niktsl
Jan 7, 2025

MainMan1998 Jan 8, 2025
Author

niktsl
Jan 9, 2025

MainMan1998 Jan 10, 2025
Author

SupremeMortal
Apr 9, 2025