Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078
-
Hello I'm trying to upgrade my cluster 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 version. My cluster has 2 masters and 3 workers on Fedora CoreOS 39.20240210.3.0, all are running on proxmox without secureboot. I followed the documentation who explains how to upgrade it: https://okd.io/docs/project/upgrade-notes/from-4-15/force-upgrade-to-stable-4-16 I modified the kube-apiserver-operator deploy config like the documentation and the update began. Practically all components were updated (except machine-config) and the first master rebooted on CentOS Stream CoreOS 416.9.202411211032-0. I have some issues with the 2 machine-config-daemon pods who run on the first master: I0101 15:34:37.800798 123225 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-2860-g4bb33649-dirty (4bb3364914c4dbcdfcc08b0914f402cdd38f014f) The 2nd pod, machine-config-daemon-lnshc get a loopbackoff: I0101 15:37:58.741413 121218 update.go:2641] Disk currentConfig "rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd" overrides node's currentConfig annotation "rendered-worker-84ea878f8910625351bfcf5b66a72542" It tries to download a specific image but it gets another image, I tried to modify the pod configuration to give the good image but I didn't find the original sha (sha256:eb85d903c52970e2d6823d92c880b20609d8e8e0dbc5ad27e16681ff444c8c83). I don't know where is set. I connected to the first node, it has 2 errors: I tried to start the first service, it wait something and then it failed: Jan 01 15:46:00 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:00.923494 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d" For the second service, it try to create a group, it saw that the group exist and it failed: I read some topic on github who it said to use os-tree to rebase the scos-content. rpm-ostree rebase --experimental quay.io/okd/scos-content:tag--stream-coreos For me, the machine-config-daemon try to pull some wrong image, I tried to change to the great sha but I don't know who is set. Does anyone have an idea what should be the problem ? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Hello, I continue to research a solution ;) I checked the filesystem, this one is on xfs except /boot who it is on ext4. I checked the sha on openshift-machine-config-operator on the pod configuration, all sha are good. I tried to delete the mcp worker / master without success. I'm out of idea, I don't understand where could be the cause of the error, rpm-ostream is blocked because of dependency… When I saw the log on the machine-config-daemon it try to access to something but it can't have access to it and then it crash: |
Beta Was this translation helpful? Give feedback.
-
in my case on one of the Nodes, /boot was not mounted Check if your boot is there if not , try the following |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
There won't be an upgrade to e2fsck since you need to step through 4.16 to get to current builds, and there aren't any plans to build new versions of 4.16. There is a solution to this which requires manual intervention on each node once the
You should now have a working 4.16 install and can progress through the upgrade process to 4.17 and 4.18. |
Beta Was this translation helpful? Give feedback.
The issue occurred because Fedora CoreOS version 40 or later was used for the installation. This led to e2fsck being a version later than 1.47, which, by default, enabled the
orphan_file
feature.One of the significant changes introduced with the system upgrade starting from version 4.15 was the transition from a Fedora-based to a CentOS Stream-based , as a result, the e2fsck version was downgraded to 1.46, which does not support the orphan_file feature.
A prerequisite for mounting boot is to pass the check as you can see
Output of file /run/systemd/generator/boot.mount
However, during testing, the non-compatible version resulted in an error
System is unable to proceed with the update un…