-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Make bootstrapping opt out and remove the legacy master install path #7486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make bootstrapping opt out and remove the legacy master install path #7486
Conversation
/retest |
2aede15
to
18aede1
Compare
/retest |
not an expert in this code space yet, but this looks good to me. |
Interesting - looks like when the kubelet creates the api mirror pod seeing the new pod from the api server causes the containers to get restarted:
|
This is what is causing the install job to choke - we poll waiting for the api to come up, then we continue on initializing things. The kubelet is able to create the mirror pod, that causes the kubelet to get a new sync event, then it looks like the triggers a restart of the api container, which is almost exactly when the first CLI call is made and fails. |
Job fails right around 05:28:23 which is while the api is stopped. |
What happened to the atomic bot? We're going to run into lots of trouble without that bot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is missing the logic we implemented for osm_default_node_selector
osm_default_node_selector should match what we implemented in 3.9.
"stdout": "-- Logs begin at Mon 2018-03-12 16:29:00 UTC, end at Mon 2018-03-12 16:36:41 UTC. --\nMar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Failed to load environment files: No such file or directory\nMar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Failed to run 'start-pre' task: No such file or directory\nMar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: Failed to start origin-node.service.\nMar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Unit entered failed state.\nMar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Failed with result 'resources'.",
"stdout_lines": [
"-- Logs begin at Mon 2018-03-12 16:29:00 UTC, end at Mon 2018-03-12 16:36:41 UTC. --",
"Mar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Failed to load environment files: No such file or directory",
"Mar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Failed to run 'start-pre' task: No such file or directory",
"Mar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: Failed to start origin-node.service.",
"Mar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Unit entered failed state.",
"Mar 12 16:36:41 ip-172-18-12-1.ec2.internal systemd[1]: origin-node.service: Failed with result 'resources'."
] Fedora atomic host, single master, v3.9.0 image tag. |
Task: openshift_control_plane : create service account kubeconfig with csr rights fails with The full traceback is:
File "/tmp/ansible_9dfpdoj9/ansible_modlib.zip/ansible/module_utils/basic.py", line 2736, in run_command
cmd = subprocess.Popen(args, **kwargs)
File "/usr/lib64/python3.6/subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
fatal: [ec2-54-175-14-73.compute-1.amazonaws.com]: FAILED! => {
"attempts": 24,
"changed": false,
"cmd": "oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra",
"failed": true,
"invocation": {
"module_args": {
"_raw_params": "oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra",
"_uses_shell": false,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"warn": true
}
},
"msg": "[Errno 2] No such file or directory: 'oc': 'oc'",
"rc": 2
} On Fedora Atomic Host, single master. |
openshift_bootstrap_autoapprover : Create auto-approver on cluster fails due to oc command. |
/hold We need to get fedora atomic bot back online before we continue down this road. We're going to have a lot of drift. |
On RHEL Atomic I get:
A similar story with rpm-based install for origin 3.9. Maybe Origin 3.10 is required to make it pass? |
Seeing the following on Fedora Atomic Host, origin-node service not starting: Mar 12 19:45:58 ip-172-18-10-39.ec2.internal origin-node[26381]: I0312 19:45:58.092754 26462 start_node.go:309] Reading node configuration from /etc/origin/node/node-config.yaml
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal origin-node[26381]: Invalid NodeConfig /etc/origin/node/node-config.yaml
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal origin-node[26381]: servingInfo.certFile: Invalid value: "/etc/origin/node/server.crt": could not read file: stat /etc/origin/node/server.crt: no such file or directory
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal origin-node[26381]: servingInfo.keyFile: Invalid value: "/etc/origin/node/server.key": could not read file: stat /etc/origin/node/server.key: no such file or directory
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal origin-node[26381]: servingInfo.clientCA: Invalid value: "/etc/origin/node/ca.crt": could not read file: stat /etc/origin/node/ca.crt: no such file or directory
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal origin-node[26381]: masterKubeConfig: Invalid value: "/etc/origin/node/system:node:ip-172-18-10-39.ec2.internal.kubeconfig": could not read file: stat /etc/origin/node/system:node:ip-172-18-10-39.ec2.internal.kubeconfig: no such file or directory
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal systemd[1]: docker-43c6b0e23eaf9e63e9deee0e7fa2a465274d8347b72759225986891fd605d109.scope: Consumed 319ms CPU time
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal oci-systemd-hook[26501]: systemdhook <debug>: 43c6b0e23eaf: Skipping as container command is /usr/local/bin/openshift-node, not init or systemd
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal oci-umount[26502]: umounthook <debug>: 43c6b0e23eaf: only runs in prestart stage, ignoring
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal docker-containerd-current[1028]: time="2018-03-12T19:45:58.179543299Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 43c6b0e23eaf9e63e9deee0e7fa2a465274d8347b72759225986891fd605d109 does not exist\\none or more of the container deletions failed\\n\""
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal dockerd-current[2672]: time="2018-03-12T19:45:58.220459881Z" level=warning msg="43c6b0e23eaf9e63e9deee0e7fa2a465274d8347b72759225986891fd605d109 cleanup: failed to unmount secrets: invalid argument"
Mar 12 19:45:58 ip-172-18-10-39.ec2.internal systemd[1]: origin-node.service: Main process exited, code=exited, status=255/n/a
This may be due to trying to use 3.9. |
No dice with openshift_image_tag=v3.10.0 [root@ip-172-18-13-148 ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
ip-172-18-13-148.ec2.internal NotReady <none> 6m v1.9.1+a0ce1bc657
[root@ip-172-18-13-148 ~]# journalctl -f
-- Logs begin at Mon 2018-03-12 19:14:12 UTC. --
Mar 12 20:11:39 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:11:39.473186 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:11:39 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:11:39.473330 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Mar 12 20:11:44 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:11:44.474651 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:11:44 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:11:44.474796 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Mar 12 20:11:49 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:11:49.476303 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:11:49 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:11:49.476460 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Mar 12 20:11:54 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:11:54.477969 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:11:54 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:11:54.478119 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Mar 12 20:11:59 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:11:59.479803 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:11:59 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:11:59.479998 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Mar 12 20:12:04 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:12:04.481382 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:12:04 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:12:04.481530 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Mar 12 20:12:09 ip-172-18-13-148.ec2.internal origin-node[10260]: W0312 20:12:09.482953 10290 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 20:12:09 ip-172-18-13-148.ec2.internal origin-node[10260]: E0312 20:12:09.483111 10290 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized |
Master services don't come back after reboot. |
|
Mar 12 20:12:09 ip-172-18-13-148.ec2.internal origin-node[10260]:
W0312 20:12:09.482953 10290 cni.go:171] Unable to update cni config:
No networks found in /etc/cni/net.dMar 12 20:12:09
ip-172-18-13-148.ec2.internal origin-node[10260]: E0312
20:12:09.483111 10290 kubelet.go:2106] Container runtime network not
ready: NetworkReady=false reason:NetworkPluginNotReady message:docker:
network plugin is not ready: cni config uninitialized
Is not an error in the install, but it does mean openshift-sdn didn't
get installed.
Re bootstrap-autoapprover, if you don't have a master label you'll get
that. Masters being properly labeled will be a requirement for 3.10
with the standard labels, will have to look at why that isn't being
applied.
…On Mon, Mar 12, 2018 at 7:52 PM, Vadim Rutkovsky ***@***.***> wrote:
bootstrap-autoapprover-0 is stuck in Pending:
message: '0/1 nodes are available: 1 CheckServiceAffinity, 1 MatchNodeSelector,
1 NodeNotReady, 1 NodeOutOfDisk.'
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7486 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p0a4jnLUQrI-YQue0jhfu9Clf2rkks5tdwojgaJpZM4Slhyb>
.
|
@jlebon can you speak to why the f27 bot is dead? |
Yeah.
was there a system container hack to extract this onto disk somewhere? |
Oh. Join wasn't starting the node. Sigh. |
7e74967
to
1c55072
Compare
Atomic f27 is now failing because the hostname value reported by ansible in |
0883e40
to
77758ed
Compare
F27 atomic passed!!!!!!!!!!1!!!!! |
Last set of changes was that bootstrapped nodes need to have in Ansible the correct calculated nodename that the kubelet uses during bootstrapping (which is hostname, not hostname -f). That got f27 to pass, and the crio fix is in the origin merge queue. This is ready for final review. |
/skip |
/test logging |
If you'd prefer to review the most recent changes to openshift facts instead of merging the yet outstanding PRs that may simplify things. |
/test crio |
\o/ Awesome! |
Initial upgrade implementation in #7723 |
/skip logging |
cd7dd23
to
a97c508
Compare
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Change the defaults for node bootstrapping to true, all nodes will bootstrap unless opted out. Remove containerized node artifacts Remove the openshift_master role - it is dead.
a97c508
to
ddf1aa2
Compare
openshift/origin#19190 is what is blocking atomic |
We will need to incorporate some changes from #7694 after it merges that won't be picked up by a rebase. Specifically this: https://github.com/openshift/openshift-ansible/pull/7694/files#diff-6430bef27c2912d81bffbc6e7c0ca7a3R81 Need to migrate this from master to ocp: https://github.com/openshift/openshift-ansible/pull/7694/files#diff-96ca052d3820630a5305e1a7b4628856R1 Need to migrate this from master to ocp: https://github.com/openshift/openshift-ansible/pull/7694/files#diff-eb4389ec1aecfaf5b21994caaae7fcd5R1 |
bot, retest this please |
This job is now green. Going to stick the label on to unblock the queue based on prior approvals. |
Change the defaults for node bootstrapping to true, all nodes will bootstrap unless opted out. During setup, we pre-configure all nodes that elect for bootstrapping before the master is configured, then install the control plane, then configure any nodes that opt out of bootstrapping. I'd like to complete remove the old node path, or perhaps move it to a "add new nodes to the cluster" sort of config until we know whether users are ready for it to be removed.
Remove the openshift_master role - it is dead. Copied in a few changes that happened in master before the role was killed. Copied upgrades over, although nothing has been done there.
Follow on to #6916