-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Refactor dind #11061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor dind #11061
Conversation
eaff020
to
87a5263
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small stuff on the Bash bits.
--hostname="${name}" "${DIND_IMAGE}")" | ||
node_cids+=( "${cid}" ) | ||
node_ips+=( "$(get-docker-ip "${cid}")" ) | ||
local cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't co-locate expressions and scoping statements on the same line:
local cid
cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"
node_cids+=( "${cid}" ) | ||
node_ips+=( "$(get-docker-ip "${cid}")" ) | ||
local cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")" | ||
local ip="$(get-docker-ip "${cid}")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoping
local rc_file="dind-${INSTANCE_PREFIX}.rc" | ||
local admin_config="$(os::provision::get-admin-config ${CONFIG_ROOT})" | ||
local rc_file="dind-${cluster_id}.rc" | ||
local admin_config="$(get-admin-config ${CONFIG_ROOT})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quoting
admin_config="$(get-admin-config "${CONFIG_ROOT}")"
^ ^
local rc_file="dind-${INSTANCE_PREFIX}.rc" | ||
local admin_config="$(os::provision::get-admin-config ${CONFIG_ROOT})" | ||
local rc_file="dind-${cluster_id}.rc" | ||
local admin_config="$(get-admin-config ${CONFIG_ROOT})" | ||
local bin_path="$(os::build::get-bin-output-path "${OS_ROOT}")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoping
os::provision::disable-node "${OS_ROOT}" "${CONFIG_ROOT}" \ | ||
"${SDN_NODE_NAME}" | ||
if [[ -n "${wait_for_cluster}" ]]; then | ||
wait-for-cluster "$(get-admin-config ${config_root})" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait-for-cluster "$(get-admin-config "${config_root}")" \
^ ^
;; | ||
wait-for-cluster) | ||
wait-for-cluster | ||
wait-for-cluster "$(get-admin-config ${CONFIG_ROOT})" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait-for-cluster "$(get-admin-config "${CONFIG_ROOT}")" \
^ ^
} | ||
|
||
mount --make-shared / | ||
os::provision::enable-overlay-storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass "$@"
through if you wanted args to the script to go to the func?
local error_msg="[ERROR] Timeout waiting for ${msg}" | ||
|
||
local counter=0 | ||
while ! $(${condition}); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above
local error_msg="[ERROR] Timeout waiting for ${msg}" | ||
|
||
local counter=0 | ||
while ! $(${condition}); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again
|
||
# Deploy the node config | ||
mkdir -p "${DEPLOYED_CONFIG_PATH}" | ||
cp -r ${CONFIG_PATH}/* "${DEPLOYED_CONFIG_PATH}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cp -r "${CONFIG_PATH}"/* "${DEPLOYED_CONFIG_PATH}"
^ ^
87a5263
to
e9a3a31
Compare
@stevekuznetsov I've hopefully address your comments. I've also broken that wait function out into hack/lib/util/dind.sh so it can be reused instead of copying everywhere. |
|
||
function ensure-node-config() { | ||
local deployed_config_path="/var/lib/origin/openshift.local.config/node" | ||
local deployed_config_file="${deployed_config_path}/node-config.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I needed to fix the case of the conditional below.
# The container will have created configuration as root | ||
sudo rm -rf ${CONFIG_ROOT}/openshift.local.* | ||
sudo rm -rf ${config_root}/openshift.local.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this one last time
sudo rm -rf "${config_root}"/openshift.local.*
^ ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
local rc_file="dind-${cluster_id}.rc" | ||
local admin_config | ||
admin_config="$(get-admin-config "${CONFIG_ROOT}")" | ||
local bin_bath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/bath/path/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if [[ "${counter}" != "0" && "${timeout}" != "${OS_WAIT_FOREVER}" ]]; then | ||
echo -e '\nDone' | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to match our import
s in hack/lib/init.sh
we always declare our functions readonly
-- otherwise there is a chance that a nested script with nested callouts to hack/lib/init.sh
will have a different version of a function, or an alias for it or other such nonsense, which we don't really want to deal with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
fi | ||
sleep 1 | ||
else | ||
echo -e "\n${error_msg}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point in the future I intend to walk through our code and enforce the use of os::log::{info,warn,error}
. one of those may be more appropriate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this file is intended to be distributed standalone in the dind images, I want to avoid adding any dependencies. I've updated the file's header to indicate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in hack/lib
which is pretty much the antithesis to standalone -- this will get source
d by everything that brings in hack/lib/init.sh
. Are we shipping this in an image for customers? Can we put these in origin/images/networking-diagnostics/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not shipping to customers, is is only used by the dind (test+dev) images and hack/dind-cluster.sh
. Would you prefer it be located in the dind image path, say images/dind/node/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's meant to be a standalone script, yes, I think it would be better if it were divorced from hack/
and did not source
hack/lib/init.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ed1ad8e
to
4157c4f
Compare
Hmm, the intra-pod e2e is failing. Not sure what change I made could have triggered that. |
4157c4f
to
1e22413
Compare
|
||
While it is possible to run a dind cluster directly on a linux host, | ||
it is recommended to consider the warnings at the top of the | ||
dind-cluster.sh script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is kind of weird given that the doc doesn't talk about any other way of running dind now. You should probably incorporate it in the "Prerequisites" (eg, "4. You don't mind loading a few kernel modules etc")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there are some folks for whom disabling selinux and running privileged docker containers directly on their host won't be desirable. Fair point, though, that there isn't much detail on how they would run a dind cluster otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
${DOCKER_CMD} run --privileged --net=host --rm -v /lib/modules:/lib/modules \ | ||
openshift/dind-node bash -e -c \ | ||
'/usr/sbin/sysctl -w net.bridge.bridge-nf-call-iptables=0 > /dev/null; | ||
modprobe openvswitch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat random to specify the path for sysctl but not for modprobe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
local master_cid | ||
master_cid="$(${run_cmd} --name="${MASTER_NAME}" --hostname="${MASTER_NAME}" "${MASTER_IMAGE}")" | ||
local master_ip | ||
master_ip="$(get-docker-ip "${master_cid}")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know the style guide, but could you merge some of the declarations? local master_cid master_ip
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per @stevekuznetsov's comments in a previous review, mixing expressions and scoping statements has the potential to bypass fail-on-error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still declare them on a single line, if it's two statements, though:
local my_var; my_var="$( some command )"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant changing the above to:
local master_cid master_ip
master_cid="$(${run_cmd} --name="${MASTER_NAME}" --hostname="${MASTER_NAME}" "${MASTER_IMAGE}")"
master_ip="$(get-docker-ip "${master_cid}")"
ie, declaration assignment assignment, rather than declaration assignment declaration assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair point. I think declaring things on one line is nicer, but that's just me. A chunk of these are gone now that /etc/hosts is synced inside the container, though, and I'm not sure if it's worth changing the rest.
# Ensure the master can resolve node names to ip for kubelet communication | ||
# | ||
# Attempts to keep /etc/hosts in sync with the api's node records | ||
# have thus far proved unreliable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does that mean / how will we know when that comment no longer applies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having trouble naively using a systemd unit on a timer. Now that @eparis pointed me at @smarterclayton's guide to using oc observe
, I think a bash control loop might do the trick.
# The container will have created configuration as root | ||
sudo rm -rf ${CONFIG_ROOT}/openshift.local.* | ||
sudo rm -rf "${config_root}"/openshift.local.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is pre-existing, but "sudo rm -rf" and "*" really don't belong in the same command...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain. The goal is to remove openshift.local.config
and openshift.local.etcd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then do
sudo rm -rf "${config_root}"/openshift.local.config "${config_root}"/openshift.local.etcd
It's just scary to "rm -rf" a glob pattern. Especially as root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the danger in a qualified glob pattern, but fair enough, I'll make the change.
local deployed_config_file="${deployed_config_path}/node-config.yaml" | ||
|
||
# If the node config hasn't been deployed | ||
if [[ ! -f "${deployed_config_file}" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if [[ -f "${deployed_config_file}" ]]; then
return
fi
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
8277be8
to
705aa1f
Compare
705aa1f
to
9897da9
Compare
I've left out redeploy for now pending confirmation of its utility. I suggest we run the extended test repeatedly to assure ourselves that there are no latent issues. |
re-[testextended][extended:networking] |
2 similar comments
re-[testextended][extended:networking] |
re-[testextended][extended:networking] |
The tests are passing reliably, so I think this is ready from a functional perspective. Still looking for feedback on the ux changes. |
I like the switch to command-line args. I could never remember the environment variable names |
The new "dind: enable ssh access to cluster" commit has unrelated "os::util::is-master" stuff mixed in |
@danwinship It's not entirely unrelated. It's a refactor so that there is a common way of identifying if a host is the master and the method is used in the script that enables ssh access. Would you prefer that be a separate commit? |
Oh, I skimmed quickly but I guess I missed that it was used by the new commit. That's fine then. |
LGTM, all I want is 'less' installed in the images :) |
Evaluated for origin testextended up to f518e27 |
continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/557/) (Extended Tests: networking) |
Bash bits look good to me -- some of the smaller scripts could use some doc at the top, but that's minor. |
[merge] |
[Test]ing while waiting on the merge queue |
Evaluated for origin test up to f518e27 |
continuous-integration/openshift-jenkins/test ABORTED (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9662/) |
re-[merge] |
flake is #11240 re-[merge] |
Evaluated for origin merge up to f518e27 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9749/) (Image: devenv-rhel7_5143) |
source /data/network-plugin | ||
|
||
function ensure-node-config() { | ||
local deployed_config_path="/var/lib/origin/openshift.local.config/node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible multiple nodes are sharing this folder and the last one in is stomping? Trying to figure out what is causing #11274
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this path is where the certs will be copied locally after being generated. The generation target is a unique path on the shared volume (e.g. /data/openshift.local.config/node-[node name]).
--certificate-authority="${master_config_path}/ca.crt" \ | ||
--signer-cert="${master_config_path}/ca.crt" \ | ||
--signer-key="${master_config_path}/ca.key" \ | ||
--signer-serial="${master_config_path}/ca.serial.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are nodes generating config themselves with copies of the signing cert/key/serial? Can we verify the serial numbers in the resulting certs? Wondering if two nodes are both allocating the "next" serial number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entirely likely possible there is a race here. Is ca.serial.txt incremented by each node cert generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the cert-generating commands are not intended to be run concurrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll fix.
This looks like a big change, but it's mostly shifting code around to remove dependency on contrib/vagrant and move towards being able to deploy on kube/openshift.
cc: @openshift/networking