Skip to content

Refactor dind #11061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 7, 2016
Merged

Refactor dind #11061

merged 5 commits into from
Oct 7, 2016

Conversation

marun
Copy link
Contributor

@marun marun commented Sep 23, 2016

This looks like a big change, but it's mostly shifting code around to remove dependency on contrib/vagrant and move towards being able to deploy on kube/openshift.

  • refactor hack/dind-cluster.sh
    • remove dependency on contrib/vagrant/* so it can be deleted as part of the move to openshift-ansible for dev cluster deployment (cc: @stevekuznetsov)
    • use flags as much as possible when starting a cluster (cc: @danwinship)
    • build images and binaries only if necessary or requested to do so, and build on the host rather than in the container (cc: @smarterclayton)
    • perform host modification in a privileged docker container to ensure the docker host is modified even if docker is running remotely
    • minimize unnecessary output
  • create new openshift/dind base image that runs only systemd+dind so it can be reused for testing things like openshift-ansible (cc: @sdodson, @stevekuznetsov)
  • create new openshift/dind-node and openshift/dind-master images
    • use systemd units instead of post-build configuration for:
      • generating cluster configuration (likely to be replaced by something kubeadm-like)
      • disabling the master node

cc: @openshift/networking

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small stuff on the Bash bits.

--hostname="${name}" "${DIND_IMAGE}")"
node_cids+=( "${cid}" )
node_ips+=( "$(get-docker-ip "${cid}")" )
local cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't co-locate expressions and scoping statements on the same line:

local cid
cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"

node_cids+=( "${cid}" )
node_ips+=( "$(get-docker-ip "${cid}")" )
local cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"
local ip="$(get-docker-ip "${cid}")"
Copy link
Contributor

@stevekuznetsov stevekuznetsov Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoping

local rc_file="dind-${INSTANCE_PREFIX}.rc"
local admin_config="$(os::provision::get-admin-config ${CONFIG_ROOT})"
local rc_file="dind-${cluster_id}.rc"
local admin_config="$(get-admin-config ${CONFIG_ROOT})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting

admin_config="$(get-admin-config "${CONFIG_ROOT}")"
                                 ^              ^

local rc_file="dind-${INSTANCE_PREFIX}.rc"
local admin_config="$(os::provision::get-admin-config ${CONFIG_ROOT})"
local rc_file="dind-${cluster_id}.rc"
local admin_config="$(get-admin-config ${CONFIG_ROOT})"
local bin_path="$(os::build::get-bin-output-path "${OS_ROOT}")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoping

os::provision::disable-node "${OS_ROOT}" "${CONFIG_ROOT}" \
"${SDN_NODE_NAME}"
if [[ -n "${wait_for_cluster}" ]]; then
wait-for-cluster "$(get-admin-config ${config_root})" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait-for-cluster "$(get-admin-config "${config_root}")" \
                                     ^              ^

;;
wait-for-cluster)
wait-for-cluster
wait-for-cluster "$(get-admin-config ${CONFIG_ROOT})" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait-for-cluster "$(get-admin-config "${CONFIG_ROOT}")" \
                                     ^              ^

}

mount --make-shared /
os::provision::enable-overlay-storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass "$@" through if you wanted args to the script to go to the func?

local error_msg="[ERROR] Timeout waiting for ${msg}"

local counter=0
while ! $(${condition}); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above

local error_msg="[ERROR] Timeout waiting for ${msg}"

local counter=0
while ! $(${condition}); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again


# Deploy the node config
mkdir -p "${DEPLOYED_CONFIG_PATH}"
cp -r ${CONFIG_PATH}/* "${DEPLOYED_CONFIG_PATH}"
Copy link
Contributor

@stevekuznetsov stevekuznetsov Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp -r "${CONFIG_PATH}"/* "${DEPLOYED_CONFIG_PATH}"
      ^              ^

@marun
Copy link
Contributor Author

marun commented Sep 23, 2016

@stevekuznetsov I've hopefully address your comments. I've also broken that wait function out into hack/lib/util/dind.sh so it can be reused instead of copying everywhere.


function ensure-node-config() {
local deployed_config_path="/var/lib/origin/openshift.local.config/node"
local deployed_config_file="${deployed_config_path}/node-config.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I needed to fix the case of the conditional below.

# The container will have created configuration as root
sudo rm -rf ${CONFIG_ROOT}/openshift.local.*
sudo rm -rf ${config_root}/openshift.local.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this one last time

sudo rm -rf "${config_root}"/openshift.local.*
            ^              ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

local rc_file="dind-${cluster_id}.rc"
local admin_config
admin_config="$(get-admin-config "${CONFIG_ROOT}")"
local bin_bath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bath/path/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if [[ "${counter}" != "0" && "${timeout}" != "${OS_WAIT_FOREVER}" ]]; then
echo -e '\nDone'
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to match our imports in hack/lib/init.sh we always declare our functions readonly -- otherwise there is a chance that a nested script with nested callouts to hack/lib/init.sh will have a different version of a function, or an alias for it or other such nonsense, which we don't really want to deal with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

fi
sleep 1
else
echo -e "\n${error_msg}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point in the future I intend to walk through our code and enforce the use of os::log::{info,warn,error}. one of those may be more appropriate here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this file is intended to be distributed standalone in the dind images, I want to avoid adding any dependencies. I've updated the file's header to indicate this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in hack/lib which is pretty much the antithesis to standalone -- this will get sourced by everything that brings in hack/lib/init.sh. Are we shipping this in an image for customers? Can we put these in origin/images/networking-diagnostics/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not shipping to customers, is is only used by the dind (test+dev) images and hack/dind-cluster.sh. Would you prefer it be located in the dind image path, say images/dind/node/?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's meant to be a standalone script, yes, I think it would be better if it were divorced from hack/ and did not source hack/lib/init.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@marun marun force-pushed the refactor-dind branch 2 times, most recently from ed1ad8e to 4157c4f Compare September 23, 2016 20:22
@marun
Copy link
Contributor Author

marun commented Sep 24, 2016

Hmm, the intra-pod e2e is failing. Not sure what change I made could have triggered that.


While it is possible to run a dind cluster directly on a linux host,
it is recommended to consider the warnings at the top of the
dind-cluster.sh script.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is kind of weird given that the doc doesn't talk about any other way of running dind now. You should probably incorporate it in the "Prerequisites" (eg, "4. You don't mind loading a few kernel modules etc")

Copy link
Contributor Author

@marun marun Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there are some folks for whom disabling selinux and running privileged docker containers directly on their host won't be desirable. Fair point, though, that there isn't much detail on how they would run a dind cluster otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

${DOCKER_CMD} run --privileged --net=host --rm -v /lib/modules:/lib/modules \
openshift/dind-node bash -e -c \
'/usr/sbin/sysctl -w net.bridge.bridge-nf-call-iptables=0 > /dev/null;
modprobe openvswitch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat random to specify the path for sysctl but not for modprobe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

local master_cid
master_cid="$(${run_cmd} --name="${MASTER_NAME}" --hostname="${MASTER_NAME}" "${MASTER_IMAGE}")"
local master_ip
master_ip="$(get-docker-ip "${master_cid}")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know the style guide, but could you merge some of the declarations? local master_cid master_ip etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per @stevekuznetsov's comments in a previous review, mixing expressions and scoping statements has the potential to bypass fail-on-error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still declare them on a single line, if it's two statements, though:

local my_var; my_var="$( some command )"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant changing the above to:

local master_cid master_ip
master_cid="$(${run_cmd} --name="${MASTER_NAME}" --hostname="${MASTER_NAME}" "${MASTER_IMAGE}")"
master_ip="$(get-docker-ip "${master_cid}")"

ie, declaration assignment assignment, rather than declaration assignment declaration assignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair point. I think declaring things on one line is nicer, but that's just me. A chunk of these are gone now that /etc/hosts is synced inside the container, though, and I'm not sure if it's worth changing the rest.

# Ensure the master can resolve node names to ip for kubelet communication
#
# Attempts to keep /etc/hosts in sync with the api's node records
# have thus far proved unreliable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does that mean / how will we know when that comment no longer applies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having trouble naively using a systemd unit on a timer. Now that @eparis pointed me at @smarterclayton's guide to using oc observe, I think a bash control loop might do the trick.

# The container will have created configuration as root
sudo rm -rf ${CONFIG_ROOT}/openshift.local.*
sudo rm -rf "${config_root}"/openshift.local.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is pre-existing, but "sudo rm -rf" and "*" really don't belong in the same command...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain. The goal is to remove openshift.local.config and openshift.local.etcd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then do

sudo rm -rf "${config_root}"/openshift.local.config "${config_root}"/openshift.local.etcd

It's just scary to "rm -rf" a glob pattern. Especially as root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the danger in a qualified glob pattern, but fair enough, I'll make the change.

local deployed_config_file="${deployed_config_path}/node-config.yaml"

# If the node config hasn't been deployed
if [[ ! -f "${deployed_config_file}" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [[ -f "${deployed_config_file}" ]]; then
    return
fi
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2016
@marun marun force-pushed the refactor-dind branch 2 times, most recently from 8277be8 to 705aa1f Compare September 30, 2016 05:23
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2016
@marun marun changed the title WIP Refactor dind Refactor dind Sep 30, 2016
@marun
Copy link
Contributor Author

marun commented Sep 30, 2016

I've left out redeploy for now pending confirmation of its utility. I suggest we run the extended test repeatedly to assure ourselves that there are no latent issues.

@marun
Copy link
Contributor Author

marun commented Oct 1, 2016

re-[testextended][extended:networking]

2 similar comments
@marun
Copy link
Contributor Author

marun commented Oct 1, 2016

re-[testextended][extended:networking]

@marun
Copy link
Contributor Author

marun commented Oct 1, 2016

re-[testextended][extended:networking]

@marun
Copy link
Contributor Author

marun commented Oct 1, 2016

The tests are passing reliably, so I think this is ready from a functional perspective. Still looking for feedback on the ux changes.

@danwinship
Copy link
Contributor

Still looking for feedback on the ux changes.

I like the switch to command-line args. I could never remember the environment variable names

@danwinship
Copy link
Contributor

The new "dind: enable ssh access to cluster" commit has unrelated "os::util::is-master" stuff mixed in

@marun
Copy link
Contributor Author

marun commented Oct 3, 2016

@danwinship It's not entirely unrelated. It's a refactor so that there is a common way of identifying if a host is the master and the method is used in the script that enables ssh access. Would you prefer that be a separate commit?

@danwinship
Copy link
Contributor

Oh, I skimmed quickly but I guess I missed that it was used by the new commit. That's fine then.

@dcbw
Copy link
Contributor

dcbw commented Oct 4, 2016

LGTM, all I want is 'less' installed in the images :)

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to f518e27

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/557/) (Extended Tests: networking)

@stevekuznetsov
Copy link
Contributor

Bash bits look good to me -- some of the smaller scripts could use some doc at the top, but that's minor.

@danwinship
Copy link
Contributor

[merge]

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to f518e27

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test ABORTED (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9662/)

@dcbw
Copy link
Contributor

dcbw commented Oct 5, 2016

re-[merge]

@dcbw
Copy link
Contributor

dcbw commented Oct 6, 2016

flake is #11240 re-[merge]

@marun
Copy link
Contributor Author

marun commented Oct 6, 2016

flake #11240, #10489

re-[merge]

@danwinship
Copy link
Contributor

flakes #11240 #10773. [merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to f518e27

@openshift-bot
Copy link
Contributor

openshift-bot commented Oct 7, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9749/) (Image: devenv-rhel7_5143)

source /data/network-plugin

function ensure-node-config() {
local deployed_config_path="/var/lib/origin/openshift.local.config/node"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible multiple nodes are sharing this folder and the last one in is stomping? Trying to figure out what is causing #11274

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this path is where the certs will be copied locally after being generated. The generation target is a unique path on the shared volume (e.g. /data/openshift.local.config/node-[node name]).

--certificate-authority="${master_config_path}/ca.crt" \
--signer-cert="${master_config_path}/ca.crt" \
--signer-key="${master_config_path}/ca.key" \
--signer-serial="${master_config_path}/ca.serial.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are nodes generating config themselves with copies of the signing cert/key/serial? Can we verify the serial numbers in the resulting certs? Wondering if two nodes are both allocating the "next" serial number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entirely likely possible there is a race here. Is ca.serial.txt incremented by each node cert generation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the cert-generating commands are not intended to be run concurrently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll fix.

@marun marun deleted the refactor-dind branch November 29, 2016 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants