Refactor dind #11061

marun · 2016-09-23T03:13:47Z

This looks like a big change, but it's mostly shifting code around to remove dependency on contrib/vagrant and move towards being able to deploy on kube/openshift.

refactor hack/dind-cluster.sh
- remove dependency on contrib/vagrant/* so it can be deleted as part of the move to openshift-ansible for dev cluster deployment (cc: @stevekuznetsov)
- use flags as much as possible when starting a cluster (cc: @danwinship)
- build images and binaries only if necessary or requested to do so, and build on the host rather than in the container (cc: @smarterclayton)
- perform host modification in a privileged docker container to ensure the docker host is modified even if docker is running remotely
- minimize unnecessary output
create new openshift/dind base image that runs only systemd+dind so it can be reused for testing things like openshift-ansible (cc: @sdodson, @stevekuznetsov)
create new openshift/dind-node and openshift/dind-master images
- use systemd units instead of post-build configuration for:
  - generating cluster configuration (likely to be replaced by something kubeadm-like)
  - disabling the master node

cc: @openshift/networking

stevekuznetsov

Only small stuff on the Bash bits.

stevekuznetsov · 2016-09-23T15:55:42Z

hack/dind-cluster.sh

-        --hostname="${name}" "${DIND_IMAGE}")"
-    node_cids+=( "${cid}" )
-    node_ips+=( "$(get-docker-ip "${cid}")" )
+    local cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"


Don't co-locate expressions and scoping statements on the same line:

local cid cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"

stevekuznetsov · 2016-09-23T15:55:52Z

hack/dind-cluster.sh

-    node_cids+=( "${cid}" )
-    node_ips+=( "$(get-docker-ip "${cid}")" )
+    local cid="$(${run_cmd} --name="${name}" --hostname="${name}" "${NODE_IMAGE}")"
+    local ip="$(get-docker-ip "${cid}")"


stevekuznetsov · 2016-09-23T15:56:33Z

hack/dind-cluster.sh

-  local rc_file="dind-${INSTANCE_PREFIX}.rc"
-  local admin_config="$(os::provision::get-admin-config ${CONFIG_ROOT})"
+  local rc_file="dind-${cluster_id}.rc"
+  local admin_config="$(get-admin-config ${CONFIG_ROOT})"


quoting

admin_config="$(get-admin-config "${CONFIG_ROOT}")" ^ ^

stevekuznetsov · 2016-09-23T15:57:26Z

hack/dind-cluster.sh

-  local rc_file="dind-${INSTANCE_PREFIX}.rc"
-  local admin_config="$(os::provision::get-admin-config ${CONFIG_ROOT})"
+  local rc_file="dind-${cluster_id}.rc"
+  local admin_config="$(get-admin-config ${CONFIG_ROOT})"
  local bin_path="$(os::build::get-bin-output-path "${OS_ROOT}")"


stevekuznetsov · 2016-09-23T15:58:09Z

hack/dind-cluster.sh

-    os::provision::disable-node "${OS_ROOT}" "${CONFIG_ROOT}" \
-        "${SDN_NODE_NAME}"
+  if [[ -n "${wait_for_cluster}" ]]; then
+    wait-for-cluster "$(get-admin-config ${config_root})" \


wait-for-cluster "$(get-admin-config "${config_root}")" \ ^ ^

stevekuznetsov · 2016-09-23T16:11:41Z

hack/dind-cluster.sh

    ;;
  wait-for-cluster)
-    wait-for-cluster
+    wait-for-cluster "$(get-admin-config ${CONFIG_ROOT})" \


wait-for-cluster "$(get-admin-config "${CONFIG_ROOT}")" \ ^ ^

stevekuznetsov · 2016-09-23T16:27:20Z

images/dind/dind-setup.sh

+}
+
+mount --make-shared /
+os::provision::enable-overlay-storage


Pass "$@" through if you wanted args to the script to go to the func?

stevekuznetsov · 2016-09-23T16:27:41Z

images/dind/master/disable-master-node.sh

+  local error_msg="[ERROR] Timeout waiting for ${msg}"
+
+  local counter=0
+  while ! $(${condition}); do


same comments as above

stevekuznetsov · 2016-09-23T16:28:31Z

images/dind/node/openshift-generate-node-config.sh

+  local error_msg="[ERROR] Timeout waiting for ${msg}"
+
+  local counter=0
+  while ! $(${condition}); do


stevekuznetsov · 2016-09-23T16:29:37Z

images/dind/node/openshift-generate-node-config.sh

+
+  # Deploy the node config
+  mkdir -p "${DEPLOYED_CONFIG_PATH}"
+  cp -r ${CONFIG_PATH}/* "${DEPLOYED_CONFIG_PATH}"


cp -r "${CONFIG_PATH}"/* "${DEPLOYED_CONFIG_PATH}" ^ ^

marun · 2016-09-23T18:21:31Z

@stevekuznetsov I've hopefully address your comments. I've also broken that wait function out into hack/lib/util/dind.sh so it can be reused instead of copying everywhere.

stevekuznetsov · 2016-09-23T18:29:21Z

images/dind/node/openshift-generate-node-config.sh

+
+function ensure-node-config() {
+  local deployed_config_path="/var/lib/origin/openshift.local.config/node"
+  local deployed_config_file="${deployed_config_path}/node-config.yaml"


this looks unused

Done. I needed to fix the case of the conditional below.

stevekuznetsov · 2016-09-23T18:30:47Z

hack/dind-cluster.sh

  # The container will have created configuration as root
-  sudo rm -rf ${CONFIG_ROOT}/openshift.local.*
+  sudo rm -rf ${config_root}/openshift.local.*


I missed this one last time

sudo rm -rf "${config_root}"/openshift.local.* ^ ^

stevekuznetsov · 2016-09-23T18:31:19Z

hack/dind-cluster.sh

+  local rc_file="dind-${cluster_id}.rc"
+  local admin_config
+  admin_config="$(get-admin-config "${CONFIG_ROOT}")"
+  local bin_bath


s/bath/path/

stevekuznetsov · 2016-09-23T18:32:55Z

hack/lib/util/dind.sh

+  if [[ "${counter}" != "0" && "${timeout}" != "${OS_WAIT_FOREVER}" ]]; then
+    echo -e '\nDone'
+  fi
+}


to match our imports in hack/lib/init.sh we always declare our functions readonly -- otherwise there is a chance that a nested script with nested callouts to hack/lib/init.sh will have a different version of a function, or an alias for it or other such nonsense, which we don't really want to deal with

stevekuznetsov · 2016-09-23T18:34:01Z

hack/lib/util/dind.sh

+      fi
+      sleep 1
+    else
+      echo -e "\n${error_msg}"


at some point in the future I intend to walk through our code and enforce the use of os::log::{info,warn,error}. one of those may be more appropriate here

Given that this file is intended to be distributed standalone in the dind images, I want to avoid adding any dependencies. I've updated the file's header to indicate this.

We're in hack/lib which is pretty much the antithesis to standalone -- this will get sourced by everything that brings in hack/lib/init.sh. Are we shipping this in an image for customers? Can we put these in origin/images/networking-diagnostics/ ?

No, this is not shipping to customers, is is only used by the dind (test+dev) images and hack/dind-cluster.sh. Would you prefer it be located in the dind image path, say images/dind/node/?

If it's meant to be a standalone script, yes, I think it would be better if it were divorced from hack/ and did not source hack/lib/init.sh

marun · 2016-09-24T00:24:58Z

Hmm, the intra-pod e2e is failing. Not sure what change I made could have triggered that.

danwinship · 2016-09-28T17:44:38Z

CONTRIBUTING.adoc

+
+While it is possible to run a dind cluster directly on a linux host,
+it is recommended to consider the warnings at the top of the
+dind-cluster.sh script.


This warning is kind of weird given that the doc doesn't talk about any other way of running dind now. You should probably incorporate it in the "Prerequisites" (eg, "4. You don't mind loading a few kernel modules etc")

I assume there are some folks for whom disabling selinux and running privileged docker containers directly on their host won't be desirable. Fair point, though, that there isn't much detail on how they would run a dind cluster otherwise.

danwinship · 2016-09-28T17:47:17Z

hack/dind-cluster.sh

+  ${DOCKER_CMD} run --privileged --net=host --rm -v /lib/modules:/lib/modules \
+                openshift/dind-node bash -e -c \
+                '/usr/sbin/sysctl -w net.bridge.bridge-nf-call-iptables=0 > /dev/null;
+                modprobe openvswitch;


somewhat random to specify the path for sysctl but not for modprobe

danwinship · 2016-09-28T17:49:51Z

hack/dind-cluster.sh

+  local master_cid
+  master_cid="$(${run_cmd} --name="${MASTER_NAME}" --hostname="${MASTER_NAME}" "${MASTER_IMAGE}")"
+  local master_ip
+  master_ip="$(get-docker-ip "${master_cid}")"


i don't know the style guide, but could you merge some of the declarations? local master_cid master_ip etc

As per @stevekuznetsov's comments in a previous review, mixing expressions and scoping statements has the potential to bypass fail-on-error.

We can still declare them on a single line, if it's two statements, though:

local my_var; my_var="$( some command )"

I meant changing the above to:

local master_cid master_ip master_cid="$(${run_cmd} --name="${MASTER_NAME}" --hostname="${MASTER_NAME}" "${MASTER_IMAGE}")" master_ip="$(get-docker-ip "${master_cid}")"

ie, declaration assignment assignment, rather than declaration assignment declaration assignment

Ah, fair point. I think declaring things on one line is nicer, but that's just me. A chunk of these are gone now that /etc/hosts is synced inside the container, though, and I'm not sure if it's worth changing the rest.

danwinship · 2016-09-28T17:51:28Z

hack/dind-cluster.sh

+    # Ensure the master can resolve node names to ip for kubelet communication
+    #
+    # Attempts to keep /etc/hosts in sync with the api's node records
+    # have thus far proved unreliable.


what does that mean / how will we know when that comment no longer applies?

I was having trouble naively using a systemd unit on a timer. Now that @eparis pointed me at @smarterclayton's guide to using oc observe, I think a bash control loop might do the trick.

danwinship · 2016-09-28T17:54:27Z

hack/dind-cluster.sh

  # The container will have created configuration as root
-  sudo rm -rf ${CONFIG_ROOT}/openshift.local.*
+  sudo rm -rf "${config_root}"/openshift.local.*


I know this is pre-existing, but "sudo rm -rf" and "*" really don't belong in the same command...

Please explain. The goal is to remove openshift.local.config and openshift.local.etcd.

Then do

sudo rm -rf "${config_root}"/openshift.local.config "${config_root}"/openshift.local.etcd

It's just scary to "rm -rf" a glob pattern. Especially as root.

I don't see the danger in a qualified glob pattern, but fair enough, I'll make the change.

danwinship · 2016-09-28T18:18:46Z

images/dind/node/openshift-generate-node-config.sh

+  local deployed_config_file="${deployed_config_path}/node-config.yaml"
+
+  # If the node config hasn't been deployed
+  if [[ ! -f "${deployed_config_file}" ]]; then


if [[ -f "${deployed_config_file}" ]]; then return fi ...

marun · 2016-09-30T21:38:02Z

I've left out redeploy for now pending confirmation of its utility. I suggest we run the extended test repeatedly to assure ourselves that there are no latent issues.

marun · 2016-10-01T01:39:05Z

re-[testextended][extended:networking]

marun · 2016-10-01T08:25:36Z

re-[testextended][extended:networking]

marun · 2016-10-01T11:33:04Z

re-[testextended][extended:networking]

marun · 2016-10-01T11:33:56Z

The tests are passing reliably, so I think this is ready from a functional perspective. Still looking for feedback on the ux changes.

danwinship · 2016-10-03T15:38:37Z

Still looking for feedback on the ux changes.

I like the switch to command-line args. I could never remember the environment variable names

danwinship · 2016-10-03T15:39:12Z

The new "dind: enable ssh access to cluster" commit has unrelated "os::util::is-master" stuff mixed in

marun · 2016-10-03T16:36:52Z

@danwinship It's not entirely unrelated. It's a refactor so that there is a common way of identifying if a host is the master and the method is used in the script that enables ssh access. Would you prefer that be a separate commit?

danwinship · 2016-10-03T17:50:06Z

Oh, I skimmed quickly but I guess I missed that it was used by the new commit. That's fine then.

dcbw · 2016-10-04T20:58:46Z

LGTM, all I want is 'less' installed in the images :)

openshift-bot · 2016-10-04T21:12:30Z

Evaluated for origin testextended up to f518e27

openshift-bot · 2016-10-04T22:52:17Z

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/557/) (Extended Tests: networking)

stevekuznetsov · 2016-10-05T12:01:28Z

Bash bits look good to me -- some of the smaller scripts could use some doc at the top, but that's minor.

danwinship · 2016-10-05T13:33:17Z

[merge]

openshift-bot · 2016-10-05T13:52:28Z

[Test]ing while waiting on the merge queue

openshift-bot · 2016-10-05T14:02:18Z

Evaluated for origin test up to f518e27

openshift-bot · 2016-10-05T17:07:29Z

continuous-integration/openshift-jenkins/test ABORTED (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9662/)

dcbw · 2016-10-05T21:47:04Z

re-[merge]

dcbw · 2016-10-06T02:51:21Z

flake is #11240 re-[merge]

marun · 2016-10-06T16:44:29Z

flake #11240, #10489

re-[merge]

danwinship · 2016-10-07T12:31:06Z

flakes #11240 #10773. [merge]

openshift-bot · 2016-10-07T12:32:22Z

Evaluated for origin merge up to f518e27

openshift-bot · 2016-10-07T15:52:35Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9749/) (Image: devenv-rhel7_5143)

liggitt · 2016-10-08T16:41:48Z

images/dind/node/openshift-generate-node-config.sh

+source /data/network-plugin
+
+function ensure-node-config() {
+  local deployed_config_path="/var/lib/origin/openshift.local.config/node"


Is it possible multiple nodes are sharing this folder and the last one in is stomping? Trying to figure out what is causing #11274

No, this path is where the certs will be copied locally after being generated. The generation target is a unique path on the shared volume (e.g. /data/openshift.local.config/node-[node name]).

liggitt · 2016-10-08T16:43:29Z

images/dind/node/openshift-generate-node-config.sh

+      --certificate-authority="${master_config_path}/ca.crt" \
+      --signer-cert="${master_config_path}/ca.crt" \
+      --signer-key="${master_config_path}/ca.key" \
+      --signer-serial="${master_config_path}/ca.serial.txt"


Are nodes generating config themselves with copies of the signing cert/key/serial? Can we verify the serial numbers in the resulting certs? Wondering if two nodes are both allocating the "next" serial number

Entirely likely possible there is a race here. Is ca.serial.txt incremented by each node cert generation?

yes, the cert-generating commands are not intended to be run concurrently

Ok, I'll fix.

marun force-pushed the refactor-dind branch from eaff020 to 87a5263 Compare September 23, 2016 03:17

stevekuznetsov suggested changes Sep 23, 2016

View reviewed changes

marun force-pushed the refactor-dind branch from 87a5263 to e9a3a31 Compare September 23, 2016 18:18

stevekuznetsov suggested changes Sep 23, 2016

View reviewed changes

marun force-pushed the refactor-dind branch 2 times, most recently from ed1ad8e to 4157c4f Compare September 23, 2016 20:22

marun force-pushed the refactor-dind branch from 4157c4f to 1e22413 Compare September 24, 2016 00:38

danwinship reviewed Sep 28, 2016

View reviewed changes

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2016

marun force-pushed the refactor-dind branch 2 times, most recently from 8277be8 to 705aa1f Compare September 30, 2016 05:23

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2016

marun added the component/networking label Sep 30, 2016

marun force-pushed the refactor-dind branch from 705aa1f to 9897da9 Compare September 30, 2016 20:37

marun added 3 commits September 30, 2016 13:37

Refactor dind

9897da9

dind: enable ssh access to cluster

2f7655d

dind: update warning of system modification

e4f4629

marun changed the title ~~WIP Refactor dind~~ Refactor dind Sep 30, 2016

marun added 2 commits October 4, 2016 14:13

dind: slim down base image size

60cb02d

dind: install 'less' to ease debugging

f518e27

stevekuznetsov approved these changes Oct 7, 2016

View reviewed changes

openshift-bot merged commit fe5ec65 into openshift:master Oct 7, 2016

danwinship mentioned this pull request Oct 8, 2016

networking test flake "certificate is valid for nettest-node-1, 172.17.0.3, not nettest-node-2" #11274

Closed

liggitt reviewed Oct 8, 2016

View reviewed changes

marun mentioned this pull request Oct 18, 2016

dind: bump deployment timeout #11404

Merged

marun deleted the refactor-dind branch November 29, 2016 19:21

Refactor dind #11061

Refactor dind #11061

Uh oh!

Conversation

marun commented Sep 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevekuznetsov Sep 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevekuznetsov Sep 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun commented Sep 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun commented Sep 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun Sep 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun commented Sep 23, 2016 •

edited

Loading

stevekuznetsov Sep 23, 2016 •

edited

Loading

stevekuznetsov Sep 23, 2016 •

edited

Loading

marun Sep 30, 2016 •

edited

Loading