add retry logic to flaky commands #5125

sosiouxme · 2017-08-17T23:26:49Z

Introduces the openshift_package action plugin which adds retry logic to package tasks.
Also modifies the repoquery plugin with retry logic.
Updates openshift_checks python code with retry logic.

NOTE: in doing this with an action plugin, we require that action plugin to be loaded pretty much everywhere. This will require using an ansible.cfg that refers to the repo's action_plugin/ directory when running playbooks.

sosiouxme · 2017-08-17T23:27:41Z

I'm not really happy with the need to add symlinks everywhere for playbooks to enable the plugin. Why not put all the common plugins in an openshift_common role (with no tasks) and just have every playbook invoke that? It's not pretty either but it solves the problem once and for all.

sdodson · 2017-08-18T13:24:49Z

oh man! I've never been this excited before! 99% of all the problems we see in upgrading the starter clusters are related to flaky yum commands.

ashcrow · 2017-08-18T13:52:55Z

Will review shortly.

ashcrow · 2017-08-18T13:54:03Z

Looks like travis caught a typo:

Syntax checking playbook: /home/travis/build/openshift/openshift-ansible/playbooks/byo/vagrant.yml

ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/home/travis/build/openshift/openshift-ansible/roles/openshift_repos/tasks/main.yaml': line 9, column 5, but may

be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  block:

  - name: Ensure libselinux-python is installed

    ^ here

The error appears to have been in '/home/travis/build/openshift/openshift-ansible/roles/openshift_repos/tasks/main.yaml': line 9, column 5, but may

be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  block:

  - name: Ensure libselinux-python is installed

    ^ here

sosiouxme · 2017-08-18T14:17:52Z

On Fri, Aug 18, 2017 at 9:54 AM, Stephen Milner ***@***.***> wrote: Looks like travis caught a typo: Syntax checking playbook: /home/travis/build/openshift/openshift-ansible/playbooks/byo/vagrant.yml ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.

Not actually a typo; it's ansible not picking up the action plugin so it doesn't recognize the action. I couldn't immediately see why but it just underscores why the symlink-everywhere approach sucks... this is the error you get when you miss one (or something else happens that prevents the action plugin from loading).

ashcrow

A few requested updates/nits. Really good work @sosiouxme!

ashcrow · 2017-08-18T14:10:15Z

action_plugins/openshift_package.py

@@ -0,0 +1,98 @@
+"""


nit: Unless this is an ansible-ism generally docstrings occur after the license block.

ashcrow · 2017-08-18T14:11:43Z

action_plugins/openshift_package.py

+                if 'use' in new_module_args:
+                    del new_module_args['use']
+
+                display.vvvv("Running %s" % module)


nit: Use current string formatting for later 2.x and 3+

display.vvvv("Running {}".format(module))

ashcrow · 2017-08-18T14:12:33Z

action_plugins/openshift_package.py

+                        result.update(res)
+                        break
+                    result['last_failed'] = res
+                    display.v("{} module failed on try {} with result: {}".format(module, tries, res))


👍 for format and letting the user know how many tries it's attempted!!

ashcrow · 2017-08-18T14:15:26Z

action_plugins/openshift_package.py

+                        wrap_async=self._task.async,
+                    )
+                    tries += 1
+                    if tries > 3 or not res.get('failed'):


Not required by any means but it may be worth noting via comment for future developers that result['last_failed'] = res can not be called on the last try else ansible will have a recursion error.

Guess I'm a little confused what you mean here. res only ever gets placed into result, not itself; where would recursion occur?

The way that ansible houses it's result is via a single instance. If a result appends itself it causes a Python recursion error as something similar to:

result['data'] = result['data'] = result['data'] = ...

In your code it specifically avoids this, but it's just an idea as it's an easy accident to make when adding features.

ashcrow · 2017-08-18T14:15:57Z

library/openshift_package.py

+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+"""


nit: Unless this is an ansible-ism generally docstrings occur after the license block.

ashcrow · 2017-08-18T14:17:55Z

library/openshift_package.py

+
+ANSIBLE_METADATA = {'metadata_version': '1.0',
+                    'status': ['stableinterface'],
+                    'supported_by': 'core'}


nit: I realize this is a slightly modified copy but this is technically inaccurate. I believe it should be community.

ashcrow · 2017-08-18T14:18:34Z

library/openshift_package.py

+module: openshift_package
+version_added: 2.0
+author:
+    - Ansible Inc


nit: Not sure if this should be updated or not.

ashcrow · 2017-08-18T14:19:34Z

library/openshift_package.py

+    - Ansible Inc
+short_description: Generic OS package manager
+description:
+     - Installs, upgrade and removes packages using the underlying OS package manager.


'Installs, upgrade and removes packages using the underlying OS package manager. Attempts up to 3 times on failure' or something like that.

ashcrow · 2017-08-18T14:24:18Z

A helpful future enhancement for this action would be to add support for denoting how many times to try and how long to sleep between tries.

ashcrow · 2017-08-18T14:24:54Z

Not actually a typo; it's ansible not picking up the action plugin so it doesn't recognize the action. I couldn't immediately see why but it just underscores why the symlink-everywhere approach sucks... this is the error you get when you miss one (or something else happens that prevents the action plugin from loading).

Ah, good catch.

mtnbikenc · 2017-08-18T15:35:17Z

@sosiouxme I don't care for the symlinking everywhere either. I'd rather use the ansible.cfg file for this like we already do for callback_plugins. We even have roles_path in there too so we could do away with all the roles symlinks as well. But, ALL playbooks would need to be launched from the root of the repo in order for the ansible.cfg to take effect. This does require the reliance on a properly configured ansible.cfg, however, we already want that anyway and we are not getting our desired defaults if we let the user launch a playbook from any directory. These are considerations that are higher level than just this PR so a decision on this approach should not block the addition of this action plugin.

callback_plugins = callback_plugins/
action_plugins = action_plugins/  <== maybe add this and not use symlinks
roles_path = roles/

I've opened #5140 to test some of these ideas.

sosiouxme · 2017-08-21T13:39:20Z

On Fri, Aug 18, 2017 at 11:35 AM, Russell Teague ***@***.***> wrote: @sosiouxme <https://github.com/sosiouxme> I don't care for the symlinking everywhere either. I'd rather use the ansible.cfg file for this like we already do for callback_plugins. We even have roles_path in there too so we could do away with all the roles symlinks as well. But, *ALL* playbooks would need to be launched from the root of the repo in order for the ansible.cfg to take effect. This does require the reliance on a properly configured ansible.cfg, however, we already want that anyway and we are not getting our desired defaults if we let the user launch a playbook from any directory.

OK, I have some concerns with that approach but it's surely cleaner. The difference is, if you miss a callback plugin, nothing really breaks. If you miss an action plugin, then any task that invokes it comes back as a syntax error, which is pretty terrible user experience.

callback_plugins = callback_plugins/ action_plugins = action_plugins/ <== maybe add this and not use symlinks

Sure, I can just nuke the symlinks commit :)

roles_path = roles/ I've opened #5140 <#5140> to test some of these ideas.

Will discuss over there.

sosiouxme · 2017-08-23T00:10:04Z

I really cannot figure out why ansible isn't using the action plugin on playbooks/byo/rhel_subscribe.yml (and thus vagrant.yml). It works for the other playbooks in the same dir. Ansible bug? Something weird about the rhel_subscribe role?

EDIT: Adding library in ansible.cfg fixed it. I still don't know why only that playbook needs it.

sosiouxme · 2017-08-23T02:34:55Z

should pass travis now but I'm curious how badly tests will break

michaelgugino · 2017-08-23T03:12:52Z

Is there a reason we can't use ansible's built-in retry?

sosiouxme · 2017-08-23T10:47:01Z

@michaelgugino we can certainly use retries/until and forgo the openshift_package plugin. I just thought it would be nice to have the logic in one central place instead of scattered throughout the roles.

openshift-bot · 2017-08-25T01:47:31Z

Evaluated for openshift ansible test up to a925d06

openshift-bot · 2017-08-25T02:55:32Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/524/) (Base Commit: 3db50ad) (PR Branch Commit: a925d06)

ashcrow · 2017-08-25T13:35:36Z

test/integration/openshift_health_checker/common.go

+
+	// determine the playbook path relative to repoBase
+	playbook, err := filepath.Abs(p.Path)
+	if err != nil {


Just an idea: If you reuse this pattern (which is super common in Go) it may be worth functionizing it:

func failOnError(err Error): if err != nil { t.Fatal(err) }

sosiouxme · 2017-08-25T13:47:04Z

Test failures are now all because either action_plugin or library aren't picked up in config and so the openshift_package plugin does not work. This requires fixing how the jobs run.

sosiouxme · 2017-08-28T18:21:21Z

Will base this on changes from #4264 so adding action_plugins to the RPM is painless.

Also, since it will break too many things to require running from repo base with the ansible.cfg, will punt on that and continue to add symlinks everywhere.

also needed library linked in playbooks/byo for some unknown reason

Of course this is only helpful if you're using the ansible.cfg and have the root of the repo as current working directory. Also aligned some entries in utils/etc/ansible.cfg for the RPM

Just happened to see some things that no longer trigger pylint.

openshift-merge-robot · 2017-09-19T08:04:46Z

@sosiouxme PR needs rebase

lhuard1A · 2017-10-27T08:56:50Z

Hello,

I’m wondering what kind of yum command flakiness issues are you guys experiencing.

We’re currently regularly have this kind of ansible failures:

TASK [openshift_facts : Ensure various deps are installed] ******************************************************************************************************
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=iproute)
ok: [ose3-int-a-master-3.node.jawed-1a-eu-central-1.acs] => (item=iproute)
ok: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] => (item=iproute)
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=python-dbus)
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=python-six)
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=PyYAML)
failed: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] (item=python-dbus) => {"failed": true, "item": "python-dbus", "msg": "Failure talking to yum: [Errno 2] No such file or directory: '/var/cache/yum/x86_64/7Server/rhel7-optional/gen/primary_db.sqlite'"}
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=yum-utils)
ok: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] => (item=python-six)
ok: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] => (item=PyYAML)

I could understand flakiness caused by the network or by the health of the remote yum server, but I have no evidence that the yum error we get is linked to the remote server or a network glitch.

Is the yum error above something that is “known to happen” and supposed to be addressed by this PR?

michaelgugino · 2017-10-27T13:44:50Z

@lhuard1A

That error is typically due to some other process has a lock on the file; ie a yum process is already running.

This commit is due to network failures, mostly.

If you are having an issue you need help with, and you think it might be caused by ansible, please file an issue in this repository and we will try to assist.

sosiouxme requested review from ashcrow and mtnbikenc August 17, 2017 23:26

ashcrow requested changes Aug 18, 2017

View reviewed changes

ashcrow approved these changes Aug 23, 2017

View reviewed changes

ashcrow previously approved these changes Aug 24, 2017

View reviewed changes

ashcrow reviewed Aug 25, 2017

View reviewed changes

sosiouxme mentioned this pull request Aug 30, 2017

WIP add retries #5045

Closed

rhcarvalho mentioned this pull request Aug 31, 2017

Verify Requirements.[localhost] openshift_health_check openshift/origin#15715

Closed

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2017

tbielawa mentioned this pull request Sep 1, 2017

yum flake network unreachable openshift/origin#16103

Closed

sosiouxme dismissed ashcrow’s stale review via 541073e September 8, 2017 18:45

openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 8, 2017

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2017

This was referenced Sep 12, 2017

Improve docker_image_availability check reliability/performance #5365

Merged

Verify Requirements: [localhost] openshift_health_check (Not all of the required packages are available at their requested version: docker:1.12) openshift/origin#16143

Closed

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 13, 2017

sosiouxme mentioned this pull request Sep 13, 2017

add retries on repoquery #5401

Merged

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 14, 2017

sosiouxme added 5 commits September 15, 2017 10:07

implement yum retries with openshift_package

3337e50

add action_plugins link everywhere

6235f20

also needed library linked in playbooks/byo for some unknown reason

ansible.cfg: configure action_plugins and library

32d01ec

Of course this is only helpful if you're using the ansible.cfg and have the root of the repo as current working directory. Also aligned some entries in utils/etc/ansible.cfg for the RPM

test/integration: cd and use repo ansible.cfg

8a97607

reduce pylint disabling

52b5164

Just happened to see some things that no longer trigger pylint.

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2017

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2017

sosiouxme mentioned this pull request Nov 28, 2017

retry package operations #6302

Merged

sosiouxme closed this Nov 28, 2017

add retry logic to flaky commands #5125

add retry logic to flaky commands #5125

Uh oh!

Conversation

sosiouxme commented Aug 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sosiouxme commented Aug 17, 2017

Uh oh!

sdodson commented Aug 18, 2017

Uh oh!

ashcrow commented Aug 18, 2017

Uh oh!

ashcrow commented Aug 18, 2017

Uh oh!

sosiouxme commented Aug 18, 2017 via email

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashcrow commented Aug 18, 2017

Uh oh!

ashcrow commented Aug 18, 2017

Uh oh!

mtnbikenc commented Aug 18, 2017

Uh oh!

sosiouxme commented Aug 21, 2017 via email

Uh oh!

sosiouxme commented Aug 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sosiouxme commented Aug 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelgugino commented Aug 23, 2017

Uh oh!

sosiouxme commented Aug 23, 2017

Uh oh!

openshift-bot commented Aug 25, 2017

Uh oh!

openshift-bot commented Aug 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sosiouxme commented Aug 25, 2017

Uh oh!

sosiouxme commented Aug 28, 2017

Uh oh!

openshift-merge-robot commented Sep 19, 2017

Uh oh!

lhuard1A commented Oct 27, 2017

Uh oh!

michaelgugino commented Oct 27, 2017

Uh oh!

Uh oh!

sosiouxme commented Aug 17, 2017 •

edited

Loading

sosiouxme commented Aug 23, 2017 •

edited

Loading

sosiouxme commented Aug 23, 2017 •

edited

Loading