Skip to content

add retry logic to flaky commands #5125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

add retry logic to flaky commands #5125

wants to merge 5 commits into from

Conversation

sosiouxme
Copy link
Member

@sosiouxme sosiouxme commented Aug 17, 2017

Introduces the openshift_package action plugin which adds retry logic to package tasks.
Also modifies the repoquery plugin with retry logic.
Updates openshift_checks python code with retry logic.

NOTE: in doing this with an action plugin, we require that action plugin to be loaded pretty much everywhere. This will require using an ansible.cfg that refers to the repo's action_plugin/ directory when running playbooks.

@sosiouxme sosiouxme requested review from ashcrow and mtnbikenc August 17, 2017 23:26
@sosiouxme
Copy link
Member Author

I'm not really happy with the need to add symlinks everywhere for playbooks to enable the plugin. Why not put all the common plugins in an openshift_common role (with no tasks) and just have every playbook invoke that? It's not pretty either but it solves the problem once and for all.

@sdodson
Copy link
Member

sdodson commented Aug 18, 2017

oh man! I've never been this excited before! 99% of all the problems we see in upgrading the starter clusters are related to flaky yum commands.

@ashcrow
Copy link
Member

ashcrow commented Aug 18, 2017

Will review shortly.

@ashcrow
Copy link
Member

ashcrow commented Aug 18, 2017

Looks like travis caught a typo:

Syntax checking playbook: /home/travis/build/openshift/openshift-ansible/playbooks/byo/vagrant.yml

ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/home/travis/build/openshift/openshift-ansible/roles/openshift_repos/tasks/main.yaml': line 9, column 5, but may

be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  block:

  - name: Ensure libselinux-python is installed

    ^ here

The error appears to have been in '/home/travis/build/openshift/openshift-ansible/roles/openshift_repos/tasks/main.yaml': line 9, column 5, but may

be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  block:

  - name: Ensure libselinux-python is installed

    ^ here

@sosiouxme
Copy link
Member Author

sosiouxme commented Aug 18, 2017 via email

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few requested updates/nits. Really good work @sosiouxme!

@@ -0,0 +1,98 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Unless this is an ansible-ism generally docstrings occur after the license block.

if 'use' in new_module_args:
del new_module_args['use']

display.vvvv("Running %s" % module)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use current string formatting for later 2.x and 3+

display.vvvv("Running {}".format(module))

result.update(res)
break
result['last_failed'] = res
display.v("{} module failed on try {} with result: {}".format(module, tries, res))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for format and letting the user know how many tries it's attempted!!

wrap_async=self._task.async,
)
tries += 1
if tries > 3 or not res.get('failed'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required by any means but it may be worth noting via comment for future developers that result['last_failed'] = res can not be called on the last try else ansible will have a recursion error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess I'm a little confused what you mean here. res only ever gets placed into result, not itself; where would recursion occur?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that ansible houses it's result is via a single instance. If a result appends itself it causes a Python recursion error as something similar to:

result['data'] = result['data'] = result['data'] = ...

In your code it specifically avoids this, but it's just an idea as it's an easy accident to make when adding features.

#!/usr/bin/python
# -*- coding: utf-8 -*-

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Unless this is an ansible-ism generally docstrings occur after the license block.


ANSIBLE_METADATA = {'metadata_version': '1.0',
'status': ['stableinterface'],
'supported_by': 'core'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I realize this is a slightly modified copy but this is technically inaccurate. I believe it should be community.

module: openshift_package
version_added: 2.0
author:
- Ansible Inc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure if this should be updated or not.

- Ansible Inc
short_description: Generic OS package manager
description:
- Installs, upgrade and removes packages using the underlying OS package manager.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Installs, upgrade and removes packages using the underlying OS package manager. Attempts up to 3 times on failure' or something like that.

@ashcrow
Copy link
Member

ashcrow commented Aug 18, 2017

A helpful future enhancement for this action would be to add support for denoting how many times to try and how long to sleep between tries.

@ashcrow
Copy link
Member

ashcrow commented Aug 18, 2017

Not actually a typo; it's ansible not picking up the action plugin so it doesn't recognize the action. I couldn't immediately see why but it just underscores why the symlink-everywhere approach sucks... this is the error you get when you miss one (or something else happens that prevents the action plugin from loading).

Ah, good catch.

@mtnbikenc
Copy link
Member

@sosiouxme I don't care for the symlinking everywhere either. I'd rather use the ansible.cfg file for this like we already do for callback_plugins. We even have roles_path in there too so we could do away with all the roles symlinks as well. But, ALL playbooks would need to be launched from the root of the repo in order for the ansible.cfg to take effect. This does require the reliance on a properly configured ansible.cfg, however, we already want that anyway and we are not getting our desired defaults if we let the user launch a playbook from any directory. These are considerations that are higher level than just this PR so a decision on this approach should not block the addition of this action plugin.

callback_plugins = callback_plugins/
action_plugins = action_plugins/  <== maybe add this and not use symlinks
roles_path = roles/

I've opened #5140 to test some of these ideas.

@sosiouxme
Copy link
Member Author

sosiouxme commented Aug 21, 2017 via email

@sosiouxme
Copy link
Member Author

sosiouxme commented Aug 23, 2017

I really cannot figure out why ansible isn't using the action plugin on playbooks/byo/rhel_subscribe.yml (and thus vagrant.yml). It works for the other playbooks in the same dir. Ansible bug? Something weird about the rhel_subscribe role?

EDIT: Adding library in ansible.cfg fixed it. I still don't know why only that playbook needs it.

@sosiouxme
Copy link
Member Author

sosiouxme commented Aug 23, 2017

should pass travis now but I'm curious how badly tests will break

@michaelgugino
Copy link
Contributor

Is there a reason we can't use ansible's built-in retry?

@sosiouxme
Copy link
Member Author

@michaelgugino we can certainly use retries/until and forgo the openshift_package plugin. I just thought it would be nice to have the logic in one central place instead of scattered throughout the roles.

ashcrow
ashcrow previously approved these changes Aug 24, 2017
@openshift-bot
Copy link

Evaluated for openshift ansible test up to a925d06

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/524/) (Base Commit: 3db50ad) (PR Branch Commit: a925d06)


// determine the playbook path relative to repoBase
playbook, err := filepath.Abs(p.Path)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: If you reuse this pattern (which is super common in Go) it may be worth functionizing it:

func failOnError(err Error):
    if err != nil {
        t.Fatal(err)
    }

@sosiouxme
Copy link
Member Author

Test failures are now all because either action_plugin or library aren't picked up in config and so the openshift_package plugin does not work. This requires fixing how the jobs run.

@sosiouxme
Copy link
Member Author

Will base this on changes from #4264 so adding action_plugins to the RPM is painless.

Also, since it will break too many things to require running from repo base with the ansible.cfg, will punt on that and continue to add symlinks everywhere.

@sosiouxme sosiouxme mentioned this pull request Aug 30, 2017
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2017
@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 8, 2017
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2017
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 13, 2017
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 14, 2017
also needed library linked in playbooks/byo for some unknown reason
Of course this is only helpful if you're using the ansible.cfg and have
the root of the repo as current working directory.
Also aligned some entries in utils/etc/ansible.cfg for the RPM
Just happened to see some things that no longer trigger pylint.
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2017
@openshift-merge-robot
Copy link
Contributor

@sosiouxme PR needs rebase

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2017
@lhuard1A
Copy link

Hello,

I’m wondering what kind of yum command flakiness issues are you guys experiencing.

We’re currently regularly have this kind of ansible failures:

TASK [openshift_facts : Ensure various deps are installed] ******************************************************************************************************
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=iproute)
ok: [ose3-int-a-master-3.node.jawed-1a-eu-central-1.acs] => (item=iproute)
ok: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] => (item=iproute)
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=python-dbus)
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=python-six)
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=PyYAML)
failed: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] (item=python-dbus) => {"failed": true, "item": "python-dbus", "msg": "Failure talking to yum: [Errno 2] No such file or directory: '/var/cache/yum/x86_64/7Server/rhel7-optional/gen/primary_db.sqlite'"}
ok: [ose3-int-a-master-1.node.jawed-1a-eu-central-1.acs] => (item=yum-utils)
ok: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] => (item=python-six)
ok: [ose3-int-a-master-2.node.jawed-1a-eu-central-1.acs] => (item=PyYAML)

I could understand flakiness caused by the network or by the health of the remote yum server, but I have no evidence that the yum error we get is linked to the remote server or a network glitch.

Is the yum error above something that is “known to happen” and supposed to be addressed by this PR?

@michaelgugino
Copy link
Contributor

@lhuard1A

That error is typically due to some other process has a lock on the file; ie a yum process is already running.

This commit is due to network failures, mostly.

If you are having an issue you need help with, and you think it might be caused by ansible, please file an issue in this repository and we will try to assist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants