-
Notifications
You must be signed in to change notification settings - Fork 2.3k
add retry logic to flaky commands #5125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not really happy with the need to add symlinks everywhere for playbooks to enable the plugin. Why not put all the common plugins in an |
oh man! I've never been this excited before! 99% of all the problems we see in upgrading the starter clusters are related to flaky yum commands. |
Will review shortly. |
Looks like travis caught a typo:
|
On Fri, Aug 18, 2017 at 9:54 AM, Stephen Milner ***@***.***> wrote:
Looks like travis caught a typo:
Syntax checking playbook: /home/travis/build/openshift/openshift-ansible/playbooks/byo/vagrant.yml
ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.
Not actually a typo; it's ansible not picking up the action plugin so it
doesn't recognize the action. I couldn't immediately see why but it just
underscores why the symlink-everywhere approach sucks... this is the error
you get when you miss one (or something else happens that prevents the
action plugin from loading).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few requested updates/nits. Really good work @sosiouxme!
@@ -0,0 +1,98 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Unless this is an ansible-ism generally docstrings occur after the license block.
action_plugins/openshift_package.py
Outdated
if 'use' in new_module_args: | ||
del new_module_args['use'] | ||
|
||
display.vvvv("Running %s" % module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use current string formatting for later 2.x and 3+
display.vvvv("Running {}".format(module))
result.update(res) | ||
break | ||
result['last_failed'] = res | ||
display.v("{} module failed on try {} with result: {}".format(module, tries, res)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for format
and letting the user know how many tries it's attempted!!
wrap_async=self._task.async, | ||
) | ||
tries += 1 | ||
if tries > 3 or not res.get('failed'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required by any means but it may be worth noting via comment for future developers that result['last_failed'] = res
can not be called on the last try else ansible will have a recursion error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess I'm a little confused what you mean here. res
only ever gets placed into result
, not itself; where would recursion occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way that ansible houses it's result is via a single instance. If a result appends itself it causes a Python recursion error as something similar to:
result['data'] = result['data'] = result['data'] = ...
In your code it specifically avoids this, but it's just an idea as it's an easy accident to make when adding features.
library/openshift_package.py
Outdated
#!/usr/bin/python | ||
# -*- coding: utf-8 -*- | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Unless this is an ansible-ism generally docstrings occur after the license block.
library/openshift_package.py
Outdated
|
||
ANSIBLE_METADATA = {'metadata_version': '1.0', | ||
'status': ['stableinterface'], | ||
'supported_by': 'core'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I realize this is a slightly modified copy but this is technically inaccurate. I believe it should be community
.
library/openshift_package.py
Outdated
module: openshift_package | ||
version_added: 2.0 | ||
author: | ||
- Ansible Inc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Not sure if this should be updated or not.
library/openshift_package.py
Outdated
- Ansible Inc | ||
short_description: Generic OS package manager | ||
description: | ||
- Installs, upgrade and removes packages using the underlying OS package manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Installs, upgrade and removes packages using the underlying OS package manager. Attempts up to 3 times on failure' or something like that.
A helpful future enhancement for this action would be to add support for denoting how many times to try and how long to sleep between tries. |
Ah, good catch. |
@sosiouxme I don't care for the symlinking everywhere either. I'd rather use the ansible.cfg file for this like we already do for callback_plugins. We even have
I've opened #5140 to test some of these ideas. |
On Fri, Aug 18, 2017 at 11:35 AM, Russell Teague ***@***.***> wrote:
@sosiouxme <https://github.com/sosiouxme> I don't care for the symlinking
everywhere either. I'd rather use the ansible.cfg file for this like we
already do for callback_plugins. We even have roles_path in there too so
we could do away with all the roles symlinks as well. But, *ALL*
playbooks would need to be launched from the root of the repo in order for
the ansible.cfg to take effect. This does require the reliance on a
properly configured ansible.cfg, however, we already want that anyway and
we are not getting our desired defaults if we let the user launch a
playbook from any directory.
OK, I have some concerns with that approach but it's surely cleaner.
The difference is, if you miss a callback plugin, nothing really breaks. If
you miss an action plugin, then any task that invokes it comes back as a
syntax error, which is pretty terrible user experience.
callback_plugins = callback_plugins/
action_plugins = action_plugins/ <== maybe add this and not use symlinks
Sure, I can just nuke the symlinks commit :)
Will discuss over there.
|
I really cannot figure out why ansible isn't using the action plugin on EDIT: Adding |
should pass travis now but I'm curious how badly tests will break |
Is there a reason we can't use ansible's built-in retry? |
@michaelgugino we can certainly use retries/until and forgo the |
Evaluated for openshift ansible test up to a925d06 |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/524/) (Base Commit: 3db50ad) (PR Branch Commit: a925d06) |
|
||
// determine the playbook path relative to repoBase | ||
playbook, err := filepath.Abs(p.Path) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: If you reuse this pattern (which is super common in Go) it may be worth functionizing it:
func failOnError(err Error):
if err != nil {
t.Fatal(err)
}
Test failures are now all because either |
Will base this on changes from #4264 so adding action_plugins to the RPM is painless. Also, since it will break too many things to require running from repo base with the |
also needed library linked in playbooks/byo for some unknown reason
Of course this is only helpful if you're using the ansible.cfg and have the root of the repo as current working directory. Also aligned some entries in utils/etc/ansible.cfg for the RPM
Just happened to see some things that no longer trigger pylint.
@sosiouxme PR needs rebase |
Hello, I’m wondering what kind of yum command flakiness issues are you guys experiencing. We’re currently regularly have this kind of ansible failures:
I could understand flakiness caused by the network or by the health of the remote yum server, but I have no evidence that the yum error we get is linked to the remote server or a network glitch. Is the |
That error is typically due to some other process has a lock on the file; ie a yum process is already running. This commit is due to network failures, mostly. If you are having an issue you need help with, and you think it might be caused by ansible, please file an issue in this repository and we will try to assist. |
Introduces the
openshift_package
action plugin which adds retry logic to package tasks.Also modifies the
repoquery
plugin with retry logic.Updates
openshift_checks
python code with retry logic.NOTE: in doing this with an action plugin, we require that action plugin to be loaded pretty much everywhere. This will require using an
ansible.cfg
that refers to the repo'saction_plugin/
directory when running playbooks.