-
Notifications
You must be signed in to change notification settings - Fork 175
Lock upgrade marker #8254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock upgrade marker #8254
Conversation
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
This pull request is now in conflicts. Could you fix it? 🙏
|
03380a0
to
6dc582d
Compare
8aa3860
to
1b578ff
Compare
buildkite test this |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
internal/pkg/agent/application/upgrade/marker_access_windows.go
Outdated
Show resolved
Hide resolved
From what I can see, all the locking when reading or writing to the marker file is non-blocking. What will actually happen if the agent and watcher try to access the file concurrently? One will get an error, and then what? |
@swiatekm As an alternative we are using a so the FileLocker will keep trying to acquire a lock periodically for the timeout duration and if it fails it will error out. Reading and writing the update marker right now consists of a single writer (main process or watcher) in every phase. |
|
💚 Build Succeeded
History
cc @pchila |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a question about whether we need the CHANGELOG entry for this change but the implementation itself + tests LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not fun of panics but the alternative is not good as well. this will send us to startup spiral loop but at least we know something's wrong right away - should not happen in very most of the cases.
LGTM otherwise.
…ts-oblt-cli * feature/serverless-its-oblt-cli: (51 commits) as agreed let's move the group to the kb.integration pipeline Update .github/workflows/serverless-project.yml ci: invoke serverless integration tests package from BK github-actions: create a serverless project daily [tests] split up serverless and resource leaks integration tests (#8396) chore: Update to elastic/beats@dfdc12e33de0 (#8446) Lock upgrade marker (#8254) build(deps): bump github.com/elastic/elastic-agent-system-metrics from 0.11.12 to 0.11.13 (#8420) Add docker image name template and renamed fips cloud specs (#8429) buildkite(scripts): refactor common scripts (#8365) Use require.Eventually to try and address flakiness (#8421) Fix pre-command to support extended testing (#8418) [Automation] Bump Golang version to 1.24.4 (#8384) chore: Update to elastic/beats@aed2a8b768bd (#8423) [main][Automation] Update versions (#8425) Add Pipeline to deploy custom agent image for FIPS testing (#8035) ci: automatically update OTel components (#8288) [Automation] Bump VM Image version to 1749258065 (#8390) fix: increase context timeout to accommodate for slower machines in Test_checkForUnprivilegedVault (#8374) otel: add cumulativetodeltaprocessor to EDOT collector (#8372) ...
What does this PR do?
Introduce file locking for modifying and reading elastic-agent update marker consistently.
FileLocker is a wrapper around github.com/gofrs/flock that exposes both blocking and non-blocking file locking.
A lock on an update marker lockfile has been introduced at creation, reading and writing: this should ensure that no concurrent read/writes happen between elastic-agent main process and watcher.
This is a prerequisite for manual rollback feature: with such a feature both process will read/write the update marker concurrently.
Why is it important?
With the manual rollback feature that will be introduced by #6889, both elastic-agent main process and watcher need to modify the update marker. Locking before modification should avoid inconsistent writes.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
Aside from unit tests that have been introduced under
internal/pkg/agent/application/filelock
andinternal/pkg/agent/application/upgrade/step_mark.go
, the only use case here is to perform an upgrade and verify that it succeeds.If we want to test the file locking we can use the
testlocker
binary ininternal/pkg/agent/application/filelock/testlocker/main.go
and lock the update marker lock file in<elastic agent root>/data/.update-marker.lock
(holding this lock for too long may generate some failures as there's a timeout of 30s for locking operations.Related issues
Questions to ask yourself