Skip to content

Lock upgrade marker #8254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 11, 2025
Merged

Lock upgrade marker #8254

merged 17 commits into from
Jun 11, 2025

Conversation

pchila
Copy link
Member

@pchila pchila commented May 27, 2025

What does this PR do?

Introduce file locking for modifying and reading elastic-agent update marker consistently.
FileLocker is a wrapper around github.com/gofrs/flock that exposes both blocking and non-blocking file locking.
A lock on an update marker lockfile has been introduced at creation, reading and writing: this should ensure that no concurrent read/writes happen between elastic-agent main process and watcher.
This is a prerequisite for manual rollback feature: with such a feature both process will read/write the update marker concurrently.

Why is it important?

With the manual rollback feature that will be introduced by #6889, both elastic-agent main process and watcher need to modify the update marker. Locking before modification should avoid inconsistent writes.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Aside from unit tests that have been introduced under internal/pkg/agent/application/filelock and internal/pkg/agent/application/upgrade/step_mark.go, the only use case here is to perform an upgrade and verify that it succeeds.
If we want to test the file locking we can use the testlocker binary in internal/pkg/agent/application/filelock/testlocker/main.go and lock the update marker lock file in <elastic agent root>/data/.update-marker.lock (holding this lock for too long may generate some failures as there's a timeout of 30s for locking operations.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila self-assigned this May 27, 2025
Copy link
Contributor

mergify bot commented May 27, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 27, 2025
Copy link
Contributor

mergify bot commented May 28, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lock-upgrade-marker upstream/lock-upgrade-marker
git merge upstream/main
git push upstream lock-upgrade-marker

@pchila pchila force-pushed the lock-upgrade-marker branch from 03380a0 to 6dc582d Compare May 30, 2025 10:06
@pchila pchila force-pushed the lock-upgrade-marker branch from 8aa3860 to 1b578ff Compare June 3, 2025 15:33
@pchila
Copy link
Member Author

pchila commented Jun 3, 2025

buildkite test this

@pchila pchila marked this pull request as ready for review June 5, 2025 13:56
@pchila pchila requested a review from a team as a code owner June 5, 2025 13:56
@pchila pchila requested review from michalpristas and swiatekm June 5, 2025 13:56
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

ycombinator
ycombinator previously approved these changes Jun 6, 2025
@swiatekm
Copy link
Contributor

swiatekm commented Jun 9, 2025

From what I can see, all the locking when reading or writing to the marker file is non-blocking. What will actually happen if the agent and watcher try to access the file concurrently? One will get an error, and then what?

@pchila
Copy link
Member Author

pchila commented Jun 9, 2025

From what I can see, all the locking when reading or writing to the marker file is non-blocking. What will actually happen if the agent and watcher try to access the file concurrently? One will get an error, and then what?

@swiatekm
We don't use flock's Lock() API as in the documentation we read It's recommended that TryLock() be used over this function..

As an alternative we are using a TryLockContext() with a timeout

https://github.com/elastic/elastic-agent/pull/8254/files#diff-a21d895563344387fd97f5e4b5f5769a266362240b0b6ac420113eabfa1301a6R99

https://github.com/elastic/elastic-agent/pull/8254/files#diff-a21d895563344387fd97f5e4b5f5769a266362240b0b6ac420113eabfa1301a6R59-R62

so the FileLocker will keep trying to acquire a lock periodically for the timeout duration and if it fails it will error out.
Right now such an error is fatal when creating the update marker (but under normal condition there should be no contention at that point).

Reading and writing the update marker right now consists of a single writer (main process or watcher) in every phase.
As #6890 and #6889 get implemented we are gonna have multiple concurrent writers on update marker where the lock needs to be in place to avoid overwriting information. If a writer gets a locking error it will need to reload the update marker data and retry the (partial) modification

swiatekm
swiatekm previously approved these changes Jun 9, 2025
@pchila pchila dismissed stale reviews from swiatekm and ycombinator via 955b09a June 9, 2025 12:08
Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pchila

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question about whether we need the CHANGELOG entry for this change but the implementation itself + tests LGTM!

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not fun of panics but the alternative is not good as well. this will send us to startup spiral loop but at least we know something's wrong right away - should not happen in very most of the cases.
LGTM otherwise.

@pchila pchila merged commit 5b280bd into elastic:main Jun 11, 2025
13 checks passed
v1v added a commit that referenced this pull request Jun 12, 2025
…ts-oblt-cli

* feature/serverless-its-oblt-cli: (51 commits)
  as agreed let's move the group to the kb.integration pipeline
  Update .github/workflows/serverless-project.yml
  ci: invoke serverless integration tests package from BK
  github-actions: create a serverless project daily
  [tests] split up serverless and resource leaks integration tests (#8396)
  chore: Update to elastic/beats@dfdc12e33de0 (#8446)
  Lock upgrade marker (#8254)
  build(deps): bump github.com/elastic/elastic-agent-system-metrics from 0.11.12 to 0.11.13 (#8420)
  Add docker image name template and renamed fips cloud specs (#8429)
  buildkite(scripts): refactor common scripts (#8365)
  Use require.Eventually to try and address flakiness (#8421)
  Fix pre-command to support extended testing (#8418)
  [Automation] Bump Golang version to 1.24.4 (#8384)
  chore: Update to elastic/beats@aed2a8b768bd (#8423)
  [main][Automation] Update versions (#8425)
  Add Pipeline to deploy custom agent image for FIPS testing (#8035)
  ci: automatically update OTel components (#8288)
  [Automation] Bump VM Image version to 1749258065 (#8390)
  fix: increase context timeout to accommodate for slower machines in Test_checkForUnprivilegedVault (#8374)
  otel: add cumulativetodeltaprocessor to EDOT collector (#8372)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lock upgrade marker before writing both in agent and agent watcher
5 participants