`etcd_debugging_mvcc_watcher_total` may become negative value #19987

ahrtr · 2025-05-20T09:09:32Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

etcd_debugging_mvcc_watcher_total may become negative value due to a race condition. The issue was fixed in #19600.

Two problems/followups,

The PR was merged before it being properly reviewed. The test needs to be enhanced using a failpoint. Refer to mvcc: avoid double decrement of watcher gauge on close/cancel race #19600 (comment)
The DCO workflow somehow failed in automatically check-picked PRs, so we need to manually backport the PR to stable releases (including 3.4, 3.5 and 3.6).

cc @kjgorman @ivanvc @jmhbnz @serathius

What did you expect to happen?

.

How can we reproduce it (as minimally and precisely as possible)?

.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

The text was updated successfully, but these errors were encountered:

ahrtr · 2025-05-20T10:12:01Z

The PR was merged before it being properly reviewed. The test needs to be enhanced using a failpoint. Refer to mvcc: avoid double decrement of watcher gauge on close/cancel race #19600 (comment)

I think we should clearly document our review policy:

Every PR should have at least 2 approvals before merging.
For smaller or simple changes (like tests, workflows, etc.), 1 approval is fine.
But if the author is a maintainer, they still need approval from another maintainer, reviewer, or submodule owner — even for minor changes.

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

serathius · 2025-05-20T11:12:25Z

The PR was merged before it being properly reviewed. The test needs to be enhanced using a failpoint. Refer to #19600 (comment)

I personally don't agree with it. The change was pretty simple, covered with proper unit tests and reviewed by me and @ivanvc.

Every PR should have at least 2 approvals before merging.

While agree that for complicated changes it's good and customary to ask for more reviewers, I don't think it's a good default. I would prefer that we start moving in K8s direction of having a directory owners that are trusted to approve the PR or delegate to the best person to approve.

ahrtr · 2025-05-20T13:46:08Z

I personally don't agree with it. The change was pretty simple, covered with proper unit tests and reviewed by me and @ivanvc.

@ivanvc did not approve the PR. Any production change on the core part (i.e. mvcc, applying workflow, etcdserver etc) should have at least two approvals. Rule is rule ( of course, to be reviewed & approved )

Also I see that some PRs were merged without any reviews,

Note we have blameless culture. It's not blaming anyone, it's just for safety & quality -- prevent any potential risk in future.

siyuanfoundation · 2025-05-20T16:33:26Z

While I agree in general more reviews would make etcd safer and improve code quality and we should do better. We also need to take into considerations that we only have a small pool of reviewers across different time zones. Having hard requirements could greatly hinder dev speed, especially for the ongoing antithesis project. I think it is acceptable to trust the approver's judgement calls while we all lean more on the more cautious side.

In addition, if etcd is going to own submodules in kubernetes, it would be more consistent to adopt the review process from kubernetes.

ahrtr · 2025-05-20T16:41:57Z

Having hard requirements could greatly hinder dev speed, especially for the ongoing antithesis project

It isn't just antithesis, it's also robustness test. When robustness test raises any issue, we need to analyse it as well, although not everyone work on it. So quality of robustness test matters.

In addition, if etcd is going to own submodules in kubernetes

See my #19988, the reviewers include the submodule owners.

Having hard requirements could greatly hinder dev speed

Quality & stability outweighs dev speed. If you see the PRs listed above, they are all raised & merged in few hours.

I think it is acceptable to trust the approver's judgement calls

I think we should rely on process/policy to guarantee quality instead of any individual's judgement. Human makes mistakes.

ahrtr · 2025-05-20T16:43:17Z

Having hard requirements

If you read the #19988, they aren't hard requirements. There are some flexiblities.

serathius · 2025-05-20T22:25:48Z

Also I see that some PRs were merged without any reviews,

Allow empty member histories from all three members #19965

Prevent exiting before watch closes #19966

Fix error handling in watch #19968

Fix missing KV reports in antithesis #19967

Fetch branch when building antithesis image #19964

Fix building antithesis etcd image #19963

Note we have blameless culture. It's not blaming anyone, it's just for safety & quality -- prevent any potential risk in future.

That doesn't really seem blameless, a good practice is to ensure diverse examples. However I don't think there is a need scour PR history to make a point. I can take the fault because I merged those PRs. I did that after I broke robustness tests on main, after I missed that tests were not run on PR. It's easy to miss that tests didn't run because they are not showing up until ok-to-test is applied and GitHub collapses the list of tests. I should have reverted the faulty PR and just waited for review. Unfortunately instead of that I just repeatable tried to fix the error on multiple attempts. I just didn't want to revert PR from new contributor, and later I desperately tried to prevent on main branch being broken.

I think the proper solution is not a policy, but just taking back the repo write access from maintainers. It's just to easy to make a mistake, to tempting to try to make a dirty fix.

I think we should rely on process/policy to guarantee quality instead of any individual's judgement. Human makes mistakes.

I agree with the second part, but I take different conclusion. People make mistakes, but we should trust them to not be malicious and take accountability. Sorry if I broken your trust, but the goal was not to abuse the system.

kjgorman · 2025-05-21T08:11:51Z

👋 Hello team, sorry for any confusion my PR may have caused here. With respect to the failpoint test and cherry-picking to backport, I'm happy to help here, bearing in mind I'm not at all familiar with your processes so might need some pointers.

For the failpoint for example - is this what you had in mind: #19997 ? It does help make the test fail deterministically locally without the patch after gofail enable etc., but I notice we don't seem to use failpoints in unit tests at the moment (e.g., couldn't find reference to it in ./scripts/test.sh for running unit tests), so it seems like this won't run e.g. in CI/on PRs etc. Is that a correct understanding?

ahrtr · 2025-05-21T09:28:27Z

That doesn't really seem blameless, a good practice is to ensure diverse examples. However I don't think there is a need scour PR history to make a point.

This is to clarify why I added the policy: "If the author is a maintainer, they should still get approval from another maintainer, reviewer, or submodule owner, even for minor changes.". Sorry if it made you feel uncomfortable — that was not my intention.

I think the proper solution is not a policy, but just taking back the repo write access from maintainers

Agreed. It's also exactly the reason why I raised the comment in the first place. kubernetes/org#5523 (comment)

cc @ivanvc @jmhbnz

but we should trust them to not be malicious and take accountability. Sorry if I broken your trust, but the goal was not to abuse the system.

I think we trust all members, at least trust all maintainers and reviewers.

With respect to the failpoint test and cherry-picking to backport, I'm happy to help here

thx

but I notice we don't seem to use failpoints in unit tests at the momen

Yes, failpoint isn't enabled by default for unit test. You can add an e2e test instead. Please refer to the example #19608

ahrtr added type/bug release/v3.6 release/v3.5 release/v3.4 labels May 20, 2025

ahrtr mentioned this issue May 20, 2025

Add review policy #19988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`etcd_debugging_mvcc_watcher_total` may become negative value #19987

`etcd_debugging_mvcc_watcher_total` may become negative value #19987

ahrtr commented May 20, 2025

paste your configuration here

ahrtr commented May 20, 2025

Uh oh!

serathius commented May 20, 2025

Uh oh!

ahrtr commented May 20, 2025

Uh oh!

siyuanfoundation commented May 20, 2025

Uh oh!

ahrtr commented May 20, 2025

Uh oh!

ahrtr commented May 20, 2025

Uh oh!

serathius commented May 20, 2025

Uh oh!

kjgorman commented May 21, 2025

Uh oh!

ahrtr commented May 21, 2025

Uh oh!

etcd_debugging_mvcc_watcher_total may become negative value #19987

etcd_debugging_mvcc_watcher_total may become negative value #19987

Comments

ahrtr commented May 20, 2025

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented May 20, 2025

Uh oh!

serathius commented May 20, 2025

Uh oh!

ahrtr commented May 20, 2025

Uh oh!

siyuanfoundation commented May 20, 2025

Uh oh!

ahrtr commented May 20, 2025

Uh oh!

ahrtr commented May 20, 2025

Uh oh!

serathius commented May 20, 2025

Uh oh!

kjgorman commented May 21, 2025

Uh oh!

ahrtr commented May 21, 2025

Uh oh!

`etcd_debugging_mvcc_watcher_total` may become negative value #19987

`etcd_debugging_mvcc_watcher_total` may become negative value #19987