Skip to content

etcd_debugging_mvcc_watcher_total may become negative value #19987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
ahrtr opened this issue May 20, 2025 · 9 comments
Open
4 tasks done

etcd_debugging_mvcc_watcher_total may become negative value #19987

ahrtr opened this issue May 20, 2025 · 9 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented May 20, 2025

Bug report criteria

What happened?

etcd_debugging_mvcc_watcher_total may become negative value due to a race condition. The issue was fixed in #19600.

Two problems/followups,

cc @kjgorman @ivanvc @jmhbnz @serathius

What did you expect to happen?

.

How can we reproduce it (as minimally and precisely as possible)?

.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

@ahrtr
Copy link
Member Author

ahrtr commented May 20, 2025

I think we should clearly document our review policy:

  • Every PR should have at least 2 approvals before merging.
  • For smaller or simple changes (like tests, workflows, etc.), 1 approval is fine.
  • But if the author is a maintainer, they still need approval from another maintainer, reviewer, or submodule owner — even for minor changes.

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

@serathius
Copy link
Member

The PR was merged before it being properly reviewed. The test needs to be enhanced using a failpoint. Refer to #19600 (comment)

I personally don't agree with it. The change was pretty simple, covered with proper unit tests and reviewed by me and @ivanvc.

Every PR should have at least 2 approvals before merging.

While agree that for complicated changes it's good and customary to ask for more reviewers, I don't think it's a good default. I would prefer that we start moving in K8s direction of having a directory owners that are trusted to approve the PR or delegate to the best person to approve.

@ahrtr
Copy link
Member Author

ahrtr commented May 20, 2025

I personally don't agree with it. The change was pretty simple, covered with proper unit tests and reviewed by me and @ivanvc.

@ivanvc did not approve the PR. Any production change on the core part (i.e. mvcc, applying workflow, etcdserver etc) should have at least two approvals. Rule is rule ( of course, to be reviewed & approved )

Also I see that some PRs were merged without any reviews,

Note we have blameless culture. It's not blaming anyone, it's just for safety & quality -- prevent any potential risk in future.

@siyuanfoundation
Copy link
Contributor

While I agree in general more reviews would make etcd safer and improve code quality and we should do better. We also need to take into considerations that we only have a small pool of reviewers across different time zones. Having hard requirements could greatly hinder dev speed, especially for the ongoing antithesis project. I think it is acceptable to trust the approver's judgement calls while we all lean more on the more cautious side.

In addition, if etcd is going to own submodules in kubernetes, it would be more consistent to adopt the review process from kubernetes.

@ahrtr
Copy link
Member Author

ahrtr commented May 20, 2025

Having hard requirements could greatly hinder dev speed, especially for the ongoing antithesis project

It isn't just antithesis, it's also robustness test. When robustness test raises any issue, we need to analyse it as well, although not everyone work on it. So quality of robustness test matters.

In addition, if etcd is going to own submodules in kubernetes

See my #19988, the reviewers include the submodule owners.

Having hard requirements could greatly hinder dev speed

Quality & stability outweighs dev speed. If you see the PRs listed above, they are all raised & merged in few hours.

I think it is acceptable to trust the approver's judgement calls

I think we should rely on process/policy to guarantee quality instead of any individual's judgement. Human makes mistakes.

@ahrtr
Copy link
Member Author

ahrtr commented May 20, 2025

Having hard requirements

If you read the #19988, they aren't hard requirements. There are some flexiblities.

@serathius
Copy link
Member

Also I see that some PRs were merged without any reviews,

Note we have blameless culture. It's not blaming anyone, it's just for safety & quality -- prevent any potential risk in future.

That doesn't really seem blameless, a good practice is to ensure diverse examples. However I don't think there is a need scour PR history to make a point. I can take the fault because I merged those PRs. I did that after I broke robustness tests on main, after I missed that tests were not run on PR. It's easy to miss that tests didn't run because they are not showing up until ok-to-test is applied and GitHub collapses the list of tests. I should have reverted the faulty PR and just waited for review. Unfortunately instead of that I just repeatable tried to fix the error on multiple attempts. I just didn't want to revert PR from new contributor, and later I desperately tried to prevent on main branch being broken.

I think the proper solution is not a policy, but just taking back the repo write access from maintainers. It's just to easy to make a mistake, to tempting to try to make a dirty fix.

I think we should rely on process/policy to guarantee quality instead of any individual's judgement. Human makes mistakes.

I agree with the second part, but I take different conclusion. People make mistakes, but we should trust them to not be malicious and take accountability. Sorry if I broken your trust, but the goal was not to abuse the system.

@kjgorman
Copy link
Contributor

👋 Hello team, sorry for any confusion my PR may have caused here. With respect to the failpoint test and cherry-picking to backport, I'm happy to help here, bearing in mind I'm not at all familiar with your processes so might need some pointers.

For the failpoint for example - is this what you had in mind: #19997 ? It does help make the test fail deterministically locally without the patch after gofail enable etc., but I notice we don't seem to use failpoints in unit tests at the moment (e.g., couldn't find reference to it in ./scripts/test.sh for running unit tests), so it seems like this won't run e.g. in CI/on PRs etc. Is that a correct understanding?

@ahrtr
Copy link
Member Author

ahrtr commented May 21, 2025

That doesn't really seem blameless, a good practice is to ensure diverse examples. However I don't think there is a need scour PR history to make a point.

This is to clarify why I added the policy: "If the author is a maintainer, they should still get approval from another maintainer, reviewer, or submodule owner, even for minor changes.". Sorry if it made you feel uncomfortable — that was not my intention.

I think the proper solution is not a policy, but just taking back the repo write access from maintainers

Agreed. It's also exactly the reason why I raised the comment in the first place. kubernetes/org#5523 (comment)

cc @ivanvc @jmhbnz

but we should trust them to not be malicious and take accountability. Sorry if I broken your trust, but the goal was not to abuse the system.

I think we trust all members, at least trust all maintainers and reviewers.

With respect to the failpoint test and cherry-picking to backport, I'm happy to help here

thx

but I notice we don't seem to use failpoints in unit tests at the momen

Yes, failpoint isn't enabled by default for unit test. You can add an e2e test instead. Please refer to the example #19608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants