Skip to content

Replace checkAUTHORS.pl with updateAUTHORS.pl/updateAUTHORS.pm, allow users to be excluded #20070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Aug 21, 2022

Conversation

demerphq
Copy link
Collaborator

@demerphq demerphq commented Aug 8, 2022

In #20010 a user has asked that we do not list their name or email any of our latest files in the repo.

This was a bit problematic as updateAUTHORS.pl would just add the user back if we did so. I added functionality to updateAUTHORS.pl to support us using hashed contributor details as a way to prevent users from being added by our automation. Unfortunately this then meant that checkAUTHORS.pl and updateAUTHORS.pl had different idea about what should be in the files. There are other issues, the checkAUTHORS.pl file contains a lot of data in a hard to reuse or manage form, and in plaintext, (although slightly munged so its not a direct scraper target). In the end it made a lot more sense to me to simply drop checkAUTHORS.pl and replace it with updateAUTHORS.pl for all use cases.

This patch sequence adds support for a bunch of features that checkAUTHORS.pl had that updateAUTHOR.pl did not, almost uniformly related to reporting so that the tool is now nearly feature equivalent with checkAUTHORS.pl.

There are a few differences:

  1. The old tool supported saving the commit history to a file and then piping into the checkAUTHORS.pl, this mode is no longer supported, although if we need it we can bring it back, but with a certain amount of bodge. The reason for the bodge is that updateAUTHORS.pm expects its log data in a very particular format that is not supported out of the box by git log. You have to construct a long and awkward format string to provide for log to use, so it is not quite so simple to pipe the results to a file. If we really need it we can figure out a solution. An upside of this is that the tool doesn't expect filenames as arguments, instead it expects a get ref spec just like git log would do so it shouldnt really be needed to pipe the results.
  2. The old tool supported querying github. updateAUTHORS.pl doesnt support this as far as I remember. I dont think we were using it anyway. We can teach updateAUTHORS.pl to do the same thing if someone remembers why were doing this.
  3. The old tools supported a couple of options that IMO no longer make sense. Specifically the "--missing" option is redundant. --validate/--tap does the same thing.
  4. The old tool had some (out of date) information on committers which I have abandonded. We currently do not track this. I was thinking if we want to track it we should add a * to peoples AUTHOR data and then add a footnote explaining they are committers. Then we can use the AUTHORS file as the primary source of this data.

This PR adds two new files

  1. Porting/updateAUTHORS.pm which is the brains of the updateAUTHORS.pl tool separated from the command line part of the two so that the POD for using the tool could be separated from the pod of how the tools works.
  2. Porting/exclude_contrib.txt. This contains a list of SHA-256 in base64 of the email details of contributors (commit author/email) that we should not include in .mailmap or AUTHORS.

This PR removes two files:

  1. Porting/checkAUTHORS.pl - replaced by Porting/updateAUTHORS.pl
  2. t/porting/pending-author.t - I couldnt figure out what this was for, and it doesnt seem to be required with the updateAUTHORS.pl workflow. If someone can explain what it was used for better than the test file itself then I would have no problem adding it back.

This PR removes the personal details as requested in #20010

The latest version of updateAUTHORS.pl has support for additional reports, and I expect as I have time I will add more. IMO it is much easier to do this with updateAUTHORS.pl than it was with checkAUTHORS.pl (because of the reliance on the git standard log formats). I plan to add a "who-works-on-what" to show file data by author. Eg, it might show that @khwilliamson, myself and @hvds do most of the changes to the regex engine.

I have marked this PR as draft for now, as i expect to make changes as I get feedback, but the patch it pretty stable, review appreciated!

@Leont
Copy link
Contributor

Leont commented Aug 8, 2022

t/porting/pending-author.t - I couldnt figure out what this was for, and it doesnt seem to be required with the updateAUTHORS.pl workflow. If someone can explain what it was used for better than the test file itself then I would have no problem adding it back.

If there are pending changes, it checks if the current author is listed as author. This is to prevent the test-suite from succeeding before making one's first commit but failing after it.

@demerphq
Copy link
Collaborator Author

demerphq commented Aug 9, 2022 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Aug 9, 2022 via email

@demerphq demerphq force-pushed the yves/exclude_contrib branch from d27aa44 to e25eb4c Compare August 9, 2022 10:10
@Leont
Copy link
Contributor

Leont commented Aug 9, 2022

What is a pending change?

Uncommited changes

@demerphq
Copy link
Collaborator Author

demerphq commented Aug 9, 2022

@Leont - Ok, when I first read pending-author i think i misunderstood what it does and got myself in a muddle, the term "pending" really threw me. I should have just ignored that and looked at the code more closely.

So it is looking to see if there are uncommitted changes that if they were committed with the current users details would result in a test failure. I dont see why we need another test file for this, we can just put it into the authors.t testing. I will add that.

@demerphq demerphq force-pushed the yves/exclude_contrib branch from 1d8e24a to 60cfdac Compare August 9, 2022 16:22
@demerphq
Copy link
Collaborator Author

demerphq commented Aug 9, 2022

Ok, authors.t now does the equivalent of what pending authors used to do, arguably better as it now checks (nearly) all the things that git would. I also added copious verbiage if the tests fail.

@jkeenan jkeenan added the Infrastructure Things needed to maintain Perl development label Aug 10, 2022
@bram-perl
Copy link

Some early feedback without really looking at the code yet:

  • commit Porting/updateAUTHORS.pl - add support for excluding contributors:
    • based on the commit description I got a feeling that the commit tries to do too much at once (i.e. consider splitting the commit?)
    • Porting/checkAUTHORS.pl which does not do the right thing these days anyway.: Is it possible/sensible to include a small description of how/when it doesn't do the right thing?
  • commit remove redundant data from .mailmap:
    • can you describe in the commit message what you consider redundant data?
    • and the reason why this matters? [I'm not disputing that there - probably - is a good reason for it; but I like seeing it in the commit message]

@demerphq
Copy link
Collaborator Author

demerphq commented Aug 11, 2022 via email

Copy link

@bram-perl bram-perl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I reviewed this commit per commit and not by final changes..
GitHub UI doesn't really make clear on which commit the comment was added tho :(

Some of the comments apply to code that was later changed, so it might be a bit confusing when looking at the current code..

@demerphq demerphq force-pushed the yves/exclude_contrib branch from db4f96f to 3f1d192 Compare August 12, 2022 11:13
@demerphq
Copy link
Collaborator Author

BTW, I am still not entirely satisfied with a couple of aspects of this PR, it might be wise to wait until I say so before anyone wastes their time reviewing code I am going to kill.

@demerphq demerphq force-pushed the yves/exclude_contrib branch from 3f1d192 to 2627f5b Compare August 13, 2022 13:22
@demerphq
Copy link
Collaborator Author

Hi @bram-perl I have update the code and include I think all of your feedback. With the amount that I have rebased, squashed, and otherwise munged the sequence I am going to mark all your previous feedback as resolved (as I hope it is) and start fresh. The patch sequence should be much cleaner now with much of the interim stuff squashed out. Please let me know what you find that I overlooked. I have been staring at it so long I have become desensitized. :-)

@demerphq demerphq force-pushed the yves/exclude_contrib branch from 2627f5b to 046763a Compare August 13, 2022 13:35
@demerphq
Copy link
Collaborator Author

That last push just removes a function made obsolete by the checkAUTHORS removal.

Copy link

@bram-perl bram-perl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done a commit-per-commit review.

(I may have gotten carried away a bit during reviewing - hope you don't mind :) )

@demerphq
Copy link
Collaborator Author

Hi @bram-perl I have incorporated most of your feedback, either literally as you suggested, or in some case in spirit but i went a different way.

The only points I did not go along with were the following two:

  1. extensive docs in updateAUTHORS.pm. First, updateAUTHORS.pm is the "brain" of "updateAUTHORS.pl", and many things are documented in the .pl script that apply to the .pm file. So I added a paragraph saying that people should be familiar with both if they want to hack on this, but to put it simply I don't think the docs for the internals of a Porting/ script need to be at CPAN quality levels, and definitely that they aren't shouldn't be a blocker to move this ball forward now. This code isn't intended for reuse (outside of the Porting directory anyway) and its already better documented and structured than what it replaces, likely I will be one of the few who work on it, and frankly I'd like to put it to bed for now for a while. Patches welcome of course if you think otherwise, and I might even do a follow up PR with more docs when I have more energy, but right now I am growing exhausted with this PR and I would like to get it merged and rest up on it a bit before I do anything else. Doc patches are sometimes best written by the person who didnt write the code anyway, as the have a better eye for what is obvious or not.
  2. The control logic for authors.t, I think we should try to avoid commit range scanning in our testing as I believe it is fragile and likely to break. I am open to the idea that we need to do something special for TRAVIS but i would like to see what breaks before I say what it is. My feeling is right now are doing it wrong, and we should consider an alternative strategy, but i need data about how the test fails without the TRAVIS special case.

So if you don't mind I will let you push a follow PR for point 1. For point 2, I am open to doing something, but I would really like to know how this fails when that travis logic fires, which is simplest done by merging, waiting for it to fail (if it does) and then looking at the failure and then dealing with it. If folks are ok with that I would like to proceed that way.

@bram-perl
Copy link

  1. The docs: ok,you did document some of it so I wasn't sure how extensive you wanted the docs so I just added a comment when I spotted something. Not wanting to write/have extensive docs for it is acceptable (at least to me it is).

  2. As suggested earlier: I would just take it out of this PR and submit it as a separate PR. (I think the error in the revision logic is already know, and that PR fix t/porting/authors.t detection of merge commits #20113 is the fix for it but the reasons why it's needed - if it's needed at all - should be better documented)

I'll try todo a final review tonight on this PR;

@demerphq demerphq force-pushed the yves/exclude_contrib branch from e279804 to 7203f62 Compare August 18, 2022 17:32
@demerphq
Copy link
Collaborator Author

Re: POD docs for the module. I would /like/, long run, to have complete docs for this, but I don't see those docs as a necessary precondition to getting the PR merged. Rather I would like to get this merged and then incrementally improve it over time, ideally with the support from others who care about these things. IMO after the PR (very much with your help) updateAUTHORS.pl will be a better tool than we had with either itself or checkAUTHORS.pl before this PR. We can incrementally make it even better as we like once its merged.

A. Sinan Unur was listed twice in AUTHORS, under Sinan Unur
as well. This removes the dupe.

This is in prep for various changes to the updateAUTHORS.pl process.
they are not pod, but rather editor droppings.
…lved_debug"

The existing name is confusing to me, especially when the workflow gets
skipped so change the name to something more intuitive.

We also change it to respect .mailmap by using %aN and %aE instead of
%an and %ae.

We also change it to use ".." and "..." in the git log statement, and
use `sort | uniq -c` instead of `sort -u`, it is helpful to see counts.
We also show the resolved begin and end commits.
We can at least make sure it compiles... Later we can build on this, for
now this sanity checks that we haven't *completely* broken the script
while we refactor it.
As we extend the module logic mixing the docs for the module and the
command line tool becomes problematic, and it is pain passing around the
state via distinct variables. This is first step to moving to a simpler
model. We don't make much use of the OO in this patch at all, that will
come in a follow up patch.
It is much cleaner when the object stores its state internally. That
way we don't have to adjust the transitive caller graph when we need
something new.

Also normalize some var names, we were using "author_info" and
"authors_info" in a way that was confusing, now it is consistently
"author_info".
This makes it easier and simpler to add new options and
extend the script in the future.
Makes it easy to add diagnostics while developing.
If we want to just check if the files are up to date we dont want
to have to write to the disk. We will use this in a future patch.
checkAUTHORS.pl supports specifying a commit range, and we actually
need it even more as we do not support piping from git log (as our
git log format is ungainly) so this allows us to be feature compatible.
Exactly what "update_authors" does is a little less clear than what
"update_authors_file" does.
This add a new file Porting/exclude_contrib.txt to hold a list
of base64 SHA-256 digests of the user name and email who should be
excluded.

This adds the options

    --exclude-me
    --exclude-contrib=NAME_AND_EMAIL
    --exclude-missing

to add exclusions in different ways. --exclude-me uses the git
credentials for the current user, --exclude-contrib expects a name
and email, and --exclude-missing excludes someone who is missing from
the change log.

When excluding someone their name will be removed from AUTHORS if it is
already listed, and the relevant entries to .mailmap will be removed,
and if necessary additional exclusion entries will be added for any of
their old identities which were mentioned in .mailmap.

This feature will be used in a later patch to resolve GH issue #20010.
Adds the --stats, --files, --who, and related options similar to
what checkAUTHORS.pl offers. See perldoc Porting/updateAUTHORS.pl for
the list of options it supports.
Binmode the pipe, use tabs not nulls, and on windows use double quotes
not single quotes for arguments to subprocess scripts.

Since we weren't using Porting/updateAUTHORS.pm as part of our test
pipeline we didn't know it didnt work on Windows builds.
Use consistent hash key and list sort orders.
This is helpful when adding a new option to updateAUTHORS.pl. It not
really meant for users, more for someone working on the tool itself, and
thus is not documented.
The blurb mentions checkAUTHORS.pl, and we should point people at
updateAUTHORS.pl instead.
This emulates the test feature in checkAUTHORS.pl and in
t/porting/pending-author.t which overlaps with the tests that
checkAUTHORS.pl does on behalf of t/porting/authors.t

We test that the .mailmap and AUTHORS files are up to date. If the
working directory is not clean (eg git status --porcelain returns a non
empty list of files) then we also validate the git credentials in use
to make sure that if the changes were applied we would also still pass
the test.

The script implements the same logic that `git commit` uses to decide
what identity a commit would use with the exception of the last resort
measures git undertakes to use the hostname and similar files.
Improved POD verbiage. Explain how to exclude people in a more friendly
way. Various other pod cleanups and improvements.
A number of devs have noted and raised concerns that having .mailmap
entries where the LHS and RHS of the entry is redundant. A few have also
expressed the view that this exposes unnecessarily exposes their email
address in an easily harvestible form.

On the technical level as far as git specifically is concerned it is
true this data is redundant, as git uses .mailmap to transform user data
that match the RHS of an entry into the value on LHS, and when they are
the same obviously it is a no-op. However on the technical level for our
infra these entries are not redundant. We can use them to identify and
correctly respond to many cases of manual update of the AUTHORS file,
for instance changing ones preferred name. With the .mailmap entries we
have the data to identify the old preferred name, and join it together
with the unchanged email for the user and then automatically update
their .mailmap entries. This is why these entries were created
originally.

However, I believe that this functionality is not useful enough to
require us to have discussions with contributors on this subject on a
regular basis. We can add command line options that allow people to
change the AUTHORS file and the .mailmap file properly, so we can drop
the "redundant" data and avoid the need to talk about why it is there.

The required functionality for changing names will come in a follow up
patch in this PR.
With the removal of the redundant data from .mailmap we lacked a way to
rename an author properly. Previously someone could manually update a
name or email in AUTHORS files and things would "just work" as the
.mailmap file would tell us the old mapping. Now we have removed those
mappings we need a way to easily rename someones canonical data. This
also means manually changing the AUTHORS file is now a problem.
updateAUTHORS.pl now can replace all the capabilities of
checkAUTHORS.pl, and checkAUTHORS.pl has a slightly different idea of
what needs to be validated, and is not aware of the
Porting/exclude_contrib.txt. Instead of teaching it to be aware and
maintaining two tools this drops checkAUTHORS.pl

In the process we move the logic for t/porting/pending-author.t into
t/porting/authors.t, by letting updateAUTHORS.pl do it as well. This is
helpful because updateAUTHORS.pl contains similar core git logic so that
it can implement some of its options.
Contributor has requested they be excluded from our contributor
databases.
@demerphq demerphq force-pushed the yves/exclude_contrib branch from 08c5367 to 9a25b33 Compare August 21, 2022 08:09
@demerphq
Copy link
Collaborator Author

rebased to deal with the conflicts in .mailmap

And remove the duplicate for "vividsnow" that this found.
@demerphq demerphq force-pushed the yves/exclude_contrib branch from 9a25b33 to 17f95b8 Compare August 21, 2022 08:39
@demerphq demerphq merged commit 015aea7 into blead Aug 21, 2022
@demerphq demerphq deleted the yves/exclude_contrib branch August 21, 2022 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hasConflicts Infrastructure Things needed to maintain Perl development
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants