Skip to content

runtime/pprof: incorrect mutex contention time #73760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
func25 opened this issue May 18, 2025 · 7 comments
Open

runtime/pprof: incorrect mutex contention time #73760

func25 opened this issue May 18, 2025 · 7 comments
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@func25
Copy link
Contributor

func25 commented May 18, 2025

Go version

go1.23.9

Output of go env in your module/workspace:

none

What did you do?

I tried a simple snippet to evaluate the mutex profile:

var mtx sync.Mutex
var wg sync.WaitGroup

func acquire(n time.Duration) {
	mtx.Lock()
	defer mtx.Unlock()

	time.Sleep(n)
	wg.Done()
}

func main() {
	runtime.SetMutexProfileFraction(1)

	wg.Add(5)
	go acquire(1 * time.Second) // gowrap1
	go acquire(2 * time.Second) // gowrap2
	go acquire(3 * time.Second) // gowrap3
	go acquire(4 * time.Second) // gowrap4
	go acquire(5 * time.Second) // gowrap5

	wg.Wait()

	// Write mutex profile to file
	f, err := os.Create("mutex.prof")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	if err := pprof.Lookup("mutex").WriteTo(f, 0); err != nil {
		log.Fatal(err)
	}
}

What did you see happen?

The order in which goroutines acquire the lock is: gowrap1gowrap5gowrap2gowrap3gowrap4. However, the result shows an overshoot in gowrap3's contention time and underestimates gowrap2:

Image

What did you expect to see?

The expected contention time for gowrap2 should be 4 seconds. The contention time for gowrap3 should be 3 seconds. Not sure if this behavior is intentional.

@func25 func25 closed this as completed May 18, 2025
@func25 func25 reopened this May 18, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label May 18, 2025
@adonovan adonovan added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 19, 2025
@dr2chase
Copy link
Contributor

I'm not at all convinced that this is a bug, and the order in which goroutines run and/or acquire a mutex is not generally specified.

@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 21, 2025
@func25
Copy link
Contributor Author

func25 commented May 21, 2025

I log the name to confirm the order. I've also tried several other acquire orders, and the issue is consistently reproducible.

I'm not at all convinced that this is a bug

Initially I thought it might be a limitation, but given how simple the case is, I’d like to file a report to consult.

@rhysh
Copy link
Contributor

rhysh commented May 21, 2025

gowrap1 has 4s, or 1s * 4 goroutines
gowrap5 has 15s, or 5s * 3 goroutines
gowrap3 has 6s, or 3s * 2 goroutines
gowrap2 has 3s
gowrap4 has 0s, or 4s * 0 goroutines

Without approximations, I'd expect gowrap2 to take the blame for 2 seconds of delay to 1 goroutine for a total of 2 seconds.

But semrelease1 does an approximation to avoid taking O(N) time to visit every waiting goroutine. That may be to blame here.

@func25
Copy link
Contributor Author

func25 commented May 21, 2025

@rhysh Hm It's not the order I mentioned? Btw, all the goroutines wait for the same duration before unlocking, so the approximation using the head and tail should be correct.

I guess the gowrap2 case is wrong when it's reset here:

go/src/runtime/sema.go

Lines 428 to 429 in 2a7ca15

t.acquiretime = now
tailtime = s.waittail.acquiretime

@rhysh
Copy link
Contributor

rhysh commented May 21, 2025

Btw, all the goroutines wait for the same duration before unlocking

It looks like the argument of acquire ends up as a sleep while holding the lock. The 4th goroutine to be created (gowrap4) has an argument of 4*time.Second, so does mtx.Lock() ; time.Sleep(4*time.Second) ; mtx.Unlock(). I read that as each goroutine waiting for a different duration before unlocking.

It's not the order I mentioned

How did you determine the order? You mentioned logging, but there isn't any in the demo code. I agree with @dr2chase that the demo code doesn't enforce a specific order.

I guess the gowrap2 case is wrong when it's reset here

It's right to be reset — each critical section should only get the blame for the time that it was the lock-holder, so each lock-releaser needs to take full blame for the waiters it delayed via advancing their timers to now. The approximation is that it only updates the head and tail nodes, not all the nodes in between.

It's a theory based on my understanding of semrelease1, without working through in detail which sudog values get updated at which times.

@func25
Copy link
Contributor Author

func25 commented May 21, 2025

There's no specific order, I thought you were evaluating the reported order since it's quite similar. But as I mentioned, the issue is always reproducible regardless of the order.

You mentioned logging, but there isn't any in the demo code.

Sorry, here's the updated code:

var mtx sync.Mutex
var wg sync.WaitGroup

func acquire(n time.Duration, name string) {
	mtx.Lock()
	defer mtx.Unlock()

	println("start wait", name)

	time.Sleep(n)
	wg.Done()
}

func main() {
	runtime.SetMutexProfileFraction(1)

	wg.Add(5)
	go acquire(1*time.Second, "gowrap1") // gowrap1
	go acquire(2*time.Second, "gowrap2") // gowrap2
	go acquire(3*time.Second, "gowrap3") // gowrap3
	go acquire(4*time.Second, "gowrap4") // gowrap4
	go acquire(5*time.Second, "gowrap5") // gowrap5

	wg.Wait()

	// Write goroutine profile to file
	f, err := os.Create("mutex.prof")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	if err := pprof.Lookup("mutex").WriteTo(f, 0); err != nil {
		log.Fatal(err)
	}
}

I read that as each goroutine waiting for a different duration before unlocking.

Yes, but when unlocking occurs, the wait times for the remaining waiters head and tail nodes are reset. So their wait time is the same before the next unlocking happens.

It's right to be reset

I didn't say it's entirely wrong; but in the case of gowrap2 (in the order I tested), it's misleading. The t.acquiretime is reset to now, tailTime captures s.waittail.acquiretime, but s.waittail == t, so it's captured after the reset.

go/src/runtime/sema.go

Lines 428 to 429 in 2a7ca15

t.acquiretime = now
tailtime = s.waittail.acquiretime

The wait time of the tail will be 0 and it underestimates the tail's wait time. Here is the formula:

dt0 := t0 - acquiretime    // Head wait time
dt := dt0
dtail := t0 - tailtime     // Tail wait time
dt += (dtail + dt0) / 2 * int64(s.waiters)  // Average wait time * number of waiters

So when gowrap2 releases the mutex:

dt = t0 - acquiretime = 2s
dt = 2s
dtail = 0 // underestimate
dt += (0 + 2s) / 2 * 1 -> 3s

The dt is 3s, as shown in the diagram.

@mknyszek mknyszek added this to the Backlog milestone May 28, 2025
@mknyszek mknyszek changed the title runtime/sema: incorrect mutex contention time runtime/pprof: incorrect mutex contention time May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

6 participants