-
Notifications
You must be signed in to change notification settings - Fork 24.4k
MPS slows down after sleep #124056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This might be an issue with the mps.synchronize(). I think there is another issue talking about this... |
I don't think For example: import time
import torch
from torchvision import models
device = torch.device("mps")
model = models.resnet18()
model.to(device)
model.eval()
for i in range(10):
x = torch.randn(1, 3, 224, 224, device=device)
start = time.perf_counter()
with torch.no_grad():
y_mps = model(x)
y_cpu = y_mps.cpu()
end = time.perf_counter()
print(end - start)
# time.sleep(1) # <--- postprocessing on y_cpu |
Anyone else have any ideas here? We are unable to solution this so far. |
I agree @hdnhan this is not related to We use Commit and continue logic to do concurrent processing between CPU and GPU. If you add a sleep, the memory we had wired in for the command buffer we are processing will get released (I think there is a 1s or a fixed counter at which we unwire Metal resources) and then we will have to map those back in. That will slow down and cause performance regression. We can confirm this in Instruments (if you take a Metal System trace https://developer.apple.com/documentation/xcode/analyzing-the-performance-of-your-metal-app/) and see Paging activity when you add the Sleep. The other reason could be that by adding as Sleep we moved the P state of GPU from higher state of 8 to say 4. This will simply mean that we need to keep GPU fed and not add delays in between. |
Hi @kulinseth - really appreciate you getting back to us. This is causing us to run inference at 200% the time it should take so we are very eager to solve it. Additionally, I'm sure this is something other development teams will bump into soon (and it great numbers) - so it is for sure serving a greater good! Thank you. I'll let @hdnhan and @Amusesmile ask more comments here but TLDR, we are aware of sleep (figured this part out) but don't fix our problem. What you mention above we don't know how to do is moving the P state of GPU to a higher state of 8. Thank you again for the help. |
I'll provide a few more details and mention things we've tried. This is in a situation where we're processing incoming audio in a real time application. We wait for 120ms of collected audio, then send it into our pytorch MPS model in a separate thread. It should take 50ms to process then wait for ~70ms for the next audio frame. Because of the sleep slowdown issue, it's taking 100ms, so only has ~20ms of safety margin. Intuitively we had figured out that the sleep or time in between calls was what is slowing it down for each pass as you mention. We'd like to find a way to prevent this memory re-wiring or prioritization of the GPU. @kulinseth you mention that there's a 1s or fixed counter after which the resources are mapped back. Is there a way to make this slower or gain control over it, or is it managed by the system and out of reach? For your second item, the P state of GPU, we have done experiments where we use small pytorch "filler" models instead of sleep calls to try to keep the GPU active on this thread, however they don't seem to help. Do you think that if they had the same memory input/output signature (but no processing), they might better keep things primed and fast for the next real processing call? Any additional information you have about what we have control over or what to try would be appreciated. The filler pytorch model we created was something trivial like input 100 and returning the same. Thanks! |
Hi @kulinseth Thanks for your response. I had played around with Instruments for a few days. With the script above, I got some findings.
|
Hi @hdnhan! I'm taking a look at the issue on our side. A couple of findings I have so far
However controlling the power states is something taking place on the OS level without the apps having an access to those tools. So we'll need to work around it here. I think @Amusesmile is on the right track with the idea of filler model in the sense that busy waiting while keeping the GPU active should prevent the power state from winding down. However based on the description
it sounds like this model might get optimized by the MPSGraph since it should recognize if the op is just an identity and avoid doing the work it deems unnecessary. I'll keep digging around on what would be the best way to avoid the power state slump in this case and update here. |
Just to fill in to the answer earlier here is a non-elegant example of busy waiting to prevent resources from falling to lower power state by just reiterating the inference until the 1s is up just to demonstrate.
The results on my machine Running without sleeping:
Running with sleep(1):
Busy waiting for 1s with the amended script above:
|
We can’t tell you guys how much we appreciate the help. Super looking
forward to a solution here - this will be life changing for our team. Thank
you guys SO much for helping. We are on standby to support in any way.
Best,
_BT
On September 3, 2024 at 5:12:41 PM, jhavukainen ***@***.***) wrote:
Just to fill in to the answer earlier here is a non-elegant example of busy
waiting to prevent resources from falling to lower power state by just
reiterating the inference until the 1s is up just to demonstrate.
import time
import torch
from torchvision import models
device = torch.device("mps")
model = models.resnet18()model.to(device)
model.()
for i in range(10):
x = torch.randn(1, 3, 224, 224, device=device)
start = time.perf_counter()
with torch.no_grad():
model(x)
torch.mps.synchronize()
end = time.perf_counter()
print(end - start)
t1 = time.perf_counter()
while(time.perf_counter()-t1 < 1.0):
model(x)
The results on my machine
Running without sleeping:
0.045149333000154
0.005514041000424186
0.00550837500122725
0.005526875000214204
0.005485875000886153
0.005400875001214445
0.0024398329987889156
0.0024834159994497895
0.0024525839999114396
0.00240375000066706
Running with sleep(1):
0.04453287499927683
0.010447082999235136
0.01648600000044098
0.016526749999684398
0.01672737499939103
0.012525374999313499
0.014056874999369029
0.01697608299946296
0.01570920800077147
0.02034816699961084
Busy waiting for 1s with the amended script above:
0.04514091599958192
0.0025789999999688007
0.00240733299870044
0.002496540999345598
0.0026197499992122175
0.0025347499995405087
0.0026297910007997416
0.0025359999999636784
0.002526082998883794
0.0025174589991365792
—
Reply to this email directly, view it on GitHub
<#124056 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUW67T2ZSLGYMLK2KF4JNG3ZUYQ4RAVCNFSM6AAAAABGHCDCEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGQ2DMMJRGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks for the information and measurements related to the busy-waiting workaround. I wonder if there's any other method/flag/system-setting to reach for, as the timing seems to prevent us from trying this directly. I'll stick to 1 second so we're always on the same scale. Using those numbers it's like our warmed function takes 450ms, our unwarmed function takes 850ms, and we need to run every second (1000ms). If we try to run an identical inference buffer instead of sleeping, it takes 450ms (real) + 450ms (fake) = 900ms and then we still need to sleep for 100ms (waiting for the new audio buffer) which slows it down again slightly, getting everything out of sync. Basically our times are so much larger in comparison to the sleep timing that we can't run more in between. We may be able to run smaller inference calls but they would have a different memory signature, so I'm not sure if the GPU would remain warmed (wired?) in that sense. We'll continue to try different combinations but I wanted to see if there's any other idea or method to try besides busy-wait. Much appreciated and best regards,
|
Hi @Amusesmile! Yeah I get that its a bit clumsy workaround.
The different memory signature should not be an issue as long as the GPU stays busy. I just happened to use the same network as in the example script since it was already there. However you are onto something here: It is annoyingly difficult to estimate what is "busy enough" especially since it might be device dependent. I'm still asking around for solutions that have public APIs that you could use but these seem to be a bit hard to come by. There is So far the only recommendation has been to investigate if your application would be able to parallelize recording and processing the previous recording parts leading to the GPU device staying active in a more organic way: If I understood correctly (using the 1s time scale again) you'll kick-off an audio recording every second. While waiting for the recording to finish you'll
So in order to avoid the wait time ending up so long one would need to adjust the recording window for example to lets say 500ms leading to (assuming the GPU will not have time to power down), leading to
Would something along those lines be possible? This might help especially if the processing does not scale down quite that linearly, since there's some overhead in kicking off the computation. This would also help avoid doing the busy wait which is I must add highly not recommended in most of the cases as that might tie up resources the system would like to use for other tasks in the background, possibly leading to worse user experience if the app makes the OS feel sluggish. Also am I understanding correctly that here the consistency of processing times might be more important than having it be the absolutely fastest possible? That is it would be preferable to have the processing take exactly 650ms every iteration instead of risking it vary between 450-850ms depending on the power state? I'll also ask around about that in case this might bring up more ideas to people since there could be other APIs for guaranteeing consistent performance instead of maximum performance. But in any case thanks for the patience, unfortunately this is a bit outside of my range of expertise on the MacOS so having to heavily rely on others for guidance. |
Hi, It's little off the topic, but I am trying to look into a different aspect. I tried In this case, it seems the |
Thanks- this is all great info and yes, the timing represents what we're experiencing, albeit with numbers scaled to 120ms instead of 1s. That's great to know about the busy-work solution being possible with a smaller inference call. I'm most intrigued about your idea of "consistent performance vs. max performance" because yes, if we could get it to be semi-fast always instead of the fastest possible, that would be very helpful and solve many of the surrounding issues. About the threading, we do actually use separate threads. We collect new audio in the main DAW audio processing thread, then send the collected buffer to the second inference thread when it's ready, complete inference, then stitch and fade for playback back in the main thread. This was a bit difficult to engineer in a safe manner but it was the method that finally let us get it working without locking up Logic or the other DAWs. Any information you'd be able to find out about the OS flags or other ideas are still very much appreciated. Thanks so much for asking around and thinking about the problem. Cheers!
|
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
Sample output:
sleep
sleep
:Is it expected? And now can I keep the model runs as fast as before
sleep
? I remember facing the same problem withCUDA
on Nvidia Jetson, the solution is to usesudo jetson_clocks
. So is there any similar solution to Jetson on macOS.Versions
cc @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen
The text was updated successfully, but these errors were encountered: