Skip to content

MPS slows down after sleep #124056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hdnhan opened this issue Apr 15, 2024 · 14 comments
Open

MPS slows down after sleep #124056

hdnhan opened this issue Apr 15, 2024 · 14 comments
Assignees
Labels
module: mps Related to Apple Metal Performance Shaders framework module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@hdnhan
Copy link

hdnhan commented Apr 15, 2024

🐛 Describe the bug

import time
import torch
from torchvision import models

device = torch.device("mps")
model = models.resnet18()
model.to(device)
model.eval()

for i in range(10):
    x = torch.randn(1, 3, 224, 224, device=device)

    start = time.perf_counter()
    with torch.no_grad():
        model(x)
    torch.mps.synchronize()
    end = time.perf_counter()

    print(end - start)
    # time.sleep(1) # <--- Cooldown time

Sample output:

  • Comment sleep
0.013851750001776963
0.011806082969997078
0.009203707973938435
0.006015666993334889
0.0059204999706707895
0.00472587498370558
0.004517166991718113
0.0045674999710172415
0.004554332990664989
0.004492584033869207
  • Uncomment sleep:
0.013814875041134655
0.0132357919937931
0.013537375023588538
0.016312292020302266
0.014802166959270835
0.01544141594786197
0.01633595797466114
0.015492709004320204
0.01755624997895211
0.015840500011108816

Is it expected? And now can I keep the model runs as fast as before sleep? I remember facing the same problem with CUDA on Nvidia Jetson, the solution is to use sudo jetson_clocks. So is there any similar solution to Jetson on macOS.

Versions

OS: macOS 14.4.1 (arm64)
CPU: Apple M2

Python version: 3.10.14 (main, Mar 21 2024, 11:21:31) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit

[pip3] torch==2.4.0.dev20240331
[pip3] torchaudio==2.2.0.dev20240331
[pip3] torchvision==0.19.0.dev20240331

cc @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

@tringwald tringwald added module: performance Issues related to performance, either of kernel code or framework glue module: mps Related to Apple Metal Performance Shaders framework labels Apr 15, 2024
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 15, 2024
@albanD
Copy link
Collaborator

albanD commented Apr 15, 2024

This might be an issue with the mps.synchronize(). I think there is another issue talking about this...

@hdnhan
Copy link
Author

hdnhan commented Apr 16, 2024

This might be an issue with the mps.synchronize(). I think there is another issue talking about this...

I don't think mps.synchronize() has any problem (or maybe it has). The example above just simplifies the flow in my actual project. Basically mps.synchronize() and sleep can be replaced by any post-processing.

For example:

import time
import torch
from torchvision import models

device = torch.device("mps")
model = models.resnet18()
model.to(device)
model.eval()

for i in range(10):
    x = torch.randn(1, 3, 224, 224, device=device)

    start = time.perf_counter()
    with torch.no_grad():
        y_mps = model(x)
    y_cpu = y_mps.cpu()
    end = time.perf_counter()

    print(end - start)
    # time.sleep(1) # <--- postprocessing on y_cpu

@SonikArchitects
Copy link

Anyone else have any ideas here? We are unable to solution this so far.

@kulinseth
Copy link
Collaborator

I agree @hdnhan this is not related to mps.synchronize(). I can think of few reasons for which this can happen.

We use Commit and continue logic to do concurrent processing between CPU and GPU. If you add a sleep, the memory we had wired in for the command buffer we are processing will get released (I think there is a 1s or a fixed counter at which we unwire Metal resources) and then we will have to map those back in. That will slow down and cause performance regression. We can confirm this in Instruments (if you take a Metal System trace https://developer.apple.com/documentation/xcode/analyzing-the-performance-of-your-metal-app/) and see Paging activity when you add the Sleep.

The other reason could be that by adding as Sleep we moved the P state of GPU from higher state of 8 to say 4. This will simply mean that we need to keep GPU fed and not add delays in between.

@SonikArchitects
Copy link

Hi @kulinseth - really appreciate you getting back to us. This is causing us to run inference at 200% the time it should take so we are very eager to solve it. Additionally, I'm sure this is something other development teams will bump into soon (and it great numbers) - so it is for sure serving a greater good! Thank you.

I'll let @hdnhan and @Amusesmile ask more comments here but TLDR, we are aware of sleep (figured this part out) but don't fix our problem. What you mention above we don't know how to do is moving the P state of GPU to a higher state of 8.

Thank you again for the help.

@Amusesmile
Copy link

I'll provide a few more details and mention things we've tried. This is in a situation where we're processing incoming audio in a real time application. We wait for 120ms of collected audio, then send it into our pytorch MPS model in a separate thread. It should take 50ms to process then wait for ~70ms for the next audio frame. Because of the sleep slowdown issue, it's taking 100ms, so only has ~20ms of safety margin. Intuitively we had figured out that the sleep or time in between calls was what is slowing it down for each pass as you mention. We'd like to find a way to prevent this memory re-wiring or prioritization of the GPU.

@kulinseth you mention that there's a 1s or fixed counter after which the resources are mapped back. Is there a way to make this slower or gain control over it, or is it managed by the system and out of reach?

For your second item, the P state of GPU, we have done experiments where we use small pytorch "filler" models instead of sleep calls to try to keep the GPU active on this thread, however they don't seem to help. Do you think that if they had the same memory input/output signature (but no processing), they might better keep things primed and fast for the next real processing call?

Any additional information you have about what we have control over or what to try would be appreciated. The filler pytorch model we created was something trivial like input 100 and returning the same.

Thanks!

@hdnhan
Copy link
Author

hdnhan commented Aug 24, 2024

Hi @kulinseth

Thanks for your response.

I had played around with Instruments for a few days. With the script above, I got some findings.

  • With/without sleep, I found that GPU Performance State seems unchanged and under Minimum mode. Here are two successive model inference runs without Sleep:
    image

  • During sleep 1s, Unwire Memory and Wire Memory under Driver Processing (Metal Driver) graph were not showing anything, I think nothing happened with memory during Sleep period then.

  • During sleep 1s, Vertex and Fragment graphs show some processes, I'm not sure it will affect anything.

  • The model has 12 command buffer blocks (if I understand correctly, the model has more than 12 nodes/layers, so there is a fusion step that merges some layers into one giant command buffer block)
    1. A command buffer block in a loop with Sleep is ~2-3x slower than the corresponding block in a loop without Sleep.
    2. Time between command buffer blocks are bigger in a loop with Sleep.
    Without Sleep:
    image
    With Sleep:
    image

@jhavukainen
Copy link
Collaborator

Hi @hdnhan! I'm taking a look at the issue on our side. A couple of findings I have so far

  • The paging activity does not look like the issue from the traces. It's pretty much the same with or without the sleep in the loop to my understanding
  • The GPU is certainly winding to a lower power state with the sleep in the loop. But forcing only the GPU to stay in the maximum power state only explains away some of the perf loss here. Adjusting other SoC power state parameters seems to be required to get the full performance back with the sleep in the loop.

However controlling the power states is something taking place on the OS level without the apps having an access to those tools. So we'll need to work around it here.

I think @Amusesmile is on the right track with the idea of filler model in the sense that busy waiting while keeping the GPU active should prevent the power state from winding down. However based on the description

The filler pytorch model we created was something trivial like input 100 and returning the same.

it sounds like this model might get optimized by the MPSGraph since it should recognize if the op is just an identity and avoid doing the work it deems unnecessary.

I'll keep digging around on what would be the best way to avoid the power state slump in this case and update here.

@jhavukainen
Copy link
Collaborator

Just to fill in to the answer earlier here is a non-elegant example of busy waiting to prevent resources from falling to lower power state by just reiterating the inference until the 1s is up just to demonstrate.

import time
import torch
from torchvision import models

device = torch.device("mps")
model = models.resnet18()
model.to(device)
model.eval()

for i in range(10):
    x = torch.randn(1, 3, 224, 224, device=device)

    start = time.perf_counter()
    with torch.no_grad():
        model(x)
    torch.mps.synchronize()
    end = time.perf_counter()

    print(end - start)

    t1 = time.perf_counter()
    while(time.perf_counter()-t1 < 1.0):
        model(x)

The results on my machine

Running without sleeping:

0.045149333000154
0.005514041000424186
0.00550837500122725
0.005526875000214204
0.005485875000886153
0.005400875001214445
0.0024398329987889156
0.0024834159994497895
0.0024525839999114396
0.00240375000066706

Running with sleep(1):

0.04453287499927683
0.010447082999235136
0.01648600000044098
0.016526749999684398
0.01672737499939103
0.012525374999313499
0.014056874999369029
0.01697608299946296
0.01570920800077147
0.02034816699961084

Busy waiting for 1s with the amended script above:

0.04514091599958192
0.0025789999999688007
0.00240733299870044
0.002496540999345598
0.0026197499992122175
0.0025347499995405087
0.0026297910007997416
0.0025359999999636784
0.002526082998883794
0.0025174589991365792

@jhavukainen jhavukainen self-assigned this Sep 3, 2024
@SonikArchitects
Copy link

SonikArchitects commented Sep 4, 2024 via email

@Amusesmile
Copy link

Thanks for the information and measurements related to the busy-waiting workaround. I wonder if there's any other method/flag/system-setting to reach for, as the timing seems to prevent us from trying this directly.

I'll stick to 1 second so we're always on the same scale. Using those numbers it's like our warmed function takes 450ms, our unwarmed function takes 850ms, and we need to run every second (1000ms). If we try to run an identical inference buffer instead of sleeping, it takes 450ms (real) + 450ms (fake) = 900ms and then we still need to sleep for 100ms (waiting for the new audio buffer) which slows it down again slightly, getting everything out of sync. Basically our times are so much larger in comparison to the sleep timing that we can't run more in between.

We may be able to run smaller inference calls but they would have a different memory signature, so I'm not sure if the GPU would remain warmed (wired?) in that sense. We'll continue to try different combinations but I wanted to see if there's any other idea or method to try besides busy-wait.

Much appreciated and best regards,

Just to fill in to the answer earlier here is a non-elegant example of busy waiting to prevent resources from falling to lower power state by just reiterating the inference until the 1s is up just to demonstrate.

import time
import torch
from torchvision import models

device = torch.device("mps")
model = models.resnet18()
model.to(device)
model.eval()

for i in range(10):
    x = torch.randn(1, 3, 224, 224, device=device)

    start = time.perf_counter()
    with torch.no_grad():
        model(x)
    torch.mps.synchronize()
    end = time.perf_counter()

    print(end - start)

    t1 = time.perf_counter()
    while(time.perf_counter()-t1 < 1.0):
        model(x)

The results on my machine

Running without sleeping:

0.045149333000154
0.005514041000424186
0.00550837500122725
0.005526875000214204
0.005485875000886153
0.005400875001214445
0.0024398329987889156
0.0024834159994497895
0.0024525839999114396
0.00240375000066706

Running with sleep(1):

0.04453287499927683
0.010447082999235136
0.01648600000044098
0.016526749999684398
0.01672737499939103
0.012525374999313499
0.014056874999369029
0.01697608299946296
0.01570920800077147
0.02034816699961084

Busy waiting for 1s with the amended script above:

0.04514091599958192
0.0025789999999688007
0.00240733299870044
0.002496540999345598
0.0026197499992122175
0.0025347499995405087
0.0026297910007997416
0.0025359999999636784
0.002526082998883794
0.0025174589991365792

@jhavukainen
Copy link
Collaborator

jhavukainen commented Sep 12, 2024

Hi @Amusesmile! Yeah I get that its a bit clumsy workaround.

We may be able to run smaller inference calls but they would have a different memory signature, so I'm not sure if the GPU would remain warmed (wired?) in that sense. We'll continue to try different combinations but I wanted to see if there's any other idea or method to try besides busy-wait.

The different memory signature should not be an issue as long as the GPU stays busy. I just happened to use the same network as in the example script since it was already there. However you are onto something here: It is annoyingly difficult to estimate what is "busy enough" especially since it might be device dependent.

I'm still asking around for solutions that have public APIs that you could use but these seem to be a bit hard to come by. There is os_workgroup_interval_t that lets a workgroup of threads to synchronize so that the other thread gets warmed up right in time as the preset interval is coming to an end. However the issue is that the API is only for real-time audio CPU control so it is not able to control the GPU states.

So far the only recommendation has been to investigate if your application would be able to parallelize recording and processing the previous recording parts leading to the GPU device staying active in a more organic way:

If I understood correctly (using the 1s time scale again) you'll kick-off an audio recording every second. While waiting for the recording to finish you'll

  • Process the previous interval (450ms or 850ms, depending on if the machine is sleepy or not)
  • Wait for the remainder (550ms or 150ms until the 1s mark is up and the recording of the next interval is complete)
  • Start the cycle over again

So in order to avoid the wait time ending up so long one would need to adjust the recording window for example to lets say 500ms leading to (assuming the GPU will not have time to power down), leading to

  • Process the previous interval (225ms, making some unbased assumptions of the time scaling linearly here)
  • Wait for the remainder (275ms until 0.5s mark is up and next interval of 0.5s is available for processing)
  • Possibly stitch the processed 500ms interval together with the previous one in another thread in case the rest of the application is built expecting the original time interval of 1s

Would something along those lines be possible? This might help especially if the processing does not scale down quite that linearly, since there's some overhead in kicking off the computation. This would also help avoid doing the busy wait which is I must add highly not recommended in most of the cases as that might tie up resources the system would like to use for other tasks in the background, possibly leading to worse user experience if the app makes the OS feel sluggish.

Also am I understanding correctly that here the consistency of processing times might be more important than having it be the absolutely fastest possible? That is it would be preferable to have the processing take exactly 650ms every iteration instead of risking it vary between 450-850ms depending on the power state? I'll also ask around about that in case this might bring up more ideas to people since there could be other APIs for guaranteeing consistent performance instead of maximum performance. But in any case thanks for the patience, unfortunately this is a bit outside of my range of expertise on the MacOS so having to heavily rely on others for guidance.

@hdnhan
Copy link
Author

hdnhan commented Sep 12, 2024

Hi,

It's little off the topic, but I am trying to look into a different aspect.

I tried 01-MetalAdder from https://github.com/larsgeb/m1-gpu-cpp, I modified something like this in the main.cpp. The output of without/with sleep 1 second are almost the same.
image

In this case, it seems the 01-MetalAdder doesn't experience the issue. I wonder whether the add_arrays kernel is too simple or there is something wrong with implementation in MPS or MPS in torch?

@Amusesmile
Copy link

Thanks- this is all great info and yes, the timing represents what we're experiencing, albeit with numbers scaled to 120ms instead of 1s. That's great to know about the busy-work solution being possible with a smaller inference call. I'm most intrigued about your idea of "consistent performance vs. max performance" because yes, if we could get it to be semi-fast always instead of the fastest possible, that would be very helpful and solve many of the surrounding issues.

About the threading, we do actually use separate threads. We collect new audio in the main DAW audio processing thread, then send the collected buffer to the second inference thread when it's ready, complete inference, then stitch and fade for playback back in the main thread. This was a bit difficult to engineer in a safe manner but it was the method that finally let us get it working without locking up Logic or the other DAWs.

Any information you'd be able to find out about the OS flags or other ideas are still very much appreciated. Thanks so much for asking around and thinking about the problem. Cheers!

Hi @Amusesmile! Yeah I get that its a bit clumsy workaround.

We may be able to run smaller inference calls but they would have a different memory signature, so I'm not sure if the GPU would remain warmed (wired?) in that sense. We'll continue to try different combinations but I wanted to see if there's any other idea or method to try besides busy-wait.

The different memory signature should not be an issue as long as the GPU stays busy. I just happened to use the same network as in the example script since it was already there. However you are onto something here: It is annoyingly difficult to estimate what is "busy enough" especially since it might be device dependent.

I'm still asking around for solutions that have public APIs that you could use but these seem to be a bit hard to come by. There is os_workgroup_interval_t that lets a workgroup of threads to synchronize so that the other thread gets warmed up right in time as the preset interval is coming to an end. However the issue is that the API is only for real-time audio CPU control so it is not able to control the GPU states.

So far the only recommendation has been to investigate if your application would be able to parallelize recording and processing the previous recording parts leading to the GPU device staying active in a more organic way:

If I understood correctly (using the 1s time scale again) you'll kick-off an audio recording every second. While waiting for the recording to finish you'll

  • Process the previous interval (450ms or 850ms, depending on if the machine is sleepy or not)
  • Wait for the remainder (550ms or 150ms until the 1s mark is up and the recording of the next interval is complete)
  • Start the cycle over again

So in order to avoid the wait time ending up so long one would need to adjust the recording window for example to lets say 500ms leading to (assuming the GPU will not have time to power down), leading to

  • Process the previous interval (225ms, making some unbased assumptions of the time scaling linearly here)
  • Wait for the remainder (275ms until 0.5s mark is up and next interval of 0.5s is available for processing)
  • Possibly stitch the processed 500ms interval together with the previous one in another thread in case the rest of the application is built expecting the original time interval of 1s

Would something along those lines be possible? This might help especially if the processing does not scale down quite that linearly, since there's some overhead in kicking off the computation. This would also help avoid doing the busy wait which is I must add highly not recommended in most of the cases as that might tie up resources the system would like to use for other tasks in the background, possibly leading to worse user experience if the app makes the OS feel sluggish.

Also am I understanding correctly that here the consistency of processing times might be more important than having it be the absolutely fastest possible? That is it would be preferable to have the processing take exactly 650ms every iteration instead of risking it vary between 450-850ms depending on the power state? I'll also ask around about that in case this might bring up more ideas to people since there could be other APIs for guaranteeing consistent performance instead of maximum performance. But in any case thanks for the patience, unfortunately this is a bit outside of my range of expertise on the MacOS so having to heavily rely on others for guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: mps Related to Apple Metal Performance Shaders framework module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants