HDR: fix read performance regression (fixes #3585) #3588

aras-p · 2022-10-07T07:26:28Z

Description

HDR code overhaul to support IOProxy (#3218) changed all reads into a pread() with explicit file position tracking. Turns out, that is quite a larger overhead compared to simple sequential reads; even more so on Windows.

On my PC (Windows 10, VS2022, Ryzen 5950X) this change gets file read time for an 8x resolution .HDR image (with RLE compression) from 8.28s back to 1.10s just like it was in OIIO 2.3.

On Windows the extra cost when using pread() is a bit more extra mutex locks, but the real cost is in enventual ReadFile; looks like any seek operations make it take some sort of "way slower" path with various callbacks and whatnot inside the kernel.

Before the fix (the extra ~2 sec of file seeking cost are outside the shot):

After the fix:

(exercise for the future: get rid of that ldexp cost, and perhaps also stop reading the file mostly two bytes at a time -- but both are optimizations in a "nice to have" category; not this performance regression)

Tests

No new tests needed.

Checklist:

I have read the contribution guidelines.
If this is more extensive than a small change to existing code, I
have previously submitted a Contributor License Agreement
(individual, and if there is any way my
employers might think my programming belongs to them, then also
corporate).
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite
(adding new test cases if necessary).
My code follows the prevailing code style of this project.

HDR code overhaul to support IOProxy (#3218) changed all reads into a `pread()` with explicit file position tracking. Turns out, that is quite a larger overhead compared to simple sequential reads; even more so on Windows. On my PC (Windows 10, VS2022, Ryzen 5950X) this change gets file read time for an 8x resolution .HDR image (with RLE compression) from 8.28s back to 1.10s just like it was in OIIO 2.3. On Windows the extra cost when using pread() is a bit more extra mutex locks, but the real cost is in enventual ReadFile; looks like any seek operations make it take some sort of "way slower" path with various callbacks and whatnot inside the kernel.

lgritz · 2022-10-07T11:02:29Z

This is awesome, thanks!

Now for some discussion (which I think should not preclude accepting this for now).

Switching from IOProxy::pread to using IOProxy::seek + IOProxy::read makes the IOProxy stateful and not thread-safe without an external lock (which this HdrInput has, so it's not going to break anything), precluding any future removal of the locks from read_native_scanline to make the whole ImageInput able to read concurrently. That's probably not important for this file format, but for others (especially crucial those usable as texture), it would be desirable to rely on those stateless pread calls and remove the lock from the ImageInput.

I think this patch is fine. We don't expect a lot of concurrent reads from hdr like we do for tiff and openexr that we use extensively for highly threaded on-demand reading of texture. It fixes the recent performance regression, and it doesn't really cost us concurrency because we already had the lock in HdrInput. But in general, we want to prefer eschewing read/seek and using pread instead, so we can expect maximum concurrency from multiple threads using an ImageInput (as we do in ImageCache/TextureSystem).

So I believe that a revised and more complete diagnosis is that the problem isn't the pread vs read per se, but that for RLE-compressed images we were calling pread separately for every 4 bytes! The calls to read are faster simply because they are buffered (in FILE) and require fewer OS system calls. But the real sin is that we call it way too many times.

I think that the rle case could go back to pread if we want, but do just one pread per scanline (rather than one per pixel value run), asking for the maximum amount that will be needed to read the scanline for the worst case rle representation (it's ok if it's the end of the file, pread will just read up to the end and return the true number of bytes read). In other words, we'd be doing the buffering ourselves.

But anyway, the real misdesign of this HDR reader is ancient, and consists of doing many many tiny reads. Even the buffered fread would be sped up a lot by not doing it separately for every 2-4 bytes of the file. I don't know if it's worth fixing for this format. Depends on whether anybody really needs to maximize its performance or expects multiple threads reading the same hdr file to be fully concurrent.

lgritz

LGTM. Solves the perf regression, though it's a little sad to revert back to the stateful kind of read. But that's probably fine for this format. It would only really make a difference for file types that we expected to want to read concurrently by multiple threads from the same ImageInput (which primarily is the case for tiled formats that we use for TextureSystem).

aras-p · 2022-10-07T11:21:20Z

Yeah I agree -- the actual cause of perf regression is like "pread has some extra overhead compared to just read", which is not an issue if you're reading in "sensible" chunk sizes. But does very much become an issue if you're reading 2 or 4 bytes at a time :) Maybe someday Someone ™️ will redo the reading code here to read in larger chunks, and it could get back to pread.

lgritz · 2022-10-07T15:29:32Z

I was mainly writing that all down for the record, for anybody who treads here next, and for my own clarity of thought on the matter. HDR is primarily a legacy/import format (generally on the path to turn it into an exr), and when used tends to read the whole image at once. It doesn't need to be especially performant compared to more common formats and especially ones that would be read primarily concurrently such as true texture formats. So I'm not sure that Someone™️ needs to take another pass on it unless it's just for fun or to scratch an itch.

But thanks for tracking this down! Even for a format that doesn't need to be especially high performance, I definitely was not aiming for a 10x perf regression!

…AcademySoftwareFoundation#3588) HDR code overhaul to support IOProxy (AcademySoftwareFoundation#3218) changed all reads into a `pread()` with explicit file position tracking. Turns out, that is quite a larger overhead compared to simple sequential reads; even more so on Windows. On my PC (Windows 10, VS2022, Ryzen 5950X) this change gets file read time for an 8x resolution .HDR image (with RLE compression) from 8.28s back to 1.10s just like it was in OIIO 2.3. On Windows the extra cost when using pread() is a bit more extra mutex locks, but the real cost is in enventual ReadFile; looks like any seek operations make it take some sort of "way slower" path with various callbacks and whatnot inside the kernel. --- Note repoduced from discussion AcademySoftwareFoundation#3588 from LG: Switching from IOProxy::pread to using IOProxy::seek + IOProxy::read makes the IOProxy stateful and not thread-safe without an external lock (which this HdrInput has, so it's not going to break anything), precluding any future removal of the locks from read_native_scanline to make the whole ImageInput able to read concurrently. That's probably not important for this file format, but for others (especially crucial those usable as texture), it would be desirable to rely on those stateless pread calls and remove the lock from the ImageInput. I think this patch is fine. We don't expect a lot of concurrent reads from hdr like we do for tiff and openexr that we use extensively for highly threaded on-demand reading of texture. It fixes the recent performance regression, and it doesn't really cost us concurrency because we already had the lock in HdrInput. But in general, we want to prefer eschewing read/seek and using pread instead, so we can expect maximum concurrency from multiple threads using an ImageInput (as we do in ImageCache/TextureSystem). So I believe that a revised and more complete diagnosis is that the problem isn't the pread vs read per se, but that for RLE-compressed images we were calling pread separately for every 4 bytes! The calls to read are faster simply because they are buffered (in FILE) and require fewer OS system calls. But the real sin is that we call it way too many times. I think that the rle case could go back to pread if we want, but do just one pread per scanline (rather than one per pixel value run), asking for the maximum amount that will be needed to read the scanline for the worst case rle representation (it's ok if it's the end of the file, pread will just read up to the end and return the true number of bytes read). In other words, we'd be doing the buffering ourselves. But anyway, the real misdesign of this HDR reader is ancient, and consists of doing many many tiny reads. Even the buffered fread would be sped up a lot by not doing it separately for every 2-4 bytes of the file. I don't know if it's worth fixing for this format. Depends on whether anybody really needs to maximize its performance or expects multiple threads reading the same hdr file to be fully concurrent.

aras-p marked this pull request as ready for review October 7, 2022 08:26

aras-p changed the title ~~HDR: fix read performance regression (#3585)~~ HDR: fix read performance regression (fixes #3585) Oct 7, 2022

aras-p mentioned this pull request Oct 7, 2022

HDR: speed up reading by around 4x #3590

Merged

5 tasks

lgritz approved these changes Oct 7, 2022

View reviewed changes

lgritz merged commit a285a5d into AcademySoftwareFoundation:master Oct 7, 2022

aras-p deleted the hdr-opt branch October 7, 2022 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDR: fix read performance regression (fixes #3585) #3588

HDR: fix read performance regression (fixes #3585) #3588

Uh oh!

aras-p commented Oct 7, 2022

Uh oh!

lgritz commented Oct 7, 2022

Uh oh!

lgritz left a comment

Uh oh!

aras-p commented Oct 7, 2022

Uh oh!

lgritz commented Oct 7, 2022

Uh oh!

Uh oh!

HDR: fix read performance regression (fixes #3585) #3588

HDR: fix read performance regression (fixes #3585) #3588

Uh oh!

Conversation

aras-p commented Oct 7, 2022

Description

Tests

Checklist:

Uh oh!

lgritz commented Oct 7, 2022

Uh oh!

lgritz left a comment

Choose a reason for hiding this comment

Uh oh!

aras-p commented Oct 7, 2022

Uh oh!

lgritz commented Oct 7, 2022

Uh oh!

Uh oh!