-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implement statx(STATX_DIOALIGN)
so applications can discover correct O_DIRECT
alignment
#16972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@gamanakis @hanneskasp FYI. Hannes, I'd love any feedback from your development team; it's hard to guess what applications are expecting here. Thank you! |
Hello, we use the following code for testing today
Server 1
Server 2
Does that help to answer the question (I'm not a developer and would bring one in if needed)? Best regards |
@hanneskasp OpenZFS 2.3.0 does not include the present pull-request, that's why memalign=0 and offsetalign=0 in your example. In order for your program to output some meaningful memalign and offsetalign you have to apply this pull-request locally and then compile OpenZFS. @robn is actually asking how you would use memalign and offsetalign, what values do you expect? |
quoting my colleague:
Does that help? The plan is to continue testing once the PR is in the 2.3 RPM (the main testing system is Rocky Linux 9 with ZFS 2.3) |
@robn on my veeam backup OpenZFS 2.3.0 filesystem with 128k recordsize, with this PR applied and using the script from @hanneskasp I get:
|
@robn May be I misremember something, but the patch looks weird to me. I'd expect you to statically set |
What also makes me wonder is how this API supposed to be used for creating and writing new files? I wonder if applications in that case call it on directory, in which case we might not care about specific zp properties, but only the dataset in general? Or it specifies name of still absent file, in which case I wonder what call ZFS will receive? |
I'm just returning from vacation today; I'll get an update pushed in the next day or two. Should be straightforward, you've all dug in and found all the wibbly bits already! Thanks for the great feedback, really appreciated :) |
Last push simplifies it all to match @behlendorf's summary above:
Tests updated accordingly. |
@robn if you can address my last couple comments and update this PR we should be able to pull this in. |
Not sure it's relevant, but I compiled the statx-dioalign branch as per below and then attempted to run a Veeam copy job and it immediately fails with a message similar to the following as soon as I enable DirectIO;
As soon as I disable DirectIO, the problem goes away.
|
Hello, Best regards |
Sorry it's taken so long to get back to this. I've pushed what I hope is the right thing, basically always the object block size, except in the first block, then the max recordsize. Tests have been extended to prove all this, and I've added the write test for correct/incorrect alignment. I suspect I could do more there; I'm not sure if "incorrect" alignment is expected to work in some situations. Let me know, it'd be easy enough to add a loop with lots of different sizes or alignments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
awesome work guys! Do I read correctly that this didn't make it into 2.3.1 though? I'm guessing its targeted for 2.3.2? |
@ashleyw-gh yeah, I was too slow on it, sorry about that. If no one else gets to it, I'll make sure it's submitted for consideration for 2.3.2. |
A standalone helper program to call statx(2) with various masks, and display the returned value, or fail if the kernel does not support it. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>
Uses the statx helper to test the results of the STATX_DIOALIGN request as we manipulate DIO enable, dataset recordsize and file size and structure. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>
This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>
This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16972
This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16972
This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16972
This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16972
@robn do you know if this was approved for 2.3.2? I only see the 2.3.2 staging tag at the moment, so I'm assuming 2.3.2 is some time off yet? Reason why is I think @hanneskasp needs this to get his Veeam dev team to make changes they need to on their side. |
@ashleyw-gh it's in the zfs-2.3.2 list that's going though review #17214. Hope to have it out this week or next. |
thanks for adding it. I just tested ZFS 2.3.2 with a beta build and now it works out of the box to write data to ZFS. There is still the challenge that reflink is not stable with default settings (ashley changed zfs_bclone_wait_dirty=1 to get reflink stable), but that's a different story. |
Motivation and Context
statx(STATX_DIOALIGN)
is a Linux-specific syscall to let an application discover if direct IO (O_DIRECT
) to a file is possible and, if so, what the alignment requirements are. This is useful since Linux does not mandate what happens when an unaligned access is requested; the filesystem may returnEINVAL
(as OpenZFS does), or may fall back to a non-aligned write.This PR wires this call up to OpenZFS.
Further reading:
Closes: #16970
Description
zfs_get_direct_alignment()
, to return the direct read and write alignment for the given open znode;zpl_getattr_impl
to handleSTATX_DIOALIGN
by filling it out with information derived fromzfs_get_direct_alignment()
;statx(2)
requesting specific fields and display the results;direct=
,recordsize=
and object size and structure.Notes
STATX_DIOALIGN
is required to return two alignment values:dio_mem_align
: the required alignment for a userspace memory region used inO_DIRECT
calls;dio_offset_align
: the required alignment for DMA to the underlying storage device.Both set to 0 indicates that direct IO is not possible on the file. If direct IO is possible, both must be non-zero. It's not valid to set one to zero and the other non-zero.
Since its a
stat
-like query, it's not even necessary that the file be open, so our decision on whether or not direct is possible is simply checking that DIO would be possible on this dataset:zfs_dio_enabled
set, anddirect=always
orstandard
.OpenZFS has different alignment requirements for reads and writes, which can't be expressed through this interface, so I've taken the larger of the read and write alignment, so at least the caller gets something that is guaranteed to work for both cases.
Write alignment for direct IO is documented to be recordsize. This is implemented as
z_blksz
, which makes sense, but presents a conundrum. If an object is only one L0 block, thenz_blksz
may be smaller than the recordsize right now, because it will grow if more data is added. If we only returnedz_blksz
, then it would be possible for the apparent alignment requirement to change after a single write, which seems like it would make things difficult for an application that wanted to get the alignment once, when the file is opened, and then go to work. So, I've simply made it that if the object is still in the first block and thez_blksz
is less than the dataset'sz_max_blksz
(ie recordsize), we use the dataset max instead.I'm not sure if this really lines up with what really happens in
zfs_setup_direct()
, but it seems that it falling out as the recordsize most of the time is probably what's expected, and honestly, direct IO for files smaller than recordsize is kind of a misuse of the whole thing? If anyone knows a program that uses thisSTATX_DIOALIGN
, I'd like to find out what it actually expects.For the IO alignment, I've just gone with the pool ashift. Again, without knowing how its used I don't really know, but it seems like the closest to not outright lying, even if we don't ever pass user pages down to IO devices. Possibly we should return the same value for both, given that an arbitrary 4K alignment probably doesn't work for direct access to a 128K record, so we don't really want anything to choose it.
How Has This Been Tested?
New test included.
Test passes on Linux 6.6.x and 6.1.x. Test skipped on Linux 5.4.x and 4.19.x. Test also skipped on FreeBSD 14.2.
Types of changes
Checklist:
Signed-off-by
.