Skip to content

Implement statx(STATX_DIOALIGN) so applications can discover correct O_DIRECT alignment #16972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 13, 2025

Conversation

robn
Copy link
Member

@robn robn commented Jan 22, 2025

Motivation and Context

statx(STATX_DIOALIGN) is a Linux-specific syscall to let an application discover if direct IO (O_DIRECT) to a file is possible and, if so, what the alignment requirements are. This is useful since Linux does not mandate what happens when an unaligned access is requested; the filesystem may return EINVAL (as OpenZFS does), or may fall back to a non-aligned write.

This PR wires this call up to OpenZFS.

Further reading:

Closes: #16970

Description

  • Adds a ZPL function, zfs_get_direct_alignment(), to return the direct read and write alignment for the given open znode;
  • Extends zpl_getattr_impl to handle STATX_DIOALIGN by filling it out with information derived from zfs_get_direct_alignment();
  • Adds a ZTS helper program to call statx(2) requesting specific fields and display the results;
  • Adds a ZTS test to use the program to check the values are returned correctly for various combinations of direct=, recordsize= and object size and structure.

Notes

STATX_DIOALIGN is required to return two alignment values:

  • dio_mem_align: the required alignment for a userspace memory region used in O_DIRECT calls;
  • dio_offset_align: the required alignment for DMA to the underlying storage device.

Both set to 0 indicates that direct IO is not possible on the file. If direct IO is possible, both must be non-zero. It's not valid to set one to zero and the other non-zero.

Since its a stat-like query, it's not even necessary that the file be open, so our decision on whether or not direct is possible is simply checking that DIO would be possible on this dataset: zfs_dio_enabled set, and direct=always or standard.

OpenZFS has different alignment requirements for reads and writes, which can't be expressed through this interface, so I've taken the larger of the read and write alignment, so at least the caller gets something that is guaranteed to work for both cases.

Write alignment for direct IO is documented to be recordsize. This is implemented as z_blksz, which makes sense, but presents a conundrum. If an object is only one L0 block, then z_blksz may be smaller than the recordsize right now, because it will grow if more data is added. If we only returned z_blksz, then it would be possible for the apparent alignment requirement to change after a single write, which seems like it would make things difficult for an application that wanted to get the alignment once, when the file is opened, and then go to work. So, I've simply made it that if the object is still in the first block and the z_blksz is less than the dataset's z_max_blksz (ie recordsize), we use the dataset max instead.

I'm not sure if this really lines up with what really happens in zfs_setup_direct(), but it seems that it falling out as the recordsize most of the time is probably what's expected, and honestly, direct IO for files smaller than recordsize is kind of a misuse of the whole thing? If anyone knows a program that uses this STATX_DIOALIGN, I'd like to find out what it actually expects.

For the IO alignment, I've just gone with the pool ashift. Again, without knowing how its used I don't really know, but it seems like the closest to not outright lying, even if we don't ever pass user pages down to IO devices. Possibly we should return the same value for both, given that an arbitrary 4K alignment probably doesn't work for direct access to a 128K record, so we don't really want anything to choose it.

How Has This Been Tested?

New test included.

Test passes on Linux 6.6.x and 6.1.x. Test skipped on Linux 5.4.x and 4.19.x. Test also skipped on FreeBSD 14.2.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn requested review from amotin and bwatkinson January 22, 2025 06:11
@robn
Copy link
Member Author

robn commented Jan 22, 2025

@gamanakis @hanneskasp FYI. Hannes, I'd love any feedback from your development team; it's hard to guess what applications are expecting here. Thank you!

@hanneskasp
Copy link

Hello,
we try to solve the O_DIRECT misalignment problem mentioned here #16953 (there is an strace about what fails today in #16953 (comment))

we use the following code for testing today

#include <iostream>
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/stat.h>

int main(int argc, char* argv[])
{
        struct statx statxbuf{};
        if (statx(0, argv[1], 0, STATX_DIOALIGN, &statxbuf))
        {
                perror("Cannot execute statx");
                return -1;
        }
        std::cout << "mask" << statxbuf.stx_mask << " memalign: " << statxbuf.stx_dio_mem_align << " offsetalign: " << statxbuf.stx_dio_offset_align << std::endl;
        return 0;
}

Server 1

  • Rocky Linux 9.5 with 5.14.0-503.21.1.el9_5.x86_64 (we know that statx needs kernel 6.1, but I did not find a quick way to get the mainline kernel working with ZFS on Rocky 9.5 and it seems to work fine for XFS)
  • zfs-2.3.0-1 & zfs-kmod-2.3.0-1
    [root@rocky-zfs23 ~]# ./a.out /etc/passwd (this is XFS)
    mask14335 memalign: 512 offsetalign: 512
    [root@rocky-zfs23 ~]# ./a.out /datareduction/recordsize16k/VBR-FK/BJ-HK-zfs-fastdedup_DC01/HK-DC01_7B18C.vbm
    mask6143 memalign: 0 offsetalign: 0

Server 2

  • Ubuntu 24.04 with 6.8.0-51-generic kernel
  • zfs-2.2.2-0ubuntu9.1 and zfs-kmod-2.2.2-0ubuntu9.1 root@ubuntu:# ./a.out /etc/passwd (this is ext4)
    mask14335 memalign: 4 offsetalign: 512
    root@ubuntu:# ./a.out /alignment/test.txt
    mask6143 memalign: 0 offsetalign: 0

Does that help to answer the question (I'm not a developer and would bring one in if needed)?

Best regards
Hannes

@gamanakis
Copy link
Contributor

@hanneskasp OpenZFS 2.3.0 does not include the present pull-request, that's why memalign=0 and offsetalign=0 in your example. In order for your program to output some meaningful memalign and offsetalign you have to apply this pull-request locally and then compile OpenZFS.

@robn is actually asking how you would use memalign and offsetalign, what values do you expect?

@hanneskasp
Copy link

quoting my colleague:

We're expecting values that will tell us the needed alignment for correct execution of direct IO requests. We will align the memory address of the memory buffer we're passing to the request, offset of the data for IO and size of the data for IO according to these values.

Does that help?

The plan is to continue testing once the PR is in the 2.3 RPM (the main testing system is Rocky Linux 9 with ZFS 2.3)

@gamanakis
Copy link
Contributor

@robn on my veeam backup OpenZFS 2.3.0 filesystem with 128k recordsize, with this PR applied and using the script from @hanneskasp I get:

mask14303 memalign: 131072 offsetalign: 4096

@amotin
Copy link
Member

amotin commented Jan 22, 2025

@robn May be I misremember something, but the patch looks weird to me. I'd expect you to statically set dio_mem_align to PAGE_SIZE, since Direct I/O code strictly requires all buffers to be page-aligned/-multiple. dio_offset_align is indeed tricky since it is different for reads and writes. For reads it can be anything page-aligned/-multiple, but preferably the file block size to avoid partial reads. For writes should be a full block size to not fall back to ARC, but as you pointed the block size might change. I think we should return there the file's current block size if the file has more than one block, or either recordsize or the file's current block size (rounded to the next power of 2 if needed) if it is bigger, but in either case no less than PAGE_SIZE.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 22, 2025
@amotin
Copy link
Member

amotin commented Jan 22, 2025

What also makes me wonder is how this API supposed to be used for creating and writing new files? I wonder if applications in that case call it on directory, in which case we might not care about specific zp properties, but only the dataset in general? Or it specifies name of still absent file, in which case I wonder what call ZFS will receive?

@robn
Copy link
Member Author

robn commented Jan 27, 2025

I'm just returning from vacation today; I'll get an update pushed in the next day or two. Should be straightforward, you've all dug in and found all the wibbly bits already! Thanks for the great feedback, really appreciated :)

@robn
Copy link
Member Author

robn commented Jan 27, 2025

Last push simplifies it all to match @behlendorf's summary above:

dio_mem_align = PAGE_SIZE;
dio_offset_align = MAX(zp->z_blksz, PAGE_SIZE);

Tests updated accordingly.

@behlendorf
Copy link
Contributor

@robn if you can address my last couple comments and update this PR we should be able to pull this in.

@ashleyw-gh
Copy link

ashleyw-gh commented Feb 28, 2025

Not sure it's relevant, but I compiled the statx-dioalign branch as per below and then attempted to run a Veeam copy job and it immediately fails with a message similar to the following as soon as I enable DirectIO;

Asynchronous request operation has failed. [size = 524288] [offset = 4096] [bufaddr = 0x00007f3638cfe200] [res = -22; 0]
Failed to open storage for read/write access. Storage: [/VeeamBackup/cj_mgmt/mgmt/VeeamProxy8.vm-651871D2025-03-01T113926_E6DE.vib].

As soon as I disable DirectIO, the problem goes away.
@hanneskasp - do you know if Veeam need to make changes to the Veeam code before for this patch would work properly with Veeam?

# git clone https://github.com/robn/zfs
# cd ./zfs
# git checkout "statx-dioalign"
# sh autogen.sh
# ./configure
# make -j1 rpm-utils rpm-dkms
# yum localinstall *.$(uname -p).rpm *.noarch.rpm
# reboot
# zfs version
zfs-2.3.99-1
zfs-kmod-2.3.99-1
# cat /sys/module/zfs/parameters/zfs_dio_enabled
0
# vi /etc/modprobe.d/zfs.conf
options zfs zfs_bclone_enabled=1 zfs_bclone_wait_dirty=1 zfs_dio_enabled=1
# echo 1 >/sys/module/zfs/parameters/zfs_dio_enabled
# dracut -f initramfs-$(uname -r).img  $(uname -r)
# reboot
# cat /sys/module/zfs/parameters/zfs_dio_enabled
1

@hanneskasp
Copy link

Hello,
yes, we have a code change planned in V13 to query statx(). It cannot work today and the earliest point in time for testing would be a V13 beta where I have you on my list to send you the beta if you have time for testing.

Best regards
Hannes

@robn robn force-pushed the statx-dioalign branch from 9f8146b to b8a8e13 Compare March 7, 2025 10:13
@robn
Copy link
Member Author

robn commented Mar 7, 2025

Sorry it's taken so long to get back to this. I've pushed what I hope is the right thing, basically always the object block size, except in the first block, then the max recordsize.

Tests have been extended to prove all this, and I've added the write test for correct/incorrect alignment. I suspect I could do more there; I'm not sure if "incorrect" alignment is expected to work in some situations. Let me know, it'd be easy enough to add a loop with lots of different sizes or alignments.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 10, 2025
@ashleyw-gh
Copy link

awesome work guys! Do I read correctly that this didn't make it into 2.3.1 though? I'm guessing its targeted for 2.3.2?

@robn
Copy link
Member Author

robn commented Mar 11, 2025

awesome work guys! Do I read correctly that this didn't make it into 2.3.1 though? I'm guessing its targeted for 2.3.2?

@ashleyw-gh yeah, I was too slow on it, sorry about that. If no one else gets to it, I'll make sure it's submitted for consideration for 2.3.2.

A standalone helper program to call statx(2) with various masks, and
display the returned value, or fail if the kernel does not support it.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
robn added 2 commits March 11, 2025 19:57
Uses the statx helper to test the results of the STATX_DIOALIGN request
as we manipulate DIO enable, dataset recordsize and file size and
structure.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
@github-actions github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Mar 11, 2025
@amotin amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Mar 11, 2025
@amotin amotin merged commit 13ec35c into openzfs:master Mar 13, 2025
22 of 25 checks passed
@robn robn deleted the statx-dioalign branch March 13, 2025 21:49
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Mar 27, 2025
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16972
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 3, 2025
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16972
fuporovvStack pushed a commit to fuporovvStack/zfs that referenced this pull request Apr 11, 2025
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16972
fuporovvStack pushed a commit to fuporovvStack/zfs that referenced this pull request Apr 11, 2025
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16972
@ashleyw-gh
Copy link

awesome work guys! Do I read correctly that this didn't make it into 2.3.1 though? I'm guessing its targeted for 2.3.2?

@ashleyw-gh yeah, I was too slow on it, sorry about that. If no one else gets to it, I'll make sure it's submitted for consideration for 2.3.2.

@robn do you know if this was approved for 2.3.2? I only see the 2.3.2 staging tag at the moment, so I'm assuming 2.3.2 is some time off yet? Reason why is I think @hanneskasp needs this to get his Veeam dev team to make changes they need to on their side.

@tonyhutter
Copy link
Contributor

@ashleyw-gh it's in the zfs-2.3.2 list that's going though review #17214. Hope to have it out this week or next.

@hanneskasp
Copy link

thanks for adding it. I just tested ZFS 2.3.2 with a beta build and now it works out of the box to write data to ZFS.

There is still the challenge that reflink is not stable with default settings (ashley changed zfs_bclone_wait_dirty=1 to get reflink stable), but that's a different story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

O_DIRECT implementation misses statx(STATX_DIOALIGN) extension
9 participants