Skip to content

Linux v5.13 broke zvol_first_open lock retry logic, so open() on a zvol dev node now sometimes fails with errno=ERESTARTSYS in userspace (particularly noticeable in udev) #12301

Closed
@jgottula

Description

@jgottula

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version rolling release / 2021 Jun 28
Linux Kernel 5.13
Architecture x86_64
ZFS Version 2.0.5-1 and 2.1.99-335_g4694131a0
SPL Version 2.0.5-1 and 2.1.99-335_g4694131a0

The problem

At system startup right after pool import, or at any other time when a large number of zvol dev nodes pop into existence (e.g. when setting snapdev=visible on a volume dataset with many snapshots), a small random smattering of the /dev/zvol/* symlinks that are expected to be created, are missing. This is just the immediately user-noticeable symptom, though.

Adding the following udev rule to heighten the log level for zvol devices helps diagnose what's going on. You can throw this in as /etc/udev/rules.d/00-debug.rules and load it up at runtime via udevadm control -R if desired.

SUBSYSTEM=="block", KERNEL=="zd*", OPTIONS="log_level=debug"

The rule in /usr/lib/udev/rules.d/60-zvol.rules for creating the /dev/zvol/* symlinks includes a PROGRAM token which causes udev (systemd-udevd) to spawn an instance of /usr/lib/udev/zvol_id per /dev/zd* dev node. Many of these are run in parallel at once.

The debug logging rule shown above reveals that what's happening is that, occasionally and somewhat randomly, one of these zvol_id processes will not be able to successfully perform the open("/dev/zdN", O_RDONLY) syscall to open the dev node it was asked to identify. The syscall fails, and the value in errno after the failure is 512.

So... 512 isn't a valid userspace error value at all. I was fortunate to find that 512 corresponds to ERESTARTSYS in Linux kernel land. As far as I know, this is a Thing That Should Never Leak To Userspace. Yet, it has, for whatever reason. (Maybe something about the signal configuration in udev processes is unusual and allows for this? No idea.)

Anyway. My naive understanding is that if a syscall handler (such as zvol_open) returns -ERESTARTSYS, then the Linux kernel is supposed to automatically make things "just work" such that the thread doesn't exit back to userspace but instead makes a U-turn and auto-retries the syscall again. (Which, in this case, would serve the purpose of effecting a retry on some heavily contended locks.) And so in principle, glibc shouldn't have to do any magic; and the program itself shouldn't have to do some sort of EINTR loop. And, at least if things work as intended, no ERESTARTSYS 512 values should be seen on the userspace side. (Yet, here we are.)

It should also be noted that other udev-spawned programs (e.g. udisks-part-id) also occasionally encountered the exact same syscall error with the same ERESTARTSYS error; often well-correlated with the failure in zvol_id, presumably due to both programs running at nearly the same time and both facing similar lock contention.

I did hack around with the zvol_id program, and experimented with putting the open syscall into a loop that would retry on failure if (errno == 512). I put number-of-tries printouts to stderr in that loop, and I found that the open call would eventually succeed; but generally after something on the order of 20-400 retries. (Probably would fare better with some kind of sleep call in between, but ehhh.)

I do half-wonder if the reorganization that happened in 5df7e9d has anything to do with this issue: prior to that commit, zvol_first_open would directly return (-SET_ERROR(ERESTARTSYS));, whereas after the commit, zvol_first_open (in one file) does return (SET_ERROR(EINTR)); while zvol_open (in a different file) does some -EINTR to -ERESTARTSYS stuff. Doesn't seem like the whole story by any means, but possibly a complicating factor.

I also wonder if there was some internal change to how the Linux kernel does its ERESTARTSYS magic, perhaps in 5.12. I haven't been able to actually track down where that logic is located, at the moment; I think it used to be in one particular file, but has since been moved to some other file; and my on-the-fly kretprobes knowledge apparently isn't good enough to figure out how to just print out a damn stack trace whenever zvol_first_open returns -EINTR.

Other issues and PRs of similarity or relevance

Information that might kinda help reproduce the problem maybe perhaps

I don't know if I have any specific reliable way to reproduce the problem.

The best I can say is that my pool has a lot of zvols (dozens to hundreds, particularly when snapdev=visible), and that I see this pretty consistently whenever a bunch of /dev/zd* nodes come into existence at the same time and make systemd-udevd spin up a crapton of workers that all attempt to access some large number of those dev nodes at the same time (undoubtedly running into some degree of lock contention).

I don't do any exotic things like attempting to build nested zfs contraptions involving using zvols as vdevs. (I generally regard myself as a sane person.)

I use a custom kernel that has, among other things, CONFIG_PREEMPT=y and CONFIG_HZ=1000. The involuntary preemption thing could be relevant; not entirely sure, to be honest.

This has been happening on an SMP system with multiple cores (AMD FX-8350).

If it's of any interest, I'm currently running systemd/udev version 249rc2. I believe this was happening on 248.3 as well, and possibly prior to that. I'm 95% sure I was also seeing this behavior back on kernel 5.12.2 (at which time I was using zfs 2babd20).

Oh yeah and I guess that little udev rule file I posted earlier that enables debug log_level for /dev/zd* dev nodes is probably a handy thing to have around if you're trying to reproduce this.

Logs of interest

Nothing of particular note shows up in /proc/spl/kstat/zfs/dbgmsg (with zfs_flags=1).

What I've posted below, however, is some interesting udev log_level=debug output from the journal, isolated out for one particular zvol device that had difficulty when I switched snapdev=hidden to snapdev=visible on a particular volume.

In this case I was using an unmodified zvol_id binary that didn't have any printouts added to indicate e.g. what errno was; but you can see it from the udisks-part-id invocation that also failed on the open syscall.

systemd-udevd[3765278]: zd736: The log level is changed to 'debug' while processing device (SEQNUM=8400, ACTION=add)
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/50-udev-default.rules:69 GROUP 6
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 Running PROGRAM '/usr/lib/udev/zvol_id /dev/zd736'
systemd-udevd[3765278]: zd736: Starting '/usr/lib/udev/zvol_id /dev/zd736'
systemd-udevd[3765278]: Successfully forked off '(spawn)' as PID 3765289.
systemd-udevd[3765278]: zd736: '/usr/lib/udev/zvol_id /dev/zd736'(out) 'Unable to open device file: /dev/zd736'
systemd-udevd[3765278]: zd736: Process '/usr/lib/udev/zvol_id /dev/zd736' succeeded.
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'zvol/Unable'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'to'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'open'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'device'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'file:'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK '/dev/zd736'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/80-udisks.rules:84 Importing properties from results of 'udisks-part-id /dev/zd736'
systemd-udevd[3765278]: zd736: Starting 'udisks-part-id /dev/zd736'
systemd-udevd[3765278]: Successfully forked off '(spawn)' as PID 3765294.
systemd-udevd[3765278]: zd736: 'udisks-part-id /dev/zd736'(err) 'using device_file=/dev/zd736 syspath=/sys/devices/virtual/block/zd736, offset=0 ao=0 and number=0 for /dev/zd736'
systemd-udevd[3765278]: zd736: 'udisks-part-id /dev/zd736'(err) 'Error opening /dev/zd736: Unknown error 512'
systemd-udevd[3765278]: zd736: Process 'udisks-part-id /dev/zd736' succeeded.
systemd-udevd[3765278]: zd736: Handling device node '/dev/zd736', devnum=b230:736
systemd-udevd[3765278]: zd736: Setting permissions /dev/zd736, uid=0, gid=6, mode=0660
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/file:'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/file:' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/open'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/open' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/zvol\x2fUnable'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/zvol/Unable' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/to'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/to' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/dev\x2fzd736'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/dev/zd736' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/device'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/device' to 'zd736'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/block/230:736' to '../zd736'
systemd-udevd[3765278]: zd736: sd-device: Created db file '/run/udev/data/b230:736' for '/devices/virtual/block/zd736'
systemd-udevd[3765278]: zd736: Handling device node '/dev/zd736', devnum=b230:736
systemd-udevd[3765278]: zd736: Preserve permissions of /dev/zd736, uid=0, gid=6, mode=0660
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/file:'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/file:' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/open'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/open' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/zvol\x2fUnable'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/zvol/Unable' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/to'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/to' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/dev\x2fzd736'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/dev/zd736' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/device'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/device' to 'zd736'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/block/230:736' to '../zd736'
systemd-udevd[3765278]: zd736: Adding watch on '/dev/zd736'
systemd-udevd[3765278]: zd736: Device processed (SEQNUM=8400, ACTION=add)
systemd-udevd[3765278]: zd736: sd-device-monitor: Passed 757 byte to netlink monitor

In case you read the log above and are now thinking "wtf are those symlinks?!"

In the log sequence posted above, it's hard not to notice that another idiotic thing is going on. Because there's some goofs in the udev rules file and in zvol_id, the nonzero exit status of 512 from zvol_id is modulo'd into 0, which of course is then treated as a "success", meaning that the contents of stdout are not ignored, but are consumed normally (as if they were the name of the /dev/zvol/* symlink to create). And then zvol_id, for reasons I don't particularly understand, only ever prints stuff (including its error messages) to stdout; combine that with the wonders of space-separated list syntax, and you wind up with udev deciding that the rules are requesting that the following symlinks be created:

/dev/zvol/Unable -> /dev/zd736
/dev/to          -> /dev/zd736
/dev/open        -> /dev/zd736
/dev/device      -> /dev/zd736
/dev/file:       -> /dev/zd736
/dev//dev/zd736  -> /dev/zd736

(I plan to address this in a separate PR that I have pretty much all lined up and ready to go already. The 60-zvol.rules file is extremely ancient; like, it's literally a decade old. And it uses a fairly impressive number of deprecated syntax quirks for a mere one-liner file. Most of the code for zvol_id is about a decade old as well, and does dumb stuff like printing error messages to stdout. So yeah, I'll be taking care of all of this particular dumbness myself shortly. That PR will fix the bizarre symlinks symptom that's been going on; but it doesn't do anything to address the core mystery of "why the hell are these zvol_open calls failing, and in such a way that errno=ERESTARTSYS is getting to userspace?")

(This is now up as PR #12302.)

Stuff I can do to help

If this issue turns out to be a giant pain to reproduce, I'm happy to do stuff like flipping on zfs_flags bits, dumping various logs, adding extra debug printouts in particular places, creating/building/running custom one-off builds of the zfs module that I use on my personal system with my actual data with reckless abandon; basically whatever helps track this down. 😄

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions