Description
System information
Type | Version/Name |
---|---|
Distribution Name | Arch Linux |
Distribution Version | rolling release / 2021 Jun 28 |
Linux Kernel | 5.13 |
Architecture | x86_64 |
ZFS Version | 2.0.5-1 and 2.1.99-335_g4694131a0 |
SPL Version | 2.0.5-1 and 2.1.99-335_g4694131a0 |
The problem
At system startup right after pool import, or at any other time when a large number of zvol dev nodes pop into existence (e.g. when setting snapdev=visible
on a volume dataset with many snapshots), a small random smattering of the /dev/zvol/*
symlinks that are expected to be created, are missing. This is just the immediately user-noticeable symptom, though.
Adding the following udev rule to heighten the log level for zvol devices helps diagnose what's going on. You can throw this in as /etc/udev/rules.d/00-debug.rules
and load it up at runtime via udevadm control -R
if desired.
SUBSYSTEM=="block", KERNEL=="zd*", OPTIONS="log_level=debug"
The rule in /usr/lib/udev/rules.d/60-zvol.rules
for creating the /dev/zvol/*
symlinks includes a PROGRAM
token which causes udev (systemd-udevd) to spawn an instance of /usr/lib/udev/zvol_id
per /dev/zd*
dev node. Many of these are run in parallel at once.
The debug logging rule shown above reveals that what's happening is that, occasionally and somewhat randomly, one of these zvol_id
processes will not be able to successfully perform the open("/dev/zdN", O_RDONLY)
syscall to open the dev node it was asked to identify. The syscall fails, and the value in errno
after the failure is 512.
So... 512 isn't a valid userspace error value at all. I was fortunate to find that 512 corresponds to ERESTARTSYS
in Linux kernel land. As far as I know, this is a Thing That Should Never Leak To Userspace. Yet, it has, for whatever reason. (Maybe something about the signal configuration in udev processes is unusual and allows for this? No idea.)
Anyway. My naive understanding is that if a syscall handler (such as zvol_open
) returns -ERESTARTSYS
, then the Linux kernel is supposed to automatically make things "just work" such that the thread doesn't exit back to userspace but instead makes a U-turn and auto-retries the syscall again. (Which, in this case, would serve the purpose of effecting a retry on some heavily contended locks.) And so in principle, glibc shouldn't have to do any magic; and the program itself shouldn't have to do some sort of EINTR
loop. And, at least if things work as intended, no ERESTARTSYS
512 values should be seen on the userspace side. (Yet, here we are.)
It should also be noted that other udev-spawned programs (e.g. udisks-part-id
) also occasionally encountered the exact same syscall error with the same ERESTARTSYS
error; often well-correlated with the failure in zvol_id
, presumably due to both programs running at nearly the same time and both facing similar lock contention.
I did hack around with the zvol_id
program, and experimented with putting the open
syscall into a loop that would retry on failure if (errno == 512)
. I put number-of-tries printouts to stderr in that loop, and I found that the open
call would eventually succeed; but generally after something on the order of 20-400 retries. (Probably would fare better with some kind of sleep call in between, but ehhh.)
I do half-wonder if the reorganization that happened in 5df7e9d has anything to do with this issue: prior to that commit, zvol_first_open
would directly return (-SET_ERROR(ERESTARTSYS));
, whereas after the commit, zvol_first_open
(in one file) does return (SET_ERROR(EINTR));
while zvol_open
(in a different file) does some -EINTR
to -ERESTARTSYS
stuff. Doesn't seem like the whole story by any means, but possibly a complicating factor.
I also wonder if there was some internal change to how the Linux kernel does its ERESTARTSYS
magic, perhaps in 5.12. I haven't been able to actually track down where that logic is located, at the moment; I think it used to be in one particular file, but has since been moved to some other file; and my on-the-fly kretprobes knowledge apparently isn't good enough to figure out how to just print out a damn stack trace whenever zvol_first_open
returns -EINTR
.
Other issues and PRs of similarity or relevance
-
zpool command hangs when attempting to create a new pool on a zvol #612: Ancient issue concerning deadlocks when using zvols as vdevs (for whatever pathological reason someone would want to do such a thing); resolved by 65d5608 which originally introduced the
-ERESTARTSYS
mechanism tozvol_first_open
- Fix zpool on zvol deadlock #1220: PR a month after the above, tweaking the code a bit
-
Zpool import crashes the single CPU machines. #6283: Issue concerning deadlocks on 1-CPU systems due to endless
-ERESTARTSYS
retries without any possibility for task switching to take place and resolve the lock situation- Reschedule processes on error. #6312: PR fixing the above issue by adding the explicit
schedule()
call in the-ERESTARTSYS
case
- Reschedule processes on error. #6312: PR fixing the above issue by adding the explicit
-
zpool create with zvols can hang when racing with udev #7773: Issue concerning lock contention and
-ERESTARTSYS
inzvol_open
(never resolved, closed as stale)
Information that might kinda help reproduce the problem maybe perhaps
I don't know if I have any specific reliable way to reproduce the problem.
The best I can say is that my pool has a lot of zvols (dozens to hundreds, particularly when snapdev=visible
), and that I see this pretty consistently whenever a bunch of /dev/zd*
nodes come into existence at the same time and make systemd-udevd spin up a crapton of workers that all attempt to access some large number of those dev nodes at the same time (undoubtedly running into some degree of lock contention).
I don't do any exotic things like attempting to build nested zfs contraptions involving using zvols as vdevs. (I generally regard myself as a sane person.)
I use a custom kernel that has, among other things, CONFIG_PREEMPT=y
and CONFIG_HZ=1000
. The involuntary preemption thing could be relevant; not entirely sure, to be honest.
This has been happening on an SMP system with multiple cores (AMD FX-8350).
If it's of any interest, I'm currently running systemd/udev version 249rc2. I believe this was happening on 248.3 as well, and possibly prior to that. I'm 95% sure I was also seeing this behavior back on kernel 5.12.2 (at which time I was using zfs 2babd20).
Oh yeah and I guess that little udev rule file I posted earlier that enables debug log_level for /dev/zd*
dev nodes is probably a handy thing to have around if you're trying to reproduce this.
Logs of interest
Nothing of particular note shows up in /proc/spl/kstat/zfs/dbgmsg
(with zfs_flags=1
).
What I've posted below, however, is some interesting udev log_level=debug
output from the journal, isolated out for one particular zvol device that had difficulty when I switched snapdev=hidden
to snapdev=visible
on a particular volume.
In this case I was using an unmodified zvol_id
binary that didn't have any printouts added to indicate e.g. what errno
was; but you can see it from the udisks-part-id
invocation that also failed on the open
syscall.
systemd-udevd[3765278]: zd736: The log level is changed to 'debug' while processing device (SEQNUM=8400, ACTION=add)
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/50-udev-default.rules:69 GROUP 6
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 Running PROGRAM '/usr/lib/udev/zvol_id /dev/zd736'
systemd-udevd[3765278]: zd736: Starting '/usr/lib/udev/zvol_id /dev/zd736'
systemd-udevd[3765278]: Successfully forked off '(spawn)' as PID 3765289.
systemd-udevd[3765278]: zd736: '/usr/lib/udev/zvol_id /dev/zd736'(out) 'Unable to open device file: /dev/zd736'
systemd-udevd[3765278]: zd736: Process '/usr/lib/udev/zvol_id /dev/zd736' succeeded.
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'zvol/Unable'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'to'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'open'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'device'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK 'file:'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/60-zvol.rules:6 LINK '/dev/zd736'
systemd-udevd[3765278]: zd736: /usr/lib/udev/rules.d/80-udisks.rules:84 Importing properties from results of 'udisks-part-id /dev/zd736'
systemd-udevd[3765278]: zd736: Starting 'udisks-part-id /dev/zd736'
systemd-udevd[3765278]: Successfully forked off '(spawn)' as PID 3765294.
systemd-udevd[3765278]: zd736: 'udisks-part-id /dev/zd736'(err) 'using device_file=/dev/zd736 syspath=/sys/devices/virtual/block/zd736, offset=0 ao=0 and number=0 for /dev/zd736'
systemd-udevd[3765278]: zd736: 'udisks-part-id /dev/zd736'(err) 'Error opening /dev/zd736: Unknown error 512'
systemd-udevd[3765278]: zd736: Process 'udisks-part-id /dev/zd736' succeeded.
systemd-udevd[3765278]: zd736: Handling device node '/dev/zd736', devnum=b230:736
systemd-udevd[3765278]: zd736: Setting permissions /dev/zd736, uid=0, gid=6, mode=0660
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/file:'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/file:' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/open'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/open' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/zvol\x2fUnable'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/zvol/Unable' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/to'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/to' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/dev\x2fzd736'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/dev/zd736' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/device'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/device' to 'zd736'
systemd-udevd[3765278]: zd736: Creating symlink '/dev/block/230:736' to '../zd736'
systemd-udevd[3765278]: zd736: sd-device: Created db file '/run/udev/data/b230:736' for '/devices/virtual/block/zd736'
systemd-udevd[3765278]: zd736: Handling device node '/dev/zd736', devnum=b230:736
systemd-udevd[3765278]: zd736: Preserve permissions of /dev/zd736, uid=0, gid=6, mode=0660
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/file:'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/file:' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/open'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/open' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/zvol\x2fUnable'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/zvol/Unable' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/to'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/to' to 'zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/dev\x2fzd736'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/dev/zd736' to '../zd736'
systemd-udevd[3765278]: zd736: Found 'b230:736' claiming '/run/udev/links/device'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/device' to 'zd736'
systemd-udevd[3765278]: zd736: Preserve already existing symlink '/dev/block/230:736' to '../zd736'
systemd-udevd[3765278]: zd736: Adding watch on '/dev/zd736'
systemd-udevd[3765278]: zd736: Device processed (SEQNUM=8400, ACTION=add)
systemd-udevd[3765278]: zd736: sd-device-monitor: Passed 757 byte to netlink monitor
In case you read the log above and are now thinking "wtf are those symlinks?!"
In the log sequence posted above, it's hard not to notice that another idiotic thing is going on. Because there's some goofs in the udev rules file and in zvol_id
, the nonzero exit status of 512 from zvol_id
is modulo'd into 0, which of course is then treated as a "success", meaning that the contents of stdout
are not ignored, but are consumed normally (as if they were the name of the /dev/zvol/*
symlink to create). And then zvol_id
, for reasons I don't particularly understand, only ever prints stuff (including its error messages) to stdout; combine that with the wonders of space-separated list syntax, and you wind up with udev deciding that the rules are requesting that the following symlinks be created:
/dev/zvol/Unable -> /dev/zd736
/dev/to -> /dev/zd736
/dev/open -> /dev/zd736
/dev/device -> /dev/zd736
/dev/file: -> /dev/zd736
/dev//dev/zd736 -> /dev/zd736
(I plan to address this in a separate PR that I have pretty much all lined up and ready to go already. The 60-zvol.rules
file is extremely ancient; like, it's literally a decade old. And it uses a fairly impressive number of deprecated syntax quirks for a mere one-liner file. Most of the code for zvol_id
is about a decade old as well, and does dumb stuff like printing error messages to stdout. So yeah, I'll be taking care of all of this particular dumbness myself shortly. That PR will fix the bizarre symlinks symptom that's been going on; but it doesn't do anything to address the core mystery of "why the hell are these zvol_open
calls failing, and in such a way that errno=ERESTARTSYS
is getting to userspace?")
(This is now up as PR #12302.)
Stuff I can do to help
If this issue turns out to be a giant pain to reproduce, I'm happy to do stuff like flipping on zfs_flags
bits, dumping various logs, adding extra debug printouts in particular places, creating/building/running custom one-off builds of the zfs module that I use on my personal system with my actual data with reckless abandon; basically whatever helps track this down. 😄