You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
but based on the RSS check, we just successfully set up all the disks
so I'm not sure why we would be missing them when we go to do datasets immediately after
("immediately after" comes from the http handler for putting a new config: we set up the disks then immediately set up the datasets, but only if disks all succeeded:
let datasets = self.datasets_ensure(datasets_config).await?;
// If we only had partial success deploying datasets, don't proceed.
if datasets.has_error(){
returnOk(OmicronSledConfigResult{
disks: disks.status,
datasets: datasets.status,
});
}
, which, as you said, should be setting up disks, reporting zero errors, and then attempting to set up datasets
I'm going to dig into the code more closely, but the ordering here should be okay, unless some part of "disk setup" is returning too early and racing with dataset initialization
john (he/him):
possible smoking gun in sled-agent.log
at the end of ensuring physical disks, we see 10 successes:
those two serials do match the pool IDs that we see failures for
so: sled-agent successfully started managing all 10 disks, then thinks those two disks went away
sean (he/him): so, next question: Why did we see "raw disk removal"
john (he/him): it looks like one of them then immediately came back?
01:48:41.541Z INFO SledAgent: physical disks ensure - I think we started the "set omicron config" call here
01:48:46.802Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Device with serial 23470B800536 was observed removed from the device tree
01:48:51.869Z INFO SledAgent (StorageResources): Ensuring zpool has datasets - Interestingly, I see additional calls to "Ensure zpool has datasets" which are continuing to operated on the disk which was "removed" (serial: 23470B800536). I don't think the removal interrupts the "physical disks ensure" call, and the operations are succeed
ing, so it's still going. This includes: setting the LBA on that disk to 4096, doing other setup, etc.
01:48:54.309Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - We see the disk come back, with serial 23470B800536.
01:50:27.018Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B800991
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B800991
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval - We see the disk with serial 23470B800536 go away again.
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDisk - We see the same disk with serial 23470B800536 come back
01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - We think the disk exists again. I see the serial in the result output for DisksManagementResult.
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B801016
01:50:53.685Z INFO SledAgent: datasets ensure - We try to ensure datasets. I see the "Failed to find zpool" errors here, in particular, for the zpool with UUID 8809d9a9-1544-42d5-b3b7-28129a2f6773.
Digging back further, this is the zpool associated with the disk with serial 23470B801016 - this was removed earlier, and was just detected by the device tree again
01:51:02.863Z INFO SledAgent (StorageResources): Managing disk - I see the sled agent trying to manage the disk with serial 23470B801016 ; I think this is a down-stream attempt to re-synchronize the disks. Looks successful.
01:51:03.147Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - Finishes the "disk synchronization" from earlier
TL;DR:
We asked the sled agent to manage the disks it could, this seemed to happen successfully
The device tree was flapping a lot, showing multiple devices being added and removed and re-added. Weirdly, while the devices were missing from the device tree, this still appeared to be mutable.
While one of those devices was "removed", we asked to put datasets on it. That failed, which I believe is currently expected, because we consider "managed" devices to be ones that we see in the configuration + ones that we think are physically plugged in.
The disk later re-appeared, and was "managed" by the control plane again, but this was after the request to put datasets on it failed.
rm:
Just thinking out loud for a moment, to do the namespace reformat we actually will detach the blkdev instance to issue the format and then once that's done it'll come back.
So that is a way that without the nvme level device disappearing we can see that happen.
But I don't know where we do that kind of disk device manipulation in that timeline or if relevant. Just trying to think out loud here and what little we can get from what's in here. It's certainly possible (though weird) that the NVMe device went away and came back.
rm:
It's specifically doing a blkdev detach so it can reformat.
The problem is that we have to do something that destroys all the data on the device and renumbers all the sectors.
So we can't do it while it is up.
mike:
Not at the laptop at the moment, but yea detach, format, attach
rm:
So if rescanning and this operation are independent from the hardware monitor's loop we're going to be a little split brain.
Do we have logs around the reformat actions that we might be able to correlate?
sean (he/him):
We do, I think this roughly lines up with the timeline I posted earlier
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016
01:50:36.228Z INFO SledAgent (StorageResources): Resized 23470B801016 from 3840 to 3200
01:50:40.627Z INFO SledAgent (StorageResources): Formatted disk with serial 23470B801016 to an LBA with data size 4096
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded
So: yeah, looks like the formatting happened in the window between "device removed" and "device added"
sean (he/him):
So this makes more sense:
We decide "what disks are plugged in" by scanning the device tree in the HardwareMonitor
This information exists in the Sled's inventory, and gets plumbed to RSS
When RSS asks for a disk to be used, we call ensure_partition_layout -> if it's a U.2, we call ensure_size_and_formatting, which can remove the device from the device tree
If we poll the device tree during this period of removal, this is interpreted as "this disk has been removed"
When the disk later re-appears in the device tree, we'll try to manage it again, and "re-synchronize" the requested config with the disks observed in the device tree
However, if datasets are requested while we are not synchronized (which can happen as a part of formatting the disk, when the block device is detached) they'll show up as "not found"
sean (he/him):
To answer the question "Why is the disk not visible immediately after it's re-added to the device tree", the answer is basically: "Monitoring hardware and propagating that info to the sled agent is asynchronous".
In this particular case, we may want to add messaging for force things to be more synchronous -- e.g., after formatting any disks, ensure we're operating on an up-to-date view of the device tree, and ensure that storage manage state is synchronized and by "synchronized", I mean "at least up-to-date enough that it can see the block device which has been re-added"
sean (he/him):
I think the timeline is:
ensure_partition_layout removes the device, formats it.
If HardwareMonitor scans the device tree, it'll see the device as missing. This fires off a message to the "storage manager" part of the sled agent, telling it "this disk is gone".
If the HardwareMonitor scans the device tree and sees the disk again, it'll send a message to the "storage manager" telling it "this disk was just added, adopt it into the control plane at your convenience"
Every ten seconds, the storage manager "synchronizes disks", which means: it reads the configured disks, it looks at the disks that it things are online, and it adopts the disks it thinks should be part of the control plane.
rm:
I think if we scanned NVMe devices rather than blkdev we could probably avoid that.
(And wouldn't get tripped up by device missing blkdev namespaces or with too many)
sean (he/him):
Yeah, this makes sense to me
which would basically mean: from the perspective of the sled agent, reformatting the disk does not cause it to "go away"
The text was updated successfully, but these errors were encountered:
As we discussed in the support bundle huddle (admittedly only tangentially related to support bundles, but, if you want the recording, it's from that meeting as of May 22nd):
Keep the behavior where a blkdev driver unbinding implies "the device is gone", but also, add the equivalent of a "flush" command after omicron_set_config creates disks. Effectively: (1) ask the hardware manager to re-synchronize (which then may trigger messages to the StorageManager), and (2) ask the StorageManager to "synchronize all disks" (it does this periodically anyway). If we do this after ensuring physical disks, but before requesting datasets, this operation should see that "all initialized disks are back online". This solution is fine, but it still means that "disk was removed" messages get sporadically fired off for the blkdev driver getting rebound, which is unfortunate. We're not progressing forward with this path unless there's an issue with the align-on-nvme-based approach
Uh oh!
There was an error while loading. Please reload this page.
It started during RSS on a new 32 sled rack where we hit this failure:
Before starting RSS, we verified all sleds were seen, and all sleds had disk present (diskinfo).
Logs gathered can be found here: /staff/core/omicron-8203
Interleaved here is an edited chat log with highlights of the debug we walked through.
john (he/him):
(cc sean (he/him) who have more context) I'm pretty confused by this failure, on its face. It's reported by RSS here:
omicron/sled-agent/src/rack_setup/service.rs
Lines 429 to 450 in d4e0bc1
the error came from some sled here, I think:
omicron/sled-storage/src/manager.rs
Lines 1507 to 1518 in d4e0bc1
but based on the RSS check, we just successfully set up all the disks
so I'm not sure why we would be missing them when we go to do datasets immediately after
("immediately after" comes from the http handler for putting a new config: we set up the disks then immediately set up the datasets, but only if disks all succeeded:
omicron/sled-agent/src/sled_agent.rs
Lines 978 to 1000 in d4e0bc1
alan:
We did find the pools were created on sled 26:
sean (he/him):
This looks like it's invoking
omicron/sled-agent/src/sled_agent.rs
Lines 978 to 1008 in d4e0bc1
I'm going to dig into the code more closely, but the ordering here should be okay, unless some part of "disk setup" is returning too early and racing with dataset initialization
john (he/him):
possible smoking gun in sled-agent.log
at the end of ensuring physical disks, we see 10 successes:
but the next couple lines, we see two disks removed:
john (he/him):
looks like that tracks:
those two serials do match the pool IDs that we see failures for
so: sled-agent successfully started managing all 10 disks, then thinks those two disks went away
sean (he/him): so, next question: Why did we see "raw disk removal"
john (he/him): it looks like one of them then immediately came back?
but maybe we didn't "re-manage" it in time
sean (he/him):
I think the other one also came back (serial: 23470B801016), within about a second
sean (he/him):
To lay out a timeline:
01:48:41.541Z INFO SledAgent: physical disks ensure - I think we started the "set omicron config" call here
01:48:46.802Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Device with serial 23470B800536 was observed removed from the device tree
01:48:51.869Z INFO SledAgent (StorageResources): Ensuring zpool has datasets - Interestingly, I see additional calls to "Ensure zpool has datasets" which are continuing to operated on the disk which was "removed" (serial: 23470B800536). I don't think the removal interrupts the "physical disks ensure" call, and the operations are succeed
ing, so it's still going. This includes: setting the LBA on that disk to 4096, doing other setup, etc.
01:48:54.309Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - We see the disk come back, with serial 23470B800536.
01:50:27.018Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B800991
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B800991
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval - We see the disk with serial 23470B800536 go away again.
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDisk - We see the same disk with serial 23470B800536 come back
01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - We think the disk exists again. I see the serial in the result output for DisksManagementResult.
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B801016
01:50:53.685Z INFO SledAgent: datasets ensure - We try to ensure datasets. I see the "Failed to find zpool" errors here, in particular, for the zpool with UUID 8809d9a9-1544-42d5-b3b7-28129a2f6773.
Digging back further, this is the zpool associated with the disk with serial 23470B801016 - this was removed earlier, and was just detected by the device tree again
01:51:02.863Z INFO SledAgent (StorageResources): Managing disk - I see the sled agent trying to manage the disk with serial 23470B801016 ; I think this is a down-stream attempt to re-synchronize the disks. Looks successful.
01:51:03.147Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - Finishes the "disk synchronization" from earlier
TL;DR:
We asked the sled agent to manage the disks it could, this seemed to happen successfully
The device tree was flapping a lot, showing multiple devices being added and removed and re-added. Weirdly, while the devices were missing from the device tree, this still appeared to be mutable.
While one of those devices was "removed", we asked to put datasets on it. That failed, which I believe is currently expected, because we consider "managed" devices to be ones that we see in the configuration + ones that we think are physically plugged in.
The disk later re-appeared, and was "managed" by the control plane again, but this was after the request to put datasets on it failed.
rm:
Just thinking out loud for a moment, to do the namespace reformat we actually will detach the blkdev instance to issue the format and then once that's done it'll come back.
So that is a way that without the nvme level device disappearing we can see that happen.
But I don't know where we do that kind of disk device manipulation in that timeline or if relevant. Just trying to think out loud here and what little we can get from what's in here. It's certainly possible (though weird) that the NVMe device went away and came back.
rm:
It's specifically doing a blkdev detach so it can reformat.
The problem is that we have to do something that destroys all the data on the device and renumbers all the sectors.
So we can't do it while it is up.
mike:
Not at the laptop at the moment, but yea detach, format, attach
rm:
So if rescanning and this operation are independent from the hardware monitor's loop we're going to be a little split brain.
Do we have logs around the reformat actions that we might be able to correlate?
sean (he/him):
We do, I think this roughly lines up with the timeline I posted earlier
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016
01:50:36.228Z INFO SledAgent (StorageResources): Resized 23470B801016 from 3840 to 3200
01:50:40.627Z INFO SledAgent (StorageResources): Formatted disk with serial 23470B801016 to an LBA with data size 4096
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded
So: yeah, looks like the formatting happened in the window between "device removed" and "device added"
sean (he/him):
So this makes more sense:
sean (he/him):
To answer the question "Why is the disk not visible immediately after it's re-added to the device tree", the answer is basically: "Monitoring hardware and propagating that info to the sled agent is asynchronous".
In this particular case, we may want to add messaging for force things to be more synchronous -- e.g., after formatting any disks, ensure we're operating on an up-to-date view of the device tree, and ensure that storage manage state is synchronized and by "synchronized", I mean "at least up-to-date enough that it can see the block device which has been re-added"
sean (he/him):
I think the timeline is:
rm:
I think if we scanned NVMe devices rather than blkdev we could probably avoid that.
(And wouldn't get tripped up by device missing blkdev namespaces or with too many)
sean (he/him):
Yeah, this makes sense to me
which would basically mean: from the perspective of the sled agent, reformatting the disk does not cause it to "go away"
The text was updated successfully, but these errors were encountered: