Skip to content

RSS on a new rack failed: Errors initializing datasets #8203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
leftwo opened this issue May 22, 2025 · 2 comments
Open

RSS on a new rack failed: Errors initializing datasets #8203

leftwo opened this issue May 22, 2025 · 2 comments
Assignees

Comments

@leftwo
Copy link
Contributor

leftwo commented May 22, 2025

It started during RSS on a new 32 sled rack where we hit this failure:

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮                                  
│ Current Rack Setup Status                                                                                                           │  
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Status: Initialization Failed                                                                                                       │ 
│ Last initialization operation ID: 116f7904-35dd-4301-a895-139f71c603f4                                                              │                                  
│ Message: Errors initializing datasets: Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crucible: Zpool Not      │ 
│ Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug:   │
│ Zpool Not Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset                                               │
│ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/debug: Zpool Not Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing │
│ dataset oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/zone/oxz_crucible_65a0d412-2bc5-4b37-9f57-d5830d688c86: Zpool Not Found:     │
│ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing dataset oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crucible: Zpool Not   │
│ Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone:    │
│ Zpool Not Found: oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773, Error initializing dataset                                               │
│ oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414/crypt/zone: Zpool Not Found: oxp_6ee32179-9c93-4ad2-9816-05d8e5a9b414, Error initializing  │
│ dataset oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone/oxz_crucible_586474b6-bcc3-4013-894c-6422ed120cef: Zpool Not Found:     │
│ oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773    

Before starting RSS, we verified all sleds were seen, and all sleds had disk present (diskinfo).

Logs gathered can be found here: /staff/core/omicron-8203
Interleaved here is an edited chat log with highlights of the debug we walked through.

john (he/him):
(cc sean (he/him) who have more context) I'm pretty confused by this failure, on its face. It's reported by RSS here:

// No individual disk errors reported; all disks were
// initialized. Check for any dataset errors; these are not
// retryable.
let dataset_errors = response
.datasets
.into_iter()
.filter_map(|status| {
status.err.map(|err| {
format!(
"Error initializing dataset {}: {err}",
status.dataset_name.full_name()
)
})
})
.collect::<Vec<_>>();
if !dataset_errors.is_empty() {
return Err(BackoffError::permanent(
SetupServiceError::DatasetInitialization {
errors: dataset_errors,
},
));
}

the error came from some sled here, I think:
// We can only place datasets within managed disks.
// If a disk is attached to this sled, but not a part of the Control
// Plane, it is treated as "not found" for dataset placement.
if !self
.resources
.disks()
.iter_managed()
.any(|(_, disk)| disk.zpool_name() == zpool)
{
warn!(self.log, "Failed to find zpool");
return Err(Error::ZpoolNotFound(format!("{}", zpool)));
}

but based on the RSS check, we just successfully set up all the disks
so I'm not sure why we would be missing them when we go to do datasets immediately after
("immediately after" comes from the http handler for putting a new config: we set up the disks then immediately set up the datasets, but only if disks all succeeded:
// Until the config-reconciler work lands: unpack the unified config
// into the three split configs for indepenedent ledgering.
let disks_config = OmicronPhysicalDisksConfig {
generation: config.generation,
disks: config.disks.into_iter().collect(),
};
let disks = self.omicron_physical_disks_ensure(disks_config).await?;
// If we only had partial success deploying disks, don't proceed.
if disks.has_error() {
return Ok(OmicronSledConfigResult {
disks: disks.status,
datasets: Vec::new(),
});
}
let datasets_config = DatasetsConfig {
generation: config.generation,
datasets: config.datasets.into_iter().map(|d| (d.id, d)).collect(),
};
let datasets = self.datasets_ensure(datasets_config).await?;
)

alan:
We did find the pools were created on sled 26:

support@oxz_switch0:~$ pilot host exec -c 'zfs list | grep oxp_8809d9a9' 0-31                                                                                            
...                                                                                                                                                                      
26  BRM06240015        ok:                                                                                                                                               
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773              1.44M  2.82T  96K  /oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773                                                       
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt         648K  2.82T 248K  /pool/ext/8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt                                                                                                                                                                                                                     
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug   200K   100G 200K  /pool/ext/8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/debug                                                                                                                                                                                                               
oxp_8809d9a9-1544-42d5-b3b7-28129a2f6773/crypt/zone                                                                                                                                                                                                                                                                                               

sean (he/him):

john (he/him):
the error came from some sled here, I think:

// We can only place datasets within managed disks.
// If a disk is attached to this sled, but not a part of the Control
// Plane, it is treated as "not found" for dataset placement.
if !self
.resources
.disks()
.iter_managed()
.any(|(_, disk)| disk.zpool_name() == zpool)
{
warn!(self.log, "Failed to find zpool");
return Err(Error::ZpoolNotFound(format!("{}", zpool)));
}

This looks like it's invoking

// Until the config-reconciler work lands: unpack the unified config
// into the three split configs for indepenedent ledgering.
let disks_config = OmicronPhysicalDisksConfig {
generation: config.generation,
disks: config.disks.into_iter().collect(),
};
let disks = self.omicron_physical_disks_ensure(disks_config).await?;
// If we only had partial success deploying disks, don't proceed.
if disks.has_error() {
return Ok(OmicronSledConfigResult {
disks: disks.status,
datasets: Vec::new(),
});
}
let datasets_config = DatasetsConfig {
generation: config.generation,
datasets: config.datasets.into_iter().map(|d| (d.id, d)).collect(),
};
let datasets = self.datasets_ensure(datasets_config).await?;
// If we only had partial success deploying datasets, don't proceed.
if datasets.has_error() {
return Ok(OmicronSledConfigResult {
disks: disks.status,
datasets: datasets.status,
});
}
, which, as you said, should be setting up disks, reporting zero errors, and then attempting to set up datasets
I'm going to dig into the code more closely, but the ordering here should be okay, unless some part of "disk setup" is returning too early and racing with dataset initialization

john (he/him):
possible smoking gun in sled-agent.log
at the end of ensuring physical disks, we see 10 successes:

01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error                                                                                                                                                                                                                                                      
    file = sled-storage/src/manager.rs:709                                                                                                                                                                                                                                                                                                        
    result = DisksManagementResult { status: [DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800536" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800633" }, err: None }, DiskManagementStatus { identity: D
iskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800674" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800723" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800724" }
, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800895" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800985" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", m
odel: "WUS5EA138ESP7E3", serial: "23470B800991" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801016" }, err: None }, DiskManagementStatus { identity: DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801059" }, err: None }] }               

but the next couple lines, we see two disks removed:

01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", mo
del: "WUS5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                            
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Removing disk                      
    file = sled-storage/src/resources.rs:569                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" }                                                                         
01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,3/pci1b96,2722@0/blkdev@w0014EE8401D0C500,0", dev_path: Some("/dev/dsk/c7t0014EE8401D0C500d0") }, slot: 7, variant: U2, identity: DiskIdentity { vendor: "1b96", mo
del: "WUS5EA138ESP7E3", serial: "23470B801016" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                            
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Removing disk                      
    file = sled-storage/src/resources.rs:569                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B801016" }                                                                         

john (he/him):
looks like that tracks:

    {                                                                               
      "identity": {                                                                 
        "vendor": "1b96",                                                           
        "model": "WUS5EA138ESP7E3",                                                 
        "serial": "23470B800991"                                                    
      },                                                                            
      "id": "ece45ca3-f756-43bc-836e-425e650fe194",                                 
      "pool_id": "6ee32179-9c93-4ad2-9816-05d8e5a9b414"                             
    },                                                                              
    {                                                                               
      "identity": {                                                                 
        "vendor": "1b96",                                                           
        "model": "WUS5EA138ESP7E3",                                                 
        "serial": "23470B801016"                                                    
      },                                                                            
      "id": "c873bf6a-bcb3-45fd-95a1-1c0362861289",                                 
      "pool_id": "8809d9a9-1544-42d5-b3b7-28129a2f6773"                             
    },                                                                              

those two serials do match the pool IDs that we see failures for
so: sled-agent successfully started managing all 10 disks, then thinks those two disks went away

sean (he/him): so, next question: Why did we see "raw disk removal"

john (he/him): it looks like one of them then immediately came back?

01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDisk { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", model: "W
US5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                                   
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Inserting disk                     
    file = sled-storage/src/resources.rs:491                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" }                                                                         

but maybe we didn't "re-manage" it in time

sean (he/him):
I think the other one also came back (serial: 23470B801016), within about a second

01:50:52.920Z INFO SledAgent (StorageManager): Received DetectedRawDisk { raw_disk: Real(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,2722@0/blkdev@w0014EE8401D00800,0", dev_path: Some("/dev/dsk/c6t0014EE8401D00800d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", model: "W
US5EA138ESP7E3", serial: "23470B800991" }, is_boot_disk: false, firmware: DiskFirmware { active_slot: 2, next_active_slot: Some(2), slot1_read_only: true, number_of_slots: 4, slot_firmware_versions: [Some("RC610004"), Some("RC610007"), Some("RC610004"), Some("RC610004")] } }), tx: ... }                                                   
    file = sled-storage/src/manager.rs:615                                          
01:50:52.920Z INFO SledAgent (StorageResources): Inserting disk                     
    file = sled-storage/src/resources.rs:491                                        
    identity = DiskIdentity { vendor: "1b96", model: "WUS5EA138ESP7E3", serial: "23470B800991" }                                                                         

sean (he/him):
To lay out a timeline:

01:48:41.541Z INFO SledAgent: physical disks ensure - I think we started the "set omicron config" call here
01:48:46.802Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Device with serial 23470B800536 was observed removed from the device tree
01:48:51.869Z INFO SledAgent (StorageResources): Ensuring zpool has datasets - Interestingly, I see additional calls to "Ensure zpool has datasets" which are continuing to operated on the disk which was "removed" (serial: 23470B800536). I don't think the removal interrupts the "physical disks ensure" call, and the operations are succeed
ing, so it's still going. This includes: setting the LBA on that disk to 4096, doing other setup, etc.
01:48:54.309Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - We see the disk come back, with serial 23470B800536.
01:50:27.018Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B800991
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B800991
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDiskRemoval - We see the disk with serial 23470B800536 go away again.
01:50:52.556Z INFO SledAgent (StorageManager): Received DetectedRawDisk - We see the same disk with serial 23470B800536 come back
01:50:52.920Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - We think the disk exists again. I see the serial in the result output for DisksManagementResult.
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded - Addition of disk with serial 23470B801016
01:50:53.685Z INFO SledAgent: datasets ensure - We try to ensure datasets. I see the "Failed to find zpool" errors here, in particular, for the zpool with UUID 8809d9a9-1544-42d5-b3b7-28129a2f6773.

Digging back further, this is the zpool associated with the disk with serial 23470B801016 - this was removed earlier, and was just detected by the device tree again
01:51:02.863Z INFO SledAgent (StorageResources): Managing disk - I see the sled agent trying to manage the disk with serial 23470B801016 ; I think this is a down-stream attempt to re-synchronize the disks. Looks successful.
01:51:03.147Z INFO SledAgent (StorageManager): Successfully synchronized disks without error - Finishes the "disk synchronization" from earlier

TL;DR:
We asked the sled agent to manage the disks it could, this seemed to happen successfully
The device tree was flapping a lot, showing multiple devices being added and removed and re-added. Weirdly, while the devices were missing from the device tree, this still appeared to be mutable.
While one of those devices was "removed", we asked to put datasets on it. That failed, which I believe is currently expected, because we consider "managed" devices to be ones that we see in the configuration + ones that we think are physically plugged in.
The disk later re-appeared, and was "managed" by the control plane again, but this was after the request to put datasets on it failed.

rm:
Just thinking out loud for a moment, to do the namespace reformat we actually will detach the blkdev instance to issue the format and then once that's done it'll come back.
So that is a way that without the nvme level device disappearing we can see that happen.
But I don't know where we do that kind of disk device manipulation in that timeline or if relevant. Just trying to think out loud here and what little we can get from what's in here. It's certainly possible (though weird) that the NVMe device went away and came back.

rm:
It's specifically doing a blkdev detach so it can reformat.
The problem is that we have to do something that destroys all the data on the device and renumbers all the sectors.
So we can't do it while it is up.

mike:
Not at the laptop at the moment, but yea detach, format, attach

rm:
So if rescanning and this operation are independent from the hardware monitor's loop we're going to be a little split brain.
Do we have logs around the reformat actions that we might be able to correlate?

sean (he/him):
We do, I think this roughly lines up with the timeline I posted earlier
01:50:32.936Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskRemoved - Removal of disk with serial 23470B801016
01:50:36.228Z INFO SledAgent (StorageResources): Resized 23470B801016 from 3840 to 3200
01:50:40.627Z INFO SledAgent (StorageResources): Formatted disk with serial 23470B801016 to an LBA with data size 4096
01:50:53.336Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded
So: yeah, looks like the formatting happened in the window between "device removed" and "device added"

sean (he/him):
So this makes more sense:

  • We decide "what disks are plugged in" by scanning the device tree in the HardwareMonitor
  • This information exists in the Sled's inventory, and gets plumbed to RSS
  • When RSS asks for a disk to be used, we call ensure_partition_layout -> if it's a U.2, we call ensure_size_and_formatting, which can remove the device from the device tree
  • If we poll the device tree during this period of removal, this is interpreted as "this disk has been removed"
  • When the disk later re-appears in the device tree, we'll try to manage it again, and "re-synchronize" the requested config with the disks observed in the device tree
  • However, if datasets are requested while we are not synchronized (which can happen as a part of formatting the disk, when the block device is detached) they'll show up as "not found"

sean (he/him):
To answer the question "Why is the disk not visible immediately after it's re-added to the device tree", the answer is basically: "Monitoring hardware and propagating that info to the sled agent is asynchronous".

In this particular case, we may want to add messaging for force things to be more synchronous -- e.g., after formatting any disks, ensure we're operating on an up-to-date view of the device tree, and ensure that storage manage state is synchronized and by "synchronized", I mean "at least up-to-date enough that it can see the block device which has been re-added"

sean (he/him):
I think the timeline is:

  • ensure_partition_layout removes the device, formats it.
  • If HardwareMonitor scans the device tree, it'll see the device as missing. This fires off a message to the "storage manager" part of the sled agent, telling it "this disk is gone".
  • If the HardwareMonitor scans the device tree and sees the disk again, it'll send a message to the "storage manager" telling it "this disk was just added, adopt it into the control plane at your convenience"
  • Every ten seconds, the storage manager "synchronizes disks", which means: it reads the configured disks, it looks at the disks that it things are online, and it adopts the disks it thinks should be part of the control plane.

rm:
I think if we scanned NVMe devices rather than blkdev we could probably avoid that.
(And wouldn't get tripped up by device missing blkdev namespaces or with too many)

sean (he/him):
Yeah, this makes sense to me
which would basically mean: from the perspective of the sled agent, reformatting the disk does not cause it to "go away"

@leftwo
Copy link
Contributor Author

leftwo commented May 22, 2025

Also related, or at least discussed: #5241

@smklein
Copy link
Collaborator

smklein commented May 22, 2025

As we discussed in the support bundle huddle (admittedly only tangentially related to support bundles, but, if you want the recording, it's from that meeting as of May 22nd):

There are two viable solutions to this problem.

  1. Fix StorageManger should search by NVMe node and not blkdev nodes #5241, make "blkdev driver going away and coming back" no longer send a message that indicates the disk has been removed and re-added. This is the approach I believe @papertigers is currently taking -- I'll assign the issue to him.

  2. Keep the behavior where a blkdev driver unbinding implies "the device is gone", but also, add the equivalent of a "flush" command after omicron_set_config creates disks. Effectively: (1) ask the hardware manager to re-synchronize (which then may trigger messages to the StorageManager), and (2) ask the StorageManager to "synchronize all disks" (it does this periodically anyway). If we do this after ensuring physical disks, but before requesting datasets, this operation should see that "all initialized disks are back online". This solution is fine, but it still means that "disk was removed" messages get sporadically fired off for the blkdev driver getting rebound, which is unfortunate. We're not progressing forward with this path unless there's an issue with the align-on-nvme-based approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants