-
Notifications
You must be signed in to change notification settings - Fork 223
Support for parallel upload of large artifacts #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for writing this up @syed! To capture my own thoughts on the discussion, it may make more sense to move the concept of the chunk index into the manifest. E.g. {
"mediaType": "application/vnd.oci.image.manifest.v1+json",
// ...
"layers": [
{
"mediaType": "application/vnd.example.llmco.chunked",
"digest": "sha256:8dbaf885340e047cecd34321def92f2dbbff99fe064295f3a2c507c424893606",
"size": 5000000,
"annotations": {
"com.example.llmco.chunked.full-digest": "sha256:1b278385074125759e2782b52bcbf6b24640769737c8003c832d0f12c4d56651"
}
},
{
"mediaType": "application/vnd.example.llmco.chunked",
"digest": "sha256:8b461e6703adccca9526545a1475ffa3370dd47a598ac96f02ffc21755d925ef",
"size": 4567123,
"annotations": {
"com.example.llmco.chunked.full-digest": "sha256:1b278385074125759e2782b52bcbf6b24640769737c8003c832d0f12c4d56651"
}
}
]
} The full digest annotations let you disambiguate multiple chunked blobs in the same manifest and validate the assembled result. The advantage of this is existing registries and clients can continue to use the same registry blob and manifest APIs, and it's only the tooling that is consuming the artifact that needs to know how to chunk and reassemble the content. The disadvantage is that it's not possible to pull the assembled blob from the registry, but we don't have a proposal that does that yet. |
The issue I see with this approach is
|
This might actually be a good self regulating problem. If someone makes 10,000 chunks, the overhead of all of the connections to pull each chunk will exceed the benefit from concurrency on the push. I'd want to see some real world testing of how many connections improves the transfer speed and at what point the value of additional connections becomes a negative.
It moves the logic into the client, but I'd suggest that is client side logic that needs to be written with either implementation. Longer term, I think a better alternative is a way for registry servers to communicate support for a concurrent (out-of-order) chunked blob push, and combine that with a hashing algorithm that can run in parallel across multiple segments of the blob. Effectively the hashing algorithm could work as a merkle tree that can track lots of hashes of incomplete parts and as ranges complete, the summarized result is propagated up the tree to free up memory. The registry could even decide if it will support concurrent chunks based on the digest algorithm of the blob being pushed. |
I can get some numbers on this
What you are suggesting is a commutative hashing algorithm |
The hash of hashes is a merkle tree and used at a more abstract level everywhere. I'd still want order to be important, so a tool implementing this would need to store as much as block_size * number_of_concurrent_chunks in memory to account for incomplete blocks. It would trade memory to gain speed and concurrency. |
I'm not sure I follow. Let's say there's a 100GB file and say each chunk is 5GB. You are saying that the registry has to hold and allocate 100GB of memory till all the chunks are uploaded? |
The hash algorithm would probably default to a chunk size of maybe 50kb, and each node of the tree could have perhaps 20 child nodes (just to make the math easy). The hash chunk size is independent of the network chunk size because they are optimizing different things (memory/cpu vs network concurrency/latency). Once all child node hashes are complete, or you finish a leaf chunk, you propagate up the hash of that node and free any child memory. The tree dynamically expands (logarithmic scale) to read large offsets, and only allows the hash to be output when all but the last leaf is complete. So, for example, you could start reading the first mb, you would fill up the leaves of the first node, and once the node is done it only needs to track the single hash of hashes. If you then started reading a chunk around the 50mb offset, the tree would expand to a three tier tree (1st tier = 1mb, 2nd tier = 20mb, 3rd tier = 400mb, 4th tier = 8gb, 5th tier = 160gb, 6th their = 3.2tb), but you only need to store the tail of any partial chunk if you started part way into the 50kb, and the tree state along with the hash state of the leaves if they aren't complete. For any nodes that have no data read yet, you wouldn't have a node, just a state in the parent node indicating no data. This is off the top of my head, so more than likely someone else has already come up with this or something better. I hear blake3 has some concurrency support. I'm mainly throwing out that a hash algorithm could be designed to work with concurrent / out of order chunks, and without needing to wait for the last chunk to compute the full hash. And given an algorithm that handled that scenario, our change could be to the API, allowing concurrent uploads of a single blob, rather than redesigning the concept of a blob. |
I see where you are going with this @sudo-bmitch. This can work with the parallel upload in the API. Two things I want to point out as a registry operator
|
Another point - we (Docker Hub) see a noticeable increase in download failures as blob size increases. Downloading 10 different 500 mb layers is more likely to succeed than 1 5gb layer. Clients could do this by using range requests on a single large blob, but since range requests are a |
Adding my test results here for completeness. I did an upload test with S3 multipart upload in a VM in the same region as the S3 bucket and did a OCI blob upload to ECR in the same region and these are the results
the multipart upload is significantly faster than the OCI |
What if you split it into chunks and upload? Obviously this doesn't account for the work on the client to split/join, but how does that look performance wise? |
I did this test today. Split the same 100G file into 1G chunks and uploaded them as layers in parallel
compared this to |
Uh oh!
There was an error while loading. Please reload this page.
Issue
The OCI spec limits upload of artifacts to be serial in nature. You need to process the upload in-order because we need to calculate the checksum of the incoming data. The sha256 algorithm in not "distributable" it needs to be run serially. This creates problems when uploading large artifacts like ML models. This makes using OCI for large artifacts unattractive as compared to uploading in an object storage like S3 which supports multipart uploads.
There have been prior attempts at addressing this where the idea was to support out-of-order chunked uploads. This however leaves the assembly and checksum validation on the registry which might take a long time to do this or may not have resources to pull a large blob in-memory/on disk to calculate the final checksum
Use cases
Large artifacts are becoming prevalent in the OCI space. A few examples
Proposal
The proposal is to introduce a new layer mediaType which is an indirection to a index of chunks.
The chunk-index is another blob which holds a list of chunks with their sizes and offsets. The clients, when uploading will chunk the file can upload each chunk in parallel. Once all the chunks are uploaded, the client creates the chunk index and pushes that and then finally creates the manifest which references the blob chunk index.
The advantage here is that there is no extra processing required on the registry side. There is no reassembly required on the server which means the blob can be "committed" on the registry as soon as the final chunk is uploaded.
Considerations/Issues
The text was updated successfully, but these errors were encountered: