Skip to content

Support for parallel upload of large artifacts #573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
syed opened this issue Apr 18, 2025 · 12 comments
Open

Support for parallel upload of large artifacts #573

syed opened this issue Apr 18, 2025 · 12 comments

Comments

@syed
Copy link

syed commented Apr 18, 2025

Issue

The OCI spec limits upload of artifacts to be serial in nature. You need to process the upload in-order because we need to calculate the checksum of the incoming data. The sha256 algorithm in not "distributable" it needs to be run serially. This creates problems when uploading large artifacts like ML models. This makes using OCI for large artifacts unattractive as compared to uploading in an object storage like S3 which supports multipart uploads.

There have been prior attempts at addressing this where the idea was to support out-of-order chunked uploads. This however leaves the assembly and checksum validation on the registry which might take a long time to do this or may not have resources to pull a large blob in-memory/on disk to calculate the final checksum

Use cases

Large artifacts are becoming prevalent in the OCI space. A few examples

  • AI models
  • VM images
  • DB backups
  • Binaries for libs

Proposal

The proposal is to introduce a new layer mediaType which is an indirection to a index of chunks.

Image

The chunk-index is another blob which holds a list of chunks with their sizes and offsets. The clients, when uploading will chunk the file can upload each chunk in parallel. Once all the chunks are uploaded, the client creates the chunk index and pushes that and then finally creates the manifest which references the blob chunk index.

The advantage here is that there is no extra processing required on the registry side. There is no reassembly required on the server which means the blob can be "committed" on the registry as soon as the final chunk is uploaded.

Considerations/Issues

  1. Older clients and registries will not be able to support this and we have to find some way of being backwards compatible.
  2. Since there is no full reassembly on the registry. The full artifact is only available when assembly is done on the client side (as opposed to S3 where reassembly is done on S3)
@sudo-bmitch
Copy link
Contributor

Thanks for writing this up @syed! To capture my own thoughts on the discussion, it may make more sense to move the concept of the chunk index into the manifest. E.g.

{
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  // ...
  "layers": [
    {
      "mediaType": "application/vnd.example.llmco.chunked",
      "digest": "sha256:8dbaf885340e047cecd34321def92f2dbbff99fe064295f3a2c507c424893606",
      "size": 5000000,
      "annotations": {
        "com.example.llmco.chunked.full-digest": "sha256:1b278385074125759e2782b52bcbf6b24640769737c8003c832d0f12c4d56651"
      }
    },
    {
      "mediaType": "application/vnd.example.llmco.chunked",
      "digest": "sha256:8b461e6703adccca9526545a1475ffa3370dd47a598ac96f02ffc21755d925ef",
      "size": 4567123,
      "annotations": {
        "com.example.llmco.chunked.full-digest": "sha256:1b278385074125759e2782b52bcbf6b24640769737c8003c832d0f12c4d56651"
      }
    }
  ] 
}

The full digest annotations let you disambiguate multiple chunked blobs in the same manifest and validate the assembled result. The advantage of this is existing registries and clients can continue to use the same registry blob and manifest APIs, and it's only the tooling that is consuming the artifact that needs to know how to chunk and reassemble the content.

The disadvantage is that it's not possible to pull the assembled blob from the registry, but we don't have a proposal that does that yet.

@syed
Copy link
Author

syed commented Apr 18, 2025

The issue I see with this approach is

  1. It can lead to large manifests which can exceed the max allowed limit for a manifest whereas the blob chunk index can be of any arbitrary size
  2. it's hard to address the chunked blob as a single entity. With the approach I suggested, the blob chunk index can be a proxy for the actual blob (esp when you don't want to download the full blob)

@sudo-bmitch
Copy link
Contributor

sudo-bmitch commented Apr 18, 2025

  1. It can lead to large manifests which can exceed the max allowed limit for a manifest whereas the blob chunk index can be of any arbitrary size

This might actually be a good self regulating problem. If someone makes 10,000 chunks, the overhead of all of the connections to pull each chunk will exceed the benefit from concurrency on the push. I'd want to see some real world testing of how many connections improves the transfer speed and at what point the value of additional connections becomes a negative.

  1. it's hard to address the chunked blob as a single entity. With the approach I suggested, the blob chunk index can be a proxy for the actual blob (esp when you don't want to download the full blob)

It moves the logic into the client, but I'd suggest that is client side logic that needs to be written with either implementation.


Longer term, I think a better alternative is a way for registry servers to communicate support for a concurrent (out-of-order) chunked blob push, and combine that with a hashing algorithm that can run in parallel across multiple segments of the blob. Effectively the hashing algorithm could work as a merkle tree that can track lots of hashes of incomplete parts and as ranges complete, the summarized result is propagated up the tree to free up memory. The registry could even decide if it will support concurrent chunks based on the digest algorithm of the blob being pushed.

@syed
Copy link
Author

syed commented Apr 18, 2025

I'd want to see some real world testing of how many connections improves the transfer speed and at what point the value of additional connections becomes a negative.

I can get some numbers on this

combine that with a hashing algorithm that can run in parallel across multiple segments of the blob.

What you are suggesting is a commutative hashing algorithm H(a +b) = H(b +a) as far as I know, this doesn't exist (while being cryptographically secure). All the literature that tries to do this ends up with some form of a hash-of-hashes approach.

@sudo-bmitch
Copy link
Contributor

combine that with a hashing algorithm that can run in parallel across multiple segments of the blob.

What you are suggesting is a commutative hashing algorithm H(a +b) = H(b +a) as far as I know, this doesn't exist (while being cryptographically secure). All the literature that tries to do this ends up with some form of a hash-of-hashes approach.

The hash of hashes is a merkle tree and used at a more abstract level everywhere. I'd still want order to be important, so a tool implementing this would need to store as much as block_size * number_of_concurrent_chunks in memory to account for incomplete blocks. It would trade memory to gain speed and concurrency.

@syed
Copy link
Author

syed commented Apr 18, 2025

I'm not sure I follow. Let's say there's a 100GB file and say each chunk is 5GB. You are saying that the registry has to hold and allocate 100GB of memory till all the chunks are uploaded?

@sudo-bmitch
Copy link
Contributor

I'm not sure I follow. Let's say there's a 100GB file and say each chunk is 5GB. You are saying that the registry has to hold and allocate 100GB of memory till all the chunks are uploaded?

The hash algorithm would probably default to a chunk size of maybe 50kb, and each node of the tree could have perhaps 20 child nodes (just to make the math easy). The hash chunk size is independent of the network chunk size because they are optimizing different things (memory/cpu vs network concurrency/latency). Once all child node hashes are complete, or you finish a leaf chunk, you propagate up the hash of that node and free any child memory. The tree dynamically expands (logarithmic scale) to read large offsets, and only allows the hash to be output when all but the last leaf is complete.

So, for example, you could start reading the first mb, you would fill up the leaves of the first node, and once the node is done it only needs to track the single hash of hashes. If you then started reading a chunk around the 50mb offset, the tree would expand to a three tier tree (1st tier = 1mb, 2nd tier = 20mb, 3rd tier = 400mb, 4th tier = 8gb, 5th tier = 160gb, 6th their = 3.2tb), but you only need to store the tail of any partial chunk if you started part way into the 50kb, and the tree state along with the hash state of the leaves if they aren't complete. For any nodes that have no data read yet, you wouldn't have a node, just a state in the parent node indicating no data.

This is off the top of my head, so more than likely someone else has already come up with this or something better. I hear blake3 has some concurrency support. I'm mainly throwing out that a hash algorithm could be designed to work with concurrent / out of order chunks, and without needing to wait for the last chunk to compute the full hash. And given an algorithm that handled that scenario, our change could be to the API, allowing concurrent uploads of a single blob, rather than redesigning the concept of a blob.

@syed
Copy link
Author

syed commented Apr 23, 2025

I see where you are going with this @sudo-bmitch. This can work with the parallel upload in the API. Two things I want to point out as a registry operator

  1. Most registry operators use some kind of object storage as their backend. The way I imagine this flow is going to work is that there will be a 1-1 between a chunk and a multi-part upload (I'm taking S3 as an example). At the end of the upload we will have to call a complete multi-part upload which will "commit" the full blob in the blob storage. This takes a lot of time if the file is larger.
  2. Large single blobs are not cacheable by many CDN providers. Most of them will just forward/proxy that request to the origin (eg Cloudfront has a 30GB limit). By chunking the big file into multiple blobs, we can cache that effectively in CDNs

@jcarter3
Copy link

Another point - we (Docker Hub) see a noticeable increase in download failures as blob size increases. Downloading 10 different 500 mb layers is more likely to succeed than 1 5gb layer. Clients could do this by using range requests on a single large blob, but since range requests are a SHOULD, they might have more success with chunking large blobs into multiple layers.

@syed
Copy link
Author

syed commented May 9, 2025

Adding my test results here for completeness.

I did an upload test with S3 multipart upload in a VM in the same region as the S3 bucket and did a OCI blob upload to ECR in the same region and these are the results

./parallel-blob-upload-testing -bucket syed-s3-access-logs -file largefile.test -workers 8

Starting multipart upload with 8 parallel workers...
No of parts 20
Uploaded part 8 of 20
Uploaded part 3 of 20
Uploaded part 1 of 20
Uploaded part 4 of 20
Uploaded part 5 of 20
Uploaded part 2 of 20
Uploaded part 7 of 20
Uploaded part 6 of 20
Uploaded part 10 of 20
Uploaded part 11 of 20
Uploaded part 12 of 20
Uploaded part 15 of 20
Uploaded part 14 of 20
Uploaded part 17 of 20
Uploaded part 16 of 20
Uploaded part 9 of 20
Uploaded part 13 of 20
Uploaded part 18 of 20
Uploaded part 20 of 20
Uploaded part 19 of 20
Multipart upload completed in 14m41.245502783s
Parts upload done in 14m41.045911214s

time oras push 942491940283.dkr.ecr.us-east-1.amazonaws.com/large-layers-testing:latest largefile.test

⠏ [         ...........](31.7 MB/s) Uploading largefile.test                                                        48.90/100 GB  48.92% 22m28s
  └─ sha256:672cea3784975f9ec1d75d28fb8a2977cc80fbbd573a3a5cabfc29e5515c8fec
✓ Uploaded  application/vnd.oci.empty.v1+json                                                                             2/2  B 100.00%  414ms
  └─ sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a
Error response from registry: denied: Adding this part to the layer with upload id '70548a0d-9a69-39b1-829c-6c1118106311' in the repository wit
h name 'large-layers-testing' in registry with id '942491940283' exceeds the maximum allowed size of a layer which is '52428800000'

real    37m26.213s
user    8m31.067s
sys     1m53.056s

the multipart upload is significantly faster than the OCI

@jcarter3
Copy link

jcarter3 commented May 9, 2025

What if you split it into chunks and upload? Obviously this doesn't account for the work on the client to split/join, but how does that look performance wise?

@syed
Copy link
Author

syed commented May 24, 2025

I did this test today. Split the same 100G file into 1G chunks and uploaded them as layers in parallel

Successfully uploaded largefile.test as OCI artifact with tag chunked (100 layers)

real    17m17.242s
user    10m29.860s
sys     4m46.382s

compared this to 37m on a single layer upload. I say this is a significant improvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants