Skip to content

docker/k8s/batch: increase /dev/shm size for larger datasets #428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 10 tasks
d4l3k opened this issue Mar 21, 2022 · 0 comments
Closed
1 of 10 tasks

docker/k8s/batch: increase /dev/shm size for larger datasets #428

d4l3k opened this issue Mar 21, 2022 · 0 comments
Assignees
Labels
aws_batch bug Something isn't working docker good first issue Good for newcomers kubernetes kubernetes and volcano schedulers module: runner issues related to the torchx.runner and torchx.scheduler modules
Milestone

Comments

@d4l3k
Copy link
Member

d4l3k commented Mar 21, 2022

🐛 Bug

When running models that need to load large datasets via PyTorch dataloaders they need /dev/shm to be sufficiently sized for data to be transferred between processes. Docker/K8S has a default /dev/shm size of 64MB which is much too small. Increasing the size doesn't eat up memory until it's allocated so we should be safe to set the size to be the full memory allocated for the container.

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

import torch
from torch.utils.data import Dataset, DataLoader

class BigDataset(Dataset):

    def __init__(self, size):
        self.size = size

    def __len__(self):
        return 100

    def __getitem__(self, idx):
        return torch.zeros((1,self.size))

dataset = BigDataset(100_000_000)
dataloader = DataLoader(dataset, batch_size=4, num_workers=4)

for i, x in enumerate(dataloader):
    print(i, x.shape)
torchx run --scheduler local_docker --wait --log dist.ddp -j 1x1 --script large-shm.py

Expected behavior

It runs

Environment

tristanr@tristanr-arch2 ~> docker version
Client:
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.17.5
 Git commit:        e91ed5707e
 Built:             Mon Dec 13 22:31:40 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.5
  Git commit:       459d0dfbbb
  Built:            Mon Dec 13 22:30:43 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.0
  GitCommit:        39259a8f35919a0d02c9ecc2871ddd6ccf6a7c6e.m
 runc:
  Version:          1.1.0
  GitCommit:        v1.1.0-0-g067aaf85
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Additional context

https://stackoverflow.com/questions/46085748/define-size-for-dev-shm-on-container-engine/46434614#46434614

@d4l3k d4l3k self-assigned this Mar 21, 2022
@d4l3k d4l3k added module: runner issues related to the torchx.runner and torchx.scheduler modules kubernetes kubernetes and volcano schedulers docker aws_batch bug Something isn't working good first issue Good for newcomers labels Mar 21, 2022
@d4l3k d4l3k added this to the 0.1.2 release milestone Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws_batch bug Something isn't working docker good first issue Good for newcomers kubernetes kubernetes and volcano schedulers module: runner issues related to the torchx.runner and torchx.scheduler modules
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant