Skip to content

[feature] Add debug_dataloader_samples utility to preview decoded dataloader samples (#184) #364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

garongkim
Copy link

What does this PR do?

This PR adds a utility function debug_dataloader_samples() under utils/debug_utils.py.

It allows users to preview a few decoded examples from the first batch of the training DataLoader, which can help ensure that tokenization is working correctly before launching a long training run.
This debug utility addresses #184.

Although another contributor expressed interest in the issue, no progress or maintainer feedback has occurred in the last 3 weeks.
As this feature is simple but useful, I decided to implement and contribute it directly.

How to use

This utility can be optionally called from run_train.py or any other training script to inspect tokenized input samples before training.
Here is an example snippet:

from nanotron.utils.debug_utils import debug_dataloader_samples

debug_dataloader_samples(train_dataloader, tokenizer, num_samples=3)

Notes

To avoid a circular import issue between utils.py and distributed.py,
this function is placed in a new module debug_utils.py instead of the main utils.py.

Logging is used instead of print() to remain consistent with Nanotron’s logging system.

Fixes #184

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guidelines?
  • Did you write any new necessary tests?
  • Did you log the throughput and loss you get to ensure the PR works as expected in actual training?
  • Did you log the memory usage? you can use this tool to understand the memory usage breakdown in nanotron.
  • If you modified anything related to checkpoints, did you verify that saving and reloading checkpoints still works correctly?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

NouamaneTazi and others added 7 commits April 14, 2025 16:17
…face#353)

* can only merge to main from dev

* Fix UnBoundLocalError in `clm_collator.py` (huggingface#339)

* Update clm_collator.py

* can only merge to main from dev (huggingface#348)

---------

Co-authored-by: Nouamane Tazi <[email protected]>

* fix init and init scaling factor and run evals in background (huggingface#349)

* InitScalingMethod

* InitScalingMethod

* run evals in background (huggingface#352)

* eval

* try adding lightevalrunner to trainer

* amend

* amend

* amend

* amend

* amend

* amend

* .

* amend

* amend

* .

* qos to low

* add nanotron_path

* some fix: logs, and config

* cp instead of sync

* eval_interval

* serialize sanity checks

* add output dir and s3_save path in the config

* fix s3 only if define

* fixes

---------

Co-authored-by: elie <[email protected]>
Co-authored-by: “eliebak” <[email protected]>

---------

Co-authored-by: elie <[email protected]>
Co-authored-by: “eliebak” <[email protected]>

---------

Co-authored-by: Connector Switch <[email protected]>
Co-authored-by: elie <[email protected]>
Co-authored-by: “eliebak” <[email protected]>
* deepwiki

* Update README.md
Comment on lines +152 to +154
_fused_rotary_emb: bool = False
_fused_rms_norm: bool = False
_use_qkv_packed: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are duplicated

Comment on lines +65 to +70
slurm_job_id, slurm_log = run_slurm_one_job(
config=self.config,
lighteval_config=self.lighteval_config,
model_checkpoint_path=checkpoint_path,
current_step=self.config.general.step,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@garongkim garongkim closed this May 26, 2025
@garongkim
Copy link
Author

Closing this in facor of a cleaner PR: #368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants