Skip to content

Cannot run the Model generated from the example script #251

Closed
@hz-nm

Description

@hz-nm

I was testing out the library for the model on a single GPU for training.
Used the following command to run the training,

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=1 run_train.py --config-file examples/config_tiny_llama.yaml

Made some changes in the config_tiny_llama.yaml file which include,

parallelism:
  dp: 1 # 2
  expert_parallel_size: 1
  pp: 1 # 2
  pp_engine: 1f1b
  tp: 1 # 2
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER

The training ran smoothly and the checkpoints were generated, however when I try to run the model using,

torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1

I get the following error,

[rank0]:   File "/mnt/d/nanotron-pretrain/nanotron/src/nanotron/models/llama.py", line 529, in forward
[rank0]:     (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 4)

Any help to resolve this issue will be greatly appreciated. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions