Cannot run the Model generated from the example script

I was testing out the library for the model on a single GPU for training.
Used the following command to run the training,
```
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=1 run_train.py --config-file examples/config_tiny_llama.yaml
```

Made some changes in the config_tiny_llama.yaml file which include,
```
parallelism:
  dp: 1 # 2
  expert_parallel_size: 1
  pp: 1 # 2
  pp_engine: 1f1b
  tp: 1 # 2
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER
 ```

The training ran smoothly and the checkpoints were generated, however when I try to run the model using,
```
torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
```

I get the following error,
```
[rank0]:   File "/mnt/d/nanotron-pretrain/nanotron/src/nanotron/models/llama.py", line 529, in forward
[rank0]:     (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 4)
```

Any help to resolve this issue will be greatly appreciated. Thanks.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot run the Model generated from the example script #251

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot run the Model generated from the example script #251

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions