forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 67
Commit 05c181d
authored
Fix 2DParallel test (#219)
Use `rmsnorm` instead of fused version since 2D does not support fused
version yet.
Test:
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=--training.tensor_parallel_degree
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 3 -ne 0 ']'
+ overrides='--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm'
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.tensor_parallel_degree 2 --model.norm_type=rmsnorm
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757]
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] *****************************************
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] *****************************************
[rank0]:2024-04-10 15:50:37,794 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-04-10 15:50:37,986 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-04-10 15:50:38,464 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
[rank0]:2024-04-10 15:50:38,467 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-04-10 15:50:38,474 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-04-10 15:50:38,474 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-04-10 15:50:40,306 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-04-10 15:50:40,318 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-04-10 15:50:40,319 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-04-10 15:50:40,331 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied FSDP to the model
[rank0]:2024-04-10 15:50:40,558 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%)
[rank0]:2024-04-10 15:50:40,558 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1550
[rank0]:2024-04-10 15:50:40,562 - root - INFO - Training starts at step 1
[rank0]:2024-04-10 15:50:40,562 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-04-10 15:50:41,474 - root - INFO - �[36mstep: 1 �[32mloss: 10.8403 �[33mmemory: 5.76GiB(6.06%) �[34mwps: 8,988 �[35mmfu: 0.11%�[39m
[rank0]:2024-04-10 15:50:41,475 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-04-10 15:50:41,652 - root - INFO - �[36mstep: 2 �[32mloss: 10.7703 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 46,364 �[35mmfu: 0.57%�[39m
[rank0]:2024-04-10 15:50:41,744 - root - INFO - �[36mstep: 3 �[32mloss: 10.6447 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,916 �[35mmfu: 1.10%�[39m
[rank0]:2024-04-10 15:50:41,847 - root - INFO - �[36mstep: 4 �[32mloss: 10.4428 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 80,467 �[35mmfu: 0.99%�[39m
[rank0]:2024-04-10 15:50:41,946 - root - INFO - �[36mstep: 5 �[32mloss: 10.1726 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 83,747 �[35mmfu: 1.03%�[39m
[rank0]:2024-04-10 15:50:42,038 - root - INFO - �[36mstep: 6 �[32mloss: 9.9676 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,380 �[35mmfu: 1.09%�[39m
[rank0]:2024-04-10 15:50:42,135 - root - INFO - �[36mstep: 7 �[32mloss: 9.7356 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,526 �[35mmfu: 1.05%�[39m
[rank0]:2024-04-10 15:50:42,232 - root - INFO - �[36mstep: 8 �[32mloss: 9.4619 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,349 �[35mmfu: 1.05%�[39m
[rank0]:2024-04-10 15:50:42,396 - root - INFO - �[36mstep: 9 �[32mloss: 9.2633 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 50,402 �[35mmfu: 0.62%�[39m
[rank0]:[rank0]:[W410 15:50:42.021475256 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-04-10 15:50:42,511 - root - INFO - �[36mstep: 10 �[32mloss: 9.2156 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 71,449 �[35mmfu: 0.88%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```1 parent 144b229 commit 05c181dCopy full SHA for 05c181d
File tree
Expand file treeCollapse file tree
1 file changed
+1
-1
lines changedFilter options
Expand file treeCollapse file tree
1 file changed
+1
-1
lines changed+1-1Lines changed: 1 addition & 1 deletion
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
43 | 43 |
| |
44 | 44 |
| |
45 | 45 |
| |
46 |
| - | |
| 46 | + | |
47 | 47 |
| |
48 | 48 |
| |
49 | 49 |
| |
|
0 commit comments