Skip to content

Commit 05c181d

Browse files
authored
Fix 2DParallel test (#219)
Use `rmsnorm` instead of fused version since 2D does not support fused version yet. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=--training.tensor_parallel_degree + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 3 -ne 0 ']' + overrides='--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm' + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.tensor_parallel_degree 2 --model.norm_type=rmsnorm W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] ***************************************** W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 15:50:37,794 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 15:50:37,986 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 15:50:38,464 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2] [rank0]:2024-04-10 15:50:38,467 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 15:50:38,474 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 15:50:38,474 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 15:50:40,306 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='rmsnorm') [rank0]:2024-04-10 15:50:40,318 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 15:50:40,319 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 15:50:40,331 - root - INFO - Applied Tensor Parallelism to the model [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 15:50:40,558 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 15:50:40,558 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1550 [rank0]:2024-04-10 15:50:40,562 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 15:50:40,562 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-04-10 15:50:41,474 - root - INFO - �[36mstep: 1 �[32mloss: 10.8403 �[33mmemory: 5.76GiB(6.06%) �[34mwps: 8,988 �[35mmfu: 0.11%�[39m [rank0]:2024-04-10 15:50:41,475 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-04-10 15:50:41,652 - root - INFO - �[36mstep: 2 �[32mloss: 10.7703 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 46,364 �[35mmfu: 0.57%�[39m [rank0]:2024-04-10 15:50:41,744 - root - INFO - �[36mstep: 3 �[32mloss: 10.6447 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,916 �[35mmfu: 1.10%�[39m [rank0]:2024-04-10 15:50:41,847 - root - INFO - �[36mstep: 4 �[32mloss: 10.4428 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 80,467 �[35mmfu: 0.99%�[39m [rank0]:2024-04-10 15:50:41,946 - root - INFO - �[36mstep: 5 �[32mloss: 10.1726 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 83,747 �[35mmfu: 1.03%�[39m [rank0]:2024-04-10 15:50:42,038 - root - INFO - �[36mstep: 6 �[32mloss: 9.9676 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,380 �[35mmfu: 1.09%�[39m [rank0]:2024-04-10 15:50:42,135 - root - INFO - �[36mstep: 7 �[32mloss: 9.7356 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,526 �[35mmfu: 1.05%�[39m [rank0]:2024-04-10 15:50:42,232 - root - INFO - �[36mstep: 8 �[32mloss: 9.4619 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,349 �[35mmfu: 1.05%�[39m [rank0]:2024-04-10 15:50:42,396 - root - INFO - �[36mstep: 9 �[32mloss: 9.2633 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 50,402 �[35mmfu: 0.62%�[39m [rank0]:[rank0]:[W410 15:50:42.021475256 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-04-10 15:50:42,511 - root - INFO - �[36mstep: 10 �[32mloss: 9.2156 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 71,449 �[35mmfu: 0.88%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ```
1 parent 144b229 commit 05c181d

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

test_runner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ class OverrideDefinitions:
4343
),
4444
OverrideDefinitions(
4545
[
46-
["--training.tensor_parallel_degree 2"],
46+
["--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm"],
4747
],
4848
"Eager mode 2DParallel",
4949
),

0 commit comments

Comments
 (0)