Fix 2DParallel test (#219)

gnadathur · web-flow · commit 05c181d94475 · 2024-04-10T15:54:08.000-07:00
Use `rmsnorm` instead of fused version since 2D does not support fused
version yet.

Test:

```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=--training.tensor_parallel_degree
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 3 -ne 0 ']'
+ overrides='--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm'
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.tensor_parallel_degree 2 --model.norm_type=rmsnorm
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] 
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] *****************************************
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] *****************************************
[rank0]:2024-04-10 15:50:37,794 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-04-10 15:50:37,986 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-04-10 15:50:38,464 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
[rank0]:2024-04-10 15:50:38,467 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-04-10 15:50:38,474 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-04-10 15:50:38,474 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-04-10 15:50:40,306 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-04-10 15:50:40,318 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-04-10 15:50:40,319 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-04-10 15:50:40,331 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied FSDP to the model
[rank0]:2024-04-10 15:50:40,558 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%)
[rank0]:2024-04-10 15:50:40,558 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1550
[rank0]:2024-04-10 15:50:40,562 - root - INFO - Training starts at step 1
[rank0]:2024-04-10 15:50:40,562 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-04-10 15:50:41,474 - root - INFO - �[36mstep:  1  �[32mloss: 10.8403  �[33mmemory:  5.76GiB(6.06%)  �[34mwps: 8,988  �[35mmfu: 0.11%�[39m
[rank0]:2024-04-10 15:50:41,475 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-04-10 15:50:41,652 - root - INFO - �[36mstep:  2  �[32mloss: 10.7703  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 46,364  �[35mmfu: 0.57%�[39m
[rank0]:2024-04-10 15:50:41,744 - root - INFO - �[36mstep:  3  �[32mloss: 10.6447  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 89,916  �[35mmfu: 1.10%�[39m
[rank0]:2024-04-10 15:50:41,847 - root - INFO - �[36mstep:  4  �[32mloss: 10.4428  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 80,467  �[35mmfu: 0.99%�[39m
[rank0]:2024-04-10 15:50:41,946 - root - INFO - �[36mstep:  5  �[32mloss: 10.1726  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 83,747  �[35mmfu: 1.03%�[39m
[rank0]:2024-04-10 15:50:42,038 - root - INFO - �[36mstep:  6  �[32mloss:  9.9676  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 89,380  �[35mmfu: 1.09%�[39m
[rank0]:2024-04-10 15:50:42,135 - root - INFO - �[36mstep:  7  �[32mloss:  9.7356  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 85,526  �[35mmfu: 1.05%�[39m
[rank0]:2024-04-10 15:50:42,232 - root - INFO - �[36mstep:  8  �[32mloss:  9.4619  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 85,349  �[35mmfu: 1.05%�[39m
[rank0]:2024-04-10 15:50:42,396 - root - INFO - �[36mstep:  9  �[32mloss:  9.2633  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 50,402  �[35mmfu: 0.62%�[39m
[rank0]:[rank0]:[W410 15:50:42.021475256 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-04-10 15:50:42,511 - root - INFO - �[36mstep: 10  �[32mloss:  9.2156  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 71,449  �[35mmfu: 0.88%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```
diff --git a/test_runner.py b/test_runner.py
@@ -43,7 +43,7 @@ class OverrideDefinitions:
     ),
     OverrideDefinitions(
         [
-            ["--training.tensor_parallel_degree 2"],
+            ["--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm"],
         ],
         "Eager mode 2DParallel",
     ),

Original file line number	Diff line number	Diff line change
`@@ -43,7 +43,7 @@ class OverrideDefinitions:`
`43`	`43`	`),`
`44`	`44`	`OverrideDefinitions(`
`45`	`45`	`[`
`46`		`- ["--training.tensor_parallel_degree 2"],`
	`46`	`+ ["--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm"],`
`47`	`47`	`],`
`48`	`48`	`"Eager mode 2DParallel",`
`49`	`49`	`),`