You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#######Thank you for your help in resolving the earlier issues! However, I'm now facing a new problem during inference:
Generating: 0%| | 0/3000 [00:00<?, ?it/s]We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Generating: 16%|████████████████████████▊ | 469/3000 [00:24<02:12, 19.07it/s]
[2025-03-31 20:48:37][root][INFO] - LLM Inference Time: 25.14s
Error executing job with overrides: ['++model_config.llm_name=qwen2-0.5b', '++model_config.llm_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5B', '++model_config.llm_dim=896', '++model_config.encoder_name=whisper', '++model_config.encoder_projector_ds_rate=5', '++model_config.encoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/small.pt', '++model_config.encoder_dim=768', '++model_config.encoder_projector=linear', '++model_config.codec_decoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/pretrained_models/CosyVoice-300M-SFT', '++model_config.codec_decode=true', '++model_config.vocab_config.code_layer=3', '++model_config.vocab_config.total_audio_vocabsize=4160', '++model_config.vocab_config.total_vocabsize=156160', '++model_config.code_type=CosyVoice', '++model_config.codec_decoder_type=CosyVoice', '++model_config.group_decode=true', '++model_config.group_decode_adapter_type=linear', '++dataset_config.dataset=speech_dataset_s2s', '++dataset_config.val_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.train_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.input_type=mel', '++dataset_config.mel_size=80', '++dataset_config.inference_mode=true', '++dataset_config.manifest_format=jsonl', '++dataset_config.split_size=0.002', '++dataset_config.load_from_cache_file=false', '++dataset_config.task_type=s2s', '++dataset_config.seed=777', '++dataset_config.vocab_config.code_layer=3', '++dataset_config.vocab_config.total_audio_vocabsize=4160', '++dataset_config.vocab_config.total_vocabsize=156160', '++dataset_config.code_type=CosyVoice', '++dataset_config.num_latency_tokens=0', '++dataset_config.do_layershift=false', '++train_config.model_name=s2s', '++train_config.freeze_encoder=true', '++train_config.freeze_llm=true', '++train_config.freeze_encoder_projector=true', '++train_config.freeze_group_decode_adapter=true', '++train_config.batching_strategy=custom', '++train_config.num_epochs=1', '++train_config.val_batch_size=1', '++train_config.num_workers_dataloader=2', '++train_config.task_type=s2s', '++decode_config.text_repetition_penalty=1.2', '++decode_config.audio_repetition_penalty=1.2', '++decode_config.max_new_tokens=3000', '++decode_config.task_type=s2s', '++decode_config.do_sample=false', '++decode_config.top_p=1.0', '++decode_config.top_k=0', '++decode_config.temperature=1.0', '++decode_config.decode_text_only=false', '++decode_config.do_layershift=false', '++decode_log=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/s2s_decode__trp1.2_arp1.2_seed777_greedy', '++decode_config.num_latency_tokens=0', '++ckpt_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/model.pt', '++output_text_only=false', '++inference_online=false', '++speech_sample_rate=22050', '++audio_prompt_path=/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/audio_prompt/en/prompt_3.wav']
Traceback (most recent call last):
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/inference_s2s.py", line 102, in main_hydra
batch_inference(cfg)
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/generate/generate_s2s_batch.py", line 176, in main
q.write(key + "\t" + source_text + "\n")
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I'm facing this issue while running inference_s2s_batch.sh with both the pre-trained and fine-tuned models. However, when I load the pre-trained model using inference_s2s_online.sh, it successfully generates both the target text and audio. Please look into this.
The text was updated successfully, but these errors were encountered:
It seems that the JSONL file you provided doesn't contain the key field, which results in its value being NoneType. You can either add the missing key field to your data or simply remove the line q.write(key + "\t" + source_text + "\n") from the code manually.
#######Thank you for your help in resolving the earlier issues! However, I'm now facing a new problem during inference:
Generating: 0%| | 0/3000 [00:00<?, ?it/s]We detected that you are passing
past_key_values
as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache
class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)Generating: 16%|████████████████████████▊ | 469/3000 [00:24<02:12, 19.07it/s]
[2025-03-31 20:48:37][root][INFO] - LLM Inference Time: 25.14s
Error executing job with overrides: ['++model_config.llm_name=qwen2-0.5b', '++model_config.llm_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5B', '++model_config.llm_dim=896', '++model_config.encoder_name=whisper', '++model_config.encoder_projector_ds_rate=5', '++model_config.encoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/small.pt', '++model_config.encoder_dim=768', '++model_config.encoder_projector=linear', '++model_config.codec_decoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/pretrained_models/CosyVoice-300M-SFT', '++model_config.codec_decode=true', '++model_config.vocab_config.code_layer=3', '++model_config.vocab_config.total_audio_vocabsize=4160', '++model_config.vocab_config.total_vocabsize=156160', '++model_config.code_type=CosyVoice', '++model_config.codec_decoder_type=CosyVoice', '++model_config.group_decode=true', '++model_config.group_decode_adapter_type=linear', '++dataset_config.dataset=speech_dataset_s2s', '++dataset_config.val_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.train_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.input_type=mel', '++dataset_config.mel_size=80', '++dataset_config.inference_mode=true', '++dataset_config.manifest_format=jsonl', '++dataset_config.split_size=0.002', '++dataset_config.load_from_cache_file=false', '++dataset_config.task_type=s2s', '++dataset_config.seed=777', '++dataset_config.vocab_config.code_layer=3', '++dataset_config.vocab_config.total_audio_vocabsize=4160', '++dataset_config.vocab_config.total_vocabsize=156160', '++dataset_config.code_type=CosyVoice', '++dataset_config.num_latency_tokens=0', '++dataset_config.do_layershift=false', '++train_config.model_name=s2s', '++train_config.freeze_encoder=true', '++train_config.freeze_llm=true', '++train_config.freeze_encoder_projector=true', '++train_config.freeze_group_decode_adapter=true', '++train_config.batching_strategy=custom', '++train_config.num_epochs=1', '++train_config.val_batch_size=1', '++train_config.num_workers_dataloader=2', '++train_config.task_type=s2s', '++decode_config.text_repetition_penalty=1.2', '++decode_config.audio_repetition_penalty=1.2', '++decode_config.max_new_tokens=3000', '++decode_config.task_type=s2s', '++decode_config.do_sample=false', '++decode_config.top_p=1.0', '++decode_config.top_k=0', '++decode_config.temperature=1.0', '++decode_config.decode_text_only=false', '++decode_config.do_layershift=false', '++decode_log=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/s2s_decode__trp1.2_arp1.2_seed777_greedy', '++decode_config.num_latency_tokens=0', '++ckpt_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/model.pt', '++output_text_only=false', '++inference_online=false', '++speech_sample_rate=22050', '++audio_prompt_path=/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/audio_prompt/en/prompt_3.wav']
Traceback (most recent call last):
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/inference_s2s.py", line 102, in main_hydra
batch_inference(cfg)
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/generate/generate_s2s_batch.py", line 176, in main
q.write(key + "\t" + source_text + "\n")
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I'm facing this issue while running inference_s2s_batch.sh with both the pre-trained and fine-tuned models. However, when I load the pre-trained model using inference_s2s_online.sh, it successfully generates both the target text and audio. Please look into this.
The text was updated successfully, but these errors were encountered: