Skip to content

issue in inference_s2s_batch.sh #218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lalaramarya opened this issue Mar 31, 2025 · 1 comment
Open

issue in inference_s2s_batch.sh #218

Lalaramarya opened this issue Mar 31, 2025 · 1 comment

Comments

@Lalaramarya
Copy link

#######Thank you for your help in resolving the earlier issues! However, I'm now facing a new problem during inference:

Generating: 0%| | 0/3000 [00:00<?, ?it/s]We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Generating: 16%|████████████████████████▊ | 469/3000 [00:24<02:12, 19.07it/s]
[2025-03-31 20:48:37][root][INFO] - LLM Inference Time: 25.14s
Error executing job with overrides: ['++model_config.llm_name=qwen2-0.5b', '++model_config.llm_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5B', '++model_config.llm_dim=896', '++model_config.encoder_name=whisper', '++model_config.encoder_projector_ds_rate=5', '++model_config.encoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/small.pt', '++model_config.encoder_dim=768', '++model_config.encoder_projector=linear', '++model_config.codec_decoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/pretrained_models/CosyVoice-300M-SFT', '++model_config.codec_decode=true', '++model_config.vocab_config.code_layer=3', '++model_config.vocab_config.total_audio_vocabsize=4160', '++model_config.vocab_config.total_vocabsize=156160', '++model_config.code_type=CosyVoice', '++model_config.codec_decoder_type=CosyVoice', '++model_config.group_decode=true', '++model_config.group_decode_adapter_type=linear', '++dataset_config.dataset=speech_dataset_s2s', '++dataset_config.val_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.train_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.input_type=mel', '++dataset_config.mel_size=80', '++dataset_config.inference_mode=true', '++dataset_config.manifest_format=jsonl', '++dataset_config.split_size=0.002', '++dataset_config.load_from_cache_file=false', '++dataset_config.task_type=s2s', '++dataset_config.seed=777', '++dataset_config.vocab_config.code_layer=3', '++dataset_config.vocab_config.total_audio_vocabsize=4160', '++dataset_config.vocab_config.total_vocabsize=156160', '++dataset_config.code_type=CosyVoice', '++dataset_config.num_latency_tokens=0', '++dataset_config.do_layershift=false', '++train_config.model_name=s2s', '++train_config.freeze_encoder=true', '++train_config.freeze_llm=true', '++train_config.freeze_encoder_projector=true', '++train_config.freeze_group_decode_adapter=true', '++train_config.batching_strategy=custom', '++train_config.num_epochs=1', '++train_config.val_batch_size=1', '++train_config.num_workers_dataloader=2', '++train_config.task_type=s2s', '++decode_config.text_repetition_penalty=1.2', '++decode_config.audio_repetition_penalty=1.2', '++decode_config.max_new_tokens=3000', '++decode_config.task_type=s2s', '++decode_config.do_sample=false', '++decode_config.top_p=1.0', '++decode_config.top_k=0', '++decode_config.temperature=1.0', '++decode_config.decode_text_only=false', '++decode_config.do_layershift=false', '++decode_log=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/s2s_decode__trp1.2_arp1.2_seed777_greedy', '++decode_config.num_latency_tokens=0', '++ckpt_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/model.pt', '++output_text_only=false', '++inference_online=false', '++speech_sample_rate=22050', '++audio_prompt_path=/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/audio_prompt/en/prompt_3.wav']
Traceback (most recent call last):
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/inference_s2s.py", line 102, in main_hydra
batch_inference(cfg)
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/generate/generate_s2s_batch.py", line 176, in main
q.write(key + "\t" + source_text + "\n")
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I'm facing this issue while running inference_s2s_batch.sh with both the pre-trained and fine-tuned models. However, when I load the pre-trained model using inference_s2s_online.sh, it successfully generates both the target text and audio. Please look into this.

@cwx-worst-one
Copy link
Collaborator

It seems that the JSONL file you provided doesn't contain the key field, which results in its value being NoneType. You can either add the missing key field to your data or simply remove the line q.write(key + "\t" + source_text + "\n") from the code manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants