Skip to content

added onEmitChunk callback to extract audio before onSpeechEnd. live audio #122 #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gencerege
Copy link

I was trying to implement whisper_streaming via a websocket. For the implementation I needed to access the frames before onSpeechEnd triggered. I first implemented it outside the package using onFrameProcessed callback, then implemented it as a callback onEmitChunk after seeing issues #186 #122 #68.

I have added a callback function onEmitChunk, that returns an audio segment of length numFramesToEmit * frameSamples.
when speech end is detected it returns the all the accumulated frames since last call to onEmitChunk.

After this implementation, I was able do live transcription solely using this callback, which I believe is a nice simplification.

I have also made a small modification to the algorithm by adding endSpeechPadFrames. Mainly, this allows flexibility in the ending region of the audio segment. As an example, one may want to wait 0.5s before ending speech segment, but a padding of 0.5s can be overkill, and prefer a smaller pad such as 0.2s.

Using endSpeechPadFrames and redemptionFrames together, I changed how audio buffer resets after speech end is detected. If there is frames that fall between endSpeechPadFrames and redemptionFrames, those are kept in the buffer to be used as preSpeechPadFrames in case speech starts right away.
Ideally preSpeechPadFrames + endSpeechPadFrames => redemptionFrames so that we can always pad with desired preSpeechPadFrames. However, even if this is not the case, as long as endSpeechPadFrames < redemptionFrames there is extra padding compared to just resetting the audio buffer.
This allows for better segmentation of speech when speech starts right after speech end is raised.
By removing/reducing the period where speech cannot be prepended and reducing the chance of starting sylabbles of buffer t to leak into buffer t - 1. the This is not a problem if consecutive buffers are appended before processing however, if they are processed seperately it is not ideal.

I would appreciate it if there are some things you might want to change. Please let me know!

  • [ X] Verified that changes work on the test site, adding changes to the test site if necessary to try out your changes
  • [ X] Updated relevant changelogs
  • [ X] Ran npm run format

Copy link

vercel bot commented Feb 22, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vad_test_site ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 22, 2025 6:54pm

@altyni86
Copy link

This is cool!

@ricky0123
Copy link
Owner

Hey, thanks for the PR, I really appreciate it. I'm going to play around with it and get back to you. Thanks!

frames.length -
(this.options.redemptionFrames - this.options.endSpeechPadFrames)
)
const audio = concatArrays(audioBufferPad)
handleEvent({ msg: Message.SpeechEnd, audio })

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't SpeechEnd be emitted after EmitChunk?

@yourenyouyu
Copy link

When will the code be merged?

@niron1
Copy link

niron1 commented May 25, 2025

i'm not sure how to use this commit. all the examples are based on vad-web, not vad. and vad-web does not expose the newly added onEmitChunk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants