added onEmitChunk callback to extract audio before onSpeechEnd. live audio #122 #191
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was trying to implement whisper_streaming via a websocket. For the implementation I needed to access the frames before onSpeechEnd triggered. I first implemented it outside the package using onFrameProcessed callback, then implemented it as a callback onEmitChunk after seeing issues #186 #122 #68.
I have added a callback function onEmitChunk, that returns an audio segment of length numFramesToEmit * frameSamples.
when speech end is detected it returns the all the accumulated frames since last call to onEmitChunk.
After this implementation, I was able do live transcription solely using this callback, which I believe is a nice simplification.
I have also made a small modification to the algorithm by adding endSpeechPadFrames. Mainly, this allows flexibility in the ending region of the audio segment. As an example, one may want to wait 0.5s before ending speech segment, but a padding of 0.5s can be overkill, and prefer a smaller pad such as 0.2s.
Using endSpeechPadFrames and redemptionFrames together, I changed how audio buffer resets after speech end is detected. If there is frames that fall between endSpeechPadFrames and redemptionFrames, those are kept in the buffer to be used as preSpeechPadFrames in case speech starts right away.
Ideally preSpeechPadFrames + endSpeechPadFrames => redemptionFrames so that we can always pad with desired preSpeechPadFrames. However, even if this is not the case, as long as endSpeechPadFrames < redemptionFrames there is extra padding compared to just resetting the audio buffer.
This allows for better segmentation of speech when speech starts right after speech end is raised.
By removing/reducing the period where speech cannot be prepended and reducing the chance of starting sylabbles of buffer t to leak into buffer t - 1. the This is not a problem if consecutive buffers are appended before processing however, if they are processed seperately it is not ideal.
I would appreciate it if there are some things you might want to change. Please let me know!
npm run format