Lightweight, low-latency voice assistant running fully offline on a Raspberry Pi or Linux machine.
Powered by PyAudio, Vosk STT, Piper TTS, and local LLMs via Ollama.
- ποΈ Microphone Input using PyAudio
- π Real-Time Transcription with Vosk
- π§ LLM-Powered Responses using Ollama with models like
gemma2:2b
,qwen2.5:0.5b
- π£οΈ Natural Voice Output via Piper TTS
- ποΈ Optional Noise & Filter FX using SoX for realism
- π§ ALSA Volume Control
- π§© Modular Python code ready for customization
- Raspberry Pi 5 or Linux desktop
- Python 3.9+
- PyAudio, NumPy, requests, soxr, pydub, vosk
- SoX + ALSA utilities
- Ollama with one or more small LLMs (e.g., Gemma or Qwen)
- Piper TTS with ONNX voice models
Install dependencies:
pip install pyaudio requests soxr numpy pydub vosk
sudo apt install sox alsa-utils
Place a config file at va_config.json:
{
"volume": 8,
"mic_name": "Plantronics",
"audio_output_device": "Plantronics",
"model_name": "gemma2:2b",
"voice": "en_US-kathleen-low.onnx",
"enable_audio_processing": false,
"history_length": 6,
"system_prompt": "You are a helpful assistant."
}
Note: if the configuration file is not found, defaults withing the main python app will be used:
# ------------------- CONFIG FILE LOADING -------------------
DEFAULT_CONFIG = {
"volume": 9,
"mic_name": "Plantronics",
"audio_output_device": "Plantronics",
"model_name": "qwen2.5:0.5b",
"voice": "en_US-kathleen-low.onnx",
"enable_audio_processing": False,
"history_length": 4,
"system_prompt": "You are a helpful assistant."
}
The history_length
setting controls how many previous exchanges (user + assistant messages) are included when generating each new reply.
- A value of
6
means the model receives the last 6 exchanges, plus the system prompt. - This allows the assistant to maintain short-term memory for more coherent conversations.
- Setting it lower (e.g.,
2
) increases speed and memory efficiency.
pyaudio
vosk
soxr
numpy
requests
pydub
If you plan to run this on a Raspberry Pi, you may also need:
soundfile # for pydub compatibility on some distros
# 1. Clone the repo
git clone https://github.com/your-username/voice-assistant-local.git
cd voice-assistant-local
# 2. Create and activate a virtual environment
python3 -m venv env
source env/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Install SoX and ALSA utilities (if not already installed)
sudo apt install sox alsa-utils
# 5. (Optional) Test PyAudio installation
python -m pip install --upgrade pip setuptools wheel
π‘ If you get errors installing PyAudio on Raspberry Pi, try:
sudo apt install portaudio19-dev pip install pyaudio
Piper is a standalone text-to-speech engine used by this assistant. It's not a Python package, so it must be installed manually.
-
Download the appropriate Piper binary from: π https://github.com/rhasspy/piper/releases
For Ubuntu Linux, download:
piper_linux_x86_64.tar.gz
-
Extract it:
tar -xvzf piper_linux_x86_64.tar.gz
-
Move the binary into your project directory:
mkdir -p bin/piper mv piper bin/piper/ chmod +x bin/piper/piper
-
β Done! The script will automatically call it from
bin/piper/piper
.
voice_assistant.py
va_config.json
requirements.txt
bin/
βββ piper/
βββ piper β (binary)
voices/
βββ en_US-kathleen-low.onnx
βββ en_US-kathleen-low.onnx.json
To configure the correct audio devices, use these commands on your Raspberry Pi or Linux terminal:
- List Microphones (Input Devices)
python3 -m pip install pyaudio
python3 -c "import pyaudio; p = pyaudio.PyAudio(); \
[print(f'{i}: {p.get_device_info_by_index(i)}') for i in range(p.get_device_count())]"
Look for your microphone name (e.g., Plantronics) and use that as mic_name. 2. List Speakers (Output Devices)
aplay -l
Example output:
card 3: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio]
Use this info to set your audio_output_device to something like:
"audio_output_device": "USB PnP"
Ollama is a local model runner for LLMs. You need to install it separately (outside of Python).
On Linux (x86 or ARM):
curl -fsSL https://ollama.com/install.sh | sh
Or follow detailed instructions: π https://ollama.com/download
Then start the daemon:
ollama serve
After Ollama is installed and running, open a terminal and run:
ollama run gemma2:2b
ollama run qwen2.5:0.5b
This will automatically download and start the models. You only need to run this once per model.
Ollama is not a Python package β it is a background service. Do not add it to
requirements.txt
. Just make sure itβs installed and running before launching the assistant.
To enable speech synthesis, you'll need to download a voice model (.onnx) and its matching config (.json) file.
-
Visit the official Piper voices list: π https://github.com/rhasspy/piper/blob/master/VOICES.md
-
Choose a voice you like (e.g.,
en_US-lessac-medium
oren_US-amy-low
). -
Download both files for your chosen voice:
voice.onnx
config.json
-
If you wish, you can rename the ONNX file and config file using the same base name. For example:
amy-low.onnx amy-low.json
-
Place both files in a directory called
voices/
next to your script. Example Directory Structure:voice_assistant.py voices/ βββ amy-low.onnx βββ amy-low.json
-
Update your
config.json
:"voice": "amy-low.onnx"
β οΈ Make sure both.onnx
and.json
are present in thevoices/
folder with matching names (excluding the extension).
The script prints out debug timing for the STT, LLM, and TTS parts of the pipeline. I asked ChatGPT4 to analyze some of the results i obtained.
System: Ubuntu laptop, Intel Core i5
Model: qwen2.5:0.5b
(local via Ollama)
TTS: piper
with en_US-kathleen-low.onnx
Audio: Plantronics USB headset
Stage | Metric (ms) | Notes |
---|---|---|
STT Parse | 4.5 ms avg | Vosk transcribes near-instantly |
LLM Inference | ~2,200 ms avg | Ranges from ~1s (short queries) to 5s |
TTS Generation | ~1,040 ms avg | Piper ONNX performs well on CPU |
Audio Playback | ~7,250 ms avg | Reflects actual audio length, not delay |
- STT speed is excellent β under 10 ms consistently.
- LLM inference is snappy for a 0.5b model running locally. Your best response came in under 1.1 sec.
- TTS is consistent and fast β Kathleen-low voice is fully synthesized in ~800β1600 ms.
- Playback timing matches response length β no lag, just actual audio time.
- End-to-end round trip time from speaking to hearing a reply is about 8β10 seconds, including speech and playback time.
-
β Offline smart assistants
-
β Wearable or embedded AI demos
-
β Voice-controlled kiosks
-
β Character-based roleplay agents
MIT Β© 2024 M15.ai