May 21, 2026

Vosk as a Local STT Provider

When running agentic speech-to-text systems on edge devices like a Raspberry Pi (4GB RAM), resource constraints quickly become a major bottleneck. Running heavy local models such as faster-whisper alongside real-time video streams often results in severe performance degradation and poor video frames.

To address this, we integrated Vosk—a highly lightweight, offline speech recognition toolkit designed specifically for mobile and edge platforms. By configuring Vosk as a local Speech-To-Text (STT) provider, we established an ultra-efficient speech processing pipeline that preserves high video quality and system responsiveness.

Implementation Architecture

The solution introduces native Vosk support into the transcription pipeline through three key architectural areas:

1. Configuration Interface

In hermes_cli/config.py, we added dedicated configuration blocks for Vosk under the speech-to-text (stt) block. This allows users to easily customize the model path, download lightweight language packs, and specify Vosk as their active provider.

2. Highly Optimized Core Transcription

Inside tools/transcription_tools.py, we implemented the core logic. To prevent heavy disk reads and reload latency on every voice request, we designed a model caching singleton that holds the model structure in memory:

# Singleton for Vosk model caching_vosk_model: Optional[object] = None
_vosk_model_name: Optional[str] = None

Vosk strictly requires incoming audio streams in a 16kHz mono WAV PCM format. We integrated dedicated conversion safeguards in the preparation helper, feeding frame-by-frame chunks directly to the KaldiRecognizer engine:

rec = KaldiRecognizer(_vosk_model, wf.getframerate())
rec.SetWords(False)

results = []
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        part = json.loads(rec.Result())
        if part.get("text"):
            results.append(part["text"])

We also implemented a dynamic auto-detection fallback: if no other high-resource local or remote provider is available, the runtime gracefully defaults to Vosk to maintain STT functionality.

3. Verification and Unit Tests

To ensure reliability across edge configurations, comprehensive tests were added intests/tools/test_transcription_tools.py, covering package absence handling, mocking successful Kaldi recognizer stream loops, and routing fallback scenarios.

Link To Issue ↗

← Back