Whisper
Local speech-to-text (STT) service using OpenAI's Whisper model.
Overview
Whisper provides automatic speech recognition for the Voice Platform. It transcribes audio from voice calls and WebRTC sessions into text for processing by Norns.
Container: whisper
Image: onerahmet/openai-whisper-asr-webservice:latest
Port: 9000 (internal)
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Voice Agent │────▶│ Whisper │────▶│ Norns Agent │
│ (Audio Input) │ │ (STT Engine) │ │ (Text Output) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Configuration
| Variable | Value | Description |
|---|---|---|
ASR_ENGINE | faster_whisper | Uses faster-whisper for improved performance |
ASR_MODEL | base.en | English-optimized base model |
Model Options
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
tiny.en | 39MB | Fastest | Good | Real-time, low latency |
base.en | 74MB | Fast | Better | Current (balanced) |
small.en | 244MB | Medium | High | Accurate transcription |
medium.en | 769MB | Slow | Higher | High accuracy needs |
API Usage
Transcribe Audio
curl -X POST http://whisper:9000/asr \
-H "Content-Type: multipart/form-data" \
-F "audio_file=@audio.wav" \
-F "task=transcribe" \
-F "language=en"
Response
{
"text": "Hello, this is a test transcription."
}
Parameters
| Parameter | Type | Description |
|---|---|---|
audio_file | file | Audio file (WAV, MP3, FLAC, etc.) |
task | string | transcribe or translate |
language | string | Language code (e.g., en) |
output | string | json, text, srt, vtt |
Integration with Voice Platform
Whisper is called by the Voice Agent for:
- Phone Calls - Transcribes caller speech via Telephony
- WebRTC - Transcribes browser-based voice input
- Voice Commands - Converts speech to actionable text
Flow
User speaks → LiveKit/Telephony → Audio chunk
↓
Whisper (STT)
↓
Text transcript
↓
Norns Agent → Response
↓
Piper (TTS) → Audio response
Performance
- Latency: ~200-500ms for short utterances (base.en model)
- Memory: ~1GB RAM
- GPU: Not required (CPU inference on M4)
Quick Commands
# Check service health
docker exec whisper curl -s http://localhost:9000/health
# View logs
docker logs -f whisper
# Test transcription
docker exec whisper curl -s http://localhost:9000/asr \
-F "audio_file=@/tmp/test.wav"
See Also
- Piper - Text-to-speech (TTS)
- Voice Gateway - Voice processing
- Telephony - Phone integration
- LiveKit - WebRTC infrastructure