Whisper

Local speech-to-text (STT) service using OpenAI's Whisper model.

Overview

Whisper provides automatic speech recognition for the Voice Platform. It transcribes audio from voice calls and WebRTC sessions into text for processing by Norns.

Container: whisper Image: onerahmet/openai-whisper-asr-webservice:latest Port: 9000 (internal)

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Voice Agent   │────▶│     Whisper     │────▶│   Norns Agent   │
│  (Audio Input)  │     │   (STT Engine)  │     │   (Text Output) │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Configuration

Variable	Value	Description
`ASR_ENGINE`	`faster_whisper`	Uses faster-whisper for improved performance
`ASR_MODEL`	`base.en`	English-optimized base model

Model Options

Model	Size	Speed	Accuracy	Use Case
`tiny.en`	39MB	Fastest	Good	Real-time, low latency
`base.en`	74MB	Fast	Better	Current (balanced)
`small.en`	244MB	Medium	High	Accurate transcription
`medium.en`	769MB	Slow	Higher	High accuracy needs

API Usage

Transcribe Audio

curl -X POST http://whisper:9000/asr \
  -H "Content-Type: multipart/form-data" \
  -F "audio_file=@audio.wav" \
  -F "task=transcribe" \
  -F "language=en"

Response

{
  "text": "Hello, this is a test transcription."
}

Parameters

Parameter	Type	Description
`audio_file`	file	Audio file (WAV, MP3, FLAC, etc.)
`task`	string	`transcribe` or `translate`
`language`	string	Language code (e.g., `en`)
`output`	string	`json`, `text`, `srt`, `vtt`

Integration with Voice Platform

Whisper is called by the Voice Agent for:

Phone Calls - Transcribes caller speech via Telephony
WebRTC - Transcribes browser-based voice input
Voice Commands - Converts speech to actionable text

Flow

User speaks → LiveKit/Telephony → Audio chunk
                                      ↓
                               Whisper (STT)
                                      ↓
                               Text transcript
                                      ↓
                               Norns Agent → Response
                                      ↓
                               Piper (TTS) → Audio response

Performance

Latency: ~200-500ms for short utterances (base.en model)
Memory: ~1GB RAM
GPU: Not required (CPU inference on M4)

Quick Commands

# Check service health
docker exec whisper curl -s http://localhost:9000/health

# View logs
docker logs -f whisper

# Test transcription
docker exec whisper curl -s http://localhost:9000/asr \
  -F "audio_file=@/tmp/test.wav"

Overview​

Architecture​

Configuration​

Model Options​

API Usage​

Transcribe Audio​

Response​

Parameters​

Integration with Voice Platform​

Flow​

Performance​

Quick Commands​

See Also​