Whisper Transcription Service
4 min read
Available from Release 2026.06
Overview
Starting with release 2026.06, the speech service supports OpenAI Whisper as an alternative transcription backend alongside Microsoft Azure Speech-to-Text. Whisper can be deployed in two ways:
Azure OpenAI Whisper — cloud-managed via your existing Azure OpenAI resource
Whisper OX — self-hosted using faster-whisper via Ray Serve, for on-premise / data-sovereignty scenarios
For general voice architecture, provisioning, and troubleshooting, refer to: - Voice Infrastructure — architecture diagrams, deployment, and MS STT setup - Voice Administration — admin configuration and permissions - Using Voice — end-user documentation
When to Choose Whisper
| Microsoft STT (default) | Azure OpenAI Whisper | Whisper OX (self-hosted) |
|---|---|---|---|
Data residency | Azure region | Azure region | Fully on-premise |
Authentication | Workload Identity | API Key | Network-level |
Cost | CHF 0.298/hr audio | Azure OpenAI pricing | Infrastructure only |
Best for | Low-latency live dictation | Cloud customers wanting OpenAI models | Data-sovereignty requirements |
How Whisper Differs from MS STT
The existing MS STT provider streams audio in real-time with interim results. Whisper providers use chunked batch processing:
Audio arrives from the browser as PCM chunks via WebSocket (same as MS STT)
The speech service buffers audio for a configurable duration
The buffer is converted to WAV and sent to the Whisper endpoint
A transcript is returned to the browser
On stop, remaining audio is sent as a final chunk
The chunk duration is controlled by WHISPER_CHUNK_DURATION_SECONDS:
Value | Behavior |
|---|---|
| Buffer all audio until user stops (highest accuracy, highest latency) |
| Lower latency, potentially less accurate on short chunks |
| Recommended — good balance of latency and accuracy |
Option A: Azure OpenAI Whisper
Use this if you already have an Azure OpenAI resource.
Prerequisites
An Azure OpenAI resource with a Whisper model deployment
API key for the resource
Network connectivity from the speech service pods to the Azure OpenAI endpoint
Provisioning
In the Azure Portal, navigate to your Azure OpenAI resource
Under Model Deployments, deploy the
whispermodelNote the endpoint URL and API key
Environment Variables (Speech Service)
Variable | Required | Description |
|---|---|---|
| Yes | Set to |
| Yes | e.g., |
| Yes | Store in Azure Key Vault, load via secrets provider (CSI driver). Never hardcode. |
| No | Defaults to |
| No | Defaults to |
| No | Defaults to |
Budget
Refer to Azure OpenAI pricing for current Whisper rates.
Option B: Whisper OX (Self-Hosted)
Use this for on-premise or data-sovereignty scenarios where audio must not leave your network.
Prerequisites
A server or Kubernetes cluster with GPU support (recommended) or sufficient CPU
Network connectivity from the speech service to the Whisper OX endpoint
A deployed Whisper OX server implementing the API contract below
API Contract
The endpoint must accept POST with multipart/form-data:
Form Field | Type | Description |
|---|---|---|
| File (WAV) | 16kHz, 16-bit, mono PCM in WAV container |
| String (JSON) | Transcription parameters |
Parameters sent in generate_kwargs:
{
"multilingual": true,
"vad_filter": true,
"task": "transcribe",
"language": "en",
"batch_size": 8,
"beam_size": 5,
"condition_on_previous_text": true
}Expected response:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Hello world",
"sequence_confidence": -0.25,
"words": null
}
],
"language": "en",
"duration": 2.5
}Sizing Guide
Concurrent Users | Hardware | Model | Throughput |
|---|---|---|---|
1-10 | 1x NVIDIA T4 (16 GB) |
| ~5-10x realtime |
10-50 | 1x NVIDIA A10G (24 GB) |
| ~15-20x realtime |
50-200 | 2-4x NVIDIA A10G |
| ~40-80x realtime |
CPU only | 8+ cores, 16 GB RAM |
| ~1-3x realtime |
"Nx realtime" = N seconds of audio transcribed per 1 second of wall-clock time.
A reference implementation is available in the Unique monorepo at .local/whisper-ox/.
Environment Variables (Speech Service)
Variable | Required | Description |
|---|---|---|
| Yes | Set to |
| Yes | e.g., |
| No | Defaults to |
| No | Defaults to |
| No | API key to send to the Whisper OX server. When set, the key is sent in the header specified by |
| No | Header name used to send the API key. Defaults to |
Security
The endpoint supports optional API key authentication. Set
WHISPER_OX_API_KEYand optionallyWHISPER_OX_API_KEY_HEADER(defaults toX-API-Key) on the speech service, and the correspondingWHISPER_API_KEY/WHISPER_API_KEY_HEADERon the Whisper OX server. When unset, no authentication is enforced — secure at the network level (private endpoints, NetworkPolicy, service mesh).Ensure TLS if traffic crosses network boundaries
Audio is sent as raw WAV — treat the endpoint as sensitive infrastructure
Frontend: Enabling Whisper Mode on Assistants
The assistant must be configured to use Whisper mode. Set the following in assistant settings:
Setting | Value |
|---|---|
|
|
When set, the frontend shows a waveform during recording and a processing indicator while waiting for the transcript. The WebSocket URL (SPEECH_BACKEND_API_URL) does not change — the backend handles provider routing transparently.
To revert an assistant to live streaming, set speechToTextMode back to live.
Troubleshooting
Symptom | Cause | Resolution |
|---|---|---|
"Whisper OX endpoint is not configured" |
| Set the env var on the speech service |
Timeout during transcription | Server overloaded or unreachable | Check connectivity and server health ( |
Empty transcript | Audio too short or silence-only | Check microphone input; |
Slow transcription | CPU-only with large model | Use GPU or a smaller model ( |
API error (4xx/5xx) | Server-side issue | Check Whisper OX server logs |
Enable debug logging with LOG_LEVEL=debug on the speech service for detailed chunk sizes, API calls, and transcription timings.
Migration Checklist
Enabling Whisper (from MS STT)
Deploy the Whisper backend (Azure OpenAI or Whisper OX)
Set
TRANSCRIPTION_SERVICEand endpoint env vars on the speech serviceRestart the speech service
Set
sttConfig.speechToTextMode: "whisper"on target assistantsNo frontend deployment changes needed
Rolling Back to MS STT
Remove or set
TRANSCRIPTION_SERVICE=MS_STTRestart the speech service
Set
sttConfig.speechToTextMode: "live"on assistants