Whisper Transcription Service

4 min read

Available from Release 2026.06

Overview

Starting with release 2026.06, the speech service supports OpenAI Whisper as an alternative transcription backend alongside Microsoft Azure Speech-to-Text. Whisper can be deployed in two ways:

Azure OpenAI Whisper — cloud-managed via your existing Azure OpenAI resource
Whisper OX — self-hosted using faster-whisper via Ray Serve, for on-premise / data-sovereignty scenarios

For general voice architecture, provisioning, and troubleshooting, refer to: - Voice Infrastructure — architecture diagrams, deployment, and MS STT setup - Voice Administration — admin configuration and permissions - Using Voice — end-user documentation

When to Choose Whisper

	Microsoft STT (default)	Azure OpenAI Whisper	Whisper OX (self-hosted)
Data residency	Azure region	Azure region	Fully on-premise
Authentication	Workload Identity	API Key	Network-level
Cost	CHF 0.298/hr audio	Azure OpenAI pricing	Infrastructure only
Best for	Low-latency live dictation	Cloud customers wanting OpenAI models	Data-sovereignty requirements

How Whisper Differs from MS STT

The existing MS STT provider streams audio in real-time with interim results. Whisper providers use chunked batch processing:

Audio arrives from the browser as PCM chunks via WebSocket (same as MS STT)
The speech service buffers audio for a configurable duration
The buffer is converted to WAV and sent to the Whisper endpoint
A transcript is returned to the browser
On stop, remaining audio is sent as a final chunk

The chunk duration is controlled by WHISPER_CHUNK_DURATION_SECONDS:

Value	Behavior
`0`	Buffer all audio until user stops (highest accuracy, highest latency)
`5-10`	Lower latency, potentially less accurate on short chunks
`15-30`	Recommended — good balance of latency and accuracy

Option A: Azure OpenAI Whisper

Use this if you already have an Azure OpenAI resource.

Prerequisites

An Azure OpenAI resource with a Whisper model deployment
API key for the resource
Network connectivity from the speech service pods to the Azure OpenAI endpoint

Provisioning

In the Azure Portal, navigate to your Azure OpenAI resource
Under Model Deployments, deploy the whisper model
Note the endpoint URL and API key

Environment Variables (Speech Service)

Variable	Required	Description
`TRANSCRIPTION_SERVICE`	Yes	Set to `WHISPER`
`AZURE_OPENAI_WHISPER_ENDPOINT`	Yes	e.g., `https://your-resource.openai.azure.com/`
`AZURE_OPENAI_WHISPER_API_KEY`	Yes	Store in Azure Key Vault, load via secrets provider (CSI driver). Never hardcode.
`AZURE_OPENAI_WHISPER_DEPLOYMENT`	No	Defaults to `whisper`
`AZURE_OPENAI_WHISPER_API_VERSION`	No	Defaults to `2024-02-01`
`WHISPER_CHUNK_DURATION_SECONDS`	No	Defaults to `15`

Budget

Refer to Azure OpenAI pricing for current Whisper rates.

Option B: Whisper OX (Self-Hosted)

Use this for on-premise or data-sovereignty scenarios where audio must not leave your network.

Prerequisites

A server or Kubernetes cluster with GPU support (recommended) or sufficient CPU
Network connectivity from the speech service to the Whisper OX endpoint
A deployed Whisper OX server implementing the API contract below

API Contract

The endpoint must accept POST with multipart/form-data:

Form Field	Type	Description
`audio_file`	File (WAV)	16kHz, 16-bit, mono PCM in WAV container
`generate_kwargs`	String (JSON)	Transcription parameters

Parameters sent in generate_kwargs:

json

{
  "multilingual": true,
  "vad_filter": true,
  "task": "transcribe",
  "language": "en",
  "batch_size": 8,
  "beam_size": 5,
  "condition_on_previous_text": true
}

Expected response:

json

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello world",
      "sequence_confidence": -0.25,
      "words": null
    }
  ],
  "language": "en",
  "duration": 2.5
}

Sizing Guide

Concurrent Users	Hardware	Model	Throughput
1-10	1x NVIDIA T4 (16 GB)	`large-v3`	~5-10x realtime
10-50	1x NVIDIA A10G (24 GB)	`large-v3`	~15-20x realtime
50-200	2-4x NVIDIA A10G	`large-v3`	~40-80x realtime
CPU only	8+ cores, 16 GB RAM	`small` / `medium`	~1-3x realtime

"Nx realtime" = N seconds of audio transcribed per 1 second of wall-clock time.

A reference implementation is available in the Unique monorepo at .local/whisper-ox/.

Environment Variables (Speech Service)

Variable	Required	Description
`TRANSCRIPTION_SERVICE`	Yes	Set to `WHISPER_OX`
`WHISPER_OX_ENDPOINT`	Yes	e.g., `https://whisper.internal.company.com/transcribe`
`WHISPER_OX_LANGUAGE`	No	Defaults to `en`
`WHISPER_CHUNK_DURATION_SECONDS`	No	Defaults to `15`
`WHISPER_OX_API_KEY`	No	API key to send to the Whisper OX server. When set, the key is sent in the header specified by `WHISPER_OX_API_KEY_HEADER`. When unset, no auth header is sent.
`WHISPER_OX_API_KEY_HEADER`	No	Header name used to send the API key. Defaults to `X-API-Key`.

Security

The endpoint supports optional API key authentication. Set WHISPER_OX_API_KEY and optionally WHISPER_OX_API_KEY_HEADER (defaults to X-API-Key) on the speech service, and the corresponding WHISPER_API_KEY / WHISPER_API_KEY_HEADER on the Whisper OX server. When unset, no authentication is enforced — secure at the network level (private endpoints, NetworkPolicy, service mesh).
Ensure TLS if traffic crosses network boundaries
Audio is sent as raw WAV — treat the endpoint as sensitive infrastructure

Frontend: Enabling Whisper Mode on Assistants

The assistant must be configured to use Whisper mode. Set the following in assistant settings:

Setting	Value
`sttConfig.speechToTextMode`	`whisper`

When set, the frontend shows a waveform during recording and a processing indicator while waiting for the transcript. The WebSocket URL (SPEECH_BACKEND_API_URL) does not change — the backend handles provider routing transparently.

To revert an assistant to live streaming, set speechToTextMode back to live.

Troubleshooting

Symptom	Cause	Resolution
"Whisper OX endpoint is not configured"	`WHISPER_OX_ENDPOINT` not set	Set the env var on the speech service
Timeout during transcription	Server overloaded or unreachable	Check connectivity and server health (`GET /health`)
Empty transcript	Audio too short or silence-only	Check microphone input; `vad_filter` is enabled by default
Slow transcription	CPU-only with large model	Use GPU or a smaller model (`small`, `medium`)
API error (4xx/5xx)	Server-side issue	Check Whisper OX server logs

Enable debug logging with LOG_LEVEL=debug on the speech service for detailed chunk sizes, API calls, and transcription timings.

Migration Checklist

Enabling Whisper (from MS STT)

Deploy the Whisper backend (Azure OpenAI or Whisper OX)
Set TRANSCRIPTION_SERVICE and endpoint env vars on the speech service
Restart the speech service
Set sttConfig.speechToTextMode: "whisper" on target assistants
No frontend deployment changes needed

Rolling Back to MS STT

Remove or set TRANSCRIPTION_SERVICE=MS_STT
Restart the speech service
Set sttConfig.speechToTextMode: "live" on assistants