Whisper Transcription Service

4 min read

info

Available from Release 2026.06

Overview

Starting with release 2026.06, the speech service supports OpenAI Whisper as an alternative transcription backend alongside Microsoft Azure Speech-to-Text. Whisper can be deployed in two ways:

  • Azure OpenAI Whisper — cloud-managed via your existing Azure OpenAI resource

  • Whisper OX — self-hosted using faster-whisper via Ray Serve, for on-premise / data-sovereignty scenarios

For general voice architecture, provisioning, and troubleshooting, refer to: - Voice Infrastructure — architecture diagrams, deployment, and MS STT setup - Voice Administration — admin configuration and permissions - Using Voice — end-user documentation

When to Choose Whisper

 

Microsoft STT (default)

Azure OpenAI Whisper

Whisper OX (self-hosted)

Data residency

Azure region

Azure region

Fully on-premise

Authentication

Workload Identity

API Key

Network-level

Cost

CHF 0.298/hr audio

Azure OpenAI pricing

Infrastructure only

Best for

Low-latency live dictation

Cloud customers wanting OpenAI models

Data-sovereignty requirements


How Whisper Differs from MS STT

The existing MS STT provider streams audio in real-time with interim results. Whisper providers use chunked batch processing:

  1. Audio arrives from the browser as PCM chunks via WebSocket (same as MS STT)

  2. The speech service buffers audio for a configurable duration

  3. The buffer is converted to WAV and sent to the Whisper endpoint

  4. A transcript is returned to the browser

  5. On stop, remaining audio is sent as a final chunk

The chunk duration is controlled by WHISPER_CHUNK_DURATION_SECONDS:

Value

Behavior

0

Buffer all audio until user stops (highest accuracy, highest latency)

5-10

Lower latency, potentially less accurate on short chunks

15-30

Recommended — good balance of latency and accuracy


Option A: Azure OpenAI Whisper

Use this if you already have an Azure OpenAI resource.

Prerequisites

  1. An Azure OpenAI resource with a Whisper model deployment

  2. API key for the resource

  3. Network connectivity from the speech service pods to the Azure OpenAI endpoint

Provisioning

  1. In the Azure Portal, navigate to your Azure OpenAI resource

  2. Under Model Deployments, deploy the whisper model

  3. Note the endpoint URL and API key

Environment Variables (Speech Service)

Variable

Required

Description

TRANSCRIPTION_SERVICE

Yes

Set to WHISPER

AZURE_OPENAI_WHISPER_ENDPOINT

Yes

e.g., https://your-resource.openai.azure.com/

AZURE_OPENAI_WHISPER_API_KEY

Yes

Store in Azure Key Vault, load via secrets provider (CSI driver). Never hardcode.

AZURE_OPENAI_WHISPER_DEPLOYMENT

No

Defaults to whisper

AZURE_OPENAI_WHISPER_API_VERSION

No

Defaults to 2024-02-01

WHISPER_CHUNK_DURATION_SECONDS

No

Defaults to 15

Budget

Refer to Azure OpenAI pricing for current Whisper rates.


Option B: Whisper OX (Self-Hosted)

Use this for on-premise or data-sovereignty scenarios where audio must not leave your network.

Prerequisites

  1. A server or Kubernetes cluster with GPU support (recommended) or sufficient CPU

  2. Network connectivity from the speech service to the Whisper OX endpoint

  3. A deployed Whisper OX server implementing the API contract below

API Contract

The endpoint must accept POST with multipart/form-data:

Form Field

Type

Description

audio_file

File (WAV)

16kHz, 16-bit, mono PCM in WAV container

generate_kwargs

String (JSON)

Transcription parameters

Parameters sent in generate_kwargs:

json
{
  "multilingual": true,
  "vad_filter": true,
  "task": "transcribe",
  "language": "en",
  "batch_size": 8,
  "beam_size": 5,
  "condition_on_previous_text": true
}

Expected response:

json
{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello world",
      "sequence_confidence": -0.25,
      "words": null
    }
  ],
  "language": "en",
  "duration": 2.5
}

 Sizing Guide

Concurrent Users

Hardware

Model

Throughput

1-10

1x NVIDIA T4 (16 GB)

large-v3

~5-10x realtime

10-50

1x NVIDIA A10G (24 GB)

large-v3

~15-20x realtime

50-200

2-4x NVIDIA A10G

large-v3

~40-80x realtime

CPU only

8+ cores, 16 GB RAM

small / medium

~1-3x realtime

"Nx realtime" = N seconds of audio transcribed per 1 second of wall-clock time.

A reference implementation is available in the Unique monorepo at .local/whisper-ox/.

Environment Variables (Speech Service)

Variable

Required

Description

TRANSCRIPTION_SERVICE

Yes

Set to WHISPER_OX

WHISPER_OX_ENDPOINT

Yes

e.g., https://whisper.internal.company.com/transcribe

WHISPER_OX_LANGUAGE

No

Defaults to en

WHISPER_CHUNK_DURATION_SECONDS

No

Defaults to 15

WHISPER_OX_API_KEY

No

API key to send to the Whisper OX server. When set, the key is sent in the header specified by WHISPER_OX_API_KEY_HEADER. When unset, no auth header is sent.

WHISPER_OX_API_KEY_HEADER

No

Header name used to send the API key. Defaults to X-API-Key.

Security

  • The endpoint supports optional API key authentication. Set WHISPER_OX_API_KEY and optionally WHISPER_OX_API_KEY_HEADER (defaults to X-API-Key) on the speech service, and the corresponding WHISPER_API_KEY / WHISPER_API_KEY_HEADER on the Whisper OX server. When unset, no authentication is enforced — secure at the network level (private endpoints, NetworkPolicy, service mesh).

  • Ensure TLS if traffic crosses network boundaries

  • Audio is sent as raw WAV — treat the endpoint as sensitive infrastructure


Frontend: Enabling Whisper Mode on Assistants

The assistant must be configured to use Whisper mode. Set the following in assistant settings:

Setting

Value

sttConfig.speechToTextMode

whisper

When set, the frontend shows a waveform during recording and a processing indicator while waiting for the transcript. The WebSocket URL (SPEECH_BACKEND_API_URL) does not change — the backend handles provider routing transparently.

To revert an assistant to live streaming, set speechToTextMode back to live.


Troubleshooting

Symptom

Cause

Resolution

"Whisper OX endpoint is not configured"

WHISPER_OX_ENDPOINT not set

Set the env var on the speech service

Timeout during transcription

Server overloaded or unreachable

Check connectivity and server health (GET /health)

Empty transcript

Audio too short or silence-only

Check microphone input; vad_filter is enabled by default

Slow transcription

CPU-only with large model

Use GPU or a smaller model (small, medium)

API error (4xx/5xx)

Server-side issue

Check Whisper OX server logs

Enable debug logging with LOG_LEVEL=debug on the speech service for detailed chunk sizes, API calls, and transcription timings.


Migration Checklist

Enabling Whisper (from MS STT)

  1. Deploy the Whisper backend (Azure OpenAI or Whisper OX)

  2. Set TRANSCRIPTION_SERVICE and endpoint env vars on the speech service

  3. Restart the speech service

  4. Set sttConfig.speechToTextMode: "whisper" on target assistants

  5. No frontend deployment changes needed

Rolling Back to MS STT

  1. Remove or set TRANSCRIPTION_SERVICE=MS_STT

  2. Restart the speech service

  3. Set sttConfig.speechToTextMode: "live" on assistants

Last updated