Live Transcription (RTT)

VALSEA provides a real-time speech-to-text API via WebSocket, allowing you to stream audio and receive transcriptions with low latency.

Connection

Endpoint: wss://api.valsea.ai/v1/realtime

Authentication

You must authenticate the WebSocket connection by passing your API key in the HTTP headers during the handshake.

Headers:

Authorization: Bearer YOUR_API_KEY (Recommended)
X-API-Key: YOUR_API_KEY (Supported)

Browser clients can also authenticate with ?api_key=YOUR_API_KEY because standard browser WebSocket constructors do not support custom headers.

Paid Rate-Limit Bypass

If you need to exceed the realtime connection RPM limit for a session, you can bypass the rate-limit check by sending one of these opt-in flags during the WebSocket handshake:

Header: X-Bypass-Rate-Limit: true
Query parameter: bypass_rate_limit=true
Query parameter: bypassRateLimit=true

Bypass applies only to the per-organization rate-limit check. Authentication, credit checks, and session billing still apply. RTT sessions using this bypass are billed at 2x the normal realtime credit cost. The initial session.created event includes rateLimitBypass: true and billingMultiplier: 2 when bypass is active. If speaker diarization is also enabled for the session, the final session cost is 4x the normal realtime credit cost.

Message Flow

Connect: Client establishes WebSocket connection.
Session Created: Server sends session.created event.
Start Session: Client sends session.start to configure language and model.
Stream Audio: Client sends audio.append messages with base64-encoded PCM16 audio chunks.
Receive Transcripts: Server streams transcript.partial and transcript.final events.
Commit Audio: Client sends audio.commit when a user stops speaking (optional/VAD dependent).
Stop Session: Client sends session.stop to end the session.

Partial vs Final (Important)

RTT emits two transcript event types for each utterance:

transcript.partial: low-latency, in-progress text. This can change as more audio arrives.
transcript.final: stable text for a completed segment. Treat this as the committed result.

Recommended client behavior:

Keep a temporary currentPartial string for transcript.partial.
Append only transcript.final to your persisted transcript history.
Clear currentPartial when you receive a matching transcript.final.

Do not persist partial text as final output. Partials are intentionally mutable and may be revised by the engine before a final segment is produced.

Client Messages

`session.start`

Initialize the session with configuration.

{
  "type": "session.start",
  "model": "valsea-rtt",
  "hint_text": "Optional context or vocabulary",
  "enable_correction": true,
  "language": "singlish",
  "target_language": "english",
  "diarize": true,
  "diarization_min_speakers": 2,
  "diarization_max_speakers": 6
}

Field	Type	Description
`model`	string	Model to use (e.g., `valsea-rtt`).
`hint_text`	string	Optional list of words or context to improve accuracy.
`enable_correction`	boolean	Enable post-processing for grammar/language correction (default: `true`).
`language`	string	Language hint for correction (e.g., `singlish`, `english`, `chinese`, `korean`).
`target_language`	string	Optional translation target for final transcript text. Also accepted as `targetLanguage`. Omit it to return transcription only.
`diarize`	boolean	Enable speaker diarization on final transcript events. Default: `false`. Bills at 2x normal realtime credits.
`diarization_min_speakers`	integer	Minimum expected speaker count for diarization. Default: `2`.
`diarization_max_speakers`	integer	Maximum expected speaker count for diarization. Default: `6`.

When target_language is set and differs from the input language, transcript.final.text is translated to that target language. Partial transcripts are not translated. Supported translation targets are english, chinese, japanese, korean, vietnamese, thai, french, spanish, german, russian, indonesian, malay, filipino, tamil, khmer, and lao.

When diarize=true, final events include speaker-labeled words and utterances metadata. Speaker IDs are zero-based integers. Diarization is emitted only on final events.

`audio.append`

Send audio data.

{
  "type": "audio.append",
  "audio": "BASE64_ENCODED_PCM16_DATA"
}

Format: Raw PCM 16-bit, 16kHz (recommended), mono.
Encoding: Base64 string.

You may also send raw binary PCM16 frames directly over the WebSocket. Binary frames are treated as audio.append messages by the server.

`audio.commit`

Signal the end of a speech segment (e.g., VAD triggered silence).

{
  "type": "audio.commit"
}

`session.stop`

End the session gracefully.

{
  "type": "session.stop"
}

Server Messages

`session.created`

Sent immediately upon connection.

{
  "type": "session.created",
  "sessionId": "rtt_...",
  "supported_models": ["valsea-rtt"]
}

`session.ready`

Sent when the backend engine is connected and ready to receive audio.

{
  "type": "session.ready",
  "sessionId": "rtt_..."
}

`transcript.partial`

Intermediate transcription results (low latency, may change).

{
  "type": "transcript.partial",
  "text": "Hello world",
  "isFinal": false,
  "timestampMs": 1230
}

`transcript.final`

Finalized text for a speech segment.

{
  "type": "transcript.final",
  "text": "Hello, world.",
  "rawText": "hello world",
  "isFinal": true,
  "timestampMs": 2500,
  "corrections": [],
  "words": [{ "word": "Hello", "start": 0.4, "end": 0.8, "speaker": 0 }],
  "utterances": [{ "start": 0.4, "end": 0.8, "speaker": 0, "transcript": "Hello", "words": [] }]
}

If realtime translation is enabled, text contains the translated final text and the event includes translation metadata:

{
  "type": "transcript.final",
  "text": "Hello, world.",
  "rawText": "hello world",
  "isFinal": true,
  "timestampMs": 2500,
  "translated": true,
  "sourceLanguage": "singlish",
  "targetLanguage": "english"
}

`error`

Sent when an error occurs.

{
  "type": "error",
  "code": "INVALID_MESSAGE",
  "message": "Failed to parse message"
}

Event Handling Pattern

Use this pattern to avoid duplicated or unstable transcript content:

let currentPartial = '';
const finalSegments = [];

ws.on('message', (raw) => {
  const msg = JSON.parse(raw);

  if (msg.type === 'transcript.partial') {
    currentPartial = msg.text || '';
  }

  if (msg.type === 'transcript.final') {
    finalSegments.push(msg.text || '');
    currentPartial = '';
  }
});

Browser Interpreter Compatibility

For simple browser live-translation demos, /v1/realtime also accepts a compact init message and binary PCM16 frames.

Client init

{
  "language": "en",
  "model": "v2.0",
  "translation": {
    "type": "two_way",
    "language_a": "en",
    "language_b": "vi"
  }
}

Supported compact language codes are en, vi, ja, ko, zh, th, fr, es, de, ru, and id. They are mapped to the canonical API language names internally.

Server messages

When the compact init shape is used, transcript events are returned in a paired interpreter shape:

{
  "type": "ready"
}

{
  "type": "partial",
  "text": "hello"
}

{
  "type": "final",
  "text": "hello",
  "translation": "xin chao",
  "language": "en",
  "translation_language": "vi"
}

Only final transcript events include translations. Use text as the original transcript and translation as the translated output.

Example (Node.js)

const WebSocket = require('ws');
const fs = require('fs');

const ws = new WebSocket('wss://api.valsea.ai/v1/realtime', {
  headers: { 'X-API-Key': 'YOUR_KEY' },
});

ws.on('open', () => {
  // 1. Configure Session
  ws.send(
    JSON.stringify({
      type: 'session.start',
      model: 'valsea-rtt',
      language: 'singlish',
      target_language: 'english',
    }),
  );
});

ws.on('message', (data) => {
  const msg = JSON.parse(data);

  if (msg.type === 'session.ready') {
    // 2. Start streaming audio (example)
    const audioStream = fs.createReadStream('audio.raw');
    audioStream.on('data', (chunk) => {
      ws.send(
        JSON.stringify({
          type: 'audio.append',
          audio: chunk.toString('base64'),
        }),
      );
    });
  } else if (msg.type === 'transcript.final') {
    console.log('Final:', msg.text); // translated when target_language is set
  }
});

Model Selection Guide

Default: Auto-Detect

Use the default model to automatically detect and transcribe speech across 100+ languages with no configuration required.

{
  "model": "valsea-auto"
}

Best for:

Multi-language or unknown input
Global applications
Fast setup with minimal tuning

How it works:

Automatically detects the spoken language
Applies a general-purpose transcription pipeline optimized for broad coverage

Accent & Dialect Coverage — The auto-detect model is designed for broad language coverage, not deep regional specialization. It may be less accurate for heavy regional accents, code-switched speech (e.g. Singlish, Manglish), and local slang or informal speech patterns.

For Higher Accuracy (Recommended for SEA)

For Southeast Asian speech and mixed-language inputs, use a region-specific language setting:

{
  "model": "valsea-rtt",
  "language": "singlish"
}

These specialized modes provide:

Accent-aware transcription
Better handling of mixed-language speech
Improved semantic correction for local expressions

List of Languages

Southeast Asia

Singlish — singlish

Indonesian — indonesian

Malaysian — malay

Vietnamese — vietnamese

Thai — thai

Javanese — javanese

Lao — lao

Khmer — khmer

Filipino/Tagalog — filipino

English (Philippines) — english-philippines

Middle East & North Africa

Arabic — arabic arabic-algeria arabic-bahrain arabic-egypt arabic-israel arabic-jordan arabic-kuwait arabic-lebanon arabic-mauritania arabic-morocco arabic-oman arabic-palestine arabic-qatar arabic-saudi arabic-syria arabic-tunisia arabic-uae arabic-yemen

Persian — persian

Hebrew — hebrew

Amharic — amharic

Wolof — wolof

Sub-Saharan Africa

Swahili — swahili swahili-ke

Afrikaans — afrikaans

Akan — akan

Bemba — bemba

Fulani — fulani

Ga — ga

Hausa — hausa

Igbo — igbo

Luganda — luganda

Xhosa — xhosa

Yoruba — yoruba

Zulu — zulu

Northern Sotho — northern-sotho

Nyankole — nyankole

Oromo — oromo

Pidgin — pidgin

Kinyarwanda — kinyarwanda

Shona — shona

Sotho — sotho

Tswana — tswana

Twi — twi

South Asia

Bengali — bengali-bd bengali-in

Hindi — hindi

Gujarati — gujarati

Kannada — kannada

Malayalam — malayalam

Marathi — marathi

Nepali — nepali

Oriya — oriya

Punjabi — punjabi

Sinhala — sinhala

Tamil — tamil

Telugu — telugu

Assamese — assamese

East Asia

Chinese — cantonese chinese chinese-simplified chinese-traditional

Covers a wide range of regional accents and dialects, including those from Anhui, Beijing, Chongqing, Gansu, Guangdong, Guangxi, Guizhou, Hangzhou, Hebei, Henan, Hong Kong, Hubei, Jiangsu, Jianghuai, Jiaoliao, Jilu, Lanyin, Nanjing, Northeast, Ningxia, Shaanxi, Shandong, Sichuan, Taiwan, Tianjin, and Yunnan.

Japanese — japanese

Korean — korean

Mongolian — mongolian

Central & Western Asia

Azerbaijani — azerbaijani

Armenian — armenian

Georgian — georgian

Kazakh — kazakh

Kurdish — kurdish

Kyrgyz — kyrgyz

Uzbek — uzbek

Turkish — turkish

Western Europe

English — english english-au english-gb english-in english-philippines english-us

French — french french-ca

Spanish — spanish spanish-es spanish-mexico spanish-us

Portuguese — portuguese portuguese-br

German — german

Dutch — dutch

Italian — italian

Catalan — catalan

Galician — galician

Asturian — asturian

Basque — basque

Welsh — welsh

Luxembourgish — luxembourgish

Maltese — maltese

Northern Europe

Danish — danish

Finnish — finnish

Icelandic — icelandic

Norwegian — norwegian

Swedish — swedish

Estonian — estonian

Latvian — latvian

Lithuanian — lithuanian

Eastern Europe & Balkans

Bulgarian — bulgarian

Croatian — croatian

Czech — czech

Hungarian — hungarian

Macedonian — macedonian

Polish — polish

Romanian — romanian

Russian — russian

Serbian — serbian

Slovak — slovak

Slovenian — slovenian

Ukrainian — ukrainian

Albanian — albanian

Greek — greek

Pacific & Oceania

Maori — maori