Live Transcription (RTT)
VALSEA provides a real-time speech-to-text API via WebSocket, allowing you to stream audio and receive transcriptions with low latency.
Connection
Endpoint: wss://api.valsea.ai/v1/realtime
Authentication
You must authenticate the WebSocket connection by passing your API key in the HTTP headers during the handshake.
Headers:
Authorization:Bearer YOUR_API_KEY(Recommended)X-API-Key:YOUR_API_KEY(Supported)
Browser clients can also authenticate with ?api_key=YOUR_API_KEY because standard browser
WebSocket constructors do not support custom headers.
Paid Rate-Limit Bypass
If you need to exceed the realtime connection RPM limit for a session, you can bypass the rate-limit check by sending one of these opt-in flags during the WebSocket handshake:
- Header:
X-Bypass-Rate-Limit: true - Query parameter:
bypass_rate_limit=true - Query parameter:
bypassRateLimit=true
Bypass applies only to the per-organization rate-limit check. Authentication, credit checks, and
session billing still apply. RTT sessions using this bypass are billed at 2x the normal realtime
credit cost. The initial session.created event includes rateLimitBypass: true and
billingMultiplier: 2 when bypass is active. If speaker diarization is also enabled for the session,
the final session cost is 4x the normal realtime credit cost.
Message Flow
- Connect: Client establishes WebSocket connection.
- Session Created: Server sends
session.createdevent. - Start Session: Client sends
session.startto configure language and model. - Stream Audio: Client sends
audio.appendmessages with base64-encoded PCM16 audio chunks. - Receive Transcripts: Server streams
transcript.partialandtranscript.finalevents. - Commit Audio: Client sends
audio.commitwhen a user stops speaking (optional/VAD dependent). - Stop Session: Client sends
session.stopto end the session.
Partial vs Final (Important)
RTT emits two transcript event types for each utterance:
transcript.partial: low-latency, in-progress text. This can change as more audio arrives.transcript.final: stable text for a completed segment. Treat this as the committed result.
Recommended client behavior:
- Keep a temporary
currentPartialstring fortranscript.partial. - Append only
transcript.finalto your persisted transcript history. - Clear
currentPartialwhen you receive a matchingtranscript.final.
Do not persist partial text as final output. Partials are intentionally mutable and may be revised by the engine before a final segment is produced.
Client Messages
session.start
Initialize the session with configuration.
{
"type": "session.start",
"model": "valsea-rtt",
"hint_text": "Optional context or vocabulary",
"enable_correction": true,
"language": "singlish",
"target_language": "english",
"diarize": true,
"diarization_min_speakers": 2,
"diarization_max_speakers": 6
}
| Field | Type | Description |
|---|---|---|
model | string | Model to use (e.g., valsea-rtt). |
hint_text | string | Optional list of words or context to improve accuracy. |
enable_correction | boolean | Enable post-processing for grammar/language correction (default: true). |
language | string | Language hint for correction (e.g., singlish, english, chinese, korean). |
target_language | string | Optional translation target for final transcript text. Also accepted as targetLanguage. Omit it to return transcription only. |
diarize | boolean | Enable speaker diarization on final transcript events. Default: false. Bills at 2x normal realtime credits. |
diarization_min_speakers | integer | Minimum expected speaker count for diarization. Default: 2. |
diarization_max_speakers | integer | Maximum expected speaker count for diarization. Default: 6. |
When target_language is set and differs from the input language, transcript.final.text is translated to that target language. Partial transcripts are not translated. Supported translation targets are english, chinese, japanese, korean, vietnamese, thai, french, spanish, german, russian, indonesian, malay, filipino, tamil, khmer, and lao.
When diarize=true, final events include speaker-labeled words and utterances metadata. Speaker
IDs are zero-based integers. Diarization is emitted only on final events.
audio.append
Send audio data.
{
"type": "audio.append",
"audio": "BASE64_ENCODED_PCM16_DATA"
}
- Format: Raw PCM 16-bit, 16kHz (recommended), mono.
- Encoding: Base64 string.
You may also send raw binary PCM16 frames directly over the WebSocket. Binary frames are treated as
audio.append messages by the server.
audio.commit
Signal the end of a speech segment (e.g., VAD triggered silence).
{
"type": "audio.commit"
}
session.stop
End the session gracefully.
{
"type": "session.stop"
}
Server Messages
session.created
Sent immediately upon connection.
{
"type": "session.created",
"sessionId": "rtt_...",
"supported_models": ["valsea-rtt"]
}
session.ready
Sent when the backend engine is connected and ready to receive audio.
{
"type": "session.ready",
"sessionId": "rtt_..."
}
transcript.partial
Intermediate transcription results (low latency, may change).
{
"type": "transcript.partial",
"text": "Hello world",
"isFinal": false,
"timestampMs": 1230
}
transcript.final
Finalized text for a speech segment.
{
"type": "transcript.final",
"text": "Hello, world.",
"rawText": "hello world",
"isFinal": true,
"timestampMs": 2500,
"corrections": [],
"words": [{ "word": "Hello", "start": 0.4, "end": 0.8, "speaker": 0 }],
"utterances": [{ "start": 0.4, "end": 0.8, "speaker": 0, "transcript": "Hello", "words": [] }]
}
If realtime translation is enabled, text contains the translated final text and the event includes translation metadata:
{
"type": "transcript.final",
"text": "Hello, world.",
"rawText": "hello world",
"isFinal": true,
"timestampMs": 2500,
"translated": true,
"sourceLanguage": "singlish",
"targetLanguage": "english"
}
error
Sent when an error occurs.
{
"type": "error",
"code": "INVALID_MESSAGE",
"message": "Failed to parse message"
}
Event Handling Pattern
Use this pattern to avoid duplicated or unstable transcript content:
let currentPartial = '';
const finalSegments = [];
ws.on('message', (raw) => {
const msg = JSON.parse(raw);
if (msg.type === 'transcript.partial') {
currentPartial = msg.text || '';
}
if (msg.type === 'transcript.final') {
finalSegments.push(msg.text || '');
currentPartial = '';
}
});
Browser Interpreter Compatibility
For simple browser live-translation demos, /v1/realtime also accepts a compact init message and
binary PCM16 frames.
Client init
{
"language": "en",
"model": "v2.0",
"translation": {
"type": "two_way",
"language_a": "en",
"language_b": "vi"
}
}
Supported compact language codes are en, vi, ja, ko, zh, th, fr, es, de, ru, and
id. They are mapped to the canonical API language names internally.
Server messages
When the compact init shape is used, transcript events are returned in a paired interpreter shape:
{
"type": "ready"
}
{
"type": "partial",
"text": "hello"
}
{
"type": "final",
"text": "hello",
"translation": "xin chao",
"language": "en",
"translation_language": "vi"
}
Only final transcript events include translations. Use text as the original transcript and
translation as the translated output.
Example (Node.js)
const WebSocket = require('ws');
const fs = require('fs');
const ws = new WebSocket('wss://api.valsea.ai/v1/realtime', {
headers: { 'X-API-Key': 'YOUR_KEY' },
});
ws.on('open', () => {
// 1. Configure Session
ws.send(
JSON.stringify({
type: 'session.start',
model: 'valsea-rtt',
language: 'singlish',
target_language: 'english',
}),
);
});
ws.on('message', (data) => {
const msg = JSON.parse(data);
if (msg.type === 'session.ready') {
// 2. Start streaming audio (example)
const audioStream = fs.createReadStream('audio.raw');
audioStream.on('data', (chunk) => {
ws.send(
JSON.stringify({
type: 'audio.append',
audio: chunk.toString('base64'),
}),
);
});
} else if (msg.type === 'transcript.final') {
console.log('Final:', msg.text); // translated when target_language is set
}
});
Model Selection Guide
Default: Auto-Detect
Use the default model to automatically detect and transcribe speech across 100+ languages with no configuration required.
{
"model": "valsea-auto"
}
Best for:
- Multi-language or unknown input
- Global applications
- Fast setup with minimal tuning
How it works:
- Automatically detects the spoken language
- Applies a general-purpose transcription pipeline optimized for broad coverage
Accent & Dialect Coverage — The auto-detect model is designed for broad language coverage, not deep regional specialization. It may be less accurate for heavy regional accents, code-switched speech (e.g. Singlish, Manglish), and local slang or informal speech patterns.
For Higher Accuracy (Recommended for SEA)
For Southeast Asian speech and mixed-language inputs, use a region-specific language setting:
{
"model": "valsea-rtt",
"language": "singlish"
}
These specialized modes provide:
- Accent-aware transcription
- Better handling of mixed-language speech
- Improved semantic correction for local expressions
List of Languages
Southeast Asia
Singlish — singlish
Indonesian — indonesian
Malaysian — malay
Vietnamese — vietnamese
Thai — thai
Javanese — javanese
Lao — lao
Khmer — khmer
Filipino/Tagalog — filipino
English (Philippines) — english-philippines
Middle East & North Africa
Arabic — arabic arabic-algeria arabic-bahrain arabic-egypt arabic-israel arabic-jordan arabic-kuwait arabic-lebanon arabic-mauritania arabic-morocco arabic-oman arabic-palestine arabic-qatar arabic-saudi arabic-syria arabic-tunisia arabic-uae arabic-yemen
Persian — persian
Hebrew — hebrew
Amharic — amharic
Wolof — wolof
Sub-Saharan Africa
Swahili — swahili swahili-ke
Afrikaans — afrikaans
Akan — akan
Bemba — bemba
Fulani — fulani
Ga — ga
Hausa — hausa
Igbo — igbo
Luganda — luganda
Xhosa — xhosa
Yoruba — yoruba
Zulu — zulu
Northern Sotho — northern-sotho
Nyankole — nyankole
Oromo — oromo
Pidgin — pidgin
Kinyarwanda — kinyarwanda
Shona — shona
Sotho — sotho
Tswana — tswana
Twi — twi
South Asia
Bengali — bengali-bd bengali-in
Hindi — hindi
Gujarati — gujarati
Kannada — kannada
Malayalam — malayalam
Marathi — marathi
Nepali — nepali
Oriya — oriya
Punjabi — punjabi
Sinhala — sinhala
Tamil — tamil
Telugu — telugu
Assamese — assamese
East Asia
Chinese — cantonese chinese chinese-simplified chinese-traditional
Covers a wide range of regional accents and dialects, including those from Anhui, Beijing, Chongqing, Gansu, Guangdong, Guangxi, Guizhou, Hangzhou, Hebei, Henan, Hong Kong, Hubei, Jiangsu, Jianghuai, Jiaoliao, Jilu, Lanyin, Nanjing, Northeast, Ningxia, Shaanxi, Shandong, Sichuan, Taiwan, Tianjin, and Yunnan.
Japanese — japanese
Korean — korean
Mongolian — mongolian
Central & Western Asia
Azerbaijani — azerbaijani
Armenian — armenian
Georgian — georgian
Kazakh — kazakh
Kurdish — kurdish
Kyrgyz — kyrgyz
Uzbek — uzbek
Turkish — turkish
Western Europe
English — english english-au english-gb english-in english-philippines english-us
French — french french-ca
Spanish — spanish spanish-es spanish-mexico spanish-us
Portuguese — portuguese portuguese-br
German — german
Dutch — dutch
Italian — italian
Catalan — catalan
Galician — galician
Asturian — asturian
Basque — basque
Welsh — welsh
Luxembourgish — luxembourgish
Maltese — maltese
Northern Europe
Danish — danish
Finnish — finnish
Icelandic — icelandic
Norwegian — norwegian
Swedish — swedish
Estonian — estonian
Latvian — latvian
Lithuanian — lithuanian
Eastern Europe & Balkans
Bulgarian — bulgarian
Croatian — croatian
Czech — czech
Hungarian — hungarian
Macedonian — macedonian
Polish — polish
Romanian — romanian
Russian — russian
Serbian — serbian
Slovak — slovak
Slovenian — slovenian
Ukrainian — ukrainian
Albanian — albanian
Greek — greek
Pacific & Oceania
Maori — maori