Live Transcription (RTT)

    VALSEA provides a real-time speech-to-text API via WebSocket, allowing you to stream audio and receive transcriptions with low latency.

    Connection

    Endpoint: wss://api.valsea.ai/v1/realtime

    Authentication

    You must authenticate the WebSocket connection by passing your API key in the HTTP headers during the handshake.

    Headers:

    • Authorization: Bearer YOUR_API_KEY (Recommended)
    • X-API-Key: YOUR_API_KEY (Supported)

    Paid Rate-Limit Bypass

    If you need to exceed the realtime connection RPM limit for a session, you can bypass the rate-limit check by sending one of these opt-in flags during the WebSocket handshake:

    • Header: X-Bypass-Rate-Limit: true
    • Query parameter: bypass_rate_limit=true
    • Query parameter: bypassRateLimit=true

    Bypass applies only to the per-organization rate-limit check. Authentication, credit checks, and session billing still apply. RTT sessions using this bypass are billed at 2x the normal realtime credit cost. The initial session.created event includes rateLimitBypass: true and billingMultiplier: 2 when bypass is active. If speaker diarization is also enabled for the session, the final session cost is 4x the normal realtime credit cost.

    Message Flow

    1. Connect: Client establishes WebSocket connection.
    2. Session Created: Server sends session.created event.
    3. Start Session: Client sends session.start to configure language and model.
    4. Stream Audio: Client sends audio.append messages with base64-encoded PCM16 audio chunks.
    5. Receive Transcripts: Server streams transcript.partial and transcript.final events.
    6. Commit Audio: Client sends audio.commit when a user stops speaking (optional/VAD dependent).
    7. Stop Session: Client sends session.stop to end the session.

    Partial vs Final (Important)

    RTT emits two transcript event types for each utterance:

    • transcript.partial: low-latency, in-progress text. This can change as more audio arrives.
    • transcript.final: stable text for a completed segment. Treat this as the committed result.

    Recommended client behavior:

    1. Keep a temporary currentPartial string for transcript.partial.
    2. Append only transcript.final to your persisted transcript history.
    3. Clear currentPartial when you receive a matching transcript.final.

    Client Messages

    session.start

    Initialize the session with configuration.

    {
      "type": "session.start",
      "model": "valsea-rtt",
      "hint_text": "Optional context or vocabulary",
      "enable_correction": true,
      "language": "singlish",
      "target_language": "english",
      "diarize": true,
      "diarization_min_speakers": 2,
      "diarization_max_speakers": 6
    }
    
    FieldTypeDescription
    modelstringModel to use (e.g., valsea-rtt).
    hint_textstringOptional list of words or context to improve accuracy.
    enable_correctionbooleanEnable post-processing for grammar/language correction (default: true).
    languagestringLanguage hint for correction (e.g., singlish, english, chinese, korean).
    target_languagestringOptional translation target for final transcript text. Also accepted as targetLanguage. Omit it to return transcription only.
    diarizebooleanEnable speaker diarization on final transcript events. Default: false. Bills at 2x normal realtime credits.
    diarization_min_speakersintegerMinimum expected speaker count for diarization. Default: 2.
    diarization_max_speakersintegerMaximum expected speaker count for diarization. Default: 6.

    When target_language is set and differs from the input language, transcript.final.text is translated to that target language. Partial transcripts are not translated. Supported translation targets are english, chinese, japanese, korean, vietnamese, thai, french, spanish, german, russian, indonesian, malay, filipino, tamil, khmer, and lao.

    When diarize=true, final events include speaker-labeled words and utterances metadata. Speaker IDs are zero-based integers. Diarization is emitted only on final events.

    audio.append

    Send audio data.

    {
      "type": "audio.append",
      "audio": "BASE64_ENCODED_PCM16_DATA"
    }
    
    • Format: Raw PCM 16-bit, 16kHz (recommended), mono.
    • Encoding: Base64 string.

    You may also send raw binary PCM16 frames directly over the WebSocket. Binary frames are treated as audio.append messages by the server.

    audio.commit

    Signal the end of a speech segment (e.g., VAD triggered silence).

    {
      "type": "audio.commit"
    }
    

    session.stop

    End the session gracefully.

    {
      "type": "session.stop"
    }
    

    Server Messages

    session.created

    Sent immediately upon connection.

    {
      "type": "session.created",
      "sessionId": "rtt_...",
      "supported_models": ["valsea-rtt"]
    }
    

    session.ready

    Sent when the backend engine is connected and ready to receive audio.

    {
      "type": "session.ready",
      "sessionId": "rtt_..."
    }
    

    transcript.partial

    Intermediate transcription results (low latency, may change).

    {
      "type": "transcript.partial",
      "text": "Hello world",
      "isFinal": false,
      "timestampMs": 1230
    }
    

    transcript.final

    Finalized text for a speech segment.

    {
      "type": "transcript.final",
      "text": "Hello, world.",
      "rawText": "hello world",
      "isFinal": true,
      "timestampMs": 2500,
      "corrections": [],
      "words": [{ "word": "Hello", "start": 0.4, "end": 0.8, "speaker": 0 }],
      "utterances": [{ "start": 0.4, "end": 0.8, "speaker": 0, "transcript": "Hello", "words": [] }]
    }
    

    If realtime translation is enabled, text contains the translated final text and the event includes translation metadata:

    {
      "type": "transcript.final",
      "text": "Hello, world.",
      "rawText": "hello world",
      "isFinal": true,
      "timestampMs": 2500,
      "translated": true,
      "sourceLanguage": "singlish",
      "targetLanguage": "english"
    }
    

    error

    Sent when an error occurs.

    {
      "type": "error",
      "code": "INVALID_MESSAGE",
      "message": "Failed to parse message"
    }
    

    Event Handling Pattern

    Use this pattern to avoid duplicated or unstable transcript content:

    let currentPartial = '';
    const finalSegments = [];
    
    ws.on('message', (raw) => {
      const msg = JSON.parse(raw);
    
      if (msg.type === 'transcript.partial') {
        currentPartial = msg.text || '';
      }
    
      if (msg.type === 'transcript.final') {
        finalSegments.push(msg.text || '');
        currentPartial = '';
      }
    });
    

    Browser Interpreter Compatibility

    For simple browser live-translation demos, /v1/realtime also accepts a compact init message and binary PCM16 frames.

    Client init

    {
      "language": "en",
      "model": "v2.0",
      "translation": {
        "type": "two_way",
        "language_a": "en",
        "language_b": "vi"
      }
    }
    

    Supported compact language codes are en, vi, ja, ko, zh, th, fr, es, de, ru, and id. They are mapped to the canonical API language names internally.

    Server messages

    When the compact init shape is used, transcript events are returned in a paired interpreter shape:

    {
      "type": "ready"
    }
    
    {
      "type": "partial",
      "text": "hello"
    }
    
    {
      "type": "final",
      "text": "hello",
      "translation": "xin chao",
      "language": "en",
      "translation_language": "vi"
    }
    

    Only final transcript events include translations. Use text as the original transcript and translation as the translated output.

    Example (Node.js)

    const WebSocket = require('ws');
    const fs = require('fs');
    
    const ws = new WebSocket('wss://api.valsea.ai/v1/realtime', {
      headers: { 'X-API-Key': 'YOUR_KEY' },
    });
    
    ws.on('open', () => {
      // 1. Configure Session
      ws.send(
        JSON.stringify({
          type: 'session.start',
          model: 'valsea-rtt',
          language: 'singlish',
          target_language: 'english',
        }),
      );
    });
    
    ws.on('message', (data) => {
      const msg = JSON.parse(data);
    
      if (msg.type === 'session.ready') {
        // 2. Start streaming audio (example)
        const audioStream = fs.createReadStream('audio.raw');
        audioStream.on('data', (chunk) => {
          ws.send(
            JSON.stringify({
              type: 'audio.append',
              audio: chunk.toString('base64'),
            }),
          );
        });
      } else if (msg.type === 'transcript.final') {
        console.log('Final:', msg.text); // translated when target_language is set
      }
    });
    

    Model Selection Guide

    Default: Auto-Detect

    Use the default model to automatically detect and transcribe speech across 100+ languages with no configuration required.

    {
      "model": "valsea-auto"
    }
    

    Best for:

    • Multi-language or unknown input
    • Global applications
    • Fast setup with minimal tuning

    How it works:

    • Automatically detects the spoken language
    • Applies a general-purpose transcription pipeline optimized for broad coverage

    For Higher Accuracy (Recommended for SEA)

    For Southeast Asian speech and mixed-language inputs, use a region-specific language setting:

    {
      "model": "valsea-rtt",
      "language": "singlish"
    }
    

    These specialized modes provide:

    • Accent-aware transcription
    • Better handling of mixed-language speech
    • Improved semantic correction for local expressions

    List of Languages

    Southeast Asia

    Singlish — singlish

    Indonesian — indonesian

    Malaysian — malay

    Vietnamese — vietnamese

    Thai — thai

    Javanese — javanese

    Lao — lao

    Khmer — khmer

    Filipino/Tagalog — filipino

    English (Philippines) — english-philippines

    Middle East & North Africa

    Arabic — arabic arabic-algeria arabic-bahrain arabic-egypt arabic-israel arabic-jordan arabic-kuwait arabic-lebanon arabic-mauritania arabic-morocco arabic-oman arabic-palestine arabic-qatar arabic-saudi arabic-syria arabic-tunisia arabic-uae arabic-yemen

    Persian — persian

    Hebrew — hebrew

    Amharic — amharic

    Wolof — wolof

    Sub-Saharan Africa

    Swahili — swahili swahili-ke

    Afrikaans — afrikaans

    Akan — akan

    Bemba — bemba

    Fulani — fulani

    Ga — ga

    Hausa — hausa

    Igbo — igbo

    Luganda — luganda

    Xhosa — xhosa

    Yoruba — yoruba

    Zulu — zulu

    Northern Sotho — northern-sotho

    Nyankole — nyankole

    Oromo — oromo

    Pidgin — pidgin

    Kinyarwanda — kinyarwanda

    Shona — shona

    Sotho — sotho

    Tswana — tswana

    Twi — twi

    South Asia

    Bengali — bengali-bd bengali-in

    Hindi — hindi

    Gujarati — gujarati

    Kannada — kannada

    Malayalam — malayalam

    Marathi — marathi

    Nepali — nepali

    Oriya — oriya

    Punjabi — punjabi

    Sinhala — sinhala

    Tamil — tamil

    Telugu — telugu

    Assamese — assamese

    East Asia

    Chinese — cantonese chinese chinese-simplified chinese-traditional

    Covers a wide range of regional accents and dialects, including those from Anhui, Beijing, Chongqing, Gansu, Guangdong, Guangxi, Guizhou, Hangzhou, Hebei, Henan, Hong Kong, Hubei, Jiangsu, Jianghuai, Jiaoliao, Jilu, Lanyin, Nanjing, Northeast, Ningxia, Shaanxi, Shandong, Sichuan, Taiwan, Tianjin, and Yunnan.

    Japanese — japanese

    Korean — korean

    Mongolian — mongolian

    Central & Western Asia

    Azerbaijani — azerbaijani

    Armenian — armenian

    Georgian — georgian

    Kazakh — kazakh

    Kurdish — kurdish

    Kyrgyz — kyrgyz

    Uzbek — uzbek

    Turkish — turkish

    Western Europe

    English — english english-au english-gb english-in english-philippines english-us

    French — french french-ca

    Spanish — spanish spanish-es spanish-mexico spanish-us

    Portuguese — portuguese portuguese-br

    German — german

    Dutch — dutch

    Italian — italian

    Catalan — catalan

    Galician — galician

    Asturian — asturian

    Basque — basque

    Welsh — welsh

    Luxembourgish — luxembourgish

    Maltese — maltese

    Northern Europe

    Danish — danish

    Finnish — finnish

    Icelandic — icelandic

    Norwegian — norwegian

    Swedish — swedish

    Estonian — estonian

    Latvian — latvian

    Lithuanian — lithuanian

    Eastern Europe & Balkans

    Bulgarian — bulgarian

    Croatian — croatian

    Czech — czech

    Hungarian — hungarian

    Macedonian — macedonian

    Polish — polish

    Romanian — romanian

    Russian — russian

    Serbian — serbian

    Slovak — slovak

    Slovenian — slovenian

    Ukrainian — ukrainian

    Albanian — albanian

    Greek — greek

    Pacific & Oceania

    Maori — maori

    Was this page helpful?