Live Transcription

    VALSEA provides a real-time speech-to-text API via WebSocket, allowing you to stream audio and receive transcriptions with low latency.

    Connection

    Endpoint: wss://api.valsea.ai/v1/realtime

    Authentication

    You must authenticate the WebSocket connection by passing your API key in the HTTP headers during the handshake.

    Headers:

    • Authorization: Bearer YOUR_API_KEY (Recommended)
    • X-API-Key: YOUR_API_KEY (Supported)

    Message Flow

    1. Connect: Client establishes WebSocket connection.
    2. Session Created: Server sends session.created event.
    3. Start Session: Client sends session.start to configure language and model.
    4. Stream Audio: Client sends audio.append messages with base64-encoded PCM16 audio chunks.
    5. Receive Transcripts: Server streams transcript.partial and transcript.final events.
    6. Commit Audio: Client sends audio.commit when a user stops speaking (optional/VAD dependent).
    7. Stop Session: Client sends session.stop to end the session.

    Partial vs Final (Important)

    RTT emits two transcript event types for each utterance:

    • transcript.partial: low-latency, in-progress text. This can change as more audio arrives.
    • transcript.final: stable text for a completed segment. Treat this as the committed result.

    Recommended client behavior:

    1. Keep a temporary currentPartial string for transcript.partial.
    2. Append only transcript.final to your persisted transcript history.
    3. Clear currentPartial when you receive a matching transcript.final.

    Client Messages

    session.start

    Initialize the session with configuration.

    {
      "type": "session.start",
      "model": "valsea-rtt",
      "hint_text": "Optional context or vocabulary",
      "enable_correction": true,
      "language": "singlish"
    }
    
    FieldTypeDescription
    modelstringModel to use (e.g., valsea-rtt).
    hint_textstringOptional list of words or context to improve accuracy.
    enable_correctionbooleanEnable post-processing for grammar/language correction (default: true).
    languagestringLanguage hint for correction (e.g., singlish, english, chinese, korean).

    audio.append

    Send audio data.

    {
      "type": "audio.append",
      "audio": "BASE64_ENCODED_PCM16_DATA"
    }
    
    • Format: Raw PCM 16-bit, 16kHz (recommended), mono.
    • Encoding: Base64 string.

    audio.commit

    Signal the end of a speech segment (e.g., VAD triggered silence).

    {
      "type": "audio.commit"
    }
    

    session.stop

    End the session gracefully.

    {
      "type": "session.stop"
    }
    

    Server Messages

    session.created

    Sent immediately upon connection.

    {
      "type": "session.created",
      "sessionId": "rtt_...",
      "supported_models": ["valsea-rtt"]
    }
    

    session.ready

    Sent when the backend engine is connected and ready to receive audio.

    {
      "type": "session.ready",
      "sessionId": "rtt_..."
    }
    

    transcript.partial

    Intermediate transcription results (low latency, may change).

    {
      "type": "transcript.partial",
      "text": "Hello world",
      "isFinal": false,
      "timestampMs": 1230
    }
    

    transcript.final

    Finalized text for a speech segment.

    {
      "type": "transcript.final",
      "text": "Hello, world.",
      "raw_text": "hello world",
      "isFinal": true,
      "timestampMs": 2500,
      "corrections": []
    }
    

    error

    Sent when an error occurs.

    {
      "type": "error",
      "code": "INVALID_MESSAGE",
      "message": "Failed to parse message"
    }
    

    Event Handling Pattern

    Use this pattern to avoid duplicated or unstable transcript content:

    let currentPartial = '';
    const finalSegments = [];
    
    ws.on('message', (raw) => {
      const msg = JSON.parse(raw);
    
      if (msg.type === 'transcript.partial') {
        currentPartial = msg.text || '';
      }
    
      if (msg.type === 'transcript.final') {
        finalSegments.push(msg.text || '');
        currentPartial = '';
      }
    });
    

    Example (Node.js)

    const WebSocket = require('ws');
    const fs = require('fs');
    
    const ws = new WebSocket('wss://api.valsea.ai/v1/realtime', {
      headers: { 'X-API-Key': 'YOUR_KEY' },
    });
    
    ws.on('open', () => {
      // 1. Configure Session
      ws.send(
        JSON.stringify({
          type: 'session.start',
          model: 'valsea-singlish-rtt',
        }),
      );
    });
    
    ws.on('message', (data) => {
      const msg = JSON.parse(data);
    
      if (msg.type === 'session.ready') {
        // 2. Start streaming audio (example)
        const audioStream = fs.createReadStream('audio.raw');
        audioStream.on('data', (chunk) => {
          ws.send(
            JSON.stringify({
              type: 'audio.append',
              audio: chunk.toString('base64'),
            }),
          );
        });
      } else if (msg.type === 'transcript.final') {
        console.log('Final:', msg.text);
      }
    });
    

    Was this page helpful?