Realtime Text to Speech

    Valsea provides realtime text-to-speech over WebSocket for low-latency audio playback. The protocol is Valsea-specific and designed so additional realtime TTS engines can be routed behind the same public messages later.

    Connection

    Endpoint: wss://api.valsea.ai/v1/realtime/tts

    Authenticate during the WebSocket handshake with one of these options:

    • Authorization: Bearer YOUR_API_KEY
    • X-API-Key: YOUR_API_KEY
    • Query parameter: api_key=YOUR_API_KEY

    Message flow

    1. Connect to the WebSocket endpoint.
    2. Receive session.created.
    3. Send session.start with model, voice, language, and output options.
    4. Receive session.ready.
    5. Send speech.create with the text input.
    6. Receive speech.started, then binary audio chunks, then speech.finished.
    7. Send session.stop to close cleanly.

    Client messages

    session.start

    {
      "type": "session.start",
      "model": "valsea-tts-realtime",
      "language": "vietnamese",
      "voice": "valsea-neutral",
      "response_format": "mp3",
      "speed": 1,
      "normalization": "basic",
      "audio_quality": 64
    }
    
    FieldTypeDescription
    modelvalsea-tts-realtimeRealtime TTS model.
    languagevietnamese | englishLanguage used for voice routing.
    voicevalsea-neutral | valsea-male | valsea-femaleStable Valsea voice alias.
    response_formatmp3 | wav | opusBinary audio chunk format.
    speednumberPlayback speed from 0.25 to 4.
    normalizationno | basic | advancedText normalization level.
    audio_qualityintegerAudio quality value. Default: 64.

    speech.create

    {
      "type": "speech.create",
      "input": "Xin chao, day la giong noi Valsea."
    }
    

    session.stop

    { "type": "session.stop" }
    

    Server messages

    session.created

    {
      "type": "session.created",
      "sessionId": "tts_...",
      "supportedModels": ["valsea-tts-realtime"],
      "supportedVoices": ["valsea-neutral", "valsea-male", "valsea-female"],
      "supportedLanguages": ["vietnamese", "english"]
    }
    

    session.ready

    {
      "type": "session.ready",
      "sessionId": "tts_...",
      "model": "valsea-tts-realtime"
    }
    

    Audio events

    The server sends speech.started, then binary audio WebSocket frames, then speech.finished.

    { "type": "speech.started", "sessionId": "tts_..." }
    
    { "type": "speech.finished", "sessionId": "tts_..." }
    

    Example

    const WebSocket = require('ws');
    const fs = require('fs');
    
    const ws = new WebSocket('wss://api.valsea.ai/v1/realtime/tts', {
      headers: { Authorization: 'Bearer YOUR_API_KEY' },
    });
    
    const chunks = [];
    
    ws.on('message', (data, isBinary) => {
      if (isBinary) {
        chunks.push(Buffer.from(data));
        return;
      }
    
      const message = JSON.parse(data.toString());
    
      if (message.type === 'session.created') {
        ws.send(
          JSON.stringify({
            type: 'session.start',
            model: 'valsea-tts-realtime',
            language: 'vietnamese',
            voice: 'valsea-neutral',
            response_format: 'mp3',
          }),
        );
      }
    
      if (message.type === 'session.ready') {
        ws.send(
          JSON.stringify({
            type: 'speech.create',
            input: 'Xin chao, day la giong noi Valsea.',
          }),
        );
      }
    
      if (message.type === 'speech.finished') {
        fs.writeFileSync('speech.mp3', Buffer.concat(chunks));
        ws.send(JSON.stringify({ type: 'session.stop' }));
      }
    });
    

    Billing

    Realtime TTS is billed by generated audio duration, rounded up to the next whole minute.

    Was this page helpful?