Realtime Text to Speech

Valsea provides realtime text-to-speech over WebSocket for low-latency audio playback. The protocol is Valsea-specific and designed so additional realtime TTS engines can be routed behind the same public messages later.

Connection

Endpoint: wss://api.valsea.ai/v1/realtime/tts

Authenticate during the WebSocket handshake with one of these options:

Authorization: Bearer YOUR_API_KEY
X-API-Key: YOUR_API_KEY
Query parameter: api_key=YOUR_API_KEY

Message flow

Connect to the WebSocket endpoint.
Receive session.created.
Send session.start with model, voice, language, and output options.
Receive session.ready.
Send speech.create with the text input.
Receive speech.started, then binary audio chunks, then speech.finished.
Send session.stop to close cleanly.

Client messages

`session.start`

{
  "type": "session.start",
  "model": "valsea-tts-realtime",
  "language": "vietnamese",
  "voice": "valsea-neutral",
  "response_format": "mp3",
  "speed": 1,
  "normalization": "basic",
  "audio_quality": 64
}

Field	Type	Description
`model`	`valsea-tts-realtime`	Realtime TTS model.
`language`	`vietnamese` \| `english` \| `english-in` \| `hindi` \| `bengali-in` \| `kannada` \| `malayalam` \| `marathi` \| `odia` \| `oriya` \| `punjabi` \| `tamil` \| `telugu` \| `gujarati`	Language used for voice routing.
`voice`	`valsea-neutral` \| `valsea-male` \| `valsea-female`	Stable Valsea voice alias.
`response_format`	`mp3` \| `wav` \| `opus`	Binary audio chunk format.
`speed`	number	Playback speed from `0.25` to `4`.
`normalization`	`no` \| `basic` \| `advanced`	Text normalization level.
`audio_quality`	integer	Audio quality value. Default: `64`.

`speech.create`

{
  "type": "speech.create",
  "input": "Xin chao, day la giong noi Valsea."
}

`session.stop`

{ "type": "session.stop" }

Server messages

`session.created`

{
  "type": "session.created",
  "sessionId": "tts_...",
  "supportedModels": ["valsea-tts-realtime"],
  "supportedVoices": ["valsea-neutral", "valsea-male", "valsea-female"],
  "supportedLanguages": ["vietnamese", "english", "english-in", "hindi", "bengali-in", "kannada", "malayalam", "marathi", "odia", "oriya", "punjabi", "tamil", "telugu", "gujarati"]
}

`session.ready`

{
  "type": "session.ready",
  "sessionId": "tts_...",
  "model": "valsea-tts-realtime"
}

Audio events

The server sends speech.started, then binary audio WebSocket frames, then speech.finished.

{ "type": "speech.started", "sessionId": "tts_..." }

{ "type": "speech.finished", "sessionId": "tts_..." }

Example

const WebSocket = require('ws');
const fs = require('fs');

const ws = new WebSocket('wss://api.valsea.ai/v1/realtime/tts', {
  headers: { Authorization: 'Bearer YOUR_API_KEY' },
});

const chunks = [];

ws.on('message', (data, isBinary) => {
  if (isBinary) {
    chunks.push(Buffer.from(data));
    return;
  }

  const message = JSON.parse(data.toString());

  if (message.type === 'session.created') {
    ws.send(
      JSON.stringify({
        type: 'session.start',
        model: 'valsea-tts-realtime',
        language: 'vietnamese',
        voice: 'valsea-neutral',
        response_format: 'mp3',
      }),
    );
  }

  if (message.type === 'session.ready') {
    ws.send(
      JSON.stringify({
        type: 'speech.create',
        input: 'Xin chao, day la giong noi Valsea.',
      }),
    );
  }

  if (message.type === 'speech.finished') {
    fs.writeFileSync('speech.mp3', Buffer.concat(chunks));
    ws.send(JSON.stringify({ type: 'session.stop' }));
  }
});

Billing

Realtime TTS is billed by generated audio duration, rounded up to the next whole minute.