Realtime Text to Speech
Valsea provides realtime text-to-speech over WebSocket for low-latency audio playback. The protocol is Valsea-specific and designed so additional realtime TTS engines can be routed behind the same public messages later.
Connection
Endpoint: wss://api.valsea.ai/v1/realtime/tts
Authenticate during the WebSocket handshake with one of these options:
Authorization: Bearer YOUR_API_KEYX-API-Key: YOUR_API_KEY- Query parameter:
api_key=YOUR_API_KEY
Message flow
- Connect to the WebSocket endpoint.
- Receive
session.created. - Send
session.startwith model, voice, language, and output options. - Receive
session.ready. - Send
speech.createwith the text input. - Receive
speech.started, then binary audio chunks, thenspeech.finished. - Send
session.stopto close cleanly.
Client messages
session.start
{
"type": "session.start",
"model": "valsea-tts-realtime",
"language": "vietnamese",
"voice": "valsea-neutral",
"response_format": "mp3",
"speed": 1,
"normalization": "basic",
"audio_quality": 64
}
| Field | Type | Description |
|---|---|---|
model | valsea-tts-realtime | Realtime TTS model. |
language | vietnamese | english | Language used for voice routing. |
voice | valsea-neutral | valsea-male | valsea-female | Stable Valsea voice alias. |
response_format | mp3 | wav | opus | Binary audio chunk format. |
speed | number | Playback speed from 0.25 to 4. |
normalization | no | basic | advanced | Text normalization level. |
audio_quality | integer | Audio quality value. Default: 64. |
speech.create
{
"type": "speech.create",
"input": "Xin chao, day la giong noi Valsea."
}
session.stop
{ "type": "session.stop" }
Server messages
session.created
{
"type": "session.created",
"sessionId": "tts_...",
"supportedModels": ["valsea-tts-realtime"],
"supportedVoices": ["valsea-neutral", "valsea-male", "valsea-female"],
"supportedLanguages": ["vietnamese", "english"]
}
session.ready
{
"type": "session.ready",
"sessionId": "tts_...",
"model": "valsea-tts-realtime"
}
Audio events
The server sends speech.started, then binary audio WebSocket frames, then speech.finished.
{ "type": "speech.started", "sessionId": "tts_..." }
{ "type": "speech.finished", "sessionId": "tts_..." }
Example
const WebSocket = require('ws');
const fs = require('fs');
const ws = new WebSocket('wss://api.valsea.ai/v1/realtime/tts', {
headers: { Authorization: 'Bearer YOUR_API_KEY' },
});
const chunks = [];
ws.on('message', (data, isBinary) => {
if (isBinary) {
chunks.push(Buffer.from(data));
return;
}
const message = JSON.parse(data.toString());
if (message.type === 'session.created') {
ws.send(
JSON.stringify({
type: 'session.start',
model: 'valsea-tts-realtime',
language: 'vietnamese',
voice: 'valsea-neutral',
response_format: 'mp3',
}),
);
}
if (message.type === 'session.ready') {
ws.send(
JSON.stringify({
type: 'speech.create',
input: 'Xin chao, day la giong noi Valsea.',
}),
);
}
if (message.type === 'speech.finished') {
fs.writeFileSync('speech.mp3', Buffer.concat(chunks));
ws.send(JSON.stringify({ type: 'session.stop' }));
}
});
Billing
Realtime TTS is billed by generated audio duration, rounded up to the next whole minute.