# StableVoice API Base URL: `https://stablevoice.dev` Pay-per-request text-to-speech. Reserve a StableUpload output slot, call `POST /api/speech`, then poll `GET /api/jobs/{jobId}`. ## Models Default to `voxcpm2` for custom voice cloning. It has the best fidelity, multilingual coverage, 48kHz output, and long-form handling. | model | best for | license | notes | |---|---|---|---| | `voxcpm2` | production custom clones | Apache-2.0 | Use `cloneMode: "ultimate"` + exact `referenceText` for best similarity. | | `qwen3-tts-1.7b` | experimental clone evals | Apache-2.0 | Compare against Vox; try exact `referenceText` and `xVectorOnlyMode`. | | `f5-tts` | cheap/fast English clones | MIT | Lower fidelity; reference is clipped to 12s. | | `chatterbox-turbo` | bundled-voice TTS | MIT | Best default for short catalog-voice utterances and paralinguistic tags. | | `chatterbox` | expressive bundled TTS | MIT | English, CFG/exaggeration controls. | | `chatterbox-multilingual` | multilingual bundled TTS | MIT | 23 languages. | Hard rules: - Exact transcript available: pass `referenceText`; for Vox use `options.cloneMode: "ultimate"`. - Transcript uncertain: omit `referenceText`; for Qwen also compare `options.xVectorOnlyMode: true`. - Cadence steering: use short Vox `options.stylePrompt` only. Long style instructions can be spoken by the model. - Designed voice without a reference: use Vox `options.cloneMode: "voice-design"` + `options.voiceDescription`; do not send `referenceAudioUrl`. Call `GET /api/voices` for the full model catalog, bundled voice guide, supported languages, cloning notes, and pricing. ## Workflow ``` 1. Optional: GET stablevoice.dev/api/voices 2. POST stableupload.dev/api/upload # reserve output slot, filename matches format 3. POST stablevoice.dev/api/speech # paid job 4. GET stablevoice.dev/api/jobs/{jobId} # SIWX poll every 2-5s ``` ## Endpoints - `POST /api/speech` — paid TTS job. Body: `text` (1-2500), `model`, `voice`, `language`, `format`, `output`, optional `referenceAudioUrl`, optional `referenceText`, `options`, `clientRequestId`. - `GET /api/jobs/{jobId}` — SIWX status. When complete, read `result.outputs.audio.publicUrl`. - `GET /api/voices` — SIWX model/voice catalog and decision guidance. - `GET /api/voice-samples` — SIWX bundled voice MP3 previews. - `GET /api/jobs?cursor=...&limit=50` and `DELETE /api/jobs/{jobId}` — SIWX job history. ## Voice cloning Use `referenceAudioUrl` for cloning. It must be a `https://f.stableupload.dev/...` URL. Reference duration: - `voxcpm2` and `qwen3-tts-1.7b`: 3-10s, recommended 8s. - `f5-tts`: 10-15s, recommended 12s. - `chatterbox-*`: 5-15s, recommended 10s. Cloning workflow: ``` 1. POST stableupload.dev /api/upload # reserve slot for the reference clip 2. PUT or POST the audio to that slot # wav/mp3/m4a, mono or stereo, any sample rate 3. POST stableupload.dev /api/upload # reserve slot for generated speech 4. POST stablevoice.dev /api/speech # set referenceAudioUrl + output to the two publicUrls 5. GET stablevoice.dev /api/jobs/{id} # SIWX poll ``` Best clone request: ```json { "model": "voxcpm2", "text": "Your output text here.", "referenceAudioUrl": "https://f.stableupload.dev/.../voice.wav", "referenceText": "Exact transcript of the reference clip.", "options": { "cloneMode": "ultimate" } } ``` Qwen comparison request: ```json { "model": "qwen3-tts-1.7b", "referenceAudioUrl": "https://f.stableupload.dev/.../voice.wav", "options": { "xVectorOnlyMode": true } } ``` ## Capture flow — record from a human without a wallet Use this when the recordee does not have a wallet: 1. Reserve a StableUpload slot: filename `.mp3`, `contentType: "audio/mpeg"`, `policyTtlSeconds: 3600`. 2. `POST /api/recording-tokens` ($0.01) with that slot's `uploadUrl` or `postUrl/postFields`, `publicUrl`, `expiresAt=uploadUrlExpiresAt`, and optional `suggestedText`/speaker label. 3. Send `recordUrl` to the person. They record up to 60s; StableVoice transcodes and uploads it to your slot. 4. Poll `GET /api/recording-tokens/{token}` until complete, then use the returned `publicUrl` as `referenceAudioUrl`. Minimum speech price: $0.02.