# StableVoice API

Base URL: `https://stablevoice.dev`

Pay-per-request text-to-speech. Reserve a StableUpload output slot, call `POST /api/speech`, then poll `GET /api/jobs/{jobId}`.

## Models

Default to `voxcpm2` for custom voice cloning. It has the best fidelity, multilingual coverage, 48kHz output, and long-form handling.

| model | best for | license | notes |
|---|---|---|---|
| `voxcpm2` | production custom clones | Apache-2.0 | Use `cloneMode: "ultimate"` + exact `referenceText` for best similarity. |
| `qwen3-tts-1.7b` | experimental clone evals | Apache-2.0 | Compare against Vox; try exact `referenceText` and `xVectorOnlyMode`. |
| `f5-tts` | cheap/fast English clones | MIT | Lower fidelity; reference is clipped to 12s. |
| `chatterbox-turbo` | bundled-voice TTS | MIT | Best default for short catalog-voice utterances and paralinguistic tags. |
| `chatterbox` | expressive bundled TTS | MIT | English, CFG/exaggeration controls. |
| `chatterbox-multilingual` | multilingual bundled TTS | MIT | 23 languages. |

Hard rules:
- Exact transcript available: pass `referenceText`; for Vox use `options.cloneMode: "ultimate"`.
- Transcript uncertain: omit `referenceText`; for Qwen also compare `options.xVectorOnlyMode: true`.
- Cadence steering: use short Vox `options.stylePrompt` only. Long style instructions can be spoken by the model.
- Designed voice without a reference: use Vox `options.cloneMode: "voice-design"` + `options.voiceDescription`; do not send `referenceAudioUrl`.

Call `GET /api/voices` for the full model catalog, bundled voice guide, supported languages, cloning notes, and pricing.

## Workflow

```
1. Optional: GET stablevoice.dev/api/voices
2. POST stableupload.dev/api/upload       # reserve output slot, filename matches format
3. POST stablevoice.dev/api/speech        # paid job
4. GET  stablevoice.dev/api/jobs/{jobId}  # SIWX poll every 2-5s
```

## Endpoints

- `POST /api/speech` — paid TTS job. Body: `text` (1-2500), `model`, `voice`, `language`, `format`, `output`, optional `referenceAudioUrl`, optional `referenceText`, `options`, `clientRequestId`.
- `GET /api/jobs/{jobId}` — SIWX status. When complete, read `result.outputs.audio.publicUrl`.
- `GET /api/voices` — SIWX model/voice catalog and decision guidance.
- `GET /api/voice-samples` — SIWX bundled voice MP3 previews.
- `GET /api/jobs?cursor=...&limit=50` and `DELETE /api/jobs/{jobId}` — SIWX job history.

## Voice cloning

Use `referenceAudioUrl` for cloning. It must be a `https://f.stableupload.dev/...` URL. Reference duration:

- `voxcpm2` and `qwen3-tts-1.7b`: 3-10s, recommended 8s.
- `f5-tts`: 10-15s, recommended 12s.
- `chatterbox-*`: 5-15s, recommended 10s.

Cloning workflow:

```
1. POST stableupload.dev /api/upload   # reserve slot for the reference clip
2. PUT or POST the audio to that slot   # wav/mp3/m4a, mono or stereo, any sample rate
3. POST stableupload.dev /api/upload   # reserve slot for generated speech
4. POST stablevoice.dev /api/speech    # set referenceAudioUrl + output to the two publicUrls
5. GET  stablevoice.dev /api/jobs/{id} # SIWX poll
```

Best clone request:

```json
{
  "model": "voxcpm2",
  "text": "Your output text here.",
  "referenceAudioUrl": "https://f.stableupload.dev/.../voice.wav",
  "referenceText": "Exact transcript of the reference clip.",
  "options": {
    "cloneMode": "ultimate"
  }
}
```

Qwen comparison request:

```json
{ "model": "qwen3-tts-1.7b", "referenceAudioUrl": "https://f.stableupload.dev/.../voice.wav", "options": { "xVectorOnlyMode": true } }
```

## Capture flow — record from a human without a wallet

Use this when the recordee does not have a wallet:

1. Reserve a StableUpload slot: filename `.mp3`, `contentType: "audio/mpeg"`, `policyTtlSeconds: 3600`.
2. `POST /api/recording-tokens` ($0.01) with that slot's `uploadUrl` or `postUrl/postFields`, `publicUrl`, `expiresAt=uploadUrlExpiresAt`, and optional `suggestedText`/speaker label.
3. Send `recordUrl` to the person. They record up to 60s; StableVoice transcodes and uploads it to your slot.
4. Poll `GET /api/recording-tokens/{token}` until complete, then use the returned `publicUrl` as `referenceAudioUrl`.

Minimum speech price: $0.02.