{"openapi":"3.1.0","info":{"title":"StableVoice","description":"Pay-per-request text-to-speech on Modal. StableVoice serves the Chatterbox, F5-TTS, VoxCPM2, and Qwen3-TTS open models with bundled voices, optional custom voice references, mp3/wav output, and StableUpload storage slots.","version":"0.1.0","x-guidance":"# StableVoice API\n\nBase URL: `https://stablevoice.dev`\n\nPay-per-request text-to-speech. Reserve a StableUpload output slot, call `POST /api/speech`, then poll `GET /api/jobs/{jobId}`.\n\n## Models\n\nDefault to `voxcpm2` for custom voice cloning. It has the best fidelity, multilingual coverage, 48kHz output, and long-form handling.\n\n| model | best for | license | notes |\n|---|---|---|---|\n| `voxcpm2` | production custom clones | Apache-2.0 | Use `cloneMode: \"ultimate\"` + exact `referenceText` for best similarity. |\n| `qwen3-tts-1.7b` | experimental clone evals | Apache-2.0 | Compare against Vox; try exact `referenceText` and `xVectorOnlyMode`. |\n| `f5-tts` | cheap/fast English clones | MIT | Lower fidelity; reference is clipped to 12s. |\n| `chatterbox-turbo` | bundled-voice TTS | MIT | Best default for short catalog-voice utterances and paralinguistic tags. |\n| `chatterbox` | expressive bundled TTS | MIT | English, CFG/exaggeration controls. |\n| `chatterbox-multilingual` | multilingual bundled TTS | MIT | 23 languages. |\n\nHard rules:\n- Exact transcript available: pass `referenceText`; for Vox use `options.cloneMode: \"ultimate\"`.\n- Transcript uncertain: omit `referenceText`; for Qwen also compare `options.xVectorOnlyMode: true`.\n- Cadence steering: use short Vox `options.stylePrompt` only. Long style instructions can be spoken by the model.\n- Designed voice without a reference: use Vox `options.cloneMode: \"voice-design\"` + `options.voiceDescription`; do not send `referenceAudioUrl`.\n\nCall `GET /api/voices` for the full model catalog, bundled voice guide, supported languages, cloning notes, and pricing.\n\n## Workflow\n\n```\n1. Optional: GET stablevoice.dev/api/voices\n2. POST stableupload.dev/api/upload       # reserve output slot, filename matches format\n3. POST stablevoice.dev/api/speech        # paid job\n4. GET  stablevoice.dev/api/jobs/{jobId}  # SIWX poll every 2-5s\n```\n\n## Endpoints\n\n- `POST /api/speech` — paid TTS job. Body: `text` (1-2500), `model`, `voice`, `language`, `format`, `output`, optional `referenceAudioUrl`, optional `referenceText`, `options`, `clientRequestId`.\n- `GET /api/jobs/{jobId}` — SIWX status. When complete, read `result.outputs.audio.publicUrl`.\n- `GET /api/voices` — SIWX model/voice catalog and decision guidance.\n- `GET /api/voice-samples` — SIWX bundled voice MP3 previews.\n- `GET /api/jobs?cursor=...&limit=50` and `DELETE /api/jobs/{jobId}` — SIWX job history.\n\n## Voice cloning\n\nUse `referenceAudioUrl` for cloning. It must be a `https://f.stableupload.dev/...` URL. Reference duration:\n\n- `voxcpm2` and `qwen3-tts-1.7b`: 3-10s, recommended 8s.\n- `f5-tts`: 10-15s, recommended 12s.\n- `chatterbox-*`: 5-15s, recommended 10s.\n\nCloning workflow:\n\n```\n1. POST stableupload.dev /api/upload   # reserve slot for the reference clip\n2. PUT or POST the audio to that slot   # wav/mp3/m4a, mono or stereo, any sample rate\n3. POST stableupload.dev /api/upload   # reserve slot for generated speech\n4. POST stablevoice.dev /api/speech    # set referenceAudioUrl + output to the two publicUrls\n5. GET  stablevoice.dev /api/jobs/{id} # SIWX poll\n```\n\nBest clone request:\n\n```json\n{\n  \"model\": \"voxcpm2\",\n  \"text\": \"Your output text here.\",\n  \"referenceAudioUrl\": \"https://f.stableupload.dev/.../voice.wav\",\n  \"referenceText\": \"Exact transcript of the reference clip.\",\n  \"options\": {\n    \"cloneMode\": \"ultimate\"\n  }\n}\n```\n\nQwen comparison request:\n\n```json\n{ \"model\": \"qwen3-tts-1.7b\", \"referenceAudioUrl\": \"https://f.stableupload.dev/.../voice.wav\", \"options\": { \"xVectorOnlyMode\": true } }\n```\n\n## Capture flow — record from a human without a wallet\n\nUse this when the recordee does not have a wallet:\n\n1. Reserve a StableUpload slot: filename `.mp3`, `contentType: \"audio/mpeg\"`, `policyTtlSeconds: 3600`.\n2. `POST /api/recording-tokens` ($0.01) with that slot's `uploadUrl` or `postUrl/postFields`, `publicUrl`, `expiresAt=uploadUrlExpiresAt`, and optional `suggestedText`/speaker label.\n3. Send `recordUrl` to the person. They record up to 60s; StableVoice transcodes and uploads it to your slot.\n4. Poll `GET /api/recording-tokens/{token}` until complete, then use the returned `publicUrl` as `referenceAudioUrl`.\n\nMinimum speech price: $0.02.\n","guidance":"# StableVoice API\n\nBase URL: `https://stablevoice.dev`\n\nPay-per-request text-to-speech. Reserve a StableUpload output slot, call `POST /api/speech`, then poll `GET /api/jobs/{jobId}`.\n\n## Models\n\nDefault to `voxcpm2` for custom voice cloning. It has the best fidelity, multilingual coverage, 48kHz output, and long-form handling.\n\n| model | best for | license | notes |\n|---|---|---|---|\n| `voxcpm2` | production custom clones | Apache-2.0 | Use `cloneMode: \"ultimate\"` + exact `referenceText` for best similarity. |\n| `qwen3-tts-1.7b` | experimental clone evals | Apache-2.0 | Compare against Vox; try exact `referenceText` and `xVectorOnlyMode`. |\n| `f5-tts` | cheap/fast English clones | MIT | Lower fidelity; reference is clipped to 12s. |\n| `chatterbox-turbo` | bundled-voice TTS | MIT | Best default for short catalog-voice utterances and paralinguistic tags. |\n| `chatterbox` | expressive bundled TTS | MIT | English, CFG/exaggeration controls. |\n| `chatterbox-multilingual` | multilingual bundled TTS | MIT | 23 languages. |\n\nHard rules:\n- Exact transcript available: pass `referenceText`; for Vox use `options.cloneMode: \"ultimate\"`.\n- Transcript uncertain: omit `referenceText`; for Qwen also compare `options.xVectorOnlyMode: true`.\n- Cadence steering: use short Vox `options.stylePrompt` only. Long style instructions can be spoken by the model.\n- Designed voice without a reference: use Vox `options.cloneMode: \"voice-design\"` + `options.voiceDescription`; do not send `referenceAudioUrl`.\n\nCall `GET /api/voices` for the full model catalog, bundled voice guide, supported languages, cloning notes, and pricing.\n\n## Workflow\n\n```\n1. Optional: GET stablevoice.dev/api/voices\n2. POST stableupload.dev/api/upload       # reserve output slot, filename matches format\n3. POST stablevoice.dev/api/speech        # paid job\n4. GET  stablevoice.dev/api/jobs/{jobId}  # SIWX poll every 2-5s\n```\n\n## Endpoints\n\n- `POST /api/speech` — paid TTS job. Body: `text` (1-2500), `model`, `voice`, `language`, `format`, `output`, optional `referenceAudioUrl`, optional `referenceText`, `options`, `clientRequestId`.\n- `GET /api/jobs/{jobId}` — SIWX status. When complete, read `result.outputs.audio.publicUrl`.\n- `GET /api/voices` — SIWX model/voice catalog and decision guidance.\n- `GET /api/voice-samples` — SIWX bundled voice MP3 previews.\n- `GET /api/jobs?cursor=...&limit=50` and `DELETE /api/jobs/{jobId}` — SIWX job history.\n\n## Voice cloning\n\nUse `referenceAudioUrl` for cloning. It must be a `https://f.stableupload.dev/...` URL. Reference duration:\n\n- `voxcpm2` and `qwen3-tts-1.7b`: 3-10s, recommended 8s.\n- `f5-tts`: 10-15s, recommended 12s.\n- `chatterbox-*`: 5-15s, recommended 10s.\n\nCloning workflow:\n\n```\n1. POST stableupload.dev /api/upload   # reserve slot for the reference clip\n2. PUT or POST the audio to that slot   # wav/mp3/m4a, mono or stereo, any sample rate\n3. POST stableupload.dev /api/upload   # reserve slot for generated speech\n4. POST stablevoice.dev /api/speech    # set referenceAudioUrl + output to the two publicUrls\n5. GET  stablevoice.dev /api/jobs/{id} # SIWX poll\n```\n\nBest clone request:\n\n```json\n{\n  \"model\": \"voxcpm2\",\n  \"text\": \"Your output text here.\",\n  \"referenceAudioUrl\": \"https://f.stableupload.dev/.../voice.wav\",\n  \"referenceText\": \"Exact transcript of the reference clip.\",\n  \"options\": {\n    \"cloneMode\": \"ultimate\"\n  }\n}\n```\n\nQwen comparison request:\n\n```json\n{ \"model\": \"qwen3-tts-1.7b\", \"referenceAudioUrl\": \"https://f.stableupload.dev/.../voice.wav\", \"options\": { \"xVectorOnlyMode\": true } }\n```\n\n## Capture flow — record from a human without a wallet\n\nUse this when the recordee does not have a wallet:\n\n1. Reserve a StableUpload slot: filename `.mp3`, `contentType: \"audio/mpeg\"`, `policyTtlSeconds: 3600`.\n2. `POST /api/recording-tokens` ($0.01) with that slot's `uploadUrl` or `postUrl/postFields`, `publicUrl`, `expiresAt=uploadUrlExpiresAt`, and optional `suggestedText`/speaker label.\n3. Send `recordUrl` to the person. They record up to 60s; StableVoice transcodes and uploads it to your slot.\n4. Poll `GET /api/recording-tokens/{token}` until complete, then use the returned `publicUrl` as `referenceAudioUrl`.\n\nMinimum speech price: $0.02.\n","contact":{"name":"Merit Systems","url":"https://stablevoice.dev"}},"servers":[{"url":"https://stablevoice.dev"}],"tags":[{"name":"Jobs"},{"name":"Recording Tokens"},{"name":"Speech"},{"name":"Voice Samples"},{"name":"Voices"}],"paths":{"/api/recording-tokens":{"post":{"operationId":"recording-tokens","summary":"Mint a browser recording URL for a StableUpload slot. The recordee speaks optional suggestedText; StableVoice transcodes the audio to MP3 and uploads it to the slot. Use the slot publicUrl as /api/speech referenceAudioUrl. Pass uploadUrlExpiresAt as expiresAt so the recording URL cannot outlive the upload.","tags":["Recording Tokens"],"x-payment-info":{"price":{"mode":"fixed","currency":"USD","amount":"0.01"},"protocols":[{"x402":{}},{"mpp":{"method":"tempo","intent":"charge","currency":"0x20c000000000000000000000b9537d11c60e8b50"}}]},"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","properties":{"publicUrl":{"type":"string","format":"uri","description":"StableUpload public URL where the resulting MP3 will live. Must end with .mp3."},"uploadUrl":{"description":"Presigned PUT URL from StableUpload. Provide either uploadUrl or postUrl+postFields.","type":"string","format":"uri"},"postUrl":{"description":"Presigned POST URL from StableUpload. Pair with postFields.","type":"string","format":"uri"},"postFields":{"description":"Required form fields when posting to postUrl.","type":"object","propertyNames":{"type":"string"},"additionalProperties":{"type":"string"}},"suggestedText":{"description":"Optional script the recordee will see on the recording page (max 500 chars). Improves clone quality with consistent prosody.","type":"string","minLength":1,"maxLength":500},"speakerLabel":{"description":"Optional name shown on the recording page (\"Recording for: …\").","type":"string","minLength":1,"maxLength":80},"expiresAt":{"type":"string","format":"date-time","pattern":"^(?:(?:\\d\\d[2468][048]|\\d\\d[13579][26]|\\d\\d0[48]|[02468][048]00|[13579][26]00)-02-29|\\d{4}-(?:(?:0[13578]|1[02])-(?:0[1-9]|[12]\\d|3[01])|(?:0[469]|11)-(?:0[1-9]|[12]\\d|30)|(?:02)-(?:0[1-9]|1\\d|2[0-8])))T(?:(?:[01]\\d|2[0-3]):[0-5]\\d(?::[0-5]\\d(?:\\.\\d+)?)?(?:Z))$","description":"Required StableUpload uploadUrlExpiresAt value. We clamp the recording-token TTL to this value. Do not pass StableUpload's public-file retention expiresAt."},"requestedExpiresAt":{"description":"Advanced optional epoch timestamp, seconds or milliseconds, for an explicit recording-link expiry. Default is 1 hour; max is 24 hours and still clamped to StableUpload uploadUrlExpiresAt.","anyOf":[{"type":"integer","exclusiveMinimum":0,"maximum":9007199254740991},{"type":"string","pattern":"^\\d+$"}]}},"required":["publicUrl","expiresAt"]}}}},"responses":{"200":{"description":"Successful response","content":{"application/json":{"schema":{"type":"object","properties":{"token":{"type":"string"},"recordUrl":{"type":"string","format":"uri"},"expiresAt":{"type":"string","format":"date-time","pattern":"^(?:(?:\\d\\d[2468][048]|\\d\\d[13579][26]|\\d\\d0[48]|[02468][048]00|[13579][26]00)-02-29|\\d{4}-(?:(?:0[13578]|1[02])-(?:0[1-9]|[12]\\d|3[01])|(?:0[469]|11)-(?:0[1-9]|[12]\\d|30)|(?:02)-(?:0[1-9]|1\\d|2[0-8])))T(?:(?:[01]\\d|2[0-3]):[0-5]\\d(?::[0-5]\\d(?:\\.\\d+)?)?(?:Z))$"},"publicUrl":{"type":"string","format":"uri"}},"required":["token","recordUrl","expiresAt","publicUrl"],"additionalProperties":false}}}},"402":{"description":"Payment Required"}}}},"/api/speech":{"post":{"operationId":"speech","summary":"Create a paid StableVoice TTS job. Reserve an output slot on stableupload.dev first, then pass that slot as output. Poll /api/jobs/{jobId} until result.outputs.audio.publicUrl is ready.","tags":["Speech"],"x-payment-info":{"price":{"mode":"dynamic","currency":"USD","min":"0.02","max":"1.00"},"protocols":[{"x402":{}},{"mpp":{"method":"tempo","intent":"charge","currency":"0x20c000000000000000000000b9537d11c60e8b50"}}]},"requestBody":{"required":true,"content":{"application/json":{"schema":{"type":"object","properties":{"type":{"default":"stablevoice-speech","type":"string","const":"stablevoice-speech"},"text":{"type":"string","minLength":1,"maxLength":2500,"description":"Text to synthesize, max 2500 characters."},"model":{"default":"chatterbox-turbo","description":"Self-hosted Modal model. chatterbox-turbo is fastest; chatterbox adds expressive controls; chatterbox-multilingual supports 23 languages.","type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"voice":{"default":"Lucy","description":"Bundled reference voice. Match to the character's identity using `voiceGuide` from /api/voices — each voice carries an accent/ethnicity tag (Indian, Australian, Slavic, Black, Latina, American, older country) and a casting cue. Default Lucy is a neutral North American female. Ignored only when referenceAudioUrl is supplied.","type":"string","enum":["Aaron","Abigail","Anaya","Andy","Archer","Brian","Chloe","Dylan","Emmanuel","Ethan","Evelyn","Gavin","Gordon","Ivan","Laura","Lucy","Madison","Marisol","Meera","Walter"]},"language":{"default":"en","description":"Language ID. Required for chatterbox-multilingual.","type":"string","enum":["ar","da","de","el","en","es","fi","fr","he","hi","it","ja","ko","ms","nl","no","pl","pt","ru","sv","sw","tr","zh"]},"referenceAudioUrl":{"description":"Optional StableUpload URL for a custom voice reference. WAV/MP3/M4A. Required reference duration depends on the model: voxcpm2 and qwen3-tts-1.7b take 3-10s, f5-tts takes 10-15s, chatterbox-* takes 5-15s.","type":"string","format":"uri"},"referenceText":{"description":"Optional exact transcript of referenceAudioUrl. Used by f5-tts, voxcpm2, and qwen3-tts-1.7b to skip on-worker Whisper transcription. Omit it if uncertain. Ignored by chatterbox-*.","type":"string","minLength":1,"maxLength":1000},"format":{"default":"wav","type":"string","enum":["wav","mp3"]},"output":{"type":"object","properties":{"publicUrl":{"type":"string","format":"uri","description":"StableUpload public URL where the generated audio will land."},"uploadUrl":{"description":"Presigned PUT URL from StableUpload. Do not pass a POST slot's bare S3 postUrl here.","type":"string","format":"uri"},"postUrl":{"description":"Presigned POST URL from StableUpload. Pair with postFields.","type":"string","format":"uri"},"postFields":{"description":"Required form fields when posting to postUrl.","type":"object","propertyNames":{"type":"string"},"additionalProperties":{"type":"string"}}},"required":["publicUrl"],"description":"Reserved StableUpload output slot for the generated audio."},"options":{"type":"object","properties":{"temperature":{"default":0.8,"type":"number","minimum":0.05,"maximum":2},"topP":{"default":0.95,"type":"number","minimum":0.05,"maximum":1},"topK":{"default":1000,"type":"integer","minimum":1,"maximum":1000},"minP":{"default":0.05,"type":"number","minimum":0,"maximum":0.5},"repetitionPenalty":{"default":1.2,"type":"number","minimum":1,"maximum":3},"exaggeration":{"default":0.5,"type":"number","minimum":0,"maximum":1.5},"cfgWeight":{"default":0.5,"type":"number","minimum":0,"maximum":1.5},"normalizeReferenceLoudness":{"default":true,"type":"boolean"},"nfeStep":{"type":"integer","minimum":8,"maximum":64},"cfgStrength":{"type":"number","minimum":0,"maximum":5},"speed":{"type":"number","minimum":0.5,"maximum":2},"cfgValue":{"type":"number","minimum":0,"maximum":5},"inferenceTimesteps":{"type":"integer","minimum":4,"maximum":32},"cloneMode":{"type":"string","enum":["standard","controllable","ultimate","voice-design"]},"stylePrompt":{"description":"VoxCPM2 style instruction, wrapped into the generated text as `(instruction)...`.","type":"string","minLength":1,"maxLength":200},"voiceDescription":{"description":"VoxCPM2 voice-design description. Use without referenceAudioUrl.","type":"string","minLength":1,"maxLength":200},"xVectorOnlyMode":{"description":"Qwen3-TTS clone mode that skips reference transcript use; faster but lower similarity.","type":"boolean"}}},"clientRequestId":{"type":"string","minLength":1,"maxLength":128}},"required":["text","output"]}}}},"responses":{"200":{"description":"Successful response","content":{"application/json":{"schema":{"type":"object","properties":{"jobId":{"type":"string"},"status":{"type":"string","enum":["pending","queued","processing"]},"type":{"type":"string","const":"stablevoice-speech"},"price":{"type":"string"},"deduplicated":{"type":"boolean"}},"required":["jobId","status","type","price"],"additionalProperties":false}}}},"402":{"description":"Payment Required"}}}},"/api/voices":{"get":{"operationId":"voices","summary":"SIWX StableVoice model and voice catalog. Every voice in `voiceGuide` leads with an accent/ethnicity tag (Indian, Australian, Slavic, Black, Latina, American, older country, etc.) and a one-line casting cue, so an agent can match a voice to a character's identity rather than only to a use case.","tags":["Voices"],"security":[{"siwx":[]}],"responses":{"200":{"description":"Successful response","content":{"application/json":{"schema":{"type":"object","properties":{"models":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"label":{"type":"string"},"license":{"type":"string"},"languages":{"type":"array","items":{"type":"string"}},"defaultLanguage":{"type":"string"},"defaultVoice":{"type":"string"},"description":{"type":"string"},"options":{"type":"array","items":{"type":"string"}},"cloning":{"type":"object","properties":{"minRefSeconds":{"type":"number"},"recommendedRefSeconds":{"type":"number"},"acceptsReferenceText":{"type":"boolean"},"notes":{"type":"string"}},"required":["minRefSeconds","recommendedRefSeconds","acceptsReferenceText","notes"],"additionalProperties":false}},"required":["id","label","license","languages","defaultLanguage","defaultVoice","description","options","cloning"],"additionalProperties":false}},"cloning":{"type":"object","properties":{"referenceUrlPrefix":{"type":"string","const":"https://f.stableupload.dev/"},"recommended":{"type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"fallback":{"type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"summary":{"type":"string"},"decisionGuide":{"type":"array","items":{"type":"string"}},"optionGuide":{"type":"object","properties":{"voxcpm2":{"type":"object","properties":{"cloneMode":{"type":"string"},"stylePrompt":{"type":"string"},"voiceDescription":{"type":"string"}},"required":["cloneMode","stylePrompt","voiceDescription"],"additionalProperties":false},"qwen3Tts17b":{"type":"object","properties":{"referenceText":{"type":"string"},"xVectorOnlyMode":{"type":"string"}},"required":["referenceText","xVectorOnlyMode"],"additionalProperties":false}},"required":["voxcpm2","qwen3Tts17b"],"additionalProperties":false},"workflow":{"type":"array","items":{"type":"string"}},"captureFlow":{"type":"object","properties":{"endpoint":{"type":"string","const":"/api/recording-tokens"},"price":{"type":"string"},"ttlSeconds":{"type":"integer","minimum":-9007199254740991,"maximum":9007199254740991},"maxRecordingSeconds":{"type":"integer","minimum":-9007199254740991,"maximum":9007199254740991},"description":{"type":"string"},"workflow":{"type":"array","items":{"type":"string"}}},"required":["endpoint","price","ttlSeconds","maxRecordingSeconds","description","workflow"],"additionalProperties":false}},"required":["referenceUrlPrefix","recommended","fallback","summary","decisionGuide","optionGuide","workflow","captureFlow"],"additionalProperties":false},"voices":{"type":"array","items":{"type":"string"}},"voiceGuide":{"type":"array","items":{"type":"object","properties":{"voice":{"type":"string","enum":["Aaron","Abigail","Anaya","Andy","Archer","Brian","Chloe","Dylan","Emmanuel","Ethan","Evelyn","Gavin","Gordon","Ivan","Laura","Lucy","Madison","Marisol","Meera","Walter"]},"description":{"type":"string"},"traits":{"type":"array","items":{"type":"string"}},"bestFor":{"type":"array","items":{"type":"string"}}},"required":["voice","description","traits","bestFor"],"additionalProperties":false}},"outputFormats":{"type":"array","items":{"type":"string","enum":["wav","mp3"]}},"paralinguisticTags":{"type":"array","items":{"type":"string","enum":["[laugh]","[chuckle]","[sigh]","[gasp]","[cough]"]}},"pricing":{"type":"object","properties":{"minimum":{"type":"string"},"formula":{"type":"string"}},"required":["minimum","formula"],"additionalProperties":false}},"required":["models","cloning","voices","voiceGuide","outputFormats","paralinguisticTags","pricing"],"additionalProperties":false}}}},"402":{"description":"Authentication Required"}}}},"/api/jobs":{"get":{"operationId":"jobs","summary":"List StableVoice jobs for the authenticated wallet","tags":["Jobs"],"security":[{"siwx":[]}],"responses":{"200":{"description":"Successful response","content":{"application/json":{"schema":{"type":"object","properties":{"items":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string","const":"stablevoice-speech"},"status":{"type":"string","enum":["pending","queued","processing","complete","failed"]},"progress":{"type":"integer","minimum":0,"maximum":100},"input":{"type":"object","properties":{"type":{"default":"stablevoice-speech","type":"string","const":"stablevoice-speech"},"text":{"type":"string","minLength":1,"maxLength":2500,"description":"Text to synthesize, max 2500 characters."},"model":{"default":"chatterbox-turbo","description":"Self-hosted Modal model. chatterbox-turbo is fastest; chatterbox adds expressive controls; chatterbox-multilingual supports 23 languages.","type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"voice":{"default":"Lucy","description":"Bundled reference voice. Match to the character's identity using `voiceGuide` from /api/voices — each voice carries an accent/ethnicity tag (Indian, Australian, Slavic, Black, Latina, American, older country) and a casting cue. Default Lucy is a neutral North American female. Ignored only when referenceAudioUrl is supplied.","type":"string","enum":["Aaron","Abigail","Anaya","Andy","Archer","Brian","Chloe","Dylan","Emmanuel","Ethan","Evelyn","Gavin","Gordon","Ivan","Laura","Lucy","Madison","Marisol","Meera","Walter"]},"language":{"default":"en","description":"Language ID. Required for chatterbox-multilingual.","type":"string","enum":["ar","da","de","el","en","es","fi","fr","he","hi","it","ja","ko","ms","nl","no","pl","pt","ru","sv","sw","tr","zh"]},"referenceAudioUrl":{"description":"Optional StableUpload URL for a custom voice reference. WAV/MP3/M4A. Required reference duration depends on the model: voxcpm2 and qwen3-tts-1.7b take 3-10s, f5-tts takes 10-15s, chatterbox-* takes 5-15s.","type":"string","format":"uri"},"referenceText":{"description":"Optional exact transcript of referenceAudioUrl. Used by f5-tts, voxcpm2, and qwen3-tts-1.7b to skip on-worker Whisper transcription. Omit it if uncertain. Ignored by chatterbox-*.","type":"string","minLength":1,"maxLength":1000},"format":{"default":"wav","type":"string","enum":["wav","mp3"]},"output":{"type":"object","properties":{"publicUrl":{"type":"string","format":"uri","description":"StableUpload public URL where the generated audio will land."},"uploadUrl":{"description":"Presigned PUT URL from StableUpload. Do not pass a POST slot's bare S3 postUrl here.","type":"string","format":"uri"},"postUrl":{"description":"Presigned POST URL from StableUpload. Pair with postFields.","type":"string","format":"uri"},"postFields":{"description":"Required form fields when posting to postUrl.","type":"object","propertyNames":{"type":"string"},"additionalProperties":{"type":"string"}}},"required":["publicUrl"],"additionalProperties":false,"description":"Reserved StableUpload output slot for the generated audio."},"options":{"type":"object","properties":{"temperature":{"default":0.8,"type":"number","minimum":0.05,"maximum":2},"topP":{"default":0.95,"type":"number","minimum":0.05,"maximum":1},"topK":{"default":1000,"type":"integer","minimum":1,"maximum":1000},"minP":{"default":0.05,"type":"number","minimum":0,"maximum":0.5},"repetitionPenalty":{"default":1.2,"type":"number","minimum":1,"maximum":3},"exaggeration":{"default":0.5,"type":"number","minimum":0,"maximum":1.5},"cfgWeight":{"default":0.5,"type":"number","minimum":0,"maximum":1.5},"normalizeReferenceLoudness":{"default":true,"type":"boolean"},"nfeStep":{"type":"integer","minimum":8,"maximum":64},"cfgStrength":{"type":"number","minimum":0,"maximum":5},"speed":{"type":"number","minimum":0.5,"maximum":2},"cfgValue":{"type":"number","minimum":0,"maximum":5},"inferenceTimesteps":{"type":"integer","minimum":4,"maximum":32},"cloneMode":{"type":"string","enum":["standard","controllable","ultimate","voice-design"]},"stylePrompt":{"description":"VoxCPM2 style instruction, wrapped into the generated text as `(instruction)...`.","type":"string","minLength":1,"maxLength":200},"voiceDescription":{"description":"VoxCPM2 voice-design description. Use without referenceAudioUrl.","type":"string","minLength":1,"maxLength":200},"xVectorOnlyMode":{"description":"Qwen3-TTS clone mode that skips reference transcript use; faster but lower similarity.","type":"boolean"}},"required":["temperature","topP","topK","minP","repetitionPenalty","exaggeration","cfgWeight","normalizeReferenceLoudness"],"additionalProperties":false},"clientRequestId":{"type":"string","minLength":1,"maxLength":128}},"required":["type","text","model","voice","language","format","output"],"additionalProperties":false},"result":{"anyOf":[{"type":"object","properties":{"outputs":{"type":"object","properties":{"audio":{"type":"object","properties":{"publicUrl":{"type":"string","format":"uri"}},"required":["publicUrl"],"additionalProperties":false}},"required":["audio"],"additionalProperties":false},"metrics":{"type":"object","properties":{"t_fetch_reference_s":{"type":"number"},"t_transcribe_s":{"type":"number"},"t_load_s":{"type":"number"},"t_generate_s":{"type":"number"},"t_encode_s":{"type":"number"},"t_upload_s":{"type":"number"},"duration_s":{"type":"number"},"sample_rate":{"type":"integer","minimum":-9007199254740991,"maximum":9007199254740991},"bytes":{"type":"integer","minimum":-9007199254740991,"maximum":9007199254740991}},"required":["t_fetch_reference_s","t_load_s","t_generate_s","t_encode_s","t_upload_s","sample_rate","bytes"],"additionalProperties":false},"provider":{"type":"object","properties":{"name":{"type":"string","const":"modal"},"callId":{"type":"string"},"model":{"type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"voice":{"type":"string"},"format":{"type":"string","enum":["wav","mp3"]}},"required":["name","callId","model","voice","format"],"additionalProperties":false}},"required":["outputs","metrics","provider"],"additionalProperties":false},{"type":"null"}]},"error":{"anyOf":[{"type":"string"},{"type":"null"}]},"provider":{"type":"object","properties":{"name":{"type":"string","const":"modal"},"requestId":{"anyOf":[{"type":"string"},{"type":"null"}]},"status":{"anyOf":[{"type":"string"},{"type":"null"}]},"logs":{"type":"array","items":{"type":"string"}},"error":{"anyOf":[{"type":"object","properties":{"code":{"type":"string"},"message":{"type":"string"}},"required":["code","message"],"additionalProperties":false},{"type":"null"}]}},"required":["name","requestId","status","logs","error"],"additionalProperties":false},"createdAt":{"type":"string"},"updatedAt":{"type":"string"}},"required":["id","type","status","progress","input","result","error","provider","createdAt","updatedAt"],"additionalProperties":false}},"nextCursor":{"anyOf":[{"type":"string"},{"type":"null"}]}},"required":["items","nextCursor"],"additionalProperties":false}}}},"402":{"description":"Authentication Required"}}}},"/api/jobs/:jobId":{"get":{"operationId":"jobs_status","summary":"Get a StableVoice job status by ID","tags":["Jobs"],"security":[{"siwx":[]}],"responses":{"200":{"description":"Successful response","content":{"application/json":{"schema":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string","const":"stablevoice-speech"},"status":{"type":"string","enum":["pending","queued","processing","complete","failed"]},"progress":{"type":"integer","minimum":0,"maximum":100},"input":{"type":"object","properties":{"type":{"default":"stablevoice-speech","type":"string","const":"stablevoice-speech"},"text":{"type":"string","minLength":1,"maxLength":2500,"description":"Text to synthesize, max 2500 characters."},"model":{"default":"chatterbox-turbo","description":"Self-hosted Modal model. chatterbox-turbo is fastest; chatterbox adds expressive controls; chatterbox-multilingual supports 23 languages.","type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"voice":{"default":"Lucy","description":"Bundled reference voice. Match to the character's identity using `voiceGuide` from /api/voices — each voice carries an accent/ethnicity tag (Indian, Australian, Slavic, Black, Latina, American, older country) and a casting cue. Default Lucy is a neutral North American female. Ignored only when referenceAudioUrl is supplied.","type":"string","enum":["Aaron","Abigail","Anaya","Andy","Archer","Brian","Chloe","Dylan","Emmanuel","Ethan","Evelyn","Gavin","Gordon","Ivan","Laura","Lucy","Madison","Marisol","Meera","Walter"]},"language":{"default":"en","description":"Language ID. Required for chatterbox-multilingual.","type":"string","enum":["ar","da","de","el","en","es","fi","fr","he","hi","it","ja","ko","ms","nl","no","pl","pt","ru","sv","sw","tr","zh"]},"referenceAudioUrl":{"description":"Optional StableUpload URL for a custom voice reference. WAV/MP3/M4A. Required reference duration depends on the model: voxcpm2 and qwen3-tts-1.7b take 3-10s, f5-tts takes 10-15s, chatterbox-* takes 5-15s.","type":"string","format":"uri"},"referenceText":{"description":"Optional exact transcript of referenceAudioUrl. Used by f5-tts, voxcpm2, and qwen3-tts-1.7b to skip on-worker Whisper transcription. Omit it if uncertain. Ignored by chatterbox-*.","type":"string","minLength":1,"maxLength":1000},"format":{"default":"wav","type":"string","enum":["wav","mp3"]},"output":{"type":"object","properties":{"publicUrl":{"type":"string","format":"uri","description":"StableUpload public URL where the generated audio will land."},"uploadUrl":{"description":"Presigned PUT URL from StableUpload. Do not pass a POST slot's bare S3 postUrl here.","type":"string","format":"uri"},"postUrl":{"description":"Presigned POST URL from StableUpload. Pair with postFields.","type":"string","format":"uri"},"postFields":{"description":"Required form fields when posting to postUrl.","type":"object","propertyNames":{"type":"string"},"additionalProperties":{"type":"string"}}},"required":["publicUrl"],"additionalProperties":false,"description":"Reserved StableUpload output slot for the generated audio."},"options":{"type":"object","properties":{"temperature":{"default":0.8,"type":"number","minimum":0.05,"maximum":2},"topP":{"default":0.95,"type":"number","minimum":0.05,"maximum":1},"topK":{"default":1000,"type":"integer","minimum":1,"maximum":1000},"minP":{"default":0.05,"type":"number","minimum":0,"maximum":0.5},"repetitionPenalty":{"default":1.2,"type":"number","minimum":1,"maximum":3},"exaggeration":{"default":0.5,"type":"number","minimum":0,"maximum":1.5},"cfgWeight":{"default":0.5,"type":"number","minimum":0,"maximum":1.5},"normalizeReferenceLoudness":{"default":true,"type":"boolean"},"nfeStep":{"type":"integer","minimum":8,"maximum":64},"cfgStrength":{"type":"number","minimum":0,"maximum":5},"speed":{"type":"number","minimum":0.5,"maximum":2},"cfgValue":{"type":"number","minimum":0,"maximum":5},"inferenceTimesteps":{"type":"integer","minimum":4,"maximum":32},"cloneMode":{"type":"string","enum":["standard","controllable","ultimate","voice-design"]},"stylePrompt":{"description":"VoxCPM2 style instruction, wrapped into the generated text as `(instruction)...`.","type":"string","minLength":1,"maxLength":200},"voiceDescription":{"description":"VoxCPM2 voice-design description. Use without referenceAudioUrl.","type":"string","minLength":1,"maxLength":200},"xVectorOnlyMode":{"description":"Qwen3-TTS clone mode that skips reference transcript use; faster but lower similarity.","type":"boolean"}},"required":["temperature","topP","topK","minP","repetitionPenalty","exaggeration","cfgWeight","normalizeReferenceLoudness"],"additionalProperties":false},"clientRequestId":{"type":"string","minLength":1,"maxLength":128}},"required":["type","text","model","voice","language","format","output"],"additionalProperties":false},"result":{"anyOf":[{"type":"object","properties":{"outputs":{"type":"object","properties":{"audio":{"type":"object","properties":{"publicUrl":{"type":"string","format":"uri"}},"required":["publicUrl"],"additionalProperties":false}},"required":["audio"],"additionalProperties":false},"metrics":{"type":"object","properties":{"t_fetch_reference_s":{"type":"number"},"t_transcribe_s":{"type":"number"},"t_load_s":{"type":"number"},"t_generate_s":{"type":"number"},"t_encode_s":{"type":"number"},"t_upload_s":{"type":"number"},"duration_s":{"type":"number"},"sample_rate":{"type":"integer","minimum":-9007199254740991,"maximum":9007199254740991},"bytes":{"type":"integer","minimum":-9007199254740991,"maximum":9007199254740991}},"required":["t_fetch_reference_s","t_load_s","t_generate_s","t_encode_s","t_upload_s","sample_rate","bytes"],"additionalProperties":false},"provider":{"type":"object","properties":{"name":{"type":"string","const":"modal"},"callId":{"type":"string"},"model":{"type":"string","enum":["chatterbox-turbo","chatterbox","chatterbox-multilingual","f5-tts","voxcpm2","qwen3-tts-1.7b"]},"voice":{"type":"string"},"format":{"type":"string","enum":["wav","mp3"]}},"required":["name","callId","model","voice","format"],"additionalProperties":false}},"required":["outputs","metrics","provider"],"additionalProperties":false},{"type":"null"}]},"error":{"anyOf":[{"type":"string"},{"type":"null"}]},"provider":{"type":"object","properties":{"name":{"type":"string","const":"modal"},"requestId":{"anyOf":[{"type":"string"},{"type":"null"}]},"status":{"anyOf":[{"type":"string"},{"type":"null"}]},"logs":{"type":"array","items":{"type":"string"}},"error":{"anyOf":[{"type":"object","properties":{"code":{"type":"string"},"message":{"type":"string"}},"required":["code","message"],"additionalProperties":false},{"type":"null"}]}},"required":["name","requestId","status","logs","error"],"additionalProperties":false},"createdAt":{"type":"string"},"updatedAt":{"type":"string"}},"required":["id","type","status","progress","input","result","error","provider","createdAt","updatedAt"],"additionalProperties":false}}}},"402":{"description":"Authentication Required"}}},"delete":{"operationId":"jobs_delete","summary":"Soft-delete a finished StableVoice job","tags":["Jobs"],"security":[{"siwx":[]}],"responses":{"200":{"description":"Successful response"},"402":{"description":"Authentication Required"}}}},"/api/voice-samples":{"get":{"operationId":"voice-samples","summary":"List static StableVoice MP3 previews for every bundled voice. Use this before choosing a bundled voice for /api/speech.","tags":["Voice Samples"],"security":[{"siwx":[]}],"responses":{"200":{"description":"Successful response","content":{"application/json":{"schema":{"type":"object","properties":{"count":{"type":"integer","minimum":0,"maximum":9007199254740991},"items":{"type":"array","items":{"type":"object","properties":{"voice":{"type":"string","enum":["Aaron","Abigail","Anaya","Andy","Archer","Brian","Chloe","Dylan","Emmanuel","Ethan","Evelyn","Gavin","Gordon","Ivan","Laura","Lucy","Madison","Marisol","Meera","Walter"]},"slug":{"type":"string"},"description":{"type":"string"},"traits":{"type":"array","items":{"type":"string"}},"bestFor":{"type":"array","items":{"type":"string"}},"text":{"type":"string"},"model":{"type":"string","const":"chatterbox-turbo"},"language":{"type":"string","const":"en"},"format":{"type":"string","const":"mp3"},"audioUrl":{"type":"string","format":"uri"}},"required":["voice","slug","description","traits","bestFor","text","model","language","format","audioUrl"],"additionalProperties":false}},"note":{"type":"string"}},"required":["count","items","note"],"additionalProperties":false}}}},"402":{"description":"Authentication Required"}}}}},"components":{"securitySchemes":{"siwx":{"type":"apiKey","in":"header","name":"SIGN-IN-WITH-X"}}}}