What is Synclip Audio Studio?
Synclip Audio Studio is the audio production hub inside your workspace. It consolidates three separate audio workflows — text-to-speech, voice cloning, and stem separation — into a single mode-switching panel, so you never have to leave your project to produce the audio track a video needs.
The three live modes are: Text to Speech (TTS), Voice Clone, and Audio Separation. Two more — Text to Music and Speech to Text (ASR) — are in development and will roll out as they reach production quality.
Every mode is connected to the same coin balance and task queue. Results land in My Creations automatically, and any audio file produced in the studio can be handed straight to the lipsync workspace with one click.
The five modes at a glance
Audio Studio is built around a mode-switching interface. You pick the workflow you need, and the input panel reconfigures for that task.
Text to SpeechLive
Convert a script into natural human-sounding speech. Choose from 77 voices across Chinese, English, Japanese, Korean, French, Spanish, and more.
- 77 voices across 7+ languages — Chinese (Mandarin), English (US/UK/AU/IN), Japanese, Korean, French, Spanish, Italian, Portuguese
- Character limit scales with your subscription tier: 1,000 (free) → 3,000 → 5,000 → 10,000 characters
- Standard and premium voices — premium voices have richer, more expressive delivery
- Speed control at generation time
Voice CloneLive
Upload a short reference audio file and generate new speech in that voice. No long training session required — one upload is enough.
- Upload any WAV or MP3 up to 10 MB as the reference
- Type the target script in the left panel, then generate speech that matches the reference voice
- Works best with clean, single-speaker audio — at least 5–10 seconds of natural speech
- Output lands in My Creations alongside your TTS files
- Useful for branded narrators, multilingual dubbing, or keeping an existing voice consistent across new content
Audio SeparationLive
Upload a mixed audio or video file and split it into two stems: foreground (vocals) and background (music / ambient).
- Upload any audio file up to 10 MB
- Two output files: _fg (foreground / vocals) and _bg (background / backing track)
- Priced at 4 coins per minute of audio
- Use cases: extract clean vocals for dubbing, isolate background music for B-roll, remove a backing track before applying lip-sync
Text to MusicComing soon
Describe the music you need and generate a matching track. This mode is in development — it will appear as an active option once it reaches production quality.
- Prompt-based music generation
- Designed to produce background scores for video content
Speech to Text (ASR)Coming soon
Transcribe any audio file to text with high accuracy and multi-language support. Coming soon.
- Strong multi-language support
- Output as plain text or timed transcript
Text to Speech — 77 voices, 7+ languages
The TTS mode is the most-used part of Audio Studio, primarily because it feeds directly into lip-sync video production. Here is a sample of the voices available across the main language groups:
Chinese (Mandarin)
| Voice | Gender | Style | Best for |
|---|---|---|---|
| 云健 (Yunjian) | Male | Steady | Audiobook, narration |
| 云扬 (Yunyáng) | Male | Energetic | Podcast, social media |
| 小妮 (Xiǎo Ní) | Female | Sweet | Animation characters |
| 小小 (Xiǎo Xiǎo) | Female | Gentle | Voice assistant |
| 凌雨燕 (Líng Yǔyàn) | Female | Elegant | Storytelling |
| 刘平 (Liú Píng) | Male | Authoritative | Presentation, news |
English (US / UK / AU / IN)
| Voice | Gender | Style | Best for |
|---|---|---|---|
| Jessica | Female | Friendly | Podcast |
| Onyx | Male | Deep | Movie trailer, promo |
| Nova | Female | Modern | Vlog, social content |
| Nicole | Female | Professional | Tutorial, e-learning |
| Fenrir | Male | Dramatic | Fantasy narration |
| River | Female | Soothing | Audiobook, meditation |
Japanese / Korean / French / Spanish / Italian / Portuguese
| Voice | Gender | Style | Best for |
|---|---|---|---|
| Sakura (JA) | Female | Warm | Tutorial, commercial |
| Nori (JA) | Male | Professional | Corporate, presentation |
| Chae-won (KO) | Female | Clear | Podcast, vlog |
| Sophie (FR) | Female | Natural | E-learning, documentary |
| Carlos (ES) | Male | Energetic | Ads, YouTube |
| Isabella (PT) | Female | Friendly | Social media, tutorials |
Tips for better TTS results
- Use punctuation to control pacing. A full stop produces a longer natural pause than a comma. If you need a distinct beat between two ideas, end the first sentence properly.
- Break long paragraphs into short sentences — shorter sentences produce noticeably cleaner, more natural-sounding delivery.
- Slow down the rate slightly (0.85×) on brand names, technical terms, or any phrase that needs the listener to register it.
- Premium voices have richer tonal variation; use them for hero narration or final productions. Standard voices are great for drafts and functional content.
- Match voice energy to the video context: an energetic, warm voice works over fast cuts and product demos; a measured, calm voice suits documentaries and e-learning.
Voice Clone — match any voice from a reference file
Voice Clone lets you generate speech that sounds like a specific person — without any long setup. You upload a short reference recording, type your script, and Audio Studio produces that voice reading your new text.
The most common use case is brand consistency: if a client has existing narration or a brand voice they want to carry into new content, Voice Clone handles that without a new studio recording session.
It also works for multilingual dubbing: clone a speaker's English voice and generate the Spanish version of the same script, keeping the same voice character across languages.
How to use Voice Clone
- Switch to the Voice Clone tab in Audio Studio.
- In the right panel, click the upload zone and select a WAV or MP3 reference file (up to 10 MB).
- In the left panel, type the script you want generated in that voice.
- Click Generate — the result is saved to My Creations.
For best results: use a clean reference with minimal background noise, a single speaker, and at least 5–10 seconds of natural speech. Recordings with music, reverb, or multiple speakers will reduce accuracy.
Audio Separation — split any track into vocals and backing
Audio Separation takes a mixed audio file and returns two stems: a foreground file containing the vocals or primary speaker, and a background file containing the music, ambience, or backing track.
The clearest use case for video production: you have a clip with a speaker and background music, but you need clean vocals to feed into lip-sync or dubbing. Upload the mixed file, run separation, and you get the isolated voice track in seconds.
The reverse works too. If you have a great piece of background music buried inside a clip, separation pulls it out as a standalone file ready to drop onto a new timeline.
Output files
_fg — foreground stem (vocals, primary speaker, lead instrument)_bg — background stem (music, ambience, and any other sound behind the speaker)
Audio Separation is priced at 4 coins per minute of uploaded audio. A 3-minute track costs 12 coins.
How Audio Studio connects to your lip-sync workflow
Audio Studio was designed first as a feeder for lip-sync video production. The connection between the two workspaces is direct:
- Produce your voice track in Audio Studio (TTS, Voice Clone, or a cleaned separation output).
- The result lands in My Creations.
- Open the Lipsync workspace, select "From My Creations" as the audio source, and pick the file.
- Upload your portrait (or use an existing one), configure body movement if needed, and render.
This loop — script → audio → lipsync video — can run entirely inside Synclip without downloading or re-uploading files between tools.
Start in Audio Studio
- Open your Synclip workspace.
- Select Audio Studio from the left sidebar.
- Pick your mode: TTS, Voice Clone, or Audio Separation.
- Generate your track and send it to lipsync — or download it directly.
If you already have a Synclip account, Audio Studio is available now. The three live modes — TTS, Voice Clone, and Separation — are ready to use. Text to Music and ASR will appear in the mode switcher once they reach production readiness.