Voice Cloning
Clone any voice with just 3-5 seconds of reference audio. No fine-tuning needed (zero-shot).
How It Works
- You provide a short audio clip + its transcript
- The codec encodes the audio into speech tokens
- These tokens are used as context for the LLM
- The LLM continues generating in the same voice style
Basic Cloning
from vieneu import Vieneu
tts = Vieneu()
audio = tts.infer(
text="Đây là giọng nói được clone.",
ref_audio="path/to/speaker.wav",
ref_text="Exact transcript of what the speaker says in the audio.",
)
tts.save(audio, "cloned_output.wav")
Tips for Best Results
- Audio quality: Use clean recordings without background noise
- Duration: 3-5 seconds is ideal. Too short = poor quality, too long = wasted context
- Transcript accuracy: The
ref_textmust accurately match what's spoken - Language: Reference audio should be in Vietnamese for best results
Using Encoded Codes Directly
ref_codes = tts.encode_reference("speaker.wav")
for text in texts:
audio = tts.infer(
text=text,
ref_codes=ref_codes,
ref_text="Transcript of reference audio.",
)
Preset Voices
voices = tts.list_preset_voices()
for description, voice_id in voices:
print(f"{voice_id}: {description}")
voice = tts.get_preset_voice("voice_name")
audio = tts.infer(text="Chào bạn!", voice=voice)