Voice Cloning

Clone any voice with just 3-5 seconds of reference audio. No fine-tuning needed (zero-shot).

How It Works

You provide a short audio clip + its transcript
The codec encodes the audio into speech tokens
These tokens are used as context for the LLM
The LLM continues generating in the same voice style

Basic Cloning

from vieneu import Vieneu

tts = Vieneu()

audio = tts.infer(
    text="Đây là giọng nói được clone.",
    ref_audio="path/to/speaker.wav",
    ref_text="Exact transcript of what the speaker says in the audio.",
)
tts.save(audio, "cloned_output.wav")

Tips for Best Results

Audio quality: Use clean recordings without background noise
Duration: 3-5 seconds is ideal. Too short = poor quality, too long = wasted context
Transcript accuracy: The ref_text must accurately match what's spoken
Language: Reference audio should be in Vietnamese for best results

Using Encoded Codes Directly

ref_codes = tts.encode_reference("speaker.wav")

for text in texts:
    audio = tts.infer(
        text=text,
        ref_codes=ref_codes,
        ref_text="Transcript of reference audio.",
    )

Preset Voices

voices = tts.list_preset_voices()
for description, voice_id in voices:
    print(f"{voice_id}: {description}")

voice = tts.get_preset_voice("voice_name")
audio = tts.infer(text="Chào bạn!", voice=voice)

How It Works​

Basic Cloning​

Tips for Best Results​

Using Encoded Codes Directly​

Preset Voices​

How It Works

Basic Cloning

Tips for Best Results

Using Encoded Codes Directly

Preset Voices