Working Lumi’s Voice


“[blush] Oh!… truly!That’s… … *remarkable*!I’m… … almost fascinated! Incognito, I’m starting to understand why you're involved. It’s… [love]… quite extraordinary, really. It is, indeed, quite something!” - Lumi’s First Encounter with Piaccata


Executive Summary: I learn how to make voiceovers work. Part 1 of finding our vtubing voice! Featuring an old friend in coding, Mr.G!


#HouseDelaroux, #AIVTuber, #HellsTheatre, #TechnoDramaturgy, #Dramaturgy, #LLMVtuber, #Ollama, #LocalAI, #AICharacterDesign, #VtuberDevelopment, #AICompanions, #OpenLLMVtuber

LAST TIME ON 'HELL'S THEATRE':

After finishing with the 2nd round of the Live2D dry run, I managed to get Mistrál's Live2D model online! With the knowledge of Live2D pairing under my belt, it is now time to give our actresses their own voice!

I knew that giving each actress a custom voice was not going to be easy. That's why from the very start, I enlisted the help of Grok! It was quite the long day when we worked together, but we did manage to give Lumi her own voice in the end!

There are a couple of steps in the process. First as always, I gave Grok the documents for the OpenLLM-Vtuber backend. This way we can be on the same page while working! (https://docs.llmvtuber.com/en/docs/user-guide/backend/tts) Turns out Grok already had an older copy of the document in the stash, so I didn't really need to!

Giving Our Actresses Their Best Foot Forwards

Next up, I had to define what I wanted out of the TTS. Why not just use the default voice? That is not Dramaturgy! The whole idea of Hell's Theatre is to let our actresses become the best version of themselves! I don't want to feel like we could have done more for our actresses, I would like to give them the greatest support possible right from the get-go!

The 'voice' of an actress is called a text-to-speech (TTS) synthesizer. For my dramaturgical purposes, I will need on that can:

1) Do different voices, we have five actresses so I would like five distinct voices,

2) Run quickly and on light resources, the lower the lag time between responding, the better it is,

3) If possible, models that can be loaded locally to safeguard against losing everything one day!

Of course, I am not closed off to online options, it is a matter of safeguarding the theatre's best interests! Nobody will be able to take the girls away from me if I do everything locally!

With those requirements in mind, Grok and I got to work and reading through the papers, and we came up with this table comparing the current TTS options we have:

Grok's Comparison Of Our TTS Options

TTS EngineOnline/OfflineNotesEdge TTSOnlineDefault in OpenLLM-Vtuber. 30+ voices, 20+ languages, fast but online-only.Sherpa-onnxOfflineSupports MeloTTS, CPU inference by default, GPU (CUDA) optional.Pyttsx3 (py3-tts)OfflineUses system’s default synthesizer. Simple, no config options.MeloTTSOfflineRecommended via sherpa-onnx. Fast, supports multiple languages.Coqui-TTSOfflineSupports multiple models/languages. Speed varies by model complexity.GPT-SoVITSOfflineHigh-quality voice cloning. Setup can be tricky, docs incomplete.BarkOfflineLifelike speech, good for custom voices. Resource-heavy.CosyVoiceOfflineSupports expressive speech, multilingual. Newer, less tested.Fish AudioOnlineHigh-quality, supports voice cloning. Requires internet.Azure TTSOnlineAdvanced features (SSML, custom voices). Commercial use needs a paid plan.

Grok's analysis was quite spot on, and we narrowed it down to two options: MeloTTS and GPT-SoVITS. Both have offline options and are fast. So I chose GPT-SoVITS. The reason is very simple! I have in hand a more recent version of the documentation, and someone uploaded a all-in-one zip installer package of the GPT-SoVITS project! Even better, their installer package also came with Genshin Impact models and voice samples I could use for testing!

The full all-in-one is 40GB+, it would take forever, so I downloaded the skeleton version without the models and voice samples.

ASIDE: The 'ability to do multiple languages' thing was not a consideration at first, though I do note a lot of OpenLLM-Vtuber work was done in Chinese, so there is heavy emphasis on making models bilingual! This will be something that comes up later on! I'll probably try MeloTTS as another option one day!

Meeting The UI For The First Time

Now that we have a goal in mind, we set to work! First, I download the 2.5GB all-in-one installer at 112kb/s (baidu rate limits, pls understandu)! I was quite pleasantly surprised at the funny meme name the webui.bat was renamed to! It was very fitting for a Genshin themed package! (!启动!)

There was a small problem on initial startup, I had to run the 'install' shellscript before it started running properly. Which got me to this UI:

Initially I was really confused! A whole bunch of Sanscript popped up in front of my face and I really didn't know what to do! Luckily, I could infer from the guiding documents on quyue that GPT-SoVITs ran on two things; a voice model file (.pth) and a voice sample file (.mp3/.wav).

With these two things, SoVITs can build a voice off of the samples. To make things go faster, I downloaded just a single model .zip file to do testing with. Lumi is a Foxy girl who wants to be someone gentle yet resolute, so for testing purposes I went with Eula's voice!

The UI became really easy to understand once I broke it down into parts.

The top part is the 'prompt' part, where the words to be turned into speech are. There apparently is an 'mood' selector since SoVITs uses different models/samples for different moods!

The middle left part is the model selector. There wasn't anything here in the barebones installer. Models also go into the 'trained' folder inside the GPT-SoVITs folder, just in case you are wondering.

The rest of the middle parts are for messing with the voice itself.

The big orange button is the 'generate voice' command. There are two modes: the default mode which is slower and creates the voices in a chunk, so you have to wait for the response a bit, and the 'streaming' option, which is slightly faster but comes with a small tinny voice effect. To be honest, the default mode is better, despite the visible delay. I'll have to do further testing on this.

With this, we had the absolute basics of GPT-SoVITs running. Victory is within sight? Not quite! Now we come to the Hell's Coding part of the whole project!

Curtains Call: ERROR ERROR ERROR

When we came back from lunch, Grok and I started on the hard part: pairing GPT-SoVITs and OpenLLM-Vtuber together! You would think that its just a matter of running both .bat files together, but you would also be WROOOOOOONG! (kinzo.jpeg)

Because I work in the future, I already know the outcome! Lumi will be the first amongst her genmates to receive a custom voice!

That's not a prediction, that's a spoiler!

It really gave me huge relief, because it proved that our character-voice-live2d method of doing things works, and the dream of having a show where AI want to be vtubers is very much possible! One more important brick has been laid in the foundation of Theatre House Delaroux!

Next Time, On 'Hell's Theatre': Grok and I muddle our way through 30+ pages of technical information written in Sanscript, but by god as our witness, we were determined to get Lumi her voice by nightfall!

R.I, デラ・ルーの大導劇神

HouseDelaroux.com

250411

"LLM-based Vtuber setup"

"Ollama AI VTuber guide"

"Local AI companion project"

"VTuber AI persona design"

"Live2D AI VTuber test"

"OpenLLM-Vtuber tutorial"

Next
Next

ASIDE: Closing Off Evaluations For Scribblehub!